src/ 0000755 0000000 0000000 00000000000 12124530534 010342 5 ustar root root src/doc/ 0000755 0000000 0000000 00000000000 12124530526 011110 5 ustar root root src/doc/manpage.xml 0000644 0000000 0000000 00000040113 12124530526 013241 0 ustar root root
Michael FuchsSoftware Engineerheroldherold1User CommandsheroldHTML to DocBook converterheroldOPTIONSDescriptionThe reuse of HTML content in presentation-neutral form is a frequent
problem. One possible solution is to convert HTML to DocBook XML, because
DocBook is a semantic markup language for documentation, which enables its
users to create document content that captures the logical structure of
the content. The command line tool herold can
be used to convert HTML to DocBook. Because HTML elements are often used
not as intended, the possibilities for such a transformation are somewhat
limited. herold is part of the dbdoclet suite of tools. For more
information visit http://www.dbdoclet.org.Options--docbook-add-index, -xAutomatically add an index element at the end of the
document.--docbook-decompose-tables, -TDecomposes the tables from the HTML code into single
paragraphs. This can be useful, if a document contains a lot of
tables for formatting reasons.--docbook-encoding, -dSpecifies the encoding of the generated DocBook XML
files.--docbook-root-element, -rThe root element of the document. Possible values are: book,
article, reference, part, chapter or section. The default value for
this option is 'article'--docbook-title, -tThe title for the resulting document.--in, -iSpecifies the HTML input file.--help, -hPrints a help page on the console.--html-encoding, -sSpecifies the encoding of the HTML source files, such as
ISO-8859-1.--out, -oSpecifies the DocBook XML destination file.--profile, -pA profile file with predefined settings.--verbose, vEnables the verbosity for the console output.--version, -VDisplays the version of herold.ConfigurationThe details of a transformation are controlled by a profile file. A
profile file offers more possibilities to influence the transformation
than the command line arguments. The following example shows a typical
profile file.transformation html2docbook;
section section-detection {
attribute-class = ["^MsoHeading(\d+)$"];
section-numbering-pattern = "((\d+\.)+)?\d*\.?\p{Z}*";
}
section list-detection {
itemized-attribute-class = ["^MsoListBullet(\w*)$", "Aufzhlung(\w+)$];
itemized-strip-prefix = [ "-", "o", "\u00b7" ];
ordered-attribute-class = ["^MsoListNumbered(\w*)$"];
ordered-strip-prefix = [ "\d+\.\s+" ];
}
section HTML {
encoding = "windows-1252";
exclude = [ "//p[starts-with(@class, 'MsoToc')]", "" ];
}
section DocBook {
abstract = """<title>Lorem ipsum</title>
<para>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed
do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut
enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.sed, dolor
amet.</para>""";
add-index = true;
author-email = "me@somewhere.de";
author-firstname = "Michael";
author-surname = "Fuchs";
collapse-protected-space = "true";
copyright-holder = "Ingenieurbüro Michael Fuchs";
copyright-year = "2012";
corporation = "";
create-condition-attribute = false;
create-prolog = true;
create-remap-attribute = false;
create-xref-label = false;
decompose-tables = false;
detect-trapped-br = true;
documentation-id = "doc01";
document-element = "book";
encoding = "UTF-8";
hyphenation-char = "soft-hyphen";
image-data-formats = [ "gif", "base64" ];
image-path = "./figures";
language = "de";
release-info = "Version 3.1";
table-style = "all";
title = "Tutorial";
title-normalize-space = true;
use-absolute-image-path = false;
}
SyntaxA profile file consists mainly of sections. Sections are used to
group parameters which share the same context. Every section must start
with the keyword section followed by the name of the
section. After the name comes the block of parameters, which is
surrounded by curly braces. Parameters can be of type String, Number,
Boolean or Array. Strings must be framed with double quotes. If the
String contains newlines, use three double quotes instead of one. Arrays
are framed with square brackets. Inside an array, the elements must be
comma separated. Every assignment must be finished by a semicolon. Multi
line comments have the form /* my comment */ , single
line comments look like // my comment\n.Mandatory ElementsA profile for herold must start with the line
transformation html2docbook;.Section section-detectionThe section section-detection is used
to detect section elements in HTML code and to strip off any
numbering prefix from the titles.Many authoring tools allow deeply nested sections. While
exporting HTML, it happens, that the nesting becomes deeper than
six levels. HTML provides header elements for up to six levels,
h1-h6, but no h7 or even more. At this point, the formatting is
normally done with the help of CSS and div or p elements. herold
is able to detect the header element of HTML, but it can not
know about the export format of a specific tool. To solve this
problem even for some cases, you can specify the parameter
attribute-class. It consists of a list of
regular expressions, which are matched against the class
attribute of each HTML element. If a match is found, the element
is considered as a section element. The regular expression can
have group, which is interpreted as level indicator. The group
must be the first group and it must match against a number,
e.g. ^heading(\d+)$. If the level can not be
detected, a level of seven is assumed.Because DocBook XSL stylesheets take care of the section
numbering while transforming the DocBook XML to a specific
output, it is often necessary to strip the numbering already
defined in the HTML page. Otherwise you end up with two
numbering texts in front of your titles. To help herold with the
detection of numbering patterns, use the parameter
section-numbering-pattern.attribute-classA regular expression, which is applied to every p
and div element. If the expression matches, the current
element is handled as a section element. If the regular
expression has groups, the first group will be used as
nesting level, otherwise level seven is assumed.section-numbering-patternNormally you want to get rid of the section
numbering that comes with the HTML data, because it
becomes part of the title text in DocBook. The section
numbers will the appear twice in your target media. One
from HTML and one from the DocBook XSL processing. The
parameter section-numbering-pattern defines a regular
expression, which is matched against the beginning of
every section title. If it matches, the matching part is
removed.Section list-detectionSometimes lists are not represented with ul, ol or dl
tags, but they are represented as p tags with additional css
formatting. If you use a tool, which creates or exports HTML
with such a construct, the conversion will end up with para
elements, instead of the corresponding list elements in
DocBook. To recreate the lists in some cases, you can use the
section list-detection. The parameters
itemized-attribute-class and
ordered-attribute-class let you define lists
of regular expression, which match against listitems in the
HTML. herold tries to rebuild the proper list structure from
this information, even for nested lists.Section HTMLThe section HTML defines parameters, which control the loading and
parsing of the HTML input data.encodingThe character set used to read the input stream.excludeDefines an array of xpath expressions. All matches are
removed from the HTML DOM tree before transformation.Section DocBookabstractThe text for the abstract element of the info section. If
the text is structured with newlines, use three double quotes as
delimiters. If the text starts with a "<" character, it is
embedded into an abstract element, otherwise the text is
embedded into an para element inside of an abstract element. The
text will parsed and can contain DocBook elements.add-indexIf set to true, an index element is inserted at the end of
the DocBook XML.create-xref-labelif set to false, anchor elements doesn't get a xreflabel
attribute.decompose-tablesIf set to true, tables structures will be ignored. The
content of the table cells will be inserted into the DocBook XML
as a sequence of paragraphs. This parameter can be useful if
your HTML contains tables for formatting purposes. Normally you
want to get rid of them, because they tamper the logical
structure.document-elementThe document element you want to use. Must be one of
article, book, part or reference.encodingThe character set which will be used for writing the
output file.image-data-formatsAn array of image formats. These formats will be inserted
as imageobject elements, additionally to the format found in the
src attribute of the corresponding img element. The original
format is inserted twice with the roles "html" and "fo". The
other formats are inserted as "html-<FORMAT>" and
"fo-<FORMAT>".titleThe title of the resulting document. If this parameter is
undefined, herold tries to dected the title from the head
section of the HTML data.use-absolute-image-pathIf you want absolute image paths in the fileref attribute
of the imagedata element, set this parameter to true.CopyrightCopyright 2001-2013 Michael Fuchs. License GPLv3+: GNU GPL version 3
or later http://gnu.org/licenses/gpl.html.
This is free software: you are free to change and redistribute it. There
is NO WARRANTY, to the extent permitted by law.
src/conf/ 0000755 0000000 0000000 00000000000 12124530526 011270 5 ustar root root src/conf/config.properties 0000644 0000000 0000000 00000007737 12124530526 014671 0 ustar root root # Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
# Framework config properties.
#
# To override the packages the framework exports by default from the
# class path, set this variable.
#org.osgi.framework.system.packages=
# To append packages to the default set of exported system packages,
# set this value.
#org.osgi.framework.system.packages.extra=
# The following property makes specified packages from the class path
# available to all bundles. You should avoid using this property.
#org.osgi.framework.bootdelegation=sun.*,com.sun.*
# Felix tries to guess when to implicitly boot delegate in certain
# situations to ease integration without outside code. This feature
# is enabled by default, uncomment the following line to disable it.
#felix.bootdelegation.implicit=false
# The following property explicitly specifies the location of the bundle
# cache, which defaults to "felix-cache" in the current working directory.
# If this value is not absolute, then the felix.cache.rootdir controls
# how the absolute location is calculated. (See next property)
#org.osgi.framework.storage=${felix.cache.rootdir}/felix-cache
# The following property is used to convert a relative bundle cache
# location into an absolute one by specifying the root to prepend to
# the relative cache path. The default for this property is the
# current working directory.
#felix.cache.rootdir=${user.dir}
# The following property controls whether the bundle cache is flushed
# the first time the framework is initialized. Possible values are
# "none" and "onFirstInit"; the default is "none".
#org.osgi.framework.storage.clean=onFirstInit
# The following property determines which actions are performed when
# processing the auto-deploy directory. It is a comma-delimited list of
# the following values: 'install', 'start', 'update', and 'uninstall'.
# An undefined or blank value is equivalent to disabling auto-deploy
# processing.
felix.auto.deploy.action=install,start
# The following property specifies the directory to use as the bundle
# auto-deploy directory; the default is 'bundle' in the working directory.
#felix.auto.deploy.dir=bundle
# The following property is a space-delimited list of bundle URLs
# to install when the framework starts. The ending numerical component
# is the target start level. Any number of these properties may be
# specified for different start levels.
#felix.auto.install.1=
# The following property is a space-delimited list of bundle URLs
# to install and start when the framework starts. The ending numerical
# component is the target start level. Any number of these properties
# may be specified for different start levels.
#felix.auto.start.1=
felix.log.level=1
# Sets the initial start level of the framework upon startup.
#org.osgi.framework.startlevel.beginning=1
# Sets the start level of newly installed bundles.
#felix.startlevel.bundle=1
# Felix installs a stream and content handler factories by default,
# uncomment the following line to not install them.
#felix.service.urlhandlers=false
# The launcher registers a shutdown hook to cleanly stop the framework
# by default, uncomment the following line to disable it.
#felix.shutdown.hook=false
#
# Bundle config properties.
#
org.osgi.service.http.port=8080
obr.repository.url=http://felix.apache.org/obr/releases.xml
src/conf/log4j.xml 0000644 0000000 0000000 00000001750 12124530526 013034 0 ustar root root