Home

News

Dec. 26, 2007: HtmlCleaner release 1.6
  • New flag parameter ignoreQuestAndExclam is introduced offering control over special tags - <?TAGNAME....>, <!TAGNAME....>.
  • Bug fixes.
Sep. 27, 2007: HtmlCleaner release 1.55
  • Added Reader based HtmlCleaner constructors.
  • New parameter pruneTags is introduced offering a way to remove undesired tags with all the children from XML tree after parsing and cleaning.
  • Bug fixes.
Sep. 8, 2007: HtmlCleaner release 1.5
  • Several bug fixes.
  • Added option to escape XML content in DOM serializer - HtmlCleaner.createDOM(boolean escapeXml)
Aug. 24, 2007: HtmlCleaner release 1.4
  • New flag allowHtmlInsideAttributes is introduced in order to give the parser flexibility in handling attribute values.
  • Several bug fixes.
Jul. 12, 2007: HtmlCleaner release 1.3
  • New browser-compact serializer added, that preserves single whitespace where multiple occure.
  • New flag namespacesAware is introduced in order to control namespace prefixes and namespace declarations. It should be used instead of omitXmlnsAttributes that existed in previous versions and had limited functionality.
  • New flag allowMultiWordAttributes is introduced giving HtmlCleaner's parser flexibility to (dis)allow tag attributes consisting of multiple words.
  • New flag useEmptyElementTags is introduced in order to controll output of tags with empty body
    (<xxx/> vs <xxx></xxx>).
  • Several bug fixes.
May. 05, 2007: HtmlCleaner release 1.2
  • Several bugs fixed.
  • New flags added to control behaviour of unknown/deprecated tags.
  • New flag added to optionally remove HTML envelope from resulting XML.
  • JDOM serializer added.
Apr. 16, 2007: SVN support added
Apr. 13, 2007: HtmlCleaner release 1.13
  • Serialization of XML to Java DOM supported with createDOM() method of HtmlCleaner class.
Jan. 28, 2007: HtmlCleaner release 1.12
  • Hexadecimal entities escaping supported (i.e. &#x09;).
Jan. 11, 2007: HtmlCleaner release 1.1
  • Compact XML serializer improved.
  • Minor XML escaping bug fixed.
Jan. 02, 2007: HtmlCleaner release 1.0.5
  • A html tokenizing bug fixed.
  • Methods of the class TagNode made public in order to enable creating custom XML serializers.
  • Method writeXml(XmlSerializer) added to HtmlCleaner class in order to support creating custom XML serializers.
Dec. 23, 2006: HtmlCleaner release 1.0
  • Minor bug in advanced XML escaping fixed.
Dec. 05, 2006: HtmlCleaner release 0.9
  • HtmlCleaner Ant task added
  • XML compact serializer added - stripps all unneeded whitespaces from the result
  • Few minor bugs fixed
Nov. 27, 2006: HtmlCleaner initial release (version 0.8)

Introduction

HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web-browsers use in order to create document object model. However, user may provide custom tag and rule set for tag filtering and balancing.

At the time of developing this tool, some open source Java solutions have existed for a long time. However, by the author's experience, they are either not maintained any more or fail to produce well-formed XML in all cases. A few of them make sometimes XML results with unexpected or unstable structure. This was the main motive for starting this project - to create small (JAR file bellow 30K), fast and reliable tool that will always produce well-formed XML.

Setting cleaner's behavior

HtmlCleaner can be tuned using several parameters. They are briefly described in the following table:

Parameter Default value Explanation
advancedXmlEscape true If this parameter is set to true, ampersand sign (&) that proceeds valid XML character sequences (&XXX;) will not be escaped with &amp;XXX;
translateSpecialEntities true If true, special HTML entities (i.e. &ocirc;, &permil;, &times;) are replaced with unicode characters they represent (ô, ‰, ×). This doesn't include &, <, >, ", '.
recognizeUnicodeChars true If true, HTML characters represented by their codes in form &#XXXX; are replaced with real unicode characters (i.e. &#1078; is replaces with ж)
useCdataForScriptAndStyle true If true, HtmlCleaner will treat SCRIPT and STYLE tag contents as CDATA sections, or otherwise it will be regarded as ordinary text (special characters will be escaped).
omitUnknownTags false Tells whether to skip (ignore) unknown tags during cleanup.
treatUnknownTagsAsContent false Tells whether to treat unknown tags as ordinary content, i.e. <something...> will be transformed to &lt;something...&gt;. This attribute is applicable only if omitUnknownTags is set to false.
omitDeprecatedTags false Tells whether to skip (ignore) deprecated HTML tags during cleanup.
treatDeprecatedTagsAsContent false Tells whether to treat deprecated tags as ordinary content, i.e. <font...> will be transformed to &lt;font...&gt;. This attribute is applicable only if omitDeprecatedTags is set to false.
omitComments false Tells whether to skip HTML comments.
omitXmlDeclaration false Tells whether or not to put XML declaration line at the beginning of the resulting XML.
omitDoctypeDeclaration true Tells whether to skip HTML declaration found in the source document. If HTML document being cleaned doesn't contain one it wouldn't be placed in the result anyway.
omitXmlnsAttributes false This flag is depricated since version 1.3 and namespacesAware should be used instead.
omitHtmlEnvelope false Tells whether to remove HTML and BODY tags from the resulting XML, and use first tag in the BODY section instead. If BODY section doesn't contain any tags, then this attribute has no effect.
useEmptyElementTags true Specifies how to serialize tags with empty body - if true, compact notation is used(<xxx/>), otherwise - <xxx></xxx>
allowMultiWordAttributes true Tells parser whether to allow attribute values consisting of multiple words or not. If true, attribute att="a b c" will stay like it is, and if false parser will split this into att="a" b="b" c="c" (this is default browsers' behaviour).
allowHtmlInsideAttributes false Tells parser wether to allow html tags inside attribute values. For example, when this flag is set att="here is <a href='xxxx'>link</a>" will stay like it is, and if not, parser will end attribute value after "here is ".
This flag makes sense only if allowMultiWordAttributes is set as well.
ignoreQuestAndExclam false Tells parser whether to completely ignore tags that have form <?TAGNAME....> or <!TAGNAME....>. This way some HTML/XML processing instructions may be omitted from the resulting xml.
namespacesAware true If true, namespace prefixes found during parsing will be preserved and all neccessery xml namespace declarations will be added in the root element. If false, all namespace prefixes and all xmlns namespace declarations will be stripped.
hyphenReplacementInComment = XML doesn't allow double hyphen sequence (--) inside comments. This parameter tells which replacement to use for it when double hyphen is encountered during parsing.
pruneTags empty string Comma-separated list of tags that will be complitely removed (with all nested elements) from XML tree after parsing. For exampe if pruneTags is "script,style", resulting XML will not contain scripts and styles.

Command line use

HtmlCleaner can be called from the command line with the following syntax:

    java -jar htmlcleanerXX.jar src = <url | file> [incharset = <charset>] 
                                [dest = <file>] [outcharset = <charset>] 
                                [options...]
    
where options include:

    outputtype = simple | compact | browser-compact | pretty
    advancedxmlescape = true | false
    usecdata = true | false
    specialentities = true | false
    unicodechars = true | false
    omitunknowntags = true | false
    treatunknowntagsascontent = true | false
    omitdeprtags = true | false
    treatdeprtagsascontent = true | false
    omitcomments = true | false
    omitxmldecl = true | false
    omitdoctypedecl = true | false
    omitxmlnsatt = true | false
    omithtmlenvelope = true | false
    useemptyelementtags = true | false
    allowmultiwordattributes = true | false
    allowhtmlinsideattributes = true | false
    ignoreqe = true | false
    namespacesaware = true | false
    hyphenreplacement = <string value>
    prunetags = <string value>
    
Note: in order to make difference between URLs and files, URS's must begin with http://

Ant use

Ensure that HtmlCleaner JAR file is at the Ant's class path. Create Ant task in the following way:

    <taskdef name="mytask" classname="org.htmlcleaner.HtmlCleanerForAnt">
    ....
    <target name="...">
        <mytask [src = "..."] 
                [incharset = "..."]
                [dest = "..."] 
                [outcharset = "..."] 
                [outputtype = "simple" | "compact" | "pretty"]
                [advancedxmlescape = "true" | "false"]
                [usecdata = "true" | "false"]
                [specialentities = "true" | "false"]
                [unicodechars = "true" | "false"]
                [omitunknowntags = "true" | "false"]
                [treatunknowntagsascontent = "true" | "false"]
                [omitdeprtags = "true" | "false"]
                [treatdeprtagsascontent = "true" | "false"]
                [omitcomments = "true" | "false"]
                [omitxmldecl = "true" | "false"]
                [omitdoctypedecl = "true" | "false"]
                [omitxmlnsatt = "true" | "false"]
                [omithtmlenvelope = "true" | "false"]
                [useemptyelementtags = "true" | "false"]
                [allowmultiwordattributes = "true" | "false"]
                [allowhtmlinsideattributes = "true" | "false"]
                [ignoreqe = "true" | "false"]
                [namespacesaware = "true" | "false"]
                [hyphenreplacement = "..."]>
                [prunetags = "..."]>
 
            .... optional HTML code ....
            
        </mytask>
    </target>
    
Note: in order to make difference between URLs and files, URS's must begin with http:// in src attribute. If src attribute is not specified, HTML from the task's body is used.

Java code use

Typical usage from Java code is the following:

    // one of several constructors
    HtmlCleaner cleaner = new HtmlCleaner(...);
    
    // optionally, set cleaner's behavior     
    cleaner.setXXX(...)
    
    // calls cleaning process
    cleaner.clean();
    
    // writes resulting XML to string, file or any output stream...
    cleaner.writeXmlXXX(...);

    // ...or create DOM object
    // org.w3c.dom.Document myDoc = cleaner.createDOM();

    // ...or create JDom object
    org.jdom.Document myJDoc = cleaner.createJDom();

    // ... or write to your own serializer
    writeXml(myXmlSerializer);

    // ... or just take resulting node and do whatever you want with it
    org.htmlcleaner.TagNode rootNode = claner.getRootNode();
    

Implementing custom tag info set

HtmlCleaner implements default HTML tag set and rules for their balancing, that mimics the browsers' behavior. For example, someone may not like the rule that implicit TBODY is inserted before TR in the HTML table. In order to make custom rules and set of tags, do the following:

  • Implement interface org.htmlcleaner.ITagInfoProvider
  • Use instance of implemented class in some of HtmlCelaner constructors.
See Java API for more details, and download source code to check how default HTML provider is implemented.