Setting Behavior

Cleaner parameters

HtmlCleaner can be set up with number of parameters. They are briefly described in the following table (parameter names vary slightly in java code, command line and Ant use, nonetheless there shouldn't be any ambiguities):

Parameter Default Explanation
advancedXmlEscape true If this parameter is set to true, ampersand sign (&) that proceeds valid XML character sequences (&XXX;) will not be escaped with &XXX;
transResCharsToNCR false If this parameter is set to true, reserved XML sequences (&, ", ', <, >) are serialized to their Numeric Character Representations (#&38;, #&34;, #&39;, #&60;, #&62;). This parameter has effect only if advancedXmlEscape is set to true.
translateSpecialEntities true If true, special HTML entities (i.e. &ocirc;, &permil;, &times;) are replaced with unicode characters they represent (ô, ‰, ×). This doesn't include &, <, >, ", '.
transSpecialEntitiesToNCR false If this parameter is set to true, special HTML entities (i.e. &Alpha;) are serialized to their Numeric Character Representations (#&913;). This parameter has effect only if translateSpecialEntities is set to true.
recognizeUnicodeChars true If true, HTML characters represented by their codes in form &#XXXX; are replaced with real unicode characters (i.e. &#1078; is replaced with ж)
useCdata true If true, HtmlCleaner will treat SCRIPT and STYLE tag contents as CDATA sections, or otherwise it will be regarded as ordinary text (special characters will be escaped).
omitUnknownTags false Tells whether to skip (ignore) unknown tags during cleanup.
treatUnknTagsAsContent false Tells whether to treat unknown tags as ordinary content, i.e. <something...> will be transformed to &lt;something...&gt;. This attribute is applicable only if omitUnknownTags is set to false.
omitDeprTags false Tells whether to skip (ignore) deprecated HTML tags during cleanup.
treatDeprTagsAsContent false Tells whether to treat deprecated tags as ordinary content, i.e. <font...> will be transformed to &lt;font...&gt;. This attribute is applicable only if omitDeprecatedTags is set to false.
omitCdataOutsideScriptAndStyle false If set to true, any CDATA sections not inside a script or style element are removed. (Since version 2.7)
omitComments false Tells whether to skip HTML comments.
omitXmlDeclaration false Tells whether or not to put XML declaration line at the beginning of the resulting XML.
omitDoctypeDeclaration true Tells whether to skip HTML declaration found in the source document. If HTML document being cleaned doesn't contain one it wouldn't be placed in the result anyway.
omitXmlnsAttributes false This flag is depricated since version 1.3 and namespacesAware should be used instead.
omitEnvelope false Tells whether to remove open and close tag being serialized. This parameter is introduced in HtmlCleaner 2.2 to replace omitHtmlEnvelope. If set to true, serialization skips open and close tags of the node, outputs only node's children.
useEmptyElementTags true Specifies how to serialize tags with empty body - if true, compact notation is used(<xxx/>), otherwise - <xxx></xxx>
allowMultiWordAttributes true Tells parser whether to allow attribute values consisting of multiple words or not. If true, attribute att="a b c" will stay like it is, and if false parser will split this into att="a" b="b" c="c" (this is default browsers' behaviour).
allowHtmlInsideAttributes false Tells parser whether to allow html tags inside attribute values. For example, when this flag is set att="here is <a href='xxxx'>link</a>" will stay like it is, and if not, parser will end attribute value after "here is ".
This flag makes sense only if allowMultiWordAttributes is set as well.
ignoreQuestAndExclam true Tells parser whether to completely ignore tags that have form <?TAGNAME....> or <!TAGNAME....>. This way some HTML/XML processing instructions may be omitted from the resulting xml.
namespacesAware true If true, namespace prefixes found during parsing will be preserved and all neccessery xml namespace declarations will be added in the root element. If false, all namespace prefixes and all xmlns namespace declarations will be stripped.
hyphenReplacement = XML doesn't allow double hyphen sequence (--) inside comments. This parameter tells which replacement to use for it when double hyphen is encountered during parsing.
pruneTags empty string Comma-separated list of tags that will be complitely removed (with all nested elements) from XML tree after parsing. For exampe if pruneTags is "script,style", resulting XML will not contain scripts and styles.
booleanAtts self Tells cleaner what value to give to boolean attributes, like checked, selected and similar. Allowed values are self - value of attribute is the same as attribute name (checked = "checked"), empty - attribute value is empty string (checked = "") and true - value of attribute is "true" (checked = "true").
nodeByXpath XPath expression used to select first node that is going to be serialized instead of whole HTML document. For example if this parameter us set to //table[1] only first table in document will be serialized.

Parsing transformations

HtmlCleaner 2.1 introduces a way to quickly skip specified tags and/or attributes or to transform them to some other tags/attributes during parsing process, avoiding expansive document object model manipulation after cleaning. Here is an example of html that we want to slightly change along standard cleanup process:
...My content 1...
<cfoutput>
    Yin and yang describe the polar effects of phenomena.
</cfoutput>
...My content 2...
<c:block parent=b1 count=331>
    Yin-yang are Mutually Rooted
</c:block>
...My content 3...
<font id=f21 size=12 face=Arial style="color:red">
    The Yin and yang aspects are in dynamic equilibrium    
</font>
...My content 4...
 
Following transformation rules will be applied in the cleaning process:
  1. cfouput
  2. c:block->div,false
  3. font->span,true
  4. font.size
  5. font.face
  6. font.style=${style};font-family=${face};font-size=${size};
They have the following meaning:
  1. cfouput tag will be ignored by parser (but not content inside)
  2. c:block tag will be transformed to div tag and all original attributes will be ignored (false in tranformation description).
  3. font tag will be transformed to span and original attributes will be preserved, except those whose transformation is explicitely described. Atributes size and face will be removed, and attribute style has more complex transformaton rule - it will be translated to value given by the template ${style};font-family=${face};font-size=${size};. Template is evaluated against source tag attributes (names between ${ and }).
Finally, HtmlCleaner gives the following xml as the result:
...My content 1...
Yin and yang describe the polar effects of phenomena.
...My content 2...
<div>
    Yin-yang are Mutually Rooted
</div>
...My content 3...
<span id="f21" style="color:red;font-family=Arial;font-size=12;">
    The Yin and yang aspects are in dynamic equilibrium
</span>
...My content 4...
 
See how to specify transofmations when HtmlCleaner is used from java code, from command line or as Ant task.