HtmlCleaner Project Home Page

Setting Behavior

Cleaner parameters
Parsing transformations

Cleaner parameters

HtmlCleaner can be set up with number of parameters. They are briefly described in the following table (parameter names vary slightly in java code, command line and Ant use, nonetheless there shouldn't be any ambiguities):

Parameter	Default	Explanation
maxDepth	1000	The maximum depth of nested tags that HtmlCleaner will attempt to handle.
advancedXmlEscape	true	If this parameter is set to true, ampersand sign (&) that proceeds valid XML character sequences (&XXX;) will not be escaped with &XXX;
deserializeEntities	false	If this parameter is set to true, special entities in element content will be deserialized to their text equivalents, enabling you to access parsed text content from the TagNode and ContentNode classes. This will not override the Serializer output, which will escape the output to create valid XML or HTML.
transResCharsToNCR	false	If this parameter is set to true, reserved XML sequences (&, ", ', <, >) are serialized to their Numeric Character Representations (#&38;, #&34;, #&39;, #&60;, #&62;). This parameter has effect only if `advancedXmlEscape` is set to true.
translateSpecialEntities	true	If true, special HTML entities (i.e. ô, &permil;, ×) are replaced with unicode characters they represent (ô, ‰, ×). This doesn't include &, <, >, ", '.
transSpecialEntitiesToNCR	false	If this parameter is set to true, special HTML entities (i.e. Α) are serialized to their Numeric Character Representations (#&913;). This parameter has effect only if `translateSpecialEntities` is set to true.
recognizeUnicodeChars	true	If true, HTML characters represented by their codes in form &#XXXX; are replaced with real unicode characters (i.e. ж is replaced with ж)
useCdata	true	If true, HtmlCleaner will treat SCRIPT and STYLE tag contents as CDATA sections, or otherwise it will be regarded as ordinary text (special characters will be escaped).
useCdataFor	"script,style"	HtmlCleaner will treat the contents of specified tags as CDATA sections, or otherwise it will be regarded as ordinary text (special characters will be escaped).
omitUnknownTags	false	Tells whether to skip (ignore) unknown tags during cleanup.
treatUnknTagsAsContent	false	Tells whether to treat unknown tags as ordinary content, i.e. `<something...>` will be transformed to `<something...>`. This attribute is applicable only if `omitUnknownTags` is set to false.
omitDeprTags	false	Tells whether to skip (ignore) deprecated HTML tags during cleanup.
treatDeprTagsAsContent	false	Tells whether to treat deprecated tags as ordinary content, i.e. `<font...>` will be transformed to `<font...>`. This attribute is applicable only if `omitDeprecatedTags` is set to false.
omitCdataOutsideScriptAndStyle	false	If set to true, any CDATA sections not inside a script or style element are removed. (Since version 2.7)
omitComments	false	Tells whether to skip HTML comments.
omitXmlDeclaration	false	Tells whether or not to put XML declaration line at the beginning of the resulting XML.
omitDoctypeDeclaration	true	Tells whether to skip HTML declaration found in the source document. If HTML document being cleaned doesn't contain one it wouldn't be placed in the result anyway.
omitXmlnsAttributes	false	This flag is depricated since version 1.3 and `namespacesAware` should be used instead.
omitEnvelope	false	Tells whether to remove open and close tag being serialized. This parameter is introduced in HtmlCleaner 2.2 to replace `omitHtmlEnvelope`. If set to true, serialization skips open and close tags of the node, outputs only node's children.
useEmptyElementTags	true	Specifies how to serialize tags with empty body - if true, compact notation is used(`<xxx/>), otherwise - <xxx></xxx>`
allowMultiWordAttributes	true	Tells parser whether to allow attribute values consisting of multiple words or not. If true, attribute `att="a b c"` will stay like it is, and if false parser will split this into `att="a" b="b" c="c"` (this is default browsers' behaviour).
allowHtmlInsideAttributes	false	Tells parser whether to allow html tags inside attribute values. For example, when this flag is set `att="here is <a href='xxxx'>link</a>"` will stay like it is, and if not, parser will end attribute value after "`here is` ". This flag makes sense only if `allowMultiWordAttributes` is set as well.
ignoreQuestAndExclam	true	Tells parser whether to completely ignore tags that have form `<?TAGNAME....>` or `<!TAGNAME....>`. This way some HTML/XML processing instructions may be omitted from the resulting xml.
namespacesAware	true	If true, namespace prefixes found during parsing will be preserved and all neccessery xml namespace declarations will be added in the root element. If false, all namespace prefixes and all xmlns namespace declarations will be stripped.
hyphenReplacement	=	XML doesn't allow double hyphen sequence (--) inside comments. This parameter tells which replacement to use for it when double hyphen is encountered during parsing.
pruneTags	empty string	Comma-separated list of tags that will be complitely removed (with all nested elements) from XML tree after parsing. For exampe if `pruneTags` is `"script,style"`, resulting XML will not contain scripts and styles.
booleanAtts	self	Tells cleaner what value to give to boolean attributes, like `checked`, `selected` and similar. Allowed values are `self` - value of attribute is the same as attribute name (checked = "checked"), `empty` - attribute value is empty string (checked = "") and `true` - value of attribute is "true" (checked = "true").
nodeByXpath		XPath expression used to select first node that is going to be serialized instead of whole HTML document. For example if this parameter us set to `//table[1]` only first table in document will be serialized.
allowInvalidAttributeNames	False	Determines whether HtmlCleaner attempts to fix attribute names by making them XML-compliant. The default value is `False`, meaning that HtmlCleaner will remove any invalid characters from attribute names, and omit attributes when the attribute name cannot be made valid in this way. If set to `True`, attribute names are always left largely as-is, even if this will break XML error checking. Because of this, setting this value to True will also set strictErrorChecking=False in DomSerializer. JDomSerializer does not have an error checking value, so be careful with using this setting as it can easily result in invalid JDom documents throwing exceptions. Note that HtmlCleaner uses the WHATWG/HTML5 attribute specification rules for parsing attributes, which means that some attributes will be considered invalid and omitted even if this property is `True`.
invalidAttributeNamePrefix	Empty String	A prefix added to HTML attribute names to indicate that HtmlCleaner has modified them from the original name as part of the cleaning process. The default value is an empty string ("").

Parsing transformations

HtmlCleaner 2.1 introduces a way to quickly skip specified tags and/or attributes or to transform them to some other tags/attributes during parsing process, avoiding expansive document object model manipulation after cleaning. Here is an example of html that we want to slightly change along standard cleanup process: