Home
News
-
New flag parameter
ignoreQuestAndExclamis introduced offering control over special tags -<?TAGNAME....>,<!TAGNAME....>. - Bug fixes.
- Added
Readerbased HtmlCleaner constructors. -
New parameter
pruneTagsis introduced offering a way to remove undesired tags with all the children from XML tree after parsing and cleaning. - Bug fixes.
- Several bug fixes.
-
Added option to escape XML content in DOM serializer -
HtmlCleaner.createDOM(boolean escapeXml)
-
New flag
allowHtmlInsideAttributesis introduced in order to give the parser flexibility in handling attribute values. - Several bug fixes.
- New
browser-compactserializer added, that preserves single whitespace where multiple occure. -
New flag
namespacesAwareis introduced in order to control namespace prefixes and namespace declarations. It should be used instead ofomitXmlnsAttributesthat existed in previous versions and had limited functionality. - New flag
allowMultiWordAttributesis introduced giving HtmlCleaner's parser flexibility to (dis)allow tag attributes consisting of multiple words. - New flag
useEmptyElementTagsis introduced in order to controll output of tags with empty body
(<xxx/> vs <xxx></xxx>). - Several bug fixes.
- Several bugs fixed.
- New flags added to control behaviour of unknown/deprecated tags.
- New flag added to optionally remove HTML envelope from resulting XML.
- JDOM serializer added.
- Latest source may be checked out from https://htmlcleaner.svn.sourceforge.net/svnroot/htmlcleaner.
- Source can be browsed at http://htmlcleaner.svn.sourceforge.net/viewvc/htmlcleaner/
- Serialization of XML to Java DOM supported with
createDOM()method ofHtmlCleanerclass.
- Hexadecimal entities escaping supported (i.e. 	).
- Compact XML serializer improved.
- Minor XML escaping bug fixed.
- A html tokenizing bug fixed.
- Methods of the class TagNode made public in order to enable creating custom XML serializers.
- Method writeXml(XmlSerializer) added to HtmlCleaner class in order to support creating custom XML serializers.
- Minor bug in advanced XML escaping fixed.
- HtmlCleaner Ant task added
- XML compact serializer added - stripps all unneeded whitespaces from the result
- Few minor bugs fixed
Introduction
HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web-browsers use in order to create document object model. However, user may provide custom tag and rule set for tag filtering and balancing.
At the time of developing this tool, some open source Java solutions have existed for a long time. However, by the author's experience, they are either not maintained any more or fail to produce well-formed XML in all cases. A few of them make sometimes XML results with unexpected or unstable structure. This was the main motive for starting this project - to create small (JAR file bellow 30K), fast and reliable tool that will always produce well-formed XML.
Setting cleaner's behavior
HtmlCleaner can be tuned using several parameters. They are briefly described in the following table:
| Parameter | Default value | Explanation |
|---|---|---|
| advancedXmlEscape | true | If this parameter is set to true, ampersand sign (&) that proceeds valid XML character sequences (&XXX;) will not be escaped with &XXX; |
| translateSpecialEntities | true | If true, special HTML entities (i.e. ô, ‰, ×) are replaced with unicode characters they represent (ô, ‰, ×). This doesn't include &, <, >, ", '. |
| recognizeUnicodeChars | true | If true, HTML characters represented by their codes in form &#XXXX; are replaced with real unicode characters (i.e. ж is replaces with ж) |
| useCdataForScriptAndStyle | true | If true, HtmlCleaner will treat SCRIPT and STYLE tag contents as CDATA sections, or otherwise it will be regarded as ordinary text (special characters will be escaped). |
| omitUnknownTags | false | Tells whether to skip (ignore) unknown tags during cleanup. |
| treatUnknownTagsAsContent | false |
Tells whether to treat unknown tags as ordinary content, i.e.
<something...> will be transformed to
<something...>. This attribute is
applicable only if omitUnknownTags is set to false.
|
| omitDeprecatedTags | false | Tells whether to skip (ignore) deprecated HTML tags during cleanup. |
| treatDeprecatedTagsAsContent | false |
Tells whether to treat deprecated tags as ordinary content, i.e.
<font...> will be transformed to
<font...>. This attribute is
applicable only if omitDeprecatedTags is set to false.
|
| omitComments | false | Tells whether to skip HTML comments. |
| omitXmlDeclaration | false | Tells whether or not to put XML declaration line at the beginning of the resulting XML. |
| omitDoctypeDeclaration | true | Tells whether to skip HTML declaration found in the source document. If HTML document being cleaned doesn't contain one it wouldn't be placed in the result anyway. |
| omitXmlnsAttributes | false |
This flag is depricated since version 1.3 and namespacesAware
should be used instead.
|
| omitHtmlEnvelope | false | Tells whether to remove HTML and BODY tags from the resulting XML, and use first tag in the BODY section instead. If BODY section doesn't contain any tags, then this attribute has no effect. |
| useEmptyElementTags | true |
Specifies how to serialize tags with empty body - if true, compact notation is used(<xxx/>), otherwise - <xxx></xxx>
|
| allowMultiWordAttributes | true |
Tells parser whether to allow attribute values consisting of multiple words or not. If true, attribute
att="a b c" will stay like it is, and if false parser will split this into att="a" b="b" c="c"
(this is default browsers' behaviour).
|
| allowHtmlInsideAttributes | false |
Tells parser wether to allow html tags inside attribute values. For example, when this flag is set
att="here is <a href='xxxx'>link</a>" will stay like it is, and if not, parser will
end attribute value after "here is ". This flag makes sense only if allowMultiWordAttributes is set as well.
|
| ignoreQuestAndExclam | false |
Tells parser whether to completely ignore tags that have form <?TAGNAME....> or
<!TAGNAME....>. This way some HTML/XML processing instructions may be omitted from the
resulting xml.
|
| namespacesAware | true | If true, namespace prefixes found during parsing will be preserved and all neccessery xml namespace declarations will be added in the root element. If false, all namespace prefixes and all xmlns namespace declarations will be stripped. |
| hyphenReplacementInComment | = | XML doesn't allow double hyphen sequence (--) inside comments. This parameter tells which replacement to use for it when double hyphen is encountered during parsing. |
| pruneTags | empty string |
Comma-separated list of tags that will be complitely removed (with all nested elements)
from XML tree after parsing. For exampe if pruneTags is "script,style",
resulting XML will not contain scripts and styles.
|
Command line use
HtmlCleaner can be called from the command line with the following syntax:
java -jar htmlcleanerXX.jar src = <url | file> [incharset = <charset>]
[dest = <file>] [outcharset = <charset>]
[options...]
where options include:
outputtype = simple | compact | browser-compact | pretty
advancedxmlescape = true | false
usecdata = true | false
specialentities = true | false
unicodechars = true | false
omitunknowntags = true | false
treatunknowntagsascontent = true | false
omitdeprtags = true | false
treatdeprtagsascontent = true | false
omitcomments = true | false
omitxmldecl = true | false
omitdoctypedecl = true | false
omitxmlnsatt = true | false
omithtmlenvelope = true | false
useemptyelementtags = true | false
allowmultiwordattributes = true | false
allowhtmlinsideattributes = true | false
ignoreqe = true | false
namespacesaware = true | false
hyphenreplacement = <string value>
prunetags = <string value>
Note: in order to make difference between URLs and files, URS's must begin with http://
Ant use
Ensure that HtmlCleaner JAR file is at the Ant's class path.
Create Ant task in the following way:
<taskdef name="mytask" classname="org.htmlcleaner.HtmlCleanerForAnt">
....
<target name="...">
<mytask [src = "..."]
[incharset = "..."]
[dest = "..."]
[outcharset = "..."]
[outputtype = "simple" | "compact" | "pretty"]
[advancedxmlescape = "true" | "false"]
[usecdata = "true" | "false"]
[specialentities = "true" | "false"]
[unicodechars = "true" | "false"]
[omitunknowntags = "true" | "false"]
[treatunknowntagsascontent = "true" | "false"]
[omitdeprtags = "true" | "false"]
[treatdeprtagsascontent = "true" | "false"]
[omitcomments = "true" | "false"]
[omitxmldecl = "true" | "false"]
[omitdoctypedecl = "true" | "false"]
[omitxmlnsatt = "true" | "false"]
[omithtmlenvelope = "true" | "false"]
[useemptyelementtags = "true" | "false"]
[allowmultiwordattributes = "true" | "false"]
[allowhtmlinsideattributes = "true" | "false"]
[ignoreqe = "true" | "false"]
[namespacesaware = "true" | "false"]
[hyphenreplacement = "..."]>
[prunetags = "..."]>
.... optional HTML code ....
</mytask>
</target>
Note: in order to make difference between URLs and files, URS's must begin with http:// in src attribute.
If src attribute is not specified, HTML from the task's body is used.
Java code use
Typical usage from Java code is the following:
// one of several constructors
HtmlCleaner cleaner = new HtmlCleaner(...);
// optionally, set cleaner's behavior
cleaner.setXXX(...)
// calls cleaning process
cleaner.clean();
// writes resulting XML to string, file or any output stream...
cleaner.writeXmlXXX(...);
// ...or create DOM object
// org.w3c.dom.Document myDoc = cleaner.createDOM();
// ...or create JDom object
org.jdom.Document myJDoc = cleaner.createJDom();
// ... or write to your own serializer
writeXml(myXmlSerializer);
// ... or just take resulting node and do whatever you want with it
org.htmlcleaner.TagNode rootNode = claner.getRootNode();
Implementing custom tag info set
HtmlCleaner implements default HTML tag set and rules for their balancing, that mimics the browsers' behavior. For example, someone may not like the rule that implicit TBODY is inserted before TR in the HTML table. In order to make custom rules and set of tags, do the following:
- Implement interface
org.htmlcleaner.ITagInfoProvider - Use instance of implemented class in some of HtmlCelaner constructors.
