Common usage
Tipically the following steps are taken:
// create an instance of HtmlCleaner HtmlCleaner cleaner = new HtmlCleaner(); // take default cleaner properties CleanerProperties props = cleaner.getProperties(); // customize cleaner's behaviour with property setters props.setXXX(...); // Clean HTML taken from simple string, file, URL, input stream, // input source or reader. Result is root node of created // tree-like structure. Single cleaner instance may be safely used // multiple times. TagNode node = cleaner.clean(...); // optionally find parts of the DOM or modify some nodes TagNode[] myNodes = node.getElementsByXXX(...); // and/or Object[] myNodes = node.evaluateXPath(xPathExpression); // and/or aNode.removeFromTree(); // and/or aNode.addAttribute(attName, attValue); // and/or aNode.removeAttribute(attName, attValue); // and/or cleaner.setInnerHtml(aNode, htmlContent); // serialize a node to a file, output stream, DOM, JDom... new XXXSerializer(props).writeXmlXXX(aNode, ...); myJDom = new JDomSerializer(props, true).createJDom(aNode); myDom = new DomSerializer(props, true).createDOM(aNode);
HtmlCleaner API
Create cleaner instance:
| Constructor or method | Purpose |
|---|---|
HtmlCleaner()
|
Create cleaner with default tag information provider. |
HtmlCleaner(ITagInfoProvider)
|
Create cleaner with custom tag information provider. |
Set cleaner properties in order to tune its behavior:
Set cleaner transformations:new!
| Constructor or method | Purpose |
|---|---|
CleanerTransformations()
|
Create collection of transformations. |
TagTransformation(String, String, boolean)
|
Create single tag transformation. |
CleanerTransformations.
|
Add tag transormation to transformations collection. |
TagTransformation.
|
Specify attribute transformation for the tag transformation. |
HtmlCleaner.
|
Set cleaner transformations. |
Clean HTML with instance of HtmlCleaner:
| Constructor or method | Purpose |
|---|---|
class HtmlCleaner:
clean(String)
|
Clean HTML that comes from verious sources. |
Search cleaned DOM and modify its structure:
| Constructor or method | Purpose |
|---|---|
class TagNode:
getAttributeByName(String)
|
Work with node (tag) attributes |
class TagNode:
getChildTagList()
|
Find and modify nodes. |
HtmlCleaner.setInnerHtml(TagNode, String)
|
Cleans given portion of HTML and stores it in specified tag node. |
Serialize DOM nodes:
| Constructor or method | Purpose |
|---|---|
SimpleXmlSerializer(CleanerProperties)
|
Create various kinds of XML serializers. |
class XmlSerializer:
writeXmlToStream(TagNode, OutputStream, String)
|
Serialize node to different outputs. |
DomSerializer.createDOM(TagNode)
|
Create common DOM objects out of cleaned HTML. |
Providing custom tag info set
HtmlCleaner implements default HTML tag set and rules for their balancing, that
is similar to the browsers' behavior. However, user is free to implement interface
ITagInfoProvider
or extend some of its imlementations in order to provide custom tag info set.
The easiest way to do that is to write XML configuration file which describes all tags
and their dependacies and use
ConfigFileTagProvider like:
HtmlCleaner cleaner =
new HtmlCleaner( new ConfigFileTagProvider(myConfigFile) );
Perhaps the best starting point is default tag ruleset description file.
It is the basis for
DefaultTagProvider.
For example, someone may not like the rule that implicit TBODY is inserted before TR in the HTML table.
To remove it, find <tag name="tr"... element in the XML and remove tbody from
req-enclosing-tags section.
Setting cleaner transformations
Following code snippet demonstrates how to set tranformations from the example:
... HtmlCleaner cleaner = new HtmlCleaner(...); ... CleanerTransformations transformations = new CleanerTransformations(); TagTransformation tt = new TagTransformation("cfoutput"); transformations.addTransformation(tt); tt = new TagTransformation("c:block", "div", false); transformations.addTransformation(tt); tt = new TagTransformation("font", "span", true); tt.addAttributeTransformation("size"); tt.addAttributeTransformation("face"); tt.addAttributeTransformation( "style", "${style};font-family=${face};font-size=${size};" ); transformations.addTransformation(tt); ... cleaner.setTransformations(transformations); ... TagNode node = cleaner.clean(...);

