org.htmlcleaner
Class HtmlCleaner

java.lang.Object
  extended by org.htmlcleaner.HtmlCleaner

public class HtmlCleaner
extends Object

Main HtmlCleaner class.

It represents public interface to the user. It's task is to call tokenizer with specified source HTML, traverse list of produced token list and create internal object model. It also offers a set of methods to write resulting XML to string, file or any output stream.

Typical usage is the following:

// create an instance of HtmlCleaner HtmlCleaner cleaner = new HtmlCleaner(); // take default cleaner properties CleanerProperties props = cleaner.getProperties(); // customize cleaner's behavior with property setters props.setXXX(...); // Clean HTML taken from simple string, file, URL, input stream, // input source or reader. Result is root node of created // tree-like structure. Single cleaner instance may be safely used // multiple times. TagNode node = cleaner.clean(...); // optionally find parts of the DOM or modify some nodes TagNode[] myNodes = node.getElementsByXXX(...); // and/or Object[] myNodes = node.evaluateXPath(xPathExpression); // and/or aNode.removeFromTree(); // and/or aNode.addAttribute(attName, attValue); // and/or aNode.removeAttribute(attName, attValue); // and/or cleaner.setInnerHtml(aNode, htmlContent); // and/or do some other tree manipulation/traversal // serialize a node to a file, output stream, DOM, JDom... new XXXSerializer(props).writeXmlXXX(aNode, ...); myJDom = new JDomSerializer(props, true).createJDom(aNode); myDom = new DomSerializer(props, true).createDOM(aNode);


Nested Class Summary
protected  class HtmlCleaner.NestingState
           
 
Constructor Summary
HtmlCleaner()
          Constructor - creates cleaner instance with default tag info provider and default properties.
HtmlCleaner(CleanerProperties properties)
          Constructor - creates the instance with default tag info provider and specified properties
HtmlCleaner(ITagInfoProvider tagInfoProvider)
          Constructor - creates the instance with specified tag info provider and default properties
HtmlCleaner(ITagInfoProvider tagInfoProvider, CleanerProperties properties)
          Constructor - creates the instance with specified tag info provider and specified properties
 
Method Summary
protected  void addPruneNode(TagNode node, org.htmlcleaner.CleanTimeValues cleanTimeValues)
           
 TagNode clean(File file)
           
 TagNode clean(File file, String charset)
           
 TagNode clean(InputStream in)
           
 TagNode clean(InputStream in, String charset)
           
 TagNode clean(Reader reader)
           
protected  TagNode clean(Reader reader, org.htmlcleaner.CleanTimeValues cleanTimeValues)
          Basic version of the cleaning call.
 TagNode clean(String htmlContent)
           
 TagNode clean(URL url)
          Creates instance from the content downloaded from specified URL.
 TagNode clean(URL url, String charset)
          Deprecated. 
protected  Set<ITagNodeCondition> getAllowTagSet(org.htmlcleaner.CleanTimeValues cleanTimeValues)
           
protected  Set<String> getAllTags(org.htmlcleaner.CleanTimeValues cleanTimeValues)
           
 String getInnerHtml(TagNode node)
          For the specified node, returns it's content as string.
 CleanerProperties getProperties()
           
protected  Set<ITagNodeCondition> getPruneTagSet(org.htmlcleaner.CleanTimeValues cleanTimeValues)
           
 ITagInfoProvider getTagInfoProvider()
           
 CleanerTransformations getTransformations()
           
 void initCleanerTransformations(Map transInfos)
           
protected  boolean isRemovingNodeReasonablySafe(TagNode startTagToken)
           
 void setInnerHtml(TagNode node, String content)
          For the specified tag node, defines it's html content.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HtmlCleaner

public HtmlCleaner()
Constructor - creates cleaner instance with default tag info provider and default properties.


HtmlCleaner

public HtmlCleaner(ITagInfoProvider tagInfoProvider)
Constructor - creates the instance with specified tag info provider and default properties

Parameters:
tagInfoProvider - Provider for tag filtering and balancing

HtmlCleaner

public HtmlCleaner(CleanerProperties properties)
Constructor - creates the instance with default tag info provider and specified properties

Parameters:
properties - Properties used during parsing and serializing

HtmlCleaner

public HtmlCleaner(ITagInfoProvider tagInfoProvider,
                   CleanerProperties properties)
Constructor - creates the instance with specified tag info provider and specified properties

Parameters:
tagInfoProvider - Provider for tag filtering and balancing
properties - Properties used during parsing and serializing
Method Detail

clean

public TagNode clean(String htmlContent)

clean

public TagNode clean(File file,
                     String charset)
              throws IOException
Throws:
IOException

clean

public TagNode clean(File file)
              throws IOException
Throws:
IOException

clean

@Deprecated
public TagNode clean(URL url,
                                String charset)
              throws IOException
Deprecated. 

Deprecated because unmanaged network IO does not handle proxies, slow servers or broken connections well. the htmlcleaner caller should be managing the connections themselves and just providing the htmlcleaner library with a stream.

Parameters:
url -
charset -
Returns:
Throws:
IOException

clean

public TagNode clean(URL url)
              throws IOException
Creates instance from the content downloaded from specified URL. HTML encoding is resolved following the attempts in the sequence: 1. reading Content-Type response header, 2. Analyzing META tags at the beginning of the html, 3. Using platform's default charset.

Parameters:
url -
Returns:
Throws:
IOException

clean

public TagNode clean(InputStream in,
                     String charset)
              throws IOException
Throws:
IOException

clean

public TagNode clean(InputStream in)
              throws IOException
Throws:
IOException

clean

public TagNode clean(Reader reader)
              throws IOException
Throws:
IOException

clean

protected TagNode clean(Reader reader,
                        org.htmlcleaner.CleanTimeValues cleanTimeValues)
                 throws IOException
Basic version of the cleaning call.

Parameters:
reader - (not closed)
Returns:
An instance of TagNode object which is the root of the XML tree.
Throws:
IOException

isRemovingNodeReasonablySafe

protected boolean isRemovingNodeReasonablySafe(TagNode startTagToken)
Parameters:
startTagToken -
Returns:
true if no id attribute or class attribute

getProperties

public CleanerProperties getProperties()

getPruneTagSet

protected Set<ITagNodeCondition> getPruneTagSet(org.htmlcleaner.CleanTimeValues cleanTimeValues)

getAllowTagSet

protected Set<ITagNodeCondition> getAllowTagSet(org.htmlcleaner.CleanTimeValues cleanTimeValues)

addPruneNode

protected void addPruneNode(TagNode node,
                            org.htmlcleaner.CleanTimeValues cleanTimeValues)

getAllTags

protected Set<String> getAllTags(org.htmlcleaner.CleanTimeValues cleanTimeValues)

getTagInfoProvider

public ITagInfoProvider getTagInfoProvider()
Returns:
ITagInfoProvider instance for this HtmlCleaner

getTransformations

public CleanerTransformations getTransformations()
Returns:
Transformations defined for this instance of cleaner

getInnerHtml

public String getInnerHtml(TagNode node)
For the specified node, returns it's content as string.

Parameters:
node -
Returns:
node's content as string

setInnerHtml

public void setInnerHtml(TagNode node,
                         String content)
For the specified tag node, defines it's html content. This causes cleaner to reclean given html portion and insert it inside the node instead of previous content.

Parameters:
node -
content -

initCleanerTransformations

public void initCleanerTransformations(Map transInfos)
Parameters:
transInfos -


Copyright © 2006-2014. All Rights Reserved.