Introduction

August 25th 2014: HtmlCleaner 2.9 released!

Various bug fixes, better HTML5 support, and a "silent mode" for less console output

For more details see the release notes.

March 18th 2014: HtmlCleaner 2.8 released!

Performance improvements, better handling of foreign markup and more

For more details see the release notes.

December 10th 2013: HtmlCleaner 2.7 released!

Lots of improvements to namespace handling and CDATA sections, and we also have a new desktop app to play with.

For more details see the release notes.

September 5th 2013: HtmlCleaner 2.6.1 released!

Fixes a mistake in 2.6 that altered part of the published API.

For more details see the release notes.

August 9th 2013: HtmlCleaner 2.6 released!

Fixes various issues including thread safety, Android support, and dependencies

For more details see the release notes.

May 13th 2013: HtmlCleaner 2.5 released!

Fixes various issues related to DOCTYPEs, HTML5 and XML namespaces.

For more details see the release notes.

March 5th 2013: HtmlCleaner 2.4 released!

This major release merges in changes from the Github fork, and fixes lots of issues.

For more details see the release notes.

February 8th 2013: HtmlCleaner 2.2.1 released!

This maintenance release contains a fix for a hex-encoding bug in 2.2

February 21st, 2011: HtmlCleaner published to the Maven repository
<dependency>
	<groupId>net.sourceforge.htmlcleaner</groupId>
	<artifactId>htmlcleaner</artifactId>
	<version>2.2</version>
</dependency>
  
December 22nd, 2010: HtmlCleaner 2.2 released!

New version brings most of required features and number of bug fixes. HtmlCleaner is now thread-safe, it introduces html-based serializers, API is extended to ease document manipulation. Parser is about 20% faster and now it runs on Java 1.5+, benefiting from language improvements.
For the details see release notes.

HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.

Here is a typical example - improperly structured HTML containing unclosed tags and missing quotes:

<table id=table1 cellspacing=2px
    <h1>CONTENT</h1>
    <td><a href=index.html>1 -> Home Page</a>
    <td><a href=intro.html>2 -> Introduction</a>
 

After putting it through HtmlCleaner, XML similar to the following is coming out:

<?xml version="1.0" encoding="UTF-8"?>
<html>
   <head />
   <body>
      <h1>CONTENT</h1>
      <table id="table1" cellspacing="2px">
         <tbody>
            <tr>
               <td>
                  <a href="index.html">1 -&gt; Home Page</a>
               </td>
               <td>
                  <a href="intro.html">2 -&gt; Introduction</a>
               </td>
            </tr>
         </tbody>
      </table>
   </body>
</html>
 

HtmlCleaner can be used in java code, as command line tool or as Ant task. It is designed to be small, independent (no runtime dependencies except JRE 1.5+), fast and flexible (its behavior is configurable through number of parameters). Although the main motive was to prepare ordinary HTML for XML processing with XPath, XQuery and XSLT, structured data produced by HtmlCleaner may be consumed and handled in manu other ways.

Features Summary

  • HtmlCleaner parses input HTML and generates tree-structure suitable for programmatic manipulation.
  • Serializers are responsible for outputting the DOM structure to XML, HTML, DOM or JDom.
  • Parsing phase relies on tag descriptions which can be customized by the user.
  • HtmlClaner's behaviour can be configured through number of parameters.
  • HtmlClaner is thread safe, meaning that single instance can clean multiple html sources at the same time.
  • HtmlClaner can be used from Java code, from command line or as Ant task.
  • HtmlClaner requires JRE 1.5+.