Introduction

HtmlCleaner is an open source HTML parser written in Java. HTML found on the Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring some order to the tags, attributes and ordinary text. For any given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create the Document Object Model. However, you can provide custom tag and rule sets for tag filtering and balancing.

August 24th 2015: HtmlCleaner 2.14 released!

A number of improvements to the cleaning algorithm, plus some bug fixes around new HTML 5 tags.

For more details see the release notes.

July 1st 2015: HtmlCleaner 2.13 released!

Maintenance release fixing some recursion issues.

For more details see the release notes.

May 15th 2015: HtmlCleaner 2.12 released!

Maintenance release to fix an issue with option tags.

For more details see the release notes.

May 12th 2015: HtmlCleaner 2.11 released!

Adds much better HTML5 support, pipelining of HTML from stdin (and XML to stdout), and more

For more details see the release notes.

October 31st 2014: HtmlCleaner 2.10 released!

Various small bug fixes

For more details see the release notes.

August 25th 2014: HtmlCleaner 2.9 released!

Various bug fixes, better HTML5 support, and a "silent mode" for less console output

For more details see the release notes.

March 18th 2014: HtmlCleaner 2.8 released!

Performance improvements, better handling of foreign markup and more

For more details see the release notes.

December 10th 2013: HtmlCleaner 2.7 released!

Lots of improvements to namespace handling and CDATA sections, and we also have a new desktop app to play with.

For more details see the release notes.

September 5th 2013: HtmlCleaner 2.6.1 released!

Fixes a mistake in 2.6 that altered part of the published API.

For more details see the release notes.

August 9th 2013: HtmlCleaner 2.6 released!

Fixes various issues including thread safety, Android support, and dependencies

For more details see the release notes.

May 13th 2013: HtmlCleaner 2.5 released!

Fixes various issues related to DOCTYPEs, HTML5 and XML namespaces.

For more details see the release notes.

March 5th 2013: HtmlCleaner 2.4 released!

This major release merges in changes from the Github fork, and fixes lots of issues.

For more details see the release notes.

February 8th 2013: HtmlCleaner 2.2.1 released!

This maintenance release contains a fix for a hex-encoding bug in 2.2

February 21st, 2011: HtmlCleaner published to the Maven repository
<dependency>
	<groupId>net.sourceforge.htmlcleaner</groupId>
	<artifactId>htmlcleaner</artifactId>
	<version>2.2</version>
</dependency>
  
December 22nd, 2010: HtmlCleaner 2.2 released!

New version brings most of required features and number of bug fixes. HtmlCleaner is now thread-safe, it introduces html-based serializers, API is extended to ease document manipulation. Parser is about 20% faster and now it runs on Java 1.5+, benefiting from language improvements.
For the details see release notes.

How does it work?

Here is a typical example - improperly structured HTML containing unclosed tags and missing quotes:

<table id=table1 cellspacing=2px
    <h1>CONTENT</h1>
    <td><a href=index.html>1 -> Home Page</a>
    <td><a href=intro.html>2 -> Introduction</a>
 

After putting it through HtmlCleaner, XML similar to the following is coming out:

<?xml version="1.0" encoding="UTF-8"?>
<html>
   <head />
   <body>
      <h1>CONTENT</h1>
      <table id="table1" cellspacing="2px">
         <tbody>
            <tr>
               <td>
                  <a href="index.html">1 -&gt; Home Page</a>
               </td>
               <td>
                  <a href="intro.html">2 -&gt; Introduction</a>
               </td>
            </tr>
         </tbody>
      </table>
   </body>
</html>
 

HtmlCleaner can be used in java code, as command line tool or as Ant task. It is designed to be small, independent (no runtime dependencies except JRE 1.5+), fast and flexible (its behavior is configurable through number of parameters). Although the main motive was to prepare ordinary HTML for XML processing with XPath, XQuery and XSLT, structured data produced by HtmlCleaner may be consumed and handled in manu other ways.

Features Summary

  • HtmlCleaner parses input HTML and generates tree-structure suitable for programmatic manipulation.
  • Serializers are responsible for outputting the DOM structure to XML, HTML, DOM or JDom.
  • Parsing phase relies on tag descriptions which can be customized by the user.
  • HtmlClaner's behaviour can be configured through number of parameters.
  • HtmlClaner is thread safe, meaning that single instance can clean multiple html sources at the same time.
  • HtmlClaner can be used from Java code, from command line or as Ant task.
  • HtmlClaner requires JRE 1.5+.