Introduction
This release introduces parsing transformations - easy way to skip or change some tags or attributes, together with some code improvements. See release notes.
July 20th, 2008: HtmlCleaner 2.0 released!
New version comes with a number of improvements and fixes. Some of them are:
- Complete code refactoring, making the Cleaner's API better and more flexible.
- Methods for DOM manipulation added.
- Basic XPath support added.
- New parameters introduced to control cleaner's behavior.
HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.
Here is a typical example - improperly structured HTML containing unclosed tags and missing quotes:
<table id=table1 cellspacing=2px <h1>CONTENT</h1> <td><a href=index.html>1 -> Home Page</a> <td><a href=intro.html>2 -> Introduction</a>
After putting it through HtmlCleaner, XML similar to the following is coming out:
<?xml version="1.0" encoding="UTF-8"?> <html> <head /> <body> <h1>CONTENT</h1> <table id="table1" cellspacing="2px"> <tbody> <tr> <td> <a href="index.html">1 -> Home Page</a> </td> <td> <a href="intro.html">2 -> Introduction</a> </td> </tr> </tbody> </table> </body> </html>
HtmlCleaner can be used in java code, as command line tool or as Ant task. It is designed to be small (currently JAR file around 55K), independent (no runtime dependencies except JRE 1.4+), fast and flexible (its behavior is configurable through number of parameters). Although the main motive was to prepare ordinary HTML for XML processing with XPath, XQuery and XSLT, structured data produced by HtmlCleaner may be consumed and handled in menu other ways.

