Release notes

March. 18, 2014: HtmlCleaner release 2.8
  • Fixed Issue 110: Performance problem take CPU to 100% [org.htmlcleaner.XPather]
  • Fixed Issue 109: Domserializer does not properly tag html ID attribute
  • Fixed Issue 107: Remove redundant escaping code from HtmlSerializer
  • Fixed Issue 106: JDOMSerializer fails unless you setUseCdataForScriptAndStyle=false
  • Fixed Issue 105: Element names in other namespaces than HTML should not be lowercased
  • Fixed Issue 104: svg:style rules should not be aggregated in the html:head section
  • Fixed Issue 82: Remove block-level restriction for tags

Thanks to Rafael Karst and Chris173 for patches used in this release.

There is a known issue with DomSerializer and HTML5 documents: see Issue 108

December. 10, 2013: HtmlCleaner release 2.7
  • Added a new desktop app for use with HtmlCleaner (thanks to Marton Szeles)
  • Fixed Issue 99: SVG elements in HTML are incorrectly modified
  • Fixed Issue 98: char sequence &; will be treated as SpecialEntity
  • Fixed Issue 97: BrowserCompactXmlSerializer erroneous whitespace handling for inline tags
  • Fixed Issue 95: NPE when trying to use JDomSerializer
  • Fixed Issue 93: Invalid cleaned HTML when empty DIV
  • Fixed Issue 89: Raw List type in tagnode.getElementList(condition, recursive);
  • Fixed Issue 88: Illegal character escaping in attributes values
  • Fixed Issue 87: Reinstate the HtmlSerializers
  • Fixed Issue 67: New line after XML declaration is wrongly taken into account
  • Fixed Issue 33: CDATA blocks are not recognized
September. 9, 2013: HtmlCleaner release 2.6.1
  • Fixed Issue 90: Re-instating the HtmlCleaner's public instance method clean(Reader)
August. 9, 2013: HtmlCleaner release 2.6
  • Fixed Issue 86: Thread safetyn
  • Fixed Issue 85: String.isEmpty not supported on Android 2.2 -> java.lang.NoSuchMethodError
  • Fixed Issue 84: HTMLCleaner 2.5 don't ignore anymore CDATA not in script/​style elements
  • Fixed Issue 76: Make Ant dependency optional
  • Fixed Issue 27: DomSerializer ignores the doctype
  • Fixed Issue 81: ConfigFileTagProvider, DefaultTagProvider out of sync
May. 15, 2013: HtmlCleaner release 2.5
  • Fixed Issue 77: HeadlessTagNode Constructor Does Not Correctly Copy Wrapped TagNode's Children
  • Fixed Issue 69: leaking resources - connection not closed
  • Fixed Issue 67: New line after XML declaration is wrongly taken into account
  • Fixed Issue 58/62: xml: namespace error on DomSerializer
  • Fixed Issue 55: Doctype upper case, name and validation
  • Fixed Issue 52: Bad serialization of HTML5 DOCTYPE clauses
  • Fixed Issue 48: Mutiple requests for the HTML page in 2.2
  • DocType handling has been significantly enhanced with support for parsing all currently valid DocTypes and providing additional information via the Java API.
Mar. 5, 2013: HtmlCleaner release 2.4
  • This is a major merge of the Github fork into the core HtmlCleaner code.
Feb. 8, 2013: HtmlCleaner release 2.2.1
  • An issue with Hex-based character encoding was fixed.
Dec. 22, 2010: HtmlCleaner release 2.2
  • HtmlCleaner is now thread-safe. Single instance can be used from multiple threads to parse multiple html sources safely. All serializers coming in the package are thread-safe as well.
  • Html-based serializers are introduced, intended to produce browser-friendly HTML. Now there are basically two serializer flavors: XML (simple, pretty, compact) and HTML (simple, pretty, compact). Html serializers doesn't strictly produce well-formed XML, but rather HTML for further browser consumption (for example special entities like &Alpha; are preserved, not escaped with &amp;Alpha;, empty tags like script are not serialized as <script /> but rather as <script></script>)
  • New parameter transResCharsToNCR is introduced, telling whether reserved XML characters (&, ", ', <, >) are serialized to their Numeric Character Representations (&#dd;)
  • New parameter transSpecialEntitiesToNCR is introduced, telling whether special HTML entities (&Alpha; for example) are serialized to their Numeric Character Representations (&#dd;)
  • Parameter omitHtmlEnvelope gets deprecated and new parameter omitEnvelope in command line/Ant and optional parameter in methods XXXSerializer.writeToXXX() is introduced instead, moving this logic to the right place. This way the whole body wihout enclosing tags is serialized, not only the first inner node as before.
  • List of special HTML entities is extended with number of new ones. Class SpecialEntity holding them has public method addEntity(entityName, entityCode) to define new ones if some are still missing.
  • TagNode has number of new methods for easier node manipulation (see API docs)
  • Visitor concept is implemented in TagNode in order to easily traverse DOM tree and collect some data/update the document.
  • Pretty XML/HTML serializers have optional parameter in constructors specifying indentation string (default is TAB character).
  • Tag definitions updated (col, legend...) to be consistent with the browsers.
  • Invalid XML characters are skipped during parsing/serialization.
  • DOM/JDom serialization bug fixes.
  • Namespaces found in source HTML are now handled properly (depending on omitXmlnsAttributes parameter).
  • Method HtmlClenaer.getAllTags() is removed, since this approach doesn't go with introduced thread-safety.
  • Few classes are renamed: ContentToken -> ContentNode, CommentToken -> CommentNode.
  • Parameter ignoreQuestAndExclam has now default value true.
  • Source code now has standard MAVEN structure.
  • HtmlCleaner now depends on Java runtime 1.5+.
  • For the list of fixed bugs see Bug tracking at SourceForge.
Sep. 02, 2008: HtmlCleaner release 2.1
  • Parsing transformations are developed in order to easily skip or change specified tags or attributes during the cleanup process.
  • Few more constructors added in class HtmlCleaner giving possibility to reuse same cleaner properties with multiple cleaner instances.
  • Code cleanup.
Jul. 15, 2008: HtmlCleaner release 2.0
  • Complete code refactoring is done so as to better separate roles of cleaner, cleaner properties, object model nodes and serializators. API is not compatible with previous versions, though it is still very simple for use.
  • Post-cleaning node manipulation is enabled with rich set of methods in TagNode class. Now, there is no need to create DOM or JDom out of HtmlCleaner object model in order to select, add or remove some nodes or attributes.
  • Basic XPath is supported on HtmlCleaner object model. Despite partial implementation, if should be power enough to find or collect nodes/attributes/text even with fairly complex criteria.
  • Modifying already cleaned HtmlCleaner object model is enabled with HtmlCleaner.setInnerHtml(node, html) similar to DHTML feature to set inner html of an object.
  • Creating custom tag rule set is now much easier by defining XML configuration file.
  • New properties booleanAttributeValues and nodeByXPath for setting cleaner's behavior are introduced.
  • Test cases added to source code.
  • Memory leak problem in Java 1.4 fixed.
  • Number of bug fixes.
Dec. 26, 2007: HtmlCleaner release 1.6
  • New flag parameter ignoreQuestAndExclam is introduced offering control over special tags - <?TAGNAME....>, <!TAGNAME....>.
  • Bug fixes.
Sep. 27, 2007: HtmlCleaner release 1.55
  • Added Reader based HtmlCleaner constructors.
  • New parameter pruneTags is introduced offering a way to remove undesired tags with all the children from XML tree after parsing and cleaning.
  • Bug fixes.
Sep. 8, 2007: HtmlCleaner release 1.5
  • Several bug fixes.
  • Added option to escape XML content in DOM serializer - HtmlCleaner.createDOM(boolean escapeXml)
Aug. 24, 2007: HtmlCleaner release 1.4
  • New flag allowHtmlInsideAttributes is introduced in order to give the parser flexibility in handling attribute values.
  • Several bug fixes.
Jul. 12, 2007: HtmlCleaner release 1.3
  • New browser-compact serializer added, that preserves single whitespace where multiple occure.
  • New flag namespacesAware is introduced in order to control namespace prefixes and namespace declarations. It should be used instead of omitXmlnsAttributes that existed in previous versions and had limited functionality.
  • New flag allowMultiWordAttributes is introduced giving HtmlCleaner's parser flexibility to (dis)allow tag attributes consisting of multiple words.
  • New flag useEmptyElementTags is introduced in order to controll output of tags with empty body
    (<xxx/> vs <xxx></xxx>).
  • Several bug fixes.
May. 05, 2007: HtmlCleaner release 1.2
  • Several bugs fixed.
  • New flags added to control behaviour of unknown/deprecated tags.
  • New flag added to optionally remove HTML envelope from resulting XML.
  • JDOM serializer added.
Apr. 13, 2007: HtmlCleaner release 1.13
  • Serialization of XML to Java DOM supported with createDOM() method of HtmlCleaner class.
Jan. 28, 2007: HtmlCleaner release 1.12
  • Hexadecimal entities escaping supported (i.e. &#x09;).
Jan. 11, 2007: HtmlCleaner release 1.1
  • Compact XML serializer improved.
  • Minor XML escaping bug fixed.
Jan. 02, 2007: HtmlCleaner release 1.0.5
  • A html tokenizing bug fixed.
  • Methods of the class TagNode made public in order to enable creating custom XML serializers.
  • Method writeXml(XmlSerializer) added to HtmlCleaner class in order to support creating custom XML serializers.
Dec. 23, 2006: HtmlCleaner release 1.0
  • Minor bug in advanced XML escaping fixed.
Dec. 05, 2006: HtmlCleaner release 0.9
  • HtmlCleaner Ant task added
  • XML compact serializer added - stripps all unneeded whitespaces from the result
  • Few minor bugs fixed
Nov. 27, 2006: HtmlCleaner initial release (version 0.8)