HtmlCleaner Project Home Page

Release notes

June. 19, 2023: HtmlCleaner release 2.29

Fix for CVE-2023-34624: Stack overflow with excessive nested tags
237 customizing javadocExecutable in pom.xml breaks the build

Thanks to niol, PoppingSnack, and Ralf Purnhagen for bug reports and contributions for this release!

Note the addition of a maxDepth cleaner property that defines the maximum nested tag depth. The default is 1000.

April. 29, 2023: HtmlCleaner release 2.28

228 svg incorrectly not marked as phrasing content in HTML5
229 style-tag should not be allowed in body in HTML5
230 Div element wrongly filtered out from dl children when using HTML 5
231 SVG moved after <p> elements

Thanks to jlacour31, Simon Urli and Michael Hamann for bug reports and contributions for this release!

March. 24, 2023: HtmlCleaner release 2.27

Updates pom.xml to target JDK 1.8, as earlier versions are no longer supported.

Thanks to Ruoyu Zhong for the nudge!

January. 18, 2022: HtmlCleaner release 2.26

Updates JDOM to version 2.0.6.1 to mitigate CVE-2021-33813

Thanks to Rafal Sierkiewicz for the patch.

September. 24, 2021: HtmlCleaner release 2.25

221 Wrong parsing of html entities in case of using recognizeUnicodeChars
224 [Android] Unsupprted Flag 256 for Pattern.java

Thanks to Simon Urli and allentown for their help with this release.

April. 29, 2020: HtmlCleaner release 2.24

220 Information is lost in case of double escaping in attributes
219 H3 closing tag incorrectly placed
217 StackOverflow in DomSerializer
216 elementNames(org.htmlcleaner.HtmlCleanerTest) test failure

A new serializer, TraversalDomSerializer, has been added. This is an experimental serializer that currently creates output that is not exactly the same as the regular DomSerializer, but may be useful where you need to reduce the memory footprint of HtmlCleaner for processing extremely large pages.

NOTE: As a side effect of implementing the new serializer, some lower-level interfaces (e.g. HtmlToken, BaseToken) have had to be refactored. This may affect some existing integrations if you interact with Token-level APIs - take care when upgrading to this version if you do.

Note that for issue 220 a change has been made to attribute processing: entities in attributes are normalised and escaped on serialisation rather than in TagNode. For example, double-escaped entities will appear when querying the TagNode interface, but will be normalised into standard HTML format using the serialiser.

Thanks to Simon Urli, Daniel Gonzalez, Sam Hutchins, and niol for their help with this release.

Sep. 6, 2019: HtmlCleaner release 2.23

215 INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified
212 Returned DOM Document instance should not contain escaped characters for attribute values
211 IOException when trying to reset unsuccessful CDATA end lookup
210 escapeXml ignored by DomSerializer
203 HC doesn't properly escape < and > characters anymore in attributes
201 no-break space not removed or converted when creating TagNode
65 no-break space entity replaced with Â

Thanks to Screaming Frog, Vincent Massol, Sam Hutchins, Tiris Valinlore, and Anthony Pessy for their help with this release.

Apr. 24, 2018: HtmlCleaner release 2.22

202 test suite failes with java9: Transformer changed behaviour
200 Adds null end of the DOCTYPE when there is no DOCTYPE
199 "whitespace: pre" CSS property not taken into account
198 Make XPath methods protected to allow extension
193 MathML equations without a namespace prefix are breaking paragraphs
192 Support HTML5 inline content without namespaces
191 Infinite loop on time b li
190 NullPointerException in HtmlCleaner.makeTree
189 Unclosed CDATA sections results in odd behaviour with large documents

Thanks to niol, Yusuf Figanioglu, Arkanosis, Nur, Dan, and Michael Ryan for their help with this release.

May. 11, 2017: HtmlCleaner release 2.21

Bug fix #188 Regression in 2.20, "prune"-tags do not get removed anymore

Thanks to Markus Schlegel for their help with this release.

May. 2, 2017: HtmlCleaner release 2.20

Enhancement - add ability to output to an Ant property from HtmlCleaner.
Enhancement 186 Add strict error checking flag as optional DomSerializer constructor
Bug fix 185 Unclosed CDATA can cause ArrayIndexOutOfBoundsException
Bug fix 175 HTMLCleaner genereates invalid attribute names from bad HTML
Bug fix 125 Html elements are doubled sometimes
Bug fix 57 Use first attribute if duplicated
Bug fix 51 setUseCdataForScriptAndStyle should apply on HtmlSerializer

Thanks to Gintas Grigelionis, Michael Ryan, Philipp Jeitner, legrass and Ivan Bondarenko for their help with this release.

Note there is an algorithm change in this release (see bug 175) where we by default try to change attribute names into valid XML attribute names; you can change this behaviour with two new cleaner properties: allowInvalidAttributeNames and invalidAttributeNamePrefix.

February. 7, 2017: HtmlCleaner release 2.19

Bug fix 183 Real world html causes clean() to eat all available memory
Bug fix 180 duplicate tags added at the end (script,body,html)
Bug fix 173 Infinite loop and OOM if uppercase P tag with xmlns
Bug fix 172 Infinite loop occurs when MathML tags are present
Bug fix 169 Several issues with CDATA blocks
Bug fix 168 DomSerializer doesn't seem to take into account the namespacesAware configuration
Enhancement 167 Make it easier to extend DomSerializer
Bug fix 166 Bad handling of <p> inside <ul> in HTML 5
Bug fix 164 Font tag is not known anymore
Enhancement 159 Add back in Utils.fullUrl()
Bug fix 158 NullPointerException in HtmlCleaner.saveToLastOpenTag

Thanks to Code Buddy, Haadar, Martin Denham, Tibor Dimitriu, Vincent Massol, Guillaume Delhumeau, and Rob Decker for their help with this release.

Note we have a small algorithm change in this release (see bug 166) to help make cleaning of lists and tables more sensible by inserting missing LIs and TDs in the first instance rather than moving invalid content outside; this should improve the quality of cleaned HTML, but YMMV. Please give your feedback on this change and report any bugs!

November. 2, 2016: HtmlCleaner release 2.18

Bug fix 179 java -jar option does not work

Thanks to Card Package for their help with this release.

October. 19, 2016: HtmlCleaner release 2.17

Bug fix 178 java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.htmlcleaner.TagNode
Bug fix 176 Crash: IllegalArgumentException in convertToUnicode
Bug fix 165 Invalid HTML characters are not removed

Thanks to Code Buddy and Matthew Fulgo for their help with this release.

December. 2, 2015: HtmlCleaner release 2.16

Bug fix 157 Infinite loop occur
Bug fix 156 style element should not always be moved to head in HTML5
Bug fix 155 Memory and resource blowup on particular documents
Bug fix 154 single quote character must not get serialized as "& a p o s;" by html serializers
Bug fix 153 NullPointerException when DOCTYPE doesn't contain a qualifiedName
Bug fix 130 Apply recognizeUnicodeChars property when cleaning
Bug fix 118 HTML always translating special entities

Thanks to Andrey Krivonogov for the thread interruption patch, Seanster for the special entities patch, Kricket for the unicode patch, and Code Buddy, Lee Kyung Min, Oscar Scholten, Matthew Sharpe for their help squashing bugs in this release.

October. 1, 2015: HtmlCleaner release 2.15

New Feature FR20 Added useCdataFor parameter
Bug Fix 152 Destruction of Unicode characters above 65535

Thanks to Sebastian Paulus for reporting the unicode issue and diagnosing the fix for it, and thanks to Salvatore for the patch adding the useCdataFor option.

August. 24, 2015: HtmlCleaner release 2.14

149 StackOverflowError
148 Giving mixed-case filenames doesn't work on case-sensitive filesystems
147 Correction of ul structure
146 2.13 does not correct table structure
144 schema.org elements such as meta and link are removed
140 CRITICAL: endless loop in some tags (ref #129, #126)
139 option tag displayed after optgroup
136 ClassCastException

Thanks to Wolfgang Koppenberger, Martin Denham, Code Buddy, James Le Cuirot, Haadar and Jeb42 for reporting these problems and helping fix them.

Note also that the cleaning algorithm is once again tweaked - this should align better with current browsers than the previous release, but in some cases means we're being slightly more lenient than the W3 spec.

July. 1, 2015: HtmlCleaner release 2.13

Fixed issue 129 Defining required parent for >legend< element causes out-of-memory error
Fixed issue 126 Infinite loop on HTML parsing
Fixed issue 138 label tags are removed (fatalTag problem 2)
Fixed issue 141 OutOfMemory error

Thanks to Rasifiel Wolfgang Koppenberger and Martin Denham for reporting these problems and help with diagnosing a solution for it

Note that the ordering of processes within the main cleaning algorithm is altered in this release; this avoids the potential for infinite loops, but some types of cleaning that were successful before may have different results - YMMV. Suggestions for improving the core engine are very welcome!

May. 15, 2015: HtmlCleaner release 2.12

Fixed issue 137 Options tags are removed inside select

Thanks to Wolfgang Koppenberger for reporting this problem and diagnosing the fix for it

May. 12, 2015: HtmlCleaner release 2.11

Feature 19: Support use of stdin and stdout for pipes on command line
Feature 10: Make OSGI-compatible bundle
Feature 15: Improved HTML5 support
Fixed issue 135: Some pages cause two different NullPointerExceptions
Fixed issue 134: Some pages cause IndexOutOfBoundsException
Fixed issue 133: Some pages cause NullPointerException
Fixed issue 132: ClassCastException: ArrayList cannot be cast to org.htmlcleaner.BaseToken

Thanks to Philokypros Ioulianou for the patches used in this release.

October. 31, 2014: HtmlCleaner release 2.10

Feature 16: Make DefaultTagProvider extendable
Fixed issue 128: Regression: legend tag is stripped
Fixed issue 127: DomSerializer loose all attributes of root node
Fixed issue 126: Infinite loop on HTML parsing

Thanks to Rasifiel for patches used in this release.

August. 25, 2014: HtmlCleaner release 2.9

Feature 14: Added "silent mode" feature. Use --quiet to turn off output.
Fixed Issue 124: Class cast exception
Fixed Issue 123: Endless loop in meta tags
Fixed Issue 121: Shuld be possible to exclude "meta" tag
Fixed Issue 119: Tag combination causes internal loop
Fixed Issue 117: Parsing of CSS content property incorrect
Fixed Issue 116: Result XML different between DomSerializer and XmlSerializer
Fixed Issue 115: Recognise and remove HTML namespaces
Fixed Issue 114: Odd behaviour when using namespaces
Fixed Issue 113: PATCH - limit the number of times identical tags can be copied forward
Fixed Issue 112: HTML5 tags missing from DefaultTagProvider
Fixed Issue 111: STRONG in DefaultTagProvider isn't used correctly when constructing CLOSE_BEFORE_COPY_INSIDE_TAGS
Fixed Issue 103: Attributes of HTML element are stripped under some circumstances
Applied Patch 16: Patch for deserializing entities when reading HTML

Thanks to Alexey Lukashev and Shaun Kalley for patches used in this release.

There is a known issue with DomSerializer and HTML5 documents: see Issue 108

March. 18, 2014: HtmlCleaner release 2.8

Fixed Issue 110: Performance problem take CPU to 100% [org.htmlcleaner.XPather]
Fixed Issue 109: Domserializer does not properly tag html ID attribute
Fixed Issue 107: Remove redundant escaping code from HtmlSerializer
Fixed Issue 106: JDOMSerializer fails unless you setUseCdataForScriptAndStyle=false
Fixed Issue 105: Element names in other namespaces than HTML should not be lowercased
Fixed Issue 104: svg:style rules should not be aggregated in the html:head section
Fixed Issue 82: Remove block-level restriction for tags

Thanks to Rafael Karst and Chris173 for patches used in this release.

There is a known issue with DomSerializer and HTML5 documents: see Issue 108

December. 10, 2013: HtmlCleaner release 2.7

Added a new desktop app for use with HtmlCleaner (thanks to Marton Szeles)
Fixed Issue 99: SVG elements in HTML are incorrectly modified
Fixed Issue 98: char sequence &; will be treated as SpecialEntity
Fixed Issue 97: BrowserCompactXmlSerializer erroneous whitespace handling for inline tags
Fixed Issue 95: NPE when trying to use JDomSerializer
Fixed Issue 93: Invalid cleaned HTML when empty DIV
Fixed Issue 89: Raw List type in tagnode.getElementList(condition, recursive);
Fixed Issue 88: Illegal character escaping in attributes values
Fixed Issue 87: Reinstate the HtmlSerializers
Fixed Issue 67: New line after XML declaration is wrongly taken into account
Fixed Issue 33: CDATA blocks are not recognized

September. 9, 2013: HtmlCleaner release 2.6.1

Fixed Issue 90: Re-instating the HtmlCleaner's public instance method clean(Reader)

August. 9, 2013: HtmlCleaner release 2.6

Fixed Issue 86: Thread safetyn
Fixed Issue 85: String.isEmpty not supported on Android 2.2 -> java.lang.NoSuchMethodError
Fixed Issue 84: HTMLCleaner 2.5 don't ignore anymore CDATA not in script/style elements
Fixed Issue 76: Make Ant dependency optional
Fixed Issue 27: DomSerializer ignores the doctype
Fixed Issue 81: ConfigFileTagProvider, DefaultTagProvider out of sync

May. 15, 2013: HtmlCleaner release 2.5

Fixed Issue 77: HeadlessTagNode Constructor Does Not Correctly Copy Wrapped TagNode's Children
Fixed Issue 69: leaking resources - connection not closed
Fixed Issue 67: New line after XML declaration is wrongly taken into account
Fixed Issue 58/62: xml: namespace error on DomSerializer
Fixed Issue 55: Doctype upper case, name and validation
Fixed Issue 52: Bad serialization of HTML5 DOCTYPE clauses
Fixed Issue 48: Mutiple requests for the HTML page in 2.2
DocType handling has been significantly enhanced with support for parsing all currently valid DocTypes and providing additional information via the Java API.

Mar. 5, 2013: HtmlCleaner release 2.4

This is a major merge of the Github fork into the core HtmlCleaner code.

Feb. 8, 2013: HtmlCleaner release 2.2.1

An issue with Hex-based character encoding was fixed.

Dec. 22, 2010: HtmlCleaner release 2.2

HtmlCleaner is now thread-safe. Single instance can be used from multiple threads to parse multiple html sources safely. All serializers coming in the package are thread-safe as well.
Html-based serializers are introduced, intended to produce browser-friendly HTML. Now there are basically two serializer flavors: XML (simple, pretty, compact) and HTML (simple, pretty, compact). Html serializers doesn't strictly produce well-formed XML, but rather HTML for further browser consumption (for example special entities like Α are preserved, not escaped with &Alpha;, empty tags like script are not serialized as <script /> but rather as <script></script>)
New parameter transResCharsToNCR is introduced, telling whether reserved XML characters (&, ", ', <, >) are serialized to their Numeric Character Representations (&#dd;)
New parameter transSpecialEntitiesToNCR is introduced, telling whether special HTML entities (Α for example) are serialized to their Numeric Character Representations (&#dd;)
Parameter omitHtmlEnvelope gets deprecated and new parameter omitEnvelope in command line/Ant and optional parameter in methods XXXSerializer.writeToXXX() is introduced instead, moving this logic to the right place. This way the whole body wihout enclosing tags is serialized, not only the first inner node as before.
List of special HTML entities is extended with number of new ones. Class SpecialEntity holding them has public method addEntity(entityName, entityCode) to define new ones if some are still missing.
TagNode has number of new methods for easier node manipulation (see API docs)
Visitor concept is implemented in TagNode in order to easily traverse DOM tree and collect some data/update the document.
Pretty XML/HTML serializers have optional parameter in constructors specifying indentation string (default is TAB character).
Tag definitions updated (col, legend...) to be consistent with the browsers.
Invalid XML characters are skipped during parsing/serialization.
DOM/JDom serialization bug fixes.
Namespaces found in source HTML are now handled properly (depending on omitXmlnsAttributes parameter).
Method HtmlClenaer.getAllTags() is removed, since this approach doesn't go with introduced thread-safety.
Few classes are renamed: ContentToken -> ContentNode, CommentToken -> CommentNode.
Parameter ignoreQuestAndExclam has now default value true.
Source code now has standard MAVEN structure.
HtmlCleaner now depends on Java runtime 1.5+.
For the list of fixed bugs see Bug tracking at SourceForge.

Sep. 02, 2008: HtmlCleaner release 2.1

Parsing transformations are developed in order to easily skip or change specified tags or attributes during the cleanup process.
Few more constructors added in class HtmlCleaner giving possibility to reuse same cleaner properties with multiple cleaner instances.
Code cleanup.

Jul. 15, 2008: HtmlCleaner release 2.0

Complete code refactoring is done so as to better separate roles of cleaner, cleaner properties, object model nodes and serializators. API is not compatible with previous versions, though it is still very simple for use.
Post-cleaning node manipulation is enabled with rich set of methods in TagNode class. Now, there is no need to create DOM or JDom out of HtmlCleaner object model in order to select, add or remove some nodes or attributes.
Basic XPath is supported on HtmlCleaner object model. Despite partial implementation, if should be power enough to find or collect nodes/attributes/text even with fairly complex criteria.
Modifying already cleaned HtmlCleaner object model is enabled with HtmlCleaner.setInnerHtml(node, html) similar to DHTML feature to set inner html of an object.
Creating custom tag rule set is now much easier by defining XML configuration file.
New properties booleanAttributeValues and nodeByXPath for setting cleaner's behavior are introduced.
Test cases added to source code.
Memory leak problem in Java 1.4 fixed.
Number of bug fixes.

Dec. 26, 2007: HtmlCleaner release 1.6

New flag parameter ignoreQuestAndExclam is introduced offering control over special tags - <?TAGNAME....>, <!TAGNAME....>.
Bug fixes.

Sep. 27, 2007: HtmlCleaner release 1.55

Added Reader based HtmlCleaner constructors.
New parameter pruneTags is introduced offering a way to remove undesired tags with all the children from XML tree after parsing and cleaning.
Bug fixes.

Sep. 8, 2007: HtmlCleaner release 1.5

Several bug fixes.
Added option to escape XML content in DOM serializer - HtmlCleaner.createDOM(boolean escapeXml)

Aug. 24, 2007: HtmlCleaner release 1.4

New flag allowHtmlInsideAttributes is introduced in order to give the parser flexibility in handling attribute values.
Several bug fixes.

Jul. 12, 2007: HtmlCleaner release 1.3

New browser-compact serializer added, that preserves single whitespace where multiple occure.
New flag namespacesAware is introduced in order to control namespace prefixes and namespace declarations. It should be used instead of omitXmlnsAttributes that existed in previous versions and had limited functionality.
New flag allowMultiWordAttributes is introduced giving HtmlCleaner's parser flexibility to (dis)allow tag attributes consisting of multiple words.
New flag useEmptyElementTags is introduced in order to controll output of tags with empty body
(<xxx/> vs <xxx></xxx>).
Several bug fixes.

May. 05, 2007: HtmlCleaner release 1.2

Several bugs fixed.
New flags added to control behaviour of unknown/deprecated tags.
New flag added to optionally remove HTML envelope from resulting XML.
JDOM serializer added.

Apr. 13, 2007: HtmlCleaner release 1.13

Serialization of XML to Java DOM supported with createDOM() method of HtmlCleaner class.

Jan. 28, 2007: HtmlCleaner release 1.12

Hexadecimal entities escaping supported (i.e. 	).

Jan. 11, 2007: HtmlCleaner release 1.1

Compact XML serializer improved.
Minor XML escaping bug fixed.

Jan. 02, 2007: HtmlCleaner release 1.0.5

A html tokenizing bug fixed.
Methods of the class TagNode made public in order to enable creating custom XML serializers.
Method writeXml(XmlSerializer) added to HtmlCleaner class in order to support creating custom XML serializers.

Dec. 23, 2006: HtmlCleaner release 1.0

Minor bug in advanced XML escaping fixed.

Dec. 05, 2006: HtmlCleaner release 0.9

HtmlCleaner Ant task added
XML compact serializer added - stripps all unneeded whitespaces from the result
Few minor bugs fixed

Nov. 27, 2006: HtmlCleaner initial release (version 0.8)