Command line usage

HtmlCleaner can be called from the command line with the following syntax:

    java -jar htmlcleanerXX.jar [src = <url | file>] [incharset = <charset>] 
                                [dest = <file>] [outcharset = <charset>] 
                                [taginfofile = <file>] [options...]
where options include:

    outputtype = simple | compact | browser-compact | pretty | 
                 htmlsimple | htmlcompact | htmlpretty
    advancedxmlescape = true | false
    transrescharstoncr = true | false
    usecdata = true | false
	usecdatafor = ["script,style"]
    specialentities = true | false
    transspecialentitiestoncr = true | false
    unicodechars = true | false
    omitunknowntags = true | false
    treatunknowntagsascontent = true | false
    omitdeprtags = true | false
    treatdeprtagsascontent = true | false
    omitcomments = true | false
    omitxmldecl = true | false
    omitdoctypedecl = true | false
    useemptyelementtags = true | false
    allowmultiwordattributes = true | false
    allowhtmlinsideattributes = true | false
    ignoreqe = true | false
    namespacesaware = true | false
    hyphenreplacement = <string value>
    prunetags = <string value>
    booleanatts = self | empty | true
    nodebyxpath = <xpath expression>
    omitenvelope = true | false
    allowinvalidattributenames = true | false
    invalidattributenameprefix [""]

Note: in order to make difference between URLs and files, URL's must begin with http:// or https://

Pipelines and stdin

As of version 2.11, the src parameter is optional, as you can instead send data directly from stdin. For example: curl | java -jar htmlcleaner-2.11.jar > cleaned.html

TagInfo providers

Optional parameter taginfofile is path to XML file that contains description of all tags and tag dependencies. It will be used in cleaning process instead of default tag info set. See description file of default tag info set as reference.

Quiet Mode

As of version 2.9, you can also supply the --quiet option to reduce the amount of log output HtmlCleaner produces.


Transformation parameters are prefixed with "t:". Transformations given in example would be described in command-line as: t:cfoutput t:c:block=div,false t:font=span,true t:font.size t:font.face${style};font-family=${face};font-size=${size};