1´ This file is part of TagSoup and is Copyright 2002‐2008 by John
2Cowan. ´ ´ TagSoup is licensed under the Apache License, ´ Ver‐
3sion 2.0. You may obtain a copy of this license at ´
4http://www.apache.org/licenses/LICENSE‐2.0 . You may also have ´
5additional legal rights not granted by this license. ´ ´ TagSoup
6is distributed in the hope that it will be useful, but ´ unless
7required by applicable law or agreed to in writing, TagSoup ´ is
8distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
9´ OF ANY KIND, either express or implied; not even the implied
10warranty ´ of MERCHANTABILITY or FITNESS FOR A PARTICULAR PUR‐
11TAGSOUP(1) User Commands TAGSOUP(1)
12
13
14
15POSE. ´
16
18 tagsoup - convert nasty, ugly HTML to clean XHTML
19
21 java -jar tagsoup [ options ] [ files ]
22
24 Rectify arbitrary HTML into clean XHTML, using a tailored description
25 of HTML. The output will be well-formed XML, but not necessarily valid
26 XHTML.
27
28 --files
29 multiple input files should be processed into corresponding out‐
30 put files
31
32 --encoding=encoding
33 specifies the encoding of input files
34
35 --output-encoding=encoding
36 specifies the encoding of the output (if the encoding name
37 begins with ``utf'', the output will not contain character enti‐
38 ties; otherwise, all non-ASCII characters are represented as
39 entities)
40
41 --html output rectified HTML rather than XML, omitting the XML declara‐
42 tion and any namespace declarations
43
44 --method=html
45 output rectified HTML rather than XML (end-tags are omitted for
46 empty elements, and no character escaping is done in script and
47 style elements)
48
49 --omit-xml-declaration
50 omit the XML declaration
51
52 --lexical
53 output lexical features (specifically comments and any DOCTYPE
54 declaration)
55
56 --nons suppress namespaces in output
57
58 --nobogons
59 suppress unknown non-HTML elements in output
60
61 --nodefaults
62 suppress default attribute values
63
64 --nocolons
65 change explicit colons in element and attribute names to under‐
66 scores
67
68 --norestart
69 don't restart any restartable elements
70
71 --ignorable
72 pass through ignorable whitespace (whitespace in element-only
73 content) via SAX method handler ignorableWhitespace
74
75 --any treat unknown non-HTML elements as allowing any content
76 (default)
77
78 --emptybogons
79 treat unknown non-HTML elements as empty elements
80
81 --norootbogons
82 don't allow unknown non-HTML elements to be root elements
83
84 --doctype-system=system-id
85 force DOCTYPE declaration to be output with specified system
86 identifier
87
88 --doctype-public=public-id
89 force DOCTYPE declaration to be output with specified public
90 identifier
91
92 --standalone=[yes|no]
93 specify standalone pseudo-attribute in output XML declaration
94
95 --version=version
96 specify version pseudo-attribute in output XML declaration (does
97 not affect actual version of XML output)
98
99 --nocdata
100 treat the CDATA-content elements script and style as ordinary
101 elements (mostly for testing)
102
103 --pyx output PYX format rather than XML (mostly for testing)
104
105 --pyxin
106 input is PYX-format HTML (mostly for testing)
107
108 --reuse
109 reuse the same Parser object internally (for testing only)
110
111 --help output basic help
112
113 --version
114 output version number
115
116 TagSoup is a parser and reformatter for nasty, ugly HTML. Its normal
117 processing mode is to accept HTML files on the command line, or from
118 the standard input if none are given, and output them as clean XML to
119 the standard output. The encoding is assumed to be the platform-local
120 encoding on input, and is always UTF-8 on output.
121
122 When the --files option is given, each input file is processed into an
123 output file of the corresponding name, with the extension changed to
124 xhtml. If the extension is already xhtml, it is changed to xhtml_.
125
126 TagSoup will repair, by whatever means necessary, violations of XML
127 well-formedness. In particular, it will fix up malformed attribute
128 names and supply missing attribute-value quotation marks. More signif‐
129 icantly, it supplies end-tags where HTML allows them to be omitted, and
130 sometimes where it doesn't. It will even supply start-tags where nec‐
131 essary; for example, if a document begins with a <li> tag, TagSoup will
132 automatically prefix it with <html><body><ul>.
133
135 TagSoup can be fooled by missing close quotes after attribute values,
136 and by incorrect character encodings (it does not contain an encoding
137 guesser).
138
139 TagSoup doesn't understand namespace declarations, which are not prop‐
140 erly part of HTML. Instead, any element or attribute name beginning
141 foo: will be put into the artificial namespace urn:x-prefix:foo.
142
143 For the same reasons, namespace-qualified attributes like xml:space
144 can't be returned as default values, though an explicit attribute in
145 the xml namespace will be returned with the proper namespace URI.
146
148 John Cowan <cowan@ccil.org>
149
151 Copyright © 2002-2008 John Cowan
152 TagSoup is free software; see the source for copying conditions. There
153 is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICU‐
154 LAR PURPOSE.
155
156
157
158TagSoup 1.2.1 January 2008 TAGSOUP(1)