tagsoup(1)

1´ This file is part of TagSoup and is Copyright 2002‐2008 by John
2Cowan.  ´ ´ TagSoup is licensed under the Apache License, ´  Ver‐
3sion   2.0.   You  may  obtain  a  copy  of  this  license  at  ´
4http://www.apache.org/licenses/LICENSE‐2.0 .  You may also have ´
5additional legal rights not granted by this license.  ´ ´ TagSoup
6is distributed in the hope that it will be useful, but  ´  unless
7required  by applicable law or agreed to in writing, TagSoup ´ is
8distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
9´  OF  ANY  KIND, either express or implied; not even the implied
10warranty ´ of MERCHANTABILITY or FITNESS FOR  A  PARTICULAR  PUR‐
11TAGSOUP(1)                       User Commands                      TAGSOUP(1)
12
13
14
15POSE.  ´
16

NAME

18       tagsoup - convert nasty, ugly HTML to clean XHTML
19

SYNOPSIS

21       java -jar tagsoup [ options ] [ files ]
22

DESCRIPTION

24       Rectify  arbitrary  HTML into clean XHTML, using a tailored description
25       of HTML.  The output will be well-formed XML, but not necessarily valid
26       XHTML.
27
28       --files
29              multiple input files should be processed into corresponding out‐
30              put files
31
32       --encoding=encoding
33              specifies the encoding of input files
34
35       --output-encoding=encoding
36              specifies the encoding of  the  output  (if  the  encoding  name
37              begins with ``utf'', the output will not contain character enti‐
38              ties; otherwise, all non-ASCII  characters  are  represented  as
39              entities)
40
41       --html output rectified HTML rather than XML, omitting the XML declara‐
42              tion and any namespace declarations
43
44       --method=html
45              output rectified HTML rather than XML (end-tags are omitted  for
46              empty  elements, and no character escaping is done in script and
47              style elements)
48
49       --omit-xml-declaration
50              omit the XML declaration
51
52       --lexical
53              output lexical features (specifically comments and  any  DOCTYPE
54              declaration)
55
56       --nons suppress namespaces in output
57
58       --nobogons
59              suppress unknown non-HTML elements in output
60
61       --nodefaults
62              suppress default attribute values
63
64       --nocolons
65              change  explicit colons in element and attribute names to under‐
66              scores
67
68       --norestart
69              don't restart any restartable elements
70
71       --ignorable
72              pass through ignorable whitespace  (whitespace  in  element-only
73              content) via SAX method handler ignorableWhitespace
74
75       --any  treat   unknown   non-HTML  elements  as  allowing  any  content
76              (default)
77
78       --emptybogons
79              treat unknown non-HTML elements as empty elements
80
81       --norootbogons
82              don't allow unknown non-HTML elements to be root elements
83
84       --doctype-system=system-id
85              force DOCTYPE declaration to be  output  with  specified  system
86              identifier
87
88       --doctype-public=public-id
89              force  DOCTYPE  declaration  to  be output with specified public
90              identifier
91
92       --standalone=[yes|no]
93              specify standalone pseudo-attribute in output XML declaration
94
95       --version=version
96              specify version pseudo-attribute in output XML declaration (does
97              not affect actual version of XML output)
98
99       --nocdata
100              treat  the  CDATA-content  elements script and style as ordinary
101              elements (mostly for testing)
102
103       --pyx  output PYX format rather than XML (mostly for testing)
104
105       --pyxin
106              input is PYX-format HTML (mostly for testing)
107
108       --reuse
109              reuse the same Parser object internally (for testing only)
110
111       --help output basic help
112
113       --version
114              output version number
115
116       TagSoup is a parser and reformatter for nasty, ugly HTML.   Its  normal
117       processing  mode  is  to accept HTML files on the command line, or from
118       the standard input if none are given, and output them as clean  XML  to
119       the  standard output.  The encoding is assumed to be the platform-local
120       encoding on input, and is always UTF-8 on output.
121
122       When the --files option is given, each input file is processed into  an
123       output  file  of  the corresponding name, with the extension changed to
124       xhtml.  If the extension is already xhtml, it is changed to xhtml_.
125
126       TagSoup will repair, by whatever means  necessary,  violations  of  XML
127       well-formedness.   In  particular,  it  will fix up malformed attribute
128       names and supply missing attribute-value quotation marks.  More signif‐
129       icantly, it supplies end-tags where HTML allows them to be omitted, and
130       sometimes where it doesn't.  It will even supply start-tags where  nec‐
131       essary; for example, if a document begins with a <li> tag, TagSoup will
132       automatically prefix it with <html><body><ul>.
133

BUGS

135       TagSoup can be fooled by missing close quotes after  attribute  values,
136       and  by  incorrect character encodings (it does not contain an encoding
137       guesser).
138
139       TagSoup doesn't understand namespace declarations, which are not  prop‐
140       erly  part  of  HTML.  Instead, any element or attribute name beginning
141       foo: will be put into the artificial namespace urn:x-prefix:foo.
142
143       For the same reasons,  namespace-qualified  attributes  like  xml:space
144       can't  be  returned  as default values, though an explicit attribute in
145       the xml namespace will be returned with the proper namespace URI.
146

AUTHOR

148       John Cowan <cowan@ccil.org>
149

COPYRIGHT

151       Copyright © 2002-2008 John Cowan
152       TagSoup is free software; see the source for copying conditions.  There
153       is  NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICU‐
154       LAR PURPOSE.
155
156
157
158TagSoup 1.2.1                    January 2008                       TAGSOUP(1)