1HTML::HTML5::Parser(3)User Contributed Perl DocumentationHTML::HTML5::Parser(3)
2
3
4
6 HTML::HTML5::Parser - parse HTML reliably
7
9 use HTML::HTML5::Parser;
10
11 my $parser = HTML::HTML5::Parser->new;
12 my $doc = $parser->parse_string(<<'EOT');
13 <!doctype html>
14 <title>Foo</title>
15 <p><b><i>Foo</b> bar</i>.
16 <p>Baz</br>Quux.
17 EOT
18
19 my $fdoc = $parser->parse_file( $html_file_name );
20 my $fhdoc = $parser->parse_fh( $html_file_handle );
21
23 This library is substantially the same as the non-CPAN module
24 Whatpm::HTML. Changes include:
25
26 • Provides an XML::LibXML-like DOM interface. If you usually use
27 XML::LibXML's DOM parser, this should be a drop-in solution for
28 tag soup HTML.
29
30 • Constructs an XML::LibXML::Document as the result of parsing.
31
32 • Via bundling and modifications, removed external dependencies
33 on non-CPAN packages.
34
35 Constructor
36 "new"
37 $parser = HTML::HTML5::Parser->new;
38 # or
39 $parser = HTML::HTML5::Parser->new(no_cache => 1);
40
41 The constructor does nothing interesting besides take one flag
42 argument, "no_cache => 1", to disable the global element
43 metadata cache. Disabling the cache is handy for conserving
44 memory if you parse a large number of documents, however, class
45 methods such as "/source_line" will not work, and must be run
46 from an instance of this parser.
47
48 XML::LibXML-Compatible Methods
49 "parse_file", "parse_html_file"
50 $doc = $parser->parse_file( $html_file_name [,\%opts] );
51
52 This function parses an HTML document from a file or network;
53 $html_file_name can be either a filename or an URL.
54
55 Options include 'encoding' to indicate file encoding (e.g.
56 'utf-8') and 'user_agent' which should be a blessed
57 "LWP::UserAgent" (or HTTP::Tiny) object to be used when retrieving
58 URLs.
59
60 If requesting a URL and the response Content-Type header indicates
61 an XML-based media type (such as XHTML), XML::LibXML::Parser will
62 be used automatically (instead of the tag soup parser). The XML
63 parser can be told to use a DTD catalogue by setting the option
64 'xml_catalogue' to the filename of the catalogue.
65
66 HTML (tag soup) parsing can be forced using the option
67 'force_html', even when an XML media type is returned. If an
68 options hashref was passed, parse_file will set
69 $options->{'parser_used'} to the name of the class used to parse
70 the URL, to allow the calling code to double-check which parser was
71 used afterwards.
72
73 If an options hashref was passed, parse_file will set
74 $options->{'response'} to the HTTP::Response object obtained by
75 retrieving the URI.
76
77 "parse_fh", "parse_html_fh"
78 $doc = $parser->parse_fh( $io_fh [,\%opts] );
79
80 parse_fh() parses a IOREF or a subclass of "IO::Handle".
81
82 Options include 'encoding' to indicate file encoding (e.g.
83 'utf-8').
84
85 "parse_string", "parse_html_string"
86 $doc = $parser->parse_string( $html_string [,\%opts] );
87
88 This function is similar to parse_fh(), but it parses an HTML
89 document that is available as a single string in memory.
90
91 Options include 'encoding' to indicate file encoding (e.g.
92 'utf-8').
93
94 "load_xml", "load_html"
95 Wrappers for the parse_* functions. These should be roughly
96 compatible with the equivalently named functions in XML::LibXML.
97
98 Note that "load_xml" first attempts to parse as real XML, falling
99 back to HTML5 parsing; "load_html" just goes straight for HTML5.
100
101 "parse_balanced_chunk"
102 $fragment = $parser->parse_balanced_chunk( $string [,\%opts] );
103
104 This method is roughly equivalent to XML::LibXML's method of the
105 same name, but unlike XML::LibXML, and despite its name it does not
106 require the chunk to be "balanced". This method is somewhat black
107 magic, but should work, and do the proper thing in most cases. Of
108 course, the proper thing might not be what you'd expect! I'll try
109 to keep this explanation as brief as possible...
110
111 Consider the following string:
112
113 <b>Hello</b></td></tr> <i>World</i>
114
115 What is the proper way to parse that? If it were found in a
116 document like this:
117
118 <html>
119 <head><title>X</title></head>
120 <body>
121 <div>
122 <b>Hello</b></td></tr> <i>World</i>
123 </div>
124 </body>
125 </html>
126
127 Then the document would end up equivalent to the following XHTML:
128
129 <html>
130 <head><title>X</title></head>
131 <body>
132 <div>
133 <b>Hello</b> <i>World</i>
134 </div>
135 </body>
136 </html>
137
138 The superfluous "</td></tr>" is simply ignored. However, if it were
139 found in a document like this:
140
141 <html>
142 <head><title>X</title></head>
143 <body>
144 <table><tbody><tr><td>
145 <b>Hello</b></td></tr> <i>World</i>
146 </td></tr></tbody></table>
147 </body>
148 </html>
149
150 Then the result would be:
151
152 <html>
153 <head><title>X</title></head>
154 <body>
155 <i>World</i>
156 <table><tbody><tr><td>
157 <b>Hello</b></td></tr>
158 </tbody></table>
159 </body>
160 </html>
161
162 Yes, "<i>World</i>" gets hoisted up before the "<table>". This is
163 weird, I know, but it's how browsers do it in real life.
164
165 So what should:
166
167 $string = q{<b>Hello</b></td></tr> <i>World</i>};
168 $fragment = $parser->parse_balanced_chunk($string);
169
170 actually return? Well, you can choose...
171
172 $string = q{<b>Hello</b></td></tr> <i>World</i>};
173
174 $frag1 = $parser->parse_balanced_chunk($string, {within=>'div'});
175 say $frag1->toString; # <b>Hello</b> <i>World</i>
176
177 $frag2 = $parser->parse_balanced_chunk($string, {within=>'td'});
178 say $frag2->toString; # <i>World</i><b>Hello</b>
179
180 If you don't pass a "within" option, then the chunk is parsed as if
181 it were within a "<div>" element. This is often the most sensible
182 option. If you pass something like "{ within => "foobar" }" where
183 "foobar" is not a real HTML element name (as found in the HTML5
184 spec), then this method will croak; if you pass the name of a void
185 element (e.g. "br" or "meta") then this method will croak; there
186 are a handful of other unsupported elements which will croak
187 (namely: "noscript", "noembed", "noframes").
188
189 Note that the second time around, although we parsed the string "as
190 if it were within a "<td>" element", the "<i>Hello</i>" bit did not
191 strictly end up within the "<td>" element (not even within the
192 "<table>" element!) yet it still gets returned. We'll call things
193 such as this "outliers". There is a "force_within" option which
194 tells parse_balanced_chunk to ignore outliers:
195
196 $frag3 = $parser->parse_balanced_chunk($string,
197 {force_within=>'td'});
198 say $frag3->toString; # <b>Hello</b>
199
200 There is a boolean option "mark_outliers" which marks each outlier
201 with an attribute ("data-perl-html-html5-parser-outlier") to
202 indicate its outlier status. Clearly, this is ignored when you use
203 "force_within" because no outliers are returned. Some outliers may
204 be XML::LibXML::Text elements; text nodes don't have attributes, so
205 these will not be marked with an attribute.
206
207 A last note is to mention what gets returned by this method.
208 Normally it's an XML::LibXML::DocumentFragment object, but if you
209 call the method in list context, a list of the individual node
210 elements is returned. Alternatively you can request the data to be
211 returned as an XML::LibXML::NodeList object:
212
213 # Get an XML::LibXML::NodeList
214 my $list = $parser->parse_balanced_chunk($str, {as=>'list'});
215
216 The exact implementation of this method may change from version to
217 version, but the long-term goal will be to approach how common
218 desktop browsers parse HTML fragments when implementing the setter
219 for DOM's "innerHTML" attribute.
220
221 The push parser and SAX-based parser are not supported. Trying to
222 change an option (such as recover_silently) will make
223 HTML::HTML5::Parser carp a warning. (But you can inspect the options.)
224
225 Error Handling
226 Error handling is obviously different to XML::LibXML, as errors are
227 (bugs notwithstanding) non-fatal.
228
229 "error_handler"
230 Get/set an error handling function. Must be set to a coderef or
231 undef.
232
233 The error handling function will be called with a single parameter,
234 a HTML::HTML5::Parser::Error object.
235
236 "errors"
237 Returns a list of errors that occurred during the last parse.
238
239 See HTML::HTML5::Parser::Error.
240
241 Additional Methods
242 The module provides a few methods to obtain additional, non-DOM data
243 from DOM nodes.
244
245 "dtd_public_id"
246 $pubid = $parser->dtd_public_id( $doc );
247
248 For an XML::LibXML::Document which has been returned by
249 HTML::HTML5::Parser, using this method will tell you the Public
250 Identifier of the DTD used (if any).
251
252 "dtd_system_id"
253 $sysid = $parser->dtd_system_id( $doc );
254
255 For an XML::LibXML::Document which has been returned by
256 HTML::HTML5::Parser, using this method will tell you the System
257 Identifier of the DTD used (if any).
258
259 "dtd_element"
260 $element = $parser->dtd_element( $doc );
261
262 For an XML::LibXML::Document which has been returned by
263 HTML::HTML5::Parser, using this method will tell you the root
264 element declared in the DTD used (if any). That is, if the document
265 has this doctype:
266
267 <!doctype html>
268
269 ... it will return "html".
270
271 This may return the empty string if a DTD was present but did not
272 contain a root element; or undef if no DTD was present.
273
274 "compat_mode"
275 $mode = $parser->compat_mode( $doc );
276
277 Returns 'quirks', 'limited quirks' or undef (standards mode).
278
279 "charset"
280 $charset = $parser->charset( $doc );
281
282 The character set apparently used by the document.
283
284 "source_line"
285 ($line, $col) = $parser->source_line( $node );
286 $line = $parser->source_line( $node );
287
288 In scalar context, "source_line" returns the line number of the
289 source code that started a particular node (element, attribute or
290 comment).
291
292 In list context, returns a tuple: $line, $column, $implicitness.
293 Tab characters count as one column, not eight.
294
295 $implicitness indicates that the node was not explicitly marked up
296 in the source code, but its existence was inferred by the parser.
297 For example, in the following markup, the HTML, TITLE and P
298 elements are explicit, but the HEAD and BODY elements are implicit.
299
300 <html>
301 <title>I have an implicit head</title>
302 <p>And an implicit body too!</p>
303 </html>
304
305 (Note that implicit elements do still have a line number and column
306 number.) The implictness indicator is a new feature, and I'd
307 appreciate any bug reports where it gets things wrong.
308
309 XML::LibXML::Node has a "line_number" method. In general this will
310 always return 0 and HTML::HTML5::Parser has no way of influencing
311 it. However, if you install XML::LibXML::Devel::SetLineNumber on
312 your system, the "line_number" method will start working (at least
313 for elements).
314
316 <http://suika.fam.cx/www/markup/html/whatpm/Whatpm/HTML.html>.
317
318 HTML::HTML5::Writer, HTML::HTML5::Builder, XML::LibXML,
319 XML::LibXML::PrettyPrint, XML::LibXML::Devel::SetLineNumber.
320
322 Toby Inkster, <tobyink@cpan.org>
323
325 Copyright (C) 2007-2011 by Wakaba
326
327 Copyright (C) 2009-2012 by Toby Inkster
328
329 This library is free software; you can redistribute it and/or modify it
330 under the same terms as Perl itself.
331
333 THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
334 WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
335 MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.
336
337
338
339perl v5.36.0 2023-01-20 HTML::HTML5::Parser(3)