HTML::HTML5::Parser(3pm)

1HTML::HTML5::Parser(3)User Contributed Perl DocumentationHTML::HTML5::Parser(3)
2
3
4

NAME

6       HTML::HTML5::Parser - parse HTML reliably
7

SYNOPSIS

9         use HTML::HTML5::Parser;
10
11         my $parser = HTML::HTML5::Parser->new;
12         my $doc    = $parser->parse_string(<<'EOT');
13         <!doctype html>
14         <title>Foo</title>
15         <p><b><i>Foo</b> bar</i>.
16         <p>Baz</br>Quux.
17         EOT
18
19         my $fdoc   = $parser->parse_file( $html_file_name );
20         my $fhdoc  = $parser->parse_fh( $html_file_handle );
21

DESCRIPTION

23       This library is substantially the same as the non-CPAN module
24       Whatpm::HTML.  Changes include:
25
26       •       Provides an XML::LibXML-like DOM interface. If you usually use
27               XML::LibXML's DOM parser, this should be a drop-in solution for
28               tag soup HTML.
29
30       •       Constructs an XML::LibXML::Document as the result of parsing.
31
32       •       Via bundling and modifications, removed external dependencies
33               on non-CPAN packages.
34
35   Constructor
36       "new"
37                 $parser = HTML::HTML5::Parser->new;
38                 # or
39                 $parser = HTML::HTML5::Parser->new(no_cache => 1);
40
41               The constructor does nothing interesting besides take one flag
42               argument, "no_cache => 1", to disable the global element
43               metadata cache. Disabling the cache is handy for conserving
44               memory if you parse a large number of documents, however, class
45               methods such as "/source_line" will not work, and must be run
46               from an instance of this parser.
47
48   XML::LibXML-Compatible Methods
49       "parse_file", "parse_html_file"
50             $doc = $parser->parse_file( $html_file_name [,\%opts] );
51
52           This function parses an HTML document from a file or network;
53           $html_file_name can be either a filename or an URL.
54
55           Options include 'encoding' to indicate file encoding (e.g.
56           'utf-8') and 'user_agent' which should be a blessed
57           "LWP::UserAgent" (or HTTP::Tiny) object to be used when retrieving
58           URLs.
59
60           If requesting a URL and the response Content-Type header indicates
61           an XML-based media type (such as XHTML), XML::LibXML::Parser will
62           be used automatically (instead of the tag soup parser). The XML
63           parser can be told to use a DTD catalogue by setting the option
64           'xml_catalogue' to the filename of the catalogue.
65
66           HTML (tag soup) parsing can be forced using the option
67           'force_html', even when an XML media type is returned. If an
68           options hashref was passed, parse_file will set
69           $options->{'parser_used'} to the name of the class used to parse
70           the URL, to allow the calling code to double-check which parser was
71           used afterwards.
72
73           If an options hashref was passed, parse_file will set
74           $options->{'response'} to the HTTP::Response object obtained by
75           retrieving the URI.
76
77       "parse_fh", "parse_html_fh"
78             $doc = $parser->parse_fh( $io_fh [,\%opts] );
79
80           "parse_fh()" parses a IOREF or a subclass of "IO::Handle".
81
82           Options include 'encoding' to indicate file encoding (e.g.
83           'utf-8').
84
85       "parse_string", "parse_html_string"
86             $doc = $parser->parse_string( $html_string [,\%opts] );
87
88           This function is similar to "parse_fh()", but it parses an HTML
89           document that is available as a single string in memory.
90
91           Options include 'encoding' to indicate file encoding (e.g.
92           'utf-8').
93
94       "load_xml", "load_html"
95           Wrappers for the parse_* functions. These should be roughly
96           compatible with the equivalently named functions in XML::LibXML.
97
98           Note that "load_xml" first attempts to parse as real XML, falling
99           back to HTML5 parsing; "load_html" just goes straight for HTML5.
100
101       "parse_balanced_chunk"
102             $fragment = $parser->parse_balanced_chunk( $string [,\%opts] );
103
104           This method is roughly equivalent to XML::LibXML's method of the
105           same name, but unlike XML::LibXML, and despite its name it does not
106           require the chunk to be "balanced". This method is somewhat black
107           magic, but should work, and do the proper thing in most cases. Of
108           course, the proper thing might not be what you'd expect! I'll try
109           to keep this explanation as brief as possible...
110
111           Consider the following string:
112
113             <b>Hello</b></td></tr> <i>World</i>
114
115           What is the proper way to parse that? If it were found in a
116           document like this:
117
118             <html>
119               <head><title>X</title></head>
120               <body>
121                 <div>
122                   <b>Hello</b></td></tr> <i>World</i>
123                 </div>
124               </body>
125             </html>
126
127           Then the document would end up equivalent to the following XHTML:
128
129             <html>
130               <head><title>X</title></head>
131               <body>
132                 <div>
133                   <b>Hello</b> <i>World</i>
134                 </div>
135               </body>
136             </html>
137
138           The superfluous "</td></tr>" is simply ignored. However, if it were
139           found in a document like this:
140
141             <html>
142               <head><title>X</title></head>
143               <body>
144                 <table><tbody><tr><td>
145                   <b>Hello</b></td></tr> <i>World</i>
146                 </td></tr></tbody></table>
147               </body>
148             </html>
149
150           Then the result would be:
151
152             <html>
153               <head><title>X</title></head>
154               <body>
155                 <i>World</i>
156                 <table><tbody><tr><td>
157                   <b>Hello</b></td></tr>
158                 </tbody></table>
159               </body>
160             </html>
161
162           Yes, "<i>World</i>" gets hoisted up before the "<table>". This is
163           weird, I know, but it's how browsers do it in real life.
164
165           So what should:
166
167             $string   = q{<b>Hello</b></td></tr> <i>World</i>};
168             $fragment = $parser->parse_balanced_chunk($string);
169
170           actually return? Well, you can choose...
171
172             $string = q{<b>Hello</b></td></tr> <i>World</i>};
173
174             $frag1  = $parser->parse_balanced_chunk($string, {within=>'div'});
175             say $frag1->toString; # <b>Hello</b> <i>World</i>
176
177             $frag2  = $parser->parse_balanced_chunk($string, {within=>'td'});
178             say $frag2->toString; # <i>World</i><b>Hello</b>
179
180           If you don't pass a "within" option, then the chunk is parsed as if
181           it were within a "<div>" element. This is often the most sensible
182           option. If you pass something like "{ within => "foobar" }" where
183           "foobar" is not a real HTML element name (as found in the HTML5
184           spec), then this method will croak; if you pass the name of a void
185           element (e.g. "br" or "meta") then this method will croak; there
186           are a handful of other unsupported elements which will croak
187           (namely: "noscript", "noembed", "noframes").
188
189           Note that the second time around, although we parsed the string "as
190           if it were within a "<td>" element", the "<i>Hello</i>" bit did not
191           strictly end up within the "<td>" element (not even within the
192           "<table>" element!) yet it still gets returned.  We'll call things
193           such as this "outliers". There is a "force_within" option which
194           tells parse_balanced_chunk to ignore outliers:
195
196             $frag3  = $parser->parse_balanced_chunk($string,
197                                                     {force_within=>'td'});
198             say $frag3->toString; # <b>Hello</b>
199
200           There is a boolean option "mark_outliers" which marks each outlier
201           with an attribute ("data-perl-html-html5-parser-outlier") to
202           indicate its outlier status. Clearly, this is ignored when you use
203           "force_within" because no outliers are returned. Some outliers may
204           be XML::LibXML::Text elements; text nodes don't have attributes, so
205           these will not be marked with an attribute.
206
207           A last note is to mention what gets returned by this method.
208           Normally it's an XML::LibXML::DocumentFragment object, but if you
209           call the method in list context, a list of the individual node
210           elements is returned. Alternatively you can request the data to be
211           returned as an XML::LibXML::NodeList object:
212
213            # Get an XML::LibXML::NodeList
214            my $list = $parser->parse_balanced_chunk($str, {as=>'list'});
215
216           The exact implementation of this method may change from version to
217           version, but the long-term goal will be to approach how common
218           desktop browsers parse HTML fragments when implementing the setter
219           for DOM's "innerHTML" attribute.
220
221       The push parser and SAX-based parser are not supported. Trying to
222       change an option (such as recover_silently) will make
223       HTML::HTML5::Parser carp a warning. (But you can inspect the options.)
224
225   Error Handling
226       Error handling is obviously different to XML::LibXML, as errors are
227       (bugs notwithstanding) non-fatal.
228
229       "error_handler"
230           Get/set an error handling function. Must be set to a coderef or
231           undef.
232
233           The error handling function will be called with a single parameter,
234           a HTML::HTML5::Parser::Error object.
235
236       "errors"
237           Returns a list of errors that occurred during the last parse.
238
239           See HTML::HTML5::Parser::Error.
240
241   Additional Methods
242       The module provides a few methods to obtain additional, non-DOM data
243       from DOM nodes.
244
245       "dtd_public_id"
246             $pubid = $parser->dtd_public_id( $doc );
247
248           For an XML::LibXML::Document which has been returned by
249           HTML::HTML5::Parser, using this method will tell you the Public
250           Identifier of the DTD used (if any).
251
252       "dtd_system_id"
253             $sysid = $parser->dtd_system_id( $doc );
254
255           For an XML::LibXML::Document which has been returned by
256           HTML::HTML5::Parser, using this method will tell you the System
257           Identifier of the DTD used (if any).
258
259       "dtd_element"
260             $element = $parser->dtd_element( $doc );
261
262           For an XML::LibXML::Document which has been returned by
263           HTML::HTML5::Parser, using this method will tell you the root
264           element declared in the DTD used (if any). That is, if the document
265           has this doctype:
266
267             <!doctype html>
268
269           ... it will return "html".
270
271           This may return the empty string if a DTD was present but did not
272           contain a root element; or undef if no DTD was present.
273
274       "compat_mode"
275             $mode = $parser->compat_mode( $doc );
276
277           Returns 'quirks', 'limited quirks' or undef (standards mode).
278
279       "charset"
280             $charset = $parser->charset( $doc );
281
282           The character set apparently used by the document.
283
284       "source_line"
285             ($line, $col) = $parser->source_line( $node );
286             $line = $parser->source_line( $node );
287
288           In scalar context, "source_line" returns the line number of the
289           source code that started a particular node (element, attribute or
290           comment).
291
292           In list context, returns a tuple: $line, $column, $implicitness.
293           Tab characters count as one column, not eight.
294
295           $implicitness indicates that the node was not explicitly marked up
296           in the source code, but its existence was inferred by the parser.
297           For example, in the following markup, the HTML, TITLE and P
298           elements are explicit, but the HEAD and BODY elements are implicit.
299
300            <html>
301             <title>I have an implicit head</title>
302             <p>And an implicit body too!</p>
303            </html>
304
305           (Note that implicit elements do still have a line number and column
306           number.) The implictness indicator is a new feature, and I'd
307           appreciate any bug reports where it gets things wrong.
308
309           XML::LibXML::Node has a "line_number" method. In general this will
310           always return 0 and HTML::HTML5::Parser has no way of influencing
311           it. However, if you install XML::LibXML::Devel::SetLineNumber on
312           your system, the "line_number" method will start working (at least
313           for elements).
314

AUTHOR

322       Toby Inkster, <tobyink@cpan.org>
323

COPYRIGHT AND LICENCE

325       Copyright (C) 2007-2011 by Wakaba
326
327       Copyright (C) 2009-2012 by Toby Inkster
328
329       This library is free software; you can redistribute it and/or modify it
330       under the same terms as Perl itself.
331

DISCLAIMER OF WARRANTIES

333       THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
334       WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
335       MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.
336
337
338
339perl v5.36.0                      2022-07-22            HTML::HTML5::Parser(3)

NAME

SYNOPSIS

DESCRIPTION

SEE ALSO

AUTHOR

COPYRIGHT AND LICENCE

DISCLAIMER OF WARRANTIES