1HTML::Gumbo(3)        User Contributed Perl Documentation       HTML::Gumbo(3)
2
3
4

NAME

6       HTML::Gumbo - HTML5 parser based on gumbo C library
7

SYNOPSIS

9           use HTML::Gumbo;
10           say HTML::Gumbo->new->parse('<div></div>');
11
12           say HTML::Gumbo->new->parse('<h1>Hello</h1>', format => 'tree')->as_HTML;
13

DESCRIPTION

15       Gumbo <https://github.com/google/gumbo-parser> is an implementation of
16       the HTML5 parsing algorithm <http://www.w3.org/TR/html5/syntax.html>
17       implemented as a pure C99 library with no outside dependencies.
18
19       Goals and features of the C library:
20
21       •   Fully conformant with the HTML5 spec.
22
23       •   Robust and resilient to bad input.
24
25       •   Simple API that can be easily wrapped by other languages. (This is
26           one of such wrappers.)
27
28       •   Support for source locations and pointers back to the original
29           text.  (Not exposed by this implementation at the moment.)
30
31       •   Relatively lightweight, with no outside dependencies.
32
33       •   Passes all html5lib-0.95 tests.
34
35       •   Tested on over 2.5 billion pages from Google's index.
36

METHODS

38   new
39           my $parser = HTML::Gumbo->new;
40
41       No options at the moment.
42
43   parse
44           my $res = $parser->parse(
45               "<h1>hello world!</h1>",
46               format => 'tree',
47               input_is => 'string',
48           );
49
50       Takes html string and pairs of named arguments:
51
52       format
53           Output format, default is string. See "SUPPORTED OUTPUT FORMATS".
54
55       fragment_namespace
56           Enables fragments parsing algorithm. Pass either 'HTML', 'SVG' or
57           'MATHML' to enable and set namespace. Without this input is parsed
58           as html document, so html, head, title and body tags are added if
59           absent.
60
61           Note that fragment_enclosing_tag is set to '<body>' and can not be
62           changed at the moment. Feel free to send patches implementing this
63           part.
64
65           See "SUPPORTED OUTPUT FORMATS" for additional details.
66
67           Note that SVG and MATHML parsing is not tested, feel free to file
68           bug reports with tests in case it doesn't work.
69
70       input_is
71           Whether html is perl 'string', 'octets' or 'utf8' (octets known to
72           be utf8). See "CHARACTER ENCODING OF THE INPUT".
73
74       encoding, encoding_content_type, encoding_tentative
75           See "CHARACTER ENCODING OF THE INPUT".
76
77       ... Some formatters may have additional arguments, see "SUPPORTED
78           OUTPUT FORMATS"
79
80       Return value depends on the picked format.
81

SUPPORTED OUTPUT FORMATS

83   string
84       HTML is parsed and re-built from the tree, so tags are balanced (except
85       void elements).
86
87       No additional arguments specific for this format.
88
89           $html = HTML::Gumbo->new->parse( $html );
90
91   callback
92       HTML::Parser like interface. Pass a sub as "callback" argument to
93       "parse" method and it will be called for every node in the document:
94
95           HTML::Gumbo->new->parse( $html, format => 'callback', callback => sub {
96               my ($event) = shift;
97               if ( $event eq 'document start' ) {
98                   my ($doctype) = @_;
99               }
100               elsif ( $event eq 'document end' ) {
101               }
102               elsif ( $event eq 'start' ) {
103                   my ($tag, $attrs) = @_;
104               }
105               elsif ( $event eq 'end' ) {
106                   my ($tag) = @_;
107               }
108               elsif ( $event eq /^(text|space|cdata|comment)$/ ) {
109                   my ($text) = @_;
110               }
111               else {
112                   die "Unknown event";
113               }
114           } );
115
116       Note that 'end' events are not generated for void elements
117       <http://www.w3.org/TR/html5/syntax.html#void-elements>, for example
118       "hr", "br" and "img".
119
120       No additional arguments except mentioned "callback".
121
122       Fragment parsing still generates 'document start' and 'document end'
123       events what can be handy to initialize your parsing callback.
124
125   tree
126       Alpha stage.
127
128       Produces tree based on HTML::Elements, like HTML::TreeBuilder.
129
130       There is major difference from HTML::TreeBuilder, this method produces
131       top level element with tag name 'document' which may have doctype,
132       comments and html tags as children.
133
134       Fragments parsing still produces top level 'document' element as
135       fragment can be a list of tags, for example: '<p>hello</p><p>world</p'.
136
137       Yes, it's not ready to use as drop in replacement of tree builder.
138       Patches are wellcome as I don't use this formatter at the moment. Note
139       that it's hard to get rid of top level element because of situations
140       described above.  So not bad idea is to write HTML::Gumbo::Document
141       class that is either subclass of HTML::Element or implements a small
142       subset of methods of HTML::Element.
143

CHARACTER ENCODING OF THE INPUT

145       The C parser works only with UTF-8, so you have several options to make
146       sure input is UTF-8. First of all define "input_is" argument:
147
148       string
149           Input is Perl string, for example obtained from "decoded_content"
150           in HTTP::Response.  Default value.
151
152               $gumbo->parse( decode_utf8($octets) );
153
154       octets
155           Input are octets. Partial implementation of encoding sniffing
156           algorithm <http://www.w3.org/TR/html5/syntax.html#encoding-
157           sniffing-algorithm> is used. First thing wins:
158
159           "encoding" argument
160               Use it to hardcode a specific encoding.
161
162                   $gumbo->parse( $octets, input_is => 'octets', encoding => 'latin-1' );
163
164           BOM UTF-8/UTF-16 BOMs are checked.
165
166           "encoding_content_type" argument
167               Encdoning from rransport layer, charset in content-type header.
168
169                   $gumbo->parse( $octets, input_is => 'octets', encoding_content_type => 'latin-1' );
170
171           Prescan
172               Not implemented, follow issue 58
173               <https://github.com/google/gumbo-parser/issues/58>.
174
175               HTML5 defines prescan algorithm
176               <http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-
177               to-determine-its-encoding> that extracts encoding from meta
178               tags in the head.
179
180               It would be cool to get it in the C library, but I will accept
181               a patch that impements it in pure perl.
182
183           "encoding_tentative" argument
184               The likely encoding for this page, e.g. based on the encoding
185               of the page when it was last visited.
186
187                   $gumbo->parse( $octets, input_is => 'octets', encoding_tentative => 'latin-1' );
188
189           nested browsing context
190               Not implemented. Fragment parsing with or without context is
191               not implemented. Parser also has no origin information, so it
192               wouldn't be implemented.
193
194           autodetection
195               Not implemented.
196
197               Can be implemented using Encode::Detect::Detector. Patches are
198               welcome.
199
200           otherwise
201               It dies.
202
203       "utf8"
204           Use utf8 as input_is when you're sure input is UTF-8, but octets.
205           No pre-processing at all. Should only be used on trusted input or
206           when it's preprocessed already.
207

AUTHOR

209       Ruslan Zakirov <ruz@bestpractical.com>
210

LICENSE

212       Under the same terms as perl itself.
213
214
215
216perl v5.36.0                      2022-07-22                    HTML::Gumbo(3)
Impressum