1HTML::Gumbo(3) User Contributed Perl Documentation HTML::Gumbo(3)
2
3
4
6 HTML::Gumbo - HTML5 parser based on gumbo C library
7
9 use HTML::Gumbo;
10 say HTML::Gumbo->new->parse('<div></div>');
11
12 say HTML::Gumbo->new->parse('<h1>Hello</h1>', format => 'tree')->as_HTML;
13
15 Gumbo <https://github.com/google/gumbo-parser> is an implementation of
16 the HTML5 parsing algorithm <http://www.w3.org/TR/html5/syntax.html>
17 implemented as a pure C99 library with no outside dependencies.
18
19 Goals and features of the C library:
20
21 • Fully conformant with the HTML5 spec.
22
23 • Robust and resilient to bad input.
24
25 • Simple API that can be easily wrapped by other languages. (This is
26 one of such wrappers.)
27
28 • Support for source locations and pointers back to the original
29 text. (Not exposed by this implementation at the moment.)
30
31 • Relatively lightweight, with no outside dependencies.
32
33 • Passes all html5lib-0.95 tests.
34
35 • Tested on over 2.5 billion pages from Google's index.
36
38 new
39 my $parser = HTML::Gumbo->new;
40
41 No options at the moment.
42
43 parse
44 my $res = $parser->parse(
45 "<h1>hello world!</h1>",
46 format => 'tree',
47 input_is => 'string',
48 );
49
50 Takes html string and pairs of named arguments:
51
52 format
53 Output format, default is string. See "SUPPORTED OUTPUT FORMATS".
54
55 fragment_namespace
56 Enables fragments parsing algorithm. Pass either 'HTML', 'SVG' or
57 'MATHML' to enable and set namespace. Without this input is parsed
58 as html document, so html, head, title and body tags are added if
59 absent.
60
61 Note that fragment_enclosing_tag is set to '<body>' and can not be
62 changed at the moment. Feel free to send patches implementing this
63 part.
64
65 See "SUPPORTED OUTPUT FORMATS" for additional details.
66
67 Note that SVG and MATHML parsing is not tested, feel free to file
68 bug reports with tests in case it doesn't work.
69
70 input_is
71 Whether html is perl 'string', 'octets' or 'utf8' (octets known to
72 be utf8). See "CHARACTER ENCODING OF THE INPUT".
73
74 encoding, encoding_content_type, encoding_tentative
75 See "CHARACTER ENCODING OF THE INPUT".
76
77 ... Some formatters may have additional arguments, see "SUPPORTED
78 OUTPUT FORMATS"
79
80 Return value depends on the picked format.
81
83 string
84 HTML is parsed and re-built from the tree, so tags are balanced (except
85 void elements).
86
87 No additional arguments specific for this format.
88
89 $html = HTML::Gumbo->new->parse( $html );
90
91 callback
92 HTML::Parser like interface. Pass a sub as "callback" argument to
93 "parse" method and it will be called for every node in the document:
94
95 HTML::Gumbo->new->parse( $html, format => 'callback', callback => sub {
96 my ($event) = shift;
97 if ( $event eq 'document start' ) {
98 my ($doctype) = @_;
99 }
100 elsif ( $event eq 'document end' ) {
101 }
102 elsif ( $event eq 'start' ) {
103 my ($tag, $attrs) = @_;
104 }
105 elsif ( $event eq 'end' ) {
106 my ($tag) = @_;
107 }
108 elsif ( $event eq /^(text|space|cdata|comment)$/ ) {
109 my ($text) = @_;
110 }
111 else {
112 die "Unknown event";
113 }
114 } );
115
116 Note that 'end' events are not generated for void elements
117 <http://www.w3.org/TR/html5/syntax.html#void-elements>, for example
118 "hr", "br" and "img".
119
120 No additional arguments except mentioned "callback".
121
122 Fragment parsing still generates 'document start' and 'document end'
123 events what can be handy to initialize your parsing callback.
124
125 tree
126 Alpha stage.
127
128 Produces tree based on HTML::Elements, like HTML::TreeBuilder.
129
130 There is major difference from HTML::TreeBuilder, this method produces
131 top level element with tag name 'document' which may have doctype,
132 comments and html tags as children.
133
134 Fragments parsing still produces top level 'document' element as
135 fragment can be a list of tags, for example: '<p>hello</p><p>world</p'.
136
137 Yes, it's not ready to use as drop in replacement of tree builder.
138 Patches are wellcome as I don't use this formatter at the moment. Note
139 that it's hard to get rid of top level element because of situations
140 described above. So not bad idea is to write HTML::Gumbo::Document
141 class that is either subclass of HTML::Element or implements a small
142 subset of methods of HTML::Element.
143
145 The C parser works only with UTF-8, so you have several options to make
146 sure input is UTF-8. First of all define "input_is" argument:
147
148 string
149 Input is Perl string, for example obtained from "decoded_content"
150 in HTTP::Response. Default value.
151
152 $gumbo->parse( decode_utf8($octets) );
153
154 octets
155 Input are octets. Partial implementation of encoding sniffing
156 algorithm <http://www.w3.org/TR/html5/syntax.html#encoding-
157 sniffing-algorithm> is used. First thing wins:
158
159 "encoding" argument
160 Use it to hardcode a specific encoding.
161
162 $gumbo->parse( $octets, input_is => 'octets', encoding => 'latin-1' );
163
164 BOM UTF-8/UTF-16 BOMs are checked.
165
166 "encoding_content_type" argument
167 Encdoning from rransport layer, charset in content-type header.
168
169 $gumbo->parse( $octets, input_is => 'octets', encoding_content_type => 'latin-1' );
170
171 Prescan
172 Not implemented, follow issue 58
173 <https://github.com/google/gumbo-parser/issues/58>.
174
175 HTML5 defines prescan algorithm
176 <http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-
177 to-determine-its-encoding> that extracts encoding from meta
178 tags in the head.
179
180 It would be cool to get it in the C library, but I will accept
181 a patch that impements it in pure perl.
182
183 "encoding_tentative" argument
184 The likely encoding for this page, e.g. based on the encoding
185 of the page when it was last visited.
186
187 $gumbo->parse( $octets, input_is => 'octets', encoding_tentative => 'latin-1' );
188
189 nested browsing context
190 Not implemented. Fragment parsing with or without context is
191 not implemented. Parser also has no origin information, so it
192 wouldn't be implemented.
193
194 autodetection
195 Not implemented.
196
197 Can be implemented using Encode::Detect::Detector. Patches are
198 welcome.
199
200 otherwise
201 It dies.
202
203 "utf8"
204 Use utf8 as input_is when you're sure input is UTF-8, but octets.
205 No pre-processing at all. Should only be used on trusted input or
206 when it's preprocessed already.
207
209 Ruslan Zakirov <ruz@bestpractical.com>
210
212 Under the same terms as perl itself.
213
214
215
216perl v5.34.1 2022-04-01 HTML::Gumbo(3)