1HTML::Element(3) User Contributed Perl Documentation HTML::Element(3)
2
3
4
6 HTML::Element - Class for objects that represent HTML elements
7
9 This document describes version 5.03 of HTML::Element, released
10 September 22, 2012 as part of HTML-Tree.
11
13 use HTML::Element;
14 $a = HTML::Element->new('a', href => 'http://www.perl.com/');
15 $a->push_content("The Perl Homepage");
16
17 $tag = $a->tag;
18 print "$tag starts out as:", $a->starttag, "\n";
19 print "$tag ends as:", $a->endtag, "\n";
20 print "$tag\'s href attribute is: ", $a->attr('href'), "\n";
21
22 $links_r = $a->extract_links();
23 print "Hey, I found ", scalar(@$links_r), " links.\n";
24
25 print "And that, as HTML, is: ", $a->as_HTML, "\n";
26 $a = $a->delete;
27
29 (This class is part of the HTML::Tree dist.)
30
31 Objects of the HTML::Element class can be used to represent elements of
32 HTML document trees. These objects have attributes, notably attributes
33 that designates each element's parent and content. The content is an
34 array of text segments and other HTML::Element objects. A tree with
35 HTML::Element objects as nodes can represent the syntax tree for a HTML
36 document.
37
39 Consider this HTML document:
40
41 <html lang='en-US'>
42 <head>
43 <title>Stuff</title>
44 <meta name='author' content='Jojo'>
45 </head>
46 <body>
47 <h1>I like potatoes!</h1>
48 </body>
49 </html>
50
51 Building a syntax tree out of it makes a tree-structure in memory that
52 could be diagrammed as:
53
54 html (lang='en-US')
55 / \
56 / \
57 / \
58 head body
59 /\ \
60 / \ \
61 / \ \
62 title meta h1
63 | (name='author', |
64 "Stuff" content='Jojo') "I like potatoes"
65
66 This is the traditional way to diagram a tree, with the "root" at the
67 top, and it's this kind of diagram that people have in mind when they
68 say, for example, that "the meta element is under the head element
69 instead of under the body element". (The same is also said with
70 "inside" instead of "under" -- the use of "inside" makes more sense
71 when you're looking at the HTML source.)
72
73 Another way to represent the above tree is with indenting:
74
75 html (attributes: lang='en-US')
76 head
77 title
78 "Stuff"
79 meta (attributes: name='author' content='Jojo')
80 body
81 h1
82 "I like potatoes"
83
84 Incidentally, diagramming with indenting works much better for very
85 large trees, and is easier for a program to generate. The
86 "$tree->dump" method uses indentation just that way.
87
88 However you diagram the tree, it's stored the same in memory -- it's a
89 network of objects, each of which has attributes like so:
90
91 element #1: _tag: 'html'
92 _parent: none
93 _content: [element #2, element #5]
94 lang: 'en-US'
95
96 element #2: _tag: 'head'
97 _parent: element #1
98 _content: [element #3, element #4]
99
100 element #3: _tag: 'title'
101 _parent: element #2
102 _content: [text segment "Stuff"]
103
104 element #4 _tag: 'meta'
105 _parent: element #2
106 _content: none
107 name: author
108 content: Jojo
109
110 element #5 _tag: 'body'
111 _parent: element #1
112 _content: [element #6]
113
114 element #6 _tag: 'h1'
115 _parent: element #5
116 _content: [text segment "I like potatoes"]
117
118 The "treeness" of the tree-structure that these elements comprise is
119 not an aspect of any particular object, but is emergent from the
120 relatedness attributes (_parent and _content) of these element-objects
121 and from how you use them to get from element to element.
122
123 While you could access the content of a tree by writing code that says
124 "access the 'src' attribute of the root's first child's seventh child's
125 third child", you're more likely to have to scan the contents of a
126 tree, looking for whatever nodes, or kinds of nodes, you want to do
127 something with. The most straightforward way to look over a tree is to
128 "traverse" it; an HTML::Element method ("$h->traverse") is provided for
129 this purpose; and several other HTML::Element methods are based on it.
130
131 (For everything you ever wanted to know about trees, and then some, see
132 Niklaus Wirth's Algorithms + Data Structures = Programs or Donald
133 Knuth's The Art of Computer Programming, Volume 1.)
134
135 Weak References
136 TL;DR summary: "use HTML::TreeBuilder 5 -weak;" and forget about the
137 "delete" method (except for pruning a node from a tree).
138
139 Because HTML::Element stores a reference to the parent element, Perl's
140 reference-count garbage collection doesn't work properly with
141 HTML::Element trees. Starting with version 5.00, HTML::Element uses
142 weak references (if available) to prevent that problem. Weak
143 references were introduced in Perl 5.6.0, but you also need a version
144 of Scalar::Util that provides the "weaken" function.
145
146 Weak references are enabled by default. If you want to be certain
147 they're in use, you can say "use HTML::Element 5 -weak;". You must
148 include the version number; previous versions of HTML::Element ignored
149 the import list entirely.
150
151 To disable weak references, you can say "use HTML::Element -noweak;".
152 This is a global setting. This feature is deprecated and is provided
153 only as a quick fix for broken code. If your code does not work
154 properly with weak references, you should fix it immediately, as weak
155 references may become mandatory in a future version. Generally, all
156 you need to do is keep a reference to the root of the tree until you're
157 done working with it.
158
159 Because HTML::TreeBuilder is a subclass of HTML::Element, you can also
160 import "-weak" or "-noweak" from HTML::TreeBuilder: e.g.
161 "use HTML::TreeBuilder: 5 -weak;".
162
164 new
165 $h = HTML::Element->new('tag', 'attrname' => 'value', ... );
166
167 This constructor method returns a new HTML::Element object. The tag
168 name is a required argument; it will be forced to lowercase.
169 Optionally, you can specify other initial attributes at object creation
170 time.
171
172 attr
173 $value = $h->attr('attr');
174 $old_value = $h->attr('attr', $new_value);
175
176 Returns (optionally sets) the value of the given attribute of $h. The
177 attribute name (but not the value, if provided) is forced to lowercase.
178 If trying to read the value of an attribute not present for this
179 element, the return value is undef. If setting a new value, the old
180 value of that attribute is returned.
181
182 If methods are provided for accessing an attribute (like "$h->tag" for
183 "_tag", "$h->content_list", etc. below), use those instead of calling
184 attr "$h->attr", whether for reading or setting.
185
186 Note that setting an attribute to "undef" (as opposed to "", the empty
187 string) actually deletes the attribute.
188
189 tag
190 $tagname = $h->tag();
191 $h->tag('tagname');
192
193 Returns (optionally sets) the tag name (also known as the generic
194 identifier) for the element $h. In setting, the tag name is always
195 converted to lower case.
196
197 There are four kinds of "pseudo-elements" that show up as HTML::Element
198 objects:
199
200 Comment pseudo-elements
201 These are element objects with a "$h->tag" value of "~comment", and
202 the content of the comment is stored in the "text" attribute
203 ("$h->attr("text")"). For example, parsing this code with
204 HTML::TreeBuilder...
205
206 <!-- I like Pie.
207 Pie is good
208 -->
209
210 produces an HTML::Element object with these attributes:
211
212 "_tag",
213 "~comment",
214 "text",
215 " I like Pie.\n Pie is good\n "
216
217 Declaration pseudo-elements
218 Declarations (rarely encountered) are represented as HTML::Element
219 objects with a tag name of "~declaration", and content in the
220 "text" attribute. For example, this:
221
222 <!DOCTYPE foo>
223
224 produces an element whose attributes include:
225
226 "_tag", "~declaration", "text", "DOCTYPE foo"
227
228 Processing instruction pseudo-elements
229 PIs (rarely encountered) are represented as HTML::Element objects
230 with a tag name of "~pi", and content in the "text" attribute. For
231 example, this:
232
233 <?stuff foo?>
234
235 produces an element whose attributes include:
236
237 "_tag", "~pi", "text", "stuff foo?"
238
239 (assuming a recent version of HTML::Parser)
240
241 ~literal pseudo-elements
242 These objects are not currently produced by HTML::TreeBuilder, but
243 can be used to represent a "super-literal" -- i.e., a literal you
244 want to be immune from escaping. (Yes, I just made that term up.)
245
246 That is, this is useful if you want to insert code into a tree that
247 you plan to dump out with "as_HTML", where you want, for some
248 reason, to suppress "as_HTML"'s normal behavior of amp-quoting text
249 segments.
250
251 For example, this:
252
253 my $literal = HTML::Element->new('~literal',
254 'text' => 'x < 4 & y > 7'
255 );
256 my $span = HTML::Element->new('span');
257 $span->push_content($literal);
258 print $span->as_HTML;
259
260 prints this:
261
262 <span>x < 4 & y > 7</span>
263
264 Whereas this:
265
266 my $span = HTML::Element->new('span');
267 $span->push_content('x < 4 & y > 7');
268 # normal text segment
269 print $span->as_HTML;
270
271 prints this:
272
273 <span>x < 4 & y > 7</span>
274
275 Unless you're inserting lots of pre-cooked code into existing
276 trees, and dumping them out again, it's not likely that you'll find
277 "~literal" pseudo-elements useful.
278
279 parent
280 $parent = $h->parent();
281 $h->parent($new_parent);
282
283 Returns (optionally sets) the parent (aka "container") for this
284 element. The parent should either be undef, or should be another
285 element.
286
287 You should not use this to directly set the parent of an element.
288 Instead use any of the other methods under "Structure-Modifying
289 Methods", below.
290
291 Note that "not($h->parent)" is a simple test for whether $h is the root
292 of its subtree.
293
294 content_list
295 @content = $h->content_list();
296 $num_children = $h->content_list();
297
298 Returns a list of the child nodes of this element -- i.e., what nodes
299 (elements or text segments) are inside/under this element. (Note that
300 this may be an empty list.)
301
302 In a scalar context, this returns the count of the items, as you may
303 expect.
304
305 content
306 $content_array_ref = $h->content(); # may return undef
307
308 This somewhat deprecated method returns the content of this element;
309 but unlike content_list, this returns either undef (which you should
310 understand to mean no content), or a reference to the array of content
311 items, each of which is either a text segment (a string, i.e., a
312 defined non-reference scalar value), or an HTML::Element object. Note
313 that even if an arrayref is returned, it may be a reference to an empty
314 array.
315
316 While older code should feel free to continue to use "$h->content", new
317 code should use "$h->content_list" in almost all conceivable cases. It
318 is my experience that in most cases this leads to simpler code anyway,
319 since it means one can say:
320
321 @children = $h->content_list;
322
323 instead of the inelegant:
324
325 @children = @{$h->content || []};
326
327 If you do use "$h->content" (or "$h->content_array_ref"), you should
328 not use the reference returned by it (assuming it returned a reference,
329 and not undef) to directly set or change the content of an element or
330 text segment! Instead use content_refs_list or any of the other
331 methods under "Structure-Modifying Methods", below.
332
333 content_array_ref
334 $content_array_ref = $h->content_array_ref(); # never undef
335
336 This is like "content" (with all its caveats and deprecations) except
337 that it is guaranteed to return an array reference. That is, if the
338 given node has no "_content" attribute, the "content" method would
339 return that undef, but "content_array_ref" would set the given node's
340 "_content" value to "[]" (a reference to a new, empty array), and
341 return that.
342
343 content_refs_list
344 @content_refs = $h->content_refs_list;
345
346 This returns a list of scalar references to each element of $h's
347 content list. This is useful in case you want to in-place edit any
348 large text segments without having to get a copy of the current value
349 of that segment value, modify that copy, then use the "splice_content"
350 to replace the old with the new. Instead, here you can in-place edit:
351
352 foreach my $item_r ($h->content_refs_list) {
353 next if ref $$item_r;
354 $$item_r =~ s/honour/honor/g;
355 }
356
357 You could currently achieve the same affect with:
358
359 foreach my $item (@{ $h->content_array_ref }) {
360 # deprecated!
361 next if ref $item;
362 $item =~ s/honour/honor/g;
363 }
364
365 ...except that using the return value of "$h->content" or
366 "$h->content_array_ref" to do that is deprecated, and just might stop
367 working in the future.
368
369 implicit
370 $is_implicit = $h->implicit();
371 $h->implicit($make_implicit);
372
373 Returns (optionally sets) the "_implicit" attribute. This attribute is
374 a flag that's used for indicating that the element was not originally
375 present in the source, but was added to the parse tree (by
376 HTML::TreeBuilder, for example) in order to conform to the rules of
377 HTML structure.
378
379 pos
380 $pos = $h->pos();
381 $h->pos($element);
382
383 Returns (and optionally sets) the "_pos" (for "current position")
384 pointer of $h. This attribute is a pointer used during some parsing
385 operations, whose value is whatever HTML::Element element at or under
386 $h is currently "open", where "$h->insert_element(NEW)" will actually
387 insert a new element.
388
389 (This has nothing to do with the Perl function called "pos", for
390 controlling where regular expression matching starts.)
391
392 If you set "$h->pos($element)", be sure that $element is either $h, or
393 an element under $h.
394
395 If you've been modifying the tree under $h and are no longer sure
396 "$h->pos" is valid, you can enforce validity with:
397
398 $h->pos(undef) unless $h->pos->is_inside($h);
399
400 all_attr
401 %attr = $h->all_attr();
402
403 Returns all this element's attributes and values, as key-value pairs.
404 This will include any "internal" attributes (i.e., ones not present in
405 the original element, and which will not be represented if/when you
406 call "$h->as_HTML"). Internal attributes are distinguished by the fact
407 that the first character of their key (not value! key!) is an
408 underscore ("_").
409
410 Example output of "$h->all_attr()" : "'_parent', "[object_value]" ,
411 '_tag', 'em', 'lang', 'en-US', '_content', "[array-ref value].
412
413 all_attr_names
414 @names = $h->all_attr_names();
415 $num_attrs = $h->all_attr_names();
416
417 Like "all_attr", but only returns the names of the attributes. In
418 scalar context, returns the number of attributes.
419
420 Example output of "$h->all_attr_names()" : "'_parent', '_tag', 'lang',
421 '_content', ".
422
423 all_external_attr
424 %attr = $h->all_external_attr();
425
426 Like "all_attr", except that internal attributes are not present.
427
428 all_external_attr_names
429 @names = $h->all_external_attr_names();
430 $num_attrs = $h->all_external_attr_names();
431
432 Like "all_attr_names", except that internal attributes' names are not
433 present (or counted).
434
435 id
436 $id = $h->id();
437 $h->id($string);
438
439 Returns (optionally sets to $string) the "id" attribute.
440 "$h->id(undef)" deletes the "id" attribute.
441
442 "$h->id(...)" is basically equivalent to "$h->attr('id', ...)", except
443 that when setting the attribute, this method returns the new value, not
444 the old value.
445
446 idf
447 $id = $h->idf();
448 $h->idf($string);
449
450 Just like the "id" method, except that if you call "$h->idf()" and no
451 "id" attribute is defined for this element, then it's set to a likely-
452 to-be-unique value, and returned. (The "f" is for "force".)
453
455 These methods are provided for modifying the content of trees by adding
456 or changing nodes as parents or children of other nodes.
457
458 push_content
459 $h->push_content($element_or_text, ...);
460
461 Adds the specified items to the end of the content list of the element
462 $h. The items of content to be added should each be either a text
463 segment (a string), an HTML::Element object, or an arrayref. Arrayrefs
464 are fed thru "$h->new_from_lol(that_arrayref)" to convert them into
465 elements, before being added to the content list of $h. This means you
466 can say things concise things like:
467
468 $body->push_content(
469 ['br'],
470 ['ul',
471 map ['li', $_], qw(Peaches Apples Pears Mangos)
472 ]
473 );
474
475 See the "new_from_lol" method's documentation, far below, for more
476 explanation.
477
478 Returns $h (the element itself).
479
480 The push_content method will try to consolidate adjacent text segments
481 while adding to the content list. That's to say, if $h's
482 "content_list" is
483
484 ('foo bar ', $some_node, 'baz!')
485
486 and you call
487
488 $h->push_content('quack?');
489
490 then the resulting content list will be this:
491
492 ('foo bar ', $some_node, 'baz!quack?')
493
494 and not this:
495
496 ('foo bar ', $some_node, 'baz!', 'quack?')
497
498 If that latter is what you want, you'll have to override the feature of
499 consolidating text by using splice_content, as in:
500
501 $h->splice_content(scalar($h->content_list),0,'quack?');
502
503 Similarly, if you wanted to add 'Skronk' to the beginning of the
504 content list, calling this:
505
506 $h->unshift_content('Skronk');
507
508 then the resulting content list will be this:
509
510 ('Skronkfoo bar ', $some_node, 'baz!')
511
512 and not this:
513
514 ('Skronk', 'foo bar ', $some_node, 'baz!')
515
516 What you'd to do get the latter is:
517
518 $h->splice_content(0,0,'Skronk');
519
520 unshift_content
521 $h->unshift_content($element_or_text, ...)
522
523 Just like "push_content", but adds to the beginning of the $h element's
524 content list.
525
526 The items of content to be added should each be either a text segment
527 (a string), an HTML::Element object, or an arrayref (which is fed thru
528 "new_from_lol").
529
530 The unshift_content method will try to consolidate adjacent text
531 segments while adding to the content list. See above for a discussion
532 of this.
533
534 Returns $h (the element itself).
535
536 splice_content
537 @removed = $h->splice_content($offset, $length,
538 $element_or_text, ...);
539
540 Detaches the elements from $h's list of content-nodes, starting at
541 $offset and continuing for $length items, replacing them with the
542 elements of the following list, if any. Returns the elements (if any)
543 removed from the content-list. If $offset is negative, then it starts
544 that far from the end of the array, just like Perl's normal "splice"
545 function. If $length and the following list is omitted, removes
546 everything from $offset onward.
547
548 The items of content to be added (if any) should each be either a text
549 segment (a string), an arrayref (which is fed thru "new_from_lol"), or
550 an HTML::Element object that's not already a child of $h.
551
552 detach
553 $old_parent = $h->detach();
554
555 This unlinks $h from its parent, by setting its 'parent' attribute to
556 undef, and by removing it from the content list of its parent (if it
557 had one). The return value is the parent that was detached from (or
558 undef, if $h had no parent to start with). Note that neither $h nor
559 its parent are explicitly destroyed.
560
561 detach_content
562 @old_content = $h->detach_content();
563
564 This unlinks all of $h's children from $h, and returns them. Note that
565 these are not explicitly destroyed; for that, you can just use
566 "$h->delete_content".
567
568 replace_with
569 $h->replace_with( $element_or_text, ... )
570
571 This replaces $h in its parent's content list with the nodes specified.
572 The element $h (which by then may have no parent) is returned. This
573 causes a fatal error if $h has no parent. The list of nodes to insert
574 may contain $h, but at most once. Aside from that possible exception,
575 the nodes to insert should not already be children of $h's parent.
576
577 Also, note that this method does not destroy $h if weak references are
578 turned off -- use "$h->replace_with(...)->delete" if you need that.
579
580 preinsert
581 $h->preinsert($element_or_text...);
582
583 Inserts the given nodes right BEFORE $h in $h's parent's content list.
584 This causes a fatal error if $h has no parent. None of the given nodes
585 should be $h or other children of $h. Returns $h.
586
587 postinsert
588 $h->postinsert($element_or_text...)
589
590 Inserts the given nodes right AFTER $h in $h's parent's content list.
591 This causes a fatal error if $h has no parent. None of the given nodes
592 should be $h or other children of $h. Returns $h.
593
594 replace_with_content
595 $h->replace_with_content();
596
597 This replaces $h in its parent's content list with its own content.
598 The element $h (which by then has no parent or content of its own) is
599 returned. This causes a fatal error if $h has no parent. Also, note
600 that this does not destroy $h if weak references are turned off -- use
601 "$h->replace_with_content->delete" if you need that.
602
603 delete_content
604 $h->delete_content();
605 $h->destroy_content(); # alias
606
607 Clears the content of $h, calling "$h->delete" for each content
608 element. Compare with "$h->detach_content".
609
610 Returns $h.
611
612 "destroy_content" is an alias for this method.
613
614 delete
615 $h->delete();
616 $h->destroy(); # alias
617
618 Detaches this element from its parent (if it has one) and explicitly
619 destroys the element and all its descendants. The return value is the
620 empty list (or "undef" in scalar context).
621
622 Before version 5.00 of HTML::Element, you had to call "delete" when you
623 were finished with the tree, or your program would leak memory. This
624 is no longer necessary if weak references are enabled, see "Weak
625 References".
626
627 destroy
628 An alias for "delete".
629
630 destroy_content
631 An alias for "delete_content".
632
633 clone
634 $copy = $h->clone();
635
636 Returns a copy of the element (whose children are clones (recursively)
637 of the original's children, if any).
638
639 The returned element is parentless. Any '_pos' attributes present in
640 the source element/tree will be absent in the copy. For that and other
641 reasons, the clone of an HTML::TreeBuilder object that's in mid-parse
642 (i.e, the head of a tree that HTML::TreeBuilder is elaborating) cannot
643 (currently) be used to continue the parse.
644
645 You are free to clone HTML::TreeBuilder trees, just as long as: 1)
646 they're done being parsed, or 2) you don't expect to resume parsing
647 into the clone. (You can continue parsing into the original; it is
648 never affected.)
649
650 clone_list
651 @copies = HTML::Element->clone_list(...nodes...);
652
653 Returns a list consisting of a copy of each node given. Text segments
654 are simply copied; elements are cloned by calling "$it->clone" on each
655 of them.
656
657 Note that this must be called as a class method, not as an instance
658 method. "clone_list" will croak if called as an instance method. You
659 can also call it like so:
660
661 ref($h)->clone_list(...nodes...)
662
663 normalize_content
664 $h->normalize_content
665
666 Normalizes the content of $h -- i.e., concatenates any adjacent text
667 nodes. (Any undefined text segments are turned into empty-strings.)
668 Note that this does not recurse into $h's descendants.
669
670 delete_ignorable_whitespace
671 $h->delete_ignorable_whitespace()
672
673 This traverses under $h and deletes any text segments that are
674 ignorable whitespace. You should not use this if $h is under a "<pre>"
675 element.
676
677 insert_element
678 $h->insert_element($element, $implicit);
679
680 Inserts (via push_content) a new element under the element at
681 "$h->pos()". Then updates "$h->pos()" to point to the inserted
682 element, unless $element is a prototypically empty element like "<br>",
683 "<hr>", "<img>", etc. The new "$h->pos()" is returned. This method is
684 useful only if your particular tree task involves setting "$h->pos()".
685
687 dump
688 $h->dump()
689 $h->dump(*FH) ; # or *FH{IO} or $fh_obj
690
691 Prints the element and all its children to STDOUT (or to a specified
692 filehandle), in a format useful only for debugging. The structure of
693 the document is shown by indentation (no end tags).
694
695 as_HTML
696 $s = $h->as_HTML();
697 $s = $h->as_HTML($entities);
698 $s = $h->as_HTML($entities, $indent_char);
699 $s = $h->as_HTML($entities, $indent_char, \%optional_end_tags);
700
701 Returns a string representing in HTML the element and its descendants.
702 The optional argument $entities specifies a string of the entities to
703 encode. For compatibility with previous versions, specify '<>&' here.
704 If omitted or undef, all unsafe characters are encoded as HTML
705 entities. See HTML::Entities for details. If passed an empty string,
706 no entities are encoded.
707
708 If $indent_char is specified and defined, the HTML to be output is
709 intented, using the string you specify (which you probably should set
710 to "\t", or some number of spaces, if you specify it).
711
712 If "\%optional_end_tags" is specified and defined, it should be a
713 reference to a hash that holds a true value for every tag name whose
714 end tag is optional. Defaults to "\%HTML::Element::optionalEndTag",
715 which is an alias to %HTML::Tagset::optionalEndTag, which, at time of
716 writing, contains true values for "p, li, dt, dd". A useful value to
717 pass is an empty hashref, "{}", which means that no end-tags are
718 optional for this dump. Otherwise, possibly consider copying
719 %HTML::Tagset::optionalEndTag to a hash of your own, adding or deleting
720 values as you like, and passing a reference to that hash.
721
722 as_text
723 $s = $h->as_text();
724 $s = $h->as_text(skip_dels => 1);
725
726 Returns a string consisting of only the text parts of the element's
727 descendants. Any whitespace inside the element is included unchanged,
728 but whitespace not in the tree is never added. But remember that
729 whitespace may be ignored or compacted by HTML::TreeBuilder during
730 parsing (depending on the value of the "ignore_ignorable_whitespace"
731 and "no_space_compacting" attributes). Also, since whitespace is never
732 added during parsing,
733
734 HTML::TreeBuilder->new_from_content("<p>a</p><p>b</p>")
735 ->as_text;
736
737 returns "ab", not "a b" or "a\nb".
738
739 Text under "<script>" or "<style>" elements is never included in what's
740 returned. If "skip_dels" is true, then text content under "<del>"
741 nodes is not included in what's returned.
742
743 as_trimmed_text
744 $s = $h->as_trimmed_text(...);
745 $s = $h->as_trimmed_text(extra_chars => '\xA0'); # remove
746 $s = $h->as_text_trimmed(...); # alias
747
748 This is just like "as_text(...)" except that leading and trailing
749 whitespace is deleted, and any internal whitespace is collapsed.
750
751 This will not remove non-breaking spaces, Unicode spaces, or any other
752 non-ASCII whitespace unless you supply the extra characters as a string
753 argument (e.g. "$h->as_trimmed_text(extra_chars => '\xA0')").
754 "extra_chars" may be any string that can appear inside a character
755 class, including ranges like "a-z", POSIX character classes like
756 "[:alpha:]", and character class escapes like "\p{Zs}".
757
758 as_XML
759 $s = $h->as_XML()
760
761 Returns a string representing in XML the element and its descendants.
762
763 The XML is not indented.
764
765 as_Lisp_form
766 $s = $h->as_Lisp_form();
767
768 Returns a string representing the element and its descendants as a Lisp
769 form. Unsafe characters are encoded as octal escapes.
770
771 The Lisp form is indented, and contains external ("href", etc.) as
772 well as internal attributes ("_tag", "_content", "_implicit", etc.),
773 except for "_parent", which is omitted.
774
775 Current example output for a given element:
776
777 ("_tag" "img" "border" "0" "src" "pie.png" "usemap" "#main.map")
778
779 format
780 $s = $h->format; # use HTML::FormatText
781 $s = $h->format($formatter);
782
783 Formats text output. Defaults to HTML::FormatText.
784
785 Takes a second argument that is a reference to a formatter.
786
787 starttag
788 $start = $h->starttag();
789 $start = $h->starttag($entities);
790
791 Returns a string representing the complete start tag for the element.
792 I.e., leading "<", tag name, attributes, and trailing ">". All values
793 are surrounded with double-quotes, and appropriate characters are
794 encoded. If $entities is omitted or undef, all unsafe characters are
795 encoded as HTML entities. See HTML::Entities for details. If you
796 specify some value for $entities, remember to include the double-quote
797 character in it. (Previous versions of this module would basically
798 behave as if '&">' were specified for $entities.) If $entities is an
799 empty string, no entity is escaped.
800
801 starttag_XML
802 $start = $h->starttag_XML();
803
804 Returns a string representing the complete start tag for the element.
805
806 endtag
807 $end = $h->endtag();
808
809 Returns a string representing the complete end tag for this element.
810 I.e., "</", tag name, and ">".
811
812 endtag_XML
813 $end = $h->endtag_XML();
814
815 Returns a string representing the complete end tag for this element.
816 I.e., "</", tag name, and ">".
817
819 These methods all involve some structural aspect of the tree; either
820 they report some aspect of the tree's structure, or they involve
821 traversal down the tree, or walking up the tree.
822
823 is_inside
824 $inside = $h->is_inside('tag', $element, ...);
825
826 Returns true if the $h element is, or is contained anywhere inside an
827 element that is any of the ones listed, or whose tag name is any of the
828 tag names listed. You can use any mix of elements and tag names.
829
830 is_empty
831 $empty = $h->is_empty();
832
833 Returns true if $h has no content, i.e., has no elements or text
834 segments under it. In other words, this returns true if $h is a leaf
835 node, AKA a terminal node. Do not confuse this sense of "empty" with
836 another sense that it can have in SGML/HTML/XML terminology, which
837 means that the element in question is of the type (like HTML's "<hr>",
838 "<br>", "<img>", etc.) that can't have any content.
839
840 That is, a particular "<p>" element may happen to have no content, so
841 $that_p_element->is_empty will be true -- even though the prototypical
842 "<p>" element isn't "empty" (not in the way that the prototypical
843 "<hr>" element is).
844
845 If you think this might make for potentially confusing code, consider
846 simply using the clearer exact equivalent: "not($h->content_list)".
847
848 pindex
849 $index = $h->pindex();
850
851 Return the index of the element in its parent's contents array, such
852 that $h would equal
853
854 $h->parent->content->[$h->pindex]
855 # or
856 ($h->parent->content_list)[$h->pindex]
857
858 assuming $h isn't root. If the element $h is root, then "$h->pindex"
859 returns "undef".
860
861 left
862 $element = $h->left();
863 @elements = $h->left();
864
865 In scalar context: returns the node that's the immediate left sibling
866 of $h. If $h is the leftmost (or only) child of its parent (or has no
867 parent), then this returns undef.
868
869 In list context: returns all the nodes that're the left siblings of $h
870 (starting with the leftmost). If $h is the leftmost (or only) child of
871 its parent (or has no parent), then this returns an empty list.
872
873 (See also "$h->preinsert(LIST)".)
874
875 right
876 $element = $h->right();
877 @elements = $h->right();
878
879 In scalar context: returns the node that's the immediate right sibling
880 of $h. If $h is the rightmost (or only) child of its parent (or has no
881 parent), then this returns "undef".
882
883 In list context: returns all the nodes that're the right siblings of
884 $h, starting with the leftmost. If $h is the rightmost (or only) child
885 of its parent (or has no parent), then this returns an empty list.
886
887 (See also "$h->postinsert(LIST)".)
888
889 address
890 $address = $h->address();
891 $element_or_text = $h->address($address);
892
893 The first form (with no parameter) returns a string representing the
894 location of $h in the tree it is a member of. The address consists of
895 numbers joined by a '.', starting with '0', and followed by the
896 pindexes of the nodes in the tree that are ancestors of $h, starting
897 from the top.
898
899 So if the way to get to a node starting at the root is to go to child 2
900 of the root, then child 10 of that, and then child 0 of that, and then
901 you're there -- then that node's address is "0.2.10.0".
902
903 As a bit of a special case, the address of the root is simply "0".
904
905 I forsee this being used mainly for debugging, but you may find your
906 own uses for it.
907
908 $element_or_text = $h->address($address);
909
910 This form returns the node (whether element or text-segment) at the
911 given address in the tree that $h is a part of. (That is, the address
912 is resolved starting from "$h->root".)
913
914 If there is no node at the given address, this returns "undef".
915
916 You can specify "relative addressing" (i.e., that indexing is supposed
917 to start from $h and not from "$h->root") by having the address start
918 with a period -- e.g., "$h->address(".3.2")" will look at child 3 of
919 $h, and child 2 of that.
920
921 depth
922 $depth = $h->depth();
923
924 Returns a number expressing $h's depth within its tree, i.e., how many
925 steps away it is from the root. If $h has no parent (i.e., is root),
926 its depth is 0.
927
928 root
929 $root = $h->root();
930
931 Returns the element that's the top of $h's tree. If $h is root, this
932 just returns $h. (If you want to test whether $h is the root, instead
933 of asking what its root is, just test "not($h->parent)".)
934
935 lineage
936 @lineage = $h->lineage();
937
938 Returns the list of $h's ancestors, starting with its parent, and then
939 that parent's parent, and so on, up to the root. If $h is root, this
940 returns an empty list.
941
942 If you simply want a count of the number of elements in $h's lineage,
943 use "$h->depth".
944
945 lineage_tag_names
946 @names = $h->lineage_tag_names();
947
948 Returns the list of the tag names of $h's ancestors, starting with its
949 parent, and that parent's parent, and so on, up to the root. If $h is
950 root, this returns an empty list. Example output: "('em', 'td', 'tr',
951 'table', 'body', 'html')"
952
953 Equivalent to:
954
955 map { $_->tag } $h->lineage;
956
957 descendants
958 @descendants = $h->descendants();
959
960 In list context, returns the list of all $h's descendant elements,
961 listed in pre-order (i.e., an element appears before its content-
962 elements). Text segments DO NOT appear in the list. In scalar
963 context, returns a count of all such elements.
964
965 descendents
966 This is just an alias to the "descendants" method, for people who can't
967 spell.
968
969 find_by_tag_name
970 @elements = $h->find_by_tag_name('tag', ...);
971 $first_match = $h->find_by_tag_name('tag', ...);
972
973 In list context, returns a list of elements at or under $h that have
974 any of the specified tag names. In scalar context, returns the first
975 (in pre-order traversal of the tree) such element found, or undef if
976 none.
977
978 find
979 This is just an alias to "find_by_tag_name". (There was once going to
980 be a whole find_* family of methods, but then "look_down" filled that
981 niche, so there turned out not to be much reason for the verboseness of
982 the name "find_by_tag_name".)
983
984 find_by_attribute
985 @elements = $h->find_by_attribute('attribute', 'value');
986 $first_match = $h->find_by_attribute('attribute', 'value');
987
988 In a list context, returns a list of elements at or under $h that have
989 the specified attribute, and have the given value for that attribute.
990 In a scalar context, returns the first (in pre-order traversal of the
991 tree) such element found, or undef if none.
992
993 This method is deprecated in favor of the more expressive "look_down"
994 method, which new code should use instead.
995
996 look_down
997 @elements = $h->look_down( ...criteria... );
998 $first_match = $h->look_down( ...criteria... );
999
1000 This starts at $h and looks thru its element descendants (in pre-
1001 order), looking for elements matching the criteria you specify. In
1002 list context, returns all elements that match all the given criteria;
1003 in scalar context, returns the first such element (or undef, if nothing
1004 matched).
1005
1006 There are three kinds of criteria you can specify:
1007
1008 (attr_name, attr_value)
1009 This means you're looking for an element with that value for that
1010 attribute. Example: "alt", "pix!". Consider that you can search
1011 on internal attribute values too: "_tag", "p".
1012
1013 (attr_name, qr/.../)
1014 This means you're looking for an element whose value for that
1015 attribute matches the specified Regexp object.
1016
1017 a coderef
1018 This means you're looking for elements where
1019 coderef->(each_element) returns true. Example:
1020
1021 my @wide_pix_images = $h->look_down(
1022 _tag => "img",
1023 alt => "pix!",
1024 sub { $_[0]->attr('width') > 350 }
1025 );
1026
1027 Note that "(attr_name, attr_value)" and "(attr_name, qr/.../)" criteria
1028 are almost always faster than coderef criteria, so should presumably be
1029 put before them in your list of criteria. That is, in the example
1030 above, the sub ref is called only for elements that have already passed
1031 the criteria of having a "_tag" attribute with value "img", and an
1032 "alt" attribute with value "pix!". If the coderef were first, it would
1033 be called on every element, and then what elements pass that criterion
1034 (i.e., elements for which the coderef returned true) would be checked
1035 for their "_tag" and "alt" attributes.
1036
1037 Note that comparison of string attribute-values against the string
1038 value in "(attr_name, attr_value)" is case-INsensitive! A criterion of
1039 "('align', 'right')" will match an element whose "align" value is
1040 "RIGHT", or "right" or "rIGhT", etc.
1041
1042 Note also that "look_down" considers "" (empty-string) and undef to be
1043 different things, in attribute values. So this:
1044
1045 $h->look_down("alt", "")
1046
1047 will find elements with an "alt" attribute, but where the value for the
1048 "alt" attribute is "". But this:
1049
1050 $h->look_down("alt", undef)
1051
1052 is the same as:
1053
1054 $h->look_down(sub { !defined($_[0]->attr('alt')) } )
1055
1056 That is, it finds elements that do not have an "alt" attribute at all
1057 (or that do have an "alt" attribute, but with a value of undef -- which
1058 is not normally possible).
1059
1060 Note that when you give several criteria, this is taken to mean you're
1061 looking for elements that match all your criterion, not just any of
1062 them. In other words, there is an implicit "and", not an "or". So if
1063 you wanted to express that you wanted to find elements with a "name"
1064 attribute with the value "foo" or with an "id" attribute with the value
1065 "baz", you'd have to do it like:
1066
1067 @them = $h->look_down(
1068 sub {
1069 # the lcs are to fold case
1070 lc($_[0]->attr('name')) eq 'foo'
1071 or lc($_[0]->attr('id')) eq 'baz'
1072 }
1073 );
1074
1075 Coderef criteria are more expressive than "(attr_name, attr_value)" and
1076 "(attr_name, qr/.../)" criteria, and all "(attr_name, attr_value)" and
1077 "(attr_name, qr/.../)" criteria could be expressed in terms of
1078 coderefs. However, "(attr_name, attr_value)" and "(attr_name,
1079 qr/.../)" criteria are a convenient shorthand. (In fact, "look_down"
1080 itself is basically "shorthand" too, since anything you can do with
1081 "look_down" you could do by traversing the tree, either with the
1082 "traverse" method or with a routine of your own. However, "look_down"
1083 often makes for very concise and clear code.)
1084
1085 look_up
1086 @elements = $h->look_up( ...criteria... );
1087 $first_match = $h->look_up( ...criteria... );
1088
1089 This is identical to "$h->look_down", except that whereas
1090 "$h->look_down" basically scans over the list:
1091
1092 ($h, $h->descendants)
1093
1094 "$h->look_up" instead scans over the list
1095
1096 ($h, $h->lineage)
1097
1098 So, for example, this returns all ancestors of $h (possibly including
1099 $h itself) that are "<td>" elements with an "align" attribute with a
1100 value of "right" (or "RIGHT", etc.):
1101
1102 $h->look_up("_tag", "td", "align", "right");
1103
1104 traverse
1105 $h->traverse(...options...)
1106
1107 Lengthy discussion of HTML::Element's unnecessary and confusing
1108 "traverse" method has been moved to a separate file:
1109 HTML::Element::traverse
1110
1111 attr_get_i
1112 @values = $h->attr_get_i('attribute');
1113 $first_value = $h->attr_get_i('attribute');
1114
1115 In list context, returns a list consisting of the values of the given
1116 attribute for $h and for all its ancestors starting from $h and working
1117 its way up. Nodes with no such attribute are skipped. ("attr_get_i"
1118 stands for "attribute get, with inheritance".) In scalar context,
1119 returns the first such value, or undef if none.
1120
1121 Consider a document consisting of:
1122
1123 <html lang='i-klingon'>
1124 <head><title>Pati Pata</title></head>
1125 <body>
1126 <h1 lang='la'>Stuff</h1>
1127 <p lang='es-MX' align='center'>
1128 Foo bar baz <cite>Quux</cite>.
1129 </p>
1130 <p>Hooboy.</p>
1131 </body>
1132 </html>
1133
1134 If $h is the "<cite>" element, "$h->attr_get_i("lang")" in list context
1135 will return the list "('es-MX', 'i-klingon')". In scalar context, it
1136 will return the value 'es-MX'.
1137
1138 If you call with multiple attribute names...
1139
1140 @values = $h->attr_get_i('a1', 'a2', 'a3');
1141 $first_value = $h->attr_get_i('a1', 'a2', 'a3');
1142
1143 ...in list context, this will return a list consisting of the values of
1144 these attributes which exist in $h and its ancestors. In scalar
1145 context, this returns the first value (i.e., the value of the first
1146 existing attribute from the first element that has any of the
1147 attributes listed). So, in the above example,
1148
1149 $h->attr_get_i('lang', 'align');
1150
1151 will return:
1152
1153 ('es-MX', 'center', 'i-klingon') # in list context
1154 or
1155 'es-MX' # in scalar context.
1156
1157 But note that this:
1158
1159 $h->attr_get_i('align', 'lang');
1160
1161 will return:
1162
1163 ('center', 'es-MX', 'i-klingon') # in list context
1164 or
1165 'center' # in scalar context.
1166
1167 tagname_map
1168 $hash_ref = $h->tagname_map();
1169
1170 Scans across $h and all its descendants, and makes a hash (a reference
1171 to which is returned) where each entry consists of a key that's a tag
1172 name, and a value that's a reference to a list to all elements that
1173 have that tag name. I.e., this method returns:
1174
1175 {
1176 # Across $h and all descendants...
1177 'a' => [ ...list of all <a> elements... ],
1178 'em' => [ ...list of all <em> elements... ],
1179 'img' => [ ...list of all <img> elements... ],
1180 }
1181
1182 (There are entries in the hash for only those tagnames that occur
1183 at/under $h -- so if there's no "<img>" elements, there'll be no "img"
1184 entry in the returned hashref.)
1185
1186 Example usage:
1187
1188 my $map_r = $h->tagname_map();
1189 my @heading_tags = sort grep m/^h\d$/s, keys %$map_r;
1190 if(@heading_tags) {
1191 print "Heading levels used: @heading_tags\n";
1192 } else {
1193 print "No headings.\n"
1194 }
1195
1196 extract_links
1197 $links_array_ref = $h->extract_links();
1198 $links_array_ref = $h->extract_links(@wantedTypes);
1199
1200 Returns links found by traversing the element and all of its children
1201 and looking for attributes (like "href" in an "<a>" element, or "src"
1202 in an "<img>" element) whose values represent links. The return value
1203 is a reference to an array. Each element of the array is reference to
1204 an array with four items: the link-value, the element that has the
1205 attribute with that link-value, and the name of that attribute, and the
1206 tagname of that element. (Example: "['http://www.suck.com/',"
1207 $elem_obj ", 'href', 'a']".) You may or may not end up using the
1208 element itself -- for some purposes, you may use only the link value.
1209
1210 You might specify that you want to extract links from just some kinds
1211 of elements (instead of the default, which is to extract links from all
1212 the kinds of elements known to have attributes whose values represent
1213 links). For instance, if you want to extract links from only "<a>" and
1214 "<img>" elements, you could code it like this:
1215
1216 for (@{ $e->extract_links('a', 'img') }) {
1217 my($link, $element, $attr, $tag) = @$_;
1218 print
1219 "Hey, there's a $tag that links to ",
1220 $link, ", in its $attr attribute, at ",
1221 $element->address(), ".\n";
1222 }
1223
1224 simplify_pres
1225 $h->simplify_pres();
1226
1227 In text bits under PRE elements that are at/under $h, this routine
1228 nativizes all newlines, and expands all tabs.
1229
1230 That is, if you read a file with lines delimited by "\cm\cj"'s, the
1231 text under PRE areas will have "\cm\cj"'s instead of "\n"'s. Calling
1232 "$h->simplify_pres" on such a tree will turn "\cm\cj"'s into "\n"'s.
1233
1234 Tabs are expanded to however many spaces it takes to get to the next
1235 8th column -- the usual way of expanding them.
1236
1237 same_as
1238 $equal = $h->same_as($i)
1239
1240 Returns true if $h and $i are both elements representing the same tree
1241 of elements, each with the same tag name, with the same explicit
1242 attributes (i.e., not counting attributes whose names start with "_"),
1243 and with the same content (textual, comments, etc.).
1244
1245 Sameness of descendant elements is tested, recursively, with
1246 "$child1->same_as($child_2)", and sameness of text segments is tested
1247 with "$segment1 eq $segment2".
1248
1249 new_from_lol
1250 $h = HTML::Element->new_from_lol($array_ref);
1251 @elements = HTML::Element->new_from_lol($array_ref, ...);
1252
1253 Resursively constructs a tree of nodes, based on the (non-cyclic) data
1254 structure represented by each $array_ref, where that is a reference to
1255 an array of arrays (of arrays (of arrays (etc.))).
1256
1257 In each arrayref in that structure, different kinds of values are
1258 treated as follows:
1259
1260 · Arrayrefs
1261
1262 Arrayrefs are considered to designate a sub-tree representing
1263 children for the node constructed from the current arrayref.
1264
1265 · Hashrefs
1266
1267 Hashrefs are considered to contain attribute-value pairs to add to
1268 the element to be constructed from the current arrayref
1269
1270 · Text segments
1271
1272 Text segments at the start of any arrayref will be considered to
1273 specify the name of the element to be constructed from the current
1274 arrayref; all other text segments will be considered to specify
1275 text segments as children for the current arrayref.
1276
1277 · Elements
1278
1279 Existing element objects are either inserted into the treelet
1280 constructed, or clones of them are. That is, when the lol-tree is
1281 being traversed and elements constructed based what's in it, if an
1282 existing element object is found, if it has no parent, then it is
1283 added directly to the treelet constructed; but if it has a parent,
1284 then "$that_node->clone" is added to the treelet at the appropriate
1285 place.
1286
1287 An example will hopefully make this more obvious:
1288
1289 my $h = HTML::Element->new_from_lol(
1290 ['html',
1291 ['head',
1292 [ 'title', 'I like stuff!' ],
1293 ],
1294 ['body',
1295 {'lang', 'en-JP', _implicit => 1},
1296 'stuff',
1297 ['p', 'um, p < 4!', {'class' => 'par123'}],
1298 ['div', {foo => 'bar'}, '123'],
1299 ]
1300 ]
1301 );
1302 $h->dump;
1303
1304 Will print this:
1305
1306 <html> @0
1307 <head> @0.0
1308 <title> @0.0.0
1309 "I like stuff!"
1310 <body lang="en-JP"> @0.1 (IMPLICIT)
1311 "stuff"
1312 <p class="par123"> @0.1.1
1313 "um, p < 4!"
1314 <div foo="bar"> @0.1.2
1315 "123"
1316
1317 And printing $h->as_HTML will give something like:
1318
1319 <html><head><title>I like stuff!</title></head>
1320 <body lang="en-JP">stuff<p class="par123">um, p < 4!
1321 <div foo="bar">123</div></body></html>
1322
1323 You can even do fancy things with "map":
1324
1325 $body->push_content(
1326 # push_content implicitly calls new_from_lol on arrayrefs...
1327 ['br'],
1328 ['blockquote',
1329 ['h2', 'Pictures!'],
1330 map ['p', $_],
1331 $body2->look_down("_tag", "img"),
1332 # images, to be copied from that other tree.
1333 ],
1334 # and more stuff:
1335 ['ul',
1336 map ['li', ['a', {'href'=>"$_.png"}, $_ ] ],
1337 qw(Peaches Apples Pears Mangos)
1338 ],
1339 );
1340
1341 In scalar context, you must supply exactly one arrayref. In list
1342 context, you can pass a list of arrayrefs, and new_from_lol will return
1343 a list of elements, one for each arrayref.
1344
1345 @elements = HTML::Element->new_from_lol(
1346 ['hr'],
1347 ['p', 'And there, on the door, was a hook!'],
1348 );
1349 # constructs two elements.
1350
1351 objectify_text
1352 $h->objectify_text();
1353
1354 This turns any text nodes under $h from mere text segments (strings)
1355 into real objects, pseudo-elements with a tag-name of "~text", and the
1356 actual text content in an attribute called "text". (For a discussion
1357 of pseudo-elements, see the "tag" method, far above.) This method is
1358 provided because, for some purposes, it is convenient or necessary to
1359 be able, for a given text node, to ask what element is its parent; and
1360 clearly this is not possible if a node is just a text string.
1361
1362 Note that these "~text" objects are not recognized as text nodes by
1363 methods like "as_text". Presumably you will want to call
1364 "$h->objectify_text", perform whatever task that you needed that for,
1365 and then call "$h->deobjectify_text" before calling anything like
1366 "$h->as_text".
1367
1368 deobjectify_text
1369 $h->deobjectify_text();
1370
1371 This undoes the effect of "$h->objectify_text". That is, it takes any
1372 "~text" pseudo-elements in the tree at/under $h, and deletes each one,
1373 replacing each with the content of its "text" attribute.
1374
1375 Note that if $h itself is a "~text" pseudo-element, it will be
1376 destroyed -- a condition you may need to treat specially in your
1377 calling code (since it means you can't very well do anything with $h
1378 after that). So that you can detect that condition, if $h is itself a
1379 "~text" pseudo-element, then this method returns the value of the
1380 "text" attribute, which should be a defined value; in all other cases,
1381 it returns undef.
1382
1383 (This method assumes that no "~text" pseudo-element has any children.)
1384
1385 number_lists
1386 $h->number_lists();
1387
1388 For every UL, OL, DIR, and MENU element at/under $h, this sets a
1389 "_bullet" attribute for every child LI element. For LI children of an
1390 OL, the "_bullet" attribute's value will be something like "4.", "d.",
1391 "D.", "IV.", or "iv.", depending on the OL element's "type" attribute.
1392 LI children of a UL, DIR, or MENU get their "_bullet" attribute set to
1393 "*". There should be no other LIs (i.e., except as children of OL, UL,
1394 DIR, or MENU elements), and if there are, they are unaffected.
1395
1396 has_insane_linkage
1397 $h->has_insane_linkage
1398
1399 This method is for testing whether this element or the elements under
1400 it have linkage attributes (_parent and _content) whose values are
1401 deeply aberrant: if there are undefs in a content list; if an element
1402 appears in the content lists of more than one element; if the _parent
1403 attribute of an element doesn't match its actual parent; or if an
1404 element appears as its own descendant (i.e., if there is a cyclicity in
1405 the tree).
1406
1407 This returns empty list (or false, in scalar context) if the subtree's
1408 linkage methods are sane; otherwise it returns two items (or true, in
1409 scalar context): the element where the error occurred, and a string
1410 describing the error.
1411
1412 This method is provided is mainly for debugging and troubleshooting --
1413 it should be quite impossible for any document constructed via
1414 HTML::TreeBuilder to parse into a non-sane tree (since it's not the
1415 content of the tree per se that's in question, but whether the tree in
1416 memory was properly constructed); and it should be impossible for you
1417 to produce an insane tree just thru reasonable use of normal documented
1418 structure-modifying methods. But if you're constructing your own
1419 trees, and your program is going into infinite loops as during calls to
1420 traverse() or any of the secondary structural methods, as part of
1421 debugging, consider calling "has_insane_linkage" on the tree.
1422
1423 element_class
1424 $classname = $h->element_class();
1425
1426 This method returns the class which will be used for new elements. It
1427 defaults to HTML::Element, but can be overridden by subclassing or
1428 esoteric means best left to those will will read the source and then
1429 not complain when those esoteric means change. (Just subclass.)
1430
1432 Use_Weak_Refs
1433 $enabled = HTML::Element->Use_Weak_Refs;
1434 HTML::Element->Use_Weak_Refs( $enabled );
1435
1436 This method allows you to check whether weak reference support is
1437 enabled, and to enable or disable it. For details, see "Weak
1438 References". $enabled is true if weak references are enabled.
1439
1440 You should not switch this in the middle of your program, and you
1441 probably shouldn't use it at all. Existing trees are not affected by
1442 this method (until you start modifying nodes in them).
1443
1444 Throws an exception if you attempt to enable weak references and your
1445 Perl or Scalar::Util does not support them.
1446
1447 Disabling weak reference support is deprecated.
1448
1450 Version
1451 This subroutine is deprecated. Please use the standard VERSION method
1452 (e.g. "HTML::Element->VERSION") instead.
1453
1454 ABORT OK PRUNE PRUNE_SOFTLY PRUNE_UP
1455 Constants for signalling back to the traverser
1456
1458 * If you want to free the memory associated with a tree built of
1459 HTML::Element nodes, and you have disabled weak references, then you
1460 will have to delete it explicitly using the "delete" method. See "Weak
1461 References".
1462
1463 * There's almost nothing to stop you from making a "tree" with
1464 cyclicities (loops) in it, which could, for example, make the traverse
1465 method go into an infinite loop. So don't make cyclicities! (If all
1466 you're doing is parsing HTML files, and looking at the resulting trees,
1467 this will never be a problem for you.)
1468
1469 * There's no way to represent comments or processing directives in a
1470 tree with HTML::Elements. Not yet, at least.
1471
1472 * There's (currently) nothing to stop you from using an undefined value
1473 as a text segment. If you're running under "perl -w", however, this
1474 may make HTML::Element's code produce a slew of warnings.
1475
1477 You are welcome to derive subclasses from HTML::Element, but you should
1478 be aware that the code in HTML::Element makes certain assumptions about
1479 elements (and I'm using "element" to mean ONLY an object of class
1480 HTML::Element, or of a subclass of HTML::Element):
1481
1482 * The value of an element's _parent attribute must either be undef or
1483 otherwise false, or must be an element.
1484
1485 * The value of an element's _content attribute must either be undef or
1486 otherwise false, or a reference to an (unblessed) array. The array may
1487 be empty; but if it has items, they must ALL be either mere strings
1488 (text segments), or elements.
1489
1490 * The value of an element's _tag attribute should, at least, be a
1491 string of printable characters.
1492
1493 Moreover, bear these rules in mind:
1494
1495 * Do not break encapsulation on objects. That is, access their
1496 contents only thru $obj->attr or more specific methods.
1497
1498 * You should think twice before completely overriding any of the
1499 methods that HTML::Element provides. (Overriding with a method that
1500 calls the superclass method is not so bad, though.)
1501
1503 HTML::Tree; HTML::TreeBuilder; HTML::AsSubs; HTML::Tagset; and, for the
1504 morbidly curious, HTML::Element::traverse.
1505
1507 Thanks to Mark-Jason Dominus for a POD suggestion.
1508
1510 Current maintainers:
1511
1512 · Christopher J. Madsen "<perl AT cjmweb.net>"
1513
1514 · Jeff Fearn "<jfearn AT cpan.org>"
1515
1516 Original HTML-Tree author:
1517
1518 · Gisle Aas
1519
1520 Former maintainers:
1521
1522 · Sean M. Burke
1523
1524 · Andy Lester
1525
1526 · Pete Krawczyk "<petek AT cpan.org>"
1527
1528 You can follow or contribute to HTML-Tree's development at
1529 <http://github.com/madsen/HTML-Tree>.
1530
1532 Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke, 2005 Andy
1533 Lester, 2006 Pete Krawczyk, 2010 Jeff Fearn, 2012 Christopher J.
1534 Madsen.
1535
1536 This library is free software; you can redistribute it and/or modify it
1537 under the same terms as Perl itself.
1538
1539 The programs in this library are distributed in the hope that they will
1540 be useful, but without any warranty; without even the implied warranty
1541 of merchantability or fitness for a particular purpose.
1542
1543
1544
1545perl v5.16.3 2014-06-10 HTML::Element(3)