1HTML::Element(3)      User Contributed Perl Documentation     HTML::Element(3)
2
3
4

NAME

6       HTML::Element - Class for objects that represent HTML elements
7

VERSION

9       Version 3.23
10

SYNOPSIS

12           use HTML::Element;
13           $a = HTML::Element->new('a', href => 'http://www.perl.com/');
14           $a->push_content("The Perl Homepage");
15
16           $tag = $a->tag;
17           print "$tag starts out as:",  $a->starttag, "\n";
18           print "$tag ends as:",  $a->endtag, "\n";
19           print "$tag\'s href attribute is: ", $a->attr('href'), "\n";
20
21           $links_r = $a->extract_links();
22           print "Hey, I found ", scalar(@$links_r), " links.\n";
23
24           print "And that, as HTML, is: ", $a->as_HTML, "\n";
25           $a = $a->delete;
26

DESCRIPTION

28       (This class is part of the HTML::Tree dist.)
29
30       Objects of the HTML::Element class can be used to represent elements of
31       HTML document trees.  These objects have attributes, notably attributes
32       that designates each element's parent and content.  The content is an
33       array of text segments and other HTML::Element objects.  A tree with
34       HTML::Element objects as nodes can represent the syntax tree for a HTML
35       document.
36

HOW WE REPRESENT TREES

38       Consider this HTML document:
39
40         <html lang='en-US'>
41           <head>
42             <title>Stuff</title>
43             <meta name='author' content='Jojo'>
44           </head>
45           <body>
46            <h1>I like potatoes!</h1>
47           </body>
48         </html>
49
50       Building a syntax tree out of it makes a tree-structure in memory that
51       could be diagrammed as:
52
53                            html (lang='en-US')
54                             / \
55                           /     \
56                         /         \
57                       head        body
58                      /\               \
59                    /    \               \
60                  /        \               \
61                title     meta              h1
62                 ⎪       (name='author',     ⎪
63              "Stuff"    content='Jojo')    "I like potatoes"
64
65       This is the traditional way to diagram a tree, with the "root" at the
66       top, and it's this kind of diagram that people have in mind when they
67       say, for example, that "the meta element is under the head element
68       instead of under the body element".  (The same is also said with
69       "inside" instead of "under" -- the use of "inside" makes more sense
70       when you're looking at the HTML source.)
71
72       Another way to represent the above tree is with indenting:
73
74         html (attributes: lang='en-US')
75           head
76             title
77               "Stuff"
78             meta (attributes: name='author' content='Jojo')
79           body
80             h1
81               "I like potatoes"
82
83       Incidentally, diagramming with indenting works much better for very
84       large trees, and is easier for a program to generate.  The
85       "$tree->dump" method uses indentation just that way.
86
87       However you diagram the tree, it's stored the same in memory -- it's a
88       network of objects, each of which has attributes like so:
89
90         element #1:  _tag: 'html'
91                      _parent: none
92                      _content: [element #2, element #5]
93                      lang: 'en-US'
94
95         element #2:  _tag: 'head'
96                      _parent: element #1
97                      _content: [element #3, element #4]
98
99         element #3:  _tag: 'title'
100                      _parent: element #2
101                      _content: [text segment "Stuff"]
102
103         element #4   _tag: 'meta'
104                      _parent: element #2
105                      _content: none
106                      name: author
107                      content: Jojo
108
109         element #5   _tag: 'body'
110                      _parent: element #1
111                      _content: [element #6]
112
113         element #6   _tag: 'h1'
114                      _parent: element #5
115                      _content: [text segment "I like potatoes"]
116
117       The "treeness" of the tree-structure that these elements comprise is
118       not an aspect of any particular object, but is emergent from the relat‐
119       edness attributes (_parent and _content) of these element-objects and
120       from how you use them to get from element to element.
121
122       While you could access the content of a tree by writing code that says
123       "access the 'src' attribute of the root's first child's seventh child's
124       third child", you're more likely to have to scan the contents of a
125       tree, looking for whatever nodes, or kinds of nodes, you want to do
126       something with.  The most straightforward way to look over a tree is to
127       "traverse" it; an HTML::Element method ("$h->traverse") is provided for
128       this purpose; and several other HTML::Element methods are based on it.
129
130       (For everything you ever wanted to know about trees, and then some, see
131       Niklaus Wirth's Algorithms + Data Structures = Programs or Donald
132       Knuth's The Art of Computer Programming, Volume 1.)
133

BASIC METHODS

135       $h = HTML::Element->new('tag', 'attrname' => 'value', ... )
136
137       This constructor method returns a new HTML::Element object.  The tag
138       name is a required argument; it will be forced to lowercase.  Option‐
139       ally, you can specify other initial attributes at object creation time.
140
141       $h->attr('attr') or $h->attr('attr', 'value')
142
143       Returns (optionally sets) the value of the given attribute of $h.  The
144       attribute name (but not the value, if provided) is forced to lowercase.
145       If trying to read the value of an attribute not present for this ele‐
146       ment, the return value is undef.  If setting a new value, the old value
147       of that attribute is returned.
148
149       If methods are provided for accessing an attribute (like "$h->tag" for
150       "_tag", "$h->content_list", etc. below), use those instead of calling
151       attr "$h->attr", whether for reading or setting.
152
153       Note that setting an attribute to "undef" (as opposed to "", the empty
154       string) actually deletes the attribute.
155
156       $h->tag() or $h->tag('tagname')
157
158       Returns (optionally sets) the tag name (also known as the generic iden‐
159       tifier) for the element $h.  In setting, the tag name is always con‐
160       verted to lower case.
161
162       There are four kinds of "pseudo-elements" that show up as HTML::Element
163       objects:
164
165       Comment pseudo-elements
166           These are element objects with a "$h->tag" value of "~comment", and
167           the content of the comment is stored in the "text" attribute
168           ("$h->attr("text")").  For example, parsing this code with
169           HTML::TreeBuilder...
170
171             <!-- I like Pie.
172                Pie is good
173             -->
174
175           produces an HTML::Element object with these attributes:
176
177             "_tag",
178             "~comment",
179             "text",
180             " I like Pie.\n     Pie is good\n  "
181
182       Declaration pseudo-elements
183           Declarations (rarely encountered) are represented as HTML::Element
184           objects with a tag name of "~declaration", and content in the
185           "text" attribute.  For example, this:
186
187             <!DOCTYPE foo>
188
189           produces an element whose attributes include:
190
191             "_tag", "~declaration", "text", "DOCTYPE foo"
192
193       Processing instruction pseudo-elements
194           PIs (rarely encountered) are represented as HTML::Element objects
195           with a tag name of "~pi", and content in the "text" attribute.  For
196           example, this:
197
198             <?stuff foo?>
199
200           produces an element whose attributes include:
201
202             "_tag", "~pi", "text", "stuff foo?"
203
204           (assuming a recent version of HTML::Parser)
205
206       ~literal pseudo-elements
207           These objects are not currently produced by HTML::TreeBuilder, but
208           can be used to represent a "super-literal" -- i.e., a literal you
209           want to be immune from escaping.  (Yes, I just made that term up.)
210
211           That is, this is useful if you want to insert code into a tree that
212           you plan to dump out with "as_HTML", where you want, for some rea‐
213           son, to suppress "as_HTML"'s normal behavior of amp-quoting text
214           segments.
215
216           For example, this:
217
218             my $literal = HTML::Element->new('~literal',
219               'text' => 'x < 4 & y > 7'
220             );
221             my $span = HTML::Element->new('span');
222             $span->push_content($literal);
223             print $span->as_HTML;
224
225           prints this:
226
227             <span>x < 4 & y > 7</span>
228
229           Whereas this:
230
231             my $span = HTML::Element->new('span');
232             $span->push_content('x < 4 & y > 7');
233               # normal text segment
234             print $span->as_HTML;
235
236           prints this:
237
238             <span>x &lt; 4 &amp; y &gt; 7</span>
239
240           Unless you're inserting lots of pre-cooked code into existing
241           trees, and dumping them out again, it's not likely that you'll find
242           "~literal" pseudo-elements useful.
243
244       $h->parent() or $h->parent($new_parent)
245
246       Returns (optionally sets) the parent (aka "container") for this ele‐
247       ment.  The parent should either be undef, or should be another element.
248
249       You should not use this to directly set the parent of an element.
250       Instead use any of the other methods under "Structure-Modifying Meth‐
251       ods", below.
252
253       Note that not($h->parent) is a simple test for whether $h is the root
254       of its subtree.
255
256       $h->content_list()
257
258       Returns a list of the child nodes of this element -- i.e., what nodes
259       (elements or text segments) are inside/under this element. (Note that
260       this may be an empty list.)
261
262       In a scalar context, this returns the count of the items, as you may
263       expect.
264
265       $h->content()
266
267       This somewhat deprecated method returns the content of this element;
268       but unlike content_list, this returns either undef (which you should
269       understand to mean no content), or a reference to the array of content
270       items, each of which is either a text segment (a string, i.e., a
271       defined non-reference scalar value), or an HTML::Element object.  Note
272       that even if an arrayref is returned, it may be a reference to an empty
273       array.
274
275       While older code should feel free to continue to use "$h->content", new
276       code should use "$h->content_list" in almost all conceivable cases.  It
277       is my experience that in most cases this leads to simpler code anyway,
278       since it means one can say:
279
280           @children = $h->content_list;
281
282       instead of the inelegant:
283
284           @children = @{$h->content ⎪⎪ []};
285
286       If you do use "$h->content" (or "$h->content_array_ref"), you should
287       not use the reference returned by it (assuming it returned a reference,
288       and not undef) to directly set or change the content of an element or
289       text segment!  Instead use content_refs_list or any of the other meth‐
290       ods under "Structure-Modifying Methods", below.
291
292       $h->content_array_ref()
293
294       This is like "content" (with all its caveats and deprecations) except
295       that it is guaranteed to return an array reference.  That is, if the
296       given node has no "_content" attribute, the "content" method would
297       return that undef, but "content_array_ref" would set the given node's
298       "_content" value to "[]" (a reference to a new, empty array), and
299       return that.
300
301       $h->content_refs_list
302
303       This returns a list of scalar references to each element of $h's con‐
304       tent list.  This is useful in case you want to in-place edit any large
305       text segments without having to get a copy of the current value of that
306       segment value, modify that copy, then use the "splice_content" to
307       replace the old with the new.  Instead, here you can in-place edit:
308
309           foreach my $item_r ($h->content_refs_list) {
310               next if ref $$item_r;
311               $$item_r =~ s/honour/honor/g;
312           }
313
314       You could currently achieve the same affect with:
315
316           foreach my $item (@{ $h->content_array_ref }) {
317               # deprecated!
318               next if ref $item;
319               $item =~ s/honour/honor/g;
320           }
321
322       ...except that using the return value of "$h->content" or "$h->con‐
323       tent_array_ref" to do that is deprecated, and just might stop working
324       in the future.
325
326       $h->implicit() or $h->implicit($bool)
327
328       Returns (optionally sets) the "_implicit" attribute.  This attribute is
329       a flag that's used for indicating that the element was not originally
330       present in the source, but was added to the parse tree (by HTML::Tree‐
331       Builder, for example) in order to conform to the rules of HTML struc‐
332       ture.
333
334       $h->pos() or $h->pos($element)
335
336       Returns (and optionally sets) the "_pos" (for "current position")
337       pointer of $h.  This attribute is a pointer used during some parsing
338       operations, whose value is whatever HTML::Element element at or under
339       $h is currently "open", where "$h->insert_element(NEW)" will actually
340       insert a new element.
341
342       (This has nothing to do with the Perl function called "pos", for con‐
343       trolling where regular expression matching starts.)
344
345       If you set "$h->pos($element)", be sure that $element is either $h, or
346       an element under $h.
347
348       If you've been modifying the tree under $h and are no longer sure
349       "$h->pos" is valid, you can enforce validity with:
350
351           $h->pos(undef) unless $h->pos->is_inside($h);
352
353       $h->all_attr()
354
355       Returns all this element's attributes and values, as key-value pairs.
356       This will include any "internal" attributes (i.e., ones not present in
357       the original element, and which will not be represented if/when you
358       call "$h->as_HTML").  Internal attributes are distinguished by the fact
359       that the first character of their key (not value! key!) is an under‐
360       score ("_").
361
362       Example output of "$h->all_attr()" : "'_parent', "[object_value]" ,
363       '_tag', 'em', 'lang', 'en-US', '_content', "[array-ref value].
364
365       $h->all_attr_names()
366
367       Like all_attr, but only returns the names of the attributes.
368
369       Example output of "$h->all_attr_names()" : "'_parent', '_tag', 'lang',
370       '_content', ".
371
372       $h->all_external_attr()
373
374       Like "all_attr", except that internal attributes are not present.
375
376       $h->all_external_attr_names()
377
378       Like "all_external_attr_names", except that internal attributes' names
379       are not present.
380
381       $h->id() or $h->id($string)
382
383       Returns (optionally sets to $string) the "id" attribute.
384       "$h->id(undef)" deletes the "id" attribute.
385
386       $h->idf() or $h->idf($string)
387
388       Just like the "id" method, except that if you call "$h->idf()" and no
389       "id" attribute is defined for this element, then it's set to a likely-
390       to-be-unique value, and returned.  (The "f" is for "force".)
391

STRUCTURE-MODIFYING METHODS

393       These methods are provided for modifying the content of trees by adding
394       or changing nodes as parents or children of other nodes.
395
396       $h->push_content($element_or_text, ...)
397
398       Adds the specified items to the end of the content list of the element
399       $h.  The items of content to be added should each be either a text seg‐
400       ment (a string), an HTML::Element object, or an arrayref.  Arrayrefs
401       are fed thru "$h->new_from_lol(that_arrayref)" to convert them into
402       elements, before being added to the content list of $h.  This means you
403       can say things concise things like:
404
405         $body->push_content(
406           ['br'],
407           ['ul',
408             map ['li', $_], qw(Peaches Apples Pears Mangos)
409           ]
410         );
411
412       See "new_from_lol" method's documentation, far below, for more explana‐
413       tion.
414
415       The push_content method will try to consolidate adjacent text segments
416       while adding to the content list.  That's to say, if $h's content_list
417       is
418
419         ('foo bar ', $some_node, 'baz!')
420
421       and you call
422
423          $h->push_content('quack?');
424
425       then the resulting content list will be this:
426
427         ('foo bar ', $some_node, 'baz!quack?')
428
429       and not this:
430
431         ('foo bar ', $some_node, 'baz!', 'quack?')
432
433       If that latter is what you want, you'll have to override the feature of
434       consolidating text by using splice_content, as in:
435
436         $h->splice_content(scalar($h->content_list),0,'quack?');
437
438       Similarly, if you wanted to add 'Skronk' to the beginning of the con‐
439       tent list, calling this:
440
441          $h->unshift_content('Skronk');
442
443       then the resulting content list will be this:
444
445         ('Skronkfoo bar ', $some_node, 'baz!')
446
447       and not this:
448
449         ('Skronk', 'foo bar ', $some_node, 'baz!')
450
451       What you'd to do get the latter is:
452
453         $h->splice_content(0,0,'Skronk');
454
455       $h->unshift_content($element_or_text, ...)
456
457       Just like "push_content", but adds to the beginning of the $h element's
458       content list.
459
460       The items of content to be added should each be either a text segment
461       (a string), an HTML::Element object, or an arrayref (which is fed thru
462       "new_from_lol").
463
464       The unshift_content method will try to consolidate adjacent text seg‐
465       ments while adding to the content list.  See above for a discussion of
466       this.
467
468       $h->splice_content($offset, $length, $element_or_text, ...)
469
470       Detaches the elements from $h's list of content-nodes, starting at
471       $offset and continuing for $length items, replacing them with the ele‐
472       ments of the following list, if any.  Returns the elements (if any)
473       removed from the content-list.  If $offset is negative, then it starts
474       that far from the end of the array, just like Perl's normal "splice"
475       function.  If $length and the following list is omitted, removes every‐
476       thing from $offset onward.
477
478       The items of content to be added (if any) should each be either a text
479       segment (a string), an arrayref (which is fed thru "new_from_lol"), or
480       an HTML::Element object that's not already a child of $h.
481
482       $h->detach()
483
484       This unlinks $h from its parent, by setting its 'parent' attribute to
485       undef, and by removing it from the content list of its parent (if it
486       had one).  The return value is the parent that was detached from (or
487       undef, if $h had no parent to start with).  Note that neither $h nor
488       its parent are explicitly destroyed.
489
490       $h->detach_content()
491
492       This unlinks all of $h's children from $h, and returns them.  Note that
493       these are not explicitly destroyed; for that, you can just use
494       $h->delete_content.
495
496       $h->replace_with( $element_or_text, ... )
497
498       This replaces $h in its parent's content list with the nodes specified.
499       The element $h (which by then may have no parent) is returned.  This
500       causes a fatal error if $h has no parent.  The list of nodes to insert
501       may contain $h, but at most once.  Aside from that possible exception,
502       the nodes to insert should not already be children of $h's parent.
503
504       Also, note that this method does not destroy $h -- use
505       "$h->replace_with(...)->delete" if you need that.
506
507       $h->preinsert($element_or_text...)
508
509       Inserts the given nodes right BEFORE $h in $h's parent's content list.
510       This causes a fatal error if $h has no parent.  None of the given nodes
511       should be $h or other children of $h.  Returns $h.
512
513       $h->postinsert($element_or_text...)
514
515       Inserts the given nodes right AFTER $h in $h's parent's content list.
516       This causes a fatal error if $h has no parent.  None of the given nodes
517       should be $h or other children of $h.  Returns $h.
518
519       $h->replace_with_content()
520
521       This replaces $h in its parent's content list with its own content.
522       The element $h (which by then has no parent or content of its own) is
523       returned.  This causes a fatal error if $h has no parent.  Also, note
524       that this does not destroy $h -- use "$h->replace_with_content->delete"
525       if you need that.
526
527       $h->delete_content()
528
529       Clears the content of $h, calling "$h->delete" for each content ele‐
530       ment.  Compare with "$h->detach_content".
531
532       Returns $h.
533
534       $h->delete()
535
536       Detaches this element from its parent (if it has one) and explicitly
537       destroys the element and all its descendants.  The return value is
538       undef.
539
540       Perl uses garbage collection based on reference counting; when no ref‐
541       erences to a data structure exist, it's implicitly destroyed -- i.e.,
542       when no value anywhere points to a given object anymore, Perl knows it
543       can free up the memory that the now-unused object occupies.
544
545       But this fails with HTML::Element trees, because a parent element
546       always holds references to its children, and its children elements hold
547       references to the parent, so no element ever looks like it's not in
548       use.  So, to destroy those elements, you need to call "$h->delete" on
549       the parent.
550
551       $h->clone()
552
553       Returns a copy of the element (whose children are clones (recursively)
554       of the original's children, if any).
555
556       The returned element is parentless.  Any '_pos' attributes present in
557       the source element/tree will be absent in the copy.  For that and other
558       reasons, the clone of an HTML::TreeBuilder object that's in mid-parse
559       (i.e, the head of a tree that HTML::TreeBuilder is elaborating) cannot
560       (currently) be used to continue the parse.
561
562       You are free to clone HTML::TreeBuilder trees, just as long as: 1)
563       they're done being parsed, or 2) you don't expect to resume parsing
564       into the clone.  (You can continue parsing into the original; it is
565       never affected.)
566
567       HTML::Element->clone_list(...nodes...)
568
569       Returns a list consisting of a copy of each node given.  Text segments
570       are simply copied; elements are cloned by calling $it->clone on each of
571       them.
572
573       Note that this must be called as a class method, not as an instance
574       method.  "clone_list" will croak if called as an instance method.  You
575       can also call it like so:
576
577           ref($h)->clone_list(...nodes...)
578
579       $h->normalize_content
580
581       Normalizes the content of $h -- i.e., concatenates any adjacent text
582       nodes.  (Any undefined text segments are turned into empty-strings.)
583       Note that this does not recurse into $h's descendants.
584
585       $h->delete_ignorable_whitespace()
586
587       This traverses under $h and deletes any text segments that are ignor‐
588       able whitespace.  You should not use this if $h under a 'pre' element.
589
590       $h->insert_element($element, $implicit)
591
592       Inserts (via push_content) a new element under the element at
593       "$h->pos()".  Then updates "$h->pos()" to point to the inserted ele‐
594       ment, unless $element is a prototypically empty element like "br",
595       "hr", "img", etc.  The new "$h->pos()" is returned.  This method is
596       useful only if your particular tree task involves setting "$h->pos()".
597

DUMPING METHODS

599       $h->dump()
600
601       $h->dump(*FH)  ; # or *FH{IO} or $fh_obj
602
603       Prints the element and all its children to STDOUT (or to a specified
604       filehandle), in a format useful only for debugging.  The structure of
605       the document is shown by indentation (no end tags).
606
607       $h->as_HTML() or $h->as_HTML($entities)
608
609       or $h->as_HTML($entities, $indent_char)
610
611       or $h->as_HTML($entities, $indent_char, \%optional_end_tags)
612
613       Returns a string representing in HTML the element and its descendants.
614       The optional argument $entities specifies a string of the entities to
615       encode.  For compatibility with previous versions, specify '<>&' here.
616       If omitted or undef, all unsafe characters are encoded as HTML enti‐
617       ties.  See HTML::Entities for details.  If passed an empty string, no
618       entities are encoded.
619
620       If $indent_char is specified and defined, the HTML to be output is
621       intented, using the string you specify (which you probably should set
622       to "\t", or some number of spaces, if you specify it).
623
624       If "\%optional_end_tags" is specified and defined, it should be a ref‐
625       erence to a hash that holds a true value for every tag name whose end
626       tag is optional.  Defaults to "\%HTML::Element::optionalEndTag", which
627       is an alias to %HTML::Tagset::optionalEndTag, which, at time of writ‐
628       ing, contains true values for "p, li, dt, dd".  A useful value to pass
629       is an empty hashref, "{}", which means that no end-tags are optional
630       for this dump.  Otherwise, possibly consider copying
631       %HTML::Tagset::optionalEndTag to a hash of your own, adding or deleting
632       values as you like, and passing a reference to that hash.
633
634       $h->as_text()
635
636       $h->as_text(skip_dels => 1)
637
638       Returns a string consisting of only the text parts of the element's
639       descendants.
640
641       Text under 'script' or 'style' elements is never included in what's
642       returned.  If "skip_dels" is true, then text content under "del" nodes
643       is not included in what's returned.
644
645       $h->as_trimmed_text(...)
646
647       This is just like as_text(...) except that leading and trailing white‐
648       space is deleted, and any internal whitespace is collapsed.
649
650       $h->as_XML()
651
652       Returns a string representing in XML the element and its descendants.
653
654       The XML is not indented.
655
656       $h->as_Lisp_form()
657
658       Returns a string representing the element and its descendants as a Lisp
659       form.  Unsafe characters are encoded as octal escapes.
660
661       The Lisp form is indented, and contains external ("href", etc.)  as
662       well as internal attributes ("_tag", "_content", "_implicit", etc.),
663       except for "_parent", which is omitted.
664
665       Current example output for a given element:
666
667         ("_tag" "img" "border" "0" "src" "pie.png" "usemap" "#main.map")
668
669       $h->starttag() or $h->starttag($entities)
670
671       Returns a string representing the complete start tag for the element.
672       I.e., leading "<", tag name, attributes, and trailing ">".  All values
673       are surrounded with double-quotes, and appropriate characters are
674       encoded.  If $entities is omitted or undef, all unsafe characters are
675       encoded as HTML entities.  See HTML::Entities for details.  If you
676       specify some value for $entities, remember to include the double-quote
677       character in it.  (Previous versions of this module would basically
678       behave as if '&">' were specified for $entities.)  If $entities is an
679       empty string, no entity is escaped.
680
681       $h->endtag()
682
683       Returns a string representing the complete end tag for this element.
684       I.e., "</", tag name, and ">".
685

SECONDARY STRUCTURAL METHODS

687       These methods all involve some structural aspect of the tree; either
688       they report some aspect of the tree's structure, or they involve tra‐
689       versal down the tree, or walking up the tree.
690
691       $h->is_inside('tag', ...) or $h->is_inside($element, ...)
692
693       Returns true if the $h element is, or is contained anywhere inside an
694       element that is any of the ones listed, or whose tag name is any of the
695       tag names listed.
696
697       $h->is_empty()
698
699       Returns true if $h has no content, i.e., has no elements or text seg‐
700       ments under it.  In other words, this returns true if $h is a leaf
701       node, AKA a terminal node.  Do not confuse this sense of "empty" with
702       another sense that it can have in SGML/HTML/XML terminology, which
703       means that the element in question is of the type (like HTML's "hr",
704       "br", "img", etc.) that can't have any content.
705
706       That is, a particular "p" element may happen to have no content, so
707       $that_p_element->is_empty will be true -- even though the prototypical
708       "p" element isn't "empty" (not in the way that the prototypical "hr"
709       element is).
710
711       If you think this might make for potentially confusing code, consider
712       simply using the clearer exact equivalent:  not($h->content_list)
713
714       $h->pindex()
715
716       Return the index of the element in its parent's contents array, such
717       that $h would equal
718
719         $h->parent->content->[$h->pindex]
720         or
721         ($h->parent->content_list)[$h->pindex]
722
723       assuming $h isn't root.  If the element $h is root, then $h->pindex
724       returns undef.
725
726       $h->left()
727
728       In scalar context: returns the node that's the immediate left sibling
729       of $h.  If $h is the leftmost (or only) child of its parent (or has no
730       parent), then this returns undef.
731
732       In list context: returns all the nodes that're the left siblings of $h
733       (starting with the leftmost).  If $h is the leftmost (or only) child of
734       its parent (or has no parent), then this returns empty-list.
735
736       (See also $h->preinsert(LIST).)
737
738       $h->right()
739
740       In scalar context: returns the node that's the immediate right sibling
741       of $h.  If $h is the rightmost (or only) child of its parent (or has no
742       parent), then this returns undef.
743
744       In list context: returns all the nodes that're the right siblings of
745       $h, starting with the leftmost.  If $h is the rightmost (or only) child
746       of its parent (or has no parent), then this returns empty-list.
747
748       (See also $h->postinsert(LIST).)
749
750       $h->address()
751
752       Returns a string representing the location of this node in the tree.
753       The address consists of numbers joined by a '.', starting with '0', and
754       followed by the pindexes of the nodes in the tree that are ancestors of
755       $h, starting from the top.
756
757       So if the way to get to a node starting at the root is to go to child 2
758       of the root, then child 10 of that, and then child 0 of that, and then
759       you're there -- then that node's address is "0.2.10.0".
760
761       As a bit of a special case, the address of the root is simply "0".
762
763       I forsee this being used mainly for debugging, but you may find your
764       own uses for it.
765
766       $h->address($address)
767
768       This returns the node (whether element or text-segment) at the given
769       address in the tree that $h is a part of.  (That is, the address is
770       resolved starting from $h->root.)
771
772       If there is no node at the given address, this returns undef.
773
774       You can specify "relative addressing" (i.e., that indexing is supposed
775       to start from $h and not from $h->root) by having the address start
776       with a period -- e.g., $h->address(".3.2") will look at child 3 of $h,
777       and child 2 of that.
778
779       $h->depth()
780
781       Returns a number expressing $h's depth within its tree, i.e., how many
782       steps away it is from the root.  If $h has no parent (i.e., is root),
783       its depth is 0.
784
785       $h->root()
786
787       Returns the element that's the top of $h's tree.  If $h is root, this
788       just returns $h.  (If you want to test whether $h is the root, instead
789       of asking what its root is, just test "not($h->parent)".)
790
791       $h->lineage()
792
793       Returns the list of $h's ancestors, starting with its parent, and then
794       that parent's parent, and so on, up to the root.  If $h is root, this
795       returns an empty list.
796
797       If you simply want a count of the number of elements in $h's lineage,
798       use $h->depth.
799
800       $h->lineage_tag_names()
801
802       Returns the list of the tag names of $h's ancestors, starting with its
803       parent, and that parent's parent, and so on, up to the root.  If $h is
804       root, this returns an empty list.  Example output: "('em', 'td', 'tr',
805       'table', 'body', 'html')"
806
807       $h->descendants()
808
809       In list context, returns the list of all $h's descendant elements,
810       listed in pre-order (i.e., an element appears before its content-ele‐
811       ments).  Text segments DO NOT appear in the list.  In scalar context,
812       returns a count of all such elements.
813
814       $h->descendents()
815
816       This is just an alias to the "descendants" method.
817
818       $h->find_by_tag_name('tag', ...)
819
820       In list context, returns a list of elements at or under $h that have
821       any of the specified tag names.  In scalar context, returns the first
822       (in pre-order traversal of the tree) such element found, or undef if
823       none.
824
825       $h->find('tag', ...)
826
827       This is just an alias to "find_by_tag_name".  (There was once going to
828       be a whole find_* family of methods, but then look_down filled that
829       niche, so there turned out not to be much reason for the verboseness of
830       the name "find_by_tag_name".)
831
832       $h->find_by_attribute('attribute', 'value')
833
834       In a list context, returns a list of elements at or under $h that have
835       the specified attribute, and have the given value for that attribute.
836       In a scalar context, returns the first (in pre-order traversal of the
837       tree) such element found, or undef if none.
838
839       This method is deprecated in favor of the more expressive "look_down"
840       method, which new code should use instead.
841
842       $h->look_down( ...criteria... )
843
844       This starts at $h and looks thru its element descendants (in
845       pre-order), looking for elements matching the criteria you specify.  In
846       list context, returns all elements that match all the given criteria;
847       in scalar context, returns the first such element (or undef, if nothing
848       matched).
849
850       There are three kinds of criteria you can specify:
851
852       (attr_name, attr_value)
853           This means you're looking for an element with that value for that
854           attribute.  Example: "alt", "pix!".  Consider that you can search
855           on internal attribute values too: "_tag", "p".
856
857       (attr_name, qr/.../)
858           This means you're looking for an element whose value for that
859           attribute matches the specified Regexp object.
860
861       a coderef
862           This means you're looking for elements where coderef->(each_ele‐
863           ment) returns true.  Example:
864
865             my @wide_pix_images
866               = $h->look_down(
867                               "_tag", "img",
868                               "alt", "pix!",
869                               sub { $_[0]->attr('width') > 350 }
870                              );
871
872       Note that "(attr_name, attr_value)" and "(attr_name, qr/.../)" criteria
873       are almost always faster than coderef criteria, so should presumably be
874       put before them in your list of criteria.  That is, in the example
875       above, the sub ref is called only for elements that have already passed
876       the criteria of having a "_tag" attribute with value "img", and an
877       "alt" attribute with value "pix!".  If the coderef were first, it would
878       be called on every element, and then what elements pass that criterion
879       (i.e., elements for which the coderef returned true) would be checked
880       for their "_tag" and "alt" attributes.
881
882       Note that comparison of string attribute-values against the string
883       value in "(attr_name, attr_value)" is case-INsensitive!  A criterion of
884       "('align', 'right')" will match an element whose "align" value is
885       "RIGHT", or "right" or "rIGhT", etc.
886
887       Note also that "look_down" considers "" (empty-string) and undef to be
888       different things, in attribute values.  So this:
889
890         $h->look_down("alt", "")
891
892       will find elements with an "alt" attribute, but where the value for the
893       "alt" attribute is "".  But this:
894
895         $h->look_down("alt", undef)
896
897       is the same as:
898
899         $h->look_down(sub { !defined($_[0]->attr('alt')) } )
900
901       That is, it finds elements that do not have an "alt" attribute at all
902       (or that do have an "alt" attribute, but with a value of undef -- which
903       is not normally possible).
904
905       Note that when you give several criteria, this is taken to mean you're
906       looking for elements that match all your criterion, not just any of
907       them.  In other words, there is an implicit "and", not an "or".  So if
908       you wanted to express that you wanted to find elements with a "name"
909       attribute with the value "foo" or with an "id" attribute with the value
910       "baz", you'd have to do it like:
911
912         @them = $h->look_down(
913           sub {
914             # the lcs are to fold case
915             lc($_[0]->attr('name')) eq 'foo'
916             or lc($_[0]->attr('id')) eq 'baz'
917           }
918         );
919
920       Coderef criteria are more expressive than "(attr_name, attr_value)" and
921       "(attr_name, qr/.../)" criteria, and all "(attr_name, attr_value)" and
922       "(attr_name, qr/.../)" criteria could be expressed in terms of
923       coderefs.  However, "(attr_name, attr_value)" and "(attr_name,
924       qr/.../)" criteria are a convenient shorthand.  (In fact, "look_down"
925       itself is basically "shorthand" too, since anything you can do with
926       "look_down" you could do by traversing the tree, either with the "tra‐
927       verse" method or with a routine of your own.  However, "look_down"
928       often makes for very concise and clear code.)
929
930       $h->look_up( ...criteria... )
931
932       This is identical to $h->look_down, except that whereas $h->look_down
933       basically scans over the list:
934
935          ($h, $h->descendants)
936
937       $h->look_up instead scans over the list
938
939          ($h, $h->lineage)
940
941       So, for example, this returns all ancestors of $h (possibly including
942       $h itself) that are "td" elements with an "align" attribute with a
943       value of "right" (or "RIGHT", etc.):
944
945          $h->look_up("_tag", "td", "align", "right");
946
947       $h->traverse(...options...)
948
949       Lengthy discussion of HTML::Element's unnecessary and confusing "tra‐
950       verse" method has been moved to a separate file: HTML::Element::tra‐
951       verse
952
953       $h->attr_get_i('attribute')
954
955       In list context, returns a list consisting of the values of the given
956       attribute for $self and for all its ancestors starting from $self and
957       working its way up.  Nodes with no such attribute are skipped.
958       ("attr_get_i" stands for "attribute get, with inheritance".)  In scalar
959       context, returns the first such value, or undef if none.
960
961       Consider a document consisting of:
962
963          <html lang='i-klingon'>
964            <head><title>Pati Pata</title></head>
965            <body>
966              <h1 lang='la'>Stuff</h1>
967              <p lang='es-MX' align='center'>
968                Foo bar baz <cite>Quux</cite>.
969              </p>
970              <p>Hooboy.</p>
971            </body>
972          </html>
973
974       If $h is the "cite" element, $h->attr_get_i("lang") in list context
975       will return the list ('es-MX', 'i-klingon').  In scalar context, it
976       will return the value 'es-MX'.
977
978       If you call with multiple attribute names...
979
980       $h->attr_get_i('a1', 'a2', 'a3')
981
982       ...in list context, this will return a list consisting of the values of
983       these attributes which exist in $self and its ancestors.  In scalar
984       context, this returns the first value (i.e., the value of the first
985       existing attribute from the first element that has any of the
986       attributes listed).  So, in the above example,
987
988         $h->attr_get_i('lang', 'align');
989
990       will return:
991
992          ('es-MX', 'center', 'i-klingon') # in list context
993         or
994          'es-MX' # in scalar context.
995
996       But note that this:
997
998        $h->attr_get_i('align', 'lang');
999
1000       will return:
1001
1002          ('center', 'es-MX', 'i-klingon') # in list context
1003         or
1004          'center' # in scalar context.
1005
1006       $h->tagname_map()
1007
1008       Scans across $h and all its descendants, and makes a hash (a reference
1009       to which is returned) where each entry consists of a key that's a tag
1010       name, and a value that's a reference to a list to all elements that
1011       have that tag name.  I.e., this method returns:
1012
1013          {
1014            # Across $h and all descendants...
1015            'a'   => [ ...list of all 'a'   elements... ],
1016            'em'  => [ ...list of all 'em'  elements... ],
1017            'img' => [ ...list of all 'img' elements... ],
1018          }
1019
1020       (There are entries in the hash for only those tagnames that occur
1021       at/under $h -- so if there's no "img" elements, there'll be no "img"
1022       entry in the hashr(ref) returned.)
1023
1024       Example usage:
1025
1026           my $map_r = $h->tagname_map();
1027           my @heading_tags = sort grep m/^h\d$/s, keys %$map_r;
1028           if(@heading_tags) {
1029             print "Heading levels used: @heading_tags\n";
1030           } else {
1031             print "No headings.\n"
1032           }
1033
1034       $h->extract_links() or $h->extract_links(@wantedTypes)
1035
1036       Returns links found by traversing the element and all of its children
1037       and looking for attributes (like "href" in an "a" element, or "src" in
1038       an "img" element) whose values represent links.  The return value is a
1039       reference to an array.  Each element of the array is reference to an
1040       array with four items: the link-value, the element that has the
1041       attribute with that link-value, and the name of that attribute, and the
1042       tagname of that element.  (Example: "['http://www.suck.com/',"
1043       $elem_obj ", 'href', 'a']".)  You may or may not end up using the ele‐
1044       ment itself -- for some purposes, you may use only the link value.
1045
1046       You might specify that you want to extract links from just some kinds
1047       of elements (instead of the default, which is to extract links from all
1048       the kinds of elements known to have attributes whose values represent
1049       links).  For instance, if you want to extract links from only "a" and
1050       "img" elements, you could code it like this:
1051
1052         for (@{  $e->extract_links('a', 'img')  }) {
1053             my($link, $element, $attr, $tag) = @$_;
1054             print
1055               "Hey, there's a $tag that links to "
1056               $link, ", in its $attr attribute, at ",
1057               $element->address(), ".\n";
1058         }
1059
1060       $h->simplify_pres
1061
1062       In text bits under PRE elements that are at/under $h, this routine
1063       nativizes all newlines, and expands all tabs.
1064
1065       That is, if you read a file with lines delimited by "\cm\cj"'s, the
1066       text under PRE areas will have "\cm\cj"'s instead of "\n"'s. Calling
1067       $h->nativize_pre_newlines on such a tree will turn "\cm\cj"'s into
1068       "\n"'s.
1069
1070       Tabs are expanded to however many spaces it takes to get to the next
1071       8th column -- the usual way of expanding them.
1072
1073       $h->same_as($i)
1074
1075       Returns true if $h and $i are both elements representing the same tree
1076       of elements, each with the same tag name, with the same explicit
1077       attributes (i.e., not counting attributes whose names start with "_"),
1078       and with the same content (textual, comments, etc.).
1079
1080       Sameness of descendant elements is tested, recursively, with
1081       "$child1->same_as($child_2)", and sameness of text segments is tested
1082       with "$segment1 eq $segment2".
1083
1084       $h = HTML::Element->new_from_lol(ARRAYREF)
1085
1086       Resursively constructs a tree of nodes, based on the (non-cyclic) data
1087       structure represented by ARRAYREF, where that is a reference to an
1088       array of arrays (of arrays (of arrays (etc.))).
1089
1090       In each arrayref in that structure, different kinds of values are
1091       treated as follows:
1092
1093       * Arrayrefs
1094           Arrayrefs are considered to designate a sub-tree representing chil‐
1095           dren for the node constructed from the current arrayref.
1096
1097       * Hashrefs
1098           Hashrefs are considered to contain attribute-value pairs to add to
1099           the element to be constructed from the current arrayref
1100
1101       * Text segments
1102           Text segments at the start of any arrayref will be considered to
1103           specify the name of the element to be constructed from the current
1104           araryref; all other text segments will be considered to specify
1105           text segments as children for the current arrayref.
1106
1107       * Elements
1108           Existing element objects are either inserted into the treelet con‐
1109           structed, or clones of them are.  That is, when the lol-tree is
1110           being traversed and elements constructed based what's in it, if an
1111           existing element object is found, if it has no parent, then it is
1112           added directly to the treelet constructed; but if it has a parent,
1113           then "$that_node->clone" is added to the treelet at the appropriate
1114           place.
1115
1116       An example will hopefully make this more obvious:
1117
1118         my $h = HTML::Element->new_from_lol(
1119           ['html',
1120             ['head',
1121               [ 'title', 'I like stuff!' ],
1122             ],
1123             ['body',
1124               {'lang', 'en-JP', _implicit => 1},
1125               'stuff',
1126               ['p', 'um, p < 4!', {'class' => 'par123'}],
1127               ['div', {foo => 'bar'}, '123'],
1128             ]
1129           ]
1130         );
1131         $h->dump;
1132
1133       Will print this:
1134
1135         <html> @0
1136           <head> @0.0
1137             <title> @0.0.0
1138               "I like stuff!"
1139           <body lang="en-JP"> @0.1 (IMPLICIT)
1140             "stuff"
1141             <p class="par123"> @0.1.1
1142               "um, p < 4!"
1143             <div foo="bar"> @0.1.2
1144               "123"
1145
1146       And printing $h->as_HTML will give something like:
1147
1148         <html><head><title>I like stuff!</title></head>
1149         <body lang="en-JP">stuff<p class="par123">um, p &lt; 4!
1150         <div foo="bar">123</div></body></html>
1151
1152       You can even do fancy things with "map":
1153
1154         $body->push_content(
1155           # push_content implicitly calls new_from_lol on arrayrefs...
1156           ['br'],
1157           ['blockquote',
1158             ['h2', 'Pictures!'],
1159             map ['p', $_],
1160             $body2->look_down("_tag", "img"),
1161               # images, to be copied from that other tree.
1162           ],
1163           # and more stuff:
1164           ['ul',
1165             map ['li', ['a', {'href'=>"$_.png"}, $_ ] ],
1166             qw(Peaches Apples Pears Mangos)
1167           ],
1168         );
1169
1170       @elements = HTML::Element->new_from_lol(ARRAYREFS)
1171
1172       Constructs several elements, by calling new_from_lol for every arrayref
1173       in the ARRAYREFS list.
1174
1175         @elements = HTML::Element->new_from_lol(
1176           ['hr'],
1177           ['p', 'And there, on the door, was a hook!'],
1178         );
1179          # constructs two elements.
1180
1181       $h->objectify_text()
1182
1183       This turns any text nodes under $h from mere text segments (strings)
1184       into real objects, pseudo-elements with a tag-name of "~text", and the
1185       actual text content in an attribute called "text".  (For a discussion
1186       of pseudo-elements, see the "tag" method, far above.)  This method is
1187       provided because, for some purposes, it is convenient or necessary to
1188       be able, for a given text node, to ask what element is its parent; and
1189       clearly this is not possible if a node is just a text string.
1190
1191       Note that these "~text" objects are not recognized as text nodes by
1192       methods like as_text.  Presumably you will want to call $h->objec‐
1193       tify_text, perform whatever task that you needed that for, and then
1194       call $h->deobjectify_text before calling anything like $h->as_text.
1195
1196       $h->deobjectify_text()
1197
1198       This undoes the effect of $h->objectify_text.  That is, it takes any
1199       "~text" pseudo-elements in the tree at/under $h, and deletes each one,
1200       replacing each with the content of its "text" attribute.
1201
1202       Note that if $h itself is a "~text" pseudo-element, it will be
1203       destroyed -- a condition you may need to treat specially in your call‐
1204       ing code (since it means you can't very well do anything with $h after
1205       that).  So that you can detect that condition, if $h is itself a
1206       "~text" pseudo-element, then this method returns the value of the
1207       "text" attribute, which should be a defined value; in all other cases,
1208       it returns undef.
1209
1210       (This method assumes that no "~text" pseudo-element has any children.)
1211
1212       $h->number_lists()
1213
1214       For every UL, OL, DIR, and MENU element at/under $h, this sets a "_bul‐
1215       let" attribute for every child LI element.  For LI children of an OL,
1216       the "_bullet" attribute's value will be something like "4.", "d.",
1217       "D.", "IV.", or "iv.", depending on the OL element's "type" attribute.
1218       LI children of a UL, DIR, or MENU get their "_bullet" attribute set to
1219       "*".  There should be no other LIs (i.e., except as children of OL, UL,
1220       DIR, or MENU elements), and if there are, they are unaffected.
1221
1222       $h->has_insane_linkage
1223
1224       This method is for testing whether this element or the elements under
1225       it have linkage attributes (_parent and _content) whose values are
1226       deeply aberrant: if there are undefs in a content list; if an element
1227       appears in the content lists of more than one element; if the _parent
1228       attribute of an element doesn't match its actual parent; or if an ele‐
1229       ment appears as its own descendant (i.e., if there is a cyclicity in
1230       the tree).
1231
1232       This returns empty list (or false, in scalar context) if the subtree's
1233       linkage methods are sane; otherwise it returns two items (or true, in
1234       scalar context): the element where the error occurred, and a string
1235       describing the error.
1236
1237       This method is provided is mainly for debugging and troubleshooting --
1238       it should be quite impossible for any document constructed via
1239       HTML::TreeBuilder to parse into a non-sane tree (since it's not the
1240       content of the tree per se that's in question, but whether the tree in
1241       memory was properly constructed); and it should be impossible for you
1242       to produce an insane tree just thru reasonable use of normal documented
1243       structure-modifying methods.  But if you're constructing your own
1244       trees, and your program is going into infinite loops as during calls to
1245       traverse() or any of the secondary structural methods, as part of
1246       debugging, consider calling is_insane on the tree.
1247

BUGS

1249       * If you want to free the memory associated with a tree built of
1250       HTML::Element nodes, then you will have to delete it explicitly.  See
1251       the $h->delete method, above.
1252
1253       * There's almost nothing to stop you from making a "tree" with cyclici‐
1254       ties (loops) in it, which could, for example, make the traverse method
1255       go into an infinite loop.  So don't make cyclicities!  (If all you're
1256       doing is parsing HTML files, and looking at the resulting trees, this
1257       will never be a problem for you.)
1258
1259       * There's no way to represent comments or processing directives in a
1260       tree with HTML::Elements.  Not yet, at least.
1261
1262       * There's (currently) nothing to stop you from using an undefined value
1263       as a text segment.  If you're running under "perl -w", however, this
1264       may make HTML::Element's code produce a slew of warnings.
1265

NOTES ON SUBCLASSING

1267       You are welcome to derive subclasses from HTML::Element, but you should
1268       be aware that the code in HTML::Element makes certain assumptions about
1269       elements (and I'm using "element" to mean ONLY an object of class
1270       HTML::Element, or of a subclass of HTML::Element):
1271
1272       * The value of an element's _parent attribute must either be undef or
1273       otherwise false, or must be an element.
1274
1275       * The value of an element's _content attribute must either be undef or
1276       otherwise false, or a reference to an (unblessed) array.  The array may
1277       be empty; but if it has items, they must ALL be either mere strings
1278       (text segments), or elements.
1279
1280       * The value of an element's _tag attribute should, at least, be a
1281       string of printable characters.
1282
1283       Moreover, bear these rules in mind:
1284
1285       * Do not break encapsulation on objects.  That is, access their con‐
1286       tents only thru $obj->attr or more specific methods.
1287
1288       * You should think twice before completely overriding any of the meth‐
1289       ods that HTML::Element provides.  (Overriding with a method that calls
1290       the superclass method is not so bad, though.)
1291

SEE ALSO

1293       HTML::Tree; HTML::TreeBuilder; HTML::AsSubs; HTML::Tagset; and, for the
1294       morbidly curious, HTML::Element::traverse.
1295
1297       Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke, 2005 Andy
1298       Lester, 2006 Pete Krawczyk.
1299
1300       This library is free software; you can redistribute it and/or modify it
1301       under the same terms as Perl itself.
1302
1303       This program is distributed in the hope that it will be useful, but
1304       without any warranty; without even the implied warranty of mer‐
1305       chantability or fitness for a particular purpose.
1306

AUTHOR

1308       Currently maintained by Pete Krawczyk "<petek@cpan.org>"
1309
1310       Original authors: Gisle Aas, Sean Burke and Andy Lester.
1311
1312       Thanks to Mark-Jason Dominus for a POD suggestion.
1313
1314
1315
1316perl v5.8.8                       2006-08-04                  HTML::Element(3)
Impressum