HTML::Element(3pm)

1HTML::Element(3)      User Contributed Perl Documentation     HTML::Element(3)
2
3
4

NAME

6       HTML::Element - Class for objects that represent HTML elements
7

VERSION

9       Version 3.23
10

SYNOPSIS

12           use HTML::Element;
13           $a = HTML::Element->new('a', href => 'http://www.perl.com/');
14           $a->push_content("The Perl Homepage");
15
16           $tag = $a->tag;
17           print "$tag starts out as:",  $a->starttag, "\n";
18           print "$tag ends as:",  $a->endtag, "\n";
19           print "$tag\'s href attribute is: ", $a->attr('href'), "\n";
20
21           $links_r = $a->extract_links();
22           print "Hey, I found ", scalar(@$links_r), " links.\n";
23
24           print "And that, as HTML, is: ", $a->as_HTML, "\n";
25           $a = $a->delete;
26

DESCRIPTION

28       (This class is part of the HTML::Tree dist.)
29
30       Objects of the HTML::Element class can be used to represent elements of
31       HTML document trees.  These objects have attributes, notably attributes
32       that designates each element's parent and content.  The content is an
33       array of text segments and other HTML::Element objects.  A tree with
34       HTML::Element objects as nodes can represent the syntax tree for a HTML
35       document.
36

HOW WE REPRESENT TREES

38       Consider this HTML document:
39
40         <html lang='en-US'>
41           <head>
42             <title>Stuff</title>
43             <meta name='author' content='Jojo'>
44           </head>
45           <body>
46            <h1>I like potatoes!</h1>
47           </body>
48         </html>
49
50       Building a syntax tree out of it makes a tree-structure in memory that
51       could be diagrammed as:
52
53                            html (lang='en-US')
54                             / \
55                           /     \
56                         /         \
57                       head        body
58                      /\               \
59                    /    \               \
60                  /        \               \
61                title     meta              h1
62                 |       (name='author',     |
63              "Stuff"    content='Jojo')    "I like potatoes"
64
65       This is the traditional way to diagram a tree, with the "root" at the
66       top, and it's this kind of diagram that people have in mind when they
67       say, for example, that "the meta element is under the head element
68       instead of under the body element".  (The same is also said with
69       "inside" instead of "under" -- the use of "inside" makes more sense
70       when you're looking at the HTML source.)
71
72       Another way to represent the above tree is with indenting:
73
74         html (attributes: lang='en-US')
75           head
76             title
77               "Stuff"
78             meta (attributes: name='author' content='Jojo')
79           body
80             h1
81               "I like potatoes"
82
83       Incidentally, diagramming with indenting works much better for very
84       large trees, and is easier for a program to generate.  The
85       "$tree->dump" method uses indentation just that way.
86
87       However you diagram the tree, it's stored the same in memory -- it's a
88       network of objects, each of which has attributes like so:
89
90         element #1:  _tag: 'html'
91                      _parent: none
92                      _content: [element #2, element #5]
93                      lang: 'en-US'
94
95         element #2:  _tag: 'head'
96                      _parent: element #1
97                      _content: [element #3, element #4]
98
99         element #3:  _tag: 'title'
100                      _parent: element #2
101                      _content: [text segment "Stuff"]
102
103         element #4   _tag: 'meta'
104                      _parent: element #2
105                      _content: none
106                      name: author
107                      content: Jojo
108
109         element #5   _tag: 'body'
110                      _parent: element #1
111                      _content: [element #6]
112
113         element #6   _tag: 'h1'
114                      _parent: element #5
115                      _content: [text segment "I like potatoes"]
116
117       The "treeness" of the tree-structure that these elements comprise is
118       not an aspect of any particular object, but is emergent from the
119       relatedness attributes (_parent and _content) of these element-objects
120       and from how you use them to get from element to element.
121
122       While you could access the content of a tree by writing code that says
123       "access the 'src' attribute of the root's first child's seventh child's
124       third child", you're more likely to have to scan the contents of a
125       tree, looking for whatever nodes, or kinds of nodes, you want to do
126       something with.  The most straightforward way to look over a tree is to
127       "traverse" it; an HTML::Element method ("$h->traverse") is provided for
128       this purpose; and several other HTML::Element methods are based on it.
129
130       (For everything you ever wanted to know about trees, and then some, see
131       Niklaus Wirth's Algorithms + Data Structures = Programs or Donald
132       Knuth's The Art of Computer Programming, Volume 1.)
133

BASIC METHODS

135   $h = HTML::Element->new('tag', 'attrname' => 'value', ... )
136       This constructor method returns a new HTML::Element object.  The tag
137       name is a required argument; it will be forced to lowercase.
138       Optionally, you can specify other initial attributes at object creation
139       time.
140
141   $h->attr('attr') or $h->attr('attr', 'value')
142       Returns (optionally sets) the value of the given attribute of $h.  The
143       attribute name (but not the value, if provided) is forced to lowercase.
144       If trying to read the value of an attribute not present for this
145       element, the return value is undef.  If setting a new value, the old
146       value of that attribute is returned.
147
148       If methods are provided for accessing an attribute (like "$h->tag" for
149       "_tag", "$h->content_list", etc. below), use those instead of calling
150       attr "$h->attr", whether for reading or setting.
151
152       Note that setting an attribute to "undef" (as opposed to "", the empty
153       string) actually deletes the attribute.
154
155   $h->tag() or $h->tag('tagname')
156       Returns (optionally sets) the tag name (also known as the generic
157       identifier) for the element $h.  In setting, the tag name is always
158       converted to lower case.
159
160       There are four kinds of "pseudo-elements" that show up as HTML::Element
161       objects:
162
163       Comment pseudo-elements
164           These are element objects with a "$h->tag" value of "~comment", and
165           the content of the comment is stored in the "text" attribute
166           ("$h->attr("text")").  For example, parsing this code with
167           HTML::TreeBuilder...
168
169             <!-- I like Pie.
170                Pie is good
171             -->
172
173           produces an HTML::Element object with these attributes:
174
175             "_tag",
176             "~comment",
177             "text",
178             " I like Pie.\n     Pie is good\n  "
179
180       Declaration pseudo-elements
181           Declarations (rarely encountered) are represented as HTML::Element
182           objects with a tag name of "~declaration", and content in the
183           "text" attribute.  For example, this:
184
185             <!DOCTYPE foo>
186
187           produces an element whose attributes include:
188
189             "_tag", "~declaration", "text", "DOCTYPE foo"
190
191       Processing instruction pseudo-elements
192           PIs (rarely encountered) are represented as HTML::Element objects
193           with a tag name of "~pi", and content in the "text" attribute.  For
194           example, this:
195
196             <?stuff foo?>
197
198           produces an element whose attributes include:
199
200             "_tag", "~pi", "text", "stuff foo?"
201
202           (assuming a recent version of HTML::Parser)
203
204       ~literal pseudo-elements
205           These objects are not currently produced by HTML::TreeBuilder, but
206           can be used to represent a "super-literal" -- i.e., a literal you
207           want to be immune from escaping.  (Yes, I just made that term up.)
208
209           That is, this is useful if you want to insert code into a tree that
210           you plan to dump out with "as_HTML", where you want, for some
211           reason, to suppress "as_HTML"'s normal behavior of amp-quoting text
212           segments.
213
214           For example, this:
215
216             my $literal = HTML::Element->new('~literal',
217               'text' => 'x < 4 & y > 7'
218             );
219             my $span = HTML::Element->new('span');
220             $span->push_content($literal);
221             print $span->as_HTML;
222
223           prints this:
224
225             <span>x < 4 & y > 7</span>
226
227           Whereas this:
228
229             my $span = HTML::Element->new('span');
230             $span->push_content('x < 4 & y > 7');
231               # normal text segment
232             print $span->as_HTML;
233
234           prints this:
235
236             <span>x &lt; 4 &amp; y &gt; 7</span>
237
238           Unless you're inserting lots of pre-cooked code into existing
239           trees, and dumping them out again, it's not likely that you'll find
240           "~literal" pseudo-elements useful.
241
242   $h->parent() or $h->parent($new_parent)
243       Returns (optionally sets) the parent (aka "container") for this
244       element.  The parent should either be undef, or should be another
245       element.
246
247       You should not use this to directly set the parent of an element.
248       Instead use any of the other methods under "Structure-Modifying
249       Methods", below.
250
251       Note that not($h->parent) is a simple test for whether $h is the root
252       of its subtree.
253
254   $h->content_list()
255       Returns a list of the child nodes of this element -- i.e., what nodes
256       (elements or text segments) are inside/under this element. (Note that
257       this may be an empty list.)
258
259       In a scalar context, this returns the count of the items, as you may
260       expect.
261
262   $h->content()
263       This somewhat deprecated method returns the content of this element;
264       but unlike content_list, this returns either undef (which you should
265       understand to mean no content), or a reference to the array of content
266       items, each of which is either a text segment (a string, i.e., a
267       defined non-reference scalar value), or an HTML::Element object.  Note
268       that even if an arrayref is returned, it may be a reference to an empty
269       array.
270
271       While older code should feel free to continue to use "$h->content", new
272       code should use "$h->content_list" in almost all conceivable cases.  It
273       is my experience that in most cases this leads to simpler code anyway,
274       since it means one can say:
275
276           @children = $h->content_list;
277
278       instead of the inelegant:
279
280           @children = @{$h->content || []};
281
282       If you do use "$h->content" (or "$h->content_array_ref"), you should
283       not use the reference returned by it (assuming it returned a reference,
284       and not undef) to directly set or change the content of an element or
285       text segment!  Instead use content_refs_list or any of the other
286       methods under "Structure-Modifying Methods", below.
287
288   $h->content_array_ref()
289       This is like "content" (with all its caveats and deprecations) except
290       that it is guaranteed to return an array reference.  That is, if the
291       given node has no "_content" attribute, the "content" method would
292       return that undef, but "content_array_ref" would set the given node's
293       "_content" value to "[]" (a reference to a new, empty array), and
294       return that.
295
296   $h->content_refs_list
297       This returns a list of scalar references to each element of $h's
298       content list.  This is useful in case you want to in-place edit any
299       large text segments without having to get a copy of the current value
300       of that segment value, modify that copy, then use the "splice_content"
301       to replace the old with the new.  Instead, here you can in-place edit:
302
303           foreach my $item_r ($h->content_refs_list) {
304               next if ref $$item_r;
305               $$item_r =~ s/honour/honor/g;
306           }
307
308       You could currently achieve the same affect with:
309
310           foreach my $item (@{ $h->content_array_ref }) {
311               # deprecated!
312               next if ref $item;
313               $item =~ s/honour/honor/g;
314           }
315
316       ...except that using the return value of "$h->content" or
317       "$h->content_array_ref" to do that is deprecated, and just might stop
318       working in the future.
319
320   $h->implicit() or $h->implicit($bool)
321       Returns (optionally sets) the "_implicit" attribute.  This attribute is
322       a flag that's used for indicating that the element was not originally
323       present in the source, but was added to the parse tree (by
324       HTML::TreeBuilder, for example) in order to conform to the rules of
325       HTML structure.
326
327   $h->pos() or $h->pos($element)
328       Returns (and optionally sets) the "_pos" (for "current position")
329       pointer of $h.  This attribute is a pointer used during some parsing
330       operations, whose value is whatever HTML::Element element at or under
331       $h is currently "open", where "$h->insert_element(NEW)" will actually
332       insert a new element.
333
334       (This has nothing to do with the Perl function called "pos", for
335       controlling where regular expression matching starts.)
336
337       If you set "$h->pos($element)", be sure that $element is either $h, or
338       an element under $h.
339
340       If you've been modifying the tree under $h and are no longer sure
341       "$h->pos" is valid, you can enforce validity with:
342
343           $h->pos(undef) unless $h->pos->is_inside($h);
344
345   $h->all_attr()
346       Returns all this element's attributes and values, as key-value pairs.
347       This will include any "internal" attributes (i.e., ones not present in
348       the original element, and which will not be represented if/when you
349       call "$h->as_HTML").  Internal attributes are distinguished by the fact
350       that the first character of their key (not value! key!) is an
351       underscore ("_").
352
353       Example output of "$h->all_attr()" : "'_parent', "[object_value]" ,
354       '_tag', 'em', 'lang', 'en-US', '_content', "[array-ref value].
355
356   $h->all_attr_names()
357       Like all_attr, but only returns the names of the attributes.
358
359       Example output of "$h->all_attr_names()" : "'_parent', '_tag', 'lang',
360       '_content', ".
361
362   $h->all_external_attr()
363       Like "all_attr", except that internal attributes are not present.
364
365   $h->all_external_attr_names()
366       Like "all_external_attr_names", except that internal attributes' names
367       are not present.
368
369   $h->id() or $h->id($string)
370       Returns (optionally sets to $string) the "id" attribute.
371       "$h->id(undef)" deletes the "id" attribute.
372
373   $h->idf() or $h->idf($string)
374       Just like the "id" method, except that if you call "$h->idf()" and no
375       "id" attribute is defined for this element, then it's set to a likely-
376       to-be-unique value, and returned.  (The "f" is for "force".)
377

STRUCTURE-MODIFYING METHODS

379       These methods are provided for modifying the content of trees by adding
380       or changing nodes as parents or children of other nodes.
381
382   $h->push_content($element_or_text, ...)
383       Adds the specified items to the end of the content list of the element
384       $h.  The items of content to be added should each be either a text
385       segment (a string), an HTML::Element object, or an arrayref.  Arrayrefs
386       are fed thru "$h->new_from_lol(that_arrayref)" to convert them into
387       elements, before being added to the content list of $h.  This means you
388       can say things concise things like:
389
390         $body->push_content(
391           ['br'],
392           ['ul',
393             map ['li', $_], qw(Peaches Apples Pears Mangos)
394           ]
395         );
396
397       See "new_from_lol" method's documentation, far below, for more
398       explanation.
399
400       The push_content method will try to consolidate adjacent text segments
401       while adding to the content list.  That's to say, if $h's content_list
402       is
403
404         ('foo bar ', $some_node, 'baz!')
405
406       and you call
407
408          $h->push_content('quack?');
409
410       then the resulting content list will be this:
411
412         ('foo bar ', $some_node, 'baz!quack?')
413
414       and not this:
415
416         ('foo bar ', $some_node, 'baz!', 'quack?')
417
418       If that latter is what you want, you'll have to override the feature of
419       consolidating text by using splice_content, as in:
420
421         $h->splice_content(scalar($h->content_list),0,'quack?');
422
423       Similarly, if you wanted to add 'Skronk' to the beginning of the
424       content list, calling this:
425
426          $h->unshift_content('Skronk');
427
428       then the resulting content list will be this:
429
430         ('Skronkfoo bar ', $some_node, 'baz!')
431
432       and not this:
433
434         ('Skronk', 'foo bar ', $some_node, 'baz!')
435
436       What you'd to do get the latter is:
437
438         $h->splice_content(0,0,'Skronk');
439
440   $h->unshift_content($element_or_text, ...)
441       Just like "push_content", but adds to the beginning of the $h element's
442       content list.
443
444       The items of content to be added should each be either a text segment
445       (a string), an HTML::Element object, or an arrayref (which is fed thru
446       "new_from_lol").
447
448       The unshift_content method will try to consolidate adjacent text
449       segments while adding to the content list.  See above for a discussion
450       of this.
451
452   $h->splice_content($offset, $length, $element_or_text, ...)
453       Detaches the elements from $h's list of content-nodes, starting at
454       $offset and continuing for $length items, replacing them with the
455       elements of the following list, if any.  Returns the elements (if any)
456       removed from the content-list.  If $offset is negative, then it starts
457       that far from the end of the array, just like Perl's normal "splice"
458       function.  If $length and the following list is omitted, removes
459       everything from $offset onward.
460
461       The items of content to be added (if any) should each be either a text
462       segment (a string), an arrayref (which is fed thru "new_from_lol"), or
463       an HTML::Element object that's not already a child of $h.
464
465   $h->detach()
466       This unlinks $h from its parent, by setting its 'parent' attribute to
467       undef, and by removing it from the content list of its parent (if it
468       had one).  The return value is the parent that was detached from (or
469       undef, if $h had no parent to start with).  Note that neither $h nor
470       its parent are explicitly destroyed.
471
472   $h->detach_content()
473       This unlinks all of $h's children from $h, and returns them.  Note that
474       these are not explicitly destroyed; for that, you can just use
475       $h->delete_content.
476
477   $h->replace_with( $element_or_text, ... )
478       This replaces $h in its parent's content list with the nodes specified.
479       The element $h (which by then may have no parent) is returned.  This
480       causes a fatal error if $h has no parent.  The list of nodes to insert
481       may contain $h, but at most once.  Aside from that possible exception,
482       the nodes to insert should not already be children of $h's parent.
483
484       Also, note that this method does not destroy $h -- use
485       "$h->replace_with(...)->delete" if you need that.
486
487   $h->preinsert($element_or_text...)
488       Inserts the given nodes right BEFORE $h in $h's parent's content list.
489       This causes a fatal error if $h has no parent.  None of the given nodes
490       should be $h or other children of $h.  Returns $h.
491
492   $h->postinsert($element_or_text...)
493       Inserts the given nodes right AFTER $h in $h's parent's content list.
494       This causes a fatal error if $h has no parent.  None of the given nodes
495       should be $h or other children of $h.  Returns $h.
496
497   $h->replace_with_content()
498       This replaces $h in its parent's content list with its own content.
499       The element $h (which by then has no parent or content of its own) is
500       returned.  This causes a fatal error if $h has no parent.  Also, note
501       that this does not destroy $h -- use "$h->replace_with_content->delete"
502       if you need that.
503
504   $h->delete_content()
505       Clears the content of $h, calling "$h->delete" for each content
506       element.  Compare with "$h->detach_content".
507
508       Returns $h.
509
510   $h->delete()
511       Detaches this element from its parent (if it has one) and explicitly
512       destroys the element and all its descendants.  The return value is
513       undef.
514
515       Perl uses garbage collection based on reference counting; when no
516       references to a data structure exist, it's implicitly destroyed --
517       i.e., when no value anywhere points to a given object anymore, Perl
518       knows it can free up the memory that the now-unused object occupies.
519
520       But this fails with HTML::Element trees, because a parent element
521       always holds references to its children, and its children elements hold
522       references to the parent, so no element ever looks like it's not in
523       use.  So, to destroy those elements, you need to call "$h->delete" on
524       the parent.
525
526   $h->clone()
527       Returns a copy of the element (whose children are clones (recursively)
528       of the original's children, if any).
529
530       The returned element is parentless.  Any '_pos' attributes present in
531       the source element/tree will be absent in the copy.  For that and other
532       reasons, the clone of an HTML::TreeBuilder object that's in mid-parse
533       (i.e, the head of a tree that HTML::TreeBuilder is elaborating) cannot
534       (currently) be used to continue the parse.
535
536       You are free to clone HTML::TreeBuilder trees, just as long as: 1)
537       they're done being parsed, or 2) you don't expect to resume parsing
538       into the clone.  (You can continue parsing into the original; it is
539       never affected.)
540
541   HTML::Element->clone_list(...nodes...)
542       Returns a list consisting of a copy of each node given.  Text segments
543       are simply copied; elements are cloned by calling $it->clone on each of
544       them.
545
546       Note that this must be called as a class method, not as an instance
547       method.  "clone_list" will croak if called as an instance method.  You
548       can also call it like so:
549
550           ref($h)->clone_list(...nodes...)
551
552   $h->normalize_content
553       Normalizes the content of $h -- i.e., concatenates any adjacent text
554       nodes.  (Any undefined text segments are turned into empty-strings.)
555       Note that this does not recurse into $h's descendants.
556
557   $h->delete_ignorable_whitespace()
558       This traverses under $h and deletes any text segments that are
559       ignorable whitespace.  You should not use this if $h under a 'pre'
560       element.
561
562   $h->insert_element($element, $implicit)
563       Inserts (via push_content) a new element under the element at
564       "$h->pos()".  Then updates "$h->pos()" to point to the inserted
565       element, unless $element is a prototypically empty element like "br",
566       "hr", "img", etc.  The new "$h->pos()" is returned.  This method is
567       useful only if your particular tree task involves setting "$h->pos()".
568

DUMPING METHODS

570   $h->dump()
571   $h->dump(*FH)  ; # or *FH{IO} or $fh_obj
572       Prints the element and all its children to STDOUT (or to a specified
573       filehandle), in a format useful only for debugging.  The structure of
574       the document is shown by indentation (no end tags).
575
576   $h->as_HTML() or $h->as_HTML($entities)
577   or $h->as_HTML($entities, $indent_char)
578   or $h->as_HTML($entities, $indent_char, \%optional_end_tags)
579       Returns a string representing in HTML the element and its descendants.
580       The optional argument $entities specifies a string of the entities to
581       encode.  For compatibility with previous versions, specify '<>&' here.
582       If omitted or undef, all unsafe characters are encoded as HTML
583       entities.  See HTML::Entities for details.  If passed an empty string,
584       no entities are encoded.
585
586       If $indent_char is specified and defined, the HTML to be output is
587       intented, using the string you specify (which you probably should set
588       to "\t", or some number of spaces, if you specify it).
589
590       If "\%optional_end_tags" is specified and defined, it should be a
591       reference to a hash that holds a true value for every tag name whose
592       end tag is optional.  Defaults to "\%HTML::Element::optionalEndTag",
593       which is an alias to %HTML::Tagset::optionalEndTag, which, at time of
594       writing, contains true values for "p, li, dt, dd".  A useful value to
595       pass is an empty hashref, "{}", which means that no end-tags are
596       optional for this dump.  Otherwise, possibly consider copying
597       %HTML::Tagset::optionalEndTag to a hash of your own, adding or deleting
598       values as you like, and passing a reference to that hash.
599
600   $h->as_text()
601   $h->as_text(skip_dels => 1)
602       Returns a string consisting of only the text parts of the element's
603       descendants.
604
605       Text under 'script' or 'style' elements is never included in what's
606       returned.  If "skip_dels" is true, then text content under "del" nodes
607       is not included in what's returned.
608
609   $h->as_trimmed_text(...)
610       This is just like as_text(...) except that leading and trailing
611       whitespace is deleted, and any internal whitespace is collapsed.
612
613   $h->as_XML()
614       Returns a string representing in XML the element and its descendants.
615
616       The XML is not indented.
617
618   $h->as_Lisp_form()
619       Returns a string representing the element and its descendants as a Lisp
620       form.  Unsafe characters are encoded as octal escapes.
621
622       The Lisp form is indented, and contains external ("href", etc.)  as
623       well as internal attributes ("_tag", "_content", "_implicit", etc.),
624       except for "_parent", which is omitted.
625
626       Current example output for a given element:
627
628         ("_tag" "img" "border" "0" "src" "pie.png" "usemap" "#main.map")
629
630   $h->starttag() or $h->starttag($entities)
631       Returns a string representing the complete start tag for the element.
632       I.e., leading "<", tag name, attributes, and trailing ">".  All values
633       are surrounded with double-quotes, and appropriate characters are
634       encoded.  If $entities is omitted or undef, all unsafe characters are
635       encoded as HTML entities.  See HTML::Entities for details.  If you
636       specify some value for $entities, remember to include the double-quote
637       character in it.  (Previous versions of this module would basically
638       behave as if '&">' were specified for $entities.)  If $entities is an
639       empty string, no entity is escaped.
640
641   $h->endtag()
642       Returns a string representing the complete end tag for this element.
643       I.e., "</", tag name, and ">".
644

SECONDARY STRUCTURAL METHODS

646       These methods all involve some structural aspect of the tree; either
647       they report some aspect of the tree's structure, or they involve
648       traversal down the tree, or walking up the tree.
649
650   $h->is_inside('tag', ...) or $h->is_inside($element, ...)
651       Returns true if the $h element is, or is contained anywhere inside an
652       element that is any of the ones listed, or whose tag name is any of the
653       tag names listed.
654
655   $h->is_empty()
656       Returns true if $h has no content, i.e., has no elements or text
657       segments under it.  In other words, this returns true if $h is a leaf
658       node, AKA a terminal node.  Do not confuse this sense of "empty" with
659       another sense that it can have in SGML/HTML/XML terminology, which
660       means that the element in question is of the type (like HTML's "hr",
661       "br", "img", etc.) that can't have any content.
662
663       That is, a particular "p" element may happen to have no content, so
664       $that_p_element->is_empty will be true -- even though the prototypical
665       "p" element isn't "empty" (not in the way that the prototypical "hr"
666       element is).
667
668       If you think this might make for potentially confusing code, consider
669       simply using the clearer exact equivalent:  not($h->content_list)
670
671   $h->pindex()
672       Return the index of the element in its parent's contents array, such
673       that $h would equal
674
675         $h->parent->content->[$h->pindex]
676         or
677         ($h->parent->content_list)[$h->pindex]
678
679       assuming $h isn't root.  If the element $h is root, then $h->pindex
680       returns undef.
681
682   $h->left()
683       In scalar context: returns the node that's the immediate left sibling
684       of $h.  If $h is the leftmost (or only) child of its parent (or has no
685       parent), then this returns undef.
686
687       In list context: returns all the nodes that're the left siblings of $h
688       (starting with the leftmost).  If $h is the leftmost (or only) child of
689       its parent (or has no parent), then this returns empty-list.
690
691       (See also $h->preinsert(LIST).)
692
693   $h->right()
694       In scalar context: returns the node that's the immediate right sibling
695       of $h.  If $h is the rightmost (or only) child of its parent (or has no
696       parent), then this returns undef.
697
698       In list context: returns all the nodes that're the right siblings of
699       $h, starting with the leftmost.  If $h is the rightmost (or only) child
700       of its parent (or has no parent), then this returns empty-list.
701
702       (See also $h->postinsert(LIST).)
703
704   $h->address()
705       Returns a string representing the location of this node in the tree.
706       The address consists of numbers joined by a '.', starting with '0', and
707       followed by the pindexes of the nodes in the tree that are ancestors of
708       $h, starting from the top.
709
710       So if the way to get to a node starting at the root is to go to child 2
711       of the root, then child 10 of that, and then child 0 of that, and then
712       you're there -- then that node's address is "0.2.10.0".
713
714       As a bit of a special case, the address of the root is simply "0".
715
716       I forsee this being used mainly for debugging, but you may find your
717       own uses for it.
718
719   $h->address($address)
720       This returns the node (whether element or text-segment) at the given
721       address in the tree that $h is a part of.  (That is, the address is
722       resolved starting from $h->root.)
723
724       If there is no node at the given address, this returns undef.
725
726       You can specify "relative addressing" (i.e., that indexing is supposed
727       to start from $h and not from $h->root) by having the address start
728       with a period -- e.g., $h->address(".3.2") will look at child 3 of $h,
729       and child 2 of that.
730
731   $h->depth()
732       Returns a number expressing $h's depth within its tree, i.e., how many
733       steps away it is from the root.  If $h has no parent (i.e., is root),
734       its depth is 0.
735
736   $h->root()
737       Returns the element that's the top of $h's tree.  If $h is root, this
738       just returns $h.  (If you want to test whether $h is the root, instead
739       of asking what its root is, just test "not($h->parent)".)
740
741   $h->lineage()
742       Returns the list of $h's ancestors, starting with its parent, and then
743       that parent's parent, and so on, up to the root.  If $h is root, this
744       returns an empty list.
745
746       If you simply want a count of the number of elements in $h's lineage,
747       use $h->depth.
748
749   $h->lineage_tag_names()
750       Returns the list of the tag names of $h's ancestors, starting with its
751       parent, and that parent's parent, and so on, up to the root.  If $h is
752       root, this returns an empty list.  Example output: "('em', 'td', 'tr',
753       'table', 'body', 'html')"
754
755   $h->descendants()
756       In list context, returns the list of all $h's descendant elements,
757       listed in pre-order (i.e., an element appears before its content-
758       elements).  Text segments DO NOT appear in the list.  In scalar
759       context, returns a count of all such elements.
760
761   $h->descendents()
762       This is just an alias to the "descendants" method.
763
764   $h->find_by_tag_name('tag', ...)
765       In list context, returns a list of elements at or under $h that have
766       any of the specified tag names.  In scalar context, returns the first
767       (in pre-order traversal of the tree) such element found, or undef if
768       none.
769
770   $h->find('tag', ...)
771       This is just an alias to "find_by_tag_name".  (There was once going to
772       be a whole find_* family of methods, but then look_down filled that
773       niche, so there turned out not to be much reason for the verboseness of
774       the name "find_by_tag_name".)
775
776   $h->find_by_attribute('attribute', 'value')
777       In a list context, returns a list of elements at or under $h that have
778       the specified attribute, and have the given value for that attribute.
779       In a scalar context, returns the first (in pre-order traversal of the
780       tree) such element found, or undef if none.
781
782       This method is deprecated in favor of the more expressive "look_down"
783       method, which new code should use instead.
784
785   $h->look_down( ...criteria... )
786       This starts at $h and looks thru its element descendants (in pre-
787       order), looking for elements matching the criteria you specify.  In
788       list context, returns all elements that match all the given criteria;
789       in scalar context, returns the first such element (or undef, if nothing
790       matched).
791
792       There are three kinds of criteria you can specify:
793
794       (attr_name, attr_value)
795           This means you're looking for an element with that value for that
796           attribute.  Example: "alt", "pix!".  Consider that you can search
797           on internal attribute values too: "_tag", "p".
798
799       (attr_name, qr/.../)
800           This means you're looking for an element whose value for that
801           attribute matches the specified Regexp object.
802
803       a coderef
804           This means you're looking for elements where
805           coderef->(each_element) returns true.  Example:
806
807             my @wide_pix_images
808               = $h->look_down(
809                               "_tag", "img",
810                               "alt", "pix!",
811                               sub { $_[0]->attr('width') > 350 }
812                              );
813
814       Note that "(attr_name, attr_value)" and "(attr_name, qr/.../)" criteria
815       are almost always faster than coderef criteria, so should presumably be
816       put before them in your list of criteria.  That is, in the example
817       above, the sub ref is called only for elements that have already passed
818       the criteria of having a "_tag" attribute with value "img", and an
819       "alt" attribute with value "pix!".  If the coderef were first, it would
820       be called on every element, and then what elements pass that criterion
821       (i.e., elements for which the coderef returned true) would be checked
822       for their "_tag" and "alt" attributes.
823
824       Note that comparison of string attribute-values against the string
825       value in "(attr_name, attr_value)" is case-INsensitive!  A criterion of
826       "('align', 'right')" will match an element whose "align" value is
827       "RIGHT", or "right" or "rIGhT", etc.
828
829       Note also that "look_down" considers "" (empty-string) and undef to be
830       different things, in attribute values.  So this:
831
832         $h->look_down("alt", "")
833
834       will find elements with an "alt" attribute, but where the value for the
835       "alt" attribute is "".  But this:
836
837         $h->look_down("alt", undef)
838
839       is the same as:
840
841         $h->look_down(sub { !defined($_[0]->attr('alt')) } )
842
843       That is, it finds elements that do not have an "alt" attribute at all
844       (or that do have an "alt" attribute, but with a value of undef -- which
845       is not normally possible).
846
847       Note that when you give several criteria, this is taken to mean you're
848       looking for elements that match all your criterion, not just any of
849       them.  In other words, there is an implicit "and", not an "or".  So if
850       you wanted to express that you wanted to find elements with a "name"
851       attribute with the value "foo" or with an "id" attribute with the value
852       "baz", you'd have to do it like:
853
854         @them = $h->look_down(
855           sub {
856             # the lcs are to fold case
857             lc($_[0]->attr('name')) eq 'foo'
858             or lc($_[0]->attr('id')) eq 'baz'
859           }
860         );
861
862       Coderef criteria are more expressive than "(attr_name, attr_value)" and
863       "(attr_name, qr/.../)" criteria, and all "(attr_name, attr_value)" and
864       "(attr_name, qr/.../)" criteria could be expressed in terms of
865       coderefs.  However, "(attr_name, attr_value)" and "(attr_name,
866       qr/.../)" criteria are a convenient shorthand.  (In fact, "look_down"
867       itself is basically "shorthand" too, since anything you can do with
868       "look_down" you could do by traversing the tree, either with the
869       "traverse" method or with a routine of your own.  However, "look_down"
870       often makes for very concise and clear code.)
871
872   $h->look_up( ...criteria... )
873       This is identical to $h->look_down, except that whereas $h->look_down
874       basically scans over the list:
875
876          ($h, $h->descendants)
877
878       $h->look_up instead scans over the list
879
880          ($h, $h->lineage)
881
882       So, for example, this returns all ancestors of $h (possibly including
883       $h itself) that are "td" elements with an "align" attribute with a
884       value of "right" (or "RIGHT", etc.):
885
886          $h->look_up("_tag", "td", "align", "right");
887
888   $h->traverse(...options...)
889       Lengthy discussion of HTML::Element's unnecessary and confusing
890       "traverse" method has been moved to a separate file:
891       HTML::Element::traverse
892
893   $h->attr_get_i('attribute')
894       In list context, returns a list consisting of the values of the given
895       attribute for $self and for all its ancestors starting from $self and
896       working its way up.  Nodes with no such attribute are skipped.
897       ("attr_get_i" stands for "attribute get, with inheritance".)  In scalar
898       context, returns the first such value, or undef if none.
899
900       Consider a document consisting of:
901
902          <html lang='i-klingon'>
903            <head><title>Pati Pata</title></head>
904            <body>
905              <h1 lang='la'>Stuff</h1>
906              <p lang='es-MX' align='center'>
907                Foo bar baz <cite>Quux</cite>.
908              </p>
909              <p>Hooboy.</p>
910            </body>
911          </html>
912
913       If $h is the "cite" element, $h->attr_get_i("lang") in list context
914       will return the list ('es-MX', 'i-klingon').  In scalar context, it
915       will return the value 'es-MX'.
916
917       If you call with multiple attribute names...
918
919   $h->attr_get_i('a1', 'a2', 'a3')
920       ...in list context, this will return a list consisting of the values of
921       these attributes which exist in $self and its ancestors.  In scalar
922       context, this returns the first value (i.e., the value of the first
923       existing attribute from the first element that has any of the
924       attributes listed).  So, in the above example,
925
926         $h->attr_get_i('lang', 'align');
927
928       will return:
929
930          ('es-MX', 'center', 'i-klingon') # in list context
931         or
932          'es-MX' # in scalar context.
933
934       But note that this:
935
936        $h->attr_get_i('align', 'lang');
937
938       will return:
939
940          ('center', 'es-MX', 'i-klingon') # in list context
941         or
942          'center' # in scalar context.
943
944   $h->tagname_map()
945       Scans across $h and all its descendants, and makes a hash (a reference
946       to which is returned) where each entry consists of a key that's a tag
947       name, and a value that's a reference to a list to all elements that
948       have that tag name.  I.e., this method returns:
949
950          {
951            # Across $h and all descendants...
952            'a'   => [ ...list of all 'a'   elements... ],
953            'em'  => [ ...list of all 'em'  elements... ],
954            'img' => [ ...list of all 'img' elements... ],
955          }
956
957       (There are entries in the hash for only those tagnames that occur
958       at/under $h -- so if there's no "img" elements, there'll be no "img"
959       entry in the hashr(ref) returned.)
960
961       Example usage:
962
963           my $map_r = $h->tagname_map();
964           my @heading_tags = sort grep m/^h\d$/s, keys %$map_r;
965           if(@heading_tags) {
966             print "Heading levels used: @heading_tags\n";
967           } else {
968             print "No headings.\n"
969           }
970
971   $h->extract_links() or $h->extract_links(@wantedTypes)
972       Returns links found by traversing the element and all of its children
973       and looking for attributes (like "href" in an "a" element, or "src" in
974       an "img" element) whose values represent links.  The return value is a
975       reference to an array.  Each element of the array is reference to an
976       array with four items: the link-value, the element that has the
977       attribute with that link-value, and the name of that attribute, and the
978       tagname of that element.  (Example: "['http://www.suck.com/',"
979       $elem_obj ", 'href', 'a']".)  You may or may not end up using the
980       element itself -- for some purposes, you may use only the link value.
981
982       You might specify that you want to extract links from just some kinds
983       of elements (instead of the default, which is to extract links from all
984       the kinds of elements known to have attributes whose values represent
985       links).  For instance, if you want to extract links from only "a" and
986       "img" elements, you could code it like this:
987
988         for (@{  $e->extract_links('a', 'img')  }) {
989             my($link, $element, $attr, $tag) = @$_;
990             print
991               "Hey, there's a $tag that links to "
992               $link, ", in its $attr attribute, at ",
993               $element->address(), ".\n";
994         }
995
996   $h->simplify_pres
997       In text bits under PRE elements that are at/under $h, this routine
998       nativizes all newlines, and expands all tabs.
999
1000       That is, if you read a file with lines delimited by "\cm\cj"'s, the
1001       text under PRE areas will have "\cm\cj"'s instead of "\n"'s. Calling
1002       $h->nativize_pre_newlines on such a tree will turn "\cm\cj"'s into
1003       "\n"'s.
1004
1005       Tabs are expanded to however many spaces it takes to get to the next
1006       8th column -- the usual way of expanding them.
1007
1008   $h->same_as($i)
1009       Returns true if $h and $i are both elements representing the same tree
1010       of elements, each with the same tag name, with the same explicit
1011       attributes (i.e., not counting attributes whose names start with "_"),
1012       and with the same content (textual, comments, etc.).
1013
1014       Sameness of descendant elements is tested, recursively, with
1015       "$child1->same_as($child_2)", and sameness of text segments is tested
1016       with "$segment1 eq $segment2".
1017
1018   $h = HTML::Element->new_from_lol(ARRAYREF)
1019       Resursively constructs a tree of nodes, based on the (non-cyclic) data
1020       structure represented by ARRAYREF, where that is a reference to an
1021       array of arrays (of arrays (of arrays (etc.))).
1022
1023       In each arrayref in that structure, different kinds of values are
1024       treated as follows:
1025
1026       ·   Arrayrefs
1027
1028           Arrayrefs are considered to designate a sub-tree representing
1029           children for the node constructed from the current arrayref.
1030
1031       ·   Hashrefs
1032
1033           Hashrefs are considered to contain attribute-value pairs to add to
1034           the element to be constructed from the current arrayref
1035
1036       ·   Text segments
1037
1038           Text segments at the start of any arrayref will be considered to
1039           specify the name of the element to be constructed from the current
1040           araryref; all other text segments will be considered to specify
1041           text segments as children for the current arrayref.
1042
1043       ·   Elements
1044
1045           Existing element objects are either inserted into the treelet
1046           constructed, or clones of them are.  That is, when the lol-tree is
1047           being traversed and elements constructed based what's in it, if an
1048           existing element object is found, if it has no parent, then it is
1049           added directly to the treelet constructed; but if it has a parent,
1050           then "$that_node->clone" is added to the treelet at the appropriate
1051           place.
1052
1053       An example will hopefully make this more obvious:
1054
1055         my $h = HTML::Element->new_from_lol(
1056           ['html',
1057             ['head',
1058               [ 'title', 'I like stuff!' ],
1059             ],
1060             ['body',
1061               {'lang', 'en-JP', _implicit => 1},
1062               'stuff',
1063               ['p', 'um, p < 4!', {'class' => 'par123'}],
1064               ['div', {foo => 'bar'}, '123'],
1065             ]
1066           ]
1067         );
1068         $h->dump;
1069
1070       Will print this:
1071
1072         <html> @0
1073           <head> @0.0
1074             <title> @0.0.0
1075               "I like stuff!"
1076           <body lang="en-JP"> @0.1 (IMPLICIT)
1077             "stuff"
1078             <p class="par123"> @0.1.1
1079               "um, p < 4!"
1080             <div foo="bar"> @0.1.2
1081               "123"
1082
1083       And printing $h->as_HTML will give something like:
1084
1085         <html><head><title>I like stuff!</title></head>
1086         <body lang="en-JP">stuff<p class="par123">um, p &lt; 4!
1087         <div foo="bar">123</div></body></html>
1088
1089       You can even do fancy things with "map":
1090
1091         $body->push_content(
1092           # push_content implicitly calls new_from_lol on arrayrefs...
1093           ['br'],
1094           ['blockquote',
1095             ['h2', 'Pictures!'],
1096             map ['p', $_],
1097             $body2->look_down("_tag", "img"),
1098               # images, to be copied from that other tree.
1099           ],
1100           # and more stuff:
1101           ['ul',
1102             map ['li', ['a', {'href'=>"$_.png"}, $_ ] ],
1103             qw(Peaches Apples Pears Mangos)
1104           ],
1105         );
1106
1107   @elements = HTML::Element->new_from_lol(ARRAYREFS)
1108       Constructs several elements, by calling new_from_lol for every arrayref
1109       in the ARRAYREFS list.
1110
1111         @elements = HTML::Element->new_from_lol(
1112           ['hr'],
1113           ['p', 'And there, on the door, was a hook!'],
1114         );
1115          # constructs two elements.
1116
1117   $h->objectify_text()
1118       This turns any text nodes under $h from mere text segments (strings)
1119       into real objects, pseudo-elements with a tag-name of "~text", and the
1120       actual text content in an attribute called "text".  (For a discussion
1121       of pseudo-elements, see the "tag" method, far above.)  This method is
1122       provided because, for some purposes, it is convenient or necessary to
1123       be able, for a given text node, to ask what element is its parent; and
1124       clearly this is not possible if a node is just a text string.
1125
1126       Note that these "~text" objects are not recognized as text nodes by
1127       methods like as_text.  Presumably you will want to call
1128       $h->objectify_text, perform whatever task that you needed that for, and
1129       then call $h->deobjectify_text before calling anything like
1130       $h->as_text.
1131
1132   $h->deobjectify_text()
1133       This undoes the effect of $h->objectify_text.  That is, it takes any
1134       "~text" pseudo-elements in the tree at/under $h, and deletes each one,
1135       replacing each with the content of its "text" attribute.
1136
1137       Note that if $h itself is a "~text" pseudo-element, it will be
1138       destroyed -- a condition you may need to treat specially in your
1139       calling code (since it means you can't very well do anything with $h
1140       after that).  So that you can detect that condition, if $h is itself a
1141       "~text" pseudo-element, then this method returns the value of the
1142       "text" attribute, which should be a defined value; in all other cases,
1143       it returns undef.
1144
1145       (This method assumes that no "~text" pseudo-element has any children.)
1146
1147   $h->number_lists()
1148       For every UL, OL, DIR, and MENU element at/under $h, this sets a
1149       "_bullet" attribute for every child LI element.  For LI children of an
1150       OL, the "_bullet" attribute's value will be something like "4.", "d.",
1151       "D.", "IV.", or "iv.", depending on the OL element's "type" attribute.
1152       LI children of a UL, DIR, or MENU get their "_bullet" attribute set to
1153       "*".  There should be no other LIs (i.e., except as children of OL, UL,
1154       DIR, or MENU elements), and if there are, they are unaffected.
1155
1156   $h->has_insane_linkage
1157       This method is for testing whether this element or the elements under
1158       it have linkage attributes (_parent and _content) whose values are
1159       deeply aberrant: if there are undefs in a content list; if an element
1160       appears in the content lists of more than one element; if the _parent
1161       attribute of an element doesn't match its actual parent; or if an
1162       element appears as its own descendant (i.e., if there is a cyclicity in
1163       the tree).
1164
1165       This returns empty list (or false, in scalar context) if the subtree's
1166       linkage methods are sane; otherwise it returns two items (or true, in
1167       scalar context): the element where the error occurred, and a string
1168       describing the error.
1169
1170       This method is provided is mainly for debugging and troubleshooting --
1171       it should be quite impossible for any document constructed via
1172       HTML::TreeBuilder to parse into a non-sane tree (since it's not the
1173       content of the tree per se that's in question, but whether the tree in
1174       memory was properly constructed); and it should be impossible for you
1175       to produce an insane tree just thru reasonable use of normal documented
1176       structure-modifying methods.  But if you're constructing your own
1177       trees, and your program is going into infinite loops as during calls to
1178       traverse() or any of the secondary structural methods, as part of
1179       debugging, consider calling is_insane on the tree.
1180

BUGS

1182       * If you want to free the memory associated with a tree built of
1183       HTML::Element nodes, then you will have to delete it explicitly.  See
1184       the $h->delete method, above.
1185
1186       * There's almost nothing to stop you from making a "tree" with
1187       cyclicities (loops) in it, which could, for example, make the traverse
1188       method go into an infinite loop.  So don't make cyclicities!  (If all
1189       you're doing is parsing HTML files, and looking at the resulting trees,
1190       this will never be a problem for you.)
1191
1192       * There's no way to represent comments or processing directives in a
1193       tree with HTML::Elements.  Not yet, at least.
1194
1195       * There's (currently) nothing to stop you from using an undefined value
1196       as a text segment.  If you're running under "perl -w", however, this
1197       may make HTML::Element's code produce a slew of warnings.
1198

NOTES ON SUBCLASSING

1200       You are welcome to derive subclasses from HTML::Element, but you should
1201       be aware that the code in HTML::Element makes certain assumptions about
1202       elements (and I'm using "element" to mean ONLY an object of class
1203       HTML::Element, or of a subclass of HTML::Element):
1204
1205       * The value of an element's _parent attribute must either be undef or
1206       otherwise false, or must be an element.
1207
1208       * The value of an element's _content attribute must either be undef or
1209       otherwise false, or a reference to an (unblessed) array.  The array may
1210       be empty; but if it has items, they must ALL be either mere strings
1211       (text segments), or elements.
1212
1213       * The value of an element's _tag attribute should, at least, be a
1214       string of printable characters.
1215
1216       Moreover, bear these rules in mind:
1217
1218       * Do not break encapsulation on objects.  That is, access their
1219       contents only thru $obj->attr or more specific methods.
1220
1221       * You should think twice before completely overriding any of the
1222       methods that HTML::Element provides.  (Overriding with a method that
1223       calls the superclass method is not so bad, though.)
1224

COPYRIGHT

1230       Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke, 2005 Andy
1231       Lester, 2006 Pete Krawczyk.
1232
1233       This library is free software; you can redistribute it and/or modify it
1234       under the same terms as Perl itself.
1235
1236       This program is distributed in the hope that it will be useful, but
1237       without any warranty; without even the implied warranty of
1238       merchantability or fitness for a particular purpose.
1239

AUTHOR

1241       Currently maintained by Pete Krawczyk "<petek@cpan.org>"
1242
1243       Original authors: Gisle Aas, Sean Burke and Andy Lester.
1244
1245       Thanks to Mark-Jason Dominus for a POD suggestion.
1246
1247
1248
1249perl v5.10.1                      2010-11-12                  HTML::Element(3)