1HTML::Element(3)      User Contributed Perl Documentation     HTML::Element(3)
2
3
4

NAME

6       HTML::Element - Class for objects that represent HTML elements
7

VERSION

9       Version 4.1
10

SYNOPSIS

12           use HTML::Element;
13           $a = HTML::Element->new('a', href => 'http://www.perl.com/');
14           $a->push_content("The Perl Homepage");
15
16           $tag = $a->tag;
17           print "$tag starts out as:",  $a->starttag, "\n";
18           print "$tag ends as:",  $a->endtag, "\n";
19           print "$tag\'s href attribute is: ", $a->attr('href'), "\n";
20
21           $links_r = $a->extract_links();
22           print "Hey, I found ", scalar(@$links_r), " links.\n";
23
24           print "And that, as HTML, is: ", $a->as_HTML, "\n";
25           $a = $a->delete;
26

DESCRIPTION

28       (This class is part of the HTML::Tree dist.)
29
30       Objects of the HTML::Element class can be used to represent elements of
31       HTML document trees.  These objects have attributes, notably attributes
32       that designates each element's parent and content.  The content is an
33       array of text segments and other HTML::Element objects.  A tree with
34       HTML::Element objects as nodes can represent the syntax tree for a HTML
35       document.
36

HOW WE REPRESENT TREES

38       Consider this HTML document:
39
40         <html lang='en-US'>
41           <head>
42             <title>Stuff</title>
43             <meta name='author' content='Jojo'>
44           </head>
45           <body>
46            <h1>I like potatoes!</h1>
47           </body>
48         </html>
49
50       Building a syntax tree out of it makes a tree-structure in memory that
51       could be diagrammed as:
52
53                            html (lang='en-US')
54                             / \
55                           /     \
56                         /         \
57                       head        body
58                      /\               \
59                    /    \               \
60                  /        \               \
61                title     meta              h1
62                 |       (name='author',     |
63              "Stuff"    content='Jojo')    "I like potatoes"
64
65       This is the traditional way to diagram a tree, with the "root" at the
66       top, and it's this kind of diagram that people have in mind when they
67       say, for example, that "the meta element is under the head element
68       instead of under the body element".  (The same is also said with
69       "inside" instead of "under" -- the use of "inside" makes more sense
70       when you're looking at the HTML source.)
71
72       Another way to represent the above tree is with indenting:
73
74         html (attributes: lang='en-US')
75           head
76             title
77               "Stuff"
78             meta (attributes: name='author' content='Jojo')
79           body
80             h1
81               "I like potatoes"
82
83       Incidentally, diagramming with indenting works much better for very
84       large trees, and is easier for a program to generate.  The
85       "$tree->dump" method uses indentation just that way.
86
87       However you diagram the tree, it's stored the same in memory -- it's a
88       network of objects, each of which has attributes like so:
89
90         element #1:  _tag: 'html'
91                      _parent: none
92                      _content: [element #2, element #5]
93                      lang: 'en-US'
94
95         element #2:  _tag: 'head'
96                      _parent: element #1
97                      _content: [element #3, element #4]
98
99         element #3:  _tag: 'title'
100                      _parent: element #2
101                      _content: [text segment "Stuff"]
102
103         element #4   _tag: 'meta'
104                      _parent: element #2
105                      _content: none
106                      name: author
107                      content: Jojo
108
109         element #5   _tag: 'body'
110                      _parent: element #1
111                      _content: [element #6]
112
113         element #6   _tag: 'h1'
114                      _parent: element #5
115                      _content: [text segment "I like potatoes"]
116
117       The "treeness" of the tree-structure that these elements comprise is
118       not an aspect of any particular object, but is emergent from the
119       relatedness attributes (_parent and _content) of these element-objects
120       and from how you use them to get from element to element.
121
122       While you could access the content of a tree by writing code that says
123       "access the 'src' attribute of the root's first child's seventh child's
124       third child", you're more likely to have to scan the contents of a
125       tree, looking for whatever nodes, or kinds of nodes, you want to do
126       something with.  The most straightforward way to look over a tree is to
127       "traverse" it; an HTML::Element method ("$h->traverse") is provided for
128       this purpose; and several other HTML::Element methods are based on it.
129
130       (For everything you ever wanted to know about trees, and then some, see
131       Niklaus Wirth's Algorithms + Data Structures = Programs or Donald
132       Knuth's The Art of Computer Programming, Volume 1.)
133
134   Version
135       Why is this a sub?
136
137   ABORT OK PRUNE PRUNE_SOFTLY PRUNE_UP
138       Constants for signalling back to the traverser
139

BASIC METHODS

141   $h = HTML::Element->new('tag', 'attrname' => 'value', ... )
142       This constructor method returns a new HTML::Element object.  The tag
143       name is a required argument; it will be forced to lowercase.
144       Optionally, you can specify other initial attributes at object creation
145       time.
146
147   $h->attr('attr') or $h->attr('attr', 'value')
148       Returns (optionally sets) the value of the given attribute of $h.  The
149       attribute name (but not the value, if provided) is forced to lowercase.
150       If trying to read the value of an attribute not present for this
151       element, the return value is undef.  If setting a new value, the old
152       value of that attribute is returned.
153
154       If methods are provided for accessing an attribute (like "$h->tag" for
155       "_tag", "$h->content_list", etc. below), use those instead of calling
156       attr "$h->attr", whether for reading or setting.
157
158       Note that setting an attribute to "undef" (as opposed to "", the empty
159       string) actually deletes the attribute.
160
161   $h->tag() or $h->tag('tagname')
162       Returns (optionally sets) the tag name (also known as the generic
163       identifier) for the element $h.  In setting, the tag name is always
164       converted to lower case.
165
166       There are four kinds of "pseudo-elements" that show up as HTML::Element
167       objects:
168
169       Comment pseudo-elements
170           These are element objects with a "$h->tag" value of "~comment", and
171           the content of the comment is stored in the "text" attribute
172           ("$h->attr("text")").  For example, parsing this code with
173           HTML::TreeBuilder...
174
175             <!-- I like Pie.
176                Pie is good
177             -->
178
179           produces an HTML::Element object with these attributes:
180
181             "_tag",
182             "~comment",
183             "text",
184             " I like Pie.\n     Pie is good\n  "
185
186       Declaration pseudo-elements
187           Declarations (rarely encountered) are represented as HTML::Element
188           objects with a tag name of "~declaration", and content in the
189           "text" attribute.  For example, this:
190
191             <!DOCTYPE foo>
192
193           produces an element whose attributes include:
194
195             "_tag", "~declaration", "text", "DOCTYPE foo"
196
197       Processing instruction pseudo-elements
198           PIs (rarely encountered) are represented as HTML::Element objects
199           with a tag name of "~pi", and content in the "text" attribute.  For
200           example, this:
201
202             <?stuff foo?>
203
204           produces an element whose attributes include:
205
206             "_tag", "~pi", "text", "stuff foo?"
207
208           (assuming a recent version of HTML::Parser)
209
210       ~literal pseudo-elements
211           These objects are not currently produced by HTML::TreeBuilder, but
212           can be used to represent a "super-literal" -- i.e., a literal you
213           want to be immune from escaping.  (Yes, I just made that term up.)
214
215           That is, this is useful if you want to insert code into a tree that
216           you plan to dump out with "as_HTML", where you want, for some
217           reason, to suppress "as_HTML"'s normal behavior of amp-quoting text
218           segments.
219
220           For example, this:
221
222             my $literal = HTML::Element->new('~literal',
223               'text' => 'x < 4 & y > 7'
224             );
225             my $span = HTML::Element->new('span');
226             $span->push_content($literal);
227             print $span->as_HTML;
228
229           prints this:
230
231             <span>x < 4 & y > 7</span>
232
233           Whereas this:
234
235             my $span = HTML::Element->new('span');
236             $span->push_content('x < 4 & y > 7');
237               # normal text segment
238             print $span->as_HTML;
239
240           prints this:
241
242             <span>x &lt; 4 &amp; y &gt; 7</span>
243
244           Unless you're inserting lots of pre-cooked code into existing
245           trees, and dumping them out again, it's not likely that you'll find
246           "~literal" pseudo-elements useful.
247
248   $h->parent() or $h->parent($new_parent)
249       Returns (optionally sets) the parent (aka "container") for this
250       element.  The parent should either be undef, or should be another
251       element.
252
253       You should not use this to directly set the parent of an element.
254       Instead use any of the other methods under "Structure-Modifying
255       Methods", below.
256
257       Note that not($h->parent) is a simple test for whether $h is the root
258       of its subtree.
259
260   $h->content_list()
261       Returns a list of the child nodes of this element -- i.e., what nodes
262       (elements or text segments) are inside/under this element. (Note that
263       this may be an empty list.)
264
265       In a scalar context, this returns the count of the items, as you may
266       expect.
267
268   $h->content()
269       This somewhat deprecated method returns the content of this element;
270       but unlike content_list, this returns either undef (which you should
271       understand to mean no content), or a reference to the array of content
272       items, each of which is either a text segment (a string, i.e., a
273       defined non-reference scalar value), or an HTML::Element object.  Note
274       that even if an arrayref is returned, it may be a reference to an empty
275       array.
276
277       While older code should feel free to continue to use "$h->content", new
278       code should use "$h->content_list" in almost all conceivable cases.  It
279       is my experience that in most cases this leads to simpler code anyway,
280       since it means one can say:
281
282           @children = $h->content_list;
283
284       instead of the inelegant:
285
286           @children = @{$h->content || []};
287
288       If you do use "$h->content" (or "$h->content_array_ref"), you should
289       not use the reference returned by it (assuming it returned a reference,
290       and not undef) to directly set or change the content of an element or
291       text segment!  Instead use content_refs_list or any of the other
292       methods under "Structure-Modifying Methods", below.
293
294   $h->content_array_ref()
295       This is like "content" (with all its caveats and deprecations) except
296       that it is guaranteed to return an array reference.  That is, if the
297       given node has no "_content" attribute, the "content" method would
298       return that undef, but "content_array_ref" would set the given node's
299       "_content" value to "[]" (a reference to a new, empty array), and
300       return that.
301
302   $h->content_refs_list
303       This returns a list of scalar references to each element of $h's
304       content list.  This is useful in case you want to in-place edit any
305       large text segments without having to get a copy of the current value
306       of that segment value, modify that copy, then use the "splice_content"
307       to replace the old with the new.  Instead, here you can in-place edit:
308
309           foreach my $item_r ($h->content_refs_list) {
310               next if ref $$item_r;
311               $$item_r =~ s/honour/honor/g;
312           }
313
314       You could currently achieve the same affect with:
315
316           foreach my $item (@{ $h->content_array_ref }) {
317               # deprecated!
318               next if ref $item;
319               $item =~ s/honour/honor/g;
320           }
321
322       ...except that using the return value of "$h->content" or
323       "$h->content_array_ref" to do that is deprecated, and just might stop
324       working in the future.
325
326   $h->implicit() or $h->implicit($bool)
327       Returns (optionally sets) the "_implicit" attribute.  This attribute is
328       a flag that's used for indicating that the element was not originally
329       present in the source, but was added to the parse tree (by
330       HTML::TreeBuilder, for example) in order to conform to the rules of
331       HTML structure.
332
333   $h->pos() or $h->pos($element)
334       Returns (and optionally sets) the "_pos" (for "current position")
335       pointer of $h.  This attribute is a pointer used during some parsing
336       operations, whose value is whatever HTML::Element element at or under
337       $h is currently "open", where "$h->insert_element(NEW)" will actually
338       insert a new element.
339
340       (This has nothing to do with the Perl function called "pos", for
341       controlling where regular expression matching starts.)
342
343       If you set "$h->pos($element)", be sure that $element is either $h, or
344       an element under $h.
345
346       If you've been modifying the tree under $h and are no longer sure
347       "$h->pos" is valid, you can enforce validity with:
348
349           $h->pos(undef) unless $h->pos->is_inside($h);
350
351   $h->all_attr()
352       Returns all this element's attributes and values, as key-value pairs.
353       This will include any "internal" attributes (i.e., ones not present in
354       the original element, and which will not be represented if/when you
355       call "$h->as_HTML").  Internal attributes are distinguished by the fact
356       that the first character of their key (not value! key!) is an
357       underscore ("_").
358
359       Example output of "$h->all_attr()" : "'_parent', "[object_value]" ,
360       '_tag', 'em', 'lang', 'en-US', '_content', "[array-ref value].
361
362   $h->all_attr_names()
363       Like all_attr, but only returns the names of the attributes.
364
365       Example output of "$h->all_attr_names()" : "'_parent', '_tag', 'lang',
366       '_content', ".
367
368   $h->all_external_attr()
369       Like "all_attr", except that internal attributes are not present.
370
371   $h->all_external_attr_names()
372       Like "all_external_attr_names", except that internal attributes' names
373       are not present.
374
375   $h->id() or $h->id($string)
376       Returns (optionally sets to $string) the "id" attribute.
377       "$h->id(undef)" deletes the "id" attribute.
378
379   $h->idf() or $h->idf($string)
380       Just like the "id" method, except that if you call "$h->idf()" and no
381       "id" attribute is defined for this element, then it's set to a likely-
382       to-be-unique value, and returned.  (The "f" is for "force".)
383

STRUCTURE-MODIFYING METHODS

385       These methods are provided for modifying the content of trees by adding
386       or changing nodes as parents or children of other nodes.
387
388   $h->push_content($element_or_text, ...)
389       Adds the specified items to the end of the content list of the element
390       $h.  The items of content to be added should each be either a text
391       segment (a string), an HTML::Element object, or an arrayref.  Arrayrefs
392       are fed thru "$h->new_from_lol(that_arrayref)" to convert them into
393       elements, before being added to the content list of $h.  This means you
394       can say things concise things like:
395
396         $body->push_content(
397           ['br'],
398           ['ul',
399             map ['li', $_], qw(Peaches Apples Pears Mangos)
400           ]
401         );
402
403       See "new_from_lol" method's documentation, far below, for more
404       explanation.
405
406       The push_content method will try to consolidate adjacent text segments
407       while adding to the content list.  That's to say, if $h's content_list
408       is
409
410         ('foo bar ', $some_node, 'baz!')
411
412       and you call
413
414          $h->push_content('quack?');
415
416       then the resulting content list will be this:
417
418         ('foo bar ', $some_node, 'baz!quack?')
419
420       and not this:
421
422         ('foo bar ', $some_node, 'baz!', 'quack?')
423
424       If that latter is what you want, you'll have to override the feature of
425       consolidating text by using splice_content, as in:
426
427         $h->splice_content(scalar($h->content_list),0,'quack?');
428
429       Similarly, if you wanted to add 'Skronk' to the beginning of the
430       content list, calling this:
431
432          $h->unshift_content('Skronk');
433
434       then the resulting content list will be this:
435
436         ('Skronkfoo bar ', $some_node, 'baz!')
437
438       and not this:
439
440         ('Skronk', 'foo bar ', $some_node, 'baz!')
441
442       What you'd to do get the latter is:
443
444         $h->splice_content(0,0,'Skronk');
445
446   $h->unshift_content($element_or_text, ...)
447       Just like "push_content", but adds to the beginning of the $h element's
448       content list.
449
450       The items of content to be added should each be either a text segment
451       (a string), an HTML::Element object, or an arrayref (which is fed thru
452       "new_from_lol").
453
454       The unshift_content method will try to consolidate adjacent text
455       segments while adding to the content list.  See above for a discussion
456       of this.
457
458   $h->splice_content($offset, $length, $element_or_text, ...)
459       Detaches the elements from $h's list of content-nodes, starting at
460       $offset and continuing for $length items, replacing them with the
461       elements of the following list, if any.  Returns the elements (if any)
462       removed from the content-list.  If $offset is negative, then it starts
463       that far from the end of the array, just like Perl's normal "splice"
464       function.  If $length and the following list is omitted, removes
465       everything from $offset onward.
466
467       The items of content to be added (if any) should each be either a text
468       segment (a string), an arrayref (which is fed thru "new_from_lol"), or
469       an HTML::Element object that's not already a child of $h.
470
471   $h->detach()
472       This unlinks $h from its parent, by setting its 'parent' attribute to
473       undef, and by removing it from the content list of its parent (if it
474       had one).  The return value is the parent that was detached from (or
475       undef, if $h had no parent to start with).  Note that neither $h nor
476       its parent are explicitly destroyed.
477
478   $h->detach_content()
479       This unlinks all of $h's children from $h, and returns them.  Note that
480       these are not explicitly destroyed; for that, you can just use
481       $h->delete_content.
482
483   $h->replace_with( $element_or_text, ... )
484       This replaces $h in its parent's content list with the nodes specified.
485       The element $h (which by then may have no parent) is returned.  This
486       causes a fatal error if $h has no parent.  The list of nodes to insert
487       may contain $h, but at most once.  Aside from that possible exception,
488       the nodes to insert should not already be children of $h's parent.
489
490       Also, note that this method does not destroy $h -- use
491       "$h->replace_with(...)->delete" if you need that.
492
493   $h->preinsert($element_or_text...)
494       Inserts the given nodes right BEFORE $h in $h's parent's content list.
495       This causes a fatal error if $h has no parent.  None of the given nodes
496       should be $h or other children of $h.  Returns $h.
497
498   $h->postinsert($element_or_text...)
499       Inserts the given nodes right AFTER $h in $h's parent's content list.
500       This causes a fatal error if $h has no parent.  None of the given nodes
501       should be $h or other children of $h.  Returns $h.
502
503   $h->replace_with_content()
504       This replaces $h in its parent's content list with its own content.
505       The element $h (which by then has no parent or content of its own) is
506       returned.  This causes a fatal error if $h has no parent.  Also, note
507       that this does not destroy $h -- use "$h->replace_with_content->delete"
508       if you need that.
509
510   $h->delete_content()
511       Clears the content of $h, calling "$h->delete" for each content
512       element.  Compare with "$h->detach_content".
513
514       Returns $h.
515
516   $h->delete() destroy destroy_content
517       Detaches this element from its parent (if it has one) and explicitly
518       destroys the element and all its descendants.  The return value is
519       undef.
520
521       Perl uses garbage collection based on reference counting; when no
522       references to a data structure exist, it's implicitly destroyed --
523       i.e., when no value anywhere points to a given object anymore, Perl
524       knows it can free up the memory that the now-unused object occupies.
525
526       But this fails with HTML::Element trees, because a parent element
527       always holds references to its children, and its children elements hold
528       references to the parent, so no element ever looks like it's not in
529       use.  So, to destroy those elements, you need to call "$h->delete" on
530       the parent.
531
532   $h->clone()
533       Returns a copy of the element (whose children are clones (recursively)
534       of the original's children, if any).
535
536       The returned element is parentless.  Any '_pos' attributes present in
537       the source element/tree will be absent in the copy.  For that and other
538       reasons, the clone of an HTML::TreeBuilder object that's in mid-parse
539       (i.e, the head of a tree that HTML::TreeBuilder is elaborating) cannot
540       (currently) be used to continue the parse.
541
542       You are free to clone HTML::TreeBuilder trees, just as long as: 1)
543       they're done being parsed, or 2) you don't expect to resume parsing
544       into the clone.  (You can continue parsing into the original; it is
545       never affected.)
546
547   HTML::Element->clone_list(...nodes...)
548       Returns a list consisting of a copy of each node given.  Text segments
549       are simply copied; elements are cloned by calling $it->clone on each of
550       them.
551
552       Note that this must be called as a class method, not as an instance
553       method.  "clone_list" will croak if called as an instance method.  You
554       can also call it like so:
555
556           ref($h)->clone_list(...nodes...)
557
558   $h->normalize_content
559       Normalizes the content of $h -- i.e., concatenates any adjacent text
560       nodes.  (Any undefined text segments are turned into empty-strings.)
561       Note that this does not recurse into $h's descendants.
562
563   $h->delete_ignorable_whitespace()
564       This traverses under $h and deletes any text segments that are
565       ignorable whitespace.  You should not use this if $h under a 'pre'
566       element.
567
568   $h->insert_element($element, $implicit)
569       Inserts (via push_content) a new element under the element at
570       "$h->pos()".  Then updates "$h->pos()" to point to the inserted
571       element, unless $element is a prototypically empty element like "br",
572       "hr", "img", etc.  The new "$h->pos()" is returned.  This method is
573       useful only if your particular tree task involves setting "$h->pos()".
574

DUMPING METHODS

576   $h->dump()
577   $h->dump(*FH)  ; # or *FH{IO} or $fh_obj
578       Prints the element and all its children to STDOUT (or to a specified
579       filehandle), in a format useful only for debugging.  The structure of
580       the document is shown by indentation (no end tags).
581
582   $h->as_HTML() or $h->as_HTML($entities)
583   or $h->as_HTML($entities, $indent_char)
584   or $h->as_HTML($entities, $indent_char, \%optional_end_tags)
585       Returns a string representing in HTML the element and its descendants.
586       The optional argument $entities specifies a string of the entities to
587       encode.  For compatibility with previous versions, specify '<>&' here.
588       If omitted or undef, all unsafe characters are encoded as HTML
589       entities.  See HTML::Entities for details.  If passed an empty string,
590       no entities are encoded.
591
592       If $indent_char is specified and defined, the HTML to be output is
593       intented, using the string you specify (which you probably should set
594       to "\t", or some number of spaces, if you specify it).
595
596       If "\%optional_end_tags" is specified and defined, it should be a
597       reference to a hash that holds a true value for every tag name whose
598       end tag is optional.  Defaults to "\%HTML::Element::optionalEndTag",
599       which is an alias to %HTML::Tagset::optionalEndTag, which, at time of
600       writing, contains true values for "p, li, dt, dd".  A useful value to
601       pass is an empty hashref, "{}", which means that no end-tags are
602       optional for this dump.  Otherwise, possibly consider copying
603       %HTML::Tagset::optionalEndTag to a hash of your own, adding or deleting
604       values as you like, and passing a reference to that hash.
605
606   $h->as_text()
607   $h->as_text(skip_dels => 1, extra_chars => '\xA0')
608       Returns a string consisting of only the text parts of the element's
609       descendants.
610
611       Text under 'script' or 'style' elements is never included in what's
612       returned.  If "skip_dels" is true, then text content under "del" nodes
613       is not included in what's returned.
614
615   $h->as_trimmed_text(...) as_text_trimmed
616       This is just like as_text(...) except that leading and trailing
617       whitespace is deleted, and any internal whitespace is collapsed.
618
619       This will not remove hard spaces, unicode spaces, or any other non
620       ASCII white space unless you supplye the extra characters as a string
621       argument. e.g. $h->as_trimmed_text(extra_chars => '\xA0')
622
623   $h->as_XML()
624       Returns a string representing in XML the element and its descendants.
625
626       The XML is not indented.
627
628   $h->as_Lisp_form()
629       Returns a string representing the element and its descendants as a Lisp
630       form.  Unsafe characters are encoded as octal escapes.
631
632       The Lisp form is indented, and contains external ("href", etc.)  as
633       well as internal attributes ("_tag", "_content", "_implicit", etc.),
634       except for "_parent", which is omitted.
635
636       Current example output for a given element:
637
638         ("_tag" "img" "border" "0" "src" "pie.png" "usemap" "#main.map")
639
640   format
641       Formats text output. Defaults to HTML::FormatText.
642
643       Takes a second argument that is a reference to a formatter.
644
645   $h->starttag() or $h->starttag($entities)
646       Returns a string representing the complete start tag for the element.
647       I.e., leading "<", tag name, attributes, and trailing ">".  All values
648       are surrounded with double-quotes, and appropriate characters are
649       encoded.  If $entities is omitted or undef, all unsafe characters are
650       encoded as HTML entities.  See HTML::Entities for details.  If you
651       specify some value for $entities, remember to include the double-quote
652       character in it.  (Previous versions of this module would basically
653       behave as if '&">' were specified for $entities.)  If $entities is an
654       empty string, no entity is escaped.
655
656   starttag_XML
657       Returns a string representing the complete start tag for the element.
658
659   $h->endtag() || endtag_XML
660       Returns a string representing the complete end tag for this element.
661       I.e., "</", tag name, and ">".
662

SECONDARY STRUCTURAL METHODS

664       These methods all involve some structural aspect of the tree; either
665       they report some aspect of the tree's structure, or they involve
666       traversal down the tree, or walking up the tree.
667
668   $h->is_inside('tag', ...) or $h->is_inside($element, ...)
669       Returns true if the $h element is, or is contained anywhere inside an
670       element that is any of the ones listed, or whose tag name is any of the
671       tag names listed.
672
673   $h->is_empty()
674       Returns true if $h has no content, i.e., has no elements or text
675       segments under it.  In other words, this returns true if $h is a leaf
676       node, AKA a terminal node.  Do not confuse this sense of "empty" with
677       another sense that it can have in SGML/HTML/XML terminology, which
678       means that the element in question is of the type (like HTML's "hr",
679       "br", "img", etc.) that can't have any content.
680
681       That is, a particular "p" element may happen to have no content, so
682       $that_p_element->is_empty will be true -- even though the prototypical
683       "p" element isn't "empty" (not in the way that the prototypical "hr"
684       element is).
685
686       If you think this might make for potentially confusing code, consider
687       simply using the clearer exact equivalent:  not($h->content_list)
688
689   $h->pindex()
690       Return the index of the element in its parent's contents array, such
691       that $h would equal
692
693         $h->parent->content->[$h->pindex]
694         or
695         ($h->parent->content_list)[$h->pindex]
696
697       assuming $h isn't root.  If the element $h is root, then $h->pindex
698       returns undef.
699
700   $h->left()
701       In scalar context: returns the node that's the immediate left sibling
702       of $h.  If $h is the leftmost (or only) child of its parent (or has no
703       parent), then this returns undef.
704
705       In list context: returns all the nodes that're the left siblings of $h
706       (starting with the leftmost).  If $h is the leftmost (or only) child of
707       its parent (or has no parent), then this returns empty-list.
708
709       (See also $h->preinsert(LIST).)
710
711   $h->right()
712       In scalar context: returns the node that's the immediate right sibling
713       of $h.  If $h is the rightmost (or only) child of its parent (or has no
714       parent), then this returns undef.
715
716       In list context: returns all the nodes that're the right siblings of
717       $h, starting with the leftmost.  If $h is the rightmost (or only) child
718       of its parent (or has no parent), then this returns empty-list.
719
720       (See also $h->postinsert(LIST).)
721
722   $h->address()
723       Returns a string representing the location of this node in the tree.
724       The address consists of numbers joined by a '.', starting with '0', and
725       followed by the pindexes of the nodes in the tree that are ancestors of
726       $h, starting from the top.
727
728       So if the way to get to a node starting at the root is to go to child 2
729       of the root, then child 10 of that, and then child 0 of that, and then
730       you're there -- then that node's address is "0.2.10.0".
731
732       As a bit of a special case, the address of the root is simply "0".
733
734       I forsee this being used mainly for debugging, but you may find your
735       own uses for it.
736
737   $h->address($address)
738       This returns the node (whether element or text-segment) at the given
739       address in the tree that $h is a part of.  (That is, the address is
740       resolved starting from $h->root.)
741
742       If there is no node at the given address, this returns undef.
743
744       You can specify "relative addressing" (i.e., that indexing is supposed
745       to start from $h and not from $h->root) by having the address start
746       with a period -- e.g., $h->address(".3.2") will look at child 3 of $h,
747       and child 2 of that.
748
749   $h->depth()
750       Returns a number expressing $h's depth within its tree, i.e., how many
751       steps away it is from the root.  If $h has no parent (i.e., is root),
752       its depth is 0.
753
754   $h->root()
755       Returns the element that's the top of $h's tree.  If $h is root, this
756       just returns $h.  (If you want to test whether $h is the root, instead
757       of asking what its root is, just test "not($h->parent)".)
758
759   $h->lineage()
760       Returns the list of $h's ancestors, starting with its parent, and then
761       that parent's parent, and so on, up to the root.  If $h is root, this
762       returns an empty list.
763
764       If you simply want a count of the number of elements in $h's lineage,
765       use $h->depth.
766
767   $h->lineage_tag_names()
768       Returns the list of the tag names of $h's ancestors, starting with its
769       parent, and that parent's parent, and so on, up to the root.  If $h is
770       root, this returns an empty list.  Example output: "('em', 'td', 'tr',
771       'table', 'body', 'html')"
772
773   $h->descendants()
774       In list context, returns the list of all $h's descendant elements,
775       listed in pre-order (i.e., an element appears before its content-
776       elements).  Text segments DO NOT appear in the list.  In scalar
777       context, returns a count of all such elements.
778
779   $h->descendents()
780       This is just an alias to the "descendants" method.
781
782   $h->find_by_tag_name('tag', ...)
783       In list context, returns a list of elements at or under $h that have
784       any of the specified tag names.  In scalar context, returns the first
785       (in pre-order traversal of the tree) such element found, or undef if
786       none.
787
788   $h->find('tag', ...)
789       This is just an alias to "find_by_tag_name".  (There was once going to
790       be a whole find_* family of methods, but then look_down filled that
791       niche, so there turned out not to be much reason for the verboseness of
792       the name "find_by_tag_name".)
793
794   $h->find_by_attribute('attribute', 'value')
795       In a list context, returns a list of elements at or under $h that have
796       the specified attribute, and have the given value for that attribute.
797       In a scalar context, returns the first (in pre-order traversal of the
798       tree) such element found, or undef if none.
799
800       This method is deprecated in favor of the more expressive "look_down"
801       method, which new code should use instead.
802
803   $h->look_down( ...criteria... )
804       This starts at $h and looks thru its element descendants (in pre-
805       order), looking for elements matching the criteria you specify.  In
806       list context, returns all elements that match all the given criteria;
807       in scalar context, returns the first such element (or undef, if nothing
808       matched).
809
810       There are three kinds of criteria you can specify:
811
812       (attr_name, attr_value)
813           This means you're looking for an element with that value for that
814           attribute.  Example: "alt", "pix!".  Consider that you can search
815           on internal attribute values too: "_tag", "p".
816
817       (attr_name, qr/.../)
818           This means you're looking for an element whose value for that
819           attribute matches the specified Regexp object.
820
821       a coderef
822           This means you're looking for elements where
823           coderef->(each_element) returns true.  Example:
824
825             my @wide_pix_images
826               = $h->look_down(
827                               "_tag", "img",
828                               "alt", "pix!",
829                               sub { $_[0]->attr('width') > 350 }
830                              );
831
832       Note that "(attr_name, attr_value)" and "(attr_name, qr/.../)" criteria
833       are almost always faster than coderef criteria, so should presumably be
834       put before them in your list of criteria.  That is, in the example
835       above, the sub ref is called only for elements that have already passed
836       the criteria of having a "_tag" attribute with value "img", and an
837       "alt" attribute with value "pix!".  If the coderef were first, it would
838       be called on every element, and then what elements pass that criterion
839       (i.e., elements for which the coderef returned true) would be checked
840       for their "_tag" and "alt" attributes.
841
842       Note that comparison of string attribute-values against the string
843       value in "(attr_name, attr_value)" is case-INsensitive!  A criterion of
844       "('align', 'right')" will match an element whose "align" value is
845       "RIGHT", or "right" or "rIGhT", etc.
846
847       Note also that "look_down" considers "" (empty-string) and undef to be
848       different things, in attribute values.  So this:
849
850         $h->look_down("alt", "")
851
852       will find elements with an "alt" attribute, but where the value for the
853       "alt" attribute is "".  But this:
854
855         $h->look_down("alt", undef)
856
857       is the same as:
858
859         $h->look_down(sub { !defined($_[0]->attr('alt')) } )
860
861       That is, it finds elements that do not have an "alt" attribute at all
862       (or that do have an "alt" attribute, but with a value of undef -- which
863       is not normally possible).
864
865       Note that when you give several criteria, this is taken to mean you're
866       looking for elements that match all your criterion, not just any of
867       them.  In other words, there is an implicit "and", not an "or".  So if
868       you wanted to express that you wanted to find elements with a "name"
869       attribute with the value "foo" or with an "id" attribute with the value
870       "baz", you'd have to do it like:
871
872         @them = $h->look_down(
873           sub {
874             # the lcs are to fold case
875             lc($_[0]->attr('name')) eq 'foo'
876             or lc($_[0]->attr('id')) eq 'baz'
877           }
878         );
879
880       Coderef criteria are more expressive than "(attr_name, attr_value)" and
881       "(attr_name, qr/.../)" criteria, and all "(attr_name, attr_value)" and
882       "(attr_name, qr/.../)" criteria could be expressed in terms of
883       coderefs.  However, "(attr_name, attr_value)" and "(attr_name,
884       qr/.../)" criteria are a convenient shorthand.  (In fact, "look_down"
885       itself is basically "shorthand" too, since anything you can do with
886       "look_down" you could do by traversing the tree, either with the
887       "traverse" method or with a routine of your own.  However, "look_down"
888       often makes for very concise and clear code.)
889
890   $h->look_up( ...criteria... )
891       This is identical to $h->look_down, except that whereas $h->look_down
892       basically scans over the list:
893
894          ($h, $h->descendants)
895
896       $h->look_up instead scans over the list
897
898          ($h, $h->lineage)
899
900       So, for example, this returns all ancestors of $h (possibly including
901       $h itself) that are "td" elements with an "align" attribute with a
902       value of "right" (or "RIGHT", etc.):
903
904          $h->look_up("_tag", "td", "align", "right");
905
906   $h->traverse(...options...)
907       Lengthy discussion of HTML::Element's unnecessary and confusing
908       "traverse" method has been moved to a separate file:
909       HTML::Element::traverse
910
911   $h->attr_get_i('attribute')
912       In list context, returns a list consisting of the values of the given
913       attribute for $self and for all its ancestors starting from $self and
914       working its way up.  Nodes with no such attribute are skipped.
915       ("attr_get_i" stands for "attribute get, with inheritance".)  In scalar
916       context, returns the first such value, or undef if none.
917
918       Consider a document consisting of:
919
920          <html lang='i-klingon'>
921            <head><title>Pati Pata</title></head>
922            <body>
923              <h1 lang='la'>Stuff</h1>
924              <p lang='es-MX' align='center'>
925                Foo bar baz <cite>Quux</cite>.
926              </p>
927              <p>Hooboy.</p>
928            </body>
929          </html>
930
931       If $h is the "cite" element, $h->attr_get_i("lang") in list context
932       will return the list ('es-MX', 'i-klingon').  In scalar context, it
933       will return the value 'es-MX'.
934
935       If you call with multiple attribute names...
936
937   $h->attr_get_i('a1', 'a2', 'a3')
938       ...in list context, this will return a list consisting of the values of
939       these attributes which exist in $self and its ancestors.  In scalar
940       context, this returns the first value (i.e., the value of the first
941       existing attribute from the first element that has any of the
942       attributes listed).  So, in the above example,
943
944         $h->attr_get_i('lang', 'align');
945
946       will return:
947
948          ('es-MX', 'center', 'i-klingon') # in list context
949         or
950          'es-MX' # in scalar context.
951
952       But note that this:
953
954        $h->attr_get_i('align', 'lang');
955
956       will return:
957
958          ('center', 'es-MX', 'i-klingon') # in list context
959         or
960          'center' # in scalar context.
961
962   $h->tagname_map()
963       Scans across $h and all its descendants, and makes a hash (a reference
964       to which is returned) where each entry consists of a key that's a tag
965       name, and a value that's a reference to a list to all elements that
966       have that tag name.  I.e., this method returns:
967
968          {
969            # Across $h and all descendants...
970            'a'   => [ ...list of all 'a'   elements... ],
971            'em'  => [ ...list of all 'em'  elements... ],
972            'img' => [ ...list of all 'img' elements... ],
973          }
974
975       (There are entries in the hash for only those tagnames that occur
976       at/under $h -- so if there's no "img" elements, there'll be no "img"
977       entry in the hashr(ref) returned.)
978
979       Example usage:
980
981           my $map_r = $h->tagname_map();
982           my @heading_tags = sort grep m/^h\d$/s, keys %$map_r;
983           if(@heading_tags) {
984             print "Heading levels used: @heading_tags\n";
985           } else {
986             print "No headings.\n"
987           }
988
989   $h->extract_links() or $h->extract_links(@wantedTypes)
990       Returns links found by traversing the element and all of its children
991       and looking for attributes (like "href" in an "a" element, or "src" in
992       an "img" element) whose values represent links.  The return value is a
993       reference to an array.  Each element of the array is reference to an
994       array with four items: the link-value, the element that has the
995       attribute with that link-value, and the name of that attribute, and the
996       tagname of that element.  (Example: "['http://www.suck.com/',"
997       $elem_obj ", 'href', 'a']".)  You may or may not end up using the
998       element itself -- for some purposes, you may use only the link value.
999
1000       You might specify that you want to extract links from just some kinds
1001       of elements (instead of the default, which is to extract links from all
1002       the kinds of elements known to have attributes whose values represent
1003       links).  For instance, if you want to extract links from only "a" and
1004       "img" elements, you could code it like this:
1005
1006         for (@{  $e->extract_links('a', 'img')  }) {
1007             my($link, $element, $attr, $tag) = @$_;
1008             print
1009               "Hey, there's a $tag that links to ",
1010               $link, ", in its $attr attribute, at ",
1011               $element->address(), ".\n";
1012         }
1013
1014   $h->simplify_pres
1015       In text bits under PRE elements that are at/under $h, this routine
1016       nativizes all newlines, and expands all tabs.
1017
1018       That is, if you read a file with lines delimited by "\cm\cj"'s, the
1019       text under PRE areas will have "\cm\cj"'s instead of "\n"'s. Calling
1020       $h->nativize_pre_newlines on such a tree will turn "\cm\cj"'s into
1021       "\n"'s.
1022
1023       Tabs are expanded to however many spaces it takes to get to the next
1024       8th column -- the usual way of expanding them.
1025
1026   $h->same_as($i)
1027       Returns true if $h and $i are both elements representing the same tree
1028       of elements, each with the same tag name, with the same explicit
1029       attributes (i.e., not counting attributes whose names start with "_"),
1030       and with the same content (textual, comments, etc.).
1031
1032       Sameness of descendant elements is tested, recursively, with
1033       "$child1->same_as($child_2)", and sameness of text segments is tested
1034       with "$segment1 eq $segment2".
1035
1036   $h = HTML::Element->new_from_lol(ARRAYREF)
1037       Resursively constructs a tree of nodes, based on the (non-cyclic) data
1038       structure represented by ARRAYREF, where that is a reference to an
1039       array of arrays (of arrays (of arrays (etc.))).
1040
1041       In each arrayref in that structure, different kinds of values are
1042       treated as follows:
1043
1044       ·   Arrayrefs
1045
1046           Arrayrefs are considered to designate a sub-tree representing
1047           children for the node constructed from the current arrayref.
1048
1049       ·   Hashrefs
1050
1051           Hashrefs are considered to contain attribute-value pairs to add to
1052           the element to be constructed from the current arrayref
1053
1054       ·   Text segments
1055
1056           Text segments at the start of any arrayref will be considered to
1057           specify the name of the element to be constructed from the current
1058           araryref; all other text segments will be considered to specify
1059           text segments as children for the current arrayref.
1060
1061       ·   Elements
1062
1063           Existing element objects are either inserted into the treelet
1064           constructed, or clones of them are.  That is, when the lol-tree is
1065           being traversed and elements constructed based what's in it, if an
1066           existing element object is found, if it has no parent, then it is
1067           added directly to the treelet constructed; but if it has a parent,
1068           then "$that_node->clone" is added to the treelet at the appropriate
1069           place.
1070
1071       An example will hopefully make this more obvious:
1072
1073         my $h = HTML::Element->new_from_lol(
1074           ['html',
1075             ['head',
1076               [ 'title', 'I like stuff!' ],
1077             ],
1078             ['body',
1079               {'lang', 'en-JP', _implicit => 1},
1080               'stuff',
1081               ['p', 'um, p < 4!', {'class' => 'par123'}],
1082               ['div', {foo => 'bar'}, '123'],
1083             ]
1084           ]
1085         );
1086         $h->dump;
1087
1088       Will print this:
1089
1090         <html> @0
1091           <head> @0.0
1092             <title> @0.0.0
1093               "I like stuff!"
1094           <body lang="en-JP"> @0.1 (IMPLICIT)
1095             "stuff"
1096             <p class="par123"> @0.1.1
1097               "um, p < 4!"
1098             <div foo="bar"> @0.1.2
1099               "123"
1100
1101       And printing $h->as_HTML will give something like:
1102
1103         <html><head><title>I like stuff!</title></head>
1104         <body lang="en-JP">stuff<p class="par123">um, p &lt; 4!
1105         <div foo="bar">123</div></body></html>
1106
1107       You can even do fancy things with "map":
1108
1109         $body->push_content(
1110           # push_content implicitly calls new_from_lol on arrayrefs...
1111           ['br'],
1112           ['blockquote',
1113             ['h2', 'Pictures!'],
1114             map ['p', $_],
1115             $body2->look_down("_tag", "img"),
1116               # images, to be copied from that other tree.
1117           ],
1118           # and more stuff:
1119           ['ul',
1120             map ['li', ['a', {'href'=>"$_.png"}, $_ ] ],
1121             qw(Peaches Apples Pears Mangos)
1122           ],
1123         );
1124
1125   @elements = HTML::Element->new_from_lol(ARRAYREFS)
1126       Constructs several elements, by calling new_from_lol for every arrayref
1127       in the ARRAYREFS list.
1128
1129         @elements = HTML::Element->new_from_lol(
1130           ['hr'],
1131           ['p', 'And there, on the door, was a hook!'],
1132         );
1133          # constructs two elements.
1134
1135   $h->objectify_text()
1136       This turns any text nodes under $h from mere text segments (strings)
1137       into real objects, pseudo-elements with a tag-name of "~text", and the
1138       actual text content in an attribute called "text".  (For a discussion
1139       of pseudo-elements, see the "tag" method, far above.)  This method is
1140       provided because, for some purposes, it is convenient or necessary to
1141       be able, for a given text node, to ask what element is its parent; and
1142       clearly this is not possible if a node is just a text string.
1143
1144       Note that these "~text" objects are not recognized as text nodes by
1145       methods like as_text.  Presumably you will want to call
1146       $h->objectify_text, perform whatever task that you needed that for, and
1147       then call $h->deobjectify_text before calling anything like
1148       $h->as_text.
1149
1150   $h->deobjectify_text()
1151       This undoes the effect of $h->objectify_text.  That is, it takes any
1152       "~text" pseudo-elements in the tree at/under $h, and deletes each one,
1153       replacing each with the content of its "text" attribute.
1154
1155       Note that if $h itself is a "~text" pseudo-element, it will be
1156       destroyed -- a condition you may need to treat specially in your
1157       calling code (since it means you can't very well do anything with $h
1158       after that).  So that you can detect that condition, if $h is itself a
1159       "~text" pseudo-element, then this method returns the value of the
1160       "text" attribute, which should be a defined value; in all other cases,
1161       it returns undef.
1162
1163       (This method assumes that no "~text" pseudo-element has any children.)
1164
1165   $h->number_lists()
1166       For every UL, OL, DIR, and MENU element at/under $h, this sets a
1167       "_bullet" attribute for every child LI element.  For LI children of an
1168       OL, the "_bullet" attribute's value will be something like "4.", "d.",
1169       "D.", "IV.", or "iv.", depending on the OL element's "type" attribute.
1170       LI children of a UL, DIR, or MENU get their "_bullet" attribute set to
1171       "*".  There should be no other LIs (i.e., except as children of OL, UL,
1172       DIR, or MENU elements), and if there are, they are unaffected.
1173
1174   $h->has_insane_linkage
1175       This method is for testing whether this element or the elements under
1176       it have linkage attributes (_parent and _content) whose values are
1177       deeply aberrant: if there are undefs in a content list; if an element
1178       appears in the content lists of more than one element; if the _parent
1179       attribute of an element doesn't match its actual parent; or if an
1180       element appears as its own descendant (i.e., if there is a cyclicity in
1181       the tree).
1182
1183       This returns empty list (or false, in scalar context) if the subtree's
1184       linkage methods are sane; otherwise it returns two items (or true, in
1185       scalar context): the element where the error occurred, and a string
1186       describing the error.
1187
1188       This method is provided is mainly for debugging and troubleshooting --
1189       it should be quite impossible for any document constructed via
1190       HTML::TreeBuilder to parse into a non-sane tree (since it's not the
1191       content of the tree per se that's in question, but whether the tree in
1192       memory was properly constructed); and it should be impossible for you
1193       to produce an insane tree just thru reasonable use of normal documented
1194       structure-modifying methods.  But if you're constructing your own
1195       trees, and your program is going into infinite loops as during calls to
1196       traverse() or any of the secondary structural methods, as part of
1197       debugging, consider calling is_insane on the tree.
1198
1199   $h->element_class
1200       This method returns the class which will be used for new elements.  It
1201       defaults to HTML::Element, but can be overridden by subclassing or
1202       esoteric means best left to those will will read the source and then
1203       not complain when those esoteric means change.  (Just subclass.)
1204

BUGS

1206       * If you want to free the memory associated with a tree built of
1207       HTML::Element nodes, then you will have to delete it explicitly.  See
1208       the $h->delete method, above.
1209
1210       * There's almost nothing to stop you from making a "tree" with
1211       cyclicities (loops) in it, which could, for example, make the traverse
1212       method go into an infinite loop.  So don't make cyclicities!  (If all
1213       you're doing is parsing HTML files, and looking at the resulting trees,
1214       this will never be a problem for you.)
1215
1216       * There's no way to represent comments or processing directives in a
1217       tree with HTML::Elements.  Not yet, at least.
1218
1219       * There's (currently) nothing to stop you from using an undefined value
1220       as a text segment.  If you're running under "perl -w", however, this
1221       may make HTML::Element's code produce a slew of warnings.
1222

NOTES ON SUBCLASSING

1224       You are welcome to derive subclasses from HTML::Element, but you should
1225       be aware that the code in HTML::Element makes certain assumptions about
1226       elements (and I'm using "element" to mean ONLY an object of class
1227       HTML::Element, or of a subclass of HTML::Element):
1228
1229       * The value of an element's _parent attribute must either be undef or
1230       otherwise false, or must be an element.
1231
1232       * The value of an element's _content attribute must either be undef or
1233       otherwise false, or a reference to an (unblessed) array.  The array may
1234       be empty; but if it has items, they must ALL be either mere strings
1235       (text segments), or elements.
1236
1237       * The value of an element's _tag attribute should, at least, be a
1238       string of printable characters.
1239
1240       Moreover, bear these rules in mind:
1241
1242       * Do not break encapsulation on objects.  That is, access their
1243       contents only thru $obj->attr or more specific methods.
1244
1245       * You should think twice before completely overriding any of the
1246       methods that HTML::Element provides.  (Overriding with a method that
1247       calls the superclass method is not so bad, though.)
1248

SEE ALSO

1250       HTML::Tree; HTML::TreeBuilder; HTML::AsSubs; HTML::Tagset; and, for the
1251       morbidly curious, HTML::Element::traverse.
1252
1254       Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke, 2005 Andy
1255       Lester, 2006 Pete Krawczyk, 2010 Jeff Fearn.
1256
1257       This library is free software; you can redistribute it and/or modify it
1258       under the same terms as Perl itself.
1259
1260       This program is distributed in the hope that it will be useful, but
1261       without any warranty; without even the implied warranty of
1262       merchantability or fitness for a particular purpose.
1263

AUTHOR

1265       Currently maintained by Pete Krawczyk "<petek@cpan.org>"
1266
1267       Original authors: Gisle Aas, Sean Burke and Andy Lester.
1268
1269       Thanks to Mark-Jason Dominus for a POD suggestion.
1270
1271
1272
1273perl v5.12.2                      2010-12-20                  HTML::Element(3)
Impressum