1HTML::Element(3) User Contributed Perl Documentation HTML::Element(3)
2
3
4
6 HTML::Element - Class for objects that represent HTML elements
7
9 Version 3.23
10
12 use HTML::Element;
13 $a = HTML::Element->new('a', href => 'http://www.perl.com/');
14 $a->push_content("The Perl Homepage");
15
16 $tag = $a->tag;
17 print "$tag starts out as:", $a->starttag, "\n";
18 print "$tag ends as:", $a->endtag, "\n";
19 print "$tag\'s href attribute is: ", $a->attr('href'), "\n";
20
21 $links_r = $a->extract_links();
22 print "Hey, I found ", scalar(@$links_r), " links.\n";
23
24 print "And that, as HTML, is: ", $a->as_HTML, "\n";
25 $a = $a->delete;
26
28 (This class is part of the HTML::Tree dist.)
29
30 Objects of the HTML::Element class can be used to represent elements of
31 HTML document trees. These objects have attributes, notably attributes
32 that designates each element's parent and content. The content is an
33 array of text segments and other HTML::Element objects. A tree with
34 HTML::Element objects as nodes can represent the syntax tree for a HTML
35 document.
36
38 Consider this HTML document:
39
40 <html lang='en-US'>
41 <head>
42 <title>Stuff</title>
43 <meta name='author' content='Jojo'>
44 </head>
45 <body>
46 <h1>I like potatoes!</h1>
47 </body>
48 </html>
49
50 Building a syntax tree out of it makes a tree-structure in memory that
51 could be diagrammed as:
52
53 html (lang='en-US')
54 / \
55 / \
56 / \
57 head body
58 /\ \
59 / \ \
60 / \ \
61 title meta h1
62 ⎪ (name='author', ⎪
63 "Stuff" content='Jojo') "I like potatoes"
64
65 This is the traditional way to diagram a tree, with the "root" at the
66 top, and it's this kind of diagram that people have in mind when they
67 say, for example, that "the meta element is under the head element
68 instead of under the body element". (The same is also said with
69 "inside" instead of "under" -- the use of "inside" makes more sense
70 when you're looking at the HTML source.)
71
72 Another way to represent the above tree is with indenting:
73
74 html (attributes: lang='en-US')
75 head
76 title
77 "Stuff"
78 meta (attributes: name='author' content='Jojo')
79 body
80 h1
81 "I like potatoes"
82
83 Incidentally, diagramming with indenting works much better for very
84 large trees, and is easier for a program to generate. The
85 "$tree->dump" method uses indentation just that way.
86
87 However you diagram the tree, it's stored the same in memory -- it's a
88 network of objects, each of which has attributes like so:
89
90 element #1: _tag: 'html'
91 _parent: none
92 _content: [element #2, element #5]
93 lang: 'en-US'
94
95 element #2: _tag: 'head'
96 _parent: element #1
97 _content: [element #3, element #4]
98
99 element #3: _tag: 'title'
100 _parent: element #2
101 _content: [text segment "Stuff"]
102
103 element #4 _tag: 'meta'
104 _parent: element #2
105 _content: none
106 name: author
107 content: Jojo
108
109 element #5 _tag: 'body'
110 _parent: element #1
111 _content: [element #6]
112
113 element #6 _tag: 'h1'
114 _parent: element #5
115 _content: [text segment "I like potatoes"]
116
117 The "treeness" of the tree-structure that these elements comprise is
118 not an aspect of any particular object, but is emergent from the relat‐
119 edness attributes (_parent and _content) of these element-objects and
120 from how you use them to get from element to element.
121
122 While you could access the content of a tree by writing code that says
123 "access the 'src' attribute of the root's first child's seventh child's
124 third child", you're more likely to have to scan the contents of a
125 tree, looking for whatever nodes, or kinds of nodes, you want to do
126 something with. The most straightforward way to look over a tree is to
127 "traverse" it; an HTML::Element method ("$h->traverse") is provided for
128 this purpose; and several other HTML::Element methods are based on it.
129
130 (For everything you ever wanted to know about trees, and then some, see
131 Niklaus Wirth's Algorithms + Data Structures = Programs or Donald
132 Knuth's The Art of Computer Programming, Volume 1.)
133
135 $h = HTML::Element->new('tag', 'attrname' => 'value', ... )
136
137 This constructor method returns a new HTML::Element object. The tag
138 name is a required argument; it will be forced to lowercase. Option‐
139 ally, you can specify other initial attributes at object creation time.
140
141 $h->attr('attr') or $h->attr('attr', 'value')
142
143 Returns (optionally sets) the value of the given attribute of $h. The
144 attribute name (but not the value, if provided) is forced to lowercase.
145 If trying to read the value of an attribute not present for this ele‐
146 ment, the return value is undef. If setting a new value, the old value
147 of that attribute is returned.
148
149 If methods are provided for accessing an attribute (like "$h->tag" for
150 "_tag", "$h->content_list", etc. below), use those instead of calling
151 attr "$h->attr", whether for reading or setting.
152
153 Note that setting an attribute to "undef" (as opposed to "", the empty
154 string) actually deletes the attribute.
155
156 $h->tag() or $h->tag('tagname')
157
158 Returns (optionally sets) the tag name (also known as the generic iden‐
159 tifier) for the element $h. In setting, the tag name is always con‐
160 verted to lower case.
161
162 There are four kinds of "pseudo-elements" that show up as HTML::Element
163 objects:
164
165 Comment pseudo-elements
166 These are element objects with a "$h->tag" value of "~comment", and
167 the content of the comment is stored in the "text" attribute
168 ("$h->attr("text")"). For example, parsing this code with
169 HTML::TreeBuilder...
170
171 <!-- I like Pie.
172 Pie is good
173 -->
174
175 produces an HTML::Element object with these attributes:
176
177 "_tag",
178 "~comment",
179 "text",
180 " I like Pie.\n Pie is good\n "
181
182 Declaration pseudo-elements
183 Declarations (rarely encountered) are represented as HTML::Element
184 objects with a tag name of "~declaration", and content in the
185 "text" attribute. For example, this:
186
187 <!DOCTYPE foo>
188
189 produces an element whose attributes include:
190
191 "_tag", "~declaration", "text", "DOCTYPE foo"
192
193 Processing instruction pseudo-elements
194 PIs (rarely encountered) are represented as HTML::Element objects
195 with a tag name of "~pi", and content in the "text" attribute. For
196 example, this:
197
198 <?stuff foo?>
199
200 produces an element whose attributes include:
201
202 "_tag", "~pi", "text", "stuff foo?"
203
204 (assuming a recent version of HTML::Parser)
205
206 ~literal pseudo-elements
207 These objects are not currently produced by HTML::TreeBuilder, but
208 can be used to represent a "super-literal" -- i.e., a literal you
209 want to be immune from escaping. (Yes, I just made that term up.)
210
211 That is, this is useful if you want to insert code into a tree that
212 you plan to dump out with "as_HTML", where you want, for some rea‐
213 son, to suppress "as_HTML"'s normal behavior of amp-quoting text
214 segments.
215
216 For example, this:
217
218 my $literal = HTML::Element->new('~literal',
219 'text' => 'x < 4 & y > 7'
220 );
221 my $span = HTML::Element->new('span');
222 $span->push_content($literal);
223 print $span->as_HTML;
224
225 prints this:
226
227 <span>x < 4 & y > 7</span>
228
229 Whereas this:
230
231 my $span = HTML::Element->new('span');
232 $span->push_content('x < 4 & y > 7');
233 # normal text segment
234 print $span->as_HTML;
235
236 prints this:
237
238 <span>x < 4 & y > 7</span>
239
240 Unless you're inserting lots of pre-cooked code into existing
241 trees, and dumping them out again, it's not likely that you'll find
242 "~literal" pseudo-elements useful.
243
244 $h->parent() or $h->parent($new_parent)
245
246 Returns (optionally sets) the parent (aka "container") for this ele‐
247 ment. The parent should either be undef, or should be another element.
248
249 You should not use this to directly set the parent of an element.
250 Instead use any of the other methods under "Structure-Modifying Meth‐
251 ods", below.
252
253 Note that not($h->parent) is a simple test for whether $h is the root
254 of its subtree.
255
256 $h->content_list()
257
258 Returns a list of the child nodes of this element -- i.e., what nodes
259 (elements or text segments) are inside/under this element. (Note that
260 this may be an empty list.)
261
262 In a scalar context, this returns the count of the items, as you may
263 expect.
264
265 $h->content()
266
267 This somewhat deprecated method returns the content of this element;
268 but unlike content_list, this returns either undef (which you should
269 understand to mean no content), or a reference to the array of content
270 items, each of which is either a text segment (a string, i.e., a
271 defined non-reference scalar value), or an HTML::Element object. Note
272 that even if an arrayref is returned, it may be a reference to an empty
273 array.
274
275 While older code should feel free to continue to use "$h->content", new
276 code should use "$h->content_list" in almost all conceivable cases. It
277 is my experience that in most cases this leads to simpler code anyway,
278 since it means one can say:
279
280 @children = $h->content_list;
281
282 instead of the inelegant:
283
284 @children = @{$h->content ⎪⎪ []};
285
286 If you do use "$h->content" (or "$h->content_array_ref"), you should
287 not use the reference returned by it (assuming it returned a reference,
288 and not undef) to directly set or change the content of an element or
289 text segment! Instead use content_refs_list or any of the other meth‐
290 ods under "Structure-Modifying Methods", below.
291
292 $h->content_array_ref()
293
294 This is like "content" (with all its caveats and deprecations) except
295 that it is guaranteed to return an array reference. That is, if the
296 given node has no "_content" attribute, the "content" method would
297 return that undef, but "content_array_ref" would set the given node's
298 "_content" value to "[]" (a reference to a new, empty array), and
299 return that.
300
301 $h->content_refs_list
302
303 This returns a list of scalar references to each element of $h's con‐
304 tent list. This is useful in case you want to in-place edit any large
305 text segments without having to get a copy of the current value of that
306 segment value, modify that copy, then use the "splice_content" to
307 replace the old with the new. Instead, here you can in-place edit:
308
309 foreach my $item_r ($h->content_refs_list) {
310 next if ref $$item_r;
311 $$item_r =~ s/honour/honor/g;
312 }
313
314 You could currently achieve the same affect with:
315
316 foreach my $item (@{ $h->content_array_ref }) {
317 # deprecated!
318 next if ref $item;
319 $item =~ s/honour/honor/g;
320 }
321
322 ...except that using the return value of "$h->content" or "$h->con‐
323 tent_array_ref" to do that is deprecated, and just might stop working
324 in the future.
325
326 $h->implicit() or $h->implicit($bool)
327
328 Returns (optionally sets) the "_implicit" attribute. This attribute is
329 a flag that's used for indicating that the element was not originally
330 present in the source, but was added to the parse tree (by HTML::Tree‐
331 Builder, for example) in order to conform to the rules of HTML struc‐
332 ture.
333
334 $h->pos() or $h->pos($element)
335
336 Returns (and optionally sets) the "_pos" (for "current position")
337 pointer of $h. This attribute is a pointer used during some parsing
338 operations, whose value is whatever HTML::Element element at or under
339 $h is currently "open", where "$h->insert_element(NEW)" will actually
340 insert a new element.
341
342 (This has nothing to do with the Perl function called "pos", for con‐
343 trolling where regular expression matching starts.)
344
345 If you set "$h->pos($element)", be sure that $element is either $h, or
346 an element under $h.
347
348 If you've been modifying the tree under $h and are no longer sure
349 "$h->pos" is valid, you can enforce validity with:
350
351 $h->pos(undef) unless $h->pos->is_inside($h);
352
353 $h->all_attr()
354
355 Returns all this element's attributes and values, as key-value pairs.
356 This will include any "internal" attributes (i.e., ones not present in
357 the original element, and which will not be represented if/when you
358 call "$h->as_HTML"). Internal attributes are distinguished by the fact
359 that the first character of their key (not value! key!) is an under‐
360 score ("_").
361
362 Example output of "$h->all_attr()" : "'_parent', "[object_value]" ,
363 '_tag', 'em', 'lang', 'en-US', '_content', "[array-ref value].
364
365 $h->all_attr_names()
366
367 Like all_attr, but only returns the names of the attributes.
368
369 Example output of "$h->all_attr_names()" : "'_parent', '_tag', 'lang',
370 '_content', ".
371
372 $h->all_external_attr()
373
374 Like "all_attr", except that internal attributes are not present.
375
376 $h->all_external_attr_names()
377
378 Like "all_external_attr_names", except that internal attributes' names
379 are not present.
380
381 $h->id() or $h->id($string)
382
383 Returns (optionally sets to $string) the "id" attribute.
384 "$h->id(undef)" deletes the "id" attribute.
385
386 $h->idf() or $h->idf($string)
387
388 Just like the "id" method, except that if you call "$h->idf()" and no
389 "id" attribute is defined for this element, then it's set to a likely-
390 to-be-unique value, and returned. (The "f" is for "force".)
391
393 These methods are provided for modifying the content of trees by adding
394 or changing nodes as parents or children of other nodes.
395
396 $h->push_content($element_or_text, ...)
397
398 Adds the specified items to the end of the content list of the element
399 $h. The items of content to be added should each be either a text seg‐
400 ment (a string), an HTML::Element object, or an arrayref. Arrayrefs
401 are fed thru "$h->new_from_lol(that_arrayref)" to convert them into
402 elements, before being added to the content list of $h. This means you
403 can say things concise things like:
404
405 $body->push_content(
406 ['br'],
407 ['ul',
408 map ['li', $_], qw(Peaches Apples Pears Mangos)
409 ]
410 );
411
412 See "new_from_lol" method's documentation, far below, for more explana‐
413 tion.
414
415 The push_content method will try to consolidate adjacent text segments
416 while adding to the content list. That's to say, if $h's content_list
417 is
418
419 ('foo bar ', $some_node, 'baz!')
420
421 and you call
422
423 $h->push_content('quack?');
424
425 then the resulting content list will be this:
426
427 ('foo bar ', $some_node, 'baz!quack?')
428
429 and not this:
430
431 ('foo bar ', $some_node, 'baz!', 'quack?')
432
433 If that latter is what you want, you'll have to override the feature of
434 consolidating text by using splice_content, as in:
435
436 $h->splice_content(scalar($h->content_list),0,'quack?');
437
438 Similarly, if you wanted to add 'Skronk' to the beginning of the con‐
439 tent list, calling this:
440
441 $h->unshift_content('Skronk');
442
443 then the resulting content list will be this:
444
445 ('Skronkfoo bar ', $some_node, 'baz!')
446
447 and not this:
448
449 ('Skronk', 'foo bar ', $some_node, 'baz!')
450
451 What you'd to do get the latter is:
452
453 $h->splice_content(0,0,'Skronk');
454
455 $h->unshift_content($element_or_text, ...)
456
457 Just like "push_content", but adds to the beginning of the $h element's
458 content list.
459
460 The items of content to be added should each be either a text segment
461 (a string), an HTML::Element object, or an arrayref (which is fed thru
462 "new_from_lol").
463
464 The unshift_content method will try to consolidate adjacent text seg‐
465 ments while adding to the content list. See above for a discussion of
466 this.
467
468 $h->splice_content($offset, $length, $element_or_text, ...)
469
470 Detaches the elements from $h's list of content-nodes, starting at
471 $offset and continuing for $length items, replacing them with the ele‐
472 ments of the following list, if any. Returns the elements (if any)
473 removed from the content-list. If $offset is negative, then it starts
474 that far from the end of the array, just like Perl's normal "splice"
475 function. If $length and the following list is omitted, removes every‐
476 thing from $offset onward.
477
478 The items of content to be added (if any) should each be either a text
479 segment (a string), an arrayref (which is fed thru "new_from_lol"), or
480 an HTML::Element object that's not already a child of $h.
481
482 $h->detach()
483
484 This unlinks $h from its parent, by setting its 'parent' attribute to
485 undef, and by removing it from the content list of its parent (if it
486 had one). The return value is the parent that was detached from (or
487 undef, if $h had no parent to start with). Note that neither $h nor
488 its parent are explicitly destroyed.
489
490 $h->detach_content()
491
492 This unlinks all of $h's children from $h, and returns them. Note that
493 these are not explicitly destroyed; for that, you can just use
494 $h->delete_content.
495
496 $h->replace_with( $element_or_text, ... )
497
498 This replaces $h in its parent's content list with the nodes specified.
499 The element $h (which by then may have no parent) is returned. This
500 causes a fatal error if $h has no parent. The list of nodes to insert
501 may contain $h, but at most once. Aside from that possible exception,
502 the nodes to insert should not already be children of $h's parent.
503
504 Also, note that this method does not destroy $h -- use
505 "$h->replace_with(...)->delete" if you need that.
506
507 $h->preinsert($element_or_text...)
508
509 Inserts the given nodes right BEFORE $h in $h's parent's content list.
510 This causes a fatal error if $h has no parent. None of the given nodes
511 should be $h or other children of $h. Returns $h.
512
513 $h->postinsert($element_or_text...)
514
515 Inserts the given nodes right AFTER $h in $h's parent's content list.
516 This causes a fatal error if $h has no parent. None of the given nodes
517 should be $h or other children of $h. Returns $h.
518
519 $h->replace_with_content()
520
521 This replaces $h in its parent's content list with its own content.
522 The element $h (which by then has no parent or content of its own) is
523 returned. This causes a fatal error if $h has no parent. Also, note
524 that this does not destroy $h -- use "$h->replace_with_content->delete"
525 if you need that.
526
527 $h->delete_content()
528
529 Clears the content of $h, calling "$h->delete" for each content ele‐
530 ment. Compare with "$h->detach_content".
531
532 Returns $h.
533
534 $h->delete()
535
536 Detaches this element from its parent (if it has one) and explicitly
537 destroys the element and all its descendants. The return value is
538 undef.
539
540 Perl uses garbage collection based on reference counting; when no ref‐
541 erences to a data structure exist, it's implicitly destroyed -- i.e.,
542 when no value anywhere points to a given object anymore, Perl knows it
543 can free up the memory that the now-unused object occupies.
544
545 But this fails with HTML::Element trees, because a parent element
546 always holds references to its children, and its children elements hold
547 references to the parent, so no element ever looks like it's not in
548 use. So, to destroy those elements, you need to call "$h->delete" on
549 the parent.
550
551 $h->clone()
552
553 Returns a copy of the element (whose children are clones (recursively)
554 of the original's children, if any).
555
556 The returned element is parentless. Any '_pos' attributes present in
557 the source element/tree will be absent in the copy. For that and other
558 reasons, the clone of an HTML::TreeBuilder object that's in mid-parse
559 (i.e, the head of a tree that HTML::TreeBuilder is elaborating) cannot
560 (currently) be used to continue the parse.
561
562 You are free to clone HTML::TreeBuilder trees, just as long as: 1)
563 they're done being parsed, or 2) you don't expect to resume parsing
564 into the clone. (You can continue parsing into the original; it is
565 never affected.)
566
567 HTML::Element->clone_list(...nodes...)
568
569 Returns a list consisting of a copy of each node given. Text segments
570 are simply copied; elements are cloned by calling $it->clone on each of
571 them.
572
573 Note that this must be called as a class method, not as an instance
574 method. "clone_list" will croak if called as an instance method. You
575 can also call it like so:
576
577 ref($h)->clone_list(...nodes...)
578
579 $h->normalize_content
580
581 Normalizes the content of $h -- i.e., concatenates any adjacent text
582 nodes. (Any undefined text segments are turned into empty-strings.)
583 Note that this does not recurse into $h's descendants.
584
585 $h->delete_ignorable_whitespace()
586
587 This traverses under $h and deletes any text segments that are ignor‐
588 able whitespace. You should not use this if $h under a 'pre' element.
589
590 $h->insert_element($element, $implicit)
591
592 Inserts (via push_content) a new element under the element at
593 "$h->pos()". Then updates "$h->pos()" to point to the inserted ele‐
594 ment, unless $element is a prototypically empty element like "br",
595 "hr", "img", etc. The new "$h->pos()" is returned. This method is
596 useful only if your particular tree task involves setting "$h->pos()".
597
599 $h->dump()
600
601 $h->dump(*FH) ; # or *FH{IO} or $fh_obj
602
603 Prints the element and all its children to STDOUT (or to a specified
604 filehandle), in a format useful only for debugging. The structure of
605 the document is shown by indentation (no end tags).
606
607 $h->as_HTML() or $h->as_HTML($entities)
608
609 or $h->as_HTML($entities, $indent_char)
610
611 or $h->as_HTML($entities, $indent_char, \%optional_end_tags)
612
613 Returns a string representing in HTML the element and its descendants.
614 The optional argument $entities specifies a string of the entities to
615 encode. For compatibility with previous versions, specify '<>&' here.
616 If omitted or undef, all unsafe characters are encoded as HTML enti‐
617 ties. See HTML::Entities for details. If passed an empty string, no
618 entities are encoded.
619
620 If $indent_char is specified and defined, the HTML to be output is
621 intented, using the string you specify (which you probably should set
622 to "\t", or some number of spaces, if you specify it).
623
624 If "\%optional_end_tags" is specified and defined, it should be a ref‐
625 erence to a hash that holds a true value for every tag name whose end
626 tag is optional. Defaults to "\%HTML::Element::optionalEndTag", which
627 is an alias to %HTML::Tagset::optionalEndTag, which, at time of writ‐
628 ing, contains true values for "p, li, dt, dd". A useful value to pass
629 is an empty hashref, "{}", which means that no end-tags are optional
630 for this dump. Otherwise, possibly consider copying
631 %HTML::Tagset::optionalEndTag to a hash of your own, adding or deleting
632 values as you like, and passing a reference to that hash.
633
634 $h->as_text()
635
636 $h->as_text(skip_dels => 1)
637
638 Returns a string consisting of only the text parts of the element's
639 descendants.
640
641 Text under 'script' or 'style' elements is never included in what's
642 returned. If "skip_dels" is true, then text content under "del" nodes
643 is not included in what's returned.
644
645 $h->as_trimmed_text(...)
646
647 This is just like as_text(...) except that leading and trailing white‐
648 space is deleted, and any internal whitespace is collapsed.
649
650 $h->as_XML()
651
652 Returns a string representing in XML the element and its descendants.
653
654 The XML is not indented.
655
656 $h->as_Lisp_form()
657
658 Returns a string representing the element and its descendants as a Lisp
659 form. Unsafe characters are encoded as octal escapes.
660
661 The Lisp form is indented, and contains external ("href", etc.) as
662 well as internal attributes ("_tag", "_content", "_implicit", etc.),
663 except for "_parent", which is omitted.
664
665 Current example output for a given element:
666
667 ("_tag" "img" "border" "0" "src" "pie.png" "usemap" "#main.map")
668
669 $h->starttag() or $h->starttag($entities)
670
671 Returns a string representing the complete start tag for the element.
672 I.e., leading "<", tag name, attributes, and trailing ">". All values
673 are surrounded with double-quotes, and appropriate characters are
674 encoded. If $entities is omitted or undef, all unsafe characters are
675 encoded as HTML entities. See HTML::Entities for details. If you
676 specify some value for $entities, remember to include the double-quote
677 character in it. (Previous versions of this module would basically
678 behave as if '&">' were specified for $entities.) If $entities is an
679 empty string, no entity is escaped.
680
681 $h->endtag()
682
683 Returns a string representing the complete end tag for this element.
684 I.e., "</", tag name, and ">".
685
687 These methods all involve some structural aspect of the tree; either
688 they report some aspect of the tree's structure, or they involve tra‐
689 versal down the tree, or walking up the tree.
690
691 $h->is_inside('tag', ...) or $h->is_inside($element, ...)
692
693 Returns true if the $h element is, or is contained anywhere inside an
694 element that is any of the ones listed, or whose tag name is any of the
695 tag names listed.
696
697 $h->is_empty()
698
699 Returns true if $h has no content, i.e., has no elements or text seg‐
700 ments under it. In other words, this returns true if $h is a leaf
701 node, AKA a terminal node. Do not confuse this sense of "empty" with
702 another sense that it can have in SGML/HTML/XML terminology, which
703 means that the element in question is of the type (like HTML's "hr",
704 "br", "img", etc.) that can't have any content.
705
706 That is, a particular "p" element may happen to have no content, so
707 $that_p_element->is_empty will be true -- even though the prototypical
708 "p" element isn't "empty" (not in the way that the prototypical "hr"
709 element is).
710
711 If you think this might make for potentially confusing code, consider
712 simply using the clearer exact equivalent: not($h->content_list)
713
714 $h->pindex()
715
716 Return the index of the element in its parent's contents array, such
717 that $h would equal
718
719 $h->parent->content->[$h->pindex]
720 or
721 ($h->parent->content_list)[$h->pindex]
722
723 assuming $h isn't root. If the element $h is root, then $h->pindex
724 returns undef.
725
726 $h->left()
727
728 In scalar context: returns the node that's the immediate left sibling
729 of $h. If $h is the leftmost (or only) child of its parent (or has no
730 parent), then this returns undef.
731
732 In list context: returns all the nodes that're the left siblings of $h
733 (starting with the leftmost). If $h is the leftmost (or only) child of
734 its parent (or has no parent), then this returns empty-list.
735
736 (See also $h->preinsert(LIST).)
737
738 $h->right()
739
740 In scalar context: returns the node that's the immediate right sibling
741 of $h. If $h is the rightmost (or only) child of its parent (or has no
742 parent), then this returns undef.
743
744 In list context: returns all the nodes that're the right siblings of
745 $h, starting with the leftmost. If $h is the rightmost (or only) child
746 of its parent (or has no parent), then this returns empty-list.
747
748 (See also $h->postinsert(LIST).)
749
750 $h->address()
751
752 Returns a string representing the location of this node in the tree.
753 The address consists of numbers joined by a '.', starting with '0', and
754 followed by the pindexes of the nodes in the tree that are ancestors of
755 $h, starting from the top.
756
757 So if the way to get to a node starting at the root is to go to child 2
758 of the root, then child 10 of that, and then child 0 of that, and then
759 you're there -- then that node's address is "0.2.10.0".
760
761 As a bit of a special case, the address of the root is simply "0".
762
763 I forsee this being used mainly for debugging, but you may find your
764 own uses for it.
765
766 $h->address($address)
767
768 This returns the node (whether element or text-segment) at the given
769 address in the tree that $h is a part of. (That is, the address is
770 resolved starting from $h->root.)
771
772 If there is no node at the given address, this returns undef.
773
774 You can specify "relative addressing" (i.e., that indexing is supposed
775 to start from $h and not from $h->root) by having the address start
776 with a period -- e.g., $h->address(".3.2") will look at child 3 of $h,
777 and child 2 of that.
778
779 $h->depth()
780
781 Returns a number expressing $h's depth within its tree, i.e., how many
782 steps away it is from the root. If $h has no parent (i.e., is root),
783 its depth is 0.
784
785 $h->root()
786
787 Returns the element that's the top of $h's tree. If $h is root, this
788 just returns $h. (If you want to test whether $h is the root, instead
789 of asking what its root is, just test "not($h->parent)".)
790
791 $h->lineage()
792
793 Returns the list of $h's ancestors, starting with its parent, and then
794 that parent's parent, and so on, up to the root. If $h is root, this
795 returns an empty list.
796
797 If you simply want a count of the number of elements in $h's lineage,
798 use $h->depth.
799
800 $h->lineage_tag_names()
801
802 Returns the list of the tag names of $h's ancestors, starting with its
803 parent, and that parent's parent, and so on, up to the root. If $h is
804 root, this returns an empty list. Example output: "('em', 'td', 'tr',
805 'table', 'body', 'html')"
806
807 $h->descendants()
808
809 In list context, returns the list of all $h's descendant elements,
810 listed in pre-order (i.e., an element appears before its content-ele‐
811 ments). Text segments DO NOT appear in the list. In scalar context,
812 returns a count of all such elements.
813
814 $h->descendents()
815
816 This is just an alias to the "descendants" method.
817
818 $h->find_by_tag_name('tag', ...)
819
820 In list context, returns a list of elements at or under $h that have
821 any of the specified tag names. In scalar context, returns the first
822 (in pre-order traversal of the tree) such element found, or undef if
823 none.
824
825 $h->find('tag', ...)
826
827 This is just an alias to "find_by_tag_name". (There was once going to
828 be a whole find_* family of methods, but then look_down filled that
829 niche, so there turned out not to be much reason for the verboseness of
830 the name "find_by_tag_name".)
831
832 $h->find_by_attribute('attribute', 'value')
833
834 In a list context, returns a list of elements at or under $h that have
835 the specified attribute, and have the given value for that attribute.
836 In a scalar context, returns the first (in pre-order traversal of the
837 tree) such element found, or undef if none.
838
839 This method is deprecated in favor of the more expressive "look_down"
840 method, which new code should use instead.
841
842 $h->look_down( ...criteria... )
843
844 This starts at $h and looks thru its element descendants (in
845 pre-order), looking for elements matching the criteria you specify. In
846 list context, returns all elements that match all the given criteria;
847 in scalar context, returns the first such element (or undef, if nothing
848 matched).
849
850 There are three kinds of criteria you can specify:
851
852 (attr_name, attr_value)
853 This means you're looking for an element with that value for that
854 attribute. Example: "alt", "pix!". Consider that you can search
855 on internal attribute values too: "_tag", "p".
856
857 (attr_name, qr/.../)
858 This means you're looking for an element whose value for that
859 attribute matches the specified Regexp object.
860
861 a coderef
862 This means you're looking for elements where coderef->(each_ele‐
863 ment) returns true. Example:
864
865 my @wide_pix_images
866 = $h->look_down(
867 "_tag", "img",
868 "alt", "pix!",
869 sub { $_[0]->attr('width') > 350 }
870 );
871
872 Note that "(attr_name, attr_value)" and "(attr_name, qr/.../)" criteria
873 are almost always faster than coderef criteria, so should presumably be
874 put before them in your list of criteria. That is, in the example
875 above, the sub ref is called only for elements that have already passed
876 the criteria of having a "_tag" attribute with value "img", and an
877 "alt" attribute with value "pix!". If the coderef were first, it would
878 be called on every element, and then what elements pass that criterion
879 (i.e., elements for which the coderef returned true) would be checked
880 for their "_tag" and "alt" attributes.
881
882 Note that comparison of string attribute-values against the string
883 value in "(attr_name, attr_value)" is case-INsensitive! A criterion of
884 "('align', 'right')" will match an element whose "align" value is
885 "RIGHT", or "right" or "rIGhT", etc.
886
887 Note also that "look_down" considers "" (empty-string) and undef to be
888 different things, in attribute values. So this:
889
890 $h->look_down("alt", "")
891
892 will find elements with an "alt" attribute, but where the value for the
893 "alt" attribute is "". But this:
894
895 $h->look_down("alt", undef)
896
897 is the same as:
898
899 $h->look_down(sub { !defined($_[0]->attr('alt')) } )
900
901 That is, it finds elements that do not have an "alt" attribute at all
902 (or that do have an "alt" attribute, but with a value of undef -- which
903 is not normally possible).
904
905 Note that when you give several criteria, this is taken to mean you're
906 looking for elements that match all your criterion, not just any of
907 them. In other words, there is an implicit "and", not an "or". So if
908 you wanted to express that you wanted to find elements with a "name"
909 attribute with the value "foo" or with an "id" attribute with the value
910 "baz", you'd have to do it like:
911
912 @them = $h->look_down(
913 sub {
914 # the lcs are to fold case
915 lc($_[0]->attr('name')) eq 'foo'
916 or lc($_[0]->attr('id')) eq 'baz'
917 }
918 );
919
920 Coderef criteria are more expressive than "(attr_name, attr_value)" and
921 "(attr_name, qr/.../)" criteria, and all "(attr_name, attr_value)" and
922 "(attr_name, qr/.../)" criteria could be expressed in terms of
923 coderefs. However, "(attr_name, attr_value)" and "(attr_name,
924 qr/.../)" criteria are a convenient shorthand. (In fact, "look_down"
925 itself is basically "shorthand" too, since anything you can do with
926 "look_down" you could do by traversing the tree, either with the "tra‐
927 verse" method or with a routine of your own. However, "look_down"
928 often makes for very concise and clear code.)
929
930 $h->look_up( ...criteria... )
931
932 This is identical to $h->look_down, except that whereas $h->look_down
933 basically scans over the list:
934
935 ($h, $h->descendants)
936
937 $h->look_up instead scans over the list
938
939 ($h, $h->lineage)
940
941 So, for example, this returns all ancestors of $h (possibly including
942 $h itself) that are "td" elements with an "align" attribute with a
943 value of "right" (or "RIGHT", etc.):
944
945 $h->look_up("_tag", "td", "align", "right");
946
947 $h->traverse(...options...)
948
949 Lengthy discussion of HTML::Element's unnecessary and confusing "tra‐
950 verse" method has been moved to a separate file: HTML::Element::tra‐
951 verse
952
953 $h->attr_get_i('attribute')
954
955 In list context, returns a list consisting of the values of the given
956 attribute for $self and for all its ancestors starting from $self and
957 working its way up. Nodes with no such attribute are skipped.
958 ("attr_get_i" stands for "attribute get, with inheritance".) In scalar
959 context, returns the first such value, or undef if none.
960
961 Consider a document consisting of:
962
963 <html lang='i-klingon'>
964 <head><title>Pati Pata</title></head>
965 <body>
966 <h1 lang='la'>Stuff</h1>
967 <p lang='es-MX' align='center'>
968 Foo bar baz <cite>Quux</cite>.
969 </p>
970 <p>Hooboy.</p>
971 </body>
972 </html>
973
974 If $h is the "cite" element, $h->attr_get_i("lang") in list context
975 will return the list ('es-MX', 'i-klingon'). In scalar context, it
976 will return the value 'es-MX'.
977
978 If you call with multiple attribute names...
979
980 $h->attr_get_i('a1', 'a2', 'a3')
981
982 ...in list context, this will return a list consisting of the values of
983 these attributes which exist in $self and its ancestors. In scalar
984 context, this returns the first value (i.e., the value of the first
985 existing attribute from the first element that has any of the
986 attributes listed). So, in the above example,
987
988 $h->attr_get_i('lang', 'align');
989
990 will return:
991
992 ('es-MX', 'center', 'i-klingon') # in list context
993 or
994 'es-MX' # in scalar context.
995
996 But note that this:
997
998 $h->attr_get_i('align', 'lang');
999
1000 will return:
1001
1002 ('center', 'es-MX', 'i-klingon') # in list context
1003 or
1004 'center' # in scalar context.
1005
1006 $h->tagname_map()
1007
1008 Scans across $h and all its descendants, and makes a hash (a reference
1009 to which is returned) where each entry consists of a key that's a tag
1010 name, and a value that's a reference to a list to all elements that
1011 have that tag name. I.e., this method returns:
1012
1013 {
1014 # Across $h and all descendants...
1015 'a' => [ ...list of all 'a' elements... ],
1016 'em' => [ ...list of all 'em' elements... ],
1017 'img' => [ ...list of all 'img' elements... ],
1018 }
1019
1020 (There are entries in the hash for only those tagnames that occur
1021 at/under $h -- so if there's no "img" elements, there'll be no "img"
1022 entry in the hashr(ref) returned.)
1023
1024 Example usage:
1025
1026 my $map_r = $h->tagname_map();
1027 my @heading_tags = sort grep m/^h\d$/s, keys %$map_r;
1028 if(@heading_tags) {
1029 print "Heading levels used: @heading_tags\n";
1030 } else {
1031 print "No headings.\n"
1032 }
1033
1034 $h->extract_links() or $h->extract_links(@wantedTypes)
1035
1036 Returns links found by traversing the element and all of its children
1037 and looking for attributes (like "href" in an "a" element, or "src" in
1038 an "img" element) whose values represent links. The return value is a
1039 reference to an array. Each element of the array is reference to an
1040 array with four items: the link-value, the element that has the
1041 attribute with that link-value, and the name of that attribute, and the
1042 tagname of that element. (Example: "['http://www.suck.com/',"
1043 $elem_obj ", 'href', 'a']".) You may or may not end up using the ele‐
1044 ment itself -- for some purposes, you may use only the link value.
1045
1046 You might specify that you want to extract links from just some kinds
1047 of elements (instead of the default, which is to extract links from all
1048 the kinds of elements known to have attributes whose values represent
1049 links). For instance, if you want to extract links from only "a" and
1050 "img" elements, you could code it like this:
1051
1052 for (@{ $e->extract_links('a', 'img') }) {
1053 my($link, $element, $attr, $tag) = @$_;
1054 print
1055 "Hey, there's a $tag that links to "
1056 $link, ", in its $attr attribute, at ",
1057 $element->address(), ".\n";
1058 }
1059
1060 $h->simplify_pres
1061
1062 In text bits under PRE elements that are at/under $h, this routine
1063 nativizes all newlines, and expands all tabs.
1064
1065 That is, if you read a file with lines delimited by "\cm\cj"'s, the
1066 text under PRE areas will have "\cm\cj"'s instead of "\n"'s. Calling
1067 $h->nativize_pre_newlines on such a tree will turn "\cm\cj"'s into
1068 "\n"'s.
1069
1070 Tabs are expanded to however many spaces it takes to get to the next
1071 8th column -- the usual way of expanding them.
1072
1073 $h->same_as($i)
1074
1075 Returns true if $h and $i are both elements representing the same tree
1076 of elements, each with the same tag name, with the same explicit
1077 attributes (i.e., not counting attributes whose names start with "_"),
1078 and with the same content (textual, comments, etc.).
1079
1080 Sameness of descendant elements is tested, recursively, with
1081 "$child1->same_as($child_2)", and sameness of text segments is tested
1082 with "$segment1 eq $segment2".
1083
1084 $h = HTML::Element->new_from_lol(ARRAYREF)
1085
1086 Resursively constructs a tree of nodes, based on the (non-cyclic) data
1087 structure represented by ARRAYREF, where that is a reference to an
1088 array of arrays (of arrays (of arrays (etc.))).
1089
1090 In each arrayref in that structure, different kinds of values are
1091 treated as follows:
1092
1093 * Arrayrefs
1094 Arrayrefs are considered to designate a sub-tree representing chil‐
1095 dren for the node constructed from the current arrayref.
1096
1097 * Hashrefs
1098 Hashrefs are considered to contain attribute-value pairs to add to
1099 the element to be constructed from the current arrayref
1100
1101 * Text segments
1102 Text segments at the start of any arrayref will be considered to
1103 specify the name of the element to be constructed from the current
1104 araryref; all other text segments will be considered to specify
1105 text segments as children for the current arrayref.
1106
1107 * Elements
1108 Existing element objects are either inserted into the treelet con‐
1109 structed, or clones of them are. That is, when the lol-tree is
1110 being traversed and elements constructed based what's in it, if an
1111 existing element object is found, if it has no parent, then it is
1112 added directly to the treelet constructed; but if it has a parent,
1113 then "$that_node->clone" is added to the treelet at the appropriate
1114 place.
1115
1116 An example will hopefully make this more obvious:
1117
1118 my $h = HTML::Element->new_from_lol(
1119 ['html',
1120 ['head',
1121 [ 'title', 'I like stuff!' ],
1122 ],
1123 ['body',
1124 {'lang', 'en-JP', _implicit => 1},
1125 'stuff',
1126 ['p', 'um, p < 4!', {'class' => 'par123'}],
1127 ['div', {foo => 'bar'}, '123'],
1128 ]
1129 ]
1130 );
1131 $h->dump;
1132
1133 Will print this:
1134
1135 <html> @0
1136 <head> @0.0
1137 <title> @0.0.0
1138 "I like stuff!"
1139 <body lang="en-JP"> @0.1 (IMPLICIT)
1140 "stuff"
1141 <p class="par123"> @0.1.1
1142 "um, p < 4!"
1143 <div foo="bar"> @0.1.2
1144 "123"
1145
1146 And printing $h->as_HTML will give something like:
1147
1148 <html><head><title>I like stuff!</title></head>
1149 <body lang="en-JP">stuff<p class="par123">um, p < 4!
1150 <div foo="bar">123</div></body></html>
1151
1152 You can even do fancy things with "map":
1153
1154 $body->push_content(
1155 # push_content implicitly calls new_from_lol on arrayrefs...
1156 ['br'],
1157 ['blockquote',
1158 ['h2', 'Pictures!'],
1159 map ['p', $_],
1160 $body2->look_down("_tag", "img"),
1161 # images, to be copied from that other tree.
1162 ],
1163 # and more stuff:
1164 ['ul',
1165 map ['li', ['a', {'href'=>"$_.png"}, $_ ] ],
1166 qw(Peaches Apples Pears Mangos)
1167 ],
1168 );
1169
1170 @elements = HTML::Element->new_from_lol(ARRAYREFS)
1171
1172 Constructs several elements, by calling new_from_lol for every arrayref
1173 in the ARRAYREFS list.
1174
1175 @elements = HTML::Element->new_from_lol(
1176 ['hr'],
1177 ['p', 'And there, on the door, was a hook!'],
1178 );
1179 # constructs two elements.
1180
1181 $h->objectify_text()
1182
1183 This turns any text nodes under $h from mere text segments (strings)
1184 into real objects, pseudo-elements with a tag-name of "~text", and the
1185 actual text content in an attribute called "text". (For a discussion
1186 of pseudo-elements, see the "tag" method, far above.) This method is
1187 provided because, for some purposes, it is convenient or necessary to
1188 be able, for a given text node, to ask what element is its parent; and
1189 clearly this is not possible if a node is just a text string.
1190
1191 Note that these "~text" objects are not recognized as text nodes by
1192 methods like as_text. Presumably you will want to call $h->objec‐
1193 tify_text, perform whatever task that you needed that for, and then
1194 call $h->deobjectify_text before calling anything like $h->as_text.
1195
1196 $h->deobjectify_text()
1197
1198 This undoes the effect of $h->objectify_text. That is, it takes any
1199 "~text" pseudo-elements in the tree at/under $h, and deletes each one,
1200 replacing each with the content of its "text" attribute.
1201
1202 Note that if $h itself is a "~text" pseudo-element, it will be
1203 destroyed -- a condition you may need to treat specially in your call‐
1204 ing code (since it means you can't very well do anything with $h after
1205 that). So that you can detect that condition, if $h is itself a
1206 "~text" pseudo-element, then this method returns the value of the
1207 "text" attribute, which should be a defined value; in all other cases,
1208 it returns undef.
1209
1210 (This method assumes that no "~text" pseudo-element has any children.)
1211
1212 $h->number_lists()
1213
1214 For every UL, OL, DIR, and MENU element at/under $h, this sets a "_bul‐
1215 let" attribute for every child LI element. For LI children of an OL,
1216 the "_bullet" attribute's value will be something like "4.", "d.",
1217 "D.", "IV.", or "iv.", depending on the OL element's "type" attribute.
1218 LI children of a UL, DIR, or MENU get their "_bullet" attribute set to
1219 "*". There should be no other LIs (i.e., except as children of OL, UL,
1220 DIR, or MENU elements), and if there are, they are unaffected.
1221
1222 $h->has_insane_linkage
1223
1224 This method is for testing whether this element or the elements under
1225 it have linkage attributes (_parent and _content) whose values are
1226 deeply aberrant: if there are undefs in a content list; if an element
1227 appears in the content lists of more than one element; if the _parent
1228 attribute of an element doesn't match its actual parent; or if an ele‐
1229 ment appears as its own descendant (i.e., if there is a cyclicity in
1230 the tree).
1231
1232 This returns empty list (or false, in scalar context) if the subtree's
1233 linkage methods are sane; otherwise it returns two items (or true, in
1234 scalar context): the element where the error occurred, and a string
1235 describing the error.
1236
1237 This method is provided is mainly for debugging and troubleshooting --
1238 it should be quite impossible for any document constructed via
1239 HTML::TreeBuilder to parse into a non-sane tree (since it's not the
1240 content of the tree per se that's in question, but whether the tree in
1241 memory was properly constructed); and it should be impossible for you
1242 to produce an insane tree just thru reasonable use of normal documented
1243 structure-modifying methods. But if you're constructing your own
1244 trees, and your program is going into infinite loops as during calls to
1245 traverse() or any of the secondary structural methods, as part of
1246 debugging, consider calling is_insane on the tree.
1247
1249 * If you want to free the memory associated with a tree built of
1250 HTML::Element nodes, then you will have to delete it explicitly. See
1251 the $h->delete method, above.
1252
1253 * There's almost nothing to stop you from making a "tree" with cyclici‐
1254 ties (loops) in it, which could, for example, make the traverse method
1255 go into an infinite loop. So don't make cyclicities! (If all you're
1256 doing is parsing HTML files, and looking at the resulting trees, this
1257 will never be a problem for you.)
1258
1259 * There's no way to represent comments or processing directives in a
1260 tree with HTML::Elements. Not yet, at least.
1261
1262 * There's (currently) nothing to stop you from using an undefined value
1263 as a text segment. If you're running under "perl -w", however, this
1264 may make HTML::Element's code produce a slew of warnings.
1265
1267 You are welcome to derive subclasses from HTML::Element, but you should
1268 be aware that the code in HTML::Element makes certain assumptions about
1269 elements (and I'm using "element" to mean ONLY an object of class
1270 HTML::Element, or of a subclass of HTML::Element):
1271
1272 * The value of an element's _parent attribute must either be undef or
1273 otherwise false, or must be an element.
1274
1275 * The value of an element's _content attribute must either be undef or
1276 otherwise false, or a reference to an (unblessed) array. The array may
1277 be empty; but if it has items, they must ALL be either mere strings
1278 (text segments), or elements.
1279
1280 * The value of an element's _tag attribute should, at least, be a
1281 string of printable characters.
1282
1283 Moreover, bear these rules in mind:
1284
1285 * Do not break encapsulation on objects. That is, access their con‐
1286 tents only thru $obj->attr or more specific methods.
1287
1288 * You should think twice before completely overriding any of the meth‐
1289 ods that HTML::Element provides. (Overriding with a method that calls
1290 the superclass method is not so bad, though.)
1291
1293 HTML::Tree; HTML::TreeBuilder; HTML::AsSubs; HTML::Tagset; and, for the
1294 morbidly curious, HTML::Element::traverse.
1295
1297 Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke, 2005 Andy
1298 Lester, 2006 Pete Krawczyk.
1299
1300 This library is free software; you can redistribute it and/or modify it
1301 under the same terms as Perl itself.
1302
1303 This program is distributed in the hope that it will be useful, but
1304 without any warranty; without even the implied warranty of mer‐
1305 chantability or fitness for a particular purpose.
1306
1308 Currently maintained by Pete Krawczyk "<petek@cpan.org>"
1309
1310 Original authors: Gisle Aas, Sean Burke and Andy Lester.
1311
1312 Thanks to Mark-Jason Dominus for a POD suggestion.
1313
1314
1315
1316perl v5.8.8 2006-08-04 HTML::Element(3)