1HTML::Element(3) User Contributed Perl Documentation HTML::Element(3)
2
3
4
6 HTML::Element - Class for objects that represent HTML elements
7
9 Version 3.23
10
12 use HTML::Element;
13 $a = HTML::Element->new('a', href => 'http://www.perl.com/');
14 $a->push_content("The Perl Homepage");
15
16 $tag = $a->tag;
17 print "$tag starts out as:", $a->starttag, "\n";
18 print "$tag ends as:", $a->endtag, "\n";
19 print "$tag\'s href attribute is: ", $a->attr('href'), "\n";
20
21 $links_r = $a->extract_links();
22 print "Hey, I found ", scalar(@$links_r), " links.\n";
23
24 print "And that, as HTML, is: ", $a->as_HTML, "\n";
25 $a = $a->delete;
26
28 (This class is part of the HTML::Tree dist.)
29
30 Objects of the HTML::Element class can be used to represent elements of
31 HTML document trees. These objects have attributes, notably attributes
32 that designates each element's parent and content. The content is an
33 array of text segments and other HTML::Element objects. A tree with
34 HTML::Element objects as nodes can represent the syntax tree for a HTML
35 document.
36
38 Consider this HTML document:
39
40 <html lang='en-US'>
41 <head>
42 <title>Stuff</title>
43 <meta name='author' content='Jojo'>
44 </head>
45 <body>
46 <h1>I like potatoes!</h1>
47 </body>
48 </html>
49
50 Building a syntax tree out of it makes a tree-structure in memory that
51 could be diagrammed as:
52
53 html (lang='en-US')
54 / \
55 / \
56 / \
57 head body
58 /\ \
59 / \ \
60 / \ \
61 title meta h1
62 | (name='author', |
63 "Stuff" content='Jojo') "I like potatoes"
64
65 This is the traditional way to diagram a tree, with the "root" at the
66 top, and it's this kind of diagram that people have in mind when they
67 say, for example, that "the meta element is under the head element
68 instead of under the body element". (The same is also said with
69 "inside" instead of "under" -- the use of "inside" makes more sense
70 when you're looking at the HTML source.)
71
72 Another way to represent the above tree is with indenting:
73
74 html (attributes: lang='en-US')
75 head
76 title
77 "Stuff"
78 meta (attributes: name='author' content='Jojo')
79 body
80 h1
81 "I like potatoes"
82
83 Incidentally, diagramming with indenting works much better for very
84 large trees, and is easier for a program to generate. The
85 "$tree->dump" method uses indentation just that way.
86
87 However you diagram the tree, it's stored the same in memory -- it's a
88 network of objects, each of which has attributes like so:
89
90 element #1: _tag: 'html'
91 _parent: none
92 _content: [element #2, element #5]
93 lang: 'en-US'
94
95 element #2: _tag: 'head'
96 _parent: element #1
97 _content: [element #3, element #4]
98
99 element #3: _tag: 'title'
100 _parent: element #2
101 _content: [text segment "Stuff"]
102
103 element #4 _tag: 'meta'
104 _parent: element #2
105 _content: none
106 name: author
107 content: Jojo
108
109 element #5 _tag: 'body'
110 _parent: element #1
111 _content: [element #6]
112
113 element #6 _tag: 'h1'
114 _parent: element #5
115 _content: [text segment "I like potatoes"]
116
117 The "treeness" of the tree-structure that these elements comprise is
118 not an aspect of any particular object, but is emergent from the
119 relatedness attributes (_parent and _content) of these element-objects
120 and from how you use them to get from element to element.
121
122 While you could access the content of a tree by writing code that says
123 "access the 'src' attribute of the root's first child's seventh child's
124 third child", you're more likely to have to scan the contents of a
125 tree, looking for whatever nodes, or kinds of nodes, you want to do
126 something with. The most straightforward way to look over a tree is to
127 "traverse" it; an HTML::Element method ("$h->traverse") is provided for
128 this purpose; and several other HTML::Element methods are based on it.
129
130 (For everything you ever wanted to know about trees, and then some, see
131 Niklaus Wirth's Algorithms + Data Structures = Programs or Donald
132 Knuth's The Art of Computer Programming, Volume 1.)
133
135 $h = HTML::Element->new('tag', 'attrname' => 'value', ... )
136 This constructor method returns a new HTML::Element object. The tag
137 name is a required argument; it will be forced to lowercase.
138 Optionally, you can specify other initial attributes at object creation
139 time.
140
141 $h->attr('attr') or $h->attr('attr', 'value')
142 Returns (optionally sets) the value of the given attribute of $h. The
143 attribute name (but not the value, if provided) is forced to lowercase.
144 If trying to read the value of an attribute not present for this
145 element, the return value is undef. If setting a new value, the old
146 value of that attribute is returned.
147
148 If methods are provided for accessing an attribute (like "$h->tag" for
149 "_tag", "$h->content_list", etc. below), use those instead of calling
150 attr "$h->attr", whether for reading or setting.
151
152 Note that setting an attribute to "undef" (as opposed to "", the empty
153 string) actually deletes the attribute.
154
155 $h->tag() or $h->tag('tagname')
156 Returns (optionally sets) the tag name (also known as the generic
157 identifier) for the element $h. In setting, the tag name is always
158 converted to lower case.
159
160 There are four kinds of "pseudo-elements" that show up as HTML::Element
161 objects:
162
163 Comment pseudo-elements
164 These are element objects with a "$h->tag" value of "~comment", and
165 the content of the comment is stored in the "text" attribute
166 ("$h->attr("text")"). For example, parsing this code with
167 HTML::TreeBuilder...
168
169 <!-- I like Pie.
170 Pie is good
171 -->
172
173 produces an HTML::Element object with these attributes:
174
175 "_tag",
176 "~comment",
177 "text",
178 " I like Pie.\n Pie is good\n "
179
180 Declaration pseudo-elements
181 Declarations (rarely encountered) are represented as HTML::Element
182 objects with a tag name of "~declaration", and content in the
183 "text" attribute. For example, this:
184
185 <!DOCTYPE foo>
186
187 produces an element whose attributes include:
188
189 "_tag", "~declaration", "text", "DOCTYPE foo"
190
191 Processing instruction pseudo-elements
192 PIs (rarely encountered) are represented as HTML::Element objects
193 with a tag name of "~pi", and content in the "text" attribute. For
194 example, this:
195
196 <?stuff foo?>
197
198 produces an element whose attributes include:
199
200 "_tag", "~pi", "text", "stuff foo?"
201
202 (assuming a recent version of HTML::Parser)
203
204 ~literal pseudo-elements
205 These objects are not currently produced by HTML::TreeBuilder, but
206 can be used to represent a "super-literal" -- i.e., a literal you
207 want to be immune from escaping. (Yes, I just made that term up.)
208
209 That is, this is useful if you want to insert code into a tree that
210 you plan to dump out with "as_HTML", where you want, for some
211 reason, to suppress "as_HTML"'s normal behavior of amp-quoting text
212 segments.
213
214 For example, this:
215
216 my $literal = HTML::Element->new('~literal',
217 'text' => 'x < 4 & y > 7'
218 );
219 my $span = HTML::Element->new('span');
220 $span->push_content($literal);
221 print $span->as_HTML;
222
223 prints this:
224
225 <span>x < 4 & y > 7</span>
226
227 Whereas this:
228
229 my $span = HTML::Element->new('span');
230 $span->push_content('x < 4 & y > 7');
231 # normal text segment
232 print $span->as_HTML;
233
234 prints this:
235
236 <span>x < 4 & y > 7</span>
237
238 Unless you're inserting lots of pre-cooked code into existing
239 trees, and dumping them out again, it's not likely that you'll find
240 "~literal" pseudo-elements useful.
241
242 $h->parent() or $h->parent($new_parent)
243 Returns (optionally sets) the parent (aka "container") for this
244 element. The parent should either be undef, or should be another
245 element.
246
247 You should not use this to directly set the parent of an element.
248 Instead use any of the other methods under "Structure-Modifying
249 Methods", below.
250
251 Note that not($h->parent) is a simple test for whether $h is the root
252 of its subtree.
253
254 $h->content_list()
255 Returns a list of the child nodes of this element -- i.e., what nodes
256 (elements or text segments) are inside/under this element. (Note that
257 this may be an empty list.)
258
259 In a scalar context, this returns the count of the items, as you may
260 expect.
261
262 $h->content()
263 This somewhat deprecated method returns the content of this element;
264 but unlike content_list, this returns either undef (which you should
265 understand to mean no content), or a reference to the array of content
266 items, each of which is either a text segment (a string, i.e., a
267 defined non-reference scalar value), or an HTML::Element object. Note
268 that even if an arrayref is returned, it may be a reference to an empty
269 array.
270
271 While older code should feel free to continue to use "$h->content", new
272 code should use "$h->content_list" in almost all conceivable cases. It
273 is my experience that in most cases this leads to simpler code anyway,
274 since it means one can say:
275
276 @children = $h->content_list;
277
278 instead of the inelegant:
279
280 @children = @{$h->content || []};
281
282 If you do use "$h->content" (or "$h->content_array_ref"), you should
283 not use the reference returned by it (assuming it returned a reference,
284 and not undef) to directly set or change the content of an element or
285 text segment! Instead use content_refs_list or any of the other
286 methods under "Structure-Modifying Methods", below.
287
288 $h->content_array_ref()
289 This is like "content" (with all its caveats and deprecations) except
290 that it is guaranteed to return an array reference. That is, if the
291 given node has no "_content" attribute, the "content" method would
292 return that undef, but "content_array_ref" would set the given node's
293 "_content" value to "[]" (a reference to a new, empty array), and
294 return that.
295
296 $h->content_refs_list
297 This returns a list of scalar references to each element of $h's
298 content list. This is useful in case you want to in-place edit any
299 large text segments without having to get a copy of the current value
300 of that segment value, modify that copy, then use the "splice_content"
301 to replace the old with the new. Instead, here you can in-place edit:
302
303 foreach my $item_r ($h->content_refs_list) {
304 next if ref $$item_r;
305 $$item_r =~ s/honour/honor/g;
306 }
307
308 You could currently achieve the same affect with:
309
310 foreach my $item (@{ $h->content_array_ref }) {
311 # deprecated!
312 next if ref $item;
313 $item =~ s/honour/honor/g;
314 }
315
316 ...except that using the return value of "$h->content" or
317 "$h->content_array_ref" to do that is deprecated, and just might stop
318 working in the future.
319
320 $h->implicit() or $h->implicit($bool)
321 Returns (optionally sets) the "_implicit" attribute. This attribute is
322 a flag that's used for indicating that the element was not originally
323 present in the source, but was added to the parse tree (by
324 HTML::TreeBuilder, for example) in order to conform to the rules of
325 HTML structure.
326
327 $h->pos() or $h->pos($element)
328 Returns (and optionally sets) the "_pos" (for "current position")
329 pointer of $h. This attribute is a pointer used during some parsing
330 operations, whose value is whatever HTML::Element element at or under
331 $h is currently "open", where "$h->insert_element(NEW)" will actually
332 insert a new element.
333
334 (This has nothing to do with the Perl function called "pos", for
335 controlling where regular expression matching starts.)
336
337 If you set "$h->pos($element)", be sure that $element is either $h, or
338 an element under $h.
339
340 If you've been modifying the tree under $h and are no longer sure
341 "$h->pos" is valid, you can enforce validity with:
342
343 $h->pos(undef) unless $h->pos->is_inside($h);
344
345 $h->all_attr()
346 Returns all this element's attributes and values, as key-value pairs.
347 This will include any "internal" attributes (i.e., ones not present in
348 the original element, and which will not be represented if/when you
349 call "$h->as_HTML"). Internal attributes are distinguished by the fact
350 that the first character of their key (not value! key!) is an
351 underscore ("_").
352
353 Example output of "$h->all_attr()" : "'_parent', "[object_value]" ,
354 '_tag', 'em', 'lang', 'en-US', '_content', "[array-ref value].
355
356 $h->all_attr_names()
357 Like all_attr, but only returns the names of the attributes.
358
359 Example output of "$h->all_attr_names()" : "'_parent', '_tag', 'lang',
360 '_content', ".
361
362 $h->all_external_attr()
363 Like "all_attr", except that internal attributes are not present.
364
365 $h->all_external_attr_names()
366 Like "all_external_attr_names", except that internal attributes' names
367 are not present.
368
369 $h->id() or $h->id($string)
370 Returns (optionally sets to $string) the "id" attribute.
371 "$h->id(undef)" deletes the "id" attribute.
372
373 $h->idf() or $h->idf($string)
374 Just like the "id" method, except that if you call "$h->idf()" and no
375 "id" attribute is defined for this element, then it's set to a likely-
376 to-be-unique value, and returned. (The "f" is for "force".)
377
379 These methods are provided for modifying the content of trees by adding
380 or changing nodes as parents or children of other nodes.
381
382 $h->push_content($element_or_text, ...)
383 Adds the specified items to the end of the content list of the element
384 $h. The items of content to be added should each be either a text
385 segment (a string), an HTML::Element object, or an arrayref. Arrayrefs
386 are fed thru "$h->new_from_lol(that_arrayref)" to convert them into
387 elements, before being added to the content list of $h. This means you
388 can say things concise things like:
389
390 $body->push_content(
391 ['br'],
392 ['ul',
393 map ['li', $_], qw(Peaches Apples Pears Mangos)
394 ]
395 );
396
397 See "new_from_lol" method's documentation, far below, for more
398 explanation.
399
400 The push_content method will try to consolidate adjacent text segments
401 while adding to the content list. That's to say, if $h's content_list
402 is
403
404 ('foo bar ', $some_node, 'baz!')
405
406 and you call
407
408 $h->push_content('quack?');
409
410 then the resulting content list will be this:
411
412 ('foo bar ', $some_node, 'baz!quack?')
413
414 and not this:
415
416 ('foo bar ', $some_node, 'baz!', 'quack?')
417
418 If that latter is what you want, you'll have to override the feature of
419 consolidating text by using splice_content, as in:
420
421 $h->splice_content(scalar($h->content_list),0,'quack?');
422
423 Similarly, if you wanted to add 'Skronk' to the beginning of the
424 content list, calling this:
425
426 $h->unshift_content('Skronk');
427
428 then the resulting content list will be this:
429
430 ('Skronkfoo bar ', $some_node, 'baz!')
431
432 and not this:
433
434 ('Skronk', 'foo bar ', $some_node, 'baz!')
435
436 What you'd to do get the latter is:
437
438 $h->splice_content(0,0,'Skronk');
439
440 $h->unshift_content($element_or_text, ...)
441 Just like "push_content", but adds to the beginning of the $h element's
442 content list.
443
444 The items of content to be added should each be either a text segment
445 (a string), an HTML::Element object, or an arrayref (which is fed thru
446 "new_from_lol").
447
448 The unshift_content method will try to consolidate adjacent text
449 segments while adding to the content list. See above for a discussion
450 of this.
451
452 $h->splice_content($offset, $length, $element_or_text, ...)
453 Detaches the elements from $h's list of content-nodes, starting at
454 $offset and continuing for $length items, replacing them with the
455 elements of the following list, if any. Returns the elements (if any)
456 removed from the content-list. If $offset is negative, then it starts
457 that far from the end of the array, just like Perl's normal "splice"
458 function. If $length and the following list is omitted, removes
459 everything from $offset onward.
460
461 The items of content to be added (if any) should each be either a text
462 segment (a string), an arrayref (which is fed thru "new_from_lol"), or
463 an HTML::Element object that's not already a child of $h.
464
465 $h->detach()
466 This unlinks $h from its parent, by setting its 'parent' attribute to
467 undef, and by removing it from the content list of its parent (if it
468 had one). The return value is the parent that was detached from (or
469 undef, if $h had no parent to start with). Note that neither $h nor
470 its parent are explicitly destroyed.
471
472 $h->detach_content()
473 This unlinks all of $h's children from $h, and returns them. Note that
474 these are not explicitly destroyed; for that, you can just use
475 $h->delete_content.
476
477 $h->replace_with( $element_or_text, ... )
478 This replaces $h in its parent's content list with the nodes specified.
479 The element $h (which by then may have no parent) is returned. This
480 causes a fatal error if $h has no parent. The list of nodes to insert
481 may contain $h, but at most once. Aside from that possible exception,
482 the nodes to insert should not already be children of $h's parent.
483
484 Also, note that this method does not destroy $h -- use
485 "$h->replace_with(...)->delete" if you need that.
486
487 $h->preinsert($element_or_text...)
488 Inserts the given nodes right BEFORE $h in $h's parent's content list.
489 This causes a fatal error if $h has no parent. None of the given nodes
490 should be $h or other children of $h. Returns $h.
491
492 $h->postinsert($element_or_text...)
493 Inserts the given nodes right AFTER $h in $h's parent's content list.
494 This causes a fatal error if $h has no parent. None of the given nodes
495 should be $h or other children of $h. Returns $h.
496
497 $h->replace_with_content()
498 This replaces $h in its parent's content list with its own content.
499 The element $h (which by then has no parent or content of its own) is
500 returned. This causes a fatal error if $h has no parent. Also, note
501 that this does not destroy $h -- use "$h->replace_with_content->delete"
502 if you need that.
503
504 $h->delete_content()
505 Clears the content of $h, calling "$h->delete" for each content
506 element. Compare with "$h->detach_content".
507
508 Returns $h.
509
510 $h->delete()
511 Detaches this element from its parent (if it has one) and explicitly
512 destroys the element and all its descendants. The return value is
513 undef.
514
515 Perl uses garbage collection based on reference counting; when no
516 references to a data structure exist, it's implicitly destroyed --
517 i.e., when no value anywhere points to a given object anymore, Perl
518 knows it can free up the memory that the now-unused object occupies.
519
520 But this fails with HTML::Element trees, because a parent element
521 always holds references to its children, and its children elements hold
522 references to the parent, so no element ever looks like it's not in
523 use. So, to destroy those elements, you need to call "$h->delete" on
524 the parent.
525
526 $h->clone()
527 Returns a copy of the element (whose children are clones (recursively)
528 of the original's children, if any).
529
530 The returned element is parentless. Any '_pos' attributes present in
531 the source element/tree will be absent in the copy. For that and other
532 reasons, the clone of an HTML::TreeBuilder object that's in mid-parse
533 (i.e, the head of a tree that HTML::TreeBuilder is elaborating) cannot
534 (currently) be used to continue the parse.
535
536 You are free to clone HTML::TreeBuilder trees, just as long as: 1)
537 they're done being parsed, or 2) you don't expect to resume parsing
538 into the clone. (You can continue parsing into the original; it is
539 never affected.)
540
541 HTML::Element->clone_list(...nodes...)
542 Returns a list consisting of a copy of each node given. Text segments
543 are simply copied; elements are cloned by calling $it->clone on each of
544 them.
545
546 Note that this must be called as a class method, not as an instance
547 method. "clone_list" will croak if called as an instance method. You
548 can also call it like so:
549
550 ref($h)->clone_list(...nodes...)
551
552 $h->normalize_content
553 Normalizes the content of $h -- i.e., concatenates any adjacent text
554 nodes. (Any undefined text segments are turned into empty-strings.)
555 Note that this does not recurse into $h's descendants.
556
557 $h->delete_ignorable_whitespace()
558 This traverses under $h and deletes any text segments that are
559 ignorable whitespace. You should not use this if $h under a 'pre'
560 element.
561
562 $h->insert_element($element, $implicit)
563 Inserts (via push_content) a new element under the element at
564 "$h->pos()". Then updates "$h->pos()" to point to the inserted
565 element, unless $element is a prototypically empty element like "br",
566 "hr", "img", etc. The new "$h->pos()" is returned. This method is
567 useful only if your particular tree task involves setting "$h->pos()".
568
570 $h->dump()
571 $h->dump(*FH) ; # or *FH{IO} or $fh_obj
572 Prints the element and all its children to STDOUT (or to a specified
573 filehandle), in a format useful only for debugging. The structure of
574 the document is shown by indentation (no end tags).
575
576 $h->as_HTML() or $h->as_HTML($entities)
577 or $h->as_HTML($entities, $indent_char)
578 or $h->as_HTML($entities, $indent_char, \%optional_end_tags)
579 Returns a string representing in HTML the element and its descendants.
580 The optional argument $entities specifies a string of the entities to
581 encode. For compatibility with previous versions, specify '<>&' here.
582 If omitted or undef, all unsafe characters are encoded as HTML
583 entities. See HTML::Entities for details. If passed an empty string,
584 no entities are encoded.
585
586 If $indent_char is specified and defined, the HTML to be output is
587 intented, using the string you specify (which you probably should set
588 to "\t", or some number of spaces, if you specify it).
589
590 If "\%optional_end_tags" is specified and defined, it should be a
591 reference to a hash that holds a true value for every tag name whose
592 end tag is optional. Defaults to "\%HTML::Element::optionalEndTag",
593 which is an alias to %HTML::Tagset::optionalEndTag, which, at time of
594 writing, contains true values for "p, li, dt, dd". A useful value to
595 pass is an empty hashref, "{}", which means that no end-tags are
596 optional for this dump. Otherwise, possibly consider copying
597 %HTML::Tagset::optionalEndTag to a hash of your own, adding or deleting
598 values as you like, and passing a reference to that hash.
599
600 $h->as_text()
601 $h->as_text(skip_dels => 1)
602 Returns a string consisting of only the text parts of the element's
603 descendants.
604
605 Text under 'script' or 'style' elements is never included in what's
606 returned. If "skip_dels" is true, then text content under "del" nodes
607 is not included in what's returned.
608
609 $h->as_trimmed_text(...)
610 This is just like as_text(...) except that leading and trailing
611 whitespace is deleted, and any internal whitespace is collapsed.
612
613 $h->as_XML()
614 Returns a string representing in XML the element and its descendants.
615
616 The XML is not indented.
617
618 $h->as_Lisp_form()
619 Returns a string representing the element and its descendants as a Lisp
620 form. Unsafe characters are encoded as octal escapes.
621
622 The Lisp form is indented, and contains external ("href", etc.) as
623 well as internal attributes ("_tag", "_content", "_implicit", etc.),
624 except for "_parent", which is omitted.
625
626 Current example output for a given element:
627
628 ("_tag" "img" "border" "0" "src" "pie.png" "usemap" "#main.map")
629
630 $h->starttag() or $h->starttag($entities)
631 Returns a string representing the complete start tag for the element.
632 I.e., leading "<", tag name, attributes, and trailing ">". All values
633 are surrounded with double-quotes, and appropriate characters are
634 encoded. If $entities is omitted or undef, all unsafe characters are
635 encoded as HTML entities. See HTML::Entities for details. If you
636 specify some value for $entities, remember to include the double-quote
637 character in it. (Previous versions of this module would basically
638 behave as if '&">' were specified for $entities.) If $entities is an
639 empty string, no entity is escaped.
640
641 $h->endtag()
642 Returns a string representing the complete end tag for this element.
643 I.e., "</", tag name, and ">".
644
646 These methods all involve some structural aspect of the tree; either
647 they report some aspect of the tree's structure, or they involve
648 traversal down the tree, or walking up the tree.
649
650 $h->is_inside('tag', ...) or $h->is_inside($element, ...)
651 Returns true if the $h element is, or is contained anywhere inside an
652 element that is any of the ones listed, or whose tag name is any of the
653 tag names listed.
654
655 $h->is_empty()
656 Returns true if $h has no content, i.e., has no elements or text
657 segments under it. In other words, this returns true if $h is a leaf
658 node, AKA a terminal node. Do not confuse this sense of "empty" with
659 another sense that it can have in SGML/HTML/XML terminology, which
660 means that the element in question is of the type (like HTML's "hr",
661 "br", "img", etc.) that can't have any content.
662
663 That is, a particular "p" element may happen to have no content, so
664 $that_p_element->is_empty will be true -- even though the prototypical
665 "p" element isn't "empty" (not in the way that the prototypical "hr"
666 element is).
667
668 If you think this might make for potentially confusing code, consider
669 simply using the clearer exact equivalent: not($h->content_list)
670
671 $h->pindex()
672 Return the index of the element in its parent's contents array, such
673 that $h would equal
674
675 $h->parent->content->[$h->pindex]
676 or
677 ($h->parent->content_list)[$h->pindex]
678
679 assuming $h isn't root. If the element $h is root, then $h->pindex
680 returns undef.
681
682 $h->left()
683 In scalar context: returns the node that's the immediate left sibling
684 of $h. If $h is the leftmost (or only) child of its parent (or has no
685 parent), then this returns undef.
686
687 In list context: returns all the nodes that're the left siblings of $h
688 (starting with the leftmost). If $h is the leftmost (or only) child of
689 its parent (or has no parent), then this returns empty-list.
690
691 (See also $h->preinsert(LIST).)
692
693 $h->right()
694 In scalar context: returns the node that's the immediate right sibling
695 of $h. If $h is the rightmost (or only) child of its parent (or has no
696 parent), then this returns undef.
697
698 In list context: returns all the nodes that're the right siblings of
699 $h, starting with the leftmost. If $h is the rightmost (or only) child
700 of its parent (or has no parent), then this returns empty-list.
701
702 (See also $h->postinsert(LIST).)
703
704 $h->address()
705 Returns a string representing the location of this node in the tree.
706 The address consists of numbers joined by a '.', starting with '0', and
707 followed by the pindexes of the nodes in the tree that are ancestors of
708 $h, starting from the top.
709
710 So if the way to get to a node starting at the root is to go to child 2
711 of the root, then child 10 of that, and then child 0 of that, and then
712 you're there -- then that node's address is "0.2.10.0".
713
714 As a bit of a special case, the address of the root is simply "0".
715
716 I forsee this being used mainly for debugging, but you may find your
717 own uses for it.
718
719 $h->address($address)
720 This returns the node (whether element or text-segment) at the given
721 address in the tree that $h is a part of. (That is, the address is
722 resolved starting from $h->root.)
723
724 If there is no node at the given address, this returns undef.
725
726 You can specify "relative addressing" (i.e., that indexing is supposed
727 to start from $h and not from $h->root) by having the address start
728 with a period -- e.g., $h->address(".3.2") will look at child 3 of $h,
729 and child 2 of that.
730
731 $h->depth()
732 Returns a number expressing $h's depth within its tree, i.e., how many
733 steps away it is from the root. If $h has no parent (i.e., is root),
734 its depth is 0.
735
736 $h->root()
737 Returns the element that's the top of $h's tree. If $h is root, this
738 just returns $h. (If you want to test whether $h is the root, instead
739 of asking what its root is, just test "not($h->parent)".)
740
741 $h->lineage()
742 Returns the list of $h's ancestors, starting with its parent, and then
743 that parent's parent, and so on, up to the root. If $h is root, this
744 returns an empty list.
745
746 If you simply want a count of the number of elements in $h's lineage,
747 use $h->depth.
748
749 $h->lineage_tag_names()
750 Returns the list of the tag names of $h's ancestors, starting with its
751 parent, and that parent's parent, and so on, up to the root. If $h is
752 root, this returns an empty list. Example output: "('em', 'td', 'tr',
753 'table', 'body', 'html')"
754
755 $h->descendants()
756 In list context, returns the list of all $h's descendant elements,
757 listed in pre-order (i.e., an element appears before its content-
758 elements). Text segments DO NOT appear in the list. In scalar
759 context, returns a count of all such elements.
760
761 $h->descendents()
762 This is just an alias to the "descendants" method.
763
764 $h->find_by_tag_name('tag', ...)
765 In list context, returns a list of elements at or under $h that have
766 any of the specified tag names. In scalar context, returns the first
767 (in pre-order traversal of the tree) such element found, or undef if
768 none.
769
770 $h->find('tag', ...)
771 This is just an alias to "find_by_tag_name". (There was once going to
772 be a whole find_* family of methods, but then look_down filled that
773 niche, so there turned out not to be much reason for the verboseness of
774 the name "find_by_tag_name".)
775
776 $h->find_by_attribute('attribute', 'value')
777 In a list context, returns a list of elements at or under $h that have
778 the specified attribute, and have the given value for that attribute.
779 In a scalar context, returns the first (in pre-order traversal of the
780 tree) such element found, or undef if none.
781
782 This method is deprecated in favor of the more expressive "look_down"
783 method, which new code should use instead.
784
785 $h->look_down( ...criteria... )
786 This starts at $h and looks thru its element descendants (in pre-
787 order), looking for elements matching the criteria you specify. In
788 list context, returns all elements that match all the given criteria;
789 in scalar context, returns the first such element (or undef, if nothing
790 matched).
791
792 There are three kinds of criteria you can specify:
793
794 (attr_name, attr_value)
795 This means you're looking for an element with that value for that
796 attribute. Example: "alt", "pix!". Consider that you can search
797 on internal attribute values too: "_tag", "p".
798
799 (attr_name, qr/.../)
800 This means you're looking for an element whose value for that
801 attribute matches the specified Regexp object.
802
803 a coderef
804 This means you're looking for elements where
805 coderef->(each_element) returns true. Example:
806
807 my @wide_pix_images
808 = $h->look_down(
809 "_tag", "img",
810 "alt", "pix!",
811 sub { $_[0]->attr('width') > 350 }
812 );
813
814 Note that "(attr_name, attr_value)" and "(attr_name, qr/.../)" criteria
815 are almost always faster than coderef criteria, so should presumably be
816 put before them in your list of criteria. That is, in the example
817 above, the sub ref is called only for elements that have already passed
818 the criteria of having a "_tag" attribute with value "img", and an
819 "alt" attribute with value "pix!". If the coderef were first, it would
820 be called on every element, and then what elements pass that criterion
821 (i.e., elements for which the coderef returned true) would be checked
822 for their "_tag" and "alt" attributes.
823
824 Note that comparison of string attribute-values against the string
825 value in "(attr_name, attr_value)" is case-INsensitive! A criterion of
826 "('align', 'right')" will match an element whose "align" value is
827 "RIGHT", or "right" or "rIGhT", etc.
828
829 Note also that "look_down" considers "" (empty-string) and undef to be
830 different things, in attribute values. So this:
831
832 $h->look_down("alt", "")
833
834 will find elements with an "alt" attribute, but where the value for the
835 "alt" attribute is "". But this:
836
837 $h->look_down("alt", undef)
838
839 is the same as:
840
841 $h->look_down(sub { !defined($_[0]->attr('alt')) } )
842
843 That is, it finds elements that do not have an "alt" attribute at all
844 (or that do have an "alt" attribute, but with a value of undef -- which
845 is not normally possible).
846
847 Note that when you give several criteria, this is taken to mean you're
848 looking for elements that match all your criterion, not just any of
849 them. In other words, there is an implicit "and", not an "or". So if
850 you wanted to express that you wanted to find elements with a "name"
851 attribute with the value "foo" or with an "id" attribute with the value
852 "baz", you'd have to do it like:
853
854 @them = $h->look_down(
855 sub {
856 # the lcs are to fold case
857 lc($_[0]->attr('name')) eq 'foo'
858 or lc($_[0]->attr('id')) eq 'baz'
859 }
860 );
861
862 Coderef criteria are more expressive than "(attr_name, attr_value)" and
863 "(attr_name, qr/.../)" criteria, and all "(attr_name, attr_value)" and
864 "(attr_name, qr/.../)" criteria could be expressed in terms of
865 coderefs. However, "(attr_name, attr_value)" and "(attr_name,
866 qr/.../)" criteria are a convenient shorthand. (In fact, "look_down"
867 itself is basically "shorthand" too, since anything you can do with
868 "look_down" you could do by traversing the tree, either with the
869 "traverse" method or with a routine of your own. However, "look_down"
870 often makes for very concise and clear code.)
871
872 $h->look_up( ...criteria... )
873 This is identical to $h->look_down, except that whereas $h->look_down
874 basically scans over the list:
875
876 ($h, $h->descendants)
877
878 $h->look_up instead scans over the list
879
880 ($h, $h->lineage)
881
882 So, for example, this returns all ancestors of $h (possibly including
883 $h itself) that are "td" elements with an "align" attribute with a
884 value of "right" (or "RIGHT", etc.):
885
886 $h->look_up("_tag", "td", "align", "right");
887
888 $h->traverse(...options...)
889 Lengthy discussion of HTML::Element's unnecessary and confusing
890 "traverse" method has been moved to a separate file:
891 HTML::Element::traverse
892
893 $h->attr_get_i('attribute')
894 In list context, returns a list consisting of the values of the given
895 attribute for $self and for all its ancestors starting from $self and
896 working its way up. Nodes with no such attribute are skipped.
897 ("attr_get_i" stands for "attribute get, with inheritance".) In scalar
898 context, returns the first such value, or undef if none.
899
900 Consider a document consisting of:
901
902 <html lang='i-klingon'>
903 <head><title>Pati Pata</title></head>
904 <body>
905 <h1 lang='la'>Stuff</h1>
906 <p lang='es-MX' align='center'>
907 Foo bar baz <cite>Quux</cite>.
908 </p>
909 <p>Hooboy.</p>
910 </body>
911 </html>
912
913 If $h is the "cite" element, $h->attr_get_i("lang") in list context
914 will return the list ('es-MX', 'i-klingon'). In scalar context, it
915 will return the value 'es-MX'.
916
917 If you call with multiple attribute names...
918
919 $h->attr_get_i('a1', 'a2', 'a3')
920 ...in list context, this will return a list consisting of the values of
921 these attributes which exist in $self and its ancestors. In scalar
922 context, this returns the first value (i.e., the value of the first
923 existing attribute from the first element that has any of the
924 attributes listed). So, in the above example,
925
926 $h->attr_get_i('lang', 'align');
927
928 will return:
929
930 ('es-MX', 'center', 'i-klingon') # in list context
931 or
932 'es-MX' # in scalar context.
933
934 But note that this:
935
936 $h->attr_get_i('align', 'lang');
937
938 will return:
939
940 ('center', 'es-MX', 'i-klingon') # in list context
941 or
942 'center' # in scalar context.
943
944 $h->tagname_map()
945 Scans across $h and all its descendants, and makes a hash (a reference
946 to which is returned) where each entry consists of a key that's a tag
947 name, and a value that's a reference to a list to all elements that
948 have that tag name. I.e., this method returns:
949
950 {
951 # Across $h and all descendants...
952 'a' => [ ...list of all 'a' elements... ],
953 'em' => [ ...list of all 'em' elements... ],
954 'img' => [ ...list of all 'img' elements... ],
955 }
956
957 (There are entries in the hash for only those tagnames that occur
958 at/under $h -- so if there's no "img" elements, there'll be no "img"
959 entry in the hashr(ref) returned.)
960
961 Example usage:
962
963 my $map_r = $h->tagname_map();
964 my @heading_tags = sort grep m/^h\d$/s, keys %$map_r;
965 if(@heading_tags) {
966 print "Heading levels used: @heading_tags\n";
967 } else {
968 print "No headings.\n"
969 }
970
971 $h->extract_links() or $h->extract_links(@wantedTypes)
972 Returns links found by traversing the element and all of its children
973 and looking for attributes (like "href" in an "a" element, or "src" in
974 an "img" element) whose values represent links. The return value is a
975 reference to an array. Each element of the array is reference to an
976 array with four items: the link-value, the element that has the
977 attribute with that link-value, and the name of that attribute, and the
978 tagname of that element. (Example: "['http://www.suck.com/',"
979 $elem_obj ", 'href', 'a']".) You may or may not end up using the
980 element itself -- for some purposes, you may use only the link value.
981
982 You might specify that you want to extract links from just some kinds
983 of elements (instead of the default, which is to extract links from all
984 the kinds of elements known to have attributes whose values represent
985 links). For instance, if you want to extract links from only "a" and
986 "img" elements, you could code it like this:
987
988 for (@{ $e->extract_links('a', 'img') }) {
989 my($link, $element, $attr, $tag) = @$_;
990 print
991 "Hey, there's a $tag that links to "
992 $link, ", in its $attr attribute, at ",
993 $element->address(), ".\n";
994 }
995
996 $h->simplify_pres
997 In text bits under PRE elements that are at/under $h, this routine
998 nativizes all newlines, and expands all tabs.
999
1000 That is, if you read a file with lines delimited by "\cm\cj"'s, the
1001 text under PRE areas will have "\cm\cj"'s instead of "\n"'s. Calling
1002 $h->nativize_pre_newlines on such a tree will turn "\cm\cj"'s into
1003 "\n"'s.
1004
1005 Tabs are expanded to however many spaces it takes to get to the next
1006 8th column -- the usual way of expanding them.
1007
1008 $h->same_as($i)
1009 Returns true if $h and $i are both elements representing the same tree
1010 of elements, each with the same tag name, with the same explicit
1011 attributes (i.e., not counting attributes whose names start with "_"),
1012 and with the same content (textual, comments, etc.).
1013
1014 Sameness of descendant elements is tested, recursively, with
1015 "$child1->same_as($child_2)", and sameness of text segments is tested
1016 with "$segment1 eq $segment2".
1017
1018 $h = HTML::Element->new_from_lol(ARRAYREF)
1019 Resursively constructs a tree of nodes, based on the (non-cyclic) data
1020 structure represented by ARRAYREF, where that is a reference to an
1021 array of arrays (of arrays (of arrays (etc.))).
1022
1023 In each arrayref in that structure, different kinds of values are
1024 treated as follows:
1025
1026 · Arrayrefs
1027
1028 Arrayrefs are considered to designate a sub-tree representing
1029 children for the node constructed from the current arrayref.
1030
1031 · Hashrefs
1032
1033 Hashrefs are considered to contain attribute-value pairs to add to
1034 the element to be constructed from the current arrayref
1035
1036 · Text segments
1037
1038 Text segments at the start of any arrayref will be considered to
1039 specify the name of the element to be constructed from the current
1040 araryref; all other text segments will be considered to specify
1041 text segments as children for the current arrayref.
1042
1043 · Elements
1044
1045 Existing element objects are either inserted into the treelet
1046 constructed, or clones of them are. That is, when the lol-tree is
1047 being traversed and elements constructed based what's in it, if an
1048 existing element object is found, if it has no parent, then it is
1049 added directly to the treelet constructed; but if it has a parent,
1050 then "$that_node->clone" is added to the treelet at the appropriate
1051 place.
1052
1053 An example will hopefully make this more obvious:
1054
1055 my $h = HTML::Element->new_from_lol(
1056 ['html',
1057 ['head',
1058 [ 'title', 'I like stuff!' ],
1059 ],
1060 ['body',
1061 {'lang', 'en-JP', _implicit => 1},
1062 'stuff',
1063 ['p', 'um, p < 4!', {'class' => 'par123'}],
1064 ['div', {foo => 'bar'}, '123'],
1065 ]
1066 ]
1067 );
1068 $h->dump;
1069
1070 Will print this:
1071
1072 <html> @0
1073 <head> @0.0
1074 <title> @0.0.0
1075 "I like stuff!"
1076 <body lang="en-JP"> @0.1 (IMPLICIT)
1077 "stuff"
1078 <p class="par123"> @0.1.1
1079 "um, p < 4!"
1080 <div foo="bar"> @0.1.2
1081 "123"
1082
1083 And printing $h->as_HTML will give something like:
1084
1085 <html><head><title>I like stuff!</title></head>
1086 <body lang="en-JP">stuff<p class="par123">um, p < 4!
1087 <div foo="bar">123</div></body></html>
1088
1089 You can even do fancy things with "map":
1090
1091 $body->push_content(
1092 # push_content implicitly calls new_from_lol on arrayrefs...
1093 ['br'],
1094 ['blockquote',
1095 ['h2', 'Pictures!'],
1096 map ['p', $_],
1097 $body2->look_down("_tag", "img"),
1098 # images, to be copied from that other tree.
1099 ],
1100 # and more stuff:
1101 ['ul',
1102 map ['li', ['a', {'href'=>"$_.png"}, $_ ] ],
1103 qw(Peaches Apples Pears Mangos)
1104 ],
1105 );
1106
1107 @elements = HTML::Element->new_from_lol(ARRAYREFS)
1108 Constructs several elements, by calling new_from_lol for every arrayref
1109 in the ARRAYREFS list.
1110
1111 @elements = HTML::Element->new_from_lol(
1112 ['hr'],
1113 ['p', 'And there, on the door, was a hook!'],
1114 );
1115 # constructs two elements.
1116
1117 $h->objectify_text()
1118 This turns any text nodes under $h from mere text segments (strings)
1119 into real objects, pseudo-elements with a tag-name of "~text", and the
1120 actual text content in an attribute called "text". (For a discussion
1121 of pseudo-elements, see the "tag" method, far above.) This method is
1122 provided because, for some purposes, it is convenient or necessary to
1123 be able, for a given text node, to ask what element is its parent; and
1124 clearly this is not possible if a node is just a text string.
1125
1126 Note that these "~text" objects are not recognized as text nodes by
1127 methods like as_text. Presumably you will want to call
1128 $h->objectify_text, perform whatever task that you needed that for, and
1129 then call $h->deobjectify_text before calling anything like
1130 $h->as_text.
1131
1132 $h->deobjectify_text()
1133 This undoes the effect of $h->objectify_text. That is, it takes any
1134 "~text" pseudo-elements in the tree at/under $h, and deletes each one,
1135 replacing each with the content of its "text" attribute.
1136
1137 Note that if $h itself is a "~text" pseudo-element, it will be
1138 destroyed -- a condition you may need to treat specially in your
1139 calling code (since it means you can't very well do anything with $h
1140 after that). So that you can detect that condition, if $h is itself a
1141 "~text" pseudo-element, then this method returns the value of the
1142 "text" attribute, which should be a defined value; in all other cases,
1143 it returns undef.
1144
1145 (This method assumes that no "~text" pseudo-element has any children.)
1146
1147 $h->number_lists()
1148 For every UL, OL, DIR, and MENU element at/under $h, this sets a
1149 "_bullet" attribute for every child LI element. For LI children of an
1150 OL, the "_bullet" attribute's value will be something like "4.", "d.",
1151 "D.", "IV.", or "iv.", depending on the OL element's "type" attribute.
1152 LI children of a UL, DIR, or MENU get their "_bullet" attribute set to
1153 "*". There should be no other LIs (i.e., except as children of OL, UL,
1154 DIR, or MENU elements), and if there are, they are unaffected.
1155
1156 $h->has_insane_linkage
1157 This method is for testing whether this element or the elements under
1158 it have linkage attributes (_parent and _content) whose values are
1159 deeply aberrant: if there are undefs in a content list; if an element
1160 appears in the content lists of more than one element; if the _parent
1161 attribute of an element doesn't match its actual parent; or if an
1162 element appears as its own descendant (i.e., if there is a cyclicity in
1163 the tree).
1164
1165 This returns empty list (or false, in scalar context) if the subtree's
1166 linkage methods are sane; otherwise it returns two items (or true, in
1167 scalar context): the element where the error occurred, and a string
1168 describing the error.
1169
1170 This method is provided is mainly for debugging and troubleshooting --
1171 it should be quite impossible for any document constructed via
1172 HTML::TreeBuilder to parse into a non-sane tree (since it's not the
1173 content of the tree per se that's in question, but whether the tree in
1174 memory was properly constructed); and it should be impossible for you
1175 to produce an insane tree just thru reasonable use of normal documented
1176 structure-modifying methods. But if you're constructing your own
1177 trees, and your program is going into infinite loops as during calls to
1178 traverse() or any of the secondary structural methods, as part of
1179 debugging, consider calling is_insane on the tree.
1180
1182 * If you want to free the memory associated with a tree built of
1183 HTML::Element nodes, then you will have to delete it explicitly. See
1184 the $h->delete method, above.
1185
1186 * There's almost nothing to stop you from making a "tree" with
1187 cyclicities (loops) in it, which could, for example, make the traverse
1188 method go into an infinite loop. So don't make cyclicities! (If all
1189 you're doing is parsing HTML files, and looking at the resulting trees,
1190 this will never be a problem for you.)
1191
1192 * There's no way to represent comments or processing directives in a
1193 tree with HTML::Elements. Not yet, at least.
1194
1195 * There's (currently) nothing to stop you from using an undefined value
1196 as a text segment. If you're running under "perl -w", however, this
1197 may make HTML::Element's code produce a slew of warnings.
1198
1200 You are welcome to derive subclasses from HTML::Element, but you should
1201 be aware that the code in HTML::Element makes certain assumptions about
1202 elements (and I'm using "element" to mean ONLY an object of class
1203 HTML::Element, or of a subclass of HTML::Element):
1204
1205 * The value of an element's _parent attribute must either be undef or
1206 otherwise false, or must be an element.
1207
1208 * The value of an element's _content attribute must either be undef or
1209 otherwise false, or a reference to an (unblessed) array. The array may
1210 be empty; but if it has items, they must ALL be either mere strings
1211 (text segments), or elements.
1212
1213 * The value of an element's _tag attribute should, at least, be a
1214 string of printable characters.
1215
1216 Moreover, bear these rules in mind:
1217
1218 * Do not break encapsulation on objects. That is, access their
1219 contents only thru $obj->attr or more specific methods.
1220
1221 * You should think twice before completely overriding any of the
1222 methods that HTML::Element provides. (Overriding with a method that
1223 calls the superclass method is not so bad, though.)
1224
1226 HTML::Tree; HTML::TreeBuilder; HTML::AsSubs; HTML::Tagset; and, for the
1227 morbidly curious, HTML::Element::traverse.
1228
1230 Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke, 2005 Andy
1231 Lester, 2006 Pete Krawczyk.
1232
1233 This library is free software; you can redistribute it and/or modify it
1234 under the same terms as Perl itself.
1235
1236 This program is distributed in the hope that it will be useful, but
1237 without any warranty; without even the implied warranty of
1238 merchantability or fitness for a particular purpose.
1239
1241 Currently maintained by Pete Krawczyk "<petek@cpan.org>"
1242
1243 Original authors: Gisle Aas, Sean Burke and Andy Lester.
1244
1245 Thanks to Mark-Jason Dominus for a POD suggestion.
1246
1247
1248
1249perl v5.10.1 2010-11-12 HTML::Element(3)