1HTML::Element(3) User Contributed Perl Documentation HTML::Element(3)
2
3
4
6 HTML::Element - Class for objects that represent HTML elements
7
9 Version 4.1
10
12 use HTML::Element;
13 $a = HTML::Element->new('a', href => 'http://www.perl.com/');
14 $a->push_content("The Perl Homepage");
15
16 $tag = $a->tag;
17 print "$tag starts out as:", $a->starttag, "\n";
18 print "$tag ends as:", $a->endtag, "\n";
19 print "$tag\'s href attribute is: ", $a->attr('href'), "\n";
20
21 $links_r = $a->extract_links();
22 print "Hey, I found ", scalar(@$links_r), " links.\n";
23
24 print "And that, as HTML, is: ", $a->as_HTML, "\n";
25 $a = $a->delete;
26
28 (This class is part of the HTML::Tree dist.)
29
30 Objects of the HTML::Element class can be used to represent elements of
31 HTML document trees. These objects have attributes, notably attributes
32 that designates each element's parent and content. The content is an
33 array of text segments and other HTML::Element objects. A tree with
34 HTML::Element objects as nodes can represent the syntax tree for a HTML
35 document.
36
38 Consider this HTML document:
39
40 <html lang='en-US'>
41 <head>
42 <title>Stuff</title>
43 <meta name='author' content='Jojo'>
44 </head>
45 <body>
46 <h1>I like potatoes!</h1>
47 </body>
48 </html>
49
50 Building a syntax tree out of it makes a tree-structure in memory that
51 could be diagrammed as:
52
53 html (lang='en-US')
54 / \
55 / \
56 / \
57 head body
58 /\ \
59 / \ \
60 / \ \
61 title meta h1
62 | (name='author', |
63 "Stuff" content='Jojo') "I like potatoes"
64
65 This is the traditional way to diagram a tree, with the "root" at the
66 top, and it's this kind of diagram that people have in mind when they
67 say, for example, that "the meta element is under the head element
68 instead of under the body element". (The same is also said with
69 "inside" instead of "under" -- the use of "inside" makes more sense
70 when you're looking at the HTML source.)
71
72 Another way to represent the above tree is with indenting:
73
74 html (attributes: lang='en-US')
75 head
76 title
77 "Stuff"
78 meta (attributes: name='author' content='Jojo')
79 body
80 h1
81 "I like potatoes"
82
83 Incidentally, diagramming with indenting works much better for very
84 large trees, and is easier for a program to generate. The
85 "$tree->dump" method uses indentation just that way.
86
87 However you diagram the tree, it's stored the same in memory -- it's a
88 network of objects, each of which has attributes like so:
89
90 element #1: _tag: 'html'
91 _parent: none
92 _content: [element #2, element #5]
93 lang: 'en-US'
94
95 element #2: _tag: 'head'
96 _parent: element #1
97 _content: [element #3, element #4]
98
99 element #3: _tag: 'title'
100 _parent: element #2
101 _content: [text segment "Stuff"]
102
103 element #4 _tag: 'meta'
104 _parent: element #2
105 _content: none
106 name: author
107 content: Jojo
108
109 element #5 _tag: 'body'
110 _parent: element #1
111 _content: [element #6]
112
113 element #6 _tag: 'h1'
114 _parent: element #5
115 _content: [text segment "I like potatoes"]
116
117 The "treeness" of the tree-structure that these elements comprise is
118 not an aspect of any particular object, but is emergent from the
119 relatedness attributes (_parent and _content) of these element-objects
120 and from how you use them to get from element to element.
121
122 While you could access the content of a tree by writing code that says
123 "access the 'src' attribute of the root's first child's seventh child's
124 third child", you're more likely to have to scan the contents of a
125 tree, looking for whatever nodes, or kinds of nodes, you want to do
126 something with. The most straightforward way to look over a tree is to
127 "traverse" it; an HTML::Element method ("$h->traverse") is provided for
128 this purpose; and several other HTML::Element methods are based on it.
129
130 (For everything you ever wanted to know about trees, and then some, see
131 Niklaus Wirth's Algorithms + Data Structures = Programs or Donald
132 Knuth's The Art of Computer Programming, Volume 1.)
133
134 Version
135 Why is this a sub?
136
137 ABORT OK PRUNE PRUNE_SOFTLY PRUNE_UP
138 Constants for signalling back to the traverser
139
141 $h = HTML::Element->new('tag', 'attrname' => 'value', ... )
142 This constructor method returns a new HTML::Element object. The tag
143 name is a required argument; it will be forced to lowercase.
144 Optionally, you can specify other initial attributes at object creation
145 time.
146
147 $h->attr('attr') or $h->attr('attr', 'value')
148 Returns (optionally sets) the value of the given attribute of $h. The
149 attribute name (but not the value, if provided) is forced to lowercase.
150 If trying to read the value of an attribute not present for this
151 element, the return value is undef. If setting a new value, the old
152 value of that attribute is returned.
153
154 If methods are provided for accessing an attribute (like "$h->tag" for
155 "_tag", "$h->content_list", etc. below), use those instead of calling
156 attr "$h->attr", whether for reading or setting.
157
158 Note that setting an attribute to "undef" (as opposed to "", the empty
159 string) actually deletes the attribute.
160
161 $h->tag() or $h->tag('tagname')
162 Returns (optionally sets) the tag name (also known as the generic
163 identifier) for the element $h. In setting, the tag name is always
164 converted to lower case.
165
166 There are four kinds of "pseudo-elements" that show up as HTML::Element
167 objects:
168
169 Comment pseudo-elements
170 These are element objects with a "$h->tag" value of "~comment", and
171 the content of the comment is stored in the "text" attribute
172 ("$h->attr("text")"). For example, parsing this code with
173 HTML::TreeBuilder...
174
175 <!-- I like Pie.
176 Pie is good
177 -->
178
179 produces an HTML::Element object with these attributes:
180
181 "_tag",
182 "~comment",
183 "text",
184 " I like Pie.\n Pie is good\n "
185
186 Declaration pseudo-elements
187 Declarations (rarely encountered) are represented as HTML::Element
188 objects with a tag name of "~declaration", and content in the
189 "text" attribute. For example, this:
190
191 <!DOCTYPE foo>
192
193 produces an element whose attributes include:
194
195 "_tag", "~declaration", "text", "DOCTYPE foo"
196
197 Processing instruction pseudo-elements
198 PIs (rarely encountered) are represented as HTML::Element objects
199 with a tag name of "~pi", and content in the "text" attribute. For
200 example, this:
201
202 <?stuff foo?>
203
204 produces an element whose attributes include:
205
206 "_tag", "~pi", "text", "stuff foo?"
207
208 (assuming a recent version of HTML::Parser)
209
210 ~literal pseudo-elements
211 These objects are not currently produced by HTML::TreeBuilder, but
212 can be used to represent a "super-literal" -- i.e., a literal you
213 want to be immune from escaping. (Yes, I just made that term up.)
214
215 That is, this is useful if you want to insert code into a tree that
216 you plan to dump out with "as_HTML", where you want, for some
217 reason, to suppress "as_HTML"'s normal behavior of amp-quoting text
218 segments.
219
220 For example, this:
221
222 my $literal = HTML::Element->new('~literal',
223 'text' => 'x < 4 & y > 7'
224 );
225 my $span = HTML::Element->new('span');
226 $span->push_content($literal);
227 print $span->as_HTML;
228
229 prints this:
230
231 <span>x < 4 & y > 7</span>
232
233 Whereas this:
234
235 my $span = HTML::Element->new('span');
236 $span->push_content('x < 4 & y > 7');
237 # normal text segment
238 print $span->as_HTML;
239
240 prints this:
241
242 <span>x < 4 & y > 7</span>
243
244 Unless you're inserting lots of pre-cooked code into existing
245 trees, and dumping them out again, it's not likely that you'll find
246 "~literal" pseudo-elements useful.
247
248 $h->parent() or $h->parent($new_parent)
249 Returns (optionally sets) the parent (aka "container") for this
250 element. The parent should either be undef, or should be another
251 element.
252
253 You should not use this to directly set the parent of an element.
254 Instead use any of the other methods under "Structure-Modifying
255 Methods", below.
256
257 Note that not($h->parent) is a simple test for whether $h is the root
258 of its subtree.
259
260 $h->content_list()
261 Returns a list of the child nodes of this element -- i.e., what nodes
262 (elements or text segments) are inside/under this element. (Note that
263 this may be an empty list.)
264
265 In a scalar context, this returns the count of the items, as you may
266 expect.
267
268 $h->content()
269 This somewhat deprecated method returns the content of this element;
270 but unlike content_list, this returns either undef (which you should
271 understand to mean no content), or a reference to the array of content
272 items, each of which is either a text segment (a string, i.e., a
273 defined non-reference scalar value), or an HTML::Element object. Note
274 that even if an arrayref is returned, it may be a reference to an empty
275 array.
276
277 While older code should feel free to continue to use "$h->content", new
278 code should use "$h->content_list" in almost all conceivable cases. It
279 is my experience that in most cases this leads to simpler code anyway,
280 since it means one can say:
281
282 @children = $h->content_list;
283
284 instead of the inelegant:
285
286 @children = @{$h->content || []};
287
288 If you do use "$h->content" (or "$h->content_array_ref"), you should
289 not use the reference returned by it (assuming it returned a reference,
290 and not undef) to directly set or change the content of an element or
291 text segment! Instead use content_refs_list or any of the other
292 methods under "Structure-Modifying Methods", below.
293
294 $h->content_array_ref()
295 This is like "content" (with all its caveats and deprecations) except
296 that it is guaranteed to return an array reference. That is, if the
297 given node has no "_content" attribute, the "content" method would
298 return that undef, but "content_array_ref" would set the given node's
299 "_content" value to "[]" (a reference to a new, empty array), and
300 return that.
301
302 $h->content_refs_list
303 This returns a list of scalar references to each element of $h's
304 content list. This is useful in case you want to in-place edit any
305 large text segments without having to get a copy of the current value
306 of that segment value, modify that copy, then use the "splice_content"
307 to replace the old with the new. Instead, here you can in-place edit:
308
309 foreach my $item_r ($h->content_refs_list) {
310 next if ref $$item_r;
311 $$item_r =~ s/honour/honor/g;
312 }
313
314 You could currently achieve the same affect with:
315
316 foreach my $item (@{ $h->content_array_ref }) {
317 # deprecated!
318 next if ref $item;
319 $item =~ s/honour/honor/g;
320 }
321
322 ...except that using the return value of "$h->content" or
323 "$h->content_array_ref" to do that is deprecated, and just might stop
324 working in the future.
325
326 $h->implicit() or $h->implicit($bool)
327 Returns (optionally sets) the "_implicit" attribute. This attribute is
328 a flag that's used for indicating that the element was not originally
329 present in the source, but was added to the parse tree (by
330 HTML::TreeBuilder, for example) in order to conform to the rules of
331 HTML structure.
332
333 $h->pos() or $h->pos($element)
334 Returns (and optionally sets) the "_pos" (for "current position")
335 pointer of $h. This attribute is a pointer used during some parsing
336 operations, whose value is whatever HTML::Element element at or under
337 $h is currently "open", where "$h->insert_element(NEW)" will actually
338 insert a new element.
339
340 (This has nothing to do with the Perl function called "pos", for
341 controlling where regular expression matching starts.)
342
343 If you set "$h->pos($element)", be sure that $element is either $h, or
344 an element under $h.
345
346 If you've been modifying the tree under $h and are no longer sure
347 "$h->pos" is valid, you can enforce validity with:
348
349 $h->pos(undef) unless $h->pos->is_inside($h);
350
351 $h->all_attr()
352 Returns all this element's attributes and values, as key-value pairs.
353 This will include any "internal" attributes (i.e., ones not present in
354 the original element, and which will not be represented if/when you
355 call "$h->as_HTML"). Internal attributes are distinguished by the fact
356 that the first character of their key (not value! key!) is an
357 underscore ("_").
358
359 Example output of "$h->all_attr()" : "'_parent', "[object_value]" ,
360 '_tag', 'em', 'lang', 'en-US', '_content', "[array-ref value].
361
362 $h->all_attr_names()
363 Like all_attr, but only returns the names of the attributes.
364
365 Example output of "$h->all_attr_names()" : "'_parent', '_tag', 'lang',
366 '_content', ".
367
368 $h->all_external_attr()
369 Like "all_attr", except that internal attributes are not present.
370
371 $h->all_external_attr_names()
372 Like "all_external_attr_names", except that internal attributes' names
373 are not present.
374
375 $h->id() or $h->id($string)
376 Returns (optionally sets to $string) the "id" attribute.
377 "$h->id(undef)" deletes the "id" attribute.
378
379 $h->idf() or $h->idf($string)
380 Just like the "id" method, except that if you call "$h->idf()" and no
381 "id" attribute is defined for this element, then it's set to a likely-
382 to-be-unique value, and returned. (The "f" is for "force".)
383
385 These methods are provided for modifying the content of trees by adding
386 or changing nodes as parents or children of other nodes.
387
388 $h->push_content($element_or_text, ...)
389 Adds the specified items to the end of the content list of the element
390 $h. The items of content to be added should each be either a text
391 segment (a string), an HTML::Element object, or an arrayref. Arrayrefs
392 are fed thru "$h->new_from_lol(that_arrayref)" to convert them into
393 elements, before being added to the content list of $h. This means you
394 can say things concise things like:
395
396 $body->push_content(
397 ['br'],
398 ['ul',
399 map ['li', $_], qw(Peaches Apples Pears Mangos)
400 ]
401 );
402
403 See "new_from_lol" method's documentation, far below, for more
404 explanation.
405
406 The push_content method will try to consolidate adjacent text segments
407 while adding to the content list. That's to say, if $h's content_list
408 is
409
410 ('foo bar ', $some_node, 'baz!')
411
412 and you call
413
414 $h->push_content('quack?');
415
416 then the resulting content list will be this:
417
418 ('foo bar ', $some_node, 'baz!quack?')
419
420 and not this:
421
422 ('foo bar ', $some_node, 'baz!', 'quack?')
423
424 If that latter is what you want, you'll have to override the feature of
425 consolidating text by using splice_content, as in:
426
427 $h->splice_content(scalar($h->content_list),0,'quack?');
428
429 Similarly, if you wanted to add 'Skronk' to the beginning of the
430 content list, calling this:
431
432 $h->unshift_content('Skronk');
433
434 then the resulting content list will be this:
435
436 ('Skronkfoo bar ', $some_node, 'baz!')
437
438 and not this:
439
440 ('Skronk', 'foo bar ', $some_node, 'baz!')
441
442 What you'd to do get the latter is:
443
444 $h->splice_content(0,0,'Skronk');
445
446 $h->unshift_content($element_or_text, ...)
447 Just like "push_content", but adds to the beginning of the $h element's
448 content list.
449
450 The items of content to be added should each be either a text segment
451 (a string), an HTML::Element object, or an arrayref (which is fed thru
452 "new_from_lol").
453
454 The unshift_content method will try to consolidate adjacent text
455 segments while adding to the content list. See above for a discussion
456 of this.
457
458 $h->splice_content($offset, $length, $element_or_text, ...)
459 Detaches the elements from $h's list of content-nodes, starting at
460 $offset and continuing for $length items, replacing them with the
461 elements of the following list, if any. Returns the elements (if any)
462 removed from the content-list. If $offset is negative, then it starts
463 that far from the end of the array, just like Perl's normal "splice"
464 function. If $length and the following list is omitted, removes
465 everything from $offset onward.
466
467 The items of content to be added (if any) should each be either a text
468 segment (a string), an arrayref (which is fed thru "new_from_lol"), or
469 an HTML::Element object that's not already a child of $h.
470
471 $h->detach()
472 This unlinks $h from its parent, by setting its 'parent' attribute to
473 undef, and by removing it from the content list of its parent (if it
474 had one). The return value is the parent that was detached from (or
475 undef, if $h had no parent to start with). Note that neither $h nor
476 its parent are explicitly destroyed.
477
478 $h->detach_content()
479 This unlinks all of $h's children from $h, and returns them. Note that
480 these are not explicitly destroyed; for that, you can just use
481 $h->delete_content.
482
483 $h->replace_with( $element_or_text, ... )
484 This replaces $h in its parent's content list with the nodes specified.
485 The element $h (which by then may have no parent) is returned. This
486 causes a fatal error if $h has no parent. The list of nodes to insert
487 may contain $h, but at most once. Aside from that possible exception,
488 the nodes to insert should not already be children of $h's parent.
489
490 Also, note that this method does not destroy $h -- use
491 "$h->replace_with(...)->delete" if you need that.
492
493 $h->preinsert($element_or_text...)
494 Inserts the given nodes right BEFORE $h in $h's parent's content list.
495 This causes a fatal error if $h has no parent. None of the given nodes
496 should be $h or other children of $h. Returns $h.
497
498 $h->postinsert($element_or_text...)
499 Inserts the given nodes right AFTER $h in $h's parent's content list.
500 This causes a fatal error if $h has no parent. None of the given nodes
501 should be $h or other children of $h. Returns $h.
502
503 $h->replace_with_content()
504 This replaces $h in its parent's content list with its own content.
505 The element $h (which by then has no parent or content of its own) is
506 returned. This causes a fatal error if $h has no parent. Also, note
507 that this does not destroy $h -- use "$h->replace_with_content->delete"
508 if you need that.
509
510 $h->delete_content()
511 Clears the content of $h, calling "$h->delete" for each content
512 element. Compare with "$h->detach_content".
513
514 Returns $h.
515
516 $h->delete() destroy destroy_content
517 Detaches this element from its parent (if it has one) and explicitly
518 destroys the element and all its descendants. The return value is
519 undef.
520
521 Perl uses garbage collection based on reference counting; when no
522 references to a data structure exist, it's implicitly destroyed --
523 i.e., when no value anywhere points to a given object anymore, Perl
524 knows it can free up the memory that the now-unused object occupies.
525
526 But this fails with HTML::Element trees, because a parent element
527 always holds references to its children, and its children elements hold
528 references to the parent, so no element ever looks like it's not in
529 use. So, to destroy those elements, you need to call "$h->delete" on
530 the parent.
531
532 $h->clone()
533 Returns a copy of the element (whose children are clones (recursively)
534 of the original's children, if any).
535
536 The returned element is parentless. Any '_pos' attributes present in
537 the source element/tree will be absent in the copy. For that and other
538 reasons, the clone of an HTML::TreeBuilder object that's in mid-parse
539 (i.e, the head of a tree that HTML::TreeBuilder is elaborating) cannot
540 (currently) be used to continue the parse.
541
542 You are free to clone HTML::TreeBuilder trees, just as long as: 1)
543 they're done being parsed, or 2) you don't expect to resume parsing
544 into the clone. (You can continue parsing into the original; it is
545 never affected.)
546
547 HTML::Element->clone_list(...nodes...)
548 Returns a list consisting of a copy of each node given. Text segments
549 are simply copied; elements are cloned by calling $it->clone on each of
550 them.
551
552 Note that this must be called as a class method, not as an instance
553 method. "clone_list" will croak if called as an instance method. You
554 can also call it like so:
555
556 ref($h)->clone_list(...nodes...)
557
558 $h->normalize_content
559 Normalizes the content of $h -- i.e., concatenates any adjacent text
560 nodes. (Any undefined text segments are turned into empty-strings.)
561 Note that this does not recurse into $h's descendants.
562
563 $h->delete_ignorable_whitespace()
564 This traverses under $h and deletes any text segments that are
565 ignorable whitespace. You should not use this if $h under a 'pre'
566 element.
567
568 $h->insert_element($element, $implicit)
569 Inserts (via push_content) a new element under the element at
570 "$h->pos()". Then updates "$h->pos()" to point to the inserted
571 element, unless $element is a prototypically empty element like "br",
572 "hr", "img", etc. The new "$h->pos()" is returned. This method is
573 useful only if your particular tree task involves setting "$h->pos()".
574
576 $h->dump()
577 $h->dump(*FH) ; # or *FH{IO} or $fh_obj
578 Prints the element and all its children to STDOUT (or to a specified
579 filehandle), in a format useful only for debugging. The structure of
580 the document is shown by indentation (no end tags).
581
582 $h->as_HTML() or $h->as_HTML($entities)
583 or $h->as_HTML($entities, $indent_char)
584 or $h->as_HTML($entities, $indent_char, \%optional_end_tags)
585 Returns a string representing in HTML the element and its descendants.
586 The optional argument $entities specifies a string of the entities to
587 encode. For compatibility with previous versions, specify '<>&' here.
588 If omitted or undef, all unsafe characters are encoded as HTML
589 entities. See HTML::Entities for details. If passed an empty string,
590 no entities are encoded.
591
592 If $indent_char is specified and defined, the HTML to be output is
593 intented, using the string you specify (which you probably should set
594 to "\t", or some number of spaces, if you specify it).
595
596 If "\%optional_end_tags" is specified and defined, it should be a
597 reference to a hash that holds a true value for every tag name whose
598 end tag is optional. Defaults to "\%HTML::Element::optionalEndTag",
599 which is an alias to %HTML::Tagset::optionalEndTag, which, at time of
600 writing, contains true values for "p, li, dt, dd". A useful value to
601 pass is an empty hashref, "{}", which means that no end-tags are
602 optional for this dump. Otherwise, possibly consider copying
603 %HTML::Tagset::optionalEndTag to a hash of your own, adding or deleting
604 values as you like, and passing a reference to that hash.
605
606 $h->as_text()
607 $h->as_text(skip_dels => 1, extra_chars => '\xA0')
608 Returns a string consisting of only the text parts of the element's
609 descendants.
610
611 Text under 'script' or 'style' elements is never included in what's
612 returned. If "skip_dels" is true, then text content under "del" nodes
613 is not included in what's returned.
614
615 $h->as_trimmed_text(...) as_text_trimmed
616 This is just like as_text(...) except that leading and trailing
617 whitespace is deleted, and any internal whitespace is collapsed.
618
619 This will not remove hard spaces, unicode spaces, or any other non
620 ASCII white space unless you supplye the extra characters as a string
621 argument. e.g. $h->as_trimmed_text(extra_chars => '\xA0')
622
623 $h->as_XML()
624 Returns a string representing in XML the element and its descendants.
625
626 The XML is not indented.
627
628 $h->as_Lisp_form()
629 Returns a string representing the element and its descendants as a Lisp
630 form. Unsafe characters are encoded as octal escapes.
631
632 The Lisp form is indented, and contains external ("href", etc.) as
633 well as internal attributes ("_tag", "_content", "_implicit", etc.),
634 except for "_parent", which is omitted.
635
636 Current example output for a given element:
637
638 ("_tag" "img" "border" "0" "src" "pie.png" "usemap" "#main.map")
639
640 format
641 Formats text output. Defaults to HTML::FormatText.
642
643 Takes a second argument that is a reference to a formatter.
644
645 $h->starttag() or $h->starttag($entities)
646 Returns a string representing the complete start tag for the element.
647 I.e., leading "<", tag name, attributes, and trailing ">". All values
648 are surrounded with double-quotes, and appropriate characters are
649 encoded. If $entities is omitted or undef, all unsafe characters are
650 encoded as HTML entities. See HTML::Entities for details. If you
651 specify some value for $entities, remember to include the double-quote
652 character in it. (Previous versions of this module would basically
653 behave as if '&">' were specified for $entities.) If $entities is an
654 empty string, no entity is escaped.
655
656 starttag_XML
657 Returns a string representing the complete start tag for the element.
658
659 $h->endtag() || endtag_XML
660 Returns a string representing the complete end tag for this element.
661 I.e., "</", tag name, and ">".
662
664 These methods all involve some structural aspect of the tree; either
665 they report some aspect of the tree's structure, or they involve
666 traversal down the tree, or walking up the tree.
667
668 $h->is_inside('tag', ...) or $h->is_inside($element, ...)
669 Returns true if the $h element is, or is contained anywhere inside an
670 element that is any of the ones listed, or whose tag name is any of the
671 tag names listed.
672
673 $h->is_empty()
674 Returns true if $h has no content, i.e., has no elements or text
675 segments under it. In other words, this returns true if $h is a leaf
676 node, AKA a terminal node. Do not confuse this sense of "empty" with
677 another sense that it can have in SGML/HTML/XML terminology, which
678 means that the element in question is of the type (like HTML's "hr",
679 "br", "img", etc.) that can't have any content.
680
681 That is, a particular "p" element may happen to have no content, so
682 $that_p_element->is_empty will be true -- even though the prototypical
683 "p" element isn't "empty" (not in the way that the prototypical "hr"
684 element is).
685
686 If you think this might make for potentially confusing code, consider
687 simply using the clearer exact equivalent: not($h->content_list)
688
689 $h->pindex()
690 Return the index of the element in its parent's contents array, such
691 that $h would equal
692
693 $h->parent->content->[$h->pindex]
694 or
695 ($h->parent->content_list)[$h->pindex]
696
697 assuming $h isn't root. If the element $h is root, then $h->pindex
698 returns undef.
699
700 $h->left()
701 In scalar context: returns the node that's the immediate left sibling
702 of $h. If $h is the leftmost (or only) child of its parent (or has no
703 parent), then this returns undef.
704
705 In list context: returns all the nodes that're the left siblings of $h
706 (starting with the leftmost). If $h is the leftmost (or only) child of
707 its parent (or has no parent), then this returns empty-list.
708
709 (See also $h->preinsert(LIST).)
710
711 $h->right()
712 In scalar context: returns the node that's the immediate right sibling
713 of $h. If $h is the rightmost (or only) child of its parent (or has no
714 parent), then this returns undef.
715
716 In list context: returns all the nodes that're the right siblings of
717 $h, starting with the leftmost. If $h is the rightmost (or only) child
718 of its parent (or has no parent), then this returns empty-list.
719
720 (See also $h->postinsert(LIST).)
721
722 $h->address()
723 Returns a string representing the location of this node in the tree.
724 The address consists of numbers joined by a '.', starting with '0', and
725 followed by the pindexes of the nodes in the tree that are ancestors of
726 $h, starting from the top.
727
728 So if the way to get to a node starting at the root is to go to child 2
729 of the root, then child 10 of that, and then child 0 of that, and then
730 you're there -- then that node's address is "0.2.10.0".
731
732 As a bit of a special case, the address of the root is simply "0".
733
734 I forsee this being used mainly for debugging, but you may find your
735 own uses for it.
736
737 $h->address($address)
738 This returns the node (whether element or text-segment) at the given
739 address in the tree that $h is a part of. (That is, the address is
740 resolved starting from $h->root.)
741
742 If there is no node at the given address, this returns undef.
743
744 You can specify "relative addressing" (i.e., that indexing is supposed
745 to start from $h and not from $h->root) by having the address start
746 with a period -- e.g., $h->address(".3.2") will look at child 3 of $h,
747 and child 2 of that.
748
749 $h->depth()
750 Returns a number expressing $h's depth within its tree, i.e., how many
751 steps away it is from the root. If $h has no parent (i.e., is root),
752 its depth is 0.
753
754 $h->root()
755 Returns the element that's the top of $h's tree. If $h is root, this
756 just returns $h. (If you want to test whether $h is the root, instead
757 of asking what its root is, just test "not($h->parent)".)
758
759 $h->lineage()
760 Returns the list of $h's ancestors, starting with its parent, and then
761 that parent's parent, and so on, up to the root. If $h is root, this
762 returns an empty list.
763
764 If you simply want a count of the number of elements in $h's lineage,
765 use $h->depth.
766
767 $h->lineage_tag_names()
768 Returns the list of the tag names of $h's ancestors, starting with its
769 parent, and that parent's parent, and so on, up to the root. If $h is
770 root, this returns an empty list. Example output: "('em', 'td', 'tr',
771 'table', 'body', 'html')"
772
773 $h->descendants()
774 In list context, returns the list of all $h's descendant elements,
775 listed in pre-order (i.e., an element appears before its content-
776 elements). Text segments DO NOT appear in the list. In scalar
777 context, returns a count of all such elements.
778
779 $h->descendents()
780 This is just an alias to the "descendants" method.
781
782 $h->find_by_tag_name('tag', ...)
783 In list context, returns a list of elements at or under $h that have
784 any of the specified tag names. In scalar context, returns the first
785 (in pre-order traversal of the tree) such element found, or undef if
786 none.
787
788 $h->find('tag', ...)
789 This is just an alias to "find_by_tag_name". (There was once going to
790 be a whole find_* family of methods, but then look_down filled that
791 niche, so there turned out not to be much reason for the verboseness of
792 the name "find_by_tag_name".)
793
794 $h->find_by_attribute('attribute', 'value')
795 In a list context, returns a list of elements at or under $h that have
796 the specified attribute, and have the given value for that attribute.
797 In a scalar context, returns the first (in pre-order traversal of the
798 tree) such element found, or undef if none.
799
800 This method is deprecated in favor of the more expressive "look_down"
801 method, which new code should use instead.
802
803 $h->look_down( ...criteria... )
804 This starts at $h and looks thru its element descendants (in pre-
805 order), looking for elements matching the criteria you specify. In
806 list context, returns all elements that match all the given criteria;
807 in scalar context, returns the first such element (or undef, if nothing
808 matched).
809
810 There are three kinds of criteria you can specify:
811
812 (attr_name, attr_value)
813 This means you're looking for an element with that value for that
814 attribute. Example: "alt", "pix!". Consider that you can search
815 on internal attribute values too: "_tag", "p".
816
817 (attr_name, qr/.../)
818 This means you're looking for an element whose value for that
819 attribute matches the specified Regexp object.
820
821 a coderef
822 This means you're looking for elements where
823 coderef->(each_element) returns true. Example:
824
825 my @wide_pix_images
826 = $h->look_down(
827 "_tag", "img",
828 "alt", "pix!",
829 sub { $_[0]->attr('width') > 350 }
830 );
831
832 Note that "(attr_name, attr_value)" and "(attr_name, qr/.../)" criteria
833 are almost always faster than coderef criteria, so should presumably be
834 put before them in your list of criteria. That is, in the example
835 above, the sub ref is called only for elements that have already passed
836 the criteria of having a "_tag" attribute with value "img", and an
837 "alt" attribute with value "pix!". If the coderef were first, it would
838 be called on every element, and then what elements pass that criterion
839 (i.e., elements for which the coderef returned true) would be checked
840 for their "_tag" and "alt" attributes.
841
842 Note that comparison of string attribute-values against the string
843 value in "(attr_name, attr_value)" is case-INsensitive! A criterion of
844 "('align', 'right')" will match an element whose "align" value is
845 "RIGHT", or "right" or "rIGhT", etc.
846
847 Note also that "look_down" considers "" (empty-string) and undef to be
848 different things, in attribute values. So this:
849
850 $h->look_down("alt", "")
851
852 will find elements with an "alt" attribute, but where the value for the
853 "alt" attribute is "". But this:
854
855 $h->look_down("alt", undef)
856
857 is the same as:
858
859 $h->look_down(sub { !defined($_[0]->attr('alt')) } )
860
861 That is, it finds elements that do not have an "alt" attribute at all
862 (or that do have an "alt" attribute, but with a value of undef -- which
863 is not normally possible).
864
865 Note that when you give several criteria, this is taken to mean you're
866 looking for elements that match all your criterion, not just any of
867 them. In other words, there is an implicit "and", not an "or". So if
868 you wanted to express that you wanted to find elements with a "name"
869 attribute with the value "foo" or with an "id" attribute with the value
870 "baz", you'd have to do it like:
871
872 @them = $h->look_down(
873 sub {
874 # the lcs are to fold case
875 lc($_[0]->attr('name')) eq 'foo'
876 or lc($_[0]->attr('id')) eq 'baz'
877 }
878 );
879
880 Coderef criteria are more expressive than "(attr_name, attr_value)" and
881 "(attr_name, qr/.../)" criteria, and all "(attr_name, attr_value)" and
882 "(attr_name, qr/.../)" criteria could be expressed in terms of
883 coderefs. However, "(attr_name, attr_value)" and "(attr_name,
884 qr/.../)" criteria are a convenient shorthand. (In fact, "look_down"
885 itself is basically "shorthand" too, since anything you can do with
886 "look_down" you could do by traversing the tree, either with the
887 "traverse" method or with a routine of your own. However, "look_down"
888 often makes for very concise and clear code.)
889
890 $h->look_up( ...criteria... )
891 This is identical to $h->look_down, except that whereas $h->look_down
892 basically scans over the list:
893
894 ($h, $h->descendants)
895
896 $h->look_up instead scans over the list
897
898 ($h, $h->lineage)
899
900 So, for example, this returns all ancestors of $h (possibly including
901 $h itself) that are "td" elements with an "align" attribute with a
902 value of "right" (or "RIGHT", etc.):
903
904 $h->look_up("_tag", "td", "align", "right");
905
906 $h->traverse(...options...)
907 Lengthy discussion of HTML::Element's unnecessary and confusing
908 "traverse" method has been moved to a separate file:
909 HTML::Element::traverse
910
911 $h->attr_get_i('attribute')
912 In list context, returns a list consisting of the values of the given
913 attribute for $self and for all its ancestors starting from $self and
914 working its way up. Nodes with no such attribute are skipped.
915 ("attr_get_i" stands for "attribute get, with inheritance".) In scalar
916 context, returns the first such value, or undef if none.
917
918 Consider a document consisting of:
919
920 <html lang='i-klingon'>
921 <head><title>Pati Pata</title></head>
922 <body>
923 <h1 lang='la'>Stuff</h1>
924 <p lang='es-MX' align='center'>
925 Foo bar baz <cite>Quux</cite>.
926 </p>
927 <p>Hooboy.</p>
928 </body>
929 </html>
930
931 If $h is the "cite" element, $h->attr_get_i("lang") in list context
932 will return the list ('es-MX', 'i-klingon'). In scalar context, it
933 will return the value 'es-MX'.
934
935 If you call with multiple attribute names...
936
937 $h->attr_get_i('a1', 'a2', 'a3')
938 ...in list context, this will return a list consisting of the values of
939 these attributes which exist in $self and its ancestors. In scalar
940 context, this returns the first value (i.e., the value of the first
941 existing attribute from the first element that has any of the
942 attributes listed). So, in the above example,
943
944 $h->attr_get_i('lang', 'align');
945
946 will return:
947
948 ('es-MX', 'center', 'i-klingon') # in list context
949 or
950 'es-MX' # in scalar context.
951
952 But note that this:
953
954 $h->attr_get_i('align', 'lang');
955
956 will return:
957
958 ('center', 'es-MX', 'i-klingon') # in list context
959 or
960 'center' # in scalar context.
961
962 $h->tagname_map()
963 Scans across $h and all its descendants, and makes a hash (a reference
964 to which is returned) where each entry consists of a key that's a tag
965 name, and a value that's a reference to a list to all elements that
966 have that tag name. I.e., this method returns:
967
968 {
969 # Across $h and all descendants...
970 'a' => [ ...list of all 'a' elements... ],
971 'em' => [ ...list of all 'em' elements... ],
972 'img' => [ ...list of all 'img' elements... ],
973 }
974
975 (There are entries in the hash for only those tagnames that occur
976 at/under $h -- so if there's no "img" elements, there'll be no "img"
977 entry in the hashr(ref) returned.)
978
979 Example usage:
980
981 my $map_r = $h->tagname_map();
982 my @heading_tags = sort grep m/^h\d$/s, keys %$map_r;
983 if(@heading_tags) {
984 print "Heading levels used: @heading_tags\n";
985 } else {
986 print "No headings.\n"
987 }
988
989 $h->extract_links() or $h->extract_links(@wantedTypes)
990 Returns links found by traversing the element and all of its children
991 and looking for attributes (like "href" in an "a" element, or "src" in
992 an "img" element) whose values represent links. The return value is a
993 reference to an array. Each element of the array is reference to an
994 array with four items: the link-value, the element that has the
995 attribute with that link-value, and the name of that attribute, and the
996 tagname of that element. (Example: "['http://www.suck.com/',"
997 $elem_obj ", 'href', 'a']".) You may or may not end up using the
998 element itself -- for some purposes, you may use only the link value.
999
1000 You might specify that you want to extract links from just some kinds
1001 of elements (instead of the default, which is to extract links from all
1002 the kinds of elements known to have attributes whose values represent
1003 links). For instance, if you want to extract links from only "a" and
1004 "img" elements, you could code it like this:
1005
1006 for (@{ $e->extract_links('a', 'img') }) {
1007 my($link, $element, $attr, $tag) = @$_;
1008 print
1009 "Hey, there's a $tag that links to ",
1010 $link, ", in its $attr attribute, at ",
1011 $element->address(), ".\n";
1012 }
1013
1014 $h->simplify_pres
1015 In text bits under PRE elements that are at/under $h, this routine
1016 nativizes all newlines, and expands all tabs.
1017
1018 That is, if you read a file with lines delimited by "\cm\cj"'s, the
1019 text under PRE areas will have "\cm\cj"'s instead of "\n"'s. Calling
1020 $h->nativize_pre_newlines on such a tree will turn "\cm\cj"'s into
1021 "\n"'s.
1022
1023 Tabs are expanded to however many spaces it takes to get to the next
1024 8th column -- the usual way of expanding them.
1025
1026 $h->same_as($i)
1027 Returns true if $h and $i are both elements representing the same tree
1028 of elements, each with the same tag name, with the same explicit
1029 attributes (i.e., not counting attributes whose names start with "_"),
1030 and with the same content (textual, comments, etc.).
1031
1032 Sameness of descendant elements is tested, recursively, with
1033 "$child1->same_as($child_2)", and sameness of text segments is tested
1034 with "$segment1 eq $segment2".
1035
1036 $h = HTML::Element->new_from_lol(ARRAYREF)
1037 Resursively constructs a tree of nodes, based on the (non-cyclic) data
1038 structure represented by ARRAYREF, where that is a reference to an
1039 array of arrays (of arrays (of arrays (etc.))).
1040
1041 In each arrayref in that structure, different kinds of values are
1042 treated as follows:
1043
1044 · Arrayrefs
1045
1046 Arrayrefs are considered to designate a sub-tree representing
1047 children for the node constructed from the current arrayref.
1048
1049 · Hashrefs
1050
1051 Hashrefs are considered to contain attribute-value pairs to add to
1052 the element to be constructed from the current arrayref
1053
1054 · Text segments
1055
1056 Text segments at the start of any arrayref will be considered to
1057 specify the name of the element to be constructed from the current
1058 araryref; all other text segments will be considered to specify
1059 text segments as children for the current arrayref.
1060
1061 · Elements
1062
1063 Existing element objects are either inserted into the treelet
1064 constructed, or clones of them are. That is, when the lol-tree is
1065 being traversed and elements constructed based what's in it, if an
1066 existing element object is found, if it has no parent, then it is
1067 added directly to the treelet constructed; but if it has a parent,
1068 then "$that_node->clone" is added to the treelet at the appropriate
1069 place.
1070
1071 An example will hopefully make this more obvious:
1072
1073 my $h = HTML::Element->new_from_lol(
1074 ['html',
1075 ['head',
1076 [ 'title', 'I like stuff!' ],
1077 ],
1078 ['body',
1079 {'lang', 'en-JP', _implicit => 1},
1080 'stuff',
1081 ['p', 'um, p < 4!', {'class' => 'par123'}],
1082 ['div', {foo => 'bar'}, '123'],
1083 ]
1084 ]
1085 );
1086 $h->dump;
1087
1088 Will print this:
1089
1090 <html> @0
1091 <head> @0.0
1092 <title> @0.0.0
1093 "I like stuff!"
1094 <body lang="en-JP"> @0.1 (IMPLICIT)
1095 "stuff"
1096 <p class="par123"> @0.1.1
1097 "um, p < 4!"
1098 <div foo="bar"> @0.1.2
1099 "123"
1100
1101 And printing $h->as_HTML will give something like:
1102
1103 <html><head><title>I like stuff!</title></head>
1104 <body lang="en-JP">stuff<p class="par123">um, p < 4!
1105 <div foo="bar">123</div></body></html>
1106
1107 You can even do fancy things with "map":
1108
1109 $body->push_content(
1110 # push_content implicitly calls new_from_lol on arrayrefs...
1111 ['br'],
1112 ['blockquote',
1113 ['h2', 'Pictures!'],
1114 map ['p', $_],
1115 $body2->look_down("_tag", "img"),
1116 # images, to be copied from that other tree.
1117 ],
1118 # and more stuff:
1119 ['ul',
1120 map ['li', ['a', {'href'=>"$_.png"}, $_ ] ],
1121 qw(Peaches Apples Pears Mangos)
1122 ],
1123 );
1124
1125 @elements = HTML::Element->new_from_lol(ARRAYREFS)
1126 Constructs several elements, by calling new_from_lol for every arrayref
1127 in the ARRAYREFS list.
1128
1129 @elements = HTML::Element->new_from_lol(
1130 ['hr'],
1131 ['p', 'And there, on the door, was a hook!'],
1132 );
1133 # constructs two elements.
1134
1135 $h->objectify_text()
1136 This turns any text nodes under $h from mere text segments (strings)
1137 into real objects, pseudo-elements with a tag-name of "~text", and the
1138 actual text content in an attribute called "text". (For a discussion
1139 of pseudo-elements, see the "tag" method, far above.) This method is
1140 provided because, for some purposes, it is convenient or necessary to
1141 be able, for a given text node, to ask what element is its parent; and
1142 clearly this is not possible if a node is just a text string.
1143
1144 Note that these "~text" objects are not recognized as text nodes by
1145 methods like as_text. Presumably you will want to call
1146 $h->objectify_text, perform whatever task that you needed that for, and
1147 then call $h->deobjectify_text before calling anything like
1148 $h->as_text.
1149
1150 $h->deobjectify_text()
1151 This undoes the effect of $h->objectify_text. That is, it takes any
1152 "~text" pseudo-elements in the tree at/under $h, and deletes each one,
1153 replacing each with the content of its "text" attribute.
1154
1155 Note that if $h itself is a "~text" pseudo-element, it will be
1156 destroyed -- a condition you may need to treat specially in your
1157 calling code (since it means you can't very well do anything with $h
1158 after that). So that you can detect that condition, if $h is itself a
1159 "~text" pseudo-element, then this method returns the value of the
1160 "text" attribute, which should be a defined value; in all other cases,
1161 it returns undef.
1162
1163 (This method assumes that no "~text" pseudo-element has any children.)
1164
1165 $h->number_lists()
1166 For every UL, OL, DIR, and MENU element at/under $h, this sets a
1167 "_bullet" attribute for every child LI element. For LI children of an
1168 OL, the "_bullet" attribute's value will be something like "4.", "d.",
1169 "D.", "IV.", or "iv.", depending on the OL element's "type" attribute.
1170 LI children of a UL, DIR, or MENU get their "_bullet" attribute set to
1171 "*". There should be no other LIs (i.e., except as children of OL, UL,
1172 DIR, or MENU elements), and if there are, they are unaffected.
1173
1174 $h->has_insane_linkage
1175 This method is for testing whether this element or the elements under
1176 it have linkage attributes (_parent and _content) whose values are
1177 deeply aberrant: if there are undefs in a content list; if an element
1178 appears in the content lists of more than one element; if the _parent
1179 attribute of an element doesn't match its actual parent; or if an
1180 element appears as its own descendant (i.e., if there is a cyclicity in
1181 the tree).
1182
1183 This returns empty list (or false, in scalar context) if the subtree's
1184 linkage methods are sane; otherwise it returns two items (or true, in
1185 scalar context): the element where the error occurred, and a string
1186 describing the error.
1187
1188 This method is provided is mainly for debugging and troubleshooting --
1189 it should be quite impossible for any document constructed via
1190 HTML::TreeBuilder to parse into a non-sane tree (since it's not the
1191 content of the tree per se that's in question, but whether the tree in
1192 memory was properly constructed); and it should be impossible for you
1193 to produce an insane tree just thru reasonable use of normal documented
1194 structure-modifying methods. But if you're constructing your own
1195 trees, and your program is going into infinite loops as during calls to
1196 traverse() or any of the secondary structural methods, as part of
1197 debugging, consider calling is_insane on the tree.
1198
1199 $h->element_class
1200 This method returns the class which will be used for new elements. It
1201 defaults to HTML::Element, but can be overridden by subclassing or
1202 esoteric means best left to those will will read the source and then
1203 not complain when those esoteric means change. (Just subclass.)
1204
1206 * If you want to free the memory associated with a tree built of
1207 HTML::Element nodes, then you will have to delete it explicitly. See
1208 the $h->delete method, above.
1209
1210 * There's almost nothing to stop you from making a "tree" with
1211 cyclicities (loops) in it, which could, for example, make the traverse
1212 method go into an infinite loop. So don't make cyclicities! (If all
1213 you're doing is parsing HTML files, and looking at the resulting trees,
1214 this will never be a problem for you.)
1215
1216 * There's no way to represent comments or processing directives in a
1217 tree with HTML::Elements. Not yet, at least.
1218
1219 * There's (currently) nothing to stop you from using an undefined value
1220 as a text segment. If you're running under "perl -w", however, this
1221 may make HTML::Element's code produce a slew of warnings.
1222
1224 You are welcome to derive subclasses from HTML::Element, but you should
1225 be aware that the code in HTML::Element makes certain assumptions about
1226 elements (and I'm using "element" to mean ONLY an object of class
1227 HTML::Element, or of a subclass of HTML::Element):
1228
1229 * The value of an element's _parent attribute must either be undef or
1230 otherwise false, or must be an element.
1231
1232 * The value of an element's _content attribute must either be undef or
1233 otherwise false, or a reference to an (unblessed) array. The array may
1234 be empty; but if it has items, they must ALL be either mere strings
1235 (text segments), or elements.
1236
1237 * The value of an element's _tag attribute should, at least, be a
1238 string of printable characters.
1239
1240 Moreover, bear these rules in mind:
1241
1242 * Do not break encapsulation on objects. That is, access their
1243 contents only thru $obj->attr or more specific methods.
1244
1245 * You should think twice before completely overriding any of the
1246 methods that HTML::Element provides. (Overriding with a method that
1247 calls the superclass method is not so bad, though.)
1248
1250 HTML::Tree; HTML::TreeBuilder; HTML::AsSubs; HTML::Tagset; and, for the
1251 morbidly curious, HTML::Element::traverse.
1252
1254 Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke, 2005 Andy
1255 Lester, 2006 Pete Krawczyk, 2010 Jeff Fearn.
1256
1257 This library is free software; you can redistribute it and/or modify it
1258 under the same terms as Perl itself.
1259
1260 This program is distributed in the hope that it will be useful, but
1261 without any warranty; without even the implied warranty of
1262 merchantability or fitness for a particular purpose.
1263
1265 Currently maintained by Pete Krawczyk "<petek@cpan.org>"
1266
1267 Original authors: Gisle Aas, Sean Burke and Andy Lester.
1268
1269 Thanks to Mark-Jason Dominus for a POD suggestion.
1270
1271
1272
1273perl v5.12.2 2010-12-20 HTML::Element(3)