1XML::Rules(3) User Contributed Perl Documentation XML::Rules(3)
2
3
4
6 XML::Rules - parse XML and specify what and how to keep/process for
7 individual tags
8
10 Version 1.16
11
13 use XML::Rules;
14
15 $xml = <<'*END*';
16 <doc>
17 <person>
18 <fname>...</fname>
19 <lname>...</lname>
20 <email>...</email>
21 <address>
22 <street>...</street>
23 <city>...</city>
24 <country>...</country>
25 <bogus>...</bogus>
26 </address>
27 <phones>
28 <phone type="home">123-456-7890</phone>
29 <phone type="office">663-486-7890</phone>
30 <phone type="fax">663-486-7000</phone>
31 </phones>
32 </person>
33 <person>
34 <fname>...</fname>
35 <lname>...</lname>
36 <email>...</email>
37 <address>
38 <street>...</street>
39 <city>...</city>
40 <country>...</country>
41 <bogus>...</bogus>
42 </address>
43 <phones>
44 <phone type="office">663-486-7891</phone>
45 </phones>
46 </person>
47 </doc>
48 *END*
49
50 @rules = (
51 _default => sub {$_[0] => $_[1]->{_content}},
52 # by default I'm only interested in the content of the tag, not the attributes
53 bogus => undef,
54 # let's ignore this tag and all inner ones as well
55 address => sub {address => "$_[1]->{street}, $_[1]->{city} ($_[1]->{country})"},
56 # merge the address into a single string
57 phone => sub {$_[1]->{type} => $_[1]->{_content}},
58 # let's use the "type" attribute as the key and the content as the value
59 phones => sub {delete $_[1]->{_content}; %{$_[1]}},
60 # remove the text content and pass along the type => content from the child nodes
61 person => sub { # lets print the values, all the data is readily available in the attributes
62 print "$_[1]->{lname}, $_[1]->{fname} <$_[1]->{email}>\n";
63 print "Home phone: $_[1]->{home}\n" if $_[1]->{home};
64 print "Office phone: $_[1]->{office}\n" if $_[1]->{office};
65 print "Fax: $_[1]->{fax}\n" if $_[1]->{fax};
66 print "$_[1]->{address}\n\n";
67 return; # the <person> tag is processed, no need to remember what it contained
68 },
69 );
70 $parser = XML::Rules->new(rules => \@rules);
71 $parser->parse( $xml);
72
74 There are several ways to extract data from XML. One that's often used
75 is to read the whole file and transform it into a huge maze of objects
76 and then write code like
77
78 foreach my $obj ($XML->forTheLifeOfMyMotherGiveMeTheFirstChildNamed("Peter")->pleaseBeSoKindAndGiveMeAllChildrenNamedSomethingLike("Jane")) {
79 my $obj2 = $obj->sorryToKeepBotheringButINeedTheChildNamed("Theophile");
80 my $birth = $obj2->whatsTheValueOfAttribute("BirthDate");
81 print "Theophile was born at $birth\n";
82 }
83
84 I'm exagerating of course, but you probably know what I mean. You can
85 of course shorten the path and call just one method ... that is if you
86 spend the time to learn one more "cool" thing starting with X. XPath.
87
88 You can also use XML::Simple and generate an almost equaly huge maze of
89 hashes and arrays ... which may make the code more or less complex. In
90 either case you need to have enough memory to store all that data, even
91 if you only need a piece here and there.
92
93 Another way to parse the XML is to create some subroutines that handle
94 the start and end tags and the text and whatever else may appear in the
95 XML. Some modules will let you specify just one for start tag, one for
96 text and one for end tag, others will let you install different
97 handlers for different tags. The catch is that you have to build your
98 data structures yourself, you have to know where you are, what tag is
99 just open and what is the parent and its parent etc. so that you could
100 add the attributes and especially the text to the right place. And the
101 handlers have to do everything as their side effect. Does anyone
102 remember what do they say about side efects? They make the code hard to
103 debug, they tend to change the code into a maze of interdependent
104 snippets of code.
105
106 So what's the difference in the way XML::Rules works? At the first
107 glance, not much. You can also specify subroutines to be called for the
108 tags encountered while parsing the XML, just like the other even based
109 XML parsers. The difference is that you do not have to rely on side-
110 effects if all you want is to store the value of a tag. You simply
111 return whatever you need from the current tag and the module will add
112 it at the right place in the data structure it builds and will provide
113 it to the handlers for the parent tag. And if the parent tag does
114 return that data again it will be passed to its parent and so forth.
115 Until we get to the level at which it's convenient to handle all the
116 data we accumulated from the twig.
117
118 Do we want to keep just the content and access it in the parent tag
119 handler under a specific name?
120
121 foo => sub {return 'foo' => $_[1]->{_content}}
122
123 Do we want to ornament the content a bit and add it to the parent tag's
124 content?
125
126 u => sub {return '_' . $_[1]->{_content} . '_'}
127 strong => sub {return '*' . $_[1]->{_content} . '*'}
128 uc => sub {return uc($_[1]->{_content})}
129
130 Do we want to merge the attributes into a string and access the string
131 from the parent tag under a specified name?
132
133 address => sub {return 'Address' => "Street: $_[1]->{street} $_[1]->{bldngNo}\nCity: $_[1]->{city}\nCountry: $_[1]->{country}\nPostal code: $_[1]->{zip}"}
134
135 and in this case the $_[1]->{street} may either be an attribute of the
136 <address> tag or it may be ther result of the handler (rule)
137
138 street => sub {return 'street' => $_[1]->{_content}}
139
140 and thus come from a child tag <street>. You may also use the rules to
141 convert codes to values
142
143 our %states = (
144 AL => 'Alabama',
145 AK => 'Alaska',
146 ...
147 );
148 ...
149 state => sub {return 'state' => $states{$_[1]->{_content}}; }
150
151 or
152
153 address => sub {
154 if (exists $_[1]->{id}) {
155 $sthFetchAddress->execute($_[1]->{id});
156 my $addr = $sthFetchAddress->fetchrow_hashref();
157 $sthFetchAddress->finish();
158 return 'address' => $addr;
159 } else {
160 return 'address' => $_[1];
161 }
162 }
163
164 so that you do not have to care whether there was
165
166 <address id="147"/>
167
168 or
169
170 <address><street>Larry Wall's St.</street><streetno>478</streetno><city>Core</city><country>The Programming Republic of Perl</country></address>
171
172 And if you do not like to end up with a datastructure of plain old
173 arrays and hashes, you can create application specific objects in the
174 rules
175
176 address => sub {
177 my $type = lc(delete $_[1]->{type});
178 $type.'Address' => MyApp::Address->new(%{$_[1]})
179 },
180 person => sub {
181 '@person' => MyApp::Person->new(
182 firstname => $_[1]->{fname},
183 lastname => $_[1]->{lname},
184 deliveryAddress => $_[1]->{deliveryAddress},
185 billingAddress => $_[1]->{billingAddress},
186 phone => $_[1]->{phone},
187 )
188 }
189
190 At each level in the tree structure serialized as XML you can decide
191 what to keep, what to throw away, what to transform and then just
192 return the stuff you care about and it will be available to the handler
193 at the next level.
194
196 my $parser = XML::Rules->new(
197 rules => \@rules,
198 [ start_rules => \@start_rules, ]
199 [ stripspaces => 0 / 1 / 2 / 3 + 0 / 4 + 0 / 8, ]
200 [ normalisespaces => 0 / 1, ]
201 [ style => 'parser' / 'filter', ]
202 [ ident => ' ', [reformat_all => 0 / 1] ],
203 [ encode => 'encoding specification', ]
204 [ output_encoding => 'encoding specification', ]
205 [ namespaces => \%namespace2alias_mapping, ]
206 [ handlers => \%additional_expat_handlers, ]
207 # and optionaly parameters passed to XML::Parser::Expat
208 );
209
210 Options passed to XML::Parser::Expat : ProtocolEncoding Namespaces
211 NoExpand Stream_Delimiter ErrorContext ParseParamEnt Base
212
213 The "stripspaces" controls the handling of whitespace. Please see the
214 "Whitespace handling" bellow.
215
216 The "style" specifies whether you want to build a parser used to
217 extract stuff from the XML or filter/modify the XML. If you specify
218 style => 'filter' then all tags for which you do not specify a
219 subroutine rule or that occure inside such a tag are copied to the
220 output filehandle passed to the ->filter() or ->filterfile() methods.
221
222 The "ident" specifies what character(s) to use to ident the tags when
223 filtering, by default the tags are not formatted in any way. If the
224 "reformat_all" is not set then this affects only the tags that have a
225 rule and their subtags. And in case of subtags only those that were
226 added into the attribute hash by their rules, not those left in the
227 _content array!
228
229 The "warnoverwrite" instructs XML::Rules to issue a warning whenever
230 the rule cause a key in a tag's hash to be overwritten by new data
231 produced by the rule of a subtag. This happens eg. if a tag is repeated
232 and its rule doesn't expect it.
233
234 The "encode" allows you to ask the module to run all data through
235 Encode::encode( 'encoding_specification', ...) before being passed to
236 the rules. Otherwise all data comes as UTF8.
237
238 The "output_encoding" on the other hand specifies in what encoding is
239 the resulting data going to be, the default is again UTF8. This means
240 that if you specify
241
242 encode => 'windows-1250',
243 output_encoding => 'utf8',
244
245 and the XML is in ISO-8859-2 (Latin2) then the filter will 1) convert
246 the content and attributes of the tags you are not interested in from
247 Latin2 directly to utf8 and output and 2) convert the content and
248 attributes of the tags you want to process from Latin2 to Windows-1250,
249 let you mangle the data and then convert the results to utf8 for the
250 output.
251
252 The "encode" and "output_enconding" affects also the
253 "$parser-"toXML(...)>, if they are different then the data are
254 converted from one encoding to the other.
255
256 The "handlers" allow you to set additional handlers for
257 XML::Parser::Expat->setHandlers. Your Start, End, Char and XMLDecl
258 handlers are evaluated before the ones installed by XML::Rules and may
259 modify the values in @_, but you should be very carefull with that.
260 Consider that experimental and if you do make that work the way you
261 needed, please let me know so that I know what was it good for and can
262 make sure it doesn't break in a new version.
263
264 The Rules
265 The rules option may be either an arrayref or a hashref, the module
266 doesn't care, but if you want to use regexps to specify the groups of
267 tags to be handled by the same rule you should use the array ref. The
268 rules array/hash is made of pairs in form
269
270 tagspecification => action
271
272 where the tagspecification may be either a name of a tag, a string
273 containing comma or pipe ( "|" ) delimited list of tag names or a
274 string containing a regexp enclosed in // optionaly followed by the
275 regular expression modifiers or a qr// compiled regular expressions.
276 The tag names and tag name lists take precedence to the regexps, the
277 regexps are (in case of arrayrefs only!!!) tested in the order in which
278 they are specified.
279
280 These rules are evaluated/executed whenever a tag if fully parsed
281 including all the content and child tags and they may access the
282 content and attributes of the specified tag plus the stuff produced by
283 the rules evaluated for the child tags.
284
285 The action may be either
286
287 - an undef or empty string = ignore the tag and all its children
288 - a subroutine reference = the subroutine will be called to handle the tag data&contents
289 sub { my ($tagname, $attrHash, $contexArray, $parentDataArray, $parser) = @_; ...}
290 - one of the built in rules below
291
292 Custom rules
293
294 The subroutines in the rules specification receive five parameters:
295
296 $rule->( $tag_name, \%attrs, \@context, \@parent_data, $parser)
297
298 It's OK to destroy the first two parameters, but you should treat the
299 other three as read only or at least treat them with care!
300
301 $tag_name = string containing the tag name
302 \%attrs = hash containing the attributes of the tag plus the _content key
303 containing the text content of the tag. If it's not a leaf tag it may
304 also contain the data returned by the rules invoked for the child tags.
305 \@context = an array containing the names of the tags enclosing the current
306 one. The parent tag name is the last element of the array. (READONLY!)
307 \@parent_data = an array containing the hashes with the attributes
308 and content read&produced for the enclosing tags so far.
309 You may need to access this for example to find out the version
310 of the format specified as an attribute of the root tag. You may
311 safely add, change or delete attributes in the hashes, but all bets
312 are off if you change the number or type of elements of this array!
313 $parser = the parser object
314 you may use $parser->{pad} or $parser->{parameters} to store any data
315 you need. The first is never touched by XML::Rules, the second is set to
316 the last argument of parse() or filter() methods and reset to undef
317 before those methods exit.
318
319 The subroutine may decide to handle the data and return nothing or
320 tweak the data as necessary and return just the relevant bits. It may
321 also load more information from elsewhere based on the ids found in the
322 XML and provide it to the rules of the ancestor tags as if it was part
323 of the XML.
324
325 The possible return values of the subroutines are:
326
327 1) nothing or undef or "" - nothing gets added to the parent tag's hash
328
329 2) a single string - if the parent's _content is a string then the one
330 produced by this rule is appended to the parent's _content. If the
331 parent's _content is an array, then the string is push()ed to the
332 array.
333
334 3) a single reference - if the parent's _content is a string then it's
335 changed to an array containing the original string and this reference.
336 If the parent's _content is an array, then the string is push()ed to
337 the array.
338
339 4) an even numbered list - it's a list of key & value pairs to be added
340 to the parent's hash.
341
342 The handling of the attributes may be changed by adding '@', '%', '+',
343 '*' or '.' before the attribute name.
344
345 Without any "sigil" the key & value is added to the hash overwriting
346 any previous values.
347
348 The values for the keys starting with '@' are push()ed to the arrays
349 referenced by the key name without the @. If there already is an
350 attribute of the same name then the value will be preserved and will
351 become the first element in the array.
352
353 The values for the keys starting with '%' have to be either hash or
354 array references. The key&value pairs in the referenced hash or array
355 will be added to the hash referenced by the key. This is nice for rows
356 of tags like this:
357
358 <field name="foo" value="12"/>
359 <field name="bar" value="24"/>
360
361 if you specify the rule as
362
363 field => sub { '%fields' => [$_[1]->{name} => $_[1]->{value}]}
364
365 then the parent tag's has will contain
366
367 fields => {
368 foo => 12,
369 bar => 24,
370 }
371
372 The values for the keys starting with '+' are added to the current
373 value, the ones starting with '.' are appended to the current value and
374 the ones starting with '*' multiply the current value.
375
376 5) an odd numbered list - the last element is appended or push()ed to
377 the parent's _content, the rest is handled as in the previous case.
378
379 Builtin rules
380
381 'content' = only the content of the tag is preserved and added to
382 the parent tag's hash as an attribute named after the tag. Equivalent to:
383 sub { $_[0] => $_[1]->{_content}}
384 'content trim' = only the content of the tag is preserved, trimmed and added to
385 the parent tag's hash as an attribute named after the tag
386 sub { s/^\s+//,s/\s+$// for ($_[1]->{_content}); $_[0] => $_[1]->{_content}}
387 'content array' = only the content of the tag is preserved and pushed
388 to the array pointed to by the attribute
389 sub { '@' . $_[0] => $_[1]->{_content}}
390 'as is' = the tag's hash is added to the parent tag's hash
391 as an attribute named after the tag
392 sub { $_[0] => $_[1]}
393 'as is trim' = the tag's hash is added to the parent tag's hash
394 as an attribute named after the tag, the content is trimmed
395 sub { $_[0] => $_[1]}
396 'as array' = the tag's hash is pushed to the attribute named after the tag
397 in the parent tag's hash
398 sub { '@'.$_[0] => $_[1]}
399 'as array trim' = the tag's hash is pushed to the attribute named after the tag
400 in the parent tag's hash, the content is trimmed
401 sub { '@'.$_[0] => $_[1]}
402 'no content' = the _content is removed from the tag's hash and the hash
403 is added to the parent's hash into the attribute named after the tag
404 sub { delete $_[1]->{_content}; $_[0] => $_[1]}
405 'no content array' = similar to 'no content' except the hash is pushed
406 into the array referenced by the attribute
407 'as array no content' = same as 'no content array'
408 'pass' = the tag's hash is dissolved into the parent's hash,
409 that is all tag's attributes become the parent's attributes.
410 The _content is appended to the parent's _content.
411 sub { %{$_[1]}}
412 'pass no content' = the _content is removed and the hash is dissolved
413 into the parent's hash.
414 sub { delete $_[1]->{_content}; %{$_[1]}}
415 'pass without content' = same as 'pass no content'
416 'raw' = the [tagname => attrs] is pushed to the parent tag's _content.
417 You would use this style if you wanted to be able to print
418 the parent tag as XML preserving the whitespace or other textual content
419 sub { [$_[0] => $_[1]]}
420 'raw extended' = the [tagname => attrs] is pushed to the parent tag's _content
421 and the attrs are added to the parent's attribute hash with ":$tagname" as the key
422 sub { (':'.$_[0] => $_[1], [$_[0] => $_[1]])};
423 'raw extended array' = the [tagname => attrs] is pushed to the parent tag's _content
424 and the attrs are pushed to the parent's attribute hash with ":$tagname" as the key
425 sub { ('@:'.$_[0] => $_[1], [$_[0] => $_[1]])};
426 'by <attrname>' = uses the value of the specified attribute as the key when adding the
427 attribute hash into the parent tag's hash. You can specify more names, in that case
428 the first found is used.
429 sub {delete($_[1]->{name}) => $_[1]}
430 'content by <attrname>' = uses the value of the specified attribute as the key when adding the
431 tags content into the parent tag's hash. You can specify more names, in that case
432 the first found is used.
433 sub {$_[1]->{name} => $_[1]->{_content}}
434 'no content by <attrname>' = uses the value of the specified attribute as the key when adding the
435 attribute hash into the parent tag's hash. The content is dropped. You can specify more names,
436 in that case the first found is used.
437 sub {delete($_[1]->{_content}); delete($_[1]->{name}) => $_[1]}
438 '==...' = replace the tag by the specified string. That is the string will be added to
439 the parent tag's _content
440 sub { return '...' }
441 '=...' = replace the tag contents by the specified string and forget the attributes.
442 sub { return $_[0] => '...' }
443 '' = forget the tag's contents (after processing the rules for subtags)
444 sub { return };
445
446 I include the unnamed subroutines that would be equivalent to the
447 builtin rule in case you need to add some tests and then behave as if
448 one of the builtins was used.
449
450 Builtin rule modifiers
451
452 You can add these modifiers to most rules, just add them to the string
453 literal, at the end, separated from the base rule by a space.
454
455 no xmlns = strip the namespace alias from the $_[0] (tag name)
456 remove(list,of,attributes) = remove all specified attributes (or keys produced by child tag rules) from the tag data
457 only(list,of,attributes) = filter the hash of attributes and keys+values produced by child tag rules in the tag data
458 to only include those specified here. In case you need to include the tag content do not forget to include
459 _content in the list!
460
461 Not all modifiers make sense for all rules. For example if the rule is
462 'content', it's pointless to filter the attributes, because the only
463 one used will be the content anyway.
464
465 The behaviour of the combination of the 'raw...' rules and the rule
466 modifiers is UNDEFINED!
467
468 Different rules for different paths to tags
469
470 Since 0.19 it's possible to specify several actions for a tag if you
471 need to do something different based on the path to the tag like this:
472
473 tagname => [
474 'tag/path' => action,
475 '/root/tag/path' => action,
476 '/root/*/path' => action,
477 qr{^root/ns:[^/]+/par$} => action,
478 default_action
479 ],
480
481 The path is matched against the list of parent tags joined by slashes.
482
483 If you need to use more complex conditions to select the actions you
484 have to use a single subroutine rule and implement the conditions
485 within that subroutine. You have access both to the list of enclosing
486 tags and their attribute hashes (including the data obtained from the
487 rules of the already closed subtags of the enclosing tags.
488
489 The Start Rules
490 Apart from the normal rules that get invoked once the tag is fully
491 parsed, including the contents and child tags, you may want to attach
492 some code to the start tag to (optionaly) skip whole branches of XML or
493 set up attributes and variables. You may set up the start rules either
494 in a separate parameter to the constructor or in the rules=> by
495 prepending the tag name(s) by ^.
496
497 These rules are in form
498
499 tagspecification => undef / '' / 'skip' --> skip the element, including child tags
500 tagspecification => 1 / 'handle' --> handle the element, may be needed
501 if you specify the _default rule.
502 tagspecification => \&subroutine
503
504 The subroutines receive the same parameters as for the "end tag" rules
505 except of course the _content, but their return value is treated
506 differently. If the subroutine returns a false value then the whole
507 branch enclosed by the current tag is skipped, no data are stored and
508 no rules are executed. You may modify the hash referenced by $attr.
509
510 You may even tie() the hash referenced by $attr, for example in case
511 you want to store the parsed data in a DBM::Deep. In such case all the
512 data returned by the immediate subtags of this tag will be stored in
513 the DBM::Deep. Make sure you do not overwrite the data by data from
514 another occurance of the same tag if you return $_[1]/$attr from the
515 rule!
516
517 YourHugeTag => sub {
518 my %temp = %{$_[1]};
519 tie %{$_[1]}, 'DBM::Deep', $filename;
520 %{$_[1]} = %temp;
521 1;
522 }
523
524 Both types of rules are free to store any data they want in
525 $parser->{pad}. This property is NOT emptied after the parsing!
526
527 Whitespace handling
528 There are two options that affect the whitespace handling: stripspaces
529 and normalisespaces. The normalisespaces is a simple flag that controls
530 whether multiple spaces/tabs/newlines are collapsed into a single space
531 or not. The stripspaces is more complex, it's a bit-mask, an ORed
532 combination of the following options:
533
534 0 - don't remove whitespace around tags
535 (around tags means before the opening tag and after the closing tag, not in the tag's content!)
536 1 - remove whitespace before tags whose rules did not return any text content
537 (the rule specified for the tag caused the data of the tag to be ignored,
538 processed them already or added them as attributes to parent's \%attr)
539 2 - remove whitespace around tags whose rules did not return any text content
540 3 - remove whitespace around all tags
541
542 0 - remove only whitespace-only content
543 (that is remove the whitespace around <foo/> in this case "<bar> <foo/> </bar>"
544 but not this one "<bar>blah <foo/> blah</bar>")
545 4 - remove trailing/leading whitespace
546 (remove the whitespace in both cases above)
547
548 0 - don't trim content
549 8 - do trim content
550 (That is for "<foo> blah </foo>" only pass to the rule {_content => 'blah'})
551
552 That is if you have a data oriented XML in which each tag contains
553 either text content or subtags, but not both, you want to use
554 stripspaces => 3 or stripspaces => 3|4. This will not only make sure
555 you don't need to bother with the whitespace-only _content of the tags
556 with subtags, but will also make sure you do not keep on wasting memory
557 while parsing a huge XML and processing the "twigs". Without that
558 option the parent tag of the repeated tag would keep on accumulating
559 unneeded whitespace in its _content.
560
562 parse
563 $parser->parse( $string [, $parameters]);
564 $parser->parse( $IOhandle [, $parameters]);
565
566 Parses the XML in the string or reads and parses the XML from the
567 opened IO handle, executes the rules as it encounters the closing tags
568 and returns the resulting structure.
569
570 The scalar or reference passed as the second parameter to the parse()
571 method is assigned to $parser->{parameters} for the parsing of the file
572 or string. Once the XML is parsed the key is deleted. This means that
573 the $parser does not retain a reference to the $parameters after the
574 parsing.
575
576 parsestring
577 $parser->parsestring( $string [, $parameters]);
578
579 Just an alias to ->parse().
580
581 parse_string
582 $parser->parse_string( $string [, $parameters]);
583
584 Just an alias to ->parse().
585
586 parsefile
587 $parser->parsefile( $filename [, $parameters]);
588
589 Opens the specified file and parses the XML and executes the rules as
590 it encounters the closing tags and returns the resulting structure.
591
592 parse_file
593 $parser->parse_file( $filename [, $parameters]);
594
595 Just an alias to ->parsefile().
596
597 parse_chunk
598 while (my $chunk = read_chunk_of_data()) {
599 $parser->parse_chunk($chunk);
600 }
601 my $data = $parser->last_chunk();
602
603 This method allows you to process the XML in chunks as you receive
604 them. The chunks do not need to be in any way valid ... it's fine if
605 the chunk ends in the middle of a tag or attribute.
606
607 If you need to set the $parser->{parameters}, pass it to the first call
608 to parse_chunk() the same way you would to parse(). The first chunk
609 may be empty so if you need to set up the parameters, but read the
610 chunks in a loop or in a callback, you can do this:
611
612 $parser->parse_chunk('', {foo => 15, bar => "Hello World!"});
613 while (my $chunk = read_chunk_of_data()) {
614 $parser->parse_chunk($chunk);
615 }
616 my $data = $parser->last_chunk();
617
618 or
619
620 $parser->parse_chunk('', {foo => 15, bar => "Hello World!"});
621 $ua->get($url, ':content_cb' => sub { my($data, $response, $protocol) = @_; $parser->parse_chunk($data); return 1 });
622 my $data = $parser->last_chunk();
623
624 The parse_chunk() returns 1 or dies, to get the accumulated data, you
625 need to call last_chunk(). You will want to either agressively trim the
626 data remembered or handle parts of the file using custom rules as the
627 XML is being parsed.
628
629 filter
630 $parser->filter( $string);
631 $parser->filter( $string, $LexicalOutputIOhandle [, $parameters]);
632 $parser->filter( $LexicalInputIOhandle, $LexicalOutputIOhandle [, $parameters]);
633 $parser->filter( $string, \*OutputIOhandle [, $parameters]);
634 $parser->filter( $LexicalInputIOhandle, \*OutputIOhandle [, $parameters]);
635 $parser->filter( $string, $OutputFilename [, $parameters]);
636 $parser->filter( $LexicalInputIOhandle, $OutputFilename [, $parameters]);
637 $parser->filter( $string, $StringReference [, $parameters]);
638 $parser->filter( $LexicalInputIOhandle, $StringReference [, $parameters]);
639
640 Parses the XML in the string or reads and parses the XML from the
641 opened IO handle, copies the tags that do not have a subroutine rule
642 specified and do not occure under such a tag, executes the specified
643 rules and prints the results to select()ed filehandle, $OutputFilename
644 or $OutputIOhandle or stores them in the scalar referenced by
645 $StringReference using the ->ToXML() method.
646
647 The scalar or reference passed as the third parameter to the filter()
648 method is assigned to $parser->{parameters} for the parsing of the file
649 or string. Once the XML is parsed the key is deleted. This means that
650 the $parser does not retain a reference to the $parameters after the
651 parsing.
652
653 filterstring
654 $parser->filterstring( ...);
655
656 Just an alias to ->filter().
657
658 filter_string
659 $parser->filter_string( ...);
660
661 Just an alias to ->filter().
662
663 filterfile
664 $parser->filterfile( $filename);
665 $parser->filterfile( $filename, $LexicalOutputIOhandle [, $parameters]);
666 $parser->filterfile( $filename, \*OutputIOhandle [, $parameters]);
667 $parser->filterfile( $filename, $OutputFilename [, $parameters]);
668
669 Parses the XML in the specified file, copies the tags that do not have
670 a subroutine rule specified and do not occure under such a tag,
671 executes the specified rules and prints the results to select()ed
672 filehandle, $OutputFilename or $OutputIOhandle or stores them in the
673 scalar referenced by $StringReference.
674
675 The scalar or reference passed as the third parameter to the filter()
676 method is assigned to $parser->{parameters} for the parsing of the file
677 or string. Once the XML is parsed the key is deleted. This means that
678 the $parser does not retain a reference to the $parameters after the
679 parsing.
680
681 filter_file
682 Just an alias to ->filterfile().
683
684 filter_chunk
685 while (my $chunk = read_chunk_of_data()) {
686 $parser->filter_chunk($chunk);
687 }
688 $parser->last_chunk();
689
690 This method allows you to process the XML in chunks as you receive
691 them. The chunks do not need to be in any way valid ... it's fine if
692 the chunk ends in the middle of a tag or attribute.
693
694 If you need to set the file to store the result to (default is the
695 select()ed filehandle) or set the $parser->{parameters}, pass it to the
696 first call to filter_chunk() the same way you would to filter(). The
697 first chunk may be empty so if you need to set up the parameters, but
698 read the chunks in a loop or in a callback, you can do this:
699
700 $parser->filter_chunk('', "the-filtered.xml", {foo => 15, bar => "Hello World!"});
701 while (my $chunk = read_chunk_of_data()) {
702 $parser->filter_chunk($chunk);
703 }
704 $parser->last_chunk();
705
706 or
707
708 $parser->filter_chunk('', "the_filtered.xml", {foo => 15, bar => "Hello World!"});
709 $ua->get($url, ':content_cb' => sub { my($data, $response, $protocol) = @_; $parser->filter_chunk($data); return 1 });
710 filter_chunk$parser->last_chunk();
711
712 The filter_chunk() returns 1 or dies, you need to call last_chunk() to
713 sign the end of the data and close the filehandles and clean the parser
714 status. Make sure you do not set a rule for the root tag or other tag
715 containing way too much data. Keep in mind that even if the parser
716 works as a filter, the data for a custom rule must be kept in memory
717 for the rule to execute!
718
719 last_chunk
720 my $data = $parser->last_chunk();
721 my $data = $parser->last_chunk($the_last_chunk_contents);
722
723 Finishes the processing of a XML fed to the parser in chunks. In case
724 of the parser style, returns the accumulated data. In case of the
725 filter style, flushes and closes the output file. You can pass the last
726 piece of the XML to this method or call it without parameters if all
727 the data was passed to parse_chunk()/filter_chunk().
728
729 You HAVE to execute this method after call(s) to parse_chunk() or
730 filter_chunk()! Until you do, the parser will refuse to process full
731 documents and expect another call to parse_chunk()/filter_chunk()!
732
733 escape_value
734 $parser->escape_value( $data [, $numericescape])
735
736 This method escapes the $data for inclusion in XML, the $numericescape
737 may be 0, 1 or 2 and controls whether to convert 'high' (non ASCII)
738 characters to XML entities.
739
740 0 - default: no numeric escaping (OK if you're writing out UTF8)
741
742 1 - only characters above 0xFF are escaped (ie: characters in the
743 0x80-FF range are not escaped), possibly useful with ISO8859-1 output
744
745 2 - all characters above 0x7F are escaped (good for plain ASCII output)
746
747 You can also specify the default value in the constructor
748
749 my $parser = XML::Rules->new(
750 ...
751 NumericEscape => 2,
752 );
753
754 toXML / ToXML
755 $xml = $parser->toXML( $tagname, \%attrs[, $do_not_close, $ident, $base])
756
757 You may use this method to convert the datastructures created by
758 parsing the XML into the XML format. Not all data structures may be
759 printed! I'll add more docs later, for now please do experiment.
760
761 The $ident and $base, if defined, turn on and control the pretty-
762 printing. The $ident specifies the character(s) used for one level of
763 identation, the base contains the identation of the current tag. That
764 is if you want to include the data inside of
765
766 <data>
767 <some>
768 <subtag>$here</subtag>
769 </some>
770 </data>
771
772 you will call
773
774 $parser->toXML( $tagname, \%attrs, 0, "\t", "\t\t\t");
775
776 The method does NOT validate that the $ident and $base are whitespace
777 only, but of course if it's not you end up with invalid XML. Newlines
778 are added only before the start tag and (if the tag has only child tags
779 and no content) before the closing tag, but not after the closing tag!
780 Newlines are added even if the $ident is an empty string.
781
782 parentsToXML
783 $xml = $parser->parentsToXML( [$level])
784
785 Prints all or only the topmost $level ancestor tags, including the
786 attributes and content (parsed so far), but without the closing tags.
787 You may use this to print the header of the file you are parsing,
788 followed by calling toXML() on a structure you build and then by
789 closeParentsToXML() to close the tags left opened by parentsToXML().
790 You most likely want to use the style => 'filter' option for the
791 constructor instead.
792
793 closeParentsToXML
794 $xml = $parser->closeParentsToXML( [$level])
795
796 Prints the closing tags for all or the topmost $level ancestor tags of
797 the one currently processed.
798
799 paths2rules
800 my $parser = XML::Rules->new(
801 rules => paths2rules {
802 '/root/subtag/tag' => sub { ...},
803 '/root/othertag/tag' => sub {...},
804 'tag' => sub{ ... the default code for this tag ...},
805 ...
806 }
807 );
808
809 This helper function converts a hash of "somewhat xpath-like" paths and
810 subs/rules into the format required by the module. Due to backwards
811 compatibility and efficiency I can't directly support paths in the
812 rules and the direct syntax for their specification is a bit awkward.
813 So if you need the paths and not the regexps, you may use this helper
814 instead of:
815
816 my $parser = XML::Rules->new(
817 rules => {
818 'tag' => [
819 '/root/subtag' => sub { ...},
820 '/root/othertag' => sub {...},
821 sub{ ... the default code for this tag ...},
822 ],
823 ...
824 }
825 );
826
827 return_nothing
828 Stop parsing the XML, forget any data we already have and return from
829 the $parser->parse(). This is only supposed to be used within rules
830 and may be called both as a method and as an ordinary function (it's
831 not exported).
832
833 return_this
834 Stop parsing the XML, forget any data we already have and return the
835 attributes passed to this subroutine from the $parser->parse(). This is
836 only supposed to be used within rules and may be called both as a
837 method and as an ordinary function (it's not exported).
838
839 skip_rest
840 Stop parsing the XML and return whatever data we already have from the
841 $parser->parse(). The rules for the currently opened tags are
842 evaluated as if the XML contained all the closing tags in the right
843 order.
844
845 These three work via raising an exception, the exception is caught
846 within the $parser->parse() and does not propagate outside. It's also
847 safe to raise any other exception within the rules, the exception will
848 be caught as well, the internal state of the $parser object will be
849 cleaned and the exception rethrown.
850
852 parse
853 When called as a class method, parse() accepts the same parameters as
854 new(), instantiates a new parser object and returns a subroutine
855 reference that calls the parse() method on that instance.
856
857 my $parser = XML::Rules->new(rules => \%rules);
858 my $data = $parser->parse($xml);
859
860 becomes
861
862 my $read_data = XML::Rules->parse(rules => \%rules);
863 my $data = $read_data->($xml);
864
865 or
866
867 sub read_data;
868 *read_data = XML::Rules->parse(rules => \%rules);
869 my $data = read_data($xml);
870
871 parsestring, parsefile, parse_file, filter, filterstring, filter_string,
872 filterfile, filter_file
873 All these methods work the same way as parse() when used as a class
874 method. They accept the same parameters as new(), instantiate a new
875 object and return a subroutine reference that calls the respective
876 method.
877
878 inferRulesFromExample
879 Dumper(XML::Rules::inferRulesFromExample( $fileOrXML, $fileOrXML, $fileOrXML, ...)
880 Dumper(XML::Rules->inferRulesFromExample( $fileOrXML, $fileOrXML, $fileOrXML, ...)
881 Dumper($parser->inferRulesFromExample( $fileOrXML, $fileOrXML, $fileOrXML, ...)
882
883 The subroutine parses the listed files and infers the rules that would
884 produce the minimal, but complete datastructure. It finds out what
885 tags may be repeated, whether they contain text content, attributes
886 etc. You may want to give the subroutine several examples to make sure
887 it knows about all possibilities. You should use this once and store
888 the generated rules in your script or even take this as the basis of a
889 more specific set of rules.
890
891 inferRulesFromDTD
892 Dumper(XML::Rules::inferRulesFromDTD( $DTDorDTDfile, [$enableExtended]))
893 Dumper(XML::Rules->inferRulesFromDTD( $DTDorDTDfile, [$enableExtended]))
894 Dumper($parser->inferRulesFromDTD( $DTDorDTDfile, [$enableExtended]))
895
896 The subroutine parses the DTD and infers the rules that would produce
897 the minimal, but complete datastructure. It finds out what tags may be
898 repeated, whether they contain text content, attributes etc. You may
899 use this each time you are about to parse the XML or once and store the
900 generated rules in your script or even take this as the basis of a more
901 specific set of rules.
902
903 With the second parameter set to a true value, the tags included in a
904 mixed content will use the "raw extended" or "raw extended array" types
905 instead of just "raw". This makes sure the tag data both stay at the
906 right place in the content and are accessible easily from the parent
907 tag's atrribute hash.
908
909 This subroutine requires the XML::DTDParser module!
910
911 toXML / ToXML
912 The ToXML() method may be called as a class/static method as well. In
913 that case the default identation is two spaces and the output encoding
914 is utf8.
915
917 parameters
918 You can pass a parameter (scalar or reference) to the parse...() or
919 filter...() methods, this parameter is later available to the rules as
920 $parser->{parameters}. The module will never use this parameter for any
921 other purpose so you are free to use it for any purposes provided that
922 you expect it to be reset by each call to parse...() or filter...()
923 first to the passed value and then, after the parsing is complete, to
924 undef.
925
926 pad
927 The $parser->{pad} key is specificaly reserved by the module as a place
928 where the module users can store their data. The module doesn't and
929 will not use this key in any way, doesn't set or reset it under any
930 circumstances. If you need to share some data between the rules and do
931 not want to use the structure built by applying the rules you are free
932 to use this key.
933
934 You should refrain from modifying or accessing other properties of the
935 XML::Rules object!
936
938 When used without parameters, the module does not export anything into
939 the caller's namespace. When used with parameters it either infers and
940 prints a set of rules from a DTD or example(s) or instantiates a parser
941 and exports a subroutine calling the specified method similar to the
942 parse() and other methods when called as class methods:
943
944 use XML::Rules inferRules => 'c:\temp\example.xml';
945 use XML::Rules inferRules => 'c:\temp\ourOwn.dtd';
946 use XML::Rules inferRules => ['c:\temp\example.xml', c:\temp\other.xml'];
947 use XML::Rules
948 read_data => {
949 method => 'parse',
950 rules => { ... },
951 ...
952 };
953 use XML::Rules ToXML => {
954 method => 'ToXML',
955 rules => {}, # the option is required, but may be empty
956 ident => ' '
957 };
958 ...
959 my $data => read_data($xml);
960 print ToXML(
961 rootTag => {
962 thing => [
963 {Name => "english", child => [7480], otherChild => ['Hello world']},
964 {Name => "espanol", child => [7440], otherChild => ['Hola mundo']},
965 ]
966 });
967
968 Please keep in mind that the use statement is executed at "compile
969 time", which means that the variables declared and assigned above the
970 statement do not have the value yet! This is wrong!
971
972 my %rules = ( _default => 'content', foo => 'as is', ...};
973 use XML::Rules
974 read_data => {
975 method => 'parse',
976 rules => \%rules,
977 ...
978 };
979
980 If you do not specify the method, then the method named the same as the
981 import is assumed. You also do not have to specify the rules option for
982 the ToXML method as it is not used anyway:
983
984 use XML::Rules ToXML => { ident => ' ' };
985 use XML::Rules parse => {stripspaces => 7, rules => { ... }};
986
987 You can use the inferRules form the command line like this:
988
989 perl -e "use XML::Rules inferRules => 'c:\temp\example.xml'"
990
991 or this:
992
993 perl -MXML::Rules=inferRules,c:\temp\example.xml -e 1
994
995 or use the included xml2XMLRules.pl and dtd2XMLRules.pl scripts.
996
998 By default the module doesn't handle namespaces in any way, it doesn't
999 do anything special with xmlns or xmlns:alias attributes and it doesn't
1000 strip or mangle the namespace aliases in tag or attribute names. This
1001 means that if you know for sure what namespace aliases will be used you
1002 can set up rules for tags including the aliases and unless someone
1003 decides to use a different alias or makes use of the default namespace
1004 your script will work without turning the namespace support on.
1005
1006 If you do specify any namespace to alias mapping in the constructor it
1007 does start processing the namespace stuff. The xmlns and xmlns:alias
1008 attributes for the known namespaces are stripped from the
1009 datastructures and the aliases are transformed from whatever the XML
1010 author decided to use to whatever your namespace mapping specifies.
1011 Aliases are also added to all tags that belong to a default namespace.
1012
1013 Assuming the constructor parameters contain
1014
1015 namespaces => {
1016 'http://my.namespaces.com/foo' => 'foo',
1017 'http://my.namespaces.com/bar' => 'bar',
1018 }
1019
1020 and the XML looks like this:
1021
1022 <root>
1023 <Foo xmlns="http://my.namespaces.com/foo">
1024 <subFoo>Hello world</subfoo>
1025 </Foo>
1026 <other xmlns:b="http://my.namespaces.com/bar">
1027 <b:pub>
1028 <b:name>NaRuzku</b:name>
1029 <b:address>at any crossroads</b:address>
1030 <b:desc>Fakt <b>desnej</b> pajzl.</b:desc>
1031 </b:pub>
1032 </other>
1033 </root>
1034
1035 then the rules wil be called as if the XML looked like this while the
1036 namespace support is turned off:
1037
1038 <root>
1039 <foo:Foo>
1040 <foo:subFoo>Hello world</foo:subfoo>
1041 </foo:Foo>
1042 <other>
1043 <bar:pub>
1044 <bar:name>NaRuzku</bar:name>
1045 <bar:address>at any crossroads</bar:address>
1046 <bar:desc>Fakt <b>desnej</b> pajzl.</bar:desc>
1047 </bar:pub>
1048 </other>
1049 </root>
1050
1051 This means that the namespace handling will normalize the aliases used
1052 so that you can use them in the rules.
1053
1054 It is possible to specify an empty alias, so eg. in case you are
1055 processing a SOAP XML and know the tags defined by SOAP do not colide
1056 with the tags in the enclosed XML you may simplify the parsing by
1057 removing all namespace aliases.
1058
1059 You can control the behaviour with respect to the namespaces that you
1060 did not include in your mapping by setting the "alias" for the special
1061 pseudonamespace '*'. The possible values of the "alias"are: "warn"
1062 (default), "keep", "strip", "" and "die".
1063
1064 warn: whenever an unknown namespace is encountered, XML::Rules prints a
1065 warning. The xmlns:XX attributes and the XX: aliases are retained for
1066 these namespaces. If the alias clashes with one specified by your
1067 mapping it will be changed in all places, the xmlns="..." referencing
1068 an unexpected namespace are changed to xmlns:nsN and the alias is added
1069 to the tag names included.
1070
1071 keep: this works just like the "warn" except for the warning.
1072
1073 strip: all attributes and tags in the unknown namespaces are stripped.
1074 If a tag in such a namespace contains a tag from a known namespace,
1075 then the child tag is retained.
1076
1077 "": all the xmlns attributes and the aliases for the unexected
1078 namespaces are removed, the tags and normal attributes are retained
1079 without any alias.
1080
1081 die: as soon as any unexpected namespace is encountered, XML::Rules
1082 croak()s.
1083
1085 You may view the module either as a XML::Simple on steriods and use it
1086 to build a data structure similar to the one produced by XML::Simple
1087 with the added benefit of being able to specify what tags or attributes
1088 to ignore, when to take just the content, what to store as an array
1089 etc.
1090
1091 You could also view it as yet another event based XML parser that
1092 differs from all the others only in one thing. It stores the data for
1093 you so that you do not have to use globals or closures and wonder where
1094 to attach the snippet of data you just received onto the structure you
1095 are building.
1096
1097 You can use it in a way similar to XML::Twig with simplify(): specify
1098 the rules to transform the lower level tags into a XML::Simple like
1099 (simplify()ed) structure and then handle the structure in the rule for
1100 the tag(s) you'd specify in XML::Twig's twig_roots.
1101
1103 If you need to parse a XML file without the root tag (something that
1104 each and any sane person would allow, but the XML comitee did not), you
1105 can parse
1106
1107 <!DOCTYPE doc [<!ENTITY real_doc SYSTEM "$the_file_name">]><doc>&real_doc;</doc>
1108
1109 instead.
1110
1112 Jan Krynicky, "<Jenda at CPAN.org>"
1113
1115 Please report any bugs or feature requests to "bug-xml-rules at
1116 rt.cpan.org", or through the web interface at
1117 <http://rt.cpan.org/NoAuth/ReportBug.html?Queue=XML-Rules>. I will be
1118 notified, and then you'll automatically be notified of progress on your
1119 bug as I make changes.
1120
1122 You can find documentation for this module with the perldoc command.
1123
1124 perldoc XML::Rules
1125
1126 You can also look for information at:
1127
1128 • AnnoCPAN: Annotated CPAN documentation
1129
1130 <http://annocpan.org/dist/XML-Rules>
1131
1132 • CPAN Ratings
1133
1134 <http://cpanratings.perl.org/d/XML-Rules>
1135
1136 • RT: CPAN's request tracker
1137
1138 <http://rt.cpan.org/NoAuth/Bugs.html?Dist=XML-Rules>
1139
1140 • Search CPAN
1141
1142 <http://search.cpan.org/dist/XML-Rules>
1143
1144 • PerlMonks
1145
1146 Please see <http://www.perlmonks.org/?node_id=581313> or
1147 <http://www.perlmonks.org/?node=XML::Rules> for discussion.
1148
1150 XML::Twig, XML::LibXML, XML::Pastor
1151
1153 The escape_value() method is taken with minor changes from XML::Simple.
1154
1156 Copyright 2006-2012 Jan Krynicky, all rights reserved.
1157
1158 This program is free software; you can redistribute it and/or modify it
1159 under the same terms as Perl itself.
1160
1161
1162
1163perl v5.32.1 2021-01-27 XML::Rules(3)