1XML::Rules(3)         User Contributed Perl Documentation        XML::Rules(3)
2
3
4

NAME

6       XML::Rules - parse XML and specify what and how to keep/process for
7       individual tags
8

VERSION

10       Version 1.16
11

SYNOPSIS

13               use XML::Rules;
14
15               $xml = <<'*END*';
16               <doc>
17                <person>
18                 <fname>...</fname>
19                 <lname>...</lname>
20                 <email>...</email>
21                 <address>
22                  <street>...</street>
23                  <city>...</city>
24                  <country>...</country>
25                  <bogus>...</bogus>
26                 </address>
27                 <phones>
28                  <phone type="home">123-456-7890</phone>
29                  <phone type="office">663-486-7890</phone>
30                  <phone type="fax">663-486-7000</phone>
31                 </phones>
32                </person>
33                <person>
34                 <fname>...</fname>
35                 <lname>...</lname>
36                 <email>...</email>
37                 <address>
38                  <street>...</street>
39                  <city>...</city>
40                  <country>...</country>
41                  <bogus>...</bogus>
42                 </address>
43                 <phones>
44                  <phone type="office">663-486-7891</phone>
45                 </phones>
46                </person>
47               </doc>
48               *END*
49
50               @rules = (
51                       _default => sub {$_[0] => $_[1]->{_content}},
52                               # by default I'm only interested in the content of the tag, not the attributes
53                       bogus => undef,
54                               # let's ignore this tag and all inner ones as well
55                       address => sub {address => "$_[1]->{street}, $_[1]->{city} ($_[1]->{country})"},
56                               # merge the address into a single string
57                       phone => sub {$_[1]->{type} => $_[1]->{_content}},
58                               # let's use the "type" attribute as the key and the content as the value
59                       phones => sub {delete $_[1]->{_content}; %{$_[1]}},
60                               # remove the text content and pass along the type => content from the child nodes
61                       person => sub { # lets print the values, all the data is readily available in the attributes
62                               print "$_[1]->{lname}, $_[1]->{fname} <$_[1]->{email}>\n";
63                               print "Home phone: $_[1]->{home}\n" if $_[1]->{home};
64                               print "Office phone: $_[1]->{office}\n" if $_[1]->{office};
65                               print "Fax: $_[1]->{fax}\n" if $_[1]->{fax};
66                               print "$_[1]->{address}\n\n";
67                               return; # the <person> tag is processed, no need to remember what it contained
68                       },
69               );
70               $parser = XML::Rules->new(rules => \@rules);
71               $parser->parse( $xml);
72

INTRODUCTION

74       There are several ways to extract data from XML. One that's often used
75       is to read the whole file and transform it into a huge maze of objects
76       and then write code like
77
78               foreach my $obj ($XML->forTheLifeOfMyMotherGiveMeTheFirstChildNamed("Peter")->pleaseBeSoKindAndGiveMeAllChildrenNamedSomethingLike("Jane")) {
79                       my $obj2 = $obj->sorryToKeepBotheringButINeedTheChildNamed("Theophile");
80                       my $birth = $obj2->whatsTheValueOfAttribute("BirthDate");
81                       print "Theophile was born at $birth\n";
82               }
83
84       I'm exagerating of course, but you probably know what I mean. You can
85       of course shorten the path and call just one method ... that is if you
86       spend the time to learn one more "cool" thing starting with X. XPath.
87
88       You can also use XML::Simple and generate an almost equaly huge maze of
89       hashes and arrays ... which may make the code more or less complex. In
90       either case you need to have enough memory to store all that data, even
91       if you only need a piece here and there.
92
93       Another way to parse the XML is to create some subroutines that handle
94       the start and end tags and the text and whatever else may appear in the
95       XML. Some modules will let you specify just one for start tag, one for
96       text and one for end tag, others will let you install different
97       handlers for different tags. The catch is that you have to build your
98       data structures yourself, you have to know where you are, what tag is
99       just open and what is the parent and its parent etc. so that you could
100       add the attributes and especially the text to the right place. And the
101       handlers have to do everything as their side effect. Does anyone
102       remember what do they say about side efects? They make the code hard to
103       debug, they tend to change the code into a maze of interdependent
104       snippets of code.
105
106       So what's the difference in the way XML::Rules works? At the first
107       glance, not much. You can also specify subroutines to be called for the
108       tags encountered while parsing the XML, just like the other even based
109       XML parsers. The difference is that you do not have to rely on side-
110       effects if all you want is to store the value of a tag. You simply
111       return whatever you need from the current tag and the module will add
112       it at the right place in the data structure it builds and will provide
113       it to the handlers for the parent tag. And if the parent tag does
114       return that data again it will be passed to its parent and so forth.
115       Until we get to the level at which it's convenient to handle all the
116       data we accumulated from the twig.
117
118       Do we want to keep just the content and access it in the parent tag
119       handler under a specific name?
120
121               foo => sub {return 'foo' => $_[1]->{_content}}
122
123       Do we want to ornament the content a bit and add it to the parent tag's
124       content?
125
126               u => sub {return '_' . $_[1]->{_content} . '_'}
127               strong =>  sub {return '*' . $_[1]->{_content} . '*'}
128               uc =>  sub {return uc($_[1]->{_content})}
129
130       Do we want to merge the attributes into a string and access the string
131       from the parent tag under a specified name?
132
133               address => sub {return 'Address' => "Street: $_[1]->{street} $_[1]->{bldngNo}\nCity: $_[1]->{city}\nCountry: $_[1]->{country}\nPostal code: $_[1]->{zip}"}
134
135       and in this case the $_[1]->{street} may either be an attribute of the
136       <address> tag or it may be ther result of the handler (rule)
137
138               street => sub {return 'street' => $_[1]->{_content}}
139
140       and thus come from a child tag <street>. You may also use the rules to
141       convert codes to values
142
143               our %states = (
144                 AL => 'Alabama',
145                 AK => 'Alaska',
146                 ...
147               );
148               ...
149               state => sub {return 'state' => $states{$_[1]->{_content}}; }
150
151        or
152
153               address => sub {
154                       if (exists $_[1]->{id}) {
155                               $sthFetchAddress->execute($_[1]->{id});
156                               my $addr = $sthFetchAddress->fetchrow_hashref();
157                               $sthFetchAddress->finish();
158                               return 'address' => $addr;
159                       } else {
160                               return 'address' => $_[1];
161                       }
162               }
163
164       so that you do not have to care whether there was
165
166               <address id="147"/>
167
168       or
169
170               <address><street>Larry Wall's St.</street><streetno>478</streetno><city>Core</city><country>The Programming Republic of Perl</country></address>
171
172       And if you do not like to end up with a datastructure of plain old
173       arrays and hashes, you can create application specific objects in the
174       rules
175
176               address => sub {
177                       my $type = lc(delete $_[1]->{type});
178                       $type.'Address' => MyApp::Address->new(%{$_[1]})
179               },
180               person => sub {
181                       '@person' => MyApp::Person->new(
182                               firstname => $_[1]->{fname},
183                               lastname => $_[1]->{lname},
184                               deliveryAddress => $_[1]->{deliveryAddress},
185                               billingAddress => $_[1]->{billingAddress},
186                               phone => $_[1]->{phone},
187                       )
188               }
189
190       At each level in the tree structure serialized as XML you can decide
191       what to keep, what to throw away, what to transform and then just
192       return the stuff you care about and it will be available to the handler
193       at the next level.
194

CONSTRUCTOR

196               my $parser = XML::Rules->new(
197                       rules => \@rules,
198                       [ start_rules => \@start_rules, ]
199                       [ stripspaces => 0 / 1 / 2 / 3   +   0 / 4   +   0 / 8, ]
200                       [ normalisespaces => 0 / 1, ]
201                       [ style => 'parser' / 'filter', ]
202                       [ ident => '  ', [reformat_all => 0 / 1] ],
203                       [ encode => 'encoding specification', ]
204                       [ output_encoding => 'encoding specification', ]
205                       [ namespaces => \%namespace2alias_mapping, ]
206                       [ handlers => \%additional_expat_handlers, ]
207                       # and optionaly parameters passed to XML::Parser::Expat
208               );
209
210       Options passed to XML::Parser::Expat : ProtocolEncoding Namespaces
211       NoExpand Stream_Delimiter ErrorContext ParseParamEnt Base
212
213       The "stripspaces" controls the handling of whitespace. Please see the
214       "Whitespace handling" bellow.
215
216       The "style" specifies whether you want to build a parser used to
217       extract stuff from the XML or filter/modify the XML. If you specify
218       style => 'filter' then all tags for which you do not specify a
219       subroutine rule or that occure inside such a tag are copied to the
220       output filehandle passed to the ->filter() or ->filterfile() methods.
221
222       The "ident" specifies what character(s) to use to ident the tags when
223       filtering, by default the tags are not formatted in any way. If the
224       "reformat_all" is not set then this affects only the tags that have a
225       rule and their subtags. And in case of subtags only those that were
226       added into the attribute hash by their rules, not those left in the
227       _content array!
228
229       The "warnoverwrite" instructs XML::Rules to issue a warning whenever
230       the rule cause a key in a tag's hash to be overwritten by new data
231       produced by the rule of a subtag. This happens eg. if a tag is repeated
232       and its rule doesn't expect it.
233
234       The "encode" allows you to ask the module to run all data through
235       Encode::encode( 'encoding_specification', ...)  before being passed to
236       the rules. Otherwise all data comes as UTF8.
237
238       The "output_encoding" on the other hand specifies in what encoding is
239       the resulting data going to be, the default is again UTF8.  This means
240       that if you specify
241
242               encode => 'windows-1250',
243               output_encoding => 'utf8',
244
245       and the XML is in ISO-8859-2 (Latin2) then the filter will 1) convert
246       the content and attributes of the tags you are not interested in from
247       Latin2 directly to utf8 and output and 2) convert the content and
248       attributes of the tags you want to process from Latin2 to Windows-1250,
249       let you mangle the data and then convert the results to utf8 for the
250       output.
251
252       The "encode" and "output_enconding" affects also the
253       "$parser-"toXML(...)>, if they are different then the data are
254       converted from one encoding to the other.
255
256       The "handlers" allow you to set additional handlers for
257       XML::Parser::Expat->setHandlers.  Your Start, End, Char and XMLDecl
258       handlers are evaluated before the ones installed by XML::Rules and may
259       modify the values in @_, but you should be very carefull with that.
260       Consider that experimental and if you do make that work the way you
261       needed, please let me know so that I know what was it good for and can
262       make sure it doesn't break in a new version.
263
264   The Rules
265       The rules option may be either an arrayref or a hashref, the module
266       doesn't care, but if you want to use regexps to specify the groups of
267       tags to be handled by the same rule you should use the array ref. The
268       rules array/hash is made of pairs in form
269
270               tagspecification => action
271
272       where the tagspecification may be either a name of a tag, a string
273       containing comma or pipe ( "|" ) delimited list of tag names or a
274       string containing a regexp enclosed in // optionaly followed by the
275       regular expression modifiers or a qr// compiled regular expressions.
276       The tag names and tag name lists take precedence to the regexps, the
277       regexps are (in case of arrayrefs only!!!) tested in the order in which
278       they are specified.
279
280       These rules are evaluated/executed whenever a tag if fully parsed
281       including all the content and child tags and they may access the
282       content and attributes of the specified tag plus the stuff produced by
283       the rules evaluated for the child tags.
284
285       The action may be either
286
287               - an undef or empty string = ignore the tag and all its children
288               - a subroutine reference = the subroutine will be called to handle the tag data&contents
289                       sub { my ($tagname, $attrHash, $contexArray, $parentDataArray, $parser) = @_; ...}
290               - one of the built in rules below
291
292       Custom rules
293
294       The subroutines in the rules specification receive five parameters:
295
296               $rule->( $tag_name, \%attrs, \@context, \@parent_data, $parser)
297
298       It's OK to destroy the first two parameters, but you should treat the
299       other three as read only or at least treat them with care!
300
301               $tag_name = string containing the tag name
302               \%attrs = hash containing the attributes of the tag plus the _content key
303                       containing the text content of the tag. If it's not a leaf tag it may
304                       also contain the data returned by the rules invoked for the child tags.
305               \@context = an array containing the names of the tags enclosing the current
306                       one. The parent tag name is the last element of the array. (READONLY!)
307               \@parent_data = an array containing the hashes with the attributes
308                       and content read&produced for the enclosing tags so far.
309                       You may need to access this for example to find out the version
310                       of the format specified as an attribute of the root tag. You may
311                       safely add, change or delete attributes in the hashes, but all bets
312                       are off if you change the number or type of elements of this array!
313               $parser = the parser object
314                       you may use $parser->{pad} or $parser->{parameters} to store any data
315                       you need. The first is never touched by XML::Rules, the second is set to
316                       the last argument of parse() or filter() methods and reset to undef
317                       before those methods exit.
318
319       The subroutine may decide to handle the data and return nothing or
320       tweak the data as necessary and return just the relevant bits. It may
321       also load more information from elsewhere based on the ids found in the
322       XML and provide it to the rules of the ancestor tags as if it was part
323       of the XML.
324
325       The possible return values of the subroutines are:
326
327       1) nothing or undef or "" - nothing gets added to the parent tag's hash
328
329       2) a single string - if the parent's _content is a string then the one
330       produced by this rule is appended to the parent's _content.  If the
331       parent's _content is an array, then the string is push()ed to the
332       array.
333
334       3) a single reference - if the parent's _content is a string then it's
335       changed to an array containing the original string and this reference.
336       If the parent's _content is an array, then the string is push()ed to
337       the array.
338
339       4) an even numbered list - it's a list of key & value pairs to be added
340       to the parent's hash.
341
342       The handling of the attributes may be changed by adding '@', '%', '+',
343       '*' or '.' before the attribute name.
344
345       Without any "sigil" the key & value is added to the hash overwriting
346       any previous values.
347
348       The values for the keys starting with '@' are push()ed to the arrays
349       referenced by the key name without the @. If there already is an
350       attribute of the same name then the value will be preserved and will
351       become the first element in the array.
352
353       The values for the keys starting with '%' have to be either hash or
354       array references. The key&value pairs in the referenced hash or array
355       will be added to the hash referenced by the key. This is nice for rows
356       of tags like this:
357
358         <field name="foo" value="12"/>
359         <field name="bar" value="24"/>
360
361       if you specify the rule as
362
363         field => sub { '%fields' => [$_[1]->{name} => $_[1]->{value}]}
364
365       then the parent tag's has will contain
366
367         fields => {
368           foo => 12,
369               bar => 24,
370         }
371
372       The values for the keys starting with '+' are added to the current
373       value, the ones starting with '.' are appended to the current value and
374       the ones starting with '*' multiply the current value.
375
376       5) an odd numbered list - the last element is appended or push()ed to
377       the parent's _content, the rest is handled as in the previous case.
378
379       Builtin rules
380
381               'content' = only the content of the tag is preserved and added to
382                       the parent tag's hash as an attribute named after the tag. Equivalent to:
383                       sub { $_[0] => $_[1]->{_content}}
384               'content trim' = only the content of the tag is preserved, trimmed and added to
385                       the parent tag's hash as an attribute named after the tag
386                       sub { s/^\s+//,s/\s+$// for ($_[1]->{_content}); $_[0] => $_[1]->{_content}}
387               'content array' = only the content of the tag is preserved and pushed
388                       to the array pointed to by the attribute
389                       sub { '@' . $_[0] => $_[1]->{_content}}
390               'as is' = the tag's hash is added to the parent tag's hash
391                       as an attribute named after the tag
392                       sub { $_[0] => $_[1]}
393               'as is trim' = the tag's hash is added to the parent tag's hash
394                       as an attribute named after the tag, the content is trimmed
395                       sub { $_[0] => $_[1]}
396               'as array' = the tag's hash is pushed to the attribute named after the tag
397                       in the parent tag's hash
398                       sub { '@'.$_[0] => $_[1]}
399               'as array trim' = the tag's hash is pushed to the attribute named after the tag
400                       in the parent tag's hash, the content is trimmed
401                       sub { '@'.$_[0] => $_[1]}
402               'no content' = the _content is removed from the tag's hash and the hash
403                       is added to the parent's hash into the attribute named after the tag
404                       sub { delete $_[1]->{_content}; $_[0] => $_[1]}
405               'no content array' = similar to 'no content' except the hash is pushed
406                       into the array referenced by the attribute
407               'as array no content' = same as 'no content array'
408               'pass' = the tag's hash is dissolved into the parent's hash,
409                       that is all tag's attributes become the parent's attributes.
410                       The _content is appended to the parent's _content.
411                       sub { %{$_[1]}}
412               'pass no content' = the _content is removed and the hash is dissolved
413                       into the parent's hash.
414                       sub { delete $_[1]->{_content}; %{$_[1]}}
415               'pass without content' = same as 'pass no content'
416               'raw' = the [tagname => attrs] is pushed to the parent tag's _content.
417                       You would use this style if you wanted to be able to print
418                       the parent tag as XML preserving the whitespace or other textual content
419                       sub { [$_[0] => $_[1]]}
420               'raw extended' = the [tagname => attrs] is pushed to the parent tag's _content
421                       and the attrs are added to the parent's attribute hash with ":$tagname" as the key
422                       sub { (':'.$_[0] => $_[1], [$_[0] => $_[1]])};
423               'raw extended array' = the [tagname => attrs] is pushed to the parent tag's _content
424                       and the attrs are pushed to the parent's attribute hash with ":$tagname" as the key
425                       sub { ('@:'.$_[0] => $_[1], [$_[0] => $_[1]])};
426               'by <attrname>' = uses the value of the specified attribute as the key when adding the
427                       attribute hash into the parent tag's hash. You can specify more names, in that case
428                       the first found is used.
429                       sub {delete($_[1]->{name}) => $_[1]}
430               'content by <attrname>' = uses the value of the specified attribute as the key when adding the
431                       tags content into the parent tag's hash. You can specify more names, in that case
432                       the first found is used.
433                       sub {$_[1]->{name} => $_[1]->{_content}}
434               'no content by <attrname>' = uses the value of the specified attribute as the key when adding the
435                       attribute hash into the parent tag's hash. The content is dropped. You can specify more names,
436                       in that case the first found is used.
437                       sub {delete($_[1]->{_content}); delete($_[1]->{name}) => $_[1]}
438               '==...' = replace the tag by the specified string. That is the string will be added to
439                       the parent tag's _content
440                       sub { return '...' }
441               '=...' = replace the tag contents by the specified string and forget the attributes.
442                       sub { return $_[0] => '...' }
443               '' = forget the tag's contents (after processing the rules for subtags)
444                       sub { return };
445
446       I include the unnamed subroutines that would be equivalent to the
447       builtin rule in case you need to add some tests and then behave as if
448       one of the builtins was used.
449
450       Builtin rule modifiers
451
452       You can add these modifiers to most rules, just add them to the string
453       literal, at the end, separated from the base rule by a space.
454
455               no xmlns        = strip the namespace alias from the $_[0] (tag name)
456               remove(list,of,attributes) = remove all specified attributes (or keys produced by child tag rules) from the tag data
457               only(list,of,attributes) = filter the hash of attributes and keys+values produced by child tag rules in the tag data
458                       to only include those specified here. In case you need to include the tag content do not forget to include
459                       _content in the list!
460
461       Not all modifiers make sense for all rules. For example if the  rule is
462       'content', it's pointless to filter the attributes, because the only
463       one used will be the content anyway.
464
465       The behaviour of the combination of the 'raw...' rules and the rule
466       modifiers is UNDEFINED!
467
468       Different rules for different paths to tags
469
470       Since 0.19 it's possible to specify several actions for a tag if you
471       need to do something different based on the path to the tag like this:
472
473               tagname => [
474                       'tag/path' => action,
475                       '/root/tag/path' => action,
476                       '/root/*/path' => action,
477                       qr{^root/ns:[^/]+/par$} => action,
478                       default_action
479               ],
480
481       The path is matched against the list of parent tags joined by slashes.
482
483       If you need to use more complex conditions to select the actions you
484       have to use a single subroutine rule and implement the conditions
485       within that subroutine. You have access both to the list of enclosing
486       tags and their attribute hashes (including the data obtained from the
487       rules of the already closed subtags of the enclosing tags.
488
489   The Start Rules
490       Apart from the normal rules that get invoked once the tag is fully
491       parsed, including the contents and child tags, you may want to attach
492       some code to the start tag to (optionaly) skip whole branches of XML or
493       set up attributes and variables. You may set up the start rules either
494       in a separate parameter to the constructor or in the rules=> by
495       prepending the tag name(s) by ^.
496
497       These rules are in form
498
499               tagspecification => undef / '' / 'skip' --> skip the element, including child tags
500               tagspecification => 1 / 'handle'        --> handle the element, may be needed
501                       if you specify the _default rule.
502               tagspecification => \&subroutine
503
504       The subroutines receive the same parameters as for the "end tag" rules
505       except of course the _content, but their return value is treated
506       differently.  If the subroutine returns a false value then the whole
507       branch enclosed by the current tag is skipped, no data are stored and
508       no rules are executed. You may modify the hash referenced by $attr.
509
510       You may even tie() the hash referenced by $attr, for example in case
511       you want to store the parsed data in a DBM::Deep.  In such case all the
512       data returned by the immediate subtags of this tag will be stored in
513       the DBM::Deep.  Make sure you do not overwrite the data by data from
514       another occurance of the same tag if you return $_[1]/$attr from the
515       rule!
516
517               YourHugeTag => sub {
518                       my %temp = %{$_[1]};
519                       tie %{$_[1]}, 'DBM::Deep', $filename;
520                       %{$_[1]} = %temp;
521                       1;
522               }
523
524       Both types of rules are free to store any data they want in
525       $parser->{pad}. This property is NOT emptied after the parsing!
526
527   Whitespace handling
528       There are two options that affect the whitespace handling: stripspaces
529       and normalisespaces. The normalisespaces is a simple flag that controls
530       whether multiple spaces/tabs/newlines are collapsed into a single space
531       or not. The stripspaces is more complex, it's a bit-mask, an ORed
532       combination of the following options:
533
534               0 - don't remove whitespace around tags
535                   (around tags means before the opening tag and after the closing tag, not in the tag's content!)
536               1 - remove whitespace before tags whose rules did not return any text content
537                   (the rule specified for the tag caused the data of the tag to be ignored,
538                       processed them already or added them as attributes to parent's \%attr)
539               2 - remove whitespace around tags whose rules did not return any text content
540               3 - remove whitespace around all tags
541
542               0 - remove only whitespace-only content
543                   (that is remove the whitespace around <foo/> in this case "<bar>   <foo/>   </bar>"
544                       but not this one "<bar>blah   <foo/>  blah</bar>")
545               4 - remove trailing/leading whitespace
546                   (remove the whitespace in both cases above)
547
548               0 - don't trim content
549               8 - do trim content
550                       (That is for "<foo>  blah   </foo>" only pass to the rule {_content => 'blah'})
551
552       That is if you have a data oriented XML in which each tag contains
553       either text content or subtags, but not both, you want to use
554       stripspaces => 3 or stripspaces => 3|4. This will not only make sure
555       you don't need to bother with the whitespace-only _content of the tags
556       with subtags, but will also make sure you do not keep on wasting memory
557       while parsing a huge XML and processing the "twigs". Without that
558       option the parent tag of the repeated tag would keep on accumulating
559       unneeded whitespace in its _content.
560

INSTANCE METHODS

562   parse
563               $parser->parse( $string [, $parameters]);
564               $parser->parse( $IOhandle [, $parameters]);
565
566       Parses the XML in the string or reads and parses the XML from the
567       opened IO handle, executes the rules as it encounters the closing tags
568       and returns the resulting structure.
569
570       The scalar or reference passed as the second parameter to the parse()
571       method is assigned to $parser->{parameters} for the parsing of the file
572       or string. Once the XML is parsed the key is deleted. This means that
573       the $parser does not retain a reference to the $parameters after the
574       parsing.
575
576   parsestring
577               $parser->parsestring( $string [, $parameters]);
578
579       Just an alias to ->parse().
580
581   parse_string
582               $parser->parse_string( $string [, $parameters]);
583
584       Just an alias to ->parse().
585
586   parsefile
587               $parser->parsefile( $filename [, $parameters]);
588
589       Opens the specified file and parses the XML and executes the rules as
590       it encounters the closing tags and returns the resulting structure.
591
592   parse_file
593               $parser->parse_file( $filename [, $parameters]);
594
595       Just an alias to ->parsefile().
596
597   parse_chunk
598               while (my $chunk = read_chunk_of_data()) {
599                       $parser->parse_chunk($chunk);
600               }
601               my $data = $parser->last_chunk();
602
603       This method allows you to process the XML in chunks as you receive
604       them. The chunks do not need to be in any way valid ... it's fine if
605       the chunk ends in the middle of a tag or attribute.
606
607       If you need to set the $parser->{parameters}, pass it to the first call
608       to parse_chunk() the same way you would to parse().  The first chunk
609       may be empty so if you need to set up the parameters, but read the
610       chunks in a loop or in a callback, you can do this:
611
612               $parser->parse_chunk('', {foo => 15, bar => "Hello World!"});
613               while (my $chunk = read_chunk_of_data()) {
614                       $parser->parse_chunk($chunk);
615               }
616               my $data = $parser->last_chunk();
617
618       or
619
620               $parser->parse_chunk('', {foo => 15, bar => "Hello World!"});
621               $ua->get($url, ':content_cb' => sub { my($data, $response, $protocol) = @_; $parser->parse_chunk($data); return 1 });
622               my $data = $parser->last_chunk();
623
624       The parse_chunk() returns 1 or dies, to get the accumulated data, you
625       need to call last_chunk(). You will want to either agressively trim the
626       data remembered or handle parts of the file using custom rules as the
627       XML is being parsed.
628
629   filter
630               $parser->filter( $string);
631               $parser->filter( $string, $LexicalOutputIOhandle [, $parameters]);
632               $parser->filter( $LexicalInputIOhandle, $LexicalOutputIOhandle [, $parameters]);
633               $parser->filter( $string, \*OutputIOhandle [, $parameters]);
634               $parser->filter( $LexicalInputIOhandle, \*OutputIOhandle [, $parameters]);
635               $parser->filter( $string, $OutputFilename [, $parameters]);
636               $parser->filter( $LexicalInputIOhandle, $OutputFilename [, $parameters]);
637               $parser->filter( $string, $StringReference [, $parameters]);
638               $parser->filter( $LexicalInputIOhandle, $StringReference [, $parameters]);
639
640       Parses the XML in the string or reads and parses the XML from the
641       opened IO handle, copies the tags that do not have a subroutine rule
642       specified and do not occure under such a tag, executes the specified
643       rules and prints the results to select()ed filehandle, $OutputFilename
644       or $OutputIOhandle or stores them in the scalar referenced by
645       $StringReference using the ->ToXML() method.
646
647       The scalar or reference passed as the third parameter to the filter()
648       method is assigned to $parser->{parameters} for the parsing of the file
649       or string. Once the XML is parsed the key is deleted. This means that
650       the $parser does not retain a reference to the $parameters after the
651       parsing.
652
653   filterstring
654               $parser->filterstring( ...);
655
656       Just an alias to ->filter().
657
658   filter_string
659               $parser->filter_string( ...);
660
661       Just an alias to ->filter().
662
663   filterfile
664               $parser->filterfile( $filename);
665               $parser->filterfile( $filename, $LexicalOutputIOhandle [, $parameters]);
666               $parser->filterfile( $filename, \*OutputIOhandle [, $parameters]);
667               $parser->filterfile( $filename, $OutputFilename [, $parameters]);
668
669       Parses the XML in the specified file, copies the tags that do not have
670       a subroutine rule specified and do not occure under such a tag,
671       executes the specified rules and prints the results to select()ed
672       filehandle, $OutputFilename or $OutputIOhandle or stores them in the
673       scalar referenced by $StringReference.
674
675       The scalar or reference passed as the third parameter to the filter()
676       method is assigned to $parser->{parameters} for the parsing of the file
677       or string. Once the XML is parsed the key is deleted. This means that
678       the $parser does not retain a reference to the $parameters after the
679       parsing.
680
681   filter_file
682       Just an alias to ->filterfile().
683
684   filter_chunk
685               while (my $chunk = read_chunk_of_data()) {
686                       $parser->filter_chunk($chunk);
687               }
688               $parser->last_chunk();
689
690       This method allows you to process the XML in chunks as you receive
691       them. The chunks do not need to be in any way valid ... it's fine if
692       the chunk ends in the middle of a tag or attribute.
693
694       If you need to set the file to store the result to (default is the
695       select()ed filehandle) or set the $parser->{parameters}, pass it to the
696       first call to filter_chunk() the same way you would to filter().  The
697       first chunk may be empty so if you need to set up the parameters, but
698       read the chunks in a loop or in a callback, you can do this:
699
700               $parser->filter_chunk('', "the-filtered.xml", {foo => 15, bar => "Hello World!"});
701               while (my $chunk = read_chunk_of_data()) {
702                       $parser->filter_chunk($chunk);
703               }
704               $parser->last_chunk();
705
706       or
707
708               $parser->filter_chunk('', "the_filtered.xml", {foo => 15, bar => "Hello World!"});
709               $ua->get($url, ':content_cb' => sub { my($data, $response, $protocol) = @_; $parser->filter_chunk($data); return 1 });
710               filter_chunk$parser->last_chunk();
711
712       The filter_chunk() returns 1 or dies, you need to call last_chunk() to
713       sign the end of the data and close the filehandles and clean the parser
714       status.  Make sure you do not set a rule for the root tag or other tag
715       containing way too much data. Keep in mind that even if the parser
716       works as a filter, the data for a custom rule must be kept in memory
717       for the rule to execute!
718
719   last_chunk
720               my $data = $parser->last_chunk();
721               my $data = $parser->last_chunk($the_last_chunk_contents);
722
723       Finishes the processing of a XML fed to the parser in chunks. In case
724       of the parser style, returns the accumulated data. In case of the
725       filter style, flushes and closes the output file. You can pass the last
726       piece of the XML to this method or call it without parameters if all
727       the data was passed to parse_chunk()/filter_chunk().
728
729       You HAVE to execute this method after call(s) to parse_chunk() or
730       filter_chunk()! Until you do, the parser will refuse to process full
731       documents and expect another call to parse_chunk()/filter_chunk()!
732
733   escape_value
734               $parser->escape_value( $data [, $numericescape])
735
736       This method escapes the $data for inclusion in XML, the $numericescape
737       may be 0, 1 or 2 and controls whether to convert 'high' (non ASCII)
738       characters to XML entities.
739
740       0 - default: no numeric escaping (OK if you're writing out UTF8)
741
742       1 - only characters above 0xFF are escaped (ie: characters in the
743       0x80-FF range are not escaped), possibly useful with ISO8859-1 output
744
745       2 - all characters above 0x7F are escaped (good for plain ASCII output)
746
747       You can also specify the default value in the constructor
748
749               my $parser = XML::Rules->new(
750                       ...
751                       NumericEscape => 2,
752               );
753
754   toXML / ToXML
755               $xml = $parser->toXML( $tagname, \%attrs[, $do_not_close, $ident, $base])
756
757       You may use this method to convert the datastructures created by
758       parsing the XML into the XML format.  Not all data structures may be
759       printed! I'll add more docs later, for now please do experiment.
760
761       The $ident and $base, if defined, turn on and control the pretty-
762       printing. The $ident specifies the character(s) used for one level of
763       identation, the base contains the identation of the current tag. That
764       is if you want to include the data inside of
765
766               <data>
767                       <some>
768                               <subtag>$here</subtag>
769                       </some>
770               </data>
771
772       you will call
773
774               $parser->toXML( $tagname, \%attrs, 0, "\t", "\t\t\t");
775
776       The method does NOT validate that the $ident and $base are whitespace
777       only, but of course if it's not you end up with invalid XML. Newlines
778       are added only before the start tag and (if the tag has only child tags
779       and no content) before the closing tag, but not after the closing tag!
780       Newlines are added even if the $ident is an empty string.
781
782   parentsToXML
783               $xml = $parser->parentsToXML( [$level])
784
785       Prints all or only the topmost $level ancestor tags, including the
786       attributes and content (parsed so far), but without the closing tags.
787       You may use this to print the header of the file you are parsing,
788       followed by calling toXML() on a structure you build and then by
789       closeParentsToXML() to close the tags left opened by parentsToXML().
790       You most likely want to use the style => 'filter' option for the
791       constructor instead.
792
793   closeParentsToXML
794               $xml = $parser->closeParentsToXML( [$level])
795
796       Prints the closing tags for all or the topmost $level ancestor tags of
797       the one currently processed.
798
799   paths2rules
800               my $parser = XML::Rules->new(
801                       rules => paths2rules {
802                               '/root/subtag/tag' => sub { ...},
803                               '/root/othertag/tag' => sub {...},
804                               'tag' => sub{ ... the default code for this tag ...},
805                               ...
806                       }
807               );
808
809       This helper function converts a hash of "somewhat xpath-like" paths and
810       subs/rules into the format required by the module.  Due to backwards
811       compatibility and efficiency I can't directly support paths in the
812       rules and the direct syntax for their specification is a bit awkward.
813       So if you need the paths and not the regexps, you may use this helper
814       instead of:
815
816               my $parser = XML::Rules->new(
817                       rules => {
818                               'tag' => [
819                                       '/root/subtag' => sub { ...},
820                                       '/root/othertag' => sub {...},
821                                       sub{ ... the default code for this tag ...},
822                               ],
823                               ...
824                       }
825               );
826
827   return_nothing
828       Stop parsing the XML, forget any data we already have and return from
829       the $parser->parse().  This is only supposed to be used within rules
830       and may be called both as a method and as an ordinary function (it's
831       not exported).
832
833   return_this
834       Stop parsing the XML, forget any data we already have and return the
835       attributes passed to this subroutine from the $parser->parse(). This is
836       only supposed to be used within rules and may be called both as a
837       method and as an ordinary function (it's not exported).
838
839   skip_rest
840       Stop parsing the XML and return whatever data we already have from the
841       $parser->parse().  The rules for the currently opened tags are
842       evaluated as if the XML contained all the closing tags in the right
843       order.
844
845       These three work via raising an exception, the exception is caught
846       within the $parser->parse() and does not propagate outside.  It's also
847       safe to raise any other exception within the rules, the exception will
848       be caught as well, the internal state of the $parser object will be
849       cleaned and the exception rethrown.
850

CLASS METHODS

852   parse
853       When called as a class method, parse() accepts the same parameters as
854       new(), instantiates a new parser object and returns a subroutine
855       reference that calls the parse() method on that instance.
856
857         my $parser = XML::Rules->new(rules => \%rules);
858         my $data = $parser->parse($xml);
859
860       becomes
861
862         my $read_data = XML::Rules->parse(rules => \%rules);
863         my $data = $read_data->($xml);
864
865       or
866
867         sub read_data;
868         *read_data = XML::Rules->parse(rules => \%rules);
869         my $data = read_data($xml);
870
871   parsestring, parsefile, parse_file, filter, filterstring, filter_string,
872       filterfile, filter_file
873       All these methods work the same way as parse() when used as a class
874       method. They accept the same parameters as new(), instantiate a new
875       object and return a subroutine reference that calls the respective
876       method.
877
878   inferRulesFromExample
879               Dumper(XML::Rules::inferRulesFromExample( $fileOrXML, $fileOrXML, $fileOrXML, ...)
880               Dumper(XML::Rules->inferRulesFromExample( $fileOrXML, $fileOrXML, $fileOrXML, ...)
881               Dumper($parser->inferRulesFromExample( $fileOrXML, $fileOrXML, $fileOrXML, ...)
882
883       The subroutine parses the listed files and infers the rules that would
884       produce the minimal, but complete datastructure.  It finds out what
885       tags may be repeated, whether they contain text content, attributes
886       etc. You may want to give the subroutine several examples to make sure
887       it knows about all possibilities. You should use this once and store
888       the generated rules in your script or even take this as the basis of a
889       more specific set of rules.
890
891   inferRulesFromDTD
892               Dumper(XML::Rules::inferRulesFromDTD( $DTDorDTDfile, [$enableExtended]))
893               Dumper(XML::Rules->inferRulesFromDTD( $DTDorDTDfile, [$enableExtended]))
894               Dumper($parser->inferRulesFromDTD( $DTDorDTDfile, [$enableExtended]))
895
896       The subroutine parses the DTD and infers the rules that would produce
897       the minimal, but complete datastructure.  It finds out what tags may be
898       repeated, whether they contain text content, attributes etc. You may
899       use this each time you are about to parse the XML or once and store the
900       generated rules in your script or even take this as the basis of a more
901       specific set of rules.
902
903       With the second parameter set to a true value, the tags included in a
904       mixed content will use the "raw extended" or "raw extended array" types
905       instead of just "raw". This makes sure the tag data both stay at the
906       right place in the content and are accessible easily from the parent
907       tag's atrribute hash.
908
909       This subroutine requires the XML::DTDParser module!
910
911   toXML / ToXML
912       The ToXML() method may be called as a class/static method as well. In
913       that case the default identation is two spaces and the output encoding
914       is utf8.
915

PROPERTIES

917   parameters
918       You can pass a parameter (scalar or reference) to the parse...() or
919       filter...() methods, this parameter is later available to the rules as
920       $parser->{parameters}. The module will never use this parameter for any
921       other purpose so you are free to use it for any purposes provided that
922       you expect it to be reset by each call to parse...() or filter...()
923       first to the passed value and then, after the parsing is complete, to
924       undef.
925
926   pad
927       The $parser->{pad} key is specificaly reserved by the module as a place
928       where the module users can store their data. The module doesn't and
929       will not use this key in any way, doesn't set or reset it under any
930       circumstances. If you need to share some data between the rules and do
931       not want to use the structure built by applying the rules you are free
932       to use this key.
933
934       You should refrain from modifying or accessing other properties of the
935       XML::Rules object!
936

IMPORTS

938       When used without parameters, the module does not export anything into
939       the caller's namespace. When used with parameters it either infers and
940       prints a set of rules from a DTD or example(s) or instantiates a parser
941       and exports a subroutine calling the specified method similar to the
942       parse() and other methods when called as class methods:
943
944         use XML::Rules inferRules => 'c:\temp\example.xml';
945         use XML::Rules inferRules => 'c:\temp\ourOwn.dtd';
946         use XML::Rules inferRules => ['c:\temp\example.xml', c:\temp\other.xml'];
947         use XML::Rules
948           read_data => {
949                 method => 'parse',
950                 rules => { ... },
951                 ...
952               };
953         use XML::Rules ToXML => {
954           method => 'ToXML',
955           rules => {}, # the option is required, but may be empty
956           ident => '   '
957         };
958         ...
959         my $data => read_data($xml);
960         print ToXML(
961           rootTag => {
962             thing => [
963               {Name => "english", child => [7480], otherChild => ['Hello world']},
964               {Name => "espanol", child => [7440], otherChild => ['Hola mundo']},
965           ]
966         });
967
968       Please keep in mind that the use statement is executed at "compile
969       time", which means that the variables declared and assigned above the
970       statement do not have the value yet! This is wrong!
971
972         my %rules = ( _default => 'content', foo => 'as is', ...};
973         use XML::Rules
974           read_data => {
975             method => 'parse',
976             rules => \%rules,
977             ...
978           };
979
980       If you do not specify the method, then the method named the same as the
981       import is assumed. You also do not have to specify the rules option for
982       the ToXML method as it is not used anyway:
983
984         use XML::Rules ToXML => { ident => '  ' };
985         use XML::Rules parse => {stripspaces => 7, rules => { ... }};
986
987       You can use the inferRules form the command line like this:
988
989         perl -e "use XML::Rules inferRules => 'c:\temp\example.xml'"
990
991       or this:
992
993         perl -MXML::Rules=inferRules,c:\temp\example.xml -e 1
994
995       or use the included xml2XMLRules.pl and dtd2XMLRules.pl scripts.
996

Namespace support

998       By default the module doesn't handle namespaces in any way, it doesn't
999       do anything special with xmlns or xmlns:alias attributes and it doesn't
1000       strip or mangle the namespace aliases in tag or attribute names. This
1001       means that if you know for sure what namespace aliases will be used you
1002       can set up rules for tags including the aliases and unless someone
1003       decides to use a different alias or makes use of the default namespace
1004       your script will work without turning the namespace support on.
1005
1006       If you do specify any namespace to alias mapping in the constructor it
1007       does start processing the namespace stuff. The xmlns and xmlns:alias
1008       attributes for the known namespaces are stripped from the
1009       datastructures and the aliases are transformed from whatever the XML
1010       author decided to use to whatever your namespace mapping specifies.
1011       Aliases are also added to all tags that belong to a default namespace.
1012
1013       Assuming the constructor parameters contain
1014
1015               namespaces => {
1016                       'http://my.namespaces.com/foo' => 'foo',
1017                       'http://my.namespaces.com/bar' => 'bar',
1018               }
1019
1020       and the XML looks like this:
1021
1022               <root>
1023                       <Foo xmlns="http://my.namespaces.com/foo">
1024                               <subFoo>Hello world</subfoo>
1025                       </Foo>
1026                       <other xmlns:b="http://my.namespaces.com/bar">
1027                               <b:pub>
1028                                       <b:name>NaRuzku</b:name>
1029                                       <b:address>at any crossroads</b:address>
1030                                       <b:desc>Fakt <b>desnej</b> pajzl.</b:desc>
1031                               </b:pub>
1032                       </other>
1033               </root>
1034
1035       then the rules wil be called as if the XML looked like this while the
1036       namespace support is turned off:
1037
1038               <root>
1039                       <foo:Foo>
1040                               <foo:subFoo>Hello world</foo:subfoo>
1041                       </foo:Foo>
1042                       <other>
1043                               <bar:pub>
1044                                       <bar:name>NaRuzku</bar:name>
1045                                       <bar:address>at any crossroads</bar:address>
1046                                       <bar:desc>Fakt <b>desnej</b> pajzl.</bar:desc>
1047                               </bar:pub>
1048                       </other>
1049               </root>
1050
1051       This means that the namespace handling will normalize the aliases used
1052       so that you can use them in the rules.
1053
1054       It is possible to specify an empty alias, so eg. in case you are
1055       processing a SOAP XML and know the tags defined by SOAP do not colide
1056       with the tags in the enclosed XML you may simplify the parsing by
1057       removing all namespace aliases.
1058
1059       You can control the behaviour with respect to the namespaces that you
1060       did not include in your mapping by setting the "alias" for the special
1061       pseudonamespace '*'. The possible values of the "alias"are: "warn"
1062       (default), "keep", "strip", "" and "die".
1063
1064       warn: whenever an unknown namespace is encountered, XML::Rules prints a
1065       warning.  The xmlns:XX attributes and the XX: aliases are retained for
1066       these namespaces.  If the alias clashes with one specified by your
1067       mapping it will be changed in all places, the xmlns="..." referencing
1068       an unexpected namespace are changed to xmlns:nsN and the alias is added
1069       to the tag names included.
1070
1071       keep: this works just like the "warn" except for the warning.
1072
1073       strip: all attributes and tags in the unknown namespaces are stripped.
1074       If a tag in such a namespace contains a tag from a known namespace,
1075       then the child tag is retained.
1076
1077       "": all the xmlns attributes and the aliases for the unexected
1078       namespaces are removed, the tags and normal attributes are retained
1079       without any alias.
1080
1081       die: as soon as any unexpected namespace is encountered, XML::Rules
1082       croak()s.
1083

HOW TO USE

1085       You may view the module either as a XML::Simple on steriods and use it
1086       to build a data structure similar to the one produced by XML::Simple
1087       with the added benefit of being able to specify what tags or attributes
1088       to ignore, when to take just the content, what to store as an array
1089       etc.
1090
1091       You could also view it as yet another event based XML parser that
1092       differs from all the others only in one thing.  It stores the data for
1093       you so that you do not have to use globals or closures and wonder where
1094       to attach the snippet of data you just received onto the structure you
1095       are building.
1096
1097       You can use it in a way similar to XML::Twig with simplify(): specify
1098       the rules to transform the lower level tags into a XML::Simple like
1099       (simplify()ed) structure and then handle the structure in the rule for
1100       the tag(s) you'd specify in XML::Twig's twig_roots.
1101

Unrelated tricks

1103       If you need to parse a XML file without the root tag (something that
1104       each and any sane person would allow, but the XML comitee did not), you
1105       can parse
1106
1107         <!DOCTYPE doc [<!ENTITY real_doc SYSTEM "$the_file_name">]><doc>&real_doc;</doc>
1108
1109       instead.
1110

AUTHOR

1112       Jan Krynicky, "<Jenda at CPAN.org>"
1113

BUGS

1115       Please report any bugs or feature requests to "bug-xml-rules at
1116       rt.cpan.org", or through the web interface at
1117       <http://rt.cpan.org/NoAuth/ReportBug.html?Queue=XML-Rules>.  I will be
1118       notified, and then you'll automatically be notified of progress on your
1119       bug as I make changes.
1120

SUPPORT

1122       You can find documentation for this module with the perldoc command.
1123
1124           perldoc XML::Rules
1125
1126       You can also look for information at:
1127
1128       ·   AnnoCPAN: Annotated CPAN documentation
1129
1130           <http://annocpan.org/dist/XML-Rules>
1131
1132       ·   CPAN Ratings
1133
1134           <http://cpanratings.perl.org/d/XML-Rules>
1135
1136       ·   RT: CPAN's request tracker
1137
1138           <http://rt.cpan.org/NoAuth/Bugs.html?Dist=XML-Rules>
1139
1140       ·   Search CPAN
1141
1142           <http://search.cpan.org/dist/XML-Rules>
1143
1144       ·   PerlMonks
1145
1146           Please see <http://www.perlmonks.org/?node_id=581313> or
1147           <http://www.perlmonks.org/?node=XML::Rules> for discussion.
1148

SEE ALSO

1150       XML::Twig, XML::LibXML, XML::Pastor
1151

ACKNOWLEDGEMENTS

1153       The escape_value() method is taken with minor changes from XML::Simple.
1154
1156       Copyright 2006-2012 Jan Krynicky, all rights reserved.
1157
1158       This program is free software; you can redistribute it and/or modify it
1159       under the same terms as Perl itself.
1160
1161
1162
1163perl v5.32.0                      2020-07-28                     XML::Rules(3)
Impressum