1XML::Tidy(3) User Contributed Perl Documentation XML::Tidy(3)
2
3
4
6 XML::Tidy - tidy indenting of XML documents
7
9 This documentation refers to version 1.20 of XML::Tidy, which was
10 released on Sun Jul 9 09:43:30:08 -0500 2017.
11
13 #!/usr/bin/perl
14 use strict;use warnings;
15 use utf8;use XML::Tidy;
16
17 # create new XML::Tidy object by loading: MainFile.xml
18 my $tidy_obj = XML::Tidy->new('filename' => 'MainFile.xml');
19
20 # tidy up the indenting
21 $tidy_obj->tidy();
22
23 # write out changes back to MainFile.xml
24 $tidy_obj->write();
25
27 This module creates XML document objects (with inheritance from
28 XML::XPath) to tidy mixed-content (i.e., non-data) text node indenting.
29 There are also some other handy member functions to compress and expand
30 your XML document object (into either a compact XML representation or a
31 binary one).
32
34 new()
35 This is the standard Tidy object constructor. Except for the added
36 'binary' option, it can take the same parameters as an XML::XPath
37 object constructor to initialize the XML document object. These can be
38 any one of:
39
40 'filename' => 'SomeFile.xml'
41 'binary' => 'SomeBinaryFile.xtb'
42 'xml' => $variable_which_holds_a_bunch_of_XML_data
43 'ioref' => $file_InputOutput_reference
44 'context' => $existing_node_at_specified_context_to_become_new_obj
45
46 reload()
47 The reload() member function causes the latest data contained in a Tidy
48 object to be re-parsed (which re-indexes all nodes).
49
50 This can be necessary after modifications have been made to nodes which
51 impact the tree node hierarchy because XML::XPath's find() member
52 preserves state information which can get out-of-sync.
53
54 reload() is probably rarely useful by itself but it is needed by
55 strip() and prune() so it is exposed as a method in case it comes in
56 handy for other uses.
57
58 strip()
59 The strip() member function searches the Tidy object for all mixed-
60 content (i.e., non-data) text nodes and empties them out. This will
61 basically unformat any markup indenting.
62
63 strip() is used by compress() and tidy() but it is exposed because it
64 is also worthwhile by itself.
65
66 tidy()
67 The tidy() member function can take a single optional parameter as the
68 string that should be inserted for each indent level. Some examples:
69
70 # Tidy up indenting with default two (2) spaces per indent level
71 $tidy_obj->tidy();
72
73 # Tidy up indenting with four (4) spaces per indent level
74 $tidy_obj->tidy(' ');
75
76 # Tidy up indenting with one (1) tab per indent level
77 $tidy_obj->tidy('tab' );
78
79 # Tidy up indenting with two (2) tabs per indent level
80 $tidy_obj->tidy("\t\t");
81
82 The default behavior is to use two (2) spaces for each indent level.
83 The Tidy object gets all mixed-content (i.e., non-data) text nodes
84 reformatted to appropriate indent levels according to tree nesting
85 depth.
86
87 NOTE: tidy() disturbs some XML escapes in whatever ways XML::XPath
88 does. It has been brought to my attention that these modules also strip
89 CDATA tags from XML files / data they operate on. Even though CDATA
90 tags don't seem very common, I would very much like for them to work
91 smoothly too. Hopefully the vast majority of files will work fine and
92 future support for any of the more rare types can be added later.
93
94 Additionally, please take notice that every call to tidy() (as well as
95 reload, strip, and most other XML::Tidy functions) leak some memory due
96 to their usage of XPath's findnodes command. This issue was described
97 helpfully at <HTTPS://RT.CPAN.Org/Ticket/Display.html?id=120296>.
98 Thanks to Jozef!
99
100 compress()
101 The compress() member function calls strip() on the Tidy object then
102 creates an encoded comment which contains the names of elements and
103 attributes as they occurred in the original document. Their respective
104 element and attribute names are replaced with just the appropriate
105 index throughout the document.
106
107 compress() can accept a parameter describing which node types to
108 attempt to shrink down as abbreviations. This parameter should be a
109 string of just the first letters of each node type you wish to include
110 as in the following mapping:
111
112 e = elements
113 a = attribute keys
114 v = attribute values *EXPERIMENTAL*
115 t = text nodes *EXPERIMENTAL*
116 c = comment nodes *EXPERIMENTAL*
117 n = namespace nodes *not-yet-implemented*
118
119 Attribute values ('v') and text nodes ('t') both seem to work fine with
120 current tokenization. I've still labeled them EXPERIMENTAL because they
121 seem more likely to cause problems than valid element or attribute key
122 names. I have some bugs in the comment node compression which I haven't
123 been able to find yet so that one should be avoided for now. Since
124 these three node types ('vtc') all require tokenization, they are not
125 included in default compression ('ea'). An example call which includes
126 values and text would be:
127
128 $tidy_obj->compress('eavt');
129
130 The original document structure (i.e., node hierarchy) is preserved.
131 compress() significantly reduces the file size of most XML documents
132 for when size matters more than immediate human readability. expand()
133 performs the opposite conversion.
134
135 expand()
136 The expand() member function reads any XML::Tidy::compress comments
137 from the Tidy object and uses them to reconstruct the document that was
138 passed to compress().
139
140 bcompress('BinaryOutputFilename.xtb')
141 The bcompress() member function stores a binary representation of any
142 Tidy object. The format consists of:
143
144 0) a null-terminated version string
145 1) a byte specifying how many bytes later indices will be
146 2) the number of bytes from 1 above to designate the total string count
147 3) the number of null-terminated strings from 2 above
148 4) the number of bytes from 1 above to designate the total integer count
149 5) the number of 4-byte integers from 4 above
150 6) the number of bytes from 1 above to designate the total float count
151 7) the number of 8-byte (double-precision) floats from 6 above
152 8) node index sets until the end of the file
153
154 Normal node index sets consist of two values. The first is an index
155 (again the number of bytes long comes from 1) into the three lists as
156 if they were all linear. The second is a single-byte integer
157 identifying the node type (using standard DOM node type enumerations).
158
159 A few special cases exist in node index sets though. If the index is
160 null, it is interpreted as a close-element tag (so no accompanying type
161 value is read). On the other end, when the index is non-zero, the type
162 value is always read. In the event that the type corresponds to an
163 attribute or a processing instruction, the next index is read (without
164 another accompanying type value) in order to complete the data fields
165 required by those node types.
166
167 NOTE: Please bear in mind that the encoding of binary integers and
168 floats only works properly if the values are not surrounded by spaces
169 or other delimiters and each is contained in its own single node. This
170 is necessary to enable thorough reconstruction of whitespace from the
171 original document. I recommend storing every numerical value as an
172 isolated attribute value or text node without any surrounding
173 whitespace.
174
175 # Examples which encode all numbers as binary:
176 <friend name="goodguy" category="15">
177 <hitpoints>31.255</hitpoints>
178 <location>
179 <x>-15.65535</x>
180 <y>16383.7</y>
181 <z>-1023.63</z>
182 </location>
183 </friend>
184
185 # Examples which encode all numbers as strings:
186 <enemy name="badguy" category=" 666 ">
187 <hitpoints> 2.0 </hitpoints>
188 <location> 4.0 -2.0 4.0 </location>
189 </enemy>
190
191 The default file extension is .xtb (for XML::Tidy Binary).
192
193 bexpand('BinaryInputFilename.xtb')
194 The bexpand() member function reads a binary file which was previously
195 written from bcompress(). bexpand() is an XML::Tidy object constructor
196 like new() so it can be called like:
197
198 my $xtbo = XML::Tidy->bexpand('BinaryInputFilename.xtb');
199
200 prune()
201 The prune() member function takes an XPath location to remove (along
202 with all attributes and child nodes) from the Tidy object. For example,
203 to remove all comments:
204
205 $tidy_obj->prune('//comment()');
206
207 or to remove the third baz (XPath indexing is 1-based):
208
209 $tidy_obj->prune('/foo/bar/baz[3]');
210
211 Pruning your XML tree is a form of tidying too so it snuck in here. =)
212
213 write()
214 The write() member function can take an optional filename parameter to
215 write out any changes to the Tidy object. If no parameters are given,
216 write() overwrites the original XML document file (if a 'filename'
217 parameter was given to the constructor).
218
219 write() will croak() if no filename can be found to write to.
220
221 write() can also take a secondary parameter which specifies an XPath
222 location to be written out as the new root element instead of the Tidy
223 object's root. Only the first matching element is written.
224
225 toString()
226 The toString() member function is almost identical to write() except
227 that it takes no parameters and simply returns the equivalent XML
228 string as a scalar. It is a little weird because normally only
229 XML::XPath::Node objects have a toString() member but I figure it makes
230 sense to extend the same syntax to the parent object as well, since it
231 is a useful option.
232
234 The following are just aliases to Node constructors. They'll work with
235 just the unique portion of the node type as the member function name.
236
237 e() or el() or elem() or createElement()
238 wrapper for XML::XPath::Node::Element->new()
239
240 a() or at() or attr() or createAttribute()
241 wrapper for XML::XPath::Node::Attribute->new()
242
243 c() or cm() or cmnt() or createComment()
244 wrapper for XML::XPath::Node::Comment->new()
245
246 t() or tx() or text() or createTextNode()
247 wrapper for XML::XPath::Node::Text->new()
248
249 p() or pi() or proc() or createProcessingInstruction()
250 wrapper for XML::XPath::Node::PI->new()
251
252 n() or ns() or nspc() or createNamespace()
253 wrapper for XML::XPath::Node::Namespace->new()
254
256 Since they are sometimes needed to compare against, XML::Tidy also
257 exports the same node constants as XML::XPath::Node (which correspond
258 to DOM values). These include:
259
260 UNKNOWN_NODE
261 ELEMENT_NODE
262 ATTRIBUTE_NODE
263 TEXT_NODE
264 CDATA_SECTION_NODE
265 ENTITY_REFERENCE_NODE
266 ENTITY_NODE
267 PROCESSING_INSTRUCTION_NODE
268 COMMENT_NODE
269 DOCUMENT_NODE
270 DOCUMENT_TYPE_NODE
271 DOCUMENT_FRAGMENT_NODE
272 NOTATION_NODE
273 ELEMENT_DECL_NODE
274 ATT_DEF_NODE
275 XML_DECL_NODE
276 ATTLIST_DECL_NODE
277 NAMESPACE_NODE
278 XML::Tidy also exports:
279
280 STANDARD_XML_DECL
281 which returns a reasonable default XML declaration string (assuming
282 typical "utf-8" encoding).
283
285 - fix reload() from messing up Unicode escaped &XYZ; components like
286 Copyright © and Registered ® (probably needs pre and post
287 processing)
288 - write many better UTF-8 tests
289 - support namespaces
290 - handle CDATA
291
293 Revision history for Perl extension XML::Tidy:
294
295 - 1.20 H79M9hU8 Sun Jul 9 09:43:30:08 -0500 2017
296 * removed broken Build.PL to resolve
297 <HTTPS://RT.CPAN.Org/Ticket/Display.html?id=122406>. (Thank you,
298 Slaven.)
299
300 - 1.18 H78M5qm1 Sat Jul 8 05:52:48:01 -0500 2017
301 * fixed new() to check file or xml to detect standalone in
302 declaration, from <HTTPS://RT.CPAN.Org/Ticket/Display.html?id=122389>
303 (Thanks Alex!)
304
305 * traced tidy() memory leak from
306 <HTTPS://RT.CPAN.Org/Ticket/Display.html?id=120296> (Thanks Jozef!)
307 which seems to come from every XPath->findnodes() call
308
309 * aligned synopsis comments
310
311 * updated write() to use output encoding UTF-8 since that's what
312 almost all XML should rely on (with thanks to RJBS for teaching me
313 much from his great talk at
314 <HTTPS://YouTube.Com/watch?v=TmTeXcEixEg>)
315
316 * collapsed trailing curly braces on code blocks
317
318 * added croak for any failed file open attempt
319
320 - 1.16 G6LM4EST Tue Jun 21 04:14:28:29 -0500 2016
321 * stopped using my old fragile package generation and manually
322 updated all distribution files (though Dist::Zilla should let me
323 generate much again)
324
325 * updated license to GPLv3+
326
327 * fixed 00pod.t and 01podc.t to eval the Test modules from issue and
328 patch: <HTTPS://RT.CPAN.Org/Public/Bug/Display.html?id=85592> (Thanks
329 again MichielB.)
330
331 * replaced all old '&&' with 'and' in POD
332
333 - 1.14 G6JMERCY Sun Jun 19 14:27:12:34 -0500 2016
334 * separated old PT from VERSION to fix non-numeric issue:
335 <HTTPS://RT.CPAN.Org/Public/Bug/Display.html?id=56073> (Thanks to
336 Slaven.)
337
338 * removed Unicode from POD but added encoding utf8 anyway to pass
339 tests and resolve issues:
340 <HTTPS://RT.CPAN.Org/Public/Bug/Display.html?id=92434> and
341 <HTTPS://RT.CPAN.Org/Public/Bug/Display.html?id=85592> (Thanks to
342 Sudhanshu and MichielB.)
343
344 - 1.12.B55J2qn Thu May 5 19:02:52:49 2011
345 * made "1.0" float binarize as float again, rather than just "1" int
346
347 * cleaned up POD and fixed EXPORTED CONSTANTS heads blocking together
348
349 - 1.10.B52FpLx Mon May 2 15:51:21:59 2011
350 * added tests for undefined non-standard XML declaration to suppress
351 warnings
352
353 - 1.8.B2AMvdl Thu Feb 10 22:57:39:47 2011
354 * aligned .t code
355
356 * added test for newline before -r to try to resolve:
357 <HTTPS://RT.CPAN.Org/Ticket/Display.html?id=65471> (Thanks, Leandro.)
358
359 * fixed off-by-one error when new gets a readable (non-newline)
360 filename (that's not "filename" without a pre-'filename' param) to
361 resolve: <HTTPS://RT.CPAN.Org/Ticket/Display.html?id=65151> (Thanks,
362 Simone.)
363
364 - 1.6.A7RJKwl Tue Jul 27 19:20:58:47 2010
365 * added head2 POD for EXPORTED CONSTANTS to try to pass t/00podc.t
366
367 - 1.4.A7QCvHw Mon Jul 26 12:57:17:58 2010
368 * hacked a little test for non-UTF-8 decl str to resolve FrankGoss'
369 need for ISO-8859-1 decl encoding to persist through tidying
370
371 * md sure META.yml is being generated correctly for the CPAN
372
373 * updated license to GPLv3
374
375 - 1.2.75BACCB Fri May 11 10:12:12:11 2007
376 * made "1.0" float binarize as just "1" int
377
378 * made ints signed and bounds checked
379
380 * added new('binary' => 'BinFilename.xtb') option
381
382 - 1.2.54HJnFa Sun Apr 17 19:49:15:36 2005
383 * fixed tidy() processing instruction stripping problem
384
385 * added support for binary ints and floats in bcompress()
386
387 * tightened up binary format and added pod
388
389 - 1.2.54HDR1G Sun Apr 17 13:27:01:16 2005
390 * added bcompress() and bexpand()
391
392 * added compress() and expand()
393
394 * added toString()
395
396 - 1.2.4CKBHxt Mon Dec 20 11:17:59:55 2004
397 * added exporting of XML::XPath::Node (DOM) constants
398
399 * added node object creation wrappers (like LibXML)
400
401 - 1.2.4CCJW4G Sun Dec 12 19:32:04:16 2004
402 * added optional 'xpath_loc' => to prune()
403
404 - 1.0.4CAJna1 Fri Dec 10 19:49:36:01 2004
405 * added optional 'filename' => to write()
406
407 - 1.0.4CAAf5B Fri Dec 10 10:41:05:11 2004
408 * removed 2nd param from tidy() so that 1st param is just indent
409 string
410
411 * fixed pod errors
412
413 - 1.0.4C9JpoP Thu Dec 9 19:51:50:25 2004
414 * added xplc option to write()
415
416 * added prune()
417
418 - 1.0.4C8K1Ah Wed Dec 8 20:01:10:43 2004
419 * inherited from XPath so that those methods can be called directly
420
421 * original version (separating Tidy.pm from Merge.pm)
422
424 From the command shell, please run:
425
426 `perl -MCPAN -e "install XML::Tidy"`
427
428 or uncompress the package and run the standard:
429
430 `perl Makefile.PL; make; make test; make install`
431
433 XML::Tidy requires:
434
435 Carp to allow errors to croak() from calling sub
436
437 XML::XPath to use XPath statements to query and update XML
438
439 XML::XPath::XMLParser to parse XML documents into XPath objects
440
441 Math::BaseCnv to handle base-64 indexing for compress() and
442 expand()
443
445 Please report any bugs or feature requests to bug-XML-Tidy
446 at RT.CPAN.Org, or through the web interface at
447 <HTTPS://RT.CPAN.Org/NoAuth/ReportBug.html?Queue=XML-Tidy>.
448 I will be notified, and then you can be updated of progress on your bug
449 as I address fixes.
450
452 You can find documentation for this module (after it is installed) with
453 the perldoc command.
454
455 `perldoc XML::Tidy`
456
457 You can also look for information at:
458
459 RT: CPAN's Request Tracker
460
461 HTTPS://RT.CPAN.Org/NoAuth/Bugs.html?Dist=XML-Tidy
462
463 AnnoCPAN: Annotated CPAN documentation
464
465 HTTP://AnnoCPAN.Org/dist/XML-Tidy
466
467 CPAN Ratings
468
469 HTTPS://CPANRatings.Perl.Org/d/XML-Tidy
470
471 Search CPAN
472
473 HTTP://Search.CPAN.Org/dist/XML-Tidy
474
476 Most source code should be Free! Code I have lawful authority over is
477 and shall be! Copyright: (c) 2004-2017, Pip Stuart. Copyleft : This
478 software is licensed under the GNU General Public License
479 (version 3 or later). Please consult
480 <HTTPS://GNU.Org/licenses/gpl-3.0.txt>
481 for important information about your freedom. This is Free Software:
482 you
483 are free to change and redistribute it. There is NO WARRANTY, to the
484 extent permitted by law. See <HTTPS://FSF.Org> for further
485 information.
486
488 Pip Stuart <Pip@CPAN.Org>
489
490
491
492perl v5.32.1 2021-01-27 XML::Tidy(3)