1Text::CSV_PP(3) User Contributed Perl Documentation Text::CSV_PP(3)
2
3
4
6 Text::CSV_PP - Text::CSV_XS compatible pure-Perl module
7
9 use Text::CSV_PP;
10
11 $csv = Text::CSV_PP->new(); # create a new object
12 # If you want to handle non-ascii char.
13 $csv = Text::CSV_PP->new({binary => 1});
14
15 $status = $csv->combine(@columns); # combine columns into a string
16 $line = $csv->string(); # get the combined string
17
18 $status = $csv->parse($line); # parse a CSV string into fields
19 @columns = $csv->fields(); # get the parsed fields
20
21 $status = $csv->status (); # get the most recent status
22 $bad_argument = $csv->error_input (); # get the most recent bad argument
23 $diag = $csv->error_diag (); # if an error occured, explains WHY
24
25 $status = $csv->print ($io, $colref); # Write an array of fields
26 # immediately to a file $io
27 $colref = $csv->getline ($io); # Read a line from file $io,
28 # parse it and return an array
29 # ref of fields
30 $csv->column_names (@names); # Set column names for getline_hr ()
31 $ref = $csv->getline_hr ($io); # getline (), but returns a hashref
32 $eof = $csv->eof (); # Indicate if last parse or
33 # getline () hit End Of File
34
35 $csv->types(\@t_array); # Set column types
36
38 Text::CSV_PP has almost same functions of Text::CSV_XS which provides
39 facilities for the composition and decomposition of comma-separated
40 values. As its name suggests, Text::CSV_XS is a XS module and
41 Text::CSV_PP is a Puer Perl one.
42
44 1.29
45
46 This module is compatible with Text::CSV_XS 0.80 and later.
47
48 Unicode (UTF8)
49 On parsing (both for "getline ()" and "parse ()"), if the source is
50 marked being UTF8, then parsing that source will mark all fields that
51 are marked binary will also be marked UTF8.
52
53 On combining ("print ()" and "combine ()"), if any of the combining
54 fields was marked UTF8, the resulting string will be marked UTF8.
55
57 These methods are almost same as Text::CSV_XS. Most of the
58 documentation was shamelessly copied and replaced from Text::CSV_XS.
59
60 See to Text::CSV_XS.
61
62 version ()
63 (Class method) Returns the current backend module version. If you want
64 the module version, you can use the "VERSION" method,
65
66 print Text::CSV->VERSION; # This module version
67 print Text::CSV->version; # The version of the worker module
68 # same as Text::CSV->backend->version
69
70 new (\%attr)
71 (Class method) Returns a new instance of Text::CSV_XS. The objects
72 attributes are described by the (optional) hash ref "\%attr".
73 Currently the following attributes are available:
74
75 eol An end-of-line string to add to rows. "undef" is replaced with an
76 empty string. The default is "$\". Common values for "eol" are
77 "\012" (Line Feed) or "\015\012" (Carriage Return, Line Feed).
78 Cannot be longer than 7 (ASCII) characters.
79
80 If both $/ and "eol" equal "\015", parsing lines that end on only a
81 Carriage Return without Line Feed, will be "parse"d correct. Line
82 endings, whether in $/ or "eol", other than "undef", "\n", "\r\n",
83 or "\r" are not (yet) supported for parsing.
84
85 sep_char
86 The char used for separating fields, by default a comma. (",").
87 Limited to a single-byte character, usually in the range from 0x20
88 (space) to 0x7e (tilde).
89
90 The separation character can not be equal to the quote character.
91 The separation character can not be equal to the escape character.
92
93 See also "CAVEATS" in Text::CSV_XS
94
95 allow_whitespace
96 When this option is set to true, whitespace (TAB's and SPACE's)
97 surrounding the separation character is removed when parsing. If
98 either TAB or SPACE is one of the three major characters
99 "sep_char", "quote_char", or "escape_char" it will not be
100 considered whitespace.
101
102 So lines like:
103
104 1 , "foo" , bar , 3 , zapp
105
106 are now correctly parsed, even though it violates the CSV specs.
107
108 Note that all whitespace is stripped from start and end of each
109 field. That would make it more a feature than a way to be able to
110 parse bad CSV lines, as
111
112 1, 2.0, 3, ape , monkey
113
114 will now be parsed as
115
116 ("1", "2.0", "3", "ape", "monkey")
117
118 even if the original line was perfectly sane CSV.
119
120 blank_is_undef
121 Under normal circumstances, CSV data makes no distinction between
122 quoted- and unquoted empty fields. They both end up in an empty
123 string field once read, so
124
125 1,"",," ",2
126
127 is read as
128
129 ("1", "", "", " ", "2")
130
131 When writing CSV files with "always_quote" set, the unquoted empty
132 field is the result of an undefined value. To make it possible to
133 also make this distinction when reading CSV data, the
134 "blank_is_undef" option will cause unquoted empty fields to be set
135 to undef, causing the above to be parsed as
136
137 ("1", "", undef, " ", "2")
138
139 empty_is_undef
140 Going one step further than "blank_is_undef", this attribute
141 converts all empty fields to undef, so
142
143 1,"",," ",2
144
145 is read as
146
147 (1, undef, undef, " ", 2)
148
149 Note that this only effects fields that are realy empty, not fields
150 that are empty after stripping allowed whitespace. YMMV.
151
152 quote_char
153 The char used for quoting fields containing blanks, by default the
154 double quote character ("""). A value of undef suppresses quote
155 chars. (For simple cases only). Limited to a single-byte
156 character, usually in the range from 0x20 (space) to 0x7e (tilde).
157
158 The quote character can not be equal to the separation character.
159
160 allow_loose_quotes
161 By default, parsing fields that have "quote_char" characters inside
162 an unquoted field, like
163
164 1,foo "bar" baz,42
165
166 would result in a parse error. Though it is still bad practice to
167 allow this format, we cannot help there are some vendors that make
168 their applications spit out lines styled like this.
169
170 In case there is really bad CSV data, like
171
172 1,"foo "bar" baz",42
173
174 or
175
176 1,""foo bar baz"",42
177
178 there is a way to get that parsed, and leave the quotes inside the
179 quoted field as-is. This can be achieved by setting
180 "allow_loose_quotes" AND making sure that the "escape_char" is not
181 equal to "quote_char".
182
183 escape_char
184 The character used for escaping certain characters inside quoted
185 fields. Limited to a single-byte character, usually in the range
186 from 0x20 (space) to 0x7e (tilde).
187
188 The "escape_char" defaults to being the literal double-quote mark
189 (""") in other words, the same as the default "quote_char". This
190 means that doubling the quote mark in a field escapes it:
191
192 "foo","bar","Escape ""quote mark"" with two ""quote marks""","baz"
193
194 If you change the default quote_char without changing the default
195 escape_char, the escape_char will still be the quote mark. If
196 instead you want to escape the quote_char by doubling it, you will
197 need to change the escape_char to be the same as what you changed
198 the quote_char to.
199
200 The escape character can not be equal to the separation character.
201
202 allow_loose_escapes
203 By default, parsing fields that have "escape_char" characters that
204 escape characters that do not need to be escaped, like:
205
206 my $csv = Text::CSV->new ({ escape_char => "\\" });
207 $csv->parse (qq{1,"my bar\'s",baz,42});
208
209 would result in a parse error. Though it is still bad practice to
210 allow this format, this option enables you to treat all escape
211 character sequences equal.
212
213 binary
214 If this attribute is TRUE, you may use binary characters in quoted
215 fields, including line feeds, carriage returns and NULL bytes. (The
216 latter must be escaped as ""0".) By default this feature is off.
217
218 If a string is marked UTF8, binary will be turned on automatically
219 when binary characters other than CR or NL are encountered. Note
220 that a simple string like "\x{00a0}" might still be binary, but not
221 marked UTF8, so setting "{ binary => 1 }" is still a wise option.
222
223 types
224 A set of column types; this attribute is immediately passed to the
225 types method below. You must not set this attribute otherwise,
226 except for using the types method. For details see the description
227 of the types method below.
228
229 always_quote
230 By default the generated fields are quoted only, if they need to,
231 for example, if they contain the separator. If you set this
232 attribute to a TRUE value, then all defined fields will be quoted.
233 This is typically easier to handle in external applications.
234
235 quote_space
236 By default, a space in a field would trigger quotation. As no rule
237 exists this to be forced in CSV, nor any for the opposite, the
238 default is true for safety. You can exclude the space from this
239 trigger by setting this option to 0.
240
241 quote_null
242 By default, a NULL byte in a field would be escaped. This attribute
243 enables you to treat the NULL byte as a simple binary character in
244 binary mode (the "{ binary => 1 }" is set). The default is true.
245 You can prevent NULL escapes by setting this attribute to 0.
246
247 keep_meta_info
248 By default, the parsing of input lines is as simple and fast as
249 possible. However, some parsing information - like quotation of the
250 original field - is lost in that process. Set this flag to true to
251 be able to retrieve that information after parsing with the methods
252 "meta_info ()", "is_quoted ()", and "is_binary ()" described below.
253 Default is false.
254
255 verbatim
256 This is a quite controversial attribute to set, but it makes hard
257 things possible.
258
259 The basic thought behind this is to tell the parser that the
260 normally special characters newline (NL) and Carriage Return (CR)
261 will not be special when this flag is set, and be dealt with as
262 being ordinary binary characters. This will ease working with data
263 with embedded newlines.
264
265 When "verbatim" is used with "getline ()", "getline ()" auto-
266 chomp's every line.
267
268 Imagine a file format like
269
270 M^^Hans^Janssen^Klas 2\n2A^Ja^11-06-2007#\r\n
271
272 where, the line ending is a very specific "#\r\n", and the sep_char
273 is a ^ (caret). None of the fields is quoted, but embedded binary
274 data is likely to be present. With the specific line ending, that
275 shouldn't be too hard to detect.
276
277 By default, Text::CSV' parse function however is instructed to only
278 know about "\n" and "\r" to be legal line endings, and so has to
279 deal with the embedded newline as a real end-of-line, so it can
280 scan the next line if binary is true, and the newline is inside a
281 quoted field. With this attribute however, we can tell parse () to
282 parse the line as if \n is just nothing more than a binary
283 character.
284
285 For parse () this means that the parser has no idea about line
286 ending anymore, and getline () chomps line endings on reading.
287
288 auto_diag
289 Set to true will cause "error_diag ()" to be automatically be
290 called in void context upon errors.
291
292 If set to a value greater than 1, it will die on errors instead of
293 warn.
294
295 To check future plans and a difference in XS version, please see to
296 "auto_diag" in Text::CSV_XS.
297
298 To sum it up,
299
300 $csv = Text::CSV_PP->new ();
301
302 is equivalent to
303
304 $csv = Text::CSV_PP->new ({
305 quote_char => '"',
306 escape_char => '"',
307 sep_char => ',',
308 eol => $\,
309 always_quote => 0,
310 quote_space => 1,
311 quote_null => 1,
312 binary => 0,
313 keep_meta_info => 0,
314 allow_loose_quotes => 0,
315 allow_loose_escapes => 0,
316 allow_whitespace => 0,
317 blank_is_undef => 0,
318 empty_is_undef => 0,
319 verbatim => 0,
320 auto_diag => 0,
321 });
322
323 For all of the above mentioned flags, there is an accessor method
324 available where you can inquire for the current value, or change the
325 value
326
327 my $quote = $csv->quote_char;
328 $csv->binary (1);
329
330 It is unwise to change these settings halfway through writing CSV data
331 to a stream. If however, you want to create a new stream using the
332 available CSV object, there is no harm in changing them.
333
334 If the "new ()" constructor call fails, it returns "undef", and makes
335 the fail reason available through the "error_diag ()" method.
336
337 $csv = Text::CSV->new ({ ecs_char => 1 }) or
338 die "" . Text::CSV->error_diag ();
339
340 "error_diag ()" will return a string like
341
342 "INI - Unknown attribute 'ecs_char'"
343
344 print
345 $status = $csv->print ($io, $colref);
346
347 Similar to "combine () + string () + print", but more efficient. It
348 expects an array ref as input (not an array!) and the resulting string
349 is not really created (XS version), but immediately written to the $io
350 object, typically an IO handle or any other object that offers a print
351 method. Note, this implies that the following is wrong in perl 5.005_xx
352 and older:
353
354 open FILE, ">", "whatever";
355 $status = $csv->print (\*FILE, $colref);
356
357 as in perl 5.005 and older, the glob "\*FILE" is not an object, thus it
358 doesn't have a print method. The solution is to use an IO::File object
359 or to hide the glob behind an IO::Wrap object. See IO::File and
360 IO::Wrap for details.
361
362 For performance reasons the print method doesn't create a result
363 string. (If its backend is PP version, result strings are created
364 internally.) In particular the $csv->string (), $csv->status (),
365 $csv-fields ()> and $csv->error_input () methods are meaningless after
366 executing this method.
367
368 combine
369 $status = $csv->combine (@columns);
370
371 This object function constructs a CSV string from the arguments,
372 returning success or failure. Failure can result from lack of
373 arguments or an argument containing an invalid character. Upon
374 success, "string ()" can be called to retrieve the resultant CSV
375 string. Upon failure, the value returned by "string ()" is undefined
376 and "error_input ()" can be called to retrieve an invalid argument.
377
378 string
379 $line = $csv->string ();
380
381 This object function returns the input to "parse ()" or the resultant
382 CSV string of "combine ()", whichever was called more recently.
383
384 getline
385 $colref = $csv->getline ($io);
386
387 This is the counterpart to print, like parse is the counterpart to
388 combine: It reads a row from the IO object $io using $io->getline ()
389 and parses this row into an array ref. This array ref is returned by
390 the function or undef for failure.
391
392 When fields are bound with "bind_columns ()", the return value is a
393 reference to an empty list.
394
395 The $csv->string (), $csv->fields () and $csv->status () methods are
396 meaningless, again.
397
398 getline_all
399 $arrayref = $csv->getline_all ($io);
400 $arrayref = $csv->getline_all ($io, $offset);
401 $arrayref = $csv->getline_all ($io, $offset, $length);
402
403 This will return a reference to a list of "getline ($io)" results. In
404 this call, "keep_meta_info" is disabled. If $offset is negative, as
405 with "splice ()", only the last "abs ($offset)" records of $io are
406 taken into consideration.
407
408 Given a CSV file with 10 lines:
409
410 lines call
411 ----- ---------------------------------------------------------
412 0..9 $csv->getline_all ($io) # all
413 0..9 $csv->getline_all ($io, 0) # all
414 8..9 $csv->getline_all ($io, 8) # start at 8
415 - $csv->getline_all ($io, 0, 0) # start at 0 first 0 rows
416 0..4 $csv->getline_all ($io, 0, 5) # start at 0 first 5 rows
417 4..5 $csv->getline_all ($io, 4, 2) # start at 4 first 2 rows
418 8..9 $csv->getline_all ($io, -2) # last 2 rows
419 6..7 $csv->getline_all ($io, -4, 2) # first 2 of last 4 rows
420
421 parse
422 $status = $csv->parse ($line);
423
424 This object function decomposes a CSV string into fields, returning
425 success or failure. Failure can result from a lack of argument or the
426 given CSV string is improperly formatted. Upon success, "fields ()"
427 can be called to retrieve the decomposed fields . Upon failure, the
428 value returned by "fields ()" is undefined and "error_input ()" can be
429 called to retrieve the invalid argument.
430
431 You may use the types () method for setting column types. See the
432 description below.
433
434 getline_hr
435 The "getline_hr ()" and "column_names ()" methods work together to
436 allow you to have rows returned as hashrefs. You must call
437 "column_names ()" first to declare your column names.
438
439 $csv->column_names (qw( code name price description ));
440 $hr = $csv->getline_hr ($io);
441 print "Price for $hr->{name} is $hr->{price} EUR\n";
442
443 "getline_hr ()" will croak if called before "column_names ()".
444
445 getline_hr_all
446 $arrayref = $csv->getline_hr_all ($io);
447
448 This will return a reference to a list of "getline_hr ($io)" results.
449 In this call, "keep_meta_info" is disabled.
450
451 column_names
452 Set the keys that will be used in the "getline_hr ()" calls. If no keys
453 (column names) are passed, it'll return the current setting.
454
455 "column_names ()" accepts a list of scalars (the column names) or a
456 single array_ref, so you can pass "getline ()"
457
458 $csv->column_names ($csv->getline ($io));
459
460 "column_names ()" does no checking on duplicates at all, which might
461 lead to unwanted results. Undefined entries will be replaced with the
462 string "\cAUNDEF\cA", so
463
464 $csv->column_names (undef, "", "name", "name");
465 $hr = $csv->getline_hr ($io);
466
467 Will set "$hr-"{"\cAUNDEF\cA"}> to the 1st field, "$hr-"{""}> to the
468 2nd field, and "$hr-"{name}> to the 4th field, discarding the 3rd
469 field.
470
471 "column_names ()" croaks on invalid arguments.
472
473 bind_columns
474 Takes a list of references to scalars to store the fields fetched
475 "getline ()" in. When you don't pass enough references to store the
476 fetched fields in, "getline ()" will fail. If you pass more than there
477 are fields to return, the remaining references are left untouched.
478
479 $csv->bind_columns (\$code, \$name, \$price, \$description);
480 while ($csv->getline ($io)) {
481 print "The price of a $name is \x{20ac} $price\n";
482 }
483
484 eof
485 $eof = $csv->eof ();
486
487 If "parse ()" or "getline ()" was used with an IO stream, this method
488 will return true (1) if the last call hit end of file, otherwise it
489 will return false (''). This is useful to see the difference between a
490 failure and end of file.
491
492 types
493 $csv->types (\@tref);
494
495 This method is used to force that columns are of a given type. For
496 example, if you have an integer column, two double columns and a string
497 column, then you might do a
498
499 $csv->types ([Text::CSV_PP::IV (),
500 Text::CSV_PP::NV (),
501 Text::CSV_PP::NV (),
502 Text::CSV_PP::PV ()]);
503
504 Column types are used only for decoding columns, in other words by the
505 parse () and getline () methods.
506
507 You can unset column types by doing a
508
509 $csv->types (undef);
510
511 or fetch the current type settings with
512
513 $types = $csv->types ();
514
515 IV Set field type to integer.
516
517 NV Set field type to numeric/float.
518
519 PV Set field type to string.
520
521 fields
522 @columns = $csv->fields ();
523
524 This object function returns the input to "combine ()" or the resultant
525 decomposed fields of C successful <parse ()>, whichever was called more
526 recently.
527
528 Note that the return value is undefined after using "getline ()", which
529 does not fill the data structures returned by "parse ()".
530
531 meta_info
532 @flags = $csv->meta_info ();
533
534 This object function returns the flags of the input to "combine ()" or
535 the flags of the resultant decomposed fields of "parse ()", whichever
536 was called more recently.
537
538 For each field, a meta_info field will hold flags that tell something
539 about the field returned by the "fields ()" method or passed to the
540 "combine ()" method. The flags are bitwise-or'd like:
541
542 0x0001
543 The field was quoted.
544
545 0x0002
546 The field was binary.
547
548 See the "is_*** ()" methods below.
549
550 is_quoted
551 my $quoted = $csv->is_quoted ($column_idx);
552
553 Where $column_idx is the (zero-based) index of the column in the last
554 result of "parse ()".
555
556 This returns a true value if the data in the indicated column was
557 enclosed in "quote_char" quotes. This might be important for data where
558 ",20070108," is to be treated as a numeric value, and where
559 ","20070108"," is explicitly marked as character string data.
560
561 is_binary
562 my $binary = $csv->is_binary ($column_idx);
563
564 Where $column_idx is the (zero-based) index of the column in the last
565 result of "parse ()".
566
567 This returns a true value if the data in the indicated column contained
568 any byte in the range [\x00-\x08,\x10-\x1F,\x7F-\xFF]
569
570 status
571 $status = $csv->status ();
572
573 This object function returns success (or failure) of "combine ()" or
574 "parse ()", whichever was called more recently.
575
576 error_input
577 $bad_argument = $csv->error_input ();
578
579 This object function returns the erroneous argument (if it exists) of
580 "combine ()" or "parse ()", whichever was called more recently.
581
582 error_diag
583 Text::CSV_PP->error_diag ();
584 $csv->error_diag ();
585 $error_code = 0 + $csv->error_diag ();
586 $error_str = "" . $csv->error_diag ();
587 ($cde, $str, $pos) = $csv->error_diag ();
588
589 If (and only if) an error occured, this function returns the
590 diagnostics of that error.
591
592 If called in void context, it will print the internal error code and
593 the associated error message to STDERR.
594
595 If called in list context, it will return the error code and the error
596 message in that order. If the last error was from parsing, the third
597 value returned is the best guess at the location within the line that
598 was being parsed. It's value is 1-based.
599
600 Note: $pos does not show the error point in many cases. It is for
601 conscience's sake.
602
603 If called in scalar context, it will return the diagnostics in a single
604 scalar, a-la $!. It will contain the error code in numeric context, and
605 the diagnostics message in string context.
606
607 To achieve this behavior with CSV_PP, the returned diagnostics is
608 blessed object.
609
610 SetDiag
611 $csv->SetDiag (0);
612
613 Use to reset the diagnostics if you are dealing with errors.
614
616 If an error occured, $csv->error_diag () can be used to get more
617 information on the cause of the failure. Note that for speed reasons,
618 the internal value is never cleared on success, so using the value
619 returned by error_diag () in normal cases - when no error occured - may
620 cause unexpected results.
621
622 Note: CSV_PP's diagnostics is different from CSV_XS's:
623
624 Text::CSV_XS parses csv strings by dividing one character while
625 Text::CSV_PP by using the regular expressions. That difference makes
626 the different cause of the failure.
627
628 Currently these errors are available:
629
630 1001 "sep_char is equal to quote_char or escape_char"
631 The separation character cannot be equal to either the quotation
632 character or the escape character, as that will invalidate all
633 parsing rules.
634
635 1002 "INI - allow_whitespace with escape_char or quote_char SP or TAB"
636 Using "allow_whitespace" when either "escape_char" or "quote_char" is
637 equal to SPACE or TAB is too ambiguous to allow.
638
639 1003 "INI - \r or \n in main attr not allowed"
640 Using default "eol" characters in either "sep_char", "quote_char", or
641 "escape_char" is not allowed.
642
643 2010 "ECR - QUO char inside quotes followed by CR not part of EOL"
644 2011 "ECR - Characters after end of quoted field"
645 2021 "EIQ - NL char inside quotes, binary off"
646 2022 "EIQ - CR char inside quotes, binary off"
647 2025 "EIQ - Loose unescaped escape"
648 2026 "EIQ - Binary character inside quoted field, binary off"
649 2027 "EIQ - Quoted field not terminated"
650 2030 "EIF - NL char inside unquoted verbatim, binary off"
651 2031 "EIF - CR char is first char of field, not part of EOL",
652 2032 "EIF - CR char inside unquoted, not part of EOL",
653 2034 "EIF - Loose unescaped quote",
654 2037 "EIF - Binary character in unquoted field, binary off",
655 2110 "ECB - Binary character in Combine, binary off"
656 2200 "EIO - print to IO failed. See errno"
657 4002 "EIQ - Unescaped ESC in quoted field"
658 4003 "EIF - ESC CR"
659 4004 "EUF - "
660 3001 "EHR - Unsupported syntax for column_names ()"
661 3002 "EHR - getline_hr () called before column_names ()"
662 3003 "EHR - bind_columns () and column_names () fields count mismatch"
663 3004 "EHR - bind_columns () only accepts refs to scalars"
664 3006 "EHR - bind_columns () did not pass enough refs for parsed fields"
665 3007 "EHR - bind_columns needs refs to writable scalars"
666 3008 "EHR - unexpected error in bound fields"
667
669 Makamaka Hannyaharamitu, <makamaka[at]cpan.org>
670
671 Text::CSV_XS was written by <joe[at]ispsoft.de> and maintained by
672 <h.m.brand[at]xs4all.nl>.
673
674 Text::CSV was written by <alan[at]mfgrtl.com>.
675
677 Copyright 2005-2010 by Makamaka Hannyaharamitu, <makamaka[at]cpan.org>
678
679 This library is free software; you can redistribute it and/or modify it
680 under the same terms as Perl itself.
681
683 Text::CSV_XS, Text::CSV
684
685 I got many regexp bases from <http://www.din.or.jp/~ohzaki/perl.htm>
686
687
688
689perl v5.12.3 2010-12-27 Text::CSV_PP(3)