Text::CSV_PP(3pm)

1Text::CSV_PP(3)       User Contributed Perl Documentation      Text::CSV_PP(3)
2
3
4

NAME

6       Text::CSV_PP - Text::CSV_XS compatible pure-Perl module
7

SYNOPSIS

9        use Text::CSV_PP;
10
11        $csv = Text::CSV_PP->new();     # create a new object
12        # If you want to handle non-ascii char.
13        $csv = Text::CSV_PP->new({binary => 1});
14
15        $status = $csv->combine(@columns);    # combine columns into a string
16        $line   = $csv->string();             # get the combined string
17
18        $status  = $csv->parse($line);        # parse a CSV string into fields
19        @columns = $csv->fields();            # get the parsed fields
20
21        $status       = $csv->status ();      # get the most recent status
22        $bad_argument = $csv->error_input (); # get the most recent bad argument
23        $diag         = $csv->error_diag ();  # if an error occured, explains WHY
24
25        $status = $csv->print ($io, $colref); # Write an array of fields
26                                              # immediately to a file $io
27        $colref = $csv->getline ($io);        # Read a line from file $io,
28                                              # parse it and return an array
29                                              # ref of fields
30        $csv->column_names (@names);          # Set column names for getline_hr ()
31        $ref = $csv->getline_hr ($io);        # getline (), but returns a hashref
32        $eof = $csv->eof ();                  # Indicate if last parse or
33                                              # getline () hit End Of File
34
35        $csv->types(\@t_array);               # Set column types
36

DESCRIPTION

38       Text::CSV_PP has almost same functions of Text::CSV_XS which provides
39       facilities for the composition and decomposition of comma-separated
40       values. As its name suggests, Text::CSV_XS is a XS module and
41       Text::CSV_PP is a Puer Perl one.
42

VERSION

44           1.29
45
46       This module is compatible with Text::CSV_XS 0.80 and later.
47
48   Unicode (UTF8)
49       On parsing (both for "getline ()" and "parse ()"), if the source is
50       marked being UTF8, then parsing that source will mark all fields that
51       are marked binary will also be marked UTF8.
52
53       On combining ("print ()" and "combine ()"), if any of the combining
54       fields was marked UTF8, the resulting string will be marked UTF8.
55

FUNCTIONS

57       These methods are almost same as Text::CSV_XS.  Most of the
58       documentation was shamelessly copied and replaced from Text::CSV_XS.
59
60       See to Text::CSV_XS.
61
62   version ()
63       (Class method) Returns the current backend module version.  If you want
64       the module version, you can use the "VERSION" method,
65
66        print Text::CSV->VERSION;      # This module version
67        print Text::CSV->version;      # The version of the worker module
68                                       # same as Text::CSV->backend->version
69
70   new (\%attr)
71       (Class method) Returns a new instance of Text::CSV_XS. The objects
72       attributes are described by the (optional) hash ref "\%attr".
73       Currently the following attributes are available:
74
75       eol An end-of-line string to add to rows. "undef" is replaced with an
76           empty string. The default is "$\". Common values for "eol" are
77           "\012" (Line Feed) or "\015\012" (Carriage Return, Line Feed).
78           Cannot be longer than 7 (ASCII) characters.
79
80           If both $/ and "eol" equal "\015", parsing lines that end on only a
81           Carriage Return without Line Feed, will be "parse"d correct.  Line
82           endings, whether in $/ or "eol", other than "undef", "\n", "\r\n",
83           or "\r" are not (yet) supported for parsing.
84
85       sep_char
86           The char used for separating fields, by default a comma. (",").
87           Limited to a single-byte character, usually in the range from 0x20
88           (space) to 0x7e (tilde).
89
90           The separation character can not be equal to the quote character.
91           The separation character can not be equal to the escape character.
92
93           See also "CAVEATS" in Text::CSV_XS
94
95       allow_whitespace
96           When this option is set to true, whitespace (TAB's and SPACE's)
97           surrounding the separation character is removed when parsing. If
98           either TAB or SPACE is one of the three major characters
99           "sep_char", "quote_char", or "escape_char" it will not be
100           considered whitespace.
101
102           So lines like:
103
104             1 , "foo" , bar , 3 , zapp
105
106           are now correctly parsed, even though it violates the CSV specs.
107
108           Note that all whitespace is stripped from start and end of each
109           field. That would make it more a feature than a way to be able to
110           parse bad CSV lines, as
111
112            1,   2.0,  3,   ape  , monkey
113
114           will now be parsed as
115
116            ("1", "2.0", "3", "ape", "monkey")
117
118           even if the original line was perfectly sane CSV.
119
120       blank_is_undef
121           Under normal circumstances, CSV data makes no distinction between
122           quoted- and unquoted empty fields. They both end up in an empty
123           string field once read, so
124
125            1,"",," ",2
126
127           is read as
128
129            ("1", "", "", " ", "2")
130
131           When writing CSV files with "always_quote" set, the unquoted empty
132           field is the result of an undefined value. To make it possible to
133           also make this distinction when reading CSV data, the
134           "blank_is_undef" option will cause unquoted empty fields to be set
135           to undef, causing the above to be parsed as
136
137            ("1", "", undef, " ", "2")
138
139       empty_is_undef
140           Going one step further than "blank_is_undef", this attribute
141           converts all empty fields to undef, so
142
143            1,"",," ",2
144
145           is read as
146
147            (1, undef, undef, " ", 2)
148
149           Note that this only effects fields that are realy empty, not fields
150           that are empty after stripping allowed whitespace. YMMV.
151
152       quote_char
153           The char used for quoting fields containing blanks, by default the
154           double quote character ("""). A value of undef suppresses quote
155           chars. (For simple cases only).  Limited to a single-byte
156           character, usually in the range from 0x20 (space) to 0x7e (tilde).
157
158           The quote character can not be equal to the separation character.
159
160       allow_loose_quotes
161           By default, parsing fields that have "quote_char" characters inside
162           an unquoted field, like
163
164            1,foo "bar" baz,42
165
166           would result in a parse error. Though it is still bad practice to
167           allow this format, we cannot help there are some vendors that make
168           their applications spit out lines styled like this.
169
170           In case there is really bad CSV data, like
171
172            1,"foo "bar" baz",42
173
174           or
175
176            1,""foo bar baz"",42
177
178           there is a way to get that parsed, and leave the quotes inside the
179           quoted field as-is. This can be achieved by setting
180           "allow_loose_quotes" AND making sure that the "escape_char" is not
181           equal to "quote_char".
182
183       escape_char
184           The character used for escaping certain characters inside quoted
185           fields.  Limited to a single-byte character, usually in the range
186           from 0x20 (space) to 0x7e (tilde).
187
188           The "escape_char" defaults to being the literal double-quote mark
189           (""") in other words, the same as the default "quote_char". This
190           means that doubling the quote mark in a field escapes it:
191
192             "foo","bar","Escape ""quote mark"" with two ""quote marks""","baz"
193
194           If you change the default quote_char without changing the default
195           escape_char, the escape_char will still be the quote mark.  If
196           instead you want to escape the quote_char by doubling it, you will
197           need to change the escape_char to be the same as what you changed
198           the quote_char to.
199
200           The escape character can not be equal to the separation character.
201
202       allow_loose_escapes
203           By default, parsing fields that have "escape_char" characters that
204           escape characters that do not need to be escaped, like:
205
206            my $csv = Text::CSV->new ({ escape_char => "\\" });
207            $csv->parse (qq{1,"my bar\'s",baz,42});
208
209           would result in a parse error. Though it is still bad practice to
210           allow this format, this option enables you to treat all escape
211           character sequences equal.
212
213       binary
214           If this attribute is TRUE, you may use binary characters in quoted
215           fields, including line feeds, carriage returns and NULL bytes. (The
216           latter must be escaped as ""0".) By default this feature is off.
217
218           If a string is marked UTF8, binary will be turned on automatically
219           when binary characters other than CR or NL are encountered. Note
220           that a simple string like "\x{00a0}" might still be binary, but not
221           marked UTF8, so setting "{ binary => 1 }" is still a wise option.
222
223       types
224           A set of column types; this attribute is immediately passed to the
225           types method below. You must not set this attribute otherwise,
226           except for using the types method. For details see the description
227           of the types method below.
228
229       always_quote
230           By default the generated fields are quoted only, if they need to,
231           for example, if they contain the separator. If you set this
232           attribute to a TRUE value, then all defined fields will be quoted.
233           This is typically easier to handle in external applications.
234
235       quote_space
236           By default, a space in a field would trigger quotation. As no rule
237           exists this to be forced in CSV, nor any for the opposite, the
238           default is true for safety. You can exclude the space from this
239           trigger by setting this option to 0.
240
241       quote_null
242           By default, a NULL byte in a field would be escaped. This attribute
243           enables you to treat the NULL byte as a simple binary character in
244           binary mode (the "{ binary => 1 }" is set). The default is true.
245           You can prevent NULL escapes by setting this attribute to 0.
246
247       keep_meta_info
248           By default, the parsing of input lines is as simple and fast as
249           possible. However, some parsing information - like quotation of the
250           original field - is lost in that process. Set this flag to true to
251           be able to retrieve that information after parsing with the methods
252           "meta_info ()", "is_quoted ()", and "is_binary ()" described below.
253           Default is false.
254
255       verbatim
256           This is a quite controversial attribute to set, but it makes hard
257           things possible.
258
259           The basic thought behind this is to tell the parser that the
260           normally special characters newline (NL) and Carriage Return (CR)
261           will not be special when this flag is set, and be dealt with as
262           being ordinary binary characters. This will ease working with data
263           with embedded newlines.
264
265           When "verbatim" is used with "getline ()", "getline ()" auto-
266           chomp's every line.
267
268           Imagine a file format like
269
270             M^^Hans^Janssen^Klas 2\n2A^Ja^11-06-2007#\r\n
271
272           where, the line ending is a very specific "#\r\n", and the sep_char
273           is a ^ (caret). None of the fields is quoted, but embedded binary
274           data is likely to be present. With the specific line ending, that
275           shouldn't be too hard to detect.
276
277           By default, Text::CSV' parse function however is instructed to only
278           know about "\n" and "\r" to be legal line endings, and so has to
279           deal with the embedded newline as a real end-of-line, so it can
280           scan the next line if binary is true, and the newline is inside a
281           quoted field.  With this attribute however, we can tell parse () to
282           parse the line as if \n is just nothing more than a binary
283           character.
284
285           For parse () this means that the parser has no idea about line
286           ending anymore, and getline () chomps line endings on reading.
287
288       auto_diag
289           Set to true will cause "error_diag ()" to be automatically be
290           called in void context upon errors.
291
292           If set to a value greater than 1, it will die on errors instead of
293           warn.
294
295           To check future plans and a difference in XS version, please see to
296           "auto_diag" in Text::CSV_XS.
297
298       To sum it up,
299
300        $csv = Text::CSV_PP->new ();
301
302       is equivalent to
303
304        $csv = Text::CSV_PP->new ({
305            quote_char          => '"',
306            escape_char         => '"',
307            sep_char            => ',',
308            eol                 => $\,
309            always_quote        => 0,
310            quote_space         => 1,
311            quote_null          => 1,
312            binary              => 0,
313            keep_meta_info      => 0,
314            allow_loose_quotes  => 0,
315            allow_loose_escapes => 0,
316            allow_whitespace    => 0,
317            blank_is_undef      => 0,
318            empty_is_undef      => 0,
319            verbatim            => 0,
320            auto_diag           => 0,
321            });
322
323       For all of the above mentioned flags, there is an accessor method
324       available where you can inquire for the current value, or change the
325       value
326
327        my $quote = $csv->quote_char;
328        $csv->binary (1);
329
330       It is unwise to change these settings halfway through writing CSV data
331       to a stream. If however, you want to create a new stream using the
332       available CSV object, there is no harm in changing them.
333
334       If the "new ()" constructor call fails, it returns "undef", and makes
335       the fail reason available through the "error_diag ()" method.
336
337        $csv = Text::CSV->new ({ ecs_char => 1 }) or
338            die "" . Text::CSV->error_diag ();
339
340       "error_diag ()" will return a string like
341
342        "INI - Unknown attribute 'ecs_char'"
343
344   print
345        $status = $csv->print ($io, $colref);
346
347       Similar to "combine () + string () + print", but more efficient. It
348       expects an array ref as input (not an array!) and the resulting string
349       is not really created (XS version), but immediately written to the $io
350       object, typically an IO handle or any other object that offers a print
351       method. Note, this implies that the following is wrong in perl 5.005_xx
352       and older:
353
354        open FILE, ">", "whatever";
355        $status = $csv->print (\*FILE, $colref);
356
357       as in perl 5.005 and older, the glob "\*FILE" is not an object, thus it
358       doesn't have a print method. The solution is to use an IO::File object
359       or to hide the glob behind an IO::Wrap object. See IO::File and
360       IO::Wrap for details.
361
362       For performance reasons the print method doesn't create a result
363       string.  (If its backend is PP version, result strings are created
364       internally.)  In particular the $csv->string (), $csv->status (),
365       $csv-fields ()> and $csv->error_input () methods are meaningless after
366       executing this method.
367
368   combine
369        $status = $csv->combine (@columns);
370
371       This object function constructs a CSV string from the arguments,
372       returning success or failure.  Failure can result from lack of
373       arguments or an argument containing an invalid character.  Upon
374       success, "string ()" can be called to retrieve the resultant CSV
375       string.  Upon failure, the value returned by "string ()" is undefined
376       and "error_input ()" can be called to retrieve an invalid argument.
377
378   string
379        $line = $csv->string ();
380
381       This object function returns the input to "parse ()" or the resultant
382       CSV string of "combine ()", whichever was called more recently.
383
384   getline
385        $colref = $csv->getline ($io);
386
387       This is the counterpart to print, like parse is the counterpart to
388       combine: It reads a row from the IO object $io using $io->getline ()
389       and parses this row into an array ref. This array ref is returned by
390       the function or undef for failure.
391
392       When fields are bound with "bind_columns ()", the return value is a
393       reference to an empty list.
394
395       The $csv->string (), $csv->fields () and $csv->status () methods are
396       meaningless, again.
397
398   getline_all
399        $arrayref = $csv->getline_all ($io);
400        $arrayref = $csv->getline_all ($io, $offset);
401        $arrayref = $csv->getline_all ($io, $offset, $length);
402
403       This will return a reference to a list of "getline ($io)" results.  In
404       this call, "keep_meta_info" is disabled. If $offset is negative, as
405       with "splice ()", only the last "abs ($offset)" records of $io are
406       taken into consideration.
407
408       Given a CSV file with 10 lines:
409
410        lines call
411        ----- ---------------------------------------------------------
412        0..9  $csv->getline_all ($io)         # all
413        0..9  $csv->getline_all ($io,  0)     # all
414        8..9  $csv->getline_all ($io,  8)     # start at 8
415        -     $csv->getline_all ($io,  0,  0) # start at 0 first 0 rows
416        0..4  $csv->getline_all ($io,  0,  5) # start at 0 first 5 rows
417        4..5  $csv->getline_all ($io,  4,  2) # start at 4 first 2 rows
418        8..9  $csv->getline_all ($io, -2)     # last 2 rows
419        6..7  $csv->getline_all ($io, -4,  2) # first 2 of last  4 rows
420
421   parse
422        $status = $csv->parse ($line);
423
424       This object function decomposes a CSV string into fields, returning
425       success or failure.  Failure can result from a lack of argument or the
426       given CSV string is improperly formatted.  Upon success, "fields ()"
427       can be called to retrieve the decomposed fields .  Upon failure, the
428       value returned by "fields ()" is undefined and "error_input ()" can be
429       called to retrieve the invalid argument.
430
431       You may use the types () method for setting column types. See the
432       description below.
433
434   getline_hr
435       The "getline_hr ()" and "column_names ()" methods work together to
436       allow you to have rows returned as hashrefs. You must call
437       "column_names ()" first to declare your column names.
438
439        $csv->column_names (qw( code name price description ));
440        $hr = $csv->getline_hr ($io);
441        print "Price for $hr->{name} is $hr->{price} EUR\n";
442
443       "getline_hr ()" will croak if called before "column_names ()".
444
445   getline_hr_all
446        $arrayref = $csv->getline_hr_all ($io);
447
448       This will return a reference to a list of "getline_hr ($io)" results.
449       In this call, "keep_meta_info" is disabled.
450
451   column_names
452       Set the keys that will be used in the "getline_hr ()" calls. If no keys
453       (column names) are passed, it'll return the current setting.
454
455       "column_names ()" accepts a list of scalars (the column names) or a
456       single array_ref, so you can pass "getline ()"
457
458         $csv->column_names ($csv->getline ($io));
459
460       "column_names ()" does no checking on duplicates at all, which might
461       lead to unwanted results. Undefined entries will be replaced with the
462       string "\cAUNDEF\cA", so
463
464         $csv->column_names (undef, "", "name", "name");
465         $hr = $csv->getline_hr ($io);
466
467       Will set "$hr-"{"\cAUNDEF\cA"}> to the 1st field, "$hr-"{""}> to the
468       2nd field, and "$hr-"{name}> to the 4th field, discarding the 3rd
469       field.
470
471       "column_names ()" croaks on invalid arguments.
472
473   bind_columns
474       Takes a list of references to scalars to store the fields fetched
475       "getline ()" in. When you don't pass enough references to store the
476       fetched fields in, "getline ()" will fail. If you pass more than there
477       are fields to return, the remaining references are left untouched.
478
479         $csv->bind_columns (\$code, \$name, \$price, \$description);
480         while ($csv->getline ($io)) {
481             print "The price of a $name is \x{20ac} $price\n";
482             }
483
484   eof
485        $eof = $csv->eof ();
486
487       If "parse ()" or "getline ()" was used with an IO stream, this method
488       will return true (1) if the last call hit end of file, otherwise it
489       will return false (''). This is useful to see the difference between a
490       failure and end of file.
491
492   types
493        $csv->types (\@tref);
494
495       This method is used to force that columns are of a given type. For
496       example, if you have an integer column, two double columns and a string
497       column, then you might do a
498
499        $csv->types ([Text::CSV_PP::IV (),
500                      Text::CSV_PP::NV (),
501                      Text::CSV_PP::NV (),
502                      Text::CSV_PP::PV ()]);
503
504       Column types are used only for decoding columns, in other words by the
505       parse () and getline () methods.
506
507       You can unset column types by doing a
508
509        $csv->types (undef);
510
511       or fetch the current type settings with
512
513        $types = $csv->types ();
514
515       IV  Set field type to integer.
516
517       NV  Set field type to numeric/float.
518
519       PV  Set field type to string.
520
521   fields
522        @columns = $csv->fields ();
523
524       This object function returns the input to "combine ()" or the resultant
525       decomposed fields of C successful <parse ()>, whichever was called more
526       recently.
527
528       Note that the return value is undefined after using "getline ()", which
529       does not fill the data structures returned by "parse ()".
530
531   meta_info
532        @flags = $csv->meta_info ();
533
534       This object function returns the flags of the input to "combine ()" or
535       the flags of the resultant decomposed fields of "parse ()", whichever
536       was called more recently.
537
538       For each field, a meta_info field will hold flags that tell something
539       about the field returned by the "fields ()" method or passed to the
540       "combine ()" method. The flags are bitwise-or'd like:
541
542       0x0001
543           The field was quoted.
544
545       0x0002
546           The field was binary.
547
548       See the "is_*** ()" methods below.
549
550   is_quoted
551         my $quoted = $csv->is_quoted ($column_idx);
552
553       Where $column_idx is the (zero-based) index of the column in the last
554       result of "parse ()".
555
556       This returns a true value if the data in the indicated column was
557       enclosed in "quote_char" quotes. This might be important for data where
558       ",20070108," is to be treated as a numeric value, and where
559       ","20070108"," is explicitly marked as character string data.
560
561   is_binary
562         my $binary = $csv->is_binary ($column_idx);
563
564       Where $column_idx is the (zero-based) index of the column in the last
565       result of "parse ()".
566
567       This returns a true value if the data in the indicated column contained
568       any byte in the range [\x00-\x08,\x10-\x1F,\x7F-\xFF]
569
570   status
571        $status = $csv->status ();
572
573       This object function returns success (or failure) of "combine ()" or
574       "parse ()", whichever was called more recently.
575
576   error_input
577        $bad_argument = $csv->error_input ();
578
579       This object function returns the erroneous argument (if it exists) of
580       "combine ()" or "parse ()", whichever was called more recently.
581
582   error_diag
583        Text::CSV_PP->error_diag ();
584        $csv->error_diag ();
585        $error_code   = 0  + $csv->error_diag ();
586        $error_str    = "" . $csv->error_diag ();
587        ($cde, $str, $pos) = $csv->error_diag ();
588
589       If (and only if) an error occured, this function returns the
590       diagnostics of that error.
591
592       If called in void context, it will print the internal error code and
593       the associated error message to STDERR.
594
595       If called in list context, it will return the error code and the error
596       message in that order. If the last error was from parsing, the third
597       value returned is the best guess at the location within the line that
598       was being parsed. It's value is 1-based.
599
600       Note: $pos does not show the error point in many cases.  It is for
601       conscience's sake.
602
603       If called in scalar context, it will return the diagnostics in a single
604       scalar, a-la $!. It will contain the error code in numeric context, and
605       the diagnostics message in string context.
606
607       To achieve this behavior with CSV_PP, the returned diagnostics is
608       blessed object.
609
610   SetDiag
611        $csv->SetDiag (0);
612
613       Use to reset the diagnostics if you are dealing with errors.
614

DIAGNOSTICS

616       If an error occured, $csv->error_diag () can be used to get more
617       information on the cause of the failure. Note that for speed reasons,
618       the internal value is never cleared on success, so using the value
619       returned by error_diag () in normal cases - when no error occured - may
620       cause unexpected results.
621
622       Note: CSV_PP's diagnostics is different from CSV_XS's:
623
624       Text::CSV_XS parses csv strings by dividing one character while
625       Text::CSV_PP by using the regular expressions.  That difference makes
626       the different cause of the failure.
627
628       Currently these errors are available:
629
630       1001 "sep_char is equal to quote_char or escape_char"
631         The separation character cannot be equal to either the quotation
632         character or the escape character, as that will invalidate all
633         parsing rules.
634
635       1002 "INI - allow_whitespace with escape_char or quote_char SP or TAB"
636         Using "allow_whitespace" when either "escape_char" or "quote_char" is
637         equal to SPACE or TAB is too ambiguous to allow.
638
639       1003 "INI - \r or \n in main attr not allowed"
640         Using default "eol" characters in either "sep_char", "quote_char", or
641         "escape_char" is not allowed.
642
643       2010 "ECR - QUO char inside quotes followed by CR not part of EOL"
644       2011 "ECR - Characters after end of quoted field"
645       2021 "EIQ - NL char inside quotes, binary off"
646       2022 "EIQ - CR char inside quotes, binary off"
647       2025 "EIQ - Loose unescaped escape"
648       2026 "EIQ - Binary character inside quoted field, binary off"
649       2027 "EIQ - Quoted field not terminated"
650       2030 "EIF - NL char inside unquoted verbatim, binary off"
651       2031 "EIF - CR char is first char of field, not part of EOL",
652       2032 "EIF - CR char inside unquoted, not part of EOL",
653       2034 "EIF - Loose unescaped quote",
654       2037 "EIF - Binary character in unquoted field, binary off",
655       2110 "ECB - Binary character in Combine, binary off"
656       2200 "EIO - print to IO failed. See errno"
657       4002 "EIQ - Unescaped ESC in quoted field"
658       4003 "EIF - ESC CR"
659       4004 "EUF - "
660       3001 "EHR - Unsupported syntax for column_names ()"
661       3002 "EHR - getline_hr () called before column_names ()"
662       3003 "EHR - bind_columns () and column_names () fields count mismatch"
663       3004 "EHR - bind_columns () only accepts refs to scalars"
664       3006 "EHR - bind_columns () did not pass enough refs for parsed fields"
665       3007 "EHR - bind_columns needs refs to writable scalars"
666       3008 "EHR - unexpected error in bound fields"
667

AUTHOR

669       Makamaka Hannyaharamitu, <makamaka[at]cpan.org>
670
671       Text::CSV_XS was written by <joe[at]ispsoft.de> and maintained by
672       <h.m.brand[at]xs4all.nl>.
673
674       Text::CSV was written by <alan[at]mfgrtl.com>.
675

COPYRIGHT AND LICENSE

677       Copyright 2005-2010 by Makamaka Hannyaharamitu, <makamaka[at]cpan.org>
678
679       This library is free software; you can redistribute it and/or modify it
680       under the same terms as Perl itself.
681