Text::CSV_XS(3pm)

1CSV_XS(3)             User Contributed Perl Documentation            CSV_XS(3)
2
3
4

NAME

6       Text::CSV_XS - comma-separated values manipulation routines
7

SYNOPSIS

9        use Text::CSV_XS;
10
11        my @rows;
12        my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
13        open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!";
14        while (my $row = $csv->getline ($fh)) {
15            $row->[2] =~ m/pattern/ or next; # 3rd field should match
16            push @rows, $row;
17            }
18        close $fh;
19
20        $csv->eol ("\r\n");
21        open $fh, ">:encoding(utf8)", "new.csv" or die "new.csv: $!";
22        $csv->print ($fh, $_) for @rows;
23        close $fh or die "new.csv: $!";
24

DESCRIPTION

26       Text::CSV_XS provides facilities for the composition and decomposition
27       of comma-separated values. An instance of the Text::CSV_XS class will
28       combine fields into a CSV string and parse a CSV string into fields.
29
30       The module accepts either strings or files as input and support the use
31       of user-specified characters for delimiters, separators, and escapes.
32
33   Embedded newlines
34       Important Note: The default behavior is to accept only ASCII characters
35       in the range from 0x20 (space) to 0x7E (tilde).  This means that fields
36       can not contain newlines. If your data contains newlines embedded in
37       fields, or characters above 0x7e (tilde), or binary data, you must set
38       "binary => 1" in the call to "new". To cover the widest range of
39       parsing options, you will always want to set binary.
40
41       But you still have the problem that you have to pass a correct line to
42       the "parse" method, which is more complicated from the usual point of
43       usage:
44
45        my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ });
46        while (<>) {           #  WRONG!
47            $csv->parse ($_);
48            my @fields = $csv->fields ();
49
50       will break, as the while might read broken lines, as that does not care
51       about the quoting. If you need to support embedded newlines, the way to
52       go is to not pass "eol" in the parser (it accepts "\n", "\r", and
53       "\r\n" by default) and then
54
55        my $csv = Text::CSV_XS->new ({ binary => 1 });
56        open my $io, "<", $file or die "$file: $!";
57        while (my $row = $csv->getline ($io)) {
58            my @fields = @$row;
59
60       The old(er) way of using global file handles is still supported
61
62        while (my $row = $csv->getline (*ARGV)) {
63
64   Unicode
65       Unicode is only tested to work with perl-5.8.2 and up.
66
67       On parsing (both for "getline" and "parse"), if the source is marked
68       being UTF8, then all fields that are marked binary will also be marked
69       UTF8.
70
71       For complete control over encoding, please use Text::CSV::Encoded:
72
73        use Text::CSV::Encoded;
74        my $csv = Text::CSV::Encoded->new ({
75            encoding_in  => "iso-8859-1", # the encoding comes into   Perl
76            encoding_out => "cp1252",     # the encoding comes out of Perl
77            });
78
79        $csv = Text::CSV::Encoded->new ({ encoding  => "utf8" });
80        # combine () and print () accept *literally* utf8 encoded data
81        # parse () and getline () return *literally* utf8 encoded data
82
83        $csv = Text::CSV::Encoded->new ({ encoding  => undef }); # default
84        # combine () and print () accept UTF8 marked data
85        # parse () and getline () return UTF8 marked data
86
87       On combining ("print" and "combine"), if any of the combining fields
88       was marked UTF8, the resulting string will be marked UTF8. Note however
89       that all fields before the first field that was marked UTF8 and
90       contained 8-bit characters that were not upgraded to UTF8, these will
91       be bytes in the resulting string too, causing errors. If you pass data
92       of different encoding, or you don't know if there is different
93       encoding, force it to be upgraded before you pass them on:
94
95        $csv->print ($fh, [ map { utf8::upgrade (my $x = $_); $x } @data ]);
96

SPECIFICATION

98       While no formal specification for CSV exists, RFC 4180 1) describes a
99       common format and establishes "text/csv" as the MIME type registered
100       with the IANA.
101
102       Many informal documents exist that describe the CSV format. How To: The
103       Comma Separated Value (CSV) File Format 2) provides an overview of the
104       CSV format in the most widely used applications and explains how it can
105       best be used and supported.
106
107        1) http://tools.ietf.org/html/rfc4180
108        2) http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
109
110       The basic rules are as follows:
111
112       CSV is a delimited data format that has fields/columns separated by the
113       comma character and records/rows separated by newlines. Fields that
114       contain a special character (comma, newline, or double quote), must be
115       enclosed in double quotes.  However, if a line contains a single entry
116       that is the empty string, it may be enclosed in double quotes. If a
117       field's value contains a double quote character it is escaped by
118       placing another double quote character next to it. The CSV file format
119       does not require a specific character encoding, byte order, or line
120       terminator format.
121
122       · Each record is a single line ended by a line feed (ASCII/LF=0x0A) or
123         a carriage return and line feed pair (ASCII/CRLF=0x0D 0x0A), however,
124         line-breaks may be embedded.
125
126       · Fields are separated by commas.
127
128       · Allowable characters within a CSV field include 0x09 (tab) and the
129         inclusive range of 0x20 (space) through 0x7E (tilde). In binary mode
130         all characters are accepted, at least in quoted fields.
131
132       · A field within CSV must be surrounded by double-quotes to contain a
133         the separator character (comma).
134
135       Though this is the most clear and restrictive definition, Text::CSV_XS
136       is way more liberal than this, and allows extension:
137
138       · Line termination by a single carriage return is accepted by default
139
140       · The separation-, escape-, and escape- characters can be any ASCII
141         character in the range from 0x20 (space) to 0x7E (tilde). Characters
142         outside this range may or may not work as expected. Multibyte
143         characters, like U+060c (ARABIC COMMA), U+FF0C (FULLWIDTH COMMA),
144         U+241B (SYMBOL FOR ESCAPE), U+2424 (SYMBOL FOR NEWLINE), U+FF02
145         (FULLWIDTH QUOTATION MARK), and U+201C (LEFT DOUBLE QUOTATION MARK)
146         (to give some examples of what might look promising) are therefor not
147         allowed.
148
149         If you use perl-5.8.2 or higher, these three attributes are
150         utf8-decoded, to increase the likelihood of success. This way U+00FE
151         will be allowed as a quote character.
152
153       · A field within CSV must be surrounded by double-quotes to contain an
154         embedded double-quote, represented by a pair of consecutive double-
155         quotes.  In binary mode you may additionally use the sequence ""0"
156         for representation of a NULL byte.
157
158       · Several violations of the above specification may be allowed by
159         passing options to the object creator.
160

FUNCTIONS

162   version
163       (Class method) Returns the current module version.
164
165   new
166       (Class method) Returns a new instance of Text::CSV_XS. The objects
167       attributes are described by the (optional) hash ref "\%attr".
168
169        my $csv = Text::CSV_XS->new ({ attributes ... });
170
171       The following attributes are available:
172
173       eol An end-of-line string to add to rows.
174
175           When not passed in a parser instance, the default behavior is to
176           accept "\n", "\r", and "\r\n", so it is probably safer to not
177           specify "eol" at all. Passing "undef" or the empty string behave
178           the same.
179
180           Common values for "eol" are "\012" ("\n" or Line Feed), "\015\012"
181           ("\r\n" or Carriage Return, Line Feed), and "\015" ("\r" or
182           Carriage Return). The "eol" attribute cannot exceed 7 (ASCII)
183           characters.
184
185           If both $/ and "eol" equal "\015", parsing lines that end on only a
186           Carriage Return without Line Feed, will be "parse"d correct.
187
188       sep_char
189           The char used to separate fields, by default a comma. (",").
190           Limited to a single-byte character, usually in the range from 0x20
191           (space) to 0x7e (tilde).
192
193           The separation character can not be equal to the quote character.
194           The separation character can not be equal to the escape character.
195
196           See also "CAVEATS"
197
198       allow_whitespace
199           When this option is set to true, whitespace (TAB's and SPACE's)
200           surrounding the separation character is removed when parsing. If
201           either TAB or SPACE is one of the three major characters
202           "sep_char", "quote_char", or "escape_char" it will not be
203           considered whitespace.
204
205           Now lines like:
206
207            1 , "foo" , bar , 3 , zapp
208
209           are correctly parsed, even though it violates the CSV specs.
210
211           Note that all whitespace is stripped from start and end of each
212           field.  That would make it more a feature than a way to enable
213           parsing bad CSV lines, as
214
215            1,   2.0,  3,   ape  , monkey
216
217           will now be parsed as
218
219            ("1", "2.0", "3", "ape", "monkey")
220
221           even if the original line was perfectly sane CSV.
222
223       blank_is_undef
224           Under normal circumstances, CSV data makes no distinction between
225           quoted- and unquoted empty fields. These both end up in an empty
226           string field once read, thus
227
228            1,"",," ",2
229
230           is read as
231
232            ("1", "", "", " ", "2")
233
234           When writing CSV files with "always_quote" set, the unquoted empty
235           field is the result of an undefined value. To make it possible to
236           also make this distinction when reading CSV data, the
237           "blank_is_undef" option will cause unquoted empty fields to be set
238           to undef, causing the above to be parsed as
239
240            ("1", "", undef, " ", "2")
241
242       empty_is_undef
243           Going one step further than "blank_is_undef", this attribute
244           converts all empty fields to undef, so
245
246            1,"",," ",2
247
248           is read as
249
250            (1, undef, undef, " ", 2)
251
252           Note that this effects only fields that are really empty, not
253           fields that are empty after stripping allowed whitespace. YMMV.
254
255       quote_char
256           The character to quote fields containing blanks, by default the
257           double quote character ("""). A value of undef suppresses quote
258           chars (for simple cases only).  Limited to a single-byte character,
259           usually in the range from 0x20 (space) to 0x7e (tilde).
260
261           The quote character can not be equal to the separation character.
262
263       allow_loose_quotes
264           By default, parsing fields that have "quote_char" characters inside
265           an unquoted field, like
266
267            1,foo "bar" baz,42
268
269           would result in a parse error. Though it is still bad practice to
270           allow this format, we cannot help the fact some vendors make their
271           applications spit out lines styled that way.
272
273           If there is really bad CSV data, like
274
275            1,"foo "bar" baz",42
276
277           or
278
279            1,""foo bar baz"",42
280
281           there is a way to get that parsed, and leave the quotes inside the
282           quoted field as-is. This can be achieved by setting
283           "allow_loose_quotes" AND making sure that the "escape_char" is not
284           equal to "quote_char".
285
286       escape_char
287           The character to escape certain characters inside quoted fields.
288           Limited to a single-byte character, usually in the range from 0x20
289           (space) to 0x7e (tilde).
290
291           The "escape_char" defaults to being the literal double-quote mark
292           (""") in other words, the same as the default "quote_char". This
293           means that doubling the quote mark in a field escapes it:
294
295            "foo","bar","Escape ""quote mark"" with two ""quote marks""","baz"
296
297           If you change the default quote_char without changing the default
298           escape_char, the escape_char will still be the quote mark.  If
299           instead you want to escape the quote_char by doubling it, you will
300           need to change the escape_char to be the same as what you changed
301           the quote_char to.
302
303           The escape character can not be equal to the separation character.
304
305       allow_loose_escapes
306           By default, parsing fields that have "escape_char" characters that
307           escape characters that do not need to be escaped, like:
308
309            my $csv = Text::CSV_XS->new ({ escape_char => "\\" });
310            $csv->parse (qq{1,"my bar\'s",baz,42});
311
312           would result in a parse error. Though it is still bad practice to
313           allow this format, this option enables you to treat all escape
314           character sequences equal.
315
316       allow_unquoted_escape
317           There is a backward compatibility issue in that the escape
318           character, when differing from the quotation character, cannot be
319           on the first position of a field. e.g. with "quote_char" equal to
320           the default """ and "escape_char" set to "\", this would be
321           illegal:
322
323            1,\0,2
324
325           To overcome issues with backward compatibility, you can allow this
326           by setting this attribute to 1.
327
328       binary
329           If this attribute is TRUE, you may use binary characters in quoted
330           fields, including line feeds, carriage returns and NULL bytes. (The
331           latter must be escaped as ""0".) By default this feature is off.
332
333           If a string is marked UTF8, binary will be turned on automatically
334           when binary characters other than CR or NL are encountered. Note
335           that a simple string like "\x{00a0}" might still be binary, but not
336           marked UTF8, so setting "{ binary =" 1 }> is still a wise option.
337
338       types
339           A set of column types; this attribute is immediately passed to the
340           "types" method. You must not set this attribute otherwise, except
341           for using the "types" method.
342
343       always_quote
344           By default the generated fields are quoted only if they need to be.
345           For example, if they contain the separator character. If you set
346           this attribute to a TRUE value, then all defined fields will be
347           quoted. ("undef" fields are not quoted, see "blank_is_undef")).
348           This is typically easier to handle in external applications. (Poor
349           creatures who are not using Text::CSV_XS. :-)
350
351       quote_space
352           By default, a space in a field would trigger quotation. As no rule
353           exists this to be forced in CSV, nor any for the opposite, the
354           default is true for safety. You can exclude the space from this
355           trigger by setting this attribute to 0.
356
357       quote_null
358           By default, a NULL byte in a field would be escaped. This attribute
359           enables you to treat the NULL byte as a simple binary character in
360           binary mode (the "{ binary => 1 }" is set). The default is true.
361           You can prevent NULL escapes by setting this attribute to 0.
362
363       quote_binary
364           By default,  all "unsafe" bytes inside a string cause the combined
365           field to be quoted. By setting this attribute to 0, you can disable
366           that trigger for bytes >= 0x7f.
367
368       keep_meta_info
369           By default, the parsing of input lines is as simple and fast as
370           possible.  However, some parsing information - like quotation of
371           the original field - is lost in that process. Set this flag to true
372           to enable retrieving that information after parsing with the
373           methods "meta_info", "is_quoted", and "is_binary" described below.
374           Default is false.
375
376       verbatim
377           This is a quite controversial attribute to set, but it makes hard
378           things possible.
379
380           The basic thought behind this is to tell the parser that the
381           normally special characters newline (NL) and Carriage Return (CR)
382           will not be special when this flag is set, and be dealt with as
383           being ordinary binary characters. This will ease working with data
384           with embedded newlines.
385
386           When "verbatim" is used with "getline", "getline" auto-chomp's
387           every line.
388
389           Imagine a file format like
390
391            M^^Hans^Janssen^Klas 2\n2A^Ja^11-06-2007#\r\n
392
393           where, the line ending is a very specific "#\r\n", and the sep_char
394           is a ^ (caret). None of the fields is quoted, but embedded binary
395           data is likely to be present. With the specific line ending, that
396           should not be too hard to detect.
397
398           By default, Text::CSV_XS' parse function is instructed to only know
399           about "\n" and "\r" to be legal line endings, and so has to deal
400           with the embedded newline as a real end-of-line, so it can scan the
401           next line if binary is true, and the newline is inside a quoted
402           field.  With this attribute, we tell parse () to parse the line as
403           if "\n" is just nothing more than a binary character.
404
405           For parse () this means that the parser has no idea about line
406           ending anymore, and getline () chomps line endings on reading.
407
408       auto_diag
409           Set to a true number between 1 and 9 will cause "error_diag" to be
410           automatically be called in void context upon errors.
411
412           In case of error "2012 - EOF", this call will be void.
413
414           If set to a value greater than 1, it will die on errors instead of
415           warn.  If set to anything unsupported, it will be silently ignored.
416
417           Future extensions to this feature will include more reliable auto-
418           detection of the "autodie" module being enabled, which will raise
419           the value of "auto_diag" with 1 on the moment the error is
420           detected.
421
422       diag_verbose
423           Set the verbosity of the "auto_diag" output. Currently only adds
424           the current input line (if known) to the diagnostic output with an
425           indication of the position of the error.
426
427       To sum it up,
428
429        $csv = Text::CSV_XS->new ();
430
431       is equivalent to
432
433        $csv = Text::CSV_XS->new ({
434            quote_char            => '"',
435            escape_char           => '"',
436            sep_char              => ',',
437            eol                   => $\,
438            always_quote          => 0,
439            quote_space           => 1,
440            quote_null            => 1,
441            quote_binary          => 1,
442            binary                => 0,
443            keep_meta_info        => 0,
444            allow_loose_quotes    => 0,
445            allow_loose_escapes   => 0,
446            allow_unquoted_escape => 0,
447            allow_whitespace      => 0,
448            blank_is_undef        => 0,
449            empty_is_undef        => 0,
450            verbatim              => 0,
451            auto_diag             => 0,
452            diag_verbose          => 0,
453            });
454
455       For all of the above mentioned flags, an accessor method is available
456       where you can inquire the current value, or change the value
457
458        my $quote = $csv->quote_char;
459        $csv->binary (1);
460
461       It is unwise to change these settings halfway through writing CSV data
462       to a stream. If however, you want to create a new stream using the
463       available CSV object, there is no harm in changing them.
464
465       If the "new" constructor call fails, it returns "undef", and makes the
466       fail reason available through the "error_diag" method.
467
468        $csv = Text::CSV_XS->new ({ ecs_char => 1 }) or
469            die "".Text::CSV_XS->error_diag ();
470
471       "error_diag" will return a string like
472
473        "INI - Unknown attribute 'ecs_char'"
474
475   print
476        $status = $csv->print ($io, $colref);
477
478       Similar to "combine" + "string" + "print", but way more efficient. It
479       expects an array ref as input (not an array!) and the resulting string
480       is not really created, but immediately written to the $io object,
481       typically an IO handle or any other object that offers a "print"
482       method.
483
484       For performance reasons the print method does not create a result
485       string.  In particular the "string", "status", "fields", and
486       "error_input" methods are meaningless after executing this method.
487
488       If $colref is "undef" (explicit, not through a variable argument) and
489       "bind_columns" was used to specify fields to be printed, it is possible
490       to make performance improvements, as otherwise data would have to be
491       copied as arguments to the method call:
492
493        $csv->bind_columns (\($foo, $bar));
494        $status = $csv->print ($fh, undef);
495
496       A short benchmark
497
498        my @data = ("aa" .. "zz");
499        $csv->bind_columns (\(@data));
500
501        $csv->print ($io, [ @data ]);   # 10800 recs/sec
502        $csv->print ($io,  \@data  );   # 57100 recs/sec
503        $csv->print ($io,   undef  );   # 50500 recs/sec
504
505   combine
506        $status = $csv->combine (@columns);
507
508       This object function constructs a CSV string from the arguments,
509       returning success or failure.  Failure can result from lack of
510       arguments or an argument containing an invalid character.  Upon
511       success, "string" can be called to retrieve the resultant CSV string.
512       Upon failure, the value returned by "string" is undefined and
513       "error_input" can be called to retrieve an invalid argument.
514
515   string
516        $line = $csv->string ();
517
518       This object function returns the input to "parse" or the resultant CSV
519       string of "combine", whichever was called more recently.
520
521   getline
522        $colref = $csv->getline ($io);
523
524       This is the counterpart to "print", as "parse" is the counterpart to
525       "combine": It reads a row from the IO object using "$io->getline" and
526       parses this row into an array ref. This array ref is returned by the
527       function or undef for failure.
528
529       When fields are bound with "bind_columns", the return value is a
530       reference to an empty list.
531
532       The "string", "fields", and "status" methods are meaningless, again.
533
534   getline_all
535        $arrayref = $csv->getline_all ($io);
536        $arrayref = $csv->getline_all ($io, $offset);
537        $arrayref = $csv->getline_all ($io, $offset, $length);
538
539       This will return a reference to a list of getline ($io) results.  In
540       this call, "keep_meta_info" is disabled. If $offset is negative, as
541       with "splice", only the last "abs ($offset)" records of $io are taken
542       into consideration.
543
544       Given a CSV file with 10 lines:
545
546        lines call
547        ----- ---------------------------------------------------------
548        0..9  $csv->getline_all ($io)         # all
549        0..9  $csv->getline_all ($io,  0)     # all
550        8..9  $csv->getline_all ($io,  8)     # start at 8
551        -     $csv->getline_all ($io,  0,  0) # start at 0 first 0 rows
552        0..4  $csv->getline_all ($io,  0,  5) # start at 0 first 5 rows
553        4..5  $csv->getline_all ($io,  4,  2) # start at 4 first 2 rows
554        8..9  $csv->getline_all ($io, -2)     # last 2 rows
555        6..7  $csv->getline_all ($io, -4,  2) # first 2 of last  4 rows
556
557   parse
558        $status = $csv->parse ($line);
559
560       This object function decomposes a CSV string into fields, returning
561       success or failure.  Failure can result from a lack of argument or the
562       given CSV string is improperly formatted.  Upon success, "fields" can
563       be called to retrieve the decomposed fields .  Upon failure, the value
564       returned by "fields" is undefined and "error_input" can be called to
565       retrieve the invalid argument.
566
567       You may use the "types" method for setting column types. See "types"'
568       description below.
569
570   getline_hr
571       The "getline_hr" and "column_names" methods work together to allow you
572       to have rows returned as hashrefs. You must call "column_names" first
573       to declare your column names.
574
575        $csv->column_names (qw( code name price description ));
576        $hr = $csv->getline_hr ($io);
577        print "Price for $hr->{name} is $hr->{price} EUR\n";
578
579       "getline_hr" will croak if called before "column_names".
580
581       Note that "getline_hr" creates a hashref for every row and will be much
582       slower than the combined use of "bind_columns" and "getline" but still
583       offering the same ease of use hashref inside the loop:
584
585        my @cols = @{$csv->getline ($io)};
586        $csv->column_names (@cols);
587        while (my $row = $csv->getline_hr ($io)) {
588            print $row->{price};
589            }
590
591       Could easily be rewritten to the much faster:
592
593        my @cols = @{$csv->getline ($io)};
594        my $row = {};
595        $csv->bind_columns (\@{$row}{@cols});
596        while ($csv->getline ($io)) {
597            print $row->{price};
598            }
599
600       Your mileage may vary for the size of the data and the number of rows.
601       With perl-5.14.2 the comparison for a 100_000 line file with 14 rows:
602
603                   Rate hashrefs getlines
604        hashrefs 1.00/s       --     -76%
605        getlines 4.15/s     313%       --
606
607   getline_hr_all
608        $arrayref = $csv->getline_hr_all ($io);
609        $arrayref = $csv->getline_hr_all ($io, $offset);
610        $arrayref = $csv->getline_hr_all ($io, $offset, $length);
611
612       This will return a reference to a list of getline_hr ($io) results.  In
613       this call, "keep_meta_info" is disabled.
614
615   print_hr
616        $csv->print_hr ($io, $ref);
617
618       Provides an easy way to print a $ref as fetched with getline_hr
619       provided the column names are set with column_names.
620
621       It is just a wrapper method with basic parameter checks over
622
623        $csv->print ($io, [ map { $ref->{$_} } $csv->column_names ]);
624
625   column_names
626       Set the keys that will be used in the "getline_hr" calls. If no keys
627       (column names) are passed, it'll return the current setting.
628
629       "column_names" accepts a list of scalars (the column names) or a single
630       array_ref, so you can pass "getline"
631
632        $csv->column_names ($csv->getline ($io));
633
634       "column_names" does no checking on duplicates at all, which might lead
635       to unwanted results. Undefined entries will be replaced with the string
636       "\cAUNDEF\cA", so
637
638        $csv->column_names (undef, "", "name", "name");
639        $hr = $csv->getline_hr ($io);
640
641       Will set "$hr->{"\cAUNDEF\cA"}" to the 1st field, "$hr->{""}" to the
642       2nd field, and "$hr->{name}" to the 4th field, discarding the 3rd
643       field.
644
645       "column_names" croaks on invalid arguments.
646
647   bind_columns
648       Takes a list of references to scalars to be printed with "print" or to
649       store the fields fetched by "getline" in. When you don't pass enough
650       references to store the fetched fields in, "getline" will fail. If you
651       pass more than there are fields to return, the remaining references are
652       left untouched.
653
654        $csv->bind_columns (\$code, \$name, \$price, \$description);
655        while ($csv->getline ($io)) {
656            print "The price of a $name is \x{20ac} $price\n";
657            }
658
659       To reset or clear all column binding, call "bind_columns" with a single
660       argument "undef". This will also clear column names.
661
662        $csv->bind_columns (undef);
663
664       If no arguments are passed at all, "bind_columns" will return the list
665       current bindings or "undef" if no binds are active.
666
667   eof
668        $eof = $csv->eof ();
669
670       If "parse" or "getline" was used with an IO stream, this method will
671       return true (1) if the last call hit end of file, otherwise it will
672       return false (''). This is useful to see the difference between a
673       failure and end of file.
674
675   types
676        $csv->types (\@tref);
677
678       This method is used to force that columns are of a given type. For
679       example, if you have an integer column, two double columns and a string
680       column, then you might do a
681
682        $csv->types ([Text::CSV_XS::IV (),
683                      Text::CSV_XS::NV (),
684                      Text::CSV_XS::NV (),
685                      Text::CSV_XS::PV ()]);
686
687       Column types are used only for decoding columns, in other words by the
688       "parse" and "getline" methods.
689
690       You can unset column types by doing a
691
692        $csv->types (undef);
693
694       or fetch the current type settings with
695
696        $types = $csv->types ();
697
698       IV  Set field type to integer.
699
700       NV  Set field type to numeric/float.
701
702       PV  Set field type to string.
703
704   fields
705        @columns = $csv->fields ();
706
707       This object function returns the input to "combine" or the resultant
708       decomposed fields of a successful "parse", whichever was called more
709       recently.
710
711       Note that the return value is undefined after using "getline", which
712       does not fill the data structures returned by "parse".
713
714   meta_info
715        @flags = $csv->meta_info ();
716
717       This object function returns the flags of the input to "combine" or the
718       flags of the resultant decomposed fields of "parse", whichever was
719       called more recently.
720
721       For each field, a meta_info field will hold flags that tell something
722       about the field returned by the "fields" method or passed to the
723       "combine" method. The flags are bit-wise-or'd like:
724
725       " "0x0001
726         The field was quoted.
727
728       " "0x0002
729         The field was binary.
730
731       See the "is_***" methods below.
732
733   is_quoted
734        my $quoted = $csv->is_quoted ($column_idx);
735
736       Where $column_idx is the (zero-based) index of the column in the last
737       result of "parse".
738
739       This returns a true value if the data in the indicated column was
740       enclosed in "quote_char" quotes. This might be important for data where
741       ",20070108," is to be treated as a numeric value, and where
742       ","20070108"," is explicitly marked as character string data.
743
744   is_binary
745        my $binary = $csv->is_binary ($column_idx);
746
747       Where $column_idx is the (zero-based) index of the column in the last
748       result of "parse".
749
750       This returns a true value if the data in the indicated column contained
751       any byte in the range "[\x00-\x08,\x10-\x1F,\x7F-\xFF]".
752
753   is_missing
754        my $missing = $csv->is_missing ($column_idx);
755
756       Where $column_idx is the (zero-based) index of the column in the last
757       result of "getline_hr".
758
759        while (my $hr = $csv->getline_hr ($fh)) {
760            $csv->is_missing (0) and next; # This was an empty line
761            }
762
763       When using "getline_hr" for parsing, it is impossible to tell if the
764       fields are "undef" because they where not filled in the CSV stream or
765       because they were not read at all, as all the fields defined by
766       "column_names" are set in the hash-ref. If you still need to know if
767       all fields in each row are provided, you should enable "keep_meta_info"
768       so you can check the flags.
769
770   status
771        $status = $csv->status ();
772
773       This object function returns success (or failure) of "combine" or
774       "parse", whichever was called more recently.
775
776   error_input
777        $bad_argument = $csv->error_input ();
778
779       This object function returns the erroneous argument (if it exists) of
780       "combine" or "parse", whichever was called more recently. If the last
781       call was successful, "error_input" will return "undef".
782
783   error_diag
784        Text::CSV_XS->error_diag ();
785        $csv->error_diag ();
786        $error_code           = 0  + $csv->error_diag ();
787        $error_str            = "" . $csv->error_diag ();
788        ($cde, $str, $pos, $recno) = $csv->error_diag ();
789
790       If (and only if) an error occurred, this function returns the
791       diagnostics of that error.
792
793       If called in void context, it will print the internal error code and
794       the associated error message to STDERR.
795
796       If called in list context, it will return the error code and the error
797       message in that order. If the last error was from parsing, the third
798       value returned is a best guess at the location within the line that was
799       being parsed. Its value is 1-based. The forth value represents the
800       record count parsed by this csv object See examples/csv-check for how
801       this can be used.
802
803       If called in scalar context, it will return the diagnostics in a single
804       scalar, a-la $!. It will contain the error code in numeric context, and
805       the diagnostics message in string context.
806
807       When called as a class method or a direct function call, the error
808       diagnostics is that of the last "new" call.
809
810   record_number
811        $recno = $csv->record_number ();
812
813       Returns the records parsed by this csv instance. This value should be
814       more accurate than $. when embedded newlines come in play. Records
815       written by this instance are not counted.
816
817   SetDiag
818        $csv->SetDiag (0);
819
820       Use to reset the diagnostics if you are dealing with errors.
821

INTERNALS

823       Combine (...)
824       Parse (...)
825
826       The arguments to these two internal functions are deliberately not
827       described or documented in order to enable the module author(s) to
828       change it when they feel the need for it. Using them is highly
829       discouraged as the API may change in future releases.
830

EXAMPLES

832   Reading a CSV file line by line:
833        my $csv = Text::CSV_XS->new ({ binary => 1 });
834        open my $fh, "<", "file.csv" or die "file.csv: $!";
835        while (my $row = $csv->getline ($fh)) {
836            # do something with @$row
837            }
838        $csv->eof or $csv->error_diag;
839        close $fh or die "file.csv: $!";
840
841   Parsing CSV strings:
842        my $csv = Text::CSV_XS->new ({ keep_meta_info => 1, binary => 1 });
843
844        my $sample_input_string =
845            qq{"I said, ""Hi!""",Yes,"",2.34,,"1.09","\x{20ac}",};
846        if ($csv->parse ($sample_input_string)) {
847            my @field = $csv->fields;
848            foreach my $col (0 .. $#field) {
849                my $quo = $csv->is_quoted ($col) ? $csv->{quote_char} : "";
850                printf "%2d: %s%s%s\n", $col, $quo, $field[$col], $quo;
851                }
852            }
853        else {
854            print STDERR "parse () failed on argument: ",
855                $csv->error_input, "\n";
856            $csv->error_diag ();
857            }
858
859   Printing CSV data
860       The fast way: using "print"
861
862       An example for creating CSV files using the "print" method, like in
863       dumping the content of a database ($dbh) table ($tbl) to CSV:
864
865        my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ });
866        open my $fh, ">", "$tbl.csv" or die "$tbl.csv: $!";
867        my $sth = $dbh->prepare ("select * from $tbl");
868        $sth->execute;
869        $csv->print ($fh, $sth->{NAME_lc});
870        while (my $row = $sth->fetch) {
871            $csv->print ($fh, $row) or $csv->error_diag;
872            }
873        close $fh or die "$tbl.csv: $!";
874
875       The slow way: using "combine" and "string"
876
877       or using the slower "combine" and "string" methods:
878
879        my $csv = Text::CSV_XS->new;
880
881        open my $csv_fh, ">", "hello.csv" or die "hello.csv: $!";
882
883        my @sample_input_fields = (
884            'You said, "Hello!"',   5.67,
885            '"Surely"',   '',   '3.14159');
886        if ($csv->combine (@sample_input_fields)) {
887            print $csv_fh $csv->string, "\n";
888            }
889        else {
890            print "combine () failed on argument: ",
891                $csv->error_input, "\n";
892            }
893        close $csv_fh or die "hello.csv: $!";
894
895   The examples folder
896       For more extended examples, see the examples/[24m (1) sub-directory in the
897       original distribution or the git repository (2).
898
899        1. http://repo.or.cz/w/Text-CSV_XS.git?a=tree;f=examples
900        2. http://repo.or.cz/w/Text-CSV_XS.git
901
902       The following files can be found there:
903
904       parser-xs.pl
905         This can be used as a boilerplate to `fix' bad CSV and parse beyond
906         errors.
907
908          $ perl examples/parser-xs.pl bad.csv >good.csv
909
910       csv-check
911         This is a command-line tool that uses parser-xs.pl techniques to
912         check the CSV file and report on its content.
913
914          $ csv-check files/utf8.csv
915          Checked with examples/csv-check 1.5 using Text::CSV_XS 0.81
916          OK: rows: 1, columns: 2
917              sep = <,>, quo = <">, bin = <1>
918
919       csv2xls
920         A script to convert CSV to Microsoft Excel. This requires Date::Calc
921         and Spreadsheet::WriteExcel. The converter accepts various options
922         and can produce UTF-8 Excel files.
923
924       csvdiff
925         A script that provides colorized diff on sorted CSV files, assuming
926         first line is header and first field is the key. Output options
927         include colorized ANSI escape codes or HTML.
928
929          $ csvdiff --html --output=diff.html file1.csv file2.csv
930

CAVEATS

932       "Text::CSV_XS" is not designed to detect the characters used to quote
933       and separate fields. The parsing is done using predefined settings. In
934       the examples sub-directory, you can find scripts that demonstrate how
935       you can try to detect these characters yourself.
936
937   Microsoft Excel
938       The import/export from Microsoft Excel is a risky task, according to
939       the documentation in "Text::CSV::Separator". Microsoft uses the
940       system's default list separator defined in the regional settings, which
941       happens to be a semicolon for Dutch, German and Spanish (and probably
942       some others as well).  For the English locale, the default is a comma.
943       In Windows however, the user is free to choose a predefined locale, and
944       then change every individual setting in it, so checking the locale is
945       no solution.
946

TODO

948       More Errors & Warnings
949         New extensions ought to be clear and concise in reporting what error
950         occurred where and why, and possibly also tell a remedy to the
951         problem.  error_diag is a (very) good start, but there is more work
952         to be done here.
953
954         Basic calls should croak or warn on illegal parameters. Errors should
955         be documented.
956
957       setting meta info
958         Future extensions might include extending the "meta_info",
959         "is_quoted", and "is_binary" to accept setting these flags for
960         fields, so you can specify which fields are quoted in the
961         "combine"/"string" combination.
962
963          $csv->meta_info (0, 1, 1, 3, 0, 0);
964          $csv->is_quoted (3, 1);
965
966       Parse the whole file at once
967         Implement new methods that enable parsing of a complete file at once,
968         returning a list of hashes. Possible extension to this could be to
969         enable a column selection on the call:
970
971          my @AoH = $csv->parse_file ($filename, { cols => [ 1, 4..8, 12 ]});
972
973         Returning something like
974
975          [ { fields => [ 1, 2, "foo", 4.5, undef, "", 8 ],
976              flags  => [ ... ],
977              },
978            { fields => [ ... ],
979              .
980              },
981            ]
982
983         Note that "getline_all" already returns all rows for an open stream,
984         but this will not return flags.
985
986   NOT TODO
987       combined methods
988         Requests for adding means (methods) that combine "combine" and
989         "string" in a single call will not be honored. Likewise for "parse"
990         and "fields". Given the trouble with embedded newlines, using
991         "getline" and "print" instead is the preferred way to go.
992
993   Release plan
994       No guarantees, but this is what I had in mind some time ago:
995
996       next
997          - This might very well be 1.00
998          - DIAGNOSTICS setction in pod to *describe* the errors (see below)
999          - croak / carp
1000
1001       next + 1
1002          - csv2csv - a script to regenerate a CSV file to follow standards
1003

EBCDIC

1005       The hard-coding of characters and character ranges makes this module
1006       unusable on EBCDIC systems.
1007
1008       Opening EBCDIC encoded files on ASCII+ systems is likely to succeed
1009       using Encode's cp37, cp1047, or posix-bc:
1010
1011        open my $fh, "<:encoding(cp1047)", "ebcdic_file.csv" or die "...";
1012

DIAGNOSTICS

1014       Still under construction ...
1015
1016       If an error occurred, "$csv-"error_diag> can be used to get more
1017       information on the cause of the failure. Note that for speed reasons,
1018       the internal value is never cleared on success, so using the value
1019       returned by "error_diag" in normal cases - when no error occurred - may
1020       cause unexpected results.
1021
1022       If the constructor failed, the cause can be found using "error_diag" as
1023       a class method, like "Text::CSV_XS-"error_diag>.
1024
1025       "$csv-"error_diag> is automatically called upon error when the
1026       contractor was called with "auto_diag" set to 1 or 2, or when "autodie"
1027       is in effect.  When set to 1, this will cause a "warn" with the error
1028       message, when set to 2, it will "die". "2012 - EOF" is excluded from
1029       "auto_diag" reports.
1030
1031       The errors as described below are available. I have tried to make the
1032       error itself explanatory enough, but more descriptions will be added.
1033       For most of these errors, the first three capitals describe the error
1034       category:
1035
1036       · INI
1037
1038         Initialization error or option conflict.
1039
1040       · ECR
1041
1042         Carriage-Return related parse error.
1043
1044       · EOF
1045
1046         End-Of-File related parse error.
1047
1048       · EIQ
1049
1050         Parse error inside quotation.
1051
1052       · EIF
1053
1054         Parse error inside field.
1055
1056       · ECB
1057
1058         Combine error.
1059
1060       · EHR
1061
1062         HashRef parse related error.
1063
1064       And below should be the complete list of error codes that can be
1065       returned:
1066
1067       · 1001 "INI - sep_char is equal to quote_char or escape_char"
1068
1069         The separation character cannot be equal to either the quotation
1070         character or the escape character, as that will invalidate all
1071         parsing rules.
1072
1073       · 1002 "INI - allow_whitespace with escape_char or quote_char SP or
1074         TAB"
1075
1076         Using "allow_whitespace" when either "escape_char" or "quote_char" is
1077         equal to SPACE or TAB is too ambiguous to allow.
1078
1079       · 1003 "INI - \r or \n in main attr not allowed"
1080
1081         Using default "eol" characters in either "sep_char", "quote_char", or
1082         "escape_char" is not allowed.
1083
1084       · 2010 "ECR - QUO char inside quotes followed by CR not part of EOL"
1085
1086         When "eol" has been set to something specific, other than the
1087         default, like "\r\t\n", and the "\r" is following the second
1088         (closing) "quote_char", where the characters following the "\r" do
1089         not make up the "eol" sequence, this is an error.
1090
1091       · 2011 "ECR - Characters after end of quoted field"
1092
1093         Sequences like "1,foo,"bar"baz,2" are not allowed. "bar" is a quoted
1094         field, and after the closing quote, there should be either a new-line
1095         sequence or a separation character.
1096
1097       · 2012 "EOF - End of data in parsing input stream"
1098
1099         Self-explaining. End-of-file while inside parsing a stream. Can
1100         happen only when reading from streams with "getline", as using
1101         "parse" is done on strings that are not required to have a trailing
1102         "eol".
1103
1104       · 2021 "EIQ - NL char inside quotes, binary off"
1105
1106         Sequences like "1,"foo\nbar",2" are allowed only when the binary
1107         option has been selected with the constructor.
1108
1109       · 2022 "EIQ - CR char inside quotes, binary off"
1110
1111         Sequences like "1,"foo\rbar",2" are allowed only when the binary
1112         option has been selected with the constructor.
1113
1114       · 2023 "EIQ - QUO character not allowed"
1115
1116         Sequences like ""foo "bar" baz",quux" and "2023,",2008-04-05,"Foo,
1117         Bar",\n" will cause this error.
1118
1119       · 2024 "EIQ - EOF cannot be escaped, not even inside quotes"
1120
1121         The escape character is not allowed as last character in an input
1122         stream.
1123
1124       · 2025 "EIQ - Loose unescaped escape"
1125
1126         An escape character should escape only characters that need escaping.
1127         Allowing the escape for other characters is possible with the
1128         "allow_loose_escape" attribute.
1129
1130       · 2026 "EIQ - Binary character inside quoted field, binary off"
1131
1132         Binary characters are not allowed by default. Exceptions are fields
1133         that contain valid UTF-8, that will automatically be upgraded is the
1134         content is valid UTF-8. Pass the "binary" attribute with a true value
1135         to accept binary characters.
1136
1137       · 2027 "EIQ - Quoted field not terminated"
1138
1139         When parsing a field that started with a quotation character, the
1140         field is expected to be closed with a quotation character. When the
1141         parsed line is exhausted before the quote is found, that field is not
1142         terminated.
1143
1144       · 2030 "EIF - NL char inside unquoted verbatim, binary off"
1145
1146       · 2031 "EIF - CR char is first char of field, not part of EOL"
1147
1148       · 2032 "EIF - CR char inside unquoted, not part of EOL"
1149
1150       · 2034 "EIF - Loose unescaped quote"
1151
1152       · 2035 "EIF - Escaped EOF in unquoted field"
1153
1154       · 2036 "EIF - ESC error"
1155
1156       · 2037 "EIF - Binary character in unquoted field, binary off"
1157
1158       · 2110 "ECB - Binary character in Combine, binary off"
1159
1160       · 2200 "EIO - print to IO failed. See errno"
1161
1162       · 3001 "EHR - Unsupported syntax for column_names ()"
1163
1164       · 3002 "EHR - getline_hr () called before column_names ()"
1165
1166       · 3003 "EHR - bind_columns () and column_names () fields count
1167         mismatch"
1168
1169       · 3004 "EHR - bind_columns () only accepts refs to scalars"
1170
1171       · 3006 "EHR - bind_columns () did not pass enough refs for parsed
1172         fields"
1173
1174       · 3007 "EHR - bind_columns needs refs to writable scalars"
1175
1176       · 3008 "EHR - unexpected error in bound fields"
1177
1178       · 3009 "EHR - print_hr () called before column_names ()"
1179
1180       · 3010 "EHR - print_hr () called with invalid arguments"
1181

AUTHORS and MAINTAINERS

1187       Alan Citterman <alan@mfgrtl.com> wrote the original Perl module.
1188       Please don't send mail concerning Text::CSV_XS to Alan, as he's not
1189       involved in the C part that is now the main part of the module.
1190
1191       Jochen Wiedmann <joe@ispsoft.de> rewrote the encoding and decoding in C
1192       by implementing a simple finite-state machine and added the variable
1193       quote, escape and separator characters, the binary mode and the print
1194       and getline methods. See ChangeLog releases 0.10 through 0.23.
1195
1196       H.Merijn Brand <h.m.brand@xs4all.nl> cleaned up the code, added the
1197       field flags methods, wrote the major part of the test suite, completed
1198       the documentation, fixed some RT bugs and added all the allow flags.
1199       See ChangeLog releases 0.25 and on.
1200

COPYRIGHT AND LICENSE

1202        Copyright (C) 2007-2013 H.Merijn Brand.  All rights reserved.
1203        Copyright (C) 1998-2001 Jochen Wiedmann. All rights reserved.
1204        Copyright (C) 1997      Alan Citterman.  All rights reserved.
1205
1206       This library is free software; you can redistribute it and/or modify it
1207       under the same terms as Perl itself.
1208
1209
1210
1211perl v5.16.3                      2013-06-13                         CSV_XS(3)