1CSV_XS(3) User Contributed Perl Documentation CSV_XS(3)
2
3
4
6 Text::CSV_XS - comma-separated values manipulation routines
7
9 use Text::CSV_XS;
10
11 my @rows;
12 my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
13 open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!";
14 while (my $row = $csv->getline ($fh)) {
15 $row->[2] =~ m/pattern/ or next; # 3rd field should match
16 push @rows, $row;
17 }
18 close $fh;
19
20 $csv->eol ("\r\n");
21 open $fh, ">:encoding(utf8)", "new.csv" or die "new.csv: $!";
22 $csv->print ($fh, $_) for @rows;
23 close $fh or die "new.csv: $!";
24
26 Text::CSV_XS provides facilities for the composition and decomposition
27 of comma-separated values. An instance of the Text::CSV_XS class will
28 combine fields into a CSV string and parse a CSV string into fields.
29
30 The module accepts either strings or files as input and support the use
31 of user-specified characters for delimiters, separators, and escapes.
32
33 Embedded newlines
34 Important Note: The default behavior is to accept only ASCII characters
35 in the range from 0x20 (space) to 0x7E (tilde). This means that fields
36 can not contain newlines. If your data contains newlines embedded in
37 fields, or characters above 0x7e (tilde), or binary data, you must set
38 "binary => 1" in the call to "new". To cover the widest range of
39 parsing options, you will always want to set binary.
40
41 But you still have the problem that you have to pass a correct line to
42 the "parse" method, which is more complicated from the usual point of
43 usage:
44
45 my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ });
46 while (<>) { # WRONG!
47 $csv->parse ($_);
48 my @fields = $csv->fields ();
49
50 will break, as the while might read broken lines, as that does not care
51 about the quoting. If you need to support embedded newlines, the way to
52 go is to not pass "eol" in the parser (it accepts "\n", "\r", and
53 "\r\n" by default) and then
54
55 my $csv = Text::CSV_XS->new ({ binary => 1 });
56 open my $io, "<", $file or die "$file: $!";
57 while (my $row = $csv->getline ($io)) {
58 my @fields = @$row;
59
60 The old(er) way of using global file handles is still supported
61
62 while (my $row = $csv->getline (*ARGV)) {
63
64 Unicode
65 Unicode is only tested to work with perl-5.8.2 and up.
66
67 On parsing (both for "getline" and "parse"), if the source is marked
68 being UTF8, then all fields that are marked binary will also be marked
69 UTF8.
70
71 For complete control over encoding, please use Text::CSV::Encoded:
72
73 use Text::CSV::Encoded;
74 my $csv = Text::CSV::Encoded->new ({
75 encoding_in => "iso-8859-1", # the encoding comes into Perl
76 encoding_out => "cp1252", # the encoding comes out of Perl
77 });
78
79 $csv = Text::CSV::Encoded->new ({ encoding => "utf8" });
80 # combine () and print () accept *literally* utf8 encoded data
81 # parse () and getline () return *literally* utf8 encoded data
82
83 $csv = Text::CSV::Encoded->new ({ encoding => undef }); # default
84 # combine () and print () accept UTF8 marked data
85 # parse () and getline () return UTF8 marked data
86
87 On combining ("print" and "combine"), if any of the combining fields
88 was marked UTF8, the resulting string will be marked UTF8. Note however
89 that all fields before the first field that was marked UTF8 and
90 contained 8-bit characters that were not upgraded to UTF8, these will
91 be bytes in the resulting string too, causing errors. If you pass data
92 of different encoding, or you don't know if there is different
93 encoding, force it to be upgraded before you pass them on:
94
95 $csv->print ($fh, [ map { utf8::upgrade (my $x = $_); $x } @data ]);
96
98 While no formal specification for CSV exists, RFC 4180 1) describes a
99 common format and establishes "text/csv" as the MIME type registered
100 with the IANA.
101
102 Many informal documents exist that describe the CSV format. How To: The
103 Comma Separated Value (CSV) File Format 2) provides an overview of the
104 CSV format in the most widely used applications and explains how it can
105 best be used and supported.
106
107 1) http://tools.ietf.org/html/rfc4180
108 2) http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
109
110 The basic rules are as follows:
111
112 CSV is a delimited data format that has fields/columns separated by the
113 comma character and records/rows separated by newlines. Fields that
114 contain a special character (comma, newline, or double quote), must be
115 enclosed in double quotes. However, if a line contains a single entry
116 that is the empty string, it may be enclosed in double quotes. If a
117 field's value contains a double quote character it is escaped by
118 placing another double quote character next to it. The CSV file format
119 does not require a specific character encoding, byte order, or line
120 terminator format.
121
122 · Each record is a single line ended by a line feed (ASCII/LF=0x0A) or
123 a carriage return and line feed pair (ASCII/CRLF=0x0D 0x0A), however,
124 line-breaks may be embedded.
125
126 · Fields are separated by commas.
127
128 · Allowable characters within a CSV field include 0x09 (tab) and the
129 inclusive range of 0x20 (space) through 0x7E (tilde). In binary mode
130 all characters are accepted, at least in quoted fields.
131
132 · A field within CSV must be surrounded by double-quotes to contain a
133 the separator character (comma).
134
135 Though this is the most clear and restrictive definition, Text::CSV_XS
136 is way more liberal than this, and allows extension:
137
138 · Line termination by a single carriage return is accepted by default
139
140 · The separation-, escape-, and escape- characters can be any ASCII
141 character in the range from 0x20 (space) to 0x7E (tilde). Characters
142 outside this range may or may not work as expected. Multibyte
143 characters, like U+060c (ARABIC COMMA), U+FF0C (FULLWIDTH COMMA),
144 U+241B (SYMBOL FOR ESCAPE), U+2424 (SYMBOL FOR NEWLINE), U+FF02
145 (FULLWIDTH QUOTATION MARK), and U+201C (LEFT DOUBLE QUOTATION MARK)
146 (to give some examples of what might look promising) are therefor not
147 allowed.
148
149 If you use perl-5.8.2 or higher, these three attributes are
150 utf8-decoded, to increase the likelihood of success. This way U+00FE
151 will be allowed as a quote character.
152
153 · A field within CSV must be surrounded by double-quotes to contain an
154 embedded double-quote, represented by a pair of consecutive double-
155 quotes. In binary mode you may additionally use the sequence ""0"
156 for representation of a NULL byte.
157
158 · Several violations of the above specification may be allowed by
159 passing options to the object creator.
160
162 version
163 (Class method) Returns the current module version.
164
165 new
166 (Class method) Returns a new instance of Text::CSV_XS. The objects
167 attributes are described by the (optional) hash ref "\%attr".
168
169 my $csv = Text::CSV_XS->new ({ attributes ... });
170
171 The following attributes are available:
172
173 eol An end-of-line string to add to rows.
174
175 When not passed in a parser instance, the default behavior is to
176 accept "\n", "\r", and "\r\n", so it is probably safer to not
177 specify "eol" at all. Passing "undef" or the empty string behave
178 the same.
179
180 Common values for "eol" are "\012" ("\n" or Line Feed), "\015\012"
181 ("\r\n" or Carriage Return, Line Feed), and "\015" ("\r" or
182 Carriage Return). The "eol" attribute cannot exceed 7 (ASCII)
183 characters.
184
185 If both $/ and "eol" equal "\015", parsing lines that end on only a
186 Carriage Return without Line Feed, will be "parse"d correct.
187
188 sep_char
189 The char used to separate fields, by default a comma. (",").
190 Limited to a single-byte character, usually in the range from 0x20
191 (space) to 0x7e (tilde).
192
193 The separation character can not be equal to the quote character.
194 The separation character can not be equal to the escape character.
195
196 See also "CAVEATS"
197
198 allow_whitespace
199 When this option is set to true, whitespace (TAB's and SPACE's)
200 surrounding the separation character is removed when parsing. If
201 either TAB or SPACE is one of the three major characters
202 "sep_char", "quote_char", or "escape_char" it will not be
203 considered whitespace.
204
205 Now lines like:
206
207 1 , "foo" , bar , 3 , zapp
208
209 are correctly parsed, even though it violates the CSV specs.
210
211 Note that all whitespace is stripped from start and end of each
212 field. That would make it more a feature than a way to enable
213 parsing bad CSV lines, as
214
215 1, 2.0, 3, ape , monkey
216
217 will now be parsed as
218
219 ("1", "2.0", "3", "ape", "monkey")
220
221 even if the original line was perfectly sane CSV.
222
223 blank_is_undef
224 Under normal circumstances, CSV data makes no distinction between
225 quoted- and unquoted empty fields. These both end up in an empty
226 string field once read, thus
227
228 1,"",," ",2
229
230 is read as
231
232 ("1", "", "", " ", "2")
233
234 When writing CSV files with "always_quote" set, the unquoted empty
235 field is the result of an undefined value. To make it possible to
236 also make this distinction when reading CSV data, the
237 "blank_is_undef" option will cause unquoted empty fields to be set
238 to undef, causing the above to be parsed as
239
240 ("1", "", undef, " ", "2")
241
242 empty_is_undef
243 Going one step further than "blank_is_undef", this attribute
244 converts all empty fields to undef, so
245
246 1,"",," ",2
247
248 is read as
249
250 (1, undef, undef, " ", 2)
251
252 Note that this effects only fields that are really empty, not
253 fields that are empty after stripping allowed whitespace. YMMV.
254
255 quote_char
256 The character to quote fields containing blanks, by default the
257 double quote character ("""). A value of undef suppresses quote
258 chars (for simple cases only). Limited to a single-byte character,
259 usually in the range from 0x20 (space) to 0x7e (tilde).
260
261 The quote character can not be equal to the separation character.
262
263 allow_loose_quotes
264 By default, parsing fields that have "quote_char" characters inside
265 an unquoted field, like
266
267 1,foo "bar" baz,42
268
269 would result in a parse error. Though it is still bad practice to
270 allow this format, we cannot help the fact some vendors make their
271 applications spit out lines styled that way.
272
273 If there is really bad CSV data, like
274
275 1,"foo "bar" baz",42
276
277 or
278
279 1,""foo bar baz"",42
280
281 there is a way to get that parsed, and leave the quotes inside the
282 quoted field as-is. This can be achieved by setting
283 "allow_loose_quotes" AND making sure that the "escape_char" is not
284 equal to "quote_char".
285
286 escape_char
287 The character to escape certain characters inside quoted fields.
288 Limited to a single-byte character, usually in the range from 0x20
289 (space) to 0x7e (tilde).
290
291 The "escape_char" defaults to being the literal double-quote mark
292 (""") in other words, the same as the default "quote_char". This
293 means that doubling the quote mark in a field escapes it:
294
295 "foo","bar","Escape ""quote mark"" with two ""quote marks""","baz"
296
297 If you change the default quote_char without changing the default
298 escape_char, the escape_char will still be the quote mark. If
299 instead you want to escape the quote_char by doubling it, you will
300 need to change the escape_char to be the same as what you changed
301 the quote_char to.
302
303 The escape character can not be equal to the separation character.
304
305 allow_loose_escapes
306 By default, parsing fields that have "escape_char" characters that
307 escape characters that do not need to be escaped, like:
308
309 my $csv = Text::CSV_XS->new ({ escape_char => "\\" });
310 $csv->parse (qq{1,"my bar\'s",baz,42});
311
312 would result in a parse error. Though it is still bad practice to
313 allow this format, this option enables you to treat all escape
314 character sequences equal.
315
316 allow_unquoted_escape
317 There is a backward compatibility issue in that the escape
318 character, when differing from the quotation character, cannot be
319 on the first position of a field. e.g. with "quote_char" equal to
320 the default """ and "escape_char" set to "\", this would be
321 illegal:
322
323 1,\0,2
324
325 To overcome issues with backward compatibility, you can allow this
326 by setting this attribute to 1.
327
328 binary
329 If this attribute is TRUE, you may use binary characters in quoted
330 fields, including line feeds, carriage returns and NULL bytes. (The
331 latter must be escaped as ""0".) By default this feature is off.
332
333 If a string is marked UTF8, binary will be turned on automatically
334 when binary characters other than CR or NL are encountered. Note
335 that a simple string like "\x{00a0}" might still be binary, but not
336 marked UTF8, so setting "{ binary =" 1 }> is still a wise option.
337
338 types
339 A set of column types; this attribute is immediately passed to the
340 "types" method. You must not set this attribute otherwise, except
341 for using the "types" method.
342
343 always_quote
344 By default the generated fields are quoted only if they need to be.
345 For example, if they contain the separator character. If you set
346 this attribute to a TRUE value, then all defined fields will be
347 quoted. ("undef" fields are not quoted, see "blank_is_undef")).
348 This is typically easier to handle in external applications. (Poor
349 creatures who are not using Text::CSV_XS. :-)
350
351 quote_space
352 By default, a space in a field would trigger quotation. As no rule
353 exists this to be forced in CSV, nor any for the opposite, the
354 default is true for safety. You can exclude the space from this
355 trigger by setting this attribute to 0.
356
357 quote_null
358 By default, a NULL byte in a field would be escaped. This attribute
359 enables you to treat the NULL byte as a simple binary character in
360 binary mode (the "{ binary => 1 }" is set). The default is true.
361 You can prevent NULL escapes by setting this attribute to 0.
362
363 quote_binary
364 By default, all "unsafe" bytes inside a string cause the combined
365 field to be quoted. By setting this attribute to 0, you can disable
366 that trigger for bytes >= 0x7f.
367
368 keep_meta_info
369 By default, the parsing of input lines is as simple and fast as
370 possible. However, some parsing information - like quotation of
371 the original field - is lost in that process. Set this flag to true
372 to enable retrieving that information after parsing with the
373 methods "meta_info", "is_quoted", and "is_binary" described below.
374 Default is false.
375
376 verbatim
377 This is a quite controversial attribute to set, but it makes hard
378 things possible.
379
380 The basic thought behind this is to tell the parser that the
381 normally special characters newline (NL) and Carriage Return (CR)
382 will not be special when this flag is set, and be dealt with as
383 being ordinary binary characters. This will ease working with data
384 with embedded newlines.
385
386 When "verbatim" is used with "getline", "getline" auto-chomp's
387 every line.
388
389 Imagine a file format like
390
391 M^^Hans^Janssen^Klas 2\n2A^Ja^11-06-2007#\r\n
392
393 where, the line ending is a very specific "#\r\n", and the sep_char
394 is a ^ (caret). None of the fields is quoted, but embedded binary
395 data is likely to be present. With the specific line ending, that
396 should not be too hard to detect.
397
398 By default, Text::CSV_XS' parse function is instructed to only know
399 about "\n" and "\r" to be legal line endings, and so has to deal
400 with the embedded newline as a real end-of-line, so it can scan the
401 next line if binary is true, and the newline is inside a quoted
402 field. With this attribute, we tell parse () to parse the line as
403 if "\n" is just nothing more than a binary character.
404
405 For parse () this means that the parser has no idea about line
406 ending anymore, and getline () chomps line endings on reading.
407
408 auto_diag
409 Set to a true number between 1 and 9 will cause "error_diag" to be
410 automatically be called in void context upon errors.
411
412 In case of error "2012 - EOF", this call will be void.
413
414 If set to a value greater than 1, it will die on errors instead of
415 warn. If set to anything unsupported, it will be silently ignored.
416
417 Future extensions to this feature will include more reliable auto-
418 detection of the "autodie" module being enabled, which will raise
419 the value of "auto_diag" with 1 on the moment the error is
420 detected.
421
422 diag_verbose
423 Set the verbosity of the "auto_diag" output. Currently only adds
424 the current input line (if known) to the diagnostic output with an
425 indication of the position of the error.
426
427 To sum it up,
428
429 $csv = Text::CSV_XS->new ();
430
431 is equivalent to
432
433 $csv = Text::CSV_XS->new ({
434 quote_char => '"',
435 escape_char => '"',
436 sep_char => ',',
437 eol => $\,
438 always_quote => 0,
439 quote_space => 1,
440 quote_null => 1,
441 quote_binary => 1,
442 binary => 0,
443 keep_meta_info => 0,
444 allow_loose_quotes => 0,
445 allow_loose_escapes => 0,
446 allow_unquoted_escape => 0,
447 allow_whitespace => 0,
448 blank_is_undef => 0,
449 empty_is_undef => 0,
450 verbatim => 0,
451 auto_diag => 0,
452 diag_verbose => 0,
453 });
454
455 For all of the above mentioned flags, an accessor method is available
456 where you can inquire the current value, or change the value
457
458 my $quote = $csv->quote_char;
459 $csv->binary (1);
460
461 It is unwise to change these settings halfway through writing CSV data
462 to a stream. If however, you want to create a new stream using the
463 available CSV object, there is no harm in changing them.
464
465 If the "new" constructor call fails, it returns "undef", and makes the
466 fail reason available through the "error_diag" method.
467
468 $csv = Text::CSV_XS->new ({ ecs_char => 1 }) or
469 die "".Text::CSV_XS->error_diag ();
470
471 "error_diag" will return a string like
472
473 "INI - Unknown attribute 'ecs_char'"
474
475 print
476 $status = $csv->print ($io, $colref);
477
478 Similar to "combine" + "string" + "print", but way more efficient. It
479 expects an array ref as input (not an array!) and the resulting string
480 is not really created, but immediately written to the $io object,
481 typically an IO handle or any other object that offers a "print"
482 method.
483
484 For performance reasons the print method does not create a result
485 string. In particular the "string", "status", "fields", and
486 "error_input" methods are meaningless after executing this method.
487
488 If $colref is "undef" (explicit, not through a variable argument) and
489 "bind_columns" was used to specify fields to be printed, it is possible
490 to make performance improvements, as otherwise data would have to be
491 copied as arguments to the method call:
492
493 $csv->bind_columns (\($foo, $bar));
494 $status = $csv->print ($fh, undef);
495
496 A short benchmark
497
498 my @data = ("aa" .. "zz");
499 $csv->bind_columns (\(@data));
500
501 $csv->print ($io, [ @data ]); # 10800 recs/sec
502 $csv->print ($io, \@data ); # 57100 recs/sec
503 $csv->print ($io, undef ); # 50500 recs/sec
504
505 combine
506 $status = $csv->combine (@columns);
507
508 This object function constructs a CSV string from the arguments,
509 returning success or failure. Failure can result from lack of
510 arguments or an argument containing an invalid character. Upon
511 success, "string" can be called to retrieve the resultant CSV string.
512 Upon failure, the value returned by "string" is undefined and
513 "error_input" can be called to retrieve an invalid argument.
514
515 string
516 $line = $csv->string ();
517
518 This object function returns the input to "parse" or the resultant CSV
519 string of "combine", whichever was called more recently.
520
521 getline
522 $colref = $csv->getline ($io);
523
524 This is the counterpart to "print", as "parse" is the counterpart to
525 "combine": It reads a row from the IO object using "$io->getline" and
526 parses this row into an array ref. This array ref is returned by the
527 function or undef for failure.
528
529 When fields are bound with "bind_columns", the return value is a
530 reference to an empty list.
531
532 The "string", "fields", and "status" methods are meaningless, again.
533
534 getline_all
535 $arrayref = $csv->getline_all ($io);
536 $arrayref = $csv->getline_all ($io, $offset);
537 $arrayref = $csv->getline_all ($io, $offset, $length);
538
539 This will return a reference to a list of getline ($io) results. In
540 this call, "keep_meta_info" is disabled. If $offset is negative, as
541 with "splice", only the last "abs ($offset)" records of $io are taken
542 into consideration.
543
544 Given a CSV file with 10 lines:
545
546 lines call
547 ----- ---------------------------------------------------------
548 0..9 $csv->getline_all ($io) # all
549 0..9 $csv->getline_all ($io, 0) # all
550 8..9 $csv->getline_all ($io, 8) # start at 8
551 - $csv->getline_all ($io, 0, 0) # start at 0 first 0 rows
552 0..4 $csv->getline_all ($io, 0, 5) # start at 0 first 5 rows
553 4..5 $csv->getline_all ($io, 4, 2) # start at 4 first 2 rows
554 8..9 $csv->getline_all ($io, -2) # last 2 rows
555 6..7 $csv->getline_all ($io, -4, 2) # first 2 of last 4 rows
556
557 parse
558 $status = $csv->parse ($line);
559
560 This object function decomposes a CSV string into fields, returning
561 success or failure. Failure can result from a lack of argument or the
562 given CSV string is improperly formatted. Upon success, "fields" can
563 be called to retrieve the decomposed fields . Upon failure, the value
564 returned by "fields" is undefined and "error_input" can be called to
565 retrieve the invalid argument.
566
567 You may use the "types" method for setting column types. See "types"'
568 description below.
569
570 getline_hr
571 The "getline_hr" and "column_names" methods work together to allow you
572 to have rows returned as hashrefs. You must call "column_names" first
573 to declare your column names.
574
575 $csv->column_names (qw( code name price description ));
576 $hr = $csv->getline_hr ($io);
577 print "Price for $hr->{name} is $hr->{price} EUR\n";
578
579 "getline_hr" will croak if called before "column_names".
580
581 Note that "getline_hr" creates a hashref for every row and will be much
582 slower than the combined use of "bind_columns" and "getline" but still
583 offering the same ease of use hashref inside the loop:
584
585 my @cols = @{$csv->getline ($io)};
586 $csv->column_names (@cols);
587 while (my $row = $csv->getline_hr ($io)) {
588 print $row->{price};
589 }
590
591 Could easily be rewritten to the much faster:
592
593 my @cols = @{$csv->getline ($io)};
594 my $row = {};
595 $csv->bind_columns (\@{$row}{@cols});
596 while ($csv->getline ($io)) {
597 print $row->{price};
598 }
599
600 Your mileage may vary for the size of the data and the number of rows.
601 With perl-5.14.2 the comparison for a 100_000 line file with 14 rows:
602
603 Rate hashrefs getlines
604 hashrefs 1.00/s -- -76%
605 getlines 4.15/s 313% --
606
607 getline_hr_all
608 $arrayref = $csv->getline_hr_all ($io);
609 $arrayref = $csv->getline_hr_all ($io, $offset);
610 $arrayref = $csv->getline_hr_all ($io, $offset, $length);
611
612 This will return a reference to a list of getline_hr ($io) results. In
613 this call, "keep_meta_info" is disabled.
614
615 print_hr
616 $csv->print_hr ($io, $ref);
617
618 Provides an easy way to print a $ref as fetched with getline_hr
619 provided the column names are set with column_names.
620
621 It is just a wrapper method with basic parameter checks over
622
623 $csv->print ($io, [ map { $ref->{$_} } $csv->column_names ]);
624
625 column_names
626 Set the keys that will be used in the "getline_hr" calls. If no keys
627 (column names) are passed, it'll return the current setting.
628
629 "column_names" accepts a list of scalars (the column names) or a single
630 array_ref, so you can pass "getline"
631
632 $csv->column_names ($csv->getline ($io));
633
634 "column_names" does no checking on duplicates at all, which might lead
635 to unwanted results. Undefined entries will be replaced with the string
636 "\cAUNDEF\cA", so
637
638 $csv->column_names (undef, "", "name", "name");
639 $hr = $csv->getline_hr ($io);
640
641 Will set "$hr->{"\cAUNDEF\cA"}" to the 1st field, "$hr->{""}" to the
642 2nd field, and "$hr->{name}" to the 4th field, discarding the 3rd
643 field.
644
645 "column_names" croaks on invalid arguments.
646
647 bind_columns
648 Takes a list of references to scalars to be printed with "print" or to
649 store the fields fetched by "getline" in. When you don't pass enough
650 references to store the fetched fields in, "getline" will fail. If you
651 pass more than there are fields to return, the remaining references are
652 left untouched.
653
654 $csv->bind_columns (\$code, \$name, \$price, \$description);
655 while ($csv->getline ($io)) {
656 print "The price of a $name is \x{20ac} $price\n";
657 }
658
659 To reset or clear all column binding, call "bind_columns" with a single
660 argument "undef". This will also clear column names.
661
662 $csv->bind_columns (undef);
663
664 If no arguments are passed at all, "bind_columns" will return the list
665 current bindings or "undef" if no binds are active.
666
667 eof
668 $eof = $csv->eof ();
669
670 If "parse" or "getline" was used with an IO stream, this method will
671 return true (1) if the last call hit end of file, otherwise it will
672 return false (''). This is useful to see the difference between a
673 failure and end of file.
674
675 types
676 $csv->types (\@tref);
677
678 This method is used to force that columns are of a given type. For
679 example, if you have an integer column, two double columns and a string
680 column, then you might do a
681
682 $csv->types ([Text::CSV_XS::IV (),
683 Text::CSV_XS::NV (),
684 Text::CSV_XS::NV (),
685 Text::CSV_XS::PV ()]);
686
687 Column types are used only for decoding columns, in other words by the
688 "parse" and "getline" methods.
689
690 You can unset column types by doing a
691
692 $csv->types (undef);
693
694 or fetch the current type settings with
695
696 $types = $csv->types ();
697
698 IV Set field type to integer.
699
700 NV Set field type to numeric/float.
701
702 PV Set field type to string.
703
704 fields
705 @columns = $csv->fields ();
706
707 This object function returns the input to "combine" or the resultant
708 decomposed fields of a successful "parse", whichever was called more
709 recently.
710
711 Note that the return value is undefined after using "getline", which
712 does not fill the data structures returned by "parse".
713
714 meta_info
715 @flags = $csv->meta_info ();
716
717 This object function returns the flags of the input to "combine" or the
718 flags of the resultant decomposed fields of "parse", whichever was
719 called more recently.
720
721 For each field, a meta_info field will hold flags that tell something
722 about the field returned by the "fields" method or passed to the
723 "combine" method. The flags are bit-wise-or'd like:
724
725 " "0x0001
726 The field was quoted.
727
728 " "0x0002
729 The field was binary.
730
731 See the "is_***" methods below.
732
733 is_quoted
734 my $quoted = $csv->is_quoted ($column_idx);
735
736 Where $column_idx is the (zero-based) index of the column in the last
737 result of "parse".
738
739 This returns a true value if the data in the indicated column was
740 enclosed in "quote_char" quotes. This might be important for data where
741 ",20070108," is to be treated as a numeric value, and where
742 ","20070108"," is explicitly marked as character string data.
743
744 is_binary
745 my $binary = $csv->is_binary ($column_idx);
746
747 Where $column_idx is the (zero-based) index of the column in the last
748 result of "parse".
749
750 This returns a true value if the data in the indicated column contained
751 any byte in the range "[\x00-\x08,\x10-\x1F,\x7F-\xFF]".
752
753 is_missing
754 my $missing = $csv->is_missing ($column_idx);
755
756 Where $column_idx is the (zero-based) index of the column in the last
757 result of "getline_hr".
758
759 while (my $hr = $csv->getline_hr ($fh)) {
760 $csv->is_missing (0) and next; # This was an empty line
761 }
762
763 When using "getline_hr" for parsing, it is impossible to tell if the
764 fields are "undef" because they where not filled in the CSV stream or
765 because they were not read at all, as all the fields defined by
766 "column_names" are set in the hash-ref. If you still need to know if
767 all fields in each row are provided, you should enable "keep_meta_info"
768 so you can check the flags.
769
770 status
771 $status = $csv->status ();
772
773 This object function returns success (or failure) of "combine" or
774 "parse", whichever was called more recently.
775
776 error_input
777 $bad_argument = $csv->error_input ();
778
779 This object function returns the erroneous argument (if it exists) of
780 "combine" or "parse", whichever was called more recently. If the last
781 call was successful, "error_input" will return "undef".
782
783 error_diag
784 Text::CSV_XS->error_diag ();
785 $csv->error_diag ();
786 $error_code = 0 + $csv->error_diag ();
787 $error_str = "" . $csv->error_diag ();
788 ($cde, $str, $pos, $recno) = $csv->error_diag ();
789
790 If (and only if) an error occurred, this function returns the
791 diagnostics of that error.
792
793 If called in void context, it will print the internal error code and
794 the associated error message to STDERR.
795
796 If called in list context, it will return the error code and the error
797 message in that order. If the last error was from parsing, the third
798 value returned is a best guess at the location within the line that was
799 being parsed. Its value is 1-based. The forth value represents the
800 record count parsed by this csv object See examples/csv-check for how
801 this can be used.
802
803 If called in scalar context, it will return the diagnostics in a single
804 scalar, a-la $!. It will contain the error code in numeric context, and
805 the diagnostics message in string context.
806
807 When called as a class method or a direct function call, the error
808 diagnostics is that of the last "new" call.
809
810 record_number
811 $recno = $csv->record_number ();
812
813 Returns the records parsed by this csv instance. This value should be
814 more accurate than $. when embedded newlines come in play. Records
815 written by this instance are not counted.
816
817 SetDiag
818 $csv->SetDiag (0);
819
820 Use to reset the diagnostics if you are dealing with errors.
821
823 Combine (...)
824 Parse (...)
825
826 The arguments to these two internal functions are deliberately not
827 described or documented in order to enable the module author(s) to
828 change it when they feel the need for it. Using them is highly
829 discouraged as the API may change in future releases.
830
832 Reading a CSV file line by line:
833 my $csv = Text::CSV_XS->new ({ binary => 1 });
834 open my $fh, "<", "file.csv" or die "file.csv: $!";
835 while (my $row = $csv->getline ($fh)) {
836 # do something with @$row
837 }
838 $csv->eof or $csv->error_diag;
839 close $fh or die "file.csv: $!";
840
841 Parsing CSV strings:
842 my $csv = Text::CSV_XS->new ({ keep_meta_info => 1, binary => 1 });
843
844 my $sample_input_string =
845 qq{"I said, ""Hi!""",Yes,"",2.34,,"1.09","\x{20ac}",};
846 if ($csv->parse ($sample_input_string)) {
847 my @field = $csv->fields;
848 foreach my $col (0 .. $#field) {
849 my $quo = $csv->is_quoted ($col) ? $csv->{quote_char} : "";
850 printf "%2d: %s%s%s\n", $col, $quo, $field[$col], $quo;
851 }
852 }
853 else {
854 print STDERR "parse () failed on argument: ",
855 $csv->error_input, "\n";
856 $csv->error_diag ();
857 }
858
859 Printing CSV data
860 The fast way: using "print"
861
862 An example for creating CSV files using the "print" method, like in
863 dumping the content of a database ($dbh) table ($tbl) to CSV:
864
865 my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ });
866 open my $fh, ">", "$tbl.csv" or die "$tbl.csv: $!";
867 my $sth = $dbh->prepare ("select * from $tbl");
868 $sth->execute;
869 $csv->print ($fh, $sth->{NAME_lc});
870 while (my $row = $sth->fetch) {
871 $csv->print ($fh, $row) or $csv->error_diag;
872 }
873 close $fh or die "$tbl.csv: $!";
874
875 The slow way: using "combine" and "string"
876
877 or using the slower "combine" and "string" methods:
878
879 my $csv = Text::CSV_XS->new;
880
881 open my $csv_fh, ">", "hello.csv" or die "hello.csv: $!";
882
883 my @sample_input_fields = (
884 'You said, "Hello!"', 5.67,
885 '"Surely"', '', '3.14159');
886 if ($csv->combine (@sample_input_fields)) {
887 print $csv_fh $csv->string, "\n";
888 }
889 else {
890 print "combine () failed on argument: ",
891 $csv->error_input, "\n";
892 }
893 close $csv_fh or die "hello.csv: $!";
894
895 The examples folder
896 For more extended examples, see the examples/[24m (1) sub-directory in the
897 original distribution or the git repository (2).
898
899 1. http://repo.or.cz/w/Text-CSV_XS.git?a=tree;f=examples
900 2. http://repo.or.cz/w/Text-CSV_XS.git
901
902 The following files can be found there:
903
904 parser-xs.pl
905 This can be used as a boilerplate to `fix' bad CSV and parse beyond
906 errors.
907
908 $ perl examples/parser-xs.pl bad.csv >good.csv
909
910 csv-check
911 This is a command-line tool that uses parser-xs.pl techniques to
912 check the CSV file and report on its content.
913
914 $ csv-check files/utf8.csv
915 Checked with examples/csv-check 1.5 using Text::CSV_XS 0.81
916 OK: rows: 1, columns: 2
917 sep = <,>, quo = <">, bin = <1>
918
919 csv2xls
920 A script to convert CSV to Microsoft Excel. This requires Date::Calc
921 and Spreadsheet::WriteExcel. The converter accepts various options
922 and can produce UTF-8 Excel files.
923
924 csvdiff
925 A script that provides colorized diff on sorted CSV files, assuming
926 first line is header and first field is the key. Output options
927 include colorized ANSI escape codes or HTML.
928
929 $ csvdiff --html --output=diff.html file1.csv file2.csv
930
932 "Text::CSV_XS" is not designed to detect the characters used to quote
933 and separate fields. The parsing is done using predefined settings. In
934 the examples sub-directory, you can find scripts that demonstrate how
935 you can try to detect these characters yourself.
936
937 Microsoft Excel
938 The import/export from Microsoft Excel is a risky task, according to
939 the documentation in "Text::CSV::Separator". Microsoft uses the
940 system's default list separator defined in the regional settings, which
941 happens to be a semicolon for Dutch, German and Spanish (and probably
942 some others as well). For the English locale, the default is a comma.
943 In Windows however, the user is free to choose a predefined locale, and
944 then change every individual setting in it, so checking the locale is
945 no solution.
946
948 More Errors & Warnings
949 New extensions ought to be clear and concise in reporting what error
950 occurred where and why, and possibly also tell a remedy to the
951 problem. error_diag is a (very) good start, but there is more work
952 to be done here.
953
954 Basic calls should croak or warn on illegal parameters. Errors should
955 be documented.
956
957 setting meta info
958 Future extensions might include extending the "meta_info",
959 "is_quoted", and "is_binary" to accept setting these flags for
960 fields, so you can specify which fields are quoted in the
961 "combine"/"string" combination.
962
963 $csv->meta_info (0, 1, 1, 3, 0, 0);
964 $csv->is_quoted (3, 1);
965
966 Parse the whole file at once
967 Implement new methods that enable parsing of a complete file at once,
968 returning a list of hashes. Possible extension to this could be to
969 enable a column selection on the call:
970
971 my @AoH = $csv->parse_file ($filename, { cols => [ 1, 4..8, 12 ]});
972
973 Returning something like
974
975 [ { fields => [ 1, 2, "foo", 4.5, undef, "", 8 ],
976 flags => [ ... ],
977 },
978 { fields => [ ... ],
979 .
980 },
981 ]
982
983 Note that "getline_all" already returns all rows for an open stream,
984 but this will not return flags.
985
986 NOT TODO
987 combined methods
988 Requests for adding means (methods) that combine "combine" and
989 "string" in a single call will not be honored. Likewise for "parse"
990 and "fields". Given the trouble with embedded newlines, using
991 "getline" and "print" instead is the preferred way to go.
992
993 Release plan
994 No guarantees, but this is what I had in mind some time ago:
995
996 next
997 - This might very well be 1.00
998 - DIAGNOSTICS setction in pod to *describe* the errors (see below)
999 - croak / carp
1000
1001 next + 1
1002 - csv2csv - a script to regenerate a CSV file to follow standards
1003
1005 The hard-coding of characters and character ranges makes this module
1006 unusable on EBCDIC systems.
1007
1008 Opening EBCDIC encoded files on ASCII+ systems is likely to succeed
1009 using Encode's cp37, cp1047, or posix-bc:
1010
1011 open my $fh, "<:encoding(cp1047)", "ebcdic_file.csv" or die "...";
1012
1014 Still under construction ...
1015
1016 If an error occurred, "$csv-"error_diag> can be used to get more
1017 information on the cause of the failure. Note that for speed reasons,
1018 the internal value is never cleared on success, so using the value
1019 returned by "error_diag" in normal cases - when no error occurred - may
1020 cause unexpected results.
1021
1022 If the constructor failed, the cause can be found using "error_diag" as
1023 a class method, like "Text::CSV_XS-"error_diag>.
1024
1025 "$csv-"error_diag> is automatically called upon error when the
1026 contractor was called with "auto_diag" set to 1 or 2, or when "autodie"
1027 is in effect. When set to 1, this will cause a "warn" with the error
1028 message, when set to 2, it will "die". "2012 - EOF" is excluded from
1029 "auto_diag" reports.
1030
1031 The errors as described below are available. I have tried to make the
1032 error itself explanatory enough, but more descriptions will be added.
1033 For most of these errors, the first three capitals describe the error
1034 category:
1035
1036 · INI
1037
1038 Initialization error or option conflict.
1039
1040 · ECR
1041
1042 Carriage-Return related parse error.
1043
1044 · EOF
1045
1046 End-Of-File related parse error.
1047
1048 · EIQ
1049
1050 Parse error inside quotation.
1051
1052 · EIF
1053
1054 Parse error inside field.
1055
1056 · ECB
1057
1058 Combine error.
1059
1060 · EHR
1061
1062 HashRef parse related error.
1063
1064 And below should be the complete list of error codes that can be
1065 returned:
1066
1067 · 1001 "INI - sep_char is equal to quote_char or escape_char"
1068
1069 The separation character cannot be equal to either the quotation
1070 character or the escape character, as that will invalidate all
1071 parsing rules.
1072
1073 · 1002 "INI - allow_whitespace with escape_char or quote_char SP or
1074 TAB"
1075
1076 Using "allow_whitespace" when either "escape_char" or "quote_char" is
1077 equal to SPACE or TAB is too ambiguous to allow.
1078
1079 · 1003 "INI - \r or \n in main attr not allowed"
1080
1081 Using default "eol" characters in either "sep_char", "quote_char", or
1082 "escape_char" is not allowed.
1083
1084 · 2010 "ECR - QUO char inside quotes followed by CR not part of EOL"
1085
1086 When "eol" has been set to something specific, other than the
1087 default, like "\r\t\n", and the "\r" is following the second
1088 (closing) "quote_char", where the characters following the "\r" do
1089 not make up the "eol" sequence, this is an error.
1090
1091 · 2011 "ECR - Characters after end of quoted field"
1092
1093 Sequences like "1,foo,"bar"baz,2" are not allowed. "bar" is a quoted
1094 field, and after the closing quote, there should be either a new-line
1095 sequence or a separation character.
1096
1097 · 2012 "EOF - End of data in parsing input stream"
1098
1099 Self-explaining. End-of-file while inside parsing a stream. Can
1100 happen only when reading from streams with "getline", as using
1101 "parse" is done on strings that are not required to have a trailing
1102 "eol".
1103
1104 · 2021 "EIQ - NL char inside quotes, binary off"
1105
1106 Sequences like "1,"foo\nbar",2" are allowed only when the binary
1107 option has been selected with the constructor.
1108
1109 · 2022 "EIQ - CR char inside quotes, binary off"
1110
1111 Sequences like "1,"foo\rbar",2" are allowed only when the binary
1112 option has been selected with the constructor.
1113
1114 · 2023 "EIQ - QUO character not allowed"
1115
1116 Sequences like ""foo "bar" baz",quux" and "2023,",2008-04-05,"Foo,
1117 Bar",\n" will cause this error.
1118
1119 · 2024 "EIQ - EOF cannot be escaped, not even inside quotes"
1120
1121 The escape character is not allowed as last character in an input
1122 stream.
1123
1124 · 2025 "EIQ - Loose unescaped escape"
1125
1126 An escape character should escape only characters that need escaping.
1127 Allowing the escape for other characters is possible with the
1128 "allow_loose_escape" attribute.
1129
1130 · 2026 "EIQ - Binary character inside quoted field, binary off"
1131
1132 Binary characters are not allowed by default. Exceptions are fields
1133 that contain valid UTF-8, that will automatically be upgraded is the
1134 content is valid UTF-8. Pass the "binary" attribute with a true value
1135 to accept binary characters.
1136
1137 · 2027 "EIQ - Quoted field not terminated"
1138
1139 When parsing a field that started with a quotation character, the
1140 field is expected to be closed with a quotation character. When the
1141 parsed line is exhausted before the quote is found, that field is not
1142 terminated.
1143
1144 · 2030 "EIF - NL char inside unquoted verbatim, binary off"
1145
1146 · 2031 "EIF - CR char is first char of field, not part of EOL"
1147
1148 · 2032 "EIF - CR char inside unquoted, not part of EOL"
1149
1150 · 2034 "EIF - Loose unescaped quote"
1151
1152 · 2035 "EIF - Escaped EOF in unquoted field"
1153
1154 · 2036 "EIF - ESC error"
1155
1156 · 2037 "EIF - Binary character in unquoted field, binary off"
1157
1158 · 2110 "ECB - Binary character in Combine, binary off"
1159
1160 · 2200 "EIO - print to IO failed. See errno"
1161
1162 · 3001 "EHR - Unsupported syntax for column_names ()"
1163
1164 · 3002 "EHR - getline_hr () called before column_names ()"
1165
1166 · 3003 "EHR - bind_columns () and column_names () fields count
1167 mismatch"
1168
1169 · 3004 "EHR - bind_columns () only accepts refs to scalars"
1170
1171 · 3006 "EHR - bind_columns () did not pass enough refs for parsed
1172 fields"
1173
1174 · 3007 "EHR - bind_columns needs refs to writable scalars"
1175
1176 · 3008 "EHR - unexpected error in bound fields"
1177
1178 · 3009 "EHR - print_hr () called before column_names ()"
1179
1180 · 3010 "EHR - print_hr () called with invalid arguments"
1181
1183 perl, IO::File, IO::Handle, IO::Wrap, Text::CSV, Text::CSV_PP,
1184 Text::CSV::Encoded, Text::CSV::Separator, and Spreadsheet::Read.
1185
1187 Alan Citterman <alan@mfgrtl.com> wrote the original Perl module.
1188 Please don't send mail concerning Text::CSV_XS to Alan, as he's not
1189 involved in the C part that is now the main part of the module.
1190
1191 Jochen Wiedmann <joe@ispsoft.de> rewrote the encoding and decoding in C
1192 by implementing a simple finite-state machine and added the variable
1193 quote, escape and separator characters, the binary mode and the print
1194 and getline methods. See ChangeLog releases 0.10 through 0.23.
1195
1196 H.Merijn Brand <h.m.brand@xs4all.nl> cleaned up the code, added the
1197 field flags methods, wrote the major part of the test suite, completed
1198 the documentation, fixed some RT bugs and added all the allow flags.
1199 See ChangeLog releases 0.25 and on.
1200
1202 Copyright (C) 2007-2013 H.Merijn Brand. All rights reserved.
1203 Copyright (C) 1998-2001 Jochen Wiedmann. All rights reserved.
1204 Copyright (C) 1997 Alan Citterman. All rights reserved.
1205
1206 This library is free software; you can redistribute it and/or modify it
1207 under the same terms as Perl itself.
1208
1209
1210
1211perl v5.16.3 2013-06-13 CSV_XS(3)