1CSV_XS(3) User Contributed Perl Documentation CSV_XS(3)
2
3
4
6 Text::CSV_XS - comma-separated values manipulation routines
7
9 # Functional interface
10 use Text::CSV_XS qw( csv );
11
12 # Read whole file in memory
13 my $aoa = csv (in => "data.csv"); # as array of array
14 my $aoh = csv (in => "data.csv",
15 headers => "auto"); # as array of hash
16
17 # Write array of arrays as csv file
18 csv (in => $aoa, out => "file.csv", sep_char => ";");
19
20 # Only show lines where "code" is odd
21 csv (in => "data.csv", filter => { code => sub { $_ % 2 }});
22
23
24 # Object interface
25 use Text::CSV_XS;
26
27 my @rows;
28 # Read/parse CSV
29 my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
30 open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!";
31 while (my $row = $csv->getline ($fh)) {
32 $row->[2] =~ m/pattern/ or next; # 3rd field should match
33 push @rows, $row;
34 }
35 close $fh;
36
37 # and write as CSV
38 open $fh, ">:encoding(utf8)", "new.csv" or die "new.csv: $!";
39 $csv->say ($fh, $_) for @rows;
40 close $fh or die "new.csv: $!";
41
43 Text::CSV_XS provides facilities for the composition and
44 decomposition of comma-separated values. An instance of the
45 Text::CSV_XS class will combine fields into a "CSV" string and parse a
46 "CSV" string into fields.
47
48 The module accepts either strings or files as input and support the
49 use of user-specified characters for delimiters, separators, and
50 escapes.
51
52 Embedded newlines
53 Important Note: The default behavior is to accept only ASCII
54 characters in the range from 0x20 (space) to 0x7E (tilde). This means
55 that the fields can not contain newlines. If your data contains
56 newlines embedded in fields, or characters above 0x7E (tilde), or
57 binary data, you must set "binary => 1" in the call to "new". To cover
58 the widest range of parsing options, you will always want to set
59 binary.
60
61 But you still have the problem that you have to pass a correct line to
62 the "parse" method, which is more complicated from the usual point of
63 usage:
64
65 my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ });
66 while (<>) { # WRONG!
67 $csv->parse ($_);
68 my @fields = $csv->fields ();
69 }
70
71 this will break, as the "while" might read broken lines: it does not
72 care about the quoting. If you need to support embedded newlines, the
73 way to go is to not pass "eol" in the parser (it accepts "\n", "\r",
74 and "\r\n" by default) and then
75
76 my $csv = Text::CSV_XS->new ({ binary => 1 });
77 open my $fh, "<", $file or die "$file: $!";
78 while (my $row = $csv->getline ($fh)) {
79 my @fields = @$row;
80 }
81
82 The old(er) way of using global file handles is still supported
83
84 while (my $row = $csv->getline (*ARGV)) { ... }
85
86 Unicode
87 Unicode is only tested to work with perl-5.8.2 and up.
88
89 See also "BOM".
90
91 The simplest way to ensure the correct encoding is used for in- and
92 output is by either setting layers on the filehandles, or setting the
93 "encoding" argument for "csv".
94
95 open my $fh, "<:encoding(UTF-8)", "in.csv" or die "in.csv: $!";
96 or
97 my $aoa = csv (in => "in.csv", encoding => "UTF-8");
98
99 open my $fh, ">:encoding(UTF-8)", "out.csv" or die "out.csv: $!";
100 or
101 csv (in => $aoa, out => "out.csv", encoding => "UTF-8");
102
103 On parsing (both for "getline" and "parse"), if the source is marked
104 being UTF8, then all fields that are marked binary will also be marked
105 UTF8.
106
107 On combining ("print" and "combine"): if any of the combining fields
108 was marked UTF8, the resulting string will be marked as UTF8. Note
109 however that all fields before the first field marked UTF8 and
110 contained 8-bit characters that were not upgraded to UTF8, these will
111 be "bytes" in the resulting string too, possibly causing unexpected
112 errors. If you pass data of different encoding, or you don't know if
113 there is different encoding, force it to be upgraded before you pass
114 them on:
115
116 $csv->print ($fh, [ map { utf8::upgrade (my $x = $_); $x } @data ]);
117
118 For complete control over encoding, please use Text::CSV::Encoded:
119
120 use Text::CSV::Encoded;
121 my $csv = Text::CSV::Encoded->new ({
122 encoding_in => "iso-8859-1", # the encoding comes into Perl
123 encoding_out => "cp1252", # the encoding comes out of Perl
124 });
125
126 $csv = Text::CSV::Encoded->new ({ encoding => "utf8" });
127 # combine () and print () accept *literally* utf8 encoded data
128 # parse () and getline () return *literally* utf8 encoded data
129
130 $csv = Text::CSV::Encoded->new ({ encoding => undef }); # default
131 # combine () and print () accept UTF8 marked data
132 # parse () and getline () return UTF8 marked data
133
134 BOM
135 BOM (or Byte Order Mark) handling is available only inside the
136 "header" method. This method supports the following encodings:
137 "utf-8", "utf-1", "utf-32be", "utf-32le", "utf-16be", "utf-16le",
138 "utf-ebcdic", "scsu", "bocu-1", and "gb-18030". See Wikipedia
139 <https://en.wikipedia.org/wiki/Byte_order_mark>.
140
141 If a file has a BOM, the easiest way to deal with that is
142
143 my $aoh = csv (in => $file, detect_bom => 1);
144
145 All records will be encoded based on the detected BOM.
146
147 This implies a call to the "header" method, which defaults to also
148 set the "column_names". So this is not the same as
149
150 my $aoh = csv (in => $file, headers => "auto");
151
152 which only reads the first record to set "column_names" but ignores
153 any meaning of possible present BOM.
154
156 While no formal specification for CSV exists, RFC 4180
157 <https://datatracker.ietf.org/doc/html/rfc4180> (1) describes the
158 common format and establishes "text/csv" as the MIME type registered
159 with the IANA. RFC 7111 <https://datatracker.ietf.org/doc/html/rfc7111>
160 (2) adds fragments to CSV.
161
162 Many informal documents exist that describe the "CSV" format. "How
163 To: The Comma Separated Value (CSV) File Format"
164 <http://creativyst.com/Doc/Articles/CSV/CSV01.shtml> (3) provides an
165 overview of the "CSV" format in the most widely used applications and
166 explains how it can best be used and supported.
167
168 1) https://datatracker.ietf.org/doc/html/rfc4180
169 2) https://datatracker.ietf.org/doc/html/rfc7111
170 3) http://creativyst.com/Doc/Articles/CSV/CSV01.shtml
171
172 The basic rules are as follows:
173
174 CSV is a delimited data format that has fields/columns separated by
175 the comma character and records/rows separated by newlines. Fields that
176 contain a special character (comma, newline, or double quote), must be
177 enclosed in double quotes. However, if a line contains a single entry
178 that is the empty string, it may be enclosed in double quotes. If a
179 field's value contains a double quote character it is escaped by
180 placing another double quote character next to it. The "CSV" file
181 format does not require a specific character encoding, byte order, or
182 line terminator format.
183
184 • Each record is a single line ended by a line feed (ASCII/"LF"=0x0A)
185 or a carriage return and line feed pair (ASCII/"CRLF"="0x0D 0x0A"),
186 however, line-breaks may be embedded.
187
188 • Fields are separated by commas.
189
190 • Allowable characters within a "CSV" field include 0x09 ("TAB") and
191 the inclusive range of 0x20 (space) through 0x7E (tilde). In binary
192 mode all characters are accepted, at least in quoted fields.
193
194 • A field within "CSV" must be surrounded by double-quotes to
195 contain a separator character (comma).
196
197 Though this is the most clear and restrictive definition, Text::CSV_XS
198 is way more liberal than this, and allows extension:
199
200 • Line termination by a single carriage return is accepted by default
201
202 • The separation-, escape-, and escape- characters can be any ASCII
203 character in the range from 0x20 (space) to 0x7E (tilde).
204 Characters outside this range may or may not work as expected.
205 Multibyte characters, like UTF "U+060C" (ARABIC COMMA), "U+FF0C"
206 (FULLWIDTH COMMA), "U+241B" (SYMBOL FOR ESCAPE), "U+2424" (SYMBOL
207 FOR NEWLINE), "U+FF02" (FULLWIDTH QUOTATION MARK), and "U+201C" (LEFT
208 DOUBLE QUOTATION MARK) (to give some examples of what might look
209 promising) work for newer versions of perl for "sep_char", and
210 "quote_char" but not for "escape_char".
211
212 If you use perl-5.8.2 or higher these three attributes are
213 utf8-decoded, to increase the likelihood of success. This way
214 "U+00FE" will be allowed as a quote character.
215
216 • A field in "CSV" must be surrounded by double-quotes to make an
217 embedded double-quote, represented by a pair of consecutive double-
218 quotes, valid. In binary mode you may additionally use the sequence
219 ""0" for representation of a NULL byte. Using 0x00 in binary mode is
220 just as valid.
221
222 • Several violations of the above specification may be lifted by
223 passing some options as attributes to the object constructor.
224
226 version
227 (Class method) Returns the current module version.
228
229 new
230 (Class method) Returns a new instance of class Text::CSV_XS. The
231 attributes are described by the (optional) hash ref "\%attr".
232
233 my $csv = Text::CSV_XS->new ({ attributes ... });
234
235 The following attributes are available:
236
237 eol
238
239 my $csv = Text::CSV_XS->new ({ eol => $/ });
240 $csv->eol (undef);
241 my $eol = $csv->eol;
242
243 The end-of-line string to add to rows for "print" or the record
244 separator for "getline".
245
246 When not passed in a parser instance, the default behavior is to
247 accept "\n", "\r", and "\r\n", so it is probably safer to not specify
248 "eol" at all. Passing "undef" or the empty string behave the same.
249
250 When not passed in a generating instance, records are not terminated
251 at all, so it is probably wise to pass something you expect. A safe
252 choice for "eol" on output is either $/ or "\r\n".
253
254 Common values for "eol" are "\012" ("\n" or Line Feed), "\015\012"
255 ("\r\n" or Carriage Return, Line Feed), and "\015" ("\r" or Carriage
256 Return). The "eol" attribute cannot exceed 7 (ASCII) characters.
257
258 If both $/ and "eol" equal "\015", parsing lines that end on only a
259 Carriage Return without Line Feed, will be "parse"d correct.
260
261 sep_char
262
263 my $csv = Text::CSV_XS->new ({ sep_char => ";" });
264 $csv->sep_char (";");
265 my $c = $csv->sep_char;
266
267 The char used to separate fields, by default a comma. (","). Limited
268 to a single-byte character, usually in the range from 0x20 (space) to
269 0x7E (tilde). When longer sequences are required, use "sep".
270
271 The separation character can not be equal to the quote character or to
272 the escape character.
273
274 See also "CAVEATS"
275
276 sep
277
278 my $csv = Text::CSV_XS->new ({ sep => "\N{FULLWIDTH COMMA}" });
279 $csv->sep (";");
280 my $sep = $csv->sep;
281
282 The chars used to separate fields, by default undefined. Limited to 8
283 bytes.
284
285 When set, overrules "sep_char". If its length is one byte it acts as
286 an alias to "sep_char".
287
288 See also "CAVEATS"
289
290 quote_char
291
292 my $csv = Text::CSV_XS->new ({ quote_char => "'" });
293 $csv->quote_char (undef);
294 my $c = $csv->quote_char;
295
296 The character to quote fields containing blanks or binary data, by
297 default the double quote character ("""). A value of undef suppresses
298 quote chars (for simple cases only). Limited to a single-byte
299 character, usually in the range from 0x20 (space) to 0x7E (tilde).
300 When longer sequences are required, use "quote".
301
302 "quote_char" can not be equal to "sep_char".
303
304 quote
305
306 my $csv = Text::CSV_XS->new ({ quote => "\N{FULLWIDTH QUOTATION MARK}" });
307 $csv->quote ("'");
308 my $quote = $csv->quote;
309
310 The chars used to quote fields, by default undefined. Limited to 8
311 bytes.
312
313 When set, overrules "quote_char". If its length is one byte it acts as
314 an alias to "quote_char".
315
316 This method does not support "undef". Use "quote_char" to disable
317 quotation.
318
319 See also "CAVEATS"
320
321 escape_char
322
323 my $csv = Text::CSV_XS->new ({ escape_char => "\\" });
324 $csv->escape_char (":");
325 my $c = $csv->escape_char;
326
327 The character to escape certain characters inside quoted fields.
328 This is limited to a single-byte character, usually in the range
329 from 0x20 (space) to 0x7E (tilde).
330
331 The "escape_char" defaults to being the double-quote mark ("""). In
332 other words the same as the default "quote_char". This means that
333 doubling the quote mark in a field escapes it:
334
335 "foo","bar","Escape ""quote mark"" with two ""quote marks""","baz"
336
337 If you change the "quote_char" without changing the
338 "escape_char", the "escape_char" will still be the double-quote
339 ("""). If instead you want to escape the "quote_char" by doubling it
340 you will need to also change the "escape_char" to be the same as what
341 you have changed the "quote_char" to.
342
343 Setting "escape_char" to "undef" or "" will completely disable escapes
344 and is greatly discouraged. This will also disable "escape_null".
345
346 The escape character can not be equal to the separation character.
347
348 binary
349
350 my $csv = Text::CSV_XS->new ({ binary => 1 });
351 $csv->binary (0);
352 my $f = $csv->binary;
353
354 If this attribute is 1, you may use binary characters in quoted
355 fields, including line feeds, carriage returns and "NULL" bytes. (The
356 latter could be escaped as ""0".) By default this feature is off.
357
358 If a string is marked UTF8, "binary" will be turned on automatically
359 when binary characters other than "CR" and "NL" are encountered. Note
360 that a simple string like "\x{00a0}" might still be binary, but not
361 marked UTF8, so setting "{ binary => 1 }" is still a wise option.
362
363 strict
364
365 my $csv = Text::CSV_XS->new ({ strict => 1 });
366 $csv->strict (0);
367 my $f = $csv->strict;
368
369 If this attribute is set to 1, any row that parses to a different
370 number of fields than the previous row will cause the parser to throw
371 error 2014.
372
373 skip_empty_rows
374
375 my $csv = Text::CSV_XS->new ({ skip_empty_rows => 1 });
376 $csv->skip_empty_rows ("eof");
377 my $f = $csv->skip_empty_rows;
378
379 This attribute defines the behavior for empty rows: an "eol"
380 immediately following the start of line. Default behavior is to return
381 one single empty field.
382
383 This attribute is only used in parsing. This attribute is ineffective
384 when using "parse" and "fields".
385
386 Possible values for this attribute are
387
388 0 | undef
389 my $csv = Text::CSV_XS->new ({ skip_empty_rows => 0 });
390 $csv->skip_empty_rows (undef);
391
392 No special action is taken. The result will be one single empty
393 field.
394
395 1 | "skip"
396 my $csv = Text::CSV_XS->new ({ skip_empty_rows => 1 });
397 $csv->skip_empty_rows ("skip");
398
399 The row will be skipped.
400
401 2 | "eof" | "stop"
402 my $csv = Text::CSV_XS->new ({ skip_empty_rows => 2 });
403 $csv->skip_empty_rows ("eof");
404
405 The parsing will stop as if an "eof" was detected.
406
407 3 | "die"
408 my $csv = Text::CSV_XS->new ({ skip_empty_rows => 3 });
409 $csv->skip_empty_rows ("die");
410
411 The parsing will stop. The internal error code will be set to 2015
412 and the parser will "die".
413
414 4 | "croak"
415 my $csv = Text::CSV_XS->new ({ skip_empty_rows => 4 });
416 $csv->skip_empty_rows ("croak");
417
418 The parsing will stop. The internal error code will be set to 2015
419 and the parser will "croak".
420
421 5 | "error"
422 my $csv = Text::CSV_XS->new ({ skip_empty_rows => 5 });
423 $csv->skip_empty_rows ("error");
424
425 The parsing will fail. The internal error code will be set to 2015.
426
427 callback
428 my $csv = Text::CSV_XS->new ({ skip_empty_rows => sub { [] } });
429 $csv->skip_empty_rows (sub { [ 42, $., undef, "empty" ] });
430
431 The callback is invoked and its result used instead. If you want the
432 parse to stop after the callback, make sure to return a false value.
433
434 The returned value from the callback should be an array-ref. Any
435 other type will cause the parse to stop, so these are equivalent in
436 behavior:
437
438 csv (in => $fh, skip_empty_rows => "stop");
439 csv (in => $fh. skip_empty_rows => sub { 0; });
440
441 Without arguments, the current value is returned: 0, 1, "eof", "die",
442 "croak" or the callback.
443
444 formula_handling
445
446 Alias for "formula"
447
448 formula
449
450 my $csv = Text::CSV_XS->new ({ formula => "none" });
451 $csv->formula ("none");
452 my $f = $csv->formula;
453
454 This defines the behavior of fields containing formulas. As formulas
455 are considered dangerous in spreadsheets, this attribute can define an
456 optional action to be taken if a field starts with an equal sign ("=").
457
458 For purpose of code-readability, this can also be written as
459
460 my $csv = Text::CSV_XS->new ({ formula_handling => "none" });
461 $csv->formula_handling ("none");
462 my $f = $csv->formula_handling;
463
464 Possible values for this attribute are
465
466 none
467 Take no specific action. This is the default.
468
469 $csv->formula ("none");
470
471 die
472 Cause the process to "die" whenever a leading "=" is encountered.
473
474 $csv->formula ("die");
475
476 croak
477 Cause the process to "croak" whenever a leading "=" is encountered.
478 (See Carp)
479
480 $csv->formula ("croak");
481
482 diag
483 Report position and content of the field whenever a leading "=" is
484 found. The value of the field is unchanged.
485
486 $csv->formula ("diag");
487
488 empty
489 Replace the content of fields that start with a "=" with the empty
490 string.
491
492 $csv->formula ("empty");
493 $csv->formula ("");
494
495 undef
496 Replace the content of fields that start with a "=" with "undef".
497
498 $csv->formula ("undef");
499 $csv->formula (undef);
500
501 a callback
502 Modify the content of fields that start with a "=" with the return-
503 value of the callback. The original content of the field is
504 available inside the callback as $_;
505
506 # Replace all formula's with 42
507 $csv->formula (sub { 42; });
508
509 # same as $csv->formula ("empty") but slower
510 $csv->formula (sub { "" });
511
512 # Allow =4+12
513 $csv->formula (sub { s/^=(\d+\+\d+)$/$1/eer });
514
515 # Allow more complex calculations
516 $csv->formula (sub { eval { s{^=([-+*/0-9()]+)$}{$1}ee }; $_ });
517
518 All other values will give a warning and then fallback to "diag".
519
520 decode_utf8
521
522 my $csv = Text::CSV_XS->new ({ decode_utf8 => 1 });
523 $csv->decode_utf8 (0);
524 my $f = $csv->decode_utf8;
525
526 This attributes defaults to TRUE.
527
528 While parsing, fields that are valid UTF-8, are automatically set to
529 be UTF-8, so that
530
531 $csv->parse ("\xC4\xA8\n");
532
533 results in
534
535 PV("\304\250"\0) [UTF8 "\x{128}"]
536
537 Sometimes it might not be a desired action. To prevent those upgrades,
538 set this attribute to false, and the result will be
539
540 PV("\304\250"\0)
541
542 auto_diag
543
544 my $csv = Text::CSV_XS->new ({ auto_diag => 1 });
545 $csv->auto_diag (2);
546 my $l = $csv->auto_diag;
547
548 Set this attribute to a number between 1 and 9 causes "error_diag" to
549 be automatically called in void context upon errors.
550
551 In case of error "2012 - EOF", this call will be void.
552
553 If "auto_diag" is set to a numeric value greater than 1, it will "die"
554 on errors instead of "warn". If set to anything unrecognized, it will
555 be silently ignored.
556
557 Future extensions to this feature will include more reliable auto-
558 detection of "autodie" being active in the scope of which the error
559 occurred which will increment the value of "auto_diag" with 1 the
560 moment the error is detected.
561
562 diag_verbose
563
564 my $csv = Text::CSV_XS->new ({ diag_verbose => 1 });
565 $csv->diag_verbose (2);
566 my $l = $csv->diag_verbose;
567
568 Set the verbosity of the output triggered by "auto_diag". Currently
569 only adds the current input-record-number (if known) to the
570 diagnostic output with an indication of the position of the error.
571
572 blank_is_undef
573
574 my $csv = Text::CSV_XS->new ({ blank_is_undef => 1 });
575 $csv->blank_is_undef (0);
576 my $f = $csv->blank_is_undef;
577
578 Under normal circumstances, "CSV" data makes no distinction between
579 quoted- and unquoted empty fields. These both end up in an empty
580 string field once read, thus
581
582 1,"",," ",2
583
584 is read as
585
586 ("1", "", "", " ", "2")
587
588 When writing "CSV" files with either "always_quote" or "quote_empty"
589 set, the unquoted empty field is the result of an undefined value.
590 To enable this distinction when reading "CSV" data, the
591 "blank_is_undef" attribute will cause unquoted empty fields to be set
592 to "undef", causing the above to be parsed as
593
594 ("1", "", undef, " ", "2")
595
596 Note that this is specifically important when loading "CSV" fields
597 into a database that allows "NULL" values, as the perl equivalent for
598 "NULL" is "undef" in DBI land.
599
600 empty_is_undef
601
602 my $csv = Text::CSV_XS->new ({ empty_is_undef => 1 });
603 $csv->empty_is_undef (0);
604 my $f = $csv->empty_is_undef;
605
606 Going one step further than "blank_is_undef", this attribute
607 converts all empty fields to "undef", so
608
609 1,"",," ",2
610
611 is read as
612
613 (1, undef, undef, " ", 2)
614
615 Note that this affects only fields that are originally empty, not
616 fields that are empty after stripping allowed whitespace. YMMV.
617
618 allow_whitespace
619
620 my $csv = Text::CSV_XS->new ({ allow_whitespace => 1 });
621 $csv->allow_whitespace (0);
622 my $f = $csv->allow_whitespace;
623
624 When this option is set to true, the whitespace ("TAB"'s and
625 "SPACE"'s) surrounding the separation character is removed when
626 parsing. If either "TAB" or "SPACE" is one of the three characters
627 "sep_char", "quote_char", or "escape_char" it will not be considered
628 whitespace.
629
630 Now lines like:
631
632 1 , "foo" , bar , 3 , zapp
633
634 are parsed as valid "CSV", even though it violates the "CSV" specs.
635
636 Note that all whitespace is stripped from both start and end of
637 each field. That would make it more than a feature to enable parsing
638 bad "CSV" lines, as
639
640 1, 2.0, 3, ape , monkey
641
642 will now be parsed as
643
644 ("1", "2.0", "3", "ape", "monkey")
645
646 even if the original line was perfectly acceptable "CSV".
647
648 allow_loose_quotes
649
650 my $csv = Text::CSV_XS->new ({ allow_loose_quotes => 1 });
651 $csv->allow_loose_quotes (0);
652 my $f = $csv->allow_loose_quotes;
653
654 By default, parsing unquoted fields containing "quote_char" characters
655 like
656
657 1,foo "bar" baz,42
658
659 would result in parse error 2034. Though it is still bad practice to
660 allow this format, we cannot help the fact that some vendors
661 make their applications spit out lines styled this way.
662
663 If there is really bad "CSV" data, like
664
665 1,"foo "bar" baz",42
666
667 or
668
669 1,""foo bar baz"",42
670
671 there is a way to get this data-line parsed and leave the quotes inside
672 the quoted field as-is. This can be achieved by setting
673 "allow_loose_quotes" AND making sure that the "escape_char" is not
674 equal to "quote_char".
675
676 allow_loose_escapes
677
678 my $csv = Text::CSV_XS->new ({ allow_loose_escapes => 1 });
679 $csv->allow_loose_escapes (0);
680 my $f = $csv->allow_loose_escapes;
681
682 Parsing fields that have "escape_char" characters that escape
683 characters that do not need to be escaped, like:
684
685 my $csv = Text::CSV_XS->new ({ escape_char => "\\" });
686 $csv->parse (qq{1,"my bar\'s",baz,42});
687
688 would result in parse error 2025. Though it is bad practice to allow
689 this format, this attribute enables you to treat all escape character
690 sequences equal.
691
692 allow_unquoted_escape
693
694 my $csv = Text::CSV_XS->new ({ allow_unquoted_escape => 1 });
695 $csv->allow_unquoted_escape (0);
696 my $f = $csv->allow_unquoted_escape;
697
698 A backward compatibility issue where "escape_char" differs from
699 "quote_char" prevents "escape_char" to be in the first position of a
700 field. If "quote_char" is equal to the default """ and "escape_char"
701 is set to "\", this would be illegal:
702
703 1,\0,2
704
705 Setting this attribute to 1 might help to overcome issues with
706 backward compatibility and allow this style.
707
708 always_quote
709
710 my $csv = Text::CSV_XS->new ({ always_quote => 1 });
711 $csv->always_quote (0);
712 my $f = $csv->always_quote;
713
714 By default the generated fields are quoted only if they need to be.
715 For example, if they contain the separator character. If you set this
716 attribute to 1 then all defined fields will be quoted. ("undef" fields
717 are not quoted, see "blank_is_undef"). This makes it quite often easier
718 to handle exported data in external applications. (Poor creatures who
719 are better to use Text::CSV_XS. :)
720
721 quote_space
722
723 my $csv = Text::CSV_XS->new ({ quote_space => 1 });
724 $csv->quote_space (0);
725 my $f = $csv->quote_space;
726
727 By default, a space in a field would trigger quotation. As no rule
728 exists this to be forced in "CSV", nor any for the opposite, the
729 default is true for safety. You can exclude the space from this
730 trigger by setting this attribute to 0.
731
732 quote_empty
733
734 my $csv = Text::CSV_XS->new ({ quote_empty => 1 });
735 $csv->quote_empty (0);
736 my $f = $csv->quote_empty;
737
738 By default the generated fields are quoted only if they need to be.
739 An empty (defined) field does not need quotation. If you set this
740 attribute to 1 then empty defined fields will be quoted. ("undef"
741 fields are not quoted, see "blank_is_undef"). See also "always_quote".
742
743 quote_binary
744
745 my $csv = Text::CSV_XS->new ({ quote_binary => 1 });
746 $csv->quote_binary (0);
747 my $f = $csv->quote_binary;
748
749 By default, all "unsafe" bytes inside a string cause the combined
750 field to be quoted. By setting this attribute to 0, you can disable
751 that trigger for bytes ">= 0x7F".
752
753 escape_null
754
755 my $csv = Text::CSV_XS->new ({ escape_null => 1 });
756 $csv->escape_null (0);
757 my $f = $csv->escape_null;
758
759 By default, a "NULL" byte in a field would be escaped. This option
760 enables you to treat the "NULL" byte as a simple binary character in
761 binary mode (the "{ binary => 1 }" is set). The default is true. You
762 can prevent "NULL" escapes by setting this attribute to 0.
763
764 When the "escape_char" attribute is set to undefined, this attribute
765 will be set to false.
766
767 The default setting will encode "=\x00=" as
768
769 "="0="
770
771 With "escape_null" set, this will result in
772
773 "=\x00="
774
775 The default when using the "csv" function is "false".
776
777 For backward compatibility reasons, the deprecated old name
778 "quote_null" is still recognized.
779
780 keep_meta_info
781
782 my $csv = Text::CSV_XS->new ({ keep_meta_info => 1 });
783 $csv->keep_meta_info (0);
784 my $f = $csv->keep_meta_info;
785
786 By default, the parsing of input records is as simple and fast as
787 possible. However, some parsing information - like quotation of the
788 original field - is lost in that process. Setting this flag to true
789 enables retrieving that information after parsing with the methods
790 "meta_info", "is_quoted", and "is_binary" described below. Default is
791 false for performance.
792
793 If you set this attribute to a value greater than 9, then you can
794 control output quotation style like it was used in the input of the the
795 last parsed record (unless quotation was added because of other
796 reasons).
797
798 my $csv = Text::CSV_XS->new ({
799 binary => 1,
800 keep_meta_info => 1,
801 quote_space => 0,
802 });
803
804 my $row = $csv->parse (q{1,,"", ," ",f,"g","h""h",help,"help"});
805
806 $csv->print (*STDOUT, \@row);
807 # 1,,, , ,f,g,"h""h",help,help
808 $csv->keep_meta_info (11);
809 $csv->print (*STDOUT, \@row);
810 # 1,,"", ," ",f,"g","h""h",help,"help"
811
812 undef_str
813
814 my $csv = Text::CSV_XS->new ({ undef_str => "\\N" });
815 $csv->undef_str (undef);
816 my $s = $csv->undef_str;
817
818 This attribute optionally defines the output of undefined fields. The
819 value passed is not changed at all, so if it needs quotation, the
820 quotation needs to be included in the value of the attribute. Use with
821 caution, as passing a value like ",",,,,""" will for sure mess up
822 your output. The default for this attribute is "undef", meaning no
823 special treatment.
824
825 This attribute is useful when exporting CSV data to be imported in
826 custom loaders, like for MySQL, that recognize special sequences for
827 "NULL" data.
828
829 This attribute has no meaning when parsing CSV data.
830
831 comment_str
832
833 my $csv = Text::CSV_XS->new ({ comment_str => "#" });
834 $csv->comment_str (undef);
835 my $s = $csv->comment_str;
836
837 This attribute optionally defines a string to be recognized as comment.
838 If this attribute is defined, all lines starting with this sequence
839 will not be parsed as CSV but skipped as comment.
840
841 This attribute has no meaning when generating CSV.
842
843 Comment strings that start with any of the special characters/sequences
844 are not supported (so it cannot start with any of "sep_char",
845 "quote_char", "escape_char", "sep", "quote", or "eol").
846
847 For convenience, "comment" is an alias for "comment_str".
848
849 verbatim
850
851 my $csv = Text::CSV_XS->new ({ verbatim => 1 });
852 $csv->verbatim (0);
853 my $f = $csv->verbatim;
854
855 This is a quite controversial attribute to set, but makes some hard
856 things possible.
857
858 The rationale behind this attribute is to tell the parser that the
859 normally special characters newline ("NL") and Carriage Return ("CR")
860 will not be special when this flag is set, and be dealt with as being
861 ordinary binary characters. This will ease working with data with
862 embedded newlines.
863
864 When "verbatim" is used with "getline", "getline" auto-"chomp"'s
865 every line.
866
867 Imagine a file format like
868
869 M^^Hans^Janssen^Klas 2\n2A^Ja^11-06-2007#\r\n
870
871 where, the line ending is a very specific "#\r\n", and the sep_char is
872 a "^" (caret). None of the fields is quoted, but embedded binary
873 data is likely to be present. With the specific line ending, this
874 should not be too hard to detect.
875
876 By default, Text::CSV_XS' parse function is instructed to only know
877 about "\n" and "\r" to be legal line endings, and so has to deal with
878 the embedded newline as a real "end-of-line", so it can scan the next
879 line if binary is true, and the newline is inside a quoted field. With
880 this option, we tell "parse" to parse the line as if "\n" is just
881 nothing more than a binary character.
882
883 For "parse" this means that the parser has no more idea about line
884 ending and "getline" "chomp"s line endings on reading.
885
886 types
887
888 A set of column types; the attribute is immediately passed to the
889 "types" method.
890
891 callbacks
892
893 See the "Callbacks" section below.
894
895 accessors
896
897 To sum it up,
898
899 $csv = Text::CSV_XS->new ();
900
901 is equivalent to
902
903 $csv = Text::CSV_XS->new ({
904 eol => undef, # \r, \n, or \r\n
905 sep_char => ',',
906 sep => undef,
907 quote_char => '"',
908 quote => undef,
909 escape_char => '"',
910 binary => 0,
911 decode_utf8 => 1,
912 auto_diag => 0,
913 diag_verbose => 0,
914 blank_is_undef => 0,
915 empty_is_undef => 0,
916 allow_whitespace => 0,
917 allow_loose_quotes => 0,
918 allow_loose_escapes => 0,
919 allow_unquoted_escape => 0,
920 always_quote => 0,
921 quote_empty => 0,
922 quote_space => 1,
923 escape_null => 1,
924 quote_binary => 1,
925 keep_meta_info => 0,
926 strict => 0,
927 skip_empty_rows => 0,
928 formula => 0,
929 verbatim => 0,
930 undef_str => undef,
931 comment_str => undef,
932 types => undef,
933 callbacks => undef,
934 });
935
936 For all of the above mentioned flags, an accessor method is available
937 where you can inquire the current value, or change the value
938
939 my $quote = $csv->quote_char;
940 $csv->binary (1);
941
942 It is not wise to change these settings halfway through writing "CSV"
943 data to a stream. If however you want to create a new stream using the
944 available "CSV" object, there is no harm in changing them.
945
946 If the "new" constructor call fails, it returns "undef", and makes
947 the fail reason available through the "error_diag" method.
948
949 $csv = Text::CSV_XS->new ({ ecs_char => 1 }) or
950 die "".Text::CSV_XS->error_diag ();
951
952 "error_diag" will return a string like
953
954 "INI - Unknown attribute 'ecs_char'"
955
956 known_attributes
957 @attr = Text::CSV_XS->known_attributes;
958 @attr = Text::CSV_XS::known_attributes;
959 @attr = $csv->known_attributes;
960
961 This method will return an ordered list of all the supported
962 attributes as described above. This can be useful for knowing what
963 attributes are valid in classes that use or extend Text::CSV_XS.
964
965 print
966 $status = $csv->print ($fh, $colref);
967
968 Similar to "combine" + "string" + "print", but much more efficient.
969 It expects an array ref as input (not an array!) and the resulting
970 string is not really created, but immediately written to the $fh
971 object, typically an IO handle or any other object that offers a
972 "print" method.
973
974 For performance reasons "print" does not create a result string, so
975 all "string", "status", "fields", and "error_input" methods will return
976 undefined information after executing this method.
977
978 If $colref is "undef" (explicit, not through a variable argument) and
979 "bind_columns" was used to specify fields to be printed, it is
980 possible to make performance improvements, as otherwise data would have
981 to be copied as arguments to the method call:
982
983 $csv->bind_columns (\($foo, $bar));
984 $status = $csv->print ($fh, undef);
985
986 A short benchmark
987
988 my @data = ("aa" .. "zz");
989 $csv->bind_columns (\(@data));
990
991 $csv->print ($fh, [ @data ]); # 11800 recs/sec
992 $csv->print ($fh, \@data ); # 57600 recs/sec
993 $csv->print ($fh, undef ); # 48500 recs/sec
994
995 say
996 $status = $csv->say ($fh, $colref);
997
998 Like "print", but "eol" defaults to "$\".
999
1000 print_hr
1001 $csv->print_hr ($fh, $ref);
1002
1003 Provides an easy way to print a $ref (as fetched with "getline_hr")
1004 provided the column names are set with "column_names".
1005
1006 It is just a wrapper method with basic parameter checks over
1007
1008 $csv->print ($fh, [ map { $ref->{$_} } $csv->column_names ]);
1009
1010 combine
1011 $status = $csv->combine (@fields);
1012
1013 This method constructs a "CSV" record from @fields, returning success
1014 or failure. Failure can result from lack of arguments or an argument
1015 that contains an invalid character. Upon success, "string" can be
1016 called to retrieve the resultant "CSV" string. Upon failure, the
1017 value returned by "string" is undefined and "error_input" could be
1018 called to retrieve the invalid argument.
1019
1020 string
1021 $line = $csv->string ();
1022
1023 This method returns the input to "parse" or the resultant "CSV"
1024 string of "combine", whichever was called more recently.
1025
1026 getline
1027 $colref = $csv->getline ($fh);
1028
1029 This is the counterpart to "print", as "parse" is the counterpart to
1030 "combine": it parses a row from the $fh handle using the "getline"
1031 method associated with $fh and parses this row into an array ref.
1032 This array ref is returned by the function or "undef" for failure.
1033 When $fh does not support "getline", you are likely to hit errors.
1034
1035 When fields are bound with "bind_columns" the return value is a
1036 reference to an empty list.
1037
1038 The "string", "fields", and "status" methods are meaningless again.
1039
1040 getline_all
1041 $arrayref = $csv->getline_all ($fh);
1042 $arrayref = $csv->getline_all ($fh, $offset);
1043 $arrayref = $csv->getline_all ($fh, $offset, $length);
1044
1045 This will return a reference to a list of getline ($fh) results. In
1046 this call, "keep_meta_info" is disabled. If $offset is negative, as
1047 with "splice", only the last "abs ($offset)" records of $fh are taken
1048 into consideration. Parameters $offset and $length are expected to be
1049 integers. Non-integer values are interpreted as integer without check.
1050
1051 Given a CSV file with 10 lines:
1052
1053 lines call
1054 ----- ---------------------------------------------------------
1055 0..9 $csv->getline_all ($fh) # all
1056 0..9 $csv->getline_all ($fh, 0) # all
1057 8..9 $csv->getline_all ($fh, 8) # start at 8
1058 - $csv->getline_all ($fh, 0, 0) # start at 0 first 0 rows
1059 0..4 $csv->getline_all ($fh, 0, 5) # start at 0 first 5 rows
1060 4..5 $csv->getline_all ($fh, 4, 2) # start at 4 first 2 rows
1061 8..9 $csv->getline_all ($fh, -2) # last 2 rows
1062 6..7 $csv->getline_all ($fh, -4, 2) # first 2 of last 4 rows
1063
1064 getline_hr
1065 The "getline_hr" and "column_names" methods work together to allow you
1066 to have rows returned as hashrefs. You must call "column_names" first
1067 to declare your column names.
1068
1069 $csv->column_names (qw( code name price description ));
1070 $hr = $csv->getline_hr ($fh);
1071 print "Price for $hr->{name} is $hr->{price} EUR\n";
1072
1073 "getline_hr" will croak if called before "column_names".
1074
1075 Note that "getline_hr" creates a hashref for every row and will be
1076 much slower than the combined use of "bind_columns" and "getline" but
1077 still offering the same easy to use hashref inside the loop:
1078
1079 my @cols = @{$csv->getline ($fh)};
1080 $csv->column_names (@cols);
1081 while (my $row = $csv->getline_hr ($fh)) {
1082 print $row->{price};
1083 }
1084
1085 Could easily be rewritten to the much faster:
1086
1087 my @cols = @{$csv->getline ($fh)};
1088 my $row = {};
1089 $csv->bind_columns (\@{$row}{@cols});
1090 while ($csv->getline ($fh)) {
1091 print $row->{price};
1092 }
1093
1094 Your mileage may vary for the size of the data and the number of rows.
1095 With perl-5.14.2 the comparison for a 100_000 line file with 14
1096 columns:
1097
1098 Rate hashrefs getlines
1099 hashrefs 1.00/s -- -76%
1100 getlines 4.15/s 313% --
1101
1102 getline_hr_all
1103 $arrayref = $csv->getline_hr_all ($fh);
1104 $arrayref = $csv->getline_hr_all ($fh, $offset);
1105 $arrayref = $csv->getline_hr_all ($fh, $offset, $length);
1106
1107 This will return a reference to a list of getline_hr ($fh) results.
1108 In this call, "keep_meta_info" is disabled.
1109
1110 parse
1111 $status = $csv->parse ($line);
1112
1113 This method decomposes a "CSV" string into fields, returning success
1114 or failure. Failure can result from a lack of argument or the given
1115 "CSV" string is improperly formatted. Upon success, "fields" can be
1116 called to retrieve the decomposed fields. Upon failure calling "fields"
1117 will return undefined data and "error_input" can be called to
1118 retrieve the invalid argument.
1119
1120 You may use the "types" method for setting column types. See "types"'
1121 description below.
1122
1123 The $line argument is supposed to be a simple scalar. Everything else
1124 is supposed to croak and set error 1500.
1125
1126 fragment
1127 This function tries to implement RFC7111 (URI Fragment Identifiers for
1128 the text/csv Media Type) -
1129 https://datatracker.ietf.org/doc/html/rfc7111
1130
1131 my $AoA = $csv->fragment ($fh, $spec);
1132
1133 In specifications, "*" is used to specify the last item, a dash ("-")
1134 to indicate a range. All indices are 1-based: the first row or
1135 column has index 1. Selections can be combined with the semi-colon
1136 (";").
1137
1138 When using this method in combination with "column_names", the
1139 returned reference will point to a list of hashes instead of a list
1140 of lists. A disjointed cell-based combined selection might return
1141 rows with different number of columns making the use of hashes
1142 unpredictable.
1143
1144 $csv->column_names ("Name", "Age");
1145 my $AoH = $csv->fragment ($fh, "col=3;8");
1146
1147 If the "after_parse" callback is active, it is also called on every
1148 line parsed and skipped before the fragment.
1149
1150 row
1151 row=4
1152 row=5-7
1153 row=6-*
1154 row=1-2;4;6-*
1155
1156 col
1157 col=2
1158 col=1-3
1159 col=4-*
1160 col=1-2;4;7-*
1161
1162 cell
1163 In cell-based selection, the comma (",") is used to pair row and
1164 column
1165
1166 cell=4,1
1167
1168 The range operator ("-") using "cell"s can be used to define top-left
1169 and bottom-right "cell" location
1170
1171 cell=3,1-4,6
1172
1173 The "*" is only allowed in the second part of a pair
1174
1175 cell=3,2-*,2 # row 3 till end, only column 2
1176 cell=3,2-3,* # column 2 till end, only row 3
1177 cell=3,2-*,* # strip row 1 and 2, and column 1
1178
1179 Cells and cell ranges may be combined with ";", possibly resulting in
1180 rows with different numbers of columns
1181
1182 cell=1,1-2,2;3,3-4,4;1,4;4,1
1183
1184 Disjointed selections will only return selected cells. The cells
1185 that are not specified will not be included in the returned
1186 set, not even as "undef". As an example given a "CSV" like
1187
1188 11,12,13,...19
1189 21,22,...28,29
1190 : :
1191 91,...97,98,99
1192
1193 with "cell=1,1-2,2;3,3-4,4;1,4;4,1" will return:
1194
1195 11,12,14
1196 21,22
1197 33,34
1198 41,43,44
1199
1200 Overlapping cell-specs will return those cells only once, So
1201 "cell=1,1-3,3;2,2-4,4;2,3;4,2" will return:
1202
1203 11,12,13
1204 21,22,23,24
1205 31,32,33,34
1206 42,43,44
1207
1208 RFC7111 <https://datatracker.ietf.org/doc/html/rfc7111> does not
1209 allow different types of specs to be combined (either "row" or "col"
1210 or "cell"). Passing an invalid fragment specification will croak and
1211 set error 2013.
1212
1213 column_names
1214 Set the "keys" that will be used in the "getline_hr" calls. If no
1215 keys (column names) are passed, it will return the current setting as a
1216 list.
1217
1218 "column_names" accepts a list of scalars (the column names) or a
1219 single array_ref, so you can pass the return value from "getline" too:
1220
1221 $csv->column_names ($csv->getline ($fh));
1222
1223 "column_names" does no checking on duplicates at all, which might lead
1224 to unexpected results. Undefined entries will be replaced with the
1225 string "\cAUNDEF\cA", so
1226
1227 $csv->column_names (undef, "", "name", "name");
1228 $hr = $csv->getline_hr ($fh);
1229
1230 will set "$hr->{"\cAUNDEF\cA"}" to the 1st field, "$hr->{""}" to the
1231 2nd field, and "$hr->{name}" to the 4th field, discarding the 3rd
1232 field.
1233
1234 "column_names" croaks on invalid arguments.
1235
1236 header
1237 This method does NOT work in perl-5.6.x
1238
1239 Parse the CSV header and set "sep", column_names and encoding.
1240
1241 my @hdr = $csv->header ($fh);
1242 $csv->header ($fh, { sep_set => [ ";", ",", "|", "\t" ] });
1243 $csv->header ($fh, { detect_bom => 1, munge_column_names => "lc" });
1244
1245 The first argument should be a file handle.
1246
1247 This method resets some object properties, as it is supposed to be
1248 invoked only once per file or stream. It will leave attributes
1249 "column_names" and "bound_columns" alone if setting column names is
1250 disabled. Reading headers on previously process objects might fail on
1251 perl-5.8.0 and older.
1252
1253 Assuming that the file opened for parsing has a header, and the header
1254 does not contain problematic characters like embedded newlines, read
1255 the first line from the open handle then auto-detect whether the header
1256 separates the column names with a character from the allowed separator
1257 list.
1258
1259 If any of the allowed separators matches, and none of the other
1260 allowed separators match, set "sep" to that separator for the
1261 current CSV_XS instance and use it to parse the first line, map those
1262 to lowercase, and use that to set the instance "column_names":
1263
1264 my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
1265 open my $fh, "<", "file.csv";
1266 binmode $fh; # for Windows
1267 $csv->header ($fh);
1268 while (my $row = $csv->getline_hr ($fh)) {
1269 ...
1270 }
1271
1272 If the header is empty, contains more than one unique separator out of
1273 the allowed set, contains empty fields, or contains identical fields
1274 (after folding), it will croak with error 1010, 1011, 1012, or 1013
1275 respectively.
1276
1277 If the header contains embedded newlines or is not valid CSV in any
1278 other way, this method will croak and leave the parse error untouched.
1279
1280 A successful call to "header" will always set the "sep" of the $csv
1281 object. This behavior can not be disabled.
1282
1283 return value
1284
1285 On error this method will croak.
1286
1287 In list context, the headers will be returned whether they are used to
1288 set "column_names" or not.
1289
1290 In scalar context, the instance itself is returned. Note: the values
1291 as found in the header will effectively be lost if "set_column_names"
1292 is false.
1293
1294 Options
1295
1296 sep_set
1297 $csv->header ($fh, { sep_set => [ ";", ",", "|", "\t" ] });
1298
1299 The list of legal separators defaults to "[ ";", "," ]" and can be
1300 changed by this option. As this is probably the most often used
1301 option, it can be passed on its own as an unnamed argument:
1302
1303 $csv->header ($fh, [ ";", ",", "|", "\t", "::", "\x{2063}" ]);
1304
1305 Multi-byte sequences are allowed, both multi-character and
1306 Unicode. See "sep".
1307
1308 detect_bom
1309 $csv->header ($fh, { detect_bom => 1 });
1310
1311 The default behavior is to detect if the header line starts with a
1312 BOM. If the header has a BOM, use that to set the encoding of $fh.
1313 This default behavior can be disabled by passing a false value to
1314 "detect_bom".
1315
1316 Supported encodings from BOM are: UTF-8, UTF-16BE, UTF-16LE,
1317 UTF-32BE, and UTF-32LE. BOM also supports UTF-1, UTF-EBCDIC, SCSU,
1318 BOCU-1, and GB-18030 but Encode does not (yet). UTF-7 is not
1319 supported.
1320
1321 If a supported BOM was detected as start of the stream, it is stored
1322 in the object attribute "ENCODING".
1323
1324 my $enc = $csv->{ENCODING};
1325
1326 The encoding is used with "binmode" on $fh.
1327
1328 If the handle was opened in a (correct) encoding, this method will
1329 not alter the encoding, as it checks the leading bytes of the first
1330 line. In case the stream starts with a decoded BOM ("U+FEFF"),
1331 "{ENCODING}" will be "" (empty) instead of the default "undef".
1332
1333 munge_column_names
1334 This option offers the means to modify the column names into
1335 something that is most useful to the application. The default is to
1336 map all column names to lower case.
1337
1338 $csv->header ($fh, { munge_column_names => "lc" });
1339
1340 The following values are available:
1341
1342 lc - lower case
1343 uc - upper case
1344 db - valid DB field names
1345 none - do not change
1346 \%hash - supply a mapping
1347 \&cb - supply a callback
1348
1349 Lower case
1350 $csv->header ($fh, { munge_column_names => "lc" });
1351
1352 The header is changed to all lower-case
1353
1354 $_ = lc;
1355
1356 Upper case
1357 $csv->header ($fh, { munge_column_names => "uc" });
1358
1359 The header is changed to all upper-case
1360
1361 $_ = uc;
1362
1363 Literal
1364 $csv->header ($fh, { munge_column_names => "none" });
1365
1366 Hash
1367 $csv->header ($fh, { munge_column_names => { foo => "sombrero" });
1368
1369 if a value does not exist, the original value is used unchanged
1370
1371 Database
1372 $csv->header ($fh, { munge_column_names => "db" });
1373
1374 - lower-case
1375
1376 - all sequences of non-word characters are replaced with an
1377 underscore
1378
1379 - all leading underscores are removed
1380
1381 $_ = lc (s/\W+/_/gr =~ s/^_+//r);
1382
1383 Callback
1384 $csv->header ($fh, { munge_column_names => sub { fc } });
1385 $csv->header ($fh, { munge_column_names => sub { "column_".$col++ } });
1386 $csv->header ($fh, { munge_column_names => sub { lc (s/\W+/_/gr) } });
1387
1388 As this callback is called in a "map", you can use $_ directly.
1389
1390 set_column_names
1391 $csv->header ($fh, { set_column_names => 1 });
1392
1393 The default is to set the instances column names using
1394 "column_names" if the method is successful, so subsequent calls to
1395 "getline_hr" can return a hash. Disable setting the header can be
1396 forced by using a false value for this option.
1397
1398 As described in "return value" above, content is lost in scalar
1399 context.
1400
1401 Validation
1402
1403 When receiving CSV files from external sources, this method can be
1404 used to protect against changes in the layout by restricting to known
1405 headers (and typos in the header fields).
1406
1407 my %known = (
1408 "record key" => "c_rec",
1409 "rec id" => "c_rec",
1410 "id_rec" => "c_rec",
1411 "kode" => "code",
1412 "code" => "code",
1413 "vaule" => "value",
1414 "value" => "value",
1415 );
1416 my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
1417 open my $fh, "<", $source or die "$source: $!";
1418 $csv->header ($fh, { munge_column_names => sub {
1419 s/\s+$//;
1420 s/^\s+//;
1421 $known{lc $_} or die "Unknown column '$_' in $source";
1422 }});
1423 while (my $row = $csv->getline_hr ($fh)) {
1424 say join "\t", $row->{c_rec}, $row->{code}, $row->{value};
1425 }
1426
1427 bind_columns
1428 Takes a list of scalar references to be used for output with "print"
1429 or to store in the fields fetched by "getline". When you do not pass
1430 enough references to store the fetched fields in, "getline" will fail
1431 with error 3006. If you pass more than there are fields to return,
1432 the content of the remaining references is left untouched.
1433
1434 $csv->bind_columns (\$code, \$name, \$price, \$description);
1435 while ($csv->getline ($fh)) {
1436 print "The price of a $name is \x{20ac} $price\n";
1437 }
1438
1439 To reset or clear all column binding, call "bind_columns" with the
1440 single argument "undef". This will also clear column names.
1441
1442 $csv->bind_columns (undef);
1443
1444 If no arguments are passed at all, "bind_columns" will return the list
1445 of current bindings or "undef" if no binds are active.
1446
1447 Note that in parsing with "bind_columns", the fields are set on the
1448 fly. That implies that if the third field of a row causes an error
1449 (or this row has just two fields where the previous row had more), the
1450 first two fields already have been assigned the values of the current
1451 row, while the rest of the fields will still hold the values of the
1452 previous row. If you want the parser to fail in these cases, use the
1453 "strict" attribute.
1454
1455 eof
1456 $eof = $csv->eof ();
1457
1458 If "parse" or "getline" was used with an IO stream, this method will
1459 return true (1) if the last call hit end of file, otherwise it will
1460 return false (''). This is useful to see the difference between a
1461 failure and end of file.
1462
1463 Note that if the parsing of the last line caused an error, "eof" is
1464 still true. That means that if you are not using "auto_diag", an idiom
1465 like
1466
1467 while (my $row = $csv->getline ($fh)) {
1468 # ...
1469 }
1470 $csv->eof or $csv->error_diag;
1471
1472 will not report the error. You would have to change that to
1473
1474 while (my $row = $csv->getline ($fh)) {
1475 # ...
1476 }
1477 +$csv->error_diag and $csv->error_diag;
1478
1479 types
1480 $csv->types (\@tref);
1481
1482 This method is used to force that (all) columns are of a given type.
1483 For example, if you have an integer column, two columns with
1484 doubles and a string column, then you might do a
1485
1486 $csv->types ([Text::CSV_XS::IV (),
1487 Text::CSV_XS::NV (),
1488 Text::CSV_XS::NV (),
1489 Text::CSV_XS::PV ()]);
1490
1491 Column types are used only for decoding columns while parsing, in
1492 other words by the "parse" and "getline" methods.
1493
1494 You can unset column types by doing a
1495
1496 $csv->types (undef);
1497
1498 or fetch the current type settings with
1499
1500 $types = $csv->types ();
1501
1502 IV
1503 CSV_TYPE_IV
1504 Set field type to integer.
1505
1506 NV
1507 CSV_TYPE_NV
1508 Set field type to numeric/float.
1509
1510 PV
1511 CSV_TYPE_PV
1512 Set field type to string.
1513
1514 fields
1515 @columns = $csv->fields ();
1516
1517 This method returns the input to "combine" or the resultant
1518 decomposed fields of a successful "parse", whichever was called more
1519 recently.
1520
1521 Note that the return value is undefined after using "getline", which
1522 does not fill the data structures returned by "parse".
1523
1524 meta_info
1525 @flags = $csv->meta_info ();
1526
1527 This method returns the "flags" of the input to "combine" or the flags
1528 of the resultant decomposed fields of "parse", whichever was called
1529 more recently.
1530
1531 For each field, a meta_info field will hold flags that inform
1532 something about the field returned by the "fields" method or
1533 passed to the "combine" method. The flags are bit-wise-"or"'d like:
1534
1535 0x0001
1536 "CSV_FLAGS_IS_QUOTED"
1537 The field was quoted.
1538
1539 0x0002
1540 "CSV_FLAGS_IS_BINARY"
1541 The field was binary.
1542
1543 0x0004
1544 "CSV_FLAGS_ERROR_IN_FIELD"
1545 The field was invalid.
1546
1547 Currently only used when "allow_loose_quotes" is active.
1548
1549 0x0010
1550 "CSV_FLAGS_IS_MISSING"
1551 The field was missing.
1552
1553 See the "is_***" methods below.
1554
1555 is_quoted
1556 my $quoted = $csv->is_quoted ($column_idx);
1557
1558 where $column_idx is the (zero-based) index of the column in the
1559 last result of "parse".
1560
1561 This returns a true value if the data in the indicated column was
1562 enclosed in "quote_char" quotes. This might be important for fields
1563 where content ",20070108," is to be treated as a numeric value, and
1564 where ","20070108"," is explicitly marked as character string data.
1565
1566 This method is only valid when "keep_meta_info" is set to a true value.
1567
1568 is_binary
1569 my $binary = $csv->is_binary ($column_idx);
1570
1571 where $column_idx is the (zero-based) index of the column in the
1572 last result of "parse".
1573
1574 This returns a true value if the data in the indicated column contained
1575 any byte in the range "[\x00-\x08,\x10-\x1F,\x7F-\xFF]".
1576
1577 This method is only valid when "keep_meta_info" is set to a true value.
1578
1579 is_missing
1580 my $missing = $csv->is_missing ($column_idx);
1581
1582 where $column_idx is the (zero-based) index of the column in the
1583 last result of "getline_hr".
1584
1585 $csv->keep_meta_info (1);
1586 while (my $hr = $csv->getline_hr ($fh)) {
1587 $csv->is_missing (0) and next; # This was an empty line
1588 }
1589
1590 When using "getline_hr", it is impossible to tell if the parsed
1591 fields are "undef" because they where not filled in the "CSV" stream
1592 or because they were not read at all, as all the fields defined by
1593 "column_names" are set in the hash-ref. If you still need to know if
1594 all fields in each row are provided, you should enable "keep_meta_info"
1595 so you can check the flags.
1596
1597 If "keep_meta_info" is "false", "is_missing" will always return
1598 "undef", regardless of $column_idx being valid or not. If this
1599 attribute is "true" it will return either 0 (the field is present) or 1
1600 (the field is missing).
1601
1602 A special case is the empty line. If the line is completely empty -
1603 after dealing with the flags - this is still a valid CSV line: it is a
1604 record of just one single empty field. However, if "keep_meta_info" is
1605 set, invoking "is_missing" with index 0 will now return true.
1606
1607 status
1608 $status = $csv->status ();
1609
1610 This method returns the status of the last invoked "combine" or "parse"
1611 call. Status is success (true: 1) or failure (false: "undef" or 0).
1612
1613 Note that as this only keeps track of the status of above mentioned
1614 methods, you are probably looking for "error_diag" instead.
1615
1616 error_input
1617 $bad_argument = $csv->error_input ();
1618
1619 This method returns the erroneous argument (if it exists) of "combine"
1620 or "parse", whichever was called more recently. If the last
1621 invocation was successful, "error_input" will return "undef".
1622
1623 Depending on the type of error, it might also hold the data for the
1624 last error-input of "getline".
1625
1626 error_diag
1627 Text::CSV_XS->error_diag ();
1628 $csv->error_diag ();
1629 $error_code = 0 + $csv->error_diag ();
1630 $error_str = "" . $csv->error_diag ();
1631 ($cde, $str, $pos, $rec, $fld) = $csv->error_diag ();
1632
1633 If (and only if) an error occurred, this function returns the
1634 diagnostics of that error.
1635
1636 If called in void context, this will print the internal error code and
1637 the associated error message to STDERR.
1638
1639 If called in list context, this will return the error code and the
1640 error message in that order. If the last error was from parsing, the
1641 rest of the values returned are a best guess at the location within
1642 the line that was being parsed. Their values are 1-based. The
1643 position currently is index of the byte at which the parsing failed in
1644 the current record. It might change to be the index of the current
1645 character in a later release. The records is the index of the record
1646 parsed by the csv instance. The field number is the index of the field
1647 the parser thinks it is currently trying to parse. See
1648 examples/csv-check for how this can be used.
1649
1650 If called in scalar context, it will return the diagnostics in a
1651 single scalar, a-la $!. It will contain the error code in numeric
1652 context, and the diagnostics message in string context.
1653
1654 When called as a class method or a direct function call, the
1655 diagnostics are that of the last "new" call.
1656
1657 record_number
1658 $recno = $csv->record_number ();
1659
1660 Returns the records parsed by this csv instance. This value should be
1661 more accurate than $. when embedded newlines come in play. Records
1662 written by this instance are not counted.
1663
1664 SetDiag
1665 $csv->SetDiag (0);
1666
1667 Use to reset the diagnostics if you are dealing with errors.
1668
1670 By default none of these are exported.
1671
1672 csv
1673 use Text::CSV_XS qw( csv );
1674
1675 Import the function "csv" function. See below.
1676
1677 :CONSTANTS
1678 use Text::CSV_XS qw( :CONSTANTS );
1679
1680 Import module constants "CSV_FLAGS_IS_QUOTED",
1681 "CSV_FLAGS_IS_BINARY", "CSV_FLAGS_ERROR_IN_FIELD",
1682 "CSV_FLAGS_IS_MISSING", "CSV_TYPE_PV", "CSV_TYPE_IV", and
1683 "CSV_TYPE_NV". Each can be imported alone
1684
1685 use Text::CSV_XS qw( CSV_FLAS_IS_BINARY CSV_TYPE_NV );
1686
1688 csv
1689 This function is not exported by default and should be explicitly
1690 requested:
1691
1692 use Text::CSV_XS qw( csv );
1693
1694 This is a high-level function that aims at simple (user) interfaces.
1695 This can be used to read/parse a "CSV" file or stream (the default
1696 behavior) or to produce a file or write to a stream (define the "out"
1697 attribute). It returns an array- or hash-reference on parsing (or
1698 "undef" on fail) or the numeric value of "error_diag" on writing.
1699 When this function fails you can get to the error using the class call
1700 to "error_diag"
1701
1702 my $aoa = csv (in => "test.csv") or
1703 die Text::CSV_XS->error_diag;
1704
1705 This function takes the arguments as key-value pairs. This can be
1706 passed as a list or as an anonymous hash:
1707
1708 my $aoa = csv ( in => "test.csv", sep_char => ";");
1709 my $aoh = csv ({ in => $fh, headers => "auto" });
1710
1711 The arguments passed consist of two parts: the arguments to "csv"
1712 itself and the optional attributes to the "CSV" object used inside
1713 the function as enumerated and explained in "new".
1714
1715 If not overridden, the default option used for CSV is
1716
1717 auto_diag => 1
1718 escape_null => 0
1719
1720 The option that is always set and cannot be altered is
1721
1722 binary => 1
1723
1724 As this function will likely be used in one-liners, it allows "quote"
1725 to be abbreviated as "quo", and "escape_char" to be abbreviated as
1726 "esc" or "escape".
1727
1728 Alternative invocations:
1729
1730 my $aoa = Text::CSV_XS::csv (in => "file.csv");
1731
1732 my $csv = Text::CSV_XS->new ();
1733 my $aoa = $csv->csv (in => "file.csv");
1734
1735 In the latter case, the object attributes are used from the existing
1736 object and the attribute arguments in the function call are ignored:
1737
1738 my $csv = Text::CSV_XS->new ({ sep_char => ";" });
1739 my $aoh = $csv->csv (in => "file.csv", bom => 1);
1740
1741 will parse using ";" as "sep_char", not ",".
1742
1743 in
1744
1745 Used to specify the source. "in" can be a file name (e.g. "file.csv"),
1746 which will be opened for reading and closed when finished, a file
1747 handle (e.g. $fh or "FH"), a reference to a glob (e.g. "\*ARGV"),
1748 the glob itself (e.g. *STDIN), or a reference to a scalar (e.g.
1749 "\q{1,2,"csv"}").
1750
1751 When used with "out", "in" should be a reference to a CSV structure
1752 (AoA or AoH) or a CODE-ref that returns an array-reference or a hash-
1753 reference. The code-ref will be invoked with no arguments.
1754
1755 my $aoa = csv (in => "file.csv");
1756
1757 open my $fh, "<", "file.csv";
1758 my $aoa = csv (in => $fh);
1759
1760 my $csv = [ [qw( Foo Bar )], [ 1, 2 ], [ 2, 3 ]];
1761 my $err = csv (in => $csv, out => "file.csv");
1762
1763 If called in void context without the "out" attribute, the resulting
1764 ref will be used as input to a subsequent call to csv:
1765
1766 csv (in => "file.csv", filter => { 2 => sub { length > 2 }})
1767
1768 will be a shortcut to
1769
1770 csv (in => csv (in => "file.csv", filter => { 2 => sub { length > 2 }}))
1771
1772 where, in the absence of the "out" attribute, this is a shortcut to
1773
1774 csv (in => csv (in => "file.csv", filter => { 2 => sub { length > 2 }}),
1775 out => *STDOUT)
1776
1777 out
1778
1779 csv (in => $aoa, out => "file.csv");
1780 csv (in => $aoa, out => $fh);
1781 csv (in => $aoa, out => STDOUT);
1782 csv (in => $aoa, out => *STDOUT);
1783 csv (in => $aoa, out => \*STDOUT);
1784 csv (in => $aoa, out => \my $data);
1785 csv (in => $aoa, out => undef);
1786 csv (in => $aoa, out => \"skip");
1787
1788 csv (in => $fh, out => \@aoa);
1789 csv (in => $fh, out => \@aoh, bom => 1);
1790 csv (in => $fh, out => \%hsh, key => "key");
1791
1792 In output mode, the default CSV options when producing CSV are
1793
1794 eol => "\r\n"
1795
1796 The "fragment" attribute is ignored in output mode.
1797
1798 "out" can be a file name (e.g. "file.csv"), which will be opened for
1799 writing and closed when finished, a file handle (e.g. $fh or "FH"), a
1800 reference to a glob (e.g. "\*STDOUT"), the glob itself (e.g. *STDOUT),
1801 or a reference to a scalar (e.g. "\my $data").
1802
1803 csv (in => sub { $sth->fetch }, out => "dump.csv");
1804 csv (in => sub { $sth->fetchrow_hashref }, out => "dump.csv",
1805 headers => $sth->{NAME_lc});
1806
1807 When a code-ref is used for "in", the output is generated per
1808 invocation, so no buffering is involved. This implies that there is no
1809 size restriction on the number of records. The "csv" function ends when
1810 the coderef returns a false value.
1811
1812 If "out" is set to a reference of the literal string "skip", the output
1813 will be suppressed completely, which might be useful in combination
1814 with a filter for side effects only.
1815
1816 my %cache;
1817 csv (in => "dump.csv",
1818 out => \"skip",
1819 on_in => sub { $cache{$_[1][1]}++ });
1820
1821 Currently, setting "out" to any false value ("undef", "", 0) will be
1822 equivalent to "\"skip"".
1823
1824 If the "in" argument point to something to parse, and the "out" is set
1825 to a reference to an "ARRAY" or a "HASH", the output is appended to the
1826 data in the existing reference. The result of the parse should match
1827 what exists in the reference passed. This might come handy when you
1828 have to parse a set of files with similar content (like data stored per
1829 period) and you want to collect that into a single data structure:
1830
1831 my %hash;
1832 csv (in => $_, out => \%hash, key => "id") for sort glob "foo-[0-9]*.csv";
1833
1834 my @list; # List of arrays
1835 csv (in => $_, out => \@list) for sort glob "foo-[0-9]*.csv";
1836
1837 my @list; # List of hashes
1838 csv (in => $_, out => \@list, bom => 1) for sort glob "foo-[0-9]*.csv";
1839
1840 encoding
1841
1842 If passed, it should be an encoding accepted by the :encoding()
1843 option to "open". There is no default value. This attribute does not
1844 work in perl 5.6.x. "encoding" can be abbreviated to "enc" for ease of
1845 use in command line invocations.
1846
1847 If "encoding" is set to the literal value "auto", the method "header"
1848 will be invoked on the opened stream to check if there is a BOM and set
1849 the encoding accordingly. This is equal to passing a true value in
1850 the option "detect_bom".
1851
1852 Encodings can be stacked, as supported by "binmode":
1853
1854 # Using PerlIO::via::gzip
1855 csv (in => \@csv,
1856 out => "test.csv:via.gz",
1857 encoding => ":via(gzip):encoding(utf-8)",
1858 );
1859 $aoa = csv (in => "test.csv:via.gz", encoding => ":via(gzip)");
1860
1861 # Using PerlIO::gzip
1862 csv (in => \@csv,
1863 out => "test.csv:via.gz",
1864 encoding => ":gzip:encoding(utf-8)",
1865 );
1866 $aoa = csv (in => "test.csv:gzip.gz", encoding => ":gzip");
1867
1868 detect_bom
1869
1870 If "detect_bom" is given, the method "header" will be invoked on
1871 the opened stream to check if there is a BOM and set the encoding
1872 accordingly.
1873
1874 "detect_bom" can be abbreviated to "bom".
1875
1876 This is the same as setting "encoding" to "auto".
1877
1878 Note that as the method "header" is invoked, its default is to also
1879 set the headers.
1880
1881 headers
1882
1883 If this attribute is not given, the default behavior is to produce an
1884 array of arrays.
1885
1886 If "headers" is supplied, it should be an anonymous list of column
1887 names, an anonymous hashref, a coderef, or a literal flag: "auto",
1888 "lc", "uc", or "skip".
1889
1890 skip
1891 When "skip" is used, the header will not be included in the output.
1892
1893 my $aoa = csv (in => $fh, headers => "skip");
1894
1895 "skip" is invalid/ignored in combinations with "detect_bom".
1896
1897 auto
1898 If "auto" is used, the first line of the "CSV" source will be read as
1899 the list of field headers and used to produce an array of hashes.
1900
1901 my $aoh = csv (in => $fh, headers => "auto");
1902
1903 lc
1904 If "lc" is used, the first line of the "CSV" source will be read as
1905 the list of field headers mapped to lower case and used to produce
1906 an array of hashes. This is a variation of "auto".
1907
1908 my $aoh = csv (in => $fh, headers => "lc");
1909
1910 uc
1911 If "uc" is used, the first line of the "CSV" source will be read as
1912 the list of field headers mapped to upper case and used to produce
1913 an array of hashes. This is a variation of "auto".
1914
1915 my $aoh = csv (in => $fh, headers => "uc");
1916
1917 CODE
1918 If a coderef is used, the first line of the "CSV" source will be
1919 read as the list of mangled field headers in which each field is
1920 passed as the only argument to the coderef. This list is used to
1921 produce an array of hashes.
1922
1923 my $aoh = csv (in => $fh,
1924 headers => sub { lc ($_[0]) =~ s/kode/code/gr });
1925
1926 this example is a variation of using "lc" where all occurrences of
1927 "kode" are replaced with "code".
1928
1929 ARRAY
1930 If "headers" is an anonymous list, the entries in the list will be
1931 used as field names. The first line is considered data instead of
1932 headers.
1933
1934 my $aoh = csv (in => $fh, headers => [qw( Foo Bar )]);
1935 csv (in => $aoa, out => $fh, headers => [qw( code description price )]);
1936
1937 HASH
1938 If "headers" is a hash reference, this implies "auto", but header
1939 fields that exist as key in the hashref will be replaced by the value
1940 for that key. Given a CSV file like
1941
1942 post-kode,city,name,id number,fubble
1943 1234AA,Duckstad,Donald,13,"X313DF"
1944
1945 using
1946
1947 csv (headers => { "post-kode" => "pc", "id number" => "ID" }, ...
1948
1949 will return an entry like
1950
1951 { pc => "1234AA",
1952 city => "Duckstad",
1953 name => "Donald",
1954 ID => "13",
1955 fubble => "X313DF",
1956 }
1957
1958 See also "munge_column_names" and "set_column_names".
1959
1960 munge_column_names
1961
1962 If "munge_column_names" is set, the method "header" is invoked on
1963 the opened stream with all matching arguments to detect and set the
1964 headers.
1965
1966 "munge_column_names" can be abbreviated to "munge".
1967
1968 key
1969
1970 If passed, will default "headers" to "auto" and return a hashref
1971 instead of an array of hashes. Allowed values are simple scalars or
1972 array-references where the first element is the joiner and the rest are
1973 the fields to join to combine the key.
1974
1975 my $ref = csv (in => "test.csv", key => "code");
1976 my $ref = csv (in => "test.csv", key => [ ":" => "code", "color" ]);
1977
1978 with test.csv like
1979
1980 code,product,price,color
1981 1,pc,850,gray
1982 2,keyboard,12,white
1983 3,mouse,5,black
1984
1985 the first example will return
1986
1987 { 1 => {
1988 code => 1,
1989 color => 'gray',
1990 price => 850,
1991 product => 'pc'
1992 },
1993 2 => {
1994 code => 2,
1995 color => 'white',
1996 price => 12,
1997 product => 'keyboard'
1998 },
1999 3 => {
2000 code => 3,
2001 color => 'black',
2002 price => 5,
2003 product => 'mouse'
2004 }
2005 }
2006
2007 the second example will return
2008
2009 { "1:gray" => {
2010 code => 1,
2011 color => 'gray',
2012 price => 850,
2013 product => 'pc'
2014 },
2015 "2:white" => {
2016 code => 2,
2017 color => 'white',
2018 price => 12,
2019 product => 'keyboard'
2020 },
2021 "3:black" => {
2022 code => 3,
2023 color => 'black',
2024 price => 5,
2025 product => 'mouse'
2026 }
2027 }
2028
2029 The "key" attribute can be combined with "headers" for "CSV" date that
2030 has no header line, like
2031
2032 my $ref = csv (
2033 in => "foo.csv",
2034 headers => [qw( c_foo foo bar description stock )],
2035 key => "c_foo",
2036 );
2037
2038 value
2039
2040 Used to create key-value hashes.
2041
2042 Only allowed when "key" is valid. A "value" can be either a single
2043 column label or an anonymous list of column labels. In the first case,
2044 the value will be a simple scalar value, in the latter case, it will be
2045 a hashref.
2046
2047 my $ref = csv (in => "test.csv", key => "code",
2048 value => "price");
2049 my $ref = csv (in => "test.csv", key => "code",
2050 value => [ "product", "price" ]);
2051 my $ref = csv (in => "test.csv", key => [ ":" => "code", "color" ],
2052 value => "price");
2053 my $ref = csv (in => "test.csv", key => [ ":" => "code", "color" ],
2054 value => [ "product", "price" ]);
2055
2056 with test.csv like
2057
2058 code,product,price,color
2059 1,pc,850,gray
2060 2,keyboard,12,white
2061 3,mouse,5,black
2062
2063 the first example will return
2064
2065 { 1 => 850,
2066 2 => 12,
2067 3 => 5,
2068 }
2069
2070 the second example will return
2071
2072 { 1 => {
2073 price => 850,
2074 product => 'pc'
2075 },
2076 2 => {
2077 price => 12,
2078 product => 'keyboard'
2079 },
2080 3 => {
2081 price => 5,
2082 product => 'mouse'
2083 }
2084 }
2085
2086 the third example will return
2087
2088 { "1:gray" => 850,
2089 "2:white" => 12,
2090 "3:black" => 5,
2091 }
2092
2093 the fourth example will return
2094
2095 { "1:gray" => {
2096 price => 850,
2097 product => 'pc'
2098 },
2099 "2:white" => {
2100 price => 12,
2101 product => 'keyboard'
2102 },
2103 "3:black" => {
2104 price => 5,
2105 product => 'mouse'
2106 }
2107 }
2108
2109 keep_headers
2110
2111 When using hashes, keep the column names into the arrayref passed, so
2112 all headers are available after the call in the original order.
2113
2114 my $aoh = csv (in => "file.csv", keep_headers => \my @hdr);
2115
2116 This attribute can be abbreviated to "kh" or passed as
2117 "keep_column_names".
2118
2119 This attribute implies a default of "auto" for the "headers" attribute.
2120
2121 The headers can also be kept internally to keep stable header order:
2122
2123 csv (in => csv (in => "file.csv", kh => "internal"),
2124 out => "new.csv",
2125 kh => "internal");
2126
2127 where "internal" can also be 1, "yes", or "true". This is similar to
2128
2129 my @h;
2130 csv (in => csv (in => "file.csv", kh => \@h),
2131 out => "new.csv",
2132 headers => \@h);
2133
2134 fragment
2135
2136 Only output the fragment as defined in the "fragment" method. This
2137 option is ignored when generating "CSV". See "out".
2138
2139 Combining all of them could give something like
2140
2141 use Text::CSV_XS qw( csv );
2142 my $aoh = csv (
2143 in => "test.txt",
2144 encoding => "utf-8",
2145 headers => "auto",
2146 sep_char => "|",
2147 fragment => "row=3;6-9;15-*",
2148 );
2149 say $aoh->[15]{Foo};
2150
2151 sep_set
2152
2153 If "sep_set" is set, the method "header" is invoked on the opened
2154 stream to detect and set "sep_char" with the given set.
2155
2156 "sep_set" can be abbreviated to "seps". If neither "sep_set" not "seps"
2157 is given, but "sep" is defined, "sep_set" defaults to "[ sep ]". This
2158 is only supported for perl version 5.10 and up.
2159
2160 Note that as the "header" method is invoked, its default is to also
2161 set the headers.
2162
2163 set_column_names
2164
2165 If "set_column_names" is passed, the method "header" is invoked on
2166 the opened stream with all arguments meant for "header".
2167
2168 If "set_column_names" is passed as a false value, the content of the
2169 first row is only preserved if the output is AoA:
2170
2171 With an input-file like
2172
2173 bAr,foo
2174 1,2
2175 3,4,5
2176
2177 This call
2178
2179 my $aoa = csv (in => $file, set_column_names => 0);
2180
2181 will result in
2182
2183 [[ "bar", "foo" ],
2184 [ "1", "2" ],
2185 [ "3", "4", "5" ]]
2186
2187 and
2188
2189 my $aoa = csv (in => $file, set_column_names => 0, munge => "none");
2190
2191 will result in
2192
2193 [[ "bAr", "foo" ],
2194 [ "1", "2" ],
2195 [ "3", "4", "5" ]]
2196
2197 Callbacks
2198 Callbacks enable actions triggered from the inside of Text::CSV_XS.
2199
2200 While most of what this enables can easily be done in an unrolled
2201 loop as described in the "SYNOPSIS" callbacks can be used to meet
2202 special demands or enhance the "csv" function.
2203
2204 error
2205 $csv->callbacks (error => sub { $csv->SetDiag (0) });
2206
2207 the "error" callback is invoked when an error occurs, but only
2208 when "auto_diag" is set to a true value. A callback is invoked with
2209 the values returned by "error_diag":
2210
2211 my ($c, $s);
2212
2213 sub ignore3006 {
2214 my ($err, $msg, $pos, $recno, $fldno) = @_;
2215 if ($err == 3006) {
2216 # ignore this error
2217 ($c, $s) = (undef, undef);
2218 Text::CSV_XS->SetDiag (0);
2219 }
2220 # Any other error
2221 return;
2222 } # ignore3006
2223
2224 $csv->callbacks (error => \&ignore3006);
2225 $csv->bind_columns (\$c, \$s);
2226 while ($csv->getline ($fh)) {
2227 # Error 3006 will not stop the loop
2228 }
2229
2230 after_parse
2231 $csv->callbacks (after_parse => sub { push @{$_[1]}, "NEW" });
2232 while (my $row = $csv->getline ($fh)) {
2233 $row->[-1] eq "NEW";
2234 }
2235
2236 This callback is invoked after parsing with "getline" only if no
2237 error occurred. The callback is invoked with two arguments: the
2238 current "CSV" parser object and an array reference to the fields
2239 parsed.
2240
2241 The return code of the callback is ignored unless it is a reference
2242 to the string "skip", in which case the record will be skipped in
2243 "getline_all".
2244
2245 sub add_from_db {
2246 my ($csv, $row) = @_;
2247 $sth->execute ($row->[4]);
2248 push @$row, $sth->fetchrow_array;
2249 } # add_from_db
2250
2251 my $aoa = csv (in => "file.csv", callbacks => {
2252 after_parse => \&add_from_db });
2253
2254 This hook can be used for validation:
2255
2256 FAIL
2257 Die if any of the records does not validate a rule:
2258
2259 after_parse => sub {
2260 $_[1][4] =~ m/^[0-9]{4}\s?[A-Z]{2}$/ or
2261 die "5th field does not have a valid Dutch zipcode";
2262 }
2263
2264 DEFAULT
2265 Replace invalid fields with a default value:
2266
2267 after_parse => sub { $_[1][2] =~ m/^\d+$/ or $_[1][2] = 0 }
2268
2269 SKIP
2270 Skip records that have invalid fields (only applies to
2271 "getline_all"):
2272
2273 after_parse => sub { $_[1][0] =~ m/^\d+$/ or return \"skip"; }
2274
2275 before_print
2276 my $idx = 1;
2277 $csv->callbacks (before_print => sub { $_[1][0] = $idx++ });
2278 $csv->print (*STDOUT, [ 0, $_ ]) for @members;
2279
2280 This callback is invoked before printing with "print" only if no
2281 error occurred. The callback is invoked with two arguments: the
2282 current "CSV" parser object and an array reference to the fields
2283 passed.
2284
2285 The return code of the callback is ignored.
2286
2287 sub max_4_fields {
2288 my ($csv, $row) = @_;
2289 @$row > 4 and splice @$row, 4;
2290 } # max_4_fields
2291
2292 csv (in => csv (in => "file.csv"), out => *STDOUT,
2293 callbacks => { before_print => \&max_4_fields });
2294
2295 This callback is not active for "combine".
2296
2297 Callbacks for csv ()
2298
2299 The "csv" allows for some callbacks that do not integrate in XS
2300 internals but only feature the "csv" function.
2301
2302 csv (in => "file.csv",
2303 callbacks => {
2304 filter => { 6 => sub { $_ > 15 } }, # first
2305 after_parse => sub { say "AFTER PARSE"; }, # first
2306 after_in => sub { say "AFTER IN"; }, # second
2307 on_in => sub { say "ON IN"; }, # third
2308 },
2309 );
2310
2311 csv (in => $aoh,
2312 out => "file.csv",
2313 callbacks => {
2314 on_in => sub { say "ON IN"; }, # first
2315 before_out => sub { say "BEFORE OUT"; }, # second
2316 before_print => sub { say "BEFORE PRINT"; }, # third
2317 },
2318 );
2319
2320 filter
2321 This callback can be used to filter records. It is called just after
2322 a new record has been scanned. The callback accepts a:
2323
2324 hashref
2325 The keys are the index to the row (the field name or field number,
2326 1-based) and the values are subs to return a true or false value.
2327
2328 csv (in => "file.csv", filter => {
2329 3 => sub { m/a/ }, # third field should contain an "a"
2330 5 => sub { length > 4 }, # length of the 5th field minimal 5
2331 });
2332
2333 csv (in => "file.csv", filter => { foo => sub { $_ > 4 }});
2334
2335 If the keys to the filter hash contain any character that is not a
2336 digit it will also implicitly set "headers" to "auto" unless
2337 "headers" was already passed as argument. When headers are
2338 active, returning an array of hashes, the filter is not applicable
2339 to the header itself.
2340
2341 All sub results should match, as in AND.
2342
2343 The context of the callback sets $_ localized to the field
2344 indicated by the filter. The two arguments are as with all other
2345 callbacks, so the other fields in the current row can be seen:
2346
2347 filter => { 3 => sub { $_ > 100 ? $_[1][1] =~ m/A/ : $_[1][6] =~ m/B/ }}
2348
2349 If the context is set to return a list of hashes ("headers" is
2350 defined), the current record will also be available in the
2351 localized %_:
2352
2353 filter => { 3 => sub { $_ > 100 && $_{foo} =~ m/A/ && $_{bar} < 1000 }}
2354
2355 If the filter is used to alter the content by changing $_, make
2356 sure that the sub returns true in order not to have that record
2357 skipped:
2358
2359 filter => { 2 => sub { $_ = uc }}
2360
2361 will upper-case the second field, and then skip it if the resulting
2362 content evaluates to false. To always accept, end with truth:
2363
2364 filter => { 2 => sub { $_ = uc; 1 }}
2365
2366 coderef
2367 csv (in => "file.csv", filter => sub { $n++; 0; });
2368
2369 If the argument to "filter" is a coderef, it is an alias or
2370 shortcut to a filter on column 0:
2371
2372 csv (filter => sub { $n++; 0 });
2373
2374 is equal to
2375
2376 csv (filter => { 0 => sub { $n++; 0 });
2377
2378 filter-name
2379 csv (in => "file.csv", filter => "not_blank");
2380 csv (in => "file.csv", filter => "not_empty");
2381 csv (in => "file.csv", filter => "filled");
2382
2383 These are predefined filters
2384
2385 Given a file like (line numbers prefixed for doc purpose only):
2386
2387 1:1,2,3
2388 2:
2389 3:,
2390 4:""
2391 5:,,
2392 6:, ,
2393 7:"",
2394 8:" "
2395 9:4,5,6
2396
2397 not_blank
2398 Filter out the blank lines
2399
2400 This filter is a shortcut for
2401
2402 filter => { 0 => sub { @{$_[1]} > 1 or
2403 defined $_[1][0] && $_[1][0] ne "" } }
2404
2405 Due to the implementation, it is currently impossible to also
2406 filter lines that consists only of a quoted empty field. These
2407 lines are also considered blank lines.
2408
2409 With the given example, lines 2 and 4 will be skipped.
2410
2411 not_empty
2412 Filter out lines where all the fields are empty.
2413
2414 This filter is a shortcut for
2415
2416 filter => { 0 => sub { grep { defined && $_ ne "" } @{$_[1]} } }
2417
2418 A space is not regarded being empty, so given the example data,
2419 lines 2, 3, 4, 5, and 7 are skipped.
2420
2421 filled
2422 Filter out lines that have no visible data
2423
2424 This filter is a shortcut for
2425
2426 filter => { 0 => sub { grep { defined && m/\S/ } @{$_[1]} } }
2427
2428 This filter rejects all lines that not have at least one field
2429 that does not evaluate to the empty string.
2430
2431 With the given example data, this filter would skip lines 2
2432 through 8.
2433
2434 One could also use modules like Types::Standard:
2435
2436 use Types::Standard -types;
2437
2438 my $type = Tuple[Str, Str, Int, Bool, Optional[Num]];
2439 my $check = $type->compiled_check;
2440
2441 # filter with compiled check and warnings
2442 my $aoa = csv (
2443 in => \$data,
2444 filter => {
2445 0 => sub {
2446 my $ok = $check->($_[1]) or
2447 warn $type->get_message ($_[1]), "\n";
2448 return $ok;
2449 },
2450 },
2451 );
2452
2453 after_in
2454 This callback is invoked for each record after all records have been
2455 parsed but before returning the reference to the caller. The hook is
2456 invoked with two arguments: the current "CSV" parser object and a
2457 reference to the record. The reference can be a reference to a
2458 HASH or a reference to an ARRAY as determined by the arguments.
2459
2460 This callback can also be passed as an attribute without the
2461 "callbacks" wrapper.
2462
2463 before_out
2464 This callback is invoked for each record before the record is
2465 printed. The hook is invoked with two arguments: the current "CSV"
2466 parser object and a reference to the record. The reference can be a
2467 reference to a HASH or a reference to an ARRAY as determined by the
2468 arguments.
2469
2470 This callback can also be passed as an attribute without the
2471 "callbacks" wrapper.
2472
2473 This callback makes the row available in %_ if the row is a hashref.
2474 In this case %_ is writable and will change the original row.
2475
2476 on_in
2477 This callback acts exactly as the "after_in" or the "before_out"
2478 hooks.
2479
2480 This callback can also be passed as an attribute without the
2481 "callbacks" wrapper.
2482
2483 This callback makes the row available in %_ if the row is a hashref.
2484 In this case %_ is writable and will change the original row. So e.g.
2485 with
2486
2487 my $aoh = csv (
2488 in => \"foo\n1\n2\n",
2489 headers => "auto",
2490 on_in => sub { $_{bar} = 2; },
2491 );
2492
2493 $aoh will be:
2494
2495 [ { foo => 1,
2496 bar => 2,
2497 }
2498 { foo => 2,
2499 bar => 2,
2500 }
2501 ]
2502
2503 csv
2504 The function "csv" can also be called as a method or with an
2505 existing Text::CSV_XS object. This could help if the function is to
2506 be invoked a lot of times and the overhead of creating the object
2507 internally over and over again would be prevented by passing an
2508 existing instance.
2509
2510 my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
2511
2512 my $aoa = $csv->csv (in => $fh);
2513 my $aoa = csv (in => $fh, csv => $csv);
2514
2515 both act the same. Running this 20000 times on a 20 lines CSV file,
2516 showed a 53% speedup.
2517
2519 Combine (...)
2520 Parse (...)
2521
2522 The arguments to these internal functions are deliberately not
2523 described or documented in order to enable the module authors make
2524 changes it when they feel the need for it. Using them is highly
2525 discouraged as the API may change in future releases.
2526
2528 Reading a CSV file line by line:
2529 my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
2530 open my $fh, "<", "file.csv" or die "file.csv: $!";
2531 while (my $row = $csv->getline ($fh)) {
2532 # do something with @$row
2533 }
2534 close $fh or die "file.csv: $!";
2535
2536 or
2537
2538 my $aoa = csv (in => "file.csv", on_in => sub {
2539 # do something with %_
2540 });
2541
2542 Reading only a single column
2543
2544 my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
2545 open my $fh, "<", "file.csv" or die "file.csv: $!";
2546 # get only the 4th column
2547 my @column = map { $_->[3] } @{$csv->getline_all ($fh)};
2548 close $fh or die "file.csv: $!";
2549
2550 with "csv", you could do
2551
2552 my @column = map { $_->[0] }
2553 @{csv (in => "file.csv", fragment => "col=4")};
2554
2555 Parsing CSV strings:
2556 my $csv = Text::CSV_XS->new ({ keep_meta_info => 1, binary => 1 });
2557
2558 my $sample_input_string =
2559 qq{"I said, ""Hi!""",Yes,"",2.34,,"1.09","\x{20ac}",};
2560 if ($csv->parse ($sample_input_string)) {
2561 my @field = $csv->fields;
2562 foreach my $col (0 .. $#field) {
2563 my $quo = $csv->is_quoted ($col) ? $csv->{quote_char} : "";
2564 printf "%2d: %s%s%s\n", $col, $quo, $field[$col], $quo;
2565 }
2566 }
2567 else {
2568 print STDERR "parse () failed on argument: ",
2569 $csv->error_input, "\n";
2570 $csv->error_diag ();
2571 }
2572
2573 Parsing CSV from memory
2574
2575 Given a complete CSV data-set in scalar $data, generate a list of
2576 lists to represent the rows and fields
2577
2578 # The data
2579 my $data = join "\r\n" => map { join "," => 0 .. 5 } 0 .. 5;
2580
2581 # in a loop
2582 my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
2583 open my $fh, "<", \$data;
2584 my @foo;
2585 while (my $row = $csv->getline ($fh)) {
2586 push @foo, $row;
2587 }
2588 close $fh;
2589
2590 # a single call
2591 my $foo = csv (in => \$data);
2592
2593 Printing CSV data
2594 The fast way: using "print"
2595
2596 An example for creating "CSV" files using the "print" method:
2597
2598 my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ });
2599 open my $fh, ">", "foo.csv" or die "foo.csv: $!";
2600 for (1 .. 10) {
2601 $csv->print ($fh, [ $_, "$_" ]) or $csv->error_diag;
2602 }
2603 close $fh or die "$tbl.csv: $!";
2604
2605 The slow way: using "combine" and "string"
2606
2607 or using the slower "combine" and "string" methods:
2608
2609 my $csv = Text::CSV_XS->new;
2610
2611 open my $csv_fh, ">", "hello.csv" or die "hello.csv: $!";
2612
2613 my @sample_input_fields = (
2614 'You said, "Hello!"', 5.67,
2615 '"Surely"', '', '3.14159');
2616 if ($csv->combine (@sample_input_fields)) {
2617 print $csv_fh $csv->string, "\n";
2618 }
2619 else {
2620 print "combine () failed on argument: ",
2621 $csv->error_input, "\n";
2622 }
2623 close $csv_fh or die "hello.csv: $!";
2624
2625 Generating CSV into memory
2626
2627 Format a data-set (@foo) into a scalar value in memory ($data):
2628
2629 # The data
2630 my @foo = map { [ 0 .. 5 ] } 0 .. 3;
2631
2632 # in a loop
2633 my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1, eol => "\r\n" });
2634 open my $fh, ">", \my $data;
2635 $csv->print ($fh, $_) for @foo;
2636 close $fh;
2637
2638 # a single call
2639 csv (in => \@foo, out => \my $data);
2640
2641 Rewriting CSV
2642 Rewrite "CSV" files with ";" as separator character to well-formed
2643 "CSV":
2644
2645 use Text::CSV_XS qw( csv );
2646 csv (in => csv (in => "bad.csv", sep_char => ";"), out => *STDOUT);
2647
2648 As "STDOUT" is now default in "csv", a one-liner converting a UTF-16
2649 CSV file with BOM and TAB-separation to valid UTF-8 CSV could be:
2650
2651 $ perl -C3 -MText::CSV_XS=csv -we\
2652 'csv(in=>"utf16tab.csv",encoding=>"utf16",sep=>"\t")' >utf8.csv
2653
2654 Dumping database tables to CSV
2655 Dumping a database table can be simple as this (TIMTOWTDI):
2656
2657 my $dbh = DBI->connect (...);
2658 my $sql = "select * from foo";
2659
2660 # using your own loop
2661 open my $fh, ">", "foo.csv" or die "foo.csv: $!\n";
2662 my $csv = Text::CSV_XS->new ({ binary => 1, eol => "\r\n" });
2663 my $sth = $dbh->prepare ($sql); $sth->execute;
2664 $csv->print ($fh, $sth->{NAME_lc});
2665 while (my $row = $sth->fetch) {
2666 $csv->print ($fh, $row);
2667 }
2668
2669 # using the csv function, all in memory
2670 csv (out => "foo.csv", in => $dbh->selectall_arrayref ($sql));
2671
2672 # using the csv function, streaming with callbacks
2673 my $sth = $dbh->prepare ($sql); $sth->execute;
2674 csv (out => "foo.csv", in => sub { $sth->fetch });
2675 csv (out => "foo.csv", in => sub { $sth->fetchrow_hashref });
2676
2677 Note that this does not discriminate between "empty" values and NULL-
2678 values from the database, as both will be the same empty field in CSV.
2679 To enable distinction between the two, use "quote_empty".
2680
2681 csv (out => "foo.csv", in => sub { $sth->fetch }, quote_empty => 1);
2682
2683 If the database import utility supports special sequences to insert
2684 "NULL" values into the database, like MySQL/MariaDB supports "\N",
2685 use a filter or a map
2686
2687 csv (out => "foo.csv", in => sub { $sth->fetch },
2688 on_in => sub { $_ //= "\\N" for @{$_[1]} });
2689
2690 while (my $row = $sth->fetch) {
2691 $csv->print ($fh, [ map { $_ // "\\N" } @$row ]);
2692 }
2693
2694 Note that this will not work as expected when choosing the backslash
2695 ("\") as "escape_char", as that will cause the "\" to need to be
2696 escaped by yet another "\", which will cause the field to need
2697 quotation and thus ending up as "\\N" instead of "\N". See also
2698 "undef_str".
2699
2700 csv (out => "foo.csv", in => sub { $sth->fetch }, undef_str => "\\N");
2701
2702 These special sequences are not recognized by Text::CSV_XS on parsing
2703 the CSV generated like this, but map and filter are your friends again
2704
2705 while (my $row = $csv->getline ($fh)) {
2706 $sth->execute (map { $_ eq "\\N" ? undef : $_ } @$row);
2707 }
2708
2709 csv (in => "foo.csv", filter => { 1 => sub {
2710 $sth->execute (map { $_ eq "\\N" ? undef : $_ } @{$_[1]}); 0; }});
2711
2712 Converting CSV to JSON
2713 use Text::CSV_XS qw( csv );
2714 use JSON; # or Cpanel::JSON::XS for better performance
2715
2716 # AoA (no header interpretation)
2717 say encode_json (csv (in => "file.csv"));
2718
2719 # AoH (convert to structures)
2720 say encode_json (csv (in => "file.csv", bom => 1));
2721
2722 Yes, it is that simple.
2723
2724 The examples folder
2725 For more extended examples, see the examples/ 1. sub-directory in the
2726 original distribution or the git repository 2.
2727
2728 1. https://github.com/Tux/Text-CSV_XS/tree/master/examples
2729 2. https://github.com/Tux/Text-CSV_XS
2730
2731 The following files can be found there:
2732
2733 parser-xs.pl
2734 This can be used as a boilerplate to parse invalid "CSV" and parse
2735 beyond (expected) errors alternative to using the "error" callback.
2736
2737 $ perl examples/parser-xs.pl bad.csv >good.csv
2738
2739 csv-check
2740 This is a command-line tool that uses parser-xs.pl techniques to
2741 check the "CSV" file and report on its content.
2742
2743 $ csv-check files/utf8.csv
2744 Checked files/utf8.csv with csv-check 1.9
2745 using Text::CSV_XS 1.32 with perl 5.26.0 and Unicode 9.0.0
2746 OK: rows: 1, columns: 2
2747 sep = <,>, quo = <">, bin = <1>, eol = <"\n">
2748
2749 csv-split
2750 This command splits "CSV" files into smaller files, keeping (part
2751 of) the header. Options include maximum number of (data) rows per
2752 file and maximum number of columns per file or a combination of the
2753 two.
2754
2755 csv2xls
2756 A script to convert "CSV" to Microsoft Excel ("XLS"). This requires
2757 extra modules Date::Calc and Spreadsheet::WriteExcel. The converter
2758 accepts various options and can produce UTF-8 compliant Excel files.
2759
2760 csv2xlsx
2761 A script to convert "CSV" to Microsoft Excel ("XLSX"). This requires
2762 the modules Date::Calc and Spreadsheet::Writer::XLSX. The converter
2763 does accept various options including merging several "CSV" files
2764 into a single Excel file.
2765
2766 csvdiff
2767 A script that provides colorized diff on sorted CSV files, assuming
2768 first line is header and first field is the key. Output options
2769 include colorized ANSI escape codes or HTML.
2770
2771 $ csvdiff --html --output=diff.html file1.csv file2.csv
2772
2773 rewrite.pl
2774 A script to rewrite (in)valid CSV into valid CSV files. Script has
2775 options to generate confusing CSV files or CSV files that conform to
2776 Dutch MS-Excel exports (using ";" as separation).
2777
2778 Script - by default - honors BOM and auto-detects separation
2779 converting it to default standard CSV with "," as separator.
2780
2782 Text::CSV_XS is not designed to detect the characters used to quote
2783 and separate fields. The parsing is done using predefined (default)
2784 settings. In the examples sub-directory, you can find scripts that
2785 demonstrate how you could try to detect these characters yourself.
2786
2787 Microsoft Excel
2788 The import/export from Microsoft Excel is a risky task, according to
2789 the documentation in "Text::CSV::Separator". Microsoft uses the
2790 system's list separator defined in the regional settings, which happens
2791 to be a semicolon for Dutch, German and Spanish (and probably some
2792 others as well). For the English locale, the default is a comma.
2793 In Windows however, the user is free to choose a predefined locale,
2794 and then change every individual setting in it, so checking the
2795 locale is no solution.
2796
2797 As of version 1.17, a lone first line with just
2798
2799 sep=;
2800
2801 will be recognized and honored when parsing with "getline".
2802
2804 More Errors & Warnings
2805 New extensions ought to be clear and concise in reporting what
2806 error has occurred where and why, and maybe also offer a remedy to
2807 the problem.
2808
2809 "error_diag" is a (very) good start, but there is more work to be
2810 done in this area.
2811
2812 Basic calls should croak or warn on illegal parameters. Errors
2813 should be documented.
2814
2815 setting meta info
2816 Future extensions might include extending the "meta_info",
2817 "is_quoted", and "is_binary" to accept setting these flags for
2818 fields, so you can specify which fields are quoted in the
2819 "combine"/"string" combination.
2820
2821 $csv->meta_info (0, 1, 1, 3, 0, 0);
2822 $csv->is_quoted (3, 1);
2823
2824 Metadata Vocabulary for Tabular Data
2825 <http://w3c.github.io/csvw/metadata/> (a W3C editor's draft) could be
2826 an example for supporting more metadata.
2827
2828 Parse the whole file at once
2829 Implement new methods or functions that enable parsing of a
2830 complete file at once, returning a list of hashes. Possible extension
2831 to this could be to enable a column selection on the call:
2832
2833 my @AoH = $csv->parse_file ($filename, { cols => [ 1, 4..8, 12 ]});
2834
2835 returning something like
2836
2837 [ { fields => [ 1, 2, "foo", 4.5, undef, "", 8 ],
2838 flags => [ ... ],
2839 },
2840 { fields => [ ... ],
2841 .
2842 },
2843 ]
2844
2845 Note that the "csv" function already supports most of this, but does
2846 not return flags. "getline_all" returns all rows for an open stream,
2847 but this will not return flags either. "fragment" can reduce the
2848 required rows or columns, but cannot combine them.
2849
2850 provider
2851 csv (in => $fh) vs csv (provider => sub { get_line });
2852
2853 Whatever the attribute name might end up to be, this should make it
2854 easier to add input providers for parsing. Currently most special
2855 variations for the "in" attribute are aimed at CSV generation: e.g. a
2856 callback is defined to return a reference to a record. This new
2857 attribute should enable passing data to parse, like getline.
2858
2859 Suggested by Johan Vromans.
2860
2861 Cookbook
2862 Write a document that has recipes for most known non-standard (and
2863 maybe some standard) "CSV" formats, including formats that use
2864 "TAB", ";", "|", or other non-comma separators.
2865
2866 Examples could be taken from W3C's CSV on the Web: Use Cases and
2867 Requirements <http://w3c.github.io/csvw/use-cases-and-
2868 requirements/index.html>
2869
2870 Steal
2871 Steal good new ideas and features from PapaParse
2872 <http://papaparse.com> or csvkit <http://csvkit.readthedocs.org>.
2873
2874 Raku support
2875 Raku support can be found here <https://github.com/Tux/CSV>. The
2876 interface is richer in support than the Perl5 API, as Raku supports
2877 more types.
2878
2879 The Raku version does not (yet) support pure binary CSV datasets.
2880
2881 NOT TODO
2882 combined methods
2883 Requests for adding means (methods) that combine "combine" and
2884 "string" in a single call will not be honored (use "print" instead).
2885 Likewise for "parse" and "fields" (use "getline" instead), given the
2886 problems with embedded newlines.
2887
2888 Release plan
2889 No guarantees, but this is what I had in mind some time ago:
2890
2891 • DIAGNOSTICS section in pod to *describe* the errors (see below)
2892
2894 Everything should now work on native EBCDIC systems. As the test does
2895 not cover all possible codepoints and Encode does not support
2896 "utf-ebcdic", there is no guarantee that all handling of Unicode is
2897 done correct.
2898
2899 Opening "EBCDIC" encoded files on "ASCII"+ systems is likely to
2900 succeed using Encode's "cp37", "cp1047", or "posix-bc":
2901
2902 open my $fh, "<:encoding(cp1047)", "ebcdic_file.csv" or die "...";
2903
2905 Still under construction ...
2906
2907 If an error occurs, "$csv->error_diag" can be used to get information
2908 on the cause of the failure. Note that for speed reasons the internal
2909 value is never cleared on success, so using the value returned by
2910 "error_diag" in normal cases - when no error occurred - may cause
2911 unexpected results.
2912
2913 If the constructor failed, the cause can be found using "error_diag" as
2914 a class method, like "Text::CSV_XS->error_diag".
2915
2916 The "$csv->error_diag" method is automatically invoked upon error when
2917 the contractor was called with "auto_diag" set to 1 or 2, or when
2918 autodie is in effect. When set to 1, this will cause a "warn" with the
2919 error message, when set to 2, it will "die". "2012 - EOF" is excluded
2920 from "auto_diag" reports.
2921
2922 Errors can be (individually) caught using the "error" callback.
2923
2924 The errors as described below are available. I have tried to make the
2925 error itself explanatory enough, but more descriptions will be added.
2926 For most of these errors, the first three capitals describe the error
2927 category:
2928
2929 • INI
2930
2931 Initialization error or option conflict.
2932
2933 • ECR
2934
2935 Carriage-Return related parse error.
2936
2937 • EOF
2938
2939 End-Of-File related parse error.
2940
2941 • EIQ
2942
2943 Parse error inside quotation.
2944
2945 • EIF
2946
2947 Parse error inside field.
2948
2949 • ECB
2950
2951 Combine error.
2952
2953 • EHR
2954
2955 HashRef parse related error.
2956
2957 And below should be the complete list of error codes that can be
2958 returned:
2959
2960 • 1001 "INI - sep_char is equal to quote_char or escape_char"
2961
2962 The separation character cannot be equal to the quotation
2963 character or to the escape character, as this would invalidate all
2964 parsing rules.
2965
2966 • 1002 "INI - allow_whitespace with escape_char or quote_char SP or
2967 TAB"
2968
2969 Using the "allow_whitespace" attribute when either "quote_char" or
2970 "escape_char" is equal to "SPACE" or "TAB" is too ambiguous to
2971 allow.
2972
2973 • 1003 "INI - \r or \n in main attr not allowed"
2974
2975 Using default "eol" characters in either "sep_char", "quote_char",
2976 or "escape_char" is not allowed.
2977
2978 • 1004 "INI - callbacks should be undef or a hashref"
2979
2980 The "callbacks" attribute only allows one to be "undef" or a hash
2981 reference.
2982
2983 • 1005 "INI - EOL too long"
2984
2985 The value passed for EOL is exceeding its maximum length (16).
2986
2987 • 1006 "INI - SEP too long"
2988
2989 The value passed for SEP is exceeding its maximum length (16).
2990
2991 • 1007 "INI - QUOTE too long"
2992
2993 The value passed for QUOTE is exceeding its maximum length (16).
2994
2995 • 1008 "INI - SEP undefined"
2996
2997 The value passed for SEP should be defined and not empty.
2998
2999 • 1010 "INI - the header is empty"
3000
3001 The header line parsed in the "header" is empty.
3002
3003 • 1011 "INI - the header contains more than one valid separator"
3004
3005 The header line parsed in the "header" contains more than one
3006 (unique) separator character out of the allowed set of separators.
3007
3008 • 1012 "INI - the header contains an empty field"
3009
3010 The header line parsed in the "header" contains an empty field.
3011
3012 • 1013 "INI - the header contains nun-unique fields"
3013
3014 The header line parsed in the "header" contains at least two
3015 identical fields.
3016
3017 • 1014 "INI - header called on undefined stream"
3018
3019 The header line cannot be parsed from an undefined source.
3020
3021 • 1500 "PRM - Invalid/unsupported argument(s)"
3022
3023 Function or method called with invalid argument(s) or parameter(s).
3024
3025 • 1501 "PRM - The key attribute is passed as an unsupported type"
3026
3027 The "key" attribute is of an unsupported type.
3028
3029 • 1502 "PRM - The value attribute is passed without the key attribute"
3030
3031 The "value" attribute is only allowed when a valid key is given.
3032
3033 • 1503 "PRM - The value attribute is passed as an unsupported type"
3034
3035 The "value" attribute is of an unsupported type.
3036
3037 • 2010 "ECR - QUO char inside quotes followed by CR not part of EOL"
3038
3039 When "eol" has been set to anything but the default, like
3040 "\r\t\n", and the "\r" is following the second (closing)
3041 "quote_char", where the characters following the "\r" do not make up
3042 the "eol" sequence, this is an error.
3043
3044 • 2011 "ECR - Characters after end of quoted field"
3045
3046 Sequences like "1,foo,"bar"baz,22,1" are not allowed. "bar" is a
3047 quoted field and after the closing double-quote, there should be
3048 either a new-line sequence or a separation character.
3049
3050 • 2012 "EOF - End of data in parsing input stream"
3051
3052 Self-explaining. End-of-file while inside parsing a stream. Can
3053 happen only when reading from streams with "getline", as using
3054 "parse" is done on strings that are not required to have a trailing
3055 "eol".
3056
3057 • 2013 "INI - Specification error for fragments RFC7111"
3058
3059 Invalid specification for URI "fragment" specification.
3060
3061 • 2014 "ENF - Inconsistent number of fields"
3062
3063 Inconsistent number of fields under strict parsing.
3064
3065 • 2015 "ERW - Empty row"
3066
3067 An empty row was not allowed.
3068
3069 • 2021 "EIQ - NL char inside quotes, binary off"
3070
3071 Sequences like "1,"foo\nbar",22,1" are allowed only when the binary
3072 option has been selected with the constructor.
3073
3074 • 2022 "EIQ - CR char inside quotes, binary off"
3075
3076 Sequences like "1,"foo\rbar",22,1" are allowed only when the binary
3077 option has been selected with the constructor.
3078
3079 • 2023 "EIQ - QUO character not allowed"
3080
3081 Sequences like ""foo "bar" baz",qu" and "2023,",2008-04-05,"Foo,
3082 Bar",\n" will cause this error.
3083
3084 • 2024 "EIQ - EOF cannot be escaped, not even inside quotes"
3085
3086 The escape character is not allowed as last character in an input
3087 stream.
3088
3089 • 2025 "EIQ - Loose unescaped escape"
3090
3091 An escape character should escape only characters that need escaping.
3092
3093 Allowing the escape for other characters is possible with the
3094 attribute "allow_loose_escapes".
3095
3096 • 2026 "EIQ - Binary character inside quoted field, binary off"
3097
3098 Binary characters are not allowed by default. Exceptions are
3099 fields that contain valid UTF-8, that will automatically be upgraded
3100 if the content is valid UTF-8. Set "binary" to 1 to accept binary
3101 data.
3102
3103 • 2027 "EIQ - Quoted field not terminated"
3104
3105 When parsing a field that started with a quotation character, the
3106 field is expected to be closed with a quotation character. When the
3107 parsed line is exhausted before the quote is found, that field is not
3108 terminated.
3109
3110 • 2030 "EIF - NL char inside unquoted verbatim, binary off"
3111
3112 • 2031 "EIF - CR char is first char of field, not part of EOL"
3113
3114 • 2032 "EIF - CR char inside unquoted, not part of EOL"
3115
3116 • 2034 "EIF - Loose unescaped quote"
3117
3118 • 2035 "EIF - Escaped EOF in unquoted field"
3119
3120 • 2036 "EIF - ESC error"
3121
3122 • 2037 "EIF - Binary character in unquoted field, binary off"
3123
3124 • 2110 "ECB - Binary character in Combine, binary off"
3125
3126 • 2200 "EIO - print to IO failed. See errno"
3127
3128 • 3001 "EHR - Unsupported syntax for column_names ()"
3129
3130 • 3002 "EHR - getline_hr () called before column_names ()"
3131
3132 • 3003 "EHR - bind_columns () and column_names () fields count
3133 mismatch"
3134
3135 • 3004 "EHR - bind_columns () only accepts refs to scalars"
3136
3137 • 3006 "EHR - bind_columns () did not pass enough refs for parsed
3138 fields"
3139
3140 • 3007 "EHR - bind_columns needs refs to writable scalars"
3141
3142 • 3008 "EHR - unexpected error in bound fields"
3143
3144 • 3009 "EHR - print_hr () called before column_names ()"
3145
3146 • 3010 "EHR - print_hr () called with invalid arguments"
3147
3149 IO::File, IO::Handle, IO::Wrap, Text::CSV, Text::CSV_PP,
3150 Text::CSV::Encoded, Text::CSV::Separator, Text::CSV::Slurp,
3151 Spreadsheet::CSV and Spreadsheet::Read, and of course perl.
3152
3153 If you are using Raku, have a look at "Text::CSV" in the Raku
3154 ecosystem, offering the same features.
3155
3156 non-perl
3157
3158 A CSV parser in JavaScript, also used by W3C <http://www.w3.org>, is
3159 the multi-threaded in-browser PapaParse <http://papaparse.com/>.
3160
3161 csvkit <http://csvkit.readthedocs.org> is a python CSV parsing toolkit.
3162
3164 Alan Citterman <alan@mfgrtl.com> wrote the original Perl module.
3165 Please don't send mail concerning Text::CSV_XS to Alan, who is not
3166 involved in the C/XS part that is now the main part of the module.
3167
3168 Jochen Wiedmann <joe@ispsoft.de> rewrote the en- and decoding in C by
3169 implementing a simple finite-state machine. He added variable quote,
3170 escape and separator characters, the binary mode and the print and
3171 getline methods. See ChangeLog releases 0.10 through 0.23.
3172
3173 H.Merijn Brand <hmbrand@cpan.org> cleaned up the code, added the field
3174 flags methods, wrote the major part of the test suite, completed the
3175 documentation, fixed most RT bugs, added all the allow flags and the
3176 "csv" function. See ChangeLog releases 0.25 and on.
3177
3179 Copyright (C) 2007-2023 H.Merijn Brand. All rights reserved.
3180 Copyright (C) 1998-2001 Jochen Wiedmann. All rights reserved.
3181 Copyright (C) 1997 Alan Citterman. All rights reserved.
3182
3183 This library is free software; you can redistribute and/or modify it
3184 under the same terms as Perl itself.
3185
3186
3187
3188perl v5.38.0 2023-09-21 CSV_XS(3)