Text::CSV::Separator(3pm)

1Text::CSV::Separator(3)User Contributed Perl DocumentatioTnext::CSV::Separator(3)
2
3
4

NAME

6       Text::CSV::Separator - Determine the field separator of a CSV file
7

VERSION

9       Version 0.20 - November 2, 2008
10

SYNOPSIS

12           use Text::CSV::Separator qw(get_separator);
13
14           my @char_list = get_separator(
15                                           path    => $csv_path,
16                                           exclude => $array1_ref, # optional
17                                           include => $array2_ref, # optional
18                                           echo    => 1,           # optional
19                                        );
20
21           my $separator;
22           if (@char_list) {
23               if (@char_list == 1) {           # successful detection
24                   $separator = $char_list[0];
25               } else {                         # several candidates passed the tests
26                   # Some code here
27           } else {                             # no candidate passed the tests
28               # Some code here
29           }
30
31
32           # "I'm Feeling Lucky" alternative interface
33           # Don't forget to include the 'lucky' parameter
34
35           my $separator = get_separator(
36                                           path    => $csv_path,
37                                           lucky   => 1,
38                                           exclude => $array1_ref, # optional
39                                           include => $array2_ref, # optional
40                                           echo    => 1,           # optional
41                                        );
42

DESCRIPTION

44       This module provides a fast detection of the field separator character
45       (also called field delimiter) of a CSV file, or more generally, of a
46       character separated text file (also called delimited text file), and
47       returns it ready to use in a CSV parser (e.g., Text::CSV_XS,
48       Tie::CSV_File, or Text::CSV::Simple).  This may be useful to the
49       vulnerable -and often ignored- population of programmers who need to
50       process automatically CSV files from different sources.
51
52       The default set of candidates contains the following characters: ','
53       ';'  ':'  '|'  '\t'
54
55       The only required parameter is the CSV file path. Optionally, the user
56       can specify characters to be excluded or included in the list of
57       candidates.
58
59       The routine returns an array containing the list of candidates that
60       passed the tests. If it succeeds, this array will contain only one
61       value: the field separator we are looking for. On the other hand, if no
62       candidate survives the tests, it will return an empty list.
63
64       The technique used is based on the following principle:
65
66       ·       For every line in the file, the number of instances of the
67               separator character acting as separators must be an integer
68               constant > 0 , although a line may also contain some instances
69               of that character as literal characters.
70
71       ·       Most of the other candidates won't appear in a typical CSV
72               line.
73
74       As soon as a candidate misses a line, it will be removed from the
75       candidates list.
76
77       This is the first test done to the CSV file. In most cases, it will
78       detect the separator after processing the first few lines. In
79       particular, if the file contains a header line, one line will probably
80       be enough to get the job done.  Processing will stop and return control
81       to the caller as soon as the program reaches a status of 1 single
82       candidate (or 0 candidates left).
83
84       If the routine cannot determine the separator in the first pass, it
85       will do a second pass based on several heuristic techniques. It checks
86       whether the file has columns consisting of time values, comma-separated
87       decimal numbers, or numbers containing a comma as the group separator,
88       which can lead to false positives in files that don't have a header
89       row. It also measures the variability of the remaining candidates.  Of
90       course, you can always create a CSV file capable of resisting the
91       siege, but this approach will work correctly in many cases. The
92       possibility of excluding some of the default candidates may help to
93       resolve cases with several possible winners.  The resulting array
94       contains the list of possible separators sorted by their likelihood,
95       being the first array item the most probable separator.
96
97       The module also provides an alternative interface with a simpler
98       syntax, which can be handy if you think that the files your program
99       will have to deal with aren't too exotic. To use it you only have to
100       add the lucky => 1 key-value pair to the parameters hash and the
101       routine will return a single value, so you can assign it directly to a
102       scalar variable.  If no candidate survives the first pass, it will
103       return "undef".  The code skips the 2nd pass, which is usually
104       unnecessary, so the program won't store counts and won't check any
105       existing regularities. Hence, it will run faster and will require less
106       memory. This approach should be enough in most cases.
107

FUNCTIONS

109       "get_separator(%options)"
110           Returns an array containing the field separator character (or
111           characters, if more than one candidate passed the tests) of a CSV
112           file. In case no candidate passes the tests, it returns an empty
113           list.
114
115           The available parameters are:
116
117           ·       "path"
118
119                   Required. The path to the CSV file.
120
121           ·       "exclude"
122
123                   Optional. Array containing characters to be excluded from
124                   the candidates list.
125
126           ·       "include"
127
128                   Optional. Array containing characters to be included in the
129                   candidates list.
130
131           ·       "lucky"
132
133                   Optional. If selected, get_separator will return one single
134                   character, or "undef" in case no separator is detected. Off
135                   by default.
136
137           ·       "echo"
138
139                   Optional. Writes to the standard output messages describing
140                   the actions performed. Off by default.  This is useful to
141                   keep track of what's going on, especially for debugging
142                   purposes.
143

EXPORT

145       None by default.
146

EXAMPLE

148       Consider the following scenario: Your program must process a batch of
149       csv files, and you know that the separator could be a comma, a
150       semicolon or a tab.  You also know that one of the fields contains time
151       values. This field will provide a fixed number of colons that could
152       mislead the detection code.  In this case, you should exclude the colon
153       (and you can also exclude the other default candidate not considered,
154       the pipe character):
155
156           my @char_list = get_separator(
157                                           path    => $csv_path,
158                                           exclude => [':', '|'],
159                                        );
160
161           if (@char_list) {
162               my $separator;
163               if (@char_list == 1) {
164                   $separator = $char_list[0];
165               } else {
166                   # Some code here
167               }
168           }
169
170
171           # Using the "I'm Feeling Lucky" interface:
172
173           my $separator = get_separator(
174                                           path    => $csv_path,
175                                           lucky   => 1,
176                                           exclude => [':', '|'],
177                                         );
178

MOTIVATION

180       Despite the popularity of XML, the CSV file format is still widely used
181       for data exchange between applications, because of its much lower
182       overhead: It requires much less bandwidth and storage space than XML,
183       and it also has a better performance under compression (see the
184       References below).
185
186       Unfortunately, there is no formal specification of the CSV format.  The
187       Microsoft Excel implementation is the most widely used and it has
188       become a de facto standard, but the variations are almost endless.
189
190       One of the biggest annoyances of this format is that in most cases you
191       don't know a priori what is the field separator character used in a
192       file.  CSV stands for "comma-separated values", but most of the
193       spreadsheet applications let the user select the field delimiter from a
194       list of several different characters when saving or exporting data to a
195       CSV file.  Furthermore, in a Windows system, when you save a
196       spreadsheet in Excel as a CSV file, Excel will use as the field
197       delimiter the default list separator of your system's locale, which
198       happens to be a semicolon for several European languages. You can even
199       customize this setting and use the list separator you like. For these
200       and other reasons, automating the processing of CSV files is a risky
201       task.
202
203       This module can be used to determine the separator character of a
204       delimited text file of any kind, but since the aforementioned ambiguity
205       problems occur mainly in CSV files, I decided to use the Text::CSV::
206       namespace.
207

REFERENCES

209       <http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm>
210
211       <http://www.xml.com/pub/a/2004/12/15/deviant.html>
212

ACKNOWLEDGEMENTS

218       Many thanks to Xavier Noria for wise suggestions.  The author is also
219       grateful to Thomas Zahreddin, Benjamin Erhart, Ferdinand Gassauer, and
220       Mario Krauss for valuable comments and bug reports.
221

AUTHOR

223       Enrique Nell, <perl_nell@telefonica.net>
224

COPYRIGHT AND LICENSE

226       Copyright (C) 2006 by Enrique Nell.
227
228       This library is free software; you can redistribute it and/or modify it
229       under the same terms as Perl itself.
230
231
232
233perl v5.28.0                      2008-11-02           Text::CSV::Separator(3)