1Text::CSV::Separator(3)User Contributed Perl DocumentatioTnext::CSV::Separator(3)
2
3
4
6 Text::CSV::Separator - Determine the field separator of a CSV file
7
9 Version 0.19 - December 4, 2007
10
12 use Text::CSV::Separator qw(get_separator);
13
14 my @char_list = get_separator(
15 path => $csv_path,
16 exclude => $array1_ref, # optional
17 include => $array2_ref, # optional
18 echo => 1, # optional
19 );
20
21 my $separator;
22 if (@char_list) {
23 if (@char_list == 1) { # successful detection
24 $separator = $char_list[0];
25 } else { # several candidates passed the tests
26 # Some code here
27 } else { # no candidate passed the tests
28 # Some code here
29 }
30
31
32 # "I'm Feeling Lucky" alternative interface
33 # Don't forget to include the 'lucky' parameter
34
35 my $separator = get_separator(
36 path => $csv_path,
37 lucky => 1,
38 exclude => $array1_ref, # optional
39 include => $array2_ref, # optional
40 echo => 1, # optional
41 );
42
44 This module provides a fast detection of the field separator character
45 (also called field delimiter) of a CSV file, or more generally, of a
46 character separated text file (also called delimited text file), and
47 returns it ready to use in a CSV parser (e.g., Text::CSV_XS,
48 Tie::CSV_File, or Text::CSV::Simple). This may be useful to the
49 vulnerable -and often ignored- population of programmers who need to
50 process automatically CSV files from different sources.
51
52 The default set of candidates contains the following characters: ','
53 ';' ':' '|' '\t'
54
55 The only required parameter is the CSV file path. Optionally, the user
56 can specify characters to be excluded or included in the list of
57 candidates.
58
59 The routine returns an array containing the list of candidates that
60 passed the tests. If it succeeds, this array will contain only one
61 value: the field separator we are looking for. On the other hand, if no
62 candidate survives the tests, it will return an empty list.
63
64 The technique used is based on the following principle:
65
66 · For every line in the file, the number of instances of the
67 separator character acting as separators must be an integer
68 constant > 0 , although a line may also contain some instances
69 of that character as literal characters.
70
71 · Most of the other candidates won't appear in a typical CSV
72 line.
73
74 As soon as a candidate misses a line, it will be removed from the
75 candidates list.
76
77 This is the first test done to the CSV file. In most cases, it will
78 detect the separator after processing the first few lines. In
79 particular, if the file contains a header line, one line will probably
80 be enough to get the job done. Processing will stop and return control
81 to the caller as soon as the program reaches a status of 1 single
82 candidate (or 0 candidates left).
83
84 If the routine cannot determine the separator in the first pass, it
85 will do a second pass based on several heuristic techniques. It checks
86 whether the file has columns consisting of time values, comma-separated
87 decimal numbers, or numbers containing a comma as the group separator,
88 which can lead to false positives in files that don't have a header
89 row. It also measures the variability of the remaining candidates. Of
90 course, you can always create a CSV file capable of resisting the
91 siege, but this approach will work correctly in many cases. The
92 possibility of excluding some of the default candidates may help to
93 resolve cases with several possible winners. The resulting array
94 contains the list of possible separators sorted by their likelihood,
95 being the first array item the most probable separator.
96
97 The module also provides an alternative interface with a simpler
98 syntax, which can be handy if you think that the files your program
99 will have to deal with aren't too exotic. To use it you only have to
100 add the lucky => 1 key-value pair to the parameters hash and the
101 routine will return a single value, so you can assign it directly to a
102 scalar variable. If no candidate survives the first pass, it will
103 return "undef". The code skips the 2nd pass, which is usually
104 unnecessary, so the program won't store counts and won't check any
105 existing regularities. Hence, it will run faster and will require less
106 memory. This approach should be enough in most cases.
107
109 get_separator(%options)
110 Returns an array containing the field separator character (or
111 characters, if more than one candidate passed the tests) of a CSV
112 file. In case no candidate passes the tests, it returns an empty
113 list.
114
115 The available parameters are:
116
117 path Required. The path to the CSV file.
118
119 exclude Optional. Array containing characters to be excluded from
120 the candidates list.
121
122 include Optional. Array containing characters to be included in the
123 candidates list.
124
125 lucky Optional. If selected, get_separator will return one single
126 character, or "undef" in case no separator is detected. Off
127 by default.
128
129 echo Optional. Writes to the standard output messages describing
130 the actions performed. Off by default. This is useful to
131 keep track of what's going on, especially for debugging
132 purposes.
133
135 None by default.
136
138 Consider the following scenario: Your program must process a batch of
139 csv files, and you know that the separator could be a comma, a
140 semicolon or a tab. You also know that one of the fields contains time
141 values. This field will provide a fixed number of colons that could
142 mislead the detection code. In this case, you should exclude the colon
143 (and you can also exclude the other default candidate not considered,
144 the pipe character):
145
146 my @char_list = get_separator(
147 path => $csv_path,
148 exclude => [':', '|'],
149 );
150
151 if (@char_list) {
152 my $separator;
153 if (@char_list == 1) {
154 $separator = $char_list[0];
155 } else {
156 # Some code here
157 }
158 }
159
160
161 # Using the "I'm Feeling Lucky" interface:
162
163 my $separator = get_separator(
164 path => $csv_path,
165 lucky => 1,
166 exclude => [':', '|'],
167 );
168
170 Despite the popularity of XML, the CSV file format is still widely used
171 for data exchange between applications, because of its much lower
172 overhead: It requires much less bandwidth and storage space than XML,
173 and it also has a better performance under compression (see the
174 References below).
175
176 Unfortunately, there is no formal specification of the CSV format. The
177 Microsoft Excel implementation is the most widely used and it has
178 become a de facto standard, but the variations are almost endless.
179
180 One of the biggest annoyances of this format is that in most cases you
181 don't know a priori what is the field separator character used in a
182 file. CSV stands for "comma-separated values", but most of the
183 spreadsheet applications let the user select the field delimiter from a
184 list of several different characters when saving or exporting data to a
185 CSV file. Furthermore, in a Windows system, when you save a
186 spreadsheet in Excel as a CSV file, Excel will use as the field
187 delimiter the default list separator of your system's locale, which
188 happens to be a semicolon for several European languages. You can even
189 customize this setting and use the list separator you like. For these
190 and other reasons, automating the processing of CSV files is a risky
191 task.
192
193 This module can be used to determine the separator character of a
194 delimited text file of any kind, but since the aforementioned ambiguity
195 problems occur mainly in CSV files, I decided to use the Text::CSV::
196 namespace.
197
199 <http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm>
200
201 <http://www.xml.com/pub/a/2004/12/15/deviant.html>
202
204 There's another module in CPAN for this task,
205 Text::CSV::DetectSeparator, which follows a different approach.
206
208 Many thanks to Xavier Noria for wise suggestions. The author is also
209 grateful to Thomas Zahreddin, Benjamin Erhart, Ferdinand Gassauer, and
210 Mario Krauss for valuable comments and bug reports.
211
213 Enrique Nell, <perl_nell@telefonica.net>
214
216 Copyright (C) 2006 by Enrique Nell.
217
218 This library is free software; you can redistribute it and/or modify it
219 under the same terms as Perl itself.
220
221
222
223perl v5.12.0 2007-12-04 Text::CSV::Separator(3)