1Text::CSV::Separator(3)User Contributed Perl DocumentatioTnext::CSV::Separator(3)
2
3
4

NAME

6       Text::CSV::Separator - Determine the field separator of a CSV file
7

VERSION

9       Version 0.19 - December 4, 2007
10

SYNOPSIS

12           use Text::CSV::Separator qw(get_separator);
13
14           my @char_list = get_separator(
15                                           path    => $csv_path,
16                                           exclude => $array1_ref, # optional
17                                           include => $array2_ref, # optional
18                                           echo    => 1,           # optional
19                                        );
20
21           my $separator;
22           if (@char_list) {
23               if (@char_list == 1) {           # successful detection
24                   $separator = $char_list[0];
25               } else {                         # several candidates passed the tests
26                   # Some code here
27           } else {                             # no candidate passed the tests
28                   # Some code here
29           }
30
31
32           # "I'm Feeling Lucky" alternative interface
33           # Don't forget to include the 'lucky' parameter
34
35           my $separator = get_separator(
36                                           path    => $csv_path,
37                                           lucky   => 1,
38                                           exclude => $array1_ref, # optional
39                                           include => $array2_ref, # optional
40                                           echo    => 1,           # optional
41                                        );
42

DESCRIPTION

44       This module provides a fast detection of the field separator character
45       (also called field delimiter) of a CSV file, or more generally, of a
46       character separated text file (also called delimited text file), and
47       returns it ready to use in a CSV parser (e.g., Text::CSV_XS,
48       Tie::CSV_File, or Text::CSV::Simple).  This may be useful to the
49       vulnerable -and often ignored- population of programmers who need to
50       process automatically CSV files from different sources.
51
52       The default set of candidates contains the following characters: ','
53       ';'  ':'  '|'  '\t'
54
55       The only required parameter is the CSV file path. Optionally, the user
56       can specify characters to be excluded or included in the list of
57       candidates.
58
59       The routine returns an array containing the list of candidates that
60       passed the tests. If it succeeds, this array will contain only one
61       value: the field separator we are looking for. On the other hand, if no
62       candidate survives the tests, it will return an empty list.
63
64       The technique used is based on the following principle:
65
66       ·       For every line in the file, the number of instances of the
67               separator character acting as separators must be an integer
68               constant > 0 , although a line may also contain some instances
69               of that character as literal characters.
70
71       ·       Most of the other candidates won't appear in a typical CSV
72               line.
73
74       As soon as a candidate misses a line, it will be removed from the
75       candidates list.
76
77       This is the first test done to the CSV file. In most cases, it will
78       detect the separator after processing the first few lines. In
79       particular, if the file contains a header line, one line will probably
80       be enough to get the job done.  Processing will stop and return control
81       to the caller as soon as the program reaches a status of 1 single
82       candidate (or 0 candidates left).
83
84       If the routine cannot determine the separator in the first pass, it
85       will do a second pass based on several heuristic techniques. It checks
86       whether the file has columns consisting of time values, comma-separated
87       decimal numbers, or numbers containing a comma as the group separator,
88       which can lead to false positives in files that don't have a header
89       row. It also measures the variability of the remaining candidates.  Of
90       course, you can always create a CSV file capable of resisting the
91       siege, but this approach will work correctly in many cases. The
92       possibility of excluding some of the default candidates may help to
93       resolve cases with several possible winners.  The resulting array
94       contains the list of possible separators sorted by their likelihood,
95       being the first array item the most probable separator.
96
97       The module also provides an alternative interface with a simpler
98       syntax, which can be handy if you think that the files your program
99       will have to deal with aren't too exotic. To use it you only have to
100       add the lucky => 1 key-value pair to the parameters hash and the
101       routine will return a single value, so you can assign it directly to a
102       scalar variable.  If no candidate survives the first pass, it will
103       return "undef".  The code skips the 2nd pass, which is usually
104       unnecessary, so the program won't store counts and won't check any
105       existing regularities. Hence, it will run faster and will require less
106       memory. This approach should be enough in most cases.
107

FUNCTIONS

109       get_separator(%options)
110           Returns an array containing the field separator character (or
111           characters, if more than one candidate passed the tests) of a CSV
112           file. In case no candidate passes the tests, it returns an empty
113           list.
114
115           The available parameters are:
116
117           path    Required. The path to the CSV file.
118
119           exclude Optional. Array containing characters to be excluded from
120                   the candidates list.
121
122           include Optional. Array containing characters to be included in the
123                   candidates list.
124
125           lucky   Optional. If selected, get_separator will return one single
126                   character, or "undef" in case no separator is detected. Off
127                   by default.
128
129           echo    Optional. Writes to the standard output messages describing
130                   the actions performed. Off by default.  This is useful to
131                   keep track of what's going on, especially for debugging
132                   purposes.
133

EXPORT

135       None by default.
136

EXAMPLE

138       Consider the following scenario: Your program must process a batch of
139       csv files, and you know that the separator could be a comma, a
140       semicolon or a tab.  You also know that one of the fields contains time
141       values. This field will provide a fixed number of colons that could
142       mislead the detection code.  In this case, you should exclude the colon
143       (and you can also exclude the other default candidate not considered,
144       the pipe character):
145
146           my @char_list = get_separator(
147                                           path    => $csv_path,
148                                           exclude => [':', '|'],
149                                        );
150
151           if (@char_list) {
152               my $separator;
153               if (@char_list == 1) {
154                   $separator = $char_list[0];
155               } else {
156                   # Some code here
157               }
158           }
159
160
161           # Using the "I'm Feeling Lucky" interface:
162
163           my $separator = get_separator(
164                                           path    => $csv_path,
165                                           lucky   => 1,
166                                           exclude => [':', '|'],
167                                         );
168

MOTIVATION

170       Despite the popularity of XML, the CSV file format is still widely used
171       for data exchange between applications, because of its much lower
172       overhead: It requires much less bandwidth and storage space than XML,
173       and it also has a better performance under compression (see the
174       References below).
175
176       Unfortunately, there is no formal specification of the CSV format.  The
177       Microsoft Excel implementation is the most widely used and it has
178       become a de facto standard, but the variations are almost endless.
179
180       One of the biggest annoyances of this format is that in most cases you
181       don't know a priori what is the field separator character used in a
182       file.  CSV stands for "comma-separated values", but most of the
183       spreadsheet applications let the user select the field delimiter from a
184       list of several different characters when saving or exporting data to a
185       CSV file.  Furthermore, in a Windows system, when you save a
186       spreadsheet in Excel as a CSV file, Excel will use as the field
187       delimiter the default list separator of your system's locale, which
188       happens to be a semicolon for several European languages. You can even
189       customize this setting and use the list separator you like. For these
190       and other reasons, automating the processing of CSV files is a risky
191       task.
192
193       This module can be used to determine the separator character of a
194       delimited text file of any kind, but since the aforementioned ambiguity
195       problems occur mainly in CSV files, I decided to use the Text::CSV::
196       namespace.
197

REFERENCES

199       <http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm>
200
201       <http://www.xml.com/pub/a/2004/12/15/deviant.html>
202

SEE ALSO

204       There's another module in CPAN for this task,
205       Text::CSV::DetectSeparator, which follows a different approach.
206

ACKNOWLEDGEMENTS

208       Many thanks to Xavier Noria for wise suggestions.  The author is also
209       grateful to Thomas Zahreddin, Benjamin Erhart, Ferdinand Gassauer, and
210       Mario Krauss for valuable comments and bug reports.
211

AUTHOR

213       Enrique Nell, <perl_nell@telefonica.net>
214
216       Copyright (C) 2006 by Enrique Nell.
217
218       This library is free software; you can redistribute it and/or modify it
219       under the same terms as Perl itself.
220
221
222
223perl v5.12.0                      2007-12-04           Text::CSV::Separator(3)
Impressum