1Text::xSV(3) User Contributed Perl Documentation Text::xSV(3)
2
3
4
6 Text::xSV - read character separated files
7
9 use Text::xSV;
10 my $csv = new Text::xSV;
11 $csv->open_file("foo.csv");
12 $csv->read_header();
13 # Make the headers case insensitive
14 foreach my $field ($csv->get_fields) {
15 if (lc($field) ne $field) {
16 $csv->alias($field, lc($field));
17 }
18 }
19
20 $csv->add_compute("message", sub {
21 my $csv = shift;
22 my ($name, $age) = $csv->extract(qw(name age));
23 return "$name is $age years old\n";
24 });
25
26 while ($csv->get_row()) {
27 my ($name, $age) = $csv->extract(qw(name age));
28 print "$name is $age years old\n";
29 # Same as
30 # print $csv->extract("message");
31 }
32
33 # The file above could have been created with:
34 my $csv = Text::xSV->new(
35 filename => "foo.csv",
36 header => ["Name", "Age", "Sex"],
37 );
38 $csv->print_header();
39 $csv->print_row("Ben Tilly", 34, "M");
40 # Same thing.
41 $csv->print_data(
42 Age => 34,
43 Name => "Ben Tilly",
44 Sex => "M",
45 );
46
48 This module is for reading and writing a common variation of character
49 separated data. The most common example is comma-separated. However
50 that is far from the only possibility, the same basic format is
51 exported by Microsoft products using tabs, colons, or other characters.
52
53 The format is a series of rows separated by returns. Within each row
54 you have a series of fields separated by your character separator.
55 Fields may either be unquoted, in which case they do not contain a
56 double-quote, separator, or return, or they are quoted, in which case
57 they may contain anything, and will encode double-quotes by pairing
58 them. In Microsoft products, quoted fields are strings and unquoted
59 fields can be interpreted as being of various datatypes based on a set
60 of heuristics. By and large this fact is irrelevant in Perl because
61 Perl is largely untyped. The one exception that this module handles
62 that empty unquoted fields are treated as nulls which are represented
63 in Perl as undefined values. If you want a zero-length string, quote
64 it.
65
66 People usually naively solve this with split. A next step up is to
67 read a line and parse it. Unfortunately this choice of interface
68 (which is made by Text::CSV on CPAN) makes it difficult to handle
69 returns embedded in a field. (Earlier versions of this document
70 claimed impossible. That is false. But the calling code has to supply
71 the logic to add lines until you have a valid row. To the extent that
72 you don't do this consistently, your code will be buggy.) Therefore
73 you it is good for the parsing logic to have access to the whole file.
74
75 This module solves the problem by creating a xSV object with access to
76 the filehandle, if in parsing it notices that a new line is needed, it
77 can read at will.
78
80 First you set up and initialize an object, then you read the xSV file
81 through it. The creation can also do multiple initializations as well.
82 Here are the available methods
83
84 "new"
85 This is the constructor. It takes a hash of optional arguments.
86 They correspond to the following set_* methods without the set_
87 prefix. For instance if you pass filename=>... in, then
88 set_filename will be called.
89
90 "set_sep"
91 Sets the one character separator that divides fields.
92 Defaults to a comma.
93
94 "set_filename"
95 The filename of the xSV file that you are reading. Used
96 heavily in error reporting. If fh is not set and filename
97 is, then fh will be set to the result of calling open on
98 filename.
99
100 "set_fh"
101 Sets the fh that this Text::xSV object will read from or
102 write to. If it is not set, it will be set to the result
103 of opening filename if that is set, otherwise it will
104 default to ARGV (ie acts like <>) or STDOUT, depending on
105 whether you first try to read or write. The old default
106 used to be STDIN.
107
108 "set_header"
109 Sets the internal header array of fields that is referred
110 to in arranging data on the *_data output methods. If
111 "bind_fields" has not been called, also calls that on the
112 assumption that the fields that you want to output matches
113 the fields that you will provide.
114
115 The return from this function is inconsistent and should
116 not be relied on to be anything useful.
117
118 "set_headers"
119 An alias to "set_header".
120
121 "set_error_handler"
122 The error handler is an anonymous function which is
123 expected to take an error message and do something useful
124 with it. The default error handler is Carp::confess.
125 Error handlers that do not trip exceptions (eg with die)
126 are less tested and may not work perfectly in all
127 circumstances.
128
129 "set_warning_handler"
130 The warning handler is an anonymous function which is
131 expected to take a warning and do something useful with it.
132 If no warning handler is supplied, the error handler is
133 wrapped with "eval" and the trapped error is warned.
134
135 "set_filter"
136 The filter is an anonymous function which is expected to
137 accept a line of input, and return a filtered line of
138 output. The default filter removes \r so that Windows
139 files can be read under Unix. This could also be used to,
140 eg, strip out Microsoft smart quotes.
141
142 "set_quote_qll"
143 The quote_all option simply puts every output field into
144 double quotation marks. This can't be set if "dont_quote"
145 is.
146
147 "set_dont_quote"
148 The dont_quote option turns off the otherwise mandatory
149 quotation marks that bracket the data fields when there are
150 separator characters, spaces or other non-printable
151 characters in the data field. This is perhaps a bit
152 antithetical to the idea of safely enclosing data fields in
153 quotation marks, but some applications, for instance
154 Microsoft SQL Server's BULK INSERT, can't handle them.
155 This can't be set if "quote_all" is.
156
157 "set_row_size"
158 The number of elements that you expect to see in each row.
159 It defaults to the size of the first row read or set. If
160 row_size_warning is true and the size of the row read or
161 formatted does not match, then a warning is issued.
162
163 "set_row_size_warning"
164 Determines whether or not to issue warnings when the row
165 read or set has a number of fields different than the
166 expected number. Defaults to true. Whether or not this is
167 on, missing fields are always read as undef, and extra
168 fields are ignored.
169
170 "set_close_fh"
171 Whether or not to close fh when the object is DESTROYed.
172 Defaults to false if fh was passed in, or true if the
173 object has to open its own fh. (This may be removed in a
174 future version.)
175
176 "set_strict"
177 In strict mode a single " within a quoted field is an
178 error. In non-strict mode it is a warning. The default is
179 strict.
180
181 "open_file"
182 Takes the name of a file, opens it, then sets the filename and fh.
183
184 "bind_fields"
185 Takes an array of fieldnames, memorizes the field positions for
186 later use. "read_header" is preferred.
187
188 "read_header"
189 Reads a row from the file as a header line and memorizes the
190 positions of the fields for later use. File formats that carry
191 field information tend to be far more robust than ones which do
192 not, so this is the preferred function.
193
194 "read_headers"
195 An alias for "read_header". (If I'm going to keep on typing the
196 plural, I'll just make it work...)
197
198 "bind_header"
199 Another alias for "read_header" maintained for backwards
200 compatibility. Deprecated because the name doesn't distinguish it
201 well enough from the unrelated "set_header".
202
203 "get_row"
204 Reads a row from the file. Returns an array or reference to an
205 array depending on context. Will also store the row in the row
206 property for later access.
207
208 "extract"
209 Extracts a list of fields out of the last row read. In list
210 context returns the list, in scalar context returns an anonymous
211 array.
212
213 "extract_hash"
214 Extracts fields into a hash. If a list of fields is passed, that
215 is the list of fields that go into the hash. If no list, it
216 extracts all fields that it knows about. In list context returns
217 the hash. In scalar context returns a reference to the hash.
218
219 "fetchrow_hash"
220 Combines "get_row" and "extract_hash" to fetch the next row and
221 return a hash or hashref depending on context.
222
223 "alias"
224 Makes an existing field available under a new name.
225
226 $csv->alias($old_name, $new_name);
227
228 "get_fields"
229 Returns a list of all known fields in no particular order.
230
231 "add_compute"
232 Adds an arbitrary compute. A compute is an arbitrary anonymous
233 function. When the computed field is extracted, Text::xSV will
234 call the compute in scalar context with the Text::xSV object as the
235 only argument.
236
237 Text::xSV caches results in case computes call other computes. It
238 will also catch infinite recursion with a hopefully useful message.
239
240 "format_row"
241 Takes a list of fields, and returns them quoted as necessary,
242 joined with sep, with a newline at the end.
243
244 "format_header"
245 Returns the formatted header row based on what was submitted with
246 "set_header". Will cause an error if "set_header" was not called.
247
248 "format_headers"
249 Continuing the meme, an alias for format_header.
250
251 "format_data"
252 Takes a hash of data. Sets internal data, and then formats the
253 result of "extract"ing out the fields corresponding to the headers.
254 Note that if you called "bind_fields" and then defined some more
255 fields with "add_compute", computes would be done for you on the
256 fly.
257
258 "print"
259 Prints the arguments directly to fh. If fh is not supplied but
260 filename is, first sets fh to the result of opening filename.
261 Otherwise it defaults fh to STDOUT. You probably don't want to use
262 this directly. Instead use one of the other print methods.
263
264 "print_row"
265 Does a "print" of "format_row". Convenient when you wish to
266 maintain your knowledge of the field order.
267
268 "print_header"
269 Does a "print" of "format_header". Makes sense when you will be
270 using print_data for your actual data because the field order is
271 guaranteed to match up.
272
273 "print_headers"
274 An alias to "print_header".
275
276 "print_data"
277 Does a "print" of "format_data". Relieves you from having to
278 synchronize field order in your code.
279
281 Add utility interfaces. (Suggested by Ken Clark.)
282
283 Offer an option for working around the broken tab-delimited output that
284 some versions of Excel present for cut-and-paste.
285
286 Add tests for the output half of the module.
287
289 When I say single character separator, I mean it.
290
291 Performance could be better. That is largely because the API was
292 chosen for simplicity of a "proof of concept", rather than for
293 performance. One idea to speed it up you would be to provide an API
294 where you bind the requested fields once and then fetch many times
295 rather than binding the request for every row.
296
297 Also note that should you ever play around with the special variables
298 $`, $&, or $', you will find that it can get much, much slower. The
299 cause of this problem is that Perl only calculates those if it has ever
300 seen one of those. This does many, many matches and calculating those
301 is slow.
302
303 I need to find out what conversions are done by Microsoft products that
304 Perl won't do on the fly upon trying to use the values.
305
307 My thanks to people who have given me feedback on how they would like
308 to use this module, and particularly to Klaus Weidner for his patch
309 fixing a nasty segmentation fault from a stack overflow in the regular
310 expression engine on large fields.
311
312 Rob Kinyon (dragonchild) motivated me to do the writing interface, and
313 gave me useful feedback on what it should look like. I'm not sure that
314 he likes the result, but it is how I understood what he said...
315
316 Jess Robinson (castaway) convinced me that ARGV was a better default
317 input handle than STDIN. I hope that switching that default doesn't
318 inconvenience anyone.
319
320 Gyepi SAM noticed that fetchrow_hash complained about missing data at
321 the end of the loop and sent a patch. Applied.
322
323 shotgunefx noticed that bind_header changed its return between
324 versions. It is actually worse than that, it changes its return if you
325 call it twice. Documented that its return should not be relied upon.
326
327 Fred Steinberg found that writes did not happen promptly upon closing
328 the object. This turned out to be a self-reference causing a DESTROY
329 bug. I fixed it.
330
331 Carey Drake and Steve Caldwell noticed that the default warning_handler
332 expected different arguments than it got. Both suggested the same fix
333 that I implemented.
334
335 Geoff Gariepy suggested adding dont_quote and quote_all. Then found a
336 silly bug in my first implementation.
337
338 Ryan Martin improved read performance over 75% with a small patch.
339
340 Bauernhaus Panoramablick and Geoff Gariepy convinced me to add the
341 ability to get non-strict mode.
342
344 Ben Tilly (btilly@gmail.com). Originally posted at
345 http://www.perlmonks.org/node_id=65094.
346
347 Copyright 2001-2009. This may be modified and distributed on the same
348 terms as Perl.
349
350
351
352perl v5.32.1 2021-01-27 Text::xSV(3)