Text::xSV(3pm)

1Text::xSV(3)          User Contributed Perl Documentation         Text::xSV(3)
2
3
4

NAME

6       Text::xSV - read character separated files
7

SYNOPSIS

9         use Text::xSV;
10         my $csv = new Text::xSV;
11         $csv->open_file("foo.csv");
12         $csv->read_header();
13         # Make the headers case insensitive
14         foreach my $field ($csv->get_fields) {
15           if (lc($field) ne $field) {
16             $csv->alias($field, lc($field));
17           }
18         }
19
20         $csv->add_compute("message", sub {
21           my $csv = shift;
22           my ($name, $age) = $csv->extract(qw(name age));
23           return "$name is $age years old\n";
24         });
25
26         while ($csv->get_row()) {
27           my ($name, $age) = $csv->extract(qw(name age));
28           print "$name is $age years old\n";
29           # Same as
30           #   print $csv->extract("message");
31         }
32
33         # The file above could have been created with:
34         my $csv = Text::xSV->new(
35           filename => "foo.csv",
36           header   => ["Name", "Age", "Sex"],
37         );
38         $csv->print_header();
39         $csv->print_row("Ben Tilly", 34, "M");
40         # Same thing.
41         $csv->print_data(
42           Age  => 34,
43           Name => "Ben Tilly",
44           Sex  => "M",
45         );
46

DESCRIPTION

48       This module is for reading and writing a common variation of character
49       separated data.  The most common example is comma-separated.  However
50       that is far from the only possibility, the same basic format is
51       exported by Microsoft products using tabs, colons, or other characters.
52
53       The format is a series of rows separated by returns.  Within each row
54       you have a series of fields separated by your character separator.
55       Fields may either be unquoted, in which case they do not contain a
56       double-quote, separator, or return, or they are quoted, in which case
57       they may contain anything, and will encode double-quotes by pairing
58       them.  In Microsoft products, quoted fields are strings and unquoted
59       fields can be interpreted as being of various datatypes based on a set
60       of heuristics.  By and large this fact is irrelevant in Perl because
61       Perl is largely untyped.  The one exception that this module handles
62       that empty unquoted fields are treated as nulls which are represented
63       in Perl as undefined values.  If you want a zero-length string, quote
64       it.
65
66       People usually naively solve this with split.  A next step up is to
67       read a line and parse it.  Unfortunately this choice of interface
68       (which is made by Text::CSV on CPAN) makes it difficult to handle
69       returns embedded in a field.  (Earlier versions of this document
70       claimed impossible.  That is false.  But the calling code has to supply
71       the logic to add lines until you have a valid row.  To the extent that
72       you don't do this consistently, your code will be buggy.)  Therefore
73       you it is good for the parsing logic to have access to the whole file.
74
75       This module solves the problem by creating a xSV object with access to
76       the filehandle, if in parsing it notices that a new line is needed, it
77       can read at will.
78

USAGE

80       First you set up and initialize an object, then you read the xSV file
81       through it.  The creation can also do multiple initializations as well.
82       Here are the available methods
83
84       "new"
85           This is the constructor.  It takes a hash of optional arguments.
86           They correspond to the following set_* methods without the set_
87           prefix.  For instance if you pass filename=>... in, then
88           set_filename will be called.
89
90           "set_sep"
91                   Sets the one character separator that divides fields.
92                   Defaults to a comma.
93
94           "set_filename"
95                   The filename of the xSV file that you are reading.  Used
96                   heavily in error reporting.  If fh is not set and filename
97                   is, then fh will be set to the result of calling open on
98                   filename.
99
100           "set_fh"
101                   Sets the fh that this Text::xSV object will read from or
102                   write to.  If it is not set, it will be set to the result
103                   of opening filename if that is set, otherwise it will
104                   default to ARGV (ie acts like <>) or STDOUT, depending on
105                   whether you first try to read or write.  The old default
106                   used to be STDIN.
107
108           "set_header"
109                   Sets the internal header array of fields that is referred
110                   to in arranging data on the *_data output methods.  If
111                   "bind_fields" has not been called, also calls that on the
112                   assumption that the fields that you want to output matches
113                   the fields that you will provide.
114
115                   The return from this function is inconsistent and should
116                   not be relied on to be anything useful.
117
118           "set_headers"
119                   An alias to "set_header".
120
121           "set_error_handler"
122                   The error handler is an anonymous function which is
123                   expected to take an error message and do something useful
124                   with it.  The default error handler is Carp::confess.
125                   Error handlers that do not trip exceptions (eg with die)
126                   are less tested and may not work perfectly in all
127                   circumstances.
128
129           "set_warning_handler"
130                   The warning handler is an anonymous function which is
131                   expected to take a warning and do something useful with it.
132                   If no warning handler is supplied, the error handler is
133                   wrapped with "eval" and the trapped error is warned.
134
135           "set_filter"
136                   The filter is an anonymous function which is expected to
137                   accept a line of input, and return a filtered line of
138                   output.  The default filter removes \r so that Windows
139                   files can be read under Unix.  This could also be used to,
140                   eg, strip out Microsoft smart quotes.
141
142           "set_quote_qll"
143                   The quote_all option simply puts every output field into
144                   double quotation marks.  This can't be set if "dont_quote"
145                   is.
146
147           "set_dont_quote"
148                   The dont_quote option turns off the otherwise mandatory
149                   quotation marks that bracket the data fields when there are
150                   separator characters, spaces or other non-printable
151                   characters in the data field.  This is perhaps a bit
152                   antithetical to the idea of safely enclosing data fields in
153                   quotation marks, but some applications, for instance
154                   Microsoft SQL Server's BULK INSERT, can't handle them.
155                   This can't be set if "quote_all" is.
156
157           "set_row_size"
158                   The number of elements that you expect to see in each row.
159                   It defaults to the size of the first row read or set.  If
160                   row_size_warning is true and the size of the row read or
161                   formatted does not match, then a warning is issued.
162
163           "set_row_size_warning"
164                   Determines whether or not to issue warnings when the row
165                   read or set has a number of fields different than the
166                   expected number.  Defaults to true.  Whether or not this is
167                   on, missing fields are always read as undef, and extra
168                   fields are ignored.
169
170           "set_close_fh"
171                   Whether or not to close fh when the object is DESTROYed.
172                   Defaults to false if fh was passed in, or true if the
173                   object has to open its own fh.  (This may be removed in a
174                   future version.)
175
176           "set_strict"
177                   In strict mode a single " within a quoted field is an
178                   error.  In non-strict mode it is a warning.  The default is
179                   strict.
180
181       "open_file"
182           Takes the name of a file, opens it, then sets the filename and fh.
183
184       "bind_fields"
185           Takes an array of fieldnames, memorizes the field positions for
186           later use.  "read_header" is preferred.
187
188       "read_header"
189           Reads a row from the file as a header line and memorizes the
190           positions of the fields for later use.  File formats that carry
191           field information tend to be far more robust than ones which do
192           not, so this is the preferred function.
193
194       "read_headers"
195           An alias for "read_header".  (If I'm going to keep on typing the
196           plural, I'll just make it work...)
197
198       "bind_header"
199           Another alias for "read_header" maintained for backwards
200           compatibility.  Deprecated because the name doesn't distinguish it
201           well enough from the unrelated "set_header".
202
203       "get_row"
204           Reads a row from the file.  Returns an array or reference to an
205           array depending on context.  Will also store the row in the row
206           property for later access.
207
208       "extract"
209           Extracts a list of fields out of the last row read.  In list
210           context returns the list, in scalar context returns an anonymous
211           array.
212
213       "extract_hash"
214           Extracts fields into a hash.  If a list of fields is passed, that
215           is the list of fields that go into the hash.  If no list, it
216           extracts all fields that it knows about.  In list context returns
217           the hash.  In scalar context returns a reference to the hash.
218
219       "fetchrow_hash"
220           Combines "get_row" and "extract_hash" to fetch the next row and
221           return a hash or hashref depending on context.
222
223       "alias"
224           Makes an existing field available under a new name.
225
226             $csv->alias($old_name, $new_name);
227
228       "get_fields"
229           Returns a list of all known fields in no particular order.
230
231       "add_compute"
232           Adds an arbitrary compute.  A compute is an arbitrary anonymous
233           function.  When the computed field is extracted, Text::xSV will
234           call the compute in scalar context with the Text::xSV object as the
235           only argument.
236
237           Text::xSV caches results in case computes call other computes.  It
238           will also catch infinite recursion with a hopefully useful message.
239
240       "format_row"
241           Takes a list of fields, and returns them quoted as necessary,
242           joined with sep, with a newline at the end.
243
244       "format_header"
245           Returns the formatted header row based on what was submitted with
246           "set_header".  Will cause an error if "set_header" was not called.
247
248       "format_headers"
249           Continuing the meme, an alias for format_header.
250
251       "format_data"
252           Takes a hash of data.  Sets internal data, and then formats the
253           result of "extract"ing out the fields corresponding to the headers.
254           Note that if you called "bind_fields" and then defined some more
255           fields with "add_compute", computes would be done for you on the
256           fly.
257
258       "print"
259           Prints the arguments directly to fh.  If fh is not supplied but
260           filename is, first sets fh to the result of opening filename.
261           Otherwise it defaults fh to STDOUT.  You probably don't want to use
262           this directly.  Instead use one of the other print methods.
263
264       "print_row"
265           Does a "print" of "format_row".  Convenient when you wish to
266           maintain your knowledge of the field order.
267
268       "print_header"
269           Does a "print" of "format_header".  Makes sense when you will be
270           using print_data for your actual data because the field order is
271           guaranteed to match up.
272
273       "print_headers"
274           An alias to "print_header".
275
276       "print_data"
277           Does a "print" of "format_data".  Relieves you from having to
278           synchronize field order in your code.
279

TODO

281       Add utility interfaces.  (Suggested by Ken Clark.)
282
283       Offer an option for working around the broken tab-delimited output that
284       some versions of Excel present for cut-and-paste.
285
286       Add tests for the output half of the module.
287

BUGS

289       When I say single character separator, I mean it.
290
291       Performance could be better.  That is largely because the API was
292       chosen for simplicity of a "proof of concept", rather than for
293       performance.  One idea to speed it up you would be to provide an API
294       where you bind the requested fields once and then fetch many times
295       rather than binding the request for every row.
296
297       Also note that should you ever play around with the special variables
298       $`, $&, or $', you will find that it can get much, much slower.  The
299       cause of this problem is that Perl only calculates those if it has ever
300       seen one of those.  This does many, many matches and calculating those
301       is slow.
302
303       I need to find out what conversions are done by Microsoft products that
304       Perl won't do on the fly upon trying to use the values.
305

ACKNOWLEDGEMENTS

307       My thanks to people who have given me feedback on how they would like
308       to use this module, and particularly to Klaus Weidner for his patch
309       fixing a nasty segmentation fault from a stack overflow in the regular
310       expression engine on large fields.
311
312       Rob Kinyon (dragonchild) motivated me to do the writing interface, and
313       gave me useful feedback on what it should look like.  I'm not sure that
314       he likes the result, but it is how I understood what he said...
315
316       Jess Robinson (castaway) convinced me that ARGV was a better default
317       input handle than STDIN.  I hope that switching that default doesn't
318       inconvenience anyone.
319
320       Gyepi SAM noticed that fetchrow_hash complained about missing data at
321       the end of the loop and sent a patch.  Applied.
322
323       shotgunefx noticed that bind_header changed its return between
324       versions.  It is actually worse than that, it changes its return if you
325       call it twice.  Documented that its return should not be relied upon.
326
327       Fred Steinberg found that writes did not happen promptly upon closing
328       the object.  This turned out to be a self-reference causing a DESTROY
329       bug.  I fixed it.
330
331       Carey Drake and Steve Caldwell noticed that the default warning_handler
332       expected different arguments than it got.  Both suggested the same fix
333       that I implemented.
334
335       Geoff Gariepy suggested adding dont_quote and quote_all.  Then found a
336       silly bug in my first implementation.
337
338       Ryan Martin improved read performance over 75% with a small patch.
339
340       Bauernhaus Panoramablick and Geoff Gariepy convinced me to add the
341       ability to get non-strict mode.
342

AUTHOR AND COPYRIGHT

344       Ben Tilly (btilly@gmail.com).  Originally posted at
345       http://www.perlmonks.org/node_id=65094.
346
347       Copyright 2001-2009.  This may be modified and distributed on the same
348       terms as Perl.
349
350
351
352perl v5.30.1                      2020-01-30                      Text::xSV(3)