1AnyData(3) User Contributed Perl Documentation AnyData(3)
2
3
4
6 AnyData -- easy access to data in many formats
7
9 $table = adTie( 'CSV','my_db.csv','o', # create a table
10 {col_names=>'name,country,sex'}
11 );
12 $table->{Sue} = {country=>'de',sex=>'f'}; # insert a row
13 delete $table->{Tom}; # delete a single row
14 $str = $table->{Sue}->{country}; # select a single value
15 while ( my $row = each %$table ) { # loop through table
16 print $row->{name} if $row->{sex} eq 'f';
17 }
18 $rows = $table->{{age=>'> 25'}} # select multiple rows
19 delete $table->{{country=>qr/us|mx|ca/}}; # delete multiple rows
20 $table->{{country=>'Nz'}}={country=>'nz'}; # update multiple rows
21 my $num = adRows( $table, age=>'< 25' ); # count matching rows
22 my @names = adNames( $table ); # get column names
23 my @cars = adColumn( $table, 'cars' ); # group a column
24 my @formats = adFormats(); # list available parsers
25 adExport( $table, $format, $file, $flags ); # save in specified format
26 print adExport( $table, $format, $flags ); # print to screen in format
27 print adDump($table); # dump table to screen
28 undef $table; # close the table
29
30 adConvert( $format1, $file1, $format2, $file2 ); # convert btwn formats
31 print adConvert( $format1, $file1, $format2 ); # convert to screen
32
34 The rather wacky idea behind this module and its sister module
35 DBD::AnyData is that any data, regardless of source or format should be
36 accessable and modifiable with the same simple set of methods. This
37 module provides a multi-dimensional tied hash interface to data in a
38 dozen different formats. The DBD::AnyData module adds a DBI/SQL
39 interface for those same formats.
40
41 Both modules provide built-in protections including appropriate
42 flocking() for all I/O and (in most cases) record-at-a-time access to
43 files rather than slurping of entire files.
44
45 Currently supported formats include general format flatfiles (CSV,
46 Fixed Length, etc.), specific formats (passwd files, httpd logs, etc.),
47 and a variety of other kinds of formats (XML, Mp3, HTML tables). The
48 number of supported formats will continue to grow rapidly since there
49 is an open API making it easy for any author to create additional
50 format parsers which can be plugged in to AnyData itself and thereby be
51 accessible by either the tiedhash or DBI/SQL interface.
52
54 The AnyData.pm module itself is pure Perl and does not depend on
55 anything other than modules that come standard with Perl. Some formats
56 and some advanced features require additional modules: to use the
57 remote ftp/http features, you must have the LWP bundle installed; to
58 use the XML format, you must have XML::Parser and XML::Twig installed;
59 to use the HTMLtable format for reading, you must have HTML::Parser and
60 HTML::TableExtract installed but you can use the HTMLtable for writing
61 with just the standard CGI module. To use DBI/SQL commands, you must
62 have DBI, DBD::AnyData, SQL::Statement and DBD::File installed.
63
65 The AnyData module imports eight methods (functions):
66
67 adTie() -- create a new table or open an existing table
68 adExport() -- save an existing table in a specified format
69 adConvert() -- convert data in one format into another format
70 adFormats() -- list available formats
71 adNames() -- get the column names of a table
72 adRows() -- get the number of rows in a table or query
73 adDump() -- display the data formatted as an array of rows
74 adColumn() -- group values in a single column
75
76 The adTie() command returns a special tied hash. The tied hash can
77 then be used to access and/or modify data. See below for details
78
79 With the exception of the XML, HTMLtable, and ARRAY formats, the
80 adTie() command saves all modifications of the data directly to file
81 as they are made. With XML and HTMLtable, you must make your
82 modifications in memory and then explicitly save them to file with
83 adExport().
84
85 adTie()
86 my $table = adTie( $format, $data, $open_mode, $flags );
87
88 The adTie() command creates a reference to a multi-dimensional tied
89 hash. In its simplest form, it simply reads a file in a specified
90 format into the tied hash:
91
92 my $table = adTie( $format, $file );
93
94 $format is the name of any supported format 'CSV','Fixed','Passwd', etc.
95 $file is the name of a relative or absolute path to a local file
96
97 e.g. my $table = adTie( 'CSV', '/usr/me/myfile.csv' );
98
99 this creates a tied hash called $table by reading data in the
100 CSV (comma separated values) format from the file 'myfile.csv'.
101
102 The hash reference resulting from adTie() can be accessed and modified
103 as follows:
104
105 use AnyData;
106 my $table = adTie( $format, $file );
107 $table->{$key}->{$column} # select a value
108 $table->{$key} = {$col1=>$val1,$col2=>$val2...} # update a row
109 delete $table->{$key} # delete a row
110 while(my $row = each %$table) { # loop through rows
111 print $row->{$col1} if $row->{$col2} ne 'baz';
112 }
113
114 The thing returned by adTie ($table in the example) is not an object,
115 it is a reference to a tied hash. This means that hash operations such
116 as exists, values, keys, may be used, keeping in mind that this is a
117 *reference* to a tied hash so the syntax would be
118
119 for( keys %$table ) {...}
120 for( values %$table ) {...}
121
122 Also keep in mind that if the table is really large, you probably do
123 not want to use keys and values because they create arrays in memory
124 containng data from every row in the table. Instead use 'each' as
125 shown above since that cycles through the file one record at a time and
126 never puts the entire table into memory.
127
128 It is also possible to use more advanced searching on the hash, see
129 "Multiple Row Operations" below.
130
131 In addition to the simple adTie($format,$file), there are other ways to
132 specify additional information in the adTie() command. The full syntax
133 is:
134
135 my $table = adTie( $format, $data, $open_mode, $flags );
136
137 The $data parameter allows you to read data from remote files accessible by
138 http or ftp, see "Using Remote Files" below. It also allows you to treat
139 strings and arrays as data sources without needing a file at all, see
140 "Working with Strings and Arrays" below.
141
142 The optional $mode parameter defaults to 'r' if none is supplied or
143 must be one of
144
145 'r' read # read only access
146 'u' update # read/write access
147 'c' create # create a new file unless it already exists
148 'o' overwrite # create a new file, overwriting any that already exist
149
150 The $flags parameter allows you to specify additional information such
151 as column names. See the sections in "Further Details" below.
152
153 With the exception of the XML, HTMLtable, and ARRAY formats, the
154 adTie() command saves all modifications of the data directly to file as
155 they are made. With XML and HTMLtable, you must make your
156 modifications in memory and then explicitly save them to file with
157 adExport().
158
159 adConvert()
160 adConvert( $format1, $data1, $format2, $file2, $flags1, $flags2 );
161
162 or
163
164 print adConvert( $format1, $data1, $format2, undef, $flags1, $flags2 );
165
166 or
167
168 my $aryref = adConvert( $format1, $data1, 'ARRAY', undef, $flags1 );
169
170 This method converts data in any supported format into any other supported
171 format. The resulting data may either be saved to a file (if $file2 is
172 supplied as a parameter) or sent back as a string to e.g. print the data
173 to the screen in the new format (if no $file2 is supplied), or sent back
174 as an array reference if $format2 is 'ARRAY'.
175
176 Some examples:
177
178 # convert a CSV file into an XML file
179 #
180 adConvert('CSV','foo.csv','XML','foo.xml');
181
182 # convert a CSV file into an HTML table and print it to the screen
183 #
184 print adConvert('CSV','foo.csv','HTMLtable');
185
186 # convert an XML string into a CSV file
187 #
188 adConvert('XML', ["<x><motto id='perl'>TIMTOWTDI</motto></x>"],
189 'CSV','foo.csv'
190 );
191
192 # convert an array reference into an XML file
193 #
194 adConvert('ARRAY', [['id','motto'],['perl','TIMTOWTDI']],
195 'XML','foo.xml'
196 );
197
198 # convert an XML file into an array reference
199 #
200 my $aryref = adConvert('XML','foo.xml','ARRAY');
201
202 See section below "Using strings and arrays" for details.
203
204 adExport()
205 adExport( $table, $format, $file, $flags );
206
207 or
208
209 print adExport( $table, $format );
210
211 or
212
213 my $aryref = adExport( $table, 'ARRAY' );
214
215 This method converts an existing tied hash into another format and/or
216 saves the tied hash as a file in the specified format.
217
218 Some examples:
219
220 all assume a previous call to my $table= adTie(...);
221
222 # export table to an XML file
223 #
224 adExport($table','XML','foo.xml');
225
226 # export table to an HTML string and print it to the screen
227 #
228 print adExport($table,'HTMLtable');
229
230 # export the table to an array reference
231 #
232 my $aryref = adExport($table,'ARRAY');
233
234 See section below "Using strings and arrays" for details.
235
236 adNames()
237 my $table = adTie(...);
238 my @column_names = adNames($table);
239
240 This method returns an array of the column names for the specified
241 table.
242
243 adRows()
244 my $table = adTie(...);
245 adRows( $table, %search_hash );
246
247 This method takes an AnyData tied hash created with adTie() and counts
248 the rows in the table that match the search hash.
249
250 For example, this snippet returns a count of the rows in the file that
251 contain the specified page in the request column
252
253 my $hits = adTie( 'Weblog', 'access.log');
254 print adRows( $hits , request => 'mypage.html' );
255
256 The search hash may contain multiple search criteria, see the section
257 on mltiple row operations below.
258
259 If the search_hash is omitted, it returns a count of all rows.
260
261 adColumn()
262 my @col_vals = adColumn( $table, $column_name, $distinct_flag );
263
264 This method returns an array of values taken from the specified column.
265 If there is a distinct_flag parameter, duplicates will be eliminated
266 from the list.
267
268 For example, this snippet returns a unique list of the values in the
269 'player' column of the table.
270
271 my $game = adTie( 'Pipe','games.db' );
272 my @players = adColumn( $game, 'player', 1 );
273
274 adDump()
275 my $table = adTie(...);
276 print adDump($table);
277
278 This method prints the raw data in the table. Column names are printed
279 inside angle brackets and separated by colons on the first line, then
280 each row is printed as a list of values inside sqaure brackets.
281
282 adFormats()
283 print "$_\n for adFormats();
284
285 This method shows the available format parsers, e.g. 'CSV', 'XML', etc.
286 It looks in your @INC for the .../AnyData/Format directory and prints
287 the names of format parsing files there. If the parser requires
288 further modules (e.g. XML requires XML::Parser) and you do not have the
289 additonal modules installed, the format will not work even if listed by
290 this command. Otherwise, all formats should work as described in this
291 documentation.
292
294 Column Names
295 Column names may be assigned in three ways:
296
297 * pre -- The format parser pre-assigns column
298 names (e.g. Passwd files automatically have
299 columns named 'username', 'homedir', 'GID', etc.).
300
301 * user -- The user specifies the column names as a comma
302 separated string associated with the key 'cols':
303
304 my $table = adTie( $format,
305 $file,
306 $mode,
307 {cols=>'name,age,gender'}
308 );
309
310 * auto -- If there is no pre-assigned list of column names
311 and none defined by the user, the first line of
312 the file is treated as a list of column names;
313 the line is parsed according to the specific
314 format (e.g. CSV column names are a comma-separated
315 list, Tab column names are a tab separated list);
316
317 When creating a new file in a format that does not pre-assign column
318 names, the user *must* manually assign them as shown above.
319
320 Some formats have special rules for assigning column names
321 (XML,Fixed,HTMLtable), see the sections below on those formats.
322
323 Key Columns
324 The AnyData modules support tables that have a single key column that
325 uniquely identifies each row as well as tables that do not have such
326 keys. For tables where there is a unique key, that key may be assigned
327 in three ways:
328
329 * pre -- The format parser automatically pre-assigns the
330 key column name e.g. Passwd files automatically
331 have 'username' as the key column.
332
333 * user -- The user specifies the key column name:
334
335 my $table = adTie( $format,
336 $file,
337 $mode,
338 {key=>'country'}
339 );
340
341 * auto If there is no pre-assigned key column and the user
342 does not define one, the first column becomes the
343 default key column
344
345 Format Specific Details
346 For full details, see the documentation for AnyData::Format::Foo
347 where Foo is any of the formats listed in the adFormats() command
348 e.g. 'CSV', 'XML', etc.
349
350 Included below are only some of the more important details of the
351 specific parsers.
352
353 Fixed Format
354 When using the Fixed format for fixed length records you must
355 always specify a pattern indicating the lengths of the fields.
356 This should be a string as would be passed to the unpack() function
357 to unpack the records in your Fixed length definition:
358
359 my $t = adTie( 'Fixed', $file, 'r', {pattern=>'A3 A7 A9'} );
360
361 If you want the column names to appear on the first line of a Fixed
362 file, they should be in comma-separated format, not in Fixed
363 format. This is different from other formats which use their own
364 format to display the column names on the first line. This is
365 necessary because the name of the column might be longer than the
366 length of the column.
367
368 XML Format
369 The XML format does not allow you to specify column names as a flag,
370 rather you specify a "record_tag" and the column names are determined
371 from the contents of the tag. If no record_tag is specified, the
372 record tag will be assumed to be the first child of the root of the
373 XML tree. That child and its structure will be determined from the
374 DTD if there is one, or from the first occurring record if there is
375 no DTD.
376
377 For simple XML, no flags are necessary:
378
379 <table>
380 <row row_id="1"><name>Joe</name><location>Seattle</location></row>
381 <row row_id="2"><name>Sue</name><location>Portland</location></row>
382 </table>
383
384 The record_tag will default to the first child, namely "row". The
385 column names will be generated from the attributes of the record
386 tag and all of the tags included under the record tag, so the
387 column names in this example will be "row_id","name","location".
388
389 If the record_tag is not the first child, you will need to specify
390 it. For example:
391
392 <db>
393 <table table_id="1">
394 <row row_id="1"><name>Joe</name><location>Seattle</location></row>
395 <row row_id="2"><name>Sue</name><location>Portland</location></row>
396 </table>
397 <table table_id="2">
398 <row row_id="1"><name>Bob</name><location>Boise</location></row>
399 <row row_id="2"><name>Bev</name><location>Billings</location></row>
400 </table>
401 </db>
402
403 In this case you will need to specify "row" as the record_tag since
404 it is not the first child of the tree. The column names will be
405 generated from the attributes of row's parent (if the parent is not
406 the root), from row's attributes and sub tags, i.e.
407 "table_id","row_id","name","location".
408
409 When exporting XML, you can specify a DTD to control the output.
410 For example, if you import a table from CSV or from an Array, you
411 can output as XML and specify which of the columns become tags and
412 which become attributes and also specify the nesting of the tags in
413 your DTD.
414
415 The XML format parser is built on top of Michel Rodriguez's
416 excellent XML::Twig which is itslef based on XML::Parser.
417 Parameters to either of those modules may be passed in the flags
418 for adTie() and the other commands including the "prettyPrint" flag
419 to specify how the output XML is displayed and things like
420 ProtocolEncoding. ProtocolEncoding defaults to 'ISO-8859-1', all
421 other flags keep the defaults of XML::Twig and XML::Parser. See
422 the documentation of those modules for details;
423
424 CAUTION: Unlike other formats, the XML format does not save changes to
425 the file as they are entered, but only saves the changes when you explicitly
426 request them to be saved with the adExport() command.
427
428 HTMLtable Format
429 This format is based on Matt Sisk's excelletn HTML::TableExtract.
430
431 It can be used to read an existing table from an html page, or to
432 create a new HTML table from any data source.
433
434 You may control which table in an HTML page is used with the column_names,
435 depth and count flags.
436
437 If a column_names flag is passed, the first table that contains those names
438 as the cells in a row will be selected.
439
440 If depth and or count parameters are passed, it will look for tables as
441 specified in the HTML::TableExtract documentation.
442
443 If none of column_names, depth, or count flags are passed, the first table
444 encountered in the file will be the table selected and its first row will
445 be used to determine the column names for the table.
446
447 When exporting to an HTMLtable, you may pass flags to specify properties
448 of the whole table (table_flags), the top row containing the column names
449 (top_row_flags), and the data rows (data_row_flags). These flags follow
450 the syntax of CGI.pm table constructors, e.g.:
451
452 print adExport( $table, 'HTMLtable', {
453 table_flags => {Border=>3,bgColor=>'blue'};
454 top_row_flags => {bgColor=>'red'};
455 data_row_flags => {valign='top'};
456 });
457
458 The table_flags will default to {Border=>1,bgColor=>'white'} if none
459 are specified.
460
461 The top_row_flags will default to {bgColor=>'#c0c0c0'} if none are
462 specified;
463
464 The data_row_flags will be empty if none are specified.
465
466 In other words, if no flags are specified the table will print out with
467 a border of 1, the column headings in gray, and the data rows in white.
468
469 CAUTION: This module will *not* preserve anything in the html file except
470 the selected table so if your file contains more than the selected table,
471 you will want to use adTie() to read the table and then adExport() to write
472 the table to a different file. When using the HTMLtable format, this is the
473 only way to preserve changes to the data, the adTie() command will *not*
474 write to a file.
475
476 Multiple Row Operations
477 The AnyData hash returned by adTie() may use either single values as
478 keys, or a reference to a hash of comparisons as a key. If the key to
479 the hash is a single value, the hash operates on a single row but if
480 the key to the hash is itself a hash reference, the hash operates on a
481 group of rows.
482
483 my $num_deleted = delete $table->{Sue};
484
485 This example deletes a single row where the key column has the value
486 'Sue'. If multiple rows have the value 'Sue' in that column, only the
487 first is deleted. It uses a simple string as a key, therefore it
488 operates on only a single row.
489
490 my $num_deleted = delete $table->{ {name=>'Sue'} };
491
492 This example deletes all rows where the column 'name' is equal to
493 'Sue'. It uses a hashref as a key and therefore operates on multiple
494 rows.
495
496 The hashref used in this example is a single column comparison but the
497 hashref could also include multiple column comparisons. This deletes
498 all rows where the the values listed for the country, gender, and age
499 columns are equal to those specified:
500
501 my $num_deleted = delete $table->{{ country => 'us',
502 gender => 'm',
503 age => '25'
504 }}
505
506 In addition to simple strings, the values may be specified as regular
507 expressions or as numeric or alphabetic comparisons. This will delete
508 all North American males under the age of 25:
509
510 my $num_deleted = delete $table->{{ country => qr/mx|us|ca/,
511 gender => 'm',
512 age => '< 25'
513 }}
514
515 If numeric or alphabetic comparisons are used, they should be a string
516 with the comparison operator separated from the value by a space, e.g.
517 '> 4' or 'lt b'.
518
519 This kind of search hashref can be used not only to delete multiple
520 rows, but also to update rows. In fact you *must* use a hashref key in
521 order to update your table. Updating is the only operation that can
522 not be done with a single string key.
523
524 The search hashref can be used with a select statement, in which case
525 it returns a reference to an array of rows matching the criteria:
526
527 my $male_players = $table->{{gender=>'m'}};
528 for my $player( @$male_players ) { print $player->{name},"\n" }
529
530 This should be used with caution with a large table since it gathers
531 all of the selected rows into an array in memory. Again, 'each' is a
532 much better way for large tables. This accomplishes the same thing as
533 the example above, but without ever pulling more than a row into memory
534 at a time:
535
536 while( my $row= each %$table ) {
537 print $row->{name}, "\n" if $row->{gender}=>'m';
538 }
539
540 Search criteria for multiple rows can also be used with the adRows()
541 function:
542
543 my $num_of_women = adRows( $table, gender => 'w' );
544
545 That does *not* pull the entire table into memory, it counts the rows a
546 record at a time.
547
548 Using Remote Files
549 If the first file parameter of adTie() or adConvert() begins with
550 "http://" or "ftp://", the file is treated as a remote URL and the LWP
551 module is called behind the scenes to fetch the file. If the files are
552 in an area that requires authentication, that may be supplied in the
553 $flags parameter.
554
555 For example:
556
557 # read a remote file and access it via a tied hash
558 #
559 my $table = adTie( 'XML', 'http://www.foo.edu/bar.xml' );
560
561 # same with username/password
562 #
563 my $table = ( 'XML', 'ftp://www.foo.edu/pub/bar.xml', 'r'
564 { user => 'me', pass => 'x7dy4'
565 );
566
567 # read a remote file, convert it to an HTML table, and print it
568 #
569 print adConvert( 'XML', 'ftp://www.foo.edu/pub/bar.xml', 'HTMLtable' );
570
571 Using Strings and Arrays
572 Strings and arrays may be used as either the source of data input or as
573 the target of data output. Strings should be passed as the only
574 element of an array reference (in other words, insdie square brackets).
575 Arrays should be a reference to an array whose first element is a
576 reference to an array of column names and whose succeeding elements are
577 references to arrays of row values.
578
579 For example:
580
581 my $table = adTie( 'XML', ["<x><motto id='perl'>TIMTOWTDI</motto></x>"] );
582
583 This uses the XML format to parse the supplied string and returns a tied
584 hash to the resulting table.
585
586
587 my $table = adTie( 'ARRAY', [['id','motto'],['perl','TIMTOWTDI']] );
588
589 This uses the column names "id" and "motto" and the supplied row values
590 and returns a tied hash to the resulting table.
591
592 It is also possible to use an empty array to create a new empty tied
593 hash in any format, for example:
594
595 my $table = adTie('XML',[],'c');
596
597 creates a new empty tied hash;
598
599 See adConvert() and adExport() for further examples of using strings
600 and arrays.
601
602 Ties, Flocks, I/O, and Atomicity
603 AnyData provides flocking which works under the limitations of flock --
604 that it only works if other processes accessing the files are also
605 using flock and only on platforms that support flock. See the flock()
606 man page for details.
607
608 Here is what the user supplied open modes actually do:
609
610 r = read only (LOCK_SH) O_RDONLY
611 u = update (LOCK_EX) O_RDWR
612 c = create (LOCK_EX) O_CREAT | O_RDWR | O_EXCL
613 o = overwrite (LOCK_EX) O_CREAT | O_RDWR | O_TRUNC
614
615 When you use something like "my $table = adTie(...)", it opens the file
616 with a lock and leaves the file and lock open until 1) the hash
617 variable ($table) goes out of scope or 2) the hash is undefined (e.g.
618 "undef $table") or 3) the hash is re-assigned to another tie. In all
619 cases the file is closed and the lock released.
620
621 If adTie is called without creating a tied hash variable, the file is
622 closed and the lock released immediately after the call to adTie.
623
624 For example: print adTie('XML','foo.xml')->{main_office}->{phone}.
625
626 That obtains a shared lock, opens the file, retrieves the one value
627 requested, closes the file and releases the lock.
628
629 These two examples accomplish the same thing but the first example
630 opens the file once, does all of the deletions, keeping the exclusive
631 lock in place until they are all done, then closes the file. The
632 second example opens and closes the file three times, once for each
633 deletion and releases the exclusive lock between each deletion:
634
635 1. my $t = adTie('Pipe','games.db','u');
636 delete $t->{"user$_"} for (0..3);
637 undef $t; # closes file and releases lock
638
639 2. delete adTie('Pipe','games.db','u')->{"user$_"} for (0..3);
640 # no undef needed since no hash variable created
641
642 Deletions and Packing
643 In order to save time and to prevent having to do writes anywhere
644 except at the end of the file, deletions and updates are *not* done at
645 the time of issuing a delete command. Rather when the user does a
646 delete, the position of the deleted record is stored in a hash and when
647 the file is saved to disk, the deletions are only then physically
648 removed by packing the entire database. Updates are done by inserting
649 the new record at the end of the file and marking the old record for
650 deletion. In the normal course of events, all of this should be
651 transparent and you'll never need to worry about it. However, if your
652 server goes down after you've made updates or deletions but before
653 you've saved the file, then the deleted rows will remain in the
654 database and for updates there will be duplicate rows -- the old
655 unpdated row and the new updated row. If you are worried about this
656 kind of event, then use atomic deletes and updates as shown in the
657 section above. There's still a very small possiblity of a crash in
658 between the deletion and the save, but in this case it should impact at
659 most a single row. (BIG thanks to Matthew Wickline for suggestions on
660 handling deletes)
661
663 See the README file and the test.pl included with the module for
664 further examples.
665
666 See the AnyData/Format/*.pm PODs for further details of specific
667 formats.
668
669 For further support, please use comp.lang.perl.modules
670
672 Special thanks to Andy Duncan, Tom Lowery, Randal Schwartz, Michel
673 Rodriguez, Jochen Wiedmann, Tim Bunce, Aligator Descartes, Mathew
674 Persico, Chris Nandor, Malcom Cook and to many others on the DBI
675 mailing lists and the clp* newsgroups.
676
678 Jeff Zucker <jeff@vpservices.com>
679
680 This module is copyright (c), 2000 by Jeff Zucker.
681 It may be freely distributed under the same terms as Perl itself.
682
683
684
685perl v5.12.0 2004-04-19 AnyData(3)