Fsdb(3pm)

1Fsdb(3)               User Contributed Perl Documentation              Fsdb(3)
2
3
4

NAME

6       Fsdb - a flat-text database for shell scripting
7

SYNOPSIS

9       Fsdb, the flatfile streaming database is package of commands for
10       manipulating flat-ASCII databases from shell scripts.  Fsdb is useful
11       to process medium amounts of data (with very little data you'd do it by
12       hand, with megabytes you might want a real database).  Fsdb was known
13       as as Jdb from 1991 to Oct. 2008.
14
15       Fsdb is very good at doing things like:
16
17       ·   extracting measurements from experimental output
18
19       ·   examining data to address different hypotheses
20
21       ·   joining data from different experiments
22
23       ·   eliminating/detecting outliers
24
25       ·   computing statistics on data (mean, confidence intervals,
26           correlations, histograms)
27
28       ·   reformatting data for graphing programs
29
30       Fsdb is built around the idea of a flat text file as a database.  Fsdb
31       files (by convention, with the extension .fsdb), have a header
32       documenting the schema (what the columns mean), and then each line
33       represents a database record (or row).
34
35       For example:
36
37               #fsdb experiment duration
38               ufs_mab_sys 37.2
39               ufs_mab_sys 37.3
40               ufs_rcp_real 264.5
41               ufs_rcp_real 277.9
42
43       Is a simple file with four experiments (the rows), each with a
44       description, size parameter, and run time in the first, second, and
45       third columns.
46
47       Rather than hand-code scripts to do each special case, Fsdb provides
48       higher-level functions.  Although it's often easy throw together a
49       custom script to do any single task, I believe that there are several
50       advantages to using Fsdb:
51
52       ·   these programs provide a higher level interface than plain Perl, so
53
54           **  Fewer lines of simpler code:
55
56                   dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration
57
58               Picks out just one type of experiment and computes statistics
59               on it, rather than:
60
61                   while (<>) { split; $sum+=$F[1]; $ss+=$F[1]**2; $n++; }
62                   $mean = $sum / $n; $std_dev = ...
63
64               in dozens of places.
65
66       ·   the library uses names for columns, so
67
68           **  No more $F[1], use "_duration".
69
70           **  New or different order columns?  No changes to your scripts!
71
72           Thus if your experiment gets more complicated with a size
73           parameter, so your log changes to:
74
75                   #fsdb experiment size duration
76                   ufs_mab_sys 1024 37.2
77                   ufs_mab_sys 1024 37.3
78                   ufs_rcp_real 1024 264.5
79                   ufs_rcp_real 1024 277.9
80                   ufs_mab_sys 2048 45.3
81                   ufs_mab_sys 2048 44.2
82
83           Then the previous scripts still work, even though duration is now
84           the third column, not the second.
85
86       ·   A series of actions are self-documenting (each program records what
87           it does).
88
89           **  No more wondering what hacks were used to compute the final
90               data, just look at the comments at the end of the output.
91
92           For example, the commands
93
94               dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration
95
96           add to the end of the output the lines
97               #    | dbrow _experiment eq "ufs_mab_sys"
98               #    | dbcolstats duration
99
100       ·   The library is mature, supporting large datasets (more than 100GB),
101           corner cases, error handling, backed by an automated test suite.
102
103           **  No more puzzling about bad output because your custom script
104               skimped on error checking.
105
106           **  No more memory thrashing when you try to sort ten million
107               records.
108
109       ·   Fsdb-2.x supports Perl scripting (in addition to shell scripting),
110           with libraries to do Fsdb input and output, and easy support for
111           pipelines.  The shell script
112
113               dbcol name test1 | dbroweval '_test1 += 5;'
114
115           can be written in perl as:
116
117               dbpipeline(dbcol(qw(name test1)), dbroweval('_test1 += 5;'));
118
119       (The disadvantage is that you need to learn what functions Fsdb
120       provides.)
121
122       Fsdb is built on flat-ASCII databases.  By storing data in simple text
123       files and processing it with pipelines it is easy to experiment (in the
124       shell) and look at the output.  To the best of my knowledge, the
125       original implementation of this idea was "/rdb", a commercial product
126       described in the book UNIX relational database management: application
127       development in the UNIX environment by Rod Manis, Evan Schaffer, and
128       Robert Jorgensen (and also at the web page <http://www.rdb.com/>).
129       Fsdb is an incompatible re-implementation of their idea without any
130       accelerated indexing or forms support.  (But it's free, and probably
131       has better statistics!).
132
133       Fsdb-2.x will exploit multiple processors or cores, and provides Perl-
134       level support for input, output, and threaded-pipelines.  (As of
135       Fsdb-2.44 it no longer uses Perl threading, just processes, since they
136       are faster.)
137
138       Installation instructions follow at the end of this document.  Fsdb-2.x
139       requires Perl 5.8 to run.  All commands have manual pages and provide
140       usage with the "--help" option.  All commands are backed by an
141       automated test suite.
142
143       The most recent version of Fsdb is available on the web at
144       <http://www.isi.edu/~johnh/SOFTWARE/FSDB/index.html>.
145

WHAT'S NEW

147   2.68, 2019-09-19 All programs now support automatic decompression based on
148       file extension.
149       IMPROVEMENT
150           The omitted-possible-error test case for dbfilepivot now has an
151           altnerative output that I saw on some BSD-running systems (thanks
152           to CPAN).
153
154       IMPROVEMENT
155           dbmerge and dbmerge2 now support "--header".  dbmerge2 now gives
156           better error messages when presented the wrong number of inputs.
157
158       BUG FIX
159           dbsort now works with "--header" even when the file is big (due to
160           fixes to dbmerge).
161
162       IMPROVEMENT
163           cvs_to_db now processes data with the "binary" option, allowing it
164           to handle newlines embedded in quoted fields.
165
166       IMPROVEMENT
167           All programs now will transparently decompress input files, if they
168           are listed as a filename as an input argument that extends with a
169           standard extension (.gz, .bz2, and .xz).
170

README CONTENTS

172       executive summary
173       what's new
174       README CONTENTS
175       installation
176       basic data format
177       basic data manipulation
178       list of commands
179       another example
180       a gradebook example
181       a password example
182       history
183       related work
184       release notes
185       copyright
186       comments
187

INSTALLATION

189       Fsdb now uses the standard Perl build and installation from
190       ExtUtil::MakeMaker(3), so the quick answer to installation is to type:
191
192           perl Makefile.PL
193           make
194           make test
195           make install
196
197       Or, if you want to install it somewhere else, change the first line to
198
199           perl Makefile.PL PREFIX=$HOME
200
201       and it will go in your home directory's bin, etc.  (See
202       ExtUtil::MakeMaker(3) for more details.)
203
204       Fsdb requires perl 5.8 or later.
205
206       A test-suite is available, run it with
207
208           make test
209
210       A FreeBSD port to Fsdb is available, see
211       <http://www.freshports.org/databases/fsdb/>.
212
213       A Fink (MacOS X) port is available, see
214       <http://pdb.finkproject.org/pdb/package.php/fsdb>.  (Thanks to Lars
215       Eggert for maintaining this port.)
216

BASIC DATA FORMAT

218       These programs are based on the idea storing data in simple ASCII
219       files.  A database is a file with one header line and then data or
220       comment lines.  For example:
221
222               #fsdb account passwd uid gid fullname homedir shell
223               johnh * 2274 134 John_Heidemann /home/johnh /bin/bash
224               greg * 2275 134 Greg_Johnson /home/greg /bin/bash
225               root * 0 0 Root /root /bin/bash
226               # this is a simple database
227
228       The header line must be first and begins with "#h".  There are rows
229       (records) and columns (fields), just like in a normal database.
230       Comment lines begin with "#".  Column names are any string not
231       containing spaces or single quote (although it is prudent to keep them
232       alphanumeric with underscore).
233
234       By default, columns are delimited by whitespace.  With this default
235       configuration, the contents of a field cannot contain whitespace.
236       However, this limitation can be relaxed by changing the field separator
237       as described below.
238
239       The big advantage of simple flat-text databases is that it is usually
240       easy to massage data into this format, and it's reasonably easy to take
241       data out of this format into other (text-based) programs, like gnuplot,
242       jgraph, and LaTeX.  Think Unix.  Think pipes.  (Or even output to Excel
243       and HTML if you prefer.)
244
245       Since no-whitespace in columns was a problem for some applications,
246       there's an option which relaxes this rule.  You can specify the field
247       separator in the table header with "-F x" where "x" is a code for the
248       new field separator.  A full list of codes is at dbfilealter(1), but
249       two common special values are "-F t" which is a separator of a single
250       tab character, and "-F S", a separator of two spaces.  Both allowing
251       (single) spaces in fields.  An example:
252
253               #fsdb -F S account passwd uid gid fullname homedir shell
254               johnh  *  2274  134  John Heidemann  /home/johnh  /bin/bash
255               greg  *  2275  134  Greg Johnson  /home/greg  /bin/bash
256               root  *  0  0  Root  /root  /bin/bash
257               # this is a simple database
258
259       See dbfilealter(1) for more details.  Regardless of what the column
260       separator is for the body of the data, it's always whitespace in the
261       header.
262
263       There's also a third format: a "list".  Because it's often hard to see
264       what's columns past the first two, in list format each "column" is on a
265       separate line.  The programs dblistize and dbcolize convert to and from
266       this format, and all programs work with either formats.  The command
267
268           dbfilealter -R C  < DATA/passwd.fsdb
269
270       outputs:
271
272               #fsdb -R C account passwd uid gid fullname homedir shell
273               account:  johnh
274               passwd:   *
275               uid:      2274
276               gid:      134
277               fullname: John_Heidemann
278               homedir:  /home/johnh
279               shell:    /bin/bash
280
281               account:  greg
282               passwd:   *
283               uid:      2275
284               gid:      134
285               fullname: Greg_Johnson
286               homedir:  /home/greg
287               shell:    /bin/bash
288
289               account:  root
290               passwd:   *
291               uid:      0
292               gid:      0
293               fullname: Root
294               homedir:  /root
295               shell:    /bin/bash
296
297               # this is a simple database
298               #  | dblistize
299
300       See dbfilealter(1) for more details.
301

BASIC DATA MANIPULATION

303       A number of programs exist to manipulate databases.  Complex functions
304       can be made by stringing together commands with shell pipelines.  For
305       example, to print the home directories of everyone with ``john'' in
306       their names, you would do:
307
308               cat DATA/passwd | dbrow '_fullname =~ /John/' | dbcol homedir
309
310       The output might be:
311
312               #fsdb homedir
313               /home/johnh
314               /home/greg
315               # this is a simple database
316               #  | dbrow _fullname =~ /John/
317               #  | dbcol homedir
318
319       (Notice that comments are appended to the output listing each command,
320       providing an automatic audit log.)
321
322       In addition to typical database functions (select, join, etc.) there
323       are also a number of statistical functions.
324
325       The real power of Fsdb is that one can apply arbitrary code to rows to
326       do powerful things.
327
328               cat DATA/passwd | dbroweval '_fullname =~ s/(\w+)_(\w+)/$2,_$1/'
329
330       converts "John_Heidemann" into "Heidemann,_John".  Not too much more
331       work could split fullname into firstname and lastname fields.
332
333       (Or:
334
335               cat DATA/passwd | dbcolcreate sort | dbroweval -b 'use Fsdb::Support'
336                       '_sort = _fullname; _sort =~ s/_/ /g; _sort = fullname_to_sort(_sort);'
337

TALKING ABOUT COLUMNS

339       An advantage of Fsdb is that you can talk about columns by name
340       (symbolically) rather than simply by their positions.  So in the above
341       example, "dbcol homedir" pulled out the home directory column, and
342       "dbrow '_fullname =~ /John/'" matched against column fullname.
343
344       In general, you can use the name of the column listed on the "#fsdb"
345       line to identify it in most programs, and _name to identify it in code.
346
347       Some alternatives for flexibility:
348
349       ·   Numeric values identify columns positionally, numbering from 0.  So
350           0 or _0 is the first column, 1 is the second, etc.
351
352       ·   In code, _last_columnname gets the value from columname's previous
353           row.
354
355       See dbroweval(1) for more details about writing code.
356

LIST OF COMMANDS

358       Enough said.  I'll summarize the commands, and then you can experiment.
359       For a detailed description of each command, see a summary by running it
360       with the argument "--help" (or "-?" if you prefer.)  Full manual pages
361       can be found by running the command with the argument "--man", or
362       running the Unix command "man dbcol" or whatever program you want.
363
364   TABLE CREATION
365       dbcolcreate
366           add columns to a database
367
368       dbcoldefine
369           set the column headings for a non-Fsdb file
370
371   TABLE MANIPULATION
372       dbcol
373           select columns from a table
374
375       dbrow
376           select rows from a table
377
378       dbsort
379           sort rows based on a set of columns
380
381       dbjoin
382           compute the natural join of two tables
383
384       dbcolrename
385           rename a column
386
387       dbcolmerge
388           merge two columns into one
389
390       dbcolsplittocols
391           split one column into two or more columns
392
393       dbcolsplittorows
394           split one column into multiple rows
395
396       dbfilepivot
397           "pivots" a file, converting multiple rows corresponding to the same
398           entity into a single row with multiple columns.
399
400       dbfilevalidate
401           check that db file doesn't have some common errors
402
403   COMPUTATION AND STATISTICS
404       dbcolstats
405           compute statistics over a column (mean,etc.,optionally median)
406
407       dbmultistats
408           group rows by some key value, then compute stats (mean, etc.) over
409           each group (equivalent to dbmapreduce with dbcolstats as the
410           reducer)
411
412       dbmapreduce
413           group rows (map) and then apply an arbitrary function to each group
414           (reduce)
415
416       dbrvstatdiff
417           compare two samples distributions (mean/conf interval/T-test)
418
419       dbcolmovingstats
420           computing moving statistics over a column of data
421
422       dbcolstatscores
423           compute Z-scores and T-scores over one column of data
424
425       dbcolpercentile
426           compute the rank or percentile of a column
427
428       dbcolhisto
429           compute histograms over a column of data
430
431       dbcolscorrelate
432           compute the coefficient of correlation over several columns
433
434       dbcolsregression
435           compute linear regression and correlation for two columns
436
437       dbrowaccumulate
438           compute a running sum over a column of data
439
440       dbrowcount
441           count the number of rows (a subset of dbstats)
442
443       dbrowdiff
444           compute differences between a columns in each row of a table
445
446       dbrowenumerate
447           number each row
448
449       dbroweval
450           run arbitrary Perl code on each row
451
452       dbrowuniq
453           count/eliminate identical rows (like Unix uniq(1))
454
455       dbfilediff
456           compare fields on rows of a file (something like Unix diff(1))
457
458   OUTPUT CONTROL
459       dbcolneaten
460           pretty-print columns
461
462       dbfilealter
463           convert between column or list format, or change the column
464           separator
465
466       dbfilestripcomments
467           remove comments from a table
468
469       dbformmail
470           generate a script that sends form mail based on each row
471
472   CONVERSIONS
473       (These programs convert data into fsdb.  See their web pages for
474       details.)
475
476       cgi_to_db
477           <http://stein.cshl.org/boulder/>
478
479       combined_log_format_to_db
480           <http://httpd.apache.org/docs/2.0/logs.html>
481
482       html_table_to_db
483           HTML tables to fsdb (assuming they're reasonably formatted).
484
485       kitrace_to_db
486           <http://ficus-www.cs.ucla.edu/ficus-members/geoff/kitrace.html>
487
488       ns_to_db
489           <http://mash-www.cs.berkeley.edu/ns/>
490
491       sqlselect_to_db
492           the output of SQL SELECT tables to db
493
494       tabdelim_to_db
495           spreadsheet tab-delimited files to db
496
497       tcpdump_to_db
498           (see man tcpdump(8) on any reasonable system)
499
500       xml_to_db
501           XML input to fsdb, assuming they're very regular
502
503       (And out of fsdb:)
504
505       db_to_csv
506           Comma-separated-value format from fsdb.
507
508       db_to_html_table
509           simple conversion of Fsdb to html tables
510
511   STANDARD OPTIONS
512       Many programs have common options:
513
514       -? or --help
515           Show basic usage.
516
517       -N on --new-name
518           When a command creates a new column like dbrowaccumulate's "accum",
519           this option lets one override the default name of that new column.
520
521       -T TmpDir
522           where to put tmp files.  Also uses environment variable TMPDIR, if
523           -T is not specified.  Default is /tmp.
524
525           Show basic usage.
526
527       -c FRACTION or --confidence FRACTION
528           Specify confidence interval FRACTION (dbcolstats, dbmultistats,
529           etc.)
530
531       -C S or "--element-separator S"
532           Specify column separator S (dbcolsplittocols, dbcolmerge).
533
534       -d or --debug
535           Enable debugging (may be repeated for greater effect in some
536           cases).
537
538       -a or --include-non-numeric
539           Compute stats over all data (treating non-numbers as zeros).  (By
540           default, things that can't be treated as numbers are ignored for
541           stats purposes)
542
543       -S or --pre-sorted
544           Assume the data is pre-sorted.  May be repeated to disable
545           verification (saving a small amount of work).
546
547       -e E or --empty E
548           give value E as the value for empty (null) records
549
550       -i I or --input I
551           Input data from file I.
552
553       -o O or --output O
554           Write data out to file O.
555
556       --header H
557           Use H as the full Fsdb header, rather than reading a header from
558           then input.  This option is particularly useful when using Fsdb
559           under Hadoop, where split files don't have heades.
560
561       --nolog.
562           Skip logging the program in a trailing comment.
563
564       When giving Perl code (in dbrow and dbroweval) column names can be
565       embedded if preceded by underscores.  Look at dbrow(1) or dbroweval(1)
566       for examples.)
567
568       Most programs run in constant memory and use temporary files if
569       necessary.  Exceptions are dbcolneaten, dbcolpercentile, dbmapreduce,
570       dbmultistats, dbrowsplituniq.
571

ANOTHER EXAMPLE

573       Take the raw data in "DATA/http_bandwidth", put a header on it
574       ("dbcoldefine size bw"), took statistics of each category
575       ("dbmultistats -k size bw"), pick out the relevant fields ("dbcol size
576       mean stddev pct_rsd"), and you get:
577
578               #fsdb size mean stddev pct_rsd
579               1024    1.4962e+06      2.8497e+05      19.047
580               10240   5.0286e+06      6.0103e+05      11.952
581               102400  4.9216e+06      3.0939e+05      6.2863
582               #  | dbcoldefine size bw
583               #  | /home/johnh/BIN/DB/dbmultistats -k size bw
584               #  | /home/johnh/BIN/DB/dbcol size mean stddev pct_rsd
585
586       (The whole command was:
587
588               cat DATA/http_bandwidth |
589               dbcoldefine size |
590               dbmultistats -k size bw |
591               dbcol size mean stddev pct_rsd
592
593       all on one line.)
594
595       Then post-process them to get rid of the exponential notation by adding
596       this to the end of the pipeline:
597
598           dbroweval '_mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev);'
599
600       (Actually, this step is no longer required since dbcolstats now uses a
601       different default format.)
602
603       giving:
604
605               #fsdb      size    mean    stddev  pct_rsd
606               1024     1496200          284970        19.047
607               10240    5028600          601030        11.952
608               102400   4921600          309390        6.2863
609               #  | dbcoldefine size bw
610               #  | dbmultistats -k size bw
611               #  | dbcol size mean stddev pct_rsd
612               #  | dbroweval   { _mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev); }
613
614       In a few lines, raw data is transformed to processed output.
615
616       Suppose you expect there is an odd distribution of results of one
617       datapoint.  Fsdb can easily produce a CDF (cumulative distribution
618       function) of the data, suitable for graphing:
619
620           cat DB/DATA/http_bandwidth | \
621               dbcoldefine size bw | \
622               dbrow '_size == 102400' | \
623               dbcol bw | \
624               dbsort -n bw | \
625               dbrowenumerate | \
626               dbcolpercentile count | \
627               dbcol bw percentile | \
628               xgraph
629
630       The steps, roughly: 1. get the raw input data and turn it into fsdb
631       format, 2. pick out just the relevant column (for efficiency) and sort
632       it, 3. for each data point, assign a CDF percentage to it, 4. pick out
633       the two columns to graph and show them
634

A GRADEBOOK EXAMPLE

636       The first commercial program I wrote was a gradebook, so here's how to
637       do it with Fsdb.
638
639       Format your data like DATA/grades.
640
641               #fsdb name email id test1
642               a a@ucla.example.edu 1 80
643               b b@usc.example.edu 2 70
644               c c@isi.example.edu 3 65
645               d d@lmu.example.edu 4 90
646               e e@caltech.example.edu 5 70
647               f f@oxy.example.edu 6 90
648
649       Or if your students have spaces in their names, use "-F S" and two
650       spaces to separate each column:
651
652               #fsdb -F S name email id test1
653               alfred aho  a@ucla.example.edu  1  80
654               butler lampson  b@usc.example.edu  2  70
655               david clark  c@isi.example.edu  3  65
656               constantine drovolis  d@lmu.example.edu  4  90
657               debrorah estrin  e@caltech.example.edu  5  70
658               sally floyd  f@oxy.example.edu  6  90
659
660       To compute statistics on an exam, do
661
662               cat DATA/grades | dbstats test1 |dblistize
663
664       giving
665
666               #fsdb -R C  ...
667               mean:        77.5
668               stddev:      10.84
669               pct_rsd:     13.987
670               conf_range:  11.377
671               conf_low:    66.123
672               conf_high:   88.877
673               conf_pct:    0.95
674               sum:         465
675               sum_squared: 36625
676               min:         65
677               max:         90
678               n:           6
679               ...
680
681       To do a histogram:
682
683               cat DATA/grades | dbcolhisto -n 5 -g test1
684
685       giving
686
687               #fsdb low histogram
688               65      *
689               70      **
690               75
691               80      *
692               85
693               90      **
694               #  | /home/johnh/BIN/DB/dbhistogram -n 5 -g test1
695
696       Now you want to send out grades to the students by e-mail.  Create a
697       form-letter (in the file test1.txt):
698
699               To: _email (_name)
700               From: J. Random Professor <jrp@usc.example.edu>
701               Subject: test1 scores
702
703               _name, your score on test1 was _test1.
704               86+   A
705               75-85 B
706               70-74 C
707               0-69  F
708
709       Generate the shell script that will send the mail out:
710
711               cat DATA/grades | dbformmail test1.txt > test1.sh
712
713       And run it:
714
715               sh <test1.sh
716
717       The last two steps can be combined:
718
719               cat DATA/grades | dbformmail test1.txt | sh
720
721       but I like to keep a copy of exactly what I send.
722
723       At the end of the semester you'll want to compute grade totals and
724       assign letter grades.  Both fall out of dbroweval.  For example, to
725       compute weighted total grades with a 40% midterm/60% final where the
726       midterm is 84 possible points and the final 100:
727
728               dbcol -rv total |
729               dbcolcreate total - |
730               dbroweval '
731                       _total = .40 * _midterm/84.0 + .60 * _final/100.0;
732                       _total = sprintf("%4.2f", _total);
733                       if (_final eq "-" || ( _name =~ /^_/)) { _total = "-"; };' |
734               dbcolneaten
735
736       If you got the data originally from a spreadsheet, save it in "tab-
737       delimited" format and convert it with tabdelim_to_db (run
738       tabdelim_to_db -? for examples).
739

A PASSWORD EXAMPLE

741       To convert the Unix password file to db:
742
743               cat /etc/passwd | sed 's/:/  /g'| \
744                       dbcoldefine -F S login password uid gid gecos home shell \
745                       >passwd.fsdb
746
747       To convert the group file
748
749               cat /etc/group | sed 's/:/  /g' | \
750                       dbcoldefine -F S group password gid members \
751                       >group.fsdb
752
753       To show the names of the groups that div7-members are in (assuming DIV7
754       is in the gecos field):
755
756               cat passwd.fsdb | dbrow '_gecos =~ /DIV7/' | dbcol login gid | \
757                       dbjoin -i - -i group.fsdb gid | dbcol login group
758

SHORT EXAMPLES

760       Which Fsdb programs are the most complicated (based on number of test
761       cases)?
762
763               ls TEST/*.cmd | \
764                       dbcoldefine test | \
765                       dbroweval '_test =~ s@^TEST/([^_]+).*$@$1@' | \
766                       dbrowuniq -c | \
767                       dbsort -nr count | \
768                       dbcolneaten
769
770       (Answer: dbmapreduce, then dbcolstats, dbfilealter and dbjoin.)
771
772       Stats on an exam (in $FILE, where $COLUMN is the name of the exam)?
773
774               cat $FILE | dbcolstats -q 4 $COLUMN <$FILE | dblistize | dbstripcomments
775
776               cat $FILE | dbcolhisto -g -n 20 $COLUMN | dbcolneaten | dbstripcomments
777
778       Merging a the hw1 column from file hw1.fsdb into grades.fsdb assuming
779       there's a common student id in column "id":
780
781               dbcol id hw1 <hw1.fsdb >t.fsdb
782
783               dbjoin -a -e - grades.fsdb t.fsdb id | \
784                   dbsort  name | \
785                   dbcolneaten >new_grades.fsdb
786
787       Merging two fsdb files with the same rows:
788
789               cat file1.fsdb file2.fsdb >output.fsdb
790
791       or if you want to clean things up a bit
792
793               cat file1.fsdb file2.fsdb | dbstripextraheaders >output.fsdb
794
795       or if you want to know where the data came from
796
797               for i in 1 2
798               do
799                       dbcolcreate source $i < file$i.fsdb
800               done >output.fsdb
801
802       (assumes you're using a Bourne-shell compatible shell, not csh).
803

WARNINGS

805       As with any tool, one should (which means must) understand the limits
806       of the tool.
807
808       All Fsdb tools should run in constant memory.  In some cases (such as
809       dbcolstats with quartiles, where the whole input must be re-read),
810       programs will spool data to disk if necessary.
811
812       Most tools buffer one or a few lines of data, so memory will scale with
813       the size of each line.  (So lines with many columns, or when columns
814       have lots data, may cause large memory consumption.)
815
816       All Fsdb tools should run in constant or at worst "n log n" time.
817
818       All Fsdb tools use normal Perl math routines for computation.  Although
819       I make every attempt to choose numerically stable algorithms (although
820       I also welcome feedback and suggestions for improvement), normal
821       rounding due to computer floating point approximations can result in
822       inaccuracies when data spans a large range of precision.  (See for
823       example the dbcolstats_extrema test cases.)
824
825       Any requirements and limitations of each Fsdb tool is documented on its
826       manual page.
827
828       If any Fsdb program violates these assumptions, that is a bug that
829       should be documented on the tool's manual page or ideally fixed.
830
831       Fsdb does depend on Perl's correctness, and Perl (and Fsdb) have some
832       bugs.  Fsdb should work on perl from version 5.10 onward.
833

HISTORY

835       There have been three versions of Fsdb; fsdb 1.0 is a complete re-write
836       of the pre-1995 versions, and was distributed from 1995 to 2007.  Fsdb
837       2.0 is a significant re-write of the 1.x versions for reasons described
838       below.
839
840       Fsdb (in its various forms) has been used extensively by its author
841       since 1991.  Since 1995 it's been used by two other researchers at UCLA
842       and several at ISI.  In February 1998 it was announced to the Internet.
843       Since then it has found a few users, some outside where I work.
844
845       Major changes:
846
847       1.0 1997-07-22: first public release.
848       2.0 2008-01-25: rewrite to use a common library, and starting to use
849       threads.
850       2.12 2008-10-16: completion of the rewrite, and first RPM package.
851       2.44 2013-10-02: abandoning threads for improved performance
852
853   Fsdb 2.0 Rationale
854       I've thought about fsdb-2.0 for many years, but it was started in
855       earnest in 2007.  Fsdb-2.0 has the following goals:
856
857       in-one-process processing
858           While fsdb is great on the Unix command line as a pipeline between
859           programs, it should also be possible to set it up to run in a
860           single process.  And if it does so, it should be able to avoid
861           serializing and deserializing (converting to and from text) data
862           between each module.  (Accomplished in fsdb-2.0: see dbpipeline,
863           although still needs tuning.)
864
865       clean IO API
866           Fsdb's roots go back to perl4 and 1991, so the fsdb-1.x library is
867           very, very crufty.  More than just being ugly (but it was that
868           too), this made things reading from one format file and writing to
869           another the application's job, when it should be the library's.
870           (Accomplished in fsdb-1.15 and improved in 2.0: see Fsdb::IO.)
871
872       normalized module APIs
873           Because fsdb modules were added as needed over 10 years, sometimes
874           the module APIs became inconsistent.  (For example, the 1.x
875           "dbcolcreate" required an empty value following the name of the new
876           column, but other programs specify empty values with the "-e"
877           argument.)  We should smooth over these inconsistencies.
878           (Accomplished as each module was ported in 2.0 through 2.7.)
879
880       everyone handles all input formats
881           Given a clean IO API, the distinction between "colized" and
882           "listized" fsdb files should go away.  Any program should be able
883           to read and write files in any format.  (Accomplished in fsdb-2.1.)
884
885       Fsdb-2.0 preserves backwards compatibility where possible, but breaks
886       it where necessary to accomplish the above goals.  In August 2008,
887       Fsdb-2.7 was declared preferred over the 1.x versions.  Benchmarking in
888       2013 showed that threading performed much worse than just using pipes,
889       so Fsdb-2.44 uses threading "style", but implemented with processes
890       (via my "Freds" library).
891
892   Contributors
893       Fsdb includes code ported from Geoff Kuenning
894       ("Fsdb::Support::TDistribution").
895
896       Fsdb contributors: Ashvin Goel goel@cse.oge.edu, Geoff Kuenning
897       geoff@fmg.cs.ucla.edu, Vikram Visweswariah visweswa@isi.edu, Kannan
898       Varadahan kannan@isi.edu, Lars Eggert larse@isi.edu, Arkadi Gelfond
899       arkadig@dyna.com, David Graff graff@ldc.upenn.edu, Haobo Yu
900       haoboy@packetdesign.com, Pavlin Radoslavov pavlin@catarina.usc.edu,
901       Graham Phillips, Yuri Pradkin, Alefiya Hussain, Ya Xu, Michael
902       Schwendt, Fabio Silva fabio@isi.edu, Jerry Zhao zhaoy@isi.edu, Ning Xu
903       nxu@aludra.usc.edu, Martin Lukac mlukac@lecs.cs.ucla.edu, Xue Cai,
904       Michael McQuaid, Christopher Meng, Calvin Ardi, H. Merijn Brand, Lan
905       Wei, Hang Guo.
906
907       Fsdb includes datasets contributed from NIST (DATA/nist_zarr13.fsdb),
908       from
909       <http://www.itl.nist.gov/div898/handbook/eda/section4/eda4281.htm>, the
910       NIST/SEMATECH e-Handbook of Statistical Methods, section 1.4.2.8.1.
911       Background and Data.  The source is public domain, and reproduced with
912       permission.
913

RELATED WORK

915       As stated in the introduction, Fsdb is an incompatible reimplementation
916       of the ideas found in "/rdb".  By storing data in simple text files and
917       processing it with pipelines it is easy to experiment (in the shell)
918       and look at the output.  The original implementation of this idea was
919       /rdb, a commercial product described in the book UNIX relational
920       database management: application development in the UNIX environment by
921       Rod Manis, Evan Schaffer, and Robert Jorgensen (and also at the web
922       page <http://www.rdb.com/>).
923
924       While Fsdb is inspired by Rdb, it includes no code from it, and Fsdb
925       makes several different design choices.  In particular: rdb attempts to
926       be closer to a "real" database, with provision for locking, file
927       indexing.  Fsdb focuses on single user use and so eschews these
928       choices.  Rdb also has some support for interactive editing.  Fsdb
929       leaves editing to text editors like emacs or vi.
930
931       In August, 2002 I found out Carlo Strozzi extended RDB with his package
932       NoSQL <http://www.linux.it/~carlos/nosql/>.  According to Mr. Strozzi,
933       he implemented NoSQL in awk to avoid the Perl start-up of RDB.
934       Although I haven't found Perl startup overhead to be a big problem on
935       my platforms (from old Sparcstation IPCs to 2GHz Pentium-4s), you may
936       want to evaluate his system.  The Linux Journal has a description of
937       NoSQL at <http://www.linuxjournal.com/article/3294>.  It seems quite
938       similar to Fsdb.  Like /rdb, NoSQL supports indexing (not present in
939       Fsdb).  Fsdb appears to have richer support for statistics, and, as of
940       Fsdb-2.x, its support for Perl threading may support faster performance
941       (one-process, less serialization and deserialization).
942

RELEASE NOTES

944       Versions prior to 1.0 were released informally on my web page but were
945       not announced.
946
947   0.0 1991
948       started for my own research use
949
950   0.1 26-May-94
951       first check-in to RCS
952
953   0.2 15-Mar-95
954       parts now require perl5
955
956   1.0, 22-Jul-97
957       adds autoconf support and a test script.
958
959   1.1, 20-Jan-98
960       support for double space field separators, better tests
961
962   1.2, 11-Feb-98
963       minor changes and release on comp.lang.perl.announce
964
965   1.3, 17-Mar-98
966       ·   adds median and quartile options to dbstats
967
968       ·   adds dmalloc_to_db converter
969
970       ·   fixes some warnings
971
972       ·   dbjoin now can run on unsorted input
973
974       ·   fixes a dbjoin bug
975
976       ·   some more tests in the test suite
977
978   1.4, 27-Mar-98
979       ·   improves error messages (all should now report the program that
980           makes the error)
981
982       ·   fixed a bug in dbstats output when the mean is zero
983
984   1.5, 25-Jun-98
985       BUG FIX dbcolhisto, dbcolpercentile now handles non-numeric values like
986       dbstats
987       NEW dbcolstats computes zscores and tscores over a column
988       NEW dbcolscorrelate computes correlation coefficients between two
989       columns
990       INTERNAL ficus_getopt.pl has been replaced by DbGetopt.pm
991       BUG FIX all tests are now ``portable'' (previously some tests ran only
992       on my system)
993       BUG FIX you no longer need to have the db programs in your path (fix
994       arose from a discussion with Arkadi Gelfond)
995       BUG FIX installation no longer uses cp -f (to work on SunOS 4)
996
997   1.6, 24-May-99
998       NEW dbsort, dbstats, dbmultistats now run in constant memory (using tmp
999       files if necessary)
1000       NEW dbcolmovingstats does moving means over a series of data
1001       NEW dbcol has a -v option to get all columns except those listed
1002       NEW dbmultistats does quartiles and medians
1003       NEW dbstripextraheaders now also cleans up bogus comments before the
1004       fist header
1005       BUG FIX dbcolneaten works better with double-space-separated data
1006
1007   1.7,  5-Jan-00
1008       NEW dbcolize now detects and rejects lines that contain embedded copies
1009       of the field separator
1010       NEW configure tries harder to prevent people from improperly
1011       configuring/installing fsdb
1012       NEW tcpdump_to_db converter (incomplete)
1013       NEW tabdelim_to_db converter:  from spreadsheet tab-delimited files to
1014       db
1015       NEW mailing lists for fsdb are     "fsdb-announce@heidemann.la.ca.us"
1016       and  "fsdb-talk@heidemann.la.ca.us"
1017           To subscribe to either, send mail
1018           to    "fsdb-announce-request@heidemann.la.ca.us"   or
1019           "fsdb-talk-request@heidemann.la.ca.us"     with "subscribe" in the
1020           BODY of the message.
1021
1022       BUG FIX dbjoin used to produce incorrect output if there were extra,
1023       unmatched values in the 2nd table. Thanks to Graham Phillips for
1024       providing a test case.
1025       BUG FIX the sample commands in the usage strings now all should
1026       explicitly include the source of data (typically from "cat foo.fsdb
1027       |").  Thanks to Ya Xu for pointing out this documentation deficiency.
1028       BUG FIX (DOCUMENTATION) dbcolmovingstats had incorrect sample output.
1029
1030   1.8, 28-Jun-00
1031       BUG FIX header options are now preserved when writing with dblistize
1032       NEW dbrowuniq now optionally checks for uniqueness only on certain
1033       fields
1034       NEW dbrowsplituniq makes one pass through a file and splits it into
1035       separate files based on the given fields
1036       NEW converter for "crl" format network traces
1037       NEW anywhere you use arbitrary code (like dbroweval), _last_foo now
1038       maps to the last row's value for field _foo.
1039       OPTIMIZATION comment processing slightly changed so that dbmultistats
1040       now is much faster on files with lots of comments (for example, ~100k
1041       lines of comments and 700 lines of data!) (Thanks to Graham Phillips
1042       for pointing out this performance problem.)
1043       BUG FIX dbstats with median/quartiles now correctly handles singleton
1044       data points.
1045
1046   1.9,  6-Nov-00
1047       NEW dbfilesplit, split a single input file into multiple output files
1048       (based on code contributed by Pavlin Radoslavov).
1049       BUG FIX dbsort now works with perl-5.6
1050
1051   1.10, 10-Apr-01
1052       BUG FIX dbstats now handles the case where there are more n-tiles than
1053       data
1054       NEW dbstats now includes a -S option to optimize work on pre-sorted
1055       data (inspired by code contributed by Haobo Yu)
1056       BUG FIX dbsort now has a better estimate of memory usage when run on
1057       data with very short records (problem detected by Haobo Yu)
1058       BUG FIX cleanup of temporary files is slightly better
1059
1060   1.11,  2-Nov-01
1061       BUG FIX dbcolneaten now runs in constant memory
1062       NEW dbcolneaten now supports "field specifiers" that allow some control
1063       over how wide columns should be
1064       OPTIMIZATION dbsort now tries hard to be filesystem cache-friendly
1065       (inspired by "Information and Control in Gray-box Systems" by the
1066       Arpaci-Dusseau's at SOSP 2001)
1067       INTERNAL t_distr now ported to perl5 module DbTDistr
1068
1069   1.12,  30-Oct-02
1070       BUG FIX dbmultistats documentation typo fixed
1071       NEW dbcolmultiscale
1072       NEW dbcol has -r option for "relaxed error checking"
1073       NEW dbcolneaten has new -e option to strip end-of-line spaces
1074       NEW dbrow finally has a -v option to negate the test
1075       BUG FIX math bug in dbcoldiff fixed by Ashvin Goel (need to check
1076       Scheaffer test cases)
1077       BUG FIX some patches to run with Perl 5.8. Note: some programs
1078       (dbcolmultiscale, dbmultistats, dbrowsplituniq) generate warnings like:
1079       "Use of uninitialized value in concatenation (.)" or "string at
1080       /usr/lib/perl5/5.8.0/FileCache.pm line 98, <STDIN> line 2". Please
1081       ignore this until I figure out how to suppress it. (Thanks to Jerry
1082       Zhao for noticing perl-5.8 problems.)
1083       BUG FIX fixed an autoconf problem where configure would fail to find a
1084       reasonable prefix (thanks to Fabio Silva for reporting the problem)
1085       NEW db_to_html_table: simple conversion to html tables (NO fancy stuff)
1086       NEW dblib now has a function dblib_text2html() that will do simple
1087       conversion of iso-8859-1 to HTML
1088
1089   1.13,  4-Feb-04
1090       NEW fsdb added to the freebsd ports tree
1091       <http://www.freshports.org/databases/fsdb/>.  Maintainer:
1092       "larse@isi.edu"
1093       BUG FIX properly handle trailing spaces when data must be numeric (ex.
1094       dbstats with -FS, see test dbstats_trailing_spaces). Fix from Ning Xu
1095       "nxu@aludra.usc.edu".
1096       NEW dbcolize error message improved (bug report from Terrence Brannon),
1097       and list format documented in the README.
1098       NEW cgi_to_db converts CGI.pm-format storage to fsdb list format
1099       BUG FIX handle numeric synonyms for column names in dbcol properly
1100       ENHANCEMENT "talking about columns" section added to README. Lack of
1101       documentation pointed out by Lars Eggert.
1102       CHANGE dbformmail now defaults to using Mail ("Berkeley Mail") to send
1103       mail, rather than sendmail (sendmail is still an option, but mail
1104       doesn't require running as root)
1105       NEW on platforms that support it (i.e., with perl 5.8), fsdb works fine
1106       with unicode
1107       NEW dbfilevalidate: check a db file for some common errors
1108
1109   1.14,  24-Aug-06
1110       ENHANCEMENT README cleanup
1111       INCOMPATIBLE CHANGE dbcolsplit renamed dbcolsplittocols
1112       NEW dbcolsplittorows  split one column into multiple rows
1113       NEW dbcolsregression compute linear regression and correlation for two
1114       columns
1115       ENHANCEMENT cvs_to_db: better error handling, normalize field names,
1116       skip blank lines
1117       ENHANCEMENT dbjoin now detects (and fails) if non-joined files have
1118       duplicate names
1119       BUG FIX minor bug fixed in calculation of Student t-distributions
1120       (doesn't change any test output, but may have caused small errors)
1121
1122   1.15, 12-Nov-07
1123       NEW fsdb-1.14 added to the MacOS Fink system
1124       <http://pdb.finkproject.org/pdb/package.php/fsdb>. (Thanks to Lars
1125       Eggert for maintaining this port.)
1126       NEW Fsdb::IO::Reader and Fsdb::IO::Writer now provide reasonably clean
1127       OO I/O interfaces to Fsdb files.  Highly recommended if you use fsdb
1128       directly from perl.  In the fullness of time I expect to reimplement
1129       the entire thing using these APIs to replace the current dblib.pl which
1130       is still hobbled by its roots in perl4.
1131       NEW dbmapreduce now implements a Google-style map/reduce abstraction,
1132       generalizing dbmultistats.
1133       ENHANCEMENT fsdb now uses the Perl build system (Makefile.PL, etc.),
1134       instead of autoconf.  This change paves the way to better perl-5-style
1135       modularization, proper manual pages, input of both listize and colize
1136       format for every program, and world peace.
1137       ENHANCEMENT dblib.pl is now moved to Fsdb::Old.pm.
1138       BUG FIX dbmultistats now propagates its format argument (-f). Bug and
1139       fix from Martin Lukac (thanks!).
1140       ENHANCEMENT dbformmail documentation now is clearer that it doesn't
1141       send the mail, you have to run the shell script it writes.  (Problem
1142       observed by Unkyu Park.)
1143       ENHANCEMENT adapted to autoconf-2.61 (and then these changes were
1144       discarded in favor of The Perl Way.
1145       BUG FIX dbmultistats memory usage corrected (O(# tags), not O(1))
1146       ENHANCEMENT dbmultistats can now optionally run with pre-grouped input
1147       in O(1) memory
1148       ENHANCEMENT dbroweval -N was finally implemented (eat comments)
1149
1150   2.0, 25-Jan-08
1151       2.0, 25-Jan-08 --- a quiet 2.0 release (gearing up towards complete)
1152
1153       ENHANCEMENT: shifting old programs to Perl modules, with the front-end
1154       program as just a wrapper. In the short-term, this change just means
1155       programs have real man pages. In the long-run, it will mean that one
1156       can run a pipeline in a single Perl program. So far: dbcol, dbroweval,
1157       the new dbrowcount. dbsort the new dbmerge, the old "dbstats" (renamed
1158       dbcolstats), dbcolrename, dbcolcreate,
1159       NEW: Fsdb::Filter::dbpipeline is an internal-only module that lets one
1160       use fsdb commands from within perl (via threads).
1161           It also provides perl function aliases for the internal modules, so
1162           a string of fsdb commands in perl are nearly as terse as in the
1163           shell:
1164
1165               use Fsdb::Filter::dbpipeline qw(:all);
1166               dbpipeline(
1167                   dbrow(qw(name test1)),
1168                   dbroweval('_test1 += 5;')
1169               );
1170
1171       INCOMPATIBLE CHANGE: The old dbcolstats has been renamed
1172       dbcolstatscores. The new dbcolstats does the same thing as the old
1173       dbstats. This incompatibility is unfortunate but normalizes program
1174       names.
1175       CHANGE: The new dbcolstats program always outputs "-" (the default
1176       empty value) for statistics it cannot compute (for example, standard
1177       deviation if there is only one row), instead of the old mix of "-" and
1178       "na".
1179       INCOMPATIBLE CHANGE: The old dbcolstats program, now called
1180       dbcolstatscores, also has different arguments.  The "-t mean,stddev"
1181       option is now "--tmean mean --tstddev stddev".  See dbcolstatscores for
1182       details.
1183       INCOMPATIBLE CHANGE: dbcolcreate now assumes all new columns get the
1184       default value rather than requiring each column to have an initial
1185       constant value. To change the initial value, sue the new "-e" option.
1186       NEW: dbrowcount counts rows, an almost-subset of dbcolstats's "n"
1187       output (except without differentiating numeric/non-numeric input), or
1188       the equivalent of "dbstripcomments | wc -l".
1189       NEW: dbmerge merges two sorted files. This functionality was previously
1190       embedded in dbsort.
1191       INCOMPATIBLE CHANGE: dbjoin's "-i" option to include non-matches is now
1192       renamed "-a", so as to not conflict with the new standard option "-i"
1193       for input file.
1194
1195   2.1,  6-Apr-08
1196       2.1,  6-Apr-08 --- another alpha 2.0, but now all converted programs
1197       understand both listize and colize format
1198
1199       ENHANCEMENT: shifting more old programs to Perl modules. New in 2.1:
1200       dbcolneaten, dbcoldefine, dbcolhisto, dblistize, dbcolize, dbrecolize
1201       ENHANCEMENT dbmerge now handles an arbitrary number of input files, not
1202       just exactly two.
1203       NEW dbmerge2 is an internal routine that handles merging exactly two
1204       files.
1205       INCOMPATIBLE CHANGE dbjoin now specifies inputs like dbmerge2, rather
1206       than assuming the first two arguments were tables (as in fsdb-1).
1207           The old dbjoin argument "-i" is now "-a" or <--type=outer>.
1208
1209           A minor change: comments in the source files for dbjoin are now
1210           intermixed with output rather than being delayed until the end.
1211
1212       ENHANCEMENT dbsort now no longer produces warnings when null values are
1213       passed to numeric comparisons.
1214       BUG FIX dbroweval now once again works with code that lacks a trailing
1215       semicolon. (This bug fixes a regression from 1.15.)
1216       INCOMPATIBLE CHANGE dbcolneaten's old "-e" option (to avoid end-of-line
1217       spaces) is now "-E" to avoid conflicts with the standard empty field
1218       argument.
1219       INCOMPATIBLE CHANGE dbcolhisto's old "-e" option is now "-E" to avoid
1220       conflicts. And its "-n", "-s", and "-w" are now "-N", "-S", and "-W" to
1221       correspond.
1222       NEW dbfilealter replaces dbrecolize, dblistize, and dbcolize, but with
1223       different options.
1224       ENHANCEMENT The library routines "Fsdb::IO" now understand both list-
1225       format and column-format data, so all converted programs can now
1226       automatically read either format.  This capability was one of the
1227       milestone goals for 2.0, so yea!
1228
1229   2.2, 23-May-08
1230       Release 2.2 is another 2.x alpha release.  Now most of the commands are
1231       ported, but a few remain, and I plan one last incompatible change (to
1232       the file header) before 2.x final.
1233
1234       ENHANCEMENT
1235           shifting more old programs to Perl modules.  New in 2.2:
1236           dbrowaccumulate, dbformmail.  dbcolmovingstats.  dbrowuniq.
1237           dbrowdiff.  dbcolmerge.  dbcolsplittocols.  dbcolsplittorows.
1238           dbmapreduce.  dbmultistats.  dbrvstatdiff.  Also dbrowenumerate
1239           exists only as a front-end (command-line) program.
1240
1241       INCOMPATIBLE CHANGE
1242           The following programs have been dropped from fsdb-2.x:
1243           dbcoltighten, dbfilesplit, dbstripextraheaders,
1244           dbstripleadingspace.
1245
1246       NEW combined_log_format_to_db to convert Apache logfiles
1247
1248       INCOMPATIBLE CHANGE
1249           Options to dbrowdiff are now -B and -I, not -a and -i.
1250
1251       INCOMPATIBLE CHANGE
1252           dbstripcomments is now dbfilestripcomments.
1253
1254       BUG FIXES
1255           dbcolneaten better handles empty columns; dbcolhisto warning
1256           suppressed (actually a bug in high-bucket handling).
1257
1258       INCOMPATIBLE CHANGE
1259           dbmultistats now requires a "-k" option in front of the key (tag)
1260           field, or if none is given, it will group by the first field (both
1261           like dbmapreduce).
1262
1263       KNOWN BUG
1264           dbmultistats with quantile option doesn't work currently.
1265
1266       INCOMPATIBLE CHANGE
1267           dbcoldiff is renamed dbrvstatdiff.
1268
1269       BUG FIXES
1270           dbformmail was leaving its log message as a  command, not a
1271           comment.  Oops.  No longer.
1272
1273   2.3, 27-May-08 (alpha)
1274       Another alpha release, this one just to fix the critical dbjoin bug
1275       listed below (that happens to have blocked my MP3 jukebox :-).
1276
1277       BUG FIX
1278           Dbsort no longer hangs if given an input file with no rows.
1279
1280       BUG FIX
1281           Dbjoin now works with unsorted input coming from a pipeline (like
1282           stdin).  Perl-5.8.8 has a bug (?) that was making this case
1283           fail---opening stdin in one thread, reading some, then reading more
1284           in a different thread caused an lseek which works on files, but
1285           fails on pipes like stdin.  Go figure.
1286
1287       BUG FIX / KNOWN BUG
1288           The dbjoin fix also fixed dbmultistats -q (it now gives the right
1289           answer).  Although a new bug appeared, messages like:
1290               Attempt to free unreferenced scalar: SV 0xa9dd0c4, Perl
1291           interpreter: 0xa8350b8 during global destruction.  So the
1292           dbmultistats_quartile test is still disabled.
1293
1294   2.4, 18-Jun-08
1295       Another alpha release, mostly to fix minor usability problems in
1296       dbmapreduce and client functions.
1297
1298       ENHANCEMENT
1299           dbrow now defaults to running user supplied code without warnings
1300           (as with fsdb-1.x).  Use "--warnings" or "-w" to turn them back on.
1301
1302       ENHANCEMENT
1303           dbroweval can now write different format output than the input,
1304           using the "-m" option.
1305
1306       KNOWN BUG
1307           dbmapreduce emits warnings on perl 5.10.0 about "Unbalanced string
1308           table refcount" and "Scalars leaked" when run with an external
1309           program as a reducer.
1310
1311           dbmultistats emits the warning "Attempt to free unreferenced
1312           scalar" when run with quartiles.
1313
1314           In each case the output is correct.  I believe these can be
1315           ignored.
1316
1317       CHANGE
1318           dbmapreduce no longer logs a line for each reducer that is invoked.
1319
1320   2.5, 24-Jun-08
1321       Another alpha release, fixing more minor bugs in "dbmapreduce" and
1322       lossage in "Fsdb::IO".
1323
1324       ENHANCEMENT
1325           dbmapreduce can now tolerate non-map-aware reducers that pass back
1326           the key column in put.  It also passes the current key as the last
1327           argument to external reducers.
1328
1329       BUG FIX
1330           Fsdb::IO::Reader, correctly handle "-header" option again.  (Broken
1331           since fsdb-2.3.)
1332
1333   2.6, 11-Jul-08
1334       Another alpha release, needed to fix DaGronk.  One new port, small bug
1335       fixes, and important fix to dbmapreduce.
1336
1337       ENHANCEMENT
1338           shifting more old programs to Perl modules.  New in 2.2:
1339           dbcolpercentile.
1340
1341       INCOMPATIBLE CHANGE and ENHANCEMENTS dbcolpercentile arguments changed,
1342       use "--rank" to require ranking instead of "-r". Also, "--ascending"
1343       and "--descending" can now be specified separately, both for
1344       "--percentile" and "--rank".
1345       BUG FIX
1346           Sigh, the sense of the --warnings option in dbrow was inverted.  No
1347           longer.
1348
1349       BUG FIX
1350           I found and fixed the string leaks (errors like "Unbalanced string
1351           table refcount" and "Scalars leaked") in dbmapreduce and
1352           dbmultistats.  (All "IO::Handle"s in threads must be manually
1353           destroyed.)
1354
1355       BUG FIX
1356           The "-C" option to specify the column separator in dbcolsplittorows
1357           now works again (broken since it was ported).
1358
1359       2.7, 30-Jul-08 beta
1360
1361       The beta release of fsdb-2.x.  Finally, all programs are ported.  As
1362       statistics, the number of lines of non-library code doubled from 7.5k
1363       to 15.5k.  The libraries are much more complete, going from 866 to 5164
1364       lines.  The overall number of programs is about the same, although 19
1365       were dropped and 11 were added.  The number of test cases has grown
1366       from 116 to 175.  All programs are now in perl-5, no more shell scripts
1367       or perl-4.  All programs now have manual pages.
1368
1369       Although this is a major step forward, I still expect to rename "fsdb"
1370       to "fsdb".
1371
1372       ENHANCEMENT
1373           shifting more old programs to Perl modules.  New in 2.7:
1374           dbcolscorellate.  dbcolsregression.  cgi_to_db.  dbfilevalidate.
1375           db_to_csv.  csv_to_db, db_to_html_table, kitrace_to_db,
1376           tcpdump_to_db, tabdelim_to_db, ns_to_db.
1377
1378       INCOMPATIBLE CHANGE
1379           The following programs have been dropped from fsdb-2.x: db2dcliff,
1380           dbcolmultiscale, crl_to_db.  ipchain_logs_to_db.  They may come
1381           back, but seemed overly specialized.  The following program
1382           dbrowsplituniq was dropped because it is superseded by dbmapreduce.
1383           dmalloc_to_db was dropped pending a test cases and examples.
1384
1385       ENHANCEMENT
1386           dbfilevalidate now has a "-c" option to correct errors.
1387
1388       NEW html_table_to_db provides the inverse of db_to_html_table.
1389
1390   2.8,  5-Aug-08
1391       Change header format, preserving forwards compatibility.
1392
1393       BUG FIX
1394           Complete editing pass over the manual, making sure it aligns with
1395           fsdb-2.x.
1396
1397       SEMI-COMPATIBLE CHANGE
1398           The header of fsdb files has changed, it is now #fsdb, not #h (or
1399           #L) and parsing of -F and -R are also different.  See dbfilealter
1400           for the new specification.  The v1 file format will be read,
1401           compatibly, but not written.
1402
1403       BUG FIX
1404           dbmapreduce now tolerates comments that precede the first key,
1405           instead of failing with an error message.
1406
1407   2.9, 6-Aug-08
1408       Still in beta; just a quick bug-fix for dbmapreduce.
1409
1410       ENHANCEMENT
1411           dbmapreduce now generates plausible output when given no rows of
1412           input.
1413
1414   2.10, 23-Sep-08
1415       Still in beta, but picking up some bug fixes.
1416
1417       ENHANCEMENT
1418           dbmapreduce now generates plausible output when given no rows of
1419           input.
1420
1421       ENHANCEMENT
1422           dbroweval the warnings option was backwards; now corrected.  As a
1423           result, warnings in user code now default off (like in fsdb-1.x).
1424
1425       BUG FIX
1426           dbcolpercentile now defaults to assuming the target column is
1427           numeric.  The new option "-N" allows selection of a non-numeric
1428           target.
1429
1430       BUG FIX
1431           dbcolscorrelate now includes "--sample" and "--nosample" options to
1432           compute the sample or full population correlation coefficients.
1433           Thanks to Xue Cai for finding this bug.
1434
1435   2.11, 14-Oct-08
1436       Still in beta, but picking up some bug fixes.
1437
1438       ENHANCEMENT
1439           html_table_to_db is now more aggressive about filling in empty
1440           cells with the official empty value, rather than leaving them blank
1441           or as whitespace.
1442
1443       ENHANCEMENT
1444           dbpipeline now catches failures during pipeline element setup and
1445           exits reasonably gracefully.
1446
1447       BUG FIX
1448           dbsubprocess now reaps child processes, thus avoiding running out
1449           of processes when used a lot.
1450
1451   2.12, 16-Oct-08
1452       Finally, a full (non-beta) 2.x release!
1453
1454       INCOMPATIBLE CHANGE
1455           Jdb has been renamed Fsdb, the flatfile-streaming database.  This
1456           change affects all internal Perl APIs, but no shell command-level
1457           APIs.  While Jdb served well for more than ten years, it is easily
1458           confused with the Java debugger (even though Jdb was there first!).
1459           It also is too generic to work well in web search engines.
1460           Finally, Jdb stands for ``John's database'', and we're a bit beyond
1461           that.  (However, some call me the ``file-system guy'', so one could
1462           argue it retains that meeting.)
1463
1464           If you just used the shell commands, this change should not affect
1465           you.  If you used the Perl-level libraries directly in your code,
1466           you should be able to rename "Jdb" to "Fsdb" to move to 2.12.
1467
1468           The jdb-announce list not yet been renamed, but it will be shortly.
1469
1470           With this release I've accomplished everything I wanted to in
1471           fsdb-2.x.  I therefore expect to return to boring, bugfix releases.
1472
1473   2.13, 30-Oct-08
1474       BUG FIX
1475           dbrowaccumulate now treats non-numeric data as zero by default.
1476
1477       BUG FIX
1478           Fixed a perl-5.10ism in dbmapreduce that breaks that program under
1479           5.8.  Thanks to Martin Lukac for reporting the bug.
1480
1481   2.14, 26-Nov-08
1482       BUG FIX
1483           Improved documentation for dbmapreduce's "-f" option.
1484
1485       ENHANCEMENT
1486           dbcolmovingstats how computes a moving standard deviation in
1487           addition to a moving mean.
1488
1489   2.15, 13-Apr-09
1490       BUG FIX
1491           Fix a make install bug reported by Shalindra Fernando.
1492
1493   2.16, 14-Apr-09
1494       BUG FIX
1495           Another minor release bug: on some systems programize_module looses
1496           executable permissions.  Again reported by Shalindra Fernando.
1497
1498   2.17, 25-Jun-09
1499       TYPO FIXES
1500           Typo in the dbroweval manual fixed.
1501
1502       IMPROVEMENT
1503           There is no longer a comment line to label columns in dbcolneaten,
1504           instead the header line is tweaked to line up.  This change
1505           restores the Jdb-1.x behavior, and means that repeated runs of
1506           dbcolneaten no longer add comment lines each time.
1507
1508       BUG FIX
1509           It turns out  dbcolneaten was not correctly handling trailing
1510           spaces when given the "-E" option to suppress them.  This
1511           regression is now fixed.
1512
1513       EXTENSION
1514           dbroweval(1) can now handle direct references to the last row via
1515           $lfref, a dubious but now documented feature.
1516
1517       BUG FIXES
1518           Separators set with "-C" in dbcolmerge and dbcolsplittocols were
1519           not properly setting the heading, and null fields were not
1520           recognized.  The first bug was reported by Martin Lukac.
1521
1522   2.18,  1-Jul-09  A minor release
1523       IMPROVEMENT
1524           Documentation for Fsdb::IO::Reader has been improved.
1525
1526       IMPROVEMENT
1527           The package should now be PGP-signed.
1528
1529   2.19,  10-Jul-09
1530       BUG FIX
1531           Internal improvements to debugging output and robustness of
1532           dbmapreduce and dbpipeline.  TEST/dbpipeline_first_fails.cmd re-
1533           enabled.
1534
1535   2.20, 30-Nov-09 (A collection of minor bugfixes, plus a build against
1536       Fedora 12.)
1537       BUG FIX
1538           Loging for dbmapreduce with code refs is now stable (it no longer
1539           includes a hex pointer to the code reference).
1540
1541       BUG FIX
1542           Better handling of mixed blank lines in Fsdb::IO::Reader (see test
1543           case dbcolize_blank_lines.cmd).
1544
1545       BUG FIX
1546           html_table_to_db now handles multi-line input better, and handles
1547           tables with COLSPAN.
1548
1549       BUG FIX
1550           dbpipeline now cleans up threads in an "eval" to prevent "cannot
1551           detach a joined thread" errors that popped up in perl-5.10.
1552           Hopefully this prevents a race condition that causes the test
1553           suites to hang about 20% of the time (in dbpipeline_first_fails).
1554
1555       IMPROVEMENT
1556           dbmapreduce now detects and correctly fails when the input and
1557           reducer have incompatible field separators.
1558
1559       IMPROVEMENT
1560           dbcolstats, dbcolhisto, dbcolscorrelate, dbcolsregression, and
1561           dbrowcount now all take an "-F" option to let one specify the
1562           output field separator (so they work better with dbmapreduce).
1563
1564       BUG FIX
1565           An omitted "-k" from the manual page of dbmultistats is now there.
1566           Bug reported by Unkyu Park.
1567
1568   2.21, 17-Apr-10 bug fix release
1569       BUG FIX
1570           Fsdb::IO::Writer now no longer fails with -outputheader => never
1571           (an obscure bug).
1572
1573       IMPROVEMENT
1574           Fsdb (in the warnings section) and dbcolstats now more carefully
1575           document how they handle (and do not handle) numerical precision
1576           problems, and other general limits.  Thanks to Yuri Pradkin for
1577           prompting this documentation.
1578
1579       IMPROVEMENT
1580           "Fsdb::Support::fullname_to_sortkey" is now restored from "Jdb".
1581
1582       IMPROVEMENT
1583           Documention for multiple styles of input approaches (including
1584           performance description) added to Fsdb::IO.
1585
1586   2.22, 2010-10-31 One new tool dbcolcopylast and several bug fixes for Perl
1587       5.10.
1588       BUG FIX
1589           dbmerge now correctly handles n-way merges.  Bug reported by Yuri
1590           Pradkin.
1591
1592       INCOMPARABLE CHANGE
1593           dbcolneaten now defaults to not padding the last column.
1594
1595       ADDITION
1596           dbrowenumerate now takes -N NewColumn to give the new column a name
1597           other than "count".  Feature requested by Mike Rouch in January
1598           2005.
1599
1600       ADDITION
1601           New program dbcolcopylast copies the last value of a column into a
1602           new column copylast_column of the next row.  New program requested
1603           by Fabio Silva; useful for converting dbmultistats output into
1604           dbrvstatdiff input.
1605
1606       BUG FIX
1607           Several tools (particularly dbmapreduce and dbmultistats) would
1608           report errors like "Unbalanced string table refcount: (1) for
1609           "STDOUT" during global destruction" on exit, at least on certain
1610           versions of Perl (for me on 5.10.1), but similar errors have been
1611           off-and-on for several Perl releases.  Although I think my code
1612           looked OK, I worked around this problem with a different way of
1613           handling standard IO redirection.
1614
1615   2.23, 2011-03-10 Several small portability bugfixes; improved dbcolstats
1616       for large datasets
1617       IMPROVEMENT
1618           Documentation to dbrvstatdiff was changed to use "sd" to refer to
1619           standard deviation, not "ss" (which might be confused with sum-of-
1620           squares).
1621
1622       BUG FIX
1623           This documentation about dbmultistats was missing the -k option in
1624           some cases.
1625
1626       BUG FIX
1627           dbmapreduce was failing on MacOS-10.6.3 for some tests with the
1628           error
1629
1630               dbmapreduce: cannot run external dbmapreduce reduce program (perl TEST/dbmapreduce_external_with_key.pl)
1631
1632           The problem seemed to be only in the error, not in operation.  On
1633           MacOS, the error is now suppressed.  Thanks to Alefiya Hussain for
1634           providing access to a Mac system that allowed debugging of this
1635           problem.
1636
1637       IMPROVEMENT
1638           The csv_to_db command requires an external Perl library
1639           (Text::CSV_XS).  On computers that lack this optional library,
1640           previously Fsdb would configure with a warning and then test cases
1641           would fail.  Now those test cases are skipped with an additional
1642           warning.
1643
1644       BUG FIX
1645           The test suite now supports alternative valid output, as a hack to
1646           account for last-digit floating point differences.  (Not very
1647           satisfying :-(
1648
1649       BUG FIX
1650           dbcolstats output for confidence intervals on very large datasets
1651           has changed.  Previously it failed for more than 2^31-1 records,
1652           and handling of T-Distributions with thousands of rows was a bit
1653           dubious.  Now datasets with more than 10000 are considered
1654           infinitely large and hopefully correctly handled.
1655
1656   2.24, 2011-04-15 Improvements to fix an old bug in dbmapreduce with
1657       different field separators
1658       IMPROVEMENT
1659           The dbfilealter command had a "--correct" option to work-around
1660           from incompatible field-separators, but it did nothing.  Now it
1661           does the correct but sad, data-loosing thing.
1662
1663       IMPROVEMENT
1664           The dbmultistats command previously failed with an error message
1665           when invoked on input with a non-default field separator.  The root
1666           cause was the underlying dbmapreduce that did not handle the case
1667           of reducers that generated output with a different field separator
1668           than the input.  We now detect and repair incompatible field
1669           separators.  This change corrects a problem originally documented
1670           and detected in Fsdb-2.20.  Bug re-reported by Unkyu Park.
1671
1672   2.25, 2011-08-07 Two new tools, xml_to_db and dbfilepivot, and a bugfix for
1673       two people.
1674       IMPROVEMENT
1675           kitrace_to_db now supports a --utc option, which also fixes this
1676           test case for users outside of the Pacific time zone.  Bug reported
1677           by David Graff, and also by Peter Desnoyers (within a week of each
1678           other :-)
1679
1680       NEW xml_to_db can convert simple, very regular XML files into Fsdb.
1681
1682       NEW dbfilepivot "pivots" a file, converting multiple rows corresponding
1683           to the same entity into a single row with multiple columns.
1684
1685   2.26, 2011-12-12 Bug fixes, particularly for perl-5.14.2.
1686       BUG FIX
1687           Bugs fixed in Fsdb::IO::Reader(3) manual page.
1688
1689       BUG FIX
1690           Fixed problems where dbcolstats was truncating floating point
1691           numbers when sorting.  This strange behavior happens as of
1692           perl-5.14.2 and it seems like a Perl bug.  I've worked around it
1693           for the test suites, but I'm a bit nervous.
1694
1695   2.27, 2012-11-15 Accumulated bug fixes.
1696       IMPROVEMENT
1697           csv_to_db now reports errors in CVS input with real diagnostics.
1698
1699       IMPROVEMENT
1700           dbcolmovingstats can now compute median, when given the "-m"
1701           option.
1702
1703       BUG FIX
1704           dbcolmovingstats non-numeric handling (the "-a" option) now works
1705           properly.
1706
1707       DOCUMENTATION
1708           The internal t/test_command.t test framework is now documented.
1709
1710       BUG FIX
1711           dbrowuniq now correctly handles the case where there is no input
1712           (previously it output a blank line, which is a malformed fsdb
1713           file).  Thanks to Yuri Pradkin for reporting this bug.
1714
1715   2.28, 2012-11-15 A quick release to fix most rpmlint errors.
1716       BUG FIX
1717           Fixed a number of minor release problems (wrong permissions, old
1718           FSF address, etc.) found by rpmlint.
1719
1720   2.29, 2012-11-20 a quick release for CPAN testing
1721       IMPROVEMENT
1722           Tweaked the RPM spec.
1723
1724       IMPROVEMENT
1725           Modified Makefile.PL to fail gracefully on Perl installations that
1726           lack threads.  (Without this fix, I get massive failures in the
1727           non-ithreads test system.)
1728
1729   2.30, 2012-11-25 improvements to perl portability
1730       BUG FIX
1731           Removed unicode character in documention of dbcolscorrelated so pod
1732           tests will pass.  (Sigh, that should work :-( )
1733
1734       BUG FIX
1735           Fixed test suite failures on 5 tests (dbcolcreate_double_creation
1736           was the first) due to Carp's addition of a period.  This problem
1737           was breaking Fsdb on perl-5.17.  Thanks to Michael McQuaid for
1738           helping diagnose this problem.
1739
1740       IMPROVEMENT
1741           The test suite now prints out the names of tests it tries.
1742
1743   2.31, 2012-11-28 A release with actual improvements to dbfilepivot and
1744       dbrowuniq.
1745       BUG FIX
1746           Documentation fixes: typos in dbcolscorrelated, bugs in
1747           dbfilepivot, clarification for comment handling in
1748           Fsdb::IO::Reader.
1749
1750       IMPROVEMENT
1751           Previously dbfilepivot assumed the input was grouped by keys and
1752           didn't very that pre-condition.  Now there is no pre-condition (it
1753           will sort the input by default), and it checks if the invariant is
1754           violated.
1755
1756       BUG FIX
1757           Previously dbfilepivot failed if the input had comments (oops :-);
1758           no longer.
1759
1760       IMPROVEMENT
1761           Now dbrowuniq has the "-L" option to preserve the last unique row
1762           (instead of the first), a common idiom.
1763
1764   2.32, 2012-12-21 Test suites should now be more numerically robust.
1765       NEW New dbfilediff does fsdb-aware file differencing.  It does not do
1766           smart intuition of add/removes like Unix diff(1), but it does know
1767           about columns, and with "-E", it does numeric-aware differences.
1768
1769       IMPROVEMENT
1770           Test suites that are numeric now use dbfilediff to do numeric-aware
1771           comparisons, so the test suite should now be robust to slightly
1772           different computers and operating systems and compilers than
1773           exactly what I use.
1774
1775   2.33, 2012-12-23 Minor fixes to some test cases.
1776       IMPROVEMENT
1777           dbfilediff and dbrowuniq now supports the "-N" option to give the
1778           new column a different name.  (And a test cases where this
1779           duplication mattered have been fixed.)
1780
1781       IMPROVEMENT
1782           dbrvstatdiff now show the t-test breakpoint with a reasonable
1783           number of floating point digits.
1784
1785       BUG FIX
1786           Fixed a numerical stability problem in the dbroweval_last test
1787           case.
1788

WHAT'S NEW

1790   2.34, 2013-02-10 Parallelism in dbmerge.
1791       IMPROVEMENT
1792           Documention for dbjoin now includes resource requirements.
1793
1794       IMPROVEMENT
1795           Default memory usage for dbsort is now about 256MB.  (The world
1796           keeps moving forward.)
1797
1798       IMPROVEMENT
1799           dbmerge now does merging in parallel.  As a side-effect, dbsort
1800           should be faster when input overflows memory.  The level of
1801           parallelism can be limited with the "--parallelism" option.  (There
1802           is more work to do here, but we're off to a start.)
1803
1804   2.35, 2013-02-23 Improvements to dbmerge parallelism
1805       BUG FIX
1806           Fsdb temporary files are now created more securely (with
1807           File::Temp).
1808
1809       IMPROVEMENT
1810           Programs that sort or merge on fields (dbmerge2, dbmerge, dbsort,
1811           dbjoin) now report an error if no fields on which to join or merge
1812           are given.
1813
1814       IMPROVEMENT
1815           Parallelism in dbmerge is should now be more consistent, with less
1816           starting and stopping.
1817
1818       IMPROVEMENT In dbmerge, the "--xargs" option lets one give input
1819       filenames on standard input, rather than the command line. This feature
1820       paves the way for faster dbsort for large inputs (by pipelining sorting
1821       and merging), expected in the next release.
1822
1823   2.36, 2013-02-25 dbsort pipelines with dbmerge
1824       IMPROVEMENT For large inputs, dbsort now pipelines sorting and merging,
1825       allowing earlier processing.
1826       BUG FIX Since 2.35, dbmerge delayed cleanup of intermediate files,
1827       thereby requiring extra disk space.
1828
1829   2.37, 2013-02-26 quick bugfix to support parallel sort and merge from
1830       recent releases
1831       BUG FIX Since 2.35, dbmerge delayed removal of input files given by
1832       "--xargs".  This problem is now fixed.
1833
1834   2.38, 2013-04-29 minor bug fixes
1835       CLARIFICATION
1836           Configure now rejects Windows since tests seem to hang on some
1837           versions of Windows.  (I would love help from a Windows developer
1838           to get this problem fixed, but I cannot do it.)  See
1839           https://rt.cpan.org/Ticket/Display.html?id=84201.
1840
1841       IMPROVEMENT
1842           All programs that use temporary files (dbcolpercentile,
1843           dbcolscorrelate, dbcolstats, dbcolstatscores) now take the "-T"
1844           option and set the temporary directory consistently.
1845
1846           In addition, error messages are better when the temporary directory
1847           has problems.  Problem reported by Liang Zhu.
1848
1849       BUG FIX
1850           dbmapreduce was failing with external, map-reduce aware reducers
1851           (when invoked with -M and an external program).  (Sigh, did this
1852           case ever work?)  This case should now work.  Thanks to Yuri
1853           Pradkin for reporting this bug (in 2011).
1854
1855       BUG FIX
1856           Fixed perl-5.10 problem with dbmerge.  Thanks to Yuri Pradkin for
1857           reporting this bug (in 2013).
1858
1859   2.39, date 2013-05-31 quick release for the dbrowuniq extension
1860       BUG FIX
1861           Actually in 2.38, the Fedora .spec got cleaner dependencies.
1862           Suggestion from Christopher Meng via
1863           <https://bugzilla.redhat.com/show_bug.cgi?id=877096>.
1864
1865       ENHANCEMENT
1866           Fsdb files are now explicitly set into UTF-8 encoding, unless one
1867           specifies "-encoding" to "Fsdb::IO".
1868
1869       ENHANCEMENT
1870           dbrowuniq now supports "-I" for incremental counting.
1871
1872   2.40, 2013-07-13 small bug fixes
1873       BUG FIX
1874           dbsort now has more respect for a user-given temporary directory;
1875           it no longer is ignored for merging.
1876
1877       IMPROVEMENT
1878           dbrowuniq now has options to output the first, last, and both first
1879           and last rows of a run ("-F", "-L", and "-B").
1880
1881       BUG FIX
1882           dbrowuniq now correctly handles "-N".  Sigh, it didn't work before.
1883
1884   2.41, 2013-07-29 small bug and packaging fixes
1885       ENHANCEMENT
1886           Documentation to dbrvstatdiff improved (inspired by questions from
1887           Qian Kun).
1888
1889       BUG FIX
1890           dbrowuniq no longer duplicates singleton unique lines when
1891           outputting both (with "-B").
1892
1893       BUG FIX
1894           Add missing "XML::Simple" dependency to Makefile.PL.
1895
1896       ENHANCEMENT
1897           Tests now show the diff of the failing output if run with "make
1898           test TEST_VERBOSE=1".
1899
1900       ENHANCEMENT
1901           dbroweval now includes documentation for how to output extra rows.
1902           Suggestion from Yuri Pradkin.
1903
1904       BUG FIX
1905           Several improvements to the Fedora package from Michael Schwendt
1906           via <https://bugzilla.redhat.com/show_bug.cgi?id=877096>, and from
1907           the harsh master that is rpmlint.  (I am stymied at teaching it
1908           that "outliers" is spelled correctly.  Maybe I should send it
1909           Schneier's book.  And an unresolvable invalid-spec-name lurks in
1910           the SRPM.)
1911
1912   2.42, 2013-07-31 A bug fix and packaging release.
1913       ENHANCEMENT
1914           Documentation to dbjoin improved to better memory usage.  (Based on
1915           problem report by Lin Quan.)
1916
1917       BUG FIX
1918           The .spec is now perl-Fsdb.spec to satisfy rpmlint.  Thanks to
1919           Christopher Meng for a specific bug report.
1920
1921       BUG FIX
1922           Test dbroweval_last.cmd no longer has a column that caused failures
1923           because of numerical instability.
1924
1925       BUG FIX
1926           Some tests now better handle bugs in old versions of perl (5.10,
1927           5.12).  Thanks to Calvin Ardi for help debugging this on a Mac with
1928           perl-5.12, but the fix should affect other platforms.
1929
1930   2.43, 2013-08-27 Adds in-file compression.
1931       BUG FIX
1932           Changed the sort on TEST/dbsort_merge.cmd to strings (from
1933           numerics) so we're less susceptible to false test-failures due to
1934           floating point IO differences.
1935
1936       EXPERIMENTAL ENHANCEMENT
1937           Yet more parallelism in dbmerge: new "endgame-mode" builds a merge
1938           tree of processes at the end of large merge tasks to get maximally
1939           parallelism.  Currently this feature is off by default because it
1940           can hang for some inputs.  Enable this experimental feature with
1941           "--endgame".
1942
1943       ENHANCEMENT
1944           "Fsdb::IO" now handles being given "IO::Pipe" objects (as exercised
1945           by dbmerge).
1946
1947       BUG FIX
1948           Handling of NamedTmpfiles now supports concurrency.  This fix will
1949           hopefully fix occasional "Use of uninitialized value $_ in string
1950           ne at ...NamedTmpfile.pm line 93."  errors.
1951
1952       BUG FIX
1953           Fsdb now requires perl 5.10.  This is a bug fix because some test
1954           cases used to require it, but this fact was not properly
1955           documented.  (Back-porting to 5.008 would require removing all "//"
1956           operators.)
1957
1958       ENHANCEMENT
1959           Fsdb now handles automatic compression of file contents.  Enable
1960           compression with "dbfilealter -Z xz" (or "gz" or "bz2").  All
1961           programs should operate on compressed files and leave the output
1962           with the same level of compression.  "xz" is recommended as fastest
1963           and most efficient.  "gz" is produces unrepeatable output (and so
1964           has no output test), it seems to insist on adding a timestamp.
1965
1966   2.44, 2013-10-02 A major change--all threads are gone.
1967       ENHANCEMENT
1968           Fsdb is now thread free and only uses processes for parallelism.
1969           This change is a big change--the entire motivation for Fsdb-2 was
1970           to exploit parallelism via threading.  Parallelism--good, but perl
1971           threading--bad for performance.  Horribly bad for performance.
1972           About 20x worse than pipes on my box.  (See perl bug #119445 for
1973           the discussion.)
1974
1975       NEW "Fsdb::Support::Freds" provides a thread-like abstraction over
1976           forking, with some nice support for callbacks in the parent upon
1977           child termination.
1978
1979       ENHANCEMENT
1980           Details about removing threads: "dbpipeline" is thread free, and
1981           new tests to verify each of its parts.  The easy cases are
1982           "dbcolpercentile", "dbcolstats", "dbfilepivot", "dbjoin", and
1983           "dbcolstatscores", each of which use it in simple ways
1984           (2013-09-09).  "dbmerge" is now thread free (2013-09-13), but was a
1985           significant rewrite, which brought "dbsort" along.  "dbmapreduce"
1986           is partly thread free (2013-09-21), again as a rewrite, and it
1987           brings "dbmultistats" along.  Full "dbmapreduce" support took much
1988           longer (2013-10-02).
1989
1990       BUG FIX
1991           When running with user-only output ("-n"), dbroweval now resets the
1992           output vector $ofref after it has been output.
1993
1994       NEW dbcolcreate will create all columns at the head of each row with
1995           the "--first" option.
1996
1997       NEW dbfilecat will concatenate two files, verifying that they have the
1998           same schema.
1999
2000       ENHANCEMENT
2001           dbmapreduce now passes comments through, rather than eating them as
2002           before.
2003
2004           Also, dbmapreduce now supports a "--" option to prevent
2005           misinterpreting sub-program parameters as for dbmapreduce.
2006
2007       INCOMPATIBLE CHANGE
2008           dbmapreduce no longer figures out if it needs to add the key to the
2009           output.  For multi-key-aware reducers, it never does (and cannot).
2010           For non-multi-key-aware reducers, it defaults to add the key and
2011           will now fail if the reducer adds the key (with error "dbcolcreate:
2012           attempt to create pre-existing column...").  In such cases, one
2013           must disable adding the key with the new option "--no-prepend-key".
2014
2015       INCOMPATIBLE CHANGE
2016           dbmapreduce no longer copies the input field separator by default.
2017           For multi-key-aware reducers, it never does (and cannot).  For non-
2018           multi-key-aware reducers, it defaults to not copying the field
2019           separator, but it will copy it (the old default) with the
2020           "--copy-fs" option
2021
2022   2.45, 2013-10-07 cleanup from de-thread-ification
2023       BUG FIX
2024           Corrected a fast busy-wait in dbmerge.
2025
2026       ENHANCEMENT
2027           Endgame mode enabled in dbmerge; it (and also large cases of
2028           dbsort) should now exploit greater parallelism.
2029
2030       BUG FIX
2031           Test case with "Fsdb::BoundedQueue" (gone since 2.44) now removed.
2032
2033   2.46, 2013-10-08 continuing cleanup of our no-threads version
2034       BUG FIX
2035           Fixed some packaging details.  (Really, threads are no longer
2036           required, missing tests in the MANIFEST.)
2037
2038       IMPROVEMENT
2039           dbsort now better communicates with the merge process to avoid
2040           bursty parallelism.
2041
2042           Fsdb::IO::Writer now can take "-autoflush =" 1> for line-buffered
2043           IO.
2044
2045   2.47, 2013-10-12 test suite cleanup for non-threaded perls
2046       BUG FIX
2047           Removed some stray "use threads" in some test cases.  We didn't
2048           need them, and these were breaking non-threaded perls.
2049
2050       BUG FIX
2051           Better handling of Fred cleanup; should fix intermittent
2052           dbmapreduce failures on BSD.
2053
2054       ENHANCEMENT
2055           Improved test framework to show output when tests fail.  (This
2056           time, for real.)
2057
2058   2.48, 2014-01-03 small bugfixes and improved release engineering
2059       ENHANCEMENT
2060           Test suites now skip tests for libraries that are missing.  (Patch
2061           for missing "IO::Compresss:Xz" contributed by Calvin Ardi.)
2062
2063       ENHANCEMENT
2064           Removed references to Jdb in the package specification.  Since the
2065           name was changed in 2008, there's no longer a huge need for
2066           backwards comparability.  (Suggestion form Petr Šabata.)
2067
2068       ENHANCEMENT
2069           Test suites now invoke the perl using the path from
2070           $Config{perlpath}.  Hopefully this helps testing in environments
2071           where there are multiple installed perls and the default perl is
2072           not the same as the perl-under-test (as happens in
2073           cpantesters.org).
2074
2075       BUG FIX
2076           Added specific encoding to this manpage to account for Unicode.
2077           Required to build correctly against perl-5.18.
2078
2079   2.49, 2014-01-04 bugfix to unicode handling in Fsdb IO (plus minor
2080       packaging fixes)
2081       BUG FIX
2082           Restored a line in the .spec to chmod g-s.
2083
2084       BUG FIX
2085           Unicode decoding is now handled correctly for programs that read
2086           from standard input.  (Also: New test scripts cover unicode input
2087           and output.)
2088
2089       BUG FIX
2090           Fix to Fsdb documentation encoding line.  Addresses test failure in
2091           perl-5.16 and earlier.  (Who knew "encoding" had to be followed by
2092           a blank line.)
2093

WHAT'S NEW

2095   2.50, 2014-05-27 a quick release for spec tweaks
2096       ENHANCEMENT
2097           In dbroweval, the "-N" (no output, even comments) option now
2098           implies "-n", and it now suppresses the header and trailer.
2099
2100       BUG FIX
2101           A few more tweaks to the perl-Fsdb.spec from Petr Šabata.
2102
2103       BUG FIX
2104           Fixed 3 uses of "use v5.10" in test suites that were causing test
2105           failures (due to warnings, not real failures) on some platforms.
2106
2107   2.51, 2014-09-05 Feature enhancements to dbcolmovingstats, dbcolcreate,
2108       dbmapreduce, and new sqlselect_to_db
2109       ENHANCEMENT
2110           dbcolcreate now has a "--no-recreate-fatal" that causes it to
2111           ignore creation of existing columns (instead of failing).
2112
2113       ENHANCEMENT
2114           dbmapreduce once again is robust to reducers that output the key;
2115           "--no-prepend-key" is no longer mandatory.
2116
2117       ENHANCEMENT
2118           dbcolsplittorows can now enumerate the output rows with "-E".
2119
2120       BUG FIX
2121           dbcolmovingstats is more mathematically robust.  Previously for
2122           some inputs and some platforms, floating point rounding could
2123           sometimes cause squareroots of negative numbers.
2124
2125       NEW sqlselect_to_db converts the output of the MySQL or MarinaDB select
2126           comment into fsdb format.
2127
2128       INCOMPATIBLE CHANGE
2129           dbfilediff now outputs the second row when doing sloppy numeric
2130           comparisons, to better support test suites.
2131
2132   2.52, 2014-11-03 Fixing the test suite for line number changes.
2133       ENHANCEMENT
2134           Test suites changes to be robust to exact line numbers of failures,
2135           since different Perl releases fail on different lines.
2136           <https://bugzilla.redhat.com/show_bug.cgi?id=1158380>
2137
2138   2.53, 2014-11-26 bug fixes and stability improvements to dbmapreduce
2139       ENHANCEMENT
2140           The dbfilediff how supports a "--quiet" option.
2141
2142       ENHANCEMENT
2143           Better documention of dbpipeline_filter.
2144
2145       BUGFIX
2146           Added groff-base and perl-podlators to the Fedora package spec.
2147           Fixes <https://bugzilla.redhat.com/show_bug.cgi?id=1163149>.  (Also
2148           in package 2.52-2.)
2149
2150       BUGFIX
2151           An important stability improvement to dbmapreduce.  It, plus
2152           dbmultistats, and dbcolstats now support controlled parallelism
2153           with the "--pararallelism=N" option.  They default to run with the
2154           number of available CPUs.  dbmapreduce also moderates its level of
2155           parallelism.  Previously it would create reducers as needed,
2156           causing CPU thrashing if reducers ran much slower than data
2157           production.
2158
2159       BUGFIX
2160           The combination of dbmapreduce with dbrowenumerate now works as it
2161           should.  (The obscure bug was an interaction with dbcolcreate with
2162           non-multi-key reducers that output their own key.  dbmapreduce has
2163           too many useful corner cases.)
2164
2165   2.54, 2014-11-28 fix for the test suite to correct failing tests on not-my-
2166       platform
2167       BUGFIX
2168           Sigh, the test suite now has a test suite.  Because, yes, I broke
2169           it, causing many incorrect failures at cpantesters.  Now fixed.
2170
2171   2.55, 2015-01-05 many spelling fixes and dbcolmovingstats tests are more
2172       robust to different numeric precision
2173       ENHANCEMENT
2174           dbfilediff now can be extra quiet, as I continue to try to track
2175           down a numeric difference on FreeBSD AMD boxes.
2176
2177       ENHANCEMENT
2178           dbcolmovingstats gave different test output (just reflecting
2179           rounding error) when stddev approaches zero.  We now detect hand
2180           handle this case.  See
2181           <https://rt.cpan.org/Public/Bug/Display.html?id=101220> and thanks
2182           to H. Merijn Brand for the bug report.
2183
2184       BUG FIX
2185           Many, many spelling bugs found by H. Merijn Brand; thanks for the
2186           bug report.
2187
2188       INCOMPATBLE CHANGE
2189           A number of programs had misspelled "separator" in
2190           "--fieldseparator" and "--columnseparator" options as "seperator".
2191           These are now correctly spelled.
2192
2193   2.56, 2015-02-03 fix against Getopt::Long-2.43's stricter error checkign
2194       BUG FIX
2195           Internal argument parsing uses Getopt::Long, but mixed pass-through
2196           and <>.  Bug reported by Petr Pisar at
2197           <https://bugzilla.redhat.com/show_bug.cgi?id=1188538>.a
2198
2199       BUG FIX
2200           Added missing BuildRequires for "XML::Simple".
2201
2202   2.57, 2015-04-29 Minor changes, with better performance from dbmulitstats.
2203       BUG FIX
2204           dbfilecat now honors "--remove-inputs" (previously it didn't).
2205           This omission meant that dbmapreduce (and dbmultistats) would
2206           accumulate files in /tmp when running.  Bad news for inputs with 4M
2207           keys.
2208
2209       ENHANCMENT
2210           dbmultistats should be faster with lots of small keys.  dbcolstats
2211           now supports "-k" to get some of the functionality of dbmultistats
2212           (if data is pre-sorted and median/quartiles are not required).
2213
2214           dbfilecat now honors "--remove-inputs" (previously it didn't).
2215           This omission meant that dbmapreduce (and dbmultistats) would
2216           accumulate files in /tmp when running.  Bad news for inputs with 4M
2217           keys.
2218
2219   2.58, 2015-04-30 Bugfix in dbmerge
2220       BUG FIX
2221           Fixed a case where dbmerge suffered mojobake in endgame mode.  This
2222           bug surfaced when dbsort was applied to large files (big enough to
2223           require merging) with unicode in them; the symptom was soemthing
2224           like:
2225             Wide character in print at /usr/lib64/perl5/IO/Handle.pm line
2226           420, <GEN12> line 111.
2227
2228   2.59, 2016-09-01 Collect a few small bug fixes and documentation
2229       improvements.
2230       BUG FIX
2231           More IO is explicitly marked UTF-8 to avoid Perl's tendency to
2232           mojibake on otherwise valid unicode input.  This change helps
2233           html_table_to_db.
2234
2235       ENHANCEMENT
2236           dbcolscorrelate now crossreferences dbcolsregression.
2237
2238       ENHANCEMENT
2239           Documentation for dbrowdiff now clarifies that the default is
2240           baseline mode.
2241
2242       BUG FIX
2243           dbjoin now propagates "-T" into the sorting process (if it is
2244           required).  Thanks to Lan Wei for reporting this bug.
2245
2246   2.60, 2016-09-04 Adds support for hash joins.
2247       ENHANCEMENT
2248           dbjoin now supports hash joins with "-t lefthash" and "-t
2249           righthash".  Hash joins cache a table in memory, but do not require
2250           that the other table be sorted.  They are ideal when joining a
2251           large table against a small one.
2252
2253   2.61, 2016-09-05 Support left and right outer joins.
2254       ENHANCEMENT
2255           dbjoin now handles left and right outer joins with "-t left" and
2256           "-t right".
2257
2258       ENHANCEMENT
2259           dbjoin hash joins are now selected with "-m lefthash" and "-m
2260           righthash" (not the shortlived "-t righthash" option).
2261           (Technically this change is incompatible with Fsdd-2.60, but no one
2262           but me ever used that version.)
2263
2264   2.62, 2016-11-29 A new yaml_to_db and other minor improvements.
2265       ENHANCEMENT
2266           Documentation for xml_to_db now includes sample output.
2267
2268       NEW yaml_to_db converts a specific form of YAML to fsdb.
2269
2270       BUG FIX
2271           The test suite now uses "diff -c -b" rather than "diff -cb" to make
2272           OpenBSD-5.9 happier, I hope.
2273
2274       ENHANCEMENT
2275           Comments that log operations at the end of each file now do simple
2276           quoting of spaces.  (It is not guaranteed to be fully shell-
2277           compliant.)
2278
2279       ENHANCEMENT
2280           There is a new standard option, "--header", allowing one to specify
2281           an Fsdb header for inputs that lack it.  Currently it is supported
2282           by dbcoldefine, dbrowuniq, dbmapreduce, dbmultistats, dbsort,
2283           dbpipeline.
2284
2285       ENHANCEMENT
2286           dbfilepivot now allows the --possible-pivots option, and if it is
2287           provided processes the data in one pass.
2288
2289       ENHANCEMENT
2290           dbroweval logs are now quoted.
2291
2292   2.63, 2017-02-03 Re-add some features supposedly in 2.62 but not, and add
2293       more --header options.
2294       ENHANCEMENT
2295           The option -j is now a synonym for --parallelism.  (And several
2296           documention bugs about this option are fixed.)
2297
2298       ENHANCEMENT
2299           Additional support for "--header" in dbcolmerge, dbcol, dbrow, and
2300           dbroweval.
2301
2302       BUG FIX
2303           Version 2.62 was supposed to have this improvement, but did not
2304           (and now does): dbfilepivot now allows the --possible-pivots
2305           option, and if it is provided processes the data in one pass.
2306
2307       BUG FIX
2308           Version 2.62 was supposed to have this improvement, but did not
2309           (and now does): dbroweval logs are now quoted.
2310
2311   2.64, 2017-11-20 several small bugfixes and enhancements
2312       BUG FIX
2313           In dbroweval, the "next row" option previously did not correctly
2314           set up "_last_fieldname".  It now does.
2315
2316       ENHANCEMENT
2317           The csv_to_db converter now has an optional "-F x" option to set
2318           the field separator.
2319
2320       ENHANCEMENT
2321           Finally dbcolsplittocols has a "--header" option, and a new "-N"
2322           option to give the list of resulting output columns.
2323
2324       INCOMPATIBLE CHANGE
2325           Now dbcolstats and dbmultistats produce no output (but a schema)
2326           when given no input but a schema.  Previously they gave a null row
2327           of output.  The "--output-on-no-input" and
2328           "--no-output-on-no-input" options can control this behavior.
2329
2330   2.65, 2018-02-16 Minor release, bug fix and -F option.
2331       ENHANCEMENT
2332           dbmultistats and dbmapreduce now both take a "-F x" option to set
2333           the field separator.
2334
2335       BUG FIX
2336           Fixed missing "use Carp" in dbcolstats.  Also went back and cleaned
2337           up all uses of "croak()".  Thanks to Zefram for the bug report.
2338
2339   2.66, 2018-12-20 Critical bug fix in dbjoin.
2340       BUG FIX
2341           Removed old tests from MANIFEST.  (Thanks to Hang Guo for reporting
2342           this bug.)
2343
2344       IMPROVEMENT
2345           Errors for non-existing input files now include the bad filename
2346           (before: "cannot setup filehandle", now: "cannot open input: cannot
2347           open TEST/bad_filename").
2348
2349       BUG FIX
2350           Hash joins with three identical rows were failing with the
2351           assertion failure "internal error: confused about overflow" due to
2352           a now-fixed bug.
2353
2354   2.67, 2019-07-10 add support for reading and writing hdfs
2355       IMPROVEMENT
2356           dbformmail now has an "mh" mechanism that writes messages to
2357           individual files (an mh-style mailbox).
2358
2359       BUG FIX
2360           dbrow failed to include the Carp library, leading to fails on
2361           croak.
2362
2363       BUG FIX
2364           Fixed dbjoin error message for an unsorted right stream was
2365           incorrect (it said left).
2366
2367       IMPROVEMENT
2368           All Fsdb programs can now read from and write to HDFS, when files
2369           that start with "hdfs:" are given to -i and -o options.
2370

AUTHOR

2372       John Heidemann, "johnh@isi.edu"
2373
2374       See "Contributors" for the many people who have contributed bug reports
2375       and fixes.
2376

COPYRIGHT

2378       Fsdb is Copyright (C) 1991-2016 by John Heidemann <johnh@isi.edu>.
2379
2380       This program is free software; you can redistribute it and/or modify it
2381       under the terms of version 2 of the GNU General Public License as
2382       published by the Free Software Foundation.
2383
2384       This program is distributed in the hope that it will be useful, but
2385       WITHOUT ANY WARRANTY; without even the implied warranty of
2386       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
2387       General Public License for more details.
2388
2389       You should have received a copy of the GNU General Public License along
2390       with this program; if not, write to the Free Software Foundation, Inc.,
2391       675 Mass Ave, Cambridge, MA 02139, USA.
2392
2393       A copy of the GNU General Public License can be found in the file
2394       ``COPYING''.
2395

COMMENTS and BUG REPORTS

2397       Any comments about these programs should be sent to John Heidemann
2398       "johnh@isi.edu".
2399
2400
2401
2402perl v5.30.0                      2019-09-19                           Fsdb(3)