Fsdb(3pm)

1Fsdb(3)               User Contributed Perl Documentation              Fsdb(3)
2
3
4

NAME

6       Fsdb - a flat-text database for shell scripting
7

SYNOPSIS

9       Fsdb, the flatfile streaming database is package of commands for
10       manipulating flat-ASCII databases from shell scripts.  Fsdb is useful
11       to process medium amounts of data (with very little data you'd do it by
12       hand, with megabytes you might want a real database).  Fsdb was known
13       as as Jdb from 1991 to Oct. 2008.
14
15       Fsdb is very good at doing things like:
16
17       •   extracting measurements from experimental output
18
19       •   examining data to address different hypotheses
20
21       •   joining data from different experiments
22
23       •   eliminating/detecting outliers
24
25       •   computing statistics on data (mean, confidence intervals,
26           correlations, histograms)
27
28       •   reformatting data for graphing programs
29
30       Fsdb is built around the idea of a flat text file as a database.  Fsdb
31       files (by convention, with the extension .fsdb), have a header
32       documenting the schema (what the columns mean), and then each line
33       represents a database record (or row).
34
35       For example:
36
37               #fsdb experiment duration
38               ufs_mab_sys 37.2
39               ufs_mab_sys 37.3
40               ufs_rcp_real 264.5
41               ufs_rcp_real 277.9
42
43       Is a simple file with four experiments (the rows), each with a
44       description, size parameter, and run time in the first, second, and
45       third columns.
46
47       Rather than hand-code scripts to do each special case, Fsdb provides
48       higher-level functions.  Although it's often easy throw together a
49       custom script to do any single task, I believe that there are several
50       advantages to using Fsdb:
51
52       •   these programs provide a higher level interface than plain Perl, so
53
54           **  Fewer lines of simpler code:
55
56                   dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration
57
58               Picks out just one type of experiment and computes statistics
59               on it, rather than:
60
61                   while (<>) { split; $sum+=$F[1]; $ss+=$F[1]**2; $n++; }
62                   $mean = $sum / $n; $std_dev = ...
63
64               in dozens of places.
65
66       •   the library uses names for columns, so
67
68           **  No more $F[1], use "_duration".
69
70           **  New or different order columns?  No changes to your scripts!
71
72           Thus if your experiment gets more complicated with a size
73           parameter, so your log changes to:
74
75                   #fsdb experiment size duration
76                   ufs_mab_sys 1024 37.2
77                   ufs_mab_sys 1024 37.3
78                   ufs_rcp_real 1024 264.5
79                   ufs_rcp_real 1024 277.9
80                   ufs_mab_sys 2048 45.3
81                   ufs_mab_sys 2048 44.2
82
83           Then the previous scripts still work, even though duration is now
84           the third column, not the second.
85
86       •   A series of actions are self-documenting (each program records what
87           it does).
88
89           **  No more wondering what hacks were used to compute the final
90               data, just look at the comments at the end of the output.
91
92           For example, the commands
93
94               dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration
95
96           add to the end of the output the lines
97               #    | dbrow _experiment eq "ufs_mab_sys"
98               #    | dbcolstats duration
99
100       •   The library is mature, supporting large datasets (more than 100GB),
101           corner cases, error handling, backed by an automated test suite.
102
103           **  No more puzzling about bad output because your custom script
104               skimped on error checking.
105
106           **  No more memory thrashing when you try to sort ten million
107               records.
108
109       •   Fsdb-2.x supports Perl scripting (in addition to shell scripting),
110           with libraries to do Fsdb input and output, and easy support for
111           pipelines.  The shell script
112
113               dbcol name test1 | dbroweval '_test1 += 5;'
114
115           can be written in perl as:
116
117               dbpipeline(dbcol(qw(name test1)), dbroweval('_test1 += 5;'));
118
119       (The disadvantage is that you need to learn what functions Fsdb
120       provides.)
121
122       Fsdb is built on flat-ASCII databases.  By storing data in simple text
123       files and processing it with pipelines it is easy to experiment (in the
124       shell) and look at the output.  To the best of my knowledge, the
125       original implementation of this idea was "/rdb", a commercial product
126       described in the book UNIX relational database management: application
127       development in the UNIX environment by Rod Manis, Evan Schaffer, and
128       Robert Jorgensen (and also at the web page <http://www.rdb.com/>).
129       Fsdb is an incompatible re-implementation of their idea without any
130       accelerated indexing or forms support.  (But it's free, and probably
131       has better statistics!).
132
133       Fsdb-2.x will exploit multiple processors or cores, and provides Perl-
134       level support for input, output, and threaded-pipelines.  (As of
135       Fsdb-2.44 it no longer uses Perl threading, just processes, since they
136       are faster.)
137
138       Installation instructions follow at the end of this document.  Fsdb-2.x
139       requires Perl 5.8 to run.  All commands have manual pages and provide
140       usage with the "--help" option.  All commands are backed by an
141       automated test suite.
142
143       The most recent version of Fsdb is available on the web at
144       <http://www.isi.edu/~johnh/SOFTWARE/FSDB/index.html>.
145

WHAT'S NEW

147   2.72, 2020-12-01 A small bug and a packaging improvement.
148       BUG FIX
149           dbcolhisto now handles the degenerate case where everything has the
150           same value (previously it would throw "illegal division by zero").
151
152       ENHANCEMENT
153           The spec for Fedora now includes "make" as BuildRequires, something
154           required for Fedora 34.
155

README CONTENTS

157       executive summary
158       what's new
159       README CONTENTS
160       installation
161       basic data format
162       basic data manipulation
163       list of commands
164       another example
165       a gradebook example
166       a password example
167       history
168       related work
169       release notes
170       copyright
171       comments
172

INSTALLATION

174       Fsdb now uses the standard Perl build and installation from
175       ExtUtil::MakeMaker(3), so the quick answer to installation is to type:
176
177           perl Makefile.PL
178           make
179           make test
180           make install
181
182       Or, if you want to install it somewhere else, change the first line to
183
184           perl Makefile.PL PREFIX=$HOME
185
186       and it will go in your home directory's bin, etc.  (See
187       ExtUtil::MakeMaker(3) for more details.)
188
189       Fsdb requires perl 5.8 or later.
190
191       A test-suite is available, run it with
192
193           make test
194
195       In the past, the ports existed for FreeBSD and MacOS.  If someone
196       running one of those OSes wants to contribute a new port, please let me
197       know.
198

BASIC DATA FORMAT

200       These programs are based on the idea storing data in simple ASCII
201       files.  A database is a file with one header line and then data or
202       comment lines.  For example:
203
204               #fsdb account passwd uid gid fullname homedir shell
205               johnh * 2274 134 John_Heidemann /home/johnh /bin/bash
206               greg * 2275 134 Greg_Johnson /home/greg /bin/bash
207               root * 0 0 Root /root /bin/bash
208               # this is a simple database
209
210       The header line must be first and begins with "#h".  There are rows
211       (records) and columns (fields), just like in a normal database.
212       Comment lines begin with "#".  Column names are any string not
213       containing spaces or single quote (although it is prudent to keep them
214       alphanumeric with underscore).
215
216       By default, columns are delimited by whitespace.  With this default
217       configuration, the contents of a field cannot contain whitespace.
218       However, this limitation can be relaxed by changing the field separator
219       as described below.
220
221       The big advantage of simple flat-text databases is that it is usually
222       easy to massage data into this format, and it's reasonably easy to take
223       data out of this format into other (text-based) programs, like gnuplot,
224       jgraph, and LaTeX.  Think Unix.  Think pipes.  (Or even output to Excel
225       and HTML if you prefer.)
226
227       Since no-whitespace in columns was a problem for some applications,
228       there's an option which relaxes this rule.  You can specify the field
229       separator in the table header with "-F x" where "x" is a code for the
230       new field separator.  A full list of codes is at dbfilealter(1), but
231       two common special values are "-F t" which is a separator of a single
232       tab character, and "-F S", a separator of two spaces.  Both allowing
233       (single) spaces in fields.  An example:
234
235               #fsdb -F S account passwd uid gid fullname homedir shell
236               johnh  *  2274  134  John Heidemann  /home/johnh  /bin/bash
237               greg  *  2275  134  Greg Johnson  /home/greg  /bin/bash
238               root  *  0  0  Root  /root  /bin/bash
239               # this is a simple database
240
241       See dbfilealter(1) for more details.  Regardless of what the column
242       separator is for the body of the data, it's always whitespace in the
243       header.
244
245       There's also a third format: a "list".  Because it's often hard to see
246       what's columns past the first two, in list format each "column" is on a
247       separate line.  The programs dblistize and dbcolize convert to and from
248       this format, and all programs work with either formats.  The command
249
250           dbfilealter -R C  < DATA/passwd.fsdb
251
252       outputs:
253
254               #fsdb -R C account passwd uid gid fullname homedir shell
255               account:  johnh
256               passwd:   *
257               uid:      2274
258               gid:      134
259               fullname: John_Heidemann
260               homedir:  /home/johnh
261               shell:    /bin/bash
262
263               account:  greg
264               passwd:   *
265               uid:      2275
266               gid:      134
267               fullname: Greg_Johnson
268               homedir:  /home/greg
269               shell:    /bin/bash
270
271               account:  root
272               passwd:   *
273               uid:      0
274               gid:      0
275               fullname: Root
276               homedir:  /root
277               shell:    /bin/bash
278
279               # this is a simple database
280               #  | dblistize
281
282       See dbfilealter(1) for more details.
283

BASIC DATA MANIPULATION

285       A number of programs exist to manipulate databases.  Complex functions
286       can be made by stringing together commands with shell pipelines.  For
287       example, to print the home directories of everyone with ``john'' in
288       their names, you would do:
289
290               cat DATA/passwd | dbrow '_fullname =~ /John/' | dbcol homedir
291
292       The output might be:
293
294               #fsdb homedir
295               /home/johnh
296               /home/greg
297               # this is a simple database
298               #  | dbrow _fullname =~ /John/
299               #  | dbcol homedir
300
301       (Notice that comments are appended to the output listing each command,
302       providing an automatic audit log.)
303
304       In addition to typical database functions (select, join, etc.) there
305       are also a number of statistical functions.
306
307       The real power of Fsdb is that one can apply arbitrary code to rows to
308       do powerful things.
309
310               cat DATA/passwd | dbroweval '_fullname =~ s/(\w+)_(\w+)/$2,_$1/'
311
312       converts "John_Heidemann" into "Heidemann,_John".  Not too much more
313       work could split fullname into firstname and lastname fields.
314
315       (Or:
316
317               cat DATA/passwd | dbcolcreate sort | dbroweval -b 'use Fsdb::Support'
318                       '_sort = _fullname; _sort =~ s/_/ /g; _sort = fullname_to_sort(_sort);'
319

TALKING ABOUT COLUMNS

321       An advantage of Fsdb is that you can talk about columns by name
322       (symbolically) rather than simply by their positions.  So in the above
323       example, "dbcol homedir" pulled out the home directory column, and
324       "dbrow '_fullname =~ /John/'" matched against column fullname.
325
326       In general, you can use the name of the column listed on the "#fsdb"
327       line to identify it in most programs, and _name to identify it in code.
328
329       Some alternatives for flexibility:
330
331       •   Numeric values identify columns positionally, numbering from 0.  So
332           0 or _0 is the first column, 1 is the second, etc.
333
334       •   In code, _last_columnname gets the value from columname's previous
335           row.
336
337       See dbroweval(1) for more details about writing code.
338

LIST OF COMMANDS

340       Enough said.  I'll summarize the commands, and then you can experiment.
341       For a detailed description of each command, see a summary by running it
342       with the argument "--help" (or "-?" if you prefer.)  Full manual pages
343       can be found by running the command with the argument "--man", or
344       running the Unix command "man dbcol" or whatever program you want.
345
346   TABLE CREATION
347       dbcolcreate
348           add columns to a database
349
350       dbcoldefine
351           set the column headings for a non-Fsdb file
352
353   TABLE MANIPULATION
354       dbcol
355           select columns from a table
356
357       dbrow
358           select rows from a table
359
360       dbsort
361           sort rows based on a set of columns
362
363       dbjoin
364           compute the natural join of two tables
365
366       dbcolrename
367           rename a column
368
369       dbcolmerge
370           merge two columns into one
371
372       dbcolsplittocols
373           split one column into two or more columns
374
375       dbcolsplittorows
376           split one column into multiple rows
377
378       dbfilepivot
379           "pivots" a file, converting multiple rows corresponding to the same
380           entity into a single row with multiple columns.
381
382       dbfilevalidate
383           check that db file doesn't have some common errors
384
385   COMPUTATION AND STATISTICS
386       dbcolstats
387           compute statistics over a column (mean,etc.,optionally median)
388
389       dbmultistats
390           group rows by some key value, then compute stats (mean, etc.) over
391           each group (equivalent to dbmapreduce with dbcolstats as the
392           reducer)
393
394       dbmapreduce
395           group rows (map) and then apply an arbitrary function to each group
396           (reduce)
397
398       dbrvstatdiff
399           compare two samples distributions (mean/conf interval/T-test)
400
401       dbcolmovingstats
402           computing moving statistics over a column of data
403
404       dbcolstatscores
405           compute Z-scores and T-scores over one column of data
406
407       dbcolpercentile
408           compute the rank or percentile of a column
409
410       dbcolhisto
411           compute histograms over a column of data
412
413       dbcolscorrelate
414           compute the coefficient of correlation over several columns
415
416       dbcolsregression
417           compute linear regression and correlation for two columns
418
419       dbrowaccumulate
420           compute a running sum over a column of data
421
422       dbrowcount
423           count the number of rows (a subset of dbstats)
424
425       dbrowdiff
426           compute differences between a columns in each row of a table
427
428       dbrowenumerate
429           number each row
430
431       dbroweval
432           run arbitrary Perl code on each row
433
434       dbrowuniq
435           count/eliminate identical rows (like Unix uniq(1))
436
437       dbfilediff
438           compare fields on rows of a file (something like Unix diff(1))
439
440   OUTPUT CONTROL
441       dbcolneaten
442           pretty-print columns
443
444       dbfilealter
445           convert between column or list format, or change the column
446           separator
447
448       dbfilestripcomments
449           remove comments from a table
450
451       dbformmail
452           generate a script that sends form mail based on each row
453
454   CONVERSIONS
455       (These programs convert data into fsdb.  See their web pages for
456       details.)
457
458       cgi_to_db
459           <http://stein.cshl.org/boulder/>
460
461       combined_log_format_to_db
462           <http://httpd.apache.org/docs/2.0/logs.html>
463
464       html_table_to_db
465           HTML tables to fsdb (assuming they're reasonably formatted).
466
467       kitrace_to_db
468           <http://ficus-www.cs.ucla.edu/ficus-members/geoff/kitrace.html>
469
470       ns_to_db
471           <http://mash-www.cs.berkeley.edu/ns/>
472
473       sqlselect_to_db
474           the output of SQL SELECT tables to db
475
476       tabdelim_to_db
477           spreadsheet tab-delimited files to db
478
479       tcpdump_to_db
480           (see man tcpdump(8) on any reasonable system)
481
482       xml_to_db
483           XML input to fsdb, assuming they're very regular
484
485       (And out of fsdb:)
486
487       db_to_csv
488           Comma-separated-value format from fsdb.
489
490       db_to_html_table
491           simple conversion of Fsdb to html tables
492
493   STANDARD OPTIONS
494       Many programs have common options:
495
496       -? or --help
497           Show basic usage.
498
499       -N on --new-name
500           When a command creates a new column like dbrowaccumulate's "accum",
501           this option lets one override the default name of that new column.
502
503       -T TmpDir
504           where to put tmp files.  Also uses environment variable TMPDIR, if
505           -T is not specified.  Default is /tmp.
506
507           Show basic usage.
508
509       -c FRACTION or --confidence FRACTION
510           Specify confidence interval FRACTION (dbcolstats, dbmultistats,
511           etc.)
512
513       -C S or "--element-separator S"
514           Specify column separator S (dbcolsplittocols, dbcolmerge).
515
516       -d or --debug
517           Enable debugging (may be repeated for greater effect in some
518           cases).
519
520       -a or --include-non-numeric
521           Compute stats over all data (treating non-numbers as zeros).  (By
522           default, things that can't be treated as numbers are ignored for
523           stats purposes)
524
525       -S or --pre-sorted
526           Assume the data is pre-sorted.  May be repeated to disable
527           verification (saving a small amount of work).
528
529       -e E or --empty E
530           give value E as the value for empty (null) records
531
532       -i I or --input I
533           Input data from file I.
534
535       -o O or --output O
536           Write data out to file O.
537
538       --header H
539           Use H as the full Fsdb header, rather than reading a header from
540           then input.  This option is particularly useful when using Fsdb
541           under Hadoop, where split files don't have heades.
542
543       --nolog.
544           Skip logging the program in a trailing comment.
545
546       When giving Perl code (in dbrow and dbroweval) column names can be
547       embedded if preceded by underscores.  Look at dbrow(1) or dbroweval(1)
548       for examples.)
549
550       Most programs run in constant memory and use temporary files if
551       necessary.  Exceptions are dbcolneaten, dbcolpercentile, dbmapreduce,
552       dbmultistats, dbrowsplituniq.
553

ANOTHER EXAMPLE

555       Take the raw data in "DATA/http_bandwidth", put a header on it
556       ("dbcoldefine size bw"), took statistics of each category
557       ("dbmultistats -k size bw"), pick out the relevant fields ("dbcol size
558       mean stddev pct_rsd"), and you get:
559
560               #fsdb size mean stddev pct_rsd
561               1024    1.4962e+06      2.8497e+05      19.047
562               10240   5.0286e+06      6.0103e+05      11.952
563               102400  4.9216e+06      3.0939e+05      6.2863
564               #  | dbcoldefine size bw
565               #  | /home/johnh/BIN/DB/dbmultistats -k size bw
566               #  | /home/johnh/BIN/DB/dbcol size mean stddev pct_rsd
567
568       (The whole command was:
569
570               cat DATA/http_bandwidth |
571               dbcoldefine size |
572               dbmultistats -k size bw |
573               dbcol size mean stddev pct_rsd
574
575       all on one line.)
576
577       Then post-process them to get rid of the exponential notation by adding
578       this to the end of the pipeline:
579
580           dbroweval '_mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev);'
581
582       (Actually, this step is no longer required since dbcolstats now uses a
583       different default format.)
584
585       giving:
586
587               #fsdb      size    mean    stddev  pct_rsd
588               1024     1496200          284970        19.047
589               10240    5028600          601030        11.952
590               102400   4921600          309390        6.2863
591               #  | dbcoldefine size bw
592               #  | dbmultistats -k size bw
593               #  | dbcol size mean stddev pct_rsd
594               #  | dbroweval   { _mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev); }
595
596       In a few lines, raw data is transformed to processed output.
597
598       Suppose you expect there is an odd distribution of results of one
599       datapoint.  Fsdb can easily produce a CDF (cumulative distribution
600       function) of the data, suitable for graphing:
601
602           cat DB/DATA/http_bandwidth | \
603               dbcoldefine size bw | \
604               dbrow '_size == 102400' | \
605               dbcol bw | \
606               dbsort -n bw | \
607               dbrowenumerate | \
608               dbcolpercentile count | \
609               dbcol bw percentile | \
610               xgraph
611
612       The steps, roughly: 1. get the raw input data and turn it into fsdb
613       format, 2. pick out just the relevant column (for efficiency) and sort
614       it, 3. for each data point, assign a CDF percentage to it, 4. pick out
615       the two columns to graph and show them
616

A GRADEBOOK EXAMPLE

618       The first commercial program I wrote was a gradebook, so here's how to
619       do it with Fsdb.
620
621       Format your data like DATA/grades.
622
623               #fsdb name email id test1
624               a a@ucla.example.edu 1 80
625               b b@usc.example.edu 2 70
626               c c@isi.example.edu 3 65
627               d d@lmu.example.edu 4 90
628               e e@caltech.example.edu 5 70
629               f f@oxy.example.edu 6 90
630
631       Or if your students have spaces in their names, use "-F S" and two
632       spaces to separate each column:
633
634               #fsdb -F S name email id test1
635               alfred aho  a@ucla.example.edu  1  80
636               butler lampson  b@usc.example.edu  2  70
637               david clark  c@isi.example.edu  3  65
638               constantine drovolis  d@lmu.example.edu  4  90
639               debrorah estrin  e@caltech.example.edu  5  70
640               sally floyd  f@oxy.example.edu  6  90
641
642       To compute statistics on an exam, do
643
644               cat DATA/grades | dbstats test1 |dblistize
645
646       giving
647
648               #fsdb -R C  ...
649               mean:        77.5
650               stddev:      10.84
651               pct_rsd:     13.987
652               conf_range:  11.377
653               conf_low:    66.123
654               conf_high:   88.877
655               conf_pct:    0.95
656               sum:         465
657               sum_squared: 36625
658               min:         65
659               max:         90
660               n:           6
661               ...
662
663       To do a histogram:
664
665               cat DATA/grades | dbcolhisto -n 5 -g test1
666
667       giving
668
669               #fsdb low histogram
670               65      *
671               70      **
672               75
673               80      *
674               85
675               90      **
676               #  | /home/johnh/BIN/DB/dbhistogram -n 5 -g test1
677
678       Now you want to send out grades to the students by e-mail.  Create a
679       form-letter (in the file test1.txt):
680
681               To: _email (_name)
682               From: J. Random Professor <jrp@usc.example.edu>
683               Subject: test1 scores
684
685               _name, your score on test1 was _test1.
686               86+   A
687               75-85 B
688               70-74 C
689               0-69  F
690
691       Generate the shell script that will send the mail out:
692
693               cat DATA/grades | dbformmail test1.txt > test1.sh
694
695       And run it:
696
697               sh <test1.sh
698
699       The last two steps can be combined:
700
701               cat DATA/grades | dbformmail test1.txt | sh
702
703       but I like to keep a copy of exactly what I send.
704
705       At the end of the semester you'll want to compute grade totals and
706       assign letter grades.  Both fall out of dbroweval.  For example, to
707       compute weighted total grades with a 40% midterm/60% final where the
708       midterm is 84 possible points and the final 100:
709
710               dbcol -rv total |
711               dbcolcreate total - |
712               dbroweval '
713                       _total = .40 * _midterm/84.0 + .60 * _final/100.0;
714                       _total = sprintf("%4.2f", _total);
715                       if (_final eq "-" || ( _name =~ /^_/)) { _total = "-"; };' |
716               dbcolneaten
717
718       If you got the data originally from a spreadsheet, save it in "tab-
719       delimited" format and convert it with tabdelim_to_db (run
720       tabdelim_to_db -? for examples).
721

A PASSWORD EXAMPLE

723       To convert the Unix password file to db:
724
725               cat /etc/passwd | sed 's/:/  /g'| \
726                       dbcoldefine -F S login password uid gid gecos home shell \
727                       >passwd.fsdb
728
729       To convert the group file
730
731               cat /etc/group | sed 's/:/  /g' | \
732                       dbcoldefine -F S group password gid members \
733                       >group.fsdb
734
735       To show the names of the groups that div7-members are in (assuming DIV7
736       is in the gecos field):
737
738               cat passwd.fsdb | dbrow '_gecos =~ /DIV7/' | dbcol login gid | \
739                       dbjoin -i - -i group.fsdb gid | dbcol login group
740

SHORT EXAMPLES

742       Which Fsdb programs are the most complicated (based on number of test
743       cases)?
744
745               ls TEST/*.cmd | \
746                       dbcoldefine test | \
747                       dbroweval '_test =~ s@^TEST/([^_]+).*$@$1@' | \
748                       dbrowuniq -c | \
749                       dbsort -nr count | \
750                       dbcolneaten
751
752       (Answer: dbmapreduce, then dbcolstats, dbfilealter and dbjoin.)
753
754       Stats on an exam (in $FILE, where $COLUMN is the name of the exam)?
755
756               cat $FILE | dbcolstats -q 4 $COLUMN <$FILE | dblistize | dbstripcomments
757
758               cat $FILE | dbcolhisto -g -n 20 $COLUMN | dbcolneaten | dbstripcomments
759
760       Merging a the hw1 column from file hw1.fsdb into grades.fsdb assuming
761       there's a common student id in column "id":
762
763               dbcol id hw1 <hw1.fsdb >t.fsdb
764
765               dbjoin -a -e - grades.fsdb t.fsdb id | \
766                   dbsort  name | \
767                   dbcolneaten >new_grades.fsdb
768
769       Merging two fsdb files with the same rows:
770
771               cat file1.fsdb file2.fsdb >output.fsdb
772
773       or if you want to clean things up a bit
774
775               cat file1.fsdb file2.fsdb | dbstripextraheaders >output.fsdb
776
777       or if you want to know where the data came from
778
779               for i in 1 2
780               do
781                       dbcolcreate source $i < file$i.fsdb
782               done >output.fsdb
783
784       (assumes you're using a Bourne-shell compatible shell, not csh).
785

WARNINGS

787       As with any tool, one should (which means must) understand the limits
788       of the tool.
789
790       All Fsdb tools should run in constant memory.  In some cases (such as
791       dbcolstats with quartiles, where the whole input must be re-read),
792       programs will spool data to disk if necessary.
793
794       Most tools buffer one or a few lines of data, so memory will scale with
795       the size of each line.  (So lines with many columns, or when columns
796       have lots data, may cause large memory consumption.)
797
798       All Fsdb tools should run in constant or at worst "n log n" time.
799
800       All Fsdb tools use normal Perl math routines for computation.  Although
801       I make every attempt to choose numerically stable algorithms (although
802       I also welcome feedback and suggestions for improvement), normal
803       rounding due to computer floating point approximations can result in
804       inaccuracies when data spans a large range of precision.  (See for
805       example the dbcolstats_extrema test cases.)
806
807       Any requirements and limitations of each Fsdb tool is documented on its
808       manual page.
809
810       If any Fsdb program violates these assumptions, that is a bug that
811       should be documented on the tool's manual page or ideally fixed.
812
813       Fsdb does depend on Perl's correctness, and Perl (and Fsdb) have some
814       bugs.  Fsdb should work on perl from version 5.10 onward.
815

HISTORY

817       There have been three versions of Fsdb; fsdb 1.0 is a complete re-write
818       of the pre-1995 versions, and was distributed from 1995 to 2007.  Fsdb
819       2.0 is a significant re-write of the 1.x versions for reasons described
820       below.
821
822       Fsdb (in its various forms) has been used extensively by its author
823       since 1991.  Since 1995 it's been used by two other researchers at UCLA
824       and several at ISI.  In February 1998 it was announced to the Internet.
825       Since then it has found a few users, some outside where I work.
826
827       Major changes:
828
829       1.0 1997-07-22: first public release.
830       2.0 2008-01-25: rewrite to use a common library, and starting to use
831       threads.
832       2.12 2008-10-16: completion of the rewrite, and first RPM package.
833       2.44 2013-10-02: abandoning threads for improved performance
834
835   Fsdb 2.0 Rationale
836       I've thought about fsdb-2.0 for many years, but it was started in
837       earnest in 2007.  Fsdb-2.0 has the following goals:
838
839       in-one-process processing
840           While fsdb is great on the Unix command line as a pipeline between
841           programs, it should also be possible to set it up to run in a
842           single process.  And if it does so, it should be able to avoid
843           serializing and deserializing (converting to and from text) data
844           between each module.  (Accomplished in fsdb-2.0: see dbpipeline,
845           although still needs tuning.)
846
847       clean IO API
848           Fsdb's roots go back to perl4 and 1991, so the fsdb-1.x library is
849           very, very crufty.  More than just being ugly (but it was that
850           too), this made things reading from one format file and writing to
851           another the application's job, when it should be the library's.
852           (Accomplished in fsdb-1.15 and improved in 2.0: see Fsdb::IO.)
853
854       normalized module APIs
855           Because fsdb modules were added as needed over 10 years, sometimes
856           the module APIs became inconsistent.  (For example, the 1.x
857           "dbcolcreate" required an empty value following the name of the new
858           column, but other programs specify empty values with the "-e"
859           argument.)  We should smooth over these inconsistencies.
860           (Accomplished as each module was ported in 2.0 through 2.7.)
861
862       everyone handles all input formats
863           Given a clean IO API, the distinction between "colized" and
864           "listized" fsdb files should go away.  Any program should be able
865           to read and write files in any format.  (Accomplished in fsdb-2.1.)
866
867       Fsdb-2.0 preserves backwards compatibility where possible, but breaks
868       it where necessary to accomplish the above goals.  In August 2008,
869       Fsdb-2.7 was declared preferred over the 1.x versions.  Benchmarking in
870       2013 showed that threading performed much worse than just using pipes,
871       so Fsdb-2.44 uses threading "style", but implemented with processes
872       (via my "Freds" library).
873
874   Contributors
875       Fsdb includes code ported from Geoff Kuenning
876       ("Fsdb::Support::TDistribution").
877
878       Fsdb contributors: Ashvin Goel goel@cse.oge.edu, Geoff Kuenning
879       geoff@fmg.cs.ucla.edu, Vikram Visweswariah visweswa@isi.edu, Kannan
880       Varadahan kannan@isi.edu, Lars Eggert larse@isi.edu, Arkadi Gelfond
881       arkadig@dyna.com, David Graff graff@ldc.upenn.edu, Haobo Yu
882       haoboy@packetdesign.com, Pavlin Radoslavov pavlin@catarina.usc.edu,
883       Graham Phillips, Yuri Pradkin, Alefiya Hussain, Ya Xu, Michael
884       Schwendt, Fabio Silva fabio@isi.edu, Jerry Zhao zhaoy@isi.edu, Ning Xu
885       nxu@aludra.usc.edu, Martin Lukac mlukac@lecs.cs.ucla.edu, Xue Cai,
886       Michael McQuaid, Christopher Meng, Calvin Ardi, H. Merijn Brand, Lan
887       Wei, Hang Guo.
888
889       Fsdb includes datasets contributed from NIST (DATA/nist_zarr13.fsdb),
890       from
891       <http://www.itl.nist.gov/div898/handbook/eda/section4/eda4281.htm>, the
892       NIST/SEMATECH e-Handbook of Statistical Methods, section 1.4.2.8.1.
893       Background and Data.  The source is public domain, and reproduced with
894       permission.
895

RELATED WORK

897       As stated in the introduction, Fsdb is an incompatible reimplementation
898       of the ideas found in "/rdb".  By storing data in simple text files and
899       processing it with pipelines it is easy to experiment (in the shell)
900       and look at the output.  The original implementation of this idea was
901       /rdb, a commercial product described in the book UNIX relational
902       database management: application development in the UNIX environment by
903       Rod Manis, Evan Schaffer, and Robert Jorgensen (and also at the web
904       page <http://www.rdb.com/>).
905
906       While Fsdb is inspired by Rdb, it includes no code from it, and Fsdb
907       makes several different design choices.  In particular: rdb attempts to
908       be closer to a "real" database, with provision for locking, file
909       indexing.  Fsdb focuses on single user use and so eschews these
910       choices.  Rdb also has some support for interactive editing.  Fsdb
911       leaves editing to text editors like emacs or vi.
912
913       In August, 2002 I found out Carlo Strozzi extended RDB with his package
914       NoSQL <http://www.linux.it/~carlos/nosql/>.  According to Mr. Strozzi,
915       he implemented NoSQL in awk to avoid the Perl start-up of RDB.
916       Although I haven't found Perl startup overhead to be a big problem on
917       my platforms (from old Sparcstation IPCs to 2GHz Pentium-4s), you may
918       want to evaluate his system.  The Linux Journal has a description of
919       NoSQL at <http://www.linuxjournal.com/article/3294>.  It seems quite
920       similar to Fsdb.  Like /rdb, NoSQL supports indexing (not present in
921       Fsdb).  Fsdb appears to have richer support for statistics, and, as of
922       Fsdb-2.x, its support for Perl threading may support faster performance
923       (one-process, less serialization and deserialization).
924

RELEASE NOTES

926       Versions prior to 1.0 were released informally on my web page but were
927       not announced.
928
929   0.0 1991
930       started for my own research use
931
932   0.1 26-May-94
933       first check-in to RCS
934
935   0.2 15-Mar-95
936       parts now require perl5
937
938   1.0, 22-Jul-97
939       adds autoconf support and a test script.
940
941   1.1, 20-Jan-98
942       support for double space field separators, better tests
943
944   1.2, 11-Feb-98
945       minor changes and release on comp.lang.perl.announce
946
947   1.3, 17-Mar-98
948       •   adds median and quartile options to dbstats
949
950       •   adds dmalloc_to_db converter
951
952       •   fixes some warnings
953
954       •   dbjoin now can run on unsorted input
955
956       •   fixes a dbjoin bug
957
958       •   some more tests in the test suite
959
960   1.4, 27-Mar-98
961       •   improves error messages (all should now report the program that
962           makes the error)
963
964       •   fixed a bug in dbstats output when the mean is zero
965
966   1.5, 25-Jun-98
967       BUG FIX dbcolhisto, dbcolpercentile now handles non-numeric values like
968       dbstats
969       NEW dbcolstats computes zscores and tscores over a column
970       NEW dbcolscorrelate computes correlation coefficients between two
971       columns
972       INTERNAL ficus_getopt.pl has been replaced by DbGetopt.pm
973       BUG FIX all tests are now ``portable'' (previously some tests ran only
974       on my system)
975       BUG FIX you no longer need to have the db programs in your path (fix
976       arose from a discussion with Arkadi Gelfond)
977       BUG FIX installation no longer uses cp -f (to work on SunOS 4)
978
979   1.6, 24-May-99
980       NEW dbsort, dbstats, dbmultistats now run in constant memory (using tmp
981       files if necessary)
982       NEW dbcolmovingstats does moving means over a series of data
983       NEW dbcol has a -v option to get all columns except those listed
984       NEW dbmultistats does quartiles and medians
985       NEW dbstripextraheaders now also cleans up bogus comments before the
986       fist header
987       BUG FIX dbcolneaten works better with double-space-separated data
988
989   1.7,  5-Jan-00
990       NEW dbcolize now detects and rejects lines that contain embedded copies
991       of the field separator
992       NEW configure tries harder to prevent people from improperly
993       configuring/installing fsdb
994       NEW tcpdump_to_db converter (incomplete)
995       NEW tabdelim_to_db converter:  from spreadsheet tab-delimited files to
996       db
997       NEW mailing lists for fsdb are     "fsdb-announce@heidemann.la.ca.us"
998       and  "fsdb-talk@heidemann.la.ca.us"
999           To subscribe to either, send mail
1000           to    "fsdb-announce-request@heidemann.la.ca.us"   or
1001           "fsdb-talk-request@heidemann.la.ca.us"     with "subscribe" in the
1002           BODY of the message.
1003
1004       BUG FIX dbjoin used to produce incorrect output if there were extra,
1005       unmatched values in the 2nd table. Thanks to Graham Phillips for
1006       providing a test case.
1007       BUG FIX the sample commands in the usage strings now all should
1008       explicitly include the source of data (typically from "cat foo.fsdb
1009       |").  Thanks to Ya Xu for pointing out this documentation deficiency.
1010       BUG FIX (DOCUMENTATION) dbcolmovingstats had incorrect sample output.
1011
1012   1.8, 28-Jun-00
1013       BUG FIX header options are now preserved when writing with dblistize
1014       NEW dbrowuniq now optionally checks for uniqueness only on certain
1015       fields
1016       NEW dbrowsplituniq makes one pass through a file and splits it into
1017       separate files based on the given fields
1018       NEW converter for "crl" format network traces
1019       NEW anywhere you use arbitrary code (like dbroweval), _last_foo now
1020       maps to the last row's value for field _foo.
1021       OPTIMIZATION comment processing slightly changed so that dbmultistats
1022       now is much faster on files with lots of comments (for example, ~100k
1023       lines of comments and 700 lines of data!) (Thanks to Graham Phillips
1024       for pointing out this performance problem.)
1025       BUG FIX dbstats with median/quartiles now correctly handles singleton
1026       data points.
1027
1028   1.9,  6-Nov-00
1029       NEW dbfilesplit, split a single input file into multiple output files
1030       (based on code contributed by Pavlin Radoslavov).
1031       BUG FIX dbsort now works with perl-5.6
1032
1033   1.10, 10-Apr-01
1034       BUG FIX dbstats now handles the case where there are more n-tiles than
1035       data
1036       NEW dbstats now includes a -S option to optimize work on pre-sorted
1037       data (inspired by code contributed by Haobo Yu)
1038       BUG FIX dbsort now has a better estimate of memory usage when run on
1039       data with very short records (problem detected by Haobo Yu)
1040       BUG FIX cleanup of temporary files is slightly better
1041
1042   1.11,  2-Nov-01
1043       BUG FIX dbcolneaten now runs in constant memory
1044       NEW dbcolneaten now supports "field specifiers" that allow some control
1045       over how wide columns should be
1046       OPTIMIZATION dbsort now tries hard to be filesystem cache-friendly
1047       (inspired by "Information and Control in Gray-box Systems" by the
1048       Arpaci-Dusseau's at SOSP 2001)
1049       INTERNAL t_distr now ported to perl5 module DbTDistr
1050
1051   1.12,  30-Oct-02
1052       BUG FIX dbmultistats documentation typo fixed
1053       NEW dbcolmultiscale
1054       NEW dbcol has -r option for "relaxed error checking"
1055       NEW dbcolneaten has new -e option to strip end-of-line spaces
1056       NEW dbrow finally has a -v option to negate the test
1057       BUG FIX math bug in dbcoldiff fixed by Ashvin Goel (need to check
1058       Scheaffer test cases)
1059       BUG FIX some patches to run with Perl 5.8. Note: some programs
1060       (dbcolmultiscale, dbmultistats, dbrowsplituniq) generate warnings like:
1061       "Use of uninitialized value in concatenation (.)" or "string at
1062       /usr/lib/perl5/5.8.0/FileCache.pm line 98, <STDIN> line 2". Please
1063       ignore this until I figure out how to suppress it. (Thanks to Jerry
1064       Zhao for noticing perl-5.8 problems.)
1065       BUG FIX fixed an autoconf problem where configure would fail to find a
1066       reasonable prefix (thanks to Fabio Silva for reporting the problem)
1067       NEW db_to_html_table: simple conversion to html tables (NO fancy stuff)
1068       NEW dblib now has a function dblib_text2html() that will do simple
1069       conversion of iso-8859-1 to HTML
1070
1071   1.13,  4-Feb-04
1072       NEW fsdb added to the freebsd ports tree
1073       <http://www.freshports.org/databases/fsdb/>.  Maintainer:
1074       "larse@isi.edu"
1075       BUG FIX properly handle trailing spaces when data must be numeric (ex.
1076       dbstats with -FS, see test dbstats_trailing_spaces). Fix from Ning Xu
1077       "nxu@aludra.usc.edu".
1078       NEW dbcolize error message improved (bug report from Terrence Brannon),
1079       and list format documented in the README.
1080       NEW cgi_to_db converts CGI.pm-format storage to fsdb list format
1081       BUG FIX handle numeric synonyms for column names in dbcol properly
1082       ENHANCEMENT "talking about columns" section added to README. Lack of
1083       documentation pointed out by Lars Eggert.
1084       CHANGE dbformmail now defaults to using Mail ("Berkeley Mail") to send
1085       mail, rather than sendmail (sendmail is still an option, but mail
1086       doesn't require running as root)
1087       NEW on platforms that support it (i.e., with perl 5.8), fsdb works fine
1088       with unicode
1089       NEW dbfilevalidate: check a db file for some common errors
1090
1091   1.14,  24-Aug-06
1092       ENHANCEMENT README cleanup
1093       INCOMPATIBLE CHANGE dbcolsplit renamed dbcolsplittocols
1094       NEW dbcolsplittorows  split one column into multiple rows
1095       NEW dbcolsregression compute linear regression and correlation for two
1096       columns
1097       ENHANCEMENT cvs_to_db: better error handling, normalize field names,
1098       skip blank lines
1099       ENHANCEMENT dbjoin now detects (and fails) if non-joined files have
1100       duplicate names
1101       BUG FIX minor bug fixed in calculation of Student t-distributions
1102       (doesn't change any test output, but may have caused small errors)
1103
1104   1.15, 12-Nov-07
1105       NEW fsdb-1.14 added to the MacOS Fink system
1106       <http://pdb.finkproject.org/pdb/package.php/fsdb>. (Thanks to Lars
1107       Eggert for maintaining this port.)
1108       NEW Fsdb::IO::Reader and Fsdb::IO::Writer now provide reasonably clean
1109       OO I/O interfaces to Fsdb files.  Highly recommended if you use fsdb
1110       directly from perl.  In the fullness of time I expect to reimplement
1111       the entire thing using these APIs to replace the current dblib.pl which
1112       is still hobbled by its roots in perl4.
1113       NEW dbmapreduce now implements a Google-style map/reduce abstraction,
1114       generalizing dbmultistats.
1115       ENHANCEMENT fsdb now uses the Perl build system (Makefile.PL, etc.),
1116       instead of autoconf.  This change paves the way to better perl-5-style
1117       modularization, proper manual pages, input of both listize and colize
1118       format for every program, and world peace.
1119       ENHANCEMENT dblib.pl is now moved to Fsdb::Old.pm.
1120       BUG FIX dbmultistats now propagates its format argument (-f). Bug and
1121       fix from Martin Lukac (thanks!).
1122       ENHANCEMENT dbformmail documentation now is clearer that it doesn't
1123       send the mail, you have to run the shell script it writes.  (Problem
1124       observed by Unkyu Park.)
1125       ENHANCEMENT adapted to autoconf-2.61 (and then these changes were
1126       discarded in favor of The Perl Way.
1127       BUG FIX dbmultistats memory usage corrected (O(# tags), not O(1))
1128       ENHANCEMENT dbmultistats can now optionally run with pre-grouped input
1129       in O(1) memory
1130       ENHANCEMENT dbroweval -N was finally implemented (eat comments)
1131
1132   2.0, 25-Jan-08
1133       2.0, 25-Jan-08 --- a quiet 2.0 release (gearing up towards complete)
1134
1135       ENHANCEMENT: shifting old programs to Perl modules, with the front-end
1136       program as just a wrapper. In the short-term, this change just means
1137       programs have real man pages. In the long-run, it will mean that one
1138       can run a pipeline in a single Perl program. So far: dbcol, dbroweval,
1139       the new dbrowcount. dbsort the new dbmerge, the old "dbstats" (renamed
1140       dbcolstats), dbcolrename, dbcolcreate,
1141       NEW: Fsdb::Filter::dbpipeline is an internal-only module that lets one
1142       use fsdb commands from within perl (via threads).
1143           It also provides perl function aliases for the internal modules, so
1144           a string of fsdb commands in perl are nearly as terse as in the
1145           shell:
1146
1147               use Fsdb::Filter::dbpipeline qw(:all);
1148               dbpipeline(
1149                   dbrow(qw(name test1)),
1150                   dbroweval('_test1 += 5;')
1151               );
1152
1153       INCOMPATIBLE CHANGE: The old dbcolstats has been renamed
1154       dbcolstatscores. The new dbcolstats does the same thing as the old
1155       dbstats. This incompatibility is unfortunate but normalizes program
1156       names.
1157       CHANGE: The new dbcolstats program always outputs "-" (the default
1158       empty value) for statistics it cannot compute (for example, standard
1159       deviation if there is only one row), instead of the old mix of "-" and
1160       "na".
1161       INCOMPATIBLE CHANGE: The old dbcolstats program, now called
1162       dbcolstatscores, also has different arguments.  The "-t mean,stddev"
1163       option is now "--tmean mean --tstddev stddev".  See dbcolstatscores for
1164       details.
1165       INCOMPATIBLE CHANGE: dbcolcreate now assumes all new columns get the
1166       default value rather than requiring each column to have an initial
1167       constant value. To change the initial value, sue the new "-e" option.
1168       NEW: dbrowcount counts rows, an almost-subset of dbcolstats's "n"
1169       output (except without differentiating numeric/non-numeric input), or
1170       the equivalent of "dbstripcomments | wc -l".
1171       NEW: dbmerge merges two sorted files. This functionality was previously
1172       embedded in dbsort.
1173       INCOMPATIBLE CHANGE: dbjoin's "-i" option to include non-matches is now
1174       renamed "-a", so as to not conflict with the new standard option "-i"
1175       for input file.
1176
1177   2.1,  6-Apr-08
1178       2.1,  6-Apr-08 --- another alpha 2.0, but now all converted programs
1179       understand both listize and colize format
1180
1181       ENHANCEMENT: shifting more old programs to Perl modules. New in 2.1:
1182       dbcolneaten, dbcoldefine, dbcolhisto, dblistize, dbcolize, dbrecolize
1183       ENHANCEMENT dbmerge now handles an arbitrary number of input files, not
1184       just exactly two.
1185       NEW dbmerge2 is an internal routine that handles merging exactly two
1186       files.
1187       INCOMPATIBLE CHANGE dbjoin now specifies inputs like dbmerge2, rather
1188       than assuming the first two arguments were tables (as in fsdb-1).
1189           The old dbjoin argument "-i" is now "-a" or <--type=outer>.
1190
1191           A minor change: comments in the source files for dbjoin are now
1192           intermixed with output rather than being delayed until the end.
1193
1194       ENHANCEMENT dbsort now no longer produces warnings when null values are
1195       passed to numeric comparisons.
1196       BUG FIX dbroweval now once again works with code that lacks a trailing
1197       semicolon. (This bug fixes a regression from 1.15.)
1198       INCOMPATIBLE CHANGE dbcolneaten's old "-e" option (to avoid end-of-line
1199       spaces) is now "-E" to avoid conflicts with the standard empty field
1200       argument.
1201       INCOMPATIBLE CHANGE dbcolhisto's old "-e" option is now "-E" to avoid
1202       conflicts. And its "-n", "-s", and "-w" are now "-N", "-S", and "-W" to
1203       correspond.
1204       NEW dbfilealter replaces dbrecolize, dblistize, and dbcolize, but with
1205       different options.
1206       ENHANCEMENT The library routines "Fsdb::IO" now understand both list-
1207       format and column-format data, so all converted programs can now
1208       automatically read either format.  This capability was one of the
1209       milestone goals for 2.0, so yea!
1210
1211   2.2, 23-May-08
1212       Release 2.2 is another 2.x alpha release.  Now most of the commands are
1213       ported, but a few remain, and I plan one last incompatible change (to
1214       the file header) before 2.x final.
1215
1216       ENHANCEMENT
1217           shifting more old programs to Perl modules.  New in 2.2:
1218           dbrowaccumulate, dbformmail.  dbcolmovingstats.  dbrowuniq.
1219           dbrowdiff.  dbcolmerge.  dbcolsplittocols.  dbcolsplittorows.
1220           dbmapreduce.  dbmultistats.  dbrvstatdiff.  Also dbrowenumerate
1221           exists only as a front-end (command-line) program.
1222
1223       INCOMPATIBLE CHANGE
1224           The following programs have been dropped from fsdb-2.x:
1225           dbcoltighten, dbfilesplit, dbstripextraheaders,
1226           dbstripleadingspace.
1227
1228       NEW combined_log_format_to_db to convert Apache logfiles
1229
1230       INCOMPATIBLE CHANGE
1231           Options to dbrowdiff are now -B and -I, not -a and -i.
1232
1233       INCOMPATIBLE CHANGE
1234           dbstripcomments is now dbfilestripcomments.
1235
1236       BUG FIXES
1237           dbcolneaten better handles empty columns; dbcolhisto warning
1238           suppressed (actually a bug in high-bucket handling).
1239
1240       INCOMPATIBLE CHANGE
1241           dbmultistats now requires a "-k" option in front of the key (tag)
1242           field, or if none is given, it will group by the first field (both
1243           like dbmapreduce).
1244
1245       KNOWN BUG
1246           dbmultistats with quantile option doesn't work currently.
1247
1248       INCOMPATIBLE CHANGE
1249           dbcoldiff is renamed dbrvstatdiff.
1250
1251       BUG FIXES
1252           dbformmail was leaving its log message as a  command, not a
1253           comment.  Oops.  No longer.
1254
1255   2.3, 27-May-08 (alpha)
1256       Another alpha release, this one just to fix the critical dbjoin bug
1257       listed below (that happens to have blocked my MP3 jukebox :-).
1258
1259       BUG FIX
1260           Dbsort no longer hangs if given an input file with no rows.
1261
1262       BUG FIX
1263           Dbjoin now works with unsorted input coming from a pipeline (like
1264           stdin).  Perl-5.8.8 has a bug (?) that was making this case
1265           fail---opening stdin in one thread, reading some, then reading more
1266           in a different thread caused an lseek which works on files, but
1267           fails on pipes like stdin.  Go figure.
1268
1269       BUG FIX / KNOWN BUG
1270           The dbjoin fix also fixed dbmultistats -q (it now gives the right
1271           answer).  Although a new bug appeared, messages like:
1272               Attempt to free unreferenced scalar: SV 0xa9dd0c4, Perl
1273           interpreter: 0xa8350b8 during global destruction.  So the
1274           dbmultistats_quartile test is still disabled.
1275
1276   2.4, 18-Jun-08
1277       Another alpha release, mostly to fix minor usability problems in
1278       dbmapreduce and client functions.
1279
1280       ENHANCEMENT
1281           dbrow now defaults to running user supplied code without warnings
1282           (as with fsdb-1.x).  Use "--warnings" or "-w" to turn them back on.
1283
1284       ENHANCEMENT
1285           dbroweval can now write different format output than the input,
1286           using the "-m" option.
1287
1288       KNOWN BUG
1289           dbmapreduce emits warnings on perl 5.10.0 about "Unbalanced string
1290           table refcount" and "Scalars leaked" when run with an external
1291           program as a reducer.
1292
1293           dbmultistats emits the warning "Attempt to free unreferenced
1294           scalar" when run with quartiles.
1295
1296           In each case the output is correct.  I believe these can be
1297           ignored.
1298
1299       CHANGE
1300           dbmapreduce no longer logs a line for each reducer that is invoked.
1301
1302   2.5, 24-Jun-08
1303       Another alpha release, fixing more minor bugs in "dbmapreduce" and
1304       lossage in "Fsdb::IO".
1305
1306       ENHANCEMENT
1307           dbmapreduce can now tolerate non-map-aware reducers that pass back
1308           the key column in put.  It also passes the current key as the last
1309           argument to external reducers.
1310
1311       BUG FIX
1312           Fsdb::IO::Reader, correctly handle "-header" option again.  (Broken
1313           since fsdb-2.3.)
1314
1315   2.6, 11-Jul-08
1316       Another alpha release, needed to fix DaGronk.  One new port, small bug
1317       fixes, and important fix to dbmapreduce.
1318
1319       ENHANCEMENT
1320           shifting more old programs to Perl modules.  New in 2.2:
1321           dbcolpercentile.
1322
1323       INCOMPATIBLE CHANGE and ENHANCEMENTS dbcolpercentile arguments changed,
1324       use "--rank" to require ranking instead of "-r". Also, "--ascending"
1325       and "--descending" can now be specified separately, both for
1326       "--percentile" and "--rank".
1327       BUG FIX
1328           Sigh, the sense of the --warnings option in dbrow was inverted.  No
1329           longer.
1330
1331       BUG FIX
1332           I found and fixed the string leaks (errors like "Unbalanced string
1333           table refcount" and "Scalars leaked") in dbmapreduce and
1334           dbmultistats.  (All "IO::Handle"s in threads must be manually
1335           destroyed.)
1336
1337       BUG FIX
1338           The "-C" option to specify the column separator in dbcolsplittorows
1339           now works again (broken since it was ported).
1340
1341       2.7, 30-Jul-08 beta
1342
1343       The beta release of fsdb-2.x.  Finally, all programs are ported.  As
1344       statistics, the number of lines of non-library code doubled from 7.5k
1345       to 15.5k.  The libraries are much more complete, going from 866 to 5164
1346       lines.  The overall number of programs is about the same, although 19
1347       were dropped and 11 were added.  The number of test cases has grown
1348       from 116 to 175.  All programs are now in perl-5, no more shell scripts
1349       or perl-4.  All programs now have manual pages.
1350
1351       Although this is a major step forward, I still expect to rename "fsdb"
1352       to "fsdb".
1353
1354       ENHANCEMENT
1355           shifting more old programs to Perl modules.  New in 2.7:
1356           dbcolscorellate.  dbcolsregression.  cgi_to_db.  dbfilevalidate.
1357           db_to_csv.  csv_to_db, db_to_html_table, kitrace_to_db,
1358           tcpdump_to_db, tabdelim_to_db, ns_to_db.
1359
1360       INCOMPATIBLE CHANGE
1361           The following programs have been dropped from fsdb-2.x: db2dcliff,
1362           dbcolmultiscale, crl_to_db.  ipchain_logs_to_db.  They may come
1363           back, but seemed overly specialized.  The following program
1364           dbrowsplituniq was dropped because it is superseded by dbmapreduce.
1365           dmalloc_to_db was dropped pending a test cases and examples.
1366
1367       ENHANCEMENT
1368           dbfilevalidate now has a "-c" option to correct errors.
1369
1370       NEW html_table_to_db provides the inverse of db_to_html_table.
1371
1372   2.8,  5-Aug-08
1373       Change header format, preserving forwards compatibility.
1374
1375       BUG FIX
1376           Complete editing pass over the manual, making sure it aligns with
1377           fsdb-2.x.
1378
1379       SEMI-COMPATIBLE CHANGE
1380           The header of fsdb files has changed, it is now #fsdb, not #h (or
1381           #L) and parsing of -F and -R are also different.  See dbfilealter
1382           for the new specification.  The v1 file format will be read,
1383           compatibly, but not written.
1384
1385       BUG FIX
1386           dbmapreduce now tolerates comments that precede the first key,
1387           instead of failing with an error message.
1388
1389   2.9, 6-Aug-08
1390       Still in beta; just a quick bug-fix for dbmapreduce.
1391
1392       ENHANCEMENT
1393           dbmapreduce now generates plausible output when given no rows of
1394           input.
1395
1396   2.10, 23-Sep-08
1397       Still in beta, but picking up some bug fixes.
1398
1399       ENHANCEMENT
1400           dbmapreduce now generates plausible output when given no rows of
1401           input.
1402
1403       ENHANCEMENT
1404           dbroweval the warnings option was backwards; now corrected.  As a
1405           result, warnings in user code now default off (like in fsdb-1.x).
1406
1407       BUG FIX
1408           dbcolpercentile now defaults to assuming the target column is
1409           numeric.  The new option "-N" allows selection of a non-numeric
1410           target.
1411
1412       BUG FIX
1413           dbcolscorrelate now includes "--sample" and "--nosample" options to
1414           compute the sample or full population correlation coefficients.
1415           Thanks to Xue Cai for finding this bug.
1416
1417   2.11, 14-Oct-08
1418       Still in beta, but picking up some bug fixes.
1419
1420       ENHANCEMENT
1421           html_table_to_db is now more aggressive about filling in empty
1422           cells with the official empty value, rather than leaving them blank
1423           or as whitespace.
1424
1425       ENHANCEMENT
1426           dbpipeline now catches failures during pipeline element setup and
1427           exits reasonably gracefully.
1428
1429       BUG FIX
1430           dbsubprocess now reaps child processes, thus avoiding running out
1431           of processes when used a lot.
1432
1433   2.12, 16-Oct-08
1434       Finally, a full (non-beta) 2.x release!
1435
1436       INCOMPATIBLE CHANGE
1437           Jdb has been renamed Fsdb, the flatfile-streaming database.  This
1438           change affects all internal Perl APIs, but no shell command-level
1439           APIs.  While Jdb served well for more than ten years, it is easily
1440           confused with the Java debugger (even though Jdb was there first!).
1441           It also is too generic to work well in web search engines.
1442           Finally, Jdb stands for ``John's database'', and we're a bit beyond
1443           that.  (However, some call me the ``file-system guy'', so one could
1444           argue it retains that meeting.)
1445
1446           If you just used the shell commands, this change should not affect
1447           you.  If you used the Perl-level libraries directly in your code,
1448           you should be able to rename "Jdb" to "Fsdb" to move to 2.12.
1449
1450           The jdb-announce list not yet been renamed, but it will be shortly.
1451
1452           With this release I've accomplished everything I wanted to in
1453           fsdb-2.x.  I therefore expect to return to boring, bugfix releases.
1454
1455   2.13, 30-Oct-08
1456       BUG FIX
1457           dbrowaccumulate now treats non-numeric data as zero by default.
1458
1459       BUG FIX
1460           Fixed a perl-5.10ism in dbmapreduce that breaks that program under
1461           5.8.  Thanks to Martin Lukac for reporting the bug.
1462
1463   2.14, 26-Nov-08
1464       BUG FIX
1465           Improved documentation for dbmapreduce's "-f" option.
1466
1467       ENHANCEMENT
1468           dbcolmovingstats how computes a moving standard deviation in
1469           addition to a moving mean.
1470
1471   2.15, 13-Apr-09
1472       BUG FIX
1473           Fix a make install bug reported by Shalindra Fernando.
1474
1475   2.16, 14-Apr-09
1476       BUG FIX
1477           Another minor release bug: on some systems programize_module looses
1478           executable permissions.  Again reported by Shalindra Fernando.
1479
1480   2.17, 25-Jun-09
1481       TYPO FIXES
1482           Typo in the dbroweval manual fixed.
1483
1484       IMPROVEMENT
1485           There is no longer a comment line to label columns in dbcolneaten,
1486           instead the header line is tweaked to line up.  This change
1487           restores the Jdb-1.x behavior, and means that repeated runs of
1488           dbcolneaten no longer add comment lines each time.
1489
1490       BUG FIX
1491           It turns out  dbcolneaten was not correctly handling trailing
1492           spaces when given the "-E" option to suppress them.  This
1493           regression is now fixed.
1494
1495       EXTENSION
1496           dbroweval(1) can now handle direct references to the last row via
1497           $lfref, a dubious but now documented feature.
1498
1499       BUG FIXES
1500           Separators set with "-C" in dbcolmerge and dbcolsplittocols were
1501           not properly setting the heading, and null fields were not
1502           recognized.  The first bug was reported by Martin Lukac.
1503
1504   2.18,  1-Jul-09  A minor release
1505       IMPROVEMENT
1506           Documentation for Fsdb::IO::Reader has been improved.
1507
1508       IMPROVEMENT
1509           The package should now be PGP-signed.
1510
1511   2.19,  10-Jul-09
1512       BUG FIX
1513           Internal improvements to debugging output and robustness of
1514           dbmapreduce and dbpipeline.  TEST/dbpipeline_first_fails.cmd re-
1515           enabled.
1516
1517   2.20, 30-Nov-09 (A collection of minor bugfixes, plus a build against
1518       Fedora 12.)
1519       BUG FIX
1520           Loging for dbmapreduce with code refs is now stable (it no longer
1521           includes a hex pointer to the code reference).
1522
1523       BUG FIX
1524           Better handling of mixed blank lines in Fsdb::IO::Reader (see test
1525           case dbcolize_blank_lines.cmd).
1526
1527       BUG FIX
1528           html_table_to_db now handles multi-line input better, and handles
1529           tables with COLSPAN.
1530
1531       BUG FIX
1532           dbpipeline now cleans up threads in an "eval" to prevent "cannot
1533           detach a joined thread" errors that popped up in perl-5.10.
1534           Hopefully this prevents a race condition that causes the test
1535           suites to hang about 20% of the time (in dbpipeline_first_fails).
1536
1537       IMPROVEMENT
1538           dbmapreduce now detects and correctly fails when the input and
1539           reducer have incompatible field separators.
1540
1541       IMPROVEMENT
1542           dbcolstats, dbcolhisto, dbcolscorrelate, dbcolsregression, and
1543           dbrowcount now all take an "-F" option to let one specify the
1544           output field separator (so they work better with dbmapreduce).
1545
1546       BUG FIX
1547           An omitted "-k" from the manual page of dbmultistats is now there.
1548           Bug reported by Unkyu Park.
1549
1550   2.21, 17-Apr-10 bug fix release
1551       BUG FIX
1552           Fsdb::IO::Writer now no longer fails with -outputheader => never
1553           (an obscure bug).
1554
1555       IMPROVEMENT
1556           Fsdb (in the warnings section) and dbcolstats now more carefully
1557           document how they handle (and do not handle) numerical precision
1558           problems, and other general limits.  Thanks to Yuri Pradkin for
1559           prompting this documentation.
1560
1561       IMPROVEMENT
1562           "Fsdb::Support::fullname_to_sortkey" is now restored from "Jdb".
1563
1564       IMPROVEMENT
1565           Documention for multiple styles of input approaches (including
1566           performance description) added to Fsdb::IO.
1567
1568   2.22, 2010-10-31 One new tool dbcolcopylast and several bug fixes for Perl
1569       5.10.
1570       BUG FIX
1571           dbmerge now correctly handles n-way merges.  Bug reported by Yuri
1572           Pradkin.
1573
1574       INCOMPARABLE CHANGE
1575           dbcolneaten now defaults to not padding the last column.
1576
1577       ADDITION
1578           dbrowenumerate now takes -N NewColumn to give the new column a name
1579           other than "count".  Feature requested by Mike Rouch in January
1580           2005.
1581
1582       ADDITION
1583           New program dbcolcopylast copies the last value of a column into a
1584           new column copylast_column of the next row.  New program requested
1585           by Fabio Silva; useful for converting dbmultistats output into
1586           dbrvstatdiff input.
1587
1588       BUG FIX
1589           Several tools (particularly dbmapreduce and dbmultistats) would
1590           report errors like "Unbalanced string table refcount: (1) for
1591           "STDOUT" during global destruction" on exit, at least on certain
1592           versions of Perl (for me on 5.10.1), but similar errors have been
1593           off-and-on for several Perl releases.  Although I think my code
1594           looked OK, I worked around this problem with a different way of
1595           handling standard IO redirection.
1596
1597   2.23, 2011-03-10 Several small portability bugfixes; improved dbcolstats
1598       for large datasets
1599       IMPROVEMENT
1600           Documentation to dbrvstatdiff was changed to use "sd" to refer to
1601           standard deviation, not "ss" (which might be confused with sum-of-
1602           squares).
1603
1604       BUG FIX
1605           This documentation about dbmultistats was missing the -k option in
1606           some cases.
1607
1608       BUG FIX
1609           dbmapreduce was failing on MacOS-10.6.3 for some tests with the
1610           error
1611
1612               dbmapreduce: cannot run external dbmapreduce reduce program (perl TEST/dbmapreduce_external_with_key.pl)
1613
1614           The problem seemed to be only in the error, not in operation.  On
1615           MacOS, the error is now suppressed.  Thanks to Alefiya Hussain for
1616           providing access to a Mac system that allowed debugging of this
1617           problem.
1618
1619       IMPROVEMENT
1620           The csv_to_db command requires an external Perl library
1621           (Text::CSV_XS).  On computers that lack this optional library,
1622           previously Fsdb would configure with a warning and then test cases
1623           would fail.  Now those test cases are skipped with an additional
1624           warning.
1625
1626       BUG FIX
1627           The test suite now supports alternative valid output, as a hack to
1628           account for last-digit floating point differences.  (Not very
1629           satisfying :-(
1630
1631       BUG FIX
1632           dbcolstats output for confidence intervals on very large datasets
1633           has changed.  Previously it failed for more than 2^31-1 records,
1634           and handling of T-Distributions with thousands of rows was a bit
1635           dubious.  Now datasets with more than 10000 are considered
1636           infinitely large and hopefully correctly handled.
1637
1638   2.24, 2011-04-15 Improvements to fix an old bug in dbmapreduce with
1639       different field separators
1640       IMPROVEMENT
1641           The dbfilealter command had a "--correct" option to work-around
1642           from incompatible field-separators, but it did nothing.  Now it
1643           does the correct but sad, data-loosing thing.
1644
1645       IMPROVEMENT
1646           The dbmultistats command previously failed with an error message
1647           when invoked on input with a non-default field separator.  The root
1648           cause was the underlying dbmapreduce that did not handle the case
1649           of reducers that generated output with a different field separator
1650           than the input.  We now detect and repair incompatible field
1651           separators.  This change corrects a problem originally documented
1652           and detected in Fsdb-2.20.  Bug re-reported by Unkyu Park.
1653
1654   2.25, 2011-08-07 Two new tools, xml_to_db and dbfilepivot, and a bugfix for
1655       two people.
1656       IMPROVEMENT
1657           kitrace_to_db now supports a --utc option, which also fixes this
1658           test case for users outside of the Pacific time zone.  Bug reported
1659           by David Graff, and also by Peter Desnoyers (within a week of each
1660           other :-)
1661
1662       NEW xml_to_db can convert simple, very regular XML files into Fsdb.
1663
1664       NEW dbfilepivot "pivots" a file, converting multiple rows corresponding
1665           to the same entity into a single row with multiple columns.
1666
1667   2.26, 2011-12-12 Bug fixes, particularly for perl-5.14.2.
1668       BUG FIX
1669           Bugs fixed in Fsdb::IO::Reader(3) manual page.
1670
1671       BUG FIX
1672           Fixed problems where dbcolstats was truncating floating point
1673           numbers when sorting.  This strange behavior happens as of
1674           perl-5.14.2 and it seems like a Perl bug.  I've worked around it
1675           for the test suites, but I'm a bit nervous.
1676
1677   2.27, 2012-11-15 Accumulated bug fixes.
1678       IMPROVEMENT
1679           csv_to_db now reports errors in CVS input with real diagnostics.
1680
1681       IMPROVEMENT
1682           dbcolmovingstats can now compute median, when given the "-m"
1683           option.
1684
1685       BUG FIX
1686           dbcolmovingstats non-numeric handling (the "-a" option) now works
1687           properly.
1688
1689       DOCUMENTATION
1690           The internal t/test_command.t test framework is now documented.
1691
1692       BUG FIX
1693           dbrowuniq now correctly handles the case where there is no input
1694           (previously it output a blank line, which is a malformed fsdb
1695           file).  Thanks to Yuri Pradkin for reporting this bug.
1696
1697   2.28, 2012-11-15 A quick release to fix most rpmlint errors.
1698       BUG FIX
1699           Fixed a number of minor release problems (wrong permissions, old
1700           FSF address, etc.) found by rpmlint.
1701
1702   2.29, 2012-11-20 a quick release for CPAN testing
1703       IMPROVEMENT
1704           Tweaked the RPM spec.
1705
1706       IMPROVEMENT
1707           Modified Makefile.PL to fail gracefully on Perl installations that
1708           lack threads.  (Without this fix, I get massive failures in the
1709           non-ithreads test system.)
1710
1711   2.30, 2012-11-25 improvements to perl portability
1712       BUG FIX
1713           Removed unicode character in documention of dbcolscorrelated so pod
1714           tests will pass.  (Sigh, that should work :-( )
1715
1716       BUG FIX
1717           Fixed test suite failures on 5 tests (dbcolcreate_double_creation
1718           was the first) due to Carp's addition of a period.  This problem
1719           was breaking Fsdb on perl-5.17.  Thanks to Michael McQuaid for
1720           helping diagnose this problem.
1721
1722       IMPROVEMENT
1723           The test suite now prints out the names of tests it tries.
1724
1725   2.31, 2012-11-28 A release with actual improvements to dbfilepivot and
1726       dbrowuniq.
1727       BUG FIX
1728           Documentation fixes: typos in dbcolscorrelated, bugs in
1729           dbfilepivot, clarification for comment handling in
1730           Fsdb::IO::Reader.
1731
1732       IMPROVEMENT
1733           Previously dbfilepivot assumed the input was grouped by keys and
1734           didn't very that pre-condition.  Now there is no pre-condition (it
1735           will sort the input by default), and it checks if the invariant is
1736           violated.
1737
1738       BUG FIX
1739           Previously dbfilepivot failed if the input had comments (oops :-);
1740           no longer.
1741
1742       IMPROVEMENT
1743           Now dbrowuniq has the "-L" option to preserve the last unique row
1744           (instead of the first), a common idiom.
1745
1746   2.32, 2012-12-21 Test suites should now be more numerically robust.
1747       NEW New dbfilediff does fsdb-aware file differencing.  It does not do
1748           smart intuition of add/removes like Unix diff(1), but it does know
1749           about columns, and with "-E", it does numeric-aware differences.
1750
1751       IMPROVEMENT
1752           Test suites that are numeric now use dbfilediff to do numeric-aware
1753           comparisons, so the test suite should now be robust to slightly
1754           different computers and operating systems and compilers than
1755           exactly what I use.
1756
1757   2.33, 2012-12-23 Minor fixes to some test cases.
1758       IMPROVEMENT
1759           dbfilediff and dbrowuniq now supports the "-N" option to give the
1760           new column a different name.  (And a test cases where this
1761           duplication mattered have been fixed.)
1762
1763       IMPROVEMENT
1764           dbrvstatdiff now show the t-test breakpoint with a reasonable
1765           number of floating point digits.
1766
1767       BUG FIX
1768           Fixed a numerical stability problem in the dbroweval_last test
1769           case.
1770

WHAT'S NEW

1772   2.34, 2013-02-10 Parallelism in dbmerge.
1773       IMPROVEMENT
1774           Documention for dbjoin now includes resource requirements.
1775
1776       IMPROVEMENT
1777           Default memory usage for dbsort is now about 256MB.  (The world
1778           keeps moving forward.)
1779
1780       IMPROVEMENT
1781           dbmerge now does merging in parallel.  As a side-effect, dbsort
1782           should be faster when input overflows memory.  The level of
1783           parallelism can be limited with the "--parallelism" option.  (There
1784           is more work to do here, but we're off to a start.)
1785
1786   2.35, 2013-02-23 Improvements to dbmerge parallelism
1787       BUG FIX
1788           Fsdb temporary files are now created more securely (with
1789           File::Temp).
1790
1791       IMPROVEMENT
1792           Programs that sort or merge on fields (dbmerge2, dbmerge, dbsort,
1793           dbjoin) now report an error if no fields on which to join or merge
1794           are given.
1795
1796       IMPROVEMENT
1797           Parallelism in dbmerge is should now be more consistent, with less
1798           starting and stopping.
1799
1800       IMPROVEMENT In dbmerge, the "--xargs" option lets one give input
1801       filenames on standard input, rather than the command line. This feature
1802       paves the way for faster dbsort for large inputs (by pipelining sorting
1803       and merging), expected in the next release.
1804
1805   2.36, 2013-02-25 dbsort pipelines with dbmerge
1806       IMPROVEMENT For large inputs, dbsort now pipelines sorting and merging,
1807       allowing earlier processing.
1808       BUG FIX Since 2.35, dbmerge delayed cleanup of intermediate files,
1809       thereby requiring extra disk space.
1810
1811   2.37, 2013-02-26 quick bugfix to support parallel sort and merge from
1812       recent releases
1813       BUG FIX Since 2.35, dbmerge delayed removal of input files given by
1814       "--xargs".  This problem is now fixed.
1815
1816   2.38, 2013-04-29 minor bug fixes
1817       CLARIFICATION
1818           Configure now rejects Windows since tests seem to hang on some
1819           versions of Windows.  (I would love help from a Windows developer
1820           to get this problem fixed, but I cannot do it.)  See
1821           https://rt.cpan.org/Ticket/Display.html?id=84201.
1822
1823       IMPROVEMENT
1824           All programs that use temporary files (dbcolpercentile,
1825           dbcolscorrelate, dbcolstats, dbcolstatscores) now take the "-T"
1826           option and set the temporary directory consistently.
1827
1828           In addition, error messages are better when the temporary directory
1829           has problems.  Problem reported by Liang Zhu.
1830
1831       BUG FIX
1832           dbmapreduce was failing with external, map-reduce aware reducers
1833           (when invoked with -M and an external program).  (Sigh, did this
1834           case ever work?)  This case should now work.  Thanks to Yuri
1835           Pradkin for reporting this bug (in 2011).
1836
1837       BUG FIX
1838           Fixed perl-5.10 problem with dbmerge.  Thanks to Yuri Pradkin for
1839           reporting this bug (in 2013).
1840
1841   2.39, date 2013-05-31 quick release for the dbrowuniq extension
1842       BUG FIX
1843           Actually in 2.38, the Fedora .spec got cleaner dependencies.
1844           Suggestion from Christopher Meng via
1845           <https://bugzilla.redhat.com/show_bug.cgi?id=877096>.
1846
1847       ENHANCEMENT
1848           Fsdb files are now explicitly set into UTF-8 encoding, unless one
1849           specifies "-encoding" to "Fsdb::IO".
1850
1851       ENHANCEMENT
1852           dbrowuniq now supports "-I" for incremental counting.
1853
1854   2.40, 2013-07-13 small bug fixes
1855       BUG FIX
1856           dbsort now has more respect for a user-given temporary directory;
1857           it no longer is ignored for merging.
1858
1859       IMPROVEMENT
1860           dbrowuniq now has options to output the first, last, and both first
1861           and last rows of a run ("-F", "-L", and "-B").
1862
1863       BUG FIX
1864           dbrowuniq now correctly handles "-N".  Sigh, it didn't work before.
1865
1866   2.41, 2013-07-29 small bug and packaging fixes
1867       ENHANCEMENT
1868           Documentation to dbrvstatdiff improved (inspired by questions from
1869           Qian Kun).
1870
1871       BUG FIX
1872           dbrowuniq no longer duplicates singleton unique lines when
1873           outputting both (with "-B").
1874
1875       BUG FIX
1876           Add missing "XML::Simple" dependency to Makefile.PL.
1877
1878       ENHANCEMENT
1879           Tests now show the diff of the failing output if run with "make
1880           test TEST_VERBOSE=1".
1881
1882       ENHANCEMENT
1883           dbroweval now includes documentation for how to output extra rows.
1884           Suggestion from Yuri Pradkin.
1885
1886       BUG FIX
1887           Several improvements to the Fedora package from Michael Schwendt
1888           via <https://bugzilla.redhat.com/show_bug.cgi?id=877096>, and from
1889           the harsh master that is rpmlint.  (I am stymied at teaching it
1890           that "outliers" is spelled correctly.  Maybe I should send it
1891           Schneier's book.  And an unresolvable invalid-spec-name lurks in
1892           the SRPM.)
1893
1894   2.42, 2013-07-31 A bug fix and packaging release.
1895       ENHANCEMENT
1896           Documentation to dbjoin improved to better memory usage.  (Based on
1897           problem report by Lin Quan.)
1898
1899       BUG FIX
1900           The .spec is now perl-Fsdb.spec to satisfy rpmlint.  Thanks to
1901           Christopher Meng for a specific bug report.
1902
1903       BUG FIX
1904           Test dbroweval_last.cmd no longer has a column that caused failures
1905           because of numerical instability.
1906
1907       BUG FIX
1908           Some tests now better handle bugs in old versions of perl (5.10,
1909           5.12).  Thanks to Calvin Ardi for help debugging this on a Mac with
1910           perl-5.12, but the fix should affect other platforms.
1911
1912   2.43, 2013-08-27 Adds in-file compression.
1913       BUG FIX
1914           Changed the sort on TEST/dbsort_merge.cmd to strings (from
1915           numerics) so we're less susceptible to false test-failures due to
1916           floating point IO differences.
1917
1918       EXPERIMENTAL ENHANCEMENT
1919           Yet more parallelism in dbmerge: new "endgame-mode" builds a merge
1920           tree of processes at the end of large merge tasks to get maximally
1921           parallelism.  Currently this feature is off by default because it
1922           can hang for some inputs.  Enable this experimental feature with
1923           "--endgame".
1924
1925       ENHANCEMENT
1926           "Fsdb::IO" now handles being given "IO::Pipe" objects (as exercised
1927           by dbmerge).
1928
1929       BUG FIX
1930           Handling of NamedTmpfiles now supports concurrency.  This fix will
1931           hopefully fix occasional "Use of uninitialized value $_ in string
1932           ne at ...NamedTmpfile.pm line 93."  errors.
1933
1934       BUG FIX
1935           Fsdb now requires perl 5.10.  This is a bug fix because some test
1936           cases used to require it, but this fact was not properly
1937           documented.  (Back-porting to 5.008 would require removing all "//"
1938           operators.)
1939
1940       ENHANCEMENT
1941           Fsdb now handles automatic compression of file contents.  Enable
1942           compression with "dbfilealter -Z xz" (or "gz" or "bz2").  All
1943           programs should operate on compressed files and leave the output
1944           with the same level of compression.  "xz" is recommended as fastest
1945           and most efficient.  "gz" is produces unrepeatable output (and so
1946           has no output test), it seems to insist on adding a timestamp.
1947
1948   2.44, 2013-10-02 A major change--all threads are gone.
1949       ENHANCEMENT
1950           Fsdb is now thread free and only uses processes for parallelism.
1951           This change is a big change--the entire motivation for Fsdb-2 was
1952           to exploit parallelism via threading.  Parallelism--good, but perl
1953           threading--bad for performance.  Horribly bad for performance.
1954           About 20x worse than pipes on my box.  (See perl bug #119445 for
1955           the discussion.)
1956
1957       NEW "Fsdb::Support::Freds" provides a thread-like abstraction over
1958           forking, with some nice support for callbacks in the parent upon
1959           child termination.
1960
1961       ENHANCEMENT
1962           Details about removing threads: "dbpipeline" is thread free, and
1963           new tests to verify each of its parts.  The easy cases are
1964           "dbcolpercentile", "dbcolstats", "dbfilepivot", "dbjoin", and
1965           "dbcolstatscores", each of which use it in simple ways
1966           (2013-09-09).  "dbmerge" is now thread free (2013-09-13), but was a
1967           significant rewrite, which brought "dbsort" along.  "dbmapreduce"
1968           is partly thread free (2013-09-21), again as a rewrite, and it
1969           brings "dbmultistats" along.  Full "dbmapreduce" support took much
1970           longer (2013-10-02).
1971
1972       BUG FIX
1973           When running with user-only output ("-n"), dbroweval now resets the
1974           output vector $ofref after it has been output.
1975
1976       NEW dbcolcreate will create all columns at the head of each row with
1977           the "--first" option.
1978
1979       NEW dbfilecat will concatenate two files, verifying that they have the
1980           same schema.
1981
1982       ENHANCEMENT
1983           dbmapreduce now passes comments through, rather than eating them as
1984           before.
1985
1986           Also, dbmapreduce now supports a "--" option to prevent
1987           misinterpreting sub-program parameters as for dbmapreduce.
1988
1989       INCOMPATIBLE CHANGE
1990           dbmapreduce no longer figures out if it needs to add the key to the
1991           output.  For multi-key-aware reducers, it never does (and cannot).
1992           For non-multi-key-aware reducers, it defaults to add the key and
1993           will now fail if the reducer adds the key (with error "dbcolcreate:
1994           attempt to create pre-existing column...").  In such cases, one
1995           must disable adding the key with the new option "--no-prepend-key".
1996
1997       INCOMPATIBLE CHANGE
1998           dbmapreduce no longer copies the input field separator by default.
1999           For multi-key-aware reducers, it never does (and cannot).  For non-
2000           multi-key-aware reducers, it defaults to not copying the field
2001           separator, but it will copy it (the old default) with the
2002           "--copy-fs" option
2003
2004   2.45, 2013-10-07 cleanup from de-thread-ification
2005       BUG FIX
2006           Corrected a fast busy-wait in dbmerge.
2007
2008       ENHANCEMENT
2009           Endgame mode enabled in dbmerge; it (and also large cases of
2010           dbsort) should now exploit greater parallelism.
2011
2012       BUG FIX
2013           Test case with "Fsdb::BoundedQueue" (gone since 2.44) now removed.
2014
2015   2.46, 2013-10-08 continuing cleanup of our no-threads version
2016       BUG FIX
2017           Fixed some packaging details.  (Really, threads are no longer
2018           required, missing tests in the MANIFEST.)
2019
2020       IMPROVEMENT
2021           dbsort now better communicates with the merge process to avoid
2022           bursty parallelism.
2023
2024           Fsdb::IO::Writer now can take "-autoflush =" 1> for line-buffered
2025           IO.
2026
2027   2.47, 2013-10-12 test suite cleanup for non-threaded perls
2028       BUG FIX
2029           Removed some stray "use threads" in some test cases.  We didn't
2030           need them, and these were breaking non-threaded perls.
2031
2032       BUG FIX
2033           Better handling of Fred cleanup; should fix intermittent
2034           dbmapreduce failures on BSD.
2035
2036       ENHANCEMENT
2037           Improved test framework to show output when tests fail.  (This
2038           time, for real.)
2039
2040   2.48, 2014-01-03 small bugfixes and improved release engineering
2041       ENHANCEMENT
2042           Test suites now skip tests for libraries that are missing.  (Patch
2043           for missing "IO::Compresss:Xz" contributed by Calvin Ardi.)
2044
2045       ENHANCEMENT
2046           Removed references to Jdb in the package specification.  Since the
2047           name was changed in 2008, there's no longer a huge need for
2048           backwards comparability.  (Suggestion form Petr Šabata.)
2049
2050       ENHANCEMENT
2051           Test suites now invoke the perl using the path from
2052           $Config{perlpath}.  Hopefully this helps testing in environments
2053           where there are multiple installed perls and the default perl is
2054           not the same as the perl-under-test (as happens in
2055           cpantesters.org).
2056
2057       BUG FIX
2058           Added specific encoding to this manpage to account for Unicode.
2059           Required to build correctly against perl-5.18.
2060
2061   2.49, 2014-01-04 bugfix to unicode handling in Fsdb IO (plus minor
2062       packaging fixes)
2063       BUG FIX
2064           Restored a line in the .spec to chmod g-s.
2065
2066       BUG FIX
2067           Unicode decoding is now handled correctly for programs that read
2068           from standard input.  (Also: New test scripts cover unicode input
2069           and output.)
2070
2071       BUG FIX
2072           Fix to Fsdb documentation encoding line.  Addresses test failure in
2073           perl-5.16 and earlier.  (Who knew "encoding" had to be followed by
2074           a blank line.)
2075

WHAT'S NEW

2077   2.50, 2014-05-27 a quick release for spec tweaks
2078       ENHANCEMENT
2079           In dbroweval, the "-N" (no output, even comments) option now
2080           implies "-n", and it now suppresses the header and trailer.
2081
2082       BUG FIX
2083           A few more tweaks to the perl-Fsdb.spec from Petr Šabata.
2084
2085       BUG FIX
2086           Fixed 3 uses of "use v5.10" in test suites that were causing test
2087           failures (due to warnings, not real failures) on some platforms.
2088
2089   2.51, 2014-09-05 Feature enhancements to dbcolmovingstats, dbcolcreate,
2090       dbmapreduce, and new sqlselect_to_db
2091       ENHANCEMENT
2092           dbcolcreate now has a "--no-recreate-fatal" that causes it to
2093           ignore creation of existing columns (instead of failing).
2094
2095       ENHANCEMENT
2096           dbmapreduce once again is robust to reducers that output the key;
2097           "--no-prepend-key" is no longer mandatory.
2098
2099       ENHANCEMENT
2100           dbcolsplittorows can now enumerate the output rows with "-E".
2101
2102       BUG FIX
2103           dbcolmovingstats is more mathematically robust.  Previously for
2104           some inputs and some platforms, floating point rounding could
2105           sometimes cause squareroots of negative numbers.
2106
2107       NEW sqlselect_to_db converts the output of the MySQL or MarinaDB select
2108           comment into fsdb format.
2109
2110       INCOMPATIBLE CHANGE
2111           dbfilediff now outputs the second row when doing sloppy numeric
2112           comparisons, to better support test suites.
2113
2114   2.52, 2014-11-03 Fixing the test suite for line number changes.
2115       ENHANCEMENT
2116           Test suites changes to be robust to exact line numbers of failures,
2117           since different Perl releases fail on different lines.
2118           <https://bugzilla.redhat.com/show_bug.cgi?id=1158380>
2119
2120   2.53, 2014-11-26 bug fixes and stability improvements to dbmapreduce
2121       ENHANCEMENT
2122           The dbfilediff how supports a "--quiet" option.
2123
2124       ENHANCEMENT
2125           Better documention of dbpipeline_filter.
2126
2127       BUGFIX
2128           Added groff-base and perl-podlators to the Fedora package spec.
2129           Fixes <https://bugzilla.redhat.com/show_bug.cgi?id=1163149>.  (Also
2130           in package 2.52-2.)
2131
2132       BUGFIX
2133           An important stability improvement to dbmapreduce.  It, plus
2134           dbmultistats, and dbcolstats now support controlled parallelism
2135           with the "--pararallelism=N" option.  They default to run with the
2136           number of available CPUs.  dbmapreduce also moderates its level of
2137           parallelism.  Previously it would create reducers as needed,
2138           causing CPU thrashing if reducers ran much slower than data
2139           production.
2140
2141       BUGFIX
2142           The combination of dbmapreduce with dbrowenumerate now works as it
2143           should.  (The obscure bug was an interaction with dbcolcreate with
2144           non-multi-key reducers that output their own key.  dbmapreduce has
2145           too many useful corner cases.)
2146
2147   2.54, 2014-11-28 fix for the test suite to correct failing tests on not-my-
2148       platform
2149       BUGFIX
2150           Sigh, the test suite now has a test suite.  Because, yes, I broke
2151           it, causing many incorrect failures at cpantesters.  Now fixed.
2152
2153   2.55, 2015-01-05 many spelling fixes and dbcolmovingstats tests are more
2154       robust to different numeric precision
2155       ENHANCEMENT
2156           dbfilediff now can be extra quiet, as I continue to try to track
2157           down a numeric difference on FreeBSD AMD boxes.
2158
2159       ENHANCEMENT
2160           dbcolmovingstats gave different test output (just reflecting
2161           rounding error) when stddev approaches zero.  We now detect hand
2162           handle this case.  See
2163           <https://rt.cpan.org/Public/Bug/Display.html?id=101220> and thanks
2164           to H. Merijn Brand for the bug report.
2165
2166       BUG FIX
2167           Many, many spelling bugs found by H. Merijn Brand; thanks for the
2168           bug report.
2169
2170       INCOMPATBLE CHANGE
2171           A number of programs had misspelled "separator" in
2172           "--fieldseparator" and "--columnseparator" options as "seperator".
2173           These are now correctly spelled.
2174
2175   2.56, 2015-02-03 fix against Getopt::Long-2.43's stricter error checkign
2176       BUG FIX
2177           Internal argument parsing uses Getopt::Long, but mixed pass-through
2178           and <>.  Bug reported by Petr Pisar at
2179           <https://bugzilla.redhat.com/show_bug.cgi?id=1188538>.a
2180
2181       BUG FIX
2182           Added missing BuildRequires for "XML::Simple".
2183
2184   2.57, 2015-04-29 Minor changes, with better performance from dbmulitstats.
2185       BUG FIX
2186           dbfilecat now honors "--remove-inputs" (previously it didn't).
2187           This omission meant that dbmapreduce (and dbmultistats) would
2188           accumulate files in /tmp when running.  Bad news for inputs with 4M
2189           keys.
2190
2191       ENHANCMENT
2192           dbmultistats should be faster with lots of small keys.  dbcolstats
2193           now supports "-k" to get some of the functionality of dbmultistats
2194           (if data is pre-sorted and median/quartiles are not required).
2195
2196           dbfilecat now honors "--remove-inputs" (previously it didn't).
2197           This omission meant that dbmapreduce (and dbmultistats) would
2198           accumulate files in /tmp when running.  Bad news for inputs with 4M
2199           keys.
2200
2201   2.58, 2015-04-30 Bugfix in dbmerge
2202       BUG FIX
2203           Fixed a case where dbmerge suffered mojobake in endgame mode.  This
2204           bug surfaced when dbsort was applied to large files (big enough to
2205           require merging) with unicode in them; the symptom was soemthing
2206           like:
2207             Wide character in print at /usr/lib64/perl5/IO/Handle.pm line
2208           420, <GEN12> line 111.
2209
2210   2.59, 2016-09-01 Collect a few small bug fixes and documentation
2211       improvements.
2212       BUG FIX
2213           More IO is explicitly marked UTF-8 to avoid Perl's tendency to
2214           mojibake on otherwise valid unicode input.  This change helps
2215           html_table_to_db.
2216
2217       ENHANCEMENT
2218           dbcolscorrelate now crossreferences dbcolsregression.
2219
2220       ENHANCEMENT
2221           Documentation for dbrowdiff now clarifies that the default is
2222           baseline mode.
2223
2224       BUG FIX
2225           dbjoin now propagates "-T" into the sorting process (if it is
2226           required).  Thanks to Lan Wei for reporting this bug.
2227
2228   2.60, 2016-09-04 Adds support for hash joins.
2229       ENHANCEMENT
2230           dbjoin now supports hash joins with "-t lefthash" and "-t
2231           righthash".  Hash joins cache a table in memory, but do not require
2232           that the other table be sorted.  They are ideal when joining a
2233           large table against a small one.
2234
2235   2.61, 2016-09-05 Support left and right outer joins.
2236       ENHANCEMENT
2237           dbjoin now handles left and right outer joins with "-t left" and
2238           "-t right".
2239
2240       ENHANCEMENT
2241           dbjoin hash joins are now selected with "-m lefthash" and "-m
2242           righthash" (not the shortlived "-t righthash" option).
2243           (Technically this change is incompatible with Fsdd-2.60, but no one
2244           but me ever used that version.)
2245
2246   2.62, 2016-11-29 A new yaml_to_db and other minor improvements.
2247       ENHANCEMENT
2248           Documentation for xml_to_db now includes sample output.
2249
2250       NEW yaml_to_db converts a specific form of YAML to fsdb.
2251
2252       BUG FIX
2253           The test suite now uses "diff -c -b" rather than "diff -cb" to make
2254           OpenBSD-5.9 happier, I hope.
2255
2256       ENHANCEMENT
2257           Comments that log operations at the end of each file now do simple
2258           quoting of spaces.  (It is not guaranteed to be fully shell-
2259           compliant.)
2260
2261       ENHANCEMENT
2262           There is a new standard option, "--header", allowing one to specify
2263           an Fsdb header for inputs that lack it.  Currently it is supported
2264           by dbcoldefine, dbrowuniq, dbmapreduce, dbmultistats, dbsort,
2265           dbpipeline.
2266
2267       ENHANCEMENT
2268           dbfilepivot now allows the --possible-pivots option, and if it is
2269           provided processes the data in one pass.
2270
2271       ENHANCEMENT
2272           dbroweval logs are now quoted.
2273
2274   2.63, 2017-02-03 Re-add some features supposedly in 2.62 but not, and add
2275       more --header options.
2276       ENHANCEMENT
2277           The option -j is now a synonym for --parallelism.  (And several
2278           documention bugs about this option are fixed.)
2279
2280       ENHANCEMENT
2281           Additional support for "--header" in dbcolmerge, dbcol, dbrow, and
2282           dbroweval.
2283
2284       BUG FIX
2285           Version 2.62 was supposed to have this improvement, but did not
2286           (and now does): dbfilepivot now allows the --possible-pivots
2287           option, and if it is provided processes the data in one pass.
2288
2289       BUG FIX
2290           Version 2.62 was supposed to have this improvement, but did not
2291           (and now does): dbroweval logs are now quoted.
2292
2293   2.64, 2017-11-20 several small bugfixes and enhancements
2294       BUG FIX
2295           In dbroweval, the "next row" option previously did not correctly
2296           set up "_last_fieldname".  It now does.
2297
2298       ENHANCEMENT
2299           The csv_to_db converter now has an optional "-F x" option to set
2300           the field separator.
2301
2302       ENHANCEMENT
2303           Finally dbcolsplittocols has a "--header" option, and a new "-N"
2304           option to give the list of resulting output columns.
2305
2306       INCOMPATIBLE CHANGE
2307           Now dbcolstats and dbmultistats produce no output (but a schema)
2308           when given no input but a schema.  Previously they gave a null row
2309           of output.  The "--output-on-no-input" and
2310           "--no-output-on-no-input" options can control this behavior.
2311
2312   2.65, 2018-02-16 Minor release, bug fix and -F option.
2313       ENHANCEMENT
2314           dbmultistats and dbmapreduce now both take a "-F x" option to set
2315           the field separator.
2316
2317       BUG FIX
2318           Fixed missing "use Carp" in dbcolstats.  Also went back and cleaned
2319           up all uses of "croak()".  Thanks to Zefram for the bug report.
2320
2321   2.66, 2018-12-20 Critical bug fix in dbjoin.
2322       BUG FIX
2323           Removed old tests from MANIFEST.  (Thanks to Hang Guo for reporting
2324           this bug.)
2325
2326       IMPROVEMENT
2327           Errors for non-existing input files now include the bad filename
2328           (before: "cannot setup filehandle", now: "cannot open input: cannot
2329           open TEST/bad_filename").
2330
2331       BUG FIX
2332           Hash joins with three identical rows were failing with the
2333           assertion failure "internal error: confused about overflow" due to
2334           a now-fixed bug.
2335
2336   2.67, 2019-07-10 add support for reading and writing hdfs
2337       IMPROVEMENT
2338           dbformmail now has an "mh" mechanism that writes messages to
2339           individual files (an mh-style mailbox).
2340
2341       BUG FIX
2342           dbrow failed to include the Carp library, leading to fails on
2343           croak.
2344
2345       BUG FIX
2346           Fixed dbjoin error message for an unsorted right stream was
2347           incorrect (it said left).
2348
2349       IMPROVEMENT
2350           All Fsdb programs can now read from and write to HDFS, when files
2351           that start with "hdfs:" are given to -i and -o options.
2352
2353   2.68, 2019-09-19 All programs now support automatic decompression based on
2354       file extension.
2355       IMPROVEMENT
2356           The omitted-possible-error test case for dbfilepivot now has an
2357           altnerative output that I saw on some BSD-running systems (thanks
2358           to CPAN).
2359
2360       IMPROVEMENT
2361           dbmerge and dbmerge2 now support "--header".  dbmerge2 now gives
2362           better error messages when presented the wrong number of inputs.
2363
2364       BUG FIX
2365           dbsort now works with "--header" even when the file is big (due to
2366           fixes to dbmerge).
2367
2368       IMPROVEMENT
2369           cvs_to_db now processes data with the "binary" option, allowing it
2370           to handle newlines embedded in quoted fields.
2371
2372       IMPROVEMENT
2373           All programs now will transparently decompress input files, if they
2374           are listed as a filename as an input argument that extends with a
2375           standard extension (.gz, .bz2, and .xz).
2376
2377   2.69, 2019-11-22 a small bugfix in dbcolstats
2378       BUG FIX
2379           Filled in the the test case for autodecompress, which was missing
2380           for the 2.68 release.
2381
2382       ENHANCEMENT
2383           The groff program is required for build, and the "Makefile.PL"
2384           fails if groff is missing at build time.  Thanks to Chris Williams
2385           for suggesting this check, and the CPAN auto-building system for
2386           trying many platforms.
2387
2388       BUG FIX
2389           The dbcolstats program had numerical instability that sometimes
2390           results in failing with a square-root of a negative number when
2391           many values varied right at the edge of floating-point precision.
2392           We now detect and report that case as 0 stddev.  Thanks to Hang Guo
2393           for providing a test case.
2394
2395   2.70, 2020-11-12 Some small quality-of-life enhancements and corner-case
2396       bugfixes.
2397       ENHANCEMENT
2398           dbcol can now take an option "-a" to include all columns, allowing
2399           reordering of certain columns while passing the rest through.
2400
2401       ENHANCEMENT
2402           dbrowuniq and dbmerge now buffer comments in a way that the last
2403           row of data output is no longer in the last block of comments.
2404           (The data is identical, but for humans looking at output, this
2405           change makes it less likely to lose the last row.)
2406
2407       BUG FIX
2408           dbmultistats and dbpipeline documentation now indicates that they
2409           support "--header" (something they did since version 2.62 in
2410           2016-11-29, but now documented.
2411
2412       ENHANCEMENT
2413           dbcolcreate now supports "--header".
2414
2415       BUG FIX
2416           Fixed several spelling errors in deprecated programs and removed
2417           information about the no-longer existing FreeBSD and MacOS ports.
2418           Thanks to Calvin Ardi for the patch.
2419
2420       BUG FIX
2421           dbmerge now handles --xargs when only one file is provided (and
2422           passes the file through unchanged).  It also throws a clean error
2423           with --xargs if zero files are provided.  (To support dbmerge,
2424           dbcol now has an internal "--saveoutput" option.)  Thanks to Yuri
2425           Pradkin for reporting the unhandled corner-case.
2426
2427   2.71, 2020-11-16 Fix a race condition breaking test suites.
2428       BUG FIX
2429           Suppress a race condition in dbcolmerge was sometimes throwing the
2430           error "Fsdb::Support::Freds: ending, but running process:
2431           dbmerge:xargs" in the dbmerge_0_xargs test case, on exit.
2432

AUTHOR

2434       John Heidemann, "johnh@isi.edu"
2435
2436       See "Contributors" for the many people who have contributed bug reports
2437       and fixes.
2438

COPYRIGHT

2440       Fsdb is Copyright (C) 1991-2020 by John Heidemann <johnh@isi.edu>.
2441
2442       This program is free software; you can redistribute it and/or modify it
2443       under the terms of version 2 of the GNU General Public License as
2444       published by the Free Software Foundation.
2445
2446       This program is distributed in the hope that it will be useful, but
2447       WITHOUT ANY WARRANTY; without even the implied warranty of
2448       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
2449       General Public License for more details.
2450
2451       You should have received a copy of the GNU General Public License along
2452       with this program; if not, write to the Free Software Foundation, Inc.,
2453       675 Mass Ave, Cambridge, MA 02139, USA.
2454
2455       A copy of the GNU General Public License can be found in the file
2456       ``COPYING''.
2457

COMMENTS and BUG REPORTS

2459       Any comments about these programs should be sent to John Heidemann
2460       "johnh@isi.edu".
2461
2462
2463
2464perl v5.32.1                      2021-01-27                           Fsdb(3)