Fsdb(3pm)

1Fsdb(3)               User Contributed Perl Documentation              Fsdb(3)
2
3
4

NAME

6       Fsdb - a flat-text database for shell scripting
7

SYNOPSIS

9       Fsdb, the flatfile streaming database is package of commands for
10       manipulating flat-ASCII databases from shell scripts.  Fsdb is useful
11       to process medium amounts of data (with very little data you'd do it by
12       hand, with megabytes you might want a real database).  Fsdb was known
13       as as Jdb from 1991 to Oct. 2008.
14
15       Fsdb is very good at doing things like:
16
17       ·   extracting measurements from experimental output
18
19       ·   examining data to address different hypotheses
20
21       ·   joining data from different experiments
22
23       ·   eliminating/detecting outliers
24
25       ·   computing statistics on data (mean, confidence intervals,
26           correlations, histograms)
27
28       ·   reformatting data for graphing programs
29
30       Fsdb is built around the idea of a flat text file as a database.  Fsdb
31       files (by convention, with the extension .fsdb), have a header
32       documenting the schema (what the columns mean), and then each line
33       represents a database record (or row).
34
35       For example:
36
37               #fsdb experiment duration
38               ufs_mab_sys 37.2
39               ufs_mab_sys 37.3
40               ufs_rcp_real 264.5
41               ufs_rcp_real 277.9
42
43       Is a simple file with four experiments (the rows), each with a
44       description, size parameter, and run time in the first, second, and
45       third columns.
46
47       Rather than hand-code scripts to do each special case, Fsdb provides
48       higher-level functions.  Although it's often easy throw together a
49       custom script to do any single task, I believe that there are several
50       advantages to using Fsdb:
51
52       ·   these programs provide a higher level interface than plain Perl, so
53
54           **  Fewer lines of simpler code:
55
56                   dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration
57
58               Picks out just one type of experiment and computes statistics
59               on it, rather than:
60
61                   while (<>) { split; $sum+=$F[1]; $ss+=$F[1]**2; $n++; }
62                   $mean = $sum / $n; $std_dev = ...
63
64               in dozens of places.
65
66       ·   the library uses names for columns, so
67
68           **  No more $F[1], use "_duration".
69
70           **  New or different order columns?  No changes to your scripts!
71
72           Thus if your experiment gets more complicated with a size
73           parameter, so your log changes to:
74
75                   #fsdb experiment size duration
76                   ufs_mab_sys 1024 37.2
77                   ufs_mab_sys 1024 37.3
78                   ufs_rcp_real 1024 264.5
79                   ufs_rcp_real 1024 277.9
80                   ufs_mab_sys 2048 45.3
81                   ufs_mab_sys 2048 44.2
82
83           Then the previous scripts still work, even though duration is now
84           the third column, not the second.
85
86       ·   A series of actions are self-documenting (each program records what
87           it does).
88
89           **  No more wondering what hacks were used to compute the final
90               data, just look at the comments at the end of the output.
91
92           For example, the commands
93
94               dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration
95
96           add to the end of the output the lines
97               #    | dbrow _experiment eq "ufs_mab_sys"
98               #    | dbcolstats duration
99
100       ·   The library is mature, supporting large datasets (more than 100GB),
101           corner cases, error handling, backed by an automated test suite.
102
103           **  No more puzzling about bad output because your custom script
104               skimped on error checking.
105
106           **  No more memory thrashing when you try to sort ten million
107               records.
108
109       ·   Fsdb-2.x supports Perl scripting (in addition to shell scripting),
110           with libraries to do Fsdb input and output, and easy support for
111           pipelines.  The shell script
112
113               dbcol name test1 | dbroweval '_test1 += 5;'
114
115           can be written in perl as:
116
117               dbpipeline(dbcol(qw(name test1)), dbroweval('_test1 += 5;'));
118
119       (The disadvantage is that you need to learn what functions Fsdb
120       provides.)
121
122       Fsdb is built on flat-ASCII databases.  By storing data in simple text
123       files and processing it with pipelines it is easy to experiment (in the
124       shell) and look at the output.  To the best of my knowledge, the
125       original implementation of this idea was "/rdb", a commercial product
126       described in the book UNIX relational database management: application
127       development in the UNIX environment by Rod Manis, Evan Schaffer, and
128       Robert Jorgensen (and also at the web page <http://www.rdb.com/>).
129       Fsdb is an incompatible re-implementation of their idea without any
130       accelerated indexing or forms support.  (But it's free, and probably
131       has better statistics!).
132
133       Fsdb-2.x will exploit multiple processors or cores, and provides Perl-
134       level support for input, output, and threaded-pipelines.  (As of
135       Fsdb-2.44 it no longer uses Perl threading, just processes, since they
136       are faster.)
137
138       Installation instructions follow at the end of this document.  Fsdb-2.x
139       requires Perl 5.8 to run.  All commands have manual pages and provide
140       usage with the "--help" option.  All commands are backed by an
141       automated test suite.
142
143       The most recent version of Fsdb is available on the web at
144       <http://www.isi.edu/~johnh/SOFTWARE/FSDB/index.html>.
145

WHAT'S NEW

147   2.71, 2020-11-16 Fix a race condition breaking test suites.
148       BUG FIX
149           Suppress a race condition in dbcolmerge was sometimes throwing the
150           error "Fsdb::Support::Freds: ending, but running process:
151           dbmerge:xargs" in the dbmerge_0_xargs test case, on exit.
152
153       ENHANCEMENT
154           dbcolcreate now supports "--header".
155
156       BUG FIX
157           Fixed several spelling errors in deprecated programs and removed
158           information about the no-longer existing FreeBSD and MacOS ports.
159           Thanks to Calvin Ardi for the patch.
160
161       BUG FIX
162           dbmerge now handles --xargs when only one file is provided (and
163           passes the file through unchanged).  It also throws a clean error
164           with --xargs if zero files are provided.  (To support dbmerge,
165           dbcol now has an internal "--saveoutput" option.)  Thanks to Yuri
166           Pradkin for reporting the unhandled corner-case.
167

README CONTENTS

169       executive summary
170       what's new
171       README CONTENTS
172       installation
173       basic data format
174       basic data manipulation
175       list of commands
176       another example
177       a gradebook example
178       a password example
179       history
180       related work
181       release notes
182       copyright
183       comments
184

INSTALLATION

186       Fsdb now uses the standard Perl build and installation from
187       ExtUtil::MakeMaker(3), so the quick answer to installation is to type:
188
189           perl Makefile.PL
190           make
191           make test
192           make install
193
194       Or, if you want to install it somewhere else, change the first line to
195
196           perl Makefile.PL PREFIX=$HOME
197
198       and it will go in your home directory's bin, etc.  (See
199       ExtUtil::MakeMaker(3) for more details.)
200
201       Fsdb requires perl 5.8 or later.
202
203       A test-suite is available, run it with
204
205           make test
206
207       In the past, the ports existed for FreeBSD and MacOS.  If someone
208       running one of those OSes wants to contribute a new port, please let me
209       know.
210

BASIC DATA FORMAT

212       These programs are based on the idea storing data in simple ASCII
213       files.  A database is a file with one header line and then data or
214       comment lines.  For example:
215
216               #fsdb account passwd uid gid fullname homedir shell
217               johnh * 2274 134 John_Heidemann /home/johnh /bin/bash
218               greg * 2275 134 Greg_Johnson /home/greg /bin/bash
219               root * 0 0 Root /root /bin/bash
220               # this is a simple database
221
222       The header line must be first and begins with "#h".  There are rows
223       (records) and columns (fields), just like in a normal database.
224       Comment lines begin with "#".  Column names are any string not
225       containing spaces or single quote (although it is prudent to keep them
226       alphanumeric with underscore).
227
228       By default, columns are delimited by whitespace.  With this default
229       configuration, the contents of a field cannot contain whitespace.
230       However, this limitation can be relaxed by changing the field separator
231       as described below.
232
233       The big advantage of simple flat-text databases is that it is usually
234       easy to massage data into this format, and it's reasonably easy to take
235       data out of this format into other (text-based) programs, like gnuplot,
236       jgraph, and LaTeX.  Think Unix.  Think pipes.  (Or even output to Excel
237       and HTML if you prefer.)
238
239       Since no-whitespace in columns was a problem for some applications,
240       there's an option which relaxes this rule.  You can specify the field
241       separator in the table header with "-F x" where "x" is a code for the
242       new field separator.  A full list of codes is at dbfilealter(1), but
243       two common special values are "-F t" which is a separator of a single
244       tab character, and "-F S", a separator of two spaces.  Both allowing
245       (single) spaces in fields.  An example:
246
247               #fsdb -F S account passwd uid gid fullname homedir shell
248               johnh  *  2274  134  John Heidemann  /home/johnh  /bin/bash
249               greg  *  2275  134  Greg Johnson  /home/greg  /bin/bash
250               root  *  0  0  Root  /root  /bin/bash
251               # this is a simple database
252
253       See dbfilealter(1) for more details.  Regardless of what the column
254       separator is for the body of the data, it's always whitespace in the
255       header.
256
257       There's also a third format: a "list".  Because it's often hard to see
258       what's columns past the first two, in list format each "column" is on a
259       separate line.  The programs dblistize and dbcolize convert to and from
260       this format, and all programs work with either formats.  The command
261
262           dbfilealter -R C  < DATA/passwd.fsdb
263
264       outputs:
265
266               #fsdb -R C account passwd uid gid fullname homedir shell
267               account:  johnh
268               passwd:   *
269               uid:      2274
270               gid:      134
271               fullname: John_Heidemann
272               homedir:  /home/johnh
273               shell:    /bin/bash
274
275               account:  greg
276               passwd:   *
277               uid:      2275
278               gid:      134
279               fullname: Greg_Johnson
280               homedir:  /home/greg
281               shell:    /bin/bash
282
283               account:  root
284               passwd:   *
285               uid:      0
286               gid:      0
287               fullname: Root
288               homedir:  /root
289               shell:    /bin/bash
290
291               # this is a simple database
292               #  | dblistize
293
294       See dbfilealter(1) for more details.
295

BASIC DATA MANIPULATION

297       A number of programs exist to manipulate databases.  Complex functions
298       can be made by stringing together commands with shell pipelines.  For
299       example, to print the home directories of everyone with ``john'' in
300       their names, you would do:
301
302               cat DATA/passwd | dbrow '_fullname =~ /John/' | dbcol homedir
303
304       The output might be:
305
306               #fsdb homedir
307               /home/johnh
308               /home/greg
309               # this is a simple database
310               #  | dbrow _fullname =~ /John/
311               #  | dbcol homedir
312
313       (Notice that comments are appended to the output listing each command,
314       providing an automatic audit log.)
315
316       In addition to typical database functions (select, join, etc.) there
317       are also a number of statistical functions.
318
319       The real power of Fsdb is that one can apply arbitrary code to rows to
320       do powerful things.
321
322               cat DATA/passwd | dbroweval '_fullname =~ s/(\w+)_(\w+)/$2,_$1/'
323
324       converts "John_Heidemann" into "Heidemann,_John".  Not too much more
325       work could split fullname into firstname and lastname fields.
326
327       (Or:
328
329               cat DATA/passwd | dbcolcreate sort | dbroweval -b 'use Fsdb::Support'
330                       '_sort = _fullname; _sort =~ s/_/ /g; _sort = fullname_to_sort(_sort);'
331

TALKING ABOUT COLUMNS

333       An advantage of Fsdb is that you can talk about columns by name
334       (symbolically) rather than simply by their positions.  So in the above
335       example, "dbcol homedir" pulled out the home directory column, and
336       "dbrow '_fullname =~ /John/'" matched against column fullname.
337
338       In general, you can use the name of the column listed on the "#fsdb"
339       line to identify it in most programs, and _name to identify it in code.
340
341       Some alternatives for flexibility:
342
343       ·   Numeric values identify columns positionally, numbering from 0.  So
344           0 or _0 is the first column, 1 is the second, etc.
345
346       ·   In code, _last_columnname gets the value from columname's previous
347           row.
348
349       See dbroweval(1) for more details about writing code.
350

LIST OF COMMANDS

352       Enough said.  I'll summarize the commands, and then you can experiment.
353       For a detailed description of each command, see a summary by running it
354       with the argument "--help" (or "-?" if you prefer.)  Full manual pages
355       can be found by running the command with the argument "--man", or
356       running the Unix command "man dbcol" or whatever program you want.
357
358   TABLE CREATION
359       dbcolcreate
360           add columns to a database
361
362       dbcoldefine
363           set the column headings for a non-Fsdb file
364
365   TABLE MANIPULATION
366       dbcol
367           select columns from a table
368
369       dbrow
370           select rows from a table
371
372       dbsort
373           sort rows based on a set of columns
374
375       dbjoin
376           compute the natural join of two tables
377
378       dbcolrename
379           rename a column
380
381       dbcolmerge
382           merge two columns into one
383
384       dbcolsplittocols
385           split one column into two or more columns
386
387       dbcolsplittorows
388           split one column into multiple rows
389
390       dbfilepivot
391           "pivots" a file, converting multiple rows corresponding to the same
392           entity into a single row with multiple columns.
393
394       dbfilevalidate
395           check that db file doesn't have some common errors
396
397   COMPUTATION AND STATISTICS
398       dbcolstats
399           compute statistics over a column (mean,etc.,optionally median)
400
401       dbmultistats
402           group rows by some key value, then compute stats (mean, etc.) over
403           each group (equivalent to dbmapreduce with dbcolstats as the
404           reducer)
405
406       dbmapreduce
407           group rows (map) and then apply an arbitrary function to each group
408           (reduce)
409
410       dbrvstatdiff
411           compare two samples distributions (mean/conf interval/T-test)
412
413       dbcolmovingstats
414           computing moving statistics over a column of data
415
416       dbcolstatscores
417           compute Z-scores and T-scores over one column of data
418
419       dbcolpercentile
420           compute the rank or percentile of a column
421
422       dbcolhisto
423           compute histograms over a column of data
424
425       dbcolscorrelate
426           compute the coefficient of correlation over several columns
427
428       dbcolsregression
429           compute linear regression and correlation for two columns
430
431       dbrowaccumulate
432           compute a running sum over a column of data
433
434       dbrowcount
435           count the number of rows (a subset of dbstats)
436
437       dbrowdiff
438           compute differences between a columns in each row of a table
439
440       dbrowenumerate
441           number each row
442
443       dbroweval
444           run arbitrary Perl code on each row
445
446       dbrowuniq
447           count/eliminate identical rows (like Unix uniq(1))
448
449       dbfilediff
450           compare fields on rows of a file (something like Unix diff(1))
451
452   OUTPUT CONTROL
453       dbcolneaten
454           pretty-print columns
455
456       dbfilealter
457           convert between column or list format, or change the column
458           separator
459
460       dbfilestripcomments
461           remove comments from a table
462
463       dbformmail
464           generate a script that sends form mail based on each row
465
466   CONVERSIONS
467       (These programs convert data into fsdb.  See their web pages for
468       details.)
469
470       cgi_to_db
471           <http://stein.cshl.org/boulder/>
472
473       combined_log_format_to_db
474           <http://httpd.apache.org/docs/2.0/logs.html>
475
476       html_table_to_db
477           HTML tables to fsdb (assuming they're reasonably formatted).
478
479       kitrace_to_db
480           <http://ficus-www.cs.ucla.edu/ficus-members/geoff/kitrace.html>
481
482       ns_to_db
483           <http://mash-www.cs.berkeley.edu/ns/>
484
485       sqlselect_to_db
486           the output of SQL SELECT tables to db
487
488       tabdelim_to_db
489           spreadsheet tab-delimited files to db
490
491       tcpdump_to_db
492           (see man tcpdump(8) on any reasonable system)
493
494       xml_to_db
495           XML input to fsdb, assuming they're very regular
496
497       (And out of fsdb:)
498
499       db_to_csv
500           Comma-separated-value format from fsdb.
501
502       db_to_html_table
503           simple conversion of Fsdb to html tables
504
505   STANDARD OPTIONS
506       Many programs have common options:
507
508       -? or --help
509           Show basic usage.
510
511       -N on --new-name
512           When a command creates a new column like dbrowaccumulate's "accum",
513           this option lets one override the default name of that new column.
514
515       -T TmpDir
516           where to put tmp files.  Also uses environment variable TMPDIR, if
517           -T is not specified.  Default is /tmp.
518
519           Show basic usage.
520
521       -c FRACTION or --confidence FRACTION
522           Specify confidence interval FRACTION (dbcolstats, dbmultistats,
523           etc.)
524
525       -C S or "--element-separator S"
526           Specify column separator S (dbcolsplittocols, dbcolmerge).
527
528       -d or --debug
529           Enable debugging (may be repeated for greater effect in some
530           cases).
531
532       -a or --include-non-numeric
533           Compute stats over all data (treating non-numbers as zeros).  (By
534           default, things that can't be treated as numbers are ignored for
535           stats purposes)
536
537       -S or --pre-sorted
538           Assume the data is pre-sorted.  May be repeated to disable
539           verification (saving a small amount of work).
540
541       -e E or --empty E
542           give value E as the value for empty (null) records
543
544       -i I or --input I
545           Input data from file I.
546
547       -o O or --output O
548           Write data out to file O.
549
550       --header H
551           Use H as the full Fsdb header, rather than reading a header from
552           then input.  This option is particularly useful when using Fsdb
553           under Hadoop, where split files don't have heades.
554
555       --nolog.
556           Skip logging the program in a trailing comment.
557
558       When giving Perl code (in dbrow and dbroweval) column names can be
559       embedded if preceded by underscores.  Look at dbrow(1) or dbroweval(1)
560       for examples.)
561
562       Most programs run in constant memory and use temporary files if
563       necessary.  Exceptions are dbcolneaten, dbcolpercentile, dbmapreduce,
564       dbmultistats, dbrowsplituniq.
565

ANOTHER EXAMPLE

567       Take the raw data in "DATA/http_bandwidth", put a header on it
568       ("dbcoldefine size bw"), took statistics of each category
569       ("dbmultistats -k size bw"), pick out the relevant fields ("dbcol size
570       mean stddev pct_rsd"), and you get:
571
572               #fsdb size mean stddev pct_rsd
573               1024    1.4962e+06      2.8497e+05      19.047
574               10240   5.0286e+06      6.0103e+05      11.952
575               102400  4.9216e+06      3.0939e+05      6.2863
576               #  | dbcoldefine size bw
577               #  | /home/johnh/BIN/DB/dbmultistats -k size bw
578               #  | /home/johnh/BIN/DB/dbcol size mean stddev pct_rsd
579
580       (The whole command was:
581
582               cat DATA/http_bandwidth |
583               dbcoldefine size |
584               dbmultistats -k size bw |
585               dbcol size mean stddev pct_rsd
586
587       all on one line.)
588
589       Then post-process them to get rid of the exponential notation by adding
590       this to the end of the pipeline:
591
592           dbroweval '_mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev);'
593
594       (Actually, this step is no longer required since dbcolstats now uses a
595       different default format.)
596
597       giving:
598
599               #fsdb      size    mean    stddev  pct_rsd
600               1024     1496200          284970        19.047
601               10240    5028600          601030        11.952
602               102400   4921600          309390        6.2863
603               #  | dbcoldefine size bw
604               #  | dbmultistats -k size bw
605               #  | dbcol size mean stddev pct_rsd
606               #  | dbroweval   { _mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev); }
607
608       In a few lines, raw data is transformed to processed output.
609
610       Suppose you expect there is an odd distribution of results of one
611       datapoint.  Fsdb can easily produce a CDF (cumulative distribution
612       function) of the data, suitable for graphing:
613
614           cat DB/DATA/http_bandwidth | \
615               dbcoldefine size bw | \
616               dbrow '_size == 102400' | \
617               dbcol bw | \
618               dbsort -n bw | \
619               dbrowenumerate | \
620               dbcolpercentile count | \
621               dbcol bw percentile | \
622               xgraph
623
624       The steps, roughly: 1. get the raw input data and turn it into fsdb
625       format, 2. pick out just the relevant column (for efficiency) and sort
626       it, 3. for each data point, assign a CDF percentage to it, 4. pick out
627       the two columns to graph and show them
628

A GRADEBOOK EXAMPLE

630       The first commercial program I wrote was a gradebook, so here's how to
631       do it with Fsdb.
632
633       Format your data like DATA/grades.
634
635               #fsdb name email id test1
636               a a@ucla.example.edu 1 80
637               b b@usc.example.edu 2 70
638               c c@isi.example.edu 3 65
639               d d@lmu.example.edu 4 90
640               e e@caltech.example.edu 5 70
641               f f@oxy.example.edu 6 90
642
643       Or if your students have spaces in their names, use "-F S" and two
644       spaces to separate each column:
645
646               #fsdb -F S name email id test1
647               alfred aho  a@ucla.example.edu  1  80
648               butler lampson  b@usc.example.edu  2  70
649               david clark  c@isi.example.edu  3  65
650               constantine drovolis  d@lmu.example.edu  4  90
651               debrorah estrin  e@caltech.example.edu  5  70
652               sally floyd  f@oxy.example.edu  6  90
653
654       To compute statistics on an exam, do
655
656               cat DATA/grades | dbstats test1 |dblistize
657
658       giving
659
660               #fsdb -R C  ...
661               mean:        77.5
662               stddev:      10.84
663               pct_rsd:     13.987
664               conf_range:  11.377
665               conf_low:    66.123
666               conf_high:   88.877
667               conf_pct:    0.95
668               sum:         465
669               sum_squared: 36625
670               min:         65
671               max:         90
672               n:           6
673               ...
674
675       To do a histogram:
676
677               cat DATA/grades | dbcolhisto -n 5 -g test1
678
679       giving
680
681               #fsdb low histogram
682               65      *
683               70      **
684               75
685               80      *
686               85
687               90      **
688               #  | /home/johnh/BIN/DB/dbhistogram -n 5 -g test1
689
690       Now you want to send out grades to the students by e-mail.  Create a
691       form-letter (in the file test1.txt):
692
693               To: _email (_name)
694               From: J. Random Professor <jrp@usc.example.edu>
695               Subject: test1 scores
696
697               _name, your score on test1 was _test1.
698               86+   A
699               75-85 B
700               70-74 C
701               0-69  F
702
703       Generate the shell script that will send the mail out:
704
705               cat DATA/grades | dbformmail test1.txt > test1.sh
706
707       And run it:
708
709               sh <test1.sh
710
711       The last two steps can be combined:
712
713               cat DATA/grades | dbformmail test1.txt | sh
714
715       but I like to keep a copy of exactly what I send.
716
717       At the end of the semester you'll want to compute grade totals and
718       assign letter grades.  Both fall out of dbroweval.  For example, to
719       compute weighted total grades with a 40% midterm/60% final where the
720       midterm is 84 possible points and the final 100:
721
722               dbcol -rv total |
723               dbcolcreate total - |
724               dbroweval '
725                       _total = .40 * _midterm/84.0 + .60 * _final/100.0;
726                       _total = sprintf("%4.2f", _total);
727                       if (_final eq "-" || ( _name =~ /^_/)) { _total = "-"; };' |
728               dbcolneaten
729
730       If you got the data originally from a spreadsheet, save it in "tab-
731       delimited" format and convert it with tabdelim_to_db (run
732       tabdelim_to_db -? for examples).
733

A PASSWORD EXAMPLE

735       To convert the Unix password file to db:
736
737               cat /etc/passwd | sed 's/:/  /g'| \
738                       dbcoldefine -F S login password uid gid gecos home shell \
739                       >passwd.fsdb
740
741       To convert the group file
742
743               cat /etc/group | sed 's/:/  /g' | \
744                       dbcoldefine -F S group password gid members \
745                       >group.fsdb
746
747       To show the names of the groups that div7-members are in (assuming DIV7
748       is in the gecos field):
749
750               cat passwd.fsdb | dbrow '_gecos =~ /DIV7/' | dbcol login gid | \
751                       dbjoin -i - -i group.fsdb gid | dbcol login group
752

SHORT EXAMPLES

754       Which Fsdb programs are the most complicated (based on number of test
755       cases)?
756
757               ls TEST/*.cmd | \
758                       dbcoldefine test | \
759                       dbroweval '_test =~ s@^TEST/([^_]+).*$@$1@' | \
760                       dbrowuniq -c | \
761                       dbsort -nr count | \
762                       dbcolneaten
763
764       (Answer: dbmapreduce, then dbcolstats, dbfilealter and dbjoin.)
765
766       Stats on an exam (in $FILE, where $COLUMN is the name of the exam)?
767
768               cat $FILE | dbcolstats -q 4 $COLUMN <$FILE | dblistize | dbstripcomments
769
770               cat $FILE | dbcolhisto -g -n 20 $COLUMN | dbcolneaten | dbstripcomments
771
772       Merging a the hw1 column from file hw1.fsdb into grades.fsdb assuming
773       there's a common student id in column "id":
774
775               dbcol id hw1 <hw1.fsdb >t.fsdb
776
777               dbjoin -a -e - grades.fsdb t.fsdb id | \
778                   dbsort  name | \
779                   dbcolneaten >new_grades.fsdb
780
781       Merging two fsdb files with the same rows:
782
783               cat file1.fsdb file2.fsdb >output.fsdb
784
785       or if you want to clean things up a bit
786
787               cat file1.fsdb file2.fsdb | dbstripextraheaders >output.fsdb
788
789       or if you want to know where the data came from
790
791               for i in 1 2
792               do
793                       dbcolcreate source $i < file$i.fsdb
794               done >output.fsdb
795
796       (assumes you're using a Bourne-shell compatible shell, not csh).
797

WARNINGS

799       As with any tool, one should (which means must) understand the limits
800       of the tool.
801
802       All Fsdb tools should run in constant memory.  In some cases (such as
803       dbcolstats with quartiles, where the whole input must be re-read),
804       programs will spool data to disk if necessary.
805
806       Most tools buffer one or a few lines of data, so memory will scale with
807       the size of each line.  (So lines with many columns, or when columns
808       have lots data, may cause large memory consumption.)
809
810       All Fsdb tools should run in constant or at worst "n log n" time.
811
812       All Fsdb tools use normal Perl math routines for computation.  Although
813       I make every attempt to choose numerically stable algorithms (although
814       I also welcome feedback and suggestions for improvement), normal
815       rounding due to computer floating point approximations can result in
816       inaccuracies when data spans a large range of precision.  (See for
817       example the dbcolstats_extrema test cases.)
818
819       Any requirements and limitations of each Fsdb tool is documented on its
820       manual page.
821
822       If any Fsdb program violates these assumptions, that is a bug that
823       should be documented on the tool's manual page or ideally fixed.
824
825       Fsdb does depend on Perl's correctness, and Perl (and Fsdb) have some
826       bugs.  Fsdb should work on perl from version 5.10 onward.
827

HISTORY

829       There have been three versions of Fsdb; fsdb 1.0 is a complete re-write
830       of the pre-1995 versions, and was distributed from 1995 to 2007.  Fsdb
831       2.0 is a significant re-write of the 1.x versions for reasons described
832       below.
833
834       Fsdb (in its various forms) has been used extensively by its author
835       since 1991.  Since 1995 it's been used by two other researchers at UCLA
836       and several at ISI.  In February 1998 it was announced to the Internet.
837       Since then it has found a few users, some outside where I work.
838
839       Major changes:
840
841       1.0 1997-07-22: first public release.
842       2.0 2008-01-25: rewrite to use a common library, and starting to use
843       threads.
844       2.12 2008-10-16: completion of the rewrite, and first RPM package.
845       2.44 2013-10-02: abandoning threads for improved performance
846
847   Fsdb 2.0 Rationale
848       I've thought about fsdb-2.0 for many years, but it was started in
849       earnest in 2007.  Fsdb-2.0 has the following goals:
850
851       in-one-process processing
852           While fsdb is great on the Unix command line as a pipeline between
853           programs, it should also be possible to set it up to run in a
854           single process.  And if it does so, it should be able to avoid
855           serializing and deserializing (converting to and from text) data
856           between each module.  (Accomplished in fsdb-2.0: see dbpipeline,
857           although still needs tuning.)
858
859       clean IO API
860           Fsdb's roots go back to perl4 and 1991, so the fsdb-1.x library is
861           very, very crufty.  More than just being ugly (but it was that
862           too), this made things reading from one format file and writing to
863           another the application's job, when it should be the library's.
864           (Accomplished in fsdb-1.15 and improved in 2.0: see Fsdb::IO.)
865
866       normalized module APIs
867           Because fsdb modules were added as needed over 10 years, sometimes
868           the module APIs became inconsistent.  (For example, the 1.x
869           "dbcolcreate" required an empty value following the name of the new
870           column, but other programs specify empty values with the "-e"
871           argument.)  We should smooth over these inconsistencies.
872           (Accomplished as each module was ported in 2.0 through 2.7.)
873
874       everyone handles all input formats
875           Given a clean IO API, the distinction between "colized" and
876           "listized" fsdb files should go away.  Any program should be able
877           to read and write files in any format.  (Accomplished in fsdb-2.1.)
878
879       Fsdb-2.0 preserves backwards compatibility where possible, but breaks
880       it where necessary to accomplish the above goals.  In August 2008,
881       Fsdb-2.7 was declared preferred over the 1.x versions.  Benchmarking in
882       2013 showed that threading performed much worse than just using pipes,
883       so Fsdb-2.44 uses threading "style", but implemented with processes
884       (via my "Freds" library).
885
886   Contributors
887       Fsdb includes code ported from Geoff Kuenning
888       ("Fsdb::Support::TDistribution").
889
890       Fsdb contributors: Ashvin Goel goel@cse.oge.edu, Geoff Kuenning
891       geoff@fmg.cs.ucla.edu, Vikram Visweswariah visweswa@isi.edu, Kannan
892       Varadahan kannan@isi.edu, Lars Eggert larse@isi.edu, Arkadi Gelfond
893       arkadig@dyna.com, David Graff graff@ldc.upenn.edu, Haobo Yu
894       haoboy@packetdesign.com, Pavlin Radoslavov pavlin@catarina.usc.edu,
895       Graham Phillips, Yuri Pradkin, Alefiya Hussain, Ya Xu, Michael
896       Schwendt, Fabio Silva fabio@isi.edu, Jerry Zhao zhaoy@isi.edu, Ning Xu
897       nxu@aludra.usc.edu, Martin Lukac mlukac@lecs.cs.ucla.edu, Xue Cai,
898       Michael McQuaid, Christopher Meng, Calvin Ardi, H. Merijn Brand, Lan
899       Wei, Hang Guo.
900
901       Fsdb includes datasets contributed from NIST (DATA/nist_zarr13.fsdb),
902       from
903       <http://www.itl.nist.gov/div898/handbook/eda/section4/eda4281.htm>, the
904       NIST/SEMATECH e-Handbook of Statistical Methods, section 1.4.2.8.1.
905       Background and Data.  The source is public domain, and reproduced with
906       permission.
907

RELATED WORK

909       As stated in the introduction, Fsdb is an incompatible reimplementation
910       of the ideas found in "/rdb".  By storing data in simple text files and
911       processing it with pipelines it is easy to experiment (in the shell)
912       and look at the output.  The original implementation of this idea was
913       /rdb, a commercial product described in the book UNIX relational
914       database management: application development in the UNIX environment by
915       Rod Manis, Evan Schaffer, and Robert Jorgensen (and also at the web
916       page <http://www.rdb.com/>).
917
918       While Fsdb is inspired by Rdb, it includes no code from it, and Fsdb
919       makes several different design choices.  In particular: rdb attempts to
920       be closer to a "real" database, with provision for locking, file
921       indexing.  Fsdb focuses on single user use and so eschews these
922       choices.  Rdb also has some support for interactive editing.  Fsdb
923       leaves editing to text editors like emacs or vi.
924
925       In August, 2002 I found out Carlo Strozzi extended RDB with his package
926       NoSQL <http://www.linux.it/~carlos/nosql/>.  According to Mr. Strozzi,
927       he implemented NoSQL in awk to avoid the Perl start-up of RDB.
928       Although I haven't found Perl startup overhead to be a big problem on
929       my platforms (from old Sparcstation IPCs to 2GHz Pentium-4s), you may
930       want to evaluate his system.  The Linux Journal has a description of
931       NoSQL at <http://www.linuxjournal.com/article/3294>.  It seems quite
932       similar to Fsdb.  Like /rdb, NoSQL supports indexing (not present in
933       Fsdb).  Fsdb appears to have richer support for statistics, and, as of
934       Fsdb-2.x, its support for Perl threading may support faster performance
935       (one-process, less serialization and deserialization).
936

RELEASE NOTES

938       Versions prior to 1.0 were released informally on my web page but were
939       not announced.
940
941   0.0 1991
942       started for my own research use
943
944   0.1 26-May-94
945       first check-in to RCS
946
947   0.2 15-Mar-95
948       parts now require perl5
949
950   1.0, 22-Jul-97
951       adds autoconf support and a test script.
952
953   1.1, 20-Jan-98
954       support for double space field separators, better tests
955
956   1.2, 11-Feb-98
957       minor changes and release on comp.lang.perl.announce
958
959   1.3, 17-Mar-98
960       ·   adds median and quartile options to dbstats
961
962       ·   adds dmalloc_to_db converter
963
964       ·   fixes some warnings
965
966       ·   dbjoin now can run on unsorted input
967
968       ·   fixes a dbjoin bug
969
970       ·   some more tests in the test suite
971
972   1.4, 27-Mar-98
973       ·   improves error messages (all should now report the program that
974           makes the error)
975
976       ·   fixed a bug in dbstats output when the mean is zero
977
978   1.5, 25-Jun-98
979       BUG FIX dbcolhisto, dbcolpercentile now handles non-numeric values like
980       dbstats
981       NEW dbcolstats computes zscores and tscores over a column
982       NEW dbcolscorrelate computes correlation coefficients between two
983       columns
984       INTERNAL ficus_getopt.pl has been replaced by DbGetopt.pm
985       BUG FIX all tests are now ``portable'' (previously some tests ran only
986       on my system)
987       BUG FIX you no longer need to have the db programs in your path (fix
988       arose from a discussion with Arkadi Gelfond)
989       BUG FIX installation no longer uses cp -f (to work on SunOS 4)
990
991   1.6, 24-May-99
992       NEW dbsort, dbstats, dbmultistats now run in constant memory (using tmp
993       files if necessary)
994       NEW dbcolmovingstats does moving means over a series of data
995       NEW dbcol has a -v option to get all columns except those listed
996       NEW dbmultistats does quartiles and medians
997       NEW dbstripextraheaders now also cleans up bogus comments before the
998       fist header
999       BUG FIX dbcolneaten works better with double-space-separated data
1000
1001   1.7,  5-Jan-00
1002       NEW dbcolize now detects and rejects lines that contain embedded copies
1003       of the field separator
1004       NEW configure tries harder to prevent people from improperly
1005       configuring/installing fsdb
1006       NEW tcpdump_to_db converter (incomplete)
1007       NEW tabdelim_to_db converter:  from spreadsheet tab-delimited files to
1008       db
1009       NEW mailing lists for fsdb are     "fsdb-announce@heidemann.la.ca.us"
1010       and  "fsdb-talk@heidemann.la.ca.us"
1011           To subscribe to either, send mail
1012           to    "fsdb-announce-request@heidemann.la.ca.us"   or
1013           "fsdb-talk-request@heidemann.la.ca.us"     with "subscribe" in the
1014           BODY of the message.
1015
1016       BUG FIX dbjoin used to produce incorrect output if there were extra,
1017       unmatched values in the 2nd table. Thanks to Graham Phillips for
1018       providing a test case.
1019       BUG FIX the sample commands in the usage strings now all should
1020       explicitly include the source of data (typically from "cat foo.fsdb
1021       |").  Thanks to Ya Xu for pointing out this documentation deficiency.
1022       BUG FIX (DOCUMENTATION) dbcolmovingstats had incorrect sample output.
1023
1024   1.8, 28-Jun-00
1025       BUG FIX header options are now preserved when writing with dblistize
1026       NEW dbrowuniq now optionally checks for uniqueness only on certain
1027       fields
1028       NEW dbrowsplituniq makes one pass through a file and splits it into
1029       separate files based on the given fields
1030       NEW converter for "crl" format network traces
1031       NEW anywhere you use arbitrary code (like dbroweval), _last_foo now
1032       maps to the last row's value for field _foo.
1033       OPTIMIZATION comment processing slightly changed so that dbmultistats
1034       now is much faster on files with lots of comments (for example, ~100k
1035       lines of comments and 700 lines of data!) (Thanks to Graham Phillips
1036       for pointing out this performance problem.)
1037       BUG FIX dbstats with median/quartiles now correctly handles singleton
1038       data points.
1039
1040   1.9,  6-Nov-00
1041       NEW dbfilesplit, split a single input file into multiple output files
1042       (based on code contributed by Pavlin Radoslavov).
1043       BUG FIX dbsort now works with perl-5.6
1044
1045   1.10, 10-Apr-01
1046       BUG FIX dbstats now handles the case where there are more n-tiles than
1047       data
1048       NEW dbstats now includes a -S option to optimize work on pre-sorted
1049       data (inspired by code contributed by Haobo Yu)
1050       BUG FIX dbsort now has a better estimate of memory usage when run on
1051       data with very short records (problem detected by Haobo Yu)
1052       BUG FIX cleanup of temporary files is slightly better
1053
1054   1.11,  2-Nov-01
1055       BUG FIX dbcolneaten now runs in constant memory
1056       NEW dbcolneaten now supports "field specifiers" that allow some control
1057       over how wide columns should be
1058       OPTIMIZATION dbsort now tries hard to be filesystem cache-friendly
1059       (inspired by "Information and Control in Gray-box Systems" by the
1060       Arpaci-Dusseau's at SOSP 2001)
1061       INTERNAL t_distr now ported to perl5 module DbTDistr
1062
1063   1.12,  30-Oct-02
1064       BUG FIX dbmultistats documentation typo fixed
1065       NEW dbcolmultiscale
1066       NEW dbcol has -r option for "relaxed error checking"
1067       NEW dbcolneaten has new -e option to strip end-of-line spaces
1068       NEW dbrow finally has a -v option to negate the test
1069       BUG FIX math bug in dbcoldiff fixed by Ashvin Goel (need to check
1070       Scheaffer test cases)
1071       BUG FIX some patches to run with Perl 5.8. Note: some programs
1072       (dbcolmultiscale, dbmultistats, dbrowsplituniq) generate warnings like:
1073       "Use of uninitialized value in concatenation (.)" or "string at
1074       /usr/lib/perl5/5.8.0/FileCache.pm line 98, <STDIN> line 2". Please
1075       ignore this until I figure out how to suppress it. (Thanks to Jerry
1076       Zhao for noticing perl-5.8 problems.)
1077       BUG FIX fixed an autoconf problem where configure would fail to find a
1078       reasonable prefix (thanks to Fabio Silva for reporting the problem)
1079       NEW db_to_html_table: simple conversion to html tables (NO fancy stuff)
1080       NEW dblib now has a function dblib_text2html() that will do simple
1081       conversion of iso-8859-1 to HTML
1082
1083   1.13,  4-Feb-04
1084       NEW fsdb added to the freebsd ports tree
1085       <http://www.freshports.org/databases/fsdb/>.  Maintainer:
1086       "larse@isi.edu"
1087       BUG FIX properly handle trailing spaces when data must be numeric (ex.
1088       dbstats with -FS, see test dbstats_trailing_spaces). Fix from Ning Xu
1089       "nxu@aludra.usc.edu".
1090       NEW dbcolize error message improved (bug report from Terrence Brannon),
1091       and list format documented in the README.
1092       NEW cgi_to_db converts CGI.pm-format storage to fsdb list format
1093       BUG FIX handle numeric synonyms for column names in dbcol properly
1094       ENHANCEMENT "talking about columns" section added to README. Lack of
1095       documentation pointed out by Lars Eggert.
1096       CHANGE dbformmail now defaults to using Mail ("Berkeley Mail") to send
1097       mail, rather than sendmail (sendmail is still an option, but mail
1098       doesn't require running as root)
1099       NEW on platforms that support it (i.e., with perl 5.8), fsdb works fine
1100       with unicode
1101       NEW dbfilevalidate: check a db file for some common errors
1102
1103   1.14,  24-Aug-06
1104       ENHANCEMENT README cleanup
1105       INCOMPATIBLE CHANGE dbcolsplit renamed dbcolsplittocols
1106       NEW dbcolsplittorows  split one column into multiple rows
1107       NEW dbcolsregression compute linear regression and correlation for two
1108       columns
1109       ENHANCEMENT cvs_to_db: better error handling, normalize field names,
1110       skip blank lines
1111       ENHANCEMENT dbjoin now detects (and fails) if non-joined files have
1112       duplicate names
1113       BUG FIX minor bug fixed in calculation of Student t-distributions
1114       (doesn't change any test output, but may have caused small errors)
1115
1116   1.15, 12-Nov-07
1117       NEW fsdb-1.14 added to the MacOS Fink system
1118       <http://pdb.finkproject.org/pdb/package.php/fsdb>. (Thanks to Lars
1119       Eggert for maintaining this port.)
1120       NEW Fsdb::IO::Reader and Fsdb::IO::Writer now provide reasonably clean
1121       OO I/O interfaces to Fsdb files.  Highly recommended if you use fsdb
1122       directly from perl.  In the fullness of time I expect to reimplement
1123       the entire thing using these APIs to replace the current dblib.pl which
1124       is still hobbled by its roots in perl4.
1125       NEW dbmapreduce now implements a Google-style map/reduce abstraction,
1126       generalizing dbmultistats.
1127       ENHANCEMENT fsdb now uses the Perl build system (Makefile.PL, etc.),
1128       instead of autoconf.  This change paves the way to better perl-5-style
1129       modularization, proper manual pages, input of both listize and colize
1130       format for every program, and world peace.
1131       ENHANCEMENT dblib.pl is now moved to Fsdb::Old.pm.
1132       BUG FIX dbmultistats now propagates its format argument (-f). Bug and
1133       fix from Martin Lukac (thanks!).
1134       ENHANCEMENT dbformmail documentation now is clearer that it doesn't
1135       send the mail, you have to run the shell script it writes.  (Problem
1136       observed by Unkyu Park.)
1137       ENHANCEMENT adapted to autoconf-2.61 (and then these changes were
1138       discarded in favor of The Perl Way.
1139       BUG FIX dbmultistats memory usage corrected (O(# tags), not O(1))
1140       ENHANCEMENT dbmultistats can now optionally run with pre-grouped input
1141       in O(1) memory
1142       ENHANCEMENT dbroweval -N was finally implemented (eat comments)
1143
1144   2.0, 25-Jan-08
1145       2.0, 25-Jan-08 --- a quiet 2.0 release (gearing up towards complete)
1146
1147       ENHANCEMENT: shifting old programs to Perl modules, with the front-end
1148       program as just a wrapper. In the short-term, this change just means
1149       programs have real man pages. In the long-run, it will mean that one
1150       can run a pipeline in a single Perl program. So far: dbcol, dbroweval,
1151       the new dbrowcount. dbsort the new dbmerge, the old "dbstats" (renamed
1152       dbcolstats), dbcolrename, dbcolcreate,
1153       NEW: Fsdb::Filter::dbpipeline is an internal-only module that lets one
1154       use fsdb commands from within perl (via threads).
1155           It also provides perl function aliases for the internal modules, so
1156           a string of fsdb commands in perl are nearly as terse as in the
1157           shell:
1158
1159               use Fsdb::Filter::dbpipeline qw(:all);
1160               dbpipeline(
1161                   dbrow(qw(name test1)),
1162                   dbroweval('_test1 += 5;')
1163               );
1164
1165       INCOMPATIBLE CHANGE: The old dbcolstats has been renamed
1166       dbcolstatscores. The new dbcolstats does the same thing as the old
1167       dbstats. This incompatibility is unfortunate but normalizes program
1168       names.
1169       CHANGE: The new dbcolstats program always outputs "-" (the default
1170       empty value) for statistics it cannot compute (for example, standard
1171       deviation if there is only one row), instead of the old mix of "-" and
1172       "na".
1173       INCOMPATIBLE CHANGE: The old dbcolstats program, now called
1174       dbcolstatscores, also has different arguments.  The "-t mean,stddev"
1175       option is now "--tmean mean --tstddev stddev".  See dbcolstatscores for
1176       details.
1177       INCOMPATIBLE CHANGE: dbcolcreate now assumes all new columns get the
1178       default value rather than requiring each column to have an initial
1179       constant value. To change the initial value, sue the new "-e" option.
1180       NEW: dbrowcount counts rows, an almost-subset of dbcolstats's "n"
1181       output (except without differentiating numeric/non-numeric input), or
1182       the equivalent of "dbstripcomments | wc -l".
1183       NEW: dbmerge merges two sorted files. This functionality was previously
1184       embedded in dbsort.
1185       INCOMPATIBLE CHANGE: dbjoin's "-i" option to include non-matches is now
1186       renamed "-a", so as to not conflict with the new standard option "-i"
1187       for input file.
1188
1189   2.1,  6-Apr-08
1190       2.1,  6-Apr-08 --- another alpha 2.0, but now all converted programs
1191       understand both listize and colize format
1192
1193       ENHANCEMENT: shifting more old programs to Perl modules. New in 2.1:
1194       dbcolneaten, dbcoldefine, dbcolhisto, dblistize, dbcolize, dbrecolize
1195       ENHANCEMENT dbmerge now handles an arbitrary number of input files, not
1196       just exactly two.
1197       NEW dbmerge2 is an internal routine that handles merging exactly two
1198       files.
1199       INCOMPATIBLE CHANGE dbjoin now specifies inputs like dbmerge2, rather
1200       than assuming the first two arguments were tables (as in fsdb-1).
1201           The old dbjoin argument "-i" is now "-a" or <--type=outer>.
1202
1203           A minor change: comments in the source files for dbjoin are now
1204           intermixed with output rather than being delayed until the end.
1205
1206       ENHANCEMENT dbsort now no longer produces warnings when null values are
1207       passed to numeric comparisons.
1208       BUG FIX dbroweval now once again works with code that lacks a trailing
1209       semicolon. (This bug fixes a regression from 1.15.)
1210       INCOMPATIBLE CHANGE dbcolneaten's old "-e" option (to avoid end-of-line
1211       spaces) is now "-E" to avoid conflicts with the standard empty field
1212       argument.
1213       INCOMPATIBLE CHANGE dbcolhisto's old "-e" option is now "-E" to avoid
1214       conflicts. And its "-n", "-s", and "-w" are now "-N", "-S", and "-W" to
1215       correspond.
1216       NEW dbfilealter replaces dbrecolize, dblistize, and dbcolize, but with
1217       different options.
1218       ENHANCEMENT The library routines "Fsdb::IO" now understand both list-
1219       format and column-format data, so all converted programs can now
1220       automatically read either format.  This capability was one of the
1221       milestone goals for 2.0, so yea!
1222
1223   2.2, 23-May-08
1224       Release 2.2 is another 2.x alpha release.  Now most of the commands are
1225       ported, but a few remain, and I plan one last incompatible change (to
1226       the file header) before 2.x final.
1227
1228       ENHANCEMENT
1229           shifting more old programs to Perl modules.  New in 2.2:
1230           dbrowaccumulate, dbformmail.  dbcolmovingstats.  dbrowuniq.
1231           dbrowdiff.  dbcolmerge.  dbcolsplittocols.  dbcolsplittorows.
1232           dbmapreduce.  dbmultistats.  dbrvstatdiff.  Also dbrowenumerate
1233           exists only as a front-end (command-line) program.
1234
1235       INCOMPATIBLE CHANGE
1236           The following programs have been dropped from fsdb-2.x:
1237           dbcoltighten, dbfilesplit, dbstripextraheaders,
1238           dbstripleadingspace.
1239
1240       NEW combined_log_format_to_db to convert Apache logfiles
1241
1242       INCOMPATIBLE CHANGE
1243           Options to dbrowdiff are now -B and -I, not -a and -i.
1244
1245       INCOMPATIBLE CHANGE
1246           dbstripcomments is now dbfilestripcomments.
1247
1248       BUG FIXES
1249           dbcolneaten better handles empty columns; dbcolhisto warning
1250           suppressed (actually a bug in high-bucket handling).
1251
1252       INCOMPATIBLE CHANGE
1253           dbmultistats now requires a "-k" option in front of the key (tag)
1254           field, or if none is given, it will group by the first field (both
1255           like dbmapreduce).
1256
1257       KNOWN BUG
1258           dbmultistats with quantile option doesn't work currently.
1259
1260       INCOMPATIBLE CHANGE
1261           dbcoldiff is renamed dbrvstatdiff.
1262
1263       BUG FIXES
1264           dbformmail was leaving its log message as a  command, not a
1265           comment.  Oops.  No longer.
1266
1267   2.3, 27-May-08 (alpha)
1268       Another alpha release, this one just to fix the critical dbjoin bug
1269       listed below (that happens to have blocked my MP3 jukebox :-).
1270
1271       BUG FIX
1272           Dbsort no longer hangs if given an input file with no rows.
1273
1274       BUG FIX
1275           Dbjoin now works with unsorted input coming from a pipeline (like
1276           stdin).  Perl-5.8.8 has a bug (?) that was making this case
1277           fail---opening stdin in one thread, reading some, then reading more
1278           in a different thread caused an lseek which works on files, but
1279           fails on pipes like stdin.  Go figure.
1280
1281       BUG FIX / KNOWN BUG
1282           The dbjoin fix also fixed dbmultistats -q (it now gives the right
1283           answer).  Although a new bug appeared, messages like:
1284               Attempt to free unreferenced scalar: SV 0xa9dd0c4, Perl
1285           interpreter: 0xa8350b8 during global destruction.  So the
1286           dbmultistats_quartile test is still disabled.
1287
1288   2.4, 18-Jun-08
1289       Another alpha release, mostly to fix minor usability problems in
1290       dbmapreduce and client functions.
1291
1292       ENHANCEMENT
1293           dbrow now defaults to running user supplied code without warnings
1294           (as with fsdb-1.x).  Use "--warnings" or "-w" to turn them back on.
1295
1296       ENHANCEMENT
1297           dbroweval can now write different format output than the input,
1298           using the "-m" option.
1299
1300       KNOWN BUG
1301           dbmapreduce emits warnings on perl 5.10.0 about "Unbalanced string
1302           table refcount" and "Scalars leaked" when run with an external
1303           program as a reducer.
1304
1305           dbmultistats emits the warning "Attempt to free unreferenced
1306           scalar" when run with quartiles.
1307
1308           In each case the output is correct.  I believe these can be
1309           ignored.
1310
1311       CHANGE
1312           dbmapreduce no longer logs a line for each reducer that is invoked.
1313
1314   2.5, 24-Jun-08
1315       Another alpha release, fixing more minor bugs in "dbmapreduce" and
1316       lossage in "Fsdb::IO".
1317
1318       ENHANCEMENT
1319           dbmapreduce can now tolerate non-map-aware reducers that pass back
1320           the key column in put.  It also passes the current key as the last
1321           argument to external reducers.
1322
1323       BUG FIX
1324           Fsdb::IO::Reader, correctly handle "-header" option again.  (Broken
1325           since fsdb-2.3.)
1326
1327   2.6, 11-Jul-08
1328       Another alpha release, needed to fix DaGronk.  One new port, small bug
1329       fixes, and important fix to dbmapreduce.
1330
1331       ENHANCEMENT
1332           shifting more old programs to Perl modules.  New in 2.2:
1333           dbcolpercentile.
1334
1335       INCOMPATIBLE CHANGE and ENHANCEMENTS dbcolpercentile arguments changed,
1336       use "--rank" to require ranking instead of "-r". Also, "--ascending"
1337       and "--descending" can now be specified separately, both for
1338       "--percentile" and "--rank".
1339       BUG FIX
1340           Sigh, the sense of the --warnings option in dbrow was inverted.  No
1341           longer.
1342
1343       BUG FIX
1344           I found and fixed the string leaks (errors like "Unbalanced string
1345           table refcount" and "Scalars leaked") in dbmapreduce and
1346           dbmultistats.  (All "IO::Handle"s in threads must be manually
1347           destroyed.)
1348
1349       BUG FIX
1350           The "-C" option to specify the column separator in dbcolsplittorows
1351           now works again (broken since it was ported).
1352
1353       2.7, 30-Jul-08 beta
1354
1355       The beta release of fsdb-2.x.  Finally, all programs are ported.  As
1356       statistics, the number of lines of non-library code doubled from 7.5k
1357       to 15.5k.  The libraries are much more complete, going from 866 to 5164
1358       lines.  The overall number of programs is about the same, although 19
1359       were dropped and 11 were added.  The number of test cases has grown
1360       from 116 to 175.  All programs are now in perl-5, no more shell scripts
1361       or perl-4.  All programs now have manual pages.
1362
1363       Although this is a major step forward, I still expect to rename "fsdb"
1364       to "fsdb".
1365
1366       ENHANCEMENT
1367           shifting more old programs to Perl modules.  New in 2.7:
1368           dbcolscorellate.  dbcolsregression.  cgi_to_db.  dbfilevalidate.
1369           db_to_csv.  csv_to_db, db_to_html_table, kitrace_to_db,
1370           tcpdump_to_db, tabdelim_to_db, ns_to_db.
1371
1372       INCOMPATIBLE CHANGE
1373           The following programs have been dropped from fsdb-2.x: db2dcliff,
1374           dbcolmultiscale, crl_to_db.  ipchain_logs_to_db.  They may come
1375           back, but seemed overly specialized.  The following program
1376           dbrowsplituniq was dropped because it is superseded by dbmapreduce.
1377           dmalloc_to_db was dropped pending a test cases and examples.
1378
1379       ENHANCEMENT
1380           dbfilevalidate now has a "-c" option to correct errors.
1381
1382       NEW html_table_to_db provides the inverse of db_to_html_table.
1383
1384   2.8,  5-Aug-08
1385       Change header format, preserving forwards compatibility.
1386
1387       BUG FIX
1388           Complete editing pass over the manual, making sure it aligns with
1389           fsdb-2.x.
1390
1391       SEMI-COMPATIBLE CHANGE
1392           The header of fsdb files has changed, it is now #fsdb, not #h (or
1393           #L) and parsing of -F and -R are also different.  See dbfilealter
1394           for the new specification.  The v1 file format will be read,
1395           compatibly, but not written.
1396
1397       BUG FIX
1398           dbmapreduce now tolerates comments that precede the first key,
1399           instead of failing with an error message.
1400
1401   2.9, 6-Aug-08
1402       Still in beta; just a quick bug-fix for dbmapreduce.
1403
1404       ENHANCEMENT
1405           dbmapreduce now generates plausible output when given no rows of
1406           input.
1407
1408   2.10, 23-Sep-08
1409       Still in beta, but picking up some bug fixes.
1410
1411       ENHANCEMENT
1412           dbmapreduce now generates plausible output when given no rows of
1413           input.
1414
1415       ENHANCEMENT
1416           dbroweval the warnings option was backwards; now corrected.  As a
1417           result, warnings in user code now default off (like in fsdb-1.x).
1418
1419       BUG FIX
1420           dbcolpercentile now defaults to assuming the target column is
1421           numeric.  The new option "-N" allows selection of a non-numeric
1422           target.
1423
1424       BUG FIX
1425           dbcolscorrelate now includes "--sample" and "--nosample" options to
1426           compute the sample or full population correlation coefficients.
1427           Thanks to Xue Cai for finding this bug.
1428
1429   2.11, 14-Oct-08
1430       Still in beta, but picking up some bug fixes.
1431
1432       ENHANCEMENT
1433           html_table_to_db is now more aggressive about filling in empty
1434           cells with the official empty value, rather than leaving them blank
1435           or as whitespace.
1436
1437       ENHANCEMENT
1438           dbpipeline now catches failures during pipeline element setup and
1439           exits reasonably gracefully.
1440
1441       BUG FIX
1442           dbsubprocess now reaps child processes, thus avoiding running out
1443           of processes when used a lot.
1444
1445   2.12, 16-Oct-08
1446       Finally, a full (non-beta) 2.x release!
1447
1448       INCOMPATIBLE CHANGE
1449           Jdb has been renamed Fsdb, the flatfile-streaming database.  This
1450           change affects all internal Perl APIs, but no shell command-level
1451           APIs.  While Jdb served well for more than ten years, it is easily
1452           confused with the Java debugger (even though Jdb was there first!).
1453           It also is too generic to work well in web search engines.
1454           Finally, Jdb stands for ``John's database'', and we're a bit beyond
1455           that.  (However, some call me the ``file-system guy'', so one could
1456           argue it retains that meeting.)
1457
1458           If you just used the shell commands, this change should not affect
1459           you.  If you used the Perl-level libraries directly in your code,
1460           you should be able to rename "Jdb" to "Fsdb" to move to 2.12.
1461
1462           The jdb-announce list not yet been renamed, but it will be shortly.
1463
1464           With this release I've accomplished everything I wanted to in
1465           fsdb-2.x.  I therefore expect to return to boring, bugfix releases.
1466
1467   2.13, 30-Oct-08
1468       BUG FIX
1469           dbrowaccumulate now treats non-numeric data as zero by default.
1470
1471       BUG FIX
1472           Fixed a perl-5.10ism in dbmapreduce that breaks that program under
1473           5.8.  Thanks to Martin Lukac for reporting the bug.
1474
1475   2.14, 26-Nov-08
1476       BUG FIX
1477           Improved documentation for dbmapreduce's "-f" option.
1478
1479       ENHANCEMENT
1480           dbcolmovingstats how computes a moving standard deviation in
1481           addition to a moving mean.
1482
1483   2.15, 13-Apr-09
1484       BUG FIX
1485           Fix a make install bug reported by Shalindra Fernando.
1486
1487   2.16, 14-Apr-09
1488       BUG FIX
1489           Another minor release bug: on some systems programize_module looses
1490           executable permissions.  Again reported by Shalindra Fernando.
1491
1492   2.17, 25-Jun-09
1493       TYPO FIXES
1494           Typo in the dbroweval manual fixed.
1495
1496       IMPROVEMENT
1497           There is no longer a comment line to label columns in dbcolneaten,
1498           instead the header line is tweaked to line up.  This change
1499           restores the Jdb-1.x behavior, and means that repeated runs of
1500           dbcolneaten no longer add comment lines each time.
1501
1502       BUG FIX
1503           It turns out  dbcolneaten was not correctly handling trailing
1504           spaces when given the "-E" option to suppress them.  This
1505           regression is now fixed.
1506
1507       EXTENSION
1508           dbroweval(1) can now handle direct references to the last row via
1509           $lfref, a dubious but now documented feature.
1510
1511       BUG FIXES
1512           Separators set with "-C" in dbcolmerge and dbcolsplittocols were
1513           not properly setting the heading, and null fields were not
1514           recognized.  The first bug was reported by Martin Lukac.
1515
1516   2.18,  1-Jul-09  A minor release
1517       IMPROVEMENT
1518           Documentation for Fsdb::IO::Reader has been improved.
1519
1520       IMPROVEMENT
1521           The package should now be PGP-signed.
1522
1523   2.19,  10-Jul-09
1524       BUG FIX
1525           Internal improvements to debugging output and robustness of
1526           dbmapreduce and dbpipeline.  TEST/dbpipeline_first_fails.cmd re-
1527           enabled.
1528
1529   2.20, 30-Nov-09 (A collection of minor bugfixes, plus a build against
1530       Fedora 12.)
1531       BUG FIX
1532           Loging for dbmapreduce with code refs is now stable (it no longer
1533           includes a hex pointer to the code reference).
1534
1535       BUG FIX
1536           Better handling of mixed blank lines in Fsdb::IO::Reader (see test
1537           case dbcolize_blank_lines.cmd).
1538
1539       BUG FIX
1540           html_table_to_db now handles multi-line input better, and handles
1541           tables with COLSPAN.
1542
1543       BUG FIX
1544           dbpipeline now cleans up threads in an "eval" to prevent "cannot
1545           detach a joined thread" errors that popped up in perl-5.10.
1546           Hopefully this prevents a race condition that causes the test
1547           suites to hang about 20% of the time (in dbpipeline_first_fails).
1548
1549       IMPROVEMENT
1550           dbmapreduce now detects and correctly fails when the input and
1551           reducer have incompatible field separators.
1552
1553       IMPROVEMENT
1554           dbcolstats, dbcolhisto, dbcolscorrelate, dbcolsregression, and
1555           dbrowcount now all take an "-F" option to let one specify the
1556           output field separator (so they work better with dbmapreduce).
1557
1558       BUG FIX
1559           An omitted "-k" from the manual page of dbmultistats is now there.
1560           Bug reported by Unkyu Park.
1561
1562   2.21, 17-Apr-10 bug fix release
1563       BUG FIX
1564           Fsdb::IO::Writer now no longer fails with -outputheader => never
1565           (an obscure bug).
1566
1567       IMPROVEMENT
1568           Fsdb (in the warnings section) and dbcolstats now more carefully
1569           document how they handle (and do not handle) numerical precision
1570           problems, and other general limits.  Thanks to Yuri Pradkin for
1571           prompting this documentation.
1572
1573       IMPROVEMENT
1574           "Fsdb::Support::fullname_to_sortkey" is now restored from "Jdb".
1575
1576       IMPROVEMENT
1577           Documention for multiple styles of input approaches (including
1578           performance description) added to Fsdb::IO.
1579
1580   2.22, 2010-10-31 One new tool dbcolcopylast and several bug fixes for Perl
1581       5.10.
1582       BUG FIX
1583           dbmerge now correctly handles n-way merges.  Bug reported by Yuri
1584           Pradkin.
1585
1586       INCOMPARABLE CHANGE
1587           dbcolneaten now defaults to not padding the last column.
1588
1589       ADDITION
1590           dbrowenumerate now takes -N NewColumn to give the new column a name
1591           other than "count".  Feature requested by Mike Rouch in January
1592           2005.
1593
1594       ADDITION
1595           New program dbcolcopylast copies the last value of a column into a
1596           new column copylast_column of the next row.  New program requested
1597           by Fabio Silva; useful for converting dbmultistats output into
1598           dbrvstatdiff input.
1599
1600       BUG FIX
1601           Several tools (particularly dbmapreduce and dbmultistats) would
1602           report errors like "Unbalanced string table refcount: (1) for
1603           "STDOUT" during global destruction" on exit, at least on certain
1604           versions of Perl (for me on 5.10.1), but similar errors have been
1605           off-and-on for several Perl releases.  Although I think my code
1606           looked OK, I worked around this problem with a different way of
1607           handling standard IO redirection.
1608
1609   2.23, 2011-03-10 Several small portability bugfixes; improved dbcolstats
1610       for large datasets
1611       IMPROVEMENT
1612           Documentation to dbrvstatdiff was changed to use "sd" to refer to
1613           standard deviation, not "ss" (which might be confused with sum-of-
1614           squares).
1615
1616       BUG FIX
1617           This documentation about dbmultistats was missing the -k option in
1618           some cases.
1619
1620       BUG FIX
1621           dbmapreduce was failing on MacOS-10.6.3 for some tests with the
1622           error
1623
1624               dbmapreduce: cannot run external dbmapreduce reduce program (perl TEST/dbmapreduce_external_with_key.pl)
1625
1626           The problem seemed to be only in the error, not in operation.  On
1627           MacOS, the error is now suppressed.  Thanks to Alefiya Hussain for
1628           providing access to a Mac system that allowed debugging of this
1629           problem.
1630
1631       IMPROVEMENT
1632           The csv_to_db command requires an external Perl library
1633           (Text::CSV_XS).  On computers that lack this optional library,
1634           previously Fsdb would configure with a warning and then test cases
1635           would fail.  Now those test cases are skipped with an additional
1636           warning.
1637
1638       BUG FIX
1639           The test suite now supports alternative valid output, as a hack to
1640           account for last-digit floating point differences.  (Not very
1641           satisfying :-(
1642
1643       BUG FIX
1644           dbcolstats output for confidence intervals on very large datasets
1645           has changed.  Previously it failed for more than 2^31-1 records,
1646           and handling of T-Distributions with thousands of rows was a bit
1647           dubious.  Now datasets with more than 10000 are considered
1648           infinitely large and hopefully correctly handled.
1649
1650   2.24, 2011-04-15 Improvements to fix an old bug in dbmapreduce with
1651       different field separators
1652       IMPROVEMENT
1653           The dbfilealter command had a "--correct" option to work-around
1654           from incompatible field-separators, but it did nothing.  Now it
1655           does the correct but sad, data-loosing thing.
1656
1657       IMPROVEMENT
1658           The dbmultistats command previously failed with an error message
1659           when invoked on input with a non-default field separator.  The root
1660           cause was the underlying dbmapreduce that did not handle the case
1661           of reducers that generated output with a different field separator
1662           than the input.  We now detect and repair incompatible field
1663           separators.  This change corrects a problem originally documented
1664           and detected in Fsdb-2.20.  Bug re-reported by Unkyu Park.
1665
1666   2.25, 2011-08-07 Two new tools, xml_to_db and dbfilepivot, and a bugfix for
1667       two people.
1668       IMPROVEMENT
1669           kitrace_to_db now supports a --utc option, which also fixes this
1670           test case for users outside of the Pacific time zone.  Bug reported
1671           by David Graff, and also by Peter Desnoyers (within a week of each
1672           other :-)
1673
1674       NEW xml_to_db can convert simple, very regular XML files into Fsdb.
1675
1676       NEW dbfilepivot "pivots" a file, converting multiple rows corresponding
1677           to the same entity into a single row with multiple columns.
1678
1679   2.26, 2011-12-12 Bug fixes, particularly for perl-5.14.2.
1680       BUG FIX
1681           Bugs fixed in Fsdb::IO::Reader(3) manual page.
1682
1683       BUG FIX
1684           Fixed problems where dbcolstats was truncating floating point
1685           numbers when sorting.  This strange behavior happens as of
1686           perl-5.14.2 and it seems like a Perl bug.  I've worked around it
1687           for the test suites, but I'm a bit nervous.
1688
1689   2.27, 2012-11-15 Accumulated bug fixes.
1690       IMPROVEMENT
1691           csv_to_db now reports errors in CVS input with real diagnostics.
1692
1693       IMPROVEMENT
1694           dbcolmovingstats can now compute median, when given the "-m"
1695           option.
1696
1697       BUG FIX
1698           dbcolmovingstats non-numeric handling (the "-a" option) now works
1699           properly.
1700
1701       DOCUMENTATION
1702           The internal t/test_command.t test framework is now documented.
1703
1704       BUG FIX
1705           dbrowuniq now correctly handles the case where there is no input
1706           (previously it output a blank line, which is a malformed fsdb
1707           file).  Thanks to Yuri Pradkin for reporting this bug.
1708
1709   2.28, 2012-11-15 A quick release to fix most rpmlint errors.
1710       BUG FIX
1711           Fixed a number of minor release problems (wrong permissions, old
1712           FSF address, etc.) found by rpmlint.
1713
1714   2.29, 2012-11-20 a quick release for CPAN testing
1715       IMPROVEMENT
1716           Tweaked the RPM spec.
1717
1718       IMPROVEMENT
1719           Modified Makefile.PL to fail gracefully on Perl installations that
1720           lack threads.  (Without this fix, I get massive failures in the
1721           non-ithreads test system.)
1722
1723   2.30, 2012-11-25 improvements to perl portability
1724       BUG FIX
1725           Removed unicode character in documention of dbcolscorrelated so pod
1726           tests will pass.  (Sigh, that should work :-( )
1727
1728       BUG FIX
1729           Fixed test suite failures on 5 tests (dbcolcreate_double_creation
1730           was the first) due to Carp's addition of a period.  This problem
1731           was breaking Fsdb on perl-5.17.  Thanks to Michael McQuaid for
1732           helping diagnose this problem.
1733
1734       IMPROVEMENT
1735           The test suite now prints out the names of tests it tries.
1736
1737   2.31, 2012-11-28 A release with actual improvements to dbfilepivot and
1738       dbrowuniq.
1739       BUG FIX
1740           Documentation fixes: typos in dbcolscorrelated, bugs in
1741           dbfilepivot, clarification for comment handling in
1742           Fsdb::IO::Reader.
1743
1744       IMPROVEMENT
1745           Previously dbfilepivot assumed the input was grouped by keys and
1746           didn't very that pre-condition.  Now there is no pre-condition (it
1747           will sort the input by default), and it checks if the invariant is
1748           violated.
1749
1750       BUG FIX
1751           Previously dbfilepivot failed if the input had comments (oops :-);
1752           no longer.
1753
1754       IMPROVEMENT
1755           Now dbrowuniq has the "-L" option to preserve the last unique row
1756           (instead of the first), a common idiom.
1757
1758   2.32, 2012-12-21 Test suites should now be more numerically robust.
1759       NEW New dbfilediff does fsdb-aware file differencing.  It does not do
1760           smart intuition of add/removes like Unix diff(1), but it does know
1761           about columns, and with "-E", it does numeric-aware differences.
1762
1763       IMPROVEMENT
1764           Test suites that are numeric now use dbfilediff to do numeric-aware
1765           comparisons, so the test suite should now be robust to slightly
1766           different computers and operating systems and compilers than
1767           exactly what I use.
1768
1769   2.33, 2012-12-23 Minor fixes to some test cases.
1770       IMPROVEMENT
1771           dbfilediff and dbrowuniq now supports the "-N" option to give the
1772           new column a different name.  (And a test cases where this
1773           duplication mattered have been fixed.)
1774
1775       IMPROVEMENT
1776           dbrvstatdiff now show the t-test breakpoint with a reasonable
1777           number of floating point digits.
1778
1779       BUG FIX
1780           Fixed a numerical stability problem in the dbroweval_last test
1781           case.
1782

WHAT'S NEW

1784   2.34, 2013-02-10 Parallelism in dbmerge.
1785       IMPROVEMENT
1786           Documention for dbjoin now includes resource requirements.
1787
1788       IMPROVEMENT
1789           Default memory usage for dbsort is now about 256MB.  (The world
1790           keeps moving forward.)
1791
1792       IMPROVEMENT
1793           dbmerge now does merging in parallel.  As a side-effect, dbsort
1794           should be faster when input overflows memory.  The level of
1795           parallelism can be limited with the "--parallelism" option.  (There
1796           is more work to do here, but we're off to a start.)
1797
1798   2.35, 2013-02-23 Improvements to dbmerge parallelism
1799       BUG FIX
1800           Fsdb temporary files are now created more securely (with
1801           File::Temp).
1802
1803       IMPROVEMENT
1804           Programs that sort or merge on fields (dbmerge2, dbmerge, dbsort,
1805           dbjoin) now report an error if no fields on which to join or merge
1806           are given.
1807
1808       IMPROVEMENT
1809           Parallelism in dbmerge is should now be more consistent, with less
1810           starting and stopping.
1811
1812       IMPROVEMENT In dbmerge, the "--xargs" option lets one give input
1813       filenames on standard input, rather than the command line. This feature
1814       paves the way for faster dbsort for large inputs (by pipelining sorting
1815       and merging), expected in the next release.
1816
1817   2.36, 2013-02-25 dbsort pipelines with dbmerge
1818       IMPROVEMENT For large inputs, dbsort now pipelines sorting and merging,
1819       allowing earlier processing.
1820       BUG FIX Since 2.35, dbmerge delayed cleanup of intermediate files,
1821       thereby requiring extra disk space.
1822
1823   2.37, 2013-02-26 quick bugfix to support parallel sort and merge from
1824       recent releases
1825       BUG FIX Since 2.35, dbmerge delayed removal of input files given by
1826       "--xargs".  This problem is now fixed.
1827
1828   2.38, 2013-04-29 minor bug fixes
1829       CLARIFICATION
1830           Configure now rejects Windows since tests seem to hang on some
1831           versions of Windows.  (I would love help from a Windows developer
1832           to get this problem fixed, but I cannot do it.)  See
1833           https://rt.cpan.org/Ticket/Display.html?id=84201.
1834
1835       IMPROVEMENT
1836           All programs that use temporary files (dbcolpercentile,
1837           dbcolscorrelate, dbcolstats, dbcolstatscores) now take the "-T"
1838           option and set the temporary directory consistently.
1839
1840           In addition, error messages are better when the temporary directory
1841           has problems.  Problem reported by Liang Zhu.
1842
1843       BUG FIX
1844           dbmapreduce was failing with external, map-reduce aware reducers
1845           (when invoked with -M and an external program).  (Sigh, did this
1846           case ever work?)  This case should now work.  Thanks to Yuri
1847           Pradkin for reporting this bug (in 2011).
1848
1849       BUG FIX
1850           Fixed perl-5.10 problem with dbmerge.  Thanks to Yuri Pradkin for
1851           reporting this bug (in 2013).
1852
1853   2.39, date 2013-05-31 quick release for the dbrowuniq extension
1854       BUG FIX
1855           Actually in 2.38, the Fedora .spec got cleaner dependencies.
1856           Suggestion from Christopher Meng via
1857           <https://bugzilla.redhat.com/show_bug.cgi?id=877096>.
1858
1859       ENHANCEMENT
1860           Fsdb files are now explicitly set into UTF-8 encoding, unless one
1861           specifies "-encoding" to "Fsdb::IO".
1862
1863       ENHANCEMENT
1864           dbrowuniq now supports "-I" for incremental counting.
1865
1866   2.40, 2013-07-13 small bug fixes
1867       BUG FIX
1868           dbsort now has more respect for a user-given temporary directory;
1869           it no longer is ignored for merging.
1870
1871       IMPROVEMENT
1872           dbrowuniq now has options to output the first, last, and both first
1873           and last rows of a run ("-F", "-L", and "-B").
1874
1875       BUG FIX
1876           dbrowuniq now correctly handles "-N".  Sigh, it didn't work before.
1877
1878   2.41, 2013-07-29 small bug and packaging fixes
1879       ENHANCEMENT
1880           Documentation to dbrvstatdiff improved (inspired by questions from
1881           Qian Kun).
1882
1883       BUG FIX
1884           dbrowuniq no longer duplicates singleton unique lines when
1885           outputting both (with "-B").
1886
1887       BUG FIX
1888           Add missing "XML::Simple" dependency to Makefile.PL.
1889
1890       ENHANCEMENT
1891           Tests now show the diff of the failing output if run with "make
1892           test TEST_VERBOSE=1".
1893
1894       ENHANCEMENT
1895           dbroweval now includes documentation for how to output extra rows.
1896           Suggestion from Yuri Pradkin.
1897
1898       BUG FIX
1899           Several improvements to the Fedora package from Michael Schwendt
1900           via <https://bugzilla.redhat.com/show_bug.cgi?id=877096>, and from
1901           the harsh master that is rpmlint.  (I am stymied at teaching it
1902           that "outliers" is spelled correctly.  Maybe I should send it
1903           Schneier's book.  And an unresolvable invalid-spec-name lurks in
1904           the SRPM.)
1905
1906   2.42, 2013-07-31 A bug fix and packaging release.
1907       ENHANCEMENT
1908           Documentation to dbjoin improved to better memory usage.  (Based on
1909           problem report by Lin Quan.)
1910
1911       BUG FIX
1912           The .spec is now perl-Fsdb.spec to satisfy rpmlint.  Thanks to
1913           Christopher Meng for a specific bug report.
1914
1915       BUG FIX
1916           Test dbroweval_last.cmd no longer has a column that caused failures
1917           because of numerical instability.
1918
1919       BUG FIX
1920           Some tests now better handle bugs in old versions of perl (5.10,
1921           5.12).  Thanks to Calvin Ardi for help debugging this on a Mac with
1922           perl-5.12, but the fix should affect other platforms.
1923
1924   2.43, 2013-08-27 Adds in-file compression.
1925       BUG FIX
1926           Changed the sort on TEST/dbsort_merge.cmd to strings (from
1927           numerics) so we're less susceptible to false test-failures due to
1928           floating point IO differences.
1929
1930       EXPERIMENTAL ENHANCEMENT
1931           Yet more parallelism in dbmerge: new "endgame-mode" builds a merge
1932           tree of processes at the end of large merge tasks to get maximally
1933           parallelism.  Currently this feature is off by default because it
1934           can hang for some inputs.  Enable this experimental feature with
1935           "--endgame".
1936
1937       ENHANCEMENT
1938           "Fsdb::IO" now handles being given "IO::Pipe" objects (as exercised
1939           by dbmerge).
1940
1941       BUG FIX
1942           Handling of NamedTmpfiles now supports concurrency.  This fix will
1943           hopefully fix occasional "Use of uninitialized value $_ in string
1944           ne at ...NamedTmpfile.pm line 93."  errors.
1945
1946       BUG FIX
1947           Fsdb now requires perl 5.10.  This is a bug fix because some test
1948           cases used to require it, but this fact was not properly
1949           documented.  (Back-porting to 5.008 would require removing all "//"
1950           operators.)
1951
1952       ENHANCEMENT
1953           Fsdb now handles automatic compression of file contents.  Enable
1954           compression with "dbfilealter -Z xz" (or "gz" or "bz2").  All
1955           programs should operate on compressed files and leave the output
1956           with the same level of compression.  "xz" is recommended as fastest
1957           and most efficient.  "gz" is produces unrepeatable output (and so
1958           has no output test), it seems to insist on adding a timestamp.
1959
1960   2.44, 2013-10-02 A major change--all threads are gone.
1961       ENHANCEMENT
1962           Fsdb is now thread free and only uses processes for parallelism.
1963           This change is a big change--the entire motivation for Fsdb-2 was
1964           to exploit parallelism via threading.  Parallelism--good, but perl
1965           threading--bad for performance.  Horribly bad for performance.
1966           About 20x worse than pipes on my box.  (See perl bug #119445 for
1967           the discussion.)
1968
1969       NEW "Fsdb::Support::Freds" provides a thread-like abstraction over
1970           forking, with some nice support for callbacks in the parent upon
1971           child termination.
1972
1973       ENHANCEMENT
1974           Details about removing threads: "dbpipeline" is thread free, and
1975           new tests to verify each of its parts.  The easy cases are
1976           "dbcolpercentile", "dbcolstats", "dbfilepivot", "dbjoin", and
1977           "dbcolstatscores", each of which use it in simple ways
1978           (2013-09-09).  "dbmerge" is now thread free (2013-09-13), but was a
1979           significant rewrite, which brought "dbsort" along.  "dbmapreduce"
1980           is partly thread free (2013-09-21), again as a rewrite, and it
1981           brings "dbmultistats" along.  Full "dbmapreduce" support took much
1982           longer (2013-10-02).
1983
1984       BUG FIX
1985           When running with user-only output ("-n"), dbroweval now resets the
1986           output vector $ofref after it has been output.
1987
1988       NEW dbcolcreate will create all columns at the head of each row with
1989           the "--first" option.
1990
1991       NEW dbfilecat will concatenate two files, verifying that they have the
1992           same schema.
1993
1994       ENHANCEMENT
1995           dbmapreduce now passes comments through, rather than eating them as
1996           before.
1997
1998           Also, dbmapreduce now supports a "--" option to prevent
1999           misinterpreting sub-program parameters as for dbmapreduce.
2000
2001       INCOMPATIBLE CHANGE
2002           dbmapreduce no longer figures out if it needs to add the key to the
2003           output.  For multi-key-aware reducers, it never does (and cannot).
2004           For non-multi-key-aware reducers, it defaults to add the key and
2005           will now fail if the reducer adds the key (with error "dbcolcreate:
2006           attempt to create pre-existing column...").  In such cases, one
2007           must disable adding the key with the new option "--no-prepend-key".
2008
2009       INCOMPATIBLE CHANGE
2010           dbmapreduce no longer copies the input field separator by default.
2011           For multi-key-aware reducers, it never does (and cannot).  For non-
2012           multi-key-aware reducers, it defaults to not copying the field
2013           separator, but it will copy it (the old default) with the
2014           "--copy-fs" option
2015
2016   2.45, 2013-10-07 cleanup from de-thread-ification
2017       BUG FIX
2018           Corrected a fast busy-wait in dbmerge.
2019
2020       ENHANCEMENT
2021           Endgame mode enabled in dbmerge; it (and also large cases of
2022           dbsort) should now exploit greater parallelism.
2023
2024       BUG FIX
2025           Test case with "Fsdb::BoundedQueue" (gone since 2.44) now removed.
2026
2027   2.46, 2013-10-08 continuing cleanup of our no-threads version
2028       BUG FIX
2029           Fixed some packaging details.  (Really, threads are no longer
2030           required, missing tests in the MANIFEST.)
2031
2032       IMPROVEMENT
2033           dbsort now better communicates with the merge process to avoid
2034           bursty parallelism.
2035
2036           Fsdb::IO::Writer now can take "-autoflush =" 1> for line-buffered
2037           IO.
2038
2039   2.47, 2013-10-12 test suite cleanup for non-threaded perls
2040       BUG FIX
2041           Removed some stray "use threads" in some test cases.  We didn't
2042           need them, and these were breaking non-threaded perls.
2043
2044       BUG FIX
2045           Better handling of Fred cleanup; should fix intermittent
2046           dbmapreduce failures on BSD.
2047
2048       ENHANCEMENT
2049           Improved test framework to show output when tests fail.  (This
2050           time, for real.)
2051
2052   2.48, 2014-01-03 small bugfixes and improved release engineering
2053       ENHANCEMENT
2054           Test suites now skip tests for libraries that are missing.  (Patch
2055           for missing "IO::Compresss:Xz" contributed by Calvin Ardi.)
2056
2057       ENHANCEMENT
2058           Removed references to Jdb in the package specification.  Since the
2059           name was changed in 2008, there's no longer a huge need for
2060           backwards comparability.  (Suggestion form Petr Šabata.)
2061
2062       ENHANCEMENT
2063           Test suites now invoke the perl using the path from
2064           $Config{perlpath}.  Hopefully this helps testing in environments
2065           where there are multiple installed perls and the default perl is
2066           not the same as the perl-under-test (as happens in
2067           cpantesters.org).
2068
2069       BUG FIX
2070           Added specific encoding to this manpage to account for Unicode.
2071           Required to build correctly against perl-5.18.
2072
2073   2.49, 2014-01-04 bugfix to unicode handling in Fsdb IO (plus minor
2074       packaging fixes)
2075       BUG FIX
2076           Restored a line in the .spec to chmod g-s.
2077
2078       BUG FIX
2079           Unicode decoding is now handled correctly for programs that read
2080           from standard input.  (Also: New test scripts cover unicode input
2081           and output.)
2082
2083       BUG FIX
2084           Fix to Fsdb documentation encoding line.  Addresses test failure in
2085           perl-5.16 and earlier.  (Who knew "encoding" had to be followed by
2086           a blank line.)
2087

WHAT'S NEW

2089   2.50, 2014-05-27 a quick release for spec tweaks
2090       ENHANCEMENT
2091           In dbroweval, the "-N" (no output, even comments) option now
2092           implies "-n", and it now suppresses the header and trailer.
2093
2094       BUG FIX
2095           A few more tweaks to the perl-Fsdb.spec from Petr Šabata.
2096
2097       BUG FIX
2098           Fixed 3 uses of "use v5.10" in test suites that were causing test
2099           failures (due to warnings, not real failures) on some platforms.
2100
2101   2.51, 2014-09-05 Feature enhancements to dbcolmovingstats, dbcolcreate,
2102       dbmapreduce, and new sqlselect_to_db
2103       ENHANCEMENT
2104           dbcolcreate now has a "--no-recreate-fatal" that causes it to
2105           ignore creation of existing columns (instead of failing).
2106
2107       ENHANCEMENT
2108           dbmapreduce once again is robust to reducers that output the key;
2109           "--no-prepend-key" is no longer mandatory.
2110
2111       ENHANCEMENT
2112           dbcolsplittorows can now enumerate the output rows with "-E".
2113
2114       BUG FIX
2115           dbcolmovingstats is more mathematically robust.  Previously for
2116           some inputs and some platforms, floating point rounding could
2117           sometimes cause squareroots of negative numbers.
2118
2119       NEW sqlselect_to_db converts the output of the MySQL or MarinaDB select
2120           comment into fsdb format.
2121
2122       INCOMPATIBLE CHANGE
2123           dbfilediff now outputs the second row when doing sloppy numeric
2124           comparisons, to better support test suites.
2125
2126   2.52, 2014-11-03 Fixing the test suite for line number changes.
2127       ENHANCEMENT
2128           Test suites changes to be robust to exact line numbers of failures,
2129           since different Perl releases fail on different lines.
2130           <https://bugzilla.redhat.com/show_bug.cgi?id=1158380>
2131
2132   2.53, 2014-11-26 bug fixes and stability improvements to dbmapreduce
2133       ENHANCEMENT
2134           The dbfilediff how supports a "--quiet" option.
2135
2136       ENHANCEMENT
2137           Better documention of dbpipeline_filter.
2138
2139       BUGFIX
2140           Added groff-base and perl-podlators to the Fedora package spec.
2141           Fixes <https://bugzilla.redhat.com/show_bug.cgi?id=1163149>.  (Also
2142           in package 2.52-2.)
2143
2144       BUGFIX
2145           An important stability improvement to dbmapreduce.  It, plus
2146           dbmultistats, and dbcolstats now support controlled parallelism
2147           with the "--pararallelism=N" option.  They default to run with the
2148           number of available CPUs.  dbmapreduce also moderates its level of
2149           parallelism.  Previously it would create reducers as needed,
2150           causing CPU thrashing if reducers ran much slower than data
2151           production.
2152
2153       BUGFIX
2154           The combination of dbmapreduce with dbrowenumerate now works as it
2155           should.  (The obscure bug was an interaction with dbcolcreate with
2156           non-multi-key reducers that output their own key.  dbmapreduce has
2157           too many useful corner cases.)
2158
2159   2.54, 2014-11-28 fix for the test suite to correct failing tests on not-my-
2160       platform
2161       BUGFIX
2162           Sigh, the test suite now has a test suite.  Because, yes, I broke
2163           it, causing many incorrect failures at cpantesters.  Now fixed.
2164
2165   2.55, 2015-01-05 many spelling fixes and dbcolmovingstats tests are more
2166       robust to different numeric precision
2167       ENHANCEMENT
2168           dbfilediff now can be extra quiet, as I continue to try to track
2169           down a numeric difference on FreeBSD AMD boxes.
2170
2171       ENHANCEMENT
2172           dbcolmovingstats gave different test output (just reflecting
2173           rounding error) when stddev approaches zero.  We now detect hand
2174           handle this case.  See
2175           <https://rt.cpan.org/Public/Bug/Display.html?id=101220> and thanks
2176           to H. Merijn Brand for the bug report.
2177
2178       BUG FIX
2179           Many, many spelling bugs found by H. Merijn Brand; thanks for the
2180           bug report.
2181
2182       INCOMPATBLE CHANGE
2183           A number of programs had misspelled "separator" in
2184           "--fieldseparator" and "--columnseparator" options as "seperator".
2185           These are now correctly spelled.
2186
2187   2.56, 2015-02-03 fix against Getopt::Long-2.43's stricter error checkign
2188       BUG FIX
2189           Internal argument parsing uses Getopt::Long, but mixed pass-through
2190           and <>.  Bug reported by Petr Pisar at
2191           <https://bugzilla.redhat.com/show_bug.cgi?id=1188538>.a
2192
2193       BUG FIX
2194           Added missing BuildRequires for "XML::Simple".
2195
2196   2.57, 2015-04-29 Minor changes, with better performance from dbmulitstats.
2197       BUG FIX
2198           dbfilecat now honors "--remove-inputs" (previously it didn't).
2199           This omission meant that dbmapreduce (and dbmultistats) would
2200           accumulate files in /tmp when running.  Bad news for inputs with 4M
2201           keys.
2202
2203       ENHANCMENT
2204           dbmultistats should be faster with lots of small keys.  dbcolstats
2205           now supports "-k" to get some of the functionality of dbmultistats
2206           (if data is pre-sorted and median/quartiles are not required).
2207
2208           dbfilecat now honors "--remove-inputs" (previously it didn't).
2209           This omission meant that dbmapreduce (and dbmultistats) would
2210           accumulate files in /tmp when running.  Bad news for inputs with 4M
2211           keys.
2212
2213   2.58, 2015-04-30 Bugfix in dbmerge
2214       BUG FIX
2215           Fixed a case where dbmerge suffered mojobake in endgame mode.  This
2216           bug surfaced when dbsort was applied to large files (big enough to
2217           require merging) with unicode in them; the symptom was soemthing
2218           like:
2219             Wide character in print at /usr/lib64/perl5/IO/Handle.pm line
2220           420, <GEN12> line 111.
2221
2222   2.59, 2016-09-01 Collect a few small bug fixes and documentation
2223       improvements.
2224       BUG FIX
2225           More IO is explicitly marked UTF-8 to avoid Perl's tendency to
2226           mojibake on otherwise valid unicode input.  This change helps
2227           html_table_to_db.
2228
2229       ENHANCEMENT
2230           dbcolscorrelate now crossreferences dbcolsregression.
2231
2232       ENHANCEMENT
2233           Documentation for dbrowdiff now clarifies that the default is
2234           baseline mode.
2235
2236       BUG FIX
2237           dbjoin now propagates "-T" into the sorting process (if it is
2238           required).  Thanks to Lan Wei for reporting this bug.
2239
2240   2.60, 2016-09-04 Adds support for hash joins.
2241       ENHANCEMENT
2242           dbjoin now supports hash joins with "-t lefthash" and "-t
2243           righthash".  Hash joins cache a table in memory, but do not require
2244           that the other table be sorted.  They are ideal when joining a
2245           large table against a small one.
2246
2247   2.61, 2016-09-05 Support left and right outer joins.
2248       ENHANCEMENT
2249           dbjoin now handles left and right outer joins with "-t left" and
2250           "-t right".
2251
2252       ENHANCEMENT
2253           dbjoin hash joins are now selected with "-m lefthash" and "-m
2254           righthash" (not the shortlived "-t righthash" option).
2255           (Technically this change is incompatible with Fsdd-2.60, but no one
2256           but me ever used that version.)
2257
2258   2.62, 2016-11-29 A new yaml_to_db and other minor improvements.
2259       ENHANCEMENT
2260           Documentation for xml_to_db now includes sample output.
2261
2262       NEW yaml_to_db converts a specific form of YAML to fsdb.
2263
2264       BUG FIX
2265           The test suite now uses "diff -c -b" rather than "diff -cb" to make
2266           OpenBSD-5.9 happier, I hope.
2267
2268       ENHANCEMENT
2269           Comments that log operations at the end of each file now do simple
2270           quoting of spaces.  (It is not guaranteed to be fully shell-
2271           compliant.)
2272
2273       ENHANCEMENT
2274           There is a new standard option, "--header", allowing one to specify
2275           an Fsdb header for inputs that lack it.  Currently it is supported
2276           by dbcoldefine, dbrowuniq, dbmapreduce, dbmultistats, dbsort,
2277           dbpipeline.
2278
2279       ENHANCEMENT
2280           dbfilepivot now allows the --possible-pivots option, and if it is
2281           provided processes the data in one pass.
2282
2283       ENHANCEMENT
2284           dbroweval logs are now quoted.
2285
2286   2.63, 2017-02-03 Re-add some features supposedly in 2.62 but not, and add
2287       more --header options.
2288       ENHANCEMENT
2289           The option -j is now a synonym for --parallelism.  (And several
2290           documention bugs about this option are fixed.)
2291
2292       ENHANCEMENT
2293           Additional support for "--header" in dbcolmerge, dbcol, dbrow, and
2294           dbroweval.
2295
2296       BUG FIX
2297           Version 2.62 was supposed to have this improvement, but did not
2298           (and now does): dbfilepivot now allows the --possible-pivots
2299           option, and if it is provided processes the data in one pass.
2300
2301       BUG FIX
2302           Version 2.62 was supposed to have this improvement, but did not
2303           (and now does): dbroweval logs are now quoted.
2304
2305   2.64, 2017-11-20 several small bugfixes and enhancements
2306       BUG FIX
2307           In dbroweval, the "next row" option previously did not correctly
2308           set up "_last_fieldname".  It now does.
2309
2310       ENHANCEMENT
2311           The csv_to_db converter now has an optional "-F x" option to set
2312           the field separator.
2313
2314       ENHANCEMENT
2315           Finally dbcolsplittocols has a "--header" option, and a new "-N"
2316           option to give the list of resulting output columns.
2317
2318       INCOMPATIBLE CHANGE
2319           Now dbcolstats and dbmultistats produce no output (but a schema)
2320           when given no input but a schema.  Previously they gave a null row
2321           of output.  The "--output-on-no-input" and
2322           "--no-output-on-no-input" options can control this behavior.
2323
2324   2.65, 2018-02-16 Minor release, bug fix and -F option.
2325       ENHANCEMENT
2326           dbmultistats and dbmapreduce now both take a "-F x" option to set
2327           the field separator.
2328
2329       BUG FIX
2330           Fixed missing "use Carp" in dbcolstats.  Also went back and cleaned
2331           up all uses of "croak()".  Thanks to Zefram for the bug report.
2332
2333   2.66, 2018-12-20 Critical bug fix in dbjoin.
2334       BUG FIX
2335           Removed old tests from MANIFEST.  (Thanks to Hang Guo for reporting
2336           this bug.)
2337
2338       IMPROVEMENT
2339           Errors for non-existing input files now include the bad filename
2340           (before: "cannot setup filehandle", now: "cannot open input: cannot
2341           open TEST/bad_filename").
2342
2343       BUG FIX
2344           Hash joins with three identical rows were failing with the
2345           assertion failure "internal error: confused about overflow" due to
2346           a now-fixed bug.
2347
2348   2.67, 2019-07-10 add support for reading and writing hdfs
2349       IMPROVEMENT
2350           dbformmail now has an "mh" mechanism that writes messages to
2351           individual files (an mh-style mailbox).
2352
2353       BUG FIX
2354           dbrow failed to include the Carp library, leading to fails on
2355           croak.
2356
2357       BUG FIX
2358           Fixed dbjoin error message for an unsorted right stream was
2359           incorrect (it said left).
2360
2361       IMPROVEMENT
2362           All Fsdb programs can now read from and write to HDFS, when files
2363           that start with "hdfs:" are given to -i and -o options.
2364
2365   2.68, 2019-09-19 All programs now support automatic decompression based on
2366       file extension.
2367       IMPROVEMENT
2368           The omitted-possible-error test case for dbfilepivot now has an
2369           altnerative output that I saw on some BSD-running systems (thanks
2370           to CPAN).
2371
2372       IMPROVEMENT
2373           dbmerge and dbmerge2 now support "--header".  dbmerge2 now gives
2374           better error messages when presented the wrong number of inputs.
2375
2376       BUG FIX
2377           dbsort now works with "--header" even when the file is big (due to
2378           fixes to dbmerge).
2379
2380       IMPROVEMENT
2381           cvs_to_db now processes data with the "binary" option, allowing it
2382           to handle newlines embedded in quoted fields.
2383
2384       IMPROVEMENT
2385           All programs now will transparently decompress input files, if they
2386           are listed as a filename as an input argument that extends with a
2387           standard extension (.gz, .bz2, and .xz).
2388
2389   2.69, 2019-11-22 a small bugfix in dbcolstats
2390       BUG FIX
2391           Filled in the the test case for autodecompress, which was missing
2392           for the 2.68 release.
2393
2394       ENHANCEMENT
2395           The groff program is required for build, and the "Makefile.PL"
2396           fails if groff is missing at build time.  Thanks to Chris Williams
2397           for suggesting this check, and the CPAN auto-building system for
2398           trying many platforms.
2399
2400       BUG FIX
2401           The dbcolstats program had numerical instability that sometimes
2402           results in failing with a square-root of a negative number when
2403           many values varied right at the edge of floating-point precision.
2404           We now detect and report that case as 0 stddev.  Thanks to Hang Guo
2405           for providing a test case.
2406
2407   2.70, 2020-11-12 Some small quality-of-life enhancements and corner-case
2408       bugfixes.
2409       ENHANCEMENT
2410           dbcol can now take an option "-a" to include all columns, allowing
2411           reordering of certain columns while passing the rest through.
2412
2413       ENHANCEMENT
2414           dbrowuniq and dbmerge now buffer comments in a way that the last
2415           row of data output is no longer in the last block of comments.
2416           (The data is identical, but for humans looking at output, this
2417           change makes it less likely to lose the last row.)
2418
2419       BUG FIX
2420           dbmultistats and dbpipeline documentation now indicates that they
2421           support "--header" (something they did since version 2.62 in
2422           2016-11-29, but now documented.
2423
2424       ENHANCEMENT
2425           dbcolcreate now supports "--header".
2426
2427       BUG FIX
2428           Fixed several spelling errors in deprecated programs and removed
2429           information about the no-longer existing FreeBSD and MacOS ports.
2430           Thanks to Calvin Ardi for the patch.
2431
2432       BUG FIX
2433           dbmerge now handles --xargs when only one file is provided (and
2434           passes the file through unchanged).  It also throws a clean error
2435           with --xargs if zero files are provided.  (To support dbmerge,
2436           dbcol now has an internal "--saveoutput" option.)  Thanks to Yuri
2437           Pradkin for reporting the unhandled corner-case.
2438

AUTHOR

2440       John Heidemann, "johnh@isi.edu"
2441
2442       See "Contributors" for the many people who have contributed bug reports
2443       and fixes.
2444

COPYRIGHT

2446       Fsdb is Copyright (C) 1991-2020 by John Heidemann <johnh@isi.edu>.
2447
2448       This program is free software; you can redistribute it and/or modify it
2449       under the terms of version 2 of the GNU General Public License as
2450       published by the Free Software Foundation.
2451
2452       This program is distributed in the hope that it will be useful, but
2453       WITHOUT ANY WARRANTY; without even the implied warranty of
2454       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
2455       General Public License for more details.
2456
2457       You should have received a copy of the GNU General Public License along
2458       with this program; if not, write to the Free Software Foundation, Inc.,
2459       675 Mass Ave, Cambridge, MA 02139, USA.
2460
2461       A copy of the GNU General Public License can be found in the file
2462       ``COPYING''.
2463

COMMENTS and BUG REPORTS

2465       Any comments about these programs should be sent to John Heidemann
2466       "johnh@isi.edu".
2467
2468
2469
2470perl v5.32.0                      2020-11-16                           Fsdb(3)