1Fsdb(3)               User Contributed Perl Documentation              Fsdb(3)
2
3
4

NAME

6       Fsdb - a flat-text database for shell scripting
7

SYNOPSIS

9       Fsdb, the flatfile streaming database is package of commands for
10       manipulating flat-ASCII databases from shell scripts.  Fsdb is useful
11       to process medium amounts of data (with very little data you'd do it by
12       hand, with megabytes you might want a real database).  Fsdb was known
13       as as Jdb from 1991 to Oct. 2008.
14
15       Fsdb is very good at doing things like:
16
17       ·   extracting measurements from experimental output
18
19       ·   examining data to address different hypotheses
20
21       ·   joining data from different experiments
22
23       ·   eliminating/detecting outliers
24
25       ·   computing statistics on data (mean, confidence intervals,
26           correlations, histograms)
27
28       ·   reformatting data for graphing programs
29
30       Fsdb is built around the idea of a flat text file as a database.  Fsdb
31       files (by convention, with the extension .fsdb), have a header
32       documenting the schema (what the columns mean), and then each line
33       represents a database record (or row).
34
35       For example:
36
37               #fsdb experiment duration
38               ufs_mab_sys 37.2
39               ufs_mab_sys 37.3
40               ufs_rcp_real 264.5
41               ufs_rcp_real 277.9
42
43       Is a simple file with four experiments (the rows), each with a
44       description, size parameter, and run time in the first, second, and
45       third columns.
46
47       Rather than hand-code scripts to do each special case, Fsdb provides
48       higher-level functions.  Although it's often easy throw together a
49       custom script to do any single task, I believe that there are several
50       advantages to using Fsdb:
51
52       ·   these programs provide a higher level interface than plain Perl, so
53
54           **  Fewer lines of simpler code:
55
56                   dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration
57
58               Picks out just one type of experiment and computes statistics
59               on it, rather than:
60
61                   while (<>) { split; $sum+=$F[1]; $ss+=$F[1]**2; $n++; }
62                   $mean = $sum / $n; $std_dev = ...
63
64               in dozens of places.
65
66       ·   the library uses names for columns, so
67
68           **  No more $F[1], use "_duration".
69
70           **  New or different order columns?  No changes to your scripts!
71
72           Thus if your experiment gets more complicated with a size
73           parameter, so your log changes to:
74
75                   #fsdb experiment size duration
76                   ufs_mab_sys 1024 37.2
77                   ufs_mab_sys 1024 37.3
78                   ufs_rcp_real 1024 264.5
79                   ufs_rcp_real 1024 277.9
80                   ufs_mab_sys 2048 45.3
81                   ufs_mab_sys 2048 44.2
82
83           Then the previous scripts still work, even though duration is now
84           the third column, not the second.
85
86       ·   A series of actions are self-documenting (each program records what
87           it does).
88
89           **  No more wondering what hacks were used to compute the final
90               data, just look at the comments at the end of the output.
91
92           For example, the commands
93
94               dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration
95
96           add to the end of the output the lines
97               #    | dbrow _experiment eq "ufs_mab_sys"
98               #    | dbcolstats duration
99
100       ·   The library is mature, supporting large datasets (more than 100GB),
101           corner cases, error handling, backed by an automated test suite.
102
103           **  No more puzzling about bad output because your custom script
104               skimped on error checking.
105
106           **  No more memory thrashing when you try to sort ten million
107               records.
108
109       ·   Fsdb-2.x supports Perl scripting (in addition to shell scripting),
110           with libraries to do Fsdb input and output, and easy support for
111           pipelines.  The shell script
112
113               dbcol name test1 | dbroweval '_test1 += 5;'
114
115           can be written in perl as:
116
117               dbpipeline(dbcol(qw(name test1)), dbroweval('_test1 += 5;'));
118
119       (The disadvantage is that you need to learn what functions Fsdb
120       provides.)
121
122       Fsdb is built on flat-ASCII databases.  By storing data in simple text
123       files and processing it with pipelines it is easy to experiment (in the
124       shell) and look at the output.  To the best of my knowledge, the
125       original implementation of this idea was "/rdb", a commercial product
126       described in the book UNIX relational database management: application
127       development in the UNIX environment by Rod Manis, Evan Schaffer, and
128       Robert Jorgensen (and also at the web page <http://www.rdb.com/>).
129       Fsdb is an incompatible re-implementation of their idea without any
130       accelerated indexing or forms support.  (But it's free, and probably
131       has better statistics!).
132
133       Fsdb-2.x will exploit multiple processors or cores, and provides Perl-
134       level support for input, output, and threaded-pipelines.  (As of
135       Fsdb-2.44 it no longer uses Perl threading, just processes, since they
136       are faster.)
137
138       Installation instructions follow at the end of this document.  Fsdb-2.x
139       requires Perl 5.8 to run.  All commands have manual pages and provide
140       usage with the "--help" option.  All commands are backed by an
141       automated test suite.
142
143       The most recent version of Fsdb is available on the web at
144       <http://www.isi.edu/~johnh/SOFTWARE/FSDB/index.html>.
145

WHAT'S NEW

147   2.66, 2018-12-20 Critical bug fix in dbjoin.
148       BUG FIX
149           Removed old tests from MANIFEST.  (Thanks to Hang Guo for reporting
150           this bug.)
151
152       IMPROVEMENT
153           Errors for non-existing input files now include the bad filename
154           (before: "cannot setup filehandle", now: "cannot open input: cannot
155           open TEST/bad_filename").
156
157       BUG FIX
158           Hash joins with three identical rows were failing with the
159           assertion failure "internal error: confused about overflow" due to
160           a now-fixed bug.
161

README CONTENTS

163       executive summary
164       what's new
165       README CONTENTS
166       installation
167       basic data format
168       basic data manipulation
169       list of commands
170       another example
171       a gradebook example
172       a password example
173       history
174       related work
175       release notes
176       copyright
177       comments
178

INSTALLATION

180       Fsdb now uses the standard Perl build and installation from
181       ExtUtil::MakeMaker(3), so the quick answer to installation is to type:
182
183           perl Makefile.PL
184           make
185           make test
186           make install
187
188       Or, if you want to install it somewhere else, change the first line to
189
190           perl Makefile.PL PREFIX=$HOME
191
192       and it will go in your home directory's bin, etc.  (See
193       ExtUtil::MakeMaker(3) for more details.)
194
195       Fsdb requires perl 5.8 or later.
196
197       A test-suite is available, run it with
198
199           make test
200
201       A FreeBSD port to Fsdb is available, see
202       <http://www.freshports.org/databases/fsdb/>.
203
204       A Fink (MacOS X) port is available, see
205       <http://pdb.finkproject.org/pdb/package.php/fsdb>.  (Thanks to Lars
206       Eggert for maintaining this port.)
207

BASIC DATA FORMAT

209       These programs are based on the idea storing data in simple ASCII
210       files.  A database is a file with one header line and then data or
211       comment lines.  For example:
212
213               #fsdb account passwd uid gid fullname homedir shell
214               johnh * 2274 134 John_Heidemann /home/johnh /bin/bash
215               greg * 2275 134 Greg_Johnson /home/greg /bin/bash
216               root * 0 0 Root /root /bin/bash
217               # this is a simple database
218
219       The header line must be first and begins with "#h".  There are rows
220       (records) and columns (fields), just like in a normal database.
221       Comment lines begin with "#".  Column names are any string not
222       containing spaces or single quote (although it is prudent to keep them
223       alphanumeric with underscore).
224
225       By default, columns are delimited by whitespace.  With this default
226       configuration, the contents of a field cannot contain whitespace.
227       However, this limitation can be relaxed by changing the field separator
228       as described below.
229
230       The big advantage of simple flat-text databases is that it is usually
231       easy to massage data into this format, and it's reasonably easy to take
232       data out of this format into other (text-based) programs, like gnuplot,
233       jgraph, and LaTeX.  Think Unix.  Think pipes.  (Or even output to Excel
234       and HTML if you prefer.)
235
236       Since no-whitespace in columns was a problem for some applications,
237       there's an option which relaxes this rule.  You can specify the field
238       separator in the table header with "-F x" where "x" is a code for the
239       new field separator.  A full list of codes is at dbfilealter(1), but
240       two common special values are "-F t" which is a separator of a single
241       tab character, and "-F S", a separator of two spaces.  Both allowing
242       (single) spaces in fields.  An example:
243
244               #fsdb -F S account passwd uid gid fullname homedir shell
245               johnh  *  2274  134  John Heidemann  /home/johnh  /bin/bash
246               greg  *  2275  134  Greg Johnson  /home/greg  /bin/bash
247               root  *  0  0  Root  /root  /bin/bash
248               # this is a simple database
249
250       See dbfilealter(1) for more details.  Regardless of what the column
251       separator is for the body of the data, it's always whitespace in the
252       header.
253
254       There's also a third format: a "list".  Because it's often hard to see
255       what's columns past the first two, in list format each "column" is on a
256       separate line.  The programs dblistize and dbcolize convert to and from
257       this format, and all programs work with either formats.  The command
258
259           dbfilealter -R C  < DATA/passwd.fsdb
260
261       outputs:
262
263               #fsdb -R C account passwd uid gid fullname homedir shell
264               account:  johnh
265               passwd:   *
266               uid:      2274
267               gid:      134
268               fullname: John_Heidemann
269               homedir:  /home/johnh
270               shell:    /bin/bash
271
272               account:  greg
273               passwd:   *
274               uid:      2275
275               gid:      134
276               fullname: Greg_Johnson
277               homedir:  /home/greg
278               shell:    /bin/bash
279
280               account:  root
281               passwd:   *
282               uid:      0
283               gid:      0
284               fullname: Root
285               homedir:  /root
286               shell:    /bin/bash
287
288               # this is a simple database
289               #  | dblistize
290
291       See dbfilealter(1) for more details.
292

BASIC DATA MANIPULATION

294       A number of programs exist to manipulate databases.  Complex functions
295       can be made by stringing together commands with shell pipelines.  For
296       example, to print the home directories of everyone with ``john'' in
297       their names, you would do:
298
299               cat DATA/passwd | dbrow '_fullname =~ /John/' | dbcol homedir
300
301       The output might be:
302
303               #fsdb homedir
304               /home/johnh
305               /home/greg
306               # this is a simple database
307               #  | dbrow _fullname =~ /John/
308               #  | dbcol homedir
309
310       (Notice that comments are appended to the output listing each command,
311       providing an automatic audit log.)
312
313       In addition to typical database functions (select, join, etc.) there
314       are also a number of statistical functions.
315
316       The real power of Fsdb is that one can apply arbitrary code to rows to
317       do powerful things.
318
319               cat DATA/passwd | dbroweval '_fullname =~ s/(\w+)_(\w+)/$2,_$1/'
320
321       converts "John_Heidemann" into "Heidemann,_John".  Not too much more
322       work could split fullname into firstname and lastname fields.
323

TALKING ABOUT COLUMNS

325       An advantage of Fsdb is that you can talk about columns by name
326       (symbolically) rather than simply by their positions.  So in the above
327       example, "dbcol homedir" pulled out the home directory column, and
328       "dbrow '_fullname =~ /John/'" matched against column fullname.
329
330       In general, you can use the name of the column listed on the "#fsdb"
331       line to identify it in most programs, and _name to identify it in code.
332
333       Some alternatives for flexibility:
334
335       ·   Numeric values identify columns positionally, numbering from 0.  So
336           0 or _0 is the first column, 1 is the second, etc.
337
338       ·   In code, _last_columnname gets the value from columname's previous
339           row.
340
341       See dbroweval(1) for more details about writing code.
342

LIST OF COMMANDS

344       Enough said.  I'll summarize the commands, and then you can experiment.
345       For a detailed description of each command, see a summary by running it
346       with the argument "--help" (or "-?" if you prefer.)  Full manual pages
347       can be found by running the command with the argument "--man", or
348       running the Unix command "man dbcol" or whatever program you want.
349
350   TABLE CREATION
351       dbcolcreate
352           add columns to a database
353
354       dbcoldefine
355           set the column headings for a non-Fsdb file
356
357   TABLE MANIPULATION
358       dbcol
359           select columns from a table
360
361       dbrow
362           select rows from a table
363
364       dbsort
365           sort rows based on a set of columns
366
367       dbjoin
368           compute the natural join of two tables
369
370       dbcolrename
371           rename a column
372
373       dbcolmerge
374           merge two columns into one
375
376       dbcolsplittocols
377           split one column into two or more columns
378
379       dbcolsplittorows
380           split one column into multiple rows
381
382       dbfilepivot
383           "pivots" a file, converting multiple rows corresponding to the same
384           entity into a single row with multiple columns.
385
386       dbfilevalidate
387           check that db file doesn't have some common errors
388
389   COMPUTATION AND STATISTICS
390       dbcolstats
391           compute statistics over a column (mean,etc.,optionally median)
392
393       dbmultistats
394           group rows by some key value, then compute stats (mean, etc.) over
395           each group (equivalent to dbmapreduce with dbcolstats as the
396           reducer)
397
398       dbmapreduce
399           group rows (map) and then apply an arbitrary function to each group
400           (reduce)
401
402       dbrvstatdiff
403           compare two samples distributions (mean/conf interval/T-test)
404
405       dbcolmovingstats
406           computing moving statistics over a column of data
407
408       dbcolstatscores
409           compute Z-scores and T-scores over one column of data
410
411       dbcolpercentile
412           compute the rank or percentile of a column
413
414       dbcolhisto
415           compute histograms over a column of data
416
417       dbcolscorrelate
418           compute the coefficient of correlation over several columns
419
420       dbcolsregression
421           compute linear regression and correlation for two columns
422
423       dbrowaccumulate
424           compute a running sum over a column of data
425
426       dbrowcount
427           count the number of rows (a subset of dbstats)
428
429       dbrowdiff
430           compute differences between a columns in each row of a table
431
432       dbrowenumerate
433           number each row
434
435       dbroweval
436           run arbitrary Perl code on each row
437
438       dbrowuniq
439           count/eliminate identical rows (like Unix uniq(1))
440
441       dbfilediff
442           compare fields on rows of a file (something like Unix diff(1))
443
444   OUTPUT CONTROL
445       dbcolneaten
446           pretty-print columns
447
448       dbfilealter
449           convert between column or list format, or change the column
450           separator
451
452       dbfilestripcomments
453           remove comments from a table
454
455       dbformmail
456           generate a script that sends form mail based on each row
457
458   CONVERSIONS
459       (These programs convert data into fsdb.  See their web pages for
460       details.)
461
462       cgi_to_db
463           <http://stein.cshl.org/boulder/>
464
465       combined_log_format_to_db
466           <http://httpd.apache.org/docs/2.0/logs.html>
467
468       html_table_to_db
469           HTML tables to fsdb (assuming they're reasonably formatted).
470
471       kitrace_to_db
472           <http://ficus-www.cs.ucla.edu/ficus-members/geoff/kitrace.html>
473
474       ns_to_db
475           <http://mash-www.cs.berkeley.edu/ns/>
476
477       sqlselect_to_db
478           the output of SQL SELECT tables to db
479
480       tabdelim_to_db
481           spreadsheet tab-delimited files to db
482
483       tcpdump_to_db
484           (see man tcpdump(8) on any reasonable system)
485
486       xml_to_db
487           XML input to fsdb, assuming they're very regular
488
489       (And out of fsdb:)
490
491       db_to_csv
492           Comma-separated-value format from fsdb.
493
494       db_to_html_table
495           simple conversion of Fsdb to html tables
496
497   STANDARD OPTIONS
498       Many programs have common options:
499
500       -? or --help
501           Show basic usage.
502
503       -N on --new-name
504           When a command creates a new column like dbrowaccumulate's "accum",
505           this option lets one override the default name of that new column.
506
507       -T TmpDir
508           where to put tmp files.  Also uses environment variable TMPDIR, if
509           -T is not specified.  Default is /tmp.
510
511           Show basic usage.
512
513       -c FRACTION or --confidence FRACTION
514           Specify confidence interval FRACTION (dbcolstats, dbmultistats,
515           etc.)
516
517       -C S or "--element-separator S"
518           Specify column separator S (dbcolsplittocols, dbcolmerge).
519
520       -d or --debug
521           Enable debugging (may be repeated for greater effect in some
522           cases).
523
524       -a or --include-non-numeric
525           Compute stats over all data (treating non-numbers as zeros).  (By
526           default, things that can't be treated as numbers are ignored for
527           stats purposes)
528
529       -S or --pre-sorted
530           Assume the data is pre-sorted.  May be repeated to disable
531           verification (saving a small amount of work).
532
533       -e E or --empty E
534           give value E as the value for empty (null) records
535
536       -i I or --input I
537           Input data from file I.
538
539       -o O or --output O
540           Write data out to file O.
541
542       --header H
543           Use H as the full Fsdb header, rather than reading a header from
544           then input.  This option is particularly useful when using Fsdb
545           under Hadoop, where split files don't have heades.
546
547       --nolog.
548           Skip logging the program in a trailing comment.
549
550       When giving Perl code (in dbrow and dbroweval) column names can be
551       embedded if preceded by underscores.  Look at dbrow(1) or dbroweval(1)
552       for examples.)
553
554       Most programs run in constant memory and use temporary files if
555       necessary.  Exceptions are dbcolneaten, dbcolpercentile, dbmapreduce,
556       dbmultistats, dbrowsplituniq.
557

ANOTHER EXAMPLE

559       Take the raw data in "DATA/http_bandwidth", put a header on it
560       ("dbcoldefine size bw"), took statistics of each category
561       ("dbmultistats -k size bw"), pick out the relevant fields ("dbcol size
562       mean stddev pct_rsd"), and you get:
563
564               #fsdb size mean stddev pct_rsd
565               1024    1.4962e+06      2.8497e+05      19.047
566               10240   5.0286e+06      6.0103e+05      11.952
567               102400  4.9216e+06      3.0939e+05      6.2863
568               #  | dbcoldefine size bw
569               #  | /home/johnh/BIN/DB/dbmultistats -k size bw
570               #  | /home/johnh/BIN/DB/dbcol size mean stddev pct_rsd
571
572       (The whole command was:
573
574               cat DATA/http_bandwidth |
575               dbcoldefine size |
576               dbmultistats -k size bw |
577               dbcol size mean stddev pct_rsd
578
579       all on one line.)
580
581       Then post-process them to get rid of the exponential notation by adding
582       this to the end of the pipeline:
583
584           dbroweval '_mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev);'
585
586       (Actually, this step is no longer required since dbcolstats now uses a
587       different default format.)
588
589       giving:
590
591               #fsdb      size    mean    stddev  pct_rsd
592               1024     1496200          284970        19.047
593               10240    5028600          601030        11.952
594               102400   4921600          309390        6.2863
595               #  | dbcoldefine size bw
596               #  | dbmultistats -k size bw
597               #  | dbcol size mean stddev pct_rsd
598               #  | dbroweval   { _mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev); }
599
600       In a few lines, raw data is transformed to processed output.
601
602       Suppose you expect there is an odd distribution of results of one
603       datapoint.  Fsdb can easily produce a CDF (cumulative distribution
604       function) of the data, suitable for graphing:
605
606           cat DB/DATA/http_bandwidth | \
607               dbcoldefine size bw | \
608               dbrow '_size == 102400' | \
609               dbcol bw | \
610               dbsort -n bw | \
611               dbrowenumerate | \
612               dbcolpercentile count | \
613               dbcol bw percentile | \
614               xgraph
615
616       The steps, roughly: 1. get the raw input data and turn it into fsdb
617       format, 2. pick out just the relevant column (for efficiency) and sort
618       it, 3. for each data point, assign a CDF percentage to it, 4. pick out
619       the two columns to graph and show them
620

A GRADEBOOK EXAMPLE

622       The first commercial program I wrote was a gradebook, so here's how to
623       do it with Fsdb.
624
625       Format your data like DATA/grades.
626
627               #fsdb name email id test1
628               a a@ucla.example.edu 1 80
629               b b@usc.example.edu 2 70
630               c c@isi.example.edu 3 65
631               d d@lmu.example.edu 4 90
632               e e@caltech.example.edu 5 70
633               f f@oxy.example.edu 6 90
634
635       Or if your students have spaces in their names, use "-F S" and two
636       spaces to separate each column:
637
638               #fsdb -F S name email id test1
639               alfred aho  a@ucla.example.edu  1  80
640               butler lampson  b@usc.example.edu  2  70
641               david clark  c@isi.example.edu  3  65
642               constantine drovolis  d@lmu.example.edu  4  90
643               debrorah estrin  e@caltech.example.edu  5  70
644               sally floyd  f@oxy.example.edu  6  90
645
646       To compute statistics on an exam, do
647
648               cat DATA/grades | dbstats test1 |dblistize
649
650       giving
651
652               #fsdb -R C  ...
653               mean:        77.5
654               stddev:      10.84
655               pct_rsd:     13.987
656               conf_range:  11.377
657               conf_low:    66.123
658               conf_high:   88.877
659               conf_pct:    0.95
660               sum:         465
661               sum_squared: 36625
662               min:         65
663               max:         90
664               n:           6
665               ...
666
667       To do a histogram:
668
669               cat DATA/grades | dbcolhisto -n 5 -g test1
670
671       giving
672
673               #fsdb low histogram
674               65      *
675               70      **
676               75
677               80      *
678               85
679               90      **
680               #  | /home/johnh/BIN/DB/dbhistogram -n 5 -g test1
681
682       Now you want to send out grades to the students by e-mail.  Create a
683       form-letter (in the file test1.txt):
684
685               To: _email (_name)
686               From: J. Random Professor <jrp@usc.example.edu>
687               Subject: test1 scores
688
689               _name, your score on test1 was _test1.
690               86+   A
691               75-85 B
692               70-74 C
693               0-69  F
694
695       Generate the shell script that will send the mail out:
696
697               cat DATA/grades | dbformmail test1.txt > test1.sh
698
699       And run it:
700
701               sh <test1.sh
702
703       The last two steps can be combined:
704
705               cat DATA/grades | dbformmail test1.txt | sh
706
707       but I like to keep a copy of exactly what I send.
708
709       At the end of the semester you'll want to compute grade totals and
710       assign letter grades.  Both fall out of dbroweval.  For example, to
711       compute weighted total grades with a 40% midterm/60% final where the
712       midterm is 84 possible points and the final 100:
713
714               dbcol -rv total |
715               dbcolcreate total - |
716               dbroweval '
717                       _total = .40 * _midterm/84.0 + .60 * _final/100.0;
718                       _total = sprintf("%4.2f", _total);
719                       if (_final eq "-" || ( _name =~ /^_/)) { _total = "-"; };' |
720               dbcolneaten
721
722       If you got the data originally from a spreadsheet, save it in "tab-
723       delimited" format and convert it with tabdelim_to_db (run
724       tabdelim_to_db -? for examples).
725

A PASSWORD EXAMPLE

727       To convert the Unix password file to db:
728
729               cat /etc/passwd | sed 's/:/  /g'| \
730                       dbcoldefine -F S login password uid gid gecos home shell \
731                       >passwd.fsdb
732
733       To convert the group file
734
735               cat /etc/group | sed 's/:/  /g' | \
736                       dbcoldefine -F S group password gid members \
737                       >group.fsdb
738
739       To show the names of the groups that div7-members are in (assuming DIV7
740       is in the gecos field):
741
742               cat passwd.fsdb | dbrow '_gecos =~ /DIV7/' | dbcol login gid | \
743                       dbjoin -i - -i group.fsdb gid | dbcol login group
744

SHORT EXAMPLES

746       Which Fsdb programs are the most complicated (based on number of test
747       cases)?
748
749               ls TEST/*.cmd | \
750                       dbcoldefine test | \
751                       dbroweval '_test =~ s@^TEST/([^_]+).*$@$1@' | \
752                       dbrowuniq -c | \
753                       dbsort -nr count | \
754                       dbcolneaten
755
756       (Answer: dbmapreduce, then dbcolstats, dbfilealter and dbjoin.)
757
758       Stats on an exam (in $FILE, where $COLUMN is the name of the exam)?
759
760               cat $FILE | dbcolstats -q 4 $COLUMN <$FILE | dblistize | dbstripcomments
761
762               cat $FILE | dbcolhisto -g -n 20 $COLUMN | dbcolneaten | dbstripcomments
763
764       Merging a the hw1 column from file hw1.fsdb into grades.fsdb assuming
765       there's a common student id in column "id":
766
767               dbcol id hw1 <hw1.fsdb >t.fsdb
768
769               dbjoin -a -e - grades.fsdb t.fsdb id | \
770                   dbsort  name | \
771                   dbcolneaten >new_grades.fsdb
772
773       Merging two fsdb files with the same rows:
774
775               cat file1.fsdb file2.fsdb >output.fsdb
776
777       or if you want to clean things up a bit
778
779               cat file1.fsdb file2.fsdb | dbstripextraheaders >output.fsdb
780
781       or if you want to know where the data came from
782
783               for i in 1 2
784               do
785                       dbcolcreate source $i < file$i.fsdb
786               done >output.fsdb
787
788       (assumes you're using a Bourne-shell compatible shell, not csh).
789

WARNINGS

791       As with any tool, one should (which means must) understand the limits
792       of the tool.
793
794       All Fsdb tools should run in constant memory.  In some cases (such as
795       dbcolstats with quartiles, where the whole input must be re-read),
796       programs will spool data to disk if necessary.
797
798       Most tools buffer one or a few lines of data, so memory will scale with
799       the size of each line.  (So lines with many columns, or when columns
800       have lots data, may cause large memory consumption.)
801
802       All Fsdb tools should run in constant or at worst "n log n" time.
803
804       All Fsdb tools use normal Perl math routines for computation.  Although
805       I make every attempt to choose numerically stable algorithms (although
806       I also welcome feedback and suggestions for improvement), normal
807       rounding due to computer floating point approximations can result in
808       inaccuracies when data spans a large range of precision.  (See for
809       example the dbcolstats_extrema test cases.)
810
811       Any requirements and limitations of each Fsdb tool is documented on its
812       manual page.
813
814       If any Fsdb program violates these assumptions, that is a bug that
815       should be documented on the tool's manual page or ideally fixed.
816
817       Fsdb does depend on Perl's correctness, and Perl (and Fsdb) have some
818       bugs.  Fsdb should work on perl from version 5.10 onward.
819

HISTORY

821       There have been three versions of Fsdb; fsdb 1.0 is a complete re-write
822       of the pre-1995 versions, and was distributed from 1995 to 2007.  Fsdb
823       2.0 is a significant re-write of the 1.x versions for reasons described
824       below.
825
826       Fsdb (in its various forms) has been used extensively by its author
827       since 1991.  Since 1995 it's been used by two other researchers at UCLA
828       and several at ISI.  In February 1998 it was announced to the Internet.
829       Since then it has found a few users, some outside where I work.
830
831       Major changes:
832
833       1.0 1997-07-22: first public release.
834       2.0 2008-01-25: rewrite to use a common library, and starting to use
835       threads.
836       2.12 2008-10-16: completion of the rewrite, and first RPM package.
837       2.44 2013-10-02: abandoning threads for improved performance
838
839   Fsdb 2.0 Rationale
840       I've thought about fsdb-2.0 for many years, but it was started in
841       earnest in 2007.  Fsdb-2.0 has the following goals:
842
843       in-one-process processing
844           While fsdb is great on the Unix command line as a pipeline between
845           programs, it should also be possible to set it up to run in a
846           single process.  And if it does so, it should be able to avoid
847           serializing and deserializing (converting to and from text) data
848           between each module.  (Accomplished in fsdb-2.0: see dbpipeline,
849           although still needs tuning.)
850
851       clean IO API
852           Fsdb's roots go back to perl4 and 1991, so the fsdb-1.x library is
853           very, very crufty.  More than just being ugly (but it was that
854           too), this made things reading from one format file and writing to
855           another the application's job, when it should be the library's.
856           (Accomplished in fsdb-1.15 and improved in 2.0: see Fsdb::IO.)
857
858       normalized module APIs
859           Because fsdb modules were added as needed over 10 years, sometimes
860           the module APIs became inconsistent.  (For example, the 1.x
861           "dbcolcreate" required an empty value following the name of the new
862           column, but other programs specify empty values with the "-e"
863           argument.)  We should smooth over these inconsistencies.
864           (Accomplished as each module was ported in 2.0 through 2.7.)
865
866       everyone handles all input formats
867           Given a clean IO API, the distinction between "colized" and
868           "listized" fsdb files should go away.  Any program should be able
869           to read and write files in any format.  (Accomplished in fsdb-2.1.)
870
871       Fsdb-2.0 preserves backwards compatibility where possible, but breaks
872       it where necessary to accomplish the above goals.  In August 2008,
873       Fsdb-2.7 was declared preferred over the 1.x versions.  Benchmarking in
874       2013 showed that threading performed much worse than just using pipes,
875       so Fsdb-2.44 uses threading "style", but implemented with processes
876       (via my "Freds" library).
877
878   Contributors
879       Fsdb includes code ported from Geoff Kuenning
880       ("Fsdb::Support::TDistribution").
881
882       Fsdb contributors: Ashvin Goel goel@cse.oge.edu, Geoff Kuenning
883       geoff@fmg.cs.ucla.edu, Vikram Visweswariah visweswa@isi.edu, Kannan
884       Varadahan kannan@isi.edu, Lars Eggert larse@isi.edu, Arkadi Gelfond
885       arkadig@dyna.com, David Graff graff@ldc.upenn.edu, Haobo Yu
886       haoboy@packetdesign.com, Pavlin Radoslavov pavlin@catarina.usc.edu,
887       Graham Phillips, Yuri Pradkin, Alefiya Hussain, Ya Xu, Michael
888       Schwendt, Fabio Silva fabio@isi.edu, Jerry Zhao zhaoy@isi.edu, Ning Xu
889       nxu@aludra.usc.edu, Martin Lukac mlukac@lecs.cs.ucla.edu, Xue Cai,
890       Michael McQuaid, Christopher Meng, Calvin Ardi, H. Merijn Brand, Lan
891       Wei, Hang Guo.
892
893       Fsdb includes datasets contributed from NIST (DATA/nist_zarr13.fsdb),
894       from
895       <http://www.itl.nist.gov/div898/handbook/eda/section4/eda4281.htm>, the
896       NIST/SEMATECH e-Handbook of Statistical Methods, section 1.4.2.8.1.
897       Background and Data.  The source is public domain, and reproduced with
898       permission.
899
901       As stated in the introduction, Fsdb is an incompatible reimplementation
902       of the ideas found in "/rdb".  By storing data in simple text files and
903       processing it with pipelines it is easy to experiment (in the shell)
904       and look at the output.  The original implementation of this idea was
905       /rdb, a commercial product described in the book UNIX relational
906       database management: application development in the UNIX environment by
907       Rod Manis, Evan Schaffer, and Robert Jorgensen (and also at the web
908       page <http://www.rdb.com/>).
909
910       While Fsdb is inspired by Rdb, it includes no code from it, and Fsdb
911       makes several different design choices.  In particular: rdb attempts to
912       be closer to a "real" database, with provision for locking, file
913       indexing.  Fsdb focuses on single user use and so eschews these
914       choices.  Rdb also has some support for interactive editing.  Fsdb
915       leaves editing to text editors like emacs or vi.
916
917       In August, 2002 I found out Carlo Strozzi extended RDB with his package
918       NoSQL <http://www.linux.it/~carlos/nosql/>.  According to Mr. Strozzi,
919       he implemented NoSQL in awk to avoid the Perl start-up of RDB.
920       Although I haven't found Perl startup overhead to be a big problem on
921       my platforms (from old Sparcstation IPCs to 2GHz Pentium-4s), you may
922       want to evaluate his system.  The Linux Journal has a description of
923       NoSQL at <http://www.linuxjournal.com/article/3294>.  It seems quite
924       similar to Fsdb.  Like /rdb, NoSQL supports indexing (not present in
925       Fsdb).  Fsdb appears to have richer support for statistics, and, as of
926       Fsdb-2.x, its support for Perl threading may support faster performance
927       (one-process, less serialization and deserialization).
928

RELEASE NOTES

930       Versions prior to 1.0 were released informally on my web page but were
931       not announced.
932
933   0.0 1991
934       started for my own research use
935
936   0.1 26-May-94
937       first check-in to RCS
938
939   0.2 15-Mar-95
940       parts now require perl5
941
942   1.0, 22-Jul-97
943       adds autoconf support and a test script.
944
945   1.1, 20-Jan-98
946       support for double space field separators, better tests
947
948   1.2, 11-Feb-98
949       minor changes and release on comp.lang.perl.announce
950
951   1.3, 17-Mar-98
952       ·   adds median and quartile options to dbstats
953
954       ·   adds dmalloc_to_db converter
955
956       ·   fixes some warnings
957
958       ·   dbjoin now can run on unsorted input
959
960       ·   fixes a dbjoin bug
961
962       ·   some more tests in the test suite
963
964   1.4, 27-Mar-98
965       ·   improves error messages (all should now report the program that
966           makes the error)
967
968       ·   fixed a bug in dbstats output when the mean is zero
969
970   1.5, 25-Jun-98
971       BUG FIX dbcolhisto, dbcolpercentile now handles non-numeric values like
972       dbstats
973       NEW dbcolstats computes zscores and tscores over a column
974       NEW dbcolscorrelate computes correlation coefficients between two
975       columns
976       INTERNAL ficus_getopt.pl has been replaced by DbGetopt.pm
977       BUG FIX all tests are now ``portable'' (previously some tests ran only
978       on my system)
979       BUG FIX you no longer need to have the db programs in your path (fix
980       arose from a discussion with Arkadi Gelfond)
981       BUG FIX installation no longer uses cp -f (to work on SunOS 4)
982
983   1.6, 24-May-99
984       NEW dbsort, dbstats, dbmultistats now run in constant memory (using tmp
985       files if necessary)
986       NEW dbcolmovingstats does moving means over a series of data
987       NEW dbcol has a -v option to get all columns except those listed
988       NEW dbmultistats does quartiles and medians
989       NEW dbstripextraheaders now also cleans up bogus comments before the
990       fist header
991       BUG FIX dbcolneaten works better with double-space-separated data
992
993   1.7,  5-Jan-00
994       NEW dbcolize now detects and rejects lines that contain embedded copies
995       of the field separator
996       NEW configure tries harder to prevent people from improperly
997       configuring/installing fsdb
998       NEW tcpdump_to_db converter (incomplete)
999       NEW tabdelim_to_db converter:  from spreadsheet tab-delimited files to
1000       db
1001       NEW mailing lists for fsdb are     "fsdb-announce@heidemann.la.ca.us"
1002       and  "fsdb-talk@heidemann.la.ca.us"
1003           To subscribe to either, send mail
1004           to    "fsdb-announce-request@heidemann.la.ca.us"   or
1005           "fsdb-talk-request@heidemann.la.ca.us"     with "subscribe" in the
1006           BODY of the message.
1007
1008       BUG FIX dbjoin used to produce incorrect output if there were extra,
1009       unmatched values in the 2nd table. Thanks to Graham Phillips for
1010       providing a test case.
1011       BUG FIX the sample commands in the usage strings now all should
1012       explicitly include the source of data (typically from "cat foo.fsdb
1013       |").  Thanks to Ya Xu for pointing out this documentation deficiency.
1014       BUG FIX (DOCUMENTATION) dbcolmovingstats had incorrect sample output.
1015
1016   1.8, 28-Jun-00
1017       BUG FIX header options are now preserved when writing with dblistize
1018       NEW dbrowuniq now optionally checks for uniqueness only on certain
1019       fields
1020       NEW dbrowsplituniq makes one pass through a file and splits it into
1021       separate files based on the given fields
1022       NEW converter for "crl" format network traces
1023       NEW anywhere you use arbitrary code (like dbroweval), _last_foo now
1024       maps to the last row's value for field _foo.
1025       OPTIMIZATION comment processing slightly changed so that dbmultistats
1026       now is much faster on files with lots of comments (for example, ~100k
1027       lines of comments and 700 lines of data!) (Thanks to Graham Phillips
1028       for pointing out this performance problem.)
1029       BUG FIX dbstats with median/quartiles now correctly handles singleton
1030       data points.
1031
1032   1.9,  6-Nov-00
1033       NEW dbfilesplit, split a single input file into multiple output files
1034       (based on code contributed by Pavlin Radoslavov).
1035       BUG FIX dbsort now works with perl-5.6
1036
1037   1.10, 10-Apr-01
1038       BUG FIX dbstats now handles the case where there are more n-tiles than
1039       data
1040       NEW dbstats now includes a -S option to optimize work on pre-sorted
1041       data (inspired by code contributed by Haobo Yu)
1042       BUG FIX dbsort now has a better estimate of memory usage when run on
1043       data with very short records (problem detected by Haobo Yu)
1044       BUG FIX cleanup of temporary files is slightly better
1045
1046   1.11,  2-Nov-01
1047       BUG FIX dbcolneaten now runs in constant memory
1048       NEW dbcolneaten now supports "field specifiers" that allow some control
1049       over how wide columns should be
1050       OPTIMIZATION dbsort now tries hard to be filesystem cache-friendly
1051       (inspired by "Information and Control in Gray-box Systems" by the
1052       Arpaci-Dusseau's at SOSP 2001)
1053       INTERNAL t_distr now ported to perl5 module DbTDistr
1054
1055   1.12,  30-Oct-02
1056       BUG FIX dbmultistats documentation typo fixed
1057       NEW dbcolmultiscale
1058       NEW dbcol has -r option for "relaxed error checking"
1059       NEW dbcolneaten has new -e option to strip end-of-line spaces
1060       NEW dbrow finally has a -v option to negate the test
1061       BUG FIX math bug in dbcoldiff fixed by Ashvin Goel (need to check
1062       Scheaffer test cases)
1063       BUG FIX some patches to run with Perl 5.8. Note: some programs
1064       (dbcolmultiscale, dbmultistats, dbrowsplituniq) generate warnings like:
1065       "Use of uninitialized value in concatenation (.)" or "string at
1066       /usr/lib/perl5/5.8.0/FileCache.pm line 98, <STDIN> line 2". Please
1067       ignore this until I figure out how to suppress it. (Thanks to Jerry
1068       Zhao for noticing perl-5.8 problems.)
1069       BUG FIX fixed an autoconf problem where configure would fail to find a
1070       reasonable prefix (thanks to Fabio Silva for reporting the problem)
1071       NEW db_to_html_table: simple conversion to html tables (NO fancy stuff)
1072       NEW dblib now has a function dblib_text2html() that will do simple
1073       conversion of iso-8859-1 to HTML
1074
1075   1.13,  4-Feb-04
1076       NEW fsdb added to the freebsd ports tree
1077       <http://www.freshports.org/databases/fsdb/>.  Maintainer:
1078       "larse@isi.edu"
1079       BUG FIX properly handle trailing spaces when data must be numeric (ex.
1080       dbstats with -FS, see test dbstats_trailing_spaces). Fix from Ning Xu
1081       "nxu@aludra.usc.edu".
1082       NEW dbcolize error message improved (bug report from Terrence Brannon),
1083       and list format documented in the README.
1084       NEW cgi_to_db converts CGI.pm-format storage to fsdb list format
1085       BUG FIX handle numeric synonyms for column names in dbcol properly
1086       ENHANCEMENT "talking about columns" section added to README. Lack of
1087       documentation pointed out by Lars Eggert.
1088       CHANGE dbformmail now defaults to using Mail ("Berkeley Mail") to send
1089       mail, rather than sendmail (sendmail is still an option, but mail
1090       doesn't require running as root)
1091       NEW on platforms that support it (i.e., with perl 5.8), fsdb works fine
1092       with unicode
1093       NEW dbfilevalidate: check a db file for some common errors
1094
1095   1.14,  24-Aug-06
1096       ENHANCEMENT README cleanup
1097       INCOMPATIBLE CHANGE dbcolsplit renamed dbcolsplittocols
1098       NEW dbcolsplittorows  split one column into multiple rows
1099       NEW dbcolsregression compute linear regression and correlation for two
1100       columns
1101       ENHANCEMENT cvs_to_db: better error handling, normalize field names,
1102       skip blank lines
1103       ENHANCEMENT dbjoin now detects (and fails) if non-joined files have
1104       duplicate names
1105       BUG FIX minor bug fixed in calculation of Student t-distributions
1106       (doesn't change any test output, but may have caused small errors)
1107
1108   1.15, 12-Nov-07
1109       NEW fsdb-1.14 added to the MacOS Fink system
1110       <http://pdb.finkproject.org/pdb/package.php/fsdb>. (Thanks to Lars
1111       Eggert for maintaining this port.)
1112       NEW Fsdb::IO::Reader and Fsdb::IO::Writer now provide reasonably clean
1113       OO I/O interfaces to Fsdb files.  Highly recommended if you use fsdb
1114       directly from perl.  In the fullness of time I expect to reimplement
1115       the entire thing using these APIs to replace the current dblib.pl which
1116       is still hobbled by its roots in perl4.
1117       NEW dbmapreduce now implements a Google-style map/reduce abstraction,
1118       generalizing dbmultistats.
1119       ENHANCEMENT fsdb now uses the Perl build system (Makefile.PL, etc.),
1120       instead of autoconf.  This change paves the way to better perl-5-style
1121       modularization, proper manual pages, input of both listize and colize
1122       format for every program, and world peace.
1123       ENHANCEMENT dblib.pl is now moved to Fsdb::Old.pm.
1124       BUG FIX dbmultistats now propagates its format argument (-f). Bug and
1125       fix from Martin Lukac (thanks!).
1126       ENHANCEMENT dbformmail documentation now is clearer that it doesn't
1127       send the mail, you have to run the shell script it writes.  (Problem
1128       observed by Unkyu Park.)
1129       ENHANCEMENT adapted to autoconf-2.61 (and then these changes were
1130       discarded in favor of The Perl Way.
1131       BUG FIX dbmultistats memory usage corrected (O(# tags), not O(1))
1132       ENHANCEMENT dbmultistats can now optionally run with pre-grouped input
1133       in O(1) memory
1134       ENHANCEMENT dbroweval -N was finally implemented (eat comments)
1135
1136   2.0, 25-Jan-08
1137       2.0, 25-Jan-08 --- a quiet 2.0 release (gearing up towards complete)
1138
1139       ENHANCEMENT: shifting old programs to Perl modules, with the front-end
1140       program as just a wrapper. In the short-term, this change just means
1141       programs have real man pages. In the long-run, it will mean that one
1142       can run a pipeline in a single Perl program. So far: dbcol, dbroweval,
1143       the new dbrowcount. dbsort the new dbmerge, the old "dbstats" (renamed
1144       dbcolstats), dbcolrename, dbcolcreate,
1145       NEW: Fsdb::Filter::dbpipeline is an internal-only module that lets one
1146       use fsdb commands from within perl (via threads).
1147           It also provides perl function aliases for the internal modules, so
1148           a string of fsdb commands in perl are nearly as terse as in the
1149           shell:
1150
1151               use Fsdb::Filter::dbpipeline qw(:all);
1152               dbpipeline(
1153                   dbrow(qw(name test1)),
1154                   dbroweval('_test1 += 5;')
1155               );
1156
1157       INCOMPATIBLE CHANGE: The old dbcolstats has been renamed
1158       dbcolstatscores. The new dbcolstats does the same thing as the old
1159       dbstats. This incompatibility is unfortunate but normalizes program
1160       names.
1161       CHANGE: The new dbcolstats program always outputs "-" (the default
1162       empty value) for statistics it cannot compute (for example, standard
1163       deviation if there is only one row), instead of the old mix of "-" and
1164       "na".
1165       INCOMPATIBLE CHANGE: The old dbcolstats program, now called
1166       dbcolstatscores, also has different arguments.  The "-t mean,stddev"
1167       option is now "--tmean mean --tstddev stddev".  See dbcolstatscores for
1168       details.
1169       INCOMPATIBLE CHANGE: dbcolcreate now assumes all new columns get the
1170       default value rather than requiring each column to have an initial
1171       constant value. To change the initial value, sue the new "-e" option.
1172       NEW: dbrowcount counts rows, an almost-subset of dbcolstats's "n"
1173       output (except without differentiating numeric/non-numeric input), or
1174       the equivalent of "dbstripcomments | wc -l".
1175       NEW: dbmerge merges two sorted files. This functionality was previously
1176       embedded in dbsort.
1177       INCOMPATIBLE CHANGE: dbjoin's "-i" option to include non-matches is now
1178       renamed "-a", so as to not conflict with the new standard option "-i"
1179       for input file.
1180
1181   2.1,  6-Apr-08
1182       2.1,  6-Apr-08 --- another alpha 2.0, but now all converted programs
1183       understand both listize and colize format
1184
1185       ENHANCEMENT: shifting more old programs to Perl modules. New in 2.1:
1186       dbcolneaten, dbcoldefine, dbcolhisto, dblistize, dbcolize, dbrecolize
1187       ENHANCEMENT dbmerge now handles an arbitrary number of input files, not
1188       just exactly two.
1189       NEW dbmerge2 is an internal routine that handles merging exactly two
1190       files.
1191       INCOMPATIBLE CHANGE dbjoin now specifies inputs like dbmerge2, rather
1192       than assuming the first two arguments were tables (as in fsdb-1).
1193           The old dbjoin argument "-i" is now "-a" or <--type=outer>.
1194
1195           A minor change: comments in the source files for dbjoin are now
1196           intermixed with output rather than being delayed until the end.
1197
1198       ENHANCEMENT dbsort now no longer produces warnings when null values are
1199       passed to numeric comparisons.
1200       BUG FIX dbroweval now once again works with code that lacks a trailing
1201       semicolon. (This bug fixes a regression from 1.15.)
1202       INCOMPATIBLE CHANGE dbcolneaten's old "-e" option (to avoid end-of-line
1203       spaces) is now "-E" to avoid conflicts with the standard empty field
1204       argument.
1205       INCOMPATIBLE CHANGE dbcolhisto's old "-e" option is now "-E" to avoid
1206       conflicts. And its "-n", "-s", and "-w" are now "-N", "-S", and "-W" to
1207       correspond.
1208       NEW dbfilealter replaces dbrecolize, dblistize, and dbcolize, but with
1209       different options.
1210       ENHANCEMENT The library routines "Fsdb::IO" now understand both list-
1211       format and column-format data, so all converted programs can now
1212       automatically read either format.  This capability was one of the
1213       milestone goals for 2.0, so yea!
1214
1215   2.2, 23-May-08
1216       Release 2.2 is another 2.x alpha release.  Now most of the commands are
1217       ported, but a few remain, and I plan one last incompatible change (to
1218       the file header) before 2.x final.
1219
1220       ENHANCEMENT
1221           shifting more old programs to Perl modules.  New in 2.2:
1222           dbrowaccumulate, dbformmail.  dbcolmovingstats.  dbrowuniq.
1223           dbrowdiff.  dbcolmerge.  dbcolsplittocols.  dbcolsplittorows.
1224           dbmapreduce.  dbmultistats.  dbrvstatdiff.  Also dbrowenumerate
1225           exists only as a front-end (command-line) program.
1226
1227       INCOMPATIBLE CHANGE
1228           The following programs have been dropped from fsdb-2.x:
1229           dbcoltighten, dbfilesplit, dbstripextraheaders,
1230           dbstripleadingspace.
1231
1232       NEW combined_log_format_to_db to convert Apache logfiles
1233
1234       INCOMPATIBLE CHANGE
1235           Options to dbrowdiff are now -B and -I, not -a and -i.
1236
1237       INCOMPATIBLE CHANGE
1238           dbstripcomments is now dbfilestripcomments.
1239
1240       BUG FIXES
1241           dbcolneaten better handles empty columns; dbcolhisto warning
1242           suppressed (actually a bug in high-bucket handling).
1243
1244       INCOMPATIBLE CHANGE
1245           dbmultistats now requires a "-k" option in front of the key (tag)
1246           field, or if none is given, it will group by the first field (both
1247           like dbmapreduce).
1248
1249       KNOWN BUG
1250           dbmultistats with quantile option doesn't work currently.
1251
1252       INCOMPATIBLE CHANGE
1253           dbcoldiff is renamed dbrvstatdiff.
1254
1255       BUG FIXES
1256           dbformmail was leaving its log message as a  command, not a
1257           comment.  Oops.  No longer.
1258
1259   2.3, 27-May-08 (alpha)
1260       Another alpha release, this one just to fix the critical dbjoin bug
1261       listed below (that happens to have blocked my MP3 jukebox :-).
1262
1263       BUG FIX
1264           Dbsort no longer hangs if given an input file with no rows.
1265
1266       BUG FIX
1267           Dbjoin now works with unsorted input coming from a pipeline (like
1268           stdin).  Perl-5.8.8 has a bug (?) that was making this case
1269           fail---opening stdin in one thread, reading some, then reading more
1270           in a different thread caused an lseek which works on files, but
1271           fails on pipes like stdin.  Go figure.
1272
1273       BUG FIX / KNOWN BUG
1274           The dbjoin fix also fixed dbmultistats -q (it now gives the right
1275           answer).  Although a new bug appeared, messages like:
1276               Attempt to free unreferenced scalar: SV 0xa9dd0c4, Perl
1277           interpreter: 0xa8350b8 during global destruction.  So the
1278           dbmultistats_quartile test is still disabled.
1279
1280   2.4, 18-Jun-08
1281       Another alpha release, mostly to fix minor usability problems in
1282       dbmapreduce and client functions.
1283
1284       ENHANCEMENT
1285           dbrow now defaults to running user supplied code without warnings
1286           (as with fsdb-1.x).  Use "--warnings" or "-w" to turn them back on.
1287
1288       ENHANCEMENT
1289           dbroweval can now write different format output than the input,
1290           using the "-m" option.
1291
1292       KNOWN BUG
1293           dbmapreduce emits warnings on perl 5.10.0 about "Unbalanced string
1294           table refcount" and "Scalars leaked" when run with an external
1295           program as a reducer.
1296
1297           dbmultistats emits the warning "Attempt to free unreferenced
1298           scalar" when run with quartiles.
1299
1300           In each case the output is correct.  I believe these can be
1301           ignored.
1302
1303       CHANGE
1304           dbmapreduce no longer logs a line for each reducer that is invoked.
1305
1306   2.5, 24-Jun-08
1307       Another alpha release, fixing more minor bugs in "dbmapreduce" and
1308       lossage in "Fsdb::IO".
1309
1310       ENHANCEMENT
1311           dbmapreduce can now tolerate non-map-aware reducers that pass back
1312           the key column in put.  It also passes the current key as the last
1313           argument to external reducers.
1314
1315       BUG FIX
1316           Fsdb::IO::Reader, correctly handle "-header" option again.  (Broken
1317           since fsdb-2.3.)
1318
1319   2.6, 11-Jul-08
1320       Another alpha release, needed to fix DaGronk.  One new port, small bug
1321       fixes, and important fix to dbmapreduce.
1322
1323       ENHANCEMENT
1324           shifting more old programs to Perl modules.  New in 2.2:
1325           dbcolpercentile.
1326
1327       INCOMPATIBLE CHANGE and ENHANCEMENTS dbcolpercentile arguments changed,
1328       use "--rank" to require ranking instead of "-r". Also, "--ascending"
1329       and "--descending" can now be specified separately, both for
1330       "--percentile" and "--rank".
1331       BUG FIX
1332           Sigh, the sense of the --warnings option in dbrow was inverted.  No
1333           longer.
1334
1335       BUG FIX
1336           I found and fixed the string leaks (errors like "Unbalanced string
1337           table refcount" and "Scalars leaked") in dbmapreduce and
1338           dbmultistats.  (All "IO::Handle"s in threads must be manually
1339           destroyed.)
1340
1341       BUG FIX
1342           The "-C" option to specify the column separator in dbcolsplittorows
1343           now works again (broken since it was ported).
1344
1345       2.7, 30-Jul-08 beta
1346
1347       The beta release of fsdb-2.x.  Finally, all programs are ported.  As
1348       statistics, the number of lines of non-library code doubled from 7.5k
1349       to 15.5k.  The libraries are much more complete, going from 866 to 5164
1350       lines.  The overall number of programs is about the same, although 19
1351       were dropped and 11 were added.  The number of test cases has grown
1352       from 116 to 175.  All programs are now in perl-5, no more shell scripts
1353       or perl-4.  All programs now have manual pages.
1354
1355       Although this is a major step forward, I still expect to rename "fsdb"
1356       to "fsdb".
1357
1358       ENHANCEMENT
1359           shifting more old programs to Perl modules.  New in 2.7:
1360           dbcolscorellate.  dbcolsregression.  cgi_to_db.  dbfilevalidate.
1361           db_to_csv.  csv_to_db, db_to_html_table, kitrace_to_db,
1362           tcpdump_to_db, tabdelim_to_db, ns_to_db.
1363
1364       INCOMPATIBLE CHANGE
1365           The following programs have been dropped from fsdb-2.x: db2dcliff,
1366           dbcolmultiscale, crl_to_db.  ipchain_logs_to_db.  They may come
1367           back, but seemed overly specialized.  The following program
1368           dbrowsplituniq was dropped because it is superseded by dbmapreduce.
1369           dmalloc_to_db was dropped pending a test cases and examples.
1370
1371       ENHANCEMENT
1372           dbfilevalidate now has a "-c" option to correct errors.
1373
1374       NEW html_table_to_db provides the inverse of db_to_html_table.
1375
1376   2.8,  5-Aug-08
1377       Change header format, preserving forwards compatibility.
1378
1379       BUG FIX
1380           Complete editing pass over the manual, making sure it aligns with
1381           fsdb-2.x.
1382
1383       SEMI-COMPATIBLE CHANGE
1384           The header of fsdb files has changed, it is now #fsdb, not #h (or
1385           #L) and parsing of -F and -R are also different.  See dbfilealter
1386           for the new specification.  The v1 file format will be read,
1387           compatibly, but not written.
1388
1389       BUG FIX
1390           dbmapreduce now tolerates comments that precede the first key,
1391           instead of failing with an error message.
1392
1393   2.9, 6-Aug-08
1394       Still in beta; just a quick bug-fix for dbmapreduce.
1395
1396       ENHANCEMENT
1397           dbmapreduce now generates plausible output when given no rows of
1398           input.
1399
1400   2.10, 23-Sep-08
1401       Still in beta, but picking up some bug fixes.
1402
1403       ENHANCEMENT
1404           dbmapreduce now generates plausible output when given no rows of
1405           input.
1406
1407       ENHANCEMENT
1408           dbroweval the warnings option was backwards; now corrected.  As a
1409           result, warnings in user code now default off (like in fsdb-1.x).
1410
1411       BUG FIX
1412           dbcolpercentile now defaults to assuming the target column is
1413           numeric.  The new option "-N" allows selection of a non-numeric
1414           target.
1415
1416       BUG FIX
1417           dbcolscorrelate now includes "--sample" and "--nosample" options to
1418           compute the sample or full population correlation coefficients.
1419           Thanks to Xue Cai for finding this bug.
1420
1421   2.11, 14-Oct-08
1422       Still in beta, but picking up some bug fixes.
1423
1424       ENHANCEMENT
1425           html_table_to_db is now more aggressive about filling in empty
1426           cells with the official empty value, rather than leaving them blank
1427           or as whitespace.
1428
1429       ENHANCEMENT
1430           dbpipeline now catches failures during pipeline element setup and
1431           exits reasonably gracefully.
1432
1433       BUG FIX
1434           dbsubprocess now reaps child processes, thus avoiding running out
1435           of processes when used a lot.
1436
1437   2.12, 16-Oct-08
1438       Finally, a full (non-beta) 2.x release!
1439
1440       INCOMPATIBLE CHANGE
1441           Jdb has been renamed Fsdb, the flatfile-streaming database.  This
1442           change affects all internal Perl APIs, but no shell command-level
1443           APIs.  While Jdb served well for more than ten years, it is easily
1444           confused with the Java debugger (even though Jdb was there first!).
1445           It also is too generic to work well in web search engines.
1446           Finally, Jdb stands for ``John's database'', and we're a bit beyond
1447           that.  (However, some call me the ``file-system guy'', so one could
1448           argue it retains that meeting.)
1449
1450           If you just used the shell commands, this change should not affect
1451           you.  If you used the Perl-level libraries directly in your code,
1452           you should be able to rename "Jdb" to "Fsdb" to move to 2.12.
1453
1454           The jdb-announce list not yet been renamed, but it will be shortly.
1455
1456           With this release I've accomplished everything I wanted to in
1457           fsdb-2.x.  I therefore expect to return to boring, bugfix releases.
1458
1459   2.13, 30-Oct-08
1460       BUG FIX
1461           dbrowaccumulate now treats non-numeric data as zero by default.
1462
1463       BUG FIX
1464           Fixed a perl-5.10ism in dbmapreduce that breaks that program under
1465           5.8.  Thanks to Martin Lukac for reporting the bug.
1466
1467   2.14, 26-Nov-08
1468       BUG FIX
1469           Improved documentation for dbmapreduce's "-f" option.
1470
1471       ENHANCEMENT
1472           dbcolmovingstats how computes a moving standard deviation in
1473           addition to a moving mean.
1474
1475   2.15, 13-Apr-09
1476       BUG FIX
1477           Fix a make install bug reported by Shalindra Fernando.
1478
1479   2.16, 14-Apr-09
1480       BUG FIX
1481           Another minor release bug: on some systems programize_module looses
1482           executable permissions.  Again reported by Shalindra Fernando.
1483
1484   2.17, 25-Jun-09
1485       TYPO FIXES
1486           Typo in the dbroweval manual fixed.
1487
1488       IMPROVEMENT
1489           There is no longer a comment line to label columns in dbcolneaten,
1490           instead the header line is tweaked to line up.  This change
1491           restores the Jdb-1.x behavior, and means that repeated runs of
1492           dbcolneaten no longer add comment lines each time.
1493
1494       BUG FIX
1495           It turns out  dbcolneaten was not correctly handling trailing
1496           spaces when given the "-E" option to suppress them.  This
1497           regression is now fixed.
1498
1499       EXTENSION
1500           dbroweval(1) can now handle direct references to the last row via
1501           $lfref, a dubious but now documented feature.
1502
1503       BUG FIXES
1504           Separators set with "-C" in dbcolmerge and dbcolsplittocols were
1505           not properly setting the heading, and null fields were not
1506           recognized.  The first bug was reported by Martin Lukac.
1507
1508   2.18,  1-Jul-09  A minor release
1509       IMPROVEMENT
1510           Documentation for Fsdb::IO::Reader has been improved.
1511
1512       IMPROVEMENT
1513           The package should now be PGP-signed.
1514
1515   2.19,  10-Jul-09
1516       BUG FIX
1517           Internal improvements to debugging output and robustness of
1518           dbmapreduce and dbpipeline.  TEST/dbpipeline_first_fails.cmd re-
1519           enabled.
1520
1521   2.20, 30-Nov-09 (A collection of minor bugfixes, plus a build against
1522       Fedora 12.)
1523       BUG FIX
1524           Loging for dbmapreduce with code refs is now stable (it no longer
1525           includes a hex pointer to the code reference).
1526
1527       BUG FIX
1528           Better handling of mixed blank lines in Fsdb::IO::Reader (see test
1529           case dbcolize_blank_lines.cmd).
1530
1531       BUG FIX
1532           html_table_to_db now handles multi-line input better, and handles
1533           tables with COLSPAN.
1534
1535       BUG FIX
1536           dbpipeline now cleans up threads in an "eval" to prevent "cannot
1537           detach a joined thread" errors that popped up in perl-5.10.
1538           Hopefully this prevents a race condition that causes the test
1539           suites to hang about 20% of the time (in dbpipeline_first_fails).
1540
1541       IMPROVEMENT
1542           dbmapreduce now detects and correctly fails when the input and
1543           reducer have incompatible field separators.
1544
1545       IMPROVEMENT
1546           dbcolstats, dbcolhisto, dbcolscorrelate, dbcolsregression, and
1547           dbrowcount now all take an "-F" option to let one specify the
1548           output field separator (so they work better with dbmapreduce).
1549
1550       BUG FIX
1551           An omitted "-k" from the manual page of dbmultistats is now there.
1552           Bug reported by Unkyu Park.
1553
1554   2.21, 17-Apr-10 bug fix release
1555       BUG FIX
1556           Fsdb::IO::Writer now no longer fails with -outputheader => never
1557           (an obscure bug).
1558
1559       IMPROVEMENT
1560           Fsdb (in the warnings section) and dbcolstats now more carefully
1561           document how they handle (and do not handle) numerical precision
1562           problems, and other general limits.  Thanks to Yuri Pradkin for
1563           prompting this documentation.
1564
1565       IMPROVEMENT
1566           "Fsdb::Support::fullname_to_sortkey" is now restored from "Jdb".
1567
1568       IMPROVEMENT
1569           Documention for multiple styles of input approaches (including
1570           performance description) added to Fsdb::IO.
1571
1572   2.22, 2010-10-31 One new tool dbcolcopylast and several bug fixes for Perl
1573       5.10.
1574       BUG FIX
1575           dbmerge now correctly handles n-way merges.  Bug reported by Yuri
1576           Pradkin.
1577
1578       INCOMPARABLE CHANGE
1579           dbcolneaten now defaults to not padding the last column.
1580
1581       ADDITION
1582           dbrowenumerate now takes -N NewColumn to give the new column a name
1583           other than "count".  Feature requested by Mike Rouch in January
1584           2005.
1585
1586       ADDITION
1587           New program dbcolcopylast copies the last value of a column into a
1588           new column copylast_column of the next row.  New program requested
1589           by Fabio Silva; useful for converting dbmultistats output into
1590           dbrvstatdiff input.
1591
1592       BUG FIX
1593           Several tools (particularly dbmapreduce and dbmultistats) would
1594           report errors like "Unbalanced string table refcount: (1) for
1595           "STDOUT" during global destruction" on exit, at least on certain
1596           versions of Perl (for me on 5.10.1), but similar errors have been
1597           off-and-on for several Perl releases.  Although I think my code
1598           looked OK, I worked around this problem with a different way of
1599           handling standard IO redirection.
1600
1601   2.23, 2011-03-10 Several small portability bugfixes; improved dbcolstats
1602       for large datasets
1603       IMPROVEMENT
1604           Documentation to dbrvstatdiff was changed to use "sd" to refer to
1605           standard deviation, not "ss" (which might be confused with sum-of-
1606           squares).
1607
1608       BUG FIX
1609           This documentation about dbmultistats was missing the -k option in
1610           some cases.
1611
1612       BUG FIX
1613           dbmapreduce was failing on MacOS-10.6.3 for some tests with the
1614           error
1615
1616               dbmapreduce: cannot run external dbmapreduce reduce program (perl TEST/dbmapreduce_external_with_key.pl)
1617
1618           The problem seemed to be only in the error, not in operation.  On
1619           MacOS, the error is now suppressed.  Thanks to Alefiya Hussain for
1620           providing access to a Mac system that allowed debugging of this
1621           problem.
1622
1623       IMPROVEMENT
1624           The csv_to_db command requires an external Perl library
1625           (Text::CSV_XS).  On computers that lack this optional library,
1626           previously Fsdb would configure with a warning and then test cases
1627           would fail.  Now those test cases are skipped with an additional
1628           warning.
1629
1630       BUG FIX
1631           The test suite now supports alternative valid output, as a hack to
1632           account for last-digit floating point differences.  (Not very
1633           satisfying :-(
1634
1635       BUG FIX
1636           dbcolstats output for confidence intervals on very large datasets
1637           has changed.  Previously it failed for more than 2^31-1 records,
1638           and handling of T-Distributions with thousands of rows was a bit
1639           dubious.  Now datasets with more than 10000 are considered
1640           infinitely large and hopefully correctly handled.
1641
1642   2.24, 2011-04-15 Improvements to fix an old bug in dbmapreduce with
1643       different field separators
1644       IMPROVEMENT
1645           The dbfilealter command had a "--correct" option to work-around
1646           from incompatible field-separators, but it did nothing.  Now it
1647           does the correct but sad, data-loosing thing.
1648
1649       IMPROVEMENT
1650           The dbmultistats command previously failed with an error message
1651           when invoked on input with a non-default field separator.  The root
1652           cause was the underlying dbmapreduce that did not handle the case
1653           of reducers that generated output with a different field separator
1654           than the input.  We now detect and repair incompatible field
1655           separators.  This change corrects a problem originally documented
1656           and detected in Fsdb-2.20.  Bug re-reported by Unkyu Park.
1657
1658   2.25, 2011-08-07 Two new tools, xml_to_db and dbfilepivot, and a bugfix for
1659       two people.
1660       IMPROVEMENT
1661           kitrace_to_db now supports a --utc option, which also fixes this
1662           test case for users outside of the Pacific time zone.  Bug reported
1663           by David Graff, and also by Peter Desnoyers (within a week of each
1664           other :-)
1665
1666       NEW xml_to_db can convert simple, very regular XML files into Fsdb.
1667
1668       NEW dbfilepivot "pivots" a file, converting multiple rows corresponding
1669           to the same entity into a single row with multiple columns.
1670
1671   2.26, 2011-12-12 Bug fixes, particularly for perl-5.14.2.
1672       BUG FIX
1673           Bugs fixed in Fsdb::IO::Reader(3) manual page.
1674
1675       BUG FIX
1676           Fixed problems where dbcolstats was truncating floating point
1677           numbers when sorting.  This strange behavior happens as of
1678           perl-5.14.2 and it seems like a Perl bug.  I've worked around it
1679           for the test suites, but I'm a bit nervous.
1680
1681   2.27, 2012-11-15 Accumulated bug fixes.
1682       IMPROVEMENT
1683           csv_to_db now reports errors in CVS input with real diagnostics.
1684
1685       IMPROVEMENT
1686           dbcolmovingstats can now compute median, when given the "-m"
1687           option.
1688
1689       BUG FIX
1690           dbcolmovingstats non-numeric handling (the "-a" option) now works
1691           properly.
1692
1693       DOCUMENTATION
1694           The internal t/test_command.t test framework is now documented.
1695
1696       BUG FIX
1697           dbrowuniq now correctly handles the case where there is no input
1698           (previously it output a blank line, which is a malformed fsdb
1699           file).  Thanks to Yuri Pradkin for reporting this bug.
1700
1701   2.28, 2012-11-15 A quick release to fix most rpmlint errors.
1702       BUG FIX
1703           Fixed a number of minor release problems (wrong permissions, old
1704           FSF address, etc.) found by rpmlint.
1705
1706   2.29, 2012-11-20 a quick release for CPAN testing
1707       IMPROVEMENT
1708           Tweaked the RPM spec.
1709
1710       IMPROVEMENT
1711           Modified Makefile.PL to fail gracefully on Perl installations that
1712           lack threads.  (Without this fix, I get massive failures in the
1713           non-ithreads test system.)
1714
1715   2.30, 2012-11-25 improvements to perl portability
1716       BUG FIX
1717           Removed unicode character in documention of dbcolscorrelated so pod
1718           tests will pass.  (Sigh, that should work :-( )
1719
1720       BUG FIX
1721           Fixed test suite failures on 5 tests (dbcolcreate_double_creation
1722           was the first) due to Carp's addition of a period.  This problem
1723           was breaking Fsdb on perl-5.17.  Thanks to Michael McQuaid for
1724           helping diagnose this problem.
1725
1726       IMPROVEMENT
1727           The test suite now prints out the names of tests it tries.
1728
1729   2.31, 2012-11-28 A release with actual improvements to dbfilepivot and
1730       dbrowuniq.
1731       BUG FIX
1732           Documentation fixes: typos in dbcolscorrelated, bugs in
1733           dbfilepivot, clarification for comment handling in
1734           Fsdb::IO::Reader.
1735
1736       IMPROVEMENT
1737           Previously dbfilepivot assumed the input was grouped by keys and
1738           didn't very that pre-condition.  Now there is no pre-condition (it
1739           will sort the input by default), and it checks if the invariant is
1740           violated.
1741
1742       BUG FIX
1743           Previously dbfilepivot failed if the input had comments (oops :-);
1744           no longer.
1745
1746       IMPROVEMENT
1747           Now dbrowuniq has the "-L" option to preserve the last unique row
1748           (instead of the first), a common idiom.
1749
1750   2.32, 2012-12-21 Test suites should now be more numerically robust.
1751       NEW New dbfilediff does fsdb-aware file differencing.  It does not do
1752           smart intuition of add/removes like Unix diff(1), but it does know
1753           about columns, and with "-E", it does numeric-aware differences.
1754
1755       IMPROVEMENT
1756           Test suites that are numeric now use dbfilediff to do numeric-aware
1757           comparisons, so the test suite should now be robust to slightly
1758           different computers and operating systems and compilers than
1759           exactly what I use.
1760
1761   2.33, 2012-12-23 Minor fixes to some test cases.
1762       IMPROVEMENT
1763           dbfilediff and dbrowuniq now supports the "-N" option to give the
1764           new column a different name.  (And a test cases where this
1765           duplication mattered have been fixed.)
1766
1767       IMPROVEMENT
1768           dbrvstatdiff now show the t-test breakpoint with a reasonable
1769           number of floating point digits.
1770
1771       BUG FIX
1772           Fixed a numerical stability problem in the dbroweval_last test
1773           case.
1774

WHAT'S NEW

1776   2.34, 2013-02-10 Parallelism in dbmerge.
1777       IMPROVEMENT
1778           Documention for dbjoin now includes resource requirements.
1779
1780       IMPROVEMENT
1781           Default memory usage for dbsort is now about 256MB.  (The world
1782           keeps moving forward.)
1783
1784       IMPROVEMENT
1785           dbmerge now does merging in parallel.  As a side-effect, dbsort
1786           should be faster when input overflows memory.  The level of
1787           parallelism can be limited with the "--parallelism" option.  (There
1788           is more work to do here, but we're off to a start.)
1789
1790   2.35, 2013-02-23 Improvements to dbmerge parallelism
1791       BUG FIX
1792           Fsdb temporary files are now created more securely (with
1793           File::Temp).
1794
1795       IMPROVEMENT
1796           Programs that sort or merge on fields (dbmerge2, dbmerge, dbsort,
1797           dbjoin) now report an error if no fields on which to join or merge
1798           are given.
1799
1800       IMPROVEMENT
1801           Parallelism in dbmerge is should now be more consistent, with less
1802           starting and stopping.
1803
1804       IMPROVEMENT In dbmerge, the "--xargs" option lets one give input
1805       filenames on standard input, rather than the command line. This feature
1806       paves the way for faster dbsort for large inputs (by pipelining sorting
1807       and merging), expected in the next release.
1808
1809   2.36, 2013-02-25 dbsort pipelines with dbmerge
1810       IMPROVEMENT For large inputs, dbsort now pipelines sorting and merging,
1811       allowing earlier processing.
1812       BUG FIX Since 2.35, dbmerge delayed cleanup of intermediate files,
1813       thereby requiring extra disk space.
1814
1815   2.37, 2013-02-26 quick bugfix to support parallel sort and merge from
1816       recent releases
1817       BUG FIX Since 2.35, dbmerge delayed removal of input files given by
1818       "--xargs".  This problem is now fixed.
1819
1820   2.38, 2013-04-29 minor bug fixes
1821       CLARIFICATION
1822           Configure now rejects Windows since tests seem to hang on some
1823           versions of Windows.  (I would love help from a Windows developer
1824           to get this problem fixed, but I cannot do it.)  See
1825           https://rt.cpan.org/Ticket/Display.html?id=84201.
1826
1827       IMPROVEMENT
1828           All programs that use temporary files (dbcolpercentile,
1829           dbcolscorrelate, dbcolstats, dbcolstatscores) now take the "-T"
1830           option and set the temporary directory consistently.
1831
1832           In addition, error messages are better when the temporary directory
1833           has problems.  Problem reported by Liang Zhu.
1834
1835       BUG FIX
1836           dbmapreduce was failing with external, map-reduce aware reducers
1837           (when invoked with -M and an external program).  (Sigh, did this
1838           case ever work?)  This case should now work.  Thanks to Yuri
1839           Pradkin for reporting this bug (in 2011).
1840
1841       BUG FIX
1842           Fixed perl-5.10 problem with dbmerge.  Thanks to Yuri Pradkin for
1843           reporting this bug (in 2013).
1844
1845   2.39, date 2013-05-31 quick release for the dbrowuniq extension
1846       BUG FIX
1847           Actually in 2.38, the Fedora .spec got cleaner dependencies.
1848           Suggestion from Christopher Meng via
1849           <https://bugzilla.redhat.com/show_bug.cgi?id=877096>.
1850
1851       ENHANCEMENT
1852           Fsdb files are now explicitly set into UTF-8 encoding, unless one
1853           specifies "-encoding" to "Fsdb::IO".
1854
1855       ENHANCEMENT
1856           dbrowuniq now supports "-I" for incremental counting.
1857
1858   2.40, 2013-07-13 small bug fixes
1859       BUG FIX
1860           dbsort now has more respect for a user-given temporary directory;
1861           it no longer is ignored for merging.
1862
1863       IMPROVEMENT
1864           dbrowuniq now has options to output the first, last, and both first
1865           and last rows of a run ("-F", "-L", and "-B").
1866
1867       BUG FIX
1868           dbrowuniq now correctly handles "-N".  Sigh, it didn't work before.
1869
1870   2.41, 2013-07-29 small bug and packaging fixes
1871       ENHANCEMENT
1872           Documentation to dbrvstatdiff improved (inspired by questions from
1873           Qian Kun).
1874
1875       BUG FIX
1876           dbrowuniq no longer duplicates singleton unique lines when
1877           outputting both (with "-B").
1878
1879       BUG FIX
1880           Add missing "XML::Simple" dependency to Makefile.PL.
1881
1882       ENHANCEMENT
1883           Tests now show the diff of the failing output if run with "make
1884           test TEST_VERBOSE=1".
1885
1886       ENHANCEMENT
1887           dbroweval now includes documentation for how to output extra rows.
1888           Suggestion from Yuri Pradkin.
1889
1890       BUG FIX
1891           Several improvements to the Fedora package from Michael Schwendt
1892           via <https://bugzilla.redhat.com/show_bug.cgi?id=877096>, and from
1893           the harsh master that is rpmlint.  (I am stymied at teaching it
1894           that "outliers" is spelled correctly.  Maybe I should send it
1895           Schneier's book.  And an unresolvable invalid-spec-name lurks in
1896           the SRPM.)
1897
1898   2.42, 2013-07-31 A bug fix and packaging release.
1899       ENHANCEMENT
1900           Documentation to dbjoin improved to better memory usage.  (Based on
1901           problem report by Lin Quan.)
1902
1903       BUG FIX
1904           The .spec is now perl-Fsdb.spec to satisfy rpmlint.  Thanks to
1905           Christopher Meng for a specific bug report.
1906
1907       BUG FIX
1908           Test dbroweval_last.cmd no longer has a column that caused failures
1909           because of numerical instability.
1910
1911       BUG FIX
1912           Some tests now better handle bugs in old versions of perl (5.10,
1913           5.12).  Thanks to Calvin Ardi for help debugging this on a Mac with
1914           perl-5.12, but the fix should affect other platforms.
1915
1916   2.43, 2013-08-27 Adds in-file compression.
1917       BUG FIX
1918           Changed the sort on TEST/dbsort_merge.cmd to strings (from
1919           numerics) so we're less susceptible to false test-failures due to
1920           floating point IO differences.
1921
1922       EXPERIMENTAL ENHANCEMENT
1923           Yet more parallelism in dbmerge: new "endgame-mode" builds a merge
1924           tree of processes at the end of large merge tasks to get maximally
1925           parallelism.  Currently this feature is off by default because it
1926           can hang for some inputs.  Enable this experimental feature with
1927           "--endgame".
1928
1929       ENHANCEMENT
1930           "Fsdb::IO" now handles being given "IO::Pipe" objects (as exercised
1931           by dbmerge).
1932
1933       BUG FIX
1934           Handling of NamedTmpfiles now supports concurrency.  This fix will
1935           hopefully fix occasional "Use of uninitialized value $_ in string
1936           ne at ...NamedTmpfile.pm line 93."  errors.
1937
1938       BUG FIX
1939           Fsdb now requires perl 5.10.  This is a bug fix because some test
1940           cases used to require it, but this fact was not properly
1941           documented.  (Back-porting to 5.008 would require removing all "//"
1942           operators.)
1943
1944       ENHANCEMENT
1945           Fsdb now handles automatic compression of file contents.  Enable
1946           compression with "dbfilealter -Z xz" (or "gz" or "bz2").  All
1947           programs should operate on compressed files and leave the output
1948           with the same level of compression.  "xz" is recommended as fastest
1949           and most efficient.  "gz" is produces unrepeatable output (and so
1950           has no output test), it seems to insist on adding a timestamp.
1951
1952   2.44, 2013-10-02 A major change--all threads are gone.
1953       ENHANCEMENT
1954           Fsdb is now thread free and only uses processes for parallelism.
1955           This change is a big change--the entire motivation for Fsdb-2 was
1956           to exploit parallelism via threading.  Parallelism--good, but perl
1957           threading--bad for performance.  Horribly bad for performance.
1958           About 20x worse than pipes on my box.  (See perl bug #119445 for
1959           the discussion.)
1960
1961       NEW "Fsdb::Support::Freds" provides a thread-like abstraction over
1962           forking, with some nice support for callbacks in the parent upon
1963           child termination.
1964
1965       ENHANCEMENT
1966           Details about removing threads: "dbpipeline" is thread free, and
1967           new tests to verify each of its parts.  The easy cases are
1968           "dbcolpercentile", "dbcolstats", "dbfilepivot", "dbjoin", and
1969           "dbcolstatscores", each of which use it in simple ways
1970           (2013-09-09).  "dbmerge" is now thread free (2013-09-13), but was a
1971           significant rewrite, which brought "dbsort" along.  "dbmapreduce"
1972           is partly thread free (2013-09-21), again as a rewrite, and it
1973           brings "dbmultistats" along.  Full "dbmapreduce" support took much
1974           longer (2013-10-02).
1975
1976       BUG FIX
1977           When running with user-only output ("-n"), dbroweval now resets the
1978           output vector $ofref after it has been output.
1979
1980       NEW dbcolcreate will create all columns at the head of each row with
1981           the "--first" option.
1982
1983       NEW dbfilecat will concatenate two files, verifying that they have the
1984           same schema.
1985
1986       ENHANCEMENT
1987           dbmapreduce now passes comments through, rather than eating them as
1988           before.
1989
1990           Also, dbmapreduce now supports a "--" option to prevent
1991           misinterpreting sub-program parameters as for dbmapreduce.
1992
1993       INCOMPATIBLE CHANGE
1994           dbmapreduce no longer figures out if it needs to add the key to the
1995           output.  For multi-key-aware reducers, it never does (and cannot).
1996           For non-multi-key-aware reducers, it defaults to add the key and
1997           will now fail if the reducer adds the key (with error "dbcolcreate:
1998           attempt to create pre-existing column...").  In such cases, one
1999           must disable adding the key with the new option "--no-prepend-key".
2000
2001       INCOMPATIBLE CHANGE
2002           dbmapreduce no longer copies the input field separator by default.
2003           For multi-key-aware reducers, it never does (and cannot).  For non-
2004           multi-key-aware reducers, it defaults to not copying the field
2005           separator, but it will copy it (the old default) with the
2006           "--copy-fs" option
2007
2008   2.45, 2013-10-07 cleanup from de-thread-ification
2009       BUG FIX
2010           Corrected a fast busy-wait in dbmerge.
2011
2012       ENHANCEMENT
2013           Endgame mode enabled in dbmerge; it (and also large cases of
2014           dbsort) should now exploit greater parallelism.
2015
2016       BUG FIX
2017           Test case with "Fsdb::BoundedQueue" (gone since 2.44) now removed.
2018
2019   2.46, 2013-10-08 continuing cleanup of our no-threads version
2020       BUG FIX
2021           Fixed some packaging details.  (Really, threads are no longer
2022           required, missing tests in the MANIFEST.)
2023
2024       IMPROVEMENT
2025           dbsort now better communicates with the merge process to avoid
2026           bursty parallelism.
2027
2028           Fsdb::IO::Writer now can take "-autoflush =" 1> for line-buffered
2029           IO.
2030
2031   2.47, 2013-10-12 test suite cleanup for non-threaded perls
2032       BUG FIX
2033           Removed some stray "use threads" in some test cases.  We didn't
2034           need them, and these were breaking non-threaded perls.
2035
2036       BUG FIX
2037           Better handling of Fred cleanup; should fix intermittent
2038           dbmapreduce failures on BSD.
2039
2040       ENHANCEMENT
2041           Improved test framework to show output when tests fail.  (This
2042           time, for real.)
2043
2044   2.48, 2014-01-03 small bugfixes and improved release engineering
2045       ENHANCEMENT
2046           Test suites now skip tests for libraries that are missing.  (Patch
2047           for missing "IO::Compresss:Xz" contributed by Calvin Ardi.)
2048
2049       ENHANCEMENT
2050           Removed references to Jdb in the package specification.  Since the
2051           name was changed in 2008, there's no longer a huge need for
2052           backwards comparability.  (Suggestion form Petr Šabata.)
2053
2054       ENHANCEMENT
2055           Test suites now invoke the perl using the path from
2056           $Config{perlpath}.  Hopefully this helps testing in environments
2057           where there are multiple installed perls and the default perl is
2058           not the same as the perl-under-test (as happens in
2059           cpantesters.org).
2060
2061       BUG FIX
2062           Added specific encoding to this manpage to account for Unicode.
2063           Required to build correctly against perl-5.18.
2064
2065   2.49, 2014-01-04 bugfix to unicode handling in Fsdb IO (plus minor
2066       packaging fixes)
2067       BUG FIX
2068           Restored a line in the .spec to chmod g-s.
2069
2070       BUG FIX
2071           Unicode decoding is now handled correctly for programs that read
2072           from standard input.  (Also: New test scripts cover unicode input
2073           and output.)
2074
2075       BUG FIX
2076           Fix to Fsdb documentation encoding line.  Addresses test failure in
2077           perl-5.16 and earlier.  (Who knew "encoding" had to be followed by
2078           a blank line.)
2079

WHAT'S NEW

2081   2.50, 2014-05-27 a quick release for spec tweaks
2082       ENHANCEMENT
2083           In dbroweval, the "-N" (no output, even comments) option now
2084           implies "-n", and it now suppresses the header and trailer.
2085
2086       BUG FIX
2087           A few more tweaks to the perl-Fsdb.spec from Petr Šabata.
2088
2089       BUG FIX
2090           Fixed 3 uses of "use v5.10" in test suites that were causing test
2091           failures (due to warnings, not real failures) on some platforms.
2092
2093   2.51, 2014-09-05 Feature enhancements to dbcolmovingstats, dbcolcreate,
2094       dbmapreduce, and new sqlselect_to_db
2095       ENHANCEMENT
2096           dbcolcreate now has a "--no-recreate-fatal" that causes it to
2097           ignore creation of existing columns (instead of failing).
2098
2099       ENHANCEMENT
2100           dbmapreduce once again is robust to reducers that output the key;
2101           "--no-prepend-key" is no longer mandatory.
2102
2103       ENHANCEMENT
2104           dbcolsplittorows can now enumerate the output rows with "-E".
2105
2106       BUG FIX
2107           dbcolmovingstats is more mathematically robust.  Previously for
2108           some inputs and some platforms, floating point rounding could
2109           sometimes cause squareroots of negative numbers.
2110
2111       NEW sqlselect_to_db converts the output of the MySQL or MarinaDB select
2112           comment into fsdb format.
2113
2114       INCOMPATIBLE CHANGE
2115           dbfilediff now outputs the second row when doing sloppy numeric
2116           comparisons, to better support test suites.
2117
2118   2.52, 2014-11-03 Fixing the test suite for line number changes.
2119       ENHANCEMENT
2120           Test suites changes to be robust to exact line numbers of failures,
2121           since different Perl releases fail on different lines.
2122           <https://bugzilla.redhat.com/show_bug.cgi?id=1158380>
2123
2124   2.53, 2014-11-26 bug fixes and stability improvements to dbmapreduce
2125       ENHANCEMENT
2126           The dbfilediff how supports a "--quiet" option.
2127
2128       ENHANCEMENT
2129           Better documention of dbpipeline_filter.
2130
2131       BUGFIX
2132           Added groff-base and perl-podlators to the Fedora package spec.
2133           Fixes <https://bugzilla.redhat.com/show_bug.cgi?id=1163149>.  (Also
2134           in package 2.52-2.)
2135
2136       BUGFIX
2137           An important stability improvement to dbmapreduce.  It, plus
2138           dbmultistats, and dbcolstats now support controlled parallelism
2139           with the "--pararallelism=N" option.  They default to run with the
2140           number of available CPUs.  dbmapreduce also moderates its level of
2141           parallelism.  Previously it would create reducers as needed,
2142           causing CPU thrashing if reducers ran much slower than data
2143           production.
2144
2145       BUGFIX
2146           The combination of dbmapreduce with dbrowenumerate now works as it
2147           should.  (The obscure bug was an interaction with dbcolcreate with
2148           non-multi-key reducers that output their own key.  dbmapreduce has
2149           too many useful corner cases.)
2150
2151   2.54, 2014-11-28 fix for the test suite to correct failing tests on not-my-
2152       platform
2153       BUGFIX
2154           Sigh, the test suite now has a test suite.  Because, yes, I broke
2155           it, causing many incorrect failures at cpantesters.  Now fixed.
2156
2157   2.55, 2015-01-05 many spelling fixes and dbcolmovingstats tests are more
2158       robust to different numeric precision
2159       ENHANCEMENT
2160           dbfilediff now can be extra quiet, as I continue to try to track
2161           down a numeric difference on FreeBSD AMD boxes.
2162
2163       ENHANCEMENT
2164           dbcolmovingstats gave different test output (just reflecting
2165           rounding error) when stddev approaches zero.  We now detect hand
2166           handle this case.  See
2167           <https://rt.cpan.org/Public/Bug/Display.html?id=101220> and thanks
2168           to H. Merijn Brand for the bug report.
2169
2170       BUG FIX
2171           Many, many spelling bugs found by H. Merijn Brand; thanks for the
2172           bug report.
2173
2174       INCOMPATBLE CHANGE
2175           A number of programs had misspelled "separator" in
2176           "--fieldseparator" and "--columnseparator" options as "seperator".
2177           These are now correctly spelled.
2178
2179   2.56, 2015-02-03 fix against Getopt::Long-2.43's stricter error checkign
2180       BUG FIX
2181           Internal argument parsing uses Getopt::Long, but mixed pass-through
2182           and <>.  Bug reported by Petr Pisar at
2183           <https://bugzilla.redhat.com/show_bug.cgi?id=1188538>.a
2184
2185       BUG FIX
2186           Added missing BuildRequires for "XML::Simple".
2187
2188   2.57, 2015-04-29 Minor changes, with better performance from dbmulitstats.
2189       BUG FIX
2190           dbfilecat now honors "--remove-inputs" (previously it didn't).
2191           This omission meant that dbmapreduce (and dbmultistats) would
2192           accumulate files in /tmp when running.  Bad news for inputs with 4M
2193           keys.
2194
2195       ENHANCMENT
2196           dbmultistats should be faster with lots of small keys.  dbcolstats
2197           now supports "-k" to get some of the functionality of dbmultistats
2198           (if data is pre-sorted and median/quartiles are not required).
2199
2200           dbfilecat now honors "--remove-inputs" (previously it didn't).
2201           This omission meant that dbmapreduce (and dbmultistats) would
2202           accumulate files in /tmp when running.  Bad news for inputs with 4M
2203           keys.
2204
2205   2.58, 2015-04-30 Bugfix in dbmerge
2206       BUG FIX
2207           Fixed a case where dbmerge suffered mojobake in endgame mode.  This
2208           bug surfaced when dbsort was applied to large files (big enough to
2209           require merging) with unicode in them; the symptom was soemthing
2210           like:
2211             Wide character in print at /usr/lib64/perl5/IO/Handle.pm line
2212           420, <GEN12> line 111.
2213
2214   2.59, 2016-09-01 Collect a few small bug fixes and documentation
2215       improvements.
2216       BUG FIX
2217           More IO is explicitly marked UTF-8 to avoid Perl's tendency to
2218           mojibake on otherwise valid unicode input.  This change helps
2219           html_table_to_db.
2220
2221       ENHANCEMENT
2222           dbcolscorrelate now crossreferences dbcolsregression.
2223
2224       ENHANCEMENT
2225           Documentation for dbrowdiff now clarifies that the default is
2226           baseline mode.
2227
2228       BUG FIX
2229           dbjoin now propagates "-T" into the sorting process (if it is
2230           required).  Thanks to Lan Wei for reporting this bug.
2231
2232   2.60, 2016-09-04 Adds support for hash joins.
2233       ENHANCEMENT
2234           dbjoin now supports hash joins with "-t lefthash" and "-t
2235           righthash".  Hash joins cache a table in memory, but do not require
2236           that the other table be sorted.  They are ideal when joining a
2237           large table against a small one.
2238
2239   2.61, 2016-09-05 Support left and right outer joins.
2240       ENHANCEMENT
2241           dbjoin now handles left and right outer joins with "-t left" and
2242           "-t right".
2243
2244       ENHANCEMENT
2245           dbjoin hash joins are now selected with "-m lefthash" and "-m
2246           righthash" (not the shortlived "-t righthash" option).
2247           (Technically this change is incompatible with Fsdd-2.60, but no one
2248           but me ever used that version.)
2249
2250   2.62, 2016-11-29 A new yaml_to_db and other minor improvements.
2251       ENHANCEMENT
2252           Documentation for xml_to_db now includes sample output.
2253
2254       NEW yaml_to_db converts a specific form of YAML to fsdb.
2255
2256       BUG FIX
2257           The test suite now uses "diff -c -b" rather than "diff -cb" to make
2258           OpenBSD-5.9 happier, I hope.
2259
2260       ENHANCEMENT
2261           Comments that log operations at the end of each file now do simple
2262           quoting of spaces.  (It is not guaranteed to be fully shell-
2263           compliant.)
2264
2265       ENHANCEMENT
2266           There is a new standard option, "--header", allowing one to specify
2267           an Fsdb header for inputs that lack it.  Currently it is supported
2268           by dbcoldefine, dbrowuniq, dbmapreduce, dbmultistats, dbsort,
2269           dbpipeline.
2270
2271       ENHANCEMENT
2272           dbfilepivot now allows the --possible-pivots option, and if it is
2273           provided processes the data in one pass.
2274
2275       ENHANCEMENT
2276           dbroweval logs are now quoted.
2277
2278   2.63, 2017-02-03 Re-add some features supposedly in 2.62 but not, and add
2279       more --header options.
2280       ENHANCEMENT
2281           The option -j is now a synonym for --parallelism.  (And several
2282           documention bugs about this option are fixed.)
2283
2284       ENHANCEMENT
2285           Additional support for "--header" in dbcolmerge, dbcol, dbrow, and
2286           dbroweval.
2287
2288       BUG FIX
2289           Version 2.62 was supposed to have this improvement, but did not
2290           (and now does): dbfilepivot now allows the --possible-pivots
2291           option, and if it is provided processes the data in one pass.
2292
2293       BUG FIX
2294           Version 2.62 was supposed to have this improvement, but did not
2295           (and now does): dbroweval logs are now quoted.
2296
2297   2.64, 2017-11-20 several small bugfixes and enhancements
2298       BUG FIX
2299           In dbroweval, the "next row" option previously did not correctly
2300           set up "_last_fieldname".  It now does.
2301
2302       ENHANCEMENT
2303           The csv_to_db converter now has an optional "-F x" option to set
2304           the field separator.
2305
2306       ENHANCEMENT
2307           Finally dbcolsplittocols has a "--header" option, and a new "-N"
2308           option to give the list of resulting output columns.
2309
2310       INCOMPATIBLE CHANGE
2311           Now dbcolstats and dbmultistats produce no output (but a schema)
2312           when given no input but a schema.  Previously they gave a null row
2313           of output.  The "--output-on-no-input" and
2314           "--no-output-on-no-input" options can control this behavior.
2315
2316   2.65, 2018-02-16 Minor release, bug fix and -F option.
2317       ENHANCEMENT
2318           dbmultistats and dbmapreduce now both take a "-F x" option to set
2319           the field separator.
2320
2321       BUG FIX
2322           Fixed missing "use Carp" in dbcolstats.  Also went back and cleaned
2323           up all uses of "croak()".  Thanks to Zefram for the bug report.
2324

AUTHOR

2326       John Heidemann, "johnh@isi.edu"
2327
2328       See "Contributors" for the many people who have contributed bug reports
2329       and fixes.
2330
2332       Fsdb is Copyright (C) 1991-2016 by John Heidemann <johnh@isi.edu>.
2333
2334       This program is free software; you can redistribute it and/or modify it
2335       under the terms of version 2 of the GNU General Public License as
2336       published by the Free Software Foundation.
2337
2338       This program is distributed in the hope that it will be useful, but
2339       WITHOUT ANY WARRANTY; without even the implied warranty of
2340       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
2341       General Public License for more details.
2342
2343       You should have received a copy of the GNU General Public License along
2344       with this program; if not, write to the Free Software Foundation, Inc.,
2345       675 Mass Ave, Cambridge, MA 02139, USA.
2346
2347       A copy of the GNU General Public License can be found in the file
2348       ``COPYING''.
2349

COMMENTS and BUG REPORTS

2351       Any comments about these programs should be sent to John Heidemann
2352       "johnh@isi.edu".
2353
2354
2355
2356perl v5.28.1                      2018-12-21                           Fsdb(3)
Impressum