1Fsdb(3) User Contributed Perl Documentation Fsdb(3)
2
3
4
6 Fsdb - a flat-text database for shell scripting
7
9 Fsdb, the flatfile streaming database is package of commands for
10 manipulating flat-ASCII databases from shell scripts. Fsdb is useful
11 to process medium amounts of data (with very little data you'd do it by
12 hand, with megabytes you might want a real database). Fsdb was known
13 as as Jdb from 1991 to Oct. 2008.
14
15 Fsdb is very good at doing things like:
16
17 · extracting measurements from experimental output
18
19 · examining data to address different hypotheses
20
21 · joining data from different experiments
22
23 · eliminating/detecting outliers
24
25 · computing statistics on data (mean, confidence intervals,
26 correlations, histograms)
27
28 · reformatting data for graphing programs
29
30 Fsdb is built around the idea of a flat text file as a database. Fsdb
31 files (by convention, with the extension .fsdb), have a header
32 documenting the schema (what the columns mean), and then each line
33 represents a database record (or row).
34
35 For example:
36
37 #fsdb experiment duration
38 ufs_mab_sys 37.2
39 ufs_mab_sys 37.3
40 ufs_rcp_real 264.5
41 ufs_rcp_real 277.9
42
43 Is a simple file with four experiments (the rows), each with a
44 description, size parameter, and run time in the first, second, and
45 third columns.
46
47 Rather than hand-code scripts to do each special case, Fsdb provides
48 higher-level functions. Although it's often easy throw together a
49 custom script to do any single task, I believe that there are several
50 advantages to using Fsdb:
51
52 · these programs provide a higher level interface than plain Perl, so
53
54 ** Fewer lines of simpler code:
55
56 dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration
57
58 Picks out just one type of experiment and computes statistics
59 on it, rather than:
60
61 while (<>) { split; $sum+=$F[1]; $ss+=$F[1]**2; $n++; }
62 $mean = $sum / $n; $std_dev = ...
63
64 in dozens of places.
65
66 · the library uses names for columns, so
67
68 ** No more $F[1], use "_duration".
69
70 ** New or different order columns? No changes to your scripts!
71
72 Thus if your experiment gets more complicated with a size
73 parameter, so your log changes to:
74
75 #fsdb experiment size duration
76 ufs_mab_sys 1024 37.2
77 ufs_mab_sys 1024 37.3
78 ufs_rcp_real 1024 264.5
79 ufs_rcp_real 1024 277.9
80 ufs_mab_sys 2048 45.3
81 ufs_mab_sys 2048 44.2
82
83 Then the previous scripts still work, even though duration is now
84 the third column, not the second.
85
86 · A series of actions are self-documenting (each program records what
87 it does).
88
89 ** No more wondering what hacks were used to compute the final
90 data, just look at the comments at the end of the output.
91
92 For example, the commands
93
94 dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration
95
96 add to the end of the output the lines
97 # | dbrow _experiment eq "ufs_mab_sys"
98 # | dbcolstats duration
99
100 · The library is mature, supporting large datasets (more than 100GB),
101 corner cases, error handling, backed by an automated test suite.
102
103 ** No more puzzling about bad output because your custom script
104 skimped on error checking.
105
106 ** No more memory thrashing when you try to sort ten million
107 records.
108
109 · Fsdb-2.x supports Perl scripting (in addition to shell scripting),
110 with libraries to do Fsdb input and output, and easy support for
111 pipelines. The shell script
112
113 dbcol name test1 | dbroweval '_test1 += 5;'
114
115 can be written in perl as:
116
117 dbpipeline(dbcol(qw(name test1)), dbroweval('_test1 += 5;'));
118
119 (The disadvantage is that you need to learn what functions Fsdb
120 provides.)
121
122 Fsdb is built on flat-ASCII databases. By storing data in simple text
123 files and processing it with pipelines it is easy to experiment (in the
124 shell) and look at the output. To the best of my knowledge, the
125 original implementation of this idea was "/rdb", a commercial product
126 described in the book UNIX relational database management: application
127 development in the UNIX environment by Rod Manis, Evan Schaffer, and
128 Robert Jorgensen (and also at the web page <http://www.rdb.com/>).
129 Fsdb is an incompatible re-implementation of their idea without any
130 accelerated indexing or forms support. (But it's free, and probably
131 has better statistics!).
132
133 Fsdb-2.x will exploit multiple processors or cores, and provides Perl-
134 level support for input, output, and threaded-pipelines. (As of
135 Fsdb-2.44 it no longer uses Perl threading, just processes, since they
136 are faster.)
137
138 Installation instructions follow at the end of this document. Fsdb-2.x
139 requires Perl 5.8 to run. All commands have manual pages and provide
140 usage with the "--help" option. All commands are backed by an
141 automated test suite.
142
143 The most recent version of Fsdb is available on the web at
144 <http://www.isi.edu/~johnh/SOFTWARE/FSDB/index.html>.
145
147 2.69, 2019-11-22 a small bugfix in dbcolstats
148 BUG FIX
149 Filled in the the test case for autodecompress, which was missing
150 for the 2.68 release.
151
152 ENHANCEMENT
153 The groff program is required for build, and the "Makefile.PL"
154 fails if groff is missing at build time. Thanks to Chris Williams
155 for suggesting this check, and the CPAN auto-building system for
156 trying many platforms.
157
158 BUG FIX
159 The dbcolstats program had numerical instability that sometimes
160 results in failing with a square-root of a negative number when
161 many values varied right at the edge of floating-point precision.
162 We now detect and report that case as 0 stddev. Thanks to Hang Guo
163 for providing a test case.
164
166 executive summary
167 what's new
168 README CONTENTS
169 installation
170 basic data format
171 basic data manipulation
172 list of commands
173 another example
174 a gradebook example
175 a password example
176 history
177 related work
178 release notes
179 copyright
180 comments
181
183 Fsdb now uses the standard Perl build and installation from
184 ExtUtil::MakeMaker(3), so the quick answer to installation is to type:
185
186 perl Makefile.PL
187 make
188 make test
189 make install
190
191 Or, if you want to install it somewhere else, change the first line to
192
193 perl Makefile.PL PREFIX=$HOME
194
195 and it will go in your home directory's bin, etc. (See
196 ExtUtil::MakeMaker(3) for more details.)
197
198 Fsdb requires perl 5.8 or later.
199
200 A test-suite is available, run it with
201
202 make test
203
204 A FreeBSD port to Fsdb is available, see
205 <http://www.freshports.org/databases/fsdb/>.
206
207 A Fink (MacOS X) port is available, see
208 <http://pdb.finkproject.org/pdb/package.php/fsdb>. (Thanks to Lars
209 Eggert for maintaining this port.)
210
212 These programs are based on the idea storing data in simple ASCII
213 files. A database is a file with one header line and then data or
214 comment lines. For example:
215
216 #fsdb account passwd uid gid fullname homedir shell
217 johnh * 2274 134 John_Heidemann /home/johnh /bin/bash
218 greg * 2275 134 Greg_Johnson /home/greg /bin/bash
219 root * 0 0 Root /root /bin/bash
220 # this is a simple database
221
222 The header line must be first and begins with "#h". There are rows
223 (records) and columns (fields), just like in a normal database.
224 Comment lines begin with "#". Column names are any string not
225 containing spaces or single quote (although it is prudent to keep them
226 alphanumeric with underscore).
227
228 By default, columns are delimited by whitespace. With this default
229 configuration, the contents of a field cannot contain whitespace.
230 However, this limitation can be relaxed by changing the field separator
231 as described below.
232
233 The big advantage of simple flat-text databases is that it is usually
234 easy to massage data into this format, and it's reasonably easy to take
235 data out of this format into other (text-based) programs, like gnuplot,
236 jgraph, and LaTeX. Think Unix. Think pipes. (Or even output to Excel
237 and HTML if you prefer.)
238
239 Since no-whitespace in columns was a problem for some applications,
240 there's an option which relaxes this rule. You can specify the field
241 separator in the table header with "-F x" where "x" is a code for the
242 new field separator. A full list of codes is at dbfilealter(1), but
243 two common special values are "-F t" which is a separator of a single
244 tab character, and "-F S", a separator of two spaces. Both allowing
245 (single) spaces in fields. An example:
246
247 #fsdb -F S account passwd uid gid fullname homedir shell
248 johnh * 2274 134 John Heidemann /home/johnh /bin/bash
249 greg * 2275 134 Greg Johnson /home/greg /bin/bash
250 root * 0 0 Root /root /bin/bash
251 # this is a simple database
252
253 See dbfilealter(1) for more details. Regardless of what the column
254 separator is for the body of the data, it's always whitespace in the
255 header.
256
257 There's also a third format: a "list". Because it's often hard to see
258 what's columns past the first two, in list format each "column" is on a
259 separate line. The programs dblistize and dbcolize convert to and from
260 this format, and all programs work with either formats. The command
261
262 dbfilealter -R C < DATA/passwd.fsdb
263
264 outputs:
265
266 #fsdb -R C account passwd uid gid fullname homedir shell
267 account: johnh
268 passwd: *
269 uid: 2274
270 gid: 134
271 fullname: John_Heidemann
272 homedir: /home/johnh
273 shell: /bin/bash
274
275 account: greg
276 passwd: *
277 uid: 2275
278 gid: 134
279 fullname: Greg_Johnson
280 homedir: /home/greg
281 shell: /bin/bash
282
283 account: root
284 passwd: *
285 uid: 0
286 gid: 0
287 fullname: Root
288 homedir: /root
289 shell: /bin/bash
290
291 # this is a simple database
292 # | dblistize
293
294 See dbfilealter(1) for more details.
295
297 A number of programs exist to manipulate databases. Complex functions
298 can be made by stringing together commands with shell pipelines. For
299 example, to print the home directories of everyone with ``john'' in
300 their names, you would do:
301
302 cat DATA/passwd | dbrow '_fullname =~ /John/' | dbcol homedir
303
304 The output might be:
305
306 #fsdb homedir
307 /home/johnh
308 /home/greg
309 # this is a simple database
310 # | dbrow _fullname =~ /John/
311 # | dbcol homedir
312
313 (Notice that comments are appended to the output listing each command,
314 providing an automatic audit log.)
315
316 In addition to typical database functions (select, join, etc.) there
317 are also a number of statistical functions.
318
319 The real power of Fsdb is that one can apply arbitrary code to rows to
320 do powerful things.
321
322 cat DATA/passwd | dbroweval '_fullname =~ s/(\w+)_(\w+)/$2,_$1/'
323
324 converts "John_Heidemann" into "Heidemann,_John". Not too much more
325 work could split fullname into firstname and lastname fields.
326
327 (Or:
328
329 cat DATA/passwd | dbcolcreate sort | dbroweval -b 'use Fsdb::Support'
330 '_sort = _fullname; _sort =~ s/_/ /g; _sort = fullname_to_sort(_sort);'
331
333 An advantage of Fsdb is that you can talk about columns by name
334 (symbolically) rather than simply by their positions. So in the above
335 example, "dbcol homedir" pulled out the home directory column, and
336 "dbrow '_fullname =~ /John/'" matched against column fullname.
337
338 In general, you can use the name of the column listed on the "#fsdb"
339 line to identify it in most programs, and _name to identify it in code.
340
341 Some alternatives for flexibility:
342
343 · Numeric values identify columns positionally, numbering from 0. So
344 0 or _0 is the first column, 1 is the second, etc.
345
346 · In code, _last_columnname gets the value from columname's previous
347 row.
348
349 See dbroweval(1) for more details about writing code.
350
352 Enough said. I'll summarize the commands, and then you can experiment.
353 For a detailed description of each command, see a summary by running it
354 with the argument "--help" (or "-?" if you prefer.) Full manual pages
355 can be found by running the command with the argument "--man", or
356 running the Unix command "man dbcol" or whatever program you want.
357
358 TABLE CREATION
359 dbcolcreate
360 add columns to a database
361
362 dbcoldefine
363 set the column headings for a non-Fsdb file
364
365 TABLE MANIPULATION
366 dbcol
367 select columns from a table
368
369 dbrow
370 select rows from a table
371
372 dbsort
373 sort rows based on a set of columns
374
375 dbjoin
376 compute the natural join of two tables
377
378 dbcolrename
379 rename a column
380
381 dbcolmerge
382 merge two columns into one
383
384 dbcolsplittocols
385 split one column into two or more columns
386
387 dbcolsplittorows
388 split one column into multiple rows
389
390 dbfilepivot
391 "pivots" a file, converting multiple rows corresponding to the same
392 entity into a single row with multiple columns.
393
394 dbfilevalidate
395 check that db file doesn't have some common errors
396
397 COMPUTATION AND STATISTICS
398 dbcolstats
399 compute statistics over a column (mean,etc.,optionally median)
400
401 dbmultistats
402 group rows by some key value, then compute stats (mean, etc.) over
403 each group (equivalent to dbmapreduce with dbcolstats as the
404 reducer)
405
406 dbmapreduce
407 group rows (map) and then apply an arbitrary function to each group
408 (reduce)
409
410 dbrvstatdiff
411 compare two samples distributions (mean/conf interval/T-test)
412
413 dbcolmovingstats
414 computing moving statistics over a column of data
415
416 dbcolstatscores
417 compute Z-scores and T-scores over one column of data
418
419 dbcolpercentile
420 compute the rank or percentile of a column
421
422 dbcolhisto
423 compute histograms over a column of data
424
425 dbcolscorrelate
426 compute the coefficient of correlation over several columns
427
428 dbcolsregression
429 compute linear regression and correlation for two columns
430
431 dbrowaccumulate
432 compute a running sum over a column of data
433
434 dbrowcount
435 count the number of rows (a subset of dbstats)
436
437 dbrowdiff
438 compute differences between a columns in each row of a table
439
440 dbrowenumerate
441 number each row
442
443 dbroweval
444 run arbitrary Perl code on each row
445
446 dbrowuniq
447 count/eliminate identical rows (like Unix uniq(1))
448
449 dbfilediff
450 compare fields on rows of a file (something like Unix diff(1))
451
452 OUTPUT CONTROL
453 dbcolneaten
454 pretty-print columns
455
456 dbfilealter
457 convert between column or list format, or change the column
458 separator
459
460 dbfilestripcomments
461 remove comments from a table
462
463 dbformmail
464 generate a script that sends form mail based on each row
465
466 CONVERSIONS
467 (These programs convert data into fsdb. See their web pages for
468 details.)
469
470 cgi_to_db
471 <http://stein.cshl.org/boulder/>
472
473 combined_log_format_to_db
474 <http://httpd.apache.org/docs/2.0/logs.html>
475
476 html_table_to_db
477 HTML tables to fsdb (assuming they're reasonably formatted).
478
479 kitrace_to_db
480 <http://ficus-www.cs.ucla.edu/ficus-members/geoff/kitrace.html>
481
482 ns_to_db
483 <http://mash-www.cs.berkeley.edu/ns/>
484
485 sqlselect_to_db
486 the output of SQL SELECT tables to db
487
488 tabdelim_to_db
489 spreadsheet tab-delimited files to db
490
491 tcpdump_to_db
492 (see man tcpdump(8) on any reasonable system)
493
494 xml_to_db
495 XML input to fsdb, assuming they're very regular
496
497 (And out of fsdb:)
498
499 db_to_csv
500 Comma-separated-value format from fsdb.
501
502 db_to_html_table
503 simple conversion of Fsdb to html tables
504
505 STANDARD OPTIONS
506 Many programs have common options:
507
508 -? or --help
509 Show basic usage.
510
511 -N on --new-name
512 When a command creates a new column like dbrowaccumulate's "accum",
513 this option lets one override the default name of that new column.
514
515 -T TmpDir
516 where to put tmp files. Also uses environment variable TMPDIR, if
517 -T is not specified. Default is /tmp.
518
519 Show basic usage.
520
521 -c FRACTION or --confidence FRACTION
522 Specify confidence interval FRACTION (dbcolstats, dbmultistats,
523 etc.)
524
525 -C S or "--element-separator S"
526 Specify column separator S (dbcolsplittocols, dbcolmerge).
527
528 -d or --debug
529 Enable debugging (may be repeated for greater effect in some
530 cases).
531
532 -a or --include-non-numeric
533 Compute stats over all data (treating non-numbers as zeros). (By
534 default, things that can't be treated as numbers are ignored for
535 stats purposes)
536
537 -S or --pre-sorted
538 Assume the data is pre-sorted. May be repeated to disable
539 verification (saving a small amount of work).
540
541 -e E or --empty E
542 give value E as the value for empty (null) records
543
544 -i I or --input I
545 Input data from file I.
546
547 -o O or --output O
548 Write data out to file O.
549
550 --header H
551 Use H as the full Fsdb header, rather than reading a header from
552 then input. This option is particularly useful when using Fsdb
553 under Hadoop, where split files don't have heades.
554
555 --nolog.
556 Skip logging the program in a trailing comment.
557
558 When giving Perl code (in dbrow and dbroweval) column names can be
559 embedded if preceded by underscores. Look at dbrow(1) or dbroweval(1)
560 for examples.)
561
562 Most programs run in constant memory and use temporary files if
563 necessary. Exceptions are dbcolneaten, dbcolpercentile, dbmapreduce,
564 dbmultistats, dbrowsplituniq.
565
567 Take the raw data in "DATA/http_bandwidth", put a header on it
568 ("dbcoldefine size bw"), took statistics of each category
569 ("dbmultistats -k size bw"), pick out the relevant fields ("dbcol size
570 mean stddev pct_rsd"), and you get:
571
572 #fsdb size mean stddev pct_rsd
573 1024 1.4962e+06 2.8497e+05 19.047
574 10240 5.0286e+06 6.0103e+05 11.952
575 102400 4.9216e+06 3.0939e+05 6.2863
576 # | dbcoldefine size bw
577 # | /home/johnh/BIN/DB/dbmultistats -k size bw
578 # | /home/johnh/BIN/DB/dbcol size mean stddev pct_rsd
579
580 (The whole command was:
581
582 cat DATA/http_bandwidth |
583 dbcoldefine size |
584 dbmultistats -k size bw |
585 dbcol size mean stddev pct_rsd
586
587 all on one line.)
588
589 Then post-process them to get rid of the exponential notation by adding
590 this to the end of the pipeline:
591
592 dbroweval '_mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev);'
593
594 (Actually, this step is no longer required since dbcolstats now uses a
595 different default format.)
596
597 giving:
598
599 #fsdb size mean stddev pct_rsd
600 1024 1496200 284970 19.047
601 10240 5028600 601030 11.952
602 102400 4921600 309390 6.2863
603 # | dbcoldefine size bw
604 # | dbmultistats -k size bw
605 # | dbcol size mean stddev pct_rsd
606 # | dbroweval { _mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev); }
607
608 In a few lines, raw data is transformed to processed output.
609
610 Suppose you expect there is an odd distribution of results of one
611 datapoint. Fsdb can easily produce a CDF (cumulative distribution
612 function) of the data, suitable for graphing:
613
614 cat DB/DATA/http_bandwidth | \
615 dbcoldefine size bw | \
616 dbrow '_size == 102400' | \
617 dbcol bw | \
618 dbsort -n bw | \
619 dbrowenumerate | \
620 dbcolpercentile count | \
621 dbcol bw percentile | \
622 xgraph
623
624 The steps, roughly: 1. get the raw input data and turn it into fsdb
625 format, 2. pick out just the relevant column (for efficiency) and sort
626 it, 3. for each data point, assign a CDF percentage to it, 4. pick out
627 the two columns to graph and show them
628
630 The first commercial program I wrote was a gradebook, so here's how to
631 do it with Fsdb.
632
633 Format your data like DATA/grades.
634
635 #fsdb name email id test1
636 a a@ucla.example.edu 1 80
637 b b@usc.example.edu 2 70
638 c c@isi.example.edu 3 65
639 d d@lmu.example.edu 4 90
640 e e@caltech.example.edu 5 70
641 f f@oxy.example.edu 6 90
642
643 Or if your students have spaces in their names, use "-F S" and two
644 spaces to separate each column:
645
646 #fsdb -F S name email id test1
647 alfred aho a@ucla.example.edu 1 80
648 butler lampson b@usc.example.edu 2 70
649 david clark c@isi.example.edu 3 65
650 constantine drovolis d@lmu.example.edu 4 90
651 debrorah estrin e@caltech.example.edu 5 70
652 sally floyd f@oxy.example.edu 6 90
653
654 To compute statistics on an exam, do
655
656 cat DATA/grades | dbstats test1 |dblistize
657
658 giving
659
660 #fsdb -R C ...
661 mean: 77.5
662 stddev: 10.84
663 pct_rsd: 13.987
664 conf_range: 11.377
665 conf_low: 66.123
666 conf_high: 88.877
667 conf_pct: 0.95
668 sum: 465
669 sum_squared: 36625
670 min: 65
671 max: 90
672 n: 6
673 ...
674
675 To do a histogram:
676
677 cat DATA/grades | dbcolhisto -n 5 -g test1
678
679 giving
680
681 #fsdb low histogram
682 65 *
683 70 **
684 75
685 80 *
686 85
687 90 **
688 # | /home/johnh/BIN/DB/dbhistogram -n 5 -g test1
689
690 Now you want to send out grades to the students by e-mail. Create a
691 form-letter (in the file test1.txt):
692
693 To: _email (_name)
694 From: J. Random Professor <jrp@usc.example.edu>
695 Subject: test1 scores
696
697 _name, your score on test1 was _test1.
698 86+ A
699 75-85 B
700 70-74 C
701 0-69 F
702
703 Generate the shell script that will send the mail out:
704
705 cat DATA/grades | dbformmail test1.txt > test1.sh
706
707 And run it:
708
709 sh <test1.sh
710
711 The last two steps can be combined:
712
713 cat DATA/grades | dbformmail test1.txt | sh
714
715 but I like to keep a copy of exactly what I send.
716
717 At the end of the semester you'll want to compute grade totals and
718 assign letter grades. Both fall out of dbroweval. For example, to
719 compute weighted total grades with a 40% midterm/60% final where the
720 midterm is 84 possible points and the final 100:
721
722 dbcol -rv total |
723 dbcolcreate total - |
724 dbroweval '
725 _total = .40 * _midterm/84.0 + .60 * _final/100.0;
726 _total = sprintf("%4.2f", _total);
727 if (_final eq "-" || ( _name =~ /^_/)) { _total = "-"; };' |
728 dbcolneaten
729
730 If you got the data originally from a spreadsheet, save it in "tab-
731 delimited" format and convert it with tabdelim_to_db (run
732 tabdelim_to_db -? for examples).
733
735 To convert the Unix password file to db:
736
737 cat /etc/passwd | sed 's/:/ /g'| \
738 dbcoldefine -F S login password uid gid gecos home shell \
739 >passwd.fsdb
740
741 To convert the group file
742
743 cat /etc/group | sed 's/:/ /g' | \
744 dbcoldefine -F S group password gid members \
745 >group.fsdb
746
747 To show the names of the groups that div7-members are in (assuming DIV7
748 is in the gecos field):
749
750 cat passwd.fsdb | dbrow '_gecos =~ /DIV7/' | dbcol login gid | \
751 dbjoin -i - -i group.fsdb gid | dbcol login group
752
754 Which Fsdb programs are the most complicated (based on number of test
755 cases)?
756
757 ls TEST/*.cmd | \
758 dbcoldefine test | \
759 dbroweval '_test =~ s@^TEST/([^_]+).*$@$1@' | \
760 dbrowuniq -c | \
761 dbsort -nr count | \
762 dbcolneaten
763
764 (Answer: dbmapreduce, then dbcolstats, dbfilealter and dbjoin.)
765
766 Stats on an exam (in $FILE, where $COLUMN is the name of the exam)?
767
768 cat $FILE | dbcolstats -q 4 $COLUMN <$FILE | dblistize | dbstripcomments
769
770 cat $FILE | dbcolhisto -g -n 20 $COLUMN | dbcolneaten | dbstripcomments
771
772 Merging a the hw1 column from file hw1.fsdb into grades.fsdb assuming
773 there's a common student id in column "id":
774
775 dbcol id hw1 <hw1.fsdb >t.fsdb
776
777 dbjoin -a -e - grades.fsdb t.fsdb id | \
778 dbsort name | \
779 dbcolneaten >new_grades.fsdb
780
781 Merging two fsdb files with the same rows:
782
783 cat file1.fsdb file2.fsdb >output.fsdb
784
785 or if you want to clean things up a bit
786
787 cat file1.fsdb file2.fsdb | dbstripextraheaders >output.fsdb
788
789 or if you want to know where the data came from
790
791 for i in 1 2
792 do
793 dbcolcreate source $i < file$i.fsdb
794 done >output.fsdb
795
796 (assumes you're using a Bourne-shell compatible shell, not csh).
797
799 As with any tool, one should (which means must) understand the limits
800 of the tool.
801
802 All Fsdb tools should run in constant memory. In some cases (such as
803 dbcolstats with quartiles, where the whole input must be re-read),
804 programs will spool data to disk if necessary.
805
806 Most tools buffer one or a few lines of data, so memory will scale with
807 the size of each line. (So lines with many columns, or when columns
808 have lots data, may cause large memory consumption.)
809
810 All Fsdb tools should run in constant or at worst "n log n" time.
811
812 All Fsdb tools use normal Perl math routines for computation. Although
813 I make every attempt to choose numerically stable algorithms (although
814 I also welcome feedback and suggestions for improvement), normal
815 rounding due to computer floating point approximations can result in
816 inaccuracies when data spans a large range of precision. (See for
817 example the dbcolstats_extrema test cases.)
818
819 Any requirements and limitations of each Fsdb tool is documented on its
820 manual page.
821
822 If any Fsdb program violates these assumptions, that is a bug that
823 should be documented on the tool's manual page or ideally fixed.
824
825 Fsdb does depend on Perl's correctness, and Perl (and Fsdb) have some
826 bugs. Fsdb should work on perl from version 5.10 onward.
827
829 There have been three versions of Fsdb; fsdb 1.0 is a complete re-write
830 of the pre-1995 versions, and was distributed from 1995 to 2007. Fsdb
831 2.0 is a significant re-write of the 1.x versions for reasons described
832 below.
833
834 Fsdb (in its various forms) has been used extensively by its author
835 since 1991. Since 1995 it's been used by two other researchers at UCLA
836 and several at ISI. In February 1998 it was announced to the Internet.
837 Since then it has found a few users, some outside where I work.
838
839 Major changes:
840
841 1.0 1997-07-22: first public release.
842 2.0 2008-01-25: rewrite to use a common library, and starting to use
843 threads.
844 2.12 2008-10-16: completion of the rewrite, and first RPM package.
845 2.44 2013-10-02: abandoning threads for improved performance
846
847 Fsdb 2.0 Rationale
848 I've thought about fsdb-2.0 for many years, but it was started in
849 earnest in 2007. Fsdb-2.0 has the following goals:
850
851 in-one-process processing
852 While fsdb is great on the Unix command line as a pipeline between
853 programs, it should also be possible to set it up to run in a
854 single process. And if it does so, it should be able to avoid
855 serializing and deserializing (converting to and from text) data
856 between each module. (Accomplished in fsdb-2.0: see dbpipeline,
857 although still needs tuning.)
858
859 clean IO API
860 Fsdb's roots go back to perl4 and 1991, so the fsdb-1.x library is
861 very, very crufty. More than just being ugly (but it was that
862 too), this made things reading from one format file and writing to
863 another the application's job, when it should be the library's.
864 (Accomplished in fsdb-1.15 and improved in 2.0: see Fsdb::IO.)
865
866 normalized module APIs
867 Because fsdb modules were added as needed over 10 years, sometimes
868 the module APIs became inconsistent. (For example, the 1.x
869 "dbcolcreate" required an empty value following the name of the new
870 column, but other programs specify empty values with the "-e"
871 argument.) We should smooth over these inconsistencies.
872 (Accomplished as each module was ported in 2.0 through 2.7.)
873
874 everyone handles all input formats
875 Given a clean IO API, the distinction between "colized" and
876 "listized" fsdb files should go away. Any program should be able
877 to read and write files in any format. (Accomplished in fsdb-2.1.)
878
879 Fsdb-2.0 preserves backwards compatibility where possible, but breaks
880 it where necessary to accomplish the above goals. In August 2008,
881 Fsdb-2.7 was declared preferred over the 1.x versions. Benchmarking in
882 2013 showed that threading performed much worse than just using pipes,
883 so Fsdb-2.44 uses threading "style", but implemented with processes
884 (via my "Freds" library).
885
886 Contributors
887 Fsdb includes code ported from Geoff Kuenning
888 ("Fsdb::Support::TDistribution").
889
890 Fsdb contributors: Ashvin Goel goel@cse.oge.edu, Geoff Kuenning
891 geoff@fmg.cs.ucla.edu, Vikram Visweswariah visweswa@isi.edu, Kannan
892 Varadahan kannan@isi.edu, Lars Eggert larse@isi.edu, Arkadi Gelfond
893 arkadig@dyna.com, David Graff graff@ldc.upenn.edu, Haobo Yu
894 haoboy@packetdesign.com, Pavlin Radoslavov pavlin@catarina.usc.edu,
895 Graham Phillips, Yuri Pradkin, Alefiya Hussain, Ya Xu, Michael
896 Schwendt, Fabio Silva fabio@isi.edu, Jerry Zhao zhaoy@isi.edu, Ning Xu
897 nxu@aludra.usc.edu, Martin Lukac mlukac@lecs.cs.ucla.edu, Xue Cai,
898 Michael McQuaid, Christopher Meng, Calvin Ardi, H. Merijn Brand, Lan
899 Wei, Hang Guo.
900
901 Fsdb includes datasets contributed from NIST (DATA/nist_zarr13.fsdb),
902 from
903 <http://www.itl.nist.gov/div898/handbook/eda/section4/eda4281.htm>, the
904 NIST/SEMATECH e-Handbook of Statistical Methods, section 1.4.2.8.1.
905 Background and Data. The source is public domain, and reproduced with
906 permission.
907
909 As stated in the introduction, Fsdb is an incompatible reimplementation
910 of the ideas found in "/rdb". By storing data in simple text files and
911 processing it with pipelines it is easy to experiment (in the shell)
912 and look at the output. The original implementation of this idea was
913 /rdb, a commercial product described in the book UNIX relational
914 database management: application development in the UNIX environment by
915 Rod Manis, Evan Schaffer, and Robert Jorgensen (and also at the web
916 page <http://www.rdb.com/>).
917
918 While Fsdb is inspired by Rdb, it includes no code from it, and Fsdb
919 makes several different design choices. In particular: rdb attempts to
920 be closer to a "real" database, with provision for locking, file
921 indexing. Fsdb focuses on single user use and so eschews these
922 choices. Rdb also has some support for interactive editing. Fsdb
923 leaves editing to text editors like emacs or vi.
924
925 In August, 2002 I found out Carlo Strozzi extended RDB with his package
926 NoSQL <http://www.linux.it/~carlos/nosql/>. According to Mr. Strozzi,
927 he implemented NoSQL in awk to avoid the Perl start-up of RDB.
928 Although I haven't found Perl startup overhead to be a big problem on
929 my platforms (from old Sparcstation IPCs to 2GHz Pentium-4s), you may
930 want to evaluate his system. The Linux Journal has a description of
931 NoSQL at <http://www.linuxjournal.com/article/3294>. It seems quite
932 similar to Fsdb. Like /rdb, NoSQL supports indexing (not present in
933 Fsdb). Fsdb appears to have richer support for statistics, and, as of
934 Fsdb-2.x, its support for Perl threading may support faster performance
935 (one-process, less serialization and deserialization).
936
938 Versions prior to 1.0 were released informally on my web page but were
939 not announced.
940
941 0.0 1991
942 started for my own research use
943
944 0.1 26-May-94
945 first check-in to RCS
946
947 0.2 15-Mar-95
948 parts now require perl5
949
950 1.0, 22-Jul-97
951 adds autoconf support and a test script.
952
953 1.1, 20-Jan-98
954 support for double space field separators, better tests
955
956 1.2, 11-Feb-98
957 minor changes and release on comp.lang.perl.announce
958
959 1.3, 17-Mar-98
960 · adds median and quartile options to dbstats
961
962 · adds dmalloc_to_db converter
963
964 · fixes some warnings
965
966 · dbjoin now can run on unsorted input
967
968 · fixes a dbjoin bug
969
970 · some more tests in the test suite
971
972 1.4, 27-Mar-98
973 · improves error messages (all should now report the program that
974 makes the error)
975
976 · fixed a bug in dbstats output when the mean is zero
977
978 1.5, 25-Jun-98
979 BUG FIX dbcolhisto, dbcolpercentile now handles non-numeric values like
980 dbstats
981 NEW dbcolstats computes zscores and tscores over a column
982 NEW dbcolscorrelate computes correlation coefficients between two
983 columns
984 INTERNAL ficus_getopt.pl has been replaced by DbGetopt.pm
985 BUG FIX all tests are now ``portable'' (previously some tests ran only
986 on my system)
987 BUG FIX you no longer need to have the db programs in your path (fix
988 arose from a discussion with Arkadi Gelfond)
989 BUG FIX installation no longer uses cp -f (to work on SunOS 4)
990
991 1.6, 24-May-99
992 NEW dbsort, dbstats, dbmultistats now run in constant memory (using tmp
993 files if necessary)
994 NEW dbcolmovingstats does moving means over a series of data
995 NEW dbcol has a -v option to get all columns except those listed
996 NEW dbmultistats does quartiles and medians
997 NEW dbstripextraheaders now also cleans up bogus comments before the
998 fist header
999 BUG FIX dbcolneaten works better with double-space-separated data
1000
1001 1.7, 5-Jan-00
1002 NEW dbcolize now detects and rejects lines that contain embedded copies
1003 of the field separator
1004 NEW configure tries harder to prevent people from improperly
1005 configuring/installing fsdb
1006 NEW tcpdump_to_db converter (incomplete)
1007 NEW tabdelim_to_db converter: from spreadsheet tab-delimited files to
1008 db
1009 NEW mailing lists for fsdb are "fsdb-announce@heidemann.la.ca.us"
1010 and "fsdb-talk@heidemann.la.ca.us"
1011 To subscribe to either, send mail
1012 to "fsdb-announce-request@heidemann.la.ca.us" or
1013 "fsdb-talk-request@heidemann.la.ca.us" with "subscribe" in the
1014 BODY of the message.
1015
1016 BUG FIX dbjoin used to produce incorrect output if there were extra,
1017 unmatched values in the 2nd table. Thanks to Graham Phillips for
1018 providing a test case.
1019 BUG FIX the sample commands in the usage strings now all should
1020 explicitly include the source of data (typically from "cat foo.fsdb
1021 |"). Thanks to Ya Xu for pointing out this documentation deficiency.
1022 BUG FIX (DOCUMENTATION) dbcolmovingstats had incorrect sample output.
1023
1024 1.8, 28-Jun-00
1025 BUG FIX header options are now preserved when writing with dblistize
1026 NEW dbrowuniq now optionally checks for uniqueness only on certain
1027 fields
1028 NEW dbrowsplituniq makes one pass through a file and splits it into
1029 separate files based on the given fields
1030 NEW converter for "crl" format network traces
1031 NEW anywhere you use arbitrary code (like dbroweval), _last_foo now
1032 maps to the last row's value for field _foo.
1033 OPTIMIZATION comment processing slightly changed so that dbmultistats
1034 now is much faster on files with lots of comments (for example, ~100k
1035 lines of comments and 700 lines of data!) (Thanks to Graham Phillips
1036 for pointing out this performance problem.)
1037 BUG FIX dbstats with median/quartiles now correctly handles singleton
1038 data points.
1039
1040 1.9, 6-Nov-00
1041 NEW dbfilesplit, split a single input file into multiple output files
1042 (based on code contributed by Pavlin Radoslavov).
1043 BUG FIX dbsort now works with perl-5.6
1044
1045 1.10, 10-Apr-01
1046 BUG FIX dbstats now handles the case where there are more n-tiles than
1047 data
1048 NEW dbstats now includes a -S option to optimize work on pre-sorted
1049 data (inspired by code contributed by Haobo Yu)
1050 BUG FIX dbsort now has a better estimate of memory usage when run on
1051 data with very short records (problem detected by Haobo Yu)
1052 BUG FIX cleanup of temporary files is slightly better
1053
1054 1.11, 2-Nov-01
1055 BUG FIX dbcolneaten now runs in constant memory
1056 NEW dbcolneaten now supports "field specifiers" that allow some control
1057 over how wide columns should be
1058 OPTIMIZATION dbsort now tries hard to be filesystem cache-friendly
1059 (inspired by "Information and Control in Gray-box Systems" by the
1060 Arpaci-Dusseau's at SOSP 2001)
1061 INTERNAL t_distr now ported to perl5 module DbTDistr
1062
1063 1.12, 30-Oct-02
1064 BUG FIX dbmultistats documentation typo fixed
1065 NEW dbcolmultiscale
1066 NEW dbcol has -r option for "relaxed error checking"
1067 NEW dbcolneaten has new -e option to strip end-of-line spaces
1068 NEW dbrow finally has a -v option to negate the test
1069 BUG FIX math bug in dbcoldiff fixed by Ashvin Goel (need to check
1070 Scheaffer test cases)
1071 BUG FIX some patches to run with Perl 5.8. Note: some programs
1072 (dbcolmultiscale, dbmultistats, dbrowsplituniq) generate warnings like:
1073 "Use of uninitialized value in concatenation (.)" or "string at
1074 /usr/lib/perl5/5.8.0/FileCache.pm line 98, <STDIN> line 2". Please
1075 ignore this until I figure out how to suppress it. (Thanks to Jerry
1076 Zhao for noticing perl-5.8 problems.)
1077 BUG FIX fixed an autoconf problem where configure would fail to find a
1078 reasonable prefix (thanks to Fabio Silva for reporting the problem)
1079 NEW db_to_html_table: simple conversion to html tables (NO fancy stuff)
1080 NEW dblib now has a function dblib_text2html() that will do simple
1081 conversion of iso-8859-1 to HTML
1082
1083 1.13, 4-Feb-04
1084 NEW fsdb added to the freebsd ports tree
1085 <http://www.freshports.org/databases/fsdb/>. Maintainer:
1086 "larse@isi.edu"
1087 BUG FIX properly handle trailing spaces when data must be numeric (ex.
1088 dbstats with -FS, see test dbstats_trailing_spaces). Fix from Ning Xu
1089 "nxu@aludra.usc.edu".
1090 NEW dbcolize error message improved (bug report from Terrence Brannon),
1091 and list format documented in the README.
1092 NEW cgi_to_db converts CGI.pm-format storage to fsdb list format
1093 BUG FIX handle numeric synonyms for column names in dbcol properly
1094 ENHANCEMENT "talking about columns" section added to README. Lack of
1095 documentation pointed out by Lars Eggert.
1096 CHANGE dbformmail now defaults to using Mail ("Berkeley Mail") to send
1097 mail, rather than sendmail (sendmail is still an option, but mail
1098 doesn't require running as root)
1099 NEW on platforms that support it (i.e., with perl 5.8), fsdb works fine
1100 with unicode
1101 NEW dbfilevalidate: check a db file for some common errors
1102
1103 1.14, 24-Aug-06
1104 ENHANCEMENT README cleanup
1105 INCOMPATIBLE CHANGE dbcolsplit renamed dbcolsplittocols
1106 NEW dbcolsplittorows split one column into multiple rows
1107 NEW dbcolsregression compute linear regression and correlation for two
1108 columns
1109 ENHANCEMENT cvs_to_db: better error handling, normalize field names,
1110 skip blank lines
1111 ENHANCEMENT dbjoin now detects (and fails) if non-joined files have
1112 duplicate names
1113 BUG FIX minor bug fixed in calculation of Student t-distributions
1114 (doesn't change any test output, but may have caused small errors)
1115
1116 1.15, 12-Nov-07
1117 NEW fsdb-1.14 added to the MacOS Fink system
1118 <http://pdb.finkproject.org/pdb/package.php/fsdb>. (Thanks to Lars
1119 Eggert for maintaining this port.)
1120 NEW Fsdb::IO::Reader and Fsdb::IO::Writer now provide reasonably clean
1121 OO I/O interfaces to Fsdb files. Highly recommended if you use fsdb
1122 directly from perl. In the fullness of time I expect to reimplement
1123 the entire thing using these APIs to replace the current dblib.pl which
1124 is still hobbled by its roots in perl4.
1125 NEW dbmapreduce now implements a Google-style map/reduce abstraction,
1126 generalizing dbmultistats.
1127 ENHANCEMENT fsdb now uses the Perl build system (Makefile.PL, etc.),
1128 instead of autoconf. This change paves the way to better perl-5-style
1129 modularization, proper manual pages, input of both listize and colize
1130 format for every program, and world peace.
1131 ENHANCEMENT dblib.pl is now moved to Fsdb::Old.pm.
1132 BUG FIX dbmultistats now propagates its format argument (-f). Bug and
1133 fix from Martin Lukac (thanks!).
1134 ENHANCEMENT dbformmail documentation now is clearer that it doesn't
1135 send the mail, you have to run the shell script it writes. (Problem
1136 observed by Unkyu Park.)
1137 ENHANCEMENT adapted to autoconf-2.61 (and then these changes were
1138 discarded in favor of The Perl Way.
1139 BUG FIX dbmultistats memory usage corrected (O(# tags), not O(1))
1140 ENHANCEMENT dbmultistats can now optionally run with pre-grouped input
1141 in O(1) memory
1142 ENHANCEMENT dbroweval -N was finally implemented (eat comments)
1143
1144 2.0, 25-Jan-08
1145 2.0, 25-Jan-08 --- a quiet 2.0 release (gearing up towards complete)
1146
1147 ENHANCEMENT: shifting old programs to Perl modules, with the front-end
1148 program as just a wrapper. In the short-term, this change just means
1149 programs have real man pages. In the long-run, it will mean that one
1150 can run a pipeline in a single Perl program. So far: dbcol, dbroweval,
1151 the new dbrowcount. dbsort the new dbmerge, the old "dbstats" (renamed
1152 dbcolstats), dbcolrename, dbcolcreate,
1153 NEW: Fsdb::Filter::dbpipeline is an internal-only module that lets one
1154 use fsdb commands from within perl (via threads).
1155 It also provides perl function aliases for the internal modules, so
1156 a string of fsdb commands in perl are nearly as terse as in the
1157 shell:
1158
1159 use Fsdb::Filter::dbpipeline qw(:all);
1160 dbpipeline(
1161 dbrow(qw(name test1)),
1162 dbroweval('_test1 += 5;')
1163 );
1164
1165 INCOMPATIBLE CHANGE: The old dbcolstats has been renamed
1166 dbcolstatscores. The new dbcolstats does the same thing as the old
1167 dbstats. This incompatibility is unfortunate but normalizes program
1168 names.
1169 CHANGE: The new dbcolstats program always outputs "-" (the default
1170 empty value) for statistics it cannot compute (for example, standard
1171 deviation if there is only one row), instead of the old mix of "-" and
1172 "na".
1173 INCOMPATIBLE CHANGE: The old dbcolstats program, now called
1174 dbcolstatscores, also has different arguments. The "-t mean,stddev"
1175 option is now "--tmean mean --tstddev stddev". See dbcolstatscores for
1176 details.
1177 INCOMPATIBLE CHANGE: dbcolcreate now assumes all new columns get the
1178 default value rather than requiring each column to have an initial
1179 constant value. To change the initial value, sue the new "-e" option.
1180 NEW: dbrowcount counts rows, an almost-subset of dbcolstats's "n"
1181 output (except without differentiating numeric/non-numeric input), or
1182 the equivalent of "dbstripcomments | wc -l".
1183 NEW: dbmerge merges two sorted files. This functionality was previously
1184 embedded in dbsort.
1185 INCOMPATIBLE CHANGE: dbjoin's "-i" option to include non-matches is now
1186 renamed "-a", so as to not conflict with the new standard option "-i"
1187 for input file.
1188
1189 2.1, 6-Apr-08
1190 2.1, 6-Apr-08 --- another alpha 2.0, but now all converted programs
1191 understand both listize and colize format
1192
1193 ENHANCEMENT: shifting more old programs to Perl modules. New in 2.1:
1194 dbcolneaten, dbcoldefine, dbcolhisto, dblistize, dbcolize, dbrecolize
1195 ENHANCEMENT dbmerge now handles an arbitrary number of input files, not
1196 just exactly two.
1197 NEW dbmerge2 is an internal routine that handles merging exactly two
1198 files.
1199 INCOMPATIBLE CHANGE dbjoin now specifies inputs like dbmerge2, rather
1200 than assuming the first two arguments were tables (as in fsdb-1).
1201 The old dbjoin argument "-i" is now "-a" or <--type=outer>.
1202
1203 A minor change: comments in the source files for dbjoin are now
1204 intermixed with output rather than being delayed until the end.
1205
1206 ENHANCEMENT dbsort now no longer produces warnings when null values are
1207 passed to numeric comparisons.
1208 BUG FIX dbroweval now once again works with code that lacks a trailing
1209 semicolon. (This bug fixes a regression from 1.15.)
1210 INCOMPATIBLE CHANGE dbcolneaten's old "-e" option (to avoid end-of-line
1211 spaces) is now "-E" to avoid conflicts with the standard empty field
1212 argument.
1213 INCOMPATIBLE CHANGE dbcolhisto's old "-e" option is now "-E" to avoid
1214 conflicts. And its "-n", "-s", and "-w" are now "-N", "-S", and "-W" to
1215 correspond.
1216 NEW dbfilealter replaces dbrecolize, dblistize, and dbcolize, but with
1217 different options.
1218 ENHANCEMENT The library routines "Fsdb::IO" now understand both list-
1219 format and column-format data, so all converted programs can now
1220 automatically read either format. This capability was one of the
1221 milestone goals for 2.0, so yea!
1222
1223 2.2, 23-May-08
1224 Release 2.2 is another 2.x alpha release. Now most of the commands are
1225 ported, but a few remain, and I plan one last incompatible change (to
1226 the file header) before 2.x final.
1227
1228 ENHANCEMENT
1229 shifting more old programs to Perl modules. New in 2.2:
1230 dbrowaccumulate, dbformmail. dbcolmovingstats. dbrowuniq.
1231 dbrowdiff. dbcolmerge. dbcolsplittocols. dbcolsplittorows.
1232 dbmapreduce. dbmultistats. dbrvstatdiff. Also dbrowenumerate
1233 exists only as a front-end (command-line) program.
1234
1235 INCOMPATIBLE CHANGE
1236 The following programs have been dropped from fsdb-2.x:
1237 dbcoltighten, dbfilesplit, dbstripextraheaders,
1238 dbstripleadingspace.
1239
1240 NEW combined_log_format_to_db to convert Apache logfiles
1241
1242 INCOMPATIBLE CHANGE
1243 Options to dbrowdiff are now -B and -I, not -a and -i.
1244
1245 INCOMPATIBLE CHANGE
1246 dbstripcomments is now dbfilestripcomments.
1247
1248 BUG FIXES
1249 dbcolneaten better handles empty columns; dbcolhisto warning
1250 suppressed (actually a bug in high-bucket handling).
1251
1252 INCOMPATIBLE CHANGE
1253 dbmultistats now requires a "-k" option in front of the key (tag)
1254 field, or if none is given, it will group by the first field (both
1255 like dbmapreduce).
1256
1257 KNOWN BUG
1258 dbmultistats with quantile option doesn't work currently.
1259
1260 INCOMPATIBLE CHANGE
1261 dbcoldiff is renamed dbrvstatdiff.
1262
1263 BUG FIXES
1264 dbformmail was leaving its log message as a command, not a
1265 comment. Oops. No longer.
1266
1267 2.3, 27-May-08 (alpha)
1268 Another alpha release, this one just to fix the critical dbjoin bug
1269 listed below (that happens to have blocked my MP3 jukebox :-).
1270
1271 BUG FIX
1272 Dbsort no longer hangs if given an input file with no rows.
1273
1274 BUG FIX
1275 Dbjoin now works with unsorted input coming from a pipeline (like
1276 stdin). Perl-5.8.8 has a bug (?) that was making this case
1277 fail---opening stdin in one thread, reading some, then reading more
1278 in a different thread caused an lseek which works on files, but
1279 fails on pipes like stdin. Go figure.
1280
1281 BUG FIX / KNOWN BUG
1282 The dbjoin fix also fixed dbmultistats -q (it now gives the right
1283 answer). Although a new bug appeared, messages like:
1284 Attempt to free unreferenced scalar: SV 0xa9dd0c4, Perl
1285 interpreter: 0xa8350b8 during global destruction. So the
1286 dbmultistats_quartile test is still disabled.
1287
1288 2.4, 18-Jun-08
1289 Another alpha release, mostly to fix minor usability problems in
1290 dbmapreduce and client functions.
1291
1292 ENHANCEMENT
1293 dbrow now defaults to running user supplied code without warnings
1294 (as with fsdb-1.x). Use "--warnings" or "-w" to turn them back on.
1295
1296 ENHANCEMENT
1297 dbroweval can now write different format output than the input,
1298 using the "-m" option.
1299
1300 KNOWN BUG
1301 dbmapreduce emits warnings on perl 5.10.0 about "Unbalanced string
1302 table refcount" and "Scalars leaked" when run with an external
1303 program as a reducer.
1304
1305 dbmultistats emits the warning "Attempt to free unreferenced
1306 scalar" when run with quartiles.
1307
1308 In each case the output is correct. I believe these can be
1309 ignored.
1310
1311 CHANGE
1312 dbmapreduce no longer logs a line for each reducer that is invoked.
1313
1314 2.5, 24-Jun-08
1315 Another alpha release, fixing more minor bugs in "dbmapreduce" and
1316 lossage in "Fsdb::IO".
1317
1318 ENHANCEMENT
1319 dbmapreduce can now tolerate non-map-aware reducers that pass back
1320 the key column in put. It also passes the current key as the last
1321 argument to external reducers.
1322
1323 BUG FIX
1324 Fsdb::IO::Reader, correctly handle "-header" option again. (Broken
1325 since fsdb-2.3.)
1326
1327 2.6, 11-Jul-08
1328 Another alpha release, needed to fix DaGronk. One new port, small bug
1329 fixes, and important fix to dbmapreduce.
1330
1331 ENHANCEMENT
1332 shifting more old programs to Perl modules. New in 2.2:
1333 dbcolpercentile.
1334
1335 INCOMPATIBLE CHANGE and ENHANCEMENTS dbcolpercentile arguments changed,
1336 use "--rank" to require ranking instead of "-r". Also, "--ascending"
1337 and "--descending" can now be specified separately, both for
1338 "--percentile" and "--rank".
1339 BUG FIX
1340 Sigh, the sense of the --warnings option in dbrow was inverted. No
1341 longer.
1342
1343 BUG FIX
1344 I found and fixed the string leaks (errors like "Unbalanced string
1345 table refcount" and "Scalars leaked") in dbmapreduce and
1346 dbmultistats. (All "IO::Handle"s in threads must be manually
1347 destroyed.)
1348
1349 BUG FIX
1350 The "-C" option to specify the column separator in dbcolsplittorows
1351 now works again (broken since it was ported).
1352
1353 2.7, 30-Jul-08 beta
1354
1355 The beta release of fsdb-2.x. Finally, all programs are ported. As
1356 statistics, the number of lines of non-library code doubled from 7.5k
1357 to 15.5k. The libraries are much more complete, going from 866 to 5164
1358 lines. The overall number of programs is about the same, although 19
1359 were dropped and 11 were added. The number of test cases has grown
1360 from 116 to 175. All programs are now in perl-5, no more shell scripts
1361 or perl-4. All programs now have manual pages.
1362
1363 Although this is a major step forward, I still expect to rename "fsdb"
1364 to "fsdb".
1365
1366 ENHANCEMENT
1367 shifting more old programs to Perl modules. New in 2.7:
1368 dbcolscorellate. dbcolsregression. cgi_to_db. dbfilevalidate.
1369 db_to_csv. csv_to_db, db_to_html_table, kitrace_to_db,
1370 tcpdump_to_db, tabdelim_to_db, ns_to_db.
1371
1372 INCOMPATIBLE CHANGE
1373 The following programs have been dropped from fsdb-2.x: db2dcliff,
1374 dbcolmultiscale, crl_to_db. ipchain_logs_to_db. They may come
1375 back, but seemed overly specialized. The following program
1376 dbrowsplituniq was dropped because it is superseded by dbmapreduce.
1377 dmalloc_to_db was dropped pending a test cases and examples.
1378
1379 ENHANCEMENT
1380 dbfilevalidate now has a "-c" option to correct errors.
1381
1382 NEW html_table_to_db provides the inverse of db_to_html_table.
1383
1384 2.8, 5-Aug-08
1385 Change header format, preserving forwards compatibility.
1386
1387 BUG FIX
1388 Complete editing pass over the manual, making sure it aligns with
1389 fsdb-2.x.
1390
1391 SEMI-COMPATIBLE CHANGE
1392 The header of fsdb files has changed, it is now #fsdb, not #h (or
1393 #L) and parsing of -F and -R are also different. See dbfilealter
1394 for the new specification. The v1 file format will be read,
1395 compatibly, but not written.
1396
1397 BUG FIX
1398 dbmapreduce now tolerates comments that precede the first key,
1399 instead of failing with an error message.
1400
1401 2.9, 6-Aug-08
1402 Still in beta; just a quick bug-fix for dbmapreduce.
1403
1404 ENHANCEMENT
1405 dbmapreduce now generates plausible output when given no rows of
1406 input.
1407
1408 2.10, 23-Sep-08
1409 Still in beta, but picking up some bug fixes.
1410
1411 ENHANCEMENT
1412 dbmapreduce now generates plausible output when given no rows of
1413 input.
1414
1415 ENHANCEMENT
1416 dbroweval the warnings option was backwards; now corrected. As a
1417 result, warnings in user code now default off (like in fsdb-1.x).
1418
1419 BUG FIX
1420 dbcolpercentile now defaults to assuming the target column is
1421 numeric. The new option "-N" allows selection of a non-numeric
1422 target.
1423
1424 BUG FIX
1425 dbcolscorrelate now includes "--sample" and "--nosample" options to
1426 compute the sample or full population correlation coefficients.
1427 Thanks to Xue Cai for finding this bug.
1428
1429 2.11, 14-Oct-08
1430 Still in beta, but picking up some bug fixes.
1431
1432 ENHANCEMENT
1433 html_table_to_db is now more aggressive about filling in empty
1434 cells with the official empty value, rather than leaving them blank
1435 or as whitespace.
1436
1437 ENHANCEMENT
1438 dbpipeline now catches failures during pipeline element setup and
1439 exits reasonably gracefully.
1440
1441 BUG FIX
1442 dbsubprocess now reaps child processes, thus avoiding running out
1443 of processes when used a lot.
1444
1445 2.12, 16-Oct-08
1446 Finally, a full (non-beta) 2.x release!
1447
1448 INCOMPATIBLE CHANGE
1449 Jdb has been renamed Fsdb, the flatfile-streaming database. This
1450 change affects all internal Perl APIs, but no shell command-level
1451 APIs. While Jdb served well for more than ten years, it is easily
1452 confused with the Java debugger (even though Jdb was there first!).
1453 It also is too generic to work well in web search engines.
1454 Finally, Jdb stands for ``John's database'', and we're a bit beyond
1455 that. (However, some call me the ``file-system guy'', so one could
1456 argue it retains that meeting.)
1457
1458 If you just used the shell commands, this change should not affect
1459 you. If you used the Perl-level libraries directly in your code,
1460 you should be able to rename "Jdb" to "Fsdb" to move to 2.12.
1461
1462 The jdb-announce list not yet been renamed, but it will be shortly.
1463
1464 With this release I've accomplished everything I wanted to in
1465 fsdb-2.x. I therefore expect to return to boring, bugfix releases.
1466
1467 2.13, 30-Oct-08
1468 BUG FIX
1469 dbrowaccumulate now treats non-numeric data as zero by default.
1470
1471 BUG FIX
1472 Fixed a perl-5.10ism in dbmapreduce that breaks that program under
1473 5.8. Thanks to Martin Lukac for reporting the bug.
1474
1475 2.14, 26-Nov-08
1476 BUG FIX
1477 Improved documentation for dbmapreduce's "-f" option.
1478
1479 ENHANCEMENT
1480 dbcolmovingstats how computes a moving standard deviation in
1481 addition to a moving mean.
1482
1483 2.15, 13-Apr-09
1484 BUG FIX
1485 Fix a make install bug reported by Shalindra Fernando.
1486
1487 2.16, 14-Apr-09
1488 BUG FIX
1489 Another minor release bug: on some systems programize_module looses
1490 executable permissions. Again reported by Shalindra Fernando.
1491
1492 2.17, 25-Jun-09
1493 TYPO FIXES
1494 Typo in the dbroweval manual fixed.
1495
1496 IMPROVEMENT
1497 There is no longer a comment line to label columns in dbcolneaten,
1498 instead the header line is tweaked to line up. This change
1499 restores the Jdb-1.x behavior, and means that repeated runs of
1500 dbcolneaten no longer add comment lines each time.
1501
1502 BUG FIX
1503 It turns out dbcolneaten was not correctly handling trailing
1504 spaces when given the "-E" option to suppress them. This
1505 regression is now fixed.
1506
1507 EXTENSION
1508 dbroweval(1) can now handle direct references to the last row via
1509 $lfref, a dubious but now documented feature.
1510
1511 BUG FIXES
1512 Separators set with "-C" in dbcolmerge and dbcolsplittocols were
1513 not properly setting the heading, and null fields were not
1514 recognized. The first bug was reported by Martin Lukac.
1515
1516 2.18, 1-Jul-09 A minor release
1517 IMPROVEMENT
1518 Documentation for Fsdb::IO::Reader has been improved.
1519
1520 IMPROVEMENT
1521 The package should now be PGP-signed.
1522
1523 2.19, 10-Jul-09
1524 BUG FIX
1525 Internal improvements to debugging output and robustness of
1526 dbmapreduce and dbpipeline. TEST/dbpipeline_first_fails.cmd re-
1527 enabled.
1528
1529 2.20, 30-Nov-09 (A collection of minor bugfixes, plus a build against
1530 Fedora 12.)
1531 BUG FIX
1532 Loging for dbmapreduce with code refs is now stable (it no longer
1533 includes a hex pointer to the code reference).
1534
1535 BUG FIX
1536 Better handling of mixed blank lines in Fsdb::IO::Reader (see test
1537 case dbcolize_blank_lines.cmd).
1538
1539 BUG FIX
1540 html_table_to_db now handles multi-line input better, and handles
1541 tables with COLSPAN.
1542
1543 BUG FIX
1544 dbpipeline now cleans up threads in an "eval" to prevent "cannot
1545 detach a joined thread" errors that popped up in perl-5.10.
1546 Hopefully this prevents a race condition that causes the test
1547 suites to hang about 20% of the time (in dbpipeline_first_fails).
1548
1549 IMPROVEMENT
1550 dbmapreduce now detects and correctly fails when the input and
1551 reducer have incompatible field separators.
1552
1553 IMPROVEMENT
1554 dbcolstats, dbcolhisto, dbcolscorrelate, dbcolsregression, and
1555 dbrowcount now all take an "-F" option to let one specify the
1556 output field separator (so they work better with dbmapreduce).
1557
1558 BUG FIX
1559 An omitted "-k" from the manual page of dbmultistats is now there.
1560 Bug reported by Unkyu Park.
1561
1562 2.21, 17-Apr-10 bug fix release
1563 BUG FIX
1564 Fsdb::IO::Writer now no longer fails with -outputheader => never
1565 (an obscure bug).
1566
1567 IMPROVEMENT
1568 Fsdb (in the warnings section) and dbcolstats now more carefully
1569 document how they handle (and do not handle) numerical precision
1570 problems, and other general limits. Thanks to Yuri Pradkin for
1571 prompting this documentation.
1572
1573 IMPROVEMENT
1574 "Fsdb::Support::fullname_to_sortkey" is now restored from "Jdb".
1575
1576 IMPROVEMENT
1577 Documention for multiple styles of input approaches (including
1578 performance description) added to Fsdb::IO.
1579
1580 2.22, 2010-10-31 One new tool dbcolcopylast and several bug fixes for Perl
1581 5.10.
1582 BUG FIX
1583 dbmerge now correctly handles n-way merges. Bug reported by Yuri
1584 Pradkin.
1585
1586 INCOMPARABLE CHANGE
1587 dbcolneaten now defaults to not padding the last column.
1588
1589 ADDITION
1590 dbrowenumerate now takes -N NewColumn to give the new column a name
1591 other than "count". Feature requested by Mike Rouch in January
1592 2005.
1593
1594 ADDITION
1595 New program dbcolcopylast copies the last value of a column into a
1596 new column copylast_column of the next row. New program requested
1597 by Fabio Silva; useful for converting dbmultistats output into
1598 dbrvstatdiff input.
1599
1600 BUG FIX
1601 Several tools (particularly dbmapreduce and dbmultistats) would
1602 report errors like "Unbalanced string table refcount: (1) for
1603 "STDOUT" during global destruction" on exit, at least on certain
1604 versions of Perl (for me on 5.10.1), but similar errors have been
1605 off-and-on for several Perl releases. Although I think my code
1606 looked OK, I worked around this problem with a different way of
1607 handling standard IO redirection.
1608
1609 2.23, 2011-03-10 Several small portability bugfixes; improved dbcolstats
1610 for large datasets
1611 IMPROVEMENT
1612 Documentation to dbrvstatdiff was changed to use "sd" to refer to
1613 standard deviation, not "ss" (which might be confused with sum-of-
1614 squares).
1615
1616 BUG FIX
1617 This documentation about dbmultistats was missing the -k option in
1618 some cases.
1619
1620 BUG FIX
1621 dbmapreduce was failing on MacOS-10.6.3 for some tests with the
1622 error
1623
1624 dbmapreduce: cannot run external dbmapreduce reduce program (perl TEST/dbmapreduce_external_with_key.pl)
1625
1626 The problem seemed to be only in the error, not in operation. On
1627 MacOS, the error is now suppressed. Thanks to Alefiya Hussain for
1628 providing access to a Mac system that allowed debugging of this
1629 problem.
1630
1631 IMPROVEMENT
1632 The csv_to_db command requires an external Perl library
1633 (Text::CSV_XS). On computers that lack this optional library,
1634 previously Fsdb would configure with a warning and then test cases
1635 would fail. Now those test cases are skipped with an additional
1636 warning.
1637
1638 BUG FIX
1639 The test suite now supports alternative valid output, as a hack to
1640 account for last-digit floating point differences. (Not very
1641 satisfying :-(
1642
1643 BUG FIX
1644 dbcolstats output for confidence intervals on very large datasets
1645 has changed. Previously it failed for more than 2^31-1 records,
1646 and handling of T-Distributions with thousands of rows was a bit
1647 dubious. Now datasets with more than 10000 are considered
1648 infinitely large and hopefully correctly handled.
1649
1650 2.24, 2011-04-15 Improvements to fix an old bug in dbmapreduce with
1651 different field separators
1652 IMPROVEMENT
1653 The dbfilealter command had a "--correct" option to work-around
1654 from incompatible field-separators, but it did nothing. Now it
1655 does the correct but sad, data-loosing thing.
1656
1657 IMPROVEMENT
1658 The dbmultistats command previously failed with an error message
1659 when invoked on input with a non-default field separator. The root
1660 cause was the underlying dbmapreduce that did not handle the case
1661 of reducers that generated output with a different field separator
1662 than the input. We now detect and repair incompatible field
1663 separators. This change corrects a problem originally documented
1664 and detected in Fsdb-2.20. Bug re-reported by Unkyu Park.
1665
1666 2.25, 2011-08-07 Two new tools, xml_to_db and dbfilepivot, and a bugfix for
1667 two people.
1668 IMPROVEMENT
1669 kitrace_to_db now supports a --utc option, which also fixes this
1670 test case for users outside of the Pacific time zone. Bug reported
1671 by David Graff, and also by Peter Desnoyers (within a week of each
1672 other :-)
1673
1674 NEW xml_to_db can convert simple, very regular XML files into Fsdb.
1675
1676 NEW dbfilepivot "pivots" a file, converting multiple rows corresponding
1677 to the same entity into a single row with multiple columns.
1678
1679 2.26, 2011-12-12 Bug fixes, particularly for perl-5.14.2.
1680 BUG FIX
1681 Bugs fixed in Fsdb::IO::Reader(3) manual page.
1682
1683 BUG FIX
1684 Fixed problems where dbcolstats was truncating floating point
1685 numbers when sorting. This strange behavior happens as of
1686 perl-5.14.2 and it seems like a Perl bug. I've worked around it
1687 for the test suites, but I'm a bit nervous.
1688
1689 2.27, 2012-11-15 Accumulated bug fixes.
1690 IMPROVEMENT
1691 csv_to_db now reports errors in CVS input with real diagnostics.
1692
1693 IMPROVEMENT
1694 dbcolmovingstats can now compute median, when given the "-m"
1695 option.
1696
1697 BUG FIX
1698 dbcolmovingstats non-numeric handling (the "-a" option) now works
1699 properly.
1700
1701 DOCUMENTATION
1702 The internal t/test_command.t test framework is now documented.
1703
1704 BUG FIX
1705 dbrowuniq now correctly handles the case where there is no input
1706 (previously it output a blank line, which is a malformed fsdb
1707 file). Thanks to Yuri Pradkin for reporting this bug.
1708
1709 2.28, 2012-11-15 A quick release to fix most rpmlint errors.
1710 BUG FIX
1711 Fixed a number of minor release problems (wrong permissions, old
1712 FSF address, etc.) found by rpmlint.
1713
1714 2.29, 2012-11-20 a quick release for CPAN testing
1715 IMPROVEMENT
1716 Tweaked the RPM spec.
1717
1718 IMPROVEMENT
1719 Modified Makefile.PL to fail gracefully on Perl installations that
1720 lack threads. (Without this fix, I get massive failures in the
1721 non-ithreads test system.)
1722
1723 2.30, 2012-11-25 improvements to perl portability
1724 BUG FIX
1725 Removed unicode character in documention of dbcolscorrelated so pod
1726 tests will pass. (Sigh, that should work :-( )
1727
1728 BUG FIX
1729 Fixed test suite failures on 5 tests (dbcolcreate_double_creation
1730 was the first) due to Carp's addition of a period. This problem
1731 was breaking Fsdb on perl-5.17. Thanks to Michael McQuaid for
1732 helping diagnose this problem.
1733
1734 IMPROVEMENT
1735 The test suite now prints out the names of tests it tries.
1736
1737 2.31, 2012-11-28 A release with actual improvements to dbfilepivot and
1738 dbrowuniq.
1739 BUG FIX
1740 Documentation fixes: typos in dbcolscorrelated, bugs in
1741 dbfilepivot, clarification for comment handling in
1742 Fsdb::IO::Reader.
1743
1744 IMPROVEMENT
1745 Previously dbfilepivot assumed the input was grouped by keys and
1746 didn't very that pre-condition. Now there is no pre-condition (it
1747 will sort the input by default), and it checks if the invariant is
1748 violated.
1749
1750 BUG FIX
1751 Previously dbfilepivot failed if the input had comments (oops :-);
1752 no longer.
1753
1754 IMPROVEMENT
1755 Now dbrowuniq has the "-L" option to preserve the last unique row
1756 (instead of the first), a common idiom.
1757
1758 2.32, 2012-12-21 Test suites should now be more numerically robust.
1759 NEW New dbfilediff does fsdb-aware file differencing. It does not do
1760 smart intuition of add/removes like Unix diff(1), but it does know
1761 about columns, and with "-E", it does numeric-aware differences.
1762
1763 IMPROVEMENT
1764 Test suites that are numeric now use dbfilediff to do numeric-aware
1765 comparisons, so the test suite should now be robust to slightly
1766 different computers and operating systems and compilers than
1767 exactly what I use.
1768
1769 2.33, 2012-12-23 Minor fixes to some test cases.
1770 IMPROVEMENT
1771 dbfilediff and dbrowuniq now supports the "-N" option to give the
1772 new column a different name. (And a test cases where this
1773 duplication mattered have been fixed.)
1774
1775 IMPROVEMENT
1776 dbrvstatdiff now show the t-test breakpoint with a reasonable
1777 number of floating point digits.
1778
1779 BUG FIX
1780 Fixed a numerical stability problem in the dbroweval_last test
1781 case.
1782
1784 2.34, 2013-02-10 Parallelism in dbmerge.
1785 IMPROVEMENT
1786 Documention for dbjoin now includes resource requirements.
1787
1788 IMPROVEMENT
1789 Default memory usage for dbsort is now about 256MB. (The world
1790 keeps moving forward.)
1791
1792 IMPROVEMENT
1793 dbmerge now does merging in parallel. As a side-effect, dbsort
1794 should be faster when input overflows memory. The level of
1795 parallelism can be limited with the "--parallelism" option. (There
1796 is more work to do here, but we're off to a start.)
1797
1798 2.35, 2013-02-23 Improvements to dbmerge parallelism
1799 BUG FIX
1800 Fsdb temporary files are now created more securely (with
1801 File::Temp).
1802
1803 IMPROVEMENT
1804 Programs that sort or merge on fields (dbmerge2, dbmerge, dbsort,
1805 dbjoin) now report an error if no fields on which to join or merge
1806 are given.
1807
1808 IMPROVEMENT
1809 Parallelism in dbmerge is should now be more consistent, with less
1810 starting and stopping.
1811
1812 IMPROVEMENT In dbmerge, the "--xargs" option lets one give input
1813 filenames on standard input, rather than the command line. This feature
1814 paves the way for faster dbsort for large inputs (by pipelining sorting
1815 and merging), expected in the next release.
1816
1817 2.36, 2013-02-25 dbsort pipelines with dbmerge
1818 IMPROVEMENT For large inputs, dbsort now pipelines sorting and merging,
1819 allowing earlier processing.
1820 BUG FIX Since 2.35, dbmerge delayed cleanup of intermediate files,
1821 thereby requiring extra disk space.
1822
1823 2.37, 2013-02-26 quick bugfix to support parallel sort and merge from
1824 recent releases
1825 BUG FIX Since 2.35, dbmerge delayed removal of input files given by
1826 "--xargs". This problem is now fixed.
1827
1828 2.38, 2013-04-29 minor bug fixes
1829 CLARIFICATION
1830 Configure now rejects Windows since tests seem to hang on some
1831 versions of Windows. (I would love help from a Windows developer
1832 to get this problem fixed, but I cannot do it.) See
1833 https://rt.cpan.org/Ticket/Display.html?id=84201.
1834
1835 IMPROVEMENT
1836 All programs that use temporary files (dbcolpercentile,
1837 dbcolscorrelate, dbcolstats, dbcolstatscores) now take the "-T"
1838 option and set the temporary directory consistently.
1839
1840 In addition, error messages are better when the temporary directory
1841 has problems. Problem reported by Liang Zhu.
1842
1843 BUG FIX
1844 dbmapreduce was failing with external, map-reduce aware reducers
1845 (when invoked with -M and an external program). (Sigh, did this
1846 case ever work?) This case should now work. Thanks to Yuri
1847 Pradkin for reporting this bug (in 2011).
1848
1849 BUG FIX
1850 Fixed perl-5.10 problem with dbmerge. Thanks to Yuri Pradkin for
1851 reporting this bug (in 2013).
1852
1853 2.39, date 2013-05-31 quick release for the dbrowuniq extension
1854 BUG FIX
1855 Actually in 2.38, the Fedora .spec got cleaner dependencies.
1856 Suggestion from Christopher Meng via
1857 <https://bugzilla.redhat.com/show_bug.cgi?id=877096>.
1858
1859 ENHANCEMENT
1860 Fsdb files are now explicitly set into UTF-8 encoding, unless one
1861 specifies "-encoding" to "Fsdb::IO".
1862
1863 ENHANCEMENT
1864 dbrowuniq now supports "-I" for incremental counting.
1865
1866 2.40, 2013-07-13 small bug fixes
1867 BUG FIX
1868 dbsort now has more respect for a user-given temporary directory;
1869 it no longer is ignored for merging.
1870
1871 IMPROVEMENT
1872 dbrowuniq now has options to output the first, last, and both first
1873 and last rows of a run ("-F", "-L", and "-B").
1874
1875 BUG FIX
1876 dbrowuniq now correctly handles "-N". Sigh, it didn't work before.
1877
1878 2.41, 2013-07-29 small bug and packaging fixes
1879 ENHANCEMENT
1880 Documentation to dbrvstatdiff improved (inspired by questions from
1881 Qian Kun).
1882
1883 BUG FIX
1884 dbrowuniq no longer duplicates singleton unique lines when
1885 outputting both (with "-B").
1886
1887 BUG FIX
1888 Add missing "XML::Simple" dependency to Makefile.PL.
1889
1890 ENHANCEMENT
1891 Tests now show the diff of the failing output if run with "make
1892 test TEST_VERBOSE=1".
1893
1894 ENHANCEMENT
1895 dbroweval now includes documentation for how to output extra rows.
1896 Suggestion from Yuri Pradkin.
1897
1898 BUG FIX
1899 Several improvements to the Fedora package from Michael Schwendt
1900 via <https://bugzilla.redhat.com/show_bug.cgi?id=877096>, and from
1901 the harsh master that is rpmlint. (I am stymied at teaching it
1902 that "outliers" is spelled correctly. Maybe I should send it
1903 Schneier's book. And an unresolvable invalid-spec-name lurks in
1904 the SRPM.)
1905
1906 2.42, 2013-07-31 A bug fix and packaging release.
1907 ENHANCEMENT
1908 Documentation to dbjoin improved to better memory usage. (Based on
1909 problem report by Lin Quan.)
1910
1911 BUG FIX
1912 The .spec is now perl-Fsdb.spec to satisfy rpmlint. Thanks to
1913 Christopher Meng for a specific bug report.
1914
1915 BUG FIX
1916 Test dbroweval_last.cmd no longer has a column that caused failures
1917 because of numerical instability.
1918
1919 BUG FIX
1920 Some tests now better handle bugs in old versions of perl (5.10,
1921 5.12). Thanks to Calvin Ardi for help debugging this on a Mac with
1922 perl-5.12, but the fix should affect other platforms.
1923
1924 2.43, 2013-08-27 Adds in-file compression.
1925 BUG FIX
1926 Changed the sort on TEST/dbsort_merge.cmd to strings (from
1927 numerics) so we're less susceptible to false test-failures due to
1928 floating point IO differences.
1929
1930 EXPERIMENTAL ENHANCEMENT
1931 Yet more parallelism in dbmerge: new "endgame-mode" builds a merge
1932 tree of processes at the end of large merge tasks to get maximally
1933 parallelism. Currently this feature is off by default because it
1934 can hang for some inputs. Enable this experimental feature with
1935 "--endgame".
1936
1937 ENHANCEMENT
1938 "Fsdb::IO" now handles being given "IO::Pipe" objects (as exercised
1939 by dbmerge).
1940
1941 BUG FIX
1942 Handling of NamedTmpfiles now supports concurrency. This fix will
1943 hopefully fix occasional "Use of uninitialized value $_ in string
1944 ne at ...NamedTmpfile.pm line 93." errors.
1945
1946 BUG FIX
1947 Fsdb now requires perl 5.10. This is a bug fix because some test
1948 cases used to require it, but this fact was not properly
1949 documented. (Back-porting to 5.008 would require removing all "//"
1950 operators.)
1951
1952 ENHANCEMENT
1953 Fsdb now handles automatic compression of file contents. Enable
1954 compression with "dbfilealter -Z xz" (or "gz" or "bz2"). All
1955 programs should operate on compressed files and leave the output
1956 with the same level of compression. "xz" is recommended as fastest
1957 and most efficient. "gz" is produces unrepeatable output (and so
1958 has no output test), it seems to insist on adding a timestamp.
1959
1960 2.44, 2013-10-02 A major change--all threads are gone.
1961 ENHANCEMENT
1962 Fsdb is now thread free and only uses processes for parallelism.
1963 This change is a big change--the entire motivation for Fsdb-2 was
1964 to exploit parallelism via threading. Parallelism--good, but perl
1965 threading--bad for performance. Horribly bad for performance.
1966 About 20x worse than pipes on my box. (See perl bug #119445 for
1967 the discussion.)
1968
1969 NEW "Fsdb::Support::Freds" provides a thread-like abstraction over
1970 forking, with some nice support for callbacks in the parent upon
1971 child termination.
1972
1973 ENHANCEMENT
1974 Details about removing threads: "dbpipeline" is thread free, and
1975 new tests to verify each of its parts. The easy cases are
1976 "dbcolpercentile", "dbcolstats", "dbfilepivot", "dbjoin", and
1977 "dbcolstatscores", each of which use it in simple ways
1978 (2013-09-09). "dbmerge" is now thread free (2013-09-13), but was a
1979 significant rewrite, which brought "dbsort" along. "dbmapreduce"
1980 is partly thread free (2013-09-21), again as a rewrite, and it
1981 brings "dbmultistats" along. Full "dbmapreduce" support took much
1982 longer (2013-10-02).
1983
1984 BUG FIX
1985 When running with user-only output ("-n"), dbroweval now resets the
1986 output vector $ofref after it has been output.
1987
1988 NEW dbcolcreate will create all columns at the head of each row with
1989 the "--first" option.
1990
1991 NEW dbfilecat will concatenate two files, verifying that they have the
1992 same schema.
1993
1994 ENHANCEMENT
1995 dbmapreduce now passes comments through, rather than eating them as
1996 before.
1997
1998 Also, dbmapreduce now supports a "--" option to prevent
1999 misinterpreting sub-program parameters as for dbmapreduce.
2000
2001 INCOMPATIBLE CHANGE
2002 dbmapreduce no longer figures out if it needs to add the key to the
2003 output. For multi-key-aware reducers, it never does (and cannot).
2004 For non-multi-key-aware reducers, it defaults to add the key and
2005 will now fail if the reducer adds the key (with error "dbcolcreate:
2006 attempt to create pre-existing column..."). In such cases, one
2007 must disable adding the key with the new option "--no-prepend-key".
2008
2009 INCOMPATIBLE CHANGE
2010 dbmapreduce no longer copies the input field separator by default.
2011 For multi-key-aware reducers, it never does (and cannot). For non-
2012 multi-key-aware reducers, it defaults to not copying the field
2013 separator, but it will copy it (the old default) with the
2014 "--copy-fs" option
2015
2016 2.45, 2013-10-07 cleanup from de-thread-ification
2017 BUG FIX
2018 Corrected a fast busy-wait in dbmerge.
2019
2020 ENHANCEMENT
2021 Endgame mode enabled in dbmerge; it (and also large cases of
2022 dbsort) should now exploit greater parallelism.
2023
2024 BUG FIX
2025 Test case with "Fsdb::BoundedQueue" (gone since 2.44) now removed.
2026
2027 2.46, 2013-10-08 continuing cleanup of our no-threads version
2028 BUG FIX
2029 Fixed some packaging details. (Really, threads are no longer
2030 required, missing tests in the MANIFEST.)
2031
2032 IMPROVEMENT
2033 dbsort now better communicates with the merge process to avoid
2034 bursty parallelism.
2035
2036 Fsdb::IO::Writer now can take "-autoflush =" 1> for line-buffered
2037 IO.
2038
2039 2.47, 2013-10-12 test suite cleanup for non-threaded perls
2040 BUG FIX
2041 Removed some stray "use threads" in some test cases. We didn't
2042 need them, and these were breaking non-threaded perls.
2043
2044 BUG FIX
2045 Better handling of Fred cleanup; should fix intermittent
2046 dbmapreduce failures on BSD.
2047
2048 ENHANCEMENT
2049 Improved test framework to show output when tests fail. (This
2050 time, for real.)
2051
2052 2.48, 2014-01-03 small bugfixes and improved release engineering
2053 ENHANCEMENT
2054 Test suites now skip tests for libraries that are missing. (Patch
2055 for missing "IO::Compresss:Xz" contributed by Calvin Ardi.)
2056
2057 ENHANCEMENT
2058 Removed references to Jdb in the package specification. Since the
2059 name was changed in 2008, there's no longer a huge need for
2060 backwards comparability. (Suggestion form Petr Šabata.)
2061
2062 ENHANCEMENT
2063 Test suites now invoke the perl using the path from
2064 $Config{perlpath}. Hopefully this helps testing in environments
2065 where there are multiple installed perls and the default perl is
2066 not the same as the perl-under-test (as happens in
2067 cpantesters.org).
2068
2069 BUG FIX
2070 Added specific encoding to this manpage to account for Unicode.
2071 Required to build correctly against perl-5.18.
2072
2073 2.49, 2014-01-04 bugfix to unicode handling in Fsdb IO (plus minor
2074 packaging fixes)
2075 BUG FIX
2076 Restored a line in the .spec to chmod g-s.
2077
2078 BUG FIX
2079 Unicode decoding is now handled correctly for programs that read
2080 from standard input. (Also: New test scripts cover unicode input
2081 and output.)
2082
2083 BUG FIX
2084 Fix to Fsdb documentation encoding line. Addresses test failure in
2085 perl-5.16 and earlier. (Who knew "encoding" had to be followed by
2086 a blank line.)
2087
2089 2.50, 2014-05-27 a quick release for spec tweaks
2090 ENHANCEMENT
2091 In dbroweval, the "-N" (no output, even comments) option now
2092 implies "-n", and it now suppresses the header and trailer.
2093
2094 BUG FIX
2095 A few more tweaks to the perl-Fsdb.spec from Petr Šabata.
2096
2097 BUG FIX
2098 Fixed 3 uses of "use v5.10" in test suites that were causing test
2099 failures (due to warnings, not real failures) on some platforms.
2100
2101 2.51, 2014-09-05 Feature enhancements to dbcolmovingstats, dbcolcreate,
2102 dbmapreduce, and new sqlselect_to_db
2103 ENHANCEMENT
2104 dbcolcreate now has a "--no-recreate-fatal" that causes it to
2105 ignore creation of existing columns (instead of failing).
2106
2107 ENHANCEMENT
2108 dbmapreduce once again is robust to reducers that output the key;
2109 "--no-prepend-key" is no longer mandatory.
2110
2111 ENHANCEMENT
2112 dbcolsplittorows can now enumerate the output rows with "-E".
2113
2114 BUG FIX
2115 dbcolmovingstats is more mathematically robust. Previously for
2116 some inputs and some platforms, floating point rounding could
2117 sometimes cause squareroots of negative numbers.
2118
2119 NEW sqlselect_to_db converts the output of the MySQL or MarinaDB select
2120 comment into fsdb format.
2121
2122 INCOMPATIBLE CHANGE
2123 dbfilediff now outputs the second row when doing sloppy numeric
2124 comparisons, to better support test suites.
2125
2126 2.52, 2014-11-03 Fixing the test suite for line number changes.
2127 ENHANCEMENT
2128 Test suites changes to be robust to exact line numbers of failures,
2129 since different Perl releases fail on different lines.
2130 <https://bugzilla.redhat.com/show_bug.cgi?id=1158380>
2131
2132 2.53, 2014-11-26 bug fixes and stability improvements to dbmapreduce
2133 ENHANCEMENT
2134 The dbfilediff how supports a "--quiet" option.
2135
2136 ENHANCEMENT
2137 Better documention of dbpipeline_filter.
2138
2139 BUGFIX
2140 Added groff-base and perl-podlators to the Fedora package spec.
2141 Fixes <https://bugzilla.redhat.com/show_bug.cgi?id=1163149>. (Also
2142 in package 2.52-2.)
2143
2144 BUGFIX
2145 An important stability improvement to dbmapreduce. It, plus
2146 dbmultistats, and dbcolstats now support controlled parallelism
2147 with the "--pararallelism=N" option. They default to run with the
2148 number of available CPUs. dbmapreduce also moderates its level of
2149 parallelism. Previously it would create reducers as needed,
2150 causing CPU thrashing if reducers ran much slower than data
2151 production.
2152
2153 BUGFIX
2154 The combination of dbmapreduce with dbrowenumerate now works as it
2155 should. (The obscure bug was an interaction with dbcolcreate with
2156 non-multi-key reducers that output their own key. dbmapreduce has
2157 too many useful corner cases.)
2158
2159 2.54, 2014-11-28 fix for the test suite to correct failing tests on not-my-
2160 platform
2161 BUGFIX
2162 Sigh, the test suite now has a test suite. Because, yes, I broke
2163 it, causing many incorrect failures at cpantesters. Now fixed.
2164
2165 2.55, 2015-01-05 many spelling fixes and dbcolmovingstats tests are more
2166 robust to different numeric precision
2167 ENHANCEMENT
2168 dbfilediff now can be extra quiet, as I continue to try to track
2169 down a numeric difference on FreeBSD AMD boxes.
2170
2171 ENHANCEMENT
2172 dbcolmovingstats gave different test output (just reflecting
2173 rounding error) when stddev approaches zero. We now detect hand
2174 handle this case. See
2175 <https://rt.cpan.org/Public/Bug/Display.html?id=101220> and thanks
2176 to H. Merijn Brand for the bug report.
2177
2178 BUG FIX
2179 Many, many spelling bugs found by H. Merijn Brand; thanks for the
2180 bug report.
2181
2182 INCOMPATBLE CHANGE
2183 A number of programs had misspelled "separator" in
2184 "--fieldseparator" and "--columnseparator" options as "seperator".
2185 These are now correctly spelled.
2186
2187 2.56, 2015-02-03 fix against Getopt::Long-2.43's stricter error checkign
2188 BUG FIX
2189 Internal argument parsing uses Getopt::Long, but mixed pass-through
2190 and <>. Bug reported by Petr Pisar at
2191 <https://bugzilla.redhat.com/show_bug.cgi?id=1188538>.a
2192
2193 BUG FIX
2194 Added missing BuildRequires for "XML::Simple".
2195
2196 2.57, 2015-04-29 Minor changes, with better performance from dbmulitstats.
2197 BUG FIX
2198 dbfilecat now honors "--remove-inputs" (previously it didn't).
2199 This omission meant that dbmapreduce (and dbmultistats) would
2200 accumulate files in /tmp when running. Bad news for inputs with 4M
2201 keys.
2202
2203 ENHANCMENT
2204 dbmultistats should be faster with lots of small keys. dbcolstats
2205 now supports "-k" to get some of the functionality of dbmultistats
2206 (if data is pre-sorted and median/quartiles are not required).
2207
2208 dbfilecat now honors "--remove-inputs" (previously it didn't).
2209 This omission meant that dbmapreduce (and dbmultistats) would
2210 accumulate files in /tmp when running. Bad news for inputs with 4M
2211 keys.
2212
2213 2.58, 2015-04-30 Bugfix in dbmerge
2214 BUG FIX
2215 Fixed a case where dbmerge suffered mojobake in endgame mode. This
2216 bug surfaced when dbsort was applied to large files (big enough to
2217 require merging) with unicode in them; the symptom was soemthing
2218 like:
2219 Wide character in print at /usr/lib64/perl5/IO/Handle.pm line
2220 420, <GEN12> line 111.
2221
2222 2.59, 2016-09-01 Collect a few small bug fixes and documentation
2223 improvements.
2224 BUG FIX
2225 More IO is explicitly marked UTF-8 to avoid Perl's tendency to
2226 mojibake on otherwise valid unicode input. This change helps
2227 html_table_to_db.
2228
2229 ENHANCEMENT
2230 dbcolscorrelate now crossreferences dbcolsregression.
2231
2232 ENHANCEMENT
2233 Documentation for dbrowdiff now clarifies that the default is
2234 baseline mode.
2235
2236 BUG FIX
2237 dbjoin now propagates "-T" into the sorting process (if it is
2238 required). Thanks to Lan Wei for reporting this bug.
2239
2240 2.60, 2016-09-04 Adds support for hash joins.
2241 ENHANCEMENT
2242 dbjoin now supports hash joins with "-t lefthash" and "-t
2243 righthash". Hash joins cache a table in memory, but do not require
2244 that the other table be sorted. They are ideal when joining a
2245 large table against a small one.
2246
2247 2.61, 2016-09-05 Support left and right outer joins.
2248 ENHANCEMENT
2249 dbjoin now handles left and right outer joins with "-t left" and
2250 "-t right".
2251
2252 ENHANCEMENT
2253 dbjoin hash joins are now selected with "-m lefthash" and "-m
2254 righthash" (not the shortlived "-t righthash" option).
2255 (Technically this change is incompatible with Fsdd-2.60, but no one
2256 but me ever used that version.)
2257
2258 2.62, 2016-11-29 A new yaml_to_db and other minor improvements.
2259 ENHANCEMENT
2260 Documentation for xml_to_db now includes sample output.
2261
2262 NEW yaml_to_db converts a specific form of YAML to fsdb.
2263
2264 BUG FIX
2265 The test suite now uses "diff -c -b" rather than "diff -cb" to make
2266 OpenBSD-5.9 happier, I hope.
2267
2268 ENHANCEMENT
2269 Comments that log operations at the end of each file now do simple
2270 quoting of spaces. (It is not guaranteed to be fully shell-
2271 compliant.)
2272
2273 ENHANCEMENT
2274 There is a new standard option, "--header", allowing one to specify
2275 an Fsdb header for inputs that lack it. Currently it is supported
2276 by dbcoldefine, dbrowuniq, dbmapreduce, dbmultistats, dbsort,
2277 dbpipeline.
2278
2279 ENHANCEMENT
2280 dbfilepivot now allows the --possible-pivots option, and if it is
2281 provided processes the data in one pass.
2282
2283 ENHANCEMENT
2284 dbroweval logs are now quoted.
2285
2286 2.63, 2017-02-03 Re-add some features supposedly in 2.62 but not, and add
2287 more --header options.
2288 ENHANCEMENT
2289 The option -j is now a synonym for --parallelism. (And several
2290 documention bugs about this option are fixed.)
2291
2292 ENHANCEMENT
2293 Additional support for "--header" in dbcolmerge, dbcol, dbrow, and
2294 dbroweval.
2295
2296 BUG FIX
2297 Version 2.62 was supposed to have this improvement, but did not
2298 (and now does): dbfilepivot now allows the --possible-pivots
2299 option, and if it is provided processes the data in one pass.
2300
2301 BUG FIX
2302 Version 2.62 was supposed to have this improvement, but did not
2303 (and now does): dbroweval logs are now quoted.
2304
2305 2.64, 2017-11-20 several small bugfixes and enhancements
2306 BUG FIX
2307 In dbroweval, the "next row" option previously did not correctly
2308 set up "_last_fieldname". It now does.
2309
2310 ENHANCEMENT
2311 The csv_to_db converter now has an optional "-F x" option to set
2312 the field separator.
2313
2314 ENHANCEMENT
2315 Finally dbcolsplittocols has a "--header" option, and a new "-N"
2316 option to give the list of resulting output columns.
2317
2318 INCOMPATIBLE CHANGE
2319 Now dbcolstats and dbmultistats produce no output (but a schema)
2320 when given no input but a schema. Previously they gave a null row
2321 of output. The "--output-on-no-input" and
2322 "--no-output-on-no-input" options can control this behavior.
2323
2324 2.65, 2018-02-16 Minor release, bug fix and -F option.
2325 ENHANCEMENT
2326 dbmultistats and dbmapreduce now both take a "-F x" option to set
2327 the field separator.
2328
2329 BUG FIX
2330 Fixed missing "use Carp" in dbcolstats. Also went back and cleaned
2331 up all uses of "croak()". Thanks to Zefram for the bug report.
2332
2333 2.66, 2018-12-20 Critical bug fix in dbjoin.
2334 BUG FIX
2335 Removed old tests from MANIFEST. (Thanks to Hang Guo for reporting
2336 this bug.)
2337
2338 IMPROVEMENT
2339 Errors for non-existing input files now include the bad filename
2340 (before: "cannot setup filehandle", now: "cannot open input: cannot
2341 open TEST/bad_filename").
2342
2343 BUG FIX
2344 Hash joins with three identical rows were failing with the
2345 assertion failure "internal error: confused about overflow" due to
2346 a now-fixed bug.
2347
2348 2.67, 2019-07-10 add support for reading and writing hdfs
2349 IMPROVEMENT
2350 dbformmail now has an "mh" mechanism that writes messages to
2351 individual files (an mh-style mailbox).
2352
2353 BUG FIX
2354 dbrow failed to include the Carp library, leading to fails on
2355 croak.
2356
2357 BUG FIX
2358 Fixed dbjoin error message for an unsorted right stream was
2359 incorrect (it said left).
2360
2361 IMPROVEMENT
2362 All Fsdb programs can now read from and write to HDFS, when files
2363 that start with "hdfs:" are given to -i and -o options.
2364
2365 2.68, 2019-09-19 All programs now support automatic decompression based on
2366 file extension.
2367 IMPROVEMENT
2368 The omitted-possible-error test case for dbfilepivot now has an
2369 altnerative output that I saw on some BSD-running systems (thanks
2370 to CPAN).
2371
2372 IMPROVEMENT
2373 dbmerge and dbmerge2 now support "--header". dbmerge2 now gives
2374 better error messages when presented the wrong number of inputs.
2375
2376 BUG FIX
2377 dbsort now works with "--header" even when the file is big (due to
2378 fixes to dbmerge).
2379
2380 IMPROVEMENT
2381 cvs_to_db now processes data with the "binary" option, allowing it
2382 to handle newlines embedded in quoted fields.
2383
2384 IMPROVEMENT
2385 All programs now will transparently decompress input files, if they
2386 are listed as a filename as an input argument that extends with a
2387 standard extension (.gz, .bz2, and .xz).
2388
2390 John Heidemann, "johnh@isi.edu"
2391
2392 See "Contributors" for the many people who have contributed bug reports
2393 and fixes.
2394
2396 Fsdb is Copyright (C) 1991-2016 by John Heidemann <johnh@isi.edu>.
2397
2398 This program is free software; you can redistribute it and/or modify it
2399 under the terms of version 2 of the GNU General Public License as
2400 published by the Free Software Foundation.
2401
2402 This program is distributed in the hope that it will be useful, but
2403 WITHOUT ANY WARRANTY; without even the implied warranty of
2404 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
2405 General Public License for more details.
2406
2407 You should have received a copy of the GNU General Public License along
2408 with this program; if not, write to the Free Software Foundation, Inc.,
2409 675 Mass Ave, Cambridge, MA 02139, USA.
2410
2411 A copy of the GNU General Public License can be found in the file
2412 ``COPYING''.
2413
2415 Any comments about these programs should be sent to John Heidemann
2416 "johnh@isi.edu".
2417
2418
2419
2420perl v5.30.1 2020-01-30 Fsdb(3)