1Fsdb(3) User Contributed Perl Documentation Fsdb(3)
2
3
4
6 Fsdb - a flat-text database for shell scripting
7
9 Fsdb, the flatfile streaming database is package of commands for
10 manipulating flat-ASCII databases from shell scripts. Fsdb is useful
11 to process medium amounts of data (with very little data you'd do it by
12 hand, with megabytes you might want a real database). Fsdb was known
13 as as Jdb from 1991 to Oct. 2008.
14
15 Fsdb is very good at doing things like:
16
17 • extracting measurements from experimental output
18
19 • examining data to address different hypotheses
20
21 • joining data from different experiments
22
23 • eliminating/detecting outliers
24
25 • computing statistics on data (mean, confidence intervals,
26 correlations, histograms)
27
28 • reformatting data for graphing programs
29
30 Fsdb is built around the idea of a flat text file as a database. Fsdb
31 files (by convention, with the extension .fsdb), have a header
32 documenting the schema (what the columns mean), and then each line
33 represents a database record (or row).
34
35 For example:
36
37 #fsdb experiment duration
38 ufs_mab_sys 37.2
39 ufs_mab_sys 37.3
40 ufs_rcp_real 264.5
41 ufs_rcp_real 277.9
42
43 Is a simple file with four experiments (the rows), each with a
44 description, size parameter, and run time in the first, second, and
45 third columns.
46
47 Rather than hand-code scripts to do each special case, Fsdb provides
48 higher-level functions. Although it's often easy throw together a
49 custom script to do any single task, I believe that there are several
50 advantages to using Fsdb:
51
52 • these programs provide a higher level interface than plain Perl, so
53
54 ** Fewer lines of simpler code:
55
56 dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration
57
58 Picks out just one type of experiment and computes statistics
59 on it, rather than:
60
61 while (<>) { split; $sum+=$F[1]; $ss+=$F[1]**2; $n++; }
62 $mean = $sum / $n; $std_dev = ...
63
64 in dozens of places.
65
66 • the library uses names for columns, so
67
68 ** No more $F[1], use "_duration".
69
70 ** New or different order columns? No changes to your scripts!
71
72 Thus if your experiment gets more complicated with a size
73 parameter, so your log changes to:
74
75 #fsdb experiment size duration
76 ufs_mab_sys 1024 37.2
77 ufs_mab_sys 1024 37.3
78 ufs_rcp_real 1024 264.5
79 ufs_rcp_real 1024 277.9
80 ufs_mab_sys 2048 45.3
81 ufs_mab_sys 2048 44.2
82
83 Then the previous scripts still work, even though duration is now
84 the third column, not the second.
85
86 • A series of actions are self-documenting (the provenance of
87 processsing done to produce each output is recorded in comments).
88
89 ** No more wondering what hacks were used to compute the final
90 data, just look at the comments at the end of the output.
91
92 For example, the commands
93
94 dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration
95
96 add to the end of the output the lines
97 # | dbrow _experiment eq "ufs_mab_sys"
98 # | dbcolstats duration
99
100 • The library is mature, supporting large datasets (more than 100GB),
101 corner cases, error handling, backed by an automated test suite.
102
103 ** No more puzzling about bad output because your custom script
104 skimped on error checking.
105
106 ** No more memory thrashing when you try to sort ten million
107 records.
108
109 • Fsdb-2.x supports Perl scripting (in addition to shell scripting),
110 with libraries to do Fsdb input and output, and easy support for
111 pipelines. The shell script
112
113 dbcol name test1 | dbroweval '_test1 += 5;'
114
115 can be written in perl as:
116
117 dbpipeline(dbcol(qw(name test1)), dbroweval('_test1 += 5;'));
118
119 (The disadvantage is that you need to learn what functions Fsdb
120 provides.)
121
122 Fsdb is built on flat-ASCII databases. By storing data in simple text
123 files and processing it with pipelines it is easy to experiment (in the
124 shell) and look at the output. To the best of my knowledge, the
125 original implementation of this idea was "/rdb", a commercial product
126 described in the book UNIX relational database management: application
127 development in the UNIX environment by Rod Manis, Evan Schaffer, and
128 Robert Jorgensen (1988 by Prentice Hall, and also at the web page
129 <http://www.rdb.com/>). Fsdb is an incompatible re-implementation of
130 their idea without any accelerated indexing or forms support. (But
131 it's free, and probably has better statistics!).
132
133 Fsdb-2.x will exploit multiple processors or cores, and provides Perl-
134 level support for input, output, and threaded-pipelines. (As of
135 Fsdb-2.44 it no longer uses Perl threading, just processes, since they
136 are faster.)
137
138 Installation instructions follow at the end of this document. Fsdb-2.x
139 requires Perl 5.8 to run. All commands have manual pages and provide
140 usage with the "--help" option. All commands are backed by an
141 automated test suite.
142
143 The most recent version of Fsdb is available on the web at
144 <http://www.isi.edu/~johnh/SOFTWARE/FSDB/index.html>.
145
147 2.74, 2021-06-23 More ipv6.
148 ENHANCEMENT
149 Fsdb::Support::IPv6 package includes ipv6_fullhex to rewrite ipv6
150 print addresses as full, 128-bit hex values.
151
153 executive summary
154 what's new
155 README CONTENTS
156 installation
157 basic data format
158 basic data manipulation
159 list of commands
160 another example
161 a gradebook example
162 a password example
163 history
164 related work
165 release notes
166 copyright
167 comments
168
170 Fsdb now uses the standard Perl build and installation from
171 ExtUtil::MakeMaker(3), so the quick answer to installation is to type:
172
173 perl Makefile.PL
174 make
175 make test
176 make install
177
178 Or, if you want to install it somewhere else, change the first line to
179
180 perl Makefile.PL PREFIX=$HOME
181
182 and it will go in your home directory's bin, etc. (See
183 ExtUtil::MakeMaker(3) for more details.)
184
185 Fsdb requires perl 5.8 or later.
186
187 A test-suite is available, run it with
188
189 make test
190
191 In the past, the ports existed for FreeBSD and MacOS. If someone
192 running one of those OSes wants to contribute a new port, please let me
193 know.
194
196 These programs are based on the idea storing data in simple ASCII
197 files. A database is a file with one header line and then data or
198 comment lines. For example:
199
200 #fsdb account passwd uid gid fullname homedir shell
201 johnh * 2274 134 John_Heidemann /home/johnh /bin/bash
202 greg * 2275 134 Greg_Johnson /home/greg /bin/bash
203 root * 0 0 Root /root /bin/bash
204 # this is a simple database
205
206 The header line must be first and begins with "#h". There are rows
207 (records) and columns (fields), just like in a normal database.
208 Comment lines begin with "#". Column names are any string not
209 containing spaces or single quote (although it is prudent to keep them
210 alphanumeric with underscore).
211
212 By default, columns are delimited by whitespace. With this default
213 configuration, the contents of a field cannot contain whitespace.
214 However, this limitation can be relaxed by changing the field separator
215 as described below.
216
217 The big advantage of simple flat-text databases is that it is usually
218 easy to massage data into this format, and it's reasonably easy to take
219 data out of this format into other (text-based) programs, like gnuplot,
220 jgraph, and LaTeX. Think Unix. Think pipes. (Or even output to Excel
221 and HTML if you prefer.)
222
223 Since no-whitespace in columns was a problem for some applications,
224 there's an option which relaxes this rule. You can specify the field
225 separator in the table header with "-F x" where "x" is a code for the
226 new field separator. A full list of codes is at dbfilealter(1), but
227 two common special values are "-F t" which is a separator of a single
228 tab character, and "-F S", a separator of two spaces. Both allowing
229 (single) spaces in fields. An example:
230
231 #fsdb -F S account passwd uid gid fullname homedir shell
232 johnh * 2274 134 John Heidemann /home/johnh /bin/bash
233 greg * 2275 134 Greg Johnson /home/greg /bin/bash
234 root * 0 0 Root /root /bin/bash
235 # this is a simple database
236
237 See dbfilealter(1) for more details. Regardless of what the column
238 separator is for the body of the data, it's always whitespace in the
239 header.
240
241 There's also a third format: a "list". Because it's often hard to see
242 what's columns past the first two, in list format each "column" is on a
243 separate line. The programs dblistize and dbcolize convert to and from
244 this format, and all programs work with either formats. The command
245
246 dbfilealter -R C < DATA/passwd.fsdb
247
248 outputs:
249
250 #fsdb -R C account passwd uid gid fullname homedir shell
251 account: johnh
252 passwd: *
253 uid: 2274
254 gid: 134
255 fullname: John_Heidemann
256 homedir: /home/johnh
257 shell: /bin/bash
258
259 account: greg
260 passwd: *
261 uid: 2275
262 gid: 134
263 fullname: Greg_Johnson
264 homedir: /home/greg
265 shell: /bin/bash
266
267 account: root
268 passwd: *
269 uid: 0
270 gid: 0
271 fullname: Root
272 homedir: /root
273 shell: /bin/bash
274
275 # this is a simple database
276 # | dblistize
277
278 See dbfilealter(1) for more details.
279
281 A number of programs exist to manipulate databases. Complex functions
282 can be made by stringing together commands with shell pipelines. For
283 example, to print the home directories of everyone with ``john'' in
284 their names, you would do:
285
286 cat DATA/passwd | dbrow '_fullname =~ /John/' | dbcol homedir
287
288 The output might be:
289
290 #fsdb homedir
291 /home/johnh
292 /home/greg
293 # this is a simple database
294 # | dbrow _fullname =~ /John/
295 # | dbcol homedir
296
297 (Notice that comments are appended to the output listing each command,
298 providing an automatic audit log.)
299
300 In addition to typical database functions (select, join, etc.) there
301 are also a number of statistical functions.
302
303 The real power of Fsdb is that one can apply arbitrary code to rows to
304 do powerful things.
305
306 cat DATA/passwd | dbroweval '_fullname =~ s/(\w+)_(\w+)/$2,_$1/'
307
308 converts "John_Heidemann" into "Heidemann,_John". Not too much more
309 work could split fullname into firstname and lastname fields.
310
311 (Or:
312
313 cat DATA/passwd | dbcolcreate sort | dbroweval -b 'use Fsdb::Support'
314 '_sort = _fullname; _sort =~ s/_/ /g; _sort = fullname_to_sort(_sort);'
315
317 An advantage of Fsdb is that you can talk about columns by name
318 (symbolically) rather than simply by their positions. So in the above
319 example, "dbcol homedir" pulled out the home directory column, and
320 "dbrow '_fullname =~ /John/'" matched against column fullname.
321
322 In general, you can use the name of the column listed on the "#fsdb"
323 line to identify it in most programs, and _name to identify it in code.
324
325 Some alternatives for flexibility:
326
327 • Numeric values identify columns positionally, numbering from 0. So
328 0 or _0 is the first column, 1 is the second, etc.
329
330 • In code, _last_columnname gets the value from columname's previous
331 row.
332
333 See dbroweval(1) for more details about writing code.
334
336 Enough said. I'll summarize the commands, and then you can experiment.
337 For a detailed description of each command, see a summary by running it
338 with the argument "--help" (or "-?" if you prefer.) Full manual pages
339 can be found by running the command with the argument "--man", or
340 running the Unix command "man dbcol" or whatever program you want.
341
342 TABLE CREATION
343 dbcolcreate
344 add columns to a database
345
346 dbcoldefine
347 set the column headings for a non-Fsdb file
348
349 TABLE MANIPULATION
350 dbcol
351 select columns from a table
352
353 dbrow
354 select rows from a table
355
356 dbsort
357 sort rows based on a set of columns
358
359 dbjoin
360 compute the natural join of two tables
361
362 dbcolrename
363 rename a column
364
365 dbcolmerge
366 merge two columns into one
367
368 dbcolsplittocols
369 split one column into two or more columns
370
371 dbcolsplittorows
372 split one column into multiple rows
373
374 dbfilepivot
375 "pivots" a file, converting multiple rows corresponding to the same
376 entity into a single row with multiple columns.
377
378 dbfilevalidate
379 check that db file doesn't have some common errors
380
381 COMPUTATION AND STATISTICS
382 dbcolstats
383 compute statistics over a column (mean,etc.,optionally median)
384
385 dbmultistats
386 group rows by some key value, then compute stats (mean, etc.) over
387 each group (equivalent to dbmapreduce with dbcolstats as the
388 reducer)
389
390 dbmapreduce
391 group rows (map) and then apply an arbitrary function to each group
392 (reduce)
393
394 dbrvstatdiff
395 compare two samples distributions (mean/conf interval/T-test)
396
397 dbcolmovingstats
398 computing moving statistics over a column of data
399
400 dbcolstatscores
401 compute Z-scores and T-scores over one column of data
402
403 dbcolpercentile
404 compute the rank or percentile of a column
405
406 dbcolhisto
407 compute histograms over a column of data
408
409 dbcolscorrelate
410 compute the coefficient of correlation over several columns
411
412 dbcolsregression
413 compute linear regression and correlation for two columns
414
415 dbrowaccumulate
416 compute a running sum over a column of data
417
418 dbrowcount
419 count the number of rows (a subset of dbstats)
420
421 dbrowdiff
422 compute differences between a columns in each row of a table
423
424 dbrowenumerate
425 number each row
426
427 dbroweval
428 run arbitrary Perl code on each row
429
430 dbrowuniq
431 count/eliminate identical rows (like Unix uniq(1))
432
433 dbfilediff
434 compare fields on rows of a file (something like Unix diff(1))
435
436 OUTPUT CONTROL
437 dbcolneaten
438 pretty-print columns
439
440 dbfilealter
441 convert between column or list format, or change the column
442 separator
443
444 dbfilestripcomments
445 remove comments from a table
446
447 dbformmail
448 generate a script that sends form mail based on each row
449
450 CONVERSIONS
451 (These programs convert data into fsdb. See their web pages for
452 details.)
453
454 cgi_to_db
455 <http://stein.cshl.org/boulder/>
456
457 combined_log_format_to_db
458 <http://httpd.apache.org/docs/2.0/logs.html>
459
460 html_table_to_db
461 HTML tables to fsdb (assuming they're reasonably formatted).
462
463 kitrace_to_db
464 <http://ficus-www.cs.ucla.edu/ficus-members/geoff/kitrace.html>
465
466 ns_to_db
467 <http://mash-www.cs.berkeley.edu/ns/>
468
469 sqlselect_to_db
470 the output of SQL SELECT tables to db
471
472 tabdelim_to_db
473 spreadsheet tab-delimited files to db
474
475 tcpdump_to_db
476 (see man tcpdump(8) on any reasonable system)
477
478 xml_to_db
479 XML input to fsdb, assuming they're very regular
480
481 (And out of fsdb:)
482
483 db_to_csv
484 Comma-separated-value format from fsdb.
485
486 db_to_html_table
487 simple conversion of Fsdb to html tables
488
489 STANDARD OPTIONS
490 Many programs have common options:
491
492 -? or --help
493 Show basic usage.
494
495 -N on --new-name
496 When a command creates a new column like dbrowaccumulate's "accum",
497 this option lets one override the default name of that new column.
498
499 -T TmpDir
500 where to put tmp files. Also uses environment variable TMPDIR, if
501 -T is not specified. Default is /tmp.
502
503 Show basic usage.
504
505 -c FRACTION or --confidence FRACTION
506 Specify confidence interval FRACTION (dbcolstats, dbmultistats,
507 etc.)
508
509 -C S or "--element-separator S"
510 Specify column separator S (dbcolsplittocols, dbcolmerge).
511
512 -d or --debug
513 Enable debugging (may be repeated for greater effect in some
514 cases).
515
516 -a or --include-non-numeric
517 Compute stats over all data (treating non-numbers as zeros). (By
518 default, things that can't be treated as numbers are ignored for
519 stats purposes)
520
521 -S or --pre-sorted
522 Assume the data is pre-sorted. May be repeated to disable
523 verification (saving a small amount of work).
524
525 -e E or --empty E
526 give value E as the value for empty (null) records
527
528 -i I or --input I
529 Input data from file I.
530
531 -o O or --output O
532 Write data out to file O.
533
534 --header H
535 Use H as the full Fsdb header, rather than reading a header from
536 then input. This option is particularly useful when using Fsdb
537 under Hadoop, where split files don't have heades.
538
539 --nolog.
540 Skip logging the program in a trailing comment.
541
542 When giving Perl code (in dbrow and dbroweval) column names can be
543 embedded if preceded by underscores. Look at dbrow(1) or dbroweval(1)
544 for examples.)
545
546 Most programs run in constant memory and use temporary files if
547 necessary. Exceptions are dbcolneaten, dbcolpercentile, dbmapreduce,
548 dbmultistats, dbrowsplituniq.
549
551 Take the raw data in "DATA/http_bandwidth", put a header on it
552 ("dbcoldefine size bw"), took statistics of each category
553 ("dbmultistats -k size bw"), pick out the relevant fields ("dbcol size
554 mean stddev pct_rsd"), and you get:
555
556 #fsdb size mean stddev pct_rsd
557 1024 1.4962e+06 2.8497e+05 19.047
558 10240 5.0286e+06 6.0103e+05 11.952
559 102400 4.9216e+06 3.0939e+05 6.2863
560 # | dbcoldefine size bw
561 # | /home/johnh/BIN/DB/dbmultistats -k size bw
562 # | /home/johnh/BIN/DB/dbcol size mean stddev pct_rsd
563
564 (The whole command was:
565
566 cat DATA/http_bandwidth |
567 dbcoldefine size |
568 dbmultistats -k size bw |
569 dbcol size mean stddev pct_rsd
570
571 all on one line.)
572
573 Then post-process them to get rid of the exponential notation by adding
574 this to the end of the pipeline:
575
576 dbroweval '_mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev);'
577
578 (Actually, this step is no longer required since dbcolstats now uses a
579 different default format.)
580
581 giving:
582
583 #fsdb size mean stddev pct_rsd
584 1024 1496200 284970 19.047
585 10240 5028600 601030 11.952
586 102400 4921600 309390 6.2863
587 # | dbcoldefine size bw
588 # | dbmultistats -k size bw
589 # | dbcol size mean stddev pct_rsd
590 # | dbroweval { _mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev); }
591
592 In a few lines, raw data is transformed to processed output.
593
594 Suppose you expect there is an odd distribution of results of one
595 datapoint. Fsdb can easily produce a CDF (cumulative distribution
596 function) of the data, suitable for graphing:
597
598 cat DB/DATA/http_bandwidth | \
599 dbcoldefine size bw | \
600 dbrow '_size == 102400' | \
601 dbcol bw | \
602 dbsort -n bw | \
603 dbrowenumerate | \
604 dbcolpercentile count | \
605 dbcol bw percentile | \
606 xgraph
607
608 The steps, roughly: 1. get the raw input data and turn it into fsdb
609 format, 2. pick out just the relevant column (for efficiency) and sort
610 it, 3. for each data point, assign a CDF percentage to it, 4. pick out
611 the two columns to graph and show them
612
614 The first commercial program I wrote was a gradebook, so here's how to
615 do it with Fsdb.
616
617 Format your data like DATA/grades.
618
619 #fsdb name email id test1
620 a a@ucla.example.edu 1 80
621 b b@usc.example.edu 2 70
622 c c@isi.example.edu 3 65
623 d d@lmu.example.edu 4 90
624 e e@caltech.example.edu 5 70
625 f f@oxy.example.edu 6 90
626
627 Or if your students have spaces in their names, use "-F S" and two
628 spaces to separate each column:
629
630 #fsdb -F S name email id test1
631 alfred aho a@ucla.example.edu 1 80
632 butler lampson b@usc.example.edu 2 70
633 david clark c@isi.example.edu 3 65
634 constantine drovolis d@lmu.example.edu 4 90
635 debrorah estrin e@caltech.example.edu 5 70
636 sally floyd f@oxy.example.edu 6 90
637
638 To compute statistics on an exam, do
639
640 cat DATA/grades | dbstats test1 |dblistize
641
642 giving
643
644 #fsdb -R C ...
645 mean: 77.5
646 stddev: 10.84
647 pct_rsd: 13.987
648 conf_range: 11.377
649 conf_low: 66.123
650 conf_high: 88.877
651 conf_pct: 0.95
652 sum: 465
653 sum_squared: 36625
654 min: 65
655 max: 90
656 n: 6
657 ...
658
659 To do a histogram:
660
661 cat DATA/grades | dbcolhisto -n 5 -g test1
662
663 giving
664
665 #fsdb low histogram
666 65 *
667 70 **
668 75
669 80 *
670 85
671 90 **
672 # | /home/johnh/BIN/DB/dbhistogram -n 5 -g test1
673
674 Now you want to send out grades to the students by e-mail. Create a
675 form-letter (in the file test1.txt):
676
677 To: _email (_name)
678 From: J. Random Professor <jrp@usc.example.edu>
679 Subject: test1 scores
680
681 _name, your score on test1 was _test1.
682 86+ A
683 75-85 B
684 70-74 C
685 0-69 F
686
687 Generate the shell script that will send the mail out:
688
689 cat DATA/grades | dbformmail test1.txt > test1.sh
690
691 And run it:
692
693 sh <test1.sh
694
695 The last two steps can be combined:
696
697 cat DATA/grades | dbformmail test1.txt | sh
698
699 but I like to keep a copy of exactly what I send.
700
701 At the end of the semester you'll want to compute grade totals and
702 assign letter grades. Both fall out of dbroweval. For example, to
703 compute weighted total grades with a 40% midterm/60% final where the
704 midterm is 84 possible points and the final 100:
705
706 dbcol -rv total |
707 dbcolcreate total - |
708 dbroweval '
709 _total = .40 * _midterm/84.0 + .60 * _final/100.0;
710 _total = sprintf("%4.2f", _total);
711 if (_final eq "-" || ( _name =~ /^_/)) { _total = "-"; };' |
712 dbcolneaten
713
714 If you got the data originally from a spreadsheet, save it in "tab-
715 delimited" format and convert it with tabdelim_to_db (run
716 tabdelim_to_db -? for examples).
717
719 To convert the Unix password file to db:
720
721 cat /etc/passwd | sed 's/:/ /g'| \
722 dbcoldefine -F S login password uid gid gecos home shell \
723 >passwd.fsdb
724
725 To convert the group file
726
727 cat /etc/group | sed 's/:/ /g' | \
728 dbcoldefine -F S group password gid members \
729 >group.fsdb
730
731 To show the names of the groups that div7-members are in (assuming DIV7
732 is in the gecos field):
733
734 cat passwd.fsdb | dbrow '_gecos =~ /DIV7/' | dbcol login gid | \
735 dbjoin -i - -i group.fsdb gid | dbcol login group
736
738 Which Fsdb programs are the most complicated (based on number of test
739 cases)?
740
741 ls TEST/*.cmd | \
742 dbcoldefine test | \
743 dbroweval '_test =~ s@^TEST/([^_]+).*$@$1@' | \
744 dbrowuniq -c | \
745 dbsort -nr count | \
746 dbcolneaten
747
748 (Answer: dbmapreduce, then dbcolstats, dbfilealter and dbjoin.)
749
750 Stats on an exam (in $FILE, where $COLUMN is the name of the exam)?
751
752 cat $FILE | dbcolstats -q 4 $COLUMN <$FILE | dblistize | dbstripcomments
753
754 cat $FILE | dbcolhisto -g -n 20 $COLUMN | dbcolneaten | dbstripcomments
755
756 Merging a the hw1 column from file hw1.fsdb into grades.fsdb assuming
757 there's a common student id in column "id":
758
759 dbcol id hw1 <hw1.fsdb >t.fsdb
760
761 dbjoin -a -e - grades.fsdb t.fsdb id | \
762 dbsort name | \
763 dbcolneaten >new_grades.fsdb
764
765 Merging two fsdb files with the same rows:
766
767 cat file1.fsdb file2.fsdb >output.fsdb
768
769 or if you want to clean things up a bit
770
771 cat file1.fsdb file2.fsdb | dbstripextraheaders >output.fsdb
772
773 or if you want to know where the data came from
774
775 for i in 1 2
776 do
777 dbcolcreate source $i < file$i.fsdb
778 done >output.fsdb
779
780 (assumes you're using a Bourne-shell compatible shell, not csh).
781
783 As with any tool, one should (which means must) understand the limits
784 of the tool.
785
786 All Fsdb tools should run in constant memory. In some cases (such as
787 dbcolstats with quartiles, where the whole input must be re-read),
788 programs will spool data to disk if necessary.
789
790 Most tools buffer one or a few lines of data, so memory will scale with
791 the size of each line. (So lines with many columns, or when columns
792 have lots data, may cause large memory consumption.)
793
794 All Fsdb tools should run in constant or at worst "n log n" time.
795
796 All Fsdb tools use normal Perl math routines for computation. Although
797 I make every attempt to choose numerically stable algorithms (although
798 I also welcome feedback and suggestions for improvement), normal
799 rounding due to computer floating point approximations can result in
800 inaccuracies when data spans a large range of precision. (See for
801 example the dbcolstats_extrema test cases.)
802
803 Any requirements and limitations of each Fsdb tool is documented on its
804 manual page.
805
806 If any Fsdb program violates these assumptions, that is a bug that
807 should be documented on the tool's manual page or ideally fixed.
808
809 Fsdb does depend on Perl's correctness, and Perl (and Fsdb) have some
810 bugs. Fsdb should work on perl from version 5.10 onward.
811
813 There have been three versions of Fsdb; fsdb 1.0 is a complete re-write
814 of the pre-1995 versions, and was distributed from 1995 to 2007. Fsdb
815 2.0 is a significant re-write of the 1.x versions for reasons described
816 below.
817
818 Fsdb (in its various forms) has been used extensively by its author
819 since 1991. Since 1995 it's been used by two other researchers at UCLA
820 and several at ISI. In February 1998 it was announced to the Internet.
821 Since then it has found a few users, some outside where I work.
822
823 Major changes:
824
825 1.0 1997-07-22: first public release.
826 2.0 2008-01-25: rewrite to use a common library, and starting to use
827 threads.
828 2.12 2008-10-16: completion of the rewrite, and first RPM package.
829 2.44 2013-10-02: abandoning threads for improved performance
830
831 Fsdb 2.0 Rationale
832 I've thought about fsdb-2.0 for many years, but it was started in
833 earnest in 2007. Fsdb-2.0 has the following goals:
834
835 in-one-process processing
836 While fsdb is great on the Unix command line as a pipeline between
837 programs, it should also be possible to set it up to run in a
838 single process. And if it does so, it should be able to avoid
839 serializing and deserializing (converting to and from text) data
840 between each module. (Accomplished in fsdb-2.0: see dbpipeline,
841 although still needs tuning.)
842
843 clean IO API
844 Fsdb's roots go back to perl4 and 1991, so the fsdb-1.x library is
845 very, very crufty. More than just being ugly (but it was that
846 too), this made things reading from one format file and writing to
847 another the application's job, when it should be the library's.
848 (Accomplished in fsdb-1.15 and improved in 2.0: see Fsdb::IO.)
849
850 normalized module APIs
851 Because fsdb modules were added as needed over 10 years, sometimes
852 the module APIs became inconsistent. (For example, the 1.x
853 "dbcolcreate" required an empty value following the name of the new
854 column, but other programs specify empty values with the "-e"
855 argument.) We should smooth over these inconsistencies.
856 (Accomplished as each module was ported in 2.0 through 2.7.)
857
858 everyone handles all input formats
859 Given a clean IO API, the distinction between "colized" and
860 "listized" fsdb files should go away. Any program should be able
861 to read and write files in any format. (Accomplished in fsdb-2.1.)
862
863 Fsdb-2.0 preserves backwards compatibility where possible, but breaks
864 it where necessary to accomplish the above goals. In August 2008,
865 Fsdb-2.7 was declared preferred over the 1.x versions. Benchmarking in
866 2013 showed that threading performed much worse than just using pipes,
867 so Fsdb-2.44 uses threading "style", but implemented with processes
868 (via my "Freds" library).
869
870 Contributors
871 Fsdb includes code ported from Geoff Kuenning
872 ("Fsdb::Support::TDistribution").
873
874 Fsdb contributors: Ashvin Goel goel@cse.oge.edu, Geoff Kuenning
875 geoff@fmg.cs.ucla.edu, Vikram Visweswariah visweswa@isi.edu, Kannan
876 Varadahan kannan@isi.edu, Lars Eggert larse@isi.edu, Arkadi Gelfond
877 arkadig@dyna.com, David Graff graff@ldc.upenn.edu, Haobo Yu
878 haoboy@packetdesign.com, Pavlin Radoslavov pavlin@catarina.usc.edu,
879 Graham Phillips, Yuri Pradkin, Alefiya Hussain, Ya Xu, Michael
880 Schwendt, Fabio Silva fabio@isi.edu, Jerry Zhao zhaoy@isi.edu, Ning Xu
881 nxu@aludra.usc.edu, Martin Lukac mlukac@lecs.cs.ucla.edu, Xue Cai,
882 Michael McQuaid, Christopher Meng, Calvin Ardi, H. Merijn Brand, Lan
883 Wei, Hang Guo.
884
885 Fsdb includes datasets contributed from NIST (DATA/nist_zarr13.fsdb),
886 from
887 <http://www.itl.nist.gov/div898/handbook/eda/section4/eda4281.htm>, the
888 NIST/SEMATECH e-Handbook of Statistical Methods, section 1.4.2.8.1.
889 Background and Data. The source is public domain, and reproduced with
890 permission.
891
893 As stated in the introduction, Fsdb is an incompatible reimplementation
894 of the ideas found in "/rdb". By storing data in simple text files and
895 processing it with pipelines it is easy to experiment (in the shell)
896 and look at the output. The original implementation of this idea was
897 /rdb, a commercial product described in the book UNIX relational
898 database management: application development in the UNIX environment by
899 Rod Manis, Evan Schaffer, and Robert Jorgensen (and also at the web
900 page <http://www.rdb.com/>).
901
902 While Fsdb is inspired by Rdb, it includes no code from it, and Fsdb
903 makes several different design choices. In particular: rdb attempts to
904 be closer to a "real" database, with provision for locking, file
905 indexing. Fsdb focuses on single user use and so eschews these
906 choices. Rdb also has some support for interactive editing. Fsdb
907 leaves editing to text editors like emacs or vi.
908
909 In August, 2002 I found out Carlo Strozzi extended RDB with his package
910 NoSQL <http://www.linux.it/~carlos/nosql/>. According to Mr. Strozzi,
911 he implemented NoSQL in awk to avoid the Perl start-up of RDB.
912 Although I haven't found Perl startup overhead to be a big problem on
913 my platforms (from old Sparcstation IPCs to 2GHz Pentium-4s), you may
914 want to evaluate his system. The Linux Journal has a description of
915 NoSQL at <http://www.linuxjournal.com/article/3294>. It seems quite
916 similar to Fsdb. Like /rdb, NoSQL supports indexing (not present in
917 Fsdb). Fsdb appears to have richer support for statistics, and, as of
918 Fsdb-2.x, its support for Perl threading may support faster performance
919 (one-process, less serialization and deserialization).
920
922 Versions prior to 1.0 were released informally on my web page but were
923 not announced.
924
925 0.0 1991
926 started for my own research use
927
928 0.1 26-May-94
929 first check-in to RCS
930
931 0.2 15-Mar-95
932 parts now require perl5
933
934 1.0, 22-Jul-97
935 adds autoconf support and a test script.
936
937 1.1, 20-Jan-98
938 support for double space field separators, better tests
939
940 1.2, 11-Feb-98
941 minor changes and release on comp.lang.perl.announce
942
943 1.3, 17-Mar-98
944 • adds median and quartile options to dbstats
945
946 • adds dmalloc_to_db converter
947
948 • fixes some warnings
949
950 • dbjoin now can run on unsorted input
951
952 • fixes a dbjoin bug
953
954 • some more tests in the test suite
955
956 1.4, 27-Mar-98
957 • improves error messages (all should now report the program that
958 makes the error)
959
960 • fixed a bug in dbstats output when the mean is zero
961
962 1.5, 25-Jun-98
963 BUG FIX dbcolhisto, dbcolpercentile now handles non-numeric values like
964 dbstats
965 NEW dbcolstats computes zscores and tscores over a column
966 NEW dbcolscorrelate computes correlation coefficients between two
967 columns
968 INTERNAL ficus_getopt.pl has been replaced by DbGetopt.pm
969 BUG FIX all tests are now ``portable'' (previously some tests ran only
970 on my system)
971 BUG FIX you no longer need to have the db programs in your path (fix
972 arose from a discussion with Arkadi Gelfond)
973 BUG FIX installation no longer uses cp -f (to work on SunOS 4)
974
975 1.6, 24-May-99
976 NEW dbsort, dbstats, dbmultistats now run in constant memory (using tmp
977 files if necessary)
978 NEW dbcolmovingstats does moving means over a series of data
979 NEW dbcol has a -v option to get all columns except those listed
980 NEW dbmultistats does quartiles and medians
981 NEW dbstripextraheaders now also cleans up bogus comments before the
982 fist header
983 BUG FIX dbcolneaten works better with double-space-separated data
984
985 1.7, 5-Jan-00
986 NEW dbcolize now detects and rejects lines that contain embedded copies
987 of the field separator
988 NEW configure tries harder to prevent people from improperly
989 configuring/installing fsdb
990 NEW tcpdump_to_db converter (incomplete)
991 NEW tabdelim_to_db converter: from spreadsheet tab-delimited files to
992 db
993 NEW mailing lists for fsdb are "fsdb-announce@heidemann.la.ca.us"
994 and "fsdb-talk@heidemann.la.ca.us"
995 To subscribe to either, send mail
996 to "fsdb-announce-request@heidemann.la.ca.us" or
997 "fsdb-talk-request@heidemann.la.ca.us" with "subscribe" in the
998 BODY of the message.
999
1000 BUG FIX dbjoin used to produce incorrect output if there were extra,
1001 unmatched values in the 2nd table. Thanks to Graham Phillips for
1002 providing a test case.
1003 BUG FIX the sample commands in the usage strings now all should
1004 explicitly include the source of data (typically from "cat foo.fsdb
1005 |"). Thanks to Ya Xu for pointing out this documentation deficiency.
1006 BUG FIX (DOCUMENTATION) dbcolmovingstats had incorrect sample output.
1007
1008 1.8, 28-Jun-00
1009 BUG FIX header options are now preserved when writing with dblistize
1010 NEW dbrowuniq now optionally checks for uniqueness only on certain
1011 fields
1012 NEW dbrowsplituniq makes one pass through a file and splits it into
1013 separate files based on the given fields
1014 NEW converter for "crl" format network traces
1015 NEW anywhere you use arbitrary code (like dbroweval), _last_foo now
1016 maps to the last row's value for field _foo.
1017 OPTIMIZATION comment processing slightly changed so that dbmultistats
1018 now is much faster on files with lots of comments (for example, ~100k
1019 lines of comments and 700 lines of data!) (Thanks to Graham Phillips
1020 for pointing out this performance problem.)
1021 BUG FIX dbstats with median/quartiles now correctly handles singleton
1022 data points.
1023
1024 1.9, 6-Nov-00
1025 NEW dbfilesplit, split a single input file into multiple output files
1026 (based on code contributed by Pavlin Radoslavov).
1027 BUG FIX dbsort now works with perl-5.6
1028
1029 1.10, 10-Apr-01
1030 BUG FIX dbstats now handles the case where there are more n-tiles than
1031 data
1032 NEW dbstats now includes a -S option to optimize work on pre-sorted
1033 data (inspired by code contributed by Haobo Yu)
1034 BUG FIX dbsort now has a better estimate of memory usage when run on
1035 data with very short records (problem detected by Haobo Yu)
1036 BUG FIX cleanup of temporary files is slightly better
1037
1038 1.11, 2-Nov-01
1039 BUG FIX dbcolneaten now runs in constant memory
1040 NEW dbcolneaten now supports "field specifiers" that allow some control
1041 over how wide columns should be
1042 OPTIMIZATION dbsort now tries hard to be filesystem cache-friendly
1043 (inspired by "Information and Control in Gray-box Systems" by the
1044 Arpaci-Dusseau's at SOSP 2001)
1045 INTERNAL t_distr now ported to perl5 module DbTDistr
1046
1047 1.12, 30-Oct-02
1048 BUG FIX dbmultistats documentation typo fixed
1049 NEW dbcolmultiscale
1050 NEW dbcol has -r option for "relaxed error checking"
1051 NEW dbcolneaten has new -e option to strip end-of-line spaces
1052 NEW dbrow finally has a -v option to negate the test
1053 BUG FIX math bug in dbcoldiff fixed by Ashvin Goel (need to check
1054 Scheaffer test cases)
1055 BUG FIX some patches to run with Perl 5.8. Note: some programs
1056 (dbcolmultiscale, dbmultistats, dbrowsplituniq) generate warnings like:
1057 "Use of uninitialized value in concatenation (.)" or "string at
1058 /usr/lib/perl5/5.8.0/FileCache.pm line 98, <STDIN> line 2". Please
1059 ignore this until I figure out how to suppress it. (Thanks to Jerry
1060 Zhao for noticing perl-5.8 problems.)
1061 BUG FIX fixed an autoconf problem where configure would fail to find a
1062 reasonable prefix (thanks to Fabio Silva for reporting the problem)
1063 NEW db_to_html_table: simple conversion to html tables (NO fancy stuff)
1064 NEW dblib now has a function dblib_text2html() that will do simple
1065 conversion of iso-8859-1 to HTML
1066
1067 1.13, 4-Feb-04
1068 NEW fsdb added to the freebsd ports tree
1069 <http://www.freshports.org/databases/fsdb/>. Maintainer:
1070 "larse@isi.edu"
1071 BUG FIX properly handle trailing spaces when data must be numeric (ex.
1072 dbstats with -FS, see test dbstats_trailing_spaces). Fix from Ning Xu
1073 "nxu@aludra.usc.edu".
1074 NEW dbcolize error message improved (bug report from Terrence Brannon),
1075 and list format documented in the README.
1076 NEW cgi_to_db converts CGI.pm-format storage to fsdb list format
1077 BUG FIX handle numeric synonyms for column names in dbcol properly
1078 ENHANCEMENT "talking about columns" section added to README. Lack of
1079 documentation pointed out by Lars Eggert.
1080 CHANGE dbformmail now defaults to using Mail ("Berkeley Mail") to send
1081 mail, rather than sendmail (sendmail is still an option, but mail
1082 doesn't require running as root)
1083 NEW on platforms that support it (i.e., with perl 5.8), fsdb works fine
1084 with unicode
1085 NEW dbfilevalidate: check a db file for some common errors
1086
1087 1.14, 24-Aug-06
1088 ENHANCEMENT README cleanup
1089 INCOMPATIBLE CHANGE dbcolsplit renamed dbcolsplittocols
1090 NEW dbcolsplittorows split one column into multiple rows
1091 NEW dbcolsregression compute linear regression and correlation for two
1092 columns
1093 ENHANCEMENT cvs_to_db: better error handling, normalize field names,
1094 skip blank lines
1095 ENHANCEMENT dbjoin now detects (and fails) if non-joined files have
1096 duplicate names
1097 BUG FIX minor bug fixed in calculation of Student t-distributions
1098 (doesn't change any test output, but may have caused small errors)
1099
1100 1.15, 12-Nov-07
1101 NEW fsdb-1.14 added to the MacOS Fink system
1102 <http://pdb.finkproject.org/pdb/package.php/fsdb>. (Thanks to Lars
1103 Eggert for maintaining this port.)
1104 NEW Fsdb::IO::Reader and Fsdb::IO::Writer now provide reasonably clean
1105 OO I/O interfaces to Fsdb files. Highly recommended if you use fsdb
1106 directly from perl. In the fullness of time I expect to reimplement
1107 the entire thing using these APIs to replace the current dblib.pl which
1108 is still hobbled by its roots in perl4.
1109 NEW dbmapreduce now implements a Google-style map/reduce abstraction,
1110 generalizing dbmultistats.
1111 ENHANCEMENT fsdb now uses the Perl build system (Makefile.PL, etc.),
1112 instead of autoconf. This change paves the way to better perl-5-style
1113 modularization, proper manual pages, input of both listize and colize
1114 format for every program, and world peace.
1115 ENHANCEMENT dblib.pl is now moved to Fsdb::Old.pm.
1116 BUG FIX dbmultistats now propagates its format argument (-f). Bug and
1117 fix from Martin Lukac (thanks!).
1118 ENHANCEMENT dbformmail documentation now is clearer that it doesn't
1119 send the mail, you have to run the shell script it writes. (Problem
1120 observed by Unkyu Park.)
1121 ENHANCEMENT adapted to autoconf-2.61 (and then these changes were
1122 discarded in favor of The Perl Way.
1123 BUG FIX dbmultistats memory usage corrected (O(# tags), not O(1))
1124 ENHANCEMENT dbmultistats can now optionally run with pre-grouped input
1125 in O(1) memory
1126 ENHANCEMENT dbroweval -N was finally implemented (eat comments)
1127
1128 2.0, 25-Jan-08
1129 2.0, 25-Jan-08 --- a quiet 2.0 release (gearing up towards complete)
1130
1131 ENHANCEMENT: shifting old programs to Perl modules, with the front-end
1132 program as just a wrapper. In the short-term, this change just means
1133 programs have real man pages. In the long-run, it will mean that one
1134 can run a pipeline in a single Perl program. So far: dbcol, dbroweval,
1135 the new dbrowcount. dbsort the new dbmerge, the old "dbstats" (renamed
1136 dbcolstats), dbcolrename, dbcolcreate,
1137 NEW: Fsdb::Filter::dbpipeline is an internal-only module that lets one
1138 use fsdb commands from within perl (via threads).
1139 It also provides perl function aliases for the internal modules, so
1140 a string of fsdb commands in perl are nearly as terse as in the
1141 shell:
1142
1143 use Fsdb::Filter::dbpipeline qw(:all);
1144 dbpipeline(
1145 dbrow(qw(name test1)),
1146 dbroweval('_test1 += 5;')
1147 );
1148
1149 INCOMPATIBLE CHANGE: The old dbcolstats has been renamed
1150 dbcolstatscores. The new dbcolstats does the same thing as the old
1151 dbstats. This incompatibility is unfortunate but normalizes program
1152 names.
1153 CHANGE: The new dbcolstats program always outputs "-" (the default
1154 empty value) for statistics it cannot compute (for example, standard
1155 deviation if there is only one row), instead of the old mix of "-" and
1156 "na".
1157 INCOMPATIBLE CHANGE: The old dbcolstats program, now called
1158 dbcolstatscores, also has different arguments. The "-t mean,stddev"
1159 option is now "--tmean mean --tstddev stddev". See dbcolstatscores for
1160 details.
1161 INCOMPATIBLE CHANGE: dbcolcreate now assumes all new columns get the
1162 default value rather than requiring each column to have an initial
1163 constant value. To change the initial value, sue the new "-e" option.
1164 NEW: dbrowcount counts rows, an almost-subset of dbcolstats's "n"
1165 output (except without differentiating numeric/non-numeric input), or
1166 the equivalent of "dbstripcomments | wc -l".
1167 NEW: dbmerge merges two sorted files. This functionality was previously
1168 embedded in dbsort.
1169 INCOMPATIBLE CHANGE: dbjoin's "-i" option to include non-matches is now
1170 renamed "-a", so as to not conflict with the new standard option "-i"
1171 for input file.
1172
1173 2.1, 6-Apr-08
1174 2.1, 6-Apr-08 --- another alpha 2.0, but now all converted programs
1175 understand both listize and colize format
1176
1177 ENHANCEMENT: shifting more old programs to Perl modules. New in 2.1:
1178 dbcolneaten, dbcoldefine, dbcolhisto, dblistize, dbcolize, dbrecolize
1179 ENHANCEMENT dbmerge now handles an arbitrary number of input files, not
1180 just exactly two.
1181 NEW dbmerge2 is an internal routine that handles merging exactly two
1182 files.
1183 INCOMPATIBLE CHANGE dbjoin now specifies inputs like dbmerge2, rather
1184 than assuming the first two arguments were tables (as in fsdb-1).
1185 The old dbjoin argument "-i" is now "-a" or <--type=outer>.
1186
1187 A minor change: comments in the source files for dbjoin are now
1188 intermixed with output rather than being delayed until the end.
1189
1190 ENHANCEMENT dbsort now no longer produces warnings when null values are
1191 passed to numeric comparisons.
1192 BUG FIX dbroweval now once again works with code that lacks a trailing
1193 semicolon. (This bug fixes a regression from 1.15.)
1194 INCOMPATIBLE CHANGE dbcolneaten's old "-e" option (to avoid end-of-line
1195 spaces) is now "-E" to avoid conflicts with the standard empty field
1196 argument.
1197 INCOMPATIBLE CHANGE dbcolhisto's old "-e" option is now "-E" to avoid
1198 conflicts. And its "-n", "-s", and "-w" are now "-N", "-S", and "-W" to
1199 correspond.
1200 NEW dbfilealter replaces dbrecolize, dblistize, and dbcolize, but with
1201 different options.
1202 ENHANCEMENT The library routines "Fsdb::IO" now understand both list-
1203 format and column-format data, so all converted programs can now
1204 automatically read either format. This capability was one of the
1205 milestone goals for 2.0, so yea!
1206
1207 2.2, 23-May-08
1208 Release 2.2 is another 2.x alpha release. Now most of the commands are
1209 ported, but a few remain, and I plan one last incompatible change (to
1210 the file header) before 2.x final.
1211
1212 ENHANCEMENT
1213 shifting more old programs to Perl modules. New in 2.2:
1214 dbrowaccumulate, dbformmail. dbcolmovingstats. dbrowuniq.
1215 dbrowdiff. dbcolmerge. dbcolsplittocols. dbcolsplittorows.
1216 dbmapreduce. dbmultistats. dbrvstatdiff. Also dbrowenumerate
1217 exists only as a front-end (command-line) program.
1218
1219 INCOMPATIBLE CHANGE
1220 The following programs have been dropped from fsdb-2.x:
1221 dbcoltighten, dbfilesplit, dbstripextraheaders,
1222 dbstripleadingspace.
1223
1224 NEW combined_log_format_to_db to convert Apache logfiles
1225
1226 INCOMPATIBLE CHANGE
1227 Options to dbrowdiff are now -B and -I, not -a and -i.
1228
1229 INCOMPATIBLE CHANGE
1230 dbstripcomments is now dbfilestripcomments.
1231
1232 BUG FIXES
1233 dbcolneaten better handles empty columns; dbcolhisto warning
1234 suppressed (actually a bug in high-bucket handling).
1235
1236 INCOMPATIBLE CHANGE
1237 dbmultistats now requires a "-k" option in front of the key (tag)
1238 field, or if none is given, it will group by the first field (both
1239 like dbmapreduce).
1240
1241 KNOWN BUG
1242 dbmultistats with quantile option doesn't work currently.
1243
1244 INCOMPATIBLE CHANGE
1245 dbcoldiff is renamed dbrvstatdiff.
1246
1247 BUG FIXES
1248 dbformmail was leaving its log message as a command, not a
1249 comment. Oops. No longer.
1250
1251 2.3, 27-May-08 (alpha)
1252 Another alpha release, this one just to fix the critical dbjoin bug
1253 listed below (that happens to have blocked my MP3 jukebox :-).
1254
1255 BUG FIX
1256 Dbsort no longer hangs if given an input file with no rows.
1257
1258 BUG FIX
1259 Dbjoin now works with unsorted input coming from a pipeline (like
1260 stdin). Perl-5.8.8 has a bug (?) that was making this case
1261 fail---opening stdin in one thread, reading some, then reading more
1262 in a different thread caused an lseek which works on files, but
1263 fails on pipes like stdin. Go figure.
1264
1265 BUG FIX / KNOWN BUG
1266 The dbjoin fix also fixed dbmultistats -q (it now gives the right
1267 answer). Although a new bug appeared, messages like:
1268 Attempt to free unreferenced scalar: SV 0xa9dd0c4, Perl
1269 interpreter: 0xa8350b8 during global destruction. So the
1270 dbmultistats_quartile test is still disabled.
1271
1272 2.4, 18-Jun-08
1273 Another alpha release, mostly to fix minor usability problems in
1274 dbmapreduce and client functions.
1275
1276 ENHANCEMENT
1277 dbrow now defaults to running user supplied code without warnings
1278 (as with fsdb-1.x). Use "--warnings" or "-w" to turn them back on.
1279
1280 ENHANCEMENT
1281 dbroweval can now write different format output than the input,
1282 using the "-m" option.
1283
1284 KNOWN BUG
1285 dbmapreduce emits warnings on perl 5.10.0 about "Unbalanced string
1286 table refcount" and "Scalars leaked" when run with an external
1287 program as a reducer.
1288
1289 dbmultistats emits the warning "Attempt to free unreferenced
1290 scalar" when run with quartiles.
1291
1292 In each case the output is correct. I believe these can be
1293 ignored.
1294
1295 CHANGE
1296 dbmapreduce no longer logs a line for each reducer that is invoked.
1297
1298 2.5, 24-Jun-08
1299 Another alpha release, fixing more minor bugs in "dbmapreduce" and
1300 lossage in "Fsdb::IO".
1301
1302 ENHANCEMENT
1303 dbmapreduce can now tolerate non-map-aware reducers that pass back
1304 the key column in put. It also passes the current key as the last
1305 argument to external reducers.
1306
1307 BUG FIX
1308 Fsdb::IO::Reader, correctly handle "-header" option again. (Broken
1309 since fsdb-2.3.)
1310
1311 2.6, 11-Jul-08
1312 Another alpha release, needed to fix DaGronk. One new port, small bug
1313 fixes, and important fix to dbmapreduce.
1314
1315 ENHANCEMENT
1316 shifting more old programs to Perl modules. New in 2.2:
1317 dbcolpercentile.
1318
1319 INCOMPATIBLE CHANGE and ENHANCEMENTS dbcolpercentile arguments changed,
1320 use "--rank" to require ranking instead of "-r". Also, "--ascending"
1321 and "--descending" can now be specified separately, both for
1322 "--percentile" and "--rank".
1323 BUG FIX
1324 Sigh, the sense of the --warnings option in dbrow was inverted. No
1325 longer.
1326
1327 BUG FIX
1328 I found and fixed the string leaks (errors like "Unbalanced string
1329 table refcount" and "Scalars leaked") in dbmapreduce and
1330 dbmultistats. (All "IO::Handle"s in threads must be manually
1331 destroyed.)
1332
1333 BUG FIX
1334 The "-C" option to specify the column separator in dbcolsplittorows
1335 now works again (broken since it was ported).
1336
1337 2.7, 30-Jul-08 beta
1338
1339 The beta release of fsdb-2.x. Finally, all programs are ported. As
1340 statistics, the number of lines of non-library code doubled from 7.5k
1341 to 15.5k. The libraries are much more complete, going from 866 to 5164
1342 lines. The overall number of programs is about the same, although 19
1343 were dropped and 11 were added. The number of test cases has grown
1344 from 116 to 175. All programs are now in perl-5, no more shell scripts
1345 or perl-4. All programs now have manual pages.
1346
1347 Although this is a major step forward, I still expect to rename "jdb"
1348 to "fsdb".
1349
1350 ENHANCEMENT
1351 shifting more old programs to Perl modules. New in 2.7:
1352 dbcolscorellate. dbcolsregression. cgi_to_db. dbfilevalidate.
1353 db_to_csv. csv_to_db, db_to_html_table, kitrace_to_db,
1354 tcpdump_to_db, tabdelim_to_db, ns_to_db.
1355
1356 INCOMPATIBLE CHANGE
1357 The following programs have been dropped from fsdb-2.x: db2dcliff,
1358 dbcolmultiscale, crl_to_db. ipchain_logs_to_db. They may come
1359 back, but seemed overly specialized. The following program
1360 dbrowsplituniq was dropped because it is superseded by dbmapreduce.
1361 dmalloc_to_db was dropped pending a test cases and examples.
1362
1363 ENHANCEMENT
1364 dbfilevalidate now has a "-c" option to correct errors.
1365
1366 NEW html_table_to_db provides the inverse of db_to_html_table.
1367
1368 2.8, 5-Aug-08
1369 Change header format, preserving forwards compatibility.
1370
1371 BUG FIX
1372 Complete editing pass over the manual, making sure it aligns with
1373 fsdb-2.x.
1374
1375 SEMI-COMPATIBLE CHANGE
1376 The header of fsdb files has changed, it is now #fsdb, not #h (or
1377 #L) and parsing of -F and -R are also different. See dbfilealter
1378 for the new specification. The v1 file format will be read,
1379 compatibly, but not written.
1380
1381 BUG FIX
1382 dbmapreduce now tolerates comments that precede the first key,
1383 instead of failing with an error message.
1384
1385 2.9, 6-Aug-08
1386 Still in beta; just a quick bug-fix for dbmapreduce.
1387
1388 ENHANCEMENT
1389 dbmapreduce now generates plausible output when given no rows of
1390 input.
1391
1392 2.10, 23-Sep-08
1393 Still in beta, but picking up some bug fixes.
1394
1395 ENHANCEMENT
1396 dbmapreduce now generates plausible output when given no rows of
1397 input.
1398
1399 ENHANCEMENT
1400 dbroweval the warnings option was backwards; now corrected. As a
1401 result, warnings in user code now default off (like in fsdb-1.x).
1402
1403 BUG FIX
1404 dbcolpercentile now defaults to assuming the target column is
1405 numeric. The new option "-N" allows selection of a non-numeric
1406 target.
1407
1408 BUG FIX
1409 dbcolscorrelate now includes "--sample" and "--nosample" options to
1410 compute the sample or full population correlation coefficients.
1411 Thanks to Xue Cai for finding this bug.
1412
1413 2.11, 14-Oct-08
1414 Still in beta, but picking up some bug fixes.
1415
1416 ENHANCEMENT
1417 html_table_to_db is now more aggressive about filling in empty
1418 cells with the official empty value, rather than leaving them blank
1419 or as whitespace.
1420
1421 ENHANCEMENT
1422 dbpipeline now catches failures during pipeline element setup and
1423 exits reasonably gracefully.
1424
1425 BUG FIX
1426 dbsubprocess now reaps child processes, thus avoiding running out
1427 of processes when used a lot.
1428
1429 2.12, 16-Oct-08
1430 Finally, a full (non-beta) 2.x release!
1431
1432 INCOMPATIBLE CHANGE
1433 Jdb has been renamed Fsdb, the flatfile-streaming database. This
1434 change affects all internal Perl APIs, but no shell command-level
1435 APIs. While Jdb served well for more than ten years, it is easily
1436 confused with the Java debugger (even though Jdb was there first!).
1437 It also is too generic to work well in web search engines.
1438 Finally, Jdb stands for ``John's database'', and we're a bit beyond
1439 that. (However, some call me the ``file-system guy'', so one could
1440 argue it retains that meeting.)
1441
1442 If you just used the shell commands, this change should not affect
1443 you. If you used the Perl-level libraries directly in your code,
1444 you should be able to rename "Jdb" to "Fsdb" to move to 2.12.
1445
1446 The jdb-announce list not yet been renamed, but it will be shortly.
1447
1448 With this release I've accomplished everything I wanted to in
1449 fsdb-2.x. I therefore expect to return to boring, bugfix releases.
1450
1451 2.13, 30-Oct-08
1452 BUG FIX
1453 dbrowaccumulate now treats non-numeric data as zero by default.
1454
1455 BUG FIX
1456 Fixed a perl-5.10ism in dbmapreduce that breaks that program under
1457 5.8. Thanks to Martin Lukac for reporting the bug.
1458
1459 2.14, 26-Nov-08
1460 BUG FIX
1461 Improved documentation for dbmapreduce's "-f" option.
1462
1463 ENHANCEMENT
1464 dbcolmovingstats how computes a moving standard deviation in
1465 addition to a moving mean.
1466
1467 2.15, 13-Apr-09
1468 BUG FIX
1469 Fix a make install bug reported by Shalindra Fernando.
1470
1471 2.16, 14-Apr-09
1472 BUG FIX
1473 Another minor release bug: on some systems programize_module looses
1474 executable permissions. Again reported by Shalindra Fernando.
1475
1476 2.17, 25-Jun-09
1477 TYPO FIXES
1478 Typo in the dbroweval manual fixed.
1479
1480 IMPROVEMENT
1481 There is no longer a comment line to label columns in dbcolneaten,
1482 instead the header line is tweaked to line up. This change
1483 restores the Jdb-1.x behavior, and means that repeated runs of
1484 dbcolneaten no longer add comment lines each time.
1485
1486 BUG FIX
1487 It turns out dbcolneaten was not correctly handling trailing
1488 spaces when given the "-E" option to suppress them. This
1489 regression is now fixed.
1490
1491 EXTENSION
1492 dbroweval(1) can now handle direct references to the last row via
1493 $lfref, a dubious but now documented feature.
1494
1495 BUG FIXES
1496 Separators set with "-C" in dbcolmerge and dbcolsplittocols were
1497 not properly setting the heading, and null fields were not
1498 recognized. The first bug was reported by Martin Lukac.
1499
1500 2.18, 1-Jul-09 A minor release
1501 IMPROVEMENT
1502 Documentation for Fsdb::IO::Reader has been improved.
1503
1504 IMPROVEMENT
1505 The package should now be PGP-signed.
1506
1507 2.19, 10-Jul-09
1508 BUG FIX
1509 Internal improvements to debugging output and robustness of
1510 dbmapreduce and dbpipeline. TEST/dbpipeline_first_fails.cmd re-
1511 enabled.
1512
1513 2.20, 30-Nov-09 (A collection of minor bugfixes, plus a build against
1514 Fedora 12.)
1515 BUG FIX
1516 Loging for dbmapreduce with code refs is now stable (it no longer
1517 includes a hex pointer to the code reference).
1518
1519 BUG FIX
1520 Better handling of mixed blank lines in Fsdb::IO::Reader (see test
1521 case dbcolize_blank_lines.cmd).
1522
1523 BUG FIX
1524 html_table_to_db now handles multi-line input better, and handles
1525 tables with COLSPAN.
1526
1527 BUG FIX
1528 dbpipeline now cleans up threads in an "eval" to prevent "cannot
1529 detach a joined thread" errors that popped up in perl-5.10.
1530 Hopefully this prevents a race condition that causes the test
1531 suites to hang about 20% of the time (in dbpipeline_first_fails).
1532
1533 IMPROVEMENT
1534 dbmapreduce now detects and correctly fails when the input and
1535 reducer have incompatible field separators.
1536
1537 IMPROVEMENT
1538 dbcolstats, dbcolhisto, dbcolscorrelate, dbcolsregression, and
1539 dbrowcount now all take an "-F" option to let one specify the
1540 output field separator (so they work better with dbmapreduce).
1541
1542 BUG FIX
1543 An omitted "-k" from the manual page of dbmultistats is now there.
1544 Bug reported by Unkyu Park.
1545
1546 2.21, 17-Apr-10 bug fix release
1547 BUG FIX
1548 Fsdb::IO::Writer now no longer fails with -outputheader => never
1549 (an obscure bug).
1550
1551 IMPROVEMENT
1552 Fsdb (in the warnings section) and dbcolstats now more carefully
1553 document how they handle (and do not handle) numerical precision
1554 problems, and other general limits. Thanks to Yuri Pradkin for
1555 prompting this documentation.
1556
1557 IMPROVEMENT
1558 "Fsdb::Support::fullname_to_sortkey" is now restored from "Jdb".
1559
1560 IMPROVEMENT
1561 Documention for multiple styles of input approaches (including
1562 performance description) added to Fsdb::IO.
1563
1564 2.22, 2010-10-31 One new tool dbcolcopylast and several bug fixes for Perl
1565 5.10.
1566 BUG FIX
1567 dbmerge now correctly handles n-way merges. Bug reported by Yuri
1568 Pradkin.
1569
1570 INCOMPARABLE CHANGE
1571 dbcolneaten now defaults to not padding the last column.
1572
1573 ADDITION
1574 dbrowenumerate now takes -N NewColumn to give the new column a name
1575 other than "count". Feature requested by Mike Rouch in January
1576 2005.
1577
1578 ADDITION
1579 New program dbcolcopylast copies the last value of a column into a
1580 new column copylast_column of the next row. New program requested
1581 by Fabio Silva; useful for converting dbmultistats output into
1582 dbrvstatdiff input.
1583
1584 BUG FIX
1585 Several tools (particularly dbmapreduce and dbmultistats) would
1586 report errors like "Unbalanced string table refcount: (1) for
1587 "STDOUT" during global destruction" on exit, at least on certain
1588 versions of Perl (for me on 5.10.1), but similar errors have been
1589 off-and-on for several Perl releases. Although I think my code
1590 looked OK, I worked around this problem with a different way of
1591 handling standard IO redirection.
1592
1593 2.23, 2011-03-10 Several small portability bugfixes; improved dbcolstats
1594 for large datasets
1595 IMPROVEMENT
1596 Documentation to dbrvstatdiff was changed to use "sd" to refer to
1597 standard deviation, not "ss" (which might be confused with sum-of-
1598 squares).
1599
1600 BUG FIX
1601 This documentation about dbmultistats was missing the -k option in
1602 some cases.
1603
1604 BUG FIX
1605 dbmapreduce was failing on MacOS-10.6.3 for some tests with the
1606 error
1607
1608 dbmapreduce: cannot run external dbmapreduce reduce program (perl TEST/dbmapreduce_external_with_key.pl)
1609
1610 The problem seemed to be only in the error, not in operation. On
1611 MacOS, the error is now suppressed. Thanks to Alefiya Hussain for
1612 providing access to a Mac system that allowed debugging of this
1613 problem.
1614
1615 IMPROVEMENT
1616 The csv_to_db command requires an external Perl library
1617 (Text::CSV_XS). On computers that lack this optional library,
1618 previously Fsdb would configure with a warning and then test cases
1619 would fail. Now those test cases are skipped with an additional
1620 warning.
1621
1622 BUG FIX
1623 The test suite now supports alternative valid output, as a hack to
1624 account for last-digit floating point differences. (Not very
1625 satisfying :-(
1626
1627 BUG FIX
1628 dbcolstats output for confidence intervals on very large datasets
1629 has changed. Previously it failed for more than 2^31-1 records,
1630 and handling of T-Distributions with thousands of rows was a bit
1631 dubious. Now datasets with more than 10000 are considered
1632 infinitely large and hopefully correctly handled.
1633
1634 2.24, 2011-04-15 Improvements to fix an old bug in dbmapreduce with
1635 different field separators
1636 IMPROVEMENT
1637 The dbfilealter command had a "--correct" option to work-around
1638 from incompatible field-separators, but it did nothing. Now it
1639 does the correct but sad, data-loosing thing.
1640
1641 IMPROVEMENT
1642 The dbmultistats command previously failed with an error message
1643 when invoked on input with a non-default field separator. The root
1644 cause was the underlying dbmapreduce that did not handle the case
1645 of reducers that generated output with a different field separator
1646 than the input. We now detect and repair incompatible field
1647 separators. This change corrects a problem originally documented
1648 and detected in Fsdb-2.20. Bug re-reported by Unkyu Park.
1649
1650 2.25, 2011-08-07 Two new tools, xml_to_db and dbfilepivot, and a bugfix for
1651 two people.
1652 IMPROVEMENT
1653 kitrace_to_db now supports a --utc option, which also fixes this
1654 test case for users outside of the Pacific time zone. Bug reported
1655 by David Graff, and also by Peter Desnoyers (within a week of each
1656 other :-)
1657
1658 NEW xml_to_db can convert simple, very regular XML files into Fsdb.
1659
1660 NEW dbfilepivot "pivots" a file, converting multiple rows corresponding
1661 to the same entity into a single row with multiple columns.
1662
1663 2.26, 2011-12-12 Bug fixes, particularly for perl-5.14.2.
1664 BUG FIX
1665 Bugs fixed in Fsdb::IO::Reader(3) manual page.
1666
1667 BUG FIX
1668 Fixed problems where dbcolstats was truncating floating point
1669 numbers when sorting. This strange behavior happens as of
1670 perl-5.14.2 and it seems like a Perl bug. I've worked around it
1671 for the test suites, but I'm a bit nervous.
1672
1673 2.27, 2012-11-15 Accumulated bug fixes.
1674 IMPROVEMENT
1675 csv_to_db now reports errors in CVS input with real diagnostics.
1676
1677 IMPROVEMENT
1678 dbcolmovingstats can now compute median, when given the "-m"
1679 option.
1680
1681 BUG FIX
1682 dbcolmovingstats non-numeric handling (the "-a" option) now works
1683 properly.
1684
1685 DOCUMENTATION
1686 The internal t/test_command.t test framework is now documented.
1687
1688 BUG FIX
1689 dbrowuniq now correctly handles the case where there is no input
1690 (previously it output a blank line, which is a malformed fsdb
1691 file). Thanks to Yuri Pradkin for reporting this bug.
1692
1693 2.28, 2012-11-15 A quick release to fix most rpmlint errors.
1694 BUG FIX
1695 Fixed a number of minor release problems (wrong permissions, old
1696 FSF address, etc.) found by rpmlint.
1697
1698 2.29, 2012-11-20 a quick release for CPAN testing
1699 IMPROVEMENT
1700 Tweaked the RPM spec.
1701
1702 IMPROVEMENT
1703 Modified Makefile.PL to fail gracefully on Perl installations that
1704 lack threads. (Without this fix, I get massive failures in the
1705 non-ithreads test system.)
1706
1707 2.30, 2012-11-25 improvements to perl portability
1708 BUG FIX
1709 Removed unicode character in documention of dbcolscorrelated so pod
1710 tests will pass. (Sigh, that should work :-( )
1711
1712 BUG FIX
1713 Fixed test suite failures on 5 tests (dbcolcreate_double_creation
1714 was the first) due to Carp's addition of a period. This problem
1715 was breaking Fsdb on perl-5.17. Thanks to Michael McQuaid for
1716 helping diagnose this problem.
1717
1718 IMPROVEMENT
1719 The test suite now prints out the names of tests it tries.
1720
1721 2.31, 2012-11-28 A release with actual improvements to dbfilepivot and
1722 dbrowuniq.
1723 BUG FIX
1724 Documentation fixes: typos in dbcolscorrelated, bugs in
1725 dbfilepivot, clarification for comment handling in
1726 Fsdb::IO::Reader.
1727
1728 IMPROVEMENT
1729 Previously dbfilepivot assumed the input was grouped by keys and
1730 didn't very that pre-condition. Now there is no pre-condition (it
1731 will sort the input by default), and it checks if the invariant is
1732 violated.
1733
1734 BUG FIX
1735 Previously dbfilepivot failed if the input had comments (oops :-);
1736 no longer.
1737
1738 IMPROVEMENT
1739 Now dbrowuniq has the "-L" option to preserve the last unique row
1740 (instead of the first), a common idiom.
1741
1742 2.32, 2012-12-21 Test suites should now be more numerically robust.
1743 NEW New dbfilediff does fsdb-aware file differencing. It does not do
1744 smart intuition of add/removes like Unix diff(1), but it does know
1745 about columns, and with "-E", it does numeric-aware differences.
1746
1747 IMPROVEMENT
1748 Test suites that are numeric now use dbfilediff to do numeric-aware
1749 comparisons, so the test suite should now be robust to slightly
1750 different computers and operating systems and compilers than
1751 exactly what I use.
1752
1753 2.33, 2012-12-23 Minor fixes to some test cases.
1754 IMPROVEMENT
1755 dbfilediff and dbrowuniq now supports the "-N" option to give the
1756 new column a different name. (And a test cases where this
1757 duplication mattered have been fixed.)
1758
1759 IMPROVEMENT
1760 dbrvstatdiff now show the t-test breakpoint with a reasonable
1761 number of floating point digits.
1762
1763 BUG FIX
1764 Fixed a numerical stability problem in the dbroweval_last test
1765 case.
1766
1768 2.34, 2013-02-10 Parallelism in dbmerge.
1769 IMPROVEMENT
1770 Documention for dbjoin now includes resource requirements.
1771
1772 IMPROVEMENT
1773 Default memory usage for dbsort is now about 256MB. (The world
1774 keeps moving forward.)
1775
1776 IMPROVEMENT
1777 dbmerge now does merging in parallel. As a side-effect, dbsort
1778 should be faster when input overflows memory. The level of
1779 parallelism can be limited with the "--parallelism" option. (There
1780 is more work to do here, but we're off to a start.)
1781
1782 2.35, 2013-02-23 Improvements to dbmerge parallelism
1783 BUG FIX
1784 Fsdb temporary files are now created more securely (with
1785 File::Temp).
1786
1787 IMPROVEMENT
1788 Programs that sort or merge on fields (dbmerge2, dbmerge, dbsort,
1789 dbjoin) now report an error if no fields on which to join or merge
1790 are given.
1791
1792 IMPROVEMENT
1793 Parallelism in dbmerge is should now be more consistent, with less
1794 starting and stopping.
1795
1796 IMPROVEMENT In dbmerge, the "--xargs" option lets one give input
1797 filenames on standard input, rather than the command line. This feature
1798 paves the way for faster dbsort for large inputs (by pipelining sorting
1799 and merging), expected in the next release.
1800
1801 2.36, 2013-02-25 dbsort pipelines with dbmerge
1802 IMPROVEMENT For large inputs, dbsort now pipelines sorting and merging,
1803 allowing earlier processing.
1804 BUG FIX Since 2.35, dbmerge delayed cleanup of intermediate files,
1805 thereby requiring extra disk space.
1806
1807 2.37, 2013-02-26 quick bugfix to support parallel sort and merge from
1808 recent releases
1809 BUG FIX Since 2.35, dbmerge delayed removal of input files given by
1810 "--xargs". This problem is now fixed.
1811
1812 2.38, 2013-04-29 minor bug fixes
1813 CLARIFICATION
1814 Configure now rejects Windows since tests seem to hang on some
1815 versions of Windows. (I would love help from a Windows developer
1816 to get this problem fixed, but I cannot do it.) See
1817 https://rt.cpan.org/Ticket/Display.html?id=84201.
1818
1819 IMPROVEMENT
1820 All programs that use temporary files (dbcolpercentile,
1821 dbcolscorrelate, dbcolstats, dbcolstatscores) now take the "-T"
1822 option and set the temporary directory consistently.
1823
1824 In addition, error messages are better when the temporary directory
1825 has problems. Problem reported by Liang Zhu.
1826
1827 BUG FIX
1828 dbmapreduce was failing with external, map-reduce aware reducers
1829 (when invoked with -M and an external program). (Sigh, did this
1830 case ever work?) This case should now work. Thanks to Yuri
1831 Pradkin for reporting this bug (in 2011).
1832
1833 BUG FIX
1834 Fixed perl-5.10 problem with dbmerge. Thanks to Yuri Pradkin for
1835 reporting this bug (in 2013).
1836
1837 2.39, date 2013-05-31 quick release for the dbrowuniq extension
1838 BUG FIX
1839 Actually in 2.38, the Fedora .spec got cleaner dependencies.
1840 Suggestion from Christopher Meng via
1841 <https://bugzilla.redhat.com/show_bug.cgi?id=877096>.
1842
1843 ENHANCEMENT
1844 Fsdb files are now explicitly set into UTF-8 encoding, unless one
1845 specifies "-encoding" to "Fsdb::IO".
1846
1847 ENHANCEMENT
1848 dbrowuniq now supports "-I" for incremental counting.
1849
1850 2.40, 2013-07-13 small bug fixes
1851 BUG FIX
1852 dbsort now has more respect for a user-given temporary directory;
1853 it no longer is ignored for merging.
1854
1855 IMPROVEMENT
1856 dbrowuniq now has options to output the first, last, and both first
1857 and last rows of a run ("-F", "-L", and "-B").
1858
1859 BUG FIX
1860 dbrowuniq now correctly handles "-N". Sigh, it didn't work before.
1861
1862 2.41, 2013-07-29 small bug and packaging fixes
1863 ENHANCEMENT
1864 Documentation to dbrvstatdiff improved (inspired by questions from
1865 Qian Kun).
1866
1867 BUG FIX
1868 dbrowuniq no longer duplicates singleton unique lines when
1869 outputting both (with "-B").
1870
1871 BUG FIX
1872 Add missing "XML::Simple" dependency to Makefile.PL.
1873
1874 ENHANCEMENT
1875 Tests now show the diff of the failing output if run with "make
1876 test TEST_VERBOSE=1".
1877
1878 ENHANCEMENT
1879 dbroweval now includes documentation for how to output extra rows.
1880 Suggestion from Yuri Pradkin.
1881
1882 BUG FIX
1883 Several improvements to the Fedora package from Michael Schwendt
1884 via <https://bugzilla.redhat.com/show_bug.cgi?id=877096>, and from
1885 the harsh master that is rpmlint. (I am stymied at teaching it
1886 that "outliers" is spelled correctly. Maybe I should send it
1887 Schneier's book. And an unresolvable invalid-spec-name lurks in
1888 the SRPM.)
1889
1890 2.42, 2013-07-31 A bug fix and packaging release.
1891 ENHANCEMENT
1892 Documentation to dbjoin improved to better memory usage. (Based on
1893 problem report by Lin Quan.)
1894
1895 BUG FIX
1896 The .spec is now perl-Fsdb.spec to satisfy rpmlint. Thanks to
1897 Christopher Meng for a specific bug report.
1898
1899 BUG FIX
1900 Test dbroweval_last.cmd no longer has a column that caused failures
1901 because of numerical instability.
1902
1903 BUG FIX
1904 Some tests now better handle bugs in old versions of perl (5.10,
1905 5.12). Thanks to Calvin Ardi for help debugging this on a Mac with
1906 perl-5.12, but the fix should affect other platforms.
1907
1908 2.43, 2013-08-27 Adds in-file compression.
1909 BUG FIX
1910 Changed the sort on TEST/dbsort_merge.cmd to strings (from
1911 numerics) so we're less susceptible to false test-failures due to
1912 floating point IO differences.
1913
1914 EXPERIMENTAL ENHANCEMENT
1915 Yet more parallelism in dbmerge: new "endgame-mode" builds a merge
1916 tree of processes at the end of large merge tasks to get maximally
1917 parallelism. Currently this feature is off by default because it
1918 can hang for some inputs. Enable this experimental feature with
1919 "--endgame".
1920
1921 ENHANCEMENT
1922 "Fsdb::IO" now handles being given "IO::Pipe" objects (as exercised
1923 by dbmerge).
1924
1925 BUG FIX
1926 Handling of NamedTmpfiles now supports concurrency. This fix will
1927 hopefully fix occasional "Use of uninitialized value $_ in string
1928 ne at ...NamedTmpfile.pm line 93." errors.
1929
1930 BUG FIX
1931 Fsdb now requires perl 5.10. This is a bug fix because some test
1932 cases used to require it, but this fact was not properly
1933 documented. (Back-porting to 5.008 would require removing all "//"
1934 operators.)
1935
1936 ENHANCEMENT
1937 Fsdb now handles automatic compression of file contents. Enable
1938 compression with "dbfilealter -Z xz" (or "gz" or "bz2"). All
1939 programs should operate on compressed files and leave the output
1940 with the same level of compression. "xz" is recommended as fastest
1941 and most efficient. "gz" is produces unrepeatable output (and so
1942 has no output test), it seems to insist on adding a timestamp.
1943
1944 2.44, 2013-10-02 A major change--all threads are gone.
1945 ENHANCEMENT
1946 Fsdb is now thread free and only uses processes for parallelism.
1947 This change is a big change--the entire motivation for Fsdb-2 was
1948 to exploit parallelism via threading. Parallelism--good, but perl
1949 threading--bad for performance. Horribly bad for performance.
1950 About 20x worse than pipes on my box. (See perl bug #119445 for
1951 the discussion.)
1952
1953 NEW "Fsdb::Support::Freds" provides a thread-like abstraction over
1954 forking, with some nice support for callbacks in the parent upon
1955 child termination.
1956
1957 ENHANCEMENT
1958 Details about removing threads: "dbpipeline" is thread free, and
1959 new tests to verify each of its parts. The easy cases are
1960 "dbcolpercentile", "dbcolstats", "dbfilepivot", "dbjoin", and
1961 "dbcolstatscores", each of which use it in simple ways
1962 (2013-09-09). "dbmerge" is now thread free (2013-09-13), but was a
1963 significant rewrite, which brought "dbsort" along. "dbmapreduce"
1964 is partly thread free (2013-09-21), again as a rewrite, and it
1965 brings "dbmultistats" along. Full "dbmapreduce" support took much
1966 longer (2013-10-02).
1967
1968 BUG FIX
1969 When running with user-only output ("-n"), dbroweval now resets the
1970 output vector $ofref after it has been output.
1971
1972 NEW dbcolcreate will create all columns at the head of each row with
1973 the "--first" option.
1974
1975 NEW dbfilecat will concatenate two files, verifying that they have the
1976 same schema.
1977
1978 ENHANCEMENT
1979 dbmapreduce now passes comments through, rather than eating them as
1980 before.
1981
1982 Also, dbmapreduce now supports a "--" option to prevent
1983 misinterpreting sub-program parameters as for dbmapreduce.
1984
1985 INCOMPATIBLE CHANGE
1986 dbmapreduce no longer figures out if it needs to add the key to the
1987 output. For multi-key-aware reducers, it never does (and cannot).
1988 For non-multi-key-aware reducers, it defaults to add the key and
1989 will now fail if the reducer adds the key (with error "dbcolcreate:
1990 attempt to create pre-existing column..."). In such cases, one
1991 must disable adding the key with the new option "--no-prepend-key".
1992
1993 INCOMPATIBLE CHANGE
1994 dbmapreduce no longer copies the input field separator by default.
1995 For multi-key-aware reducers, it never does (and cannot). For non-
1996 multi-key-aware reducers, it defaults to not copying the field
1997 separator, but it will copy it (the old default) with the
1998 "--copy-fs" option
1999
2000 2.45, 2013-10-07 cleanup from de-thread-ification
2001 BUG FIX
2002 Corrected a fast busy-wait in dbmerge.
2003
2004 ENHANCEMENT
2005 Endgame mode enabled in dbmerge; it (and also large cases of
2006 dbsort) should now exploit greater parallelism.
2007
2008 BUG FIX
2009 Test case with "Fsdb::BoundedQueue" (gone since 2.44) now removed.
2010
2011 2.46, 2013-10-08 continuing cleanup of our no-threads version
2012 BUG FIX
2013 Fixed some packaging details. (Really, threads are no longer
2014 required, missing tests in the MANIFEST.)
2015
2016 IMPROVEMENT
2017 dbsort now better communicates with the merge process to avoid
2018 bursty parallelism.
2019
2020 Fsdb::IO::Writer now can take "-autoflush =" 1> for line-buffered
2021 IO.
2022
2023 2.47, 2013-10-12 test suite cleanup for non-threaded perls
2024 BUG FIX
2025 Removed some stray "use threads" in some test cases. We didn't
2026 need them, and these were breaking non-threaded perls.
2027
2028 BUG FIX
2029 Better handling of Fred cleanup; should fix intermittent
2030 dbmapreduce failures on BSD.
2031
2032 ENHANCEMENT
2033 Improved test framework to show output when tests fail. (This
2034 time, for real.)
2035
2036 2.48, 2014-01-03 small bugfixes and improved release engineering
2037 ENHANCEMENT
2038 Test suites now skip tests for libraries that are missing. (Patch
2039 for missing "IO::Compresss:Xz" contributed by Calvin Ardi.)
2040
2041 ENHANCEMENT
2042 Removed references to Jdb in the package specification. Since the
2043 name was changed in 2008, there's no longer a huge need for
2044 backwards compatibility. (Suggestion form Petr Šabata.)
2045
2046 ENHANCEMENT
2047 Test suites now invoke the perl using the path from
2048 $Config{perlpath}. Hopefully this helps testing in environments
2049 where there are multiple installed perls and the default perl is
2050 not the same as the perl-under-test (as happens in
2051 cpantesters.org).
2052
2053 BUG FIX
2054 Added specific encoding to this manpage to account for Unicode.
2055 Required to build correctly against perl-5.18.
2056
2057 2.49, 2014-01-04 bugfix to unicode handling in Fsdb IO (plus minor
2058 packaging fixes)
2059 BUG FIX
2060 Restored a line in the .spec to chmod g-s.
2061
2062 BUG FIX
2063 Unicode decoding is now handled correctly for programs that read
2064 from standard input. (Also: New test scripts cover unicode input
2065 and output.)
2066
2067 BUG FIX
2068 Fix to Fsdb documentation encoding line. Addresses test failure in
2069 perl-5.16 and earlier. (Who knew "encoding" had to be followed by
2070 a blank line.)
2071
2073 2.50, 2014-05-27 a quick release for spec tweaks
2074 ENHANCEMENT
2075 In dbroweval, the "-N" (no output, even comments) option now
2076 implies "-n", and it now suppresses the header and trailer.
2077
2078 BUG FIX
2079 A few more tweaks to the perl-Fsdb.spec from Petr Šabata.
2080
2081 BUG FIX
2082 Fixed 3 uses of "use v5.10" in test suites that were causing test
2083 failures (due to warnings, not real failures) on some platforms.
2084
2085 2.51, 2014-09-05 Feature enhancements to dbcolmovingstats, dbcolcreate,
2086 dbmapreduce, and new sqlselect_to_db
2087 ENHANCEMENT
2088 dbcolcreate now has a "--no-recreate-fatal" that causes it to
2089 ignore creation of existing columns (instead of failing).
2090
2091 ENHANCEMENT
2092 dbmapreduce once again is robust to reducers that output the key;
2093 "--no-prepend-key" is no longer mandatory.
2094
2095 ENHANCEMENT
2096 dbcolsplittorows can now enumerate the output rows with "-E".
2097
2098 BUG FIX
2099 dbcolmovingstats is more mathematically robust. Previously for
2100 some inputs and some platforms, floating point rounding could
2101 sometimes cause squareroots of negative numbers.
2102
2103 NEW sqlselect_to_db converts the output of the MySQL or MarinaDB select
2104 comment into fsdb format.
2105
2106 INCOMPATIBLE CHANGE
2107 dbfilediff now outputs the second row when doing sloppy numeric
2108 comparisons, to better support test suites.
2109
2110 2.52, 2014-11-03 Fixing the test suite for line number changes.
2111 ENHANCEMENT
2112 Test suites changes to be robust to exact line numbers of failures,
2113 since different Perl releases fail on different lines.
2114 <https://bugzilla.redhat.com/show_bug.cgi?id=1158380>
2115
2116 2.53, 2014-11-26 bug fixes and stability improvements to dbmapreduce
2117 ENHANCEMENT
2118 The dbfilediff how supports a "--quiet" option.
2119
2120 ENHANCEMENT
2121 Better documention of dbpipeline_filter.
2122
2123 BUGFIX
2124 Added groff-base and perl-podlators to the Fedora package spec.
2125 Fixes <https://bugzilla.redhat.com/show_bug.cgi?id=1163149>. (Also
2126 in package 2.52-2.)
2127
2128 BUGFIX
2129 An important stability improvement to dbmapreduce. It, plus
2130 dbmultistats, and dbcolstats now support controlled parallelism
2131 with the "--pararallelism=N" option. They default to run with the
2132 number of available CPUs. dbmapreduce also moderates its level of
2133 parallelism. Previously it would create reducers as needed,
2134 causing CPU thrashing if reducers ran much slower than data
2135 production.
2136
2137 BUGFIX
2138 The combination of dbmapreduce with dbrowenumerate now works as it
2139 should. (The obscure bug was an interaction with dbcolcreate with
2140 non-multi-key reducers that output their own key. dbmapreduce has
2141 too many useful corner cases.)
2142
2143 2.54, 2014-11-28 fix for the test suite to correct failing tests on not-my-
2144 platform
2145 BUGFIX
2146 Sigh, the test suite now has a test suite. Because, yes, I broke
2147 it, causing many incorrect failures at cpantesters. Now fixed.
2148
2149 2.55, 2015-01-05 many spelling fixes and dbcolmovingstats tests are more
2150 robust to different numeric precision
2151 ENHANCEMENT
2152 dbfilediff now can be extra quiet, as I continue to try to track
2153 down a numeric difference on FreeBSD AMD boxes.
2154
2155 ENHANCEMENT
2156 dbcolmovingstats gave different test output (just reflecting
2157 rounding error) when stddev approaches zero. We now detect hand
2158 handle this case. See
2159 <https://rt.cpan.org/Public/Bug/Display.html?id=101220> and thanks
2160 to H. Merijn Brand for the bug report.
2161
2162 BUG FIX
2163 Many, many spelling bugs found by H. Merijn Brand; thanks for the
2164 bug report.
2165
2166 INCOMPATBLE CHANGE
2167 A number of programs had misspelled "separator" in
2168 "--fieldseparator" and "--columnseparator" options as "seperator".
2169 These are now correctly spelled.
2170
2171 2.56, 2015-02-03 fix against Getopt::Long-2.43's stricter error checkign
2172 BUG FIX
2173 Internal argument parsing uses Getopt::Long, but mixed pass-through
2174 and <>. Bug reported by Petr Pisar at
2175 <https://bugzilla.redhat.com/show_bug.cgi?id=1188538>.a
2176
2177 BUG FIX
2178 Added missing BuildRequires for "XML::Simple".
2179
2180 2.57, 2015-04-29 Minor changes, with better performance from dbmulitstats.
2181 BUG FIX
2182 dbfilecat now honors "--remove-inputs" (previously it didn't).
2183 This omission meant that dbmapreduce (and dbmultistats) would
2184 accumulate files in /tmp when running. Bad news for inputs with 4M
2185 keys.
2186
2187 ENHANCMENT
2188 dbmultistats should be faster with lots of small keys. dbcolstats
2189 now supports "-k" to get some of the functionality of dbmultistats
2190 (if data is pre-sorted and median/quartiles are not required).
2191
2192 dbfilecat now honors "--remove-inputs" (previously it didn't).
2193 This omission meant that dbmapreduce (and dbmultistats) would
2194 accumulate files in /tmp when running. Bad news for inputs with 4M
2195 keys.
2196
2197 2.58, 2015-04-30 Bugfix in dbmerge
2198 BUG FIX
2199 Fixed a case where dbmerge suffered mojobake in endgame mode. This
2200 bug surfaced when dbsort was applied to large files (big enough to
2201 require merging) with unicode in them; the symptom was soemthing
2202 like:
2203 Wide character in print at /usr/lib64/perl5/IO/Handle.pm line
2204 420, <GEN12> line 111.
2205
2206 2.59, 2016-09-01 Collect a few small bug fixes and documentation
2207 improvements.
2208 BUG FIX
2209 More IO is explicitly marked UTF-8 to avoid Perl's tendency to
2210 mojibake on otherwise valid unicode input. This change helps
2211 html_table_to_db.
2212
2213 ENHANCEMENT
2214 dbcolscorrelate now crossreferences dbcolsregression.
2215
2216 ENHANCEMENT
2217 Documentation for dbrowdiff now clarifies that the default is
2218 baseline mode.
2219
2220 BUG FIX
2221 dbjoin now propagates "-T" into the sorting process (if it is
2222 required). Thanks to Lan Wei for reporting this bug.
2223
2224 2.60, 2016-09-04 Adds support for hash joins.
2225 ENHANCEMENT
2226 dbjoin now supports hash joins with "-t lefthash" and "-t
2227 righthash". Hash joins cache a table in memory, but do not require
2228 that the other table be sorted. They are ideal when joining a
2229 large table against a small one.
2230
2231 2.61, 2016-09-05 Support left and right outer joins.
2232 ENHANCEMENT
2233 dbjoin now handles left and right outer joins with "-t left" and
2234 "-t right".
2235
2236 ENHANCEMENT
2237 dbjoin hash joins are now selected with "-m lefthash" and "-m
2238 righthash" (not the shortlived "-t righthash" option).
2239 (Technically this change is incompatible with Fsdd-2.60, but no one
2240 but me ever used that version.)
2241
2242 2.62, 2016-11-29 A new yaml_to_db and other minor improvements.
2243 ENHANCEMENT
2244 Documentation for xml_to_db now includes sample output.
2245
2246 NEW yaml_to_db converts a specific form of YAML to fsdb.
2247
2248 BUG FIX
2249 The test suite now uses "diff -c -b" rather than "diff -cb" to make
2250 OpenBSD-5.9 happier, I hope.
2251
2252 ENHANCEMENT
2253 Comments that log operations at the end of each file now do simple
2254 quoting of spaces. (It is not guaranteed to be fully shell-
2255 compliant.)
2256
2257 ENHANCEMENT
2258 There is a new standard option, "--header", allowing one to specify
2259 an Fsdb header for inputs that lack it. Currently it is supported
2260 by dbcoldefine, dbrowuniq, dbmapreduce, dbmultistats, dbsort,
2261 dbpipeline.
2262
2263 ENHANCEMENT
2264 dbfilepivot now allows the --possible-pivots option, and if it is
2265 provided processes the data in one pass.
2266
2267 ENHANCEMENT
2268 dbroweval logs are now quoted.
2269
2270 2.63, 2017-02-03 Re-add some features supposedly in 2.62 but not, and add
2271 more --header options.
2272 ENHANCEMENT
2273 The option -j is now a synonym for --parallelism. (And several
2274 documention bugs about this option are fixed.)
2275
2276 ENHANCEMENT
2277 Additional support for "--header" in dbcolmerge, dbcol, dbrow, and
2278 dbroweval.
2279
2280 BUG FIX
2281 Version 2.62 was supposed to have this improvement, but did not
2282 (and now does): dbfilepivot now allows the --possible-pivots
2283 option, and if it is provided processes the data in one pass.
2284
2285 BUG FIX
2286 Version 2.62 was supposed to have this improvement, but did not
2287 (and now does): dbroweval logs are now quoted.
2288
2289 2.64, 2017-11-20 several small bugfixes and enhancements
2290 BUG FIX
2291 In dbroweval, the "next row" option previously did not correctly
2292 set up "_last_fieldname". It now does.
2293
2294 ENHANCEMENT
2295 The csv_to_db converter now has an optional "-F x" option to set
2296 the field separator.
2297
2298 ENHANCEMENT
2299 Finally dbcolsplittocols has a "--header" option, and a new "-N"
2300 option to give the list of resulting output columns.
2301
2302 INCOMPATIBLE CHANGE
2303 Now dbcolstats and dbmultistats produce no output (but a schema)
2304 when given no input but a schema. Previously they gave a null row
2305 of output. The "--output-on-no-input" and
2306 "--no-output-on-no-input" options can control this behavior.
2307
2308 2.65, 2018-02-16 Minor release, bug fix and -F option.
2309 ENHANCEMENT
2310 dbmultistats and dbmapreduce now both take a "-F x" option to set
2311 the field separator.
2312
2313 BUG FIX
2314 Fixed missing "use Carp" in dbcolstats. Also went back and cleaned
2315 up all uses of "croak()". Thanks to Zefram for the bug report.
2316
2317 2.66, 2018-12-20 Critical bug fix in dbjoin.
2318 BUG FIX
2319 Removed old tests from MANIFEST. (Thanks to Hang Guo for reporting
2320 this bug.)
2321
2322 IMPROVEMENT
2323 Errors for non-existing input files now include the bad filename
2324 (before: "cannot setup filehandle", now: "cannot open input: cannot
2325 open TEST/bad_filename").
2326
2327 BUG FIX
2328 Hash joins with three identical rows were failing with the
2329 assertion failure "internal error: confused about overflow" due to
2330 a now-fixed bug.
2331
2332 2.67, 2019-07-10 add support for reading and writing hdfs
2333 IMPROVEMENT
2334 dbformmail now has an "mh" mechanism that writes messages to
2335 individual files (an mh-style mailbox).
2336
2337 BUG FIX
2338 dbrow failed to include the Carp library, leading to fails on
2339 croak.
2340
2341 BUG FIX
2342 Fixed dbjoin error message for an unsorted right stream was
2343 incorrect (it said left).
2344
2345 IMPROVEMENT
2346 All Fsdb programs can now read from and write to HDFS, when files
2347 that start with "hdfs:" are given to -i and -o options.
2348
2349 2.68, 2019-09-19 All programs now support automatic decompression based on
2350 file extension.
2351 IMPROVEMENT
2352 The omitted-possible-error test case for dbfilepivot now has an
2353 altnerative output that I saw on some BSD-running systems (thanks
2354 to CPAN).
2355
2356 IMPROVEMENT
2357 dbmerge and dbmerge2 now support "--header". dbmerge2 now gives
2358 better error messages when presented the wrong number of inputs.
2359
2360 BUG FIX
2361 dbsort now works with "--header" even when the file is big (due to
2362 fixes to dbmerge).
2363
2364 IMPROVEMENT
2365 cvs_to_db now processes data with the "binary" option, allowing it
2366 to handle newlines embedded in quoted fields.
2367
2368 IMPROVEMENT
2369 All programs now will transparently decompress input files, if they
2370 are listed as a filename as an input argument that extends with a
2371 standard extension (.gz, .bz2, and .xz).
2372
2373 2.69, 2019-11-22 a small bugfix in dbcolstats
2374 BUG FIX
2375 Filled in the the test case for autodecompress, which was missing
2376 for the 2.68 release.
2377
2378 ENHANCEMENT
2379 The groff program is required for build, and the "Makefile.PL"
2380 fails if groff is missing at build time. Thanks to Chris Williams
2381 for suggesting this check, and the CPAN auto-building system for
2382 trying many platforms.
2383
2384 BUG FIX
2385 The dbcolstats program had numerical instability that sometimes
2386 results in failing with a square-root of a negative number when
2387 many values varied right at the edge of floating-point precision.
2388 We now detect and report that case as 0 stddev. Thanks to Hang Guo
2389 for providing a test case.
2390
2391 2.70, 2020-11-12 Some small quality-of-life enhancements and corner-case
2392 bugfixes.
2393 ENHANCEMENT
2394 dbcol can now take an option "-a" to include all columns, allowing
2395 reordering of certain columns while passing the rest through.
2396
2397 ENHANCEMENT
2398 dbrowuniq and dbmerge now buffer comments in a way that the last
2399 row of data output is no longer in the last block of comments.
2400 (The data is identical, but for humans looking at output, this
2401 change makes it less likely to lose the last row.)
2402
2403 BUG FIX
2404 dbmultistats and dbpipeline documentation now indicates that they
2405 support "--header" (something they did since version 2.62 in
2406 2016-11-29, but now documented.
2407
2408 ENHANCEMENT
2409 dbcolcreate now supports "--header".
2410
2411 BUG FIX
2412 Fixed several spelling errors in deprecated programs and removed
2413 information about the no-longer existing FreeBSD and MacOS ports.
2414 Thanks to Calvin Ardi for the patch.
2415
2416 BUG FIX
2417 dbmerge now handles --xargs when only one file is provided (and
2418 passes the file through unchanged). It also throws a clean error
2419 with --xargs if zero files are provided. (To support dbmerge,
2420 dbcol now has an internal "--saveoutput" option.) Thanks to Yuri
2421 Pradkin for reporting the unhandled corner-case.
2422
2423 2.71, 2020-11-16 Fix a race condition breaking test suites.
2424 BUG FIX
2425 Suppress a race condition in dbcolmerge was sometimes throwing the
2426 error "Fsdb::Support::Freds: ending, but running process:
2427 dbmerge:xargs" in the dbmerge_0_xargs test case, on exit.
2428
2429 2.72, 2020-12-01 A small bug and a packaging improvement.
2430 BUG FIX
2431 dbcolhisto now handles the degenerate case where everything has the
2432 same value (previously it would throw "illegal division by zero").
2433
2434 ENHANCEMENT
2435 The spec for Fedora now includes "make" as BuildRequires, something
2436 required for Fedora 34.
2437
2438 2.73, 2021-05-18 Updates dbcolpercentile with "--weighted", and with more
2439 ipv6.
2440 ENHANCEMENT
2441 dbcolpercentile now has a "--weighted" option.
2442
2443 ENHANCEMENT
2444 The new Fsdb::Support::IPv6 package includes ipv6_normalize,
2445 ipv6_zeroize to rewrite ipv6 print addresses in IPv6 normal form,
2446 with a 0 in each 4-nybble field.
2447
2449 John Heidemann, "johnh@isi.edu"
2450
2451 See "Contributors" for the many people who have contributed bug reports
2452 and fixes.
2453
2455 Fsdb is Copyright (C) 1991-2020 by John Heidemann <johnh@isi.edu>.
2456
2457 This program is free software; you can redistribute it and/or modify it
2458 under the terms of version 2 of the GNU General Public License as
2459 published by the Free Software Foundation.
2460
2461 This program is distributed in the hope that it will be useful, but
2462 WITHOUT ANY WARRANTY; without even the implied warranty of
2463 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
2464 General Public License for more details.
2465
2466 You should have received a copy of the GNU General Public License along
2467 with this program; if not, write to the Free Software Foundation, Inc.,
2468 675 Mass Ave, Cambridge, MA 02139, USA.
2469
2470 A copy of the GNU General Public License can be found in the file
2471 ``COPYING''.
2472
2474 Any comments about these programs should be sent to John Heidemann
2475 "johnh@isi.edu".
2476
2477
2478
2479perl v5.34.0 2021-07-22 Fsdb(3)