datamash(1)

1DATAMASH(1)                      User Commands                     DATAMASH(1)
2
3
4

NAME

6       datamash - command-line calculations
7

SYNOPSIS

9       datamash [OPTION] op [fld] [op fld ...]
10

DESCRIPTION

12       Performs numeric/string operations on input from stdin.
13
14       'op'  is  the operation to perform.  If a primary operation is used, it
15       must be listed first, optionally followed by other  operations.   'fld'
16       is the input field to use.  'fld' can be a number (1=first field), or a
17       field name when using the -H or --header-in options.   Multiple  fields
18       can  be  listed  with  a  comma (e.g. 1,6,8).  A range of fields can be
19       listed with a dash (e.g. 2-8).  Use colons for operations which require
20       a pair of fields (e.g. 'pcov 2:6').
21
22   Primary operations:
23              groupby, crosstab, transpose, reverse, check
24
25   Line-Filtering operations:
26              rmdup
27
28   Per-Line operations:
29              base64,  debase64,  md5,  sha1,  sha224, sha256, sha384, sha512,
30              bin, strbin, round, floor, ceil, trunc, frac, dirname, basename,
31              barename, extname, getnum, cut
32
33   Numeric Grouping operations:
34              sum, min, max, absmin, absmax, range
35
36   Textual/Numeric Grouping operations:
37              count, first, last, rand, unique, collapse, countunique
38
39   Statistical Grouping operations:
40              mean,  geomean,  harmmean,  trimmean, median, q1, q3, iqr, perc,
41              mode, antimode, pstdev, sstdev, pvar, svar, mad, madraw,  pskew,
42              sskew, pkurt, skurt, dpo, jarque, scov, pcov, spearson, ppearson
43
44   Grouping Options:
45       -C, --skip-comments
46              skip comment lines (starting with '#' or ';' and optional white‐
47              space)
48
49       -f, --full
50              print entire input line before op results (default:  print  only
51              the grouped keys)
52
53       -g, --group=X[,Y,Z]
54              group  via  fields  X,[Y,Z];  equivalent  to  primary  operation
55              'groupby'
56
57       --header-in
58              first input line is column headers
59
60       --header-out
61              print column headers as first line
62
63       -H, --headers
64              same as '--header-in --header-out'
65
66       -i, --ignore-case
67              ignore upper/lower case when comparing text; this affects group‐
68              ing, and string operations
69
70       -s, --sort
71              sort  the  input before grouping; this removes the need to manu‐
72              ally pipe the input through 'sort'
73
74   File Operation Options:
75       --no-strict
76              allow lines with varying number of fields
77
78       --filler=X
79              fill missing values with X (default %s)
80
81   General Options:
82       -t, --field-separator=X
83              use X instead of TAB as field delimiter
84
85       --format=FORMAT
86              print numeric values with printf style floating-point FORMAT.
87
88       --output-delimiter=X
89              use X instead as output field delimiter (default: use  same  de‐
90              limiter as -t/-W)
91
92       --narm skip NA/NaN values
93
94       -R, --round=N
95              round numeric output to N decimal places
96
97       -W, --whitespace
98              use whitespace (one or more spaces and/or tabs) for field delim‐
99              iters
100
101       -z, --zero-terminated
102              end lines with 0 byte, not newline
103
104       --help display this help and exit
105
106       --version
107              output version information and exit
108

OPTIONS

AVAILABLE OPERATIONS

111   Primary Operations
112       Primary operations affect the way the file is processed. If  used,  the
113       primary  operation  must be listed first. Some operations require field
114       numbers (groupby, crosstab) while others do  not  (reverse,check,trans‐
115       pose).  If primary operation is not listed the entire file is processed
116       - either line-by-line (for 'per-line' operations) or all lines  as  one
117       group (for grouping operations). See Examples section below.
118
119       groupby X,Y,... op fld ...
120                   group  the file by given fields. Equivalent to option '-g'.
121                   For each group perform operation op on field fld.
122
123
124       crosstab X,Y [op fld ...]
125                   cross-tabulate a file by two  fields  (cross-tabulation  is
126                   also  known as pivot tables). If no operation is specified,
127                   counts how many incidents exist of X,Y.
128
129
130       transpose   transpose rows, columns of the input file
131
132
133       reverse     reverse field order in each line
134
135
136       check [N lines] [N fields]
137                   verify the input file has same  number  of  fields  in  all
138                   lines,  or  the expected number of lines/fields.  number of
139                   lines and fields are printed to STDOUT. Exits with non-zero
140                   code and prints the offending line if there's a mismatch in
141                   the number of lines/ fields.
142
143   Line-Filtering operations
144       rmdup       remove lines with duplicated key value
145
146   Per-Line operations
147       base64      Encode the field as base64
148
149
150       debase64    Decode the field as base64,  exit  with  error  if  invalid
151                   base64 string
152
153
154       md5/sha1/sha224/sha256/sha384/sha512
155                   Calculate  md5/sha1/sha224/sha256/sha384/sha512 hash of the
156                   field value
157
158
159       bin[:BUCKET-SIZE]
160                   bin numeric values into buckets of  size  BUCKET-SIZE  (de‐
161                   faults to 100).
162
163
164       strbin[:BUCKET-SIZE]
165                   hashes  the  input  and returns a numeric integer value be‐
166                   tween zero and BUCKET-SIZE (defaults to 10).
167
168
169       round/floor/ceil/trunc/frac
170                   numeric rounding operations. round (round  half  away  from
171                   zero),  floor (round up), ceil (ceiling, round down), trunc
172                   (truncate, round  towards  zero),  frac  (fraction,  return
173                   fraction part of a decimal-point value).
174
175
176       dirname/basename
177                   extract  the  directory  name and the base file name from a
178                   given string (same as to dirname(1) and basename(1)).
179
180
181       extname     extract the extension of the file name (without the '.').
182
183
184       barename    extract the base file name without the extension.
185
186
187       getnum[:TYPE]
188                   extract a number from the field. TYPE  is  optional  single
189                   letter option n/i/d/p/h/o (see examples below).
190
191
192       cut         copy input field to output field (similar to cut(1)).
193
194   Numeric Grouping operations
195       sum         sum the of values
196
197
198       min         minimum value
199
200
201       max         maximum value
202
203
204       absmin      minimum of the absolute values
205
206
207       absmax      maximum of the absolute values
208
209
210       range       the values range (max-min)
211
212   Textual/Numeric Grouping operations
213       count       count number of elements in the group
214
215
216       first       the first value of the group
217
218
219       last        the last value of the group
220
221
222       rand        one random value from the group
223
224
225       unique      comma-separated sorted list of unique values
226
227
228       collapse    comma-separated list of all input values
229
230
231       countunique number of unique/distinct values
232
233   Statistical Grouping operations
234       A  p/s  prefix indicates the varient: population or sample.  Typically,
235       the sample variant is equivalent with GNU R's internal  functions  (e.g
236       datamash's sstdev operation is equivalent to R's sd() function).
237
238       mean        mean of the values
239
240
241       geomean     geometric mean of the values
242
243
244       harmmean    harmonic mean of the values
245
246
247       trimmean[:PERCENT]
248                   trimmed mean of the values. PERCENT should be between 0 and
249                   0.5.  (trimmean:0 is equivalent to  mean.  trimmean:0.5  is
250                   equivalent to median).
251
252
253       median      median value
254
255
256       q1          1st quartile value
257
258
259       q3          3rd quartile value
260
261
262       iqr         inter-quartile range
263
264
265       perc[:PERCENTILE]
266                   percentile value PERCENTILE (defaults to 95).
267
268
269       mode        mode value (most common value)
270
271
272       antimode    anti-mode value (least common value)
273
274
275       pstdev/sstdev
276                   population/sample standard deviation
277
278
279       pvar/svar   population/sample variance
280
281
282       mad         median  absolute  deviation,  scaled by constant 1.4826 for
283                   normal distributions
284
285
286       madraw      median absolute deviation, unscaled
287
288
289       pskew/sskew skewness of the group
290                     values x reported by 'sskew' and 'pskew' operations:
291                             x > 0       -  positively skewed / skewed right
292                         0 > x           -  negatively skewed / skewed left
293                             x > 1       -  highly skewed right
294                         1 > x >  0.5    -  moderately skewed right
295                       0.5 > x > -0.5    -  approximately symmetric
296                      -0.5 > x > -1      -  moderately skewed left
297                        -1 > x           -  highly skewed left
298
299
300       pkurt/skurt excess Kurtosis of the group
301
302
303       jarque/dpo  p-value of the Jarque-Beta (jarque) and  D'Agostino-Pearson
304                   Omnibus (dpo) tests for normality:
305                      null hypothesis is normality;
306                      low p-Values indicate non-normal data;
307                      high  p-Values  indicate  null-hypothesis  cannot be re‐
308                   jected.
309
310
311       pcov/scov [X:Y]
312                   covariance of fields X and Y
313
314
315       ppearson/spearson [X:Y]
316                   Pearson product-moment correlation  coefficient  [Pearson's
317                   R] of fields X and Y
318

EXAMPLES

320   Basic usage
321       Print the sum and the mean of values from field 1:
322
323              $ seq 10 | datamash sum 1 mean 1
324              55  5.5
325
326       Group input based on field 1, and sum values (per group) on field 2:
327
328              $ cat example.txt
329              A  10
330              A  5
331              B  9
332              B  11
333
334              $ datamash -g 1 sum 2 < example.txt
335              A  15
336              B  20
337
338              $ datamash groupby 1 sum 2 < example.txt
339              A  15
340              B  20
341
342       Unsorted input must be sorted (with '-s'):
343
344              $ cat example.txt
345              A  10
346              C  4
347              B  9
348              C  1
349              A  5
350              B  11
351
352              $ datamash -s -g1 sum 2 < example.txt
353              A  15
354              B  20
355              C  5
356
357       Which is equivalent to:
358
359              $ cat example.txt | sort -k1,1 | datamash -g 1 sum 2
360
361
362
363
364   Header lines
365       Use -h (--headers) if the input file has a header line:
366
367              # Given a file with student name, field, test score...
368              $ head -n5 scores_h.txt
369              Name           Major            Score
370              Shawn          Engineering      47
371              Caleb          Business         87
372              Christian      Business         88
373              Derek          Arts             60
374
375              # Calculate the mean and standard devian for each major
376              $ datamash --sort --headers --group 2 mean 3 pstdev 3 < scores_h.txt
377
378               (or use short form)
379
380              $ datamash -sH -g2 mean 3 pstdev 3 < scores_h.txt
381
382               (or use named fields)
383
384              $ datamash -sH -g Major mean Score pstdev Score < scores_h.txt
385              GroupBy(Major)    mean(Score)   pstdev(Score)
386              Arts              68.9          10.1
387              Business          87.3           4.9
388              Engineering       66.5          19.1
389              Health-Medicine   90.6           8.8
390              Life-Sciences     55.3          19.7
391              Social-Sciences   60.2          16.6
392
393       Field names must be escaped with a backslash if they start with a digit
394       or contain special characters (dash/minus, colons, commas).   Note  the
395       interplay  between escaping with backslash and shell quoting.  The fol‐
396       lowing equivalent command sum the values of a field named "FOO-BAR":
397
398
399              $ datamash -H sum FOO\\-BAR < input.txt
400              $ datamash -H sum 'FOO\-BAR' < input.txt
401              $ datamash -H sum "FOO\\-BAR" < input.txt
402
403   Skipping comment lines
404       Use -C (--skip-comments) to skip lines starting with ´#´ or ´;´ charac‐
405       ters (and optional whitespace before them):
406
407              $ cat in.txt
408               #foo   3
409              bar     5
410              ;baz    7
411
412              $ datamash sum 2 < in.txt
413              15
414
415              $ datamash -C sum 2 < in.txt
416              5
417
418   Multiple fields
419       Use comma or dash to specify multiple fields. The following are equiva‐
420       lent:
421              $ seq 9 | paste - - -
422              1   2   3
423              4   5   6
424              7   8   9
425
426              $ seq 9 | paste - - - | datamash sum 1 sum 2 sum 3
427              12  15  18
428
429              $ seq 9 | paste - - - | datamash sum 1,2,3
430              12  15  18
431
432              $ seq 9 | paste - - - | datamash sum 1-3
433              12  15  18
434
435   Rounding
436       The following demonstrate the different rounding operations:
437       $ ( echo X ; seq -1.25 0.25 1.25 ) \
438             | datamash --full -H round 1 ceil 1 floor 1 trunc 1 frac 1
439
440         X     round(X)  ceil(X)  floor(X)  trunc(X)   frac(X)
441       -1.25   -1        -1       -2        -1         -0.25
442       -1.00   -1        -1       -1        -1          0
443       -0.75   -1         0       -1         0         -0.75
444       -0.50   -1         0       -1         0         -0.5
445       -0.25    0         0       -1         0         -0.25
446        0.00    0         0        0         0          0
447        0.25    0         1        0         0          0.25
448        0.50    1         1        0         0          0.5
449        0.75    1         1        0         0          0.75
450        1.00    1         1        1         1          0
451        1.25    1         2        1         1          0.25
452
453   Reversing fields
454              $ seq 6 | paste - - | datamash reverse
455              2    1
456              4    3
457              6    5
458
459   Transposing a file
460              $ seq 6 | paste - - | datamash transpose
461              1    3    5
462              2    4    6
463
464   Removing Duplicated lines
465       Remove lines with duplicate key value from field 1  (Unlike  first,last
466       operations,  rmdup is much faster and does not require sorting the file
467       with -s):
468
469              # Given a list of files and sample IDs:
470              $ cat INPUT
471              SampleID  File
472              2         cc.txt
473              3         dd.txt
474              1         ab.txt
475              2         ee.txt
476              3         ff.txt
477
478              # Remove lines with duplicated Sample-ID (field 1):
479              $ datamash rmdup 1 < INPUT
480
481              # or use named field:
482              $ datamash -H rmdup SampleID < INPUT
483              SampleID  File
484              2         cc.txt
485              3         dd.txt
486              1         ab.txt
487
488   Checksums
489       Calculate the sha1 hash value of each TXT file, after  calculating  the
490       sha1 value of each file's content:
491
492              $ sha1sum *.txt | datamash -Wf sha1 2
493
494   Check file structure
495       Check  the  structure of the input file: ensure all lines have the same
496       number of fields, or expected number of lines/fields:
497
498              $ seq 10 | paste - - | datamash check && echo ok || echo fail
499              5 lines, 2 fields
500              ok
501
502              $ seq 13 | paste - - - | datamash check && echo ok || echo fail
503              line 4 (3 fields):
504                10  11  12
505              line 5 (2 fields):
506                13
507              datamash: check failed: line 5 has 2 fields (previous line had 3)
508              fail
509
510              $ seq 10 | paste - - | datamash check 2 fields 5 lines
511              5 lines, 2 fields
512
513              $ seq 10 | paste - - | datamash check 4 fields
514              line 1 (2 fields):
515                1     2
516              datamash: check failed: line 1 has 2 fields (expecting 4)
517
518              $ seq 10 | paste - - | datamash check 7 lines
519              datamash: check failed: input had 5 lines (expecting 7)
520
521   Cross-Tabulation
522       Cross-tabulation compares the relationship between two  fields.   Given
523       the following input file:
524              $ cat input.txt
525              a    x    3
526              a    y    7
527              b    x    21
528              a    x    40
529
530       Show  cross-tabulation  between  the  first  field (a/b) and the second
531       field (x/y) - counting how many times each pair appears (note:  sorting
532       is required):
533
534              $ datamash -s crosstab 1,2 < input.txt
535                   x    y
536              a    2    1
537              b    1    N/A
538
539       An optional grouping operation can be used instead of counting:
540
541              $ datamash -s crosstab 1,2 sum 3 < input.txt
542                   x    y
543              a    43   7
544              b    21   N/A
545
546              $ datamash -s crosstab 1,2 unique 3 < input.txt
547                   x    y
548              a    3,40 7
549              b    21   N/A
550
551   Binning numeric values
552       Bin input values into buckets of size 5:
553
554              $  ( echo X ; seq -10 2.5 10 ) \
555                    | datamash -H --full bin:5 1
556                  X  bin(X)
557              -10.0    -15
558               -7.5    -10
559               -5.0    -10
560               -2.5     -5
561                0.0      0
562                2.5      0
563                5.0      5
564                7.5      5
565               10.0     10
566
567   Binning string values
568       Hash  any input value into a numeric integer.  A typical usage would be
569       to split an input file into N chunks, ensuring that  all  values  of  a
570       certain key will be stored in the same chunk:
571
572              $ cat input.txt
573              PatientA   10
574              PatientB   11
575              PatientC   12
576              PatientA   14
577              PatientC   15
578
579
580       Each patient ID is hashed into a bin between 0 and 9
581       and printed in the last field:
582
583
584              $ datamash --full strbin 1 < input.txt
585              PatientA   10    5
586              PatientB   11    6
587              PatientC   12    7
588              PatientA   14    5
589              PatientC   15    7
590
591
592       Splitting the input into chunks can be done with awk:
593
594
595              $ cat input.txt \
596                  | datamash --full strbin 1 \
597                  | awk '{print > $NF ".txt"}'
598
599
600   Extracting numbers with getnum
601       The 'getnum' operation extracts a numeric value from the field:
602
603
604              $ echo zoom-123.45xyz | datamash getnum 1
605              123.45
606
607       getnum accepts an optional single-letter TYPE option:
608           getnum:n - natural numbers (positive integers, including zero)
609           getnum:i - integers
610           getnum:d - decimal point numbers
611           getnum:p - positive decimal point numbers (this is the default)
612           getnum:h - hex numbers
613           getnum:o - octal numbers
614
615
616       Examples:
617              $ echo zoom-123.45xyz | datamash getnum 1
618              123.45
619
620              $ echo zoom-123.45xyz | datamash getnum:n 1
621              123
622
623              $ echo zoom-123.45xyz | datamash getnum:i 1
624              -123
625
626              $ echo zoom-123.45xyz | datamash getnum:d 1
627              123.45
628
629              $ echo zoom-123.45xyz | datamash getnum:p 1
630              -123.45
631
632              # Hex 0x123 = 291 Decimal
633              $ echo zoom-123.45xyz | datamash getnum:h 1
634              291
635
636              # Octal 0123 = 83 Decimal
637              $ echo zoom-123.45xyz | datamash getnum:o 1
638              83
639

ADDITIONAL INFORMATION

641       See GNU Datamash Website (https://www.gnu.org/software/datamash)
642

AUTHOR

644       Written by Assaf Gordon.
645

COPYRIGHT

647       Copyright  ©  2020  Assaf  Gordon  License GPLv3+: GNU GPL version 3 or
648       later <https://gnu.org/licenses/gpl.html>.
649       This is free software: you are free  to  change  and  redistribute  it.
650       There is NO WARRANTY, to the extent permitted by law.
651