datamash(1)

1DATAMASH(1)                      User Commands                     DATAMASH(1)
2
3
4

NAME

6       datamash - command-line calculations
7

SYNOPSIS

9       datamash [OPTION] op [fld] [op fld ...]
10

DESCRIPTION

12       Performs numeric/string operations on input from stdin.
13
14       'op'  is  the operation to perform.  If a primary operation is used, it
15       must be listed first, optionally followed by other  operations.   'fld'
16       is the input field to use.  'fld' can be a number (1=first field), or a
17       field name when using the -H or --header-in options.   Multiple  fields
18       can  be  listed  with  a  comma (e.g. 1,6,8).  A range of fields can be
19       listed with a dash (e.g. 2-8).  Use colons for operations which require
20       a pair of fields (e.g. 'pcov 2:6').
21
22   Primary operations:
23              groupby, crosstab, transpose, reverse, check
24
25   Line-Filtering operations:
26              rmdup
27
28   Per-Line operations:
29              base64, debase64, md5, sha1, sha256, sha512, bin, strbin, round,
30              floor, ceil, trunc, frac, dirname, basename, barename,  extname,
31              getnum, cut
32
33   Numeric Grouping operations:
34              sum, min, max, absmin, absmax, range
35
36   Textual/Numeric Grouping operations:
37              count, first, last, rand, unique, collapse, countunique
38
39   Statistical Grouping operations:
40              mean,  trimmean, median, q1, q3, iqr, perc, mode, antimode, pst‐
41              dev, sstdev, pvar,  svar,  mad,  madraw,  pskew,  sskew,  pkurt,
42              skurt, dpo, jarque, scov, pcov, spearson, ppearson
43
44   Grouping Options:
45       -C, --skip-comments
46              skip comment lines (starting with '#' or ';' and optional white‐
47              space)
48
49       -f, --full
50              print entire input line before op results (default:  print  only
51              the grouped keys)
52
53       -g, --group=X[,Y,Z]
54              group  via  fields  X,[Y,Z];  equivalent  to  primary  operation
55              'groupby'
56
57       --header-in
58              first input line is column headers
59
60       --header-out
61              print column headers as first line
62
63       -H, --headers
64              same as '--header-in --header-out'
65
66       -i, --ignore-case
67              ignore upper/lower case when comparing text; this affects group‐
68              ing, and string operations
69
70       -s, --sort
71              sort  the  input before grouping; this removes the need to manu‐
72              ally pipe the input through 'sort'
73
74   File Operation Options:
75       --no-strict
76              allow lines with varying number of fields
77
78       --filler=X
79              fill missing values with X (default %s)
80
81   General Options:
82       -t, --field-separator=X
83              use X instead of TAB as field delimiter
84
85       --format=FORMAT
86              print numeric values with printf style floating-point FORMAT.
87
88       --output-delimiter=X
89              use X instead as  output  field  delimiter  (default:  use  same
90              delimiter as -t/-W)
91
92       --narm skip NA/NaN values
93
94       -R, --round=N
95              round numeric output to N decimal places
96
97       -W, --whitespace
98              use whitespace (one or more spaces and/or tabs) for field delim‐
99              iters
100
101       -z, --zero-terminated
102              end lines with 0 byte, not newline
103
104       --help display this help and exit
105
106       --version
107              output version information and exit
108

OPTIONS

AVAILABLE OPERATIONS

111   Primary Operations
112       Primary operations affect the way the file is processed. If  used,  the
113       primary  operation  must be listed first. Some operations require field
114       numbers (groupby, crosstab) while others do  not  (reverse,check,trans‐
115       pose).  If primary operation is not listed the entire file is processed
116       - either line-by-line (for 'per-line' operations) or all lines  as  one
117       group (for grouping operations). See Examples section below.
118
119       groupby X,Y,... op fld ...
120                   group  the file by given fields. Equivalent to option '-g'.
121                   For each group perform operation op on field fld.
122
123
124       crosstab X,Y [op fld ...]
125                   cross-tabulate a file by two  fields  (cross-tabulation  is
126                   also  known as pivot tables). If no operation is specified,
127                   counts how many incidents exist of X,Y.
128
129
130       transpose   transpose rows, columns of the input file
131
132
133       reverse     reverse field order in each line
134
135
136       check [N lines] [N fields]
137                   verify the input file has same  number  of  fields  in  all
138                   lines,  or  the expected number of lines/fields.  number of
139                   lines and fields are printed to STDOUT. Exits with non-zero
140                   code and prints the offending line if there's a mismatch in
141                   the number of lines/ fields.
142
143   Line-Filtering operations
144       rmdup       remove lines with duplicated key value
145
146   Per-Line operations
147       base64      Encode the field as base64
148
149
150       debase64    Decode the field as base64,  exit  with  error  if  invalid
151                   base64 string
152
153
154       md5/sha1/sha256/sha512
155                   Calculate md5/sha1/sha256/sha512 hash of the field value
156
157
158       bin[:BUCKET-SIZE]
159                   bin   numeric  values  into  buckets  of  size  BUCKET-SIZE
160                   (defaults to 100).
161
162
163       strbin[:BUCKET-SIZE]
164                   hashes the  input  and  returns  a  numeric  integer  value
165                   between zero and BUCKET-SIZE (defaults to 10).
166
167
168       round/floor/ceil/trunc/frac
169                   numeric  rounding  operations.  round (round half away from
170                   zero), floor (round up), ceil (ceiling, round down),  trunc
171                   (truncate,  round  towards  zero),  frac  (fraction, return
172                   fraction part of a decimal-point value).
173
174
175       dirname/basename
176                   extract the directory name and the base file  name  from  a
177                   given string (same as to dirname(1) and basename(1)).
178
179
180       extname     extract the extension of the file name (without the '.').
181
182
183       barename    extract the base file name without the extension.
184
185
186       getnum[:TYPE]
187                   extract  a  number  from the field. TYPE is optional single
188                   letter option n/i/d/p/h/o (see examples below).
189
190
191       cut         copy input field to output field (similar to cut(1)).
192
193   Numeric Grouping operations
194       sum         sum the of values
195
196
197       min         minimum value
198
199
200       max         maximum value
201
202
203       absmin      minimum of the absolute values
204
205
206       absmax      maximum of the absolute values
207
208
209       range       the values range (max-min)
210
211   Textual/Numeric Grouping operations
212       count       count number of elements in the group
213
214
215       first       the first value of the group
216
217
218       last        the last value of the group
219
220
221       rand        one random value from the group
222
223
224       unique      comma-separated sorted list of unique values
225
226
227       collapse    comma-separated list of all input values
228
229
230       countunique number of unique/distinct values
231
232   Statistical Grouping operations
233       A p/s prefix indicates the varient: population or  sample.   Typically,
234       the  sample  variant is equivalent with GNU R's internal functions (e.g
235       datamash's sstdev operation is equivalent to R's sd() function).
236
237       mean        mean of the values
238
239
240       trimmean[:PERCENT]
241                   trimmed mean of the values. PERCENT should be between 0 and
242                   0.5.   (trimmean:0  is  equivalent to mean. trimmean:0.5 is
243                   equivalent to median).
244
245
246       median      median value
247
248
249       q1          1st quartile value
250
251
252       q3          3rd quartile value
253
254
255       iqr         inter-quartile range
256
257
258       perc[:PERCENTILE]
259                   percentile value PERCENTILE (defaults to 95).
260
261
262       mode        mode value (most common value)
263
264
265       antimode    anti-mode value (least common value)
266
267
268       pstdev/sstdev
269                   population/sample standard deviation
270
271
272       pvar/svar   population/sample variance
273
274
275       mad         median absolute deviation, scaled by  constant  1.4826  for
276                   normal distributions
277
278
279       madraw      median absolute deviation, unscaled
280
281
282       pskew/sskew skewness of the group
283                     values x reported by 'sskew' and 'pskew' operations:
284                             x > 0       -  positively skewed / skewed right
285                         0 > x           -  negatively skewed / skewed left
286                             x > 1       -  highly skewed right
287                         1 > x >  0.5    -  moderately skewed right
288                       0.5 > x > -0.5    -  approximately symmetric
289                      -0.5 > x > -1      -  moderately skewed left
290                        -1 > x           -  highly skewed left
291
292
293       pkurt/skurt excess Kurtosis of the group
294
295
296       jarque/dpo  p-value  of the Jarque-Beta (jarque) and D'Agostino-Pearson
297                   Omnibus (dpo) tests for normality:
298                      null hypothesis is normality;
299                      low p-Values indicate non-normal data;
300                      high  p-Values  indicate   null-hypothesis   cannot   be
301                   rejected.
302
303
304       pcov/scov [X:Y]
305                   covariance of fields X and Y
306
307
308       ppearson/spearson [X:Y]
309                   Pearson  product-moment  correlation coefficient [Pearson's
310                   R] of fields X and Y
311

EXAMPLES

313   Basic usage
314       Print the sum and the mean of values from field 1:
315
316              $ seq 10 | datamash sum 1 mean 1
317              55  5.5
318
319       Group input based on field 1, and sum values (per group) on field 2:
320
321              $ cat example.txt
322              A  10
323              A  5
324              B  9
325              B  11
326
327              $ datamash -g 1 sum 2 < example.txt
328              A  15
329              B  20
330
331              $ datamash groupby 1 sum 2 < example.txt
332              A  15
333              B  20
334
335       Unsorted input must be sorted (with '-s'):
336
337              $ cat example.txt
338              A  10
339              C  4
340              B  9
341              C  1
342              A  5
343              B  11
344
345              $ datamash -s -g1 sum 2 < example.txt
346              A  15
347              B  20
348              C  5
349
350       Which is equivalent to:
351
352              $ cat example.txt | sort -k1,1 | datamash -g 1 sum 2
353
354
355
356
357   Header lines
358       Use -h (--headers) if the input file has a header line:
359
360              # Given a file with student name, field, test score...
361              $ head -n5 scores_h.txt
362              Name           Major            Score
363              Shawn          Engineering      47
364              Caleb          Business         87
365              Christian      Business         88
366              Derek          Arts             60
367
368              # Calculate the mean and standard devian for each major
369              $ datamash --sort --headers --group 2 mean 3 pstdev 3 < scores_h.txt
370
371               (or use short form)
372
373              $ datamash -sH -g2 mean 3 pstdev 3 < scores_h.txt
374
375               (or use named fields)
376
377              $ datamash -sH -g Major mean Score pstdev Score < scores_h.txt
378              GroupBy(Major)    mean(Score)   pstdev(Score)
379              Arts              68.9          10.1
380              Business          87.3           4.9
381              Engineering       66.5          19.1
382              Health-Medicine   90.6           8.8
383              Life-Sciences     55.3          19.7
384              Social-Sciences   60.2          16.6
385
386       Field names must be escaped with a backslash if they start with a digit
387       or  contain  special characters (dash/minus, colons, commas).  Note the
388       interplay between escaping with backslash and shell quoting.  The  fol‐
389       lowing equivalent command sum the values of a field named "FOO-BAR":
390
391
392              $ datamash -H sum FOO\\-BAR < input.txt
393              $ datamash -H sum 'FOO\-BAR' < input.txt
394              $ datamash -H sum "FOO\\-BAR" < input.txt
395
396   Skipping comment lines
397       Use -C (--skip-comments) to skip lines starting with ´#´ or ´;´ charac‐
398       ters (and optional whitespace before them):
399
400              $ cat in.txt
401               #foo   3
402              bar     5
403              ;baz    7
404
405              $ datamash sum 2 < in.txt
406              15
407
408              $ datamash -C sum 2 < in.txt
409              5
410
411   Multiple fields
412       Use comma or dash to specify multiple fields. The following are equiva‐
413       lent:
414              $ seq 9 | paste - - -
415              1   2   3
416              4   5   6
417              7   8   9
418
419              $ seq 9 | paste - - - | datamash sum 1 sum 2 sum 3
420              12  15  18
421
422              $ seq 9 | paste - - - | datamash sum 1,2,3
423              12  15  18
424
425              $ seq 9 | paste - - - | datamash sum 1-3
426              12  15  18
427
428   Rounding
429       The following demonstrate the different rounding operations:
430       $ ( echo X ; seq -1.25 0.25 1.25 ) \
431             | datamash --full -H round 1 ceil 1 floor 1 trunc 1 frac 1
432
433         X     round(X)  ceil(X)  floor(X)  trunc(X)   frac(X)
434       -1.25   -1        -1       -2        -1         -0.25
435       -1.00   -1        -1       -1        -1          0
436       -0.75   -1         0       -1         0         -0.75
437       -0.50   -1         0       -1         0         -0.5
438       -0.25    0         0       -1         0         -0.25
439        0.00    0         0        0         0          0
440        0.25    0         1        0         0          0.25
441        0.50    1         1        0         0          0.5
442        0.75    1         1        0         0          0.75
443        1.00    1         1        1         1          0
444        1.25    1         2        1         1          0.25
445
446   Reversing fields
447              $ seq 6 | paste - - | datamash reverse
448              2    1
449              4    3
450              6    5
451
452   Transposing a file
453              $ seq 6 | paste - - | datamash transpose
454              1    3    5
455              2    4    6
456
457   Removing Duplicated lines
458       Remove  lines  with duplicate key value from field 1 (Unlike first,last
459       operations, rmdup is much faster and does not require sorting the  file
460       with -s):
461
462              # Given a list of files and sample IDs:
463              $ cat INPUT
464              SampleID  File
465              2         cc.txt
466              3         dd.txt
467              1         ab.txt
468              2         ee.txt
469              3         ff.txt
470
471              # Remove lines with duplicated Sample-ID (field 1):
472              $ datamash rmdup 1 < INPUT
473
474              # or use named field:
475              $ datamash -H rmdup SampleID < INPUT
476              SampleID  File
477              2         cc.txt
478              3         dd.txt
479              1         ab.txt
480
481   Checksums
482       Calculate  the  sha1 hash value of each TXT file, after calculating the
483       sha1 value of each file's content:
484
485              $ sha1sum *.txt | datamash -Wf sha1 2
486
487   Check file structure
488       Check the structure of the input file: ensure all lines have  the  same
489       number of fields, or expected number of lines/fields:
490
491              $ seq 10 | paste - - | datamash check && echo ok || echo fail
492              5 lines, 2 fields
493              ok
494
495              $ seq 13 | paste - - - | datamash check && echo ok || echo fail
496              line 4 (3 fields):
497                10  11  12
498              line 5 (2 fields):
499                13
500              datamash: check failed: line 5 has 2 fields (previous line had 3)
501              fail
502
503              $ seq 10 | paste - - | datamash check 2 fields 5 lines
504              5 lines, 2 fields
505
506              $ seq 10 | paste - - | datamash check 4 fields
507              line 1 (2 fields):
508                1     2
509              datamash: check failed: line 1 has 2 fields (expecting 4)
510
511              $ seq 10 | paste - - | datamash check 7 lines
512              datamash: check failed: input had 5 lines (expecting 7)
513
514   Cross-Tabulation
515       Cross-tabulation  compares  the relationship between two fields.  Given
516       the following input file:
517              $ cat input.txt
518              a    x    3
519              a    y    7
520              b    x    21
521              a    x    40
522
523       Show cross-tabulation between the first  field  (a/b)  and  the  second
524       field  (x/y) - counting how many times each pair appears (note: sorting
525       is required):
526
527              $ datamash -s crosstab 1,2 < input.txt
528                   x    y
529              a    2    1
530              b    1    N/A
531
532       An optional grouping operation can be used instead of counting:
533
534              $ datamash -s crosstab 1,2 sum 3 < input.txt
535                   x    y
536              a    43   7
537              b    21   N/A
538
539              $ datamash -s crosstab 1,2 unique 3 < input.txt
540                   x    y
541              a    3,40 7
542              b    21   N/A
543
544   Binning numeric values
545       Bin input values into buckets of size 5:
546
547              $  ( echo X ; seq -10 2.5 10 ) \
548                    | datamash -H --full bin:5 1
549                  X  bin(X)
550              -10.0    -15
551               -7.5    -10
552               -5.0    -10
553               -2.5     -5
554                0.0      0
555                2.5      0
556                5.0      5
557                7.5      5
558               10.0     10
559
560   Binning string values
561       Hash any input value into a numeric integer.  A typical usage would  be
562       to  split  an  input  file into N chunks, ensuring that all values of a
563       certain key will be stored in the same chunk:
564
565              $ cat input.txt
566              PatientA   10
567              PatientB   11
568              PatientC   12
569              PatientA   14
570              PatientC   15
571
572
573       Each patient ID is hashed into a bin between 0 and 9
574       and printed in the last field:
575
576
577              $ datamash --full strbin 1 < input.txt
578              PatientA   10    5
579              PatientB   11    6
580              PatientC   12    7
581              PatientA   14    5
582              PatientC   15    7
583
584
585       Splitting the input into chunks can be done with awk:
586
587
588              $ cat input.txt \
589                  | datamash --full strbin 1 \
590                  | awk '{print > $NF ".txt"}'
591
592
593   Extracting numbers with getnum
594       The 'getnum' operation extracts a numeric value from the field:
595
596
597              $ echo zoom-123.45xyz | datamash getnum 1
598              123.45
599
600       getnum accepts an optional single-letter TYPE option:
601           getnum:n - natural numbers (positive integers, including zero)
602           getnum:i - integers
603           getnum:d - decimal point numbers
604           getnum:p - positive decimal point numbers (this is the default)
605           getnum:h - hex numbers
606           getnum:o - octal numbers
607
608
609       Examples:
610              $ echo zoom-123.45xyz | datamash getnum 1
611              123.45
612
613              $ echo zoom-123.45xyz | datamash getnum:n 1
614              123
615
616              $ echo zoom-123.45xyz | datamash getnum:i 1
617              -123
618
619              $ echo zoom-123.45xyz | datamash getnum:d 1
620              123.45
621
622              $ echo zoom-123.45xyz | datamash getnum:p 1
623              -123.45
624
625              # Hex 0x123 = 291 Decimal
626              $ echo zoom-123.45xyz | datamash getnum:h 1
627              291
628
629              # Octal 0123 = 83 Decimal
630              $ echo zoom-123.45xyz | datamash getnum:o 1
631              83
632

ADDITIONAL INFORMATION

634       See GNU Datamash Website (https://www.gnu.org/software/datamash)
635

AUTHOR

637       Written by Assaf Gordon.
638

COPYRIGHT

640       Copyright © 2019 Assaf Gordon License GPLv3+:  GNU  GPL  version  3  or
641       later <https://gnu.org/licenses/gpl.html>.
642       This  is  free  software:  you  are free to change and redistribute it.
643       There is NO WARRANTY, to the extent permitted by law.
644