datamash(1)

1DATAMASH(1)                      User Commands                     DATAMASH(1)
2
3
4

NAME

6       datamash - command-line calculations
7

SYNOPSIS

9       datamash [OPTION] op [fld] [op fld ...]
10

DESCRIPTION

12       Performs numeric/string operations on input from stdin.
13
14       'op'  is  the operation to perform.  If a primary operation is used, it
15       must be listed first, optionally followed by other  operations.   'fld'
16       is the input field to use.  'fld' can be a number (1=first field), or a
17       field name when using the -H or --header-in options.   Multiple  fields
18       can  be  listed  with  a  comma (e.g. 1,6,8).  A range of fields can be
19       listed with a dash (e.g. 2-8).  Use colons for operations which require
20       a pair of fields (e.g. 'pcov 2:6').
21
22   Primary operations:
23              groupby, crosstab, transpose, reverse, check
24
25   Line-Filtering operations:
26              rmdup
27
28   Per-Line operations:
29              base64, debase64, md5, sha1, sha256, sha512, bin, strbin, round,
30              floor, ceil, trunc, frac
31
32   Numeric Grouping operations:
33              sum, min, max, absmin, absmax, range
34
35   Textual/Numeric Grouping operations:
36              count, first, last, rand, unique, collapse, countunique
37
38   Statistical Grouping operations:
39              mean, trimmean, median, q1, q3, iqr, perc, mode, antimode,  pst‐
40              dev,  sstdev,  pvar,  svar,  mad,  madraw,  pskew, sskew, pkurt,
41              skurt, dpo, jarque, scov, pcov, spearson, ppearson
42
43   Grouping Options:
44       -f, --full
45              print entire input line before op results (default:  print  only
46              the grouped keys)
47
48       -g, --group=X[,Y,Z]
49              group  via  fields  X,[Y,Z];  equivalent  to  primary  operation
50              'groupby'
51
52       --header-in
53              first input line is column headers
54
55       --header-out
56              print column headers as first line
57
58       -H, --headers
59              same as '--header-in --header-out'
60
61       -i, --ignore-case
62              ignore upper/lower case when comparing text; this affects group‐
63              ing, and string operations
64
65       -s, --sort
66              sort  the  input before grouping; this removes the need to manu‐
67              ally pipe the input through 'sort'
68
69   File Operation Options:
70       --no-strict
71              allow lines with varying number of fields
72
73       --filler=X
74              fill missing values with X (default %s)
75
76   General Options:
77       -t, --field-separator=X
78              use X instead of TAB as field delimiter
79
80       --format=FORMAT
81              print numeric values with printf style floating-point FORMAT.
82
83       --output-delimiter=X
84              use X instead as  output  field  delimiter  (default:  use  same
85              delimiter as -t/-W)
86
87       --narm skip NA/NaN values
88
89       -R, --round=N
90              round numeric output to N decimal places
91
92       -W, --whitespace
93              use whitespace (one or more spaces and/or tabs) for field delim‐
94              iters
95
96       -z, --zero-terminated
97              end lines with 0 byte, not newline
98
99       --help display this help and exit
100
101       --version
102              output version information and exit
103

OPTIONS

AVAILABLE OPERATIONS

106   Primary Operations
107       Primary operations affect the way the file is processed. If  used,  the
108       primary  operation  must be listed first. Some operations require field
109       numbers (groupby, crosstab) while others do  not  (reverse,check,trans‐
110       pose).  If primary operation is not listed the entire file is processed
111       - either line-by-line (for 'per-line' operations) or all lines  as  one
112       group (for grouping operations). See Examples section below.
113
114       groupby X,Y,... op fld ...
115                   group  the file by given fields. Equivalent to option '-g'.
116                   For each group perform operation op on field fld.
117
118
119       crosstab X,Y [op fld ...]
120                   cross-tabulate a file by two  fields  (cross-tabulation  is
121                   also  known as pivot tables). If no operation is specified,
122                   counts how many incidents exist of X,Y.
123
124
125       transpose   transpose rows, columns of the input file
126
127
128       reverse     reverse field order in each line
129
130
131       check [N lines] [N fields]
132                   verify the input file has same  number  of  fields  in  all
133                   lines,  or  the expected number of lines/fields.  number of
134                   lines and fields are printed to STDOUT. Exits with non-zero
135                   code and prints the offending line if there's a mismatch in
136                   the number of lines/ fields.
137
138   Line-Filtering operations
139       rmdup       remove lines with duplicated key value
140
141   Per-Line operations
142       base64      Encode the field as base64
143
144
145       debase64    Decode the field as base64,  exit  with  error  if  invalid
146                   base64 string
147
148
149       md5/sha1/sha256/sha512
150                   Calculate md5/sha1/sha256/sha512 hash of the field value
151
152
153       bin[:BUCKET-SIZE]
154                   bin   numeric  values  into  buckets  of  size  BUCKET-SIZE
155                   (defaults to 100).
156
157
158       strbin[:BUCKET-SIZE]
159                   hashes the  input  and  returns  a  numeric  integer  value
160                   between zero and BUCKET-SIZE (defaults to 10).
161
162
163       round/floor/ceil/trunc/frac
164                   numeric  rounding  operations.  round (round half away from
165                   zero), floor (round up), ceil (ceiling, round down),  trunc
166                   (truncate,  round  towards  zero),  frac  (fraction, return
167                   fraction part of a decimal-point value).
168
169   Numeric Grouping operations
170       sum         sum the of values
171
172
173       min         minimum value
174
175
176       max         maximum value
177
178
179       absmin      minimum of the absolute values
180
181
182       absmax      maximum of the absolute values
183
184
185       range       the values range (max-min)
186
187   Textual/Numeric Grouping operations
188       count       count number of elements in the group
189
190
191       first       the first value of the group
192
193
194       last        the last value of the group
195
196
197       rand        one random value from the group
198
199
200       unique      comma-separated sorted list of unique values
201
202
203       collapse    comma-separated list of all input values
204
205
206       countunique number of unique/distinct values
207
208   Statistical Grouping operations
209       A p/s prefix indicates the varient: population or  sample.   Typically,
210       the  sample  variant is equivalent with GNU R's internal functions (e.g
211       datamash's sstdev operation is equivalent to R's sd() function).
212
213       mean        mean of the values
214
215
216       trimmean[:PERCENT]
217                   trimmed mean of the values. PERCENT should be between 0 and
218                   0.5.   (trimmean:0  is  equivalent to mean. trimmean:0.5 is
219                   equivalent to median).
220
221
222       median      median value
223
224
225       q1          1st quartile value
226
227
228       q3          3rd quartile value
229
230
231       iqr         inter-quartile range
232
233
234       perc[:PERCENTILE]
235                   percentile value ERCENTILE (defaults to 95).
236
237
238       mode        mode value (most common value)
239
240
241       antimode    anti-mode value (least common value)
242
243
244       pstdev/sstdev
245                   population/sample standard deviation
246
247
248       pvar/svar   population/sample variance
249
250
251       mad         median absolute deviation, scaled by  constant  1.4826  for
252                   normal distributions
253
254
255       madraw      median absolute deviation, unscaled
256
257
258       pskew/sskew skewness of the group
259                     values x reported by 'sskew' and 'pskew' operations:
260                             x > 0       -  positively skewed / skewed right
261                         0 > x           -  negatively skewed / skewed left
262                             x > 1       -  highly skewed right
263                         1 > x >  0.5    -  moderately skewed right
264                       0.5 > x > -0.5    -  approximately symmetric
265                      -0.5 > x > -1      -  moderately skewed left
266                        -1 > x           -  highly skewed left
267
268
269       pkurt/skurt excess Kurtosis of the group
270
271
272       jarque/dpo  p-value  of the Jarque-Beta (jarque) and D'Agostino-Pearson
273                   Omnibus (dpo) tests for normality:
274                      null hypothesis is normality;
275                      low p-Values indicate non-normal data;
276                      high  p-Values  indicate   null-hypothesis   cannot   be
277                   rejected.
278
279
280       pcov/scov [X:Y]
281                   covariance of fields X and Y
282
283
284       ppearson/spearson [X:Y]
285                   Pearson  product-moment  correlation coefficient [Pearson's
286                   R] of fields X and Y
287

EXAMPLES

289   Basic usage
290       Print the sum and the mean of values from field 1:
291
292              $ seq 10 | datamash sum 1 mean 1
293              55  5.5
294
295       Group input based on field 1, and sum values (per group) on field 2:
296
297              $ cat example.txt
298              A  10
299              A  5
300              B  9
301              B  11
302
303              $ datamash -g 1 sum 2 < example.txt
304              A  15
305              B  20
306
307              $ datamash groupby 1 sum 2 < example.txt
308              A  15
309              B  20
310
311       Unsorted input must be sorted (with '-s'):
312
313              $ cat example.txt
314              A  10
315              C  4
316              B  9
317              C  1
318              A  5
319              B  11
320
321              $ datamash -s -g1 sum 2 < example.txt
322              A  15
323              B  20
324              C  5
325
326       Which is equivalent to:
327
328              $ cat example.txt | sort -k1,1 | datamash -g 1 sum 2
329
330
331
332
333   Header lines
334       Use -h (--headers) if the input file has a header line:
335
336              # Given a file with student name, field, test score...
337              $ head -n5 scores_h.txt
338              Name           Major            Score
339              Shawn          Engineering      47
340              Caleb          Business         87
341              Christian      Business         88
342              Derek          Arts             60
343
344              # Calculate the mean and standard devian for each major
345              $ datamash --sort --headers --group 2 mean 3 pstdev 3 < scores_h.txt
346
347               (or use short form)
348
349              $ datamash -sH -g2 mean 3 pstdev 3 < scores_h.txt
350
351               (or use named fields)
352
353              $ datamash -sH -g Major mean Score pstdev Score < scores_h.txt
354              GroupBy(Major)    mean(Score)   pstdev(Score)
355              Arts              68.9          10.1
356              Business          87.3           4.9
357              Engineering       66.5          19.1
358              Health-Medicine   90.6           8.8
359              Life-Sciences     55.3          19.7
360              Social-Sciences   60.2          16.6
361
362   Multiple fields
363       Use comma or dash to specify multiple fields. The following are equiva‐
364       lent:
365              $ seq 9 | paste - - -
366              1   2   3
367              4   5   6
368              7   8   9
369
370              $ seq 9 | paste - - - | datamash sum 1 sum 2 sum 3
371              12  15  18
372
373              $ seq 9 | paste - - - | datamash sum 1,2,3
374              12  15  18
375
376              $ seq 9 | paste - - - | datamash sum 1-3
377              12  15  18
378
379   Rounding
380       The following demonstrate the different rounding operations:
381       $ ( echo X ; seq -1.25 0.25 1.25 ) \
382             | datamash --full -H round 1 ceil 1 floor 1 trunc 1 frac 1
383
384         X     round(X)  ceil(X)  floor(X)  trunc(X)   frac(X)
385       -1.25   -1        -1       -2        -1         -0.25
386       -1.00   -1        -1       -1        -1          0
387       -0.75   -1         0       -1         0         -0.75
388       -0.50   -1         0       -1         0         -0.5
389       -0.25    0         0       -1         0         -0.25
390        0.00    0         0        0         0          0
391        0.25    0         1        0         0          0.25
392        0.50    1         1        0         0          0.5
393        0.75    1         1        0         0          0.75
394        1.00    1         1        1         1          0
395        1.25    1         2        1         1          0.25
396
397   Reversing fields
398              $ seq 6 | paste - - | datamash reverse
399              2    1
400              4    3
401              6    5
402
403   Transposing a file
404              $ seq 6 | paste - - | datamash transpose
405              1    3    5
406              2    4    6
407
408   Removing Duplicated lines
409       Remove  lines  with duplicate key value from field 1 (Unlike first,last
410       operations, rmdup is much faster and does not require sorting the  file
411       with -s):
412
413              # Given a list of files and sample IDs:
414              $ cat INPUT
415              SampleID  File
416              2         cc.txt
417              3         dd.txt
418              1         ab.txt
419              2         ee.txt
420              3         ff.txt
421
422              # Remove lines with duplicated Sample-ID (field 1):
423              $ datamash rmdup 1 < INPUT
424
425              # or use named field:
426              $ datamash -H rmdup SampleID < INPUT
427              SampleID  File
428              2         cc.txt
429              3         dd.txt
430              1         ab.txt
431
432   Checksums
433       Calculate  the  sha1 hash value of each TXT file, after calculating the
434       sha1 value of each file's content:
435
436              $ sha1sum *.txt | datamash -Wf sha1 2
437
438   Check file structure
439       Check the structure of the input file: ensure all lines have  the  same
440       number of fields, or expected number of lines/fields:
441
442              $ seq 10 | paste - - | datamash check && echo ok || echo fail
443              5 lines, 2 fields
444              ok
445
446              $ seq 13 | paste - - - | datamash check && echo ok || echo fail
447              line 4 (3 fields):
448                10  11  12
449              line 5 (2 fields):
450                13
451              datamash: check failed: line 5 has 2 fields (previous line had 3)
452              fail
453
454              $ seq 10 | paste - - | datamash check 2 fields 5 lines
455              5 lines, 2 fields
456
457              $ seq 10 | paste - - | datamash check 4 fields
458              line 1 (2 fields):
459                1     2
460              datamash: check failed: line 1 has 2 fields (expecting 4)
461
462              $ seq 10 | paste - - | datamash check 7 lines
463              datamash: check failed: input had 5 lines (expecting 7)
464
465   Cross-Tabulation
466       Cross-tabulation  compares  the relationship between two fields.  Given
467       the following input file:
468              $ cat input.txt
469              a    x    3
470              a    y    7
471              b    x    21
472              a    x    40
473
474       Show cross-tabulation between the first  field  (a/b)  and  the  second
475       field  (x/y) - counting how many times each pair appears (note: sorting
476       is required):
477
478              $ datamash -s crosstab 1,2 < input.txt
479                   x    y
480              a    2    1
481              b    1    N/A
482
483       An optional grouping operation can be used instead of counting:
484
485              $ datamash -s crosstab 1,2 sum 3 < input.txt
486                   x    y
487              a    43   7
488              b    21   N/A
489
490              $ datamash -s crosstab 1,2 unique 3 < input.txt
491                   x    y
492              a    3,40 7
493              b    21   N/A
494
495   Binning numeric values
496       Bin input values into buckets of size 5:
497
498              $  ( echo X ; seq -10 2.5 10 ) \
499                    | datamash -H --full bin:5 1
500                  X  bin(X)
501              -10.0    -15
502               -7.5    -10
503               -5.0    -10
504               -2.5     -5
505                0.0      0
506                2.5      0
507                5.0      5
508                7.5      5
509               10.0     10
510
511   Binning string values
512       Hash any input value into a numeric integer.  A typical usage would  be
513       to  split  an  input  file into N chunks, ensuring that all values of a
514       certain key will be stored in the same chunk:
515
516              $ cat input.txt
517              PatientA   10
518              PatientB   11
519              PatientC   12
520              PatientA   14
521              PatientC   15
522
523
524       Each patient ID is hashed into a bin between 0 and 9
525       and printed in the last field:
526
527
528              $ datamash --full strbin 1 < input.txt
529              PatientA   10    5
530              PatientB   11    6
531              PatientC   12    7
532              PatientA   14    5
533              PatientC   15    7
534
535
536       Splitting the input into chunks can be done with awk:
537
538
539              $ cat input.txt \
540                  | datamash --full strbin 1 \
541                  | awk '{print > $NF ".txt"}'
542
543

ADDITIONAL INFORMATION

545       See GNU Datamash Website (https://www.gnu.org/software/datamash)
546

AUTHOR

548       Written by Assaf Gordon.
549

COPYRIGHT

551       Copyright © 2018 Assaf Gordon License GPLv3+:  GNU  GPL  version  3  or
552       later <https://gnu.org/licenses/gpl.html>.
553       This  is  free  software:  you  are free to change and redistribute it.
554       There is NO WARRANTY, to the extent permitted by law.
555