datamash(1)

1DATAMASH(1)                      User Commands                     DATAMASH(1)
2
3
4

NAME

6       datamash - command-line calculations
7

SYNOPSIS

9       datamash [OPTION] op [fld] [op fld ...]
10

DESCRIPTION

12       Performs numeric/string operations on input from stdin.
13
14       'op'  is  the operation to perform.  If a primary operation is used, it
15       must be listed first, optionally followed by other  operations.   'fld'
16       is the input field to use.  'fld' can be a number (1=first field), or a
17       field name when using the -H or --header-in options.   Multiple  fields
18       can  be  listed  with  a  comma (e.g. 1,6,8).  A range of fields can be
19       listed with a dash (e.g. 2-8).  Use colons for operations which require
20       a pair of fields (e.g. 'pcov 2:6').
21
22   Primary operations:
23              groupby, crosstab, transpose, reverse, check
24
25   Line-Filtering operations:
26              rmdup
27
28   Per-Line operations:
29              base64,  debase64,  md5,  sha1,  sha224, sha256, sha384, sha512,
30              bin, strbin, round, floor, ceil, trunc, frac, dirname, basename,
31              barename, extname, getnum, cut
32
33   Numeric Grouping operations:
34              sum, min, max, absmin, absmax, range
35
36   Textual/Numeric Grouping operations:
37              count, first, last, rand, unique, collapse, countunique
38
39   Statistical Grouping operations:
40              mean,  geomean,  harmmean,  trimmean, median, q1, q3, iqr, perc,
41              mode, antimode,  pstdev,  sstdev,  pvar,  svar,  ms,  rms,  mad,
42              madraw,  pskew,  sskew,  pkurt,  skurt, dpo, jarque, scov, pcov,
43              spearson, ppearson
44

OPTIONS

46   Grouping Options:
47       -C, --skip-comments
48              skip comment lines (starting with '#' or ';' and optional white‐
49              space)
50
51       -f, --full
52              print  entire  input line before op results (default: print only
53              the grouped keys)
54
55              This option is only sensible for linewise operations. Other uses
56              are  deprecated  and  will be removed in a future version of GNU
57              Datamash.
58
59       -g, --group=X[,Y,Z]
60              group  via  fields  X,[Y,Z];  equivalent  to  primary  operation
61              'groupby'
62
63       --header-in
64              first input line is column headers
65
66       --header-out
67              print column headers as first line
68
69       -H, --headers
70              same as '--header-in --header-out'
71
72       -i, --ignore-case
73              ignore upper/lower case when comparing text; this affects group‐
74              ing, and string operations
75
76       -s, --sort
77              sort the input before grouping; this removes the need  to  manu‐
78              ally pipe the input through 'sort'
79
80       -c, --collapse-delimiter=X
81              use  X  to  separate  elements in collapse and unique lists (de‐
82              fault: comma)
83
84   File Operation Options:
85       --no-strict
86              allow lines with varying number of fields
87
88       --filler=X
89              fill missing values with X (default N/A)
90
91   General Options:
92       -t, --field-separator=X
93              use X instead of TAB as field delimiter
94
95       --format=FORMAT
96              print numeric values with printf style floating-point FORMAT.
97
98       --output-delimiter=X
99              use X instead as output field delimiter (default: use  same  de‐
100              limiter as -t/-W)
101
102       --narm skip NA/NaN values
103
104       -R, --round=N
105              round numeric output to N decimal places
106
107       -W, --whitespace
108              use whitespace (one or more spaces and/or tabs) for field delim‐
109              iters
110
111       -z, --zero-terminated
112              end lines with 0 byte, not newline
113
114       --sort-cmd=/path/to/sort
115              Alternative sort(1) to use.
116
117       --help display this help and exit
118
119       --version
120              output version information and exit
121

AVAILABLE OPERATIONS

123   Primary Operations
124       Primary operations affect the way the file is processed. If  used,  the
125       primary  operation  must  be  listed first. If primary operation is not
126       listed the entire file is processed - either  line-by-line  (for  'per-
127       line'  operations) or all lines as one group (for grouping operations).
128       See Examples section below.
129
130       groupby X,Y,... op fld ...
131                   group the file by given fields. Equivalent to option  '-g'.
132                   For each group perform operation op on field fld.
133
134
135       crosstab X,Y [op fld ...]
136                   cross-tabulate  a  file  by two fields (cross-tabulation is
137                   also known as pivot tables). If no operation is  specified,
138                   counts how many incidents exist of X,Y.
139
140
141       transpose   transpose rows, columns of the input file
142
143
144       reverse     reverse field order in each line
145
146
147       check [N lines] [N fields]
148                   verify  the  input  file  has  same number of fields in all
149                   lines, or the expected number of lines/fields.   number  of
150                   lines and fields are printed to STDOUT. Exits with non-zero
151                   code and prints the offending line if there's a mismatch in
152                   the number of lines/ fields.
153
154   Line-Filtering operations
155       rmdup       remove lines with duplicated key value
156
157   Per-Line operations
158       base64      Encode the field as base64
159
160
161       debase64    Decode  the  field  as  base64,  exit with error if invalid
162                   base64 string
163
164
165       md5/sha1/sha224/sha256/sha384/sha512
166                   Calculate md5/sha1/sha224/sha256/sha384/sha512 hash of  the
167                   field value
168
169
170       bin[:BUCKET-SIZE]
171                   bin  numeric  values  into buckets of size BUCKET-SIZE (de‐
172                   faults to 100).
173
174
175       strbin[:BUCKET-SIZE]
176                   hashes the input and returns a numeric  integer  value  be‐
177                   tween zero and BUCKET-SIZE (defaults to 10).
178
179
180       round/floor/ceil/trunc/frac
181                   numeric  rounding  operations.  round (round half away from
182                   zero), floor (round down), ceil (ceiling, round up),  trunc
183                   (truncate,  round  towards  zero),  frac  (fraction, return
184                   fraction part of a decimal-point value).
185
186
187       dirname/basename
188                   extract the directory name and the base file  name  from  a
189                   given string (same as to dirname(1) and basename(1)).
190
191
192       extname     extract the extension of the file name (without the '.').
193
194
195       barename    extract the base file name without the extension.
196
197
198       getnum[:TYPE]
199                   extract  a  number  from the field. TYPE is optional single
200                   letter option n/i/d/p/h/o (see examples below).
201
202
203       cut/echo    copy input field to output field (similar to cut(1)).   The
204                   echo command is simply an alias to cut.
205
206   Numeric Grouping operations
207       sum         sum the of values
208
209
210       min         minimum value
211
212
213       max         maximum value
214
215
216       absmin      minimum of the absolute values
217
218
219       absmax      maximum of the absolute values
220
221
222       range       the values range (max-min)
223
224   Textual/Numeric Grouping operations
225       count       count number of elements in the group
226
227
228       first       the first value of the group
229
230
231       last        the last value of the group
232
233
234       rand        one random value from the group
235
236
237       unique/uniq comma-separated  sorted list of unique values The uniq com‐
238                   mand is simply an alias to unique.
239
240
241       collapse    comma-separated list of all input values
242
243
244       countunique number of unique/distinct values
245
246   Statistical Grouping operations
247       A p/s prefix indicates the variant: population or  sample.   Typically,
248       the  sample  variant is equivalent with GNU R's internal functions (e.g
249       datamash's sstdev operation is equivalent to R's sd() function).
250
251       mean        mean of the values
252
253
254       geomean     geometric mean of the values
255
256
257       harmmean    harmonic mean of the values
258
259
260       trimmean[:PERCENT]
261                   trimmed mean of the values. PERCENT should be between 0 and
262                   0.5.   (trimmean:0  is  equivalent to mean. trimmean:0.5 is
263                   equivalent to median).
264
265
266       ms          mean square of the values
267
268
269       rms         root mean square of the values
270
271
272       median      median value
273
274
275       q1          1st quartile value
276
277
278       q3          3rd quartile value
279
280
281       iqr         inter-quartile range
282
283
284       perc[:PERCENTILE]
285                   percentile value PERCENTILE (defaults to 95).
286
287
288       mode        mode value (most common value)
289
290
291       antimode    anti-mode value (least common value)
292
293
294       pstdev/sstdev
295                   population/sample standard deviation
296
297
298       pvar/svar   population/sample variance
299
300
301       mad         median absolute deviation, scaled by  constant  1.4826  for
302                   normal distributions
303
304
305       madraw      median absolute deviation, unscaled
306
307
308       pskew/sskew skewness of the group
309                     values x reported by 'sskew' and 'pskew' operations:
310                             x > 0       -  positively skewed / skewed right
311                         0 > x           -  negatively skewed / skewed left
312                             x > 1       -  highly skewed right
313                         1 > x >  0.5    -  moderately skewed right
314                       0.5 > x > -0.5    -  approximately symmetric
315                      -0.5 > x > -1      -  moderately skewed left
316                        -1 > x           -  highly skewed left
317
318
319       pkurt/skurt excess Kurtosis of the group
320
321
322       jarque/dpo  p-value  of the Jarque-Beta (jarque) and D'Agostino-Pearson
323                   Omnibus (dpo) tests for normality:
324                      null hypothesis is normality;
325                      low p-Values indicate non-normal data;
326                      high p-Values indicate  null-hypothesis  cannot  be  re‐
327                   jected.
328
329
330       pcov/scov [X:Y]
331                   covariance of fields X and Y
332
333
334       ppearson/spearson [X:Y]
335                   Pearson  product-moment  correlation coefficient [Pearson's
336                   R] of fields X and Y
337

EXAMPLES

339   Basic usage
340       Print the sum and the mean of values from field 1:
341
342              $ seq 10 | datamash sum 1 mean 1
343              55  5.5
344
345       Group input based on field 1, and sum values (per group) on field 2:
346
347              $ cat example.txt
348              A  10
349              A  5
350              B  9
351              B  11
352
353              $ datamash -g 1 sum 2 < example.txt
354              A  15
355              B  20
356
357              $ datamash groupby 1 sum 2 < example.txt
358              A  15
359              B  20
360
361       Unsorted input must be sorted (with '-s'):
362
363              $ cat example.txt
364              A  10
365              C  4
366              B  9
367              C  1
368              A  5
369              B  11
370
371              $ datamash -s -g1 sum 2 < example.txt
372              A  15
373              B  20
374              C  5
375
376       Which is equivalent to:
377
378              $ cat example.txt | sort -k1,1 | datamash -g 1 sum 2
379
380
381
382
383   Header lines
384       Use -H (--headers) if the input file has a header line:
385
386              # Given a file with student name, field, test score...
387              $ head -n5 scores_h.txt
388              Name           Major            Score
389              Shawn          Engineering      47
390              Caleb          Business         87
391              Christian      Business         88
392              Derek          Arts             60
393
394              # Calculate the mean and standard deviation for each major
395              $ datamash --sort --headers --group 2 mean 3 pstdev 3 < scores_h.txt
396
397               (or use short form)
398
399              $ datamash -sH -g2 mean 3 pstdev 3 < scores_h.txt
400
401               (or use named fields)
402
403              $ datamash -sH -g Major mean Score pstdev Score < scores_h.txt
404              GroupBy(Major)    mean(Score)   pstdev(Score)
405              Arts              68.9          10.1
406              Business          87.3           4.9
407              Engineering       66.5          19.1
408              Health-Medicine   90.6           8.8
409              Life-Sciences     55.3          19.7
410              Social-Sciences   60.2          16.6
411
412       Field names must be escaped with a backslash if they start with a digit
413       or  contain  special characters (dash/minus, colons, commas).  Note the
414       interplay between escaping with backslash and shell quoting.  The  fol‐
415       lowing equivalent command sum the values of a field named "FOO-BAR":
416
417
418              $ datamash -H sum FOO\\-BAR < input.txt
419              $ datamash -H sum 'FOO\-BAR' < input.txt
420              $ datamash -H sum "FOO\\-BAR" < input.txt
421
422   Skipping comment lines
423       Use -C (--skip-comments) to skip lines starting with '#' or ';' charac‐
424       ters (and optional whitespace before them):
425
426              $ cat in.txt
427               #foo   3
428              bar     5
429              ;baz    7
430
431              $ datamash sum 2 < in.txt
432              15
433
434              $ datamash -C sum 2 < in.txt
435              5
436
437   Multiple fields
438       Use comma or dash to specify multiple fields. The following are equiva‐
439       lent:
440              $ seq 9 | paste - - -
441              1   2   3
442              4   5   6
443              7   8   9
444
445              $ seq 9 | paste - - - | datamash sum 1 sum 2 sum 3
446              12  15  18
447
448              $ seq 9 | paste - - - | datamash sum 1,2,3
449              12  15  18
450
451              $ seq 9 | paste - - - | datamash sum 1-3
452              12  15  18
453
454   Rounding
455       The following demonstrate the different rounding operations:
456       $ ( echo X ; seq -1.25 0.25 1.25 ) \
457             | datamash --full -H round 1 ceil 1 floor 1 trunc 1 frac 1
458
459         X     round(X)  ceil(X)  floor(X)  trunc(X)   frac(X)
460       -1.25   -1        -1       -2        -1         -0.25
461       -1.00   -1        -1       -1        -1          0
462       -0.75   -1         0       -1         0         -0.75
463       -0.50   -1         0       -1         0         -0.5
464       -0.25    0         0       -1         0         -0.25
465        0.00    0         0        0         0          0
466        0.25    0         1        0         0          0.25
467        0.50    1         1        0         0          0.5
468        0.75    1         1        0         0          0.75
469        1.00    1         1        1         1          0
470        1.25    1         2        1         1          0.25
471
472   Reversing fields
473              $ seq 6 | paste - - | datamash reverse
474              2    1
475              4    3
476              6    5
477
478   Transposing a file
479              $ seq 6 | paste - - | datamash transpose
480              1    3    5
481              2    4    6
482
483   Removing Duplicated lines
484       Remove  lines  with duplicate key value from field 1 (Unlike first,last
485       operations, rmdup is much faster and does not require sorting the  file
486       with -s):
487
488              # Given a list of files and sample IDs:
489              $ cat INPUT
490              SampleID  File
491              2         cc.txt
492              3         dd.txt
493              1         ab.txt
494              2         ee.txt
495              3         ff.txt
496
497              # Remove lines with duplicated Sample-ID (field 1):
498              $ datamash rmdup 1 < INPUT
499
500              # or use named field:
501              $ datamash -H rmdup SampleID < INPUT
502              SampleID  File
503              2         cc.txt
504              3         dd.txt
505              1         ab.txt
506
507   Checksums
508       Calculate  the  sha1 hash value of each TXT file, after calculating the
509       sha1 value of each file's content:
510
511              $ sha1sum *.txt | datamash -Wf sha1 2
512
513   Check file structure
514       Check the structure of the input file: ensure all lines have  the  same
515       number of fields, or expected number of lines/fields:
516
517              $ seq 10 | paste - - | datamash check && echo ok || echo fail
518              5 lines, 2 fields
519              ok
520
521              $ seq 13 | paste - - - | datamash check && echo ok || echo fail
522              line 4 (3 fields):
523                10  11  12
524              line 5 (2 fields):
525                13
526              datamash: check failed: line 5 has 2 fields (previous line had 3)
527              fail
528
529              $ seq 10 | paste - - | datamash check 2 fields 5 lines
530              5 lines, 2 fields
531
532              $ seq 10 | paste - - | datamash check 4 fields
533              line 1 (2 fields):
534                1     2
535              datamash: check failed: line 1 has 2 fields (expecting 4)
536
537              $ seq 10 | paste - - | datamash check 7 lines
538              datamash: check failed: input had 5 lines (expecting 7)
539
540   Cross-Tabulation
541       Cross-tabulation  compares  the relationship between two fields.  Given
542       the following input file:
543              $ cat input.txt
544              a    x    3
545              a    y    7
546              b    x    21
547              a    x    40
548
549       Show cross-tabulation between the first  field  (a/b)  and  the  second
550       field  (x/y) - counting how many times each pair appears (note: sorting
551       is required):
552
553              $ datamash -s crosstab 1,2 < input.txt
554                   x    y
555              a    2    1
556              b    1    N/A
557
558       An optional grouping operation can be used instead of counting:
559
560              $ datamash -s crosstab 1,2 sum 3 < input.txt
561                   x    y
562              a    43   7
563              b    21   N/A
564
565              $ datamash -s crosstab 1,2 unique 3 < input.txt
566                   x    y
567              a    3,40 7
568              b    21   N/A
569
570   Binning numeric values
571       Bin input values into buckets of size 5:
572
573              $  ( echo X ; seq -10 2.5 10 ) \
574                    | datamash -H --full bin:5 1
575                  X  bin(X)
576              -10.0    -10
577               -7.5    -10
578               -5.0     -5
579               -2.5     -5
580                0.0      0
581                2.5      0
582                5.0      5
583                7.5      5
584               10.0     10
585
586   Binning string values
587       Hash any input value into a numeric integer.  A typical usage would  be
588       to  split  an  input  file into N chunks, ensuring that all values of a
589       certain key will be stored in the same chunk:
590
591              $ cat input.txt
592              PatientA   10
593              PatientB   11
594              PatientC   12
595              PatientA   14
596              PatientC   15
597
598
599       Each patient ID is hashed into a bin between 0 and 9
600       and printed in the last field:
601
602
603              $ datamash --full strbin 1 < input.txt
604              PatientA   10    5
605              PatientB   11    6
606              PatientC   12    7
607              PatientA   14    5
608              PatientC   15    7
609
610
611       Splitting the input into chunks can be done with awk:
612
613
614              $ cat input.txt \
615                  | datamash --full strbin 1 \
616                  | awk '{print > $NF ".txt"}'
617
618
619   Extracting numbers with getnum
620       The 'getnum' operation extracts a numeric value from the field:
621
622
623              $ echo zoom-123.45xyz | datamash getnum 1
624              123.45
625
626       getnum accepts an optional single-letter TYPE option:
627           getnum:n - natural numbers (positive integers, including zero)
628           getnum:i - integers
629           getnum:d - decimal point numbers
630           getnum:p - positive decimal point numbers (this is the default)
631           getnum:h - hex numbers
632           getnum:o - octal numbers
633
634
635       Examples:
636              $ echo zoom-123.45xyz | datamash getnum 1
637              123.45
638
639              $ echo zoom-123.45xyz | datamash getnum:n 1
640              123
641
642              $ echo zoom-123.45xyz | datamash getnum:i 1
643              -123
644
645              $ echo zoom-123.45xyz | datamash getnum:d 1
646              123.45
647
648              $ echo zoom-123.45xyz | datamash getnum:p 1
649              -123.45
650
651              # Hex 0x123 = 291 Decimal
652              $ echo zoom-123.45xyz | datamash getnum:h 1
653              291
654
655              # Octal 0123 = 83 Decimal
656              $ echo zoom-123.45xyz | datamash getnum:o 1
657              83
658

ADDITIONAL INFORMATION

660       See GNU Datamash Website (https://www.gnu.org/software/datamash)
661

ENVIRONMENT

663       LC_NUMERIC
664              decimal-point character and thousands separator
665

AUTHOR

667       Written by Assaf Gordon, Tim Rice, Shawn Wagner, Erik Auerswald.
668

COPYRIGHT

670       Copyright © 2022 Assaf Gordon and Tim  Rice  License  GPLv3+:  GNU  GPL
671       version 3 or later <https://gnu.org/licenses/gpl.html>.
672       This  is  free  software:  you  are free to change and redistribute it.
673       There is NO WARRANTY, to the extent permitted by law.
674