1DATAMASH(1) User Commands DATAMASH(1)
2
3
4
6 datamash - command-line calculations
7
9 datamash [OPTION] op [fld] [op fld ...]
10
12 Performs numeric/string operations on input from stdin.
13
14 'op' is the operation to perform. If a primary operation is used, it
15 must be listed first, optionally followed by other operations. 'fld'
16 is the input field to use. 'fld' can be a number (1=first field), or a
17 field name when using the -H or --header-in options. Multiple fields
18 can be listed with a comma (e.g. 1,6,8). A range of fields can be
19 listed with a dash (e.g. 2-8). Use colons for operations which require
20 a pair of fields (e.g. 'pcov 2:6').
21
22 Primary operations:
23 groupby, crosstab, transpose, reverse, check
24
25 Line-Filtering operations:
26 rmdup
27
28 Per-Line operations:
29 base64, debase64, md5, sha1, sha224, sha256, sha384, sha512,
30 bin, strbin, round, floor, ceil, trunc, frac, dirname, basename,
31 barename, extname, getnum, cut
32
33 Numeric Grouping operations:
34 sum, min, max, absmin, absmax, range
35
36 Textual/Numeric Grouping operations:
37 count, first, last, rand, unique, collapse, countunique
38
39 Statistical Grouping operations:
40 mean, geomean, harmmean, trimmean, median, q1, q3, iqr, perc,
41 mode, antimode, pstdev, sstdev, pvar, svar, ms, rms, mad,
42 madraw, pskew, sskew, pkurt, skurt, dpo, jarque, scov, pcov,
43 spearson, ppearson
44
46 Grouping Options:
47 -C, --skip-comments
48 skip comment lines (starting with '#' or ';' and optional white‐
49 space)
50
51 -f, --full
52 print entire input line before op results (default: print only
53 the grouped keys)
54
55 This option is only sensible for linewise operations. Other uses
56 are deprecated and will be removed in a future version of GNU
57 Datamash.
58
59 -g, --group=X[,Y,Z]
60 group via fields X,[Y,Z]; equivalent to primary operation
61 'groupby'
62
63 --header-in
64 first input line is column headers
65
66 --header-out
67 print column headers as first line
68
69 -H, --headers
70 same as '--header-in --header-out'
71
72 -i, --ignore-case
73 ignore upper/lower case when comparing text; this affects group‐
74 ing, and string operations
75
76 -s, --sort
77 sort the input before grouping; this removes the need to manu‐
78 ally pipe the input through 'sort'
79
80 -c, --collapse-delimiter=X
81 use X to separate elements in collapse and unique lists (de‐
82 fault: comma)
83
84 File Operation Options:
85 --no-strict
86 allow lines with varying number of fields
87
88 --filler=X
89 fill missing values with X (default N/A)
90
91 General Options:
92 -t, --field-separator=X
93 use X instead of TAB as field delimiter
94
95 --format=FORMAT
96 print numeric values with printf style floating-point FORMAT.
97
98 --output-delimiter=X
99 use X instead as output field delimiter (default: use same de‐
100 limiter as -t/-W)
101
102 --narm skip NA/NaN values
103
104 -R, --round=N
105 round numeric output to N decimal places
106
107 -W, --whitespace
108 use whitespace (one or more spaces and/or tabs) for field delim‐
109 iters
110
111 -z, --zero-terminated
112 end lines with 0 byte, not newline
113
114 --sort-cmd=/path/to/sort
115 Alternative sort(1) to use.
116
117 --help display this help and exit
118
119 --version
120 output version information and exit
121
123 Primary Operations
124 Primary operations affect the way the file is processed. If used, the
125 primary operation must be listed first. If primary operation is not
126 listed the entire file is processed - either line-by-line (for 'per-
127 line' operations) or all lines as one group (for grouping operations).
128 See Examples section below.
129
130 groupby X,Y,... op fld ...
131 group the file by given fields. Equivalent to option '-g'.
132 For each group perform operation op on field fld.
133
134
135 crosstab X,Y [op fld ...]
136 cross-tabulate a file by two fields (cross-tabulation is
137 also known as pivot tables). If no operation is specified,
138 counts how many incidents exist of X,Y.
139
140
141 transpose transpose rows, columns of the input file
142
143
144 reverse reverse field order in each line
145
146
147 check [N lines] [N fields]
148 verify the input file has same number of fields in all
149 lines, or the expected number of lines/fields. number of
150 lines and fields are printed to STDOUT. Exits with non-zero
151 code and prints the offending line if there's a mismatch in
152 the number of lines/ fields.
153
154 Line-Filtering operations
155 rmdup remove lines with duplicated key value
156
157 Per-Line operations
158 base64 Encode the field as base64
159
160
161 debase64 Decode the field as base64, exit with error if invalid
162 base64 string
163
164
165 md5/sha1/sha224/sha256/sha384/sha512
166 Calculate md5/sha1/sha224/sha256/sha384/sha512 hash of the
167 field value
168
169
170 bin[:BUCKET-SIZE]
171 bin numeric values into buckets of size BUCKET-SIZE (de‐
172 faults to 100).
173
174
175 strbin[:BUCKET-SIZE]
176 hashes the input and returns a numeric integer value be‐
177 tween zero and BUCKET-SIZE (defaults to 10).
178
179
180 round/floor/ceil/trunc/frac
181 numeric rounding operations. round (round half away from
182 zero), floor (round down), ceil (ceiling, round up), trunc
183 (truncate, round towards zero), frac (fraction, return
184 fraction part of a decimal-point value).
185
186
187 dirname/basename
188 extract the directory name and the base file name from a
189 given string (same as to dirname(1) and basename(1)).
190
191
192 extname extract the extension of the file name (without the '.').
193
194
195 barename extract the base file name without the extension.
196
197
198 getnum[:TYPE]
199 extract a number from the field. TYPE is optional single
200 letter option n/i/d/p/h/o (see examples below).
201
202
203 cut/echo copy input field to output field (similar to cut(1)). The
204 echo command is simply an alias to cut.
205
206 Numeric Grouping operations
207 sum sum the of values
208
209
210 min minimum value
211
212
213 max maximum value
214
215
216 absmin minimum of the absolute values
217
218
219 absmax maximum of the absolute values
220
221
222 range the values range (max-min)
223
224 Textual/Numeric Grouping operations
225 count count number of elements in the group
226
227
228 first the first value of the group
229
230
231 last the last value of the group
232
233
234 rand one random value from the group
235
236
237 unique/uniq comma-separated sorted list of unique values The uniq com‐
238 mand is simply an alias to unique.
239
240
241 collapse comma-separated list of all input values
242
243
244 countunique number of unique/distinct values
245
246 Statistical Grouping operations
247 A p/s prefix indicates the variant: population or sample. Typically,
248 the sample variant is equivalent with GNU R's internal functions (e.g
249 datamash's sstdev operation is equivalent to R's sd() function).
250
251 mean mean of the values
252
253
254 geomean geometric mean of the values
255
256
257 harmmean harmonic mean of the values
258
259
260 trimmean[:PERCENT]
261 trimmed mean of the values. PERCENT should be between 0 and
262 0.5. (trimmean:0 is equivalent to mean. trimmean:0.5 is
263 equivalent to median).
264
265
266 ms mean square of the values
267
268
269 rms root mean square of the values
270
271
272 median median value
273
274
275 q1 1st quartile value
276
277
278 q3 3rd quartile value
279
280
281 iqr inter-quartile range
282
283
284 perc[:PERCENTILE]
285 percentile value PERCENTILE (defaults to 95).
286
287
288 mode mode value (most common value)
289
290
291 antimode anti-mode value (least common value)
292
293
294 pstdev/sstdev
295 population/sample standard deviation
296
297
298 pvar/svar population/sample variance
299
300
301 mad median absolute deviation, scaled by constant 1.4826 for
302 normal distributions
303
304
305 madraw median absolute deviation, unscaled
306
307
308 pskew/sskew skewness of the group
309 values x reported by 'sskew' and 'pskew' operations:
310 x > 0 - positively skewed / skewed right
311 0 > x - negatively skewed / skewed left
312 x > 1 - highly skewed right
313 1 > x > 0.5 - moderately skewed right
314 0.5 > x > -0.5 - approximately symmetric
315 -0.5 > x > -1 - moderately skewed left
316 -1 > x - highly skewed left
317
318
319 pkurt/skurt excess Kurtosis of the group
320
321
322 jarque/dpo p-value of the Jarque-Beta (jarque) and D'Agostino-Pearson
323 Omnibus (dpo) tests for normality:
324 null hypothesis is normality;
325 low p-Values indicate non-normal data;
326 high p-Values indicate null-hypothesis cannot be re‐
327 jected.
328
329
330 pcov/scov [X:Y]
331 covariance of fields X and Y
332
333
334 ppearson/spearson [X:Y]
335 Pearson product-moment correlation coefficient [Pearson's
336 R] of fields X and Y
337
339 Basic usage
340 Print the sum and the mean of values from field 1:
341
342 $ seq 10 | datamash sum 1 mean 1
343 55 5.5
344
345 Group input based on field 1, and sum values (per group) on field 2:
346
347 $ cat example.txt
348 A 10
349 A 5
350 B 9
351 B 11
352
353 $ datamash -g 1 sum 2 < example.txt
354 A 15
355 B 20
356
357 $ datamash groupby 1 sum 2 < example.txt
358 A 15
359 B 20
360
361 Unsorted input must be sorted (with '-s'):
362
363 $ cat example.txt
364 A 10
365 C 4
366 B 9
367 C 1
368 A 5
369 B 11
370
371 $ datamash -s -g1 sum 2 < example.txt
372 A 15
373 B 20
374 C 5
375
376 Which is equivalent to:
377
378 $ cat example.txt | sort -k1,1 | datamash -g 1 sum 2
379
380
381
382
383 Header lines
384 Use -H (--headers) if the input file has a header line:
385
386 # Given a file with student name, field, test score...
387 $ head -n5 scores_h.txt
388 Name Major Score
389 Shawn Engineering 47
390 Caleb Business 87
391 Christian Business 88
392 Derek Arts 60
393
394 # Calculate the mean and standard deviation for each major
395 $ datamash --sort --headers --group 2 mean 3 pstdev 3 < scores_h.txt
396
397 (or use short form)
398
399 $ datamash -sH -g2 mean 3 pstdev 3 < scores_h.txt
400
401 (or use named fields)
402
403 $ datamash -sH -g Major mean Score pstdev Score < scores_h.txt
404 GroupBy(Major) mean(Score) pstdev(Score)
405 Arts 68.9 10.1
406 Business 87.3 4.9
407 Engineering 66.5 19.1
408 Health-Medicine 90.6 8.8
409 Life-Sciences 55.3 19.7
410 Social-Sciences 60.2 16.6
411
412 Field names must be escaped with a backslash if they start with a digit
413 or contain special characters (dash/minus, colons, commas). Note the
414 interplay between escaping with backslash and shell quoting. The fol‐
415 lowing equivalent command sum the values of a field named "FOO-BAR":
416
417
418 $ datamash -H sum FOO\\-BAR < input.txt
419 $ datamash -H sum 'FOO\-BAR' < input.txt
420 $ datamash -H sum "FOO\\-BAR" < input.txt
421
422 Skipping comment lines
423 Use -C (--skip-comments) to skip lines starting with '#' or ';' charac‐
424 ters (and optional whitespace before them):
425
426 $ cat in.txt
427 #foo 3
428 bar 5
429 ;baz 7
430
431 $ datamash sum 2 < in.txt
432 15
433
434 $ datamash -C sum 2 < in.txt
435 5
436
437 Multiple fields
438 Use comma or dash to specify multiple fields. The following are equiva‐
439 lent:
440 $ seq 9 | paste - - -
441 1 2 3
442 4 5 6
443 7 8 9
444
445 $ seq 9 | paste - - - | datamash sum 1 sum 2 sum 3
446 12 15 18
447
448 $ seq 9 | paste - - - | datamash sum 1,2,3
449 12 15 18
450
451 $ seq 9 | paste - - - | datamash sum 1-3
452 12 15 18
453
454 Rounding
455 The following demonstrate the different rounding operations:
456 $ ( echo X ; seq -1.25 0.25 1.25 ) \
457 | datamash --full -H round 1 ceil 1 floor 1 trunc 1 frac 1
458
459 X round(X) ceil(X) floor(X) trunc(X) frac(X)
460 -1.25 -1 -1 -2 -1 -0.25
461 -1.00 -1 -1 -1 -1 0
462 -0.75 -1 0 -1 0 -0.75
463 -0.50 -1 0 -1 0 -0.5
464 -0.25 0 0 -1 0 -0.25
465 0.00 0 0 0 0 0
466 0.25 0 1 0 0 0.25
467 0.50 1 1 0 0 0.5
468 0.75 1 1 0 0 0.75
469 1.00 1 1 1 1 0
470 1.25 1 2 1 1 0.25
471
472 Reversing fields
473 $ seq 6 | paste - - | datamash reverse
474 2 1
475 4 3
476 6 5
477
478 Transposing a file
479 $ seq 6 | paste - - | datamash transpose
480 1 3 5
481 2 4 6
482
483 Removing Duplicated lines
484 Remove lines with duplicate key value from field 1 (Unlike first,last
485 operations, rmdup is much faster and does not require sorting the file
486 with -s):
487
488 # Given a list of files and sample IDs:
489 $ cat INPUT
490 SampleID File
491 2 cc.txt
492 3 dd.txt
493 1 ab.txt
494 2 ee.txt
495 3 ff.txt
496
497 # Remove lines with duplicated Sample-ID (field 1):
498 $ datamash rmdup 1 < INPUT
499
500 # or use named field:
501 $ datamash -H rmdup SampleID < INPUT
502 SampleID File
503 2 cc.txt
504 3 dd.txt
505 1 ab.txt
506
507 Checksums
508 Calculate the sha1 hash value of each TXT file, after calculating the
509 sha1 value of each file's content:
510
511 $ sha1sum *.txt | datamash -Wf sha1 2
512
513 Check file structure
514 Check the structure of the input file: ensure all lines have the same
515 number of fields, or expected number of lines/fields:
516
517 $ seq 10 | paste - - | datamash check && echo ok || echo fail
518 5 lines, 2 fields
519 ok
520
521 $ seq 13 | paste - - - | datamash check && echo ok || echo fail
522 line 4 (3 fields):
523 10 11 12
524 line 5 (2 fields):
525 13
526 datamash: check failed: line 5 has 2 fields (previous line had 3)
527 fail
528
529 $ seq 10 | paste - - | datamash check 2 fields 5 lines
530 5 lines, 2 fields
531
532 $ seq 10 | paste - - | datamash check 4 fields
533 line 1 (2 fields):
534 1 2
535 datamash: check failed: line 1 has 2 fields (expecting 4)
536
537 $ seq 10 | paste - - | datamash check 7 lines
538 datamash: check failed: input had 5 lines (expecting 7)
539
540 Cross-Tabulation
541 Cross-tabulation compares the relationship between two fields. Given
542 the following input file:
543 $ cat input.txt
544 a x 3
545 a y 7
546 b x 21
547 a x 40
548
549 Show cross-tabulation between the first field (a/b) and the second
550 field (x/y) - counting how many times each pair appears (note: sorting
551 is required):
552
553 $ datamash -s crosstab 1,2 < input.txt
554 x y
555 a 2 1
556 b 1 N/A
557
558 An optional grouping operation can be used instead of counting:
559
560 $ datamash -s crosstab 1,2 sum 3 < input.txt
561 x y
562 a 43 7
563 b 21 N/A
564
565 $ datamash -s crosstab 1,2 unique 3 < input.txt
566 x y
567 a 3,40 7
568 b 21 N/A
569
570 Binning numeric values
571 Bin input values into buckets of size 5:
572
573 $ ( echo X ; seq -10 2.5 10 ) \
574 | datamash -H --full bin:5 1
575 X bin(X)
576 -10.0 -10
577 -7.5 -10
578 -5.0 -5
579 -2.5 -5
580 0.0 0
581 2.5 0
582 5.0 5
583 7.5 5
584 10.0 10
585
586 Binning string values
587 Hash any input value into a numeric integer. A typical usage would be
588 to split an input file into N chunks, ensuring that all values of a
589 certain key will be stored in the same chunk:
590
591 $ cat input.txt
592 PatientA 10
593 PatientB 11
594 PatientC 12
595 PatientA 14
596 PatientC 15
597
598
599 Each patient ID is hashed into a bin between 0 and 9
600 and printed in the last field:
601
602
603 $ datamash --full strbin 1 < input.txt
604 PatientA 10 5
605 PatientB 11 6
606 PatientC 12 7
607 PatientA 14 5
608 PatientC 15 7
609
610
611 Splitting the input into chunks can be done with awk:
612
613
614 $ cat input.txt \
615 | datamash --full strbin 1 \
616 | awk '{print > $NF ".txt"}'
617
618
619 Extracting numbers with getnum
620 The 'getnum' operation extracts a numeric value from the field:
621
622
623 $ echo zoom-123.45xyz | datamash getnum 1
624 123.45
625
626 getnum accepts an optional single-letter TYPE option:
627 getnum:n - natural numbers (positive integers, including zero)
628 getnum:i - integers
629 getnum:d - decimal point numbers
630 getnum:p - positive decimal point numbers (this is the default)
631 getnum:h - hex numbers
632 getnum:o - octal numbers
633
634
635 Examples:
636 $ echo zoom-123.45xyz | datamash getnum 1
637 123.45
638
639 $ echo zoom-123.45xyz | datamash getnum:n 1
640 123
641
642 $ echo zoom-123.45xyz | datamash getnum:i 1
643 -123
644
645 $ echo zoom-123.45xyz | datamash getnum:d 1
646 123.45
647
648 $ echo zoom-123.45xyz | datamash getnum:p 1
649 -123.45
650
651 # Hex 0x123 = 291 Decimal
652 $ echo zoom-123.45xyz | datamash getnum:h 1
653 291
654
655 # Octal 0123 = 83 Decimal
656 $ echo zoom-123.45xyz | datamash getnum:o 1
657 83
658
660 See GNU Datamash Website (https://www.gnu.org/software/datamash)
661
663 LC_NUMERIC
664 decimal-point character and thousands separator
665
667 Written by Assaf Gordon, Tim Rice, Shawn Wagner, Erik Auerswald.
668
670 Copyright © 2022 Assaf Gordon and Tim Rice License GPLv3+: GNU GPL
671 version 3 or later <https://gnu.org/licenses/gpl.html>.
672 This is free software: you are free to change and redistribute it.
673 There is NO WARRANTY, to the extent permitted by law.
674
676 The full documentation for datamash is maintained as a Texinfo manual.
677 If the info and datamash programs are properly installed at your site,
678 the command
679
680 info datamash
681
682 should give you access to the complete manual.
683
684
685
686datamash 1.8-dirty July 2022 DATAMASH(1)