1DATAMASH(1) User Commands DATAMASH(1)
2
3
4
6 datamash - command-line calculations
7
9 datamash [OPTION] op [fld] [op fld ...]
10
12 Performs numeric/string operations on input from stdin.
13
14 'op' is the operation to perform. If a primary operation is used, it
15 must be listed first, optionally followed by other operations. 'fld'
16 is the input field to use. 'fld' can be a number (1=first field), or a
17 field name when using the -H or --header-in options. Multiple fields
18 can be listed with a comma (e.g. 1,6,8). A range of fields can be
19 listed with a dash (e.g. 2-8). Use colons for operations which require
20 a pair of fields (e.g. 'pcov 2:6').
21
22 Primary operations:
23 groupby, crosstab, transpose, reverse, check
24
25 Line-Filtering operations:
26 rmdup
27
28 Per-Line operations:
29 base64, debase64, md5, sha1, sha224, sha256, sha384, sha512,
30 bin, strbin, round, floor, ceil, trunc, frac, dirname, basename,
31 barename, extname, getnum, cut
32
33 Numeric Grouping operations:
34 sum, min, max, absmin, absmax, range
35
36 Textual/Numeric Grouping operations:
37 count, first, last, rand, unique, collapse, countunique
38
39 Statistical Grouping operations:
40 mean, geomean, harmmean, trimmean, median, q1, q3, iqr, perc,
41 mode, antimode, pstdev, sstdev, pvar, svar, mad, madraw, pskew,
42 sskew, pkurt, skurt, dpo, jarque, scov, pcov, spearson, ppearson
43
44 Grouping Options:
45 -C, --skip-comments
46 skip comment lines (starting with '#' or ';' and optional white‐
47 space)
48
49 -f, --full
50 print entire input line before op results (default: print only
51 the grouped keys)
52
53 -g, --group=X[,Y,Z]
54 group via fields X,[Y,Z]; equivalent to primary operation
55 'groupby'
56
57 --header-in
58 first input line is column headers
59
60 --header-out
61 print column headers as first line
62
63 -H, --headers
64 same as '--header-in --header-out'
65
66 -i, --ignore-case
67 ignore upper/lower case when comparing text; this affects group‐
68 ing, and string operations
69
70 -s, --sort
71 sort the input before grouping; this removes the need to manu‐
72 ally pipe the input through 'sort'
73
74 File Operation Options:
75 --no-strict
76 allow lines with varying number of fields
77
78 --filler=X
79 fill missing values with X (default %s)
80
81 General Options:
82 -t, --field-separator=X
83 use X instead of TAB as field delimiter
84
85 --format=FORMAT
86 print numeric values with printf style floating-point FORMAT.
87
88 --output-delimiter=X
89 use X instead as output field delimiter (default: use same de‐
90 limiter as -t/-W)
91
92 --narm skip NA/NaN values
93
94 -R, --round=N
95 round numeric output to N decimal places
96
97 -W, --whitespace
98 use whitespace (one or more spaces and/or tabs) for field delim‐
99 iters
100
101 -z, --zero-terminated
102 end lines with 0 byte, not newline
103
104 --help display this help and exit
105
106 --version
107 output version information and exit
108
111 Primary Operations
112 Primary operations affect the way the file is processed. If used, the
113 primary operation must be listed first. Some operations require field
114 numbers (groupby, crosstab) while others do not (reverse,check,trans‐
115 pose). If primary operation is not listed the entire file is processed
116 - either line-by-line (for 'per-line' operations) or all lines as one
117 group (for grouping operations). See Examples section below.
118
119 groupby X,Y,... op fld ...
120 group the file by given fields. Equivalent to option '-g'.
121 For each group perform operation op on field fld.
122
123
124 crosstab X,Y [op fld ...]
125 cross-tabulate a file by two fields (cross-tabulation is
126 also known as pivot tables). If no operation is specified,
127 counts how many incidents exist of X,Y.
128
129
130 transpose transpose rows, columns of the input file
131
132
133 reverse reverse field order in each line
134
135
136 check [N lines] [N fields]
137 verify the input file has same number of fields in all
138 lines, or the expected number of lines/fields. number of
139 lines and fields are printed to STDOUT. Exits with non-zero
140 code and prints the offending line if there's a mismatch in
141 the number of lines/ fields.
142
143 Line-Filtering operations
144 rmdup remove lines with duplicated key value
145
146 Per-Line operations
147 base64 Encode the field as base64
148
149
150 debase64 Decode the field as base64, exit with error if invalid
151 base64 string
152
153
154 md5/sha1/sha224/sha256/sha384/sha512
155 Calculate md5/sha1/sha224/sha256/sha384/sha512 hash of the
156 field value
157
158
159 bin[:BUCKET-SIZE]
160 bin numeric values into buckets of size BUCKET-SIZE (de‐
161 faults to 100).
162
163
164 strbin[:BUCKET-SIZE]
165 hashes the input and returns a numeric integer value be‐
166 tween zero and BUCKET-SIZE (defaults to 10).
167
168
169 round/floor/ceil/trunc/frac
170 numeric rounding operations. round (round half away from
171 zero), floor (round up), ceil (ceiling, round down), trunc
172 (truncate, round towards zero), frac (fraction, return
173 fraction part of a decimal-point value).
174
175
176 dirname/basename
177 extract the directory name and the base file name from a
178 given string (same as to dirname(1) and basename(1)).
179
180
181 extname extract the extension of the file name (without the '.').
182
183
184 barename extract the base file name without the extension.
185
186
187 getnum[:TYPE]
188 extract a number from the field. TYPE is optional single
189 letter option n/i/d/p/h/o (see examples below).
190
191
192 cut copy input field to output field (similar to cut(1)).
193
194 Numeric Grouping operations
195 sum sum the of values
196
197
198 min minimum value
199
200
201 max maximum value
202
203
204 absmin minimum of the absolute values
205
206
207 absmax maximum of the absolute values
208
209
210 range the values range (max-min)
211
212 Textual/Numeric Grouping operations
213 count count number of elements in the group
214
215
216 first the first value of the group
217
218
219 last the last value of the group
220
221
222 rand one random value from the group
223
224
225 unique comma-separated sorted list of unique values
226
227
228 collapse comma-separated list of all input values
229
230
231 countunique number of unique/distinct values
232
233 Statistical Grouping operations
234 A p/s prefix indicates the varient: population or sample. Typically,
235 the sample variant is equivalent with GNU R's internal functions (e.g
236 datamash's sstdev operation is equivalent to R's sd() function).
237
238 mean mean of the values
239
240
241 geomean geometric mean of the values
242
243
244 harmmean harmonic mean of the values
245
246
247 trimmean[:PERCENT]
248 trimmed mean of the values. PERCENT should be between 0 and
249 0.5. (trimmean:0 is equivalent to mean. trimmean:0.5 is
250 equivalent to median).
251
252
253 median median value
254
255
256 q1 1st quartile value
257
258
259 q3 3rd quartile value
260
261
262 iqr inter-quartile range
263
264
265 perc[:PERCENTILE]
266 percentile value PERCENTILE (defaults to 95).
267
268
269 mode mode value (most common value)
270
271
272 antimode anti-mode value (least common value)
273
274
275 pstdev/sstdev
276 population/sample standard deviation
277
278
279 pvar/svar population/sample variance
280
281
282 mad median absolute deviation, scaled by constant 1.4826 for
283 normal distributions
284
285
286 madraw median absolute deviation, unscaled
287
288
289 pskew/sskew skewness of the group
290 values x reported by 'sskew' and 'pskew' operations:
291 x > 0 - positively skewed / skewed right
292 0 > x - negatively skewed / skewed left
293 x > 1 - highly skewed right
294 1 > x > 0.5 - moderately skewed right
295 0.5 > x > -0.5 - approximately symmetric
296 -0.5 > x > -1 - moderately skewed left
297 -1 > x - highly skewed left
298
299
300 pkurt/skurt excess Kurtosis of the group
301
302
303 jarque/dpo p-value of the Jarque-Beta (jarque) and D'Agostino-Pearson
304 Omnibus (dpo) tests for normality:
305 null hypothesis is normality;
306 low p-Values indicate non-normal data;
307 high p-Values indicate null-hypothesis cannot be re‐
308 jected.
309
310
311 pcov/scov [X:Y]
312 covariance of fields X and Y
313
314
315 ppearson/spearson [X:Y]
316 Pearson product-moment correlation coefficient [Pearson's
317 R] of fields X and Y
318
320 Basic usage
321 Print the sum and the mean of values from field 1:
322
323 $ seq 10 | datamash sum 1 mean 1
324 55 5.5
325
326 Group input based on field 1, and sum values (per group) on field 2:
327
328 $ cat example.txt
329 A 10
330 A 5
331 B 9
332 B 11
333
334 $ datamash -g 1 sum 2 < example.txt
335 A 15
336 B 20
337
338 $ datamash groupby 1 sum 2 < example.txt
339 A 15
340 B 20
341
342 Unsorted input must be sorted (with '-s'):
343
344 $ cat example.txt
345 A 10
346 C 4
347 B 9
348 C 1
349 A 5
350 B 11
351
352 $ datamash -s -g1 sum 2 < example.txt
353 A 15
354 B 20
355 C 5
356
357 Which is equivalent to:
358
359 $ cat example.txt | sort -k1,1 | datamash -g 1 sum 2
360
361
362
363
364 Header lines
365 Use -h (--headers) if the input file has a header line:
366
367 # Given a file with student name, field, test score...
368 $ head -n5 scores_h.txt
369 Name Major Score
370 Shawn Engineering 47
371 Caleb Business 87
372 Christian Business 88
373 Derek Arts 60
374
375 # Calculate the mean and standard devian for each major
376 $ datamash --sort --headers --group 2 mean 3 pstdev 3 < scores_h.txt
377
378 (or use short form)
379
380 $ datamash -sH -g2 mean 3 pstdev 3 < scores_h.txt
381
382 (or use named fields)
383
384 $ datamash -sH -g Major mean Score pstdev Score < scores_h.txt
385 GroupBy(Major) mean(Score) pstdev(Score)
386 Arts 68.9 10.1
387 Business 87.3 4.9
388 Engineering 66.5 19.1
389 Health-Medicine 90.6 8.8
390 Life-Sciences 55.3 19.7
391 Social-Sciences 60.2 16.6
392
393 Field names must be escaped with a backslash if they start with a digit
394 or contain special characters (dash/minus, colons, commas). Note the
395 interplay between escaping with backslash and shell quoting. The fol‐
396 lowing equivalent command sum the values of a field named "FOO-BAR":
397
398
399 $ datamash -H sum FOO\\-BAR < input.txt
400 $ datamash -H sum 'FOO\-BAR' < input.txt
401 $ datamash -H sum "FOO\\-BAR" < input.txt
402
403 Skipping comment lines
404 Use -C (--skip-comments) to skip lines starting with ´#´ or ´;´ charac‐
405 ters (and optional whitespace before them):
406
407 $ cat in.txt
408 #foo 3
409 bar 5
410 ;baz 7
411
412 $ datamash sum 2 < in.txt
413 15
414
415 $ datamash -C sum 2 < in.txt
416 5
417
418 Multiple fields
419 Use comma or dash to specify multiple fields. The following are equiva‐
420 lent:
421 $ seq 9 | paste - - -
422 1 2 3
423 4 5 6
424 7 8 9
425
426 $ seq 9 | paste - - - | datamash sum 1 sum 2 sum 3
427 12 15 18
428
429 $ seq 9 | paste - - - | datamash sum 1,2,3
430 12 15 18
431
432 $ seq 9 | paste - - - | datamash sum 1-3
433 12 15 18
434
435 Rounding
436 The following demonstrate the different rounding operations:
437 $ ( echo X ; seq -1.25 0.25 1.25 ) \
438 | datamash --full -H round 1 ceil 1 floor 1 trunc 1 frac 1
439
440 X round(X) ceil(X) floor(X) trunc(X) frac(X)
441 -1.25 -1 -1 -2 -1 -0.25
442 -1.00 -1 -1 -1 -1 0
443 -0.75 -1 0 -1 0 -0.75
444 -0.50 -1 0 -1 0 -0.5
445 -0.25 0 0 -1 0 -0.25
446 0.00 0 0 0 0 0
447 0.25 0 1 0 0 0.25
448 0.50 1 1 0 0 0.5
449 0.75 1 1 0 0 0.75
450 1.00 1 1 1 1 0
451 1.25 1 2 1 1 0.25
452
453 Reversing fields
454 $ seq 6 | paste - - | datamash reverse
455 2 1
456 4 3
457 6 5
458
459 Transposing a file
460 $ seq 6 | paste - - | datamash transpose
461 1 3 5
462 2 4 6
463
464 Removing Duplicated lines
465 Remove lines with duplicate key value from field 1 (Unlike first,last
466 operations, rmdup is much faster and does not require sorting the file
467 with -s):
468
469 # Given a list of files and sample IDs:
470 $ cat INPUT
471 SampleID File
472 2 cc.txt
473 3 dd.txt
474 1 ab.txt
475 2 ee.txt
476 3 ff.txt
477
478 # Remove lines with duplicated Sample-ID (field 1):
479 $ datamash rmdup 1 < INPUT
480
481 # or use named field:
482 $ datamash -H rmdup SampleID < INPUT
483 SampleID File
484 2 cc.txt
485 3 dd.txt
486 1 ab.txt
487
488 Checksums
489 Calculate the sha1 hash value of each TXT file, after calculating the
490 sha1 value of each file's content:
491
492 $ sha1sum *.txt | datamash -Wf sha1 2
493
494 Check file structure
495 Check the structure of the input file: ensure all lines have the same
496 number of fields, or expected number of lines/fields:
497
498 $ seq 10 | paste - - | datamash check && echo ok || echo fail
499 5 lines, 2 fields
500 ok
501
502 $ seq 13 | paste - - - | datamash check && echo ok || echo fail
503 line 4 (3 fields):
504 10 11 12
505 line 5 (2 fields):
506 13
507 datamash: check failed: line 5 has 2 fields (previous line had 3)
508 fail
509
510 $ seq 10 | paste - - | datamash check 2 fields 5 lines
511 5 lines, 2 fields
512
513 $ seq 10 | paste - - | datamash check 4 fields
514 line 1 (2 fields):
515 1 2
516 datamash: check failed: line 1 has 2 fields (expecting 4)
517
518 $ seq 10 | paste - - | datamash check 7 lines
519 datamash: check failed: input had 5 lines (expecting 7)
520
521 Cross-Tabulation
522 Cross-tabulation compares the relationship between two fields. Given
523 the following input file:
524 $ cat input.txt
525 a x 3
526 a y 7
527 b x 21
528 a x 40
529
530 Show cross-tabulation between the first field (a/b) and the second
531 field (x/y) - counting how many times each pair appears (note: sorting
532 is required):
533
534 $ datamash -s crosstab 1,2 < input.txt
535 x y
536 a 2 1
537 b 1 N/A
538
539 An optional grouping operation can be used instead of counting:
540
541 $ datamash -s crosstab 1,2 sum 3 < input.txt
542 x y
543 a 43 7
544 b 21 N/A
545
546 $ datamash -s crosstab 1,2 unique 3 < input.txt
547 x y
548 a 3,40 7
549 b 21 N/A
550
551 Binning numeric values
552 Bin input values into buckets of size 5:
553
554 $ ( echo X ; seq -10 2.5 10 ) \
555 | datamash -H --full bin:5 1
556 X bin(X)
557 -10.0 -15
558 -7.5 -10
559 -5.0 -10
560 -2.5 -5
561 0.0 0
562 2.5 0
563 5.0 5
564 7.5 5
565 10.0 10
566
567 Binning string values
568 Hash any input value into a numeric integer. A typical usage would be
569 to split an input file into N chunks, ensuring that all values of a
570 certain key will be stored in the same chunk:
571
572 $ cat input.txt
573 PatientA 10
574 PatientB 11
575 PatientC 12
576 PatientA 14
577 PatientC 15
578
579
580 Each patient ID is hashed into a bin between 0 and 9
581 and printed in the last field:
582
583
584 $ datamash --full strbin 1 < input.txt
585 PatientA 10 5
586 PatientB 11 6
587 PatientC 12 7
588 PatientA 14 5
589 PatientC 15 7
590
591
592 Splitting the input into chunks can be done with awk:
593
594
595 $ cat input.txt \
596 | datamash --full strbin 1 \
597 | awk '{print > $NF ".txt"}'
598
599
600 Extracting numbers with getnum
601 The 'getnum' operation extracts a numeric value from the field:
602
603
604 $ echo zoom-123.45xyz | datamash getnum 1
605 123.45
606
607 getnum accepts an optional single-letter TYPE option:
608 getnum:n - natural numbers (positive integers, including zero)
609 getnum:i - integers
610 getnum:d - decimal point numbers
611 getnum:p - positive decimal point numbers (this is the default)
612 getnum:h - hex numbers
613 getnum:o - octal numbers
614
615
616 Examples:
617 $ echo zoom-123.45xyz | datamash getnum 1
618 123.45
619
620 $ echo zoom-123.45xyz | datamash getnum:n 1
621 123
622
623 $ echo zoom-123.45xyz | datamash getnum:i 1
624 -123
625
626 $ echo zoom-123.45xyz | datamash getnum:d 1
627 123.45
628
629 $ echo zoom-123.45xyz | datamash getnum:p 1
630 -123.45
631
632 # Hex 0x123 = 291 Decimal
633 $ echo zoom-123.45xyz | datamash getnum:h 1
634 291
635
636 # Octal 0123 = 83 Decimal
637 $ echo zoom-123.45xyz | datamash getnum:o 1
638 83
639
641 See GNU Datamash Website (https://www.gnu.org/software/datamash)
642
644 Written by Assaf Gordon.
645
647 Copyright © 2020 Assaf Gordon License GPLv3+: GNU GPL version 3 or
648 later <https://gnu.org/licenses/gpl.html>.
649 This is free software: you are free to change and redistribute it.
650 There is NO WARRANTY, to the extent permitted by law.
651
653 The full documentation for datamash is maintained as a Texinfo manual.
654 If the info and datamash programs are properly installed at your site,
655 the command
656
657 info datamash
658
659 should give you access to the complete manual.
660
661
662
663datamash 1.7 April 2020 DATAMASH(1)