1DATAMASH(1) User Commands DATAMASH(1)
2
3
4
6 datamash - command-line calculations
7
9 datamash [OPTION] op [fld] [op fld ...]
10
12 Performs numeric/string operations on input from stdin.
13
14 'op' is the operation to perform. If a primary operation is used, it
15 must be listed first, optionally followed by other operations. 'fld'
16 is the input field to use. 'fld' can be a number (1=first field), or a
17 field name when using the -H or --header-in options. Multiple fields
18 can be listed with a comma (e.g. 1,6,8). A range of fields can be
19 listed with a dash (e.g. 2-8). Use colons for operations which require
20 a pair of fields (e.g. 'pcov 2:6').
21
22 Primary operations:
23 groupby, crosstab, transpose, reverse, check
24
25 Line-Filtering operations:
26 rmdup
27
28 Per-Line operations:
29 base64, debase64, md5, sha1, sha256, sha512, bin, strbin, round,
30 floor, ceil, trunc, frac, dirname, basename, barename, extname,
31 getnum, cut
32
33 Numeric Grouping operations:
34 sum, min, max, absmin, absmax, range
35
36 Textual/Numeric Grouping operations:
37 count, first, last, rand, unique, collapse, countunique
38
39 Statistical Grouping operations:
40 mean, trimmean, median, q1, q3, iqr, perc, mode, antimode, pst‐
41 dev, sstdev, pvar, svar, mad, madraw, pskew, sskew, pkurt,
42 skurt, dpo, jarque, scov, pcov, spearson, ppearson
43
44 Grouping Options:
45 -C, --skip-comments
46 skip comment lines (starting with '#' or ';' and optional white‐
47 space)
48
49 -f, --full
50 print entire input line before op results (default: print only
51 the grouped keys)
52
53 -g, --group=X[,Y,Z]
54 group via fields X,[Y,Z]; equivalent to primary operation
55 'groupby'
56
57 --header-in
58 first input line is column headers
59
60 --header-out
61 print column headers as first line
62
63 -H, --headers
64 same as '--header-in --header-out'
65
66 -i, --ignore-case
67 ignore upper/lower case when comparing text; this affects group‐
68 ing, and string operations
69
70 -s, --sort
71 sort the input before grouping; this removes the need to manu‐
72 ally pipe the input through 'sort'
73
74 File Operation Options:
75 --no-strict
76 allow lines with varying number of fields
77
78 --filler=X
79 fill missing values with X (default %s)
80
81 General Options:
82 -t, --field-separator=X
83 use X instead of TAB as field delimiter
84
85 --format=FORMAT
86 print numeric values with printf style floating-point FORMAT.
87
88 --output-delimiter=X
89 use X instead as output field delimiter (default: use same
90 delimiter as -t/-W)
91
92 --narm skip NA/NaN values
93
94 -R, --round=N
95 round numeric output to N decimal places
96
97 -W, --whitespace
98 use whitespace (one or more spaces and/or tabs) for field delim‐
99 iters
100
101 -z, --zero-terminated
102 end lines with 0 byte, not newline
103
104 --help display this help and exit
105
106 --version
107 output version information and exit
108
111 Primary Operations
112 Primary operations affect the way the file is processed. If used, the
113 primary operation must be listed first. Some operations require field
114 numbers (groupby, crosstab) while others do not (reverse,check,trans‐
115 pose). If primary operation is not listed the entire file is processed
116 - either line-by-line (for 'per-line' operations) or all lines as one
117 group (for grouping operations). See Examples section below.
118
119 groupby X,Y,... op fld ...
120 group the file by given fields. Equivalent to option '-g'.
121 For each group perform operation op on field fld.
122
123
124 crosstab X,Y [op fld ...]
125 cross-tabulate a file by two fields (cross-tabulation is
126 also known as pivot tables). If no operation is specified,
127 counts how many incidents exist of X,Y.
128
129
130 transpose transpose rows, columns of the input file
131
132
133 reverse reverse field order in each line
134
135
136 check [N lines] [N fields]
137 verify the input file has same number of fields in all
138 lines, or the expected number of lines/fields. number of
139 lines and fields are printed to STDOUT. Exits with non-zero
140 code and prints the offending line if there's a mismatch in
141 the number of lines/ fields.
142
143 Line-Filtering operations
144 rmdup remove lines with duplicated key value
145
146 Per-Line operations
147 base64 Encode the field as base64
148
149
150 debase64 Decode the field as base64, exit with error if invalid
151 base64 string
152
153
154 md5/sha1/sha256/sha512
155 Calculate md5/sha1/sha256/sha512 hash of the field value
156
157
158 bin[:BUCKET-SIZE]
159 bin numeric values into buckets of size BUCKET-SIZE
160 (defaults to 100).
161
162
163 strbin[:BUCKET-SIZE]
164 hashes the input and returns a numeric integer value
165 between zero and BUCKET-SIZE (defaults to 10).
166
167
168 round/floor/ceil/trunc/frac
169 numeric rounding operations. round (round half away from
170 zero), floor (round up), ceil (ceiling, round down), trunc
171 (truncate, round towards zero), frac (fraction, return
172 fraction part of a decimal-point value).
173
174
175 dirname/basename
176 extract the directory name and the base file name from a
177 given string (same as to dirname(1) and basename(1)).
178
179
180 extname extract the extension of the file name (without the '.').
181
182
183 barename extract the base file name without the extension.
184
185
186 getnum[:TYPE]
187 extract a number from the field. TYPE is optional single
188 letter option n/i/d/p/h/o (see examples below).
189
190
191 cut copy input field to output field (similar to cut(1)).
192
193 Numeric Grouping operations
194 sum sum the of values
195
196
197 min minimum value
198
199
200 max maximum value
201
202
203 absmin minimum of the absolute values
204
205
206 absmax maximum of the absolute values
207
208
209 range the values range (max-min)
210
211 Textual/Numeric Grouping operations
212 count count number of elements in the group
213
214
215 first the first value of the group
216
217
218 last the last value of the group
219
220
221 rand one random value from the group
222
223
224 unique comma-separated sorted list of unique values
225
226
227 collapse comma-separated list of all input values
228
229
230 countunique number of unique/distinct values
231
232 Statistical Grouping operations
233 A p/s prefix indicates the varient: population or sample. Typically,
234 the sample variant is equivalent with GNU R's internal functions (e.g
235 datamash's sstdev operation is equivalent to R's sd() function).
236
237 mean mean of the values
238
239
240 trimmean[:PERCENT]
241 trimmed mean of the values. PERCENT should be between 0 and
242 0.5. (trimmean:0 is equivalent to mean. trimmean:0.5 is
243 equivalent to median).
244
245
246 median median value
247
248
249 q1 1st quartile value
250
251
252 q3 3rd quartile value
253
254
255 iqr inter-quartile range
256
257
258 perc[:PERCENTILE]
259 percentile value PERCENTILE (defaults to 95).
260
261
262 mode mode value (most common value)
263
264
265 antimode anti-mode value (least common value)
266
267
268 pstdev/sstdev
269 population/sample standard deviation
270
271
272 pvar/svar population/sample variance
273
274
275 mad median absolute deviation, scaled by constant 1.4826 for
276 normal distributions
277
278
279 madraw median absolute deviation, unscaled
280
281
282 pskew/sskew skewness of the group
283 values x reported by 'sskew' and 'pskew' operations:
284 x > 0 - positively skewed / skewed right
285 0 > x - negatively skewed / skewed left
286 x > 1 - highly skewed right
287 1 > x > 0.5 - moderately skewed right
288 0.5 > x > -0.5 - approximately symmetric
289 -0.5 > x > -1 - moderately skewed left
290 -1 > x - highly skewed left
291
292
293 pkurt/skurt excess Kurtosis of the group
294
295
296 jarque/dpo p-value of the Jarque-Beta (jarque) and D'Agostino-Pearson
297 Omnibus (dpo) tests for normality:
298 null hypothesis is normality;
299 low p-Values indicate non-normal data;
300 high p-Values indicate null-hypothesis cannot be
301 rejected.
302
303
304 pcov/scov [X:Y]
305 covariance of fields X and Y
306
307
308 ppearson/spearson [X:Y]
309 Pearson product-moment correlation coefficient [Pearson's
310 R] of fields X and Y
311
313 Basic usage
314 Print the sum and the mean of values from field 1:
315
316 $ seq 10 | datamash sum 1 mean 1
317 55 5.5
318
319 Group input based on field 1, and sum values (per group) on field 2:
320
321 $ cat example.txt
322 A 10
323 A 5
324 B 9
325 B 11
326
327 $ datamash -g 1 sum 2 < example.txt
328 A 15
329 B 20
330
331 $ datamash groupby 1 sum 2 < example.txt
332 A 15
333 B 20
334
335 Unsorted input must be sorted (with '-s'):
336
337 $ cat example.txt
338 A 10
339 C 4
340 B 9
341 C 1
342 A 5
343 B 11
344
345 $ datamash -s -g1 sum 2 < example.txt
346 A 15
347 B 20
348 C 5
349
350 Which is equivalent to:
351
352 $ cat example.txt | sort -k1,1 | datamash -g 1 sum 2
353
354
355
356
357 Header lines
358 Use -h (--headers) if the input file has a header line:
359
360 # Given a file with student name, field, test score...
361 $ head -n5 scores_h.txt
362 Name Major Score
363 Shawn Engineering 47
364 Caleb Business 87
365 Christian Business 88
366 Derek Arts 60
367
368 # Calculate the mean and standard devian for each major
369 $ datamash --sort --headers --group 2 mean 3 pstdev 3 < scores_h.txt
370
371 (or use short form)
372
373 $ datamash -sH -g2 mean 3 pstdev 3 < scores_h.txt
374
375 (or use named fields)
376
377 $ datamash -sH -g Major mean Score pstdev Score < scores_h.txt
378 GroupBy(Major) mean(Score) pstdev(Score)
379 Arts 68.9 10.1
380 Business 87.3 4.9
381 Engineering 66.5 19.1
382 Health-Medicine 90.6 8.8
383 Life-Sciences 55.3 19.7
384 Social-Sciences 60.2 16.6
385
386 Field names must be escaped with a backslash if they start with a digit
387 or contain special characters (dash/minus, colons, commas). Note the
388 interplay between escaping with backslash and shell quoting. The fol‐
389 lowing equivalent command sum the values of a field named "FOO-BAR":
390
391
392 $ datamash -H sum FOO\\-BAR < input.txt
393 $ datamash -H sum 'FOO\-BAR' < input.txt
394 $ datamash -H sum "FOO\\-BAR" < input.txt
395
396 Skipping comment lines
397 Use -C (--skip-comments) to skip lines starting with ´#´ or ´;´ charac‐
398 ters (and optional whitespace before them):
399
400 $ cat in.txt
401 #foo 3
402 bar 5
403 ;baz 7
404
405 $ datamash sum 2 < in.txt
406 15
407
408 $ datamash -C sum 2 < in.txt
409 5
410
411 Multiple fields
412 Use comma or dash to specify multiple fields. The following are equiva‐
413 lent:
414 $ seq 9 | paste - - -
415 1 2 3
416 4 5 6
417 7 8 9
418
419 $ seq 9 | paste - - - | datamash sum 1 sum 2 sum 3
420 12 15 18
421
422 $ seq 9 | paste - - - | datamash sum 1,2,3
423 12 15 18
424
425 $ seq 9 | paste - - - | datamash sum 1-3
426 12 15 18
427
428 Rounding
429 The following demonstrate the different rounding operations:
430 $ ( echo X ; seq -1.25 0.25 1.25 ) \
431 | datamash --full -H round 1 ceil 1 floor 1 trunc 1 frac 1
432
433 X round(X) ceil(X) floor(X) trunc(X) frac(X)
434 -1.25 -1 -1 -2 -1 -0.25
435 -1.00 -1 -1 -1 -1 0
436 -0.75 -1 0 -1 0 -0.75
437 -0.50 -1 0 -1 0 -0.5
438 -0.25 0 0 -1 0 -0.25
439 0.00 0 0 0 0 0
440 0.25 0 1 0 0 0.25
441 0.50 1 1 0 0 0.5
442 0.75 1 1 0 0 0.75
443 1.00 1 1 1 1 0
444 1.25 1 2 1 1 0.25
445
446 Reversing fields
447 $ seq 6 | paste - - | datamash reverse
448 2 1
449 4 3
450 6 5
451
452 Transposing a file
453 $ seq 6 | paste - - | datamash transpose
454 1 3 5
455 2 4 6
456
457 Removing Duplicated lines
458 Remove lines with duplicate key value from field 1 (Unlike first,last
459 operations, rmdup is much faster and does not require sorting the file
460 with -s):
461
462 # Given a list of files and sample IDs:
463 $ cat INPUT
464 SampleID File
465 2 cc.txt
466 3 dd.txt
467 1 ab.txt
468 2 ee.txt
469 3 ff.txt
470
471 # Remove lines with duplicated Sample-ID (field 1):
472 $ datamash rmdup 1 < INPUT
473
474 # or use named field:
475 $ datamash -H rmdup SampleID < INPUT
476 SampleID File
477 2 cc.txt
478 3 dd.txt
479 1 ab.txt
480
481 Checksums
482 Calculate the sha1 hash value of each TXT file, after calculating the
483 sha1 value of each file's content:
484
485 $ sha1sum *.txt | datamash -Wf sha1 2
486
487 Check file structure
488 Check the structure of the input file: ensure all lines have the same
489 number of fields, or expected number of lines/fields:
490
491 $ seq 10 | paste - - | datamash check && echo ok || echo fail
492 5 lines, 2 fields
493 ok
494
495 $ seq 13 | paste - - - | datamash check && echo ok || echo fail
496 line 4 (3 fields):
497 10 11 12
498 line 5 (2 fields):
499 13
500 datamash: check failed: line 5 has 2 fields (previous line had 3)
501 fail
502
503 $ seq 10 | paste - - | datamash check 2 fields 5 lines
504 5 lines, 2 fields
505
506 $ seq 10 | paste - - | datamash check 4 fields
507 line 1 (2 fields):
508 1 2
509 datamash: check failed: line 1 has 2 fields (expecting 4)
510
511 $ seq 10 | paste - - | datamash check 7 lines
512 datamash: check failed: input had 5 lines (expecting 7)
513
514 Cross-Tabulation
515 Cross-tabulation compares the relationship between two fields. Given
516 the following input file:
517 $ cat input.txt
518 a x 3
519 a y 7
520 b x 21
521 a x 40
522
523 Show cross-tabulation between the first field (a/b) and the second
524 field (x/y) - counting how many times each pair appears (note: sorting
525 is required):
526
527 $ datamash -s crosstab 1,2 < input.txt
528 x y
529 a 2 1
530 b 1 N/A
531
532 An optional grouping operation can be used instead of counting:
533
534 $ datamash -s crosstab 1,2 sum 3 < input.txt
535 x y
536 a 43 7
537 b 21 N/A
538
539 $ datamash -s crosstab 1,2 unique 3 < input.txt
540 x y
541 a 3,40 7
542 b 21 N/A
543
544 Binning numeric values
545 Bin input values into buckets of size 5:
546
547 $ ( echo X ; seq -10 2.5 10 ) \
548 | datamash -H --full bin:5 1
549 X bin(X)
550 -10.0 -15
551 -7.5 -10
552 -5.0 -10
553 -2.5 -5
554 0.0 0
555 2.5 0
556 5.0 5
557 7.5 5
558 10.0 10
559
560 Binning string values
561 Hash any input value into a numeric integer. A typical usage would be
562 to split an input file into N chunks, ensuring that all values of a
563 certain key will be stored in the same chunk:
564
565 $ cat input.txt
566 PatientA 10
567 PatientB 11
568 PatientC 12
569 PatientA 14
570 PatientC 15
571
572
573 Each patient ID is hashed into a bin between 0 and 9
574 and printed in the last field:
575
576
577 $ datamash --full strbin 1 < input.txt
578 PatientA 10 5
579 PatientB 11 6
580 PatientC 12 7
581 PatientA 14 5
582 PatientC 15 7
583
584
585 Splitting the input into chunks can be done with awk:
586
587
588 $ cat input.txt \
589 | datamash --full strbin 1 \
590 | awk '{print > $NF ".txt"}'
591
592
593 Extracting numbers with getnum
594 The 'getnum' operation extracts a numeric value from the field:
595
596
597 $ echo zoom-123.45xyz | datamash getnum 1
598 123.45
599
600 getnum accepts an optional single-letter TYPE option:
601 getnum:n - natural numbers (positive integers, including zero)
602 getnum:i - integers
603 getnum:d - decimal point numbers
604 getnum:p - positive decimal point numbers (this is the default)
605 getnum:h - hex numbers
606 getnum:o - octal numbers
607
608
609 Examples:
610 $ echo zoom-123.45xyz | datamash getnum 1
611 123.45
612
613 $ echo zoom-123.45xyz | datamash getnum:n 1
614 123
615
616 $ echo zoom-123.45xyz | datamash getnum:i 1
617 -123
618
619 $ echo zoom-123.45xyz | datamash getnum:d 1
620 123.45
621
622 $ echo zoom-123.45xyz | datamash getnum:p 1
623 -123.45
624
625 # Hex 0x123 = 291 Decimal
626 $ echo zoom-123.45xyz | datamash getnum:h 1
627 291
628
629 # Octal 0123 = 83 Decimal
630 $ echo zoom-123.45xyz | datamash getnum:o 1
631 83
632
634 See GNU Datamash Website (https://www.gnu.org/software/datamash)
635
637 Written by Assaf Gordon.
638
640 Copyright © 2020 Assaf Gordon License GPLv3+: GNU GPL version 3 or
641 later <https://gnu.org/licenses/gpl.html>.
642 This is free software: you are free to change and redistribute it.
643 There is NO WARRANTY, to the extent permitted by law.
644
646 The full documentation for datamash is maintained as a Texinfo manual.
647 If the info and datamash programs are properly installed at your site,
648 the command
649
650 info datamash
651
652 should give you access to the complete manual.
653
654
655
656datamash 1.6 February 2020 DATAMASH(1)