1DATAMASH(1) User Commands DATAMASH(1)
2
3
4
6 datamash - command-line calculations
7
9 datamash [OPTION] op [fld] [op fld ...]
10
12 Performs numeric/string operations on input from stdin.
13
14 'op' is the operation to perform. If a primary operation is used, it
15 must be listed first, optionally followed by other operations. 'fld'
16 is the input field to use. 'fld' can be a number (1=first field), or a
17 field name when using the -H or --header-in options. Multiple fields
18 can be listed with a comma (e.g. 1,6,8). A range of fields can be
19 listed with a dash (e.g. 2-8). Use colons for operations which require
20 a pair of fields (e.g. 'pcov 2:6').
21
22 Primary operations:
23 groupby, crosstab, transpose, reverse, check
24
25 Line-Filtering operations:
26 rmdup
27
28 Per-Line operations:
29 base64, debase64, md5, sha1, sha256, sha512, bin, strbin, round,
30 floor, ceil, trunc, frac
31
32 Numeric Grouping operations:
33 sum, min, max, absmin, absmax, range
34
35 Textual/Numeric Grouping operations:
36 count, first, last, rand, unique, collapse, countunique
37
38 Statistical Grouping operations:
39 mean, trimmean, median, q1, q3, iqr, perc, mode, antimode, pst‐
40 dev, sstdev, pvar, svar, mad, madraw, pskew, sskew, pkurt,
41 skurt, dpo, jarque, scov, pcov, spearson, ppearson
42
43 Grouping Options:
44 -f, --full
45 print entire input line before op results (default: print only
46 the grouped keys)
47
48 -g, --group=X[,Y,Z]
49 group via fields X,[Y,Z]; equivalent to primary operation
50 'groupby'
51
52 --header-in
53 first input line is column headers
54
55 --header-out
56 print column headers as first line
57
58 -H, --headers
59 same as '--header-in --header-out'
60
61 -i, --ignore-case
62 ignore upper/lower case when comparing text; this affects group‐
63 ing, and string operations
64
65 -s, --sort
66 sort the input before grouping; this removes the need to manu‐
67 ally pipe the input through 'sort'
68
69 File Operation Options:
70 --no-strict
71 allow lines with varying number of fields
72
73 --filler=X
74 fill missing values with X (default %s)
75
76 General Options:
77 -t, --field-separator=X
78 use X instead of TAB as field delimiter
79
80 --format=FORMAT
81 print numeric values with printf style floating-point FORMAT.
82
83 --output-delimiter=X
84 use X instead as output field delimiter (default: use same
85 delimiter as -t/-W)
86
87 --narm skip NA/NaN values
88
89 -R, --round=N
90 round numeric output to N decimal places
91
92 -W, --whitespace
93 use whitespace (one or more spaces and/or tabs) for field delim‐
94 iters
95
96 -z, --zero-terminated
97 end lines with 0 byte, not newline
98
99 --help display this help and exit
100
101 --version
102 output version information and exit
103
106 Primary Operations
107 Primary operations affect the way the file is processed. If used, the
108 primary operation must be listed first. Some operations require field
109 numbers (groupby, crosstab) while others do not (reverse,check,trans‐
110 pose). If primary operation is not listed the entire file is processed
111 - either line-by-line (for 'per-line' operations) or all lines as one
112 group (for grouping operations). See Examples section below.
113
114 groupby X,Y,... op fld ...
115 group the file by given fields. Equivalent to option '-g'.
116 For each group perform operation op on field fld.
117
118
119 crosstab X,Y [op fld ...]
120 cross-tabulate a file by two fields (cross-tabulation is
121 also known as pivot tables). If no operation is specified,
122 counts how many incidents exist of X,Y.
123
124
125 transpose transpose rows, columns of the input file
126
127
128 reverse reverse field order in each line
129
130
131 check [N lines] [N fields]
132 verify the input file has same number of fields in all
133 lines, or the expected number of lines/fields. number of
134 lines and fields are printed to STDOUT. Exits with non-zero
135 code and prints the offending line if there's a mismatch in
136 the number of lines/ fields.
137
138 Line-Filtering operations
139 rmdup remove lines with duplicated key value
140
141 Per-Line operations
142 base64 Encode the field as base64
143
144
145 debase64 Decode the field as base64, exit with error if invalid
146 base64 string
147
148
149 md5/sha1/sha256/sha512
150 Calculate md5/sha1/sha256/sha512 hash of the field value
151
152
153 bin[:BUCKET-SIZE]
154 bin numeric values into buckets of size BUCKET-SIZE
155 (defaults to 100).
156
157
158 strbin[:BUCKET-SIZE]
159 hashes the input and returns a numeric integer value
160 between zero and BUCKET-SIZE (defaults to 10).
161
162
163 round/floor/ceil/trunc/frac
164 numeric rounding operations. round (round half away from
165 zero), floor (round up), ceil (ceiling, round down), trunc
166 (truncate, round towards zero), frac (fraction, return
167 fraction part of a decimal-point value).
168
169 Numeric Grouping operations
170 sum sum the of values
171
172
173 min minimum value
174
175
176 max maximum value
177
178
179 absmin minimum of the absolute values
180
181
182 absmax maximum of the absolute values
183
184
185 range the values range (max-min)
186
187 Textual/Numeric Grouping operations
188 count count number of elements in the group
189
190
191 first the first value of the group
192
193
194 last the last value of the group
195
196
197 rand one random value from the group
198
199
200 unique comma-separated sorted list of unique values
201
202
203 collapse comma-separated list of all input values
204
205
206 countunique number of unique/distinct values
207
208 Statistical Grouping operations
209 A p/s prefix indicates the varient: population or sample. Typically,
210 the sample variant is equivalent with GNU R's internal functions (e.g
211 datamash's sstdev operation is equivalent to R's sd() function).
212
213 mean mean of the values
214
215
216 trimmean[:PERCENT]
217 trimmed mean of the values. PERCENT should be between 0 and
218 0.5. (trimmean:0 is equivalent to mean. trimmean:0.5 is
219 equivalent to median).
220
221
222 median median value
223
224
225 q1 1st quartile value
226
227
228 q3 3rd quartile value
229
230
231 iqr inter-quartile range
232
233
234 perc[:PERCENTILE]
235 percentile value ERCENTILE (defaults to 95).
236
237
238 mode mode value (most common value)
239
240
241 antimode anti-mode value (least common value)
242
243
244 pstdev/sstdev
245 population/sample standard deviation
246
247
248 pvar/svar population/sample variance
249
250
251 mad median absolute deviation, scaled by constant 1.4826 for
252 normal distributions
253
254
255 madraw median absolute deviation, unscaled
256
257
258 pskew/sskew skewness of the group
259 values x reported by 'sskew' and 'pskew' operations:
260 x > 0 - positively skewed / skewed right
261 0 > x - negatively skewed / skewed left
262 x > 1 - highly skewed right
263 1 > x > 0.5 - moderately skewed right
264 0.5 > x > -0.5 - approximately symmetric
265 -0.5 > x > -1 - moderately skewed left
266 -1 > x - highly skewed left
267
268
269 pkurt/skurt excess Kurtosis of the group
270
271
272 jarque/dpo p-value of the Jarque-Beta (jarque) and D'Agostino-Pearson
273 Omnibus (dpo) tests for normality:
274 null hypothesis is normality;
275 low p-Values indicate non-normal data;
276 high p-Values indicate null-hypothesis cannot be
277 rejected.
278
279
280 pcov/scov [X:Y]
281 covariance of fields X and Y
282
283
284 ppearson/spearson [X:Y]
285 Pearson product-moment correlation coefficient [Pearson's
286 R] of fields X and Y
287
289 Basic usage
290 Print the sum and the mean of values from field 1:
291
292 $ seq 10 | datamash sum 1 mean 1
293 55 5.5
294
295 Group input based on field 1, and sum values (per group) on field 2:
296
297 $ cat example.txt
298 A 10
299 A 5
300 B 9
301 B 11
302
303 $ datamash -g 1 sum 2 < example.txt
304 A 15
305 B 20
306
307 $ datamash groupby 1 sum 2 < example.txt
308 A 15
309 B 20
310
311 Unsorted input must be sorted (with '-s'):
312
313 $ cat example.txt
314 A 10
315 C 4
316 B 9
317 C 1
318 A 5
319 B 11
320
321 $ datamash -s -g1 sum 2 < example.txt
322 A 15
323 B 20
324 C 5
325
326 Which is equivalent to:
327
328 $ cat example.txt | sort -k1,1 | datamash -g 1 sum 2
329
330
331
332
333 Header lines
334 Use -h (--headers) if the input file has a header line:
335
336 # Given a file with student name, field, test score...
337 $ head -n5 scores_h.txt
338 Name Major Score
339 Shawn Engineering 47
340 Caleb Business 87
341 Christian Business 88
342 Derek Arts 60
343
344 # Calculate the mean and standard devian for each major
345 $ datamash --sort --headers --group 2 mean 3 pstdev 3 < scores_h.txt
346
347 (or use short form)
348
349 $ datamash -sH -g2 mean 3 pstdev 3 < scores_h.txt
350
351 (or use named fields)
352
353 $ datamash -sH -g Major mean Score pstdev Score < scores_h.txt
354 GroupBy(Major) mean(Score) pstdev(Score)
355 Arts 68.9 10.1
356 Business 87.3 4.9
357 Engineering 66.5 19.1
358 Health-Medicine 90.6 8.8
359 Life-Sciences 55.3 19.7
360 Social-Sciences 60.2 16.6
361
362 Multiple fields
363 Use comma or dash to specify multiple fields. The following are equiva‐
364 lent:
365 $ seq 9 | paste - - -
366 1 2 3
367 4 5 6
368 7 8 9
369
370 $ seq 9 | paste - - - | datamash sum 1 sum 2 sum 3
371 12 15 18
372
373 $ seq 9 | paste - - - | datamash sum 1,2,3
374 12 15 18
375
376 $ seq 9 | paste - - - | datamash sum 1-3
377 12 15 18
378
379 Rounding
380 The following demonstrate the different rounding operations:
381 $ ( echo X ; seq -1.25 0.25 1.25 ) \
382 | datamash --full -H round 1 ceil 1 floor 1 trunc 1 frac 1
383
384 X round(X) ceil(X) floor(X) trunc(X) frac(X)
385 -1.25 -1 -1 -2 -1 -0.25
386 -1.00 -1 -1 -1 -1 0
387 -0.75 -1 0 -1 0 -0.75
388 -0.50 -1 0 -1 0 -0.5
389 -0.25 0 0 -1 0 -0.25
390 0.00 0 0 0 0 0
391 0.25 0 1 0 0 0.25
392 0.50 1 1 0 0 0.5
393 0.75 1 1 0 0 0.75
394 1.00 1 1 1 1 0
395 1.25 1 2 1 1 0.25
396
397 Reversing fields
398 $ seq 6 | paste - - | datamash reverse
399 2 1
400 4 3
401 6 5
402
403 Transposing a file
404 $ seq 6 | paste - - | datamash transpose
405 1 3 5
406 2 4 6
407
408 Removing Duplicated lines
409 Remove lines with duplicate key value from field 1 (Unlike first,last
410 operations, rmdup is much faster and does not require sorting the file
411 with -s):
412
413 # Given a list of files and sample IDs:
414 $ cat INPUT
415 SampleID File
416 2 cc.txt
417 3 dd.txt
418 1 ab.txt
419 2 ee.txt
420 3 ff.txt
421
422 # Remove lines with duplicated Sample-ID (field 1):
423 $ datamash rmdup 1 < INPUT
424
425 # or use named field:
426 $ datamash -H rmdup SampleID < INPUT
427 SampleID File
428 2 cc.txt
429 3 dd.txt
430 1 ab.txt
431
432 Checksums
433 Calculate the sha1 hash value of each TXT file, after calculating the
434 sha1 value of each file's content:
435
436 $ sha1sum *.txt | datamash -Wf sha1 2
437
438 Check file structure
439 Check the structure of the input file: ensure all lines have the same
440 number of fields, or expected number of lines/fields:
441
442 $ seq 10 | paste - - | datamash check && echo ok || echo fail
443 5 lines, 2 fields
444 ok
445
446 $ seq 13 | paste - - - | datamash check && echo ok || echo fail
447 line 4 (3 fields):
448 10 11 12
449 line 5 (2 fields):
450 13
451 datamash: check failed: line 5 has 2 fields (previous line had 3)
452 fail
453
454 $ seq 10 | paste - - | datamash check 2 fields 5 lines
455 5 lines, 2 fields
456
457 $ seq 10 | paste - - | datamash check 4 fields
458 line 1 (2 fields):
459 1 2
460 datamash: check failed: line 1 has 2 fields (expecting 4)
461
462 $ seq 10 | paste - - | datamash check 7 lines
463 datamash: check failed: input had 5 lines (expecting 7)
464
465 Cross-Tabulation
466 Cross-tabulation compares the relationship between two fields. Given
467 the following input file:
468 $ cat input.txt
469 a x 3
470 a y 7
471 b x 21
472 a x 40
473
474 Show cross-tabulation between the first field (a/b) and the second
475 field (x/y) - counting how many times each pair appears (note: sorting
476 is required):
477
478 $ datamash -s crosstab 1,2 < input.txt
479 x y
480 a 2 1
481 b 1 N/A
482
483 An optional grouping operation can be used instead of counting:
484
485 $ datamash -s crosstab 1,2 sum 3 < input.txt
486 x y
487 a 43 7
488 b 21 N/A
489
490 $ datamash -s crosstab 1,2 unique 3 < input.txt
491 x y
492 a 3,40 7
493 b 21 N/A
494
495 Binning numeric values
496 Bin input values into buckets of size 5:
497
498 $ ( echo X ; seq -10 2.5 10 ) \
499 | datamash -H --full bin:5 1
500 X bin(X)
501 -10.0 -15
502 -7.5 -10
503 -5.0 -10
504 -2.5 -5
505 0.0 0
506 2.5 0
507 5.0 5
508 7.5 5
509 10.0 10
510
511 Binning string values
512 Hash any input value into a numeric integer. A typical usage would be
513 to split an input file into N chunks, ensuring that all values of a
514 certain key will be stored in the same chunk:
515
516 $ cat input.txt
517 PatientA 10
518 PatientB 11
519 PatientC 12
520 PatientA 14
521 PatientC 15
522
523
524 Each patient ID is hashed into a bin between 0 and 9
525 and printed in the last field:
526
527
528 $ datamash --full strbin 1 < input.txt
529 PatientA 10 5
530 PatientB 11 6
531 PatientC 12 7
532 PatientA 14 5
533 PatientC 15 7
534
535
536 Splitting the input into chunks can be done with awk:
537
538
539 $ cat input.txt \
540 | datamash --full strbin 1 \
541 | awk '{print > $NF ".txt"}'
542
543
545 See GNU Datamash Website (https://www.gnu.org/software/datamash)
546
548 Written by Assaf Gordon.
549
551 Copyright © 2018 Assaf Gordon License GPLv3+: GNU GPL version 3 or
552 later <https://gnu.org/licenses/gpl.html>.
553 This is free software: you are free to change and redistribute it.
554 There is NO WARRANTY, to the extent permitted by law.
555
557 The full documentation for datamash is maintained as a Texinfo manual.
558 If the info and datamash programs are properly installed at your site,
559 the command
560
561 info datamash
562
563 should give you access to the complete manual.
564
565
566
567datamash 1.3 March 2018 DATAMASH(1)