DATAMASH(1) | User Commands | DATAMASH(1) |
datamash - command-line calculations
datamash [OPTION] op [fld] [op fld ...]
Performs numeric/string operations on input from stdin.
'op' is the operation to perform. If a primary operation is used, it must be listed first, optionally followed by other operations. 'fld' is the input field to use. 'fld' can be a number (1=first field), or a field name when using the -H or --header-in options. Multiple fields can be listed with a comma (e.g. 1,6,8). A range of fields can be listed with a dash (e.g. 2-8). Use colons for operations which require a pair of fields (e.g. 'pcov 2:6').
Primary operations affect the way the file is processed. If used, the primary operation must be listed first. If primary operation is not listed the entire file is processed - either line-by-line (for 'per-line' operations) or all lines as one group (for grouping operations). See Examples section below.
A p/s prefix indicates the variant: population or sample. Typically, the sample variant is equivalent with GNU R's internal functions (e.g datamash's sstdev operation is equivalent to R's sd() function).
x > 0 - positively skewed / skewed right 0 > x - negatively skewed / skewed left x > 1 - highly skewed right 1 > x > 0.5 - moderately skewed right 0.5 > x > -0.5 - approximately symmetric -0.5 > x > -1 - moderately skewed left -1 > x - highly skewed left
Print the sum and the mean of values from field 1:
$ seq 10 | datamash sum 1 mean 1 55 5.5
Group input based on field 1, and sum values (per group) on field 2:
$ cat example.txt A 10 A 5 B 9 B 11 $ datamash -g 1 sum 2 < example.txt A 15 B 20 $ datamash groupby 1 sum 2 < example.txt A 15 B 20
Unsorted input must be sorted (with '-s'):
$ cat example.txt A 10 C 4 B 9 C 1 A 5 B 11 $ datamash -s -g1 sum 2 < example.txt A 15 B 20 C 5
Which is equivalent to:
$ cat example.txt | sort -k1,1 | datamash -g 1 sum 2
Use -H (--headers) if the input file has a header line:
# Given a file with student name, field, test score... $ head -n5 scores_h.txt Name Major Score Shawn Engineering 47 Caleb Business 87 Christian Business 88 Derek Arts 60 # Calculate the mean and standard deviation for each major $ datamash --sort --headers --group 2 mean 3 pstdev 3 < scores_h.txt (or use short form) $ datamash -sH -g2 mean 3 pstdev 3 < scores_h.txt (or use named fields) $ datamash -sH -g Major mean Score pstdev Score < scores_h.txt GroupBy(Major) mean(Score) pstdev(Score) Arts 68.9 10.1 Business 87.3 4.9 Engineering 66.5 19.1 Health-Medicine 90.6 8.8 Life-Sciences 55.3 19.7 Social-Sciences 60.2 16.6
Field names must be escaped with a backslash if they start with a digit or contain special characters (dash/minus, colons, commas). Note the interplay between escaping with backslash and shell quoting. The following equivalent command sum the values of a field named "FOO-BAR":
$ datamash -H sum FOO\\-BAR < input.txt $ datamash -H sum 'FOO\-BAR' < input.txt $ datamash -H sum "FOO\\-BAR" < input.txt
Use -C (--skip-comments) to skip lines starting with '#' or ';' characters (and optional whitespace before them):
$ cat in.txt #foo 3 bar 5 ;baz 7 $ datamash sum 2 < in.txt 15 $ datamash -C sum 2 < in.txt 5
Use comma or dash to specify multiple fields. The following are equivalent:
$ seq 9 | paste - - - 1 2 3 4 5 6 7 8 9 $ seq 9 | paste - - - | datamash sum 1 sum 2 sum 3 12 15 18 $ seq 9 | paste - - - | datamash sum 1,2,3 12 15 18 $ seq 9 | paste - - - | datamash sum 1-3 12 15 18
The following demonstrate the different rounding operations:
$ ( echo X ; seq -1.25 0.25 1.25 ) \ | datamash --full -H round 1 ceil 1 floor 1 trunc 1 frac 1 X round(X) ceil(X) floor(X) trunc(X) frac(X) -1.25 -1 -1 -2 -1 -0.25 -1.00 -1 -1 -1 -1 0 -0.75 -1 0 -1 0 -0.75 -0.50 -1 0 -1 0 -0.5 -0.25 0 0 -1 0 -0.25 0.00 0 0 0 0 0 0.25 0 1 0 0 0.25 0.50 1 1 0 0 0.5 0.75 1 1 0 0 0.75 1.00 1 1 1 1 0 1.25 1 2 1 1 0.25
$ seq 6 | paste - - | datamash reverse 2 1 4 3 6 5
$ seq 6 | paste - - | datamash transpose 1 3 5 2 4 6
Remove lines with duplicate key value from field 1 (Unlike first,last operations, rmdup is much faster and does not require sorting the file with -s):
# Given a list of files and sample IDs: $ cat INPUT SampleID File 2 cc.txt 3 dd.txt 1 ab.txt 2 ee.txt 3 ff.txt # Remove lines with duplicated Sample-ID (field 1): $ datamash rmdup 1 < INPUT # or use named field: $ datamash -H rmdup SampleID < INPUT SampleID File 2 cc.txt 3 dd.txt 1 ab.txt
Calculate the sha1 hash value of each TXT file, after calculating the sha1 value of each file's content:
$ sha1sum *.txt | datamash -Wf sha1 2
Check the structure of the input file: ensure all lines have the same number of fields, or expected number of lines/fields:
$ seq 10 | paste - - | datamash check && echo ok || echo fail 5 lines, 2 fields ok $ seq 13 | paste - - - | datamash check && echo ok || echo fail line 4 (3 fields): 10 11 12 line 5 (2 fields): 13 datamash: check failed: line 5 has 2 fields (previous line had 3) fail $ seq 10 | paste - - | datamash check 2 fields 5 lines 5 lines, 2 fields $ seq 10 | paste - - | datamash check 4 fields line 1 (2 fields): 1 2 datamash: check failed: line 1 has 2 fields (expecting 4) $ seq 10 | paste - - | datamash check 7 lines datamash: check failed: input had 5 lines (expecting 7)
Cross-tabulation compares the relationship between two fields. Given the following input file:
$ cat input.txt a x 3 a y 7 b x 21 a x 40
Show cross-tabulation between the first field (a/b) and the second field (x/y) - counting how many times each pair appears (note: sorting is required):
$ datamash -s crosstab 1,2 < input.txt x y a 2 1 b 1 N/A
An optional grouping operation can be used instead of counting:
$ datamash -s crosstab 1,2 sum 3 < input.txt x y a 43 7 b 21 N/A $ datamash -s crosstab 1,2 unique 3 < input.txt x y a 3,40 7 b 21 N/A
Bin input values into buckets of size 5:
$ ( echo X ; seq -10 2.5 10 ) \ | datamash -H --full bin:5 1 X bin(X) -10.0 -10 -7.5 -10 -5.0 -5 -2.5 -5 0.0 0 2.5 0 5.0 5 7.5 5 10.0 10
Hash any input value into a numeric integer. A typical usage would be to split an input file into N chunks, ensuring that all values of a certain key will be stored in the same chunk:
$ cat input.txt PatientA 10 PatientB 11 PatientC 12 PatientA 14 PatientC 15
Each patient ID is hashed into a bin between 0 and 9 and printed in the last field:
$ datamash --full strbin 1 < input.txt PatientA 10 5 PatientB 11 6 PatientC 12 7 PatientA 14 5 PatientC 15 7
Splitting the input into chunks can be done with awk:
$ cat input.txt \ | datamash --full strbin 1 \ | awk '{print > $NF ".txt"}'
The 'getnum' operation extracts a numeric value from the field:
$ echo zoom-123.45xyz | datamash getnum 1 123.45
getnum accepts an optional single-letter TYPE option:
getnum:n - natural numbers (positive integers, including zero) getnum:i - integers getnum:d - decimal point numbers getnum:p - positive decimal point numbers (this is the default) getnum:h - hex numbers getnum:o - octal numbers
Examples:
$ echo zoom-123.45xyz | datamash getnum 1 123.45 $ echo zoom-123.45xyz | datamash getnum:n 1 123 $ echo zoom-123.45xyz | datamash getnum:i 1 -123 $ echo zoom-123.45xyz | datamash getnum:d 1 123.45 $ echo zoom-123.45xyz | datamash getnum:p 1 -123.45 # Hex 0x123 = 291 Decimal $ echo zoom-123.45xyz | datamash getnum:h 1 291 # Octal 0123 = 83 Decimal $ echo zoom-123.45xyz | datamash getnum:o 1 83
See GNU Datamash Website (https://www.gnu.org/software/datamash)
Written by Assaf Gordon, Tim Rice, Shawn Wagner, Erik Auerswald.
Copyright © 2022 Assaf Gordon and Tim Rice License GPLv3+:
GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it. There is NO
WARRANTY, to the extent permitted by law.
The full documentation for datamash is maintained as a Texinfo manual. If the info and datamash programs are properly installed at your site, the command
should give you access to the complete manual.
July 2022 | datamash 1.8-dirty |