1MAWK(1) USER COMMANDS MAWK(1)
2
3
4
6 mawk - pattern scanning and text processing language
7
9 mawk [-W option] [-F value] [-v var=value] [--] 'program text' [file
10 ...]
11 mawk [-W option] [-F value] [-v var=value] [-f program-file] [--] [file
12 ...]
13
15 mawk is an interpreter for the AWK Programming Language. The AWK lan‐
16 guage is useful for manipulation of data files, text retrieval and pro‐
17 cessing, and for prototyping and experimenting with algorithms. mawk
18 is a new awk meaning it implements the AWK language as defined in Aho,
19 Kernighan and Weinberger, The AWK Programming Language, Addison-Wesley
20 Publishing, 1988 (hereafter referred to as the AWK book.) mawk con‐
21 forms to the POSIX 1003.2 (draft 11.3) definition of the AWK language
22 which contains a few features not described in the AWK book, and mawk
23 provides a small number of extensions.
24
25 An AWK program is a sequence of pattern {action} pairs and function
26 definitions. Short programs are entered on the command line usually
27 enclosed in ' ' to avoid shell interpretation. Longer programs can be
28 read in from a file with the -f option. Data input is read from the
29 list of files on the command line or from standard input when the list
30 is empty. The input is broken into records as determined by the record
31 separator variable, RS. Initially, RS = “\n” and records are synony‐
32 mous with lines. Each record is compared against each pattern and if
33 it matches, the program text for {action} is executed.
34
36 -F value sets the field separator, FS, to value.
37
38 -f file Program text is read from file instead of from the com‐
39 mand line. Multiple -f options are allowed.
40
41 -v var=value assigns value to program variable var.
42
43 -- indicates the unambiguous end of options.
44
45 The above options will be available with any POSIX compatible implemen‐
46 tation of AWK. Implementation specific options are prefaced with -W.
47 mawk provides these:
48
49 -W dump writes an assembler like listing of the internal repre‐
50 sentation of the program to stdout and exits 0 (on suc‐
51 cessful compilation).
52
53 -W exec file Program text is read from file and this is the last
54 option.
55
56 This is a useful alternative to -f on systems that sup‐
57 port the #! “magic number” convention for executable
58 scripts. Those implicitly pass the pathname of the
59 script itself as the final parameter, and expect no more
60 than one “-” option on the #! line. Because mawk can
61 combine multiple -W options separated by commas, you can
62 use this option when an additional -W option is needed.
63
64 -W help prints a usage message to stderr and exits (same as
65 “-W usage”).
66
67 -W interactive sets unbuffered writes to stdout and line buffered reads
68 from stdin. Records from stdin are lines regardless of
69 the value of RS.
70
71 -W posix_space forces mawk not to consider '\n' to be space.
72
73 -W random=num calls srand with the given parameter (and overrides the
74 auto-seeding behavior).
75
76 -W sprintf=num adjusts the size of mawk's internal sprintf buffer to
77 num bytes. More than rare use of this option indicates
78 mawk should be recompiled.
79
80 -W usage prints a usage message to stderr and exits (same as
81 “-W help”).
82
83 -W version mawk writes its version and copyright to stdout and com‐
84 piled limits to stderr and exits 0.
85
86 mawk accepts abbreviations for any of these options, e.g., “-W v” and
87 “-Wv” both tell mawk to show its version.
88
89 mawk allows multiple -W options to be combined by separating the
90 options with commas, e.g., -Wsprint=2000,posix. This is useful for
91 executable #! “magic number” invocations in which only one argument is
92 supported, e.g., -Winteractive,exec.
93
95 1. Program structure
96 An AWK program is a sequence of pattern {action} pairs and user func‐
97 tion definitions.
98
99 A pattern can be:
100 BEGIN
101 END
102 expression
103 expression , expression
104
105 One, but not both, of pattern {action} can be omitted. If {action} is
106 omitted it is implicitly { print }. If pattern is omitted, then it is
107 implicitly matched. BEGIN and END patterns require an action.
108
109 Statements are terminated by newlines, semi-colons or both. Groups of
110 statements such as actions or loop bodies are blocked via { ... } as in
111 C. The last statement in a block doesn't need a terminator. Blank
112 lines have no meaning; an empty statement is terminated with a semi-
113 colon. Long statements can be continued with a backslash, \. A state‐
114 ment can be broken without a backslash after a comma, left brace, &&,
115 ||, do, else, the right parenthesis of an if, while or for statement,
116 and the right parenthesis of a function definition. A comment starts
117 with # and extends to, but does not include the end of line.
118
119 The following statements control program flow inside blocks.
120
121 if ( expr ) statement
122
123 if ( expr ) statement else statement
124
125 while ( expr ) statement
126
127 do statement while ( expr )
128
129 for ( opt_expr ; opt_expr ; opt_expr ) statement
130
131 for ( var in array ) statement
132
133 continue
134
135 break
136
137 2. Data types, conversion and comparison
138 There are two basic data types, numeric and string. Numeric constants
139 can be integer like -2, decimal like 1.08, or in scientific notation
140 like -1.1e4 or .28E-3. All numbers are represented internally and all
141 computations are done in floating point arithmetic. So for example,
142 the expression 0.2e2 == 20 is true and true is represented as 1.0.
143
144 String constants are enclosed in double quotes.
145
146 "This is a string with a newline at the end.\n"
147
148 Strings can be continued across a line by escaping (\) the newline.
149 The following escape sequences are recognized.
150
151 \\ \
152 \" "
153 \a alert, ascii 7
154 \b backspace, ascii 8
155 \t tab, ascii 9
156 \n newline, ascii 10
157 \v vertical tab, ascii 11
158 \f formfeed, ascii 12
159 \r carriage return, ascii 13
160 \ddd 1, 2 or 3 octal digits for ascii ddd
161 \xhh 1 or 2 hex digits for ascii hh
162
163 If you escape any other character \c, you get \c, i.e., mawk ignores
164 the escape.
165
166 There are really three basic data types; the third is number and string
167 which has both a numeric value and a string value at the same time.
168 User defined variables come into existence when first referenced and
169 are initialized to null, a number and string value which has numeric
170 value 0 and string value "". Non-trivial number and string typed data
171 come from input and are typically stored in fields. (See section 4).
172
173 The type of an expression is determined by its context and automatic
174 type conversion occurs if needed. For example, to evaluate the state‐
175 ments
176
177 y = x + 2 ; z = x "hello"
178
179 The value stored in variable y will be typed numeric. If x is not
180 numeric, the value read from x is converted to numeric before it is
181 added to 2 and stored in y. The value stored in variable z will be
182 typed string, and the value of x will be converted to string if neces‐
183 sary and concatenated with "hello". (Of course, the value and type
184 stored in x is not changed by any conversions.) A string expression is
185 converted to numeric using its longest numeric prefix as with atof(3).
186 A numeric expression is converted to string by replacing expr with
187 sprintf(CONVFMT, expr), unless expr can be represented on the host
188 machine as an exact integer then it is converted to sprintf("%d",
189 expr). Sprintf() is an AWK built-in that duplicates the functionality
190 of sprintf(3), and CONVFMT is a built-in variable used for internal
191 conversion from number to string and initialized to "%.6g". Explicit
192 type conversions can be forced, expr "" is string and expr+0 is
193 numeric.
194
195 To evaluate, expr1 rel-op expr2, if both operands are numeric or number
196 and string then the comparison is numeric; if both operands are string
197 the comparison is string; if one operand is string, the non-string op‐
198 erand is converted and the comparison is string. The result is
199 numeric, 1 or 0.
200
201 In boolean contexts such as, if ( expr ) statement, a string expression
202 evaluates true if and only if it is not the empty string ""; numeric
203 values if and only if not numerically zero.
204
205 3. Regular expressions
206 In the AWK language, records, fields and strings are often tested for
207 matching a regular expression. Regular expressions are enclosed in
208 slashes, and
209
210 expr ~ /r/
211
212 is an AWK expression that evaluates to 1 if expr “matches” r, which
213 means a substring of expr is in the set of strings defined by r. With
214 no match the expression evaluates to 0; replacing ~ with the “not
215 match” operator, !~ , reverses the meaning. As pattern-action pairs,
216
217 /r/ { action } and $0 ~ /r/ { action }
218
219 are the same, and for each input record that matches r, action is exe‐
220 cuted. In fact, /r/ is an AWK expression that is equivalent to ($0 ~
221 /r/) anywhere except when on the right side of a match operator or
222 passed as an argument to a built-in function that expects a regular
223 expression argument.
224
225 AWK uses extended regular expressions as with the -E option of grep(1).
226 The regular expression metacharacters, i.e., those with special meaning
227 in regular expressions are
228
229 \ ^ $ . [ ] | ( ) * + ?
230
231 Regular expressions are built up from characters as follows:
232
233 c matches any non-metacharacter c.
234
235 \c matches a character defined by the same escape
236 sequences used in string constants or the literal
237 character c if \c is not an escape sequence.
238
239 . matches any character (including newline).
240
241 ^ matches the front of a string.
242
243 $ matches the back of a string.
244
245 [c1c2c3...] matches any character in the class c1c2c3... . An
246 interval of characters is denoted c1-c2 inside a
247 class [...].
248
249 [^c1c2c3...] matches any character not in the class c1c2c3...
250
251 Regular expressions are built up from other regular expressions as fol‐
252 lows:
253
254 r1r2 matches r1 followed immediately by r2 (concatena‐
255 tion).
256
257 r1 | r2 matches r1 or r2 (alternation).
258
259 r* matches r repeated zero or more times.
260
261 r+ matches r repeated one or more times.
262
263 r? matches r zero or once.
264
265 (r) matches r, providing grouping.
266
267 The increasing precedence of operators is alternation, concatenation
268 and unary (*, + or ?).
269
270 For example,
271
272 /^[_a-zA-Z][_a-zA-Z0-9]*$/ and
273 /^[-+]?([0-9]+\.?|\.[0-9])[0-9]*([eE][-+]?[0-9]+)?$/
274
275 are matched by AWK identifiers and AWK numeric constants respectively.
276 Note that “.” has to be escaped to be recognized as a decimal point,
277 and that metacharacters are not special inside character classes.
278
279 Any expression can be used on the right hand side of the ~ or !~ opera‐
280 tors or passed to a built-in that expects a regular expression. If
281 needed, it is converted to string, and then interpreted as a regular
282 expression. For example,
283
284 BEGIN { identifier = "[_a-zA-Z][_a-zA-Z0-9]*" }
285
286 $0 ~ "^" identifier
287
288 prints all lines that start with an AWK identifier.
289
290 mawk recognizes the empty regular expression, //, which matches the
291 empty string and hence is matched by any string at the front, back and
292 between every character. For example,
293
294 echo abc | mawk { gsub(//, "X") ; print }
295 XaXbXcX
296
297
298 4. Records and fields
299 Records are read in one at a time, and stored in the field variable $0.
300 The record is split into fields which are stored in $1, $2, ..., $NF.
301 The built-in variable NF is set to the number of fields, and NR and FNR
302 are incremented by 1. Fields above $NF are set to "".
303
304 Assignment to $0 causes the fields and NF to be recomputed. Assignment
305 to NF or to a field causes $0 to be reconstructed by concatenating the
306 $i's separated by OFS. Assignment to a field with index greater than
307 NF, increases NF and causes $0 to be reconstructed.
308
309 Data input stored in fields is string, unless the entire field has
310 numeric form and then the type is number and string. For example,
311
312 echo 24 24E |
313 mawk '{ print($1>100, $1>"100", $2>100, $2>"100") }'
314 0 1 1 1
315
316 $0 and $2 are string and $1 is number and string. The first comparison
317 is numeric, the second is string, the third is string (100 is converted
318 to "100"), and the last is string.
319
320 5. Expressions and operators
321 The expression syntax is similar to C. Primary expressions are numeric
322 constants, string constants, variables, fields, arrays and function
323 calls. The identifier for a variable, array or function can be a
324 sequence of letters, digits and underscores, that does not start with a
325 digit. Variables are not declared; they exist when first referenced
326 and are initialized to null.
327
328 New expressions are composed with the following operators in order of
329 increasing precedence.
330
331 assignment = += -= *= /= %= ^=
332 conditional ? :
333 logical or ||
334 logical and &&
335 array membership in
336 matching ~ !~
337 relational < > <= >= == !=
338 concatenation (no explicit operator)
339 add ops + -
340 mul ops * / %
341 unary + -
342 logical not !
343 exponentiation ^
344 inc and dec ++ -- (both post and pre)
345 field $
346
347 Assignment, conditional and exponentiation associate right to left; the
348 other operators associate left to right. Any expression can be paren‐
349 thesized.
350
351 6. Arrays
352 Awk provides one-dimensional arrays. Array elements are expressed as
353 array[expr]. Expr is internally converted to string type, so, for
354 example, A[1] and A["1"] are the same element and the actual index is
355 "1". Arrays indexed by strings are called associative arrays. Ini‐
356 tially an array is empty; elements exist when first accessed. An
357 expression, expr in array evaluates to 1 if array[expr] exists, else to
358 0.
359
360 There is a form of the for statement that loops over each index of an
361 array.
362
363 for ( var in array ) statement
364
365 sets var to each index of array and executes statement. The order that
366 var transverses the indices of array is not defined.
367
368 The statement, delete array[expr], causes array[expr] not to exist.
369 mawk supports an extension, delete array, which deletes all elements of
370 array.
371
372 Multidimensional arrays are synthesized with concatenation using the
373 built-in variable SUBSEP. array[expr1,expr2] is equivalent to
374 array[expr1 SUBSEP expr2]. Testing for a multidimensional element uses
375 a parenthesized index, such as
376
377 if ( (i, j) in A ) print A[i, j]
378
379
380 7. Builtin-variables
381 The following variables are built-in and initialized before program
382 execution.
383
384 ARGC number of command line arguments.
385
386 ARGV array of command line arguments, 0..ARGC-1.
387
388 CONVFMT format for internal conversion of numbers to string,
389 initially = "%.6g".
390
391 ENVIRON array indexed by environment variables. An environment
392 string, var=value is stored as ENVIRON[var] = value.
393
394 FILENAME name of the current input file.
395
396 FNR current record number in FILENAME.
397
398 FS splits records into fields as a regular expression.
399
400 NF number of fields in the current record.
401
402 NR current record number in the total input stream.
403
404 OFMT format for printing numbers; initially = "%.6g".
405
406 OFS inserted between fields on output, initially = " ".
407
408 ORS terminates each record on output, initially = "\n".
409
410 RLENGTH length set by the last call to the built-in function,
411 match().
412
413 RS input record separator, initially = "\n".
414
415 RSTART index set by the last call to match().
416
417 SUBSEP used to build multiple array subscripts, initially =
418 "\034".
419
420 8. Built-in functions
421 String functions
422
423 gsub(r,s,t) gsub(r,s)
424 Global substitution, every match of regular expression r in
425 variable t is replaced by string s. The number of replace‐
426 ments is returned. If t is omitted, $0 is used. An & in
427 the replacement string s is replaced by the matched sub‐
428 string of t. \& and \\ put literal & and \, respectively,
429 in the replacement string.
430
431 index(s,t)
432 If t is a substring of s, then the position where t starts
433 is returned, else 0 is returned. The first character of s
434 is in position 1.
435
436 length(s)
437 Returns the length of string or array. s.
438
439 match(s,r)
440 Returns the index of the first longest match of regular
441 expression r in string s. Returns 0 if no match. As a
442 side effect, RSTART is set to the return value. RLENGTH is
443 set to the length of the match or -1 if no match. If the
444 empty string is matched, RLENGTH is set to 0, and 1 is
445 returned if the match is at the front, and length(s)+1 is
446 returned if the match is at the back.
447
448 split(s,A,r) split(s,A)
449 String s is split into fields by regular expression r and
450 the fields are loaded into array A. The number of fields
451 is returned. See section 11 below for more detail. If r
452 is omitted, FS is used.
453
454 sprintf(format,expr-list)
455 Returns a string constructed from expr-list according to
456 format. See the description of printf() below.
457
458 sub(r,s,t) sub(r,s)
459 Single substitution, same as gsub() except at most one sub‐
460 stitution.
461
462 substr(s,i,n) substr(s,i)
463 Returns the substring of string s, starting at index i, of
464 length n. If n is omitted, the suffix of s, starting at i
465 is returned.
466
467 tolower(s)
468 Returns a copy of s with all upper case characters con‐
469 verted to lower case.
470
471 toupper(s)
472 Returns a copy of s with all lower case characters con‐
473 verted to upper case.
474
475 Time functions
476
477 These are available on systems which support the corresponding C mktime
478 and strftime functions:
479
480 mktime(specification)
481 converts a date specification to a timestamp with the same
482 units as systime. The date specification is a string con‐
483 taining the components of the date as decimal integers:
484
485 YYYY
486 the year, e.g., 2012
487
488 MM the month of the year starting at 1
489
490 DD the day of the month starting at 1
491
492 HH hour (0-23)
493
494 MM minute (0-59)
495
496 SS seconds (0-59)
497
498 DST
499 tells how to treat timezone versus daylight savings
500 time:
501
502 positive
503 DST is in effect
504
505 zero (default)
506 DST is not in effect
507
508 negative
509 mktime() should (use timezone information and sys‐
510 tem databases to) attempt to determine whether DST
511 is in effect at the specified time.
512
513 strftime([format [, timestamp [, utc ]]])
514 formats the given timestamp using the format (passed to the
515 C strftime function):
516
517 · If the format parameter is missing, "%c" is used.
518
519 · If the timestamp parameter is missing, the current
520 value from systime is used.
521
522 · If the utc parameter is present and nonzero, the result
523 is in UTC. Otherwise local time is used.
524
525 systime()
526 returns the current time of day as the number of seconds
527 since the Epoch (1970-01-01 00:00:00 UTC on POSIX systems).
528
529 Arithmetic functions
530
531 atan2(y,x) Arctan of y/x between -pi and pi.
532
533 cos(x) Cosine function, x in radians.
534
535 exp(x) Exponential function.
536
537 int(x) Returns x truncated towards zero.
538
539 log(x) Natural logarithm.
540
541 rand() Returns a random number between zero and one.
542
543 sin(x) Sine function, x in radians.
544
545 sqrt(x) Returns square root of x.
546
547 srand(expr) srand()
548 Seeds the random number generator, using the clock if expr
549 is omitted, and returns the value of the previous seed.
550 Srand(expr) is useful for repeating pseudo random
551 sequences.
552
553 Note: mawk is normally configured to seed the random number
554 generator from the clock at startup, making it unnecessary
555 to call srand(). This feature can be suppressed via condi‐
556 tional compile, or overridden using the -Wrandom option.
557
558 9. Input and output
559 There are two output statements, print and printf.
560
561 print writes $0 ORS to standard output.
562
563 print expr1, expr2, ..., exprn
564 writes expr1 OFS expr2 OFS ... exprn ORS to standard out‐
565 put. Numeric expressions are converted to string with
566 OFMT.
567
568 printf format, expr-list
569 duplicates the printf C library function writing to stan‐
570 dard output. The complete ANSI C format specifications are
571 recognized with conversions %c, %d, %e, %E, %f, %g, %G, %i,
572 %o, %s, %u, %x, %X and %%, and conversion qualifiers h and
573 l.
574
575 The argument list to print or printf can optionally be enclosed in
576 parentheses. Print formats numbers using OFMT or "%d" for exact inte‐
577 gers. "%c" with a numeric argument prints the corresponding 8 bit
578 character, with a string argument it prints the first character of the
579 string. The output of print and printf can be redirected to a file or
580 command by appending > file, >> file or | command to the end of the
581 print statement. Redirection opens file or command only once, subse‐
582 quent redirections append to the already open stream. By convention,
583 mawk associates the filename
584
585 · "/dev/stderr" with stderr,
586
587 · "/dev/stdout" with stdout,
588
589 · "-" and "/dev/stdin" with stdin.
590
591 The association with stderr is especially useful because it allows
592 print and printf to be redirected to stderr. These names can also be
593 passed to functions.
594
595 The input function getline has the following variations.
596
597 getline
598 reads into $0, updates the fields, NF, NR and FNR.
599
600 getline < file
601 reads into $0 from file, updates the fields and NF.
602
603 getline var
604 reads the next record into var, updates NR and FNR.
605
606 getline var < file
607 reads the next record of file into var.
608
609 command | getline
610 pipes a record from command into $0 and updates the fields
611 and NF.
612
613 command | getline var
614 pipes a record from command into var.
615
616 Getline returns 0 on end-of-file, -1 on error, otherwise 1.
617
618 Commands on the end of pipes are executed by /bin/sh.
619
620 The function close(expr) closes the file or pipe associated with expr.
621 Close returns 0 if expr is an open file, the exit status if expr is a
622 piped command, and -1 otherwise. Close is used to reread a file or
623 command, make sure the other end of an output pipe is finished or con‐
624 serve file resources.
625
626 The function fflush(expr) flushes the output file or pipe associated
627 with expr. Fflush returns 0 if expr is an open output stream else -1.
628 Fflush without an argument flushes stdout. Fflush with an empty argu‐
629 ment ("") flushes all open output.
630
631 The function system(expr) uses the C runtime system call to execute
632 expr and returns the corresponding wait status of the command as fol‐
633 lows:
634
635 · if the system call failed, setting the status to -1, mawk returns
636 that value.
637
638 · if the command exited normally, mawk returns its exit-status.
639
640 · if the command exited due to a signal such as SIGHUP, mawk returns
641 the signal number plus 256.
642
643 Changes made to the ENVIRON array are not passed to commands executed
644 with system or pipes.
645
646 10. User defined functions
647 The syntax for a user defined function is
648
649 function name( args ) { statements }
650
651 The function body can contain a return statement
652
653 return opt_expr
654
655 A return statement is not required. Function calls may be nested or
656 recursive. Functions are passed expressions by value and arrays by
657 reference. Extra arguments serve as local variables and are initial‐
658 ized to null. For example, csplit(s,A) puts each character of s into
659 array A and returns the length of s.
660
661 function csplit(s, A, n, i)
662 {
663 n = length(s)
664 for( i = 1 ; i <= n ; i++ ) A[i] = substr(s, i, 1)
665 return n
666 }
667
668 Putting extra space between passed arguments and local variables is
669 conventional. Functions can be referenced before they are defined, but
670 the function name and the '(' of the arguments must touch to avoid con‐
671 fusion with concatenation.
672
673 A function parameter is normally a scalar value (number or string). If
674 there is a forward reference to a function using an array as a parame‐
675 ter, the function's corresponding parameter will be treated as an
676 array.
677
678 11. Splitting strings, records and files
679 Awk programs use the same algorithm to split strings into arrays with
680 split(), and records into fields on FS. mawk uses essentially the same
681 algorithm to split files into records on RS.
682
683 Split(expr,A,sep) works as follows:
684
685 (1) If sep is omitted, it is replaced by FS. Sep can be an expres‐
686 sion or regular expression. If it is an expression of non-
687 string type, it is converted to string.
688
689 (2) If sep = " " (a single space), then <SPACE> is trimmed from the
690 front and back of expr, and sep becomes <SPACE>. mawk defines
691 <SPACE> as the regular expression /[ \t\n]+/. Otherwise sep is
692 treated as a regular expression, except that meta-characters
693 are ignored for a string of length 1, e.g., split(x, A, "*")
694 and split(x, A, /\*/) are the same.
695
696 (3) If expr is not string, it is converted to string. If expr is
697 then the empty string "", split() returns 0 and A is set empty.
698 Otherwise, all non-overlapping, non-null and longest matches of
699 sep in expr, separate expr into fields which are loaded into A.
700 The fields are placed in A[1], A[2], ..., A[n] and split()
701 returns n, the number of fields which is the number of matches
702 plus one. Data placed in A that looks numeric is typed number
703 and string.
704
705 Splitting records into fields works the same except the pieces are
706 loaded into $1, $2,..., $NF. If $0 is empty, NF is set to 0 and all $i
707 to "".
708
709 mawk splits files into records by the same algorithm, but with the
710 slight difference that RS is really a terminator instead of a separa‐
711 tor. (ORS is really a terminator too).
712
713 E.g., if FS = “:+” and $0 = “a::b:” , then NF = 3 and $1 = “a”, $2
714 = “b” and $3 = "", but if “a::b:” is the contents of an input file
715 and RS = “:+”, then there are two records “a” and “b”.
716
717 RS = " " is not special.
718
719 If FS = "", then mawk breaks the record into individual characters,
720 and, similarly, split(s,A,"") places the individual characters of s
721 into A.
722
723 12. Multi-line records
724 Since mawk interprets RS as a regular expression, multi-line records
725 are easy. Setting RS = "\n\n+", makes one or more blank lines separate
726 records. If FS = " " (the default), then single newlines, by the rules
727 for <SPACE> above, become space and single newlines are field separa‐
728 tors.
729
730 For example, if
731
732 · a file is "a b\nc\n\n",
733
734 · RS = "\n\n+" and
735
736 · FS = " ",
737
738 then there is one record “a b\nc” with three fields “a”, “b” and
739 “c”:
740
741 · Changing FS = “\n”, gives two fields “a b” and “c”;
742
743 · changing FS = “”, gives one field identical to the record.
744
745 If you want lines with spaces or tabs to be considered blank, set RS =
746 “\n([ \t]*\n)+”. For compatibility with other awks, setting RS = ""
747 has the same effect as if blank lines are stripped from the front and
748 back of files and then records are determined as if RS = “\n\n+”.
749 POSIX requires that “\n” always separates records when RS = "" regard‐
750 less of the value of FS. mawk does not support this convention,
751 because defining “\n” as <SPACE> makes it unnecessary.
752
753 Most of the time when you change RS for multi-line records, you will
754 also want to change ORS to “\n\n” so the record spacing is preserved on
755 output.
756
757 13. Program execution
758 This section describes the order of program execution. First ARGC is
759 set to the total number of command line arguments passed to the execu‐
760 tion phase of the program. ARGV[0] is set the name of the AWK inter‐
761 preter and ARGV[1] ... ARGV[ARGC-1] holds the remaining command line
762 arguments exclusive of options and program source. For example with
763
764 mawk -f prog v=1 A t=hello B
765
766 ARGC = 5 with ARGV[0] = "mawk", ARGV[1] = "v=1", ARGV[2] = "A", ARGV[3]
767 = "t=hello" and ARGV[4] = "B".
768
769 Next, each BEGIN block is executed in order. If the program consists
770 entirely of BEGIN blocks, then execution terminates, else an input
771 stream is opened and execution continues. If ARGC equals 1, the input
772 stream is set to stdin, else the command line arguments ARGV[1] ...
773 ARGV[ARGC-1] are examined for a file argument.
774
775 The command line arguments divide into three sets: file arguments,
776 assignment arguments and empty strings "". An assignment has the form
777 var=string. When an ARGV[i] is examined as a possible file argument,
778 if it is empty it is skipped; if it is an assignment argument, the
779 assignment to var takes place and i skips to the next argument; else
780 ARGV[i] is opened for input. If it fails to open, execution terminates
781 with exit code 2. If no command line argument is a file argument, then
782 input comes from stdin. Getline in a BEGIN action opens input. “-” as
783 a file argument denotes stdin.
784
785 Once an input stream is open, each input record is tested against each
786 pattern, and if it matches, the associated action is executed. An
787 expression pattern matches if it is boolean true (see the end of sec‐
788 tion 2). A BEGIN pattern matches before any input has been read, and
789 an END pattern matches after all input has been read. A range pattern,
790 expr1,expr2 , matches every record between the match of expr1 and the
791 match expr2 inclusively.
792
793 When end of file occurs on the input stream, the remaining command line
794 arguments are examined for a file argument, and if there is one it is
795 opened, else the END pattern is considered matched and all END actions
796 are executed.
797
798 In the example, the assignment v=1 takes place after the BEGIN actions
799 are executed, and the data placed in v is typed number and string.
800 Input is then read from file A. On end of file A, t is set to the
801 string "hello", and B is opened for input. On end of file B, the END
802 actions are executed.
803
804 Program flow at the pattern {action} level can be changed with the
805
806 next
807 nextfile
808 exit opt_expr
809
810 statements:
811
812 · A next statement causes the next input record to be read and pat‐
813 tern testing to restart with the first pattern {action} pair in the
814 program.
815
816 · A nextfile statement tells mawk to stop processing the current
817 input file. It then updates FILENAME to the next file listed on
818 the command line, and resets FNR to 1.
819
820 · An exit statement causes immediate execution of the END actions or
821 program termination if there are none or if the exit occurs in an
822 END action. The opt_expr sets the exit value of the program unless
823 overridden by a later exit or subsequent error.
824
826 1. emulate cat.
827
828 { print }
829
830 2. emulate wc.
831
832 { chars += length($0) + 1 # add one for the \n
833 words += NF
834 }
835
836 END{ print NR, words, chars }
837
838 3. count the number of unique “real words”.
839
840 BEGIN { FS = "[^A-Za-z]+" }
841
842 { for(i = 1 ; i <= NF ; i++) word[$i] = "" }
843
844 END { delete word[""]
845 for ( i in word ) cnt++
846 print cnt
847 }
848
849 4. sum the second field of every record based on the first field.
850
851 $1 ~ /credit|gain/ { sum += $2 }
852 $1 ~ /debit|loss/ { sum -= $2 }
853
854 END { print sum }
855
856 5. sort a file, comparing as string
857
858 { line[NR] = $0 "" } # make sure of comparison type
859 # in case some lines look numeric
860
861 END { isort(line, NR)
862 for(i = 1 ; i <= NR ; i++) print line[i]
863 }
864
865 #insertion sort of A[1..n]
866 function isort( A, n, i, j, hold)
867 {
868 for( i = 2 ; i <= n ; i++)
869 {
870 hold = A[j = i]
871 while ( A[j-1] > hold )
872 { j-- ; A[j+1] = A[j] }
873 A[j] = hold
874 }
875 # sentinel A[0] = "" will be created if needed
876 }
877
878
880 MAWK 1.3.3 versus POSIX 1003.2 Draft 11.3
881 The POSIX 1003.2(draft 11.3) definition of the AWK language is AWK as
882 described in the AWK book with a few extensions that appeared in Sys‐
883 temVR4 nawk. The extensions are:
884
885 · New functions: toupper() and tolower().
886
887 · New variables: ENVIRON[] and CONVFMT.
888
889 · ANSI C conversion specifications for printf() and sprintf().
890
891 · New command options: -v var=value, multiple -f options and
892 implementation options as arguments to -W.
893
894 · For systems (MS-DOS or Windows) which provide a setmode func‐
895 tion, an environment variable MAWKBINMODE and a built-in vari‐
896 able BINMODE. The bits of the BINMODE value tell mawk how to
897 modify the RS and ORS variables:
898
899 0 set standard input to binary mode, and if BIT-2 is unset, set
900 RS to "\r\n" (CR/LF) rather than "\n" (LF).
901
902 1 set standard output to binary mode, and if BIT-2 is unset,
903 set ORS to "\r\n" (CR/LF) rather than "\n" (LF).
904
905 2 suppress the assignment to RS and ORS of CR/LF, making it
906 possible to run scripts and generate output compatible with
907 Unix line-endings.
908
909 POSIX AWK is oriented to operate on files a line at a time. RS can be
910 changed from "\n" to another single character, but it is hard to find
911 any use for this — there are no examples in the AWK book. By conven‐
912 tion, RS = "", makes one or more blank lines separate records, allowing
913 multi-line records. When RS = "", "\n" is always a field separator
914 regardless of the value in FS.
915
916 mawk, on the other hand, allows RS to be a regular expression. When
917 "\n" appears in records, it is treated as space, and FS always deter‐
918 mines fields.
919
920 Removing the line at a time paradigm can make some programs simpler and
921 can often improve performance. For example, redoing example 3 from
922 above,
923
924 BEGIN { RS = "[^A-Za-z]+" }
925
926 { word[ $0 ] = "" }
927
928 END { delete word[ "" ]
929 for( i in word ) cnt++
930 print cnt
931 }
932
933 counts the number of unique words by making each word a record. On
934 moderate size files, mawk executes twice as fast, because of the sim‐
935 plified inner loop.
936
937 The following program replaces each comment by a single space in a C
938 program file,
939
940 BEGIN {
941 RS = "/\*([^*]|\*+[^/*])*\*+/"
942 # comment is record separator
943 ORS = " "
944 getline hold
945 }
946
947 { print hold ; hold = $0 }
948
949 END { printf "%s" , hold }
950
951 Buffering one record is needed to avoid terminating the last record
952 with a space.
953
954 With mawk, the following are all equivalent,
955
956 x ~ /a\+b/ x ~ "a\+b" x ~ "a\\+b"
957
958 The strings get scanned twice, once as string and once as regular
959 expression. On the string scan, mawk ignores the escape on non-escape
960 characters while the AWK book advocates \c be recognized as c which
961 necessitates the double escaping of meta-characters in strings. POSIX
962 explicitly declines to define the behavior which passively forces pro‐
963 grams that must run under a variety of awks to use the more portable
964 but less readable, double escape.
965
966 POSIX AWK does not recognize "/dev/std{in,out,err}". Some systems pro‐
967 vide an actual device for this, allowing AWKs which do not implement
968 the feature directly to support it.
969
970 POSIX AWK does not recognize \x hex escape sequences in strings.
971 Unlike ANSI C, mawk limits the number of digits that follows \x to two
972 as the current implementation only supports 8 bit characters. The
973 built-in fflush first appeared in a recent (1993) AT&T awk released to
974 netlib, and is not part of the POSIX standard. Aggregate deletion with
975 delete array is not part of the POSIX standard.
976
977 POSIX explicitly leaves the behavior of FS = "" undefined, and mentions
978 splitting the record into characters as a possible interpretation, but
979 currently this use is not portable across implementations.
980
981 Random numbers
982 POSIX does not prescribe a method for initializing random numbers at
983 startup.
984
985 In practice, most implementations do nothing special, which makes srand
986 and rand follow the C runtime library, making the initial seed value 1.
987 Some implementations (Solaris XPG4 and Tru64) return 0 from the first
988 call to srand, although the results from rand behave as if the initial
989 seed is 1. Other implementations return 1.
990
991 While mawk can call srand at startup with no parameter (initializing
992 random numbers from the clock), this feature may be suppressed using
993 conditional compilation.
994
995 Extensions added for compatibility for GAWK and BWK
996 Nextfile is a gawk extension (also implemented by BWK awk), is not yet
997 part of the POSIX standard (as of October 2012), although it has been
998 accepted for the next revision of the standard.
999
1000 Mktime, strftime and systime are gawk extensions.
1001
1002 The "/dev/stdin" feature was added to mawk after 1.3.4, for compatibil‐
1003 ity with gawk and BWK awk. The corresponding "-" (alias for
1004 /dev/stdin) was present in mawk 1.3.3.
1005
1006 Subtle Differences not in POSIX or the AWK Book
1007 Finally, here is how mawk handles exceptional cases not discussed in
1008 the AWK book or the POSIX draft. It is unsafe to assume consistency
1009 across awks and safe to skip to the next section.
1010
1011 · substr(s, i, n) returns the characters of s in the intersection
1012 of the closed interval [1, length(s)] and the half-open interval
1013 [i, i+n). When this intersection is empty, the empty string is
1014 returned; so substr("ABC", 1, 0) = "" and substr("ABC", -4, 6) =
1015 "A".
1016
1017 · Every string, including the empty string, matches the empty
1018 string at the front so, s ~ // and s ~ "", are always 1 as is
1019 match(s, //) and match(s, ""). The last two set RLENGTH to 0.
1020
1021 · index(s, t) is always the same as match(s, t1) where t1 is the
1022 same as t with metacharacters escaped. Hence consistency with
1023 match requires that index(s, "") always returns 1. Also the
1024 condition, index(s,t) != 0 if and only t is a substring of s,
1025 requires index("","") = 1.
1026
1027 · If getline encounters end of file, getline var, leaves var
1028 unchanged. Similarly, on entry to the END actions, $0, the
1029 fields and NF have their value unaltered from the last record.
1030
1032 Mawk recognizes these variables:
1033
1034 MAWKBINMODE
1035 (see COMPATIBILITY ISSUES)
1036
1037 MAWK_LONG_OPTIONS
1038 If this is set, mawk uses its value to decide what to do with
1039 GNU-style long options:
1040
1041 allow Mawk allows the option to be checked against the (small)
1042 set of long options it recognizes.
1043
1044 error Mawk prints an error message and exits. This is the
1045 default.
1046
1047 ignore Mawk ignores the option.
1048
1049 warn Print an warning message and otherwise ignore the
1050 option.
1051
1052 If the variable is unset, mawk prints an error message and exits.
1053
1054 WHINY_USERS
1055 This is an undocumented gawk feature. It tells mawk to sort
1056 array indices before it starts to iterate over the elements of an
1057 array.
1058
1060 grep(1)
1061
1062 Aho, Kernighan and Weinberger, The AWK Programming Language, Addison-
1063 Wesley Publishing, 1988, (the AWK book), defines the language, opening
1064 with a tutorial and advancing to many interesting programs that delve
1065 into issues of software design and analysis relevant to programming in
1066 any language.
1067
1068 The GAWK Manual, The Free Software Foundation, 1991, is a tutorial and
1069 language reference that does not attempt the depth of the AWK book and
1070 assumes the reader may be a novice programmer. The section on AWK
1071 arrays is excellent. It also discusses POSIX requirements for AWK.
1072
1074 mawk implements printf() and sprintf() using the C library functions,
1075 printf and sprintf, so full ANSI compatibility requires an ANSI C
1076 library. In practice this means the h conversion qualifier may not be
1077 available. Also mawk inherits any bugs or limitations of the library
1078 functions.
1079
1080 Implementors of the AWK language have shown a consistent lack of imagi‐
1081 nation when naming their programs.
1082
1084 Mike Brennan (brennan@whidbey.com).
1085 Thomas E. Dickey <dickey@invisible-island.net>.
1086
1087
1088
1089Version 1.3.4 2019-12-31 MAWK(1)