1MAWK(1) USER COMMANDS MAWK(1)
2
3
4
6 mawk - pattern scanning and text processing language
7
9 mawk [-W option] [-F value] [-v var=value] [--] 'program text' [file
10 ...]
11 mawk [-W option] [-F value] [-v var=value] [-f program-file] [--] [file
12 ...]
13
15 mawk is an interpreter for the AWK Programming Language. The AWK lan‐
16 guage is useful for manipulation of data files, text retrieval and pro‐
17 cessing, and for prototyping and experimenting with algorithms. mawk
18 is a new awk meaning it implements the AWK language as defined in Aho,
19 Kernighan and Weinberger, The AWK Programming Language, Addison-Wesley
20 Publishing, 1988 (hereafter referred to as the AWK book.) mawk con‐
21 forms to the POSIX 1003.2 (draft 11.3) definition of the AWK language
22 which contains a few features not described in the AWK book, and mawk
23 provides a small number of extensions.
24
25 An AWK program is a sequence of pattern {action} pairs and function
26 definitions. Short programs are entered on the command line usually
27 enclosed in ' ' to avoid shell interpretation. Longer programs can be
28 read in from a file with the -f option. Data input is read from the
29 list of files on the command line or from standard input when the list
30 is empty. The input is broken into records as determined by the record
31 separator variable, RS. Initially, RS = "\n" and records are synony‐
32 mous with lines. Each record is compared against each pattern and if
33 it matches, the program text for {action} is executed.
34
36 -F value sets the field separator, FS, to value.
37
38 -f file Program text is read from file instead of from the com‐
39 mand line. Multiple -f options are allowed.
40
41 -v var=value assigns value to program variable var.
42
43 -- indicates the unambiguous end of options.
44
45 The above options will be available with any POSIX compatible implemen‐
46 tation of AWK. Implementation specific options are prefaced with -W.
47 mawk provides these:
48
49 -W dump writes an assembler like listing of the internal repre‐
50 sentation of the program to stdout and exits 0 (on suc‐
51 cessful compilation).
52
53 -W exec file Program text is read from file and this is the last
54 option.
55
56 This is a useful alternative to -f on systems that sup‐
57 port the #! "magic number" convention for executable
58 scripts. Those implicitly pass the pathname of the
59 script itself as the final parameter, and expect no more
60 than one "-" option on the #! line. Because mawk can
61 combine multiple -W options separated by commas, you can
62 use this option when an additional -W option is needed.
63
64 -W help prints a usage message to stderr and exits (same as
65 “-W usage”).
66
67 -W interactive sets unbuffered writes to stdout and line buffered reads
68 from stdin. Records from stdin are lines regardless of
69 the value of RS.
70
71 -W posix_space forces mawk not to consider '\n' to be space.
72
73 -W random=num calls srand with the given parameter (and overrides the
74 auto-seeding behavior).
75
76 -W sprintf=num adjusts the size of mawk's internal sprintf buffer to
77 num bytes. More than rare use of this option indicates
78 mawk should be recompiled.
79
80 -W usage prints a usage message to stderr and exits (same as
81 “-W help”).
82
83 -W version mawk writes its version and copyright to stdout and com‐
84 piled limits to stderr and exits 0.
85
86 mawk accepts abbreviations for any of these options, e.g., “-W v” and
87 “-Wv” both tell mawk to show its version.
88
89 mawk allows multiple -W options to be combined by separating the
90 options with commas, e.g., -Wsprint=2000,posix. This is useful for
91 executable #! "magic number" invocations in which only one argument is
92 supported, e.g., -Winteractive,exec.
93
95 1. Program structure
96 An AWK program is a sequence of pattern {action} pairs and user func‐
97 tion definitions.
98
99 A pattern can be:
100 BEGIN
101 END
102 expression
103 expression , expression
104
105 One, but not both, of pattern {action} can be omitted. If {action} is
106 omitted it is implicitly { print }. If pattern is omitted, then it is
107 implicitly matched. BEGIN and END patterns require an action.
108
109 Statements are terminated by newlines, semi-colons or both. Groups of
110 statements such as actions or loop bodies are blocked via { ... } as in
111 C. The last statement in a block doesn't need a terminator. Blank
112 lines have no meaning; an empty statement is terminated with a semi-
113 colon. Long statements can be continued with a backslash, \. A state‐
114 ment can be broken without a backslash after a comma, left brace, &&,
115 ||, do, else, the right parenthesis of an if, while or for statement,
116 and the right parenthesis of a function definition. A comment starts
117 with # and extends to, but does not include the end of line.
118
119 The following statements control program flow inside blocks.
120
121 if ( expr ) statement
122
123 if ( expr ) statement else statement
124
125 while ( expr ) statement
126
127 do statement while ( expr )
128
129 for ( opt_expr ; opt_expr ; opt_expr ) statement
130
131 for ( var in array ) statement
132
133 continue
134
135 break
136
137 2. Data types, conversion and comparison
138 There are two basic data types, numeric and string. Numeric constants
139 can be integer like -2, decimal like 1.08, or in scientific notation
140 like -1.1e4 or .28E-3. All numbers are represented internally and all
141 computations are done in floating point arithmetic. So for example,
142 the expression 0.2e2 == 20 is true and true is represented as 1.0.
143
144 String constants are enclosed in double quotes.
145
146 "This is a string with a newline at the end.\n"
147
148 Strings can be continued across a line by escaping (\) the newline.
149 The following escape sequences are recognized.
150
151 \\ \
152 \" "
153 \a alert, ascii 7
154 \b backspace, ascii 8
155 \t tab, ascii 9
156 \n newline, ascii 10
157 \v vertical tab, ascii 11
158 \f formfeed, ascii 12
159 \r carriage return, ascii 13
160 \ddd 1, 2 or 3 octal digits for ascii ddd
161 \xhh 1 or 2 hex digits for ascii hh
162
163 If you escape any other character \c, you get \c, i.e., mawk ignores
164 the escape.
165
166 There are really three basic data types; the third is number and string
167 which has both a numeric value and a string value at the same time.
168 User defined variables come into existence when first referenced and
169 are initialized to null, a number and string value which has numeric
170 value 0 and string value "". Non-trivial number and string typed data
171 come from input and are typically stored in fields. (See section 4).
172
173 The type of an expression is determined by its context and automatic
174 type conversion occurs if needed. For example, to evaluate the state‐
175 ments
176
177 y = x + 2 ; z = x "hello"
178
179 The value stored in variable y will be typed numeric. If x is not
180 numeric, the value read from x is converted to numeric before it is
181 added to 2 and stored in y. The value stored in variable z will be
182 typed string, and the value of x will be converted to string if neces‐
183 sary and concatenated with "hello". (Of course, the value and type
184 stored in x is not changed by any conversions.) A string expression is
185 converted to numeric using its longest numeric prefix as with atof(3).
186 A numeric expression is converted to string by replacing expr with
187 sprintf(CONVFMT, expr), unless expr can be represented on the host
188 machine as an exact integer then it is converted to sprintf("%d",
189 expr). Sprintf() is an AWK built-in that duplicates the functionality
190 of sprintf(3), and CONVFMT is a built-in variable used for internal
191 conversion from number to string and initialized to "%.6g". Explicit
192 type conversions can be forced, expr "" is string and expr+0 is
193 numeric.
194
195 To evaluate, expr1 rel-op expr2, if both operands are numeric or number
196 and string then the comparison is numeric; if both operands are string
197 the comparison is string; if one operand is string, the non-string op‐
198 erand is converted and the comparison is string. The result is
199 numeric, 1 or 0.
200
201 In boolean contexts such as, if ( expr ) statement, a string expression
202 evaluates true if and only if it is not the empty string ""; numeric
203 values if and only if not numerically zero.
204
205 3. Regular expressions
206 In the AWK language, records, fields and strings are often tested for
207 matching a regular expression. Regular expressions are enclosed in
208 slashes, and
209
210 expr ~ /r/
211
212 is an AWK expression that evaluates to 1 if expr "matches" r, which
213 means a substring of expr is in the set of strings defined by r. With
214 no match the expression evaluates to 0; replacing ~ with the "not
215 match" operator, !~ , reverses the meaning. As pattern-action pairs,
216
217 /r/ { action } and $0 ~ /r/ { action }
218
219 are the same, and for each input record that matches r, action is exe‐
220 cuted. In fact, /r/ is an AWK expression that is equivalent to ($0 ~
221 /r/) anywhere except when on the right side of a match operator or
222 passed as an argument to a built-in function that expects a regular
223 expression argument.
224
225 AWK uses extended regular expressions as with egrep(1). The regular
226 expression metacharacters, i.e., those with special meaning in regular
227 expressions are
228
229 ^ $ . [ ] | ( ) * + ?
230
231 Regular expressions are built up from characters as follows:
232
233 c matches any non-metacharacter c.
234
235 \c matches a character defined by the same escape
236 sequences used in string constants or the literal
237 character c if \c is not an escape sequence.
238
239 . matches any character (including newline).
240
241 ^ matches the front of a string.
242
243 $ matches the back of a string.
244
245 [c1c2c3...] matches any character in the class c1c2c3... . An
246 interval of characters is denoted c1-c2 inside a
247 class [...].
248
249 [^c1c2c3...] matches any character not in the class c1c2c3...
250
251 Regular expressions are built up from other regular expressions as fol‐
252 lows:
253
254 r1r2 matches r1 followed immediately by r2 (concatena‐
255 tion).
256
257 r1 | r2 matches r1 or r2 (alternation).
258
259 r* matches r repeated zero or more times.
260
261 r+ matches r repeated one or more times.
262
263 r? matches r zero or once.
264
265 (r) matches r, providing grouping.
266
267 The increasing precedence of operators is alternation, concatenation
268 and unary (*, + or ?).
269
270 For example,
271
272 /^[_a-zA-Z][_a-zA-Z0-9]*$/ and
273 /^[-+]?([0-9]+\.?|\.[0-9])[0-9]*([eE][-+]?[0-9]+)?$/
274
275 are matched by AWK identifiers and AWK numeric constants respectively.
276 Note that “.” has to be escaped to be recognized as a decimal point,
277 and that metacharacters are not special inside character classes.
278
279 Any expression can be used on the right hand side of the ~ or !~ opera‐
280 tors or passed to a built-in that expects a regular expression. If
281 needed, it is converted to string, and then interpreted as a regular
282 expression. For example,
283
284 BEGIN { identifier = "[_a-zA-Z][_a-zA-Z0-9]*" }
285
286 $0 ~ "^" identifier
287
288 prints all lines that start with an AWK identifier.
289
290 mawk recognizes the empty regular expression, //, which matches the
291 empty string and hence is matched by any string at the front, back and
292 between every character. For example,
293
294 echo abc | mawk { gsub(//, "X") ; print }
295 XaXbXcX
296
297
298 4. Records and fields
299 Records are read in one at a time, and stored in the field variable $0.
300 The record is split into fields which are stored in $1, $2, ..., $NF.
301 The built-in variable NF is set to the number of fields, and NR and FNR
302 are incremented by 1. Fields above $NF are set to "".
303
304 Assignment to $0 causes the fields and NF to be recomputed. Assignment
305 to NF or to a field causes $0 to be reconstructed by concatenating the
306 $i's separated by OFS. Assignment to a field with index greater than
307 NF, increases NF and causes $0 to be reconstructed.
308
309 Data input stored in fields is string, unless the entire field has
310 numeric form and then the type is number and string. For example,
311
312 echo 24 24E |
313 mawk '{ print($1>100, $1>"100", $2>100, $2>"100") }'
314 0 1 1 1
315
316 $0 and $2 are string and $1 is number and string. The first comparison
317 is numeric, the second is string, the third is string (100 is converted
318 to "100"), and the last is string.
319
320 5. Expressions and operators
321 The expression syntax is similar to C. Primary expressions are numeric
322 constants, string constants, variables, fields, arrays and function
323 calls. The identifier for a variable, array or function can be a
324 sequence of letters, digits and underscores, that does not start with a
325 digit. Variables are not declared; they exist when first referenced
326 and are initialized to null.
327
328 New expressions are composed with the following operators in order of
329 increasing precedence.
330
331 assignment = += -= *= /= %= ^=
332 conditional ? :
333 logical or ||
334 logical and &&
335 array membership in
336 matching ~ !~
337 relational < > <= >= == !=
338 concatenation (no explicit operator)
339 add ops + -
340 mul ops * / %
341 unary + -
342 logical not !
343 exponentiation ^
344 inc and dec ++ -- (both post and pre)
345 field $
346
347 Assignment, conditional and exponentiation associate right to left; the
348 other operators associate left to right. Any expression can be paren‐
349 thesized.
350
351 6. Arrays
352 Awk provides one-dimensional arrays. Array elements are expressed as
353 array[expr]. Expr is internally converted to string type, so, for
354 example, A[1] and A["1"] are the same element and the actual index is
355 "1". Arrays indexed by strings are called associative arrays. Ini‐
356 tially an array is empty; elements exist when first accessed. An
357 expression, expr in array evaluates to 1 if array[expr] exists, else to
358 0.
359
360 There is a form of the for statement that loops over each index of an
361 array.
362
363 for ( var in array ) statement
364
365 sets var to each index of array and executes statement. The order that
366 var transverses the indices of array is not defined.
367
368 The statement, delete array[expr], causes array[expr] not to exist.
369 mawk supports an extension, delete array, which deletes all elements of
370 array.
371
372 Multidimensional arrays are synthesized with concatenation using the
373 built-in variable SUBSEP. array[expr1,expr2] is equivalent to
374 array[expr1 SUBSEP expr2]. Testing for a multidimensional element uses
375 a parenthesized index, such as
376
377 if ( (i, j) in A ) print A[i, j]
378
379
380 7. Builtin-variables
381 The following variables are built-in and initialized before program
382 execution.
383
384 ARGC number of command line arguments.
385
386 ARGV array of command line arguments, 0..ARGC-1.
387
388 CONVFMT format for internal conversion of numbers to string,
389 initially = "%.6g".
390
391 ENVIRON array indexed by environment variables. An environ‐
392 ment string, var=value is stored as ENVIRON[var] =
393 value.
394
395 FILENAME name of the current input file.
396
397 FNR current record number in FILENAME.
398
399 FS splits records into fields as a regular expression.
400
401 NF number of fields in the current record.
402
403 NR current record number in the total input stream.
404
405 OFMT format for printing numbers; initially = "%.6g".
406
407 OFS inserted between fields on output, initially = " ".
408
409 ORS terminates each record on output, initially = "\n".
410
411 RLENGTH length set by the last call to the built-in function,
412 match().
413
414 RS input record separator, initially = "\n".
415
416 RSTART index set by the last call to match().
417
418 SUBSEP used to build multiple array subscripts, initially =
419 "\034".
420
421 8. Built-in functions
422 String functions
423
424 gsub(r,s,t) gsub(r,s)
425 Global substitution, every match of regular expression r
426 in variable t is replaced by string s. The number of
427 replacements is returned. If t is omitted, $0 is used.
428 An & in the replacement string s is replaced by the
429 matched substring of t. \& and \\ put literal & and \,
430 respectively, in the replacement string.
431
432 index(s,t)
433 If t is a substring of s, then the position where t
434 starts is returned, else 0 is returned. The first char‐
435 acter of s is in position 1.
436
437 length(s)
438 Returns the length of string or array. s.
439
440 match(s,r)
441 Returns the index of the first longest match of regular
442 expression r in string s. Returns 0 if no match. As a
443 side effect, RSTART is set to the return value. RLENGTH
444 is set to the length of the match or -1 if no match. If
445 the empty string is matched, RLENGTH is set to 0, and 1
446 is returned if the match is at the front, and length(s)+1
447 is returned if the match is at the back.
448
449 split(s,A,r) split(s,A)
450 String s is split into fields by regular expression r and
451 the fields are loaded into array A. The number of fields
452 is returned. See section 11 below for more detail. If r
453 is omitted, FS is used.
454
455 sprintf(format,expr-list)
456 Returns a string constructed from expr-list according to
457 format. See the description of printf() below.
458
459 sub(r,s,t) sub(r,s)
460 Single substitution, same as gsub() except at most one
461 substitution.
462
463 substr(s,i,n) substr(s,i)
464 Returns the substring of string s, starting at index i,
465 of length n. If n is omitted, the suffix of s, starting
466 at i is returned.
467
468 tolower(s)
469 Returns a copy of s with all upper case characters con‐
470 verted to lower case.
471
472 toupper(s)
473 Returns a copy of s with all lower case characters con‐
474 verted to upper case.
475
476 Time functions
477
478 These are available on systems which support the corresponding C mktime
479 and strftime functions:
480
481 mktime(specification)
482 converts a date specification to a timestamp with the
483 same units as systime. The date specification is a
484 string containing the components of the date as decimal
485 integers:
486
487 YYYY
488 the year, e.g., 2012
489
490 MM the month of the year starting at 1
491
492 DD the day of the month starting at 1
493
494 HH hour (0-23)
495
496 MM minute (0-59)
497
498 SS seconds (0-59)
499
500 DST
501 tells how to treat timezone versus daylight savings
502 time:
503
504 positive
505 DST is in effect
506
507 zero (default)
508 DST is not in effect
509
510 negative
511 mktime() should (use timezone information and sys‐
512 tem databases to) attempt to determine whether DST
513 is in effect at the specified time.
514
515 strftime([format [, timestamp [, utc ]]])
516 formats the given timestamp using the format (passed to
517 the C strftime function):
518
519 · If the format parameter is missing, "%c" is used.
520
521 · If the timestamp parameter is missing, the current
522 value from systime is used.
523
524 · If the utc parameter is present and nonzero, the
525 result is in UTC. Otherwise local time is used.
526
527 systime()
528 returns the current time of day as the number of seconds
529 since the Epoch (1970-01-01 00:00:00 UTC on POSIX sys‐
530 tems).
531
532 Arithmetic functions
533
534 atan2(y,x) Arctan of y/x between -pi and pi.
535
536 cos(x) Cosine function, x in radians.
537
538 exp(x) Exponential function.
539
540 int(x) Returns x truncated towards zero.
541
542 log(x) Natural logarithm.
543
544 rand() Returns a random number between zero and one.
545
546 sin(x) Sine function, x in radians.
547
548 sqrt(x) Returns square root of x.
549
550 srand(expr) srand()
551 Seeds the random number generator, using the clock if
552 expr is omitted, and returns the value of the previous
553 seed. Srand(expr) is useful for repeating pseudo random
554 sequences.
555
556 Note: mawk is normally configured to seed the random num‐
557 ber generator from the clock at startup, making it unnec‐
558 essary to call srand(). This feature can be suppressed
559 via conditional compile, or overridden using the -Wrandom
560 option.
561
562 9. Input and output
563 There are two output statements, print and printf.
564
565 print writes $0 ORS to standard output.
566
567 print expr1, expr2, ..., exprn
568 writes expr1 OFS expr2 OFS ... exprn ORS to standard out‐
569 put. Numeric expressions are converted to string with
570 OFMT.
571
572 printf format, expr-list
573 duplicates the printf C library function writing to stan‐
574 dard output. The complete ANSI C format specifications
575 are recognized with conversions %c, %d, %e, %E, %f, %g,
576 %G, %i, %o, %s, %u, %x, %X and %%, and conversion quali‐
577 fiers h and l.
578
579 The argument list to print or printf can optionally be enclosed in
580 parentheses. Print formats numbers using OFMT or "%d" for exact inte‐
581 gers. "%c" with a numeric argument prints the corresponding 8 bit
582 character, with a string argument it prints the first character of the
583 string. The output of print and printf can be redirected to a file or
584 command by appending > file, >> file or | command to the end of the
585 print statement. Redirection opens file or command only once, subse‐
586 quent redirections append to the already open stream. By convention,
587 mawk associates the filename
588
589 · "/dev/stderr" with stderr,
590
591 · "/dev/stdout" with stdout,
592
593 · "-" and "/dev/stdin" with stdin.
594
595 The association with stderr is especially useful because it allows
596 print and printf to be redirected to stderr. These names can also be
597 passed to functions.
598
599 The input function getline has the following variations.
600
601 getline
602 reads into $0, updates the fields, NF, NR and FNR.
603
604 getline < file
605 reads into $0 from file, updates the fields and NF.
606
607 getline var
608 reads the next record into var, updates NR and FNR.
609
610 getline var < file
611 reads the next record of file into var.
612
613 command | getline
614 pipes a record from command into $0 and updates the
615 fields and NF.
616
617 command | getline var
618 pipes a record from command into var.
619
620 Getline returns 0 on end-of-file, -1 on error, otherwise 1.
621
622 Commands on the end of pipes are executed by /bin/sh.
623
624 The function close(expr) closes the file or pipe associated with expr.
625 Close returns 0 if expr is an open file, the exit status if expr is a
626 piped command, and -1 otherwise. Close is used to reread a file or
627 command, make sure the other end of an output pipe is finished or con‐
628 serve file resources.
629
630 The function fflush(expr) flushes the output file or pipe associated
631 with expr. Fflush returns 0 if expr is an open output stream else -1.
632 Fflush without an argument flushes stdout. Fflush with an empty argu‐
633 ment ("") flushes all open output.
634
635 The function system(expr) uses the C runtime system call to execute
636 expr and returns the corresponding wait status of the command as fol‐
637 lows:
638
639 · if the system call failed, setting the status to -1, mawk returns
640 that value.
641
642 · if the command exited normally, mawk returns its exit-status.
643
644 · if the command exited due to a signal such as SIGHUP, mawk returns
645 the signal number plus 256.
646
647 Changes made to the ENVIRON array are not passed to commands executed
648 with system or pipes.
649
650 10. User defined functions
651 The syntax for a user defined function is
652
653 function name( args ) { statements }
654
655 The function body can contain a return statement
656
657 return opt_expr
658
659 A return statement is not required. Function calls may be nested or
660 recursive. Functions are passed expressions by value and arrays by
661 reference. Extra arguments serve as local variables and are initial‐
662 ized to null. For example, csplit(s,A) puts each character of s into
663 array A and returns the length of s.
664
665 function csplit(s, A, n, i)
666 {
667 n = length(s)
668 for( i = 1 ; i <= n ; i++ ) A[i] = substr(s, i, 1)
669 return n
670 }
671
672 Putting extra space between passed arguments and local variables is
673 conventional. Functions can be referenced before they are defined, but
674 the function name and the '(' of the arguments must touch to avoid con‐
675 fusion with concatenation.
676
677 A function parameter is normally a scalar value (number or string). If
678 there is a forward reference to a function using an array as a parame‐
679 ter, the function's corresponding parameter will be treated as an
680 array.
681
682 11. Splitting strings, records and files
683 Awk programs use the same algorithm to split strings into arrays with
684 split(), and records into fields on FS. mawk uses essentially the same
685 algorithm to split files into records on RS.
686
687 Split(expr,A,sep) works as follows:
688
689 (1) If sep is omitted, it is replaced by FS. Sep can be an expres‐
690 sion or regular expression. If it is an expression of non-
691 string type, it is converted to string.
692
693 (2) If sep = " " (a single space), then <SPACE> is trimmed from the
694 front and back of expr, and sep becomes <SPACE>. mawk defines
695 <SPACE> as the regular expression /[ \t\n]+/. Otherwise sep is
696 treated as a regular expression, except that meta-characters
697 are ignored for a string of length 1, e.g., split(x, A, "*")
698 and split(x, A, /\*/) are the same.
699
700 (3) If expr is not string, it is converted to string. If expr is
701 then the empty string "", split() returns 0 and A is set empty.
702 Otherwise, all non-overlapping, non-null and longest matches of
703 sep in expr, separate expr into fields which are loaded into A.
704 The fields are placed in A[1], A[2], ..., A[n] and split()
705 returns n, the number of fields which is the number of matches
706 plus one. Data placed in A that looks numeric is typed number
707 and string.
708
709 Splitting records into fields works the same except the pieces are
710 loaded into $1, $2,..., $NF. If $0 is empty, NF is set to 0 and all $i
711 to "".
712
713 mawk splits files into records by the same algorithm, but with the
714 slight difference that RS is really a terminator instead of a separa‐
715 tor. (ORS is really a terminator too).
716
717 E.g., if FS = ":+" and $0 = "a::b:" , then NF = 3 and $1 = "a",
718 $2 = "b" and $3 = "", but if "a::b:" is the contents of an input
719 file and RS = ":+", then there are two records "a" and "b".
720
721 RS = " " is not special.
722
723 If FS = "", then mawk breaks the record into individual characters,
724 and, similarly, split(s,A,"") places the individual characters of s
725 into A.
726
727 12. Multi-line records
728 Since mawk interprets RS as a regular expression, multi-line records
729 are easy. Setting RS = "\n\n+", makes one or more blank lines separate
730 records. If FS = " " (the default), then single newlines, by the rules
731 for <SPACE> above, become space and single newlines are field separa‐
732 tors.
733
734 For example, if
735
736 · a file is "a b\nc\n\n",
737
738 · RS = "\n\n+" and
739
740 · FS = " ",
741
742 then there is one record "a b\nc" with three fields "a", "b" and
743 "c":
744
745 · Changing FS = "\n", gives two fields "a b" and "c";
746
747 · changing FS = "", gives one field identical to the record.
748
749 If you want lines with spaces or tabs to be considered blank, set RS =
750 "\n([ \t]*\n)+". For compatibility with other awks, setting RS = ""
751 has the same effect as if blank lines are stripped from the front and
752 back of files and then records are determined as if RS = "\n\n+".
753 POSIX requires that "\n" always separates records when RS = "" regard‐
754 less of the value of FS. mawk does not support this convention,
755 because defining "\n" as <SPACE> makes it unnecessary.
756
757 Most of the time when you change RS for multi-line records, you will
758 also want to change ORS to "\n\n" so the record spacing is preserved on
759 output.
760
761 13. Program execution
762 This section describes the order of program execution. First ARGC is
763 set to the total number of command line arguments passed to the execu‐
764 tion phase of the program. ARGV[0] is set the name of the AWK inter‐
765 preter and ARGV[1] ... ARGV[ARGC-1] holds the remaining command line
766 arguments exclusive of options and program source. For example with
767
768 mawk -f prog v=1 A t=hello B
769
770 ARGC = 5 with ARGV[0] = "mawk", ARGV[1] = "v=1", ARGV[2] = "A", ARGV[3]
771 = "t=hello" and ARGV[4] = "B".
772
773 Next, each BEGIN block is executed in order. If the program consists
774 entirely of BEGIN blocks, then execution terminates, else an input
775 stream is opened and execution continues. If ARGC equals 1, the input
776 stream is set to stdin, else the command line arguments ARGV[1] ...
777 ARGV[ARGC-1] are examined for a file argument.
778
779 The command line arguments divide into three sets: file arguments,
780 assignment arguments and empty strings "". An assignment has the form
781 var=string. When an ARGV[i] is examined as a possible file argument,
782 if it is empty it is skipped; if it is an assignment argument, the
783 assignment to var takes place and i skips to the next argument; else
784 ARGV[i] is opened for input. If it fails to open, execution terminates
785 with exit code 2. If no command line argument is a file argument, then
786 input comes from stdin. Getline in a BEGIN action opens input. “-” as
787 a file argument denotes stdin.
788
789 Once an input stream is open, each input record is tested against each
790 pattern, and if it matches, the associated action is executed. An
791 expression pattern matches if it is boolean true (see the end of sec‐
792 tion 2). A BEGIN pattern matches before any input has been read, and
793 an END pattern matches after all input has been read. A range pattern,
794 expr1,expr2 , matches every record between the match of expr1 and the
795 match expr2 inclusively.
796
797 When end of file occurs on the input stream, the remaining command line
798 arguments are examined for a file argument, and if there is one it is
799 opened, else the END pattern is considered matched and all END actions
800 are executed.
801
802 In the example, the assignment v=1 takes place after the BEGIN actions
803 are executed, and the data placed in v is typed number and string.
804 Input is then read from file A. On end of file A, t is set to the
805 string "hello", and B is opened for input. On end of file B, the END
806 actions are executed.
807
808 Program flow at the pattern {action} level can be changed with the
809
810 next
811 nextfile
812 exit opt_expr
813
814 statements:
815
816 · A next statement causes the next input record to be read and pat‐
817 tern testing to restart with the first pattern {action} pair in the
818 program.
819
820 · A nextfile statement tells mawk to stop processing the current
821 input file. It then updates FILENAME to the next file listed on
822 the command line, and resets FNR to 1.
823
824 · An exit statement causes immediate execution of the END actions or
825 program termination if there are none or if the exit occurs in an
826 END action. The opt_expr sets the exit value of the program unless
827 overridden by a later exit or subsequent error.
828
830 1. emulate cat.
831
832 { print }
833
834 2. emulate wc.
835
836 { chars += length($0) + 1 # add one for the \n
837 words += NF
838 }
839
840 END{ print NR, words, chars }
841
842 3. count the number of unique "real words".
843
844 BEGIN { FS = "[^A-Za-z]+" }
845
846 { for(i = 1 ; i <= NF ; i++) word[$i] = "" }
847
848 END { delete word[""]
849 for ( i in word ) cnt++
850 print cnt
851 }
852
853 4. sum the second field of every record based on the first field.
854
855 $1 ~ /credit|gain/ { sum += $2 }
856 $1 ~ /debit|loss/ { sum -= $2 }
857
858 END { print sum }
859
860 5. sort a file, comparing as string
861
862 { line[NR] = $0 "" } # make sure of comparison type
863 # in case some lines look numeric
864
865 END { isort(line, NR)
866 for(i = 1 ; i <= NR ; i++) print line[i]
867 }
868
869 #insertion sort of A[1..n]
870 function isort( A, n, i, j, hold)
871 {
872 for( i = 2 ; i <= n ; i++)
873 {
874 hold = A[j = i]
875 while ( A[j-1] > hold )
876 { j-- ; A[j+1] = A[j] }
877 A[j] = hold
878 }
879 # sentinel A[0] = "" will be created if needed
880 }
881
882
884 MAWK 1.3.3 versus POSIX 1003.2 Draft 11.3
885 The POSIX 1003.2(draft 11.3) definition of the AWK language is AWK as
886 described in the AWK book with a few extensions that appeared in Sys‐
887 temVR4 nawk. The extensions are:
888
889 · New functions: toupper() and tolower().
890
891 · New variables: ENVIRON[] and CONVFMT.
892
893 · ANSI C conversion specifications for printf() and sprintf().
894
895 · New command options: -v var=value, multiple -f options and
896 implementation options as arguments to -W.
897
898 · For systems (MS-DOS or Windows) which provide a setmode func‐
899 tion, an environment variable MAWKBINMODE and a built-in vari‐
900 able BINMODE. The bits of the BINMODE value tell mawk how to
901 modify the RS and ORS variables:
902
903 0 set standard input to binary mode, and if BIT-2 is unset, set
904 RS to "\r\n" (CR/LF) rather than "\n" (LF).
905
906 1 set standard output to binary mode, and if BIT-2 is unset, set
907 ORS to "\r\n" (CR/LF) rather than "\n" (LF).
908
909 2 suppress the assignment to RS and ORS of CR/LF, making it pos‐
910 sible to run scripts and generate output compatible with Unix
911 line-endings.
912
913 POSIX AWK is oriented to operate on files a line at a time. RS can be
914 changed from "\n" to another single character, but it is hard to find
915 any use for this — there are no examples in the AWK book. By conven‐
916 tion, RS = "", makes one or more blank lines separate records, allowing
917 multi-line records. When RS = "", "\n" is always a field separator
918 regardless of the value in FS.
919
920 mawk, on the other hand, allows RS to be a regular expression. When
921 "\n" appears in records, it is treated as space, and FS always deter‐
922 mines fields.
923
924 Removing the line at a time paradigm can make some programs simpler and
925 can often improve performance. For example, redoing example 3 from
926 above,
927
928 BEGIN { RS = "[^A-Za-z]+" }
929
930 { word[ $0 ] = "" }
931
932 END { delete word[ "" ]
933 for( i in word ) cnt++
934 print cnt
935 }
936
937 counts the number of unique words by making each word a record. On
938 moderate size files, mawk executes twice as fast, because of the sim‐
939 plified inner loop.
940
941 The following program replaces each comment by a single space in a C
942 program file,
943
944 BEGIN {
945 RS = "/\*([^*]|\*+[^/*])*\*+/"
946 # comment is record separator
947 ORS = " "
948 getline hold
949 }
950
951 { print hold ; hold = $0 }
952
953 END { printf "%s" , hold }
954
955 Buffering one record is needed to avoid terminating the last record
956 with a space.
957
958 With mawk, the following are all equivalent,
959
960 x ~ /a\+b/ x ~ "a\+b" x ~ "a\\+b"
961
962 The strings get scanned twice, once as string and once as regular
963 expression. On the string scan, mawk ignores the escape on non-escape
964 characters while the AWK book advocates \c be recognized as c which
965 necessitates the double escaping of meta-characters in strings. POSIX
966 explicitly declines to define the behavior which passively forces pro‐
967 grams that must run under a variety of awks to use the more portable
968 but less readable, double escape.
969
970 POSIX AWK does not recognize "/dev/std{in,out,err}". Some systems pro‐
971 vide an actual device for this, allowing AWKs which do not implement
972 the feature directly to support it.
973
974 POSIX AWK does not recognize \x hex escape sequences in strings.
975 Unlike ANSI C, mawk limits the number of digits that follows \x to two
976 as the current implementation only supports 8 bit characters. The
977 built-in fflush first appeared in a recent (1993) AT&T awk released to
978 netlib, and is not part of the POSIX standard. Aggregate deletion with
979 delete array is not part of the POSIX standard.
980
981 POSIX explicitly leaves the behavior of FS = "" undefined, and mentions
982 splitting the record into characters as a possible interpretation, but
983 currently this use is not portable across implementations.
984
985 Random numbers
986 POSIX does not prescribe a method for initializing random numbers at
987 startup.
988
989 In practice, most implementations do nothing special, which makes srand
990 and rand follow the C runtime library, making the initial seed value 1.
991 Some implementations (Solaris XPG4 and Tru64) return 0 from the first
992 call to srand, although the results from rand behave as if the initial
993 seed is 1. Other implementations return 1.
994
995 While mawk can call srand at startup with no parameter (initializing
996 random numbers from the clock), this feature may be suppressed using
997 conditional compilation.
998
999 Extensions added for compatibility for GAWK and BWK
1000 Nextfile is a gawk extension (also implemented by BWK awk), is not yet
1001 part of the POSIX standard (as of October 2012), although it has been
1002 accepted for the next revision of the standard.
1003
1004 Mktime, strftime and systime are gawk extensions.
1005
1006 The "/dev/stdin" feature was added to mawk after 1.3.4, for compatibil‐
1007 ity with gawk and BWK awk. The corresponding "-" (alias for
1008 /dev/stdin) was present in mawk 1.3.3.
1009
1010 Subtle Differences not in POSIX or the AWK Book
1011 Finally, here is how mawk handles exceptional cases not discussed in
1012 the AWK book or the POSIX draft. It is unsafe to assume consistency
1013 across awks and safe to skip to the next section.
1014
1015 · substr(s, i, n) returns the characters of s in the intersection
1016 of the closed interval [1, length(s)] and the half-open interval
1017 [i, i+n). When this intersection is empty, the empty string is
1018 returned; so substr("ABC", 1, 0) = "" and substr("ABC", -4, 6) =
1019 "A".
1020
1021 · Every string, including the empty string, matches the empty
1022 string at the front so, s ~ // and s ~ "", are always 1 as is
1023 match(s, //) and match(s, ""). The last two set RLENGTH to 0.
1024
1025 · index(s, t) is always the same as match(s, t1) where t1 is the
1026 same as t with metacharacters escaped. Hence consistency with
1027 match requires that index(s, "") always returns 1. Also the
1028 condition, index(s,t) != 0 if and only t is a substring of s,
1029 requires index("","") = 1.
1030
1031 · If getline encounters end of file, getline var, leaves var
1032 unchanged. Similarly, on entry to the END actions, $0, the
1033 fields and NF have their value unaltered from the last record.
1034
1036 Mawk recognizes these variables:
1037
1038 MAWKBINMODE
1039 (see COMPATIBILITY ISSUES)
1040
1041 MAWK_LONG_OPTIONS
1042 If this is set, mawk uses its value to decide what to do with
1043 GNU-style long options:
1044
1045 allow Mawk allows the option to be checked against the (small)
1046 set of long options it recognizes.
1047
1048 error Mawk prints an error message and exits. This is the
1049 default.
1050
1051 ignore Mawk ignores the option.
1052
1053 warn Print an warning message and otherwise ignore the option.
1054
1055 If the variable is unset, mawk prints an error message and exits.
1056
1057 WHINY_USERS
1058 This is an undocumented gawk feature. It tells mawk to sort
1059 array indices before it starts to iterate over the elements of an
1060 array.
1061
1063 egrep(1)
1064
1065 Aho, Kernighan and Weinberger, The AWK Programming Language, Addison-
1066 Wesley Publishing, 1988, (the AWK book), defines the language, opening
1067 with a tutorial and advancing to many interesting programs that delve
1068 into issues of software design and analysis relevant to programming in
1069 any language.
1070
1071 The GAWK Manual, The Free Software Foundation, 1991, is a tutorial and
1072 language reference that does not attempt the depth of the AWK book and
1073 assumes the reader may be a novice programmer. The section on AWK
1074 arrays is excellent. It also discusses POSIX requirements for AWK.
1075
1077 mawk implements printf() and sprintf() using the C library functions,
1078 printf and sprintf, so full ANSI compatibility requires an ANSI C
1079 library. In practice this means the h conversion qualifier may not be
1080 available. Also mawk inherits any bugs or limitations of the library
1081 functions.
1082
1083 Implementors of the AWK language have shown a consistent lack of imagi‐
1084 nation when naming their programs.
1085
1087 Mike Brennan (brennan@whidbey.com).
1088 Thomas E. Dickey <dickey@invisible-island.net>.
1089
1090
1091
1092Version 1.3.4 2016-09-18 MAWK(1)