1MAWK(1) General Commands Manual MAWK(1)
2
3
4
6 mawk - pattern scanning and text processing language
7
9 mawk [-W option] [-F value] [-v var=value] [--] 'program text' [file
10 ...]
11 mawk [-W option] [-F value] [-v var=value] [-f program-file] [--] [file
12 ...]
13
15 mawk is an interpreter for the AWK Programming Language. The AWK lan‐
16 guage is useful for manipulation of data files, text retrieval and pro‐
17 cessing, and for prototyping and experimenting with algorithms. mawk
18 is a new awk meaning it implements the AWK language as defined in Aho,
19 Kernighan and Weinberger, The AWK Programming Language, Addison-Wesley
20 Publishing, 1988. (Hereafter referred to as the AWK book.) mawk con‐
21 forms to the Posix 1003.2 (draft 11.3) definition of the AWK language
22 which contains a few features not described in the AWK book, and mawk
23 provides a small number of extensions.
24
25 An AWK program is a sequence of pattern {action} pairs and function
26 definitions. Short programs are entered on the command line usually
27 enclosed in ' ' to avoid shell interpretation. Longer programs can be
28 read in from a file with the -f option. Data input is read from the
29 list of files on the command line or from standard input when the list
30 is empty. The input is broken into records as determined by the record
31 separator variable, RS. Initially, RS = "\n" and records are synony‐
32 mous with lines. Each record is compared against each pattern and if
33 it matches, the program text for {action} is executed.
34
36 -F value sets the field separator, FS, to value.
37
38 -f file Program text is read from file instead of from the com‐
39 mand line. Multiple -f options are allowed.
40
41 -v var=value assigns value to program variable var.
42
43 -- indicates the unambiguous end of options.
44
45 The above options will be available with any Posix compatible implemen‐
46 tation of AWK, and implementation specific options are prefaced with
47 -W. mawk provides six:
48
49 -W version mawk writes its version and copyright to stdout and com‐
50 piled limits to stderr and exits 0.
51
52 -W dump writes an assembler like listing of the internal repre‐
53 sentation of the program to stdout and exits 0 (on suc‐
54 cessful compilation).
55
56 -W interactive sets unbuffered writes to stdout and line buffered reads
57 from stdin. Records from stdin are lines regardless of
58 the value of RS.
59
60 -W exec file Program text is read from file and this is the last
61 option. Useful on systems that support the #! "magic
62 number" convention for executable scripts.
63
64 -W sprintf=num adjusts the size of mawk's internal sprintf buffer to
65 num bytes. More than rare use of this option indicates
66 mawk should be recompiled.
67
68 -W posix_space forces mawk not to consider '\n' to be space.
69
70 The short forms -W[vdiesp] are recognized and on some systems -We is
71 mandatory to avoid command line length limitations.
72
73 mawk allows multiple -W options to be combined by separating the
74 options with commas, e.g., -Wsprint=2000,posix.
75
77 1. Program structure
78 An AWK program is a sequence of pattern {action} pairs and user func‐
79 tion definitions.
80
81 A pattern can be:
82 BEGIN
83 END
84 expression
85 expression , expression
86
87 One, but not both, of pattern {action} can be omitted. If {action} is
88 omitted it is implicitly { print }. If pattern is omitted, then it is
89 implicitly matched. BEGIN and END patterns require an action.
90
91 Statements are terminated by newlines, semi-colons or both. Groups of
92 statements such as actions or loop bodies are blocked via { ... } as in
93 C. The last statement in a block doesn't need a terminator. Blank
94 lines have no meaning; an empty statement is terminated with a semi-
95 colon. Long statements can be continued with a backslash, \. A state‐
96 ment can be broken without a backslash after a comma, left brace, &&,
97 ||, do, else, the right parenthesis of an if, while or for statement,
98 and the right parenthesis of a function definition. A comment starts
99 with # and extends to, but does not include the end of line.
100
101 The following statements control program flow inside blocks.
102
103 if ( expr ) statement
104
105 if ( expr ) statement else statement
106
107 while ( expr ) statement
108
109 do statement while ( expr )
110
111 for ( opt_expr ; opt_expr ; opt_expr ) statement
112
113 for ( var in array ) statement
114
115 continue
116
117 break
118
119 2. Data types, conversion and comparison
120 There are two basic data types, numeric and string. Numeric constants
121 can be integer like -2, decimal like 1.08, or in scientific notation
122 like -1.1e4 or .28E-3. All numbers are represented internally and all
123 computations are done in floating point arithmetic. So for example,
124 the expression 0.2e2 == 20 is true and true is represented as 1.0.
125
126 String constants are enclosed in double quotes.
127
128 "This is a string with a newline at the end.\n"
129
130 Strings can be continued across a line by escaping (\) the newline.
131 The following escape sequences are recognized.
132
133 \\ \
134 \" "
135 \a alert, ascii 7
136 \b backspace, ascii 8
137 \t tab, ascii 9
138 \n newline, ascii 10
139 \v vertical tab, ascii 11
140 \f formfeed, ascii 12
141 \r carriage return, ascii 13
142 \ddd 1, 2 or 3 octal digits for ascii ddd
143 \xhh 1 or 2 hex digits for ascii hh
144
145 If you escape any other character \c, you get \c, i.e., mawk ignores
146 the escape.
147
148 There are really three basic data types; the third is number and string
149 which has both a numeric value and a string value at the same time.
150 User defined variables come into existence when first referenced and
151 are initialized to null, a number and string value which has numeric
152 value 0 and string value "". Non-trivial number and string typed data
153 come from input and are typically stored in fields. (See section 4).
154
155 The type of an expression is determined by its context and automatic
156 type conversion occurs if needed. For example, to evaluate the state‐
157 ments
158
159 y = x + 2 ; z = x "hello"
160
161 The value stored in variable y will be typed numeric. If x is not
162 numeric, the value read from x is converted to numeric before it is
163 added to 2 and stored in y. The value stored in variable z will be
164 typed string, and the value of x will be converted to string if neces‐
165 sary and concatenated with "hello". (Of course, the value and type
166 stored in x is not changed by any conversions.) A string expression is
167 converted to numeric using its longest numeric prefix as with atof(3).
168 A numeric expression is converted to string by replacing expr with
169 sprintf(CONVFMT, expr), unless expr can be represented on the host
170 machine as an exact integer then it is converted to sprintf("%d",
171 expr). Sprintf() is an AWK built-in that duplicates the functionality
172 of sprintf(3), and CONVFMT is a built-in variable used for internal
173 conversion from number to string and initialized to "%.6g". Explicit
174 type conversions can be forced, expr "" is string and expr+0 is
175 numeric.
176
177 To evaluate, expr1 rel-op expr2, if both operands are numeric or number
178 and string then the comparison is numeric; if both operands are string
179 the comparison is string; if one operand is string, the non-string op‐
180 erand is converted and the comparison is string. The result is
181 numeric, 1 or 0.
182
183 In boolean contexts such as, if ( expr ) statement, a string expression
184 evaluates true if and only if it is not the empty string ""; numeric
185 values if and only if not numerically zero.
186
187 3. Regular expressions
188 In the AWK language, records, fields and strings are often tested for
189 matching a regular expression. Regular expressions are enclosed in
190 slashes, and
191
192 expr ~ /r/
193
194 is an AWK expression that evaluates to 1 if expr "matches" r, which
195 means a substring of expr is in the set of strings defined by r. With
196 no match the expression evaluates to 0; replacing ~ with the "not
197 match" operator, !~ , reverses the meaning. As pattern-action pairs,
198
199 /r/ { action } and $0 ~ /r/ { action }
200
201 are the same, and for each input record that matches r, action is exe‐
202 cuted. In fact, /r/ is an AWK expression that is equivalent to ($0 ~
203 /r/) anywhere except when on the right side of a match operator or
204 passed as an argument to a built-in function that expects a regular
205 expression argument.
206
207 AWK uses extended regular expressions as with egrep(1). The regular
208 expression metacharacters, i.e., those with special meaning in regular
209 expressions are
210
211 ^ $ . [ ] | ( ) * + ?
212
213 Regular expressions are built up from characters as follows:
214
215 c matches any non-metacharacter c.
216
217 \c matches a character defined by the same escape
218 sequences used in string constants or the literal
219 character c if \c is not an escape sequence.
220
221 . matches any character (including newline).
222
223 ^ matches the front of a string.
224
225 $ matches the back of a string.
226
227 [c1c2c3...] matches any character in the class c1c2c3... . An
228 interval of characters is denoted c1-c2 inside a
229 class [...].
230
231 [^c1c2c3...] matches any character not in the class c1c2c3...
232
233 Regular expressions are built up from other regular expressions as fol‐
234 lows:
235
236 r1r2 matches r1 followed immediately by r2 (concatena‐
237 tion).
238
239 r1 | r2 matches r1 or r2 (alternation).
240
241 r* matches r repeated zero or more times.
242
243 r+ matches r repeated one or more times.
244
245 r? matches r zero or once.
246
247 (r) matches r, providing grouping.
248
249 The increasing precedence of operators is alternation, concatenation
250 and unary (*, + or ?).
251
252 For example,
253
254 /^[_a-zA-Z][_a-zA-Z0-9]*$/ and
255 /^[-+]?([0-9]+\.?|\.[0-9])[0-9]*([eE][-+]?[0-9]+)?$/
256
257 are matched by AWK identifiers and AWK numeric constants respectively.
258 Note that . has to be escaped to be recognized as a decimal point, and
259 that metacharacters are not special inside character classes.
260
261 Any expression can be used on the right hand side of the ~ or !~ opera‐
262 tors or passed to a built-in that expects a regular expression. If
263 needed, it is converted to string, and then interpreted as a regular
264 expression. For example,
265
266 BEGIN { identifier = "[_a-zA-Z][_a-zA-Z0-9]*" }
267
268 $0 ~ "^" identifier
269
270 prints all lines that start with an AWK identifier.
271
272 mawk recognizes the empty regular expression, //, which matches the
273 empty string and hence is matched by any string at the front, back and
274 between every character. For example,
275
276 echo abc | mawk { gsub(//, "X") ; print }
277 XaXbXcX
278
279
280 4. Records and fields
281 Records are read in one at a time, and stored in the field variable $0.
282 The record is split into fields which are stored in $1, $2, ..., $NF.
283 The built-in variable NF is set to the number of fields, and NR and FNR
284 are incremented by 1. Fields above $NF are set to "".
285
286 Assignment to $0 causes the fields and NF to be recomputed. Assignment
287 to NF or to a field causes $0 to be reconstructed by concatenating the
288 $i's separated by OFS. Assignment to a field with index greater than
289 NF, increases NF and causes $0 to be reconstructed.
290
291 Data input stored in fields is string, unless the entire field has
292 numeric form and then the type is number and string. For example,
293
294 echo 24 24E |
295 mawk '{ print($1>100, $1>"100", $2>100, $2>"100") }'
296 0 1 1 1
297
298 $0 and $2 are string and $1 is number and string. The first comparison
299 is numeric, the second is string, the third is string (100 is converted
300 to "100"), and the last is string.
301
302 5. Expressions and operators
303 The expression syntax is similar to C. Primary expressions are numeric
304 constants, string constants, variables, fields, arrays and function
305 calls. The identifier for a variable, array or function can be a
306 sequence of letters, digits and underscores, that does not start with a
307 digit. Variables are not declared; they exist when first referenced
308 and are initialized to null.
309
310 New expressions are composed with the following operators in order of
311 increasing precedence.
312
313 assignment = += -= *= /= %= ^=
314 conditional ? :
315 logical or ||
316 logical and &&
317 array membership in
318 matching ~ !~
319 relational < > <= >= == !=
320 concatenation (no explicit operator)
321 add ops + -
322 mul ops * / %
323 unary + -
324 logical not !
325 exponentiation ^
326 inc and dec ++ -- (both post and pre)
327 field $
328
329 Assignment, conditional and exponentiation associate right to left; the
330 other operators associate left to right. Any expression can be paren‐
331 thesized.
332
333 6. Arrays
334 Awk provides one-dimensional arrays. Array elements are expressed as
335 array[expr]. Expr is internally converted to string type, so, for
336 example, A[1] and A["1"] are the same element and the actual index is
337 "1". Arrays indexed by strings are called associative arrays. Ini‐
338 tially an array is empty; elements exist when first accessed. An
339 expression, expr in array evaluates to 1 if array[expr] exists, else to
340 0.
341
342 There is a form of the for statement that loops over each index of an
343 array.
344
345 for ( var in array ) statement
346
347 sets var to each index of array and executes statement. The order that
348 var transverses the indices of array is not defined.
349
350 The statement, delete array[expr], causes array[expr] not to exist.
351 mawk supports an extension, delete array, which deletes all elements of
352 array.
353
354 Multidimensional arrays are synthesized with concatenation using the
355 built-in variable SUBSEP. array[expr1,expr2] is equivalent to
356 array[expr1 SUBSEP expr2]. Testing for a multidimensional element uses
357 a parenthesized index, such as
358
359 if ( (i, j) in A ) print A[i, j]
360
361
362 7. Builtin-variables
363 The following variables are built-in and initialized before program
364 execution.
365
366 ARGC number of command line arguments.
367
368 ARGV array of command line arguments, 0..ARGC-1.
369
370 CONVFMT format for internal conversion of numbers to string,
371 initially = "%.6g".
372
373 ENVIRON array indexed by environment variables. An environ‐
374 ment string, var=value is stored as ENVIRON[var] =
375 value.
376
377 FILENAME name of the current input file.
378
379 FNR current record number in FILENAME.
380
381 FS splits records into fields as a regular expression.
382
383 NF number of fields in the current record.
384
385 NR current record number in the total input stream.
386
387 OFMT format for printing numbers; initially = "%.6g".
388
389 OFS inserted between fields on output, initially = " ".
390
391 ORS terminates each record on output, initially = "\n".
392
393 RLENGTH length set by the last call to the built-in function,
394 match().
395
396 RS input record separator, initially = "\n".
397
398 RSTART index set by the last call to match().
399
400 SUBSEP used to build multiple array subscripts, initially =
401 "\034".
402
403 8. Built-in functions
404 String functions
405
406 gsub(r,s,t) gsub(r,s)
407 Global substitution, every match of regular expression r
408 in variable t is replaced by string s. The number of
409 replacements is returned. If t is omitted, $0 is used.
410 An & in the replacement string s is replaced by the
411 matched substring of t. \& and \\ put literal & and \,
412 respectively, in the replacement string.
413
414 index(s,t)
415 If t is a substring of s, then the position where t
416 starts is returned, else 0 is returned. The first char‐
417 acter of s is in position 1.
418
419 length(s)
420 Returns the length of string s.
421
422 match(s,r)
423 Returns the index of the first longest match of regular
424 expression r in string s. Returns 0 if no match. As a
425 side effect, RSTART is set to the return value. RLENGTH
426 is set to the length of the match or -1 if no match. If
427 the empty string is matched, RLENGTH is set to 0, and 1
428 is returned if the match is at the front, and length(s)+1
429 is returned if the match is at the back.
430
431 split(s,A,r) split(s,A)
432 String s is split into fields by regular expression r and
433 the fields are loaded into array A. The number of fields
434 is returned. See section 11 below for more detail. If r
435 is omitted, FS is used.
436
437 sprintf(format,expr-list)
438 Returns a string constructed from expr-list according to
439 format. See the description of printf() below.
440
441 sub(r,s,t) sub(r,s)
442 Single substitution, same as gsub() except at most one
443 substitution.
444
445 substr(s,i,n) substr(s,i)
446 Returns the substring of string s, starting at index i,
447 of length n. If n is omitted, the suffix of s, starting
448 at i is returned.
449
450 tolower(s)
451 Returns a copy of s with all upper case characters con‐
452 verted to lower case.
453
454 toupper(s)
455 Returns a copy of s with all lower case characters con‐
456 verted to upper case.
457
458 Arithmetic functions
459
460 atan2(y,x) Arctan of y/x between -pi and pi.
461
462 cos(x) Cosine function, x in radians.
463
464 exp(x) Exponential function.
465
466 int(x) Returns x truncated towards zero.
467
468 log(x) Natural logarithm.
469
470 rand() Returns a random number between zero and one.
471
472 sin(x) Sine function, x in radians.
473
474 sqrt(x) Returns square root of x.
475
476 srand(expr) srand()
477 Seeds the random number generator, using the clock if
478 expr is omitted, and returns the value of the previous
479 seed. mawk seeds the random number generator from the
480 clock at startup so there is no real need to call
481 srand(). Srand(expr) is useful for repeating pseudo ran‐
482 dom sequences.
483
484 9. Input and output
485 There are two output statements, print and printf.
486
487 print writes $0 ORS to standard output.
488
489 print expr1, expr2, ..., exprn
490 writes expr1 OFS expr2 OFS ... exprn ORS to standard out‐
491 put. Numeric expressions are converted to string with
492 OFMT.
493
494 printf format, expr-list
495 duplicates the printf C library function writing to stan‐
496 dard output. The complete ANSI C format specifications
497 are recognized with conversions %c, %d, %e, %E, %f, %g,
498 %G, %i, %o, %s, %u, %x, %X and %%, and conversion quali‐
499 fiers h and l.
500
501 The argument list to print or printf can optionally be enclosed in
502 parentheses. Print formats numbers using OFMT or "%d" for exact inte‐
503 gers. "%c" with a numeric argument prints the corresponding 8 bit
504 character, with a string argument it prints the first character of the
505 string. The output of print and printf can be redirected to a file or
506 command by appending > file, >> file or | command to the end of the
507 print statement. Redirection opens file or command only once, subse‐
508 quent redirections append to the already open stream. By convention,
509 mawk associates the filename "/dev/stderr" with stderr which allows
510 print and printf to be redirected to stderr. mawk also associates "-"
511 and "/dev/stdout" with stdin and stdout which allows these streams to
512 be passed to functions.
513
514 The input function getline has the following variations.
515
516 getline
517 reads into $0, updates the fields, NF, NR and FNR.
518
519 getline < file
520 reads into $0 from file, updates the fields and NF.
521
522 getline var
523 reads the next record into var, updates NR and FNR.
524
525 getline var < file
526 reads the next record of file into var.
527
528 command | getline
529 pipes a record from command into $0 and updates the
530 fields and NF.
531
532 command | getline var
533 pipes a record from command into var.
534
535 Getline returns 0 on end-of-file, -1 on error, otherwise 1.
536
537 Commands on the end of pipes are executed by /bin/sh.
538
539 The function close(expr) closes the file or pipe associated with expr.
540 Close returns 0 if expr is an open file, the exit status if expr is a
541 piped command, and -1 otherwise. Close is used to reread a file or
542 command, make sure the other end of an output pipe is finished or con‐
543 serve file resources.
544
545 The function fflush(expr) flushes the output file or pipe associated
546 with expr. Fflush returns 0 if expr is an open output stream else -1.
547 Fflush without an argument flushes stdout. Fflush with an empty argu‐
548 ment ("") flushes all open output.
549
550 The function system(expr) uses /bin/sh to execute expr and returns the
551 exit status of the command expr. Changes made to the ENVIRON array are
552 not passed to commands executed with system or pipes.
553
554 10. User defined functions
555 The syntax for a user defined function is
556
557 function name( args ) { statements }
558
559 The function body can contain a return statement
560
561 return opt_expr
562
563 A return statement is not required. Function calls may be nested or
564 recursive. Functions are passed expressions by value and arrays by
565 reference. Extra arguments serve as local variables and are initial‐
566 ized to null. For example, csplit(s,A) puts each character of s into
567 array A and returns the length of s.
568
569 function csplit(s, A, n, i)
570 {
571 n = length(s)
572 for( i = 1 ; i <= n ; i++ ) A[i] = substr(s, i, 1)
573 return n
574 }
575
576 Putting extra space between passed arguments and local variables is
577 conventional. Functions can be referenced before they are defined, but
578 the function name and the '(' of the arguments must touch to avoid con‐
579 fusion with concatenation.
580
581 11. Splitting strings, records and files
582 Awk programs use the same algorithm to split strings into arrays with
583 split(), and records into fields on FS. mawk uses essentially the same
584 algorithm to split files into records on RS.
585
586 Split(expr,A,sep) works as follows:
587
588 (1) If sep is omitted, it is replaced by FS. Sep can be an
589 expression or regular expression. If it is an expression
590 of non-string type, it is converted to string.
591
592 (2) If sep = " " (a single space), then <SPACE> is trimmed
593 from the front and back of expr, and sep becomes <SPACE>.
594 mawk defines <SPACE> as the regular expression
595 /[ \t\n]+/. Otherwise sep is treated as a regular
596 expression, except that meta-characters are ignored for a
597 string of length 1, e.g., split(x, A, "*") and split(x,
598 A, /\*/) are the same.
599
600 (3) If expr is not string, it is converted to string. If
601 expr is then the empty string "", split() returns 0 and A
602 is set empty. Otherwise, all non-overlapping, non-null
603 and longest matches of sep in expr, separate expr into
604 fields which are loaded into A. The fields are placed in
605 A[1], A[2], ..., A[n] and split() returns n, the number
606 of fields which is the number of matches plus one. Data
607 placed in A that looks numeric is typed number and
608 string.
609
610 Splitting records into fields works the same except the pieces are
611 loaded into $1, $2,..., $NF. If $0 is empty, NF is set to 0 and all $i
612 to "".
613
614 mawk splits files into records by the same algorithm, but with the
615 slight difference that RS is really a terminator instead of a separa‐
616 tor. (ORS is really a terminator too).
617
618 E.g., if FS = ":+" and $0 = "a::b:" , then NF = 3 and $1 = "a",
619 $2 = "b" and $3 = "", but if "a::b:" is the contents of an input
620 file and RS = ":+", then there are two records "a" and "b".
621
622 RS = " " is not special.
623
624 If FS = "", then mawk breaks the record into individual characters,
625 and, similarly, split(s,A,"") places the individual characters of s
626 into A.
627
628 12. Multi-line records
629 Since mawk interprets RS as a regular expression, multi-line records
630 are easy. Setting RS = "\n\n+", makes one or more blank lines separate
631 records. If FS = " " (the default), then single newlines, by the rules
632 for <SPACE> above, become space and single newlines are field separa‐
633 tors.
634
635 For example, if a file is "a b\nc\n\n", RS = "\n\n+" and FS =
636 " ", then there is one record "a b\nc" with three fields "a",
637 "b" and "c". Changing FS = "\n", gives two fields "a b" and
638 "c"; changing FS = "", gives one field identical to the record.
639
640 If you want lines with spaces or tabs to be considered blank, set RS =
641 "\n([ \t]*\n)+". For compatibility with other awks, setting RS = ""
642 has the same effect as if blank lines are stripped from the front and
643 back of files and then records are determined as if RS = "\n\n+".
644 Posix requires that "\n" always separates records when RS = "" regard‐
645 less of the value of FS. mawk does not support this convention,
646 because defining "\n" as <SPACE> makes it unnecessary.
647
648 Most of the time when you change RS for multi-line records, you will
649 also want to change ORS to "\n\n" so the record spacing is preserved on
650 output.
651
652 13. Program execution
653 This section describes the order of program execution. First ARGC is
654 set to the total number of command line arguments passed to the execu‐
655 tion phase of the program. ARGV[0] is set the name of the AWK inter‐
656 preter and ARGV[1] ... ARGV[ARGC-1] holds the remaining command line
657 arguments exclusive of options and program source. For example with
658
659 mawk -f prog v=1 A t=hello B
660
661 ARGC = 5 with ARGV[0] = "mawk", ARGV[1] = "v=1", ARGV[2] = "A", ARGV[3]
662 = "t=hello" and ARGV[4] = "B".
663
664 Next, each BEGIN block is executed in order. If the program consists
665 entirely of BEGIN blocks, then execution terminates, else an input
666 stream is opened and execution continues. If ARGC equals 1, the input
667 stream is set to stdin, else the command line arguments ARGV[1] ...
668 ARGV[ARGC-1] are examined for a file argument.
669
670 The command line arguments divide into three sets: file arguments,
671 assignment arguments and empty strings "". An assignment has the form
672 var=string. When an ARGV[i] is examined as a possible file argument,
673 if it is empty it is skipped; if it is an assignment argument, the
674 assignment to var takes place and i skips to the next argument; else
675 ARGV[i] is opened for input. If it fails to open, execution terminates
676 with exit code 2. If no command line argument is a file argument, then
677 input comes from stdin. Getline in a BEGIN action opens input. "-" as
678 a file argument denotes stdin.
679
680 Once an input stream is open, each input record is tested against each
681 pattern, and if it matches, the associated action is executed. An
682 expression pattern matches if it is boolean true (see the end of sec‐
683 tion 2). A BEGIN pattern matches before any input has been read, and
684 an END pattern matches after all input has been read. A range pattern,
685 expr1,expr2 , matches every record between the match of expr1 and the
686 match expr2 inclusively.
687
688 When end of file occurs on the input stream, the remaining command line
689 arguments are examined for a file argument, and if there is one it is
690 opened, else the END pattern is considered matched and all END actions
691 are executed.
692
693 In the example, the assignment v=1 takes place after the BEGIN actions
694 are executed, and the data placed in v is typed number and string.
695 Input is then read from file A. On end of file A, t is set to the
696 string "hello", and B is opened for input. On end of file B, the END
697 actions are executed.
698
699 Program flow at the pattern {action} level can be changed with the
700
701 next
702 exit opt_expr
703
704 statements. A next statement causes the next input record to be read
705 and pattern testing to restart with the first pattern {action} pair in
706 the program. An exit statement causes immediate execution of the END
707 actions or program termination if there are none or if the exit occurs
708 in an END action. The opt_expr sets the exit value of the program
709 unless overridden by a later exit or subsequent error.
710
712 1. emulate cat.
713
714 { print }
715
716 2. emulate wc.
717
718 { chars += length($0) + 1 # add one for the \n
719 words += NF
720 }
721
722 END{ print NR, words, chars }
723
724 3. count the number of unique "real words".
725
726 BEGIN { FS = "[^A-Za-z]+" }
727
728 { for(i = 1 ; i <= NF ; i++) word[$i] = "" }
729
730 END { delete word[""]
731 for ( i in word ) cnt++
732 print cnt
733 }
734
735 4. sum the second field of every record based on the first field.
736
737 $1 ~ /credit|gain/ { sum += $2 }
738 $1 ~ /debit|loss/ { sum -= $2 }
739
740 END { print sum }
741
742 5. sort a file, comparing as string
743
744 { line[NR] = $0 "" } # make sure of comparison type
745 # in case some lines look numeric
746
747 END { isort(line, NR)
748 for(i = 1 ; i <= NR ; i++) print line[i]
749 }
750
751 #insertion sort of A[1..n]
752 function isort( A, n, i, j, hold)
753 {
754 for( i = 2 ; i <= n ; i++)
755 {
756 hold = A[j = i]
757 while ( A[j-1] > hold )
758 { j-- ; A[j+1] = A[j] }
759 A[j] = hold
760 }
761 # sentinel A[0] = "" will be created if needed
762 }
763
764
766 The Posix 1003.2(draft 11.3) definition of the AWK language is AWK as
767 described in the AWK book with a few extensions that appeared in Sys‐
768 temVR4 nawk. The extensions are:
769
770 New functions: toupper() and tolower().
771
772 New variables: ENVIRON[] and CONVFMT.
773
774 ANSI C conversion specifications for printf() and sprintf().
775
776 New command options: -v var=value, multiple -f options and
777 implementation options as arguments to -W.
778
779 For systems (MS-DOS or Windows) which provide a setmode func‐
780 tion, an environment variable MAWKBINMODE and a built-in vari‐
781 able BINMODE. The bits of the BINMODE value tell mawk how to
782 modify the RS and ORS variables:
783
784 0 set standard input to binary mode, and if BIT-2 is unset,
785 set RS to "\r\n" (CR/LF) rather than "\n" (LF).
786
787 1 set standard output to binary mode, and if BIT-2 is unset,
788 set ORS to "\r\n" (CR/LF) rather than "\n" (LF).
789
790 2 suppress the assignment to RS and ORS of CR/LF, making it
791 possible to run scripts and generate output compatible
792 with Unix line-endings.
793
794 Posix AWK is oriented to operate on files a line at a time. RS can be
795 changed from "\n" to another single character, but it is hard to find
796 any use for this — there are no examples in the AWK book. By conven‐
797 tion, RS = "", makes one or more blank lines separate records, allowing
798 multi-line records. When RS = "", "\n" is always a field separator
799 regardless of the value in FS.
800
801 mawk, on the other hand, allows RS to be a regular expression. When
802 "\n" appears in records, it is treated as space, and FS always deter‐
803 mines fields.
804
805 Removing the line at a time paradigm can make some programs simpler and
806 can often improve performance. For example, redoing example 3 from
807 above,
808
809 BEGIN { RS = "[^A-Za-z]+" }
810
811 { word[ $0 ] = "" }
812
813 END { delete word[ "" ]
814 for( i in word ) cnt++
815 print cnt
816 }
817
818 counts the number of unique words by making each word a record. On
819 moderate size files, mawk executes twice as fast, because of the sim‐
820 plified inner loop.
821
822 The following program replaces each comment by a single space in a C
823 program file,
824
825 BEGIN {
826 RS = "/\*([^*]|\*+[^/*])*\*+/"
827 # comment is record separator
828 ORS = " "
829 getline hold
830 }
831
832 { print hold ; hold = $0 }
833
834 END { printf "%s" , hold }
835
836 Buffering one record is needed to avoid terminating the last record
837 with a space.
838
839 With mawk, the following are all equivalent,
840
841 x ~ /a\+b/ x ~ "a\+b" x ~ "a\\+b"
842
843 The strings get scanned twice, once as string and once as regular
844 expression. On the string scan, mawk ignores the escape on non-escape
845 characters while the AWK book advocates \c be recognized as c which
846 necessitates the double escaping of meta-characters in strings. Posix
847 explicitly declines to define the behavior which passively forces pro‐
848 grams that must run under a variety of awks to use the more portable
849 but less readable, double escape.
850
851 Posix AWK does not recognize "/dev/std{out,err}" or \x hex escape
852 sequences in strings. Unlike ANSI C, mawk limits the number of digits
853 that follows \x to two as the current implementation only supports 8
854 bit characters. The built-in fflush first appeared in a recent (1993)
855 AT&T awk released to netlib, and is not part of the posix standard.
856 Aggregate deletion with delete array is not part of the posix standard.
857
858 Posix explicitly leaves the behavior of FS = "" undefined, and mentions
859 splitting the record into characters as a possible interpretation, but
860 currently this use is not portable across implementations.
861
862 Finally, here is how mawk handles exceptional cases not discussed in
863 the AWK book or the Posix draft. It is unsafe to assume consistency
864 across awks and safe to skip to the next section.
865
866 substr(s, i, n) returns the characters of s in the intersection
867 of the closed interval [1, length(s)] and the half-open interval
868 [i, i+n). When this intersection is empty, the empty string is
869 returned; so substr("ABC", 1, 0) = "" and substr("ABC", -4, 6) =
870 "A".
871
872 Every string, including the empty string, matches the empty
873 string at the front so, s ~ // and s ~ "", are always 1 as is
874 match(s, //) and match(s, ""). The last two set RLENGTH to 0.
875
876 index(s, t) is always the same as match(s, t1) where t1 is the
877 same as t with metacharacters escaped. Hence consistency with
878 match requires that index(s, "") always returns 1. Also the
879 condition, index(s,t) != 0 if and only t is a substring of s,
880 requires index("","") = 1.
881
882 If getline encounters end of file, getline var, leaves var
883 unchanged. Similarly, on entry to the END actions, $0, the
884 fields and NF have their value unaltered from the last record.
885
887 egrep(1)
888
889 Aho, Kernighan and Weinberger, The AWK Programming Language, Addison-
890 Wesley Publishing, 1988, (the AWK book), defines the language, opening
891 with a tutorial and advancing to many interesting programs that delve
892 into issues of software design and analysis relevant to programming in
893 any language.
894
895 The GAWK Manual, The Free Software Foundation, 1991, is a tutorial and
896 language reference that does not attempt the depth of the AWK book and
897 assumes the reader may be a novice programmer. The section on AWK
898 arrays is excellent. It also discusses Posix requirements for AWK.
899
901 mawk implements printf() and sprintf() using the C library functions,
902 printf and sprintf, so full ANSI compatibility requires an ANSI C
903 library. In practice this means the h conversion qualifier may not be
904 available. Also mawk inherits any bugs or limitations of the library
905 functions.
906
907 Implementors of the AWK language have shown a consistent lack of imagi‐
908 nation when naming their programs.
909
911 Mike Brennan (brennan@whidbey.com).
912 Thomas E. Dickey <dickey@invisible-island.net>.
913
914
915
916 USER COMMANDS MAWK(1)