1AWK(1P) POSIX Programmer's Manual AWK(1P)
2
3
4
6 This manual page is part of the POSIX Programmer's Manual. The Linux
7 implementation of this interface may differ (consult the corresponding
8 Linux manual page for details of Linux behavior), or the interface may
9 not be implemented on Linux.
10
11
13 awk — pattern scanning and processing language
14
16 awk [−F sepstring] [−v assignment]... program [argument...]
17
18 awk [−F sepstring] −f progfile [−f progfile]... [−v assignment]...
19 [argument...]
20
22 The awk utility shall execute programs written in the awk programming
23 language, which is specialized for textual data manipulation. An awk
24 program is a sequence of patterns and corresponding actions. When input
25 is read that matches a pattern, the action associated with that pattern
26 is carried out.
27
28 Input shall be interpreted as a sequence of records. By default, a
29 record is a line, less its terminating <newline>, but this can be
30 changed by using the RS built-in variable. Each record of input shall
31 be matched in turn against each pattern in the program. For each pat‐
32 tern matched, the associated action shall be executed.
33
34 The awk utility shall interpret each input record as a sequence of
35 fields where, by default, a field is a string of non-<blank> non-<new‐
36 line> characters. This default <blank> and <newline> field delimiter
37 can be changed by using the FS built-in variable or the −F sepstring
38 option. The awk utility shall denote the first field in a record $1,
39 the second $2, and so on. The symbol $0 shall refer to the entire
40 record; setting any other field causes the re-evaluation of $0. Assign‐
41 ing to $0 shall reset the values of all other fields and the NF built-
42 in variable.
43
45 The awk utility shall conform to the Base Definitions volume of
46 POSIX.1‐2008, Section 12.2, Utility Syntax Guidelines.
47
48 The following options shall be supported:
49
50 −F sepstring
51 Define the input field separator. This option shall be equiv‐
52 alent to:
53
54 -v FS=sepstring
55
56 except that if −F sepstring and −v FS=sepstring are both
57 used, it is unspecified whether the FS assignment resulting
58 from −F sepstring is processed in command line order or is
59 processed after the last −v FS=sepstring. See the descrip‐
60 tion of the FS built-in variable, and how it is used, in the
61 EXTENDED DESCRIPTION section.
62
63 −f progfile
64 Specify the pathname of the file progfile containing an awk
65 program. A pathname of '−' shall denote the standard input.
66 If multiple instances of this option are specified, the con‐
67 catenation of the files specified as progfile in the order
68 specified shall be the awk program. The awk program can
69 alternatively be specified in the command line as a single
70 argument.
71
72 −v assignment
73 The application shall ensure that the assignment argument is
74 in the same form as an assignment operand. The specified
75 variable assignment shall occur prior to executing the awk
76 program, including the actions associated with BEGIN patterns
77 (if any). Multiple occurrences of this option can be speci‐
78 fied.
79
81 The following operands shall be supported:
82
83 program If no −f option is specified, the first operand to awk shall
84 be the text of the awk program. The application shall supply
85 the program operand as a single argument to awk. If the text
86 does not end in a <newline>, awk shall interpret the text as
87 if it did.
88
89 argument Either of the following two types of argument can be inter‐
90 mixed:
91
92 file A pathname of a file that contains the input to be
93 read, which is matched against the set of patterns
94 in the program. If no file operands are specified,
95 or if a file operand is '−', the standard input
96 shall be used.
97
98 assignment
99 An operand that begins with an <underscore> or
100 alphabetic character from the portable character
101 set (see the table in the Base Definitions volume
102 of POSIX.1‐2008, Section 6.1, Portable Character
103 Set), followed by a sequence of underscores, dig‐
104 its, and alphabetics from the portable character
105 set, followed by the '=' character, shall specify a
106 variable assignment rather than a pathname. The
107 characters before the '=' represent the name of an
108 awk variable; if that name is an awk reserved word
109 (see Grammar) the behavior is undefined. The char‐
110 acters following the <equals-sign> shall be inter‐
111 preted as if they appeared in the awk program pre‐
112 ceded and followed by a double-quote ('"') charac‐
113 ter, as a STRING token (see Grammar), except that
114 if the last character is an unescaped <backslash>,
115 it shall be interpreted as a literal <backslash>
116 rather than as the first character of the sequence
117 "\"". The variable shall be assigned the value of
118 that STRING token and, if appropriate, shall be
119 considered a numeric string (see Expressions in
120 awk), the variable shall also be assigned its
121 numeric value. Each such variable assignment shall
122 occur just prior to the processing of the following
123 file, if any. Thus, an assignment before the first
124 file argument shall be executed after the BEGIN
125 actions (if any), while an assignment after the
126 last file argument shall occur before the END
127 actions (if any). If there are no file arguments,
128 assignments shall be executed before processing the
129 standard input.
130
132 The standard input shall be used only if no file operands are speci‐
133 fied, or if a file operand is '−', or if a progfile option-argument is
134 '−'; see the INPUT FILES section. If the awk program contains no
135 actions and no patterns, but is otherwise a valid awk program, standard
136 input and any file operands shall not be read and awk shall exit with a
137 return status of zero.
138
140 Input files to the awk program from any of the following sources shall
141 be text files:
142
143 * Any file operands or their equivalents, achieved by modifying the
144 awk variables ARGV and ARGC
145
146 * Standard input in the absence of any file operands
147
148 * Arguments to the getline function
149
150 Whether the variable RS is set to a value other than a <newline> or
151 not, for these files, implementations shall support records terminated
152 with the specified separator up to {LINE_MAX} bytes and may support
153 longer records.
154
155 If −f progfile is specified, the application shall ensure that the
156 files named by each of the progfile option-arguments are text files and
157 their concatenation, in the same order as they appear in the arguments,
158 is an awk program.
159
161 The following environment variables shall affect the execution of awk:
162
163 LANG Provide a default value for the internationalization vari‐
164 ables that are unset or null. (See the Base Definitions vol‐
165 ume of POSIX.1‐2008, Section 8.2, Internationalization Vari‐
166 ables for the precedence of internationalization variables
167 used to determine the values of locale categories.)
168
169 LC_ALL If set to a non-empty string value, override the values of
170 all the other internationalization variables.
171
172 LC_COLLATE
173 Determine the locale for the behavior of ranges, equivalence
174 classes, and multi-character collating elements within regu‐
175 lar expressions and in comparisons of string values.
176
177 LC_CTYPE Determine the locale for the interpretation of sequences of
178 bytes of text data as characters (for example, single-byte as
179 opposed to multi-byte characters in arguments and input
180 files), the behavior of character classes within regular
181 expressions, the identification of characters as letters, and
182 the mapping of uppercase and lowercase characters for the
183 toupper and tolower functions.
184
185 LC_MESSAGES
186 Determine the locale that should be used to affect the format
187 and contents of diagnostic messages written to standard
188 error.
189
190 LC_NUMERIC
191 Determine the radix character used when interpreting numeric
192 input, performing conversions between numeric and string val‐
193 ues, and formatting numeric output. Regardless of locale, the
194 <period> character (the decimal-point character of the POSIX
195 locale) is the decimal-point character recognized in process‐
196 ing awk programs (including assignments in command line argu‐
197 ments).
198
199 NLSPATH Determine the location of message catalogs for the processing
200 of LC_MESSAGES.
201
202 PATH Determine the search path when looking for commands executed
203 by system(expr), or input and output pipes; see the Base Def‐
204 initions volume of POSIX.1‐2008, Chapter 8, Environment Vari‐
205 ables.
206
207 In addition, all environment variables shall be visible via the awk
208 variable ENVIRON.
209
211 Default.
212
214 The nature of the output files depends on the awk program.
215
217 The standard error shall be used only for diagnostic messages.
218
220 The nature of the output files depends on the awk program.
221
223 Overall Program Structure
224 An awk program is composed of pairs of the form:
225
226 pattern { action }
227
228 Either the pattern or the action (including the enclosing brace charac‐
229 ters) can be omitted.
230
231 A missing pattern shall match any record of input, and a missing action
232 shall be equivalent to:
233
234 { print }
235
236 Execution of the awk program shall start by first executing the actions
237 associated with all BEGIN patterns in the order they occur in the pro‐
238 gram. Then each file operand (or standard input if no files were speci‐
239 fied) shall be processed in turn by reading data from the file until a
240 record separator is seen (<newline> by default). Before the first ref‐
241 erence to a field in the record is evaluated, the record shall be split
242 into fields, according to the rules in Regular Expressions, using the
243 value of FS that was current at the time the record was read. Each pat‐
244 tern in the program then shall be evaluated in the order of occurrence,
245 and the action associated with each pattern that matches the current
246 record executed. The action for a matching pattern shall be executed
247 before evaluating subsequent patterns. Finally, the actions associated
248 with all END patterns shall be executed in the order they occur in the
249 program.
250
251 Expressions in awk
252 Expressions describe computations used in patterns and actions. In the
253 following table, valid expression operations are given in groups from
254 highest precedence first to lowest precedence last, with equal-prece‐
255 dence operators grouped between horizontal lines. In expression evalua‐
256 tion, where the grammar is formally ambiguous, higher precedence opera‐
257 tors shall be evaluated before lower precedence operators. In this ta‐
258 ble expr, expr1, expr2, and expr3 represent any expression, while
259 lvalue represents any entity that can be assigned to (that is, on the
260 left side of an assignment operator). The precise syntax of expres‐
261 sions is given in Grammar.
262
263 Table 4-1: Expressions in Decreasing Precedence in awk
264
265 ┌─────────────────────┬─────────────────────────┬────────────────┬──────────────┐
266 │ Syntax │ Name │ Type of Result │Associativity │
267 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
268 │( expr ) │Grouping │Type of expr │N/A │
269 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
270 │$expr │Field reference │String │N/A │
271 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
272 │lvalue ++ │Post-increment │Numeric │N/A │
273 │lvalue −− │Post-decrement │Numeric │N/A │
274 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
275 │++ lvalue │Pre-increment │Numeric │N/A │
276 │−− lvalue │Pre-decrement │Numeric │N/A │
277 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
278 │expr ^ expr │Exponentiation │Numeric │Right │
279 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
280 │! expr │Logical not │Numeric │N/A │
281 │+ expr │Unary plus │Numeric │N/A │
282 │− expr │Unary minus │Numeric │N/A │
283 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
284 │expr * expr │Multiplication │Numeric │Left │
285 │expr / expr │Division │Numeric │Left │
286 │expr % expr │Modulus │Numeric │Left │
287 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
288 │expr + expr │Addition │Numeric │Left │
289 │expr − expr │Subtraction │Numeric │Left │
290 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
291 │expr expr │String concatenation │String │Left │
292 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
293 │expr < expr │Less than │Numeric │None │
294 │expr <= expr │Less than or equal to │Numeric │None │
295 │expr != expr │Not equal to │Numeric │None │
296 │expr == expr │Equal to │Numeric │None │
297 │expr > expr │Greater than │Numeric │None │
298 │expr >= expr │Greater than or equal to │Numeric │None │
299 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
300 │expr ~ expr │ERE match │Numeric │None │
301 │expr !~ expr │ERE non-match │Numeric │None │
302 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
303 │expr in array │Array membership │Numeric │Left │
304 │( index ) in array │Multi-dimension array │Numeric │Left │
305 │ │membership │ │ │
306 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
307 │expr && expr │Logical AND │Numeric │Left │
308 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
309 │expr || expr │Logical OR │Numeric │Left │
310 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
311 │expr1 ? expr2 : expr3│Conditional expression │Type of selected│Right │
312 │ │ │expr2 or expr3 │ │
313 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
314 │lvalue ^= expr │Exponentiation assignment│Numeric │Right │
315 │lvalue %= expr │Modulus assignment │Numeric │Right │
316 │lvalue *= expr │Multiplication assignment│Numeric │Right │
317 │lvalue /= expr │Division assignment │Numeric │Right │
318 │lvalue += expr │Addition assignment │Numeric │Right │
319 │lvalue −= expr │Subtraction assignment │Numeric │Right │
320 │lvalue = expr │Assignment │Type of expr │Right │
321 └─────────────────────┴─────────────────────────┴────────────────┴──────────────┘
322 Each expression shall have either a string value, a numeric value, or
323 both. Except as stated for specific contexts, the value of an expres‐
324 sion shall be implicitly converted to the type needed for the context
325 in which it is used. A string value shall be converted to a numeric
326 value either by the equivalent of the following calls to functions
327 defined by the ISO C standard:
328
329 setlocale(LC_NUMERIC, "");
330 numeric_value = atof(string_value);
331
332 or by converting the initial portion of the string to type double rep‐
333 resentation as follows:
334
335 The input string is decomposed into two parts: an initial, pos‐
336 sibly empty, sequence of white-space characters (as specified by
337 isspace()) and a subject sequence interpreted as a floating-
338 point constant.
339
340 The expected form of the subject sequence is an optional '+' or
341 '−' sign, then a non-empty sequence of digits optionally con‐
342 taining a <period>, then an optional exponent part. An exponent
343 part consists of 'e' or 'E', followed by an optional sign, fol‐
344 lowed by one or more decimal digits.
345
346 The sequence starting with the first digit or the <period>
347 (whichever occurs first) is interpreted as a floating constant
348 of the C language, and if neither an exponent part nor a
349 <period> appears, a <period> is assumed to follow the last digit
350 in the string. If the subject sequence begins with a minus-sign,
351 the value resulting from the conversion is negated.
352
353 A numeric value that is exactly equal to the value of an integer (see
354 Section 1.1.2, Concepts Derived from the ISO C Standard) shall be con‐
355 verted to a string by the equivalent of a call to the sprintf function
356 (see String Functions) with the string "%d" as the fmt argument and the
357 numeric value being converted as the first and only expr argument. Any
358 other numeric value shall be converted to a string by the equivalent of
359 a call to the sprintf function with the value of the variable CONVFMT
360 as the fmt argument and the numeric value being converted as the first
361 and only expr argument. The result of the conversion is unspecified if
362 the value of CONVFMT is not a floating-point format specification. This
363 volume of POSIX.1‐2008 specifies no explicit conversions between num‐
364 bers and strings. An application can force an expression to be treated
365 as a number by adding zero to it, or can force it to be treated as a
366 string by concatenating the null string ("") to it.
367
368 A string value shall be considered a numeric string if it comes from
369 one of the following:
370
371 1. Field variables
372
373 2. Input from the getline() function
374
375 3. FILENAME
376
377 4. ARGV array elements
378
379 5. ENVIRON array elements
380
381 6. Array elements created by the split() function
382
383 7. A command line variable assignment
384
385 8. Variable assignment from another numeric string variable
386
387 and an implementation-dependent condition corresponding to either case
388 (a) or (b) below is met.
389
390 a. After the equivalent of the following calls to functions defined by
391 the ISO C standard, string_value_end would differ from
392 string_value, and any characters before the terminating null char‐
393 acter in string_value_end would be <blank> characters:
394
395 char *string_value_end;
396 setlocale(LC_NUMERIC, "");
397 numeric_value = strtod (string_value, &string_value_end);
398
399 b. After all the following conversions have been applied, the result‐
400 ing string would lexically be recognized as a NUMBER token as
401 described by the lexical conventions in Grammar:
402
403 -- All leading and trailing <blank> characters are discarded.
404
405 -- If the first non-<blank> is '+' or '−', it is discarded.
406
407 -- Each occurrence of the decimal point character from the current
408 locale is changed to a <period>.
409 In case (a) the numeric value of the numeric string shall be the value
410 that would be returned by the strtod() call. In case (b) if the first
411 non-<blank> is '−', the numeric value of the numeric string shall be
412 the negation of the numeric value of the recognized NUMBER token; oth‐
413 erwise, the numeric value of the numeric string shall be the numeric
414 value of the recognized NUMBER token. Whether or not a string is a
415 numeric string shall be relevant only in contexts where that term is
416 used in this section.
417
418 When an expression is used in a Boolean context, if it has a numeric
419 value, a value of zero shall be treated as false and any other value
420 shall be treated as true. Otherwise, a string value of the null string
421 shall be treated as false and any other value shall be treated as true.
422 A Boolean context shall be one of the following:
423
424 * The first subexpression of a conditional expression
425
426 * An expression operated on by logical NOT, logical AND, or logical
427 OR
428
429 * The second expression of a for statement
430
431 * The expression of an if statement
432
433 * The expression of the while clause in either a while or do...while
434 statement
435
436 * An expression used as a pattern (as in Overall Program Structure)
437
438 All arithmetic shall follow the semantics of floating-point arithmetic
439 as specified by the ISO C standard (see Section 1.1.2, Concepts Derived
440 from the ISO C Standard).
441
442 The value of the expression:
443
444 expr1 ^ expr2
445
446 shall be equivalent to the value returned by the ISO C standard func‐
447 tion call:
448
449 pow(expr1, expr2)
450
451 The expression:
452
453 lvalue ^= expr
454
455 shall be equivalent to the ISO C standard expression:
456
457 lvalue = pow(lvalue, expr)
458
459 except that lvalue shall be evaluated only once. The value of the
460 expression:
461
462 expr1 % expr2
463
464 shall be equivalent to the value returned by the ISO C standard func‐
465 tion call:
466
467 fmod(expr1, expr2)
468
469 The expression:
470
471 lvalue %= expr
472
473 shall be equivalent to the ISO C standard expression:
474
475 lvalue = fmod(lvalue, expr)
476
477 except that lvalue shall be evaluated only once.
478
479 Variables and fields shall be set by the assignment statement:
480
481 lvalue = expression
482
483 and the type of expression shall determine the resulting variable type.
484 The assignment includes the arithmetic assignments ("+=", "−=", "*=",
485 "/=", "%=", "^=", "++", "−−") all of which shall produce a numeric
486 result. The left-hand side of an assignment and the target of increment
487 and decrement operators can be one of a variable, an array with index,
488 or a field selector.
489
490 The awk language supplies arrays that are used for storing numbers or
491 strings. Arrays need not be declared. They shall initially be empty,
492 and their sizes shall change dynamically. The subscripts, or element
493 identifiers, are strings, providing a type of associative array capa‐
494 bility. An array name followed by a subscript within square brackets
495 can be used as an lvalue and thus as an expression, as described in the
496 grammar; see Grammar. Unsubscripted array names can be used in only
497 the following contexts:
498
499 * A parameter in a function definition or function call
500
501 * The NAME token following any use of the keyword in as specified in
502 the grammar (see Grammar); if the name used in this context is not
503 an array name, the behavior is undefined
504
505 A valid array index shall consist of one or more <comma>-separated
506 expressions, similar to the way in which multi-dimensional arrays are
507 indexed in some programming languages. Because awk arrays are really
508 one-dimensional, such a <comma>-separated list shall be converted to a
509 single string by concatenating the string values of the separate
510 expressions, each separated from the other by the value of the SUBSEP
511 variable. Thus, the following two index operations shall be equivalent:
512
513 var[expr1, expr2, ... exprn]
514
515 var[expr1 SUBSEP expr2 SUBSEP ... SUBSEP exprn]
516
517 The application shall ensure that a multi-dimensioned index used with
518 the in operator is parenthesized. The in operator, which tests for the
519 existence of a particular array element, shall not cause that element
520 to exist. Any other reference to a nonexistent array element shall
521 automatically create it.
522
523 Comparisons (with the '<', "<=", "!=", "==", '>', and ">=" operators)
524 shall be made numerically if both operands are numeric, if one is
525 numeric and the other has a string value that is a numeric string, or
526 if one is numeric and the other has the uninitialized value. Other‐
527 wise, operands shall be converted to strings as required and a string
528 comparison shall be made using the locale-specific collation sequence.
529 The value of the comparison expression shall be 1 if the relation is
530 true, or 0 if the relation is false.
531
532 Variables and Special Variables
533 Variables can be used in an awk program by referencing them. With the
534 exception of function parameters (see User-Defined Functions), they are
535 not explicitly declared. Function parameter names shall be local to the
536 function; all other variable names shall be global. The same name shall
537 not be used as both a function parameter name and as the name of a
538 function or a special awk variable. The same name shall not be used
539 both as a variable name with global scope and as the name of a func‐
540 tion. The same name shall not be used within the same scope both as a
541 scalar variable and as an array. Uninitialized variables, including
542 scalar variables, array elements, and field variables, shall have an
543 uninitialized value. An uninitialized value shall have both a numeric
544 value of zero and a string value of the empty string. Evaluation of
545 variables with an uninitialized value, to either string or numeric,
546 shall be determined by the context in which they are used.
547
548 Field variables shall be designated by a '$' followed by a number or
549 numerical expression. The effect of the field number expression evalu‐
550 ating to anything other than a non-negative integer is unspecified;
551 uninitialized variables or string values need not be converted to
552 numeric values in this context. New field variables can be created by
553 assigning a value to them. References to nonexistent fields (that is,
554 fields after $NF), shall evaluate to the uninitialized value. Such ref‐
555 erences shall not create new fields. However, assigning to a nonexis‐
556 tent field (for example, $(NF+2)=5) shall increase the value of NF;
557 create any intervening fields with the uninitialized value; and cause
558 the value of $0 to be recomputed, with the fields being separated by
559 the value of OFS. Each field variable shall have a string value or an
560 uninitialized value when created. Field variables shall have the unini‐
561 tialized value when created from $0 using FS and the variable does not
562 contain any characters. If appropriate, the field variable shall be
563 considered a numeric string (see Expressions in awk).
564
565 Implementations shall support the following other special variables
566 that are set by awk:
567
568 ARGC The number of elements in the ARGV array.
569
570 ARGV An array of command line arguments, excluding options and the
571 program argument, numbered from zero to ARGC−1.
572
573 The arguments in ARGV can be modified or added to; ARGC can
574 be altered. As each input file ends, awk shall treat the next
575 non-null element of ARGV, up to the current value of ARGC−1,
576 inclusive, as the name of the next input file. Thus, setting
577 an element of ARGV to null means that it shall not be treated
578 as an input file. The name '−' indicates the standard input.
579 If an argument matches the format of an assignment operand,
580 this argument shall be treated as an assignment rather than a
581 file argument.
582
583 CONVFMT The printf format for converting numbers to strings (except
584 for output statements, where OFMT is used); "%.6g" by
585 default.
586
587 ENVIRON An array representing the value of the environment, as
588 described in the exec functions defined in the System Inter‐
589 faces volume of POSIX.1‐2008. The indices of the array shall
590 be strings consisting of the names of the environment vari‐
591 ables, and the value of each array element shall be a string
592 consisting of the value of that variable. If appropriate, the
593 environment variable shall be considered a numeric string
594 (see Expressions in awk); the array element shall also have
595 its numeric value.
596
597 In all cases where the behavior of awk is affected by envi‐
598 ronment variables (including the environment of any commands
599 that awk executes via the system function or via pipeline
600 redirections with the print statement, the printf statement,
601 or the getline function), the environment used shall be the
602 environment at the time awk began executing; it is implemen‐
603 tation-defined whether any modification of ENVIRON affects
604 this environment.
605
606 FILENAME A pathname of the current input file. Inside a BEGIN action
607 the value is undefined. Inside an END action the value shall
608 be the name of the last input file processed.
609
610 FNR The ordinal number of the current record in the current file.
611 Inside a BEGIN action the value shall be zero. Inside an END
612 action the value shall be the number of the last record pro‐
613 cessed in the last file processed.
614
615 FS Input field separator regular expression; a <space> by
616 default.
617
618 NF The number of fields in the current record. Inside a BEGIN
619 action, the use of NF is undefined unless a getline function
620 without a var argument is executed previously. Inside an END
621 action, NF shall retain the value it had for the last record
622 read, unless a subsequent, redirected, getline function with‐
623 out a var argument is performed prior to entering the END
624 action.
625
626 NR The ordinal number of the current record from the start of
627 input. Inside a BEGIN action the value shall be zero. Inside
628 an END action the value shall be the number of the last
629 record processed.
630
631 OFMT The printf format for converting numbers to strings in output
632 statements (see Output Statements); "%.6g" by default. The
633 result of the conversion is unspecified if the value of OFMT
634 is not a floating-point format specification.
635
636 OFS The print statement output field separator; <space> by
637 default.
638
639 ORS The print statement output record separator; a <newline> by
640 default.
641
642 RLENGTH The length of the string matched by the match function.
643
644 RS The first character of the string value of RS shall be the
645 input record separator; a <newline> by default. If RS con‐
646 tains more than one character, the results are unspecified.
647 If RS is null, then records are separated by sequences con‐
648 sisting of a <newline> plus one or more blank lines, leading
649 or trailing blank lines shall not result in empty records at
650 the beginning or end of the input, and a <newline> shall
651 always be a field separator, no matter what the value of FS
652 is.
653
654 RSTART The starting position of the string matched by the match
655 function, numbering from 1. This shall always be equivalent
656 to the return value of the match function.
657
658 SUBSEP The subscript separator string for multi-dimensional arrays;
659 the default value is implementation-defined.
660
661 Regular Expressions
662 The awk utility shall make use of the extended regular expression nota‐
663 tion (see the Base Definitions volume of POSIX.1‐2008, Section 9.4,
664 Extended Regular Expressions) except that it shall allow the use of C-
665 language conventions for escaping special characters within the EREs,
666 as specified in the table in the Base Definitions volume of
667 POSIX.1‐2008, Chapter 5, File Format Notation ('\\', '\a', '\b', '\f',
668 '\n', '\r', '\t', '\v') and the following table; these escape sequences
669 shall be recognized both inside and outside bracket expressions. Note
670 that records need not be separated by <newline> characters and string
671 constants can contain <newline> characters, so even the "\n" sequence
672 is valid in awk EREs. Using a <slash> character within an ERE requires
673 the escaping shown in the following table.
674
675 Table 4-2: Escape Sequences in awk
676
677 ┌─────────┬────────────────────────────────────┬────────────────────────────────────┐
678 │ Escape │ │ │
679 │Sequence │ Description │ Meaning │
680 ├─────────┼────────────────────────────────────┼────────────────────────────────────┤
681 │\" │ <backslash> <quotation-mark> │ <quotation-mark> character │
682 ├─────────┼────────────────────────────────────┼────────────────────────────────────┤
683 │\/ │ <backslash> <slash> │ <slash> character │
684 ├─────────┼────────────────────────────────────┼────────────────────────────────────┤
685 │\ddd │ A <backslash> character followed │ The character whose encoding is │
686 │ │ by the longest sequence of one, │ represented by the one, two, or │
687 │ │ two, or three octal-digit charac‐ │ three-digit octal integer. Multi- │
688 │ │ ters (01234567). If all of the │ byte characters require multiple, │
689 │ │ digits are 0 (that is, representa‐ │ concatenated escape sequences of │
690 │ │ tion of the NUL character), the │ this type, including the leading │
691 │ │ behavior is undefined. │ <backslash> for each byte. │
692 ├─────────┼────────────────────────────────────┼────────────────────────────────────┤
693 │\c │ A <backslash> character followed │ Undefined │
694 │ │ by any character not described in │ │
695 │ │ this table or in the table in the │ │
696 │ │ Base Definitions volume of │ │
697 │ │ POSIX.1‐2008, Chapter 5, File For‐ │ │
698 │ │ mat Notation ('\\', '\a', '\b', │ │
699 │ │ '\f', '\n', '\r', '\t', '\v'). │ │
700 └─────────┴────────────────────────────────────┴────────────────────────────────────┘
701 A regular expression can be matched against a specific field or string
702 by using one of the two regular expression matching operators, '~' and
703 "!~". These operators shall interpret their right-hand operand as a
704 regular expression and their left-hand operand as a string. If the reg‐
705 ular expression matches the string, the '~' expression shall evaluate
706 to a value of 1, and the "!~" expression shall evaluate to a value of
707 0. (The regular expression matching operation is as defined by the term
708 matched in the Base Definitions volume of POSIX.1‐2008, Section 9.1,
709 Regular Expression Definitions, where a match occurs on any part of the
710 string unless the regular expression is limited with the <circumflex>
711 or <dollar-sign> special characters.) If the regular expression does
712 not match the string, the '~' expression shall evaluate to a value of
713 0, and the "!~" expression shall evaluate to a value of 1. If the
714 right-hand operand is any expression other than the lexical token ERE,
715 the string value of the expression shall be interpreted as an extended
716 regular expression, including the escape conventions described above.
717 Note that these same escape conventions shall also be applied in deter‐
718 mining the value of a string literal (the lexical token STRING), and
719 thus shall be applied a second time when a string literal is used in
720 this context.
721
722 When an ERE token appears as an expression in any context other than as
723 the right-hand of the '~' or "!~" operator or as one of the built-in
724 function arguments described below, the value of the resulting expres‐
725 sion shall be the equivalent of:
726
727 $0 " " /ere/
728
729 The ere argument to the gsub, match, sub functions, and the fs argument
730 to the split function (see String Functions) shall be interpreted as
731 extended regular expressions. These can be either ERE tokens or arbi‐
732 trary expressions, and shall be interpreted in the same manner as the
733 right-hand side of the '~' or "!~" operator.
734
735 An extended regular expression can be used to separate fields by
736 assigning a string containing the expression to the built-in variable
737 FS, either directly or as a consequence of using the −F sepstring
738 option. The default value of the FS variable shall be a single
739 <space>. The following describes FS behavior:
740
741 1. If FS is a null string, the behavior is unspecified.
742
743 2. If FS is a single character:
744
745 a. If FS is <space>, skip leading and trailing <blank> and <new‐
746 line> characters; fields shall be delimited by sets of one or
747 more <blank> or <newline> characters.
748
749 b. Otherwise, if FS is any other character c, fields shall be
750 delimited by each single occurrence of c.
751
752 3. Otherwise, the string value of FS shall be considered to be an
753 extended regular expression. Each occurrence of a sequence matching
754 the extended regular expression shall delimit fields.
755
756 Except for the '~' and "!~" operators, and in the gsub, match, split,
757 and sub built-in functions, ERE matching shall be based on input
758 records; that is, record separator characters (the first character of
759 the value of the variable RS, <newline> by default) cannot be embedded
760 in the expression, and no expression shall match the record separator
761 character. If the record separator is not <newline>, <newline> charac‐
762 ters embedded in the expression can be matched. For the '~' and "!~"
763 operators, and in those four built-in functions, ERE matching shall be
764 based on text strings; that is, any character (including <newline> and
765 the record separator) can be embedded in the pattern, and an appropri‐
766 ate pattern shall match any character. However, in all awk ERE match‐
767 ing, the use of one or more NUL characters in the pattern, input
768 record, or text string produces undefined results.
769
770 Patterns
771 A pattern is any valid expression, a range specified by two expressions
772 separated by a comma, or one of the two special patterns BEGIN or END.
773
774 Special Patterns
775 The awk utility shall recognize two special patterns, BEGIN and END.
776 Each BEGIN pattern shall be matched once and its associated action exe‐
777 cuted before the first record of input is read—except possibly by use
778 of the getline function (see Input/Output and General Functions) in a
779 prior BEGIN action—and before command line assignment is done. Each END
780 pattern shall be matched once and its associated action executed after
781 the last record of input has been read. These two patterns shall have
782 associated actions.
783
784 BEGIN and END shall not combine with other patterns. Multiple BEGIN and
785 END patterns shall be allowed. The actions associated with the BEGIN
786 patterns shall be executed in the order specified in the program, as
787 are the END actions. An END pattern can precede a BEGIN pattern in a
788 program.
789
790 If an awk program consists of only actions with the pattern BEGIN, and
791 the BEGIN action contains no getline function, awk shall exit without
792 reading its input when the last statement in the last BEGIN action is
793 executed. If an awk program consists of only actions with the pattern
794 END or only actions with the patterns BEGIN and END, the input shall be
795 read before the statements in the END actions are executed.
796
797 Expression Patterns
798 An expression pattern shall be evaluated as if it were an expression in
799 a Boolean context. If the result is true, the pattern shall be consid‐
800 ered to match, and the associated action (if any) shall be executed. If
801 the result is false, the action shall not be executed.
802
803 Pattern Ranges
804 A pattern range consists of two expressions separated by a comma; in
805 this case, the action shall be performed for all records between a
806 match of the first expression and the following match of the second
807 expression, inclusive. At this point, the pattern range can be repeated
808 starting at input records subsequent to the end of the matched range.
809
810 Actions
811 An action is a sequence of statements as shown in the grammar in Gram‐
812 mar. Any single statement can be replaced by a statement list enclosed
813 in curly braces. The application shall ensure that statements in a
814 statement list are separated by <newline> or <semicolon> characters.
815 Statements in a statement list shall be executed sequentially in the
816 order that they appear.
817
818 The expression acting as the conditional in an if statement shall be
819 evaluated and if it is non-zero or non-null, the following statement
820 shall be executed; otherwise, if else is present, the statement follow‐
821 ing the else shall be executed.
822
823 The if, while, do...while, for, break, and continue statements are
824 based on the ISO C standard (see Section 1.1.2, Concepts Derived from
825 the ISO C Standard), except that the Boolean expressions shall be
826 treated as described in Expressions in awk, and except in the case of:
827
828 for (variable in array)
829
830 which shall iterate, assigning each index of array to variable in an
831 unspecified order. The results of adding new elements to array within
832 such a for loop are undefined. If a break or continue statement occurs
833 outside of a loop, the behavior is undefined.
834
835 The delete statement shall remove an individual array element. Thus,
836 the following code deletes an entire array:
837
838 for (index in array)
839 delete array[index]
840
841 The next statement shall cause all further processing of the current
842 input record to be abandoned. The behavior is undefined if a next
843 statement appears or is invoked in a BEGIN or END action.
844
845 The exit statement shall invoke all END actions in the order in which
846 they occur in the program source and then terminate the program without
847 reading further input. An exit statement inside an END action shall
848 terminate the program without further execution of END actions. If an
849 expression is specified in an exit statement, its numeric value shall
850 be the exit status of awk, unless subsequent errors are encountered or
851 a subsequent exit statement with an expression is executed.
852
853 Output Statements
854 Both print and printf statements shall write to standard output by
855 default. The output shall be written to the location specified by out‐
856 put_redirection if one is supplied, as follows:
857
858 > expression
859 >> expression
860 | expression
861
862 In all cases, the expression shall be evaluated to produce a string
863 that is used as a pathname into which to write (for '>' or ">>") or as
864 a command to be executed (for '|'). Using the first two forms, if the
865 file of that name is not currently open, it shall be opened, creating
866 it if necessary and using the first form, truncating the file. The out‐
867 put then shall be appended to the file. As long as the file remains
868 open, subsequent calls in which expression evaluates to the same string
869 value shall simply append output to the file. The file remains open
870 until the close function (see Input/Output and General Functions) is
871 called with an expression that evaluates to the same string value.
872
873 The third form shall write output onto a stream piped to the input of a
874 command. The stream shall be created if no stream is currently open
875 with the value of expression as its command name. The stream created
876 shall be equivalent to one created by a call to the popen() function
877 defined in the System Interfaces volume of POSIX.1‐2008 with the value
878 of expression as the command argument and a value of w as the mode
879 argument. As long as the stream remains open, subsequent calls in which
880 expression evaluates to the same string value shall write output to the
881 existing stream. The stream shall remain open until the close function
882 (see Input/Output and General Functions) is called with an expression
883 that evaluates to the same string value. At that time, the stream
884 shall be closed as if by a call to the pclose() function defined in the
885 System Interfaces volume of POSIX.1‐2008.
886
887 As described in detail by the grammar in Grammar, these output state‐
888 ments shall take a <comma>-separated list of expressions referred to in
889 the grammar by the non-terminal symbols expr_list, print_expr_list, or
890 print_expr_list_opt. This list is referred to here as the expression
891 list, and each member is referred to as an expression argument.
892
893 The print statement shall write the value of each expression argument
894 onto the indicated output stream separated by the current output field
895 separator (see variable OFS above), and terminated by the output record
896 separator (see variable ORS above). All expression arguments shall be
897 taken as strings, being converted if necessary; this conversion shall
898 be as described in Expressions in awk, with the exception that the
899 printf format in OFMT shall be used instead of the value in CONVFMT.
900 An empty expression list shall stand for the whole input record ($0).
901
902 The printf statement shall produce output based on a notation similar
903 to the File Format Notation used to describe file formats in this vol‐
904 ume of POSIX.1‐2008 (see the Base Definitions volume of POSIX.1‐2008,
905 Chapter 5, File Format Notation). Output shall be produced as speci‐
906 fied with the first expression argument as the string format and subse‐
907 quent expression arguments as the strings arg1 to argn, inclusive, with
908 the following exceptions:
909
910 1. The format shall be an actual character string rather than a graph‐
911 ical representation. Therefore, it cannot contain empty character
912 positions. The <space> in the format string, in any context other
913 than a flag of a conversion specification, shall be treated as an
914 ordinary character that is copied to the output.
915
916 2. If the character set contains a '' character and that character
917 appears in the format string, it shall be treated as an ordinary
918 character that is copied to the output.
919
920 3. The escape sequences beginning with a <backslash> character shall
921 be treated as sequences of ordinary characters that are copied to
922 the output. Note that these same sequences shall be interpreted
923 lexically by awk when they appear in literal strings, but they
924 shall not be treated specially by the printf statement.
925
926 4. A field width or precision can be specified as the '*' character
927 instead of a digit string. In this case the next argument from the
928 expression list shall be fetched and its numeric value taken as the
929 field width or precision.
930
931 5. The implementation shall not precede or follow output from the d or
932 u conversion specifier characters with <blank> characters not spec‐
933 ified by the format string.
934
935 6. The implementation shall not precede output from the o conversion
936 specifier character with leading zeros not specified by the format
937 string.
938
939 7. For the c conversion specifier character: if the argument has a
940 numeric value, the character whose encoding is that value shall be
941 output. If the value is zero or is not the encoding of any charac‐
942 ter in the character set, the behavior is undefined. If the argu‐
943 ment does not have a numeric value, the first character of the
944 string value shall be output; if the string does not contain any
945 characters, the behavior is undefined.
946
947 8. For each conversion specification that consumes an argument, the
948 next expression argument shall be evaluated. With the exception of
949 the c conversion specifier character, the value shall be converted
950 (according to the rules specified in Expressions in awk) to the
951 appropriate type for the conversion specification.
952
953 9. If there are insufficient expression arguments to satisfy all the
954 conversion specifications in the format string, the behavior is
955 undefined.
956
957 10. If any character sequence in the format string begins with a '%'
958 character, but does not form a valid conversion specification, the
959 behavior is unspecified.
960
961 Both print and printf can output at least {LINE_MAX} bytes.
962
963 Functions
964 The awk language has a variety of built-in functions: arithmetic,
965 string, input/output, and general.
966
967 Arithmetic Functions
968 The arithmetic functions, except for int, shall be based on the ISO C
969 standard (see Section 1.1.2, Concepts Derived from the ISO C Standard).
970 The behavior is undefined in cases where the ISO C standard specifies
971 that an error be returned or that the behavior is undefined. Although
972 the grammar (see Grammar) permits built-in functions to appear with no
973 arguments or parentheses, unless the argument or parentheses are indi‐
974 cated as optional in the following list (by displaying them within the
975 "[]" brackets), such use is undefined.
976
977 atan2(y,x)
978 Return arctangent of y/x in radians in the range [−π,π].
979
980 cos(x) Return cosine of x, where x is in radians.
981
982 sin(x) Return sine of x, where x is in radians.
983
984 exp(x) Return the exponential function of x.
985
986 log(x) Return the natural logarithm of x.
987
988 sqrt(x) Return the square root of x.
989
990 int(x) Return the argument truncated to an integer. Truncation shall
991 be toward 0 when x>0.
992
993 rand() Return a random number n, such that 0≤n<1.
994
995 srand([expr])
996 Set the seed value for rand to expr or use the time of day if
997 expr is omitted. The previous seed value shall be returned.
998
999 String Functions
1000 The string functions in the following list shall be supported.
1001 Although the grammar (see Grammar) permits built-in functions to appear
1002 with no arguments or parentheses, unless the argument or parentheses
1003 are indicated as optional in the following list (by displaying them
1004 within the "[]" brackets), such use is undefined.
1005
1006 gsub(ere, repl[, in])
1007 Behave like sub (see below), except that it shall replace all
1008 occurrences of the regular expression (like the ed utility
1009 global substitute) in $0 or in the in argument, when speci‐
1010 fied.
1011
1012 index(s, t)
1013 Return the position, in characters, numbering from 1, in
1014 string s where string t first occurs, or zero if it does not
1015 occur at all.
1016
1017 length[([s])]
1018 Return the length, in characters, of its argument taken as a
1019 string, or of the whole record, $0, if there is no argument.
1020
1021 match(s, ere)
1022 Return the position, in characters, numbering from 1, in
1023 string s where the extended regular expression ere occurs, or
1024 zero if it does not occur at all. RSTART shall be set to the
1025 starting position (which is the same as the returned value),
1026 zero if no match is found; RLENGTH shall be set to the length
1027 of the matched string, −1 if no match is found.
1028
1029 split(s, a[, fs ])
1030 Split the string s into array elements a[1], a[2], ..., a[n],
1031 and return n. All elements of the array shall be deleted
1032 before the split is performed. The separation shall be done
1033 with the ERE fs or with the field separator FS if fs is not
1034 given. Each array element shall have a string value when cre‐
1035 ated and, if appropriate, the array element shall be consid‐
1036 ered a numeric string (see Expressions in awk). The effect
1037 of a null string as the value of fs is unspecified.
1038
1039 sprintf(fmt, expr, expr, ...)
1040 Format the expressions according to the printf format given
1041 by fmt and return the resulting string.
1042
1043 sub(ere, repl[, in ])
1044 Substitute the string repl in place of the first instance of
1045 the extended regular expression ERE in string in and return
1046 the number of substitutions. An <ampersand> ('&') appearing
1047 in the string repl shall be replaced by the string from in
1048 that matches the ERE. An <ampersand> preceded with a <back‐
1049 slash> shall be interpreted as the literal <ampersand> char‐
1050 acter. An occurrence of two consecutive <backslash> charac‐
1051 ters shall be interpreted as just a single literal <back‐
1052 slash> character. Any other occurrence of a <backslash> (for
1053 example, preceding any other character) shall be treated as a
1054 literal <backslash> character. Note that if repl is a string
1055 literal (the lexical token STRING; see Grammar), the handling
1056 of the <ampersand> character occurs after any lexical pro‐
1057 cessing, including any lexical <backslash>-escape sequence
1058 processing. If in is specified and it is not an lvalue (see
1059 Expressions in awk), the behavior is undefined. If in is
1060 omitted, awk shall use the current record ($0) in its place.
1061
1062 substr(s, m[, n ])
1063 Return the at most n-character substring of s that begins at
1064 position m, numbering from 1. If n is omitted, or if n speci‐
1065 fies more characters than are left in the string, the length
1066 of the substring shall be limited by the length of the string
1067 s.
1068
1069 tolower(s)
1070 Return a string based on the string s. Each character in s
1071 that is an uppercase letter specified to have a tolower map‐
1072 ping by the LC_CTYPE category of the current locale shall be
1073 replaced in the returned string by the lowercase letter spec‐
1074 ified by the mapping. Other characters in s shall be
1075 unchanged in the returned string.
1076
1077 toupper(s)
1078 Return a string based on the string s. Each character in s
1079 that is a lowercase letter specified to have a toupper map‐
1080 ping by the LC_CTYPE category of the current locale is
1081 replaced in the returned string by the uppercase letter spec‐
1082 ified by the mapping. Other characters in s are unchanged in
1083 the returned string.
1084
1085 All of the preceding functions that take ERE as a parameter expect a
1086 pattern or a string valued expression that is a regular expression as
1087 defined in Regular Expressions.
1088
1089 Input/Output and General Functions
1090 The input/output and general functions are:
1091
1092 close(expression)
1093 Close the file or pipe opened by a print or printf statement
1094 or a call to getline with the same string-valued expression.
1095 The limit on the number of open expression arguments is
1096 implementation-defined. If the close was successful, the
1097 function shall return zero; otherwise, it shall return non-
1098 zero.
1099
1100 expression | getline [var]
1101 Read a record of input from a stream piped from the output of
1102 a command. The stream shall be created if no stream is cur‐
1103 rently open with the value of expression as its command name.
1104 The stream created shall be equivalent to one created by a
1105 call to the popen() function with the value of expression as
1106 the command argument and a value of r as the mode argument.
1107 As long as the stream remains open, subsequent calls in which
1108 expression evaluates to the same string value shall read sub‐
1109 sequent records from the stream. The stream shall remain open
1110 until the close function is called with an expression that
1111 evaluates to the same string value. At that time, the stream
1112 shall be closed as if by a call to the pclose() function. If
1113 var is omitted, $0 and NF shall be set; otherwise, var shall
1114 be set and, if appropriate, it shall be considered a numeric
1115 string (see Expressions in awk).
1116
1117 The getline operator can form ambiguous constructs when there
1118 are unparenthesized operators (including concatenate) to the
1119 left of the '|' (to the beginning of the expression contain‐
1120 ing getline). In the context of the '$' operator, '|' shall
1121 behave as if it had a lower precedence than '$'. The result
1122 of evaluating other operators is unspecified, and conforming
1123 applications shall parenthesize properly all such usages.
1124
1125 getline Set $0 to the next input record from the current input file.
1126 This form of getline shall set the NF, NR, and FNR variables.
1127
1128 getline var
1129 Set variable var to the next input record from the current
1130 input file and, if appropriate, var shall be considered a
1131 numeric string (see Expressions in awk). This form of get‐
1132 line shall set the FNR and NR variables.
1133
1134 getline [var] < expression
1135 Read the next record of input from a named file. The expres‐
1136 sion shall be evaluated to produce a string that is used as a
1137 pathname. If the file of that name is not currently open, it
1138 shall be opened. As long as the stream remains open, subse‐
1139 quent calls in which expression evaluates to the same string
1140 value shall read subsequent records from the file. The file
1141 shall remain open until the close function is called with an
1142 expression that evaluates to the same string value. If var is
1143 omitted, $0 and NF shall be set; otherwise, var shall be set
1144 and, if appropriate, it shall be considered a numeric string
1145 (see Expressions in awk).
1146
1147 The getline operator can form ambiguous constructs when there
1148 are unparenthesized binary operators (including concatenate)
1149 to the right of the '<' (up to the end of the expression con‐
1150 taining the getline). The result of evaluating such a con‐
1151 struct is unspecified, and conforming applications shall
1152 parenthesize properly all such usages.
1153
1154 system(expression)
1155 Execute the command given by expression in a manner equiva‐
1156 lent to the system() function defined in the System Inter‐
1157 faces volume of POSIX.1‐2008 and return the exit status of
1158 the command.
1159
1160 All forms of getline shall return 1 for successful input, zero for end-
1161 of-file, and −1 for an error.
1162
1163 Where strings are used as the name of a file or pipeline, the applica‐
1164 tion shall ensure that the strings are textually identical. The termi‐
1165 nology ``same string value'' implies that ``equivalent strings'', even
1166 those that differ only by <space> characters, represent different
1167 files.
1168
1169 User-Defined Functions
1170 The awk language also provides user-defined functions. Such functions
1171 can be defined as:
1172
1173 function name([parameter, ...]) { statements }
1174
1175 A function can be referred to anywhere in an awk program; in particu‐
1176 lar, its use can precede its definition. The scope of a function is
1177 global.
1178
1179 Function parameters, if present, can be either scalars or arrays; the
1180 behavior is undefined if an array name is passed as a parameter that
1181 the function uses as a scalar, or if a scalar expression is passed as a
1182 parameter that the function uses as an array. Function parameters shall
1183 be passed by value if scalar and by reference if array name.
1184
1185 The number of parameters in the function definition need not match the
1186 number of parameters in the function call. Excess formal parameters can
1187 be used as local variables. If fewer arguments are supplied in a func‐
1188 tion call than are in the function definition, the extra parameters
1189 that are used in the function body as scalars shall evaluate to the
1190 uninitialized value until they are otherwise initialized, and the extra
1191 parameters that are used in the function body as arrays shall be
1192 treated as uninitialized arrays where each element evaluates to the
1193 uninitialized value until otherwise initialized.
1194
1195 When invoking a function, no white space can be placed between the
1196 function name and the opening parenthesis. Function calls can be nested
1197 and recursive calls can be made upon functions. Upon return from any
1198 nested or recursive function call, the values of all of the calling
1199 function's parameters shall be unchanged, except for array parameters
1200 passed by reference. The return statement can be used to return a
1201 value. If a return statement appears outside of a function definition,
1202 the behavior is undefined.
1203
1204 In the function definition, <newline> characters shall be optional
1205 before the opening brace and after the closing brace. Function defini‐
1206 tions can appear anywhere in the program where a pattern-action pair is
1207 allowed.
1208
1209 Grammar
1210 The grammar in this section and the lexical conventions in the follow‐
1211 ing section shall together describe the syntax for awk programs. The
1212 general conventions for this style of grammar are described in Section
1213 1.3, Grammar Conventions. A valid program can be represented as the
1214 non-terminal symbol program in the grammar. This formal syntax shall
1215 take precedence over the preceding text syntax description.
1216
1217 %token NAME NUMBER STRING ERE
1218 %token FUNC_NAME /* Name followed by '(' without white space. */
1219
1220 /* Keywords */
1221 %token Begin End
1222 /* 'BEGIN' 'END' */
1223
1224 %token Break Continue Delete Do Else
1225 /* 'break' 'continue' 'delete' 'do' 'else' */
1226
1227 %token Exit For Function If In
1228 /* 'exit' 'for' 'function' 'if' 'in' */
1229
1230 %token Next Print Printf Return While
1231 /* 'next' 'print' 'printf' 'return' 'while' */
1232
1233 /* Reserved function names */
1234 %token BUILTIN_FUNC_NAME
1235 /* One token for the following:
1236 * atan2 cos sin exp log sqrt int rand srand
1237 * gsub index length match split sprintf sub
1238 * substr tolower toupper close system
1239 */
1240 %token GETLINE
1241 /* Syntactically different from other built-ins. */
1242
1243 /* Two-character tokens. */
1244 %token ADD_ASSIGN SUB_ASSIGN MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN POW_ASSIGN
1245 /* '+=' '−=' '*=' '/=' '%=' '^=' */
1246
1247 %token OR AND NO_MATCH EQ LE GE NE INCR DECR APPEND
1248 /* '||' '&&' '!~' '==' '<=' '>=' '!=' '++' '−−' '>>' */
1249
1250 /* One-character tokens. */
1251 %token '{' '}' '(' ')' '[' ']' ',' ';' NEWLINE
1252 %token '+' '−' '*' '%' '^' '!' '>' '<' '|' '?' ':' ' " " ' '$' '='
1253
1254 %start program
1255 %%
1256
1257 program : item_list
1258 | actionless_item_list
1259 ;
1260
1261 item_list : newline_opt
1262 | actionless_item_list item terminator
1263 | item_list item terminator
1264 | item_list action terminator
1265 ;
1266
1267 actionless_item_list : item_list pattern terminator
1268 | actionless_item_list pattern terminator
1269 ;
1270
1271 item : pattern action
1272 | Function NAME '(' param_list_opt ')'
1273 newline_opt action
1274 | Function FUNC_NAME '(' param_list_opt ')'
1275 newline_opt action
1276 ;
1277
1278 param_list_opt : /* empty */
1279 | param_list
1280 ;
1281
1282 param_list : NAME
1283 | param_list ',' NAME
1284 ;
1285
1286 pattern : Begin
1287 | End
1288 | expr
1289 | expr ',' newline_opt expr
1290 ;
1291
1292 action : '{' newline_opt '}'
1293 | '{' newline_opt terminated_statement_list '}'
1294 | '{' newline_opt unterminated_statement_list '}'
1295 ;
1296
1297 terminator : terminator ';'
1298 | terminator NEWLINE
1299 | ';'
1300 | NEWLINE
1301 ;
1302
1303 terminated_statement_list : terminated_statement
1304 | terminated_statement_list terminated_statement
1305 ;
1306
1307 unterminated_statement_list : unterminated_statement
1308 | terminated_statement_list unterminated_statement
1309 ;
1310
1311 terminated_statement : action newline_opt
1312 | If '(' expr ')' newline_opt terminated_statement
1313 | If '(' expr ')' newline_opt terminated_statement
1314 Else newline_opt terminated_statement
1315 | While '(' expr ')' newline_opt terminated_statement
1316 | For '(' simple_statement_opt ';'
1317 expr_opt ';' simple_statement_opt ')' newline_opt
1318 terminated_statement
1319 | For '(' NAME In NAME ')' newline_opt
1320 terminated_statement
1321 | ';' newline_opt
1322 | terminatable_statement NEWLINE newline_opt
1323 | terminatable_statement ';' newline_opt
1324 ;
1325
1326 unterminated_statement : terminatable_statement
1327 | If '(' expr ')' newline_opt unterminated_statement
1328 | If '(' expr ')' newline_opt terminated_statement
1329 Else newline_opt unterminated_statement
1330 | While '(' expr ')' newline_opt unterminated_statement
1331 | For '(' simple_statement_opt ';'
1332 expr_opt ';' simple_statement_opt ')' newline_opt
1333 unterminated_statement
1334 | For '(' NAME In NAME ')' newline_opt
1335 unterminated_statement
1336 ;
1337
1338 terminatable_statement : simple_statement
1339 | Break
1340 | Continue
1341 | Next
1342 | Exit expr_opt
1343 | Return expr_opt
1344 | Do newline_opt terminated_statement While '(' expr ')'
1345 ;
1346
1347 simple_statement_opt : /* empty */
1348 | simple_statement
1349 ;
1350
1351 simple_statement : Delete NAME '[' expr_list ']'
1352 | expr
1353 | print_statement
1354 ;
1355
1356 print_statement : simple_print_statement
1357 | simple_print_statement output_redirection
1358 ;
1359
1360 simple_print_statement : Print print_expr_list_opt
1361 | Print '(' multiple_expr_list ')'
1362 | Printf print_expr_list
1363 | Printf '(' multiple_expr_list ')'
1364 ;
1365
1366 output_redirection : '>' expr
1367 | APPEND expr
1368 | '|' expr
1369 ;
1370
1371 expr_list_opt : /* empty */
1372 | expr_list
1373 ;
1374
1375 expr_list : expr
1376 | multiple_expr_list
1377 ;
1378
1379 multiple_expr_list : expr ',' newline_opt expr
1380 | multiple_expr_list ',' newline_opt expr
1381 ;
1382
1383 expr_opt : /* empty */
1384 | expr
1385 ;
1386
1387 expr : unary_expr
1388 | non_unary_expr
1389 ;
1390
1391 unary_expr : '+' expr
1392 | '−' expr
1393 | unary_expr '^' expr
1394 | unary_expr '*' expr
1395 | unary_expr '/' expr
1396 | unary_expr '%' expr
1397 | unary_expr '+' expr
1398 | unary_expr '−' expr
1399 | unary_expr non_unary_expr
1400 | unary_expr '<' expr
1401 | unary_expr LE expr
1402 | unary_expr NE expr
1403 | unary_expr EQ expr
1404 | unary_expr '>' expr
1405 | unary_expr GE expr
1406 | unary_expr '~' expr
1407 | unary_expr NO_MATCH expr
1408 | unary_expr In NAME
1409 | unary_expr AND newline_opt expr
1410 | unary_expr OR newline_opt expr
1411 | unary_expr '?' expr ':' expr
1412 | unary_input_function
1413 ;
1414
1415 non_unary_expr : '(' expr ')'
1416 | '!' expr
1417 | non_unary_expr '^' expr
1418 | non_unary_expr '*' expr
1419 | non_unary_expr '/' expr
1420 | non_unary_expr '%' expr
1421 | non_unary_expr '+' expr
1422 | non_unary_expr '−' expr
1423 | non_unary_expr non_unary_expr
1424 | non_unary_expr '<' expr
1425 | non_unary_expr LE expr
1426 | non_unary_expr NE expr
1427 | non_unary_expr EQ expr
1428 | non_unary_expr '>' expr
1429 | non_unary_expr GE expr
1430 | non_unary_expr '~' expr
1431 | non_unary_expr NO_MATCH expr
1432 | non_unary_expr In NAME
1433 | '(' multiple_expr_list ')' In NAME
1434 | non_unary_expr AND newline_opt expr
1435 | non_unary_expr OR newline_opt expr
1436 | non_unary_expr '?' expr ':' expr
1437 | NUMBER
1438 | STRING
1439 | lvalue
1440 | ERE
1441 | lvalue INCR
1442 | lvalue DECR
1443 | INCR lvalue
1444 | DECR lvalue
1445 | lvalue POW_ASSIGN expr
1446 | lvalue MOD_ASSIGN expr
1447 | lvalue MUL_ASSIGN expr
1448 | lvalue DIV_ASSIGN expr
1449 | lvalue ADD_ASSIGN expr
1450 | lvalue SUB_ASSIGN expr
1451 | lvalue '=' expr
1452 | FUNC_NAME '(' expr_list_opt ')'
1453 /* no white space allowed before '(' */
1454 | BUILTIN_FUNC_NAME '(' expr_list_opt ')'
1455 | BUILTIN_FUNC_NAME
1456 | non_unary_input_function
1457 ;
1458
1459 print_expr_list_opt : /* empty */
1460 | print_expr_list
1461 ;
1462
1463 print_expr_list : print_expr
1464 | print_expr_list ',' newline_opt print_expr
1465 ;
1466
1467 print_expr : unary_print_expr
1468 | non_unary_print_expr
1469 ;
1470
1471 unary_print_expr : '+' print_expr
1472 | '−' print_expr
1473 | unary_print_expr '^' print_expr
1474 | unary_print_expr '*' print_expr
1475 | unary_print_expr '/' print_expr
1476 | unary_print_expr '%' print_expr
1477 | unary_print_expr '+' print_expr
1478 | unary_print_expr '−' print_expr
1479 | unary_print_expr non_unary_print_expr
1480 | unary_print_expr '~' print_expr
1481 | unary_print_expr NO_MATCH print_expr
1482 | unary_print_expr In NAME
1483 | unary_print_expr AND newline_opt print_expr
1484 | unary_print_expr OR newline_opt print_expr
1485 | unary_print_expr '?' print_expr ':' print_expr
1486 ;
1487
1488 non_unary_print_expr : '(' expr ')'
1489 | '!' print_expr
1490 | non_unary_print_expr '^' print_expr
1491 | non_unary_print_expr '*' print_expr
1492 | non_unary_print_expr '/' print_expr
1493 | non_unary_print_expr '%' print_expr
1494 | non_unary_print_expr '+' print_expr
1495 | non_unary_print_expr '−' print_expr
1496 | non_unary_print_expr non_unary_print_expr
1497 | non_unary_print_expr '~' print_expr
1498 | non_unary_print_expr NO_MATCH print_expr
1499 | non_unary_print_expr In NAME
1500 | '(' multiple_expr_list ')' In NAME
1501 | non_unary_print_expr AND newline_opt print_expr
1502 | non_unary_print_expr OR newline_opt print_expr
1503 | non_unary_print_expr '?' print_expr ':' print_expr
1504 | NUMBER
1505 | STRING
1506 | lvalue
1507 | ERE
1508 | lvalue INCR
1509 | lvalue DECR
1510 | INCR lvalue
1511 | DECR lvalue
1512 | lvalue POW_ASSIGN print_expr
1513 | lvalue MOD_ASSIGN print_expr
1514 | lvalue MUL_ASSIGN print_expr
1515 | lvalue DIV_ASSIGN print_expr
1516 | lvalue ADD_ASSIGN print_expr
1517 | lvalue SUB_ASSIGN print_expr
1518 | lvalue '=' print_expr
1519 | FUNC_NAME '(' expr_list_opt ')'
1520 /* no white space allowed before '(' */
1521 | BUILTIN_FUNC_NAME '(' expr_list_opt ')'
1522 | BUILTIN_FUNC_NAME
1523 ;
1524
1525 lvalue : NAME
1526 | NAME '[' expr_list ']'
1527 | '$' expr
1528 ;
1529
1530 non_unary_input_function : simple_get
1531 | simple_get '<' expr
1532 | non_unary_expr '|' simple_get
1533 ;
1534
1535 unary_input_function : unary_expr '|' simple_get
1536 ;
1537
1538 simple_get : GETLINE
1539 | GETLINE lvalue
1540 ;
1541
1542 newline_opt : /* empty */
1543 | newline_opt NEWLINE
1544 ;
1545
1546 This grammar has several ambiguities that shall be resolved as follows:
1547
1548 * Operator precedence and associativity shall be as described in Ta‐
1549 ble 4-1, Expressions in Decreasing Precedence in awk.
1550
1551 * In case of ambiguity, an else shall be associated with the most
1552 immediately preceding if that would satisfy the grammar.
1553
1554 * In some contexts, a <slash> ('/') that is used to surround an ERE
1555 could also be the division operator. This shall be resolved in
1556 such a way that wherever the division operator could appear, a
1557 <slash> is assumed to be the division operator. (There is no unary
1558 division operator.)
1559
1560 Each expression in an awk program shall conform to the precedence and
1561 associativity rules, even when this is not needed to resolve an ambigu‐
1562 ity. For example, because '$' has higher precedence than '++', the
1563 string "$x++−−" is not a valid awk expression, even though it is unam‐
1564 biguously parsed by the grammar as "$(x++)−−".
1565
1566 One convention that might not be obvious from the formal grammar is
1567 where <newline> characters are acceptable. There are several obvious
1568 placements such as terminating a statement, and a <backslash> can be
1569 used to escape <newline> characters between any lexical tokens. In
1570 addition, <newline> characters without <backslash> characters can fol‐
1571 low a comma, an open brace, logical AND operator ("&&"), logical OR
1572 operator ("||"), the do keyword, the else keyword, and the closing
1573 parenthesis of an if, for, or while statement. For example:
1574
1575 { print $1,
1576 $2 }
1577
1578 Lexical Conventions
1579 The lexical conventions for awk programs, with respect to the preceding
1580 grammar, shall be as follows:
1581
1582 1. Except as noted, awk shall recognize the longest possible token or
1583 delimiter beginning at a given point.
1584
1585 2. A comment shall consist of any characters beginning with the <num‐
1586 ber-sign> character and terminated by, but excluding the next
1587 occurrence of, a <newline>. Comments shall have no effect, except
1588 to delimit lexical tokens.
1589
1590 3. The <newline> shall be recognized as the token NEWLINE.
1591
1592 4. A <backslash> character immediately followed by a <newline> shall
1593 have no effect.
1594
1595 5. The token STRING shall represent a string constant. A string con‐
1596 stant shall begin with the character '"'. Within a string con‐
1597 stant, a <backslash> character shall be considered to begin an
1598 escape sequence as specified in the table in the Base Definitions
1599 volume of POSIX.1‐2008, Chapter 5, File Format Notation ('\\',
1600 '\a', '\b', '\f', '\n', '\r', '\t', '\v'). In addition, the escape
1601 sequences in Table 4-2, Escape Sequences in awk shall be recog‐
1602 nized. A <newline> shall not occur within a string constant. A
1603 string constant shall be terminated by the first unescaped occur‐
1604 rence of the character '"' after the one that begins the string
1605 constant. The value of the string shall be the sequence of all
1606 unescaped characters and values of escape sequences between, but
1607 not including, the two delimiting '"' characters.
1608
1609 6. The token ERE represents an extended regular expression constant.
1610 An ERE constant shall begin with the <slash> character. Within an
1611 ERE constant, a <backslash> character shall be considered to begin
1612 an escape sequence as specified in the table in the Base Defini‐
1613 tions volume of POSIX.1‐2008, Chapter 5, File Format Notation. In
1614 addition, the escape sequences in Table 4-2, Escape Sequences in
1615 awk shall be recognized. The application shall ensure that a <new‐
1616 line> does not occur within an ERE constant. An ERE constant shall
1617 be terminated by the first unescaped occurrence of the <slash>
1618 character after the one that begins the ERE constant. The extended
1619 regular expression represented by the ERE constant shall be the
1620 sequence of all unescaped characters and values of escape sequences
1621 between, but not including, the two delimiting <slash> characters.
1622
1623 7. A <blank> shall have no effect, except to delimit lexical tokens or
1624 within STRING or ERE tokens.
1625
1626 8. The token NUMBER shall represent a numeric constant. Its form and
1627 numeric value shall either be equivalent to the decimal-floating-
1628 constant token as specified by the ISO C standard, or it shall be a
1629 sequence of decimal digits and shall be evaluated as an integer
1630 constant in decimal. In addition, implementations may accept
1631 numeric constants with the form and numeric value equivalent to the
1632 hexadecimal-constant and hexadecimal-floating-constant tokens as
1633 specified by the ISO C standard.
1634
1635 If the value is too large or too small to be representable (see
1636 Section 1.1.2, Concepts Derived from the ISO C Standard), the
1637 behavior is undefined.
1638
1639 9. A sequence of underscores, digits, and alphabetics from the porta‐
1640 ble character set (see the Base Definitions volume of POSIX.1‐2008,
1641 Section 6.1, Portable Character Set), beginning with an <under‐
1642 score> or alphabetic character, shall be considered a word.
1643
1644 10. The following words are keywords that shall be recognized as indi‐
1645 vidual tokens; the name of the token is the same as the keyword:
1646
1647 BEGIN delete END function in printf
1648 break do exit getline next return
1649 continue else for if print while
1650
1651 11. The following words are names of built-in functions and shall be
1652 recognized as the token BUILTIN_FUNC_NAME:
1653
1654 atan2 gsub log split sub toupper
1655 close index match sprintf substr
1656 cos int rand sqrt system
1657 exp length sin srand tolower
1658
1659 The above-listed keywords and names of built-in functions are con‐
1660 sidered reserved words.
1661
1662 12. The token NAME shall consist of a word that is not a keyword or a
1663 name of a built-in function and is not followed immediately (with‐
1664 out any delimiters) by the '(' character.
1665
1666 13. The token FUNC_NAME shall consist of a word that is not a keyword
1667 or a name of a built-in function, followed immediately (without any
1668 delimiters) by the '(' character. The '(' character shall not be
1669 included as part of the token.
1670
1671 14. The following two-character sequences shall be recognized as the
1672 named tokens:
1673
1674 ┌───────────┬──────────┬────────────┬──────────┐
1675 │Token Name │ Sequence │ Token Name │ Sequence │
1676 ├───────────┼──────────┼────────────┼──────────┤
1677 │ADD_ASSIGN │ += │ NO_MATCH │ !~ │
1678 │SUB_ASSIGN │ −= │ EQ │ == │
1679 │MUL_ASSIGN │ *= │ LE │ <= │
1680 │DIV_ASSIGN │ /= │ GE │ >= │
1681 │MOD_ASSIGN │ %= │ NE │ != │
1682 │POW_ASSIGN │ ^= │ INCR │ ++ │
1683 │OR │ || │ DECR │ −− │
1684 │AND │ && │ APPEND │ >> │
1685 └───────────┴──────────┴────────────┴──────────┘
1686 15. The following single characters shall be recognized as tokens whose
1687 names are the character:
1688
1689 <newline> { } ( ) [ ] , ; + − * % ^ ! > < | ? : " " $ =
1690
1691 There is a lexical ambiguity between the token ERE and the tokens '/'
1692 and DIV_ASSIGN. When an input sequence begins with a <slash> character
1693 in any syntactic context where the token '/' or DIV_ASSIGN could appear
1694 as the next token in a valid program, the longer of those two tokens
1695 that can be recognized shall be recognized. In any other syntactic con‐
1696 text where the token ERE could appear as the next token in a valid pro‐
1697 gram, the token ERE shall be recognized.
1698
1700 The following exit values shall be returned:
1701
1702 0 All input files were processed successfully.
1703
1704 >0 An error occurred.
1705
1706 The exit status can be altered within the program by using an exit
1707 expression.
1708
1710 If any file operand is specified and the named file cannot be accessed,
1711 awk shall write a diagnostic message to standard error and terminate
1712 without any further action.
1713
1714 If the program specified by either the program operand or a progfile
1715 operand is not a valid awk program (as specified in the EXTENDED
1716 DESCRIPTION section), the behavior is undefined.
1717
1718 The following sections are informative.
1719
1721 The index, length, match, and substr functions should not be confused
1722 with similar functions in the ISO C standard; the awk versions deal
1723 with characters, while the ISO C standard deals with bytes.
1724
1725 Because the concatenation operation is represented by adjacent expres‐
1726 sions rather than an explicit operator, it is often necessary to use
1727 parentheses to enforce the proper evaluation precedence.
1728
1730 The awk program specified in the command line is most easily specified
1731 within single-quotes (for example, 'program') for applications using
1732 sh, because awk programs commonly contain characters that are special
1733 to the shell, including double-quotes. In the cases where an awk pro‐
1734 gram contains single-quote characters, it is usually easiest to specify
1735 most of the program as strings within single-quotes concatenated by the
1736 shell with quoted single-quote characters. For example:
1737
1738 awk '/'\''/ { print "quote:", $0 }'
1739
1740 prints all lines from the standard input containing a single-quote
1741 character, prefixed with quote:.
1742
1743 The following are examples of simple awk programs:
1744
1745 1. Write to the standard output all input lines for which field 3 is
1746 greater than 5:
1747
1748 $3 > 5
1749
1750 2. Write every tenth line:
1751
1752 (NR % 10) == 0
1753
1754 3. Write any line with a substring matching the regular expression:
1755
1756 /(G|D)(2[0−9][[:alpha:]]*)/
1757
1758 4. Print any line with a substring containing a 'G' or 'D', followed
1759 by a sequence of digits and characters. This example uses character
1760 classes digit and alpha to match language-independent digit and
1761 alphabetic characters respectively:
1762
1763 /(G|D)([[:digit:][:alpha:]]*)/
1764
1765 5. Write any line in which the second field matches the regular
1766 expression and the fourth field does not:
1767
1768 $2 " " /xyz/ && $4 ! " " /xyz/
1769
1770 6. Write any line in which the second field contains a <backslash>:
1771
1772 $2 " " /\\/
1773
1774 7. Write any line in which the second field contains a <backslash>.
1775 Note that <backslash>-escapes are interpreted twice; once in lexi‐
1776 cal processing of the string and once in processing the regular
1777 expression:
1778
1779 $2 " " "\\\\"
1780
1781 8. Write the second to the last and the last field in each line. Sepa‐
1782 rate the fields by a <colon>:
1783
1784 {OFS=":";print $(NF−1), $NF}
1785
1786 9. Write the line number and number of fields in each line. The three
1787 strings representing the line number, the <colon>, and the number
1788 of fields are concatenated and that string is written to standard
1789 output:
1790
1791 {print NR ":" NF}
1792
1793 10. Write lines longer than 72 characters:
1794
1795 length($0) > 72
1796
1797 11. Write the first two fields in opposite order separated by OFS:
1798
1799 { print $2, $1 }
1800
1801 12. Same, with input fields separated by a <comma> or <space> and <tab>
1802 characters, or both:
1803
1804 BEGIN { FS = ",[ \t]*|[ \t]+" }
1805 { print $2, $1 }
1806
1807 13. Add up the first column, print sum, and average:
1808
1809 {s += $1 }
1810 END {print "sum is ", s, " average is", s/NR}
1811
1812 14. Write fields in reverse order, one per line (many lines out for
1813 each line in):
1814
1815 { for (i = NF; i > 0; −−i) print $i }
1816
1817 15. Write all lines between occurrences of the strings start and stop:
1818
1819 /start/, /stop/
1820
1821 16. Write all lines whose first field is different from the previous
1822 one:
1823
1824 $1 != prev { print; prev = $1 }
1825
1826 17. Simulate echo:
1827
1828 BEGIN {
1829 for (i = 1; i < ARGC; ++i)
1830 printf("%s%s", ARGV[i], i==ARGC−1?"\n":" ")
1831 }
1832
1833 18. Write the path prefixes contained in the PATH environment variable,
1834 one per line:
1835
1836 BEGIN {
1837 n = split (ENVIRON["PATH"], path, ":")
1838 for (i = 1; i <= n; ++i)
1839 print path[i]
1840 }
1841
1842 19. If there is a file named input containing page headers of the form:
1843 Page #
1844
1845 and a file named program that contains:
1846
1847 /Page/ { $2 = n++; }
1848 { print }
1849
1850 then the command line:
1851
1852 awk −f program n=5 input
1853
1854 prints the file input, filling in page numbers starting at 5.
1855
1857 This description is based on the new awk, ``nawk'', (see the referenced
1858 The AWK Programming Language), which introduced a number of new fea‐
1859 tures to the historical awk:
1860
1861 1. New keywords: delete, do, function, return
1862
1863 2. New built-in functions: atan2, close, cos, gsub, match, rand, sin,
1864 srand, sub, system
1865
1866 3. New predefined variables: FNR, ARGC, ARGV, RSTART, RLENGTH, SUBSEP
1867
1868 4. New expression operators: ?, :, ,, ^
1869
1870 5. The FS variable and the third argument to split, now treated as
1871 extended regular expressions.
1872
1873 6. The operator precedence, changed to more closely match the C lan‐
1874 guage. Two examples of code that operate differently are:
1875
1876 while ( n /= 10 > 1) ...
1877 if (!"wk" ~ /bwk/) ...
1878
1879 Several features have been added based on newer implementations of awk:
1880
1881 * Multiple instances of −f progfile are permitted.
1882
1883 * The new option −v assignment.
1884
1885 * The new predefined variable ENVIRON.
1886
1887 * New built-in functions toupper and tolower.
1888
1889 * More formatting capabilities are added to printf to match the ISO C
1890 standard.
1891
1892 The overall awk syntax has always been based on the C language, with a
1893 few features from the shell command language and other sources. Because
1894 of this, it is not completely compatible with any other language, which
1895 has caused confusion for some users. It is not the intent of the stan‐
1896 dard developers to address such issues. A few relatively minor changes
1897 toward making the language more compatible with the ISO C standard were
1898 made; most of these changes are based on similar changes in recent
1899 implementations, as described above. There remain several C-language
1900 conventions that are not in awk. One of the notable ones is the
1901 <comma> operator, which is commonly used to specify multiple expres‐
1902 sions in the C language for statement. Also, there are various places
1903 where awk is more restrictive than the C language regarding the type of
1904 expression that can be used in a given context. These limitations are
1905 due to the different features that the awk language does provide.
1906
1907 Regular expressions in awk have been extended somewhat from historical
1908 implementations to make them a pure superset of extended regular
1909 expressions, as defined by POSIX.1‐2008 (see the Base Definitions vol‐
1910 ume of POSIX.1‐2008, Section 9.4, Extended Regular Expressions). The
1911 main extensions are internationalization features and interval expres‐
1912 sions. Historical implementations of awk have long supported <back‐
1913 slash>-escape sequences as an extension to extended regular expres‐
1914 sions, and this extension has been retained despite inconsistency with
1915 other utilities. The number of escape sequences recognized in both
1916 extended regular expressions and strings has varied (generally increas‐
1917 ing with time) among implementations. The set specified by POSIX.1‐2008
1918 includes most sequences known to be supported by popular implementa‐
1919 tions and by the ISO C standard. One sequence that is not supported is
1920 hexadecimal value escapes beginning with '\x'. This would allow values
1921 expressed in more than 9 bits to be used within awk as in the ISO C
1922 standard. However, because this syntax has a non-deterministic length,
1923 it does not permit the subsequent character to be a hexadecimal digit.
1924 This limitation can be dealt with in the C language by the use of lexi‐
1925 cal string concatenation. In the awk language, concatenation could also
1926 be a solution for strings, but not for extended regular expressions
1927 (either lexical ERE tokens or strings used dynamically as regular
1928 expressions). Because of this limitation, the feature has not been
1929 added to POSIX.1‐2008.
1930
1931 When a string variable is used in a context where an extended regular
1932 expression normally appears (where the lexical token ERE is used in the
1933 grammar) the string does not contain the literal <slash> characters.
1934
1935 Some versions of awk allow the form:
1936
1937 func name(args, ... ) { statements }
1938
1939 This has been deprecated by the authors of the language, who asked that
1940 it not be specified.
1941
1942 Historical implementations of awk produce an error if a next statement
1943 is executed in a BEGIN action, and cause awk to terminate if a next
1944 statement is executed in an END action. This behavior has not been doc‐
1945 umented, and it was not believed that it was necessary to standardize
1946 it.
1947
1948 The specification of conversions between string and numeric values is
1949 much more detailed than in the documentation of historical implementa‐
1950 tions or in the referenced The AWK Programming Language. Although most
1951 of the behavior is designed to be intuitive, the details are necessary
1952 to ensure compatible behavior from different implementations. This is
1953 especially important in relational expressions since the types of the
1954 operands determine whether a string or numeric comparison is performed.
1955 From the perspective of an application developer, it is usually suffi‐
1956 cient to expect intuitive behavior and to force conversions (by adding
1957 zero or concatenating a null string) when the type of an expression
1958 does not obviously match what is needed. The intent has been to specify
1959 historical practice in almost all cases. The one exception is that, in
1960 historical implementations, variables and constants maintain both
1961 string and numeric values after their original value is converted by
1962 any use. This means that referencing a variable or constant can have
1963 unexpected side-effects. For example, with historical implementations
1964 the following program:
1965
1966 {
1967 a = "+2"
1968 b = 2
1969 if (NR % 2)
1970 c = a + b
1971 if (a == b)
1972 print "numeric comparison"
1973 else
1974 print "string comparison"
1975 }
1976
1977 would perform a numeric comparison (and output numeric comparison) for
1978 each odd-numbered line, but perform a string comparison (and output
1979 string comparison) for each even-numbered line. POSIX.1‐2008 ensures
1980 that comparisons will be numeric if necessary. With historical imple‐
1981 mentations, the following program:
1982
1983 BEGIN {
1984 OFMT = "%e"
1985 print 3.14
1986 OFMT = "%f"
1987 print 3.14
1988 }
1989
1990 would output "3.140000e+00" twice, because in the second print state‐
1991 ment the constant "3.14" would have a string value from the previous
1992 conversion. POSIX.1‐2008 requires that the output of the second print
1993 statement be "3.140000". The behavior of historical implementations
1994 was seen as too unintuitive and unpredictable.
1995
1996 It was pointed out that with the rules contained in early drafts, the
1997 following script would print nothing:
1998
1999 BEGIN {
2000 y[1.5] = 1
2001 OFMT = "%e"
2002 print y[1.5]
2003 }
2004
2005 Therefore, a new variable, CONVFMT, was introduced. The OFMT variable
2006 is now restricted to affecting output conversions of numbers to strings
2007 and CONVFMT is used for internal conversions, such as comparisons or
2008 array indexing. The default value is the same as that for OFMT, so
2009 unless a program changes CONVFMT (which no historical program would
2010 do), it will receive the historical behavior associated with internal
2011 string conversions.
2012
2013 The POSIX awk lexical and syntactic conventions are specified more for‐
2014 mally than in other sources. Again the intent has been to specify his‐
2015 torical practice. One convention that may not be obvious from the for‐
2016 mal grammar as in other verbal descriptions is where <newline> charac‐
2017 ters are acceptable. There are several obvious placements such as ter‐
2018 minating a statement, and a <backslash> can be used to escape <newline>
2019 characters between any lexical tokens. In addition, <newline> charac‐
2020 ters without <backslash> characters can follow a comma, an open brace,
2021 a logical AND operator ("&&"), a logical OR operator ("||"), the do
2022 keyword, the else keyword, and the closing parenthesis of an if, for,
2023 or while statement. For example:
2024
2025 { print $1,
2026 $2 }
2027
2028 The requirement that awk add a trailing <newline> to the program argu‐
2029 ment text is to simplify the grammar, making it match a text file in
2030 form. There is no way for an application or test suite to determine
2031 whether a literal <newline> is added or whether awk simply acts as if
2032 it did.
2033
2034 POSIX.1‐2008 requires several changes from historical implementations
2035 in order to support internationalization. Probably the most subtle of
2036 these is the use of the decimal-point character, defined by the
2037 LC_NUMERIC category of the locale, in representations of floating-point
2038 numbers. This locale-specific character is used in recognizing numeric
2039 input, in converting between strings and numeric values, and in format‐
2040 ting output. However, regardless of locale, the <period> character (the
2041 decimal-point character of the POSIX locale) is the decimal-point char‐
2042 acter recognized in processing awk programs (including assignments in
2043 command line arguments). This is essentially the same convention as the
2044 one used in the ISO C standard. The difference is that the C language
2045 includes the setlocale() function, which permits an application to mod‐
2046 ify its locale. Because of this capability, a C application begins exe‐
2047 cuting with its locale set to the C locale, and only executes in the
2048 environment-specified locale after an explicit call to setlocale().
2049 However, adding such an elaborate new feature to the awk language was
2050 seen as inappropriate for POSIX.1‐2008. It is possible to execute an
2051 awk program explicitly in any desired locale by setting the environment
2052 in the shell.
2053
2054 The undefined behavior resulting from NULs in extended regular expres‐
2055 sions allows future extensions for the GNU gawk program to process
2056 binary data.
2057
2058 The behavior in the case of invalid awk programs (including lexical,
2059 syntactic, and semantic errors) is undefined because it was considered
2060 overly limiting on implementations to specify. In most cases such
2061 errors can be expected to produce a diagnostic and a non-zero exit sta‐
2062 tus. However, some implementations may choose to extend the language in
2063 ways that make use of certain invalid constructs. Other invalid con‐
2064 structs might be deemed worthy of a warning, but otherwise cause some
2065 reasonable behavior. Still other constructs may be very difficult to
2066 detect in some implementations. Also, different implementations might
2067 detect a given error during an initial parsing of the program (before
2068 reading any input files) while others might detect it when executing
2069 the program after reading some input. Implementors should be aware that
2070 diagnosing errors as early as possible and producing useful diagnostics
2071 can ease debugging of applications, and thus make an implementation
2072 more usable.
2073
2074 The unspecified behavior from using multi-character RS values is to
2075 allow possible future extensions based on extended regular expressions
2076 used for record separators. Historical implementations take the first
2077 character of the string and ignore the others.
2078
2079 Unspecified behavior when split(string,array,<null>) is used is to
2080 allow a proposed future extension that would split up a string into an
2081 array of individual characters.
2082
2083 In the context of the getline function, equally good arguments for dif‐
2084 ferent precedences of the | and < operators can be made. Historical
2085 practice has been that:
2086
2087 getline < "a" "b"
2088
2089 is parsed as:
2090
2091 ( getline < "a" ) "b"
2092
2093 although many would argue that the intent was that the file ab should
2094 be read. However:
2095
2096 getline < "x" + 1
2097
2098 parses as:
2099
2100 getline < ( "x" + 1 )
2101
2102 Similar problems occur with the | version of getline, particularly in
2103 combination with $. For example:
2104
2105 $"echo hi" | getline
2106
2107 (This situation is particularly problematic when used in a print state‐
2108 ment, where the |getline part might be a redirection of the print.)
2109
2110 Since in most cases such constructs are not (or at least should not) be
2111 used (because they have a natural ambiguity for which there is no con‐
2112 ventional parsing), the meaning of these constructs has been made
2113 explicitly unspecified. (The effect is that a conforming application
2114 that runs into the problem must parenthesize to resolve the ambiguity.)
2115 There appeared to be few if any actual uses of such constructs.
2116
2117 Grammars can be written that would cause an error under these circum‐
2118 stances. Where backwards-compatibility is not a large consideration,
2119 implementors may wish to use such grammars.
2120
2121 Some historical implementations have allowed some built-in functions to
2122 be called without an argument list, the result being a default argument
2123 list chosen in some ``reasonable'' way. Use of length as a synonym for
2124 length($0) is the only one of these forms that is thought to be widely
2125 known or widely used; this particular form is documented in various
2126 places (for example, most historical awk reference pages, although not
2127 in the referenced The AWK Programming Language) as legitimate practice.
2128 With this exception, default argument lists have always been undocu‐
2129 mented and vaguely defined, and it is not at all clear how (or if) they
2130 should be generalized to user-defined functions. They add no useful
2131 functionality and preclude possible future extensions that might need
2132 to name functions without calling them. Not standardizing them seems
2133 the simplest course. The standard developers considered that length
2134 merited special treatment, however, since it has been documented in the
2135 past and sees possibly substantial use in historical programs. Accord‐
2136 ingly, this usage has been made legitimate, but Issue 5 removed the
2137 obsolescent marking for XSI-conforming implementations and many other‐
2138 wise conforming applications depend on this feature.
2139
2140 In sub and gsub, if repl is a string literal (the lexical token
2141 STRING), then two consecutive <backslash> characters should be used in
2142 the string to ensure a single <backslash> will precede the <ampersand>
2143 when the resultant string is passed to the function. (For example, to
2144 specify one literal <ampersand> in the replacement string, use
2145 gsub(ERE, "\\&").)
2146
2147 Historically, the only special character in the repl argument of sub
2148 and gsub string functions was the <ampersand> ('&') character and pre‐
2149 ceding it with the <backslash> character was used to turn off its spe‐
2150 cial meaning.
2151
2152 The description in the ISO POSIX‐2:1993 standard introduced behavior
2153 such that the <backslash> character was another special character and
2154 it was unspecified whether there were any other special characters.
2155 This description introduced several portability problems, some of which
2156 are described below, and so it has been replaced with the more histori‐
2157 cal description. Some of the problems include:
2158
2159 * Historically, to create the replacement string, a script could use
2160 gsub(ERE, "\\&"), but with the ISO POSIX‐2:1993 standard wording,
2161 it was necessary to use gsub(ERE, "\\\\&"). The <backslash> char‐
2162 acters are doubled here because all string literals are subject to
2163 lexical analysis, which would reduce each pair of <backslash> char‐
2164 acters to a single <backslash> before being passed to gsub.
2165
2166 * Since it was unspecified what the special characters were, for por‐
2167 table scripts to guarantee that characters are printed literally,
2168 each character had to be preceded with a <backslash>. (For exam‐
2169 ple, a portable script had to use gsub(ERE, "\\h\\i") to produce a
2170 replacement string of "hi".)
2171
2172 The description for comparisons in the ISO POSIX‐2:1993 standard did
2173 not properly describe historical practice because of the way numeric
2174 strings are compared as numbers. The current rules cause the following
2175 code:
2176
2177 if (0 == "000")
2178 print "strange, but true"
2179 else
2180 print "not true"
2181
2182 to do a numeric comparison, causing the if to succeed. It should be
2183 intuitively obvious that this is incorrect behavior, and indeed, no
2184 historical implementation of awk actually behaves this way.
2185
2186 To fix this problem, the definition of numeric string was enhanced to
2187 include only those values obtained from specific circumstances (mostly
2188 external sources) where it is not possible to determine unambiguously
2189 whether the value is intended to be a string or a numeric.
2190
2191 Variables that are assigned to a numeric string shall also be treated
2192 as a numeric string. (For example, the notion of a numeric string can
2193 be propagated across assignments.) In comparisons, all variables having
2194 the uninitialized value are to be treated as a numeric operand evaluat‐
2195 ing to the numeric value zero.
2196
2197 Uninitialized variables include all types of variables including
2198 scalars, array elements, and fields. The definition of an uninitialized
2199 value in Variables and Special Variables is necessary to describe the
2200 value placed on uninitialized variables and on fields that are valid
2201 (for example, < $NF) but have no characters in them and to describe how
2202 these variables are to be used in comparisons. A valid field, such as
2203 $1, that has no characters in it can be obtained from an input line of
2204 "\t\t" when FS='\t'. Historically, the comparison ($1<10) was done
2205 numerically after evaluating $1 to the value zero.
2206
2207 The phrase ``... also shall have the numeric value of the numeric
2208 string'' was removed from several sections of the ISO POSIX‐2:1993
2209 standard because is specifies an unnecessary implementation detail. It
2210 is not necessary for POSIX.1‐2008 to specify that these objects be
2211 assigned two different values. It is only necessary to specify that
2212 these objects may evaluate to two different values depending on con‐
2213 text.
2214
2215 Historical implementations of awk did not parse hexadecimal integer or
2216 floating constants like "0xa" and "0xap0". Due to an oversight, the
2217 2001 through 2004 editions of this standard required support for hexa‐
2218 decimal floating constants. This was due to the reference to atof().
2219 This version of the standard allows but does not require implementa‐
2220 tions to use atof() and includes a description of how floating-point
2221 numbers are recognized as an alternative to match historic behavior.
2222 The intent of this change is to allow implementations to recognize
2223 floating-point constants according to either the ISO/IEC 9899:1990
2224 standard or ISO/IEC 9899:1999 standard, and to allow (but not require)
2225 implementations to recognize hexadecimal integer constants.
2226
2227 Historical implementations of awk did not support floating-point
2228 infinities and NaNs in numeric strings; e.g., "−INF" and "NaN". How‐
2229 ever, implementations that use the atof() or strtod() functions to do
2230 the conversion picked up support for these values if they used a
2231 ISO/IEC 9899:1999 standard version of the function instead of a
2232 ISO/IEC 9899:1990 standard version. Due to an oversight, the 2001
2233 through 2004 editions of this standard did not allow support for
2234 infinities and NaNs, but in this revision support is allowed (but not
2235 required). This is a silent change to the behavior of awk programs; for
2236 example, in the POSIX locale the expression:
2237
2238 ("-INF" + 0 < 0)
2239
2240 formerly had the value 0 because "−INF" converted to 0, but now it may
2241 have the value 0 or 1.
2242
2244 None.
2245
2247 Section 1.3, Grammar Conventions, grep, lex, sed
2248
2249 The Base Definitions volume of POSIX.1‐2008, Chapter 5, File Format
2250 Notation, Section 6.1, Portable Character Set, Chapter 8, Environment
2251 Variables, Chapter 9, Regular Expressions, Section 12.2, Utility Syntax
2252 Guidelines
2253
2254 The System Interfaces volume of POSIX.1‐2008, atof(), exec, isspace(),
2255 popen(), setlocale(), strtod()
2256
2258 Portions of this text are reprinted and reproduced in electronic form
2259 from IEEE Std 1003.1, 2013 Edition, Standard for Information Technology
2260 -- Portable Operating System Interface (POSIX), The Open Group Base
2261 Specifications Issue 7, Copyright (C) 2013 by the Institute of Electri‐
2262 cal and Electronics Engineers, Inc and The Open Group. (This is
2263 POSIX.1-2008 with the 2013 Technical Corrigendum 1 applied.) In the
2264 event of any discrepancy between this version and the original IEEE and
2265 The Open Group Standard, the original IEEE and The Open Group Standard
2266 is the referee document. The original Standard can be obtained online
2267 at http://www.unix.org/online.html .
2268
2269 Any typographical or formatting errors that appear in this page are
2270 most likely to have been introduced during the conversion of the source
2271 files to man page format. To report such errors, see https://www.ker‐
2272 nel.org/doc/man-pages/reporting_bugs.html .
2273
2274
2275
2276IEEE/The Open Group 2013 AWK(1P)