1AWK(1P) POSIX Programmer's Manual AWK(1P)
2
3
4
6 This manual page is part of the POSIX Programmer's Manual. The Linux
7 implementation of this interface may differ (consult the corresponding
8 Linux manual page for details of Linux behavior), or the interface may
9 not be implemented on Linux.
10
12 awk — pattern scanning and processing language
13
15 awk [-F sepstring] [-v assignment]... program [argument...]
16
17 awk [-F sepstring] -f progfile [-f progfile]... [-v assignment]...
18 [argument...]
19
21 The awk utility shall execute programs written in the awk programming
22 language, which is specialized for textual data manipulation. An awk
23 program is a sequence of patterns and corresponding actions. When input
24 is read that matches a pattern, the action associated with that pattern
25 is carried out.
26
27 Input shall be interpreted as a sequence of records. By default, a
28 record is a line, less its terminating <newline>, but this can be
29 changed by using the RS built-in variable. Each record of input shall
30 be matched in turn against each pattern in the program. For each pat‐
31 tern matched, the associated action shall be executed.
32
33 The awk utility shall interpret each input record as a sequence of
34 fields where, by default, a field is a string of non-<blank> non-<new‐
35 line> characters. This default <blank> and <newline> field delimiter
36 can be changed by using the FS built-in variable or the -F sepstring
37 option. The awk utility shall denote the first field in a record $1,
38 the second $2, and so on. The symbol $0 shall refer to the entire
39 record; setting any other field causes the re-evaluation of $0. Assign‐
40 ing to $0 shall reset the values of all other fields and the NF built-
41 in variable.
42
44 The awk utility shall conform to the Base Definitions volume of
45 POSIX.1‐2017, Section 12.2, Utility Syntax Guidelines.
46
47 The following options shall be supported:
48
49 -F sepstring
50 Define the input field separator. This option shall be equiv‐
51 alent to:
52
53
54 -v FS=sepstring
55
56 except that if -F sepstring and -v FS=sepstring are both
57 used, it is unspecified whether the FS assignment resulting
58 from -F sepstring is processed in command line order or is
59 processed after the last -v FS=sepstring. See the descrip‐
60 tion of the FS built-in variable, and how it is used, in the
61 EXTENDED DESCRIPTION section.
62
63 -f progfile
64 Specify the pathname of the file progfile containing an awk
65 program. A pathname of '-' shall denote the standard input.
66 If multiple instances of this option are specified, the con‐
67 catenation of the files specified as progfile in the order
68 specified shall be the awk program. The awk program can
69 alternatively be specified in the command line as a single
70 argument.
71
72 -v assignment
73 The application shall ensure that the assignment argument is
74 in the same form as an assignment operand. The specified
75 variable assignment shall occur prior to executing the awk
76 program, including the actions associated with BEGIN patterns
77 (if any). Multiple occurrences of this option can be speci‐
78 fied.
79
81 The following operands shall be supported:
82
83 program If no -f option is specified, the first operand to awk shall
84 be the text of the awk program. The application shall supply
85 the program operand as a single argument to awk. If the text
86 does not end in a <newline>, awk shall interpret the text as
87 if it did.
88
89 argument Either of the following two types of argument can be inter‐
90 mixed:
91
92 file A pathname of a file that contains the input to be
93 read, which is matched against the set of patterns
94 in the program. If no file operands are specified,
95 or if a file operand is '-', the standard input
96 shall be used.
97
98 assignment
99 An operand that begins with an <underscore> or
100 alphabetic character from the portable character
101 set (see the table in the Base Definitions volume
102 of POSIX.1‐2017, Section 6.1, Portable Character
103 Set), followed by a sequence of underscores, dig‐
104 its, and alphabetics from the portable character
105 set, followed by the '=' character, shall specify a
106 variable assignment rather than a pathname. The
107 characters before the '=' represent the name of an
108 awk variable; if that name is an awk reserved word
109 (see Grammar) the behavior is undefined. The char‐
110 acters following the <equals-sign> shall be inter‐
111 preted as if they appeared in the awk program pre‐
112 ceded and followed by a double-quote ('"') charac‐
113 ter, as a STRING token (see Grammar), except that
114 if the last character is an unescaped <backslash>,
115 it shall be interpreted as a literal <backslash>
116 rather than as the first character of the sequence
117 "\"". The variable shall be assigned the value of
118 that STRING token and, if appropriate, shall be
119 considered a numeric string (see Expressions in
120 awk), the variable shall also be assigned its
121 numeric value. Each such variable assignment shall
122 occur just prior to the processing of the following
123 file, if any. Thus, an assignment before the first
124 file argument shall be executed after the BEGIN
125 actions (if any), while an assignment after the
126 last file argument shall occur before the END
127 actions (if any). If there are no file arguments,
128 assignments shall be executed before processing the
129 standard input.
130
132 The standard input shall be used only if no file operands are speci‐
133 fied, or if a file operand is '-', or if a progfile option-argument is
134 '-'; see the INPUT FILES section. If the awk program contains no
135 actions and no patterns, but is otherwise a valid awk program, standard
136 input and any file operands shall not be read and awk shall exit with a
137 return status of zero.
138
140 Input files to the awk program from any of the following sources shall
141 be text files:
142
143 * Any file operands or their equivalents, achieved by modifying the
144 awk variables ARGV and ARGC
145
146 * Standard input in the absence of any file operands
147
148 * Arguments to the getline function
149
150 Whether the variable RS is set to a value other than a <newline> or
151 not, for these files, implementations shall support records terminated
152 with the specified separator up to {LINE_MAX} bytes and may support
153 longer records.
154
155 If -f progfile is specified, the application shall ensure that the
156 files named by each of the progfile option-arguments are text files and
157 their concatenation, in the same order as they appear in the arguments,
158 is an awk program.
159
161 The following environment variables shall affect the execution of awk:
162
163 LANG Provide a default value for the internationalization vari‐
164 ables that are unset or null. (See the Base Definitions vol‐
165 ume of POSIX.1‐2017, Section 8.2, Internationalization Vari‐
166 ables for the precedence of internationalization variables
167 used to determine the values of locale categories.)
168
169 LC_ALL If set to a non-empty string value, override the values of
170 all the other internationalization variables.
171
172 LC_COLLATE
173 Determine the locale for the behavior of ranges, equivalence
174 classes, and multi-character collating elements within regu‐
175 lar expressions and in comparisons of string values.
176
177 LC_CTYPE Determine the locale for the interpretation of sequences of
178 bytes of text data as characters (for example, single-byte as
179 opposed to multi-byte characters in arguments and input
180 files), the behavior of character classes within regular
181 expressions, the identification of characters as letters, and
182 the mapping of uppercase and lowercase characters for the
183 toupper and tolower functions.
184
185 LC_MESSAGES
186 Determine the locale that should be used to affect the format
187 and contents of diagnostic messages written to standard
188 error.
189
190 LC_NUMERIC
191 Determine the radix character used when interpreting numeric
192 input, performing conversions between numeric and string val‐
193 ues, and formatting numeric output. Regardless of locale, the
194 <period> character (the decimal-point character of the POSIX
195 locale) is the decimal-point character recognized in process‐
196 ing awk programs (including assignments in command line argu‐
197 ments).
198
199 NLSPATH Determine the location of message catalogs for the processing
200 of LC_MESSAGES.
201
202 PATH Determine the search path when looking for commands executed
203 by system(expr), or input and output pipes; see the Base Def‐
204 initions volume of POSIX.1‐2017, Chapter 8, Environment Vari‐
205 ables.
206
207 In addition, all environment variables shall be visible via the awk
208 variable ENVIRON.
209
211 Default.
212
214 The nature of the output files depends on the awk program.
215
217 The standard error shall be used only for diagnostic messages.
218
220 The nature of the output files depends on the awk program.
221
223 Overall Program Structure
224 An awk program is composed of pairs of the form:
225
226
227 pattern { action }
228
229 Either the pattern or the action (including the enclosing brace charac‐
230 ters) can be omitted.
231
232 A missing pattern shall match any record of input, and a missing action
233 shall be equivalent to:
234
235
236 { print }
237
238 Execution of the awk program shall start by first executing the actions
239 associated with all BEGIN patterns in the order they occur in the pro‐
240 gram. Then each file operand (or standard input if no files were speci‐
241 fied) shall be processed in turn by reading data from the file until a
242 record separator is seen (<newline> by default). Before the first ref‐
243 erence to a field in the record is evaluated, the record shall be split
244 into fields, according to the rules in Regular Expressions, using the
245 value of FS that was current at the time the record was read. Each pat‐
246 tern in the program then shall be evaluated in the order of occurrence,
247 and the action associated with each pattern that matches the current
248 record executed. The action for a matching pattern shall be executed
249 before evaluating subsequent patterns. Finally, the actions associated
250 with all END patterns shall be executed in the order they occur in the
251 program.
252
253 Expressions in awk
254 Expressions describe computations used in patterns and actions. In the
255 following table, valid expression operations are given in groups from
256 highest precedence first to lowest precedence last, with equal-prece‐
257 dence operators grouped between horizontal lines. In expression evalua‐
258 tion, where the grammar is formally ambiguous, higher precedence opera‐
259 tors shall be evaluated before lower precedence operators. In this ta‐
260 ble expr, expr1, expr2, and expr3 represent any expression, while
261 lvalue represents any entity that can be assigned to (that is, on the
262 left side of an assignment operator). The precise syntax of expres‐
263 sions is given in Grammar.
264
265 Table 4-1: Expressions in Decreasing Precedence in awk
266
267 ┌─────────────────────┬─────────────────────────┬────────────────┬──────────────┐
268 │ Syntax │ Name │ Type of Result │Associativity │
269 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
270 │( expr ) │Grouping │Type of expr │N/A │
271 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
272 │$expr │Field reference │String │N/A │
273 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
274 │lvalue ++ │Post-increment │Numeric │N/A │
275 │lvalue -- │Post-decrement │Numeric │N/A │
276 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
277 │++ lvalue │Pre-increment │Numeric │N/A │
278 │-- lvalue │Pre-decrement │Numeric │N/A │
279 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
280 │expr ^ expr │Exponentiation │Numeric │Right │
281 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
282 │! expr │Logical not │Numeric │N/A │
283 │+ expr │Unary plus │Numeric │N/A │
284 │- expr │Unary minus │Numeric │N/A │
285 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
286 │expr * expr │Multiplication │Numeric │Left │
287 │expr / expr │Division │Numeric │Left │
288 │expr % expr │Modulus │Numeric │Left │
289 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
290 │expr + expr │Addition │Numeric │Left │
291 │expr - expr │Subtraction │Numeric │Left │
292 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
293 │expr expr │String concatenation │String │Left │
294 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
295 │expr < expr │Less than │Numeric │None │
296 │expr <= expr │Less than or equal to │Numeric │None │
297 │expr != expr │Not equal to │Numeric │None │
298 │expr == expr │Equal to │Numeric │None │
299 │expr > expr │Greater than │Numeric │None │
300 │expr >= expr │Greater than or equal to │Numeric │None │
301 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
302 │expr ~ expr │ERE match │Numeric │None │
303 │expr !~ expr │ERE non-match │Numeric │None │
304 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
305 │expr in array │Array membership │Numeric │Left │
306 │( index ) in array │Multi-dimension array │Numeric │Left │
307 │ │membership │ │ │
308 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
309 │expr && expr │Logical AND │Numeric │Left │
310 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
311 │expr || expr │Logical OR │Numeric │Left │
312 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
313 │expr1 ? expr2 : expr3│Conditional expression │Type of selected│Right │
314 │ │ │expr2 or expr3 │ │
315 ├─────────────────────┼─────────────────────────┼────────────────┼──────────────┤
316 │lvalue ^= expr │Exponentiation assignment│Numeric │Right │
317 │lvalue %= expr │Modulus assignment │Numeric │Right │
318 │lvalue *= expr │Multiplication assignment│Numeric │Right │
319 │lvalue /= expr │Division assignment │Numeric │Right │
320 │lvalue += expr │Addition assignment │Numeric │Right │
321 │lvalue -= expr │Subtraction assignment │Numeric │Right │
322 │lvalue = expr │Assignment │Type of expr │Right │
323 └─────────────────────┴─────────────────────────┴────────────────┴──────────────┘
324 Each expression shall have either a string value, a numeric value, or
325 both. Except as stated for specific contexts, the value of an expres‐
326 sion shall be implicitly converted to the type needed for the context
327 in which it is used. A string value shall be converted to a numeric
328 value either by the equivalent of the following calls to functions
329 defined by the ISO C standard:
330
331
332 setlocale(LC_NUMERIC, "");
333 numeric_value = atof(string_value);
334
335 or by converting the initial portion of the string to type double rep‐
336 resentation as follows:
337
338 The input string is decomposed into two parts: an initial, pos‐
339 sibly empty, sequence of white-space characters (as specified by
340 isspace()) and a subject sequence interpreted as a floating-
341 point constant.
342
343 The expected form of the subject sequence is an optional '+' or
344 '-' sign, then a non-empty sequence of digits optionally con‐
345 taining a <period>, then an optional exponent part. An exponent
346 part consists of 'e' or 'E', followed by an optional sign, fol‐
347 lowed by one or more decimal digits.
348
349 The sequence starting with the first digit or the <period>
350 (whichever occurs first) is interpreted as a floating constant
351 of the C language, and if neither an exponent part nor a
352 <period> appears, a <period> is assumed to follow the last digit
353 in the string. If the subject sequence begins with a <hyphen-
354 minus>, the value resulting from the conversion is negated.
355
356 A numeric value that is exactly equal to the value of an integer (see
357 Section 1.1.2, Concepts Derived from the ISO C Standard) shall be con‐
358 verted to a string by the equivalent of a call to the sprintf function
359 (see String Functions) with the string "%d" as the fmt argument and the
360 numeric value being converted as the first and only expr argument. Any
361 other numeric value shall be converted to a string by the equivalent of
362 a call to the sprintf function with the value of the variable CONVFMT
363 as the fmt argument and the numeric value being converted as the first
364 and only expr argument. The result of the conversion is unspecified if
365 the value of CONVFMT is not a floating-point format specification. This
366 volume of POSIX.1‐2017 specifies no explicit conversions between num‐
367 bers and strings. An application can force an expression to be treated
368 as a number by adding zero to it, or can force it to be treated as a
369 string by concatenating the null string ("") to it.
370
371 A string value shall be considered a numeric string if it comes from
372 one of the following:
373
374 1. Field variables
375
376 2. Input from the getline() function
377
378 3. FILENAME
379
380 4. ARGV array elements
381
382 5. ENVIRON array elements
383
384 6. Array elements created by the split() function
385
386 7. A command line variable assignment
387
388 8. Variable assignment from another numeric string variable
389
390 and an implementation-dependent condition corresponding to either case
391 (a) or (b) below is met.
392
393 a. After the equivalent of the following calls to functions defined by
394 the ISO C standard, string_value_end would differ from
395 string_value, and any characters before the terminating null char‐
396 acter in string_value_end would be <blank> characters:
397
398
399 char *string_value_end;
400 setlocale(LC_NUMERIC, "");
401 numeric_value = strtod (string_value, &string_value_end);
402
403 b. After all the following conversions have been applied, the result‐
404 ing string would lexically be recognized as a NUMBER token as
405 described by the lexical conventions in Grammar:
406
407 -- All leading and trailing <blank> characters are discarded.
408
409 -- If the first non-<blank> is '+' or '-', it is discarded.
410
411 -- Each occurrence of the decimal point character from the current
412 locale is changed to a <period>.
413 In case (a) the numeric value of the numeric string shall be the value
414 that would be returned by the strtod() call. In case (b) if the first
415 non-<blank> is '-', the numeric value of the numeric string shall be
416 the negation of the numeric value of the recognized NUMBER token; oth‐
417 erwise, the numeric value of the numeric string shall be the numeric
418 value of the recognized NUMBER token. Whether or not a string is a
419 numeric string shall be relevant only in contexts where that term is
420 used in this section.
421
422 When an expression is used in a Boolean context, if it has a numeric
423 value, a value of zero shall be treated as false and any other value
424 shall be treated as true. Otherwise, a string value of the null string
425 shall be treated as false and any other value shall be treated as true.
426 A Boolean context shall be one of the following:
427
428 * The first subexpression of a conditional expression
429
430 * An expression operated on by logical NOT, logical AND, or logical
431 OR
432
433 * The second expression of a for statement
434
435 * The expression of an if statement
436
437 * The expression of the while clause in either a while or do...while
438 statement
439
440 * An expression used as a pattern (as in Overall Program Structure)
441
442 All arithmetic shall follow the semantics of floating-point arithmetic
443 as specified by the ISO C standard (see Section 1.1.2, Concepts Derived
444 from the ISO C Standard).
445
446 The value of the expression:
447
448
449 expr1 ^ expr2
450
451 shall be equivalent to the value returned by the ISO C standard func‐
452 tion call:
453
454
455 pow(expr1, expr2)
456
457 The expression:
458
459
460 lvalue ^= expr
461
462 shall be equivalent to the ISO C standard expression:
463
464
465 lvalue = pow(lvalue, expr)
466
467 except that lvalue shall be evaluated only once. The value of the
468 expression:
469
470
471 expr1 % expr2
472
473 shall be equivalent to the value returned by the ISO C standard func‐
474 tion call:
475
476
477 fmod(expr1, expr2)
478
479 The expression:
480
481
482 lvalue %= expr
483
484 shall be equivalent to the ISO C standard expression:
485
486
487 lvalue = fmod(lvalue, expr)
488
489 except that lvalue shall be evaluated only once.
490
491 Variables and fields shall be set by the assignment statement:
492
493
494 lvalue = expression
495
496 and the type of expression shall determine the resulting variable type.
497 The assignment includes the arithmetic assignments ("+=", "-=", "*=",
498 "/=", "%=", "^=", "++", "--") all of which shall produce a numeric
499 result. The left-hand side of an assignment and the target of increment
500 and decrement operators can be one of a variable, an array with index,
501 or a field selector.
502
503 The awk language supplies arrays that are used for storing numbers or
504 strings. Arrays need not be declared. They shall initially be empty,
505 and their sizes shall change dynamically. The subscripts, or element
506 identifiers, are strings, providing a type of associative array capa‐
507 bility. An array name followed by a subscript within square brackets
508 can be used as an lvalue and thus as an expression, as described in the
509 grammar; see Grammar. Unsubscripted array names can be used in only
510 the following contexts:
511
512 * A parameter in a function definition or function call
513
514 * The NAME token following any use of the keyword in as specified in
515 the grammar (see Grammar); if the name used in this context is not
516 an array name, the behavior is undefined
517
518 A valid array index shall consist of one or more <comma>-separated
519 expressions, similar to the way in which multi-dimensional arrays are
520 indexed in some programming languages. Because awk arrays are really
521 one-dimensional, such a <comma>-separated list shall be converted to a
522 single string by concatenating the string values of the separate
523 expressions, each separated from the other by the value of the SUBSEP
524 variable. Thus, the following two index operations shall be equivalent:
525
526
527 var[expr1, expr2, ... exprn]
528
529 var[expr1 SUBSEP expr2 SUBSEP ... SUBSEP exprn]
530
531 The application shall ensure that a multi-dimensioned index used with
532 the in operator is parenthesized. The in operator, which tests for the
533 existence of a particular array element, shall not cause that element
534 to exist. Any other reference to a nonexistent array element shall
535 automatically create it.
536
537 Comparisons (with the '<', "<=", "!=", "==", '>', and ">=" operators)
538 shall be made numerically if both operands are numeric, if one is
539 numeric and the other has a string value that is a numeric string, or
540 if one is numeric and the other has the uninitialized value. Other‐
541 wise, operands shall be converted to strings as required and a string
542 comparison shall be made as follows:
543
544 * For the "!=" and "==" operators, the strings should be compared to
545 check if they are identical but may be compared using the locale-
546 specific collation sequence to check if they collate equally.
547
548 * For the other operators, the strings shall be compared using the
549 locale-specific collation sequence.
550
551 The value of the comparison expression shall be 1 if the relation is
552 true, or 0 if the relation is false.
553
554 Variables and Special Variables
555 Variables can be used in an awk program by referencing them. With the
556 exception of function parameters (see User-Defined Functions), they are
557 not explicitly declared. Function parameter names shall be local to the
558 function; all other variable names shall be global. The same name shall
559 not be used as both a function parameter name and as the name of a
560 function or a special awk variable. The same name shall not be used
561 both as a variable name with global scope and as the name of a func‐
562 tion. The same name shall not be used within the same scope both as a
563 scalar variable and as an array. Uninitialized variables, including
564 scalar variables, array elements, and field variables, shall have an
565 uninitialized value. An uninitialized value shall have both a numeric
566 value of zero and a string value of the empty string. Evaluation of
567 variables with an uninitialized value, to either string or numeric,
568 shall be determined by the context in which they are used.
569
570 Field variables shall be designated by a '$' followed by a number or
571 numerical expression. The effect of the field number expression evalu‐
572 ating to anything other than a non-negative integer is unspecified;
573 uninitialized variables or string values need not be converted to
574 numeric values in this context. New field variables can be created by
575 assigning a value to them. References to nonexistent fields (that is,
576 fields after $NF), shall evaluate to the uninitialized value. Such ref‐
577 erences shall not create new fields. However, assigning to a nonexis‐
578 tent field (for example, $(NF+2)=5) shall increase the value of NF;
579 create any intervening fields with the uninitialized value; and cause
580 the value of $0 to be recomputed, with the fields being separated by
581 the value of OFS. Each field variable shall have a string value or an
582 uninitialized value when created. Field variables shall have the unini‐
583 tialized value when created from $0 using FS and the variable does not
584 contain any characters. If appropriate, the field variable shall be
585 considered a numeric string (see Expressions in awk).
586
587 Implementations shall support the following other special variables
588 that are set by awk:
589
590 ARGC The number of elements in the ARGV array.
591
592 ARGV An array of command line arguments, excluding options and the
593 program argument, numbered from zero to ARGC-1.
594
595 The arguments in ARGV can be modified or added to; ARGC can
596 be altered. As each input file ends, awk shall treat the next
597 non-null element of ARGV, up to the current value of ARGC-1,
598 inclusive, as the name of the next input file. Thus, setting
599 an element of ARGV to null means that it shall not be treated
600 as an input file. The name '-' indicates the standard input.
601 If an argument matches the format of an assignment operand,
602 this argument shall be treated as an assignment rather than a
603 file argument.
604
605 CONVFMT The printf format for converting numbers to strings (except
606 for output statements, where OFMT is used); "%.6g" by
607 default.
608
609 ENVIRON An array representing the value of the environment, as
610 described in the exec functions defined in the System Inter‐
611 faces volume of POSIX.1‐2017. The indices of the array shall
612 be strings consisting of the names of the environment vari‐
613 ables, and the value of each array element shall be a string
614 consisting of the value of that variable. If appropriate, the
615 environment variable shall be considered a numeric string
616 (see Expressions in awk); the array element shall also have
617 its numeric value.
618
619 In all cases where the behavior of awk is affected by envi‐
620 ronment variables (including the environment of any commands
621 that awk executes via the system function or via pipeline
622 redirections with the print statement, the printf statement,
623 or the getline function), the environment used shall be the
624 environment at the time awk began executing; it is implemen‐
625 tation-defined whether any modification of ENVIRON affects
626 this environment.
627
628 FILENAME A pathname of the current input file. Inside a BEGIN action
629 the value is undefined. Inside an END action the value shall
630 be the name of the last input file processed.
631
632 FNR The ordinal number of the current record in the current file.
633 Inside a BEGIN action the value shall be zero. Inside an END
634 action the value shall be the number of the last record pro‐
635 cessed in the last file processed.
636
637 FS Input field separator regular expression; a <space> by
638 default.
639
640 NF The number of fields in the current record. Inside a BEGIN
641 action, the use of NF is undefined unless a getline function
642 without a var argument is executed previously. Inside an END
643 action, NF shall retain the value it had for the last record
644 read, unless a subsequent, redirected, getline function with‐
645 out a var argument is performed prior to entering the END
646 action.
647
648 NR The ordinal number of the current record from the start of
649 input. Inside a BEGIN action the value shall be zero. Inside
650 an END action the value shall be the number of the last
651 record processed.
652
653 OFMT The printf format for converting numbers to strings in output
654 statements (see Output Statements); "%.6g" by default. The
655 result of the conversion is unspecified if the value of OFMT
656 is not a floating-point format specification.
657
658 OFS The print statement output field separator; <space> by
659 default.
660
661 ORS The print statement output record separator; a <newline> by
662 default.
663
664 RLENGTH The length of the string matched by the match function.
665
666 RS The first character of the string value of RS shall be the
667 input record separator; a <newline> by default. If RS con‐
668 tains more than one character, the results are unspecified.
669 If RS is null, then records are separated by sequences con‐
670 sisting of a <newline> plus one or more blank lines, leading
671 or trailing blank lines shall not result in empty records at
672 the beginning or end of the input, and a <newline> shall
673 always be a field separator, no matter what the value of FS
674 is.
675
676 RSTART The starting position of the string matched by the match
677 function, numbering from 1. This shall always be equivalent
678 to the return value of the match function.
679
680 SUBSEP The subscript separator string for multi-dimensional arrays;
681 the default value is implementation-defined.
682
683 Regular Expressions
684 The awk utility shall make use of the extended regular expression nota‐
685 tion (see the Base Definitions volume of POSIX.1‐2017, Section 9.4,
686 Extended Regular Expressions) except that it shall allow the use of C-
687 language conventions for escaping special characters within the EREs,
688 as specified in the table in the Base Definitions volume of
689 POSIX.1‐2017, Chapter 5, File Format Notation ('\\', '\a', '\b', '\f',
690 '\n', '\r', '\t', '\v') and the following table; these escape sequences
691 shall be recognized both inside and outside bracket expressions. Note
692 that records need not be separated by <newline> characters and string
693 constants can contain <newline> characters, so even the "\n" sequence
694 is valid in awk EREs. Using a <slash> character within an ERE requires
695 the escaping shown in the following table.
696
697 Table 4-2: Escape Sequences in awk
698
699 ┌─────────┬────────────────────────────────────┬────────────────────────────────────┐
700 │ Escape │ │ │
701 │Sequence │ Description │ Meaning │
702 ├─────────┼────────────────────────────────────┼────────────────────────────────────┤
703 │\" │ <backslash> <quotation-mark> │ <quotation-mark> character │
704 ├─────────┼────────────────────────────────────┼────────────────────────────────────┤
705 │\/ │ <backslash> <slash> │ <slash> character │
706 ├─────────┼────────────────────────────────────┼────────────────────────────────────┤
707 │\ddd │ A <backslash> character followed │ The character whose encoding is │
708 │ │ by the longest sequence of one, │ represented by the one, two, or │
709 │ │ two, or three octal-digit charac‐ │ three-digit octal integer. Multi- │
710 │ │ ters (01234567). If all of the │ byte characters require multiple, │
711 │ │ digits are 0 (that is, representa‐ │ concatenated escape sequences of │
712 │ │ tion of the NUL character), the │ this type, including the leading │
713 │ │ behavior is undefined. │ <backslash> for each byte. │
714 ├─────────┼────────────────────────────────────┼────────────────────────────────────┤
715 │\c │ A <backslash> character followed │ Undefined │
716 │ │ by any character not described in │ │
717 │ │ this table or in the table in the │ │
718 │ │ Base Definitions volume of │ │
719 │ │ POSIX.1‐2017, Chapter 5, File For‐ │ │
720 │ │ mat Notation ('\\', '\a', '\b', │ │
721 │ │ '\f', '\n', '\r', '\t', '\v'). │ │
722 └─────────┴────────────────────────────────────┴────────────────────────────────────┘
723 A regular expression can be matched against a specific field or string
724 by using one of the two regular expression matching operators, '~' and
725 "!~". These operators shall interpret their right-hand operand as a
726 regular expression and their left-hand operand as a string. If the reg‐
727 ular expression matches the string, the '~' expression shall evaluate
728 to a value of 1, and the "!~" expression shall evaluate to a value of
729 0. (The regular expression matching operation is as defined by the term
730 matched in the Base Definitions volume of POSIX.1‐2017, Section 9.1,
731 Regular Expression Definitions, where a match occurs on any part of the
732 string unless the regular expression is limited with the <circumflex>
733 or <dollar-sign> special characters.) If the regular expression does
734 not match the string, the '~' expression shall evaluate to a value of
735 0, and the "!~" expression shall evaluate to a value of 1. If the
736 right-hand operand is any expression other than the lexical token ERE,
737 the string value of the expression shall be interpreted as an extended
738 regular expression, including the escape conventions described above.
739 Note that these same escape conventions shall also be applied in deter‐
740 mining the value of a string literal (the lexical token STRING), and
741 thus shall be applied a second time when a string literal is used in
742 this context.
743
744 When an ERE token appears as an expression in any context other than as
745 the right-hand of the '~' or "!~" operator or as one of the built-in
746 function arguments described below, the value of the resulting expres‐
747 sion shall be the equivalent of:
748
749
750 $0 ~ /ere/
751
752 The ere argument to the gsub, match, sub functions, and the fs argument
753 to the split function (see String Functions) shall be interpreted as
754 extended regular expressions. These can be either ERE tokens or arbi‐
755 trary expressions, and shall be interpreted in the same manner as the
756 right-hand side of the '~' or "!~" operator.
757
758 An extended regular expression can be used to separate fields by
759 assigning a string containing the expression to the built-in variable
760 FS, either directly or as a consequence of using the -F sepstring
761 option. The default value of the FS variable shall be a single
762 <space>. The following describes FS behavior:
763
764 1. If FS is a null string, the behavior is unspecified.
765
766 2. If FS is a single character:
767
768 a. If FS is <space>, skip leading and trailing <blank> and <new‐
769 line> characters; fields shall be delimited by sets of one or
770 more <blank> or <newline> characters.
771
772 b. Otherwise, if FS is any other character c, fields shall be
773 delimited by each single occurrence of c.
774
775 3. Otherwise, the string value of FS shall be considered to be an
776 extended regular expression. Each occurrence of a sequence matching
777 the extended regular expression shall delimit fields.
778
779 Except for the '~' and "!~" operators, and in the gsub, match, split,
780 and sub built-in functions, ERE matching shall be based on input
781 records; that is, record separator characters (the first character of
782 the value of the variable RS, <newline> by default) cannot be embedded
783 in the expression, and no expression shall match the record separator
784 character. If the record separator is not <newline>, <newline> charac‐
785 ters embedded in the expression can be matched. For the '~' and "!~"
786 operators, and in those four built-in functions, ERE matching shall be
787 based on text strings; that is, any character (including <newline> and
788 the record separator) can be embedded in the pattern, and an appropri‐
789 ate pattern shall match any character. However, in all awk ERE match‐
790 ing, the use of one or more NUL characters in the pattern, input
791 record, or text string produces undefined results.
792
793 Patterns
794 A pattern is any valid expression, a range specified by two expressions
795 separated by a comma, or one of the two special patterns BEGIN or END.
796
797 Special Patterns
798 The awk utility shall recognize two special patterns, BEGIN and END.
799 Each BEGIN pattern shall be matched once and its associated action exe‐
800 cuted before the first record of input is read—except possibly by use
801 of the getline function (see Input/Output and General Functions) in a
802 prior BEGIN action—and before command line assignment is done. Each END
803 pattern shall be matched once and its associated action executed after
804 the last record of input has been read. These two patterns shall have
805 associated actions.
806
807 BEGIN and END shall not combine with other patterns. Multiple BEGIN and
808 END patterns shall be allowed. The actions associated with the BEGIN
809 patterns shall be executed in the order specified in the program, as
810 are the END actions. An END pattern can precede a BEGIN pattern in a
811 program.
812
813 If an awk program consists of only actions with the pattern BEGIN, and
814 the BEGIN action contains no getline function, awk shall exit without
815 reading its input when the last statement in the last BEGIN action is
816 executed. If an awk program consists of only actions with the pattern
817 END or only actions with the patterns BEGIN and END, the input shall be
818 read before the statements in the END actions are executed.
819
820 Expression Patterns
821 An expression pattern shall be evaluated as if it were an expression in
822 a Boolean context. If the result is true, the pattern shall be consid‐
823 ered to match, and the associated action (if any) shall be executed. If
824 the result is false, the action shall not be executed.
825
826 Pattern Ranges
827 A pattern range consists of two expressions separated by a comma; in
828 this case, the action shall be performed for all records between a
829 match of the first expression and the following match of the second
830 expression, inclusive. At this point, the pattern range can be repeated
831 starting at input records subsequent to the end of the matched range.
832
833 Actions
834 An action is a sequence of statements as shown in the grammar in Gram‐
835 mar. Any single statement can be replaced by a statement list enclosed
836 in curly braces. The application shall ensure that statements in a
837 statement list are separated by <newline> or <semicolon> characters.
838 Statements in a statement list shall be executed sequentially in the
839 order that they appear.
840
841 The expression acting as the conditional in an if statement shall be
842 evaluated and if it is non-zero or non-null, the following statement
843 shall be executed; otherwise, if else is present, the statement follow‐
844 ing the else shall be executed.
845
846 The if, while, do...while, for, break, and continue statements are
847 based on the ISO C standard (see Section 1.1.2, Concepts Derived from
848 the ISO C Standard), except that the Boolean expressions shall be
849 treated as described in Expressions in awk, and except in the case of:
850
851
852 for (variable in array)
853
854 which shall iterate, assigning each index of array to variable in an
855 unspecified order. The results of adding new elements to array within
856 such a for loop are undefined. If a break or continue statement occurs
857 outside of a loop, the behavior is undefined.
858
859 The delete statement shall remove an individual array element. Thus,
860 the following code deletes an entire array:
861
862
863 for (index in array)
864 delete array[index]
865
866 The next statement shall cause all further processing of the current
867 input record to be abandoned. The behavior is undefined if a next
868 statement appears or is invoked in a BEGIN or END action.
869
870 The exit statement shall invoke all END actions in the order in which
871 they occur in the program source and then terminate the program without
872 reading further input. An exit statement inside an END action shall
873 terminate the program without further execution of END actions. If an
874 expression is specified in an exit statement, its numeric value shall
875 be the exit status of awk, unless subsequent errors are encountered or
876 a subsequent exit statement with an expression is executed.
877
878 Output Statements
879 Both print and printf statements shall write to standard output by
880 default. The output shall be written to the location specified by out‐
881 put_redirection if one is supplied, as follows:
882
883
884 > expression
885 >> expression
886 | expression
887
888 In all cases, the expression shall be evaluated to produce a string
889 that is used as a pathname into which to write (for '>' or ">>") or as
890 a command to be executed (for '|'). Using the first two forms, if the
891 file of that name is not currently open, it shall be opened, creating
892 it if necessary and using the first form, truncating the file. The out‐
893 put then shall be appended to the file. As long as the file remains
894 open, subsequent calls in which expression evaluates to the same string
895 value shall simply append output to the file. The file remains open
896 until the close function (see Input/Output and General Functions) is
897 called with an expression that evaluates to the same string value.
898
899 The third form shall write output onto a stream piped to the input of a
900 command. The stream shall be created if no stream is currently open
901 with the value of expression as its command name. The stream created
902 shall be equivalent to one created by a call to the popen() function
903 defined in the System Interfaces volume of POSIX.1‐2017 with the value
904 of expression as the command argument and a value of w as the mode
905 argument. As long as the stream remains open, subsequent calls in which
906 expression evaluates to the same string value shall write output to the
907 existing stream. The stream shall remain open until the close function
908 (see Input/Output and General Functions) is called with an expression
909 that evaluates to the same string value. At that time, the stream
910 shall be closed as if by a call to the pclose() function defined in the
911 System Interfaces volume of POSIX.1‐2017.
912
913 As described in detail by the grammar in Grammar, these output state‐
914 ments shall take a <comma>-separated list of expressions referred to in
915 the grammar by the non-terminal symbols expr_list, print_expr_list, or
916 print_expr_list_opt. This list is referred to here as the expression
917 list, and each member is referred to as an expression argument.
918
919 The print statement shall write the value of each expression argument
920 onto the indicated output stream separated by the current output field
921 separator (see variable OFS above), and terminated by the output record
922 separator (see variable ORS above). All expression arguments shall be
923 taken as strings, being converted if necessary; this conversion shall
924 be as described in Expressions in awk, with the exception that the
925 printf format in OFMT shall be used instead of the value in CONVFMT.
926 An empty expression list shall stand for the whole input record ($0).
927
928 The printf statement shall produce output based on a notation similar
929 to the File Format Notation used to describe file formats in this vol‐
930 ume of POSIX.1‐2017 (see the Base Definitions volume of POSIX.1‐2017,
931 Chapter 5, File Format Notation). Output shall be produced as speci‐
932 fied with the first expression argument as the string format and subse‐
933 quent expression arguments as the strings arg1 to argn, inclusive, with
934 the following exceptions:
935
936 1. The format shall be an actual character string rather than a graph‐
937 ical representation. Therefore, it cannot contain empty character
938 positions. The <space> in the format string, in any context other
939 than a flag of a conversion specification, shall be treated as an
940 ordinary character that is copied to the output.
941
942 2. If the character set contains a '' character and that character
943 appears in the format string, it shall be treated as an ordinary
944 character that is copied to the output.
945
946 3. The escape sequences beginning with a <backslash> character shall
947 be treated as sequences of ordinary characters that are copied to
948 the output. Note that these same sequences shall be interpreted
949 lexically by awk when they appear in literal strings, but they
950 shall not be treated specially by the printf statement.
951
952 4. A field width or precision can be specified as the '*' character
953 instead of a digit string. In this case the next argument from the
954 expression list shall be fetched and its numeric value taken as the
955 field width or precision.
956
957 5. The implementation shall not precede or follow output from the d or
958 u conversion specifier characters with <blank> characters not spec‐
959 ified by the format string.
960
961 6. The implementation shall not precede output from the o conversion
962 specifier character with leading zeros not specified by the format
963 string.
964
965 7. For the c conversion specifier character: if the argument has a
966 numeric value, the character whose encoding is that value shall be
967 output. If the value is zero or is not the encoding of any charac‐
968 ter in the character set, the behavior is undefined. If the argu‐
969 ment does not have a numeric value, the first character of the
970 string value shall be output; if the string does not contain any
971 characters, the behavior is undefined.
972
973 8. For each conversion specification that consumes an argument, the
974 next expression argument shall be evaluated. With the exception of
975 the c conversion specifier character, the value shall be converted
976 (according to the rules specified in Expressions in awk) to the
977 appropriate type for the conversion specification.
978
979 9. If there are insufficient expression arguments to satisfy all the
980 conversion specifications in the format string, the behavior is
981 undefined.
982
983 10. If any character sequence in the format string begins with a '%'
984 character, but does not form a valid conversion specification, the
985 behavior is unspecified.
986
987 Both print and printf can output at least {LINE_MAX} bytes.
988
989 Functions
990 The awk language has a variety of built-in functions: arithmetic,
991 string, input/output, and general.
992
993 Arithmetic Functions
994 The arithmetic functions, except for int, shall be based on the ISO C
995 standard (see Section 1.1.2, Concepts Derived from the ISO C Standard).
996 The behavior is undefined in cases where the ISO C standard specifies
997 that an error be returned or that the behavior is undefined. Although
998 the grammar (see Grammar) permits built-in functions to appear with no
999 arguments or parentheses, unless the argument or parentheses are indi‐
1000 cated as optional in the following list (by displaying them within the
1001 "[]" brackets), such use is undefined.
1002
1003 atan2(y,x)
1004 Return arctangent of y/x in radians in the range [-π,π].
1005
1006 cos(x) Return cosine of x, where x is in radians.
1007
1008 sin(x) Return sine of x, where x is in radians.
1009
1010 exp(x) Return the exponential function of x.
1011
1012 log(x) Return the natural logarithm of x.
1013
1014 sqrt(x) Return the square root of x.
1015
1016 int(x) Return the argument truncated to an integer. Truncation shall
1017 be toward 0 when x>0.
1018
1019 rand() Return a random number n, such that 0≤n<1.
1020
1021 srand([expr])
1022 Set the seed value for rand to expr or use the time of day if
1023 expr is omitted. The previous seed value shall be returned.
1024
1025 String Functions
1026 The string functions in the following list shall be supported.
1027 Although the grammar (see Grammar) permits built-in functions to appear
1028 with no arguments or parentheses, unless the argument or parentheses
1029 are indicated as optional in the following list (by displaying them
1030 within the "[]" brackets), such use is undefined.
1031
1032 gsub(ere, repl[, in])
1033 Behave like sub (see below), except that it shall replace all
1034 occurrences of the regular expression (like the ed utility
1035 global substitute) in $0 or in the in argument, when speci‐
1036 fied.
1037
1038 index(s, t)
1039 Return the position, in characters, numbering from 1, in
1040 string s where string t first occurs, or zero if it does not
1041 occur at all.
1042
1043 length[([s])]
1044 Return the length, in characters, of its argument taken as a
1045 string, or of the whole record, $0, if there is no argument.
1046
1047 match(s, ere)
1048 Return the position, in characters, numbering from 1, in
1049 string s where the extended regular expression ere occurs, or
1050 zero if it does not occur at all. RSTART shall be set to the
1051 starting position (which is the same as the returned value),
1052 zero if no match is found; RLENGTH shall be set to the length
1053 of the matched string, -1 if no match is found.
1054
1055 split(s, a[, fs ])
1056 Split the string s into array elements a[1], a[2], ..., a[n],
1057 and return n. All elements of the array shall be deleted
1058 before the split is performed. The separation shall be done
1059 with the ERE fs or with the field separator FS if fs is not
1060 given. Each array element shall have a string value when cre‐
1061 ated and, if appropriate, the array element shall be consid‐
1062 ered a numeric string (see Expressions in awk). The effect
1063 of a null string as the value of fs is unspecified.
1064
1065 sprintf(fmt, expr, expr, ...)
1066 Format the expressions according to the printf format given
1067 by fmt and return the resulting string.
1068
1069 sub(ere, repl[, in ])
1070 Substitute the string repl in place of the first instance of
1071 the extended regular expression ERE in string in and return
1072 the number of substitutions. An <ampersand> ('&') appearing
1073 in the string repl shall be replaced by the string from in
1074 that matches the ERE. An <ampersand> preceded with a <back‐
1075 slash> shall be interpreted as the literal <ampersand> char‐
1076 acter. An occurrence of two consecutive <backslash> charac‐
1077 ters shall be interpreted as just a single literal <back‐
1078 slash> character. Any other occurrence of a <backslash> (for
1079 example, preceding any other character) shall be treated as a
1080 literal <backslash> character. Note that if repl is a string
1081 literal (the lexical token STRING; see Grammar), the handling
1082 of the <ampersand> character occurs after any lexical pro‐
1083 cessing, including any lexical <backslash>-escape sequence
1084 processing. If in is specified and it is not an lvalue (see
1085 Expressions in awk), the behavior is undefined. If in is
1086 omitted, awk shall use the current record ($0) in its place.
1087
1088 substr(s, m[, n ])
1089 Return the at most n-character substring of s that begins at
1090 position m, numbering from 1. If n is omitted, or if n speci‐
1091 fies more characters than are left in the string, the length
1092 of the substring shall be limited by the length of the string
1093 s.
1094
1095 tolower(s)
1096 Return a string based on the string s. Each character in s
1097 that is an uppercase letter specified to have a tolower map‐
1098 ping by the LC_CTYPE category of the current locale shall be
1099 replaced in the returned string by the lowercase letter spec‐
1100 ified by the mapping. Other characters in s shall be
1101 unchanged in the returned string.
1102
1103 toupper(s)
1104 Return a string based on the string s. Each character in s
1105 that is a lowercase letter specified to have a toupper map‐
1106 ping by the LC_CTYPE category of the current locale is
1107 replaced in the returned string by the uppercase letter spec‐
1108 ified by the mapping. Other characters in s are unchanged in
1109 the returned string.
1110
1111 All of the preceding functions that take ERE as a parameter expect a
1112 pattern or a string valued expression that is a regular expression as
1113 defined in Regular Expressions.
1114
1115 Input/Output and General Functions
1116 The input/output and general functions are:
1117
1118 close(expression)
1119 Close the file or pipe opened by a print or printf statement
1120 or a call to getline with the same string-valued expression.
1121 The limit on the number of open expression arguments is
1122 implementation-defined. If the close was successful, the
1123 function shall return zero; otherwise, it shall return non-
1124 zero.
1125
1126 expression | getline [var]
1127 Read a record of input from a stream piped from the output of
1128 a command. The stream shall be created if no stream is cur‐
1129 rently open with the value of expression as its command name.
1130 The stream created shall be equivalent to one created by a
1131 call to the popen() function with the value of expression as
1132 the command argument and a value of r as the mode argument.
1133 As long as the stream remains open, subsequent calls in which
1134 expression evaluates to the same string value shall read sub‐
1135 sequent records from the stream. The stream shall remain open
1136 until the close function is called with an expression that
1137 evaluates to the same string value. At that time, the stream
1138 shall be closed as if by a call to the pclose() function. If
1139 var is omitted, $0 and NF shall be set; otherwise, var shall
1140 be set and, if appropriate, it shall be considered a numeric
1141 string (see Expressions in awk).
1142
1143 The getline operator can form ambiguous constructs when there
1144 are unparenthesized operators (including concatenate) to the
1145 left of the '|' (to the beginning of the expression contain‐
1146 ing getline). In the context of the '$' operator, '|' shall
1147 behave as if it had a lower precedence than '$'. The result
1148 of evaluating other operators is unspecified, and conforming
1149 applications shall parenthesize properly all such usages.
1150
1151 getline Set $0 to the next input record from the current input file.
1152 This form of getline shall set the NF, NR, and FNR variables.
1153
1154 getline var
1155 Set variable var to the next input record from the current
1156 input file and, if appropriate, var shall be considered a
1157 numeric string (see Expressions in awk). This form of get‐
1158 line shall set the FNR and NR variables.
1159
1160 getline [var] < expression
1161 Read the next record of input from a named file. The expres‐
1162 sion shall be evaluated to produce a string that is used as a
1163 pathname. If the file of that name is not currently open, it
1164 shall be opened. As long as the stream remains open, subse‐
1165 quent calls in which expression evaluates to the same string
1166 value shall read subsequent records from the file. The file
1167 shall remain open until the close function is called with an
1168 expression that evaluates to the same string value. If var is
1169 omitted, $0 and NF shall be set; otherwise, var shall be set
1170 and, if appropriate, it shall be considered a numeric string
1171 (see Expressions in awk).
1172
1173 The getline operator can form ambiguous constructs when there
1174 are unparenthesized binary operators (including concatenate)
1175 to the right of the '<' (up to the end of the expression con‐
1176 taining the getline). The result of evaluating such a con‐
1177 struct is unspecified, and conforming applications shall
1178 parenthesize properly all such usages.
1179
1180 system(expression)
1181 Execute the command given by expression in a manner equiva‐
1182 lent to the system() function defined in the System Inter‐
1183 faces volume of POSIX.1‐2017 and return the exit status of
1184 the command.
1185
1186 All forms of getline shall return 1 for successful input, zero for end-
1187 of-file, and -1 for an error.
1188
1189 Where strings are used as the name of a file or pipeline, the applica‐
1190 tion shall ensure that the strings are textually identical. The termi‐
1191 nology ``same string value'' implies that ``equivalent strings'', even
1192 those that differ only by <space> characters, represent different
1193 files.
1194
1195 User-Defined Functions
1196 The awk language also provides user-defined functions. Such functions
1197 can be defined as:
1198
1199
1200 function name([parameter, ...]) { statements }
1201
1202 A function can be referred to anywhere in an awk program; in particu‐
1203 lar, its use can precede its definition. The scope of a function is
1204 global.
1205
1206 Function parameters, if present, can be either scalars or arrays; the
1207 behavior is undefined if an array name is passed as a parameter that
1208 the function uses as a scalar, or if a scalar expression is passed as a
1209 parameter that the function uses as an array. Function parameters shall
1210 be passed by value if scalar and by reference if array name.
1211
1212 The number of parameters in the function definition need not match the
1213 number of parameters in the function call. Excess formal parameters can
1214 be used as local variables. If fewer arguments are supplied in a func‐
1215 tion call than are in the function definition, the extra parameters
1216 that are used in the function body as scalars shall evaluate to the
1217 uninitialized value until they are otherwise initialized, and the extra
1218 parameters that are used in the function body as arrays shall be
1219 treated as uninitialized arrays where each element evaluates to the
1220 uninitialized value until otherwise initialized.
1221
1222 When invoking a function, no white space can be placed between the
1223 function name and the opening parenthesis. Function calls can be nested
1224 and recursive calls can be made upon functions. Upon return from any
1225 nested or recursive function call, the values of all of the calling
1226 function's parameters shall be unchanged, except for array parameters
1227 passed by reference. The return statement can be used to return a
1228 value. If a return statement appears outside of a function definition,
1229 the behavior is undefined.
1230
1231 In the function definition, <newline> characters shall be optional
1232 before the opening brace and after the closing brace. Function defini‐
1233 tions can appear anywhere in the program where a pattern-action pair is
1234 allowed.
1235
1236 Grammar
1237 The grammar in this section and the lexical conventions in the follow‐
1238 ing section shall together describe the syntax for awk programs. The
1239 general conventions for this style of grammar are described in Section
1240 1.3, Grammar Conventions. A valid program can be represented as the
1241 non-terminal symbol program in the grammar. This formal syntax shall
1242 take precedence over the preceding text syntax description.
1243
1244
1245 %token NAME NUMBER STRING ERE
1246 %token FUNC_NAME /* Name followed by '(' without white space. */
1247
1248 /* Keywords */
1249 %token Begin End
1250 /* 'BEGIN' 'END' */
1251
1252 %token Break Continue Delete Do Else
1253 /* 'break' 'continue' 'delete' 'do' 'else' */
1254
1255 %token Exit For Function If In
1256 /* 'exit' 'for' 'function' 'if' 'in' */
1257
1258 %token Next Print Printf Return While
1259 /* 'next' 'print' 'printf' 'return' 'while' */
1260
1261 /* Reserved function names */
1262 %token BUILTIN_FUNC_NAME
1263 /* One token for the following:
1264 * atan2 cos sin exp log sqrt int rand srand
1265 * gsub index length match split sprintf sub
1266 * substr tolower toupper close system
1267 */
1268 %token GETLINE
1269 /* Syntactically different from other built-ins. */
1270
1271 /* Two-character tokens. */
1272 %token ADD_ASSIGN SUB_ASSIGN MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN POW_ASSIGN
1273 /* '+=' '-=' '*=' '/=' '%=' '^=' */
1274
1275 %token OR AND NO_MATCH EQ LE GE NE INCR DECR APPEND
1276 /* '||' '&&' '!~' '==' '<=' '>=' '!=' '++' '--' '>>' */
1277
1278 /* One-character tokens. */
1279 %token '{' '}' '(' ')' '[' ']' ',' ';' NEWLINE
1280 %token '+' '-' '*' '%' '^' '!' '>' '<' '|' '?' ':' '~' '$' '='
1281
1282 %start program
1283 %%
1284
1285 program : item_list
1286 | item_list item
1287 ;
1288
1289 item_list : /* empty */
1290 | item_list item terminator
1291 ;
1292
1293 item : action
1294 | pattern action
1295 | normal_pattern
1296 | Function NAME '(' param_list_opt ')'
1297 newline_opt action
1298 | Function FUNC_NAME '(' param_list_opt ')'
1299 newline_opt action
1300 ;
1301
1302 param_list_opt : /* empty */
1303 | param_list
1304 ;
1305
1306 param_list : NAME
1307 | param_list ',' NAME
1308 ;
1309
1310 pattern : normal_pattern
1311 | special_pattern
1312 ;
1313
1314 normal_pattern : expr
1315 | expr ',' newline_opt expr
1316 ;
1317
1318 special_pattern : Begin
1319 | End
1320 ;
1321
1322 action : '{' newline_opt '}'
1323 | '{' newline_opt terminated_statement_list '}'
1324 | '{' newline_opt unterminated_statement_list '}'
1325 ;
1326
1327 terminator : terminator NEWLINE
1328 | ';'
1329 | NEWLINE
1330 ;
1331
1332 terminated_statement_list : terminated_statement
1333 | terminated_statement_list terminated_statement
1334 ;
1335
1336 unterminated_statement_list : unterminated_statement
1337 | terminated_statement_list unterminated_statement
1338 ;
1339
1340 terminated_statement : action newline_opt
1341 | If '(' expr ')' newline_opt terminated_statement
1342 | If '(' expr ')' newline_opt terminated_statement
1343 Else newline_opt terminated_statement
1344 | While '(' expr ')' newline_opt terminated_statement
1345 | For '(' simple_statement_opt ';'
1346 expr_opt ';' simple_statement_opt ')' newline_opt
1347 terminated_statement
1348 | For '(' NAME In NAME ')' newline_opt
1349 terminated_statement
1350 | ';' newline_opt
1351 | terminatable_statement NEWLINE newline_opt
1352 | terminatable_statement ';' newline_opt
1353 ;
1354
1355 unterminated_statement : terminatable_statement
1356 | If '(' expr ')' newline_opt unterminated_statement
1357 | If '(' expr ')' newline_opt terminated_statement
1358 Else newline_opt unterminated_statement
1359 | While '(' expr ')' newline_opt unterminated_statement
1360 | For '(' simple_statement_opt ';'
1361 expr_opt ';' simple_statement_opt ')' newline_opt
1362 unterminated_statement
1363 | For '(' NAME In NAME ')' newline_opt
1364 unterminated_statement
1365 ;
1366
1367 terminatable_statement : simple_statement
1368 | Break
1369 | Continue
1370 | Next
1371 | Exit expr_opt
1372 | Return expr_opt
1373 | Do newline_opt terminated_statement While '(' expr ')'
1374 ;
1375
1376 simple_statement_opt : /* empty */
1377 | simple_statement
1378 ;
1379
1380 simple_statement : Delete NAME '[' expr_list ']'
1381 | expr
1382 | print_statement
1383 ;
1384
1385 print_statement : simple_print_statement
1386 | simple_print_statement output_redirection
1387 ;
1388
1389 simple_print_statement : Print print_expr_list_opt
1390 | Print '(' multiple_expr_list ')'
1391 | Printf print_expr_list
1392 | Printf '(' multiple_expr_list ')'
1393 ;
1394
1395 output_redirection : '>' expr
1396 | APPEND expr
1397 | '|' expr
1398 ;
1399
1400 expr_list_opt : /* empty */
1401 | expr_list
1402 ;
1403
1404 expr_list : expr
1405 | multiple_expr_list
1406 ;
1407
1408 multiple_expr_list : expr ',' newline_opt expr
1409 | multiple_expr_list ',' newline_opt expr
1410 ;
1411
1412 expr_opt : /* empty */
1413 | expr
1414 ;
1415
1416 expr : unary_expr
1417 | non_unary_expr
1418 ;
1419
1420 unary_expr : '+' expr
1421 | '-' expr
1422 | unary_expr '^' expr
1423 | unary_expr '*' expr
1424 | unary_expr '/' expr
1425 | unary_expr '%' expr
1426 | unary_expr '+' expr
1427 | unary_expr '-' expr
1428 | unary_expr non_unary_expr
1429 | unary_expr '<' expr
1430 | unary_expr LE expr
1431 | unary_expr NE expr
1432 | unary_expr EQ expr
1433 | unary_expr '>' expr
1434 | unary_expr GE expr
1435 | unary_expr '~' expr
1436 | unary_expr NO_MATCH expr
1437 | unary_expr In NAME
1438 | unary_expr AND newline_opt expr
1439 | unary_expr OR newline_opt expr
1440 | unary_expr '?' expr ':' expr
1441 | unary_input_function
1442 ;
1443
1444 non_unary_expr : '(' expr ')'
1445 | '!' expr
1446 | non_unary_expr '^' expr
1447 | non_unary_expr '*' expr
1448 | non_unary_expr '/' expr
1449 | non_unary_expr '%' expr
1450 | non_unary_expr '+' expr
1451 | non_unary_expr '-' expr
1452 | non_unary_expr non_unary_expr
1453 | non_unary_expr '<' expr
1454 | non_unary_expr LE expr
1455 | non_unary_expr NE expr
1456 | non_unary_expr EQ expr
1457 | non_unary_expr '>' expr
1458 | non_unary_expr GE expr
1459 | non_unary_expr '~' expr
1460 | non_unary_expr NO_MATCH expr
1461 | non_unary_expr In NAME
1462 | '(' multiple_expr_list ')' In NAME
1463 | non_unary_expr AND newline_opt expr
1464 | non_unary_expr OR newline_opt expr
1465 | non_unary_expr '?' expr ':' expr
1466 | NUMBER
1467 | STRING
1468 | lvalue
1469 | ERE
1470 | lvalue INCR
1471 | lvalue DECR
1472 | INCR lvalue
1473 | DECR lvalue
1474 | lvalue POW_ASSIGN expr
1475 | lvalue MOD_ASSIGN expr
1476 | lvalue MUL_ASSIGN expr
1477 | lvalue DIV_ASSIGN expr
1478 | lvalue ADD_ASSIGN expr
1479 | lvalue SUB_ASSIGN expr
1480 | lvalue '=' expr
1481 | FUNC_NAME '(' expr_list_opt ')'
1482 /* no white space allowed before '(' */
1483 | BUILTIN_FUNC_NAME '(' expr_list_opt ')'
1484 | BUILTIN_FUNC_NAME
1485 | non_unary_input_function
1486 ;
1487
1488 print_expr_list_opt : /* empty */
1489 | print_expr_list
1490 ;
1491
1492 print_expr_list : print_expr
1493 | print_expr_list ',' newline_opt print_expr
1494 ;
1495
1496 print_expr : unary_print_expr
1497 | non_unary_print_expr
1498 ;
1499
1500 unary_print_expr : '+' print_expr
1501 | '-' print_expr
1502 | unary_print_expr '^' print_expr
1503 | unary_print_expr '*' print_expr
1504 | unary_print_expr '/' print_expr
1505 | unary_print_expr '%' print_expr
1506 | unary_print_expr '+' print_expr
1507 | unary_print_expr '-' print_expr
1508 | unary_print_expr non_unary_print_expr
1509 | unary_print_expr '~' print_expr
1510 | unary_print_expr NO_MATCH print_expr
1511 | unary_print_expr In NAME
1512 | unary_print_expr AND newline_opt print_expr
1513 | unary_print_expr OR newline_opt print_expr
1514 | unary_print_expr '?' print_expr ':' print_expr
1515 ;
1516
1517 non_unary_print_expr : '(' expr ')'
1518 | '!' print_expr
1519 | non_unary_print_expr '^' print_expr
1520 | non_unary_print_expr '*' print_expr
1521 | non_unary_print_expr '/' print_expr
1522 | non_unary_print_expr '%' print_expr
1523 | non_unary_print_expr '+' print_expr
1524 | non_unary_print_expr '-' print_expr
1525 | non_unary_print_expr non_unary_print_expr
1526 | non_unary_print_expr '~' print_expr
1527 | non_unary_print_expr NO_MATCH print_expr
1528 | non_unary_print_expr In NAME
1529 | '(' multiple_expr_list ')' In NAME
1530 | non_unary_print_expr AND newline_opt print_expr
1531 | non_unary_print_expr OR newline_opt print_expr
1532 | non_unary_print_expr '?' print_expr ':' print_expr
1533 | NUMBER
1534 | STRING
1535 | lvalue
1536 | ERE
1537 | lvalue INCR
1538 | lvalue DECR
1539 | INCR lvalue
1540 | DECR lvalue
1541 | lvalue POW_ASSIGN print_expr
1542 | lvalue MOD_ASSIGN print_expr
1543 | lvalue MUL_ASSIGN print_expr
1544 | lvalue DIV_ASSIGN print_expr
1545 | lvalue ADD_ASSIGN print_expr
1546 | lvalue SUB_ASSIGN print_expr
1547 | lvalue '=' print_expr
1548 | FUNC_NAME '(' expr_list_opt ')'
1549 /* no white space allowed before '(' */
1550 | BUILTIN_FUNC_NAME '(' expr_list_opt ')'
1551 | BUILTIN_FUNC_NAME
1552 ;
1553
1554 lvalue : NAME
1555 | NAME '[' expr_list ']'
1556 | '$' expr
1557 ;
1558
1559 non_unary_input_function : simple_get
1560 | simple_get '<' expr
1561 | non_unary_expr '|' simple_get
1562 ;
1563
1564 unary_input_function : unary_expr '|' simple_get
1565 ;
1566
1567 simple_get : GETLINE
1568 | GETLINE lvalue
1569 ;
1570
1571 newline_opt : /* empty */
1572 | newline_opt NEWLINE
1573 ;
1574
1575 This grammar has several ambiguities that shall be resolved as follows:
1576
1577 * Operator precedence and associativity shall be as described in Ta‐
1578 ble 4-1, Expressions in Decreasing Precedence in awk.
1579
1580 * In case of ambiguity, an else shall be associated with the most
1581 immediately preceding if that would satisfy the grammar.
1582
1583 * In some contexts, a <slash> ('/') that is used to surround an ERE
1584 could also be the division operator. This shall be resolved in
1585 such a way that wherever the division operator could appear, a
1586 <slash> is assumed to be the division operator. (There is no unary
1587 division operator.)
1588
1589 Each expression in an awk program shall conform to the precedence and
1590 associativity rules, even when this is not needed to resolve an ambigu‐
1591 ity. For example, because '$' has higher precedence than '++', the
1592 string "$x++--" is not a valid awk expression, even though it is unam‐
1593 biguously parsed by the grammar as "$(x++)--".
1594
1595 One convention that might not be obvious from the formal grammar is
1596 where <newline> characters are acceptable. There are several obvious
1597 placements such as terminating a statement, and a <backslash> can be
1598 used to escape <newline> characters between any lexical tokens. In
1599 addition, <newline> characters without <backslash> characters can fol‐
1600 low a comma, an open brace, logical AND operator ("&&"), logical OR
1601 operator ("||"), the do keyword, the else keyword, and the closing
1602 parenthesis of an if, for, or while statement. For example:
1603
1604
1605 { print $1,
1606 $2 }
1607
1608 Lexical Conventions
1609 The lexical conventions for awk programs, with respect to the preceding
1610 grammar, shall be as follows:
1611
1612 1. Except as noted, awk shall recognize the longest possible token or
1613 delimiter beginning at a given point.
1614
1615 2. A comment shall consist of any characters beginning with the <num‐
1616 ber-sign> character and terminated by, but excluding the next
1617 occurrence of, a <newline>. Comments shall have no effect, except
1618 to delimit lexical tokens.
1619
1620 3. The <newline> shall be recognized as the token NEWLINE.
1621
1622 4. A <backslash> character immediately followed by a <newline> shall
1623 have no effect.
1624
1625 5. The token STRING shall represent a string constant. A string con‐
1626 stant shall begin with the character '"'. Within a string con‐
1627 stant, a <backslash> character shall be considered to begin an
1628 escape sequence as specified in the table in the Base Definitions
1629 volume of POSIX.1‐2017, Chapter 5, File Format Notation ('\\',
1630 '\a', '\b', '\f', '\n', '\r', '\t', '\v'). In addition, the escape
1631 sequences in Table 4-2, Escape Sequences in awk shall be recog‐
1632 nized. A <newline> shall not occur within a string constant. A
1633 string constant shall be terminated by the first unescaped occur‐
1634 rence of the character '"' after the one that begins the string
1635 constant. The value of the string shall be the sequence of all
1636 unescaped characters and values of escape sequences between, but
1637 not including, the two delimiting '"' characters.
1638
1639 6. The token ERE represents an extended regular expression constant.
1640 An ERE constant shall begin with the <slash> character. Within an
1641 ERE constant, a <backslash> character shall be considered to begin
1642 an escape sequence as specified in the table in the Base Defini‐
1643 tions volume of POSIX.1‐2017, Chapter 5, File Format Notation. In
1644 addition, the escape sequences in Table 4-2, Escape Sequences in
1645 awk shall be recognized. The application shall ensure that a <new‐
1646 line> does not occur within an ERE constant. An ERE constant shall
1647 be terminated by the first unescaped occurrence of the <slash>
1648 character after the one that begins the ERE constant. The extended
1649 regular expression represented by the ERE constant shall be the
1650 sequence of all unescaped characters and values of escape sequences
1651 between, but not including, the two delimiting <slash> characters.
1652
1653 7. A <blank> shall have no effect, except to delimit lexical tokens or
1654 within STRING or ERE tokens.
1655
1656 8. The token NUMBER shall represent a numeric constant. Its form and
1657 numeric value shall either be equivalent to the decimal-floating-
1658 constant token as specified by the ISO C standard, or it shall be a
1659 sequence of decimal digits and shall be evaluated as an integer
1660 constant in decimal. In addition, implementations may accept
1661 numeric constants with the form and numeric value equivalent to the
1662 hexadecimal-constant and hexadecimal-floating-constant tokens as
1663 specified by the ISO C standard.
1664
1665 If the value is too large or too small to be representable (see
1666 Section 1.1.2, Concepts Derived from the ISO C Standard), the
1667 behavior is undefined.
1668
1669 9. A sequence of underscores, digits, and alphabetics from the porta‐
1670 ble character set (see the Base Definitions volume of POSIX.1‐2017,
1671 Section 6.1, Portable Character Set), beginning with an <under‐
1672 score> or alphabetic character, shall be considered a word.
1673
1674 10. The following words are keywords that shall be recognized as indi‐
1675 vidual tokens; the name of the token is the same as the keyword:
1676
1677 BEGIN delete END function in printf
1678 break do exit getline next return
1679 continue else for if print while
1680
1681 11. The following words are names of built-in functions and shall be
1682 recognized as the token BUILTIN_FUNC_NAME:
1683
1684 atan2 gsub log split sub toupper
1685 close index match sprintf substr
1686 cos int rand sqrt system
1687 exp length sin srand tolower
1688
1689 The above-listed keywords and names of built-in functions are con‐
1690 sidered reserved words.
1691
1692 12. The token NAME shall consist of a word that is not a keyword or a
1693 name of a built-in function and is not followed immediately (with‐
1694 out any delimiters) by the '(' character.
1695
1696 13. The token FUNC_NAME shall consist of a word that is not a keyword
1697 or a name of a built-in function, followed immediately (without any
1698 delimiters) by the '(' character. The '(' character shall not be
1699 included as part of the token.
1700
1701 14. The following two-character sequences shall be recognized as the
1702 named tokens:
1703
1704 ┌───────────┬──────────┬────────────┬──────────┐
1705 │Token Name │ Sequence │ Token Name │ Sequence │
1706 ├───────────┼──────────┼────────────┼──────────┤
1707 │ADD_ASSIGN │ += │ NO_MATCH │ !~ │
1708 │SUB_ASSIGN │ -= │ EQ │ == │
1709 │MUL_ASSIGN │ *= │ LE │ <= │
1710 │DIV_ASSIGN │ /= │ GE │ >= │
1711 │MOD_ASSIGN │ %= │ NE │ != │
1712 │POW_ASSIGN │ ^= │ INCR │ ++ │
1713 │OR │ || │ DECR │ -- │
1714 │AND │ && │ APPEND │ >> │
1715 └───────────┴──────────┴────────────┴──────────┘
1716 15. The following single characters shall be recognized as tokens whose
1717 names are the character:
1718
1719
1720 <newline> { } ( ) [ ] , ; + - * % ^ ! > < | ? : ~ $ =
1721
1722 There is a lexical ambiguity between the token ERE and the tokens '/'
1723 and DIV_ASSIGN. When an input sequence begins with a <slash> character
1724 in any syntactic context where the token '/' or DIV_ASSIGN could appear
1725 as the next token in a valid program, the longer of those two tokens
1726 that can be recognized shall be recognized. In any other syntactic con‐
1727 text where the token ERE could appear as the next token in a valid pro‐
1728 gram, the token ERE shall be recognized.
1729
1731 The following exit values shall be returned:
1732
1733 0 All input files were processed successfully.
1734
1735 >0 An error occurred.
1736
1737 The exit status can be altered within the program by using an exit
1738 expression.
1739
1741 If any file operand is specified and the named file cannot be accessed,
1742 awk shall write a diagnostic message to standard error and terminate
1743 without any further action.
1744
1745 If the program specified by either the program operand or a progfile
1746 operand is not a valid awk program (as specified in the EXTENDED
1747 DESCRIPTION section), the behavior is undefined.
1748
1749 The following sections are informative.
1750
1752 The index, length, match, and substr functions should not be confused
1753 with similar functions in the ISO C standard; the awk versions deal
1754 with characters, while the ISO C standard deals with bytes.
1755
1756 Because the concatenation operation is represented by adjacent expres‐
1757 sions rather than an explicit operator, it is often necessary to use
1758 parentheses to enforce the proper evaluation precedence.
1759
1760 When using awk to process pathnames, it is recommended that LC_ALL, or
1761 at least LC_CTYPE and LC_COLLATE, are set to POSIX or C in the environ‐
1762 ment, since pathnames can contain byte sequences that do not form valid
1763 characters in some locales, in which case the utility's behavior would
1764 be undefined. In the POSIX locale each byte is a valid single-byte
1765 character, and therefore this problem is avoided.
1766
1767 On implementations where the "==" operator checks if strings collate
1768 equally, applications needing to check whether strings are identical
1769 can use:
1770
1771
1772 length(a) == length(b) && index(a,b) == 1
1773
1774 On implementations where the "==" operator checks if strings are iden‐
1775 tical, applications needing to check whether strings collate equally
1776 can use:
1777
1778
1779 a <= b && a >= b
1780
1782 The awk program specified in the command line is most easily specified
1783 within single-quotes (for example, 'program') for applications using
1784 sh, because awk programs commonly contain characters that are special
1785 to the shell, including double-quotes. In the cases where an awk pro‐
1786 gram contains single-quote characters, it is usually easiest to specify
1787 most of the program as strings within single-quotes concatenated by the
1788 shell with quoted single-quote characters. For example:
1789
1790
1791 awk '/'\''/ { print "quote:", $0 }'
1792
1793 prints all lines from the standard input containing a single-quote
1794 character, prefixed with quote:.
1795
1796 The following are examples of simple awk programs:
1797
1798 1. Write to the standard output all input lines for which field 3 is
1799 greater than 5:
1800
1801
1802 $3 > 5
1803
1804 2. Write every tenth line:
1805
1806
1807 (NR % 10) == 0
1808
1809 3. Write any line with a substring matching the regular expression:
1810
1811
1812 /(G|D)(2[0-9][[:alpha:]]*)/
1813
1814 4. Print any line with a substring containing a 'G' or 'D', followed
1815 by a sequence of digits and characters. This example uses character
1816 classes digit and alpha to match language-independent digit and
1817 alphabetic characters respectively:
1818
1819
1820 /(G|D)([[:digit:][:alpha:]]*)/
1821
1822 5. Write any line in which the second field matches the regular
1823 expression and the fourth field does not:
1824
1825
1826 $2 ~ /xyz/ && $4 !~ /xyz/
1827
1828 6. Write any line in which the second field contains a <backslash>:
1829
1830
1831 $2 ~ /\\/
1832
1833 7. Write any line in which the second field contains a <backslash>.
1834 Note that <backslash>-escapes are interpreted twice; once in lexi‐
1835 cal processing of the string and once in processing the regular
1836 expression:
1837
1838
1839 $2 ~ "\\\\"
1840
1841 8. Write the second to the last and the last field in each line. Sepa‐
1842 rate the fields by a <colon>:
1843
1844
1845 {OFS=":";print $(NF-1), $NF}
1846
1847 9. Write the line number and number of fields in each line. The three
1848 strings representing the line number, the <colon>, and the number
1849 of fields are concatenated and that string is written to standard
1850 output:
1851
1852
1853 {print NR ":" NF}
1854
1855 10. Write lines longer than 72 characters:
1856
1857
1858 length($0) > 72
1859
1860 11. Write the first two fields in opposite order separated by OFS:
1861
1862
1863 { print $2, $1 }
1864
1865 12. Same, with input fields separated by a <comma> or <space> and <tab>
1866 characters, or both:
1867
1868
1869 BEGIN { FS = ",[ \t]*|[ \t]+" }
1870 { print $2, $1 }
1871
1872 13. Add up the first column, print sum, and average:
1873
1874
1875 {s += $1 }
1876 END {print "sum is ", s, " average is", s/NR}
1877
1878 14. Write fields in reverse order, one per line (many lines out for
1879 each line in):
1880
1881
1882 { for (i = NF; i > 0; --i) print $i }
1883
1884 15. Write all lines between occurrences of the strings start and stop:
1885
1886
1887 /start/, /stop/
1888
1889 16. Write all lines whose first field is different from the previous
1890 one:
1891
1892
1893 $1 != prev { print; prev = $1 }
1894
1895 17. Simulate echo:
1896
1897
1898 BEGIN {
1899 for (i = 1; i < ARGC; ++i)
1900 printf("%s%s", ARGV[i], i==ARGC-1?"\n":" ")
1901 }
1902
1903 18. Write the path prefixes contained in the PATH environment variable,
1904 one per line:
1905
1906
1907 BEGIN {
1908 n = split (ENVIRON["PATH"], path, ":")
1909 for (i = 1; i <= n; ++i)
1910 print path[i]
1911 }
1912
1913 19. If there is a file named input containing page headers of the form:
1914 Page #
1915
1916 and a file named program that contains:
1917
1918
1919 /Page/ { $2 = n++; }
1920 { print }
1921
1922 then the command line:
1923
1924
1925 awk -f program n=5 input
1926
1927 prints the file input, filling in page numbers starting at 5.
1928
1930 This description is based on the new awk, ``nawk'', (see the referenced
1931 The AWK Programming Language), which introduced a number of new fea‐
1932 tures to the historical awk:
1933
1934 1. New keywords: delete, do, function, return
1935
1936 2. New built-in functions: atan2, close, cos, gsub, match, rand, sin,
1937 srand, sub, system
1938
1939 3. New predefined variables: FNR, ARGC, ARGV, RSTART, RLENGTH, SUBSEP
1940
1941 4. New expression operators: ?, :, ,, ^
1942
1943 5. The FS variable and the third argument to split, now treated as
1944 extended regular expressions.
1945
1946 6. The operator precedence, changed to more closely match the C lan‐
1947 guage. Two examples of code that operate differently are:
1948
1949
1950 while ( n /= 10 > 1) ...
1951 if (!"wk" ~ /bwk/) ...
1952
1953 Several features have been added based on newer implementations of awk:
1954
1955 * Multiple instances of -f progfile are permitted.
1956
1957 * The new option -v assignment.
1958
1959 * The new predefined variable ENVIRON.
1960
1961 * New built-in functions toupper and tolower.
1962
1963 * More formatting capabilities are added to printf to match the ISO C
1964 standard.
1965
1966 Earlier versions of this standard required implementations to support
1967 multiple adjacent <semicolon>s, lines with one or more <semicolon>
1968 before a rule (pattern-action pairs), and lines with only <semi‐
1969 colon>(s). These are not required by this standard and are considered
1970 poor programming practice, but can be accepted by an implementation of
1971 awk as an extension.
1972
1973 The overall awk syntax has always been based on the C language, with a
1974 few features from the shell command language and other sources. Because
1975 of this, it is not completely compatible with any other language, which
1976 has caused confusion for some users. It is not the intent of the stan‐
1977 dard developers to address such issues. A few relatively minor changes
1978 toward making the language more compatible with the ISO C standard were
1979 made; most of these changes are based on similar changes in recent
1980 implementations, as described above. There remain several C-language
1981 conventions that are not in awk. One of the notable ones is the
1982 <comma> operator, which is commonly used to specify multiple expres‐
1983 sions in the C language for statement. Also, there are various places
1984 where awk is more restrictive than the C language regarding the type of
1985 expression that can be used in a given context. These limitations are
1986 due to the different features that the awk language does provide.
1987
1988 Regular expressions in awk have been extended somewhat from historical
1989 implementations to make them a pure superset of extended regular
1990 expressions, as defined by POSIX.1‐2008 (see the Base Definitions vol‐
1991 ume of POSIX.1‐2017, Section 9.4, Extended Regular Expressions). The
1992 main extensions are internationalization features and interval expres‐
1993 sions. Historical implementations of awk have long supported <back‐
1994 slash>-escape sequences as an extension to extended regular expres‐
1995 sions, and this extension has been retained despite inconsistency with
1996 other utilities. The number of escape sequences recognized in both
1997 extended regular expressions and strings has varied (generally increas‐
1998 ing with time) among implementations. The set specified by POSIX.1‐2008
1999 includes most sequences known to be supported by popular implementa‐
2000 tions and by the ISO C standard. One sequence that is not supported is
2001 hexadecimal value escapes beginning with '\x'. This would allow values
2002 expressed in more than 9 bits to be used within awk as in the ISO C
2003 standard. However, because this syntax has a non-deterministic length,
2004 it does not permit the subsequent character to be a hexadecimal digit.
2005 This limitation can be dealt with in the C language by the use of lexi‐
2006 cal string concatenation. In the awk language, concatenation could also
2007 be a solution for strings, but not for extended regular expressions
2008 (either lexical ERE tokens or strings used dynamically as regular
2009 expressions). Because of this limitation, the feature has not been
2010 added to POSIX.1‐2008.
2011
2012 When a string variable is used in a context where an extended regular
2013 expression normally appears (where the lexical token ERE is used in the
2014 grammar) the string does not contain the literal <slash> characters.
2015
2016 Some versions of awk allow the form:
2017
2018
2019 func name(args, ... ) { statements }
2020
2021 This has been deprecated by the authors of the language, who asked that
2022 it not be specified.
2023
2024 Historical implementations of awk produce an error if a next statement
2025 is executed in a BEGIN action, and cause awk to terminate if a next
2026 statement is executed in an END action. This behavior has not been doc‐
2027 umented, and it was not believed that it was necessary to standardize
2028 it.
2029
2030 The specification of conversions between string and numeric values is
2031 much more detailed than in the documentation of historical implementa‐
2032 tions or in the referenced The AWK Programming Language. Although most
2033 of the behavior is designed to be intuitive, the details are necessary
2034 to ensure compatible behavior from different implementations. This is
2035 especially important in relational expressions since the types of the
2036 operands determine whether a string or numeric comparison is performed.
2037 From the perspective of an application developer, it is usually suffi‐
2038 cient to expect intuitive behavior and to force conversions (by adding
2039 zero or concatenating a null string) when the type of an expression
2040 does not obviously match what is needed. The intent has been to specify
2041 historical practice in almost all cases. The one exception is that, in
2042 historical implementations, variables and constants maintain both
2043 string and numeric values after their original value is converted by
2044 any use. This means that referencing a variable or constant can have
2045 unexpected side-effects. For example, with historical implementations
2046 the following program:
2047
2048
2049 {
2050 a = "+2"
2051 b = 2
2052 if (NR % 2)
2053 c = a + b
2054 if (a == b)
2055 print "numeric comparison"
2056 else
2057 print "string comparison"
2058 }
2059
2060 would perform a numeric comparison (and output numeric comparison) for
2061 each odd-numbered line, but perform a string comparison (and output
2062 string comparison) for each even-numbered line. POSIX.1‐2008 ensures
2063 that comparisons will be numeric if necessary. With historical imple‐
2064 mentations, the following program:
2065
2066
2067 BEGIN {
2068 OFMT = "%e"
2069 print 3.14
2070 OFMT = "%f"
2071 print 3.14
2072 }
2073
2074 would output "3.140000e+00" twice, because in the second print state‐
2075 ment the constant "3.14" would have a string value from the previous
2076 conversion. POSIX.1‐2008 requires that the output of the second print
2077 statement be "3.140000". The behavior of historical implementations
2078 was seen as too unintuitive and unpredictable.
2079
2080 It was pointed out that with the rules contained in early drafts, the
2081 following script would print nothing:
2082
2083
2084 BEGIN {
2085 y[1.5] = 1
2086 OFMT = "%e"
2087 print y[1.5]
2088 }
2089
2090 Therefore, a new variable, CONVFMT, was introduced. The OFMT variable
2091 is now restricted to affecting output conversions of numbers to strings
2092 and CONVFMT is used for internal conversions, such as comparisons or
2093 array indexing. The default value is the same as that for OFMT, so
2094 unless a program changes CONVFMT (which no historical program would
2095 do), it will receive the historical behavior associated with internal
2096 string conversions.
2097
2098 The POSIX awk lexical and syntactic conventions are specified more for‐
2099 mally than in other sources. Again the intent has been to specify his‐
2100 torical practice. One convention that may not be obvious from the for‐
2101 mal grammar as in other verbal descriptions is where <newline> charac‐
2102 ters are acceptable. There are several obvious placements such as ter‐
2103 minating a statement, and a <backslash> can be used to escape <newline>
2104 characters between any lexical tokens. In addition, <newline> charac‐
2105 ters without <backslash> characters can follow a comma, an open brace,
2106 a logical AND operator ("&&"), a logical OR operator ("||"), the do
2107 keyword, the else keyword, and the closing parenthesis of an if, for,
2108 or while statement. For example:
2109
2110
2111 { print $1,
2112 $2 }
2113
2114 The requirement that awk add a trailing <newline> to the program argu‐
2115 ment text is to simplify the grammar, making it match a text file in
2116 form. There is no way for an application or test suite to determine
2117 whether a literal <newline> is added or whether awk simply acts as if
2118 it did.
2119
2120 POSIX.1‐2008 requires several changes from historical implementations
2121 in order to support internationalization. Probably the most subtle of
2122 these is the use of the decimal-point character, defined by the
2123 LC_NUMERIC category of the locale, in representations of floating-point
2124 numbers. This locale-specific character is used in recognizing numeric
2125 input, in converting between strings and numeric values, and in format‐
2126 ting output. However, regardless of locale, the <period> character (the
2127 decimal-point character of the POSIX locale) is the decimal-point char‐
2128 acter recognized in processing awk programs (including assignments in
2129 command line arguments). This is essentially the same convention as the
2130 one used in the ISO C standard. The difference is that the C language
2131 includes the setlocale() function, which permits an application to mod‐
2132 ify its locale. Because of this capability, a C application begins exe‐
2133 cuting with its locale set to the C locale, and only executes in the
2134 environment-specified locale after an explicit call to setlocale().
2135 However, adding such an elaborate new feature to the awk language was
2136 seen as inappropriate for POSIX.1‐2008. It is possible to execute an
2137 awk program explicitly in any desired locale by setting the environment
2138 in the shell.
2139
2140 The undefined behavior resulting from NULs in extended regular expres‐
2141 sions allows future extensions for the GNU gawk program to process
2142 binary data.
2143
2144 The behavior in the case of invalid awk programs (including lexical,
2145 syntactic, and semantic errors) is undefined because it was considered
2146 overly limiting on implementations to specify. In most cases such
2147 errors can be expected to produce a diagnostic and a non-zero exit sta‐
2148 tus. However, some implementations may choose to extend the language in
2149 ways that make use of certain invalid constructs. Other invalid con‐
2150 structs might be deemed worthy of a warning, but otherwise cause some
2151 reasonable behavior. Still other constructs may be very difficult to
2152 detect in some implementations. Also, different implementations might
2153 detect a given error during an initial parsing of the program (before
2154 reading any input files) while others might detect it when executing
2155 the program after reading some input. Implementors should be aware that
2156 diagnosing errors as early as possible and producing useful diagnostics
2157 can ease debugging of applications, and thus make an implementation
2158 more usable.
2159
2160 The unspecified behavior from using multi-character RS values is to
2161 allow possible future extensions based on extended regular expressions
2162 used for record separators. Historical implementations take the first
2163 character of the string and ignore the others.
2164
2165 Unspecified behavior when split(string,array,<null>) is used is to
2166 allow a proposed future extension that would split up a string into an
2167 array of individual characters.
2168
2169 In the context of the getline function, equally good arguments for dif‐
2170 ferent precedences of the | and < operators can be made. Historical
2171 practice has been that:
2172
2173
2174 getline < "a" "b"
2175
2176 is parsed as:
2177
2178
2179 ( getline < "a" ) "b"
2180
2181 although many would argue that the intent was that the file ab should
2182 be read. However:
2183
2184
2185 getline < "x" + 1
2186
2187 parses as:
2188
2189
2190 getline < ( "x" + 1 )
2191
2192 Similar problems occur with the | version of getline, particularly in
2193 combination with $. For example:
2194
2195
2196 $"echo hi" | getline
2197
2198 (This situation is particularly problematic when used in a print state‐
2199 ment, where the |getline part might be a redirection of the print.)
2200
2201 Since in most cases such constructs are not (or at least should not) be
2202 used (because they have a natural ambiguity for which there is no con‐
2203 ventional parsing), the meaning of these constructs has been made
2204 explicitly unspecified. (The effect is that a conforming application
2205 that runs into the problem must parenthesize to resolve the ambiguity.)
2206 There appeared to be few if any actual uses of such constructs.
2207
2208 Grammars can be written that would cause an error under these circum‐
2209 stances. Where backwards-compatibility is not a large consideration,
2210 implementors may wish to use such grammars.
2211
2212 Some historical implementations have allowed some built-in functions to
2213 be called without an argument list, the result being a default argument
2214 list chosen in some ``reasonable'' way. Use of length as a synonym for
2215 length($0) is the only one of these forms that is thought to be widely
2216 known or widely used; this particular form is documented in various
2217 places (for example, most historical awk reference pages, although not
2218 in the referenced The AWK Programming Language) as legitimate practice.
2219 With this exception, default argument lists have always been undocu‐
2220 mented and vaguely defined, and it is not at all clear how (or if) they
2221 should be generalized to user-defined functions. They add no useful
2222 functionality and preclude possible future extensions that might need
2223 to name functions without calling them. Not standardizing them seems
2224 the simplest course. The standard developers considered that length
2225 merited special treatment, however, since it has been documented in the
2226 past and sees possibly substantial use in historical programs. Accord‐
2227 ingly, this usage has been made legitimate, but Issue 5 removed the
2228 obsolescent marking for XSI-conforming implementations and many other‐
2229 wise conforming applications depend on this feature.
2230
2231 In sub and gsub, if repl is a string literal (the lexical token
2232 STRING), then two consecutive <backslash> characters should be used in
2233 the string to ensure a single <backslash> will precede the <ampersand>
2234 when the resultant string is passed to the function. (For example, to
2235 specify one literal <ampersand> in the replacement string, use
2236 gsub(ERE, "\\&").)
2237
2238 Historically, the only special character in the repl argument of sub
2239 and gsub string functions was the <ampersand> ('&') character and pre‐
2240 ceding it with the <backslash> character was used to turn off its spe‐
2241 cial meaning.
2242
2243 The description in the ISO POSIX‐2:1993 standard introduced behavior
2244 such that the <backslash> character was another special character and
2245 it was unspecified whether there were any other special characters.
2246 This description introduced several portability problems, some of which
2247 are described below, and so it has been replaced with the more histori‐
2248 cal description. Some of the problems include:
2249
2250 * Historically, to create the replacement string, a script could use
2251 gsub(ERE, "\\&"), but with the ISO POSIX‐2:1993 standard wording,
2252 it was necessary to use gsub(ERE, "\\\\&"). The <backslash> char‐
2253 acters are doubled here because all string literals are subject to
2254 lexical analysis, which would reduce each pair of <backslash> char‐
2255 acters to a single <backslash> before being passed to gsub.
2256
2257 * Since it was unspecified what the special characters were, for por‐
2258 table scripts to guarantee that characters are printed literally,
2259 each character had to be preceded with a <backslash>. (For exam‐
2260 ple, a portable script had to use gsub(ERE, "\\h\\i") to produce a
2261 replacement string of "hi".)
2262
2263 The description for comparisons in the ISO POSIX‐2:1993 standard did
2264 not properly describe historical practice because of the way numeric
2265 strings are compared as numbers. The current rules cause the following
2266 code:
2267
2268
2269 if (0 == "000")
2270 print "strange, but true"
2271 else
2272 print "not true"
2273
2274 to do a numeric comparison, causing the if to succeed. It should be
2275 intuitively obvious that this is incorrect behavior, and indeed, no
2276 historical implementation of awk actually behaves this way.
2277
2278 To fix this problem, the definition of numeric string was enhanced to
2279 include only those values obtained from specific circumstances (mostly
2280 external sources) where it is not possible to determine unambiguously
2281 whether the value is intended to be a string or a numeric.
2282
2283 Variables that are assigned to a numeric string shall also be treated
2284 as a numeric string. (For example, the notion of a numeric string can
2285 be propagated across assignments.) In comparisons, all variables having
2286 the uninitialized value are to be treated as a numeric operand evaluat‐
2287 ing to the numeric value zero.
2288
2289 Uninitialized variables include all types of variables including
2290 scalars, array elements, and fields. The definition of an uninitialized
2291 value in Variables and Special Variables is necessary to describe the
2292 value placed on uninitialized variables and on fields that are valid
2293 (for example, < $NF) but have no characters in them and to describe how
2294 these variables are to be used in comparisons. A valid field, such as
2295 $1, that has no characters in it can be obtained from an input line of
2296 "\t\t" when FS='\t'. Historically, the comparison ($1<10) was done
2297 numerically after evaluating $1 to the value zero.
2298
2299 The phrase ``... also shall have the numeric value of the numeric
2300 string'' was removed from several sections of the ISO POSIX‐2:1993
2301 standard because is specifies an unnecessary implementation detail. It
2302 is not necessary for POSIX.1‐2008 to specify that these objects be
2303 assigned two different values. It is only necessary to specify that
2304 these objects may evaluate to two different values depending on con‐
2305 text.
2306
2307 Historical implementations of awk did not parse hexadecimal integer or
2308 floating constants like "0xa" and "0xap0". Due to an oversight, the
2309 2001 through 2004 editions of this standard required support for hexa‐
2310 decimal floating constants. This was due to the reference to atof().
2311 This version of the standard allows but does not require implementa‐
2312 tions to use atof() and includes a description of how floating-point
2313 numbers are recognized as an alternative to match historic behavior.
2314 The intent of this change is to allow implementations to recognize
2315 floating-point constants according to either the ISO/IEC 9899:1990
2316 standard or ISO/IEC 9899:1999 standard, and to allow (but not require)
2317 implementations to recognize hexadecimal integer constants.
2318
2319 Historical implementations of awk did not support floating-point
2320 infinities and NaNs in numeric strings; e.g., "-INF" and "NaN". How‐
2321 ever, implementations that use the atof() or strtod() functions to do
2322 the conversion picked up support for these values if they used a
2323 ISO/IEC 9899:1999 standard version of the function instead of a
2324 ISO/IEC 9899:1990 standard version. Due to an oversight, the 2001
2325 through 2004 editions of this standard did not allow support for
2326 infinities and NaNs, but in this revision support is allowed (but not
2327 required). This is a silent change to the behavior of awk programs; for
2328 example, in the POSIX locale the expression:
2329
2330
2331 ("-INF" + 0 < 0)
2332
2333 formerly had the value 0 because "-INF" converted to 0, but now it may
2334 have the value 0 or 1.
2335
2337 A future version of this standard may require the "!=" and "==" opera‐
2338 tors to perform string comparisons by checking if the strings are iden‐
2339 tical (and not by checking if they collate equally).
2340
2342 Section 1.3, Grammar Conventions, grep, lex, sed
2343
2344 The Base Definitions volume of POSIX.1‐2017, Chapter 5, File Format
2345 Notation, Section 6.1, Portable Character Set, Chapter 8, Environment
2346 Variables, Chapter 9, Regular Expressions, Section 12.2, Utility Syntax
2347 Guidelines
2348
2349 The System Interfaces volume of POSIX.1‐2017, atof(), exec, isspace(),
2350 popen(), setlocale(), strtod()
2351
2353 Portions of this text are reprinted and reproduced in electronic form
2354 from IEEE Std 1003.1-2017, Standard for Information Technology -- Por‐
2355 table Operating System Interface (POSIX), The Open Group Base Specifi‐
2356 cations Issue 7, 2018 Edition, Copyright (C) 2018 by the Institute of
2357 Electrical and Electronics Engineers, Inc and The Open Group. In the
2358 event of any discrepancy between this version and the original IEEE and
2359 The Open Group Standard, the original IEEE and The Open Group Standard
2360 is the referee document. The original Standard can be obtained online
2361 at http://www.opengroup.org/unix/online.html .
2362
2363 Any typographical or formatting errors that appear in this page are
2364 most likely to have been introduced during the conversion of the source
2365 files to man page format. To report such errors, see https://www.ker‐
2366 nel.org/doc/man-pages/reporting_bugs.html .
2367
2368
2369
2370IEEE/The Open Group 2017 AWK(1P)