1AWK(1P) POSIX Programmer's Manual AWK(1P)
2
3
4
6 This manual page is part of the POSIX Programmer's Manual. The Linux
7 implementation of this interface may differ (consult the corresponding
8 Linux manual page for details of Linux behavior), or the interface may
9 not be implemented on Linux.
10
12 awk - pattern scanning and processing language
13
15 awk [-F ERE][-v assignment] ... program [argument ...]
16
17 awk [-F ERE] -f progfile ... [-v assignment] ...[argument ...]
18
19
21 The awk utility shall execute programs written in the awk programming
22 language, which is specialized for textual data manipulation. An awk
23 program is a sequence of patterns and corresponding actions. When input
24 is read that matches a pattern, the action associated with that pattern
25 is carried out.
26
27 Input shall be interpreted as a sequence of records. By default, a
28 record is a line, less its terminating <newline>, but this can be
29 changed by using the RS built-in variable. Each record of input shall
30 be matched in turn against each pattern in the program. For each pat‐
31 tern matched, the associated action shall be executed.
32
33 The awk utility shall interpret each input record as a sequence of
34 fields where, by default, a field is a string of non- <blank>s. This
35 default white-space field delimiter can be changed by using the FS
36 built-in variable or -F ERE. The awk utility shall denote the first
37 field in a record $1, the second $2, and so on. The symbol $0 shall
38 refer to the entire record; setting any other field causes the re-eval‐
39 uation of $0. Assigning to $0 shall reset the values of all other
40 fields and the NF built-in variable.
41
43 The awk utility shall conform to the Base Definitions volume of
44 IEEE Std 1003.1-2001, Section 12.2, Utility Syntax Guidelines.
45
46 The following options shall be supported:
47
48 -F ERE
49 Define the input field separator to be the extended regular
50 expression ERE, before any input is read; see Regular Expres‐
51 sions .
52
53 -f progfile
54 Specify the pathname of the file progfile containing an awk pro‐
55 gram. If multiple instances of this option are specified, the
56 concatenation of the files specified as progfile in the order
57 specified shall be the awk program. The awk program can alterna‐
58 tively be specified in the command line as a single argument.
59
60 -v assignment
61 The application shall ensure that the assignment argument is in
62 the same form as an assignment operand. The specified variable
63 assignment shall occur prior to executing the awk program,
64 including the actions associated with BEGIN patterns (if any).
65 Multiple occurrences of this option can be specified.
66
67
69 The following operands shall be supported:
70
71 program
72 If no -f option is specified, the first operand to awk shall be
73 the text of the awk program. The application shall supply the
74 program operand as a single argument to awk. If the text does
75 not end in a <newline>, awk shall interpret the text as if it
76 did.
77
78 argument
79 Either of the following two types of argument can be intermixed:
80
81 file
82 A pathname of a file that contains the input to be read, which
83 is matched against the set of patterns in the program. If no
84 file operands are specified, or if a file operand is '-', the
85 standard input shall be used.
86
87 assignment
88 An operand that begins with an underscore or alphabetic charac‐
89 ter from the portable character set (see the table in the Base
90 Definitions volume of IEEE Std 1003.1-2001, Section 6.1, Porta‐
91 ble Character Set), followed by a sequence of underscores, dig‐
92 its, and alphabetics from the portable character set, followed
93 by the '=' character, shall specify a variable assignment rather
94 than a pathname. The characters before the '=' represent the
95 name of an awk variable; if that name is an awk reserved word
96 (see Grammar ) the behavior is undefined. The characters follow‐
97 ing the equal sign shall be interpreted as if they appeared in
98 the awk program preceded and followed by a double-quote ( ' )'
99 character, as a STRING token (see Grammar ), except that if the
100 last character is an unescaped backslash, it shall be inter‐
101 preted as a literal backslash rather than as the first character
102 of the sequence "\"" . The variable shall be assigned the value
103 of that STRING token and, if appropriate, shall be considered a
104 numeric string (see Expressions in awk ), the variable shall
105 also be assigned its numeric value. Each such variable assign‐
106 ment shall occur just prior to the processing of the following
107 file, if any. Thus, an assignment before the first file argument
108 shall be executed after the BEGIN actions (if any), while an
109 assignment after the last file argument shall occur before the
110 END actions (if any). If there are no file arguments, assign‐
111 ments shall be executed before processing the standard input.
112
113
114
116 The standard input shall be used only if no file operands are speci‐
117 fied, or if a file operand is '-' ; see the INPUT FILES section. If the
118 awk program contains no actions and no patterns, but is otherwise a
119 valid awk program, standard input and any file operands shall not be
120 read and awk shall exit with a return status of zero.
121
123 Input files to the awk program from any of the following sources shall
124 be text files:
125
126 * Any file operands or their equivalents, achieved by modifying the
127 awk variables ARGV and ARGC
128
129 * Standard input in the absence of any file operands
130
131 * Arguments to the getline function
132
133 Whether the variable RS is set to a value other than a <newline> or
134 not, for these files, implementations shall support records terminated
135 with the specified separator up to {LINE_MAX} bytes and may support
136 longer records.
137
138 If -f progfile is specified, the application shall ensure that the
139 files named by each of the progfile option-arguments are text files and
140 their concatenation, in the same order as they appear in the arguments,
141 is an awk program.
142
144 The following environment variables shall affect the execution of awk:
145
146 LANG Provide a default value for the internationalization variables
147 that are unset or null. (See the Base Definitions volume of
148 IEEE Std 1003.1-2001, Section 8.2, Internationalization Vari‐
149 ables for the precedence of internationalization variables used
150 to determine the values of locale categories.)
151
152 LC_ALL If set to a non-empty string value, override the values of all
153 the other internationalization variables.
154
155 LC_COLLATE
156 Determine the locale for the behavior of ranges, equivalence
157 classes, and multi-character collating elements within regular
158 expressions and in comparisons of string values.
159
160 LC_CTYPE
161 Determine the locale for the interpretation of sequences of
162 bytes of text data as characters (for example, single-byte as
163 opposed to multi-byte characters in arguments and input files),
164 the behavior of character classes within regular expressions,
165 the identification of characters as letters, and the mapping of
166 uppercase and lowercase characters for the toupper and tolower
167 functions.
168
169 LC_MESSAGES
170 Determine the locale that should be used to affect the format
171 and contents of diagnostic messages written to standard error.
172
173 LC_NUMERIC
174 Determine the radix character used when interpreting numeric
175 input, performing conversions between numeric and string values,
176 and formatting numeric output. Regardless of locale, the period
177 character (the decimal-point character of the POSIX locale) is
178 the decimal-point character recognized in processing awk pro‐
179 grams (including assignments in command line arguments).
180
181 NLSPATH
182 Determine the location of message catalogs for the processing of
183 LC_MESSAGES .
184
185 PATH Determine the search path when looking for commands executed by
186 system(expr), or input and output pipes; see the Base Defini‐
187 tions volume of IEEE Std 1003.1-2001, Chapter 8, Environment
188 Variables.
189
190
191 In addition, all environment variables shall be visible via the awk
192 variable ENVIRON.
193
195 Default.
196
198 The nature of the output files depends on the awk program.
199
201 The standard error shall be used only for diagnostic messages.
202
204 The nature of the output files depends on the awk program.
205
207 Overall Program Structure
208 An awk program is composed of pairs of the form:
209
210
211 pattern { action }
212
213 Either the pattern or the action (including the enclosing brace charac‐
214 ters) can be omitted.
215
216 A missing pattern shall match any record of input, and a missing action
217 shall be equivalent to:
218
219
220 { print }
221
222 Execution of the awk program shall start by first executing the actions
223 associated with all BEGIN patterns in the order they occur in the pro‐
224 gram. Then each file operand (or standard input if no files were speci‐
225 fied) shall be processed in turn by reading data from the file until a
226 record separator is seen ( <newline> by default). Before the first ref‐
227 erence to a field in the record is evaluated, the record shall be split
228 into fields, according to the rules in Regular Expressions, using the
229 value of FS that was current at the time the record was read. Each pat‐
230 tern in the program then shall be evaluated in the order of occurrence,
231 and the action associated with each pattern that matches the current
232 record executed. The action for a matching pattern shall be executed
233 before evaluating subsequent patterns. Finally, the actions associated
234 with all END patterns shall be executed in the order they occur in the
235 program.
236
237 Expressions in awk
238 Expressions describe computations used in patterns and actions. In the
239 following table, valid expression operations are given in groups from
240 highest precedence first to lowest precedence last, with equal-prece‐
241 dence operators grouped between horizontal lines. In expression evalua‐
242 tion, where the grammar is formally ambiguous, higher precedence opera‐
243 tors shall be evaluated before lower precedence operators. In this ta‐
244 ble expr, expr1, expr2, and expr3 represent any expression, while
245 lvalue represents any entity that can be assigned to (that is, on the
246 left side of an assignment operator). The precise syntax of expressions
247 is given in Grammar .
248
249 Table: Expressions in Decreasing Precedence in awk
250
251 Syntax Name Type of Result Associativity
252 ( expr ) Grouping Type of expr N/A
253 $expr Field reference String N/A
254 ++ lvalue Pre-increment Numeric N/A
255 -- lvalue Pre-decrement Numeric N/A
256 lvalue ++ Post-increment Numeric N/A
257 lvalue -- Post-decrement Numeric N/A
258 expr ^ expr Exponentiation Numeric Right
259 ! expr Logical not Numeric N/A
260 + expr Unary plus Numeric N/A
261 - expr Unary minus Numeric N/A
262 expr * expr Multiplication Numeric Left
263 expr / expr Division Numeric Left
264 expr % expr Modulus Numeric Left
265 expr + expr Addition Numeric Left
266
267 expr - expr Subtraction Numeric Left
268 expr expr String concatenation String Left
269 expr < expr Less than Numeric None
270 expr <= expr Less than or equal to Numeric None
271 expr != expr Not equal to Numeric None
272 expr == expr Equal to Numeric None
273 expr > expr Greater than Numeric None
274 expr >= expr Greater than or equal to Numeric None
275 expr ~ expr ERE match Numeric None
276 expr !~ expr ERE non-match Numeric None
277 expr in array Array membership Numeric Left
278 ( index ) in array Multi-dimension array Numeric Left
279 membership
280 expr && expr Logical AND Numeric Left
281 expr || expr Logical OR Numeric Left
282 expr1 ? expr2 : expr3 Conditional expression Type of selected Right
283 expr2 or expr3
284 lvalue ^= expr Exponentiation assignment Numeric Right
285 lvalue %= expr Modulus assignment Numeric Right
286 lvalue *= expr Multiplication assignment Numeric Right
287 lvalue /= expr Division assignment Numeric Right
288 lvalue += expr Addition assignment Numeric Right
289 lvalue -= expr Subtraction assignment Numeric Right
290 lvalue = expr Assignment Type of expr Right
291
292 Each expression shall have either a string value, a numeric value, or
293 both. Except as stated for specific contexts, the value of an expres‐
294 sion shall be implicitly converted to the type needed for the context
295 in which it is used. A string value shall be converted to a numeric
296 value by the equivalent of the following calls to functions defined by
297 the ISO C standard:
298
299
300 setlocale(LC_NUMERIC, "");
301 numeric_value = atof(string_value);
302
303 A numeric value that is exactly equal to the value of an integer (see
304 Concepts Derived from the ISO C Standard ) shall be converted to a
305 string by the equivalent of a call to the sprintf function (see String
306 Functions ) with the string "%d" as the fmt argument and the numeric
307 value being converted as the first and only expr argument. Any other
308 numeric value shall be converted to a string by the equivalent of a
309 call to the sprintf function with the value of the variable CONVFMT as
310 the fmt argument and the numeric value being converted as the first and
311 only expr argument. The result of the conversion is unspecified if the
312 value of CONVFMT is not a floating-point format specification. This
313 volume of IEEE Std 1003.1-2001 specifies no explicit conversions
314 between numbers and strings. An application can force an expression to
315 be treated as a number by adding zero to it, or can force it to be
316 treated as a string by concatenating the null string ( "" ) to it.
317
318 A string value shall be considered a numeric string if it comes from
319 one of the following:
320
321 1. Field variables
322
323 2. Input from the getline() function
324
325 3. FILENAME
326
327 4. ARGV array elements
328
329 5. ENVIRON array elements
330
331 6. Array elements created by the split() function
332
333 7. A command line variable assignment
334
335 8. Variable assignment from another numeric string variable
336
337 and after all the following conversions have been applied, the result‐
338 ing string would lexically be recognized as a NUMBER token as described
339 by the lexical conventions in Grammar :
340
341 * All leading and trailing <blank>s are discarded.
342
343 * If the first non- <blank> is '+' or '-', it is discarded.
344
345 * Changing each occurrence of the decimal point character from the
346 current locale to a period.
347
348 If a '-' character is ignored in the preceding description, the numeric
349 value of the numeric string shall be the negation of the numeric value
350 of the recognized NUMBER token. Otherwise, the numeric value of the
351 numeric string shall be the numeric value of the recognized NUMBER
352 token. Whether or not a string is a numeric string shall be relevant
353 only in contexts where that term is used in this section.
354
355 When an expression is used in a Boolean context, if it has a numeric
356 value, a value of zero shall be treated as false and any other value
357 shall be treated as true. Otherwise, a string value of the null string
358 shall be treated as false and any other value shall be treated as true.
359 A Boolean context shall be one of the following:
360
361 * The first subexpression of a conditional expression
362
363 * An expression operated on by logical NOT, logical AND, or logical OR
364
365 * The second expression of a for statement
366
367 * The expression of an if statement
368
369 * The expression of the while clause in either a while or do... while
370 statement
371
372 * An expression used as a pattern (as in Overall Program Structure)
373
374 All arithmetic shall follow the semantics of floating-point arithmetic
375 as specified by the ISO C standard (see Concepts Derived from the ISO C
376 Standard ).
377
378 The value of the expression:
379
380
381 expr1 ^ expr2
382
383 shall be equivalent to the value returned by the ISO C standard func‐
384 tion call:
385
386
387 pow(expr1, expr2)
388
389 The expression:
390
391
392 lvalue ^= expr
393
394 shall be equivalent to the ISO C standard expression:
395
396
397 lvalue = pow(lvalue, expr)
398
399 except that lvalue shall be evaluated only once. The value of the
400 expression:
401
402
403 expr1 % expr2
404
405 shall be equivalent to the value returned by the ISO C standard func‐
406 tion call:
407
408
409 fmod(expr1, expr2)
410
411 The expression:
412
413
414 lvalue %= expr
415
416 shall be equivalent to the ISO C standard expression:
417
418
419 lvalue = fmod(lvalue, expr)
420
421 except that lvalue shall be evaluated only once.
422
423 Variables and fields shall be set by the assignment statement:
424
425
426 lvalue = expression
427
428 and the type of expression shall determine the resulting variable type.
429 The assignment includes the arithmetic assignments ( "+=", "-=", "*=",
430 "/=", "%=", "^=", "++", "--" ) all of which shall produce a numeric
431 result. The left-hand side of an assignment and the target of increment
432 and decrement operators can be one of a variable, an array with index,
433 or a field selector.
434
435 The awk language supplies arrays that are used for storing numbers or
436 strings. Arrays need not be declared. They shall initially be empty,
437 and their sizes shall change dynamically. The subscripts, or element
438 identifiers, are strings, providing a type of associative array capa‐
439 bility. An array name followed by a subscript within square brackets
440 can be used as an lvalue and thus as an expression, as described in the
441 grammar; see Grammar . Unsubscripted array names can be used in only
442 the following contexts:
443
444 * A parameter in a function definition or function call
445
446 * The NAME token following any use of the keyword in as specified in
447 the grammar (see Grammar ); if the name used in this context is not
448 an array name, the behavior is undefined
449
450 A valid array index shall consist of one or more comma-separated
451 expressions, similar to the way in which multi-dimensional arrays are
452 indexed in some programming languages. Because awk arrays are really
453 one-dimensional, such a comma-separated list shall be converted to a
454 single string by concatenating the string values of the separate
455 expressions, each separated from the other by the value of the SUBSEP
456 variable. Thus, the following two index operations shall be equiva‐
457 lent:
458
459
460 var[expr1, expr2, ... exprn]
461
462
463 var[expr1 SUBSEP expr2 SUBSEP ... SUBSEP exprn]
464
465 The application shall ensure that a multi-dimensioned index used with
466 the in operator is parenthesized. The in operator, which tests for the
467 existence of a particular array element, shall not cause that element
468 to exist. Any other reference to a nonexistent array element shall
469 automatically create it.
470
471 Comparisons (with the '<', "<=", "!=", "==", '>', and ">=" operators)
472 shall be made numerically if both operands are numeric, if one is
473 numeric and the other has a string value that is a numeric string, or
474 if one is numeric and the other has the uninitialized value. Otherwise,
475 operands shall be converted to strings as required and a string compar‐
476 ison shall be made using the locale-specific collation sequence. The
477 value of the comparison expression shall be 1 if the relation is true,
478 or 0 if the relation is false.
479
480 Variables and Special Variables
481 Variables can be used in an awk program by referencing them. With the
482 exception of function parameters (see User-Defined Functions ), they
483 are not explicitly declared. Function parameter names shall be local to
484 the function; all other variable names shall be global. The same name
485 shall not be used as both a function parameter name and as the name of
486 a function or a special awk variable. The same name shall not be used
487 both as a variable name with global scope and as the name of a func‐
488 tion. The same name shall not be used within the same scope both as a
489 scalar variable and as an array. Uninitialized variables, including
490 scalar variables, array elements, and field variables, shall have an
491 uninitialized value. An uninitialized value shall have both a numeric
492 value of zero and a string value of the empty string. Evaluation of
493 variables with an uninitialized value, to either string or numeric,
494 shall be determined by the context in which they are used.
495
496 Field variables shall be designated by a '$' followed by a number or
497 numerical expression. The effect of the field number expression evalu‐
498 ating to anything other than a non-negative integer is unspecified;
499 uninitialized variables or string values need not be converted to
500 numeric values in this context. New field variables can be created by
501 assigning a value to them. References to nonexistent fields (that is,
502 fields after $NF), shall evaluate to the uninitialized value. Such ref‐
503 erences shall not create new fields. However, assigning to a nonexis‐
504 tent field (for example, $(NF+2)=5) shall increase the value of NF;
505 create any intervening fields with the uninitialized value; and cause
506 the value of $0 to be recomputed, with the fields being separated by
507 the value of OFS. Each field variable shall have a string value or an
508 uninitialized value when created. Field variables shall have the
509 uninitialized value when created from $0 using FS and the variable does
510 not contain any characters. If appropriate, the field variable shall be
511 considered a numeric string (see Expressions in awk ).
512
513 Implementations shall support the following other special variables
514 that are set by awk:
515
516 ARGC The number of elements in the ARGV array.
517
518 ARGV An array of command line arguments, excluding options and the
519 program argument, numbered from zero to ARGC-1.
520
521 The arguments in ARGV can be modified or added to; ARGC can be altered.
522 As each input file ends, awk shall treat the next non-null element of
523 ARGV, up to the current value of ARGC-1, inclusive, as the name of the
524 next input file. Thus, setting an element of ARGV to null means that it
525 shall not be treated as an input file. The name '-' indicates the stan‐
526 dard input. If an argument matches the format of an assignment operand,
527 this argument shall be treated as an assignment rather than a file
528 argument.
529
530 CONVFMT
531 The printf format for converting numbers to strings (except for
532 output statements, where OFMT is used); "%.6g" by default.
533
534 ENVIRON
535 An array representing the value of the environment, as described
536 in the exec functions defined in the System Interfaces volume of
537 IEEE Std 1003.1-2001. The indices of the array shall be strings
538 consisting of the names of the environment variables, and the
539 value of each array element shall be a string consisting of the
540 value of that variable. If appropriate, the environment variable
541 shall be considered a numeric string (see Expressions in awk );
542 the array element shall also have its numeric value.
543
544 In all cases where the behavior of awk is affected by environment vari‐
545 ables (including the environment of any commands that awk executes via
546 the system function or via pipeline redirections with the print state‐
547 ment, the printf statement, or the getline function), the environment
548 used shall be the environment at the time awk began executing; it is
549 implementation-defined whether any modification of ENVIRON affects this
550 environment.
551
552 FILENAME
553 A pathname of the current input file. Inside a BEGIN action the
554 value is undefined. Inside an END action the value shall be the
555 name of the last input file processed.
556
557 FNR The ordinal number of the current record in the current file.
558 Inside a BEGIN action the value shall be zero. Inside an END
559 action the value shall be the number of the last record pro‐
560 cessed in the last file processed.
561
562 FS Input field separator regular expression; a <space> by default.
563
564 NF The number of fields in the current record. Inside a BEGIN
565 action, the use of NF is undefined unless a getline function
566 without a var argument is executed previously. Inside an END
567 action, NF shall retain the value it had for the last record
568 read, unless a subsequent, redirected, getline function without
569 a var argument is performed prior to entering the END action.
570
571 NR The ordinal number of the current record from the start of
572 input. Inside a BEGIN action the value shall be zero. Inside an
573 END action the value shall be the number of the last record pro‐
574 cessed.
575
576 OFMT The printf format for converting numbers to strings in output
577 statements (see Output Statements ); "%.6g" by default. The
578 result of the conversion is unspecified if the value of OFMT is
579 not a floating-point format specification.
580
581 OFS The print statement output field separation; <space> by default.
582
583 ORS The print statement output record separator; a <newline> by
584 default.
585
586 RLENGTH
587 The length of the string matched by the match function.
588
589 RS The first character of the string value of RS shall be the input
590 record separator; a <newline> by default. If RS contains more
591 than one character, the results are unspecified. If RS is null,
592 then records are separated by sequences consisting of a <new‐
593 line> plus one or more blank lines, leading or trailing blank
594 lines shall not result in empty records at the beginning or end
595 of the input, and a <newline> shall always be a field separator,
596 no matter what the value of FS is.
597
598 RSTART The starting position of the string matched by the match func‐
599 tion, numbering from 1. This shall always be equivalent to the
600 return value of the match function.
601
602 SUBSEP The subscript separator string for multi-dimensional arrays; the
603 default value is implementation-defined.
604
605
606 Regular Expressions
607 The awk utility shall make use of the extended regular expression nota‐
608 tion (see the Base Definitions volume of IEEE Std 1003.1-2001, Section
609 9.4, Extended Regular Expressions) except that it shall allow the use
610 of C-language conventions for escaping special characters within the
611 EREs, as specified in the table in the Base Definitions volume of
612 IEEE Std 1003.1-2001, Chapter 5, File Format Notation ( '\\', '\a',
613 '\b', '\f', '\n', '\r', '\t', '\v' ) and the following table; these
614 escape sequences shall be recognized both inside and outside bracket
615 expressions. Note that records need not be separated by <newline>s and
616 string constants can contain <newline>s, so even the "\n" sequence is
617 valid in awk EREs. Using a slash character within an ERE requires the
618 escaping shown in the following table.
619
620 Table: Escape Sequences in awk
621
622 Escape
623 Sequence Description Meaning
624 \" Backslash quotation-mark Quotation-mark character
625 \/ Backslash slash Slash character
626 \ddd A backslash character followed The character whose encoding
627 by the longest sequence of is represented by the one,
628 one, two, or three octal-digit two, or three-digit octal
629 characters [4m(01234567). If all integer. Multi-byte characters
630 of the digits are 0 (that is, require multiple, concatenated
631 representation of the NUL escape sequences of this type,
632 character), the behavior is including the leading '\' for
633 undefined. each byte.
634 \c A backslash character followed Undefined
635 by any character not described
636 in this table or in the table
637 in the Base Definitions volume
638 of IEEE Std 1003.1-2001, Chap‐
639 ter 5, File Format Notation (
640 '\\', '\a', '\b', '\f', '\n',
641 '\r', '\t', '\v' ).
642
643 A regular expression can be matched against a specific field or string
644 by using one of the two regular expression matching operators, '~' and
645 "!~" . These operators shall interpret their right-hand operand as a
646 regular expression and their left-hand operand as a string. If the reg‐
647 ular expression matches the string, the '~' expression shall evaluate
648 to a value of 1, and the "!~" expression shall evaluate to a value of
649 0. (The regular expression matching operation is as defined by the term
650 matched in the Base Definitions volume of IEEE Std 1003.1-2001, Section
651 9.1, Regular Expression Definitions, where a match occurs on any part
652 of the string unless the regular expression is limited with the circum‐
653 flex or dollar sign special characters.) If the regular expression does
654 not match the string, the '~' expression shall evaluate to a value of
655 0, and the "!~" expression shall evaluate to a value of 1. If the
656 right-hand operand is any expression other than the lexical token ERE,
657 the string value of the expression shall be interpreted as an extended
658 regular expression, including the escape conventions described above.
659 Note that these same escape conventions shall also be applied in deter‐
660 mining the value of a string literal (the lexical token STRING), and
661 thus shall be applied a second time when a string literal is used in
662 this context.
663
664 When an ERE token appears as an expression in any context other than as
665 the right-hand of the '~' or "!~" operator or as one of the built-in
666 function arguments described below, the value of the resulting expres‐
667 sion shall be the equivalent of:
668
669
670 $0 ~ /ere/
671
672 The ere argument to the gsub, match, sub functions, and the fs argument
673 to the split function (see String Functions ) shall be interpreted as
674 extended regular expressions. These can be either ERE tokens or arbi‐
675 trary expressions, and shall be interpreted in the same manner as the
676 right-hand side of the '~' or "!~" operator.
677
678 An extended regular expression can be used to separate fields by using
679 the -F ERE option or by assigning a string containing the expression to
680 the built-in variable FS. The default value of the FS variable shall be
681 a single <space>. The following describes FS behavior:
682
683 1. If FS is a null string, the behavior is unspecified.
684
685 2. If FS is a single character:
686
687 a. If FS is <space>, skip leading and trailing <blank>s; fields
688 shall be delimited by sets of one or more <blank>s.
689
690 b. Otherwise, if FS is any other character c, fields shall be
691 delimited by each single occurrence of c.
692
693 3. Otherwise, the string value of FS shall be considered to be an
694 extended regular expression. Each occurrence of a sequence matching
695 the extended regular expression shall delimit fields.
696
697 Except for the '~' and "!~" operators, and in the gsub, match, split,
698 and sub built-in functions, ERE matching shall be based on input
699 records; that is, record separator characters (the first character of
700 the value of the variable RS, <newline> by default) cannot be embedded
701 in the expression, and no expression shall match the record separator
702 character. If the record separator is not <newline>, <newline>s embed‐
703 ded in the expression can be matched. For the '~' and "!~" operators,
704 and in those four built-in functions, ERE matching shall be based on
705 text strings; that is, any character (including <newline> and the
706 record separator) can be embedded in the pattern, and an appropriate
707 pattern shall match any character. However, in all awk ERE matching,
708 the use of one or more NUL characters in the pattern, input record, or
709 text string produces undefined results.
710
711 Patterns
712 A pattern is any valid expression, a range specified by two expressions
713 separated by a comma, or one of the two special patterns BEGIN or END.
714
715 Special Patterns
716 The awk utility shall recognize two special patterns, BEGIN and END.
717 Each BEGIN pattern shall be matched once and its associated action exe‐
718 cuted before the first record of input is read (except possibly by use
719 of the getline function-see Input/Output and General Functions - in a
720 prior BEGIN action) and before command line assignment is done. Each
721 END pattern shall be matched once and its associated action executed
722 after the last record of input has been read. These two patterns shall
723 have associated actions.
724
725 BEGIN and END shall not combine with other patterns. Multiple BEGIN and
726 END patterns shall be allowed. The actions associated with the BEGIN
727 patterns shall be executed in the order specified in the program, as
728 are the END actions. An END pattern can precede a BEGIN pattern in a
729 program.
730
731 If an awk program consists of only actions with the pattern BEGIN, and
732 the BEGIN action contains no getline function, awk shall exit without
733 reading its input when the last statement in the last BEGIN action is
734 executed. If an awk program consists of only actions with the pattern
735 END or only actions with the patterns BEGIN and END, the input shall be
736 read before the statements in the END actions are executed.
737
738 Expression Patterns
739 An expression pattern shall be evaluated as if it were an expression in
740 a Boolean context. If the result is true, the pattern shall be consid‐
741 ered to match, and the associated action (if any) shall be executed. If
742 the result is false, the action shall not be executed.
743
744 Pattern Ranges
745 A pattern range consists of two expressions separated by a comma; in
746 this case, the action shall be performed for all records between a
747 match of the first expression and the following match of the second
748 expression, inclusive. At this point, the pattern range can be repeated
749 starting at input records subsequent to the end of the matched range.
750
751 Actions
752 An action is a sequence of statements as shown in the grammar in Gram‐
753 mar . Any single statement can be replaced by a statement list enclosed
754 in braces. The application shall ensure that statements in a statement
755 list are separated by <newline>s or semicolons. Statements in a state‐
756 ment list shall be executed sequentially in the order that they appear.
757
758 The expression acting as the conditional in an if statement shall be
759 evaluated and if it is non-zero or non-null, the following statement
760 shall be executed; otherwise, if else is present, the statement follow‐
761 ing the else shall be executed.
762
763 The if, while, do... while, for, break, and continue statements are
764 based on the ISO C standard (see Concepts Derived from the ISO C Stan‐
765 dard ), except that the Boolean expressions shall be treated as
766 described in Expressions in awk , and except in the case of:
767
768
769 for (variable in array)
770
771 which shall iterate, assigning each index of array to variable in an
772 unspecified order. The results of adding new elements to array within
773 such a for loop are undefined. If a break or continue statement occurs
774 outside of a loop, the behavior is undefined.
775
776 The delete statement shall remove an individual array element. Thus,
777 the following code deletes an entire array:
778
779
780 for (index in array)
781 delete array[index]
782
783 The next statement shall cause all further processing of the current
784 input record to be abandoned. The behavior is undefined if a next
785 statement appears or is invoked in a BEGIN or END action.
786
787 The exit statement shall invoke all END actions in the order in which
788 they occur in the program source and then terminate the program without
789 reading further input. An exit statement inside an END action shall
790 terminate the program without further execution of END actions. If an
791 expression is specified in an exit statement, its numeric value shall
792 be the exit status of awk, unless subsequent errors are encountered or
793 a subsequent exit statement with an expression is executed.
794
795 Output Statements
796 Both print and printf statements shall write to standard output by
797 default. The output shall be written to the location specified by out‐
798 put_redirection if one is supplied, as follows:
799
800
801 > expression>> expression| expression
802
803 In all cases, the expression shall be evaluated to produce a string
804 that is used as a pathname into which to write (for '>' or ">>" ) or as
805 a command to be executed (for '|' ). Using the first two forms, if the
806 file of that name is not currently open, it shall be opened, creating
807 it if necessary and using the first form, truncating the file. The out‐
808 put then shall be appended to the file. As long as the file remains
809 open, subsequent calls in which expression evaluates to the same string
810 value shall simply append output to the file. The file remains open
811 until the close function (see Input/Output and General Functions ) is
812 called with an expression that evaluates to the same string value.
813
814 The third form shall write output onto a stream piped to the input of a
815 command. The stream shall be created if no stream is currently open
816 with the value of expression as its command name. The stream created
817 shall be equivalent to one created by a call to the popen() function
818 defined in the System Interfaces volume of IEEE Std 1003.1-2001 with
819 the value of expression as the command argument and a value of w as the
820 mode argument. As long as the stream remains open, subsequent calls in
821 which expression evaluates to the same string value shall write output
822 to the existing stream. The stream shall remain open until the close
823 function (see Input/Output and General Functions ) is called with an
824 expression that evaluates to the same string value. At that time, the
825 stream shall be closed as if by a call to the pclose() function defined
826 in the System Interfaces volume of IEEE Std 1003.1-2001.
827
828 As described in detail by the grammar in Grammar , these output state‐
829 ments shall take a comma-separated list of expressions referred to in
830 the grammar by the non-terminal symbols expr_list, print_expr_list, or
831 print_expr_list_opt. This list is referred to here as the expression
832 list, and each member is referred to as an expression argument.
833
834 The print statement shall write the value of each expression argument
835 onto the indicated output stream separated by the current output field
836 separator (see variable OFS above), and terminated by the output record
837 separator (see variable ORS above). All expression arguments shall be
838 taken as strings, being converted if necessary; this conversion shall
839 be as described in Expressions in awk , with the exception that the
840 printf format in OFMT shall be used instead of the value in CONVFMT. An
841 empty expression list shall stand for the whole input record ($0).
842
843 The printf statement shall produce output based on a notation similar
844 to the File Format Notation used to describe file formats in this vol‐
845 ume of IEEE Std 1003.1-2001 (see the Base Definitions volume of
846 IEEE Std 1003.1-2001, Chapter 5, File Format Notation). Output shall
847 be produced as specified with the first expression argument as the
848 string format and subsequent expression arguments as the strings arg1
849 to argn, inclusive, with the following exceptions:
850
851 1. The format shall be an actual character string rather than a graph‐
852 ical representation. Therefore, it cannot contain empty character
853 positions. The <space> in the format string, in any context other
854 than a flag of a conversion specification, shall be treated as an
855 ordinary character that is copied to the output.
856
857 2. If the character set contains a ' ' character and that character
858 appears in the format string, it shall be treated as an ordinary
859 character that is copied to the output.
860
861 3. The escape sequences beginning with a backslash character shall be
862 treated as sequences of ordinary characters that are copied to the
863 output. Note that these same sequences shall be interpreted lexi‐
864 cally by awk when they appear in literal strings, but they shall
865 not be treated specially by the printf statement.
866
867 4. A field width or precision can be specified as the '*' character
868 instead of a digit string. In this case the next argument from the
869 expression list shall be fetched and its numeric value taken as the
870 field width or precision.
871
872 5. The implementation shall not precede or follow output from the d or
873 u conversion specifier characters with <blank>s not specified by
874 the format string.
875
876 6. The implementation shall not precede output from the o conversion
877 specifier character with leading zeros not specified by the format
878 string.
879
880 7. For the c conversion specifier character: if the argument has a
881 numeric value, the character whose encoding is that value shall be
882 output. If the value is zero or is not the encoding of any charac‐
883 ter in the character set, the behavior is undefined. If the argu‐
884 ment does not have a numeric value, the first character of the
885 string value shall be output; if the string does not contain any
886 characters, the behavior is undefined.
887
888 8. For each conversion specification that consumes an argument, the
889 next expression argument shall be evaluated. With the exception of
890 the c conversion specifier character, the value shall be converted
891 (according to the rules specified in Expressions in awk ) to the
892 appropriate type for the conversion specification.
893
894 9. If there are insufficient expression arguments to satisfy all the
895 conversion specifications in the format string, the behavior is
896 undefined.
897
898 10. If any character sequence in the format string begins with a '%'
899 character, but does not form a valid conversion specification, the
900 behavior is unspecified.
901
902 Both print and printf can output at least {LINE_MAX} bytes.
903
904 Functions
905 The awk language has a variety of built-in functions: arithmetic,
906 string, input/output, and general.
907
908 Arithmetic Functions
909 The arithmetic functions, except for int, shall be based on the ISO C
910 standard (see Concepts Derived from the ISO C Standard ). The behavior
911 is undefined in cases where the ISO C standard specifies that an error
912 be returned or that the behavior is undefined. Although the grammar
913 (see Grammar ) permits built-in functions to appear with no arguments
914 or parentheses, unless the argument or parentheses are indicated as
915 optional in the following list (by displaying them within the "[]"
916 brackets), such use is undefined.
917
918 atan2(y,x)
919 Return arctangent of y/x in radians in the range [-pi,pi].
920
921 cos(x) Return cosine of x, where x is in radians.
922
923 sin(x) Return sine of x, where x is in radians.
924
925 exp(x) Return the exponential function of x.
926
927 log(x) Return the natural logarithm of x.
928
929 sqrt(x)
930 Return the square root of x.
931
932 int(x) Return the argument truncated to an integer. Truncation shall be
933 toward 0 when x>0.
934
935 rand() Return a random number n, such that 0<=n<1.
936
937 srand([expr])
938 Set the seed value for rand to expr or use the time of day if
939 expr is omitted. The previous seed value shall be returned.
940
941
942 String Functions
943 The string functions in the following list shall be supported. Although
944 the grammar (see Grammar ) permits built-in functions to appear with no
945 arguments or parentheses, unless the argument or parentheses are indi‐
946 cated as optional in the following list (by displaying them within the
947 "[]" brackets), such use is undefined.
948
949 gsub(ere, repl[, in])
950 Behave like sub (see below), except that it shall replace all
951 occurrences of the regular expression (like the ed utility
952 global substitute) in $0 or in the in argument, when specified.
953
954 index(s, t)
955 Return the position, in characters, numbering from 1, in string
956 s where string t first occurs, or zero if it does not occur at
957 all.
958
959 length[([s])]
960 Return the length, in characters, of its argument taken as a
961 string, or of the whole record, $0, if there is no argument.
962
963 match(s, ere)
964 Return the position, in characters, numbering from 1, in string
965 s where the extended regular expression ere occurs, or zero if
966 it does not occur at all. RSTART shall be set to the starting
967 position (which is the same as the returned value), zero if no
968 match is found; RLENGTH shall be set to the length of the
969 matched string, -1 if no match is found.
970
971 split(s, a[, fs ])
972 Split the string s into array elements a[1], a[2], ..., a[n],
973 and return n. All elements of the array shall be deleted before
974 the split is performed. The separation shall be done with the
975 ERE fs or with the field separator FS if fs is not given. Each
976 array element shall have a string value when created and, if
977 appropriate, the array element shall be considered a numeric
978 string (see Expressions in awk ). The effect of a null string as
979 the value of fs is unspecified.
980
981 sprintf(fmt, expr, expr, ...)
982 Format the expressions according to the printf format given by
983 fmt and return the resulting string.
984
985 sub(ere, repl[, in ])
986 Substitute the string repl in place of the first instance of the
987 extended regular expression ERE in string in and return the num‐
988 ber of substitutions. An ampersand ( '&' ) appearing in the
989 string repl shall be replaced by the string from in that matches
990 the ERE. An ampersand preceded with a backslash ( '\' ) shall be
991 interpreted as the literal ampersand character. An occurrence of
992 two consecutive backslashes shall be interpreted as just a sin‐
993 gle literal backslash character. Any other occurrence of a back‐
994 slash (for example, preceding any other character) shall be
995 treated as a literal backslash character. Note that if repl is a
996 string literal (the lexical token STRING; see Grammar ), the
997 handling of the ampersand character occurs after any lexical
998 processing, including any lexical backslash escape sequence pro‐
999 cessing. If in is specified and it is not an lvalue (see Expres‐
1000 sions in awk ), the behavior is undefined. If in is omitted, awk
1001 shall use the current record ($0) in its place.
1002
1003 substr(s, m[, n ])
1004 Return the at most n-character substring of s that begins at
1005 position m, numbering from 1. If n is omitted, or if n specifies
1006 more characters than are left in the string, the length of the
1007 substring shall be limited by the length of the string s.
1008
1009 tolower(s)
1010 Return a string based on the string s. Each character in s that
1011 is an uppercase letter specified to have a tolower mapping by
1012 the LC_CTYPE category of the current locale shall be replaced in
1013 the returned string by the lowercase letter specified by the
1014 mapping. Other characters in s shall be unchanged in the
1015 returned string.
1016
1017 toupper(s)
1018 Return a string based on the string s. Each character in s that
1019 is a lowercase letter specified to have a toupper mapping by the
1020 LC_CTYPE category of the current locale is replaced in the
1021 returned string by the uppercase letter specified by the map‐
1022 ping. Other characters in s are unchanged in the returned
1023 string.
1024
1025
1026 All of the preceding functions that take ERE as a parameter expect a
1027 pattern or a string valued expression that is a regular expression as
1028 defined in Regular Expressions .
1029
1030 Input/Output and General Functions
1031 The input/output and general functions are:
1032
1033 close(expression)
1034 Close the file or pipe opened by a print or printf statement or
1035 a call to getline with the same string-valued expression. The
1036 limit on the number of open expression arguments is implementa‐
1037 tion-defined. If the close was successful, the function shall
1038 return zero; otherwise, it shall return non-zero.
1039
1040 expression | getline [var]
1041 Read a record of input from a stream piped from the output of a
1042 command. The stream shall be created if no stream is currently
1043 open with the value of expression as its command name. The
1044 stream created shall be equivalent to one created by a call to
1045 the popen() function with the value of expression as the command
1046 argument and a value of r as the mode argument. As long as the
1047 stream remains open, subsequent calls in which expression evalu‐
1048 ates to the same string value shall read subsequent records from
1049 the stream. The stream shall remain open until the close func‐
1050 tion is called with an expression that evaluates to the same
1051 string value. At that time, the stream shall be closed as if by
1052 a call to the pclose() function. If var is omitted, $0 and NF
1053 shall be set; otherwise, var shall be set and, if appropriate,
1054 it shall be considered a numeric string (see Expressions in awk
1055 ).
1056
1057 The getline operator can form ambiguous constructs when there are
1058 unparenthesized operators (including concatenate) to the left of the
1059 '|' (to the beginning of the expression containing getline). In the
1060 context of the '$' operator, '|' shall behave as if it had a lower
1061 precedence than '$' . The result of evaluating other operators is
1062 unspecified, and conforming applications shall parenthesize properly
1063 all such usages.
1064
1065 getline
1066 Set $0 to the next input record from the current input file.
1067 This form of getline shall set the NF, NR, and FNR variables.
1068
1069 getline var
1070 Set variable var to the next input record from the current input
1071 file and, if appropriate, var shall be considered a numeric
1072 string (see Expressions in awk ). This form of getline shall set
1073 the FNR and NR variables.
1074
1075 getline [var] < expression
1076 Read the next record of input from a named file. The expression
1077 shall be evaluated to produce a string that is used as a path‐
1078 name. If the file of that name is not currently open, it shall
1079 be opened. As long as the stream remains open, subsequent calls
1080 in which expression evaluates to the same string value shall
1081 read subsequent records from the file. The file shall remain
1082 open until the close function is called with an expression that
1083 evaluates to the same string value. If var is omitted, $0 and NF
1084 shall be set; otherwise, var shall be set and, if appropriate,
1085 it shall be considered a numeric string (see Expressions in awk
1086 ).
1087
1088 The getline operator can form ambiguous constructs when there are
1089 unparenthesized binary operators (including concatenate) to the right
1090 of the '<' (up to the end of the expression containing the getline).
1091 The result of evaluating such a construct is unspecified, and conform‐
1092 ing applications shall parenthesize properly all such usages.
1093
1094 system(expression)
1095 Execute the command given by expression in a manner equivalent
1096 to the system() function defined in the System Interfaces volume
1097 of IEEE Std 1003.1-2001 and return the exit status of the com‐
1098 mand.
1099
1100
1101 All forms of getline shall return 1 for successful input, zero for end-
1102 of-file, and -1 for an error.
1103
1104 Where strings are used as the name of a file or pipeline, the applica‐
1105 tion shall ensure that the strings are textually identical. The termi‐
1106 nology "same string value" implies that "equivalent strings", even
1107 those that differ only by <space>s, represent different files.
1108
1109 User-Defined Functions
1110 The awk language also provides user-defined functions. Such functions
1111 can be defined as:
1112
1113
1114 function name([parameter, ...]) { statements }
1115
1116 A function can be referred to anywhere in an awk program; in particu‐
1117 lar, its use can precede its definition. The scope of a function is
1118 global.
1119
1120 Function parameters, if present, can be either scalars or arrays; the
1121 behavior is undefined if an array name is passed as a parameter that
1122 the function uses as a scalar, or if a scalar expression is passed as a
1123 parameter that the function uses as an array. Function parameters shall
1124 be passed by value if scalar and by reference if array name.
1125
1126 The number of parameters in the function definition need not match the
1127 number of parameters in the function call. Excess formal parameters can
1128 be used as local variables. If fewer arguments are supplied in a func‐
1129 tion call than are in the function definition, the extra parameters
1130 that are used in the function body as scalars shall evaluate to the
1131 uninitialized value until they are otherwise initialized, and the extra
1132 parameters that are used in the function body as arrays shall be
1133 treated as uninitialized arrays where each element evaluates to the
1134 uninitialized value until otherwise initialized.
1135
1136 When invoking a function, no white space can be placed between the
1137 function name and the opening parenthesis. Function calls can be nested
1138 and recursive calls can be made upon functions. Upon return from any
1139 nested or recursive function call, the values of all of the calling
1140 function's parameters shall be unchanged, except for array parameters
1141 passed by reference. The return statement can be used to return a
1142 value. If a return statement appears outside of a function definition,
1143 the behavior is undefined.
1144
1145 In the function definition, <newline>s shall be optional before the
1146 opening brace and after the closing brace. Function definitions can
1147 appear anywhere in the program where a pattern-action pair is allowed.
1148
1149 Grammar
1150 The grammar in this section and the lexical conventions in the follow‐
1151 ing section shall together describe the syntax for awk programs. The
1152 general conventions for this style of grammar are described in Grammar
1153 Conventions . A valid program can be represented as the non-terminal
1154 symbol program in the grammar. This formal syntax shall take precedence
1155 over the preceding text syntax description.
1156
1157
1158 %token NAME NUMBER STRING ERE
1159 %token FUNC_NAME /* Name followed by '(' without white space. */
1160
1161
1162 /* Keywords */
1163 %token Begin End
1164 /* 'BEGIN' 'END' */
1165
1166
1167 %token Break Continue Delete Do Else
1168 /* 'break' 'continue' 'delete' 'do' 'else' */
1169
1170
1171 %token Exit For Function If In
1172 /* 'exit' 'for' 'function' 'if' 'in' */
1173
1174
1175 %token Next Print Printf Return While
1176 /* 'next' 'print' 'printf' 'return' 'while' */
1177
1178
1179 /* Reserved function names */
1180 %token BUILTIN_FUNC_NAME
1181 /* One token for the following:
1182 * atan2 cos sin exp log sqrt int rand srand
1183 * gsub index length match split sprintf sub
1184 * substr tolower toupper close system
1185 */
1186 %token GETLINE
1187 /* Syntactically different from other built-ins. */
1188
1189
1190 /* Two-character tokens. */
1191 %token ADD_ASSIGN SUB_ASSIGN MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN POW_ASSIGN
1192 /* '+=' '-=' '*=' '/=' '%=' '^=' */
1193
1194
1195 %token OR AND NO_MATCH EQ LE GE NE INCR DECR APPEND
1196 /* '||' '&&' '!~' '==' '<=' '>=' '!=' '++' '--' '>>' */
1197
1198
1199 /* One-character tokens. */
1200 %token '{' '}' '(' ')' '[' ']' ',' ';' NEWLINE
1201 %token '+' '-' '*' '%' '^' '!' '>' '<' '|' '?' ':' '~' '$' '='
1202
1203
1204 %start program
1205 %%
1206
1207
1208 program : item_list
1209 | actionless_item_list
1210 ;
1211
1212
1213 item_list : newline_opt
1214 | actionless_item_list item terminator
1215 | item_list item terminator
1216 | item_list action terminator
1217 ;
1218
1219
1220 actionless_item_list : item_list pattern terminator
1221 | actionless_item_list pattern terminator
1222 ;
1223
1224
1225 item : pattern action
1226 | Function NAME '(' param_list_opt ')'
1227 newline_opt action
1228 | Function FUNC_NAME '(' param_list_opt ')'
1229 newline_opt action
1230 ;
1231
1232
1233 param_list_opt : /* empty */
1234 | param_list
1235 ;
1236
1237
1238 param_list : NAME
1239 | param_list ',' NAME
1240 ;
1241
1242
1243 pattern : Begin
1244 | End
1245 | expr
1246 | expr ',' newline_opt expr
1247 ;
1248
1249
1250 action : '{' newline_opt '}'
1251 | '{' newline_opt terminated_statement_list '}'
1252 | '{' newline_opt unterminated_statement_list '}'
1253 ;
1254
1255
1256 terminator : terminator ';'
1257 | terminator NEWLINE
1258 | ';'
1259 | NEWLINE
1260 ;
1261
1262
1263 terminated_statement_list : terminated_statement
1264 | terminated_statement_list terminated_statement
1265 ;
1266
1267
1268 unterminated_statement_list : unterminated_statement
1269 | terminated_statement_list unterminated_statement
1270 ;
1271
1272
1273 terminated_statement : action newline_opt
1274 | If '(' expr ')' newline_opt terminated_statement
1275 | If '(' expr ')' newline_opt terminated_statement
1276 Else newline_opt terminated_statement
1277 | While '(' expr ')' newline_opt terminated_statement
1278 | For '(' simple_statement_opt ';'
1279 expr_opt ';' simple_statement_opt ')' newline_opt
1280 terminated_statement
1281 | For '(' NAME In NAME ')' newline_opt
1282 terminated_statement
1283 | ';' newline_opt
1284 | terminatable_statement NEWLINE newline_opt
1285 | terminatable_statement ';' newline_opt
1286 ;
1287
1288
1289 unterminated_statement : terminatable_statement
1290 | If '(' expr ')' newline_opt unterminated_statement
1291 | If '(' expr ')' newline_opt terminated_statement
1292 Else newline_opt unterminated_statement
1293 | While '(' expr ')' newline_opt unterminated_statement
1294 | For '(' simple_statement_opt ';'
1295 expr_opt ';' simple_statement_opt ')' newline_opt
1296 unterminated_statement
1297 | For '(' NAME In NAME ')' newline_opt
1298 unterminated_statement
1299 ;
1300
1301
1302 terminatable_statement : simple_statement
1303 | Break
1304 | Continue
1305 | Next
1306 | Exit expr_opt
1307 | Return expr_opt
1308 | Do newline_opt terminated_statement While '(' expr ')'
1309 ;
1310
1311
1312 simple_statement_opt : /* empty */
1313 | simple_statement
1314 ;
1315
1316
1317 simple_statement : Delete NAME '[' expr_list ']'
1318 | expr
1319 | print_statement
1320 ;
1321
1322
1323 print_statement : simple_print_statement
1324 | simple_print_statement output_redirection
1325 ;
1326
1327
1328 simple_print_statement : Print print_expr_list_opt
1329 | Print '(' multiple_expr_list ')'
1330 | Printf print_expr_list
1331 | Printf '(' multiple_expr_list ')'
1332 ;
1333
1334
1335 output_redirection : '>' expr
1336 | APPEND expr
1337 | '|' expr
1338 ;
1339
1340
1341 expr_list_opt : /* empty */
1342 | expr_list
1343 ;
1344
1345
1346 expr_list : expr
1347 | multiple_expr_list
1348 ;
1349
1350
1351 multiple_expr_list : expr ',' newline_opt expr
1352 | multiple_expr_list ',' newline_opt expr
1353 ;
1354
1355
1356 expr_opt : /* empty */
1357 | expr
1358 ;
1359
1360
1361 expr : unary_expr
1362 | non_unary_expr
1363 ;
1364
1365
1366 unary_expr : '+' expr
1367 | '-' expr
1368 | unary_expr '^' expr
1369 | unary_expr '*' expr
1370 | unary_expr '/' expr
1371 | unary_expr '%' expr
1372 | unary_expr '+' expr
1373 | unary_expr '-' expr
1374 | unary_expr non_unary_expr
1375 | unary_expr '<' expr
1376 | unary_expr LE expr
1377 | unary_expr NE expr
1378 | unary_expr EQ expr
1379 | unary_expr '>' expr
1380 | unary_expr GE expr
1381 | unary_expr '~' expr
1382 | unary_expr NO_MATCH expr
1383 | unary_expr In NAME
1384 | unary_expr AND newline_opt expr
1385 | unary_expr OR newline_opt expr
1386 | unary_expr '?' expr ':' expr
1387 | unary_input_function
1388 ;
1389
1390
1391 non_unary_expr : '(' expr ')'
1392 | '!' expr
1393 | non_unary_expr '^' expr
1394 | non_unary_expr '*' expr
1395 | non_unary_expr '/' expr
1396 | non_unary_expr '%' expr
1397 | non_unary_expr '+' expr
1398 | non_unary_expr '-' expr
1399 | non_unary_expr non_unary_expr
1400 | non_unary_expr '<' expr
1401 | non_unary_expr LE expr
1402 | non_unary_expr NE expr
1403 | non_unary_expr EQ expr
1404 | non_unary_expr '>' expr
1405 | non_unary_expr GE expr
1406 | non_unary_expr '~' expr
1407 | non_unary_expr NO_MATCH expr
1408 | non_unary_expr In NAME
1409 | '(' multiple_expr_list ')' In NAME
1410 | non_unary_expr AND newline_opt expr
1411 | non_unary_expr OR newline_opt expr
1412 | non_unary_expr '?' expr ':' expr
1413 | NUMBER
1414 | STRING
1415 | lvalue
1416 | ERE
1417 | lvalue INCR
1418 | lvalue DECR
1419 | INCR lvalue
1420 | DECR lvalue
1421 | lvalue POW_ASSIGN expr
1422 | lvalue MOD_ASSIGN expr
1423 | lvalue MUL_ASSIGN expr
1424 | lvalue DIV_ASSIGN expr
1425 | lvalue ADD_ASSIGN expr
1426 | lvalue SUB_ASSIGN expr
1427 | lvalue '=' expr
1428 | FUNC_NAME '(' expr_list_opt ')'
1429 /* no white space allowed before '(' */
1430 | BUILTIN_FUNC_NAME '(' expr_list_opt ')'
1431 | BUILTIN_FUNC_NAME
1432 | non_unary_input_function
1433 ;
1434
1435
1436 print_expr_list_opt : /* empty */
1437 | print_expr_list
1438 ;
1439
1440
1441 print_expr_list : print_expr
1442 | print_expr_list ',' newline_opt print_expr
1443 ;
1444
1445
1446 print_expr : unary_print_expr
1447 | non_unary_print_expr
1448 ;
1449
1450
1451 unary_print_expr : '+' print_expr
1452 | '-' print_expr
1453 | unary_print_expr '^' print_expr
1454 | unary_print_expr '*' print_expr
1455 | unary_print_expr '/' print_expr
1456 | unary_print_expr '%' print_expr
1457 | unary_print_expr '+' print_expr
1458 | unary_print_expr '-' print_expr
1459 | unary_print_expr non_unary_print_expr
1460 | unary_print_expr '~' print_expr
1461 | unary_print_expr NO_MATCH print_expr
1462 | unary_print_expr In NAME
1463 | unary_print_expr AND newline_opt print_expr
1464 | unary_print_expr OR newline_opt print_expr
1465 | unary_print_expr '?' print_expr ':' print_expr
1466 ;
1467
1468
1469 non_unary_print_expr : '(' expr ')'
1470 | '!' print_expr
1471 | non_unary_print_expr '^' print_expr
1472 | non_unary_print_expr '*' print_expr
1473 | non_unary_print_expr '/' print_expr
1474 | non_unary_print_expr '%' print_expr
1475 | non_unary_print_expr '+' print_expr
1476 | non_unary_print_expr '-' print_expr
1477 | non_unary_print_expr non_unary_print_expr
1478 | non_unary_print_expr '~' print_expr
1479 | non_unary_print_expr NO_MATCH print_expr
1480 | non_unary_print_expr In NAME
1481 | '(' multiple_expr_list ')' In NAME
1482 | non_unary_print_expr AND newline_opt print_expr
1483 | non_unary_print_expr OR newline_opt print_expr
1484 | non_unary_print_expr '?' print_expr ':' print_expr
1485 | NUMBER
1486 | STRING
1487 | lvalue
1488 | ERE
1489 | lvalue INCR
1490 | lvalue DECR
1491 | INCR lvalue
1492 | DECR lvalue
1493 | lvalue POW_ASSIGN print_expr
1494 | lvalue MOD_ASSIGN print_expr
1495 | lvalue MUL_ASSIGN print_expr
1496 | lvalue DIV_ASSIGN print_expr
1497 | lvalue ADD_ASSIGN print_expr
1498 | lvalue SUB_ASSIGN print_expr
1499 | lvalue '=' print_expr
1500 | FUNC_NAME '(' expr_list_opt ')'
1501 /* no white space allowed before '(' */
1502 | BUILTIN_FUNC_NAME '(' expr_list_opt ')'
1503 | BUILTIN_FUNC_NAME
1504 ;
1505
1506
1507 lvalue : NAME
1508 | NAME '[' expr_list ']'
1509 | '$' expr
1510 ;
1511
1512
1513 non_unary_input_function : simple_get
1514 | simple_get '<' expr
1515 | non_unary_expr '|' simple_get
1516 ;
1517
1518
1519 unary_input_function : unary_expr '|' simple_get
1520 ;
1521
1522
1523 simple_get : GETLINE
1524 | GETLINE lvalue
1525 ;
1526
1527
1528 newline_opt : /* empty */
1529 | newline_opt NEWLINE
1530 ;
1531
1532 This grammar has several ambiguities that shall be resolved as follows:
1533
1534 * Operator precedence and associativity shall be as described in
1535 Expressions in Decreasing Precedence in awk .
1536
1537 * In case of ambiguity, an else shall be associated with the most
1538 immediately preceding if that would satisfy the grammar.
1539
1540 * In some contexts, a slash ( '/' ) that is used to surround an ERE
1541 could also be the division operator. This shall be resolved in such
1542 a way that wherever the division operator could appear, a slash is
1543 assumed to be the division operator. (There is no unary division
1544 operator.)
1545
1546 One convention that might not be obvious from the formal grammar is
1547 where <newline>s are acceptable. There are several obvious placements
1548 such as terminating a statement, and a backslash can be used to escape
1549 <newline>s between any lexical tokens. In addition, <newline>s without
1550 backslashes can follow a comma, an open brace, logical AND operator (
1551 "&&" ), logical OR operator ( "||" ), the do keyword, the else keyword,
1552 and the closing parenthesis of an if, for, or while statement. For
1553 example:
1554
1555
1556 { print $1,
1557 $2 }
1558
1559 Lexical Conventions
1560 The lexical conventions for awk programs, with respect to the preceding
1561 grammar, shall be as follows:
1562
1563 1. Except as noted, awk shall recognize the longest possible token or
1564 delimiter beginning at a given point.
1565
1566 2. A comment shall consist of any characters beginning with the number
1567 sign character and terminated by, but excluding the next occurrence
1568 of, a <newline>. Comments shall have no effect, except to delimit
1569 lexical tokens.
1570
1571 3. The <newline> shall be recognized as the token NEWLINE.
1572
1573 4. A backslash character immediately followed by a <newline> shall
1574 have no effect.
1575
1576 5. The token STRING shall represent a string constant. A string con‐
1577 stant shall begin with the character ' .' Within a string constant,
1578 a backslash character shall be considered to begin an escape
1579 sequence as specified in the table in the Base Definitions volume
1580 of IEEE Std 1003.1-2001, Chapter 5, File Format Notation ( '\\',
1581 '\a', '\b', '\f', '\n', '\r', '\t', '\v' ). In addition, the escape
1582 sequences in Expressions in Decreasing Precedence in awk shall be
1583 recognized. A <newline> shall not occur within a string constant. A
1584 string constant shall be terminated by the first unescaped occur‐
1585 rence of the character '' after the one that begins the string con‐
1586 stant. The value of the string shall be the sequence of all
1587 unescaped characters and values of escape sequences between, but
1588 not including, the two delimiting '' characters.
1589
1590 6. The token ERE represents an extended regular expression constant.
1591 An ERE constant shall begin with the slash character. Within an
1592 ERE constant, a backslash character shall be considered to begin an
1593 escape sequence as specified in the table in the Base Definitions
1594 volume of IEEE Std 1003.1-2001, Chapter 5, File Format Notation. In
1595 addition, the escape sequences in Expressions in Decreasing Prece‐
1596 dence in awk shall be recognized. The application shall ensure that
1597 a <newline> does not occur within an ERE constant. An ERE constant
1598 shall be terminated by the first unescaped occurrence of the slash
1599 character after the one that begins the ERE constant. The extended
1600 regular expression represented by the ERE constant shall be the
1601 sequence of all unescaped characters and values of escape sequences
1602 between, but not including, the two delimiting slash characters.
1603
1604 7. A <blank> shall have no effect, except to delimit lexical tokens or
1605 within STRING or ERE tokens.
1606
1607 8. The token NUMBER shall represent a numeric constant. Its form and
1608 numeric value shall be equivalent to either of the tokens floating-
1609 constant or integer-constant as specified by the ISO C standard,
1610 with the following exceptions:
1611
1612 a. An integer constant cannot begin with 0x or include the hexa‐
1613 decimal digits 'a', 'b', 'c', 'd', 'e', 'f', 'A', 'B', 'C',
1614 'D', 'E', or 'F' .
1615
1616 b. The value of an integer constant beginning with 0 shall be
1617 taken in decimal rather than octal.
1618
1619 c. An integer constant cannot include a suffix ( 'u', 'U', 'l', or
1620 'L' ).
1621
1622 d. A floating constant cannot include a suffix ( 'f', 'F', 'l', or
1623 'L' ).
1624
1625 If the value is too large or too small to be representable (see Con‐
1626 cepts Derived from the ISO C Standard ), the behavior is undefined.
1627
1628 9. A sequence of underscores, digits, and alphabetics from the porta‐
1629 ble character set (see the Base Definitions volume of
1630 IEEE Std 1003.1-2001, Section 6.1, Portable Character Set), begin‐
1631 ning with an underscore or alphabetic, shall be considered a word.
1632
1633 10. The following words are keywords that shall be recognized as indi‐
1634 vidual tokens; the name of the token is the same as the keyword:
1635
1636
1640
1641
1642 11. The following words are names of built-in functions and shall be
1643 recognized as the token BUILTIN_FUNC_NAME:
1644
1645
1650
1651
1652 The above-listed keywords and names of built-in functions are consid‐
1653 ered reserved words.
1654
1655 12. The token NAME shall consist of a word that is not a keyword or a
1656 name of a built-in function and is not followed immediately (with‐
1657 out any delimiters) by the '(' character.
1658
1659 13. The token FUNC_NAME shall consist of a word that is not a keyword
1660 or a name of a built-in function, followed immediately (without any
1661 delimiters) by the '(' character. The '(' character shall not be
1662 included as part of the token.
1663
1664 14. The following two-character sequences shall be recognized as the
1665 named tokens:
1666
1667 Token Name Sequence Token Name Sequence
1668 ADD_ASSIGN += NO_MATCH !~
1669 SUB_ASSIGN -= EQ ==
1670 MUL_ASSIGN *= LE <=
1671 DIV_ASSIGN /= GE >=
1672 MOD_ASSIGN %= NE !=
1673 POW_ASSIGN ^= INCR ++
1674 OR || DECR --
1675 AND && APPEND >>
1676
1677 15. The following single characters shall be recognized as tokens whose
1678 names are the character:
1679
1680
1681 <newline> { } ( ) [ ] , ; + - * % ^ ! > < | ? : ~ $ =
1682
1683 There is a lexical ambiguity between the token ERE and the tokens '/'
1684 and DIV_ASSIGN. When an input sequence begins with a slash character in
1685 any syntactic context where the token '/' or DIV_ASSIGN could appear as
1686 the next token in a valid program, the longer of those two tokens that
1687 can be recognized shall be recognized. In any other syntactic context
1688 where the token ERE could appear as the next token in a valid program,
1689 the token ERE shall be recognized.
1690
1692 The following exit values shall be returned:
1693
1694 0 All input files were processed successfully.
1695
1696 >0 An error occurred.
1697
1698
1699 The exit status can be altered within the program by using an exit
1700 expression.
1701
1703 If any file operand is specified and the named file cannot be accessed,
1704 awk shall write a diagnostic message to standard error and terminate
1705 without any further action.
1706
1707 If the program specified by either the program operand or a progfile
1708 operand is not a valid awk program (as specified in the EXTENDED
1709 DESCRIPTION section), the behavior is undefined.
1710
1711 The following sections are informative.
1712
1714 The index, length, match, and substr functions should not be confused
1715 with similar functions in the ISO C standard; the awk versions deal
1716 with characters, while the ISO C standard deals with bytes.
1717
1718 Because the concatenation operation is represented by adjacent expres‐
1719 sions rather than an explicit operator, it is often necessary to use
1720 parentheses to enforce the proper evaluation precedence.
1721
1723 The awk program specified in the command line is most easily specified
1724 within single-quotes (for example, programs commonly contain characters
1725 that are special to the shell, including double-quotes. In the cases
1726 where an awk program contains single-quote characters, it is usually
1727 easiest to specify most of the program as strings within single-quotes
1728 concatenated by the shell with quoted single-quote characters. For
1729 example:
1730
1731
1732 awk '/'\''/ { print "quote:", $0 }'
1733
1734 prints all lines from the standard input containing a single-quote
1735 character, prefixed with quote:.
1736
1737 The following are examples of simple awk programs:
1738
1739 1. Write to the standard output all input lines for which field 3 is
1740 greater than 5:
1741
1742
1743 $3 > 5
1744
1745 2. Write every tenth line:
1746
1747
1748 (NR % 10) == 0
1749
1750 3. Write any line with a substring matching the regular expression:
1751
1752
1753 /(G|D)(2[0-9][[:alpha:]]*)/
1754
1755 4. Print any line with a substring containing a 'G' or 'D', followed
1756 by a sequence of digits and characters. This example uses charac‐
1757 ter classes digit and alpha to match language-independent digit and
1758 alphabetic characters respectively:
1759
1760
1761 /(G|D)([[:digit:][:alpha:]]*)/
1762
1763 5. Write any line in which the second field matches the regular
1764 expression and the fourth field does not:
1765
1766
1767 $2 ~ /xyz/ && $4 !~ /xyz/
1768
1769 6. Write any line in which the second field contains a backslash:
1770
1771
1772 $2 ~ /\\/
1773
1774 7. Write any line in which the second field contains a backslash. Note
1775 that backslash escapes are interpreted twice; once in lexical pro‐
1776 cessing of the string and once in processing the regular expres‐
1777 sion:
1778
1779
1780 $2 ~ "\\\\"
1781
1782 8. Write the second to the last and the last field in each line. Sepa‐
1783 rate the fields by a colon:
1784
1785
1786 {OFS=":";print $(NF-1), $NF}
1787
1788 9. Write the line number and number of fields in each line. The three
1789 strings representing the line number, the colon, and the number of
1790 fields are concatenated and that string is written to standard out‐
1791 put:
1792
1793
1794 {print NR ":" NF}
1795
1796 10. Write lines longer than 72 characters:
1797
1798
1799 length($0) > 72
1800
1801 11. Write the first two fields in opposite order separated by OFS:
1802
1803
1804 { print $2, $1 }
1805
1806 12. Same, with input fields separated by a comma or <space>s and
1807 <tab>s, or both:
1808
1809
1810 BEGIN { FS = ",[ \t]*|[ \t]+" }
1811 { print $2, $1 }
1812
1813 13. Add up the first column, print sum, and average:
1814
1815
1816 {s += $1 }
1817 END {print "sum is ", s, " average is", s/NR}
1818
1819 14. Write fields in reverse order, one per line (many lines out for
1820 each line in):
1821
1822
1823 { for (i = NF; i > 0; --i) print $i }
1824
1825 15. Write all lines between occurrences of the strings start and stop:
1826
1827
1828 /start/, /stop/
1829
1830 16. Write all lines whose first field is different from the previous
1831 one:
1832
1833
1834 $1 != prev { print; prev = $1 }
1835
1836 17. Simulate echo:
1837
1838
1839 BEGIN {
1840 for (i = 1; i < ARGC; ++i)
1841 printf("%s%s", ARGV[i], i==ARGC-1?"\n":" ")
1842 }
1843
1844 18. Write the path prefixes contained in the PATH environment variable,
1845 one per line:
1846
1847
1848 BEGIN {
1849 n = split (ENVIRON["PATH"], path, ":")
1850 for (i = 1; i <= n; ++i)
1851 print path[i]
1852 }
1853
1854 19. If there is a file named input containing page headers of the form:
1855
1856
1857 Page #
1858
1859 and a file named program that contains:
1860
1861
1862 /Page/ { $2 = n++; }
1863 { print }
1864
1865 then the command line:
1866
1867
1868 awk -f program n=5 input
1869
1870 prints the file input, filling in page numbers starting at 5.
1871
1873 This description is based on the new awk, "nawk", (see the referenced
1874 The AWK Programming Language), which introduced a number of new fea‐
1875 tures to the historical awk:
1876
1877 1. New keywords: delete, do, function, return
1878
1879 2. New built-in functions: atan2, close, cos, gsub, match, rand, sin,
1880 srand, sub, system
1881
1882 3. New predefined variables: FNR, ARGC, ARGV, RSTART, RLENGTH, SUBSEP
1883
1884 4. New expression operators: ?, :, ,, ^
1885
1886 5. The FS variable and the third argument to split, now treated as
1887 extended regular expressions.
1888
1889 6. The operator precedence, changed to more closely match the C lan‐
1890 guage. Two examples of code that operate differently are:
1891
1892
1893 while ( n /= 10 > 1) ...
1894 if (!"wk" ~ /bwk/) ...
1895
1896 Several features have been added based on newer implementations of awk:
1897
1898 * Multiple instances of -f progfile are permitted.
1899
1900 * The new option -v assignment.
1901
1902 * The new predefined variable ENVIRON.
1903
1904 * New built-in functions toupper and tolower.
1905
1906 * More formatting capabilities are added to printf to match the ISO C
1907 standard.
1908
1909 The overall awk syntax has always been based on the C language, with a
1910 few features from the shell command language and other sources. Because
1911 of this, it is not completely compatible with any other language, which
1912 has caused confusion for some users. It is not the intent of the stan‐
1913 dard developers to address such issues. A few relatively minor changes
1914 toward making the language more compatible with the ISO C standard were
1915 made; most of these changes are based on similar changes in recent
1916 implementations, as described above. There remain several C-language
1917 conventions that are not in awk. One of the notable ones is the comma
1918 operator, which is commonly used to specify multiple expressions in the
1919 C language for statement. Also, there are various places where awk is
1920 more restrictive than the C language regarding the type of expression
1921 that can be used in a given context. These limitations are due to the
1922 different features that the awk language does provide.
1923
1924 Regular expressions in awk have been extended somewhat from historical
1925 implementations to make them a pure superset of extended regular
1926 expressions, as defined by IEEE Std 1003.1-2001 (see the Base Defini‐
1927 tions volume of IEEE Std 1003.1-2001, Section 9.4, Extended Regular
1928 Expressions). The main extensions are internationalization features
1929 and interval expressions. Historical implementations of awk have long
1930 supported backslash escape sequences as an extension to extended regu‐
1931 lar expressions, and this extension has been retained despite inconsis‐
1932 tency with other utilities. The number of escape sequences recognized
1933 in both extended regular expressions and strings has varied (generally
1934 increasing with time) among implementations. The set specified by
1935 IEEE Std 1003.1-2001 includes most sequences known to be supported by
1936 popular implementations and by the ISO C standard. One sequence that is
1937 not supported is hexadecimal value escapes beginning with '\x' . This
1938 would allow values expressed in more than 9 bits to be used within awk
1939 as in the ISO C standard. However, because this syntax has a non-deter‐
1940 ministic length, it does not permit the subsequent character to be a
1941 hexadecimal digit. This limitation can be dealt with in the C language
1942 by the use of lexical string concatenation. In the awk language, con‐
1943 catenation could also be a solution for strings, but not for extended
1944 regular expressions (either lexical ERE tokens or strings used dynami‐
1945 cally as regular expressions). Because of this limitation, the feature
1946 has not been added to IEEE Std 1003.1-2001.
1947
1948 When a string variable is used in a context where an extended regular
1949 expression normally appears (where the lexical token ERE is used in the
1950 grammar) the string does not contain the literal slashes.
1951
1952 Some versions of awk allow the form:
1953
1954
1955 func name(args, ... ) { statements }
1956
1957 This has been deprecated by the authors of the language, who asked that
1958 it not be specified.
1959
1960 Historical implementations of awk produce an error if a next statement
1961 is executed in a BEGIN action, and cause awk to terminate if a next
1962 statement is executed in an END action. This behavior has not been doc‐
1963 umented, and it was not believed that it was necessary to standardize
1964 it.
1965
1966 The specification of conversions between string and numeric values is
1967 much more detailed than in the documentation of historical implementa‐
1968 tions or in the referenced The AWK Programming Language. Although most
1969 of the behavior is designed to be intuitive, the details are necessary
1970 to ensure compatible behavior from different implementations. This is
1971 especially important in relational expressions since the types of the
1972 operands determine whether a string or numeric comparison is performed.
1973 From the perspective of an application writer, it is usually sufficient
1974 to expect intuitive behavior and to force conversions (by adding zero
1975 or concatenating a null string) when the type of an expression does not
1976 obviously match what is needed. The intent has been to specify histori‐
1977 cal practice in almost all cases. The one exception is that, in histor‐
1978 ical implementations, variables and constants maintain both string and
1979 numeric values after their original value is converted by any use. This
1980 means that referencing a variable or constant can have unexpected side
1981 effects. For example, with historical implementations the following
1982 program:
1983
1984
1985 {
1986 a = "+2"
1987 b = 2
1988 if (NR % 2)
1989 c = a + b
1990 if (a == b)
1991 print "numeric comparison"
1992 else
1993 print "string comparison"
1994 }
1995
1996 would perform a numeric comparison (and output numeric comparison) for
1997 each odd-numbered line, but perform a string comparison (and output
1998 string comparison) for each even-numbered line. IEEE Std 1003.1-2001
1999 ensures that comparisons will be numeric if necessary. With historical
2000 implementations, the following program:
2001
2002
2003 BEGIN {
2004 OFMT = "%e"
2005 print 3.14
2006 OFMT = "%f"
2007 print 3.14
2008 }
2009
2010 would output "3.140000e+00" twice, because in the second print state‐
2011 ment the constant "3.14" would have a string value from the previous
2012 conversion. IEEE Std 1003.1-2001 requires that the output of the second
2013 print statement be "3.140000" . The behavior of historical implementa‐
2014 tions was seen as too unintuitive and unpredictable.
2015
2016 It was pointed out that with the rules contained in early drafts, the
2017 following script would print nothing:
2018
2019
2020 BEGIN {
2021 y[1.5] = 1
2022 OFMT = "%e"
2023 print y[1.5]
2024 }
2025
2026 Therefore, a new variable, CONVFMT, was introduced. The OFMT variable
2027 is now restricted to affecting output conversions of numbers to strings
2028 and CONVFMT is used for internal conversions, such as comparisons or
2029 array indexing. The default value is the same as that for OFMT, so
2030 unless a program changes CONVFMT (which no historical program would
2031 do), it will receive the historical behavior associated with internal
2032 string conversions.
2033
2034 The POSIX awk lexical and syntactic conventions are specified more for‐
2035 mally than in other sources. Again the intent has been to specify his‐
2036 torical practice. One convention that may not be obvious from the for‐
2037 mal grammar as in other verbal descriptions is where <newline>s are
2038 acceptable. There are several obvious placements such as terminating a
2039 statement, and a backslash can be used to escape <newline>s between any
2040 lexical tokens. In addition, <newline>s without backslashes can follow
2041 a comma, an open brace, a logical AND operator ( "&&" ), a logical OR
2042 operator ( "||" ), the do keyword, the else keyword, and the closing
2043 parenthesis of an if, for, or while statement. For example:
2044
2045
2046 { print $1,
2047 $2 }
2048
2049 The requirement that awk add a trailing <newline> to the program argu‐
2050 ment text is to simplify the grammar, making it match a text file in
2051 form. There is no way for an application or test suite to determine
2052 whether a literal <newline> is added or whether awk simply acts as if
2053 it did.
2054
2055 IEEE Std 1003.1-2001 requires several changes from historical implemen‐
2056 tations in order to support internationalization. Probably the most
2057 subtle of these is the use of the decimal-point character, defined by
2058 the LC_NUMERIC category of the locale, in representations of floating-
2059 point numbers. This locale-specific character is used in recognizing
2060 numeric input, in converting between strings and numeric values, and in
2061 formatting output. However, regardless of locale, the period character
2062 (the decimal-point character of the POSIX locale) is the decimal-point
2063 character recognized in processing awk programs (including assignments
2064 in command line arguments). This is essentially the same convention as
2065 the one used in the ISO C standard. The difference is that the C lan‐
2066 guage includes the setlocale() function, which permits an application
2067 to modify its locale. Because of this capability, a C application
2068 begins executing with its locale set to the C locale, and only executes
2069 in the environment-specified locale after an explicit call to setlo‐
2070 cale(). However, adding such an elaborate new feature to the awk lan‐
2071 guage was seen as inappropriate for IEEE Std 1003.1-2001. It is possi‐
2072 ble to execute an awk program explicitly in any desired locale by set‐
2073 ting the environment in the shell.
2074
2075 The undefined behavior resulting from NULs in extended regular expres‐
2076 sions allows future extensions for the GNU gawk program to process
2077 binary data.
2078
2079 The behavior in the case of invalid awk programs (including lexical,
2080 syntactic, and semantic errors) is undefined because it was considered
2081 overly limiting on implementations to specify. In most cases such
2082 errors can be expected to produce a diagnostic and a non-zero exit sta‐
2083 tus. However, some implementations may choose to extend the language in
2084 ways that make use of certain invalid constructs. Other invalid con‐
2085 structs might be deemed worthy of a warning, but otherwise cause some
2086 reasonable behavior. Still other constructs may be very difficult to
2087 detect in some implementations. Also, different implementations might
2088 detect a given error during an initial parsing of the program (before
2089 reading any input files) while others might detect it when executing
2090 the program after reading some input. Implementors should be aware that
2091 diagnosing errors as early as possible and producing useful diagnostics
2092 can ease debugging of applications, and thus make an implementation
2093 more usable.
2094
2095 The unspecified behavior from using multi-character RS values is to
2096 allow possible future extensions based on extended regular expressions
2097 used for record separators. Historical implementations take the first
2098 character of the string and ignore the others.
2099
2100 Unspecified behavior when split( string, array, <null>) is used is to
2101 allow a proposed future extension that would split up a string into an
2102 array of individual characters.
2103
2104 In the context of the getline function, equally good arguments for dif‐
2105 ferent precedences of the | and < operators can be made. Historical
2106 practice has been that:
2107
2108
2109 getline < "a" "b"
2110
2111 is parsed as:
2112
2113
2114 ( getline < "a" ) "b"
2115
2116 although many would argue that the intent was that the file ab should
2117 be read. However:
2118
2119
2120 getline < "x" + 1
2121
2122 parses as:
2123
2124
2125 getline < ( "x" + 1 )
2126
2127 Similar problems occur with the | version of getline, particularly in
2128 combination with $. For example:
2129
2130
2131 $"echo hi" | getline
2132
2133 (This situation is particularly problematic when used in a print state‐
2134 ment, where the |getline part might be a redirection of the print.)
2135
2136 Since in most cases such constructs are not (or at least should not) be
2137 used (because they have a natural ambiguity for which there is no con‐
2138 ventional parsing), the meaning of these constructs has been made
2139 explicitly unspecified. (The effect is that a conforming application
2140 that runs into the problem must parenthesize to resolve the ambiguity.)
2141 There appeared to be few if any actual uses of such constructs.
2142
2143 Grammars can be written that would cause an error under these circum‐
2144 stances. Where backwards-compatibility is not a large consideration,
2145 implementors may wish to use such grammars.
2146
2147 Some historical implementations have allowed some built-in functions to
2148 be called without an argument list, the result being a default argument
2149 list chosen in some "reasonable" way. Use of length as a synonym for
2150 length($0) is the only one of these forms that is thought to be widely
2151 known or widely used; this particular form is documented in various
2152 places (for example, most historical awk reference pages, although not
2153 in the referenced The AWK Programming Language) as legitimate practice.
2154 With this exception, default argument lists have always been undocu‐
2155 mented and vaguely defined, and it is not at all clear how (or if) they
2156 should be generalized to user-defined functions. They add no useful
2157 functionality and preclude possible future extensions that might need
2158 to name functions without calling them. Not standardizing them seems
2159 the simplest course. The standard developers considered that length
2160 merited special treatment, however, since it has been documented in the
2161 past and sees possibly substantial use in historical programs. Accord‐
2162 ingly, this usage has been made legitimate, but Issue 5 removed the
2163 obsolescent marking for XSI-conforming implementations and many other‐
2164 wise conforming applications depend on this feature.
2165
2166 In sub and gsub, if repl is a string literal (the lexical token
2167 STRING), then two consecutive backslash characters should be used in
2168 the string to ensure a single backslash will precede the ampersand when
2169 the resultant string is passed to the function. (For example, to spec‐
2170 ify one literal ampersand in the replacement string, use gsub( ERE,
2171 "\\&" ).)
2172
2173 Historically the only special character in the repl argument of sub and
2174 gsub string functions was the ampersand ( '&' ) character and preceding
2175 it with the backslash character was used to turn off its special mean‐
2176 ing.
2177
2178 The description in the ISO POSIX-2:1993 standard introduced behavior
2179 such that the backslash character was another special character and it
2180 was unspecified whether there were any other special characters. This
2181 description introduced several portability problems, some of which are
2182 described below, and so it has been replaced with the more historical
2183 description. Some of the problems include:
2184
2185 * Historically, to create the replacement string, a script could use
2186 gsub( ERE, "\\&" ), but with the ISO POSIX-2:1993 standard wording,
2187 it was necessary to use gsub( ERE, "\\\\&" ). Backslash characters
2188 are doubled here because all string literals are subject to lexical
2189 analysis, which would reduce each pair of backslash characters to a
2190 single backslash before being passed to gsub.
2191
2192 * Since it was unspecified what the special characters were, for por‐
2193 table scripts to guarantee that characters are printed literally,
2194 each character had to be preceded with a backslash. (For example, a
2195 portable script had to use gsub( ERE, "\\h\\i" ) to produce a
2196 replacement string of "hi" .)
2197
2198 The description for comparisons in the ISO POSIX-2:1993 standard did
2199 not properly describe historical practice because of the way numeric
2200 strings are compared as numbers. The current rules cause the following
2201 code:
2202
2203
2204 if (0 == "000")
2205 print "strange, but true"
2206 else
2207 print "not true"
2208
2209 to do a numeric comparison, causing the if to succeed. It should be
2210 intuitively obvious that this is incorrect behavior, and indeed, no
2211 historical implementation of awk actually behaves this way.
2212
2213 To fix this problem, the definition of numeric string was enhanced to
2214 include only those values obtained from specific circumstances (mostly
2215 external sources) where it is not possible to determine unambiguously
2216 whether the value is intended to be a string or a numeric.
2217
2218 Variables that are assigned to a numeric string shall also be treated
2219 as a numeric string. (For example, the notion of a numeric string can
2220 be propagated across assignments.) In comparisons, all variables having
2221 the uninitialized value are to be treated as a numeric operand evaluat‐
2222 ing to the numeric value zero.
2223
2224 Uninitialized variables include all types of variables including
2225 scalars, array elements, and fields. The definition of an uninitialized
2226 value in Variables and Special Variables is necessary to describe the
2227 value placed on uninitialized variables and on fields that are valid
2228 (for example, < $NF) but have no characters in them and to describe how
2229 these variables are to be used in comparisons. A valid field, such as
2230 $1, that has no characters in it can be obtained from an input line of
2231 "\t\t" when FS= '\t' . Historically, the comparison ( $1<10) was done
2232 numerically after evaluating $1 to the value zero.
2233
2234 The phrase "... also shall have the numeric value of the numeric
2235 string" was removed from several sections of the ISO POSIX-2:1993 stan‐
2236 dard because is specifies an unnecessary implementation detail. It is
2237 not necessary for IEEE Std 1003.1-2001 to specify that these objects be
2238 assigned two different values. It is only necessary to specify that
2239 these objects may evaluate to two different values depending on con‐
2240 text.
2241
2242 The description of numeric string processing is based on the behavior
2243 of the atof() function in the ISO C standard. While it is not a
2244 requirement for an implementation to use this function, many historical
2245 implementations of awk do. In the ISO C standard, floating-point con‐
2246 stants use a period as a decimal point character for the language
2247 itself, independent of the current locale, but the atof() function and
2248 the associated strtod() function use the decimal point character of the
2249 current locale when converting strings to numeric values. Similarly in
2250 awk, floating-point constants in an awk script use a period independent
2251 of the locale, but input strings use the decimal point character of the
2252 locale.
2253
2255 None.
2256
2258 Grammar Conventions, grep, lex, sed, the System Interfaces volume of
2259 IEEE Std 1003.1-2001, atof(), exec, popen(), setlocale(), strtod()
2260
2262 Portions of this text are reprinted and reproduced in electronic form
2263 from IEEE Std 1003.1, 2003 Edition, Standard for Information Technology
2264 -- Portable Operating System Interface (POSIX), The Open Group Base
2265 Specifications Issue 6, Copyright (C) 2001-2003 by the Institute of
2266 Electrical and Electronics Engineers, Inc and The Open Group. In the
2267 event of any discrepancy between this version and the original IEEE and
2268 The Open Group Standard, the original IEEE and The Open Group Standard
2269 is the referee document. The original Standard can be obtained online
2270 at http://www.opengroup.org/unix/online.html .
2271
2272
2273
2274IEEE/The Open Group 2003 AWK(1P)