awk(1p) - c6

1AWK(1P)                    POSIX Programmer's Manual                   AWK(1P)
2
3
4

PROLOG

6       This  manual  page is part of the POSIX Programmer's Manual.  The Linux
7       implementation of this interface may differ (consult the  corresponding
8       Linux  manual page for details of Linux behavior), or the interface may
9       not be implemented on Linux.
10

NAME

12       awk - pattern scanning and processing language
13

SYNOPSIS

15       awk [-F ERE][-v assignment] ... program [argument ...]
16
17       awk [-F ERE] -f progfile ...  [-v assignment] ...[argument ...]
18
19

DESCRIPTION

21       The awk utility shall execute programs written in the  awk  programming
22       language,  which  is  specialized for textual data manipulation. An awk
23       program is a sequence of patterns and corresponding actions. When input
24       is read that matches a pattern, the action associated with that pattern
25       is carried out.
26
27       Input shall be interpreted as a sequence  of  records.  By  default,  a
28       record  is  a  line,  less  its  terminating <newline>, but this can be
29       changed by using the RS built-in variable. Each record of  input  shall
30       be  matched  in turn against each pattern in the program. For each pat‐
31       tern matched, the associated action shall be executed.
32
33       The awk utility shall interpret each input  record  as  a  sequence  of
34       fields  where,  by  default, a field is a string of non- <blank>s. This
35       default white-space field delimiter can be  changed  by  using  the  FS
36       built-in  variable  or  -F  ERE. The awk utility shall denote the first
37       field in a record $1, the second $2, and so on.  The  symbol  $0  shall
38       refer to the entire record; setting any other field causes the re-eval‐
39       uation of $0. Assigning to $0 shall  reset  the  values  of  all  other
40       fields and the NF built-in variable.
41

OPTIONS

43       The  awk  utility  shall  conform  to  the  Base  Definitions volume of
44       IEEE Std 1003.1-2001, Section 12.2, Utility Syntax Guidelines.
45
46       The following options shall be supported:
47
48       -F  ERE
49              Define the input field separator  to  be  the  extended  regular
50              expression  ERE,  before  any input is read; see Regular Expres‐
51              sions .
52
53       -f  progfile
54              Specify the pathname of the file progfile containing an awk pro‐
55              gram.  If  multiple  instances of this option are specified, the
56              concatenation of the files specified as progfile  in  the  order
57              specified shall be the awk program. The awk program can alterna‐
58              tively be specified in the command line as a single argument.
59
60       -v  assignment
61              The application shall ensure that the assignment argument is  in
62              the  same  form as an assignment operand. The specified variable
63              assignment shall occur  prior  to  executing  the  awk  program,
64              including  the  actions associated with BEGIN patterns (if any).
65              Multiple occurrences of this option can be specified.
66
67

OPERANDS

69       The following operands shall be supported:
70
71       program
72              If no -f option is specified, the first operand to awk shall  be
73              the  text  of  the awk program. The application shall supply the
74              program operand as a single argument to awk. If  the  text  does
75              not  end  in  a <newline>, awk shall interpret the text as if it
76              did.
77
78       argument
79              Either of the following two types of argument can be intermixed:
80
81       file
82              A pathname of a file that contains the input to be  read,  which
83              is  matched  against  the  set of patterns in the program. If no
84              file operands are specified, or if a file operand  is  '-',  the
85              standard input shall be used.
86
87       assignment
88              An  operand that begins with an underscore or alphabetic charac‐
89              ter from the portable character set (see the table in  the  Base
90              Definitions  volume of IEEE Std 1003.1-2001, Section 6.1, Porta‐
91              ble Character Set), followed by a sequence of underscores,  dig‐
92              its,  and  alphabetics from the portable character set, followed
93              by the '=' character, shall specify a variable assignment rather
94              than  a  pathname.  The  characters before the '=' represent the
95              name of an awk variable; if that name is an  awk  reserved  word
96              (see Grammar ) the behavior is undefined. The characters follow‐
97              ing the equal sign shall be interpreted as if they  appeared  in
98              the  awk  program preceded and followed by a double-quote ( ' )'
99              character, as a STRING token (see Grammar ), except that if  the
100              last  character  is  an  unescaped backslash, it shall be inter‐
101              preted as a literal backslash rather than as the first character
102              of  the sequence "\"" . The variable shall be assigned the value
103              of that STRING token and, if appropriate, shall be considered  a
104              numeric  string  (see  Expressions  in awk ), the variable shall
105              also be assigned its numeric value. Each such  variable  assign‐
106              ment  shall  occur just prior to the processing of the following
107              file, if any. Thus, an assignment before the first file argument
108              shall  be  executed  after  the BEGIN actions (if any), while an
109              assignment after the last file argument shall occur  before  the
110              END  actions  (if  any). If there are no file arguments, assign‐
111              ments shall be executed before processing the standard input.
112
113
114

STDIN

116       The standard input shall be used only if no file  operands  are  speci‐
117       fied, or if a file operand is '-' ; see the INPUT FILES section. If the
118       awk program contains no actions and no patterns,  but  is  otherwise  a
119       valid  awk  program,  standard input and any file operands shall not be
120       read and awk shall exit with a return status of zero.
121

INPUT FILES

123       Input files to the awk program from any of the following sources  shall
124       be text files:
125
126        * Any  file  operands  or their equivalents, achieved by modifying the
127          awk variables ARGV and ARGC
128
129        * Standard input in the absence of any file operands
130
131        * Arguments to the getline function
132
133       Whether the variable RS is set to a value other  than  a  <newline>  or
134       not,  for these files, implementations shall support records terminated
135       with the specified separator up to {LINE_MAX}  bytes  and  may  support
136       longer records.
137
138       If  -f  progfile  is  specified,  the application shall ensure that the
139       files named by each of the progfile option-arguments are text files and
140       their concatenation, in the same order as they appear in the arguments,
141       is an awk program.
142

ENVIRONMENT VARIABLES

144       The following environment variables shall affect the execution of awk:
145
146       LANG   Provide a default value for the  internationalization  variables
147              that  are  unset  or  null.  (See the Base Definitions volume of
148              IEEE Std 1003.1-2001, Section  8.2,  Internationalization  Vari‐
149              ables  for the precedence of internationalization variables used
150              to determine the values of locale categories.)
151
152       LC_ALL If set to a non-empty string value, override the values  of  all
153              the other internationalization variables.
154
155       LC_COLLATE
156              Determine  the  locale  for  the behavior of ranges, equivalence
157              classes, and multi-character collating elements  within  regular
158              expressions and in comparisons of string values.
159
160       LC_CTYPE
161              Determine  the  locale  for  the  interpretation of sequences of
162              bytes of text data as characters (for  example,  single-byte  as
163              opposed  to multi-byte characters in arguments and input files),
164              the behavior of character classes  within  regular  expressions,
165              the  identification of characters as letters, and the mapping of
166              uppercase and lowercase characters for the toupper  and  tolower
167              functions.
168
169       LC_MESSAGES
170              Determine  the  locale  that should be used to affect the format
171              and contents of diagnostic messages written to standard error.
172
173       LC_NUMERIC
174              Determine the radix character  used  when  interpreting  numeric
175              input, performing conversions between numeric and string values,
176              and formatting numeric output. Regardless of locale, the  period
177              character  (the  decimal-point character of the POSIX locale) is
178              the decimal-point character recognized in  processing  awk  pro‐
179              grams (including assignments in command line arguments).
180
181       NLSPATH
182              Determine the location of message catalogs for the processing of
183              LC_MESSAGES .
184
185       PATH   Determine the search path when looking for commands executed  by
186              system(expr),  or  input  and output pipes; see the Base Defini‐
187              tions volume of  IEEE Std 1003.1-2001,  Chapter  8,  Environment
188              Variables.
189
190
191       In  addition,  all  environment  variables shall be visible via the awk
192       variable ENVIRON.
193

ASYNCHRONOUS EVENTS

195       Default.
196

STDOUT

198       The nature of the output files depends on the awk program.
199

STDERR

201       The standard error shall be used only for diagnostic messages.
202

OUTPUT FILES

204       The nature of the output files depends on the awk program.
205

EXTENDED DESCRIPTION

207   Overall Program Structure
208       An awk program is composed of pairs of the form:
209
210
211              pattern { action }
212
213       Either the pattern or the action (including the enclosing brace charac‐
214       ters) can be omitted.
215
216       A missing pattern shall match any record of input, and a missing action
217       shall be equivalent to:
218
219
220              { print }
221
222       Execution of the awk program shall start by first executing the actions
223       associated  with all BEGIN patterns in the order they occur in the pro‐
224       gram. Then each file operand (or standard input if no files were speci‐
225       fied)  shall be processed in turn by reading data from the file until a
226       record separator is seen ( <newline> by default). Before the first ref‐
227       erence to a field in the record is evaluated, the record shall be split
228       into fields, according to the rules in Regular Expressions,  using  the
229       value of FS that was current at the time the record was read. Each pat‐
230       tern in the program then shall be evaluated in the order of occurrence,
231       and  the  action  associated with each pattern that matches the current
232       record executed. The action for a matching pattern  shall  be  executed
233       before  evaluating subsequent patterns. Finally, the actions associated
234       with all END patterns shall be executed in the order they occur in  the
235       program.
236
237   Expressions in awk
238       Expressions describe computations used in patterns and actions.  In the
239       following table, valid expression operations are given in  groups  from
240       highest  precedence  first to lowest precedence last, with equal-prece‐
241       dence operators grouped between horizontal lines. In expression evalua‐
242       tion, where the grammar is formally ambiguous, higher precedence opera‐
243       tors shall be evaluated before lower precedence operators. In this  ta‐
244       ble  expr,  expr1,  expr2,  and  expr3  represent any expression, while
245       lvalue represents any entity that can be assigned to (that is,  on  the
246       left side of an assignment operator). The precise syntax of expressions
247       is given in Grammar .
248
249                 Table: Expressions in Decreasing Precedence in awk
250
251    Syntax                Name                      Type of Result   Associativity
252    ( expr )              Grouping                  Type of expr     N/A
253    $expr                 Field reference           String           N/A
254    ++ lvalue             Pre-increment             Numeric          N/A
255    -- lvalue             Pre-decrement             Numeric          N/A
256    lvalue ++             Post-increment            Numeric          N/A
257    lvalue --             Post-decrement            Numeric          N/A
258    expr ^ expr           Exponentiation            Numeric          Right
259    ! expr                Logical not               Numeric          N/A
260    + expr                Unary plus                Numeric          N/A
261    - expr                Unary minus               Numeric          N/A
262    expr * expr           Multiplication            Numeric          Left
263    expr / expr           Division                  Numeric          Left
264    expr % expr           Modulus                   Numeric          Left
265    expr + expr           Addition                  Numeric          Left
266
267    expr - expr           Subtraction               Numeric          Left
268    expr expr             String concatenation      String           Left
269    expr < expr           Less than                 Numeric          None
270    expr <= expr          Less than or equal to     Numeric          None
271    expr != expr          Not equal to              Numeric          None
272    expr == expr          Equal to                  Numeric          None
273    expr > expr           Greater than              Numeric          None
274    expr >= expr          Greater than or equal to  Numeric          None
275    expr ~ expr           ERE match                 Numeric          None
276    expr !~ expr          ERE non-match             Numeric          None
277    expr in array         Array membership          Numeric          Left
278    ( index ) in array    Multi-dimension array     Numeric          Left
279                          membership
280    expr && expr          Logical AND               Numeric          Left
281    expr || expr          Logical OR                Numeric          Left
282    expr1 ? expr2 : expr3 Conditional expression    Type of selected Right
283                                                    expr2 or expr3
284    lvalue ^= expr        Exponentiation assignment Numeric          Right
285    lvalue %= expr        Modulus assignment        Numeric          Right
286    lvalue *= expr        Multiplication assignment Numeric          Right
287    lvalue /= expr        Division assignment       Numeric          Right
288    lvalue += expr        Addition assignment       Numeric          Right
289    lvalue -= expr        Subtraction assignment    Numeric          Right
290    lvalue = expr         Assignment                Type of expr     Right
291
292       Each expression shall have either a string value, a numeric  value,  or
293       both.  Except  as stated for specific contexts, the value of an expres‐
294       sion shall be implicitly converted to the type needed for  the  context
295       in  which  it  is  used. A string value shall be converted to a numeric
296       value by the equivalent of the following calls to functions defined  by
297       the ISO C standard:
298
299
300              setlocale(LC_NUMERIC, "");
301              numeric_value = atof(string_value);
302
303       A  numeric  value that is exactly equal to the value of an integer (see
304       Concepts Derived from the ISO C Standard )  shall  be  converted  to  a
305       string  by the equivalent of a call to the sprintf function (see String
306       Functions ) with the string "%d" as the fmt argument  and  the  numeric
307       value  being  converted  as the first and only expr argument. Any other
308       numeric value shall be converted to a string by  the  equivalent  of  a
309       call  to the sprintf function with the value of the variable CONVFMT as
310       the fmt argument and the numeric value being converted as the first and
311       only  expr argument. The result of the conversion is unspecified if the
312       value of CONVFMT is not a  floating-point  format  specification.  This
313       volume   of  IEEE Std 1003.1-2001  specifies  no  explicit  conversions
314       between numbers and strings. An application can force an expression  to
315       be  treated  as  a  number  by adding zero to it, or can force it to be
316       treated as a string by concatenating the null string ( "" ) to it.
317
318       A string value shall be considered a numeric string if  it  comes  from
319       one of the following:
320
321        1. Field variables
322
323        2. Input from the getline() function
324
325        3. FILENAME
326
327        4. ARGV array elements
328
329        5. ENVIRON array elements
330
331        6. Array elements created by the split() function
332
333        7. A command line variable assignment
334
335        8. Variable assignment from another numeric string variable
336
337       and  after all the following conversions have been applied, the result‐
338       ing string would lexically be recognized as a NUMBER token as described
339       by the lexical conventions in Grammar :
340
341        * All leading and trailing <blank>s are discarded.
342
343        * If the first non- <blank> is '+' or '-', it is discarded.
344
345        * Changing  each  occurrence  of  the decimal point character from the
346          current locale to a period.
347
348       If a '-' character is ignored in the preceding description, the numeric
349       value  of the numeric string shall be the negation of the numeric value
350       of the recognized NUMBER token.  Otherwise, the numeric  value  of  the
351       numeric  string  shall  be  the  numeric value of the recognized NUMBER
352       token. Whether or not a string is a numeric string  shall  be  relevant
353       only in contexts where that term is used in this section.
354
355       When  an  expression  is used in a Boolean context, if it has a numeric
356       value, a value of zero shall be treated as false and  any  other  value
357       shall  be treated as true. Otherwise, a string value of the null string
358       shall be treated as false and any other value shall be treated as true.
359       A Boolean context shall be one of the following:
360
361        * The first subexpression of a conditional expression
362
363        * An expression operated on by logical NOT, logical AND, or logical OR
364
365        * The second expression of a for statement
366
367        * The expression of an if statement
368
369        * The  expression of the while clause in either a while or do... while
370          statement
371
372        * An expression used as a pattern (as in Overall Program Structure)
373
374       All arithmetic shall follow the semantics of floating-point  arithmetic
375       as specified by the ISO C standard (see Concepts Derived from the ISO C
376       Standard ).
377
378       The value of the expression:
379
380
381              expr1 ^ expr2
382
383       shall be equivalent to the value returned by the ISO C  standard  func‐
384       tion call:
385
386
387              pow(expr1, expr2)
388
389       The expression:
390
391
392              lvalue ^= expr
393
394       shall be equivalent to the ISO C standard expression:
395
396
397              lvalue = pow(lvalue, expr)
398
399       except  that  lvalue  shall  be  evaluated  only once. The value of the
400       expression:
401
402
403              expr1 % expr2
404
405       shall be equivalent to the value returned by the ISO C  standard  func‐
406       tion call:
407
408
409              fmod(expr1, expr2)
410
411       The expression:
412
413
414              lvalue %= expr
415
416       shall be equivalent to the ISO C standard expression:
417
418
419              lvalue = fmod(lvalue, expr)
420
421       except that lvalue shall be evaluated only once.
422
423       Variables and fields shall be set by the assignment statement:
424
425
426              lvalue = expression
427
428       and the type of expression shall determine the resulting variable type.
429       The assignment includes the arithmetic assignments ( "+=", "-=",  "*=",
430       "/=",  "%=",  "^=",  "++",  "--" ) all of which shall produce a numeric
431       result. The left-hand side of an assignment and the target of increment
432       and  decrement operators can be one of a variable, an array with index,
433       or a field selector.
434
435       The awk language supplies arrays that are used for storing  numbers  or
436       strings.  Arrays  need  not be declared. They shall initially be empty,
437       and their sizes shall change dynamically. The  subscripts,  or  element
438       identifiers,  are  strings, providing a type of associative array capa‐
439       bility. An array name followed by a subscript  within  square  brackets
440       can be used as an lvalue and thus as an expression, as described in the
441       grammar; see Grammar . Unsubscripted array names can be  used  in  only
442       the following contexts:
443
444        * A parameter in a function definition or function call
445
446        * The  NAME  token following any use of the keyword in as specified in
447          the grammar (see Grammar ); if the name used in this context is  not
448          an array name, the behavior is undefined
449
450       A  valid  array  index  shall  consist  of  one or more comma-separated
451       expressions, similar to the way in which multi-dimensional  arrays  are
452       indexed  in  some programming languages.  Because awk arrays are really
453       one-dimensional, such a comma-separated list shall be  converted  to  a
454       single  string  by  concatenating  the  string  values  of the separate
455       expressions, each separated from the other by the value of  the  SUBSEP
456       variable.   Thus,  the  following two index operations shall be equiva‐
457       lent:
458
459
460              var[expr1, expr2, ... exprn]
461
462
463              var[expr1 SUBSEP expr2 SUBSEP ... SUBSEP exprn]
464
465       The application shall ensure that a multi-dimensioned index  used  with
466       the  in operator is parenthesized. The in operator, which tests for the
467       existence of a particular array element, shall not cause  that  element
468       to  exist.  Any  other  reference  to a nonexistent array element shall
469       automatically create it.
470
471       Comparisons (with the '<', "<=", "!=", "==", '>', and  ">="  operators)
472       shall  be  made  numerically  if  both  operands are numeric, if one is
473       numeric and the other has a string value that is a numeric  string,  or
474       if one is numeric and the other has the uninitialized value. Otherwise,
475       operands shall be converted to strings as required and a string compar‐
476       ison  shall  be  made using the locale-specific collation sequence. The
477       value of the comparison expression shall be 1 if the relation is  true,
478       or 0 if the relation is false.
479
480   Variables and Special Variables
481       Variables  can be used in an awk program by referencing them.  With the
482       exception of function parameters (see User-Defined  Functions  ),  they
483       are not explicitly declared. Function parameter names shall be local to
484       the function; all other variable names shall be global. The  same  name
485       shall  not be used as both a function parameter name and as the name of
486       a function or a special awk variable. The same name shall not  be  used
487       both  as  a  variable name with global scope and as the name of a func‐
488       tion. The same name shall not be used within the same scope both  as  a
489       scalar  variable  and  as an array.  Uninitialized variables, including
490       scalar variables, array elements, and field variables,  shall  have  an
491       uninitialized  value.  An uninitialized value shall have both a numeric
492       value of zero and a string value of the  empty  string.  Evaluation  of
493       variables  with  an  uninitialized  value, to either string or numeric,
494       shall be determined by the context in which they are used.
495
496       Field variables shall be designated by a '$' followed by  a  number  or
497       numerical  expression. The effect of the field number expression evalu‐
498       ating to anything other than a  non-negative  integer  is  unspecified;
499       uninitialized  variables  or  string  values  need  not be converted to
500       numeric values in this context. New field variables can be  created  by
501       assigning  a value to them.  References to nonexistent fields (that is,
502       fields after $NF), shall evaluate to the uninitialized value. Such ref‐
503       erences  shall  not create new fields. However, assigning to a nonexis‐
504       tent field (for example, $(NF+2)=5) shall increase  the  value  of  NF;
505       create  any  intervening fields with the uninitialized value; and cause
506       the value of $0 to be recomputed, with the fields  being  separated  by
507       the  value  of OFS. Each field variable shall have a string value or an
508       uninitialized value when  created.   Field  variables  shall  have  the
509       uninitialized value when created from $0 using FS and the variable does
510       not contain any characters. If appropriate, the field variable shall be
511       considered a numeric string (see Expressions in awk ).
512
513       Implementations  shall  support  the  following other special variables
514       that are set by awk:
515
516       ARGC   The number of elements in the ARGV array.
517
518       ARGV   An array of command line arguments, excluding  options  and  the
519              program argument, numbered from zero to ARGC-1.
520
521       The arguments in ARGV can be modified or added to; ARGC can be altered.
522       As each input file ends, awk shall treat the next non-null  element  of
523       ARGV,  up to the current value of ARGC-1, inclusive, as the name of the
524       next input file. Thus, setting an element of ARGV to null means that it
525       shall not be treated as an input file. The name '-' indicates the stan‐
526       dard input. If an argument matches the format of an assignment operand,
527       this  argument  shall  be  treated  as an assignment rather than a file
528       argument.
529
530       CONVFMT
531              The printf format for converting numbers to strings (except  for
532              output statements, where OFMT is used); "%.6g" by default.
533
534       ENVIRON
535              An array representing the value of the environment, as described
536              in the exec functions defined in the System Interfaces volume of
537              IEEE Std 1003.1-2001.  The indices of the array shall be strings
538              consisting of the names of the environment  variables,  and  the
539              value  of each array element shall be a string consisting of the
540              value of that variable. If appropriate, the environment variable
541              shall  be considered a numeric string (see Expressions in awk );
542              the array element shall also have its numeric value.
543
544       In all cases where the behavior of awk is affected by environment vari‐
545       ables  (including the environment of any commands that awk executes via
546       the system function or via pipeline redirections with the print  state‐
547       ment,  the  printf statement, or the getline function), the environment
548       used shall be the environment at the time awk began  executing;  it  is
549       implementation-defined whether any modification of ENVIRON affects this
550       environment.
551
552       FILENAME
553              A pathname of the current input file. Inside a BEGIN action  the
554              value  is undefined. Inside an END action the value shall be the
555              name of the last input file processed.
556
557       FNR    The ordinal number of the current record in  the  current  file.
558              Inside  a  BEGIN  action  the value shall be zero. Inside an END
559              action the value shall be the number of  the  last  record  pro‐
560              cessed in the last file processed.
561
562       FS     Input field separator regular expression; a <space> by default.
563
564       NF     The  number  of  fields  in  the  current record. Inside a BEGIN
565              action, the use of NF is undefined  unless  a  getline  function
566              without  a  var  argument is executed previously.  Inside an END
567              action, NF shall retain the value it had  for  the  last  record
568              read,  unless a subsequent, redirected, getline function without
569              a var argument is performed prior to entering the END action.
570
571       NR     The ordinal number of the  current  record  from  the  start  of
572              input.  Inside a BEGIN action the value shall be zero. Inside an
573              END action the value shall be the number of the last record pro‐
574              cessed.
575
576       OFMT   The  printf  format  for converting numbers to strings in output
577              statements (see Output Statements  );  "%.6g"  by  default.  The
578              result  of the conversion is unspecified if the value of OFMT is
579              not a floating-point format specification.
580
581       OFS    The print statement output field separation; <space> by default.
582
583       ORS    The print statement output  record  separator;  a  <newline>  by
584              default.
585
586       RLENGTH
587              The length of the string matched by the match function.
588
589       RS     The first character of the string value of RS shall be the input
590              record separator; a <newline> by default. If  RS  contains  more
591              than one character, the results are unspecified.  If RS is null,
592              then records are separated by sequences consisting  of  a  <new‐
593              line>  plus  one  or more blank lines, leading or trailing blank
594              lines shall not result in empty records at the beginning or  end
595              of the input, and a <newline> shall always be a field separator,
596              no matter what the value of FS is.
597
598       RSTART The starting position of the string matched by the  match  func‐
599              tion,  numbering  from 1. This shall always be equivalent to the
600              return value of the match function.
601
602       SUBSEP The subscript separator string for multi-dimensional arrays; the
603              default value is implementation-defined.
604
605
606   Regular Expressions
607       The awk utility shall make use of the extended regular expression nota‐
608       tion (see the Base Definitions volume of IEEE Std 1003.1-2001,  Section
609       9.4,  Extended  Regular Expressions) except that it shall allow the use
610       of C-language conventions for escaping special  characters  within  the
611       EREs,  as  specified  in  the  table  in the Base Definitions volume of
612       IEEE Std 1003.1-2001, Chapter 5, File Format  Notation  (  '\\',  '\a',
613       '\b',  '\f',  '\n',  '\r',  '\t', '\v' ) and the following table; these
614       escape sequences shall be recognized both inside  and  outside  bracket
615       expressions.  Note that records need not be separated by <newline>s and
616       string constants can contain <newline>s, so even the "\n"  sequence  is
617       valid  in  awk EREs. Using a slash character within an ERE requires the
618       escaping shown in the following table.
619
620                           Table: Escape Sequences in awk
621
622       Escape
623       Sequence Description                    Meaning
624       \"       Backslash quotation-mark       Quotation-mark character
625       \/       Backslash slash                Slash character
626       \ddd     A backslash character followed The character whose encoding
627                by the longest sequence of     is represented by the one,
628                one, two, or three octal-digit two, or three-digit octal
629                characters [4m(01234567). If all  integer. Multi-byte characters
630                of the digits are 0 (that is,  require multiple, concatenated
631                representation of the NUL      escape sequences of this type,
632                character), the behavior is    including the leading '\' for
633                undefined.                     each byte.
634       \c       A backslash character followed Undefined
635                by any character not described
636                in this table or in the table
637                in the Base Definitions volume
638                of IEEE Std 1003.1-2001, Chap‐
639                ter 5, File Format Notation (
640                '\\', '\a', '\b', '\f', '\n',
641                '\r', '\t', '\v' ).
642
643       A  regular expression can be matched against a specific field or string
644       by using one of the two regular expression matching operators, '~'  and
645       "!~"  .  These  operators shall interpret their right-hand operand as a
646       regular expression and their left-hand operand as a string. If the reg‐
647       ular  expression  matches the string, the '~' expression shall evaluate
648       to a value of 1, and the "!~" expression shall evaluate to a  value  of
649       0. (The regular expression matching operation is as defined by the term
650       matched in the Base Definitions volume of IEEE Std 1003.1-2001, Section
651       9.1,  Regular  Expression Definitions, where a match occurs on any part
652       of the string unless the regular expression is limited with the circum‐
653       flex or dollar sign special characters.) If the regular expression does
654       not match the string, the '~' expression shall evaluate to a  value  of
655       0,  and  the  "!~"  expression  shall  evaluate to a value of 1. If the
656       right-hand operand is any expression other than the lexical token  ERE,
657       the  string value of the expression shall be interpreted as an extended
658       regular expression, including the escape conventions  described  above.
659       Note that these same escape conventions shall also be applied in deter‐
660       mining the value of a string literal (the lexical  token  STRING),  and
661       thus  shall  be  applied a second time when a string literal is used in
662       this context.
663
664       When an ERE token appears as an expression in any context other than as
665       the  right-hand  of  the '~' or "!~" operator or as one of the built-in
666       function arguments described below, the value of the resulting  expres‐
667       sion shall be the equivalent of:
668
669
670              $0 ~ /ere/
671
672       The ere argument to the gsub, match, sub functions, and the fs argument
673       to the split function (see String Functions ) shall be  interpreted  as
674       extended  regular  expressions. These can be either ERE tokens or arbi‐
675       trary expressions, and shall be interpreted in the same manner  as  the
676       right-hand side of the '~' or "!~" operator.
677
678       An  extended regular expression can be used to separate fields by using
679       the -F ERE option or by assigning a string containing the expression to
680       the built-in variable FS. The default value of the FS variable shall be
681       a single <space>. The following describes FS behavior:
682
683        1. If FS is a null string, the behavior is unspecified.
684
685        2. If FS is a single character:
686
687            a. If FS is <space>, skip leading and  trailing  <blank>s;  fields
688               shall be delimited by sets of one or more <blank>s.
689
690            b. Otherwise,  if  FS  is  any  other character c, fields shall be
691               delimited by each single occurrence of c.
692
693        3. Otherwise, the string value of FS shall  be  considered  to  be  an
694           extended regular expression. Each occurrence of a sequence matching
695           the extended regular expression shall delimit fields.
696
697       Except for the '~' and "!~" operators, and in the gsub,  match,  split,
698       and  sub  built-in  functions,  ERE  matching  shall  be based on input
699       records; that is, record separator characters (the first  character  of
700       the  value of the variable RS, <newline> by default) cannot be embedded
701       in the expression, and no expression shall match the  record  separator
702       character.  If the record separator is not <newline>, <newline>s embed‐
703       ded in the expression can be matched. For the '~' and  "!~"  operators,
704       and  in  those  four built-in functions, ERE matching shall be based on
705       text strings; that is,  any  character  (including  <newline>  and  the
706       record  separator)  can  be embedded in the pattern, and an appropriate
707       pattern shall match any character. However, in all  awk  ERE  matching,
708       the  use of one or more NUL characters in the pattern, input record, or
709       text string produces undefined results.
710
711   Patterns
712       A pattern is any valid expression, a range specified by two expressions
713       separated by a comma, or one of the two special patterns BEGIN or END.
714
715   Special Patterns
716       The  awk  utility  shall recognize two special patterns, BEGIN and END.
717       Each BEGIN pattern shall be matched once and its associated action exe‐
718       cuted  before the first record of input is read (except possibly by use
719       of the getline function-see Input/Output and General Functions -  in  a
720       prior  BEGIN  action)  and before command line assignment is done. Each
721       END pattern shall be matched once and its  associated  action  executed
722       after  the last record of input has been read. These two patterns shall
723       have associated actions.
724
725       BEGIN and END shall not combine with other patterns. Multiple BEGIN and
726       END  patterns  shall  be allowed. The actions associated with the BEGIN
727       patterns shall be executed in the order specified in  the  program,  as
728       are  the  END  actions. An END pattern can precede a BEGIN pattern in a
729       program.
730
731       If an awk program consists of only actions with the pattern BEGIN,  and
732       the  BEGIN  action contains no getline function, awk shall exit without
733       reading its input when the last statement in the last BEGIN  action  is
734       executed.  If  an awk program consists of only actions with the pattern
735       END or only actions with the patterns BEGIN and END, the input shall be
736       read before the statements in the END actions are executed.
737
738   Expression Patterns
739       An expression pattern shall be evaluated as if it were an expression in
740       a Boolean context. If the result is true, the pattern shall be  consid‐
741       ered to match, and the associated action (if any) shall be executed. If
742       the result is false, the action shall not be executed.
743
744   Pattern Ranges
745       A pattern range consists of two expressions separated by  a  comma;  in
746       this  case,  the  action  shall  be performed for all records between a
747       match of the first expression and the following  match  of  the  second
748       expression, inclusive. At this point, the pattern range can be repeated
749       starting at input records subsequent to the end of the matched range.
750
751   Actions
752       An action is a sequence of statements as shown in the grammar in  Gram‐
753       mar . Any single statement can be replaced by a statement list enclosed
754       in braces. The application shall ensure that statements in a  statement
755       list  are separated by <newline>s or semicolons. Statements in a state‐
756       ment list shall be executed sequentially in the order that they appear.
757
758       The expression acting as the conditional in an if  statement  shall  be
759       evaluated  and  if  it is non-zero or non-null, the following statement
760       shall be executed; otherwise, if else is present, the statement follow‐
761       ing the else shall be executed.
762
763       The  if,  while,  do...  while, for, break, and continue statements are
764       based on the ISO C standard (see Concepts Derived from the ISO C  Stan‐
765       dard  ),  except  that  the  Boolean  expressions  shall  be treated as
766       described in Expressions in awk , and except in the case of:
767
768
769              for (variable in array)
770
771       which shall iterate, assigning each index of array to  variable  in  an
772       unspecified  order.  The results of adding new elements to array within
773       such a for loop are undefined. If a break or continue statement  occurs
774       outside of a loop, the behavior is undefined.
775
776       The  delete  statement shall remove an individual array element.  Thus,
777       the following code deletes an entire array:
778
779
780              for (index in array)
781                  delete array[index]
782
783       The next statement shall cause all further processing  of  the  current
784       input  record  to  be  abandoned.  The  behavior is undefined if a next
785       statement appears or is invoked in a BEGIN or END action.
786
787       The exit statement shall invoke all END actions in the order  in  which
788       they occur in the program source and then terminate the program without
789       reading further input. An exit statement inside  an  END  action  shall
790       terminate  the  program without further execution of END actions. If an
791       expression is specified in an exit statement, its numeric  value  shall
792       be  the exit status of awk, unless subsequent errors are encountered or
793       a subsequent exit statement with an expression is executed.
794
795   Output Statements
796       Both print and printf statements shall  write  to  standard  output  by
797       default.  The output shall be written to the location specified by out‐
798       put_redirection if one is supplied, as follows:
799
800
801              > expression>> expression| expression
802
803       In all cases, the expression shall be evaluated  to  produce  a  string
804       that is used as a pathname into which to write (for '>' or ">>" ) or as
805       a command to be executed (for '|' ). Using the first two forms, if  the
806       file  of  that name is not currently open, it shall be opened, creating
807       it if necessary and using the first form, truncating the file. The out‐
808       put  then  shall  be  appended to the file. As long as the file remains
809       open, subsequent calls in which expression evaluates to the same string
810       value  shall  simply  append  output to the file. The file remains open
811       until the close function (see Input/Output and General Functions  )  is
812       called with an expression that evaluates to the same string value.
813
814       The third form shall write output onto a stream piped to the input of a
815       command. The stream shall be created if no  stream  is  currently  open
816       with  the  value of expression as its command name.  The stream created
817       shall be equivalent to one created by a call to  the  popen()  function
818       defined  in  the  System Interfaces volume of IEEE Std 1003.1-2001 with
819       the value of expression as the command argument and a value of w as the
820       mode  argument. As long as the stream remains open, subsequent calls in
821       which expression evaluates to the same string value shall write  output
822       to  the  existing  stream. The stream shall remain open until the close
823       function (see Input/Output and General Functions ) is  called  with  an
824       expression  that evaluates to the same string value.  At that time, the
825       stream shall be closed as if by a call to the pclose() function defined
826       in the System Interfaces volume of IEEE Std 1003.1-2001.
827
828       As  described in detail by the grammar in Grammar , these output state‐
829       ments shall take a comma-separated list of expressions referred  to  in
830       the  grammar by the non-terminal symbols expr_list, print_expr_list, or
831       print_expr_list_opt. This list is referred to here  as  the  expression
832       list, and each member is referred to as an expression argument.
833
834       The  print  statement shall write the value of each expression argument
835       onto the indicated output stream separated by the current output  field
836       separator (see variable OFS above), and terminated by the output record
837       separator (see variable ORS above). All expression arguments  shall  be
838       taken  as  strings, being converted if necessary; this conversion shall
839       be as described in Expressions in awk , with  the  exception  that  the
840       printf format in OFMT shall be used instead of the value in CONVFMT. An
841       empty expression list shall stand for the whole input record ($0).
842
843       The printf statement shall produce output based on a  notation  similar
844       to  the File Format Notation used to describe file formats in this vol‐
845       ume  of  IEEE Std 1003.1-2001  (see  the  Base  Definitions  volume  of
846       IEEE Std 1003.1-2001,  Chapter  5, File Format Notation).  Output shall
847       be produced as specified with the  first  expression  argument  as  the
848       string  format  and subsequent expression arguments as the strings arg1
849       to argn, inclusive, with the following exceptions:
850
851        1. The format shall be an actual character string rather than a graph‐
852           ical  representation.  Therefore, it cannot contain empty character
853           positions. The <space> in the format string, in any  context  other
854           than  a  flag of a conversion specification, shall be treated as an
855           ordinary character that is copied to the output.
856
857        2. If the character set contains a ' ' character  and  that  character
858           appears  in  the  format string, it shall be treated as an ordinary
859           character that is copied to the output.
860
861        3. The escape sequences beginning with a backslash character shall  be
862           treated  as sequences of ordinary characters that are copied to the
863           output. Note that these same sequences shall be  interpreted  lexi‐
864           cally  by  awk  when they appear in literal strings, but they shall
865           not be treated specially by the printf statement.
866
867        4. A field width or precision can be specified as  the  '*'  character
868           instead  of a digit string. In this case the next argument from the
869           expression list shall be fetched and its numeric value taken as the
870           field width or precision.
871
872        5. The implementation shall not precede or follow output from the d or
873           u conversion specifier characters with <blank>s  not  specified  by
874           the format string.
875
876        6. The  implementation  shall not precede output from the o conversion
877           specifier character with leading zeros not specified by the  format
878           string.
879
880        7. For  the  c  conversion  specifier character: if the argument has a
881           numeric value, the character whose encoding is that value shall  be
882           output.  If the value is zero or is not the encoding of any charac‐
883           ter in the character set, the behavior is undefined. If  the  argu‐
884           ment  does  not  have  a  numeric value, the first character of the
885           string value shall be output; if the string does  not  contain  any
886           characters, the behavior is undefined.
887
888        8. For  each  conversion  specification that consumes an argument, the
889           next expression argument shall be evaluated. With the exception  of
890           the  c conversion specifier character, the value shall be converted
891           (according to the rules specified in Expressions in awk  )  to  the
892           appropriate type for the conversion specification.
893
894        9. If  there  are insufficient expression arguments to satisfy all the
895           conversion specifications in the format  string,  the  behavior  is
896           undefined.
897
898       10. If  any  character  sequence in the format string begins with a '%'
899           character, but does not form a valid conversion specification,  the
900           behavior is unspecified.
901
902       Both print and printf can output at least {LINE_MAX} bytes.
903
904   Functions
905       The  awk  language  has  a  variety  of built-in functions: arithmetic,
906       string, input/output, and general.
907
908   Arithmetic Functions
909       The arithmetic functions, except for int, shall be based on  the  ISO C
910       standard  (see Concepts Derived from the ISO C Standard ). The behavior
911       is undefined in cases where the ISO C standard specifies that an  error
912       be  returned  or  that  the behavior is undefined. Although the grammar
913       (see Grammar ) permits built-in functions to appear with  no  arguments
914       or  parentheses,  unless  the  argument or parentheses are indicated as
915       optional in the following list (by  displaying  them  within  the  "[]"
916       brackets), such use is undefined.
917
918       atan2(y,x)
919              Return arctangent of y/x in radians in the range [-pi,pi].
920
921       cos(x) Return cosine of x, where x is in radians.
922
923       sin(x) Return sine of x, where x is in radians.
924
925       exp(x) Return the exponential function of x.
926
927       log(x) Return the natural logarithm of x.
928
929       sqrt(x)
930              Return the square root of x.
931
932       int(x) Return the argument truncated to an integer. Truncation shall be
933              toward 0 when x>0.
934
935       rand() Return a random number n, such that 0<=n<1.
936
937       srand([expr])
938              Set the seed value for rand to expr or use the time  of  day  if
939              expr is omitted. The previous seed value shall be returned.
940
941
942   String Functions
943       The string functions in the following list shall be supported. Although
944       the grammar (see Grammar ) permits built-in functions to appear with no
945       arguments  or parentheses, unless the argument or parentheses are indi‐
946       cated as optional in the following list (by displaying them within  the
947       "[]" brackets), such use is undefined.
948
949       gsub(ere, repl[, in])
950              Behave  like  sub  (see below), except that it shall replace all
951              occurrences of the  regular  expression  (like  the  ed  utility
952              global substitute) in $0 or in the in argument, when specified.
953
954       index(s, t)
955              Return  the position, in characters, numbering from 1, in string
956              s where string t first occurs, or zero if it does not  occur  at
957              all.
958
959       length[([s])]
960              Return  the  length,  in  characters, of its argument taken as a
961              string, or of the whole record, $0, if there is no argument.
962
963       match(s, ere)
964              Return the position, in characters, numbering from 1, in  string
965              s  where  the extended regular expression ere occurs, or zero if
966              it does not occur at all. RSTART shall be set  to  the  starting
967              position  (which  is the same as the returned value), zero if no
968              match is found; RLENGTH shall  be  set  to  the  length  of  the
969              matched string, -1 if no match is found.
970
971       split(s, a[, fs  ])
972              Split  the  string  s into array elements a[1], a[2], ..., a[n],
973              and return n. All elements of the array shall be deleted  before
974              the  split  is  performed. The separation shall be done with the
975              ERE fs or with the field separator FS if fs is not  given.  Each
976              array  element  shall  have  a string value when created and, if
977              appropriate, the array element shall  be  considered  a  numeric
978              string (see Expressions in awk ). The effect of a null string as
979              the value of fs is unspecified.
980
981       sprintf(fmt, expr, expr, ...)
982              Format the expressions according to the printf format  given  by
983              fmt and return the resulting string.
984
985       sub(ere, repl[, in  ])
986              Substitute the string repl in place of the first instance of the
987              extended regular expression ERE in string in and return the num‐
988              ber  of  substitutions.  An  ampersand  ( '&' ) appearing in the
989              string repl shall be replaced by the string from in that matches
990              the ERE. An ampersand preceded with a backslash ( '\' ) shall be
991              interpreted as the literal ampersand character. An occurrence of
992              two  consecutive backslashes shall be interpreted as just a sin‐
993              gle literal backslash character. Any other occurrence of a back‐
994              slash  (for  example,  preceding  any  other character) shall be
995              treated as a literal backslash character. Note that if repl is a
996              string  literal  (the  lexical  token STRING; see Grammar ), the
997              handling of the ampersand character  occurs  after  any  lexical
998              processing, including any lexical backslash escape sequence pro‐
999              cessing. If in is specified and it is not an lvalue (see Expres‐
1000              sions in awk ), the behavior is undefined. If in is omitted, awk
1001              shall use the current record ($0) in its place.
1002
1003       substr(s, m[, n  ])
1004              Return the at most n-character substring of  s  that  begins  at
1005              position m, numbering from 1. If n is omitted, or if n specifies
1006              more characters than are left in the string, the length  of  the
1007              substring shall be limited by the length of the string s.
1008
1009       tolower(s)
1010              Return  a string based on the string s. Each character in s that
1011              is an uppercase letter specified to have a  tolower  mapping  by
1012              the LC_CTYPE category of the current locale shall be replaced in
1013              the returned string by the lowercase  letter  specified  by  the
1014              mapping.  Other  characters  in  s  shall  be  unchanged  in the
1015              returned string.
1016
1017       toupper(s)
1018              Return a string based on the string s. Each character in s  that
1019              is a lowercase letter specified to have a toupper mapping by the
1020              LC_CTYPE category of the  current  locale  is  replaced  in  the
1021              returned  string  by  the uppercase letter specified by the map‐
1022              ping. Other characters  in  s  are  unchanged  in  the  returned
1023              string.
1024
1025
1026       All  of  the  preceding functions that take ERE as a parameter expect a
1027       pattern or a string valued expression that is a regular  expression  as
1028       defined in Regular Expressions .
1029
1030   Input/Output and General Functions
1031       The input/output and general functions are:
1032
1033       close(expression)
1034              Close  the file or pipe opened by a print or printf statement or
1035              a call to getline with the same  string-valued  expression.  The
1036              limit  on the number of open expression arguments is implementa‐
1037              tion-defined. If the close was successful,  the  function  shall
1038              return zero; otherwise, it shall return non-zero.
1039
1040       expression |  getline [var]
1041              Read  a record of input from a stream piped from the output of a
1042              command.  The stream shall be created if no stream is  currently
1043              open  with  the  value  of  expression  as its command name. The
1044              stream created shall be equivalent to one created by a  call  to
1045              the popen() function with the value of expression as the command
1046              argument and a value of r as the mode argument. As long  as  the
1047              stream remains open, subsequent calls in which expression evalu‐
1048              ates to the same string value shall read subsequent records from
1049              the  stream.  The stream shall remain open until the close func‐
1050              tion is called with an expression that  evaluates  to  the  same
1051              string  value. At that time, the stream shall be closed as if by
1052              a call to the pclose() function. If var is omitted,  $0  and  NF
1053              shall  be  set; otherwise, var shall be set and, if appropriate,
1054              it shall be considered a numeric string (see Expressions in  awk
1055              ).
1056
1057       The  getline  operator  can  form  ambiguous  constructs when there are
1058       unparenthesized operators (including concatenate) to the  left  of  the
1059       '|'  (to  the  beginning  of the expression containing getline). In the
1060       context of the '$' operator, '|' shall behave as  if  it  had  a  lower
1061       precedence  than  '$'  .  The  result  of evaluating other operators is
1062       unspecified, and conforming applications  shall  parenthesize  properly
1063       all such usages.
1064
1065       getline
1066              Set  $0  to  the  next input record from the current input file.
1067              This form of getline shall set the NF, NR, and FNR variables.
1068
1069       getline  var
1070              Set variable var to the next input record from the current input
1071              file  and,  if  appropriate,  var  shall be considered a numeric
1072              string (see Expressions in awk ). This form of getline shall set
1073              the FNR and NR variables.
1074
1075       getline [var]  < expression
1076              Read  the next record of input from a named file. The expression
1077              shall be evaluated to produce a string that is used as  a  path‐
1078              name.  If  the file of that name is not currently open, it shall
1079              be opened. As long as the stream remains open, subsequent  calls
1080              in  which  expression  evaluates  to the same string value shall
1081              read subsequent records from the file.  The  file  shall  remain
1082              open  until the close function is called with an expression that
1083              evaluates to the same string value. If var is omitted, $0 and NF
1084              shall  be  set; otherwise, var shall be set and, if appropriate,
1085              it shall be considered a numeric string (see Expressions in  awk
1086              ).
1087
1088       The  getline  operator  can  form  ambiguous  constructs when there are
1089       unparenthesized binary operators (including concatenate) to  the  right
1090       of  the  '<'  (up to the end of the expression containing the getline).
1091       The result of evaluating such a construct is unspecified, and  conform‐
1092       ing applications shall parenthesize properly all such usages.
1093
1094       system(expression)
1095              Execute  the  command given by expression in a manner equivalent
1096              to the system() function defined in the System Interfaces volume
1097              of  IEEE Std 1003.1-2001  and return the exit status of the com‐
1098              mand.
1099
1100
1101       All forms of getline shall return 1 for successful input, zero for end-
1102       of-file, and -1 for an error.
1103
1104       Where  strings are used as the name of a file or pipeline, the applica‐
1105       tion shall ensure that the strings are textually identical.  The termi‐
1106       nology  "same  string  value"  implies  that "equivalent strings", even
1107       those that differ only by <space>s, represent different files.
1108
1109   User-Defined Functions
1110       The awk language also provides user-defined functions.  Such  functions
1111       can be defined as:
1112
1113
1114              function name([parameter, ...]) { statements }
1115
1116       A  function  can be referred to anywhere in an awk program; in particu‐
1117       lar, its use can precede its definition. The scope  of  a  function  is
1118       global.
1119
1120       Function  parameters,  if present, can be either scalars or arrays; the
1121       behavior is undefined if an array name is passed as  a  parameter  that
1122       the function uses as a scalar, or if a scalar expression is passed as a
1123       parameter that the function uses as an array. Function parameters shall
1124       be passed by value if scalar and by reference if array name.
1125
1126       The  number of parameters in the function definition need not match the
1127       number of parameters in the function call. Excess formal parameters can
1128       be  used as local variables. If fewer arguments are supplied in a func‐
1129       tion call than are in the function  definition,  the  extra  parameters
1130       that  are  used  in  the function body as scalars shall evaluate to the
1131       uninitialized value until they are otherwise initialized, and the extra
1132       parameters  that  are  used  in  the  function  body as arrays shall be
1133       treated as uninitialized arrays where each  element  evaluates  to  the
1134       uninitialized value until otherwise initialized.
1135
1136       When  invoking  a  function,  no  white space can be placed between the
1137       function name and the opening parenthesis. Function calls can be nested
1138       and  recursive  calls  can be made upon functions. Upon return from any
1139       nested or recursive function call, the values of  all  of  the  calling
1140       function's  parameters  shall be unchanged, except for array parameters
1141       passed by reference. The return statement  can  be  used  to  return  a
1142       value.  If a return statement appears outside of a function definition,
1143       the behavior is undefined.
1144
1145       In the function definition, <newline>s shall  be  optional  before  the
1146       opening  brace  and  after  the closing brace. Function definitions can
1147       appear anywhere in the program where a pattern-action pair is allowed.
1148
1149   Grammar
1150       The grammar in this section and the lexical conventions in the  follow‐
1151       ing  section  shall  together describe the syntax for awk programs. The
1152       general conventions for this style of grammar are described in  Grammar
1153       Conventions  .  A  valid program can be represented as the non-terminal
1154       symbol program in the grammar. This formal syntax shall take precedence
1155       over the preceding text syntax description.
1156
1157
1158              %token NAME NUMBER STRING ERE
1159              %token FUNC_NAME   /* Name followed by '(' without white space. */
1160
1161
1162              /* Keywords  */
1163              %token       Begin   End
1164              /*          'BEGIN' 'END'                            */
1165
1166
1167              %token       Break   Continue   Delete   Do   Else
1168              /*          'break' 'continue' 'delete' 'do' 'else'  */
1169
1170
1171              %token       Exit   For   Function   If   In
1172              /*          'exit' 'for' 'function' 'if' 'in'        */
1173
1174
1175              %token       Next   Print   Printf   Return   While
1176              /*          'next' 'print' 'printf' 'return' 'while' */
1177
1178
1179              /* Reserved function names */
1180              %token BUILTIN_FUNC_NAME
1181                          /* One token for the following:
1182                           * atan2 cos sin exp log sqrt int rand srand
1183                           * gsub index length match split sprintf sub
1184                           * substr tolower toupper close system
1185                           */
1186              %token GETLINE
1187                          /* Syntactically different from other built-ins. */
1188
1189
1190              /* Two-character tokens. */
1191              %token ADD_ASSIGN SUB_ASSIGN MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN POW_ASSIGN
1192              /*     '+='       '-='       '*='       '/='       '%='       '^=' */
1193
1194
1195              %token OR   AND  NO_MATCH   EQ   LE   GE   NE   INCR  DECR  APPEND
1196              /*     '||' '&&' '!~' '==' '<=' '>=' '!=' '++'  '--'  '>>'   */
1197
1198
1199              /* One-character tokens. */
1200              %token '{' '}' '(' ')' '[' ']' ',' ';' NEWLINE
1201              %token '+' '-' '*' '%' '^' '!' '>' '<' '|' '?' ':' '~' '$' '='
1202
1203
1204              %start program
1205              %%
1206
1207
1208              program          : item_list
1209                               | actionless_item_list
1210                               ;
1211
1212
1213              item_list        : newline_opt
1214                               | actionless_item_list item terminator
1215                               | item_list            item terminator
1216                               | item_list          action terminator
1217                               ;
1218
1219
1220              actionless_item_list : item_list            pattern terminator
1221                               | actionless_item_list pattern terminator
1222                               ;
1223
1224
1225              item             : pattern action
1226                               | Function NAME      '(' param_list_opt ')'
1227                                     newline_opt action
1228                               | Function FUNC_NAME '(' param_list_opt ')'
1229                                     newline_opt action
1230                               ;
1231
1232
1233              param_list_opt   : /* empty */
1234                               | param_list
1235                               ;
1236
1237
1238              param_list       : NAME
1239                               | param_list ',' NAME
1240                               ;
1241
1242
1243              pattern          : Begin
1244                               | End
1245                               | expr
1246                               | expr ',' newline_opt expr
1247                               ;
1248
1249
1250              action           : '{' newline_opt                             '}'
1251                               | '{' newline_opt terminated_statement_list   '}'
1252                               | '{' newline_opt unterminated_statement_list '}'
1253                               ;
1254
1255
1256              terminator       : terminator ';'
1257                               | terminator NEWLINE
1258                               |            ';'
1259                               |            NEWLINE
1260                               ;
1261
1262
1263              terminated_statement_list : terminated_statement
1264                               | terminated_statement_list terminated_statement
1265                               ;
1266
1267
1268              unterminated_statement_list : unterminated_statement
1269                               | terminated_statement_list unterminated_statement
1270                               ;
1271
1272
1273              terminated_statement : action newline_opt
1274                               | If '(' expr ')' newline_opt terminated_statement
1275                               | If '(' expr ')' newline_opt terminated_statement
1276                                     Else newline_opt terminated_statement
1277                               | While '(' expr ')' newline_opt terminated_statement
1278                               | For '(' simple_statement_opt ';'
1279                                    expr_opt ';' simple_statement_opt ')' newline_opt
1280                                    terminated_statement
1281                               | For '(' NAME In NAME ')' newline_opt
1282                                    terminated_statement
1283                               | ';' newline_opt
1284                               | terminatable_statement NEWLINE newline_opt
1285                               | terminatable_statement ';'     newline_opt
1286                               ;
1287
1288
1289              unterminated_statement : terminatable_statement
1290                               | If '(' expr ')' newline_opt unterminated_statement
1291                               | If '(' expr ')' newline_opt terminated_statement
1292                                    Else newline_opt unterminated_statement
1293                               | While '(' expr ')' newline_opt unterminated_statement
1294                               | For '(' simple_statement_opt ';'
1295                                expr_opt ';' simple_statement_opt ')' newline_opt
1296                                    unterminated_statement
1297                               | For '(' NAME In NAME ')' newline_opt
1298                                    unterminated_statement
1299                               ;
1300
1301
1302              terminatable_statement : simple_statement
1303                               | Break
1304                               | Continue
1305                               | Next
1306                               | Exit expr_opt
1307                               | Return expr_opt
1308                               | Do newline_opt terminated_statement While '(' expr ')'
1309                               ;
1310
1311
1312              simple_statement_opt : /* empty */
1313                               | simple_statement
1314                               ;
1315
1316
1317              simple_statement : Delete NAME '[' expr_list ']'
1318                               | expr
1319                               | print_statement
1320                               ;
1321
1322
1323              print_statement  : simple_print_statement
1324                               | simple_print_statement output_redirection
1325                               ;
1326
1327
1328              simple_print_statement : Print  print_expr_list_opt
1329                               | Print  '(' multiple_expr_list ')'
1330                               | Printf print_expr_list
1331                               | Printf '(' multiple_expr_list ')'
1332                               ;
1333
1334
1335              output_redirection : '>'    expr
1336                               | APPEND expr
1337                               | '|'    expr
1338                               ;
1339
1340
1341              expr_list_opt    : /* empty */
1342                               | expr_list
1343                               ;
1344
1345
1346              expr_list        : expr
1347                               | multiple_expr_list
1348                               ;
1349
1350
1351              multiple_expr_list : expr ',' newline_opt expr
1352                               | multiple_expr_list ',' newline_opt expr
1353                               ;
1354
1355
1356              expr_opt         : /* empty */
1357                               | expr
1358                               ;
1359
1360
1361              expr             : unary_expr
1362                               | non_unary_expr
1363                               ;
1364
1365
1366              unary_expr       : '+' expr
1367                               | '-' expr
1368                               | unary_expr '^'      expr
1369                               | unary_expr '*'      expr
1370                               | unary_expr '/'      expr
1371                               | unary_expr '%'      expr
1372                               | unary_expr '+'      expr
1373                               | unary_expr '-'      expr
1374                               | unary_expr          non_unary_expr
1375                               | unary_expr '<'      expr
1376                               | unary_expr LE       expr
1377                               | unary_expr NE       expr
1378                               | unary_expr EQ       expr
1379                               | unary_expr '>'      expr
1380                               | unary_expr GE       expr
1381                               | unary_expr '~'      expr
1382                               | unary_expr NO_MATCH expr
1383                               | unary_expr In NAME
1384                               | unary_expr AND newline_opt expr
1385                               | unary_expr OR  newline_opt expr
1386                               | unary_expr '?' expr ':' expr
1387                               | unary_input_function
1388                               ;
1389
1390
1391              non_unary_expr   : '(' expr ')'
1392                               | '!' expr
1393                               | non_unary_expr '^'      expr
1394                               | non_unary_expr '*'      expr
1395                               | non_unary_expr '/'      expr
1396                               | non_unary_expr '%'      expr
1397                               | non_unary_expr '+'      expr
1398                               | non_unary_expr '-'      expr
1399                               | non_unary_expr          non_unary_expr
1400                               | non_unary_expr '<'      expr
1401                               | non_unary_expr LE       expr
1402                               | non_unary_expr NE       expr
1403                               | non_unary_expr EQ       expr
1404                               | non_unary_expr '>'      expr
1405                               | non_unary_expr GE       expr
1406                               | non_unary_expr '~'      expr
1407                               | non_unary_expr NO_MATCH expr
1408                               | non_unary_expr In NAME
1409                               | '(' multiple_expr_list ')' In NAME
1410                               | non_unary_expr AND newline_opt expr
1411                               | non_unary_expr OR  newline_opt expr
1412                               | non_unary_expr '?' expr ':' expr
1413                               | NUMBER
1414                               | STRING
1415                               | lvalue
1416                               | ERE
1417                               | lvalue INCR
1418                               | lvalue DECR
1419                               | INCR lvalue
1420                               | DECR lvalue
1421                               | lvalue POW_ASSIGN expr
1422                               | lvalue MOD_ASSIGN expr
1423                               | lvalue MUL_ASSIGN expr
1424                               | lvalue DIV_ASSIGN expr
1425                               | lvalue ADD_ASSIGN expr
1426                               | lvalue SUB_ASSIGN expr
1427                               | lvalue '=' expr
1428                               | FUNC_NAME '(' expr_list_opt ')'
1429                                    /* no white space allowed before '(' */
1430                               | BUILTIN_FUNC_NAME '(' expr_list_opt ')'
1431                               | BUILTIN_FUNC_NAME
1432                               | non_unary_input_function
1433                               ;
1434
1435
1436              print_expr_list_opt : /* empty */
1437                               | print_expr_list
1438                               ;
1439
1440
1441              print_expr_list  : print_expr
1442                               | print_expr_list ',' newline_opt print_expr
1443                               ;
1444
1445
1446              print_expr       : unary_print_expr
1447                               | non_unary_print_expr
1448                               ;
1449
1450
1451              unary_print_expr : '+' print_expr
1452                               | '-' print_expr
1453                               | unary_print_expr '^'      print_expr
1454                               | unary_print_expr '*'      print_expr
1455                               | unary_print_expr '/'      print_expr
1456                               | unary_print_expr '%'      print_expr
1457                               | unary_print_expr '+'      print_expr
1458                               | unary_print_expr '-'      print_expr
1459                               | unary_print_expr          non_unary_print_expr
1460                               | unary_print_expr '~'      print_expr
1461                               | unary_print_expr NO_MATCH print_expr
1462                               | unary_print_expr In NAME
1463                               | unary_print_expr AND newline_opt print_expr
1464                               | unary_print_expr OR  newline_opt print_expr
1465                               | unary_print_expr '?' print_expr ':' print_expr
1466                               ;
1467
1468
1469              non_unary_print_expr : '(' expr ')'
1470                               | '!' print_expr
1471                               | non_unary_print_expr '^'      print_expr
1472                               | non_unary_print_expr '*'      print_expr
1473                               | non_unary_print_expr '/'      print_expr
1474                               | non_unary_print_expr '%'      print_expr
1475                               | non_unary_print_expr '+'      print_expr
1476                               | non_unary_print_expr '-'      print_expr
1477                               | non_unary_print_expr          non_unary_print_expr
1478                               | non_unary_print_expr '~'      print_expr
1479                               | non_unary_print_expr NO_MATCH print_expr
1480                               | non_unary_print_expr In NAME
1481                               | '(' multiple_expr_list ')' In NAME
1482                               | non_unary_print_expr AND newline_opt print_expr
1483                               | non_unary_print_expr OR  newline_opt print_expr
1484                               | non_unary_print_expr '?' print_expr ':' print_expr
1485                               | NUMBER
1486                               | STRING
1487                               | lvalue
1488                               | ERE
1489                               | lvalue INCR
1490                               | lvalue DECR
1491                               | INCR lvalue
1492                               | DECR lvalue
1493                               | lvalue POW_ASSIGN print_expr
1494                               | lvalue MOD_ASSIGN print_expr
1495                               | lvalue MUL_ASSIGN print_expr
1496                               | lvalue DIV_ASSIGN print_expr
1497                               | lvalue ADD_ASSIGN print_expr
1498                               | lvalue SUB_ASSIGN print_expr
1499                               | lvalue '=' print_expr
1500                               | FUNC_NAME '(' expr_list_opt ')'
1501                                   /* no white space allowed before '(' */
1502                               | BUILTIN_FUNC_NAME '(' expr_list_opt ')'
1503                               | BUILTIN_FUNC_NAME
1504                               ;
1505
1506
1507              lvalue           : NAME
1508                               | NAME '[' expr_list ']'
1509                               | '$' expr
1510                               ;
1511
1512
1513              non_unary_input_function : simple_get
1514                               | simple_get '<' expr
1515                               | non_unary_expr '|' simple_get
1516                               ;
1517
1518
1519              unary_input_function : unary_expr '|' simple_get
1520                               ;
1521
1522
1523              simple_get       : GETLINE
1524                               | GETLINE lvalue
1525                               ;
1526
1527
1528              newline_opt      : /* empty */
1529                               | newline_opt NEWLINE
1530                               ;
1531
1532       This grammar has several ambiguities that shall be resolved as follows:
1533
1534        * Operator  precedence  and  associativity  shall  be  as described in
1535          Expressions in Decreasing Precedence in awk .
1536
1537        * In case of ambiguity, an else shall  be  associated  with  the  most
1538          immediately preceding if that would satisfy the grammar.
1539
1540        * In  some  contexts,  a slash ( '/' ) that is used to surround an ERE
1541          could also be the division operator. This shall be resolved in  such
1542          a  way  that wherever the division operator could appear, a slash is
1543          assumed to be the division operator. (There  is  no  unary  division
1544          operator.)
1545
1546       One  convention  that  might  not be obvious from the formal grammar is
1547       where <newline>s are acceptable. There are several  obvious  placements
1548       such  as terminating a statement, and a backslash can be used to escape
1549       <newline>s between any lexical tokens. In addition, <newline>s  without
1550       backslashes  can  follow a comma, an open brace, logical AND operator (
1551       "&&" ), logical OR operator ( "||" ), the do keyword, the else keyword,
1552       and  the  closing  parenthesis  of  an if, for, or while statement. For
1553       example:
1554
1555
1556              { print  $1,
1557                       $2 }
1558
1559   Lexical Conventions
1560       The lexical conventions for awk programs, with respect to the preceding
1561       grammar, shall be as follows:
1562
1563        1. Except  as noted, awk shall recognize the longest possible token or
1564           delimiter beginning at a given point.
1565
1566        2. A comment shall consist of any characters beginning with the number
1567           sign character and terminated by, but excluding the next occurrence
1568           of, a <newline>. Comments shall have no effect, except  to  delimit
1569           lexical tokens.
1570
1571        3. The <newline> shall be recognized as the token NEWLINE.
1572
1573        4. A  backslash  character  immediately  followed by a <newline> shall
1574           have no effect.
1575
1576        5. The token STRING shall represent a string constant. A  string  con‐
1577           stant shall begin with the character ' .' Within a string constant,
1578           a backslash character  shall  be  considered  to  begin  an  escape
1579           sequence  as  specified in the table in the Base Definitions volume
1580           of IEEE Std 1003.1-2001, Chapter 5, File Format  Notation  (  '\\',
1581           '\a', '\b', '\f', '\n', '\r', '\t', '\v' ). In addition, the escape
1582           sequences in Expressions in Decreasing Precedence in awk  shall  be
1583           recognized. A <newline> shall not occur within a string constant. A
1584           string constant shall be terminated by the first  unescaped  occur‐
1585           rence of the character '' after the one that begins the string con‐
1586           stant. The value of  the  string  shall  be  the  sequence  of  all
1587           unescaped  characters  and  values of escape sequences between, but
1588           not including, the two delimiting '' characters.
1589
1590        6. The token ERE represents an extended regular  expression  constant.
1591           An  ERE  constant  shall begin with the slash character.  Within an
1592           ERE constant, a backslash character shall be considered to begin an
1593           escape  sequence  as specified in the table in the Base Definitions
1594           volume of IEEE Std 1003.1-2001, Chapter 5, File Format Notation. In
1595           addition,  the escape sequences in Expressions in Decreasing Prece‐
1596           dence in awk shall be recognized. The application shall ensure that
1597           a  <newline> does not occur within an ERE constant. An ERE constant
1598           shall be terminated by the first unescaped occurrence of the  slash
1599           character  after the one that begins the ERE constant. The extended
1600           regular expression represented by the ERE  constant  shall  be  the
1601           sequence of all unescaped characters and values of escape sequences
1602           between, but not including, the two delimiting slash characters.
1603
1604        7. A <blank> shall have no effect, except to delimit lexical tokens or
1605           within STRING or ERE tokens.
1606
1607        8. The  token  NUMBER shall represent a numeric constant. Its form and
1608           numeric value shall be equivalent to either of the tokens floating-
1609           constant  or  integer-constant  as specified by the ISO C standard,
1610           with the following exceptions:
1611
1612            a. An integer constant cannot begin with 0x or include  the  hexa‐
1613               decimal  digits  'a',  'b',  'c', 'd', 'e', 'f', 'A', 'B', 'C',
1614               'D', 'E', or 'F' .
1615
1616            b. The value of an integer constant  beginning  with  0  shall  be
1617               taken in decimal rather than octal.
1618
1619            c. An integer constant cannot include a suffix ( 'u', 'U', 'l', or
1620               'L' ).
1621
1622            d. A floating constant cannot include a suffix ( 'f', 'F', 'l', or
1623               'L' ).
1624
1625       If  the  value  is too large or too small to be representable (see Con‐
1626       cepts Derived from the ISO C Standard ), the behavior is undefined.
1627
1628        9. A sequence of underscores, digits, and alphabetics from the  porta‐
1629           ble   character   set   (see   the   Base   Definitions  volume  of
1630           IEEE Std 1003.1-2001, Section 6.1, Portable Character Set),  begin‐
1631           ning with an underscore or alphabetic, shall be considered a word.
1632
1633       10. The  following words are keywords that shall be recognized as indi‐
1634           vidual tokens; the name of the token is the same as the keyword:
1635
1636

BEGIN delete END function in printf

break do exit getline next return

continue else for if print while

1640
1641
1642       11. The following words are names of built-in functions  and  shall  be
1643           recognized as the token BUILTIN_FUNC_NAME:
1644
1645

atan2 gsub log split sub toupper

close index match sprintf substr

cos int rand sqrt system

exp length sin srand tolower

1650
1651
1652       The  above-listed  keywords and names of built-in functions are consid‐
1653       ered reserved words.
1654
1655       12. The token NAME shall consist of a word that is not a keyword  or  a
1656           name  of a built-in function and is not followed immediately (with‐
1657           out any delimiters) by the '(' character.
1658
1659       13. The token FUNC_NAME shall consist of a word that is not  a  keyword
1660           or a name of a built-in function, followed immediately (without any
1661           delimiters) by the '(' character. The '(' character  shall  not  be
1662           included as part of the token.
1663
1664       14. The  following  two-character  sequences shall be recognized as the
1665           named tokens:
1666
1667                      Token Name   Sequence   Token Name   Sequence
1668                      ADD_ASSIGN   +=         NO_MATCH     !~
1669                      SUB_ASSIGN   -=         EQ           ==
1670                      MUL_ASSIGN   *=         LE           <=
1671                      DIV_ASSIGN   /=         GE           >=
1672                      MOD_ASSIGN   %=         NE           !=
1673                      POW_ASSIGN   ^=         INCR         ++
1674                      OR           ||         DECR         --
1675                      AND          &&         APPEND       >>
1676
1677       15. The following single characters shall be recognized as tokens whose
1678           names are the character:
1679
1680
1681           <newline> { } ( ) [ ] , ; + - * % ^ ! > < | ? : ~ $ =
1682
1683       There  is  a lexical ambiguity between the token ERE and the tokens '/'
1684       and DIV_ASSIGN. When an input sequence begins with a slash character in
1685       any syntactic context where the token '/' or DIV_ASSIGN could appear as
1686       the next token in a valid program, the longer of those two tokens  that
1687       can  be  recognized shall be recognized. In any other syntactic context
1688       where the token ERE could appear as the next token in a valid  program,
1689       the token ERE shall be recognized.
1690

EXIT STATUS

1692       The following exit values shall be returned:
1693
1694        0     All input files were processed successfully.
1695
1696       >0     An error occurred.
1697
1698
1699       The  exit  status  can  be  altered within the program by using an exit
1700       expression.
1701

CONSEQUENCES OF ERRORS

1703       If any file operand is specified and the named file cannot be accessed,
1704       awk  shall  write  a diagnostic message to standard error and terminate
1705       without any further action.
1706
1707       If the program specified by either the program operand  or  a  progfile
1708       operand  is  not  a  valid  awk  program  (as specified in the EXTENDED
1709       DESCRIPTION section), the behavior is undefined.
1710
1711       The following sections are informative.
1712

APPLICATION USAGE

1714       The index, length, match, and substr functions should not  be  confused
1715       with  similar  functions  in  the ISO C standard; the awk versions deal
1716       with characters, while the ISO C standard deals with bytes.
1717
1718       Because the concatenation operation is represented by adjacent  expres‐
1719       sions  rather  than  an explicit operator, it is often necessary to use
1720       parentheses to enforce the proper evaluation precedence.
1721

EXAMPLES

1723       The awk program specified in the command line is most easily  specified
1724       within single-quotes (for example, programs commonly contain characters
1725       that are special to the shell, including double-quotes.  In  the  cases
1726       where  an  awk  program contains single-quote characters, it is usually
1727       easiest to specify most of the program as strings within  single-quotes
1728       concatenated  by  the  shell  with  quoted single-quote characters. For
1729       example:
1730
1731
1732              awk '/'\''/ { print "quote:", $0 }'
1733
1734       prints all lines from the  standard  input  containing  a  single-quote
1735       character, prefixed with quote:.
1736
1737       The following are examples of simple awk programs:
1738
1739        1. Write  to  the standard output all input lines for which field 3 is
1740           greater than 5:
1741
1742
1743           $3 > 5
1744
1745        2. Write every tenth line:
1746
1747
1748           (NR % 10) == 0
1749
1750        3. Write any line with a substring matching the regular expression:
1751
1752
1753           /(G|D)(2[0-9][[:alpha:]]*)/
1754
1755        4. Print any line with a substring containing a 'G' or  'D',  followed
1756           by  a sequence of digits and characters.  This example uses charac‐
1757           ter classes digit and alpha to match language-independent digit and
1758           alphabetic characters respectively:
1759
1760
1761           /(G|D)([[:digit:][:alpha:]]*)/
1762
1763        5. Write  any  line  in  which  the  second  field matches the regular
1764           expression and the fourth field does not:
1765
1766
1767           $2 ~ /xyz/ && $4 !~ /xyz/
1768
1769        6. Write any line in which the second field contains a backslash:
1770
1771
1772           $2 ~ /\\/
1773
1774        7. Write any line in which the second field contains a backslash. Note
1775           that  backslash escapes are interpreted twice; once in lexical pro‐
1776           cessing of the string and once in processing  the  regular  expres‐
1777           sion:
1778
1779
1780           $2 ~ "\\\\"
1781
1782        8. Write the second to the last and the last field in each line. Sepa‐
1783           rate the fields by a colon:
1784
1785
1786           {OFS=":";print $(NF-1), $NF}
1787
1788        9. Write the line number and number of fields in each line. The  three
1789           strings  representing the line number, the colon, and the number of
1790           fields are concatenated and that string is written to standard out‐
1791           put:
1792
1793
1794           {print NR ":" NF}
1795
1796       10. Write lines longer than 72 characters:
1797
1798
1799           length($0) > 72
1800
1801       11. Write the first two fields in opposite order separated by OFS:
1802
1803
1804           { print $2, $1 }
1805
1806       12. Same,  with  input  fields  separated  by  a  comma or <space>s and
1807           <tab>s, or both:
1808
1809
1810           BEGIN { FS = ",[ \t]*|[ \t]+" }
1811                 { print $2, $1 }
1812
1813       13. Add up the first column, print sum, and average:
1814
1815
1816                {s += $1 }
1817           END   {print "sum is ", s, " average is", s/NR}
1818
1819       14. Write fields in reverse order, one per line  (many  lines  out  for
1820           each line in):
1821
1822
1823           { for (i = NF; i > 0; --i) print $i }
1824
1825       15. Write all lines between occurrences of the strings start and stop:
1826
1827
1828           /start/, /stop/
1829
1830       16. Write  all  lines  whose first field is different from the previous
1831           one:
1832
1833
1834           $1 != prev { print; prev = $1 }
1835
1836       17. Simulate echo:
1837
1838
1839           BEGIN  {
1840                   for (i = 1; i < ARGC; ++i)
1841                   printf("%s%s", ARGV[i], i==ARGC-1?"\n":" ")
1842           }
1843
1844       18. Write the path prefixes contained in the PATH environment variable,
1845           one per line:
1846
1847
1848           BEGIN  {
1849                   n = split (ENVIRON["PATH"], path, ":")
1850                   for (i = 1; i <= n; ++i)
1851                   print path[i]
1852           }
1853
1854       19. If there is a file named input containing page headers of the form:
1855
1856
1857           Page #
1858
1859       and a file named program that contains:
1860
1861
1862              /Page/   { $2 = n++; }
1863                       { print }
1864
1865       then the command line:
1866
1867
1868              awk -f program n=5 input
1869
1870       prints the file input, filling in page numbers starting at 5.
1871

RATIONALE

1873       This  description  is based on the new awk, "nawk", (see the referenced
1874       The AWK Programming Language), which introduced a number  of  new  fea‐
1875       tures to the historical awk:
1876
1877        1. New keywords: delete, do, function, return
1878
1879        2. New  built-in functions: atan2, close, cos, gsub, match, rand, sin,
1880           srand, sub, system
1881
1882        3. New predefined variables: FNR, ARGC, ARGV, RSTART, RLENGTH, SUBSEP
1883
1884        4. New expression operators: ?, :, ,, ^
1885
1886        5. The FS variable and the third argument to  split,  now  treated  as
1887           extended regular expressions.
1888
1889        6. The  operator  precedence, changed to more closely match the C lan‐
1890           guage.  Two examples of code that operate differently are:
1891
1892
1893           while ( n /= 10 > 1) ...
1894           if (!"wk" ~ /bwk/) ...
1895
1896       Several features have been added based on newer implementations of awk:
1897
1898        * Multiple instances of -f progfile are permitted.
1899
1900        * The new option -v assignment.
1901
1902        * The new predefined variable ENVIRON.
1903
1904        * New built-in functions toupper and tolower.
1905
1906        * More formatting capabilities are added to printf to match the  ISO C
1907          standard.
1908
1909       The  overall awk syntax has always been based on the C language, with a
1910       few features from the shell command language and other sources. Because
1911       of this, it is not completely compatible with any other language, which
1912       has caused confusion for some users.  It is not the intent of the stan‐
1913       dard developers to address such issues.  A few relatively minor changes
1914       toward making the language more compatible with the ISO C standard were
1915       made;  most  of  these  changes  are based on similar changes in recent
1916       implementations, as described above. There  remain  several  C-language
1917       conventions  that  are not in awk. One of the notable ones is the comma
1918       operator, which is commonly used to specify multiple expressions in the
1919       C  language  for statement. Also, there are various places where awk is
1920       more restrictive than the C language regarding the type  of  expression
1921       that  can  be used in a given context. These limitations are due to the
1922       different features that the awk language does provide.
1923
1924       Regular expressions in awk have been extended somewhat from  historical
1925       implementations  to  make  them  a  pure  superset  of extended regular
1926       expressions, as defined by IEEE Std 1003.1-2001 (see the  Base  Defini‐
1927       tions  volume  of  IEEE Std 1003.1-2001,  Section 9.4, Extended Regular
1928       Expressions).  The main extensions  are  internationalization  features
1929       and  interval expressions.  Historical implementations of awk have long
1930       supported backslash escape sequences as an extension to extended  regu‐
1931       lar expressions, and this extension has been retained despite inconsis‐
1932       tency with other utilities. The number of escape  sequences  recognized
1933       in  both extended regular expressions and strings has varied (generally
1934       increasing with time)  among  implementations.  The  set  specified  by
1935       IEEE Std 1003.1-2001  includes  most sequences known to be supported by
1936       popular implementations and by the ISO C standard. One sequence that is
1937       not  supported  is hexadecimal value escapes beginning with '\x' . This
1938       would allow values expressed in more than 9 bits to be used within  awk
1939       as in the ISO C standard. However, because this syntax has a non-deter‐
1940       ministic length, it does not permit the subsequent character  to  be  a
1941       hexadecimal  digit. This limitation can be dealt with in the C language
1942       by the use of lexical string concatenation. In the awk  language,  con‐
1943       catenation  could  also be a solution for strings, but not for extended
1944       regular expressions (either lexical ERE tokens or strings used  dynami‐
1945       cally  as regular expressions). Because of this limitation, the feature
1946       has not been added to IEEE Std 1003.1-2001.
1947
1948       When a string variable is used in a context where an  extended  regular
1949       expression normally appears (where the lexical token ERE is used in the
1950       grammar) the string does not contain the literal slashes.
1951
1952       Some versions of awk allow the form:
1953
1954
1955              func name(args, ... ) { statements }
1956
1957       This has been deprecated by the authors of the language, who asked that
1958       it not be specified.
1959
1960       Historical  implementations of awk produce an error if a next statement
1961       is executed in a BEGIN action, and cause awk to  terminate  if  a  next
1962       statement is executed in an END action. This behavior has not been doc‐
1963       umented, and it was not believed that it was necessary  to  standardize
1964       it.
1965
1966       The  specification  of conversions between string and numeric values is
1967       much more detailed than in the documentation of historical  implementa‐
1968       tions or in the referenced The AWK Programming Language.  Although most
1969       of the behavior is designed to be intuitive, the details are  necessary
1970       to  ensure  compatible behavior from different implementations. This is
1971       especially important in relational expressions since the types  of  the
1972       operands determine whether a string or numeric comparison is performed.
1973       From the perspective of an application writer, it is usually sufficient
1974       to  expect  intuitive behavior and to force conversions (by adding zero
1975       or concatenating a null string) when the type of an expression does not
1976       obviously match what is needed. The intent has been to specify histori‐
1977       cal practice in almost all cases. The one exception is that, in histor‐
1978       ical  implementations, variables and constants maintain both string and
1979       numeric values after their original value is converted by any use. This
1980       means  that referencing a variable or constant can have unexpected side
1981       effects. For example, with  historical  implementations  the  following
1982       program:
1983
1984
1985              {
1986                  a = "+2"
1987                  b = 2
1988                  if (NR % 2)
1989                      c = a + b
1990                  if (a == b)
1991                      print "numeric comparison"
1992                  else
1993                      print "string comparison"
1994              }
1995
1996       would  perform a numeric comparison (and output numeric comparison) for
1997       each odd-numbered line, but perform a  string  comparison  (and  output
1998       string  comparison)  for  each even-numbered line. IEEE Std 1003.1-2001
1999       ensures that comparisons will be numeric if necessary. With  historical
2000       implementations, the following program:
2001
2002
2003              BEGIN {
2004                  OFMT = "%e"
2005                  print 3.14
2006                  OFMT = "%f"
2007                  print 3.14
2008              }
2009
2010       would  output  "3.140000e+00" twice, because in the second print state‐
2011       ment the constant "3.14" would have a string value  from  the  previous
2012       conversion. IEEE Std 1003.1-2001 requires that the output of the second
2013       print statement be "3.140000" . The behavior of historical  implementa‐
2014       tions was seen as too unintuitive and unpredictable.
2015
2016       It  was  pointed out that with the rules contained in early drafts, the
2017       following script would print nothing:
2018
2019
2020              BEGIN {
2021                  y[1.5] = 1
2022                  OFMT = "%e"
2023                  print y[1.5]
2024              }
2025
2026       Therefore, a new variable, CONVFMT, was introduced. The  OFMT  variable
2027       is now restricted to affecting output conversions of numbers to strings
2028       and CONVFMT is used for internal conversions, such  as  comparisons  or
2029       array  indexing.  The  default  value  is the same as that for OFMT, so
2030       unless a program changes CONVFMT (which  no  historical  program  would
2031       do),  it  will receive the historical behavior associated with internal
2032       string conversions.
2033
2034       The POSIX awk lexical and syntactic conventions are specified more for‐
2035       mally  than in other sources. Again the intent has been to specify his‐
2036       torical practice. One convention that may not be obvious from the  for‐
2037       mal  grammar  as  in  other verbal descriptions is where <newline>s are
2038       acceptable. There are several obvious placements such as terminating  a
2039       statement, and a backslash can be used to escape <newline>s between any
2040       lexical tokens. In addition, <newline>s without backslashes can  follow
2041       a  comma,  an open brace, a logical AND operator ( "&&" ), a logical OR
2042       operator ( "||" ), the do keyword, the else keyword,  and  the  closing
2043       parenthesis of an if, for, or while statement. For example:
2044
2045
2046              { print $1,
2047                      $2 }
2048
2049       The  requirement that awk add a trailing <newline> to the program argu‐
2050       ment text is to simplify the grammar, making it match a  text  file  in
2051       form.  There  is  no  way for an application or test suite to determine
2052       whether a literal <newline> is added or whether awk simply acts  as  if
2053       it did.
2054
2055       IEEE Std 1003.1-2001 requires several changes from historical implemen‐
2056       tations in order to support  internationalization.  Probably  the  most
2057       subtle  of  these is the use of the decimal-point character, defined by
2058       the LC_NUMERIC category of the locale, in representations of  floating-
2059       point  numbers.   This locale-specific character is used in recognizing
2060       numeric input, in converting between strings and numeric values, and in
2061       formatting  output. However, regardless of locale, the period character
2062       (the decimal-point character of the POSIX locale) is the  decimal-point
2063       character  recognized in processing awk programs (including assignments
2064       in command line arguments). This is essentially the same convention  as
2065       the  one  used in the ISO C standard. The difference is that the C lan‐
2066       guage includes the setlocale() function, which permits  an  application
2067       to  modify  its  locale.  Because  of  this capability, a C application
2068       begins executing with its locale set to the C locale, and only executes
2069       in  the  environment-specified  locale after an explicit call to setlo‐
2070       cale(). However, adding such an elaborate new feature to the  awk  lan‐
2071       guage  was seen as inappropriate for IEEE Std 1003.1-2001. It is possi‐
2072       ble to execute an awk program explicitly in any desired locale by  set‐
2073       ting the environment in the shell.
2074
2075       The  undefined behavior resulting from NULs in extended regular expres‐
2076       sions allows future extensions for the  GNU  gawk  program  to  process
2077       binary data.
2078
2079       The  behavior  in  the case of invalid awk programs (including lexical,
2080       syntactic, and semantic errors) is undefined because it was  considered
2081       overly  limiting  on  implementations  to  specify.  In most cases such
2082       errors can be expected to produce a diagnostic and a non-zero exit sta‐
2083       tus. However, some implementations may choose to extend the language in
2084       ways that make use of certain invalid constructs.  Other  invalid  con‐
2085       structs  might  be deemed worthy of a warning, but otherwise cause some
2086       reasonable behavior.  Still other constructs may be very  difficult  to
2087       detect  in some implementations.  Also, different implementations might
2088       detect a given error during an initial parsing of the  program  (before
2089       reading  any  input  files) while others might detect it when executing
2090       the program after reading some input. Implementors should be aware that
2091       diagnosing errors as early as possible and producing useful diagnostics
2092       can ease debugging of applications, and  thus  make  an  implementation
2093       more usable.
2094
2095       The  unspecified  behavior  from  using multi-character RS values is to
2096       allow possible future extensions based on extended regular  expressions
2097       used  for  record separators. Historical implementations take the first
2098       character of the string and ignore the others.
2099
2100       Unspecified behavior when split( string, array, <null>) is used  is  to
2101       allow  a proposed future extension that would split up a string into an
2102       array of individual characters.
2103
2104       In the context of the getline function, equally good arguments for dif‐
2105       ferent  precedences  of  the  | and < operators can be made. Historical
2106       practice has been that:
2107
2108
2109              getline < "a" "b"
2110
2111       is parsed as:
2112
2113
2114              ( getline < "a" ) "b"
2115
2116       although many would argue that the intent was that the file  ab  should
2117       be read. However:
2118
2119
2120              getline < "x" + 1
2121
2122       parses as:
2123
2124
2125              getline < ( "x" + 1 )
2126
2127       Similar  problems  occur with the | version of getline, particularly in
2128       combination with $. For example:
2129
2130
2131              $"echo hi" | getline
2132
2133       (This situation is particularly problematic when used in a print state‐
2134       ment, where the |getline part might be a redirection of the print.)
2135
2136       Since in most cases such constructs are not (or at least should not) be
2137       used (because they have a natural ambiguity for which there is no  con‐
2138       ventional  parsing),  the  meaning  of  these  constructs has been made
2139       explicitly unspecified. (The effect is that  a  conforming  application
2140       that runs into the problem must parenthesize to resolve the ambiguity.)
2141       There appeared to be few if any actual uses of such constructs.
2142
2143       Grammars can be written that would cause an error under  these  circum‐
2144       stances.   Where  backwards-compatibility is not a large consideration,
2145       implementors may wish to use such grammars.
2146
2147       Some historical implementations have allowed some built-in functions to
2148       be called without an argument list, the result being a default argument
2149       list chosen in some "reasonable" way. Use of length as  a  synonym  for
2150       length($0)  is the only one of these forms that is thought to be widely
2151       known or widely used; this particular form  is  documented  in  various
2152       places  (for example, most historical awk reference pages, although not
2153       in the referenced The AWK Programming Language) as legitimate practice.
2154       With  this  exception,  default argument lists have always been undocu‐
2155       mented and vaguely defined, and it is not at all clear how (or if) they
2156       should  be  generalized  to user-defined functions.  They add no useful
2157       functionality and preclude possible future extensions that  might  need
2158       to  name  functions without calling them.  Not standardizing them seems
2159       the simplest course. The standard  developers  considered  that  length
2160       merited special treatment, however, since it has been documented in the
2161       past and sees possibly substantial use in historical programs.  Accord‐
2162       ingly,  this  usage  has  been made legitimate, but Issue 5 removed the
2163       obsolescent marking for XSI-conforming implementations and many  other‐
2164       wise conforming applications depend on this feature.
2165
2166       In  sub  and  gsub,  if  repl  is  a  string literal (the lexical token
2167       STRING), then two consecutive backslash characters should  be  used  in
2168       the string to ensure a single backslash will precede the ampersand when
2169       the resultant string is passed to the function. (For example, to  spec‐
2170       ify  one  literal  ampersand  in the replacement string, use gsub( ERE,
2171       "\\&" ).)
2172
2173       Historically the only special character in the repl argument of sub and
2174       gsub string functions was the ampersand ( '&' ) character and preceding
2175       it with the backslash character was used to turn off its special  mean‐
2176       ing.
2177
2178       The  description  in  the ISO POSIX-2:1993 standard introduced behavior
2179       such that the backslash character was another special character and  it
2180       was  unspecified  whether there were any other special characters. This
2181       description introduced several portability problems, some of which  are
2182       described  below,  and so it has been replaced with the more historical
2183       description. Some of the problems include:
2184
2185        * Historically, to create the replacement string, a script  could  use
2186          gsub(  ERE, "\\&" ), but with the ISO POSIX-2:1993 standard wording,
2187          it was necessary to use gsub( ERE, "\\\\&" ).  Backslash  characters
2188          are  doubled here because all string literals are subject to lexical
2189          analysis, which would reduce each pair of backslash characters to  a
2190          single backslash before being passed to gsub.
2191
2192        * Since  it was unspecified what the special characters were, for por‐
2193          table scripts to guarantee that characters  are  printed  literally,
2194          each  character had to be preceded with a backslash. (For example, a
2195          portable script had to use  gsub(  ERE,  "\\h\\i"  )  to  produce  a
2196          replacement string of "hi" .)
2197
2198       The  description  for  comparisons in the ISO POSIX-2:1993 standard did
2199       not properly describe historical practice because of  the  way  numeric
2200       strings  are compared as numbers. The current rules cause the following
2201       code:
2202
2203
2204              if (0 == "000")
2205                  print "strange, but true"
2206              else
2207                  print "not true"
2208
2209       to do a numeric comparison, causing the if to  succeed.  It  should  be
2210       intuitively  obvious  that  this  is incorrect behavior, and indeed, no
2211       historical implementation of awk actually behaves this way.
2212
2213       To fix this problem, the definition of numeric string was  enhanced  to
2214       include  only those values obtained from specific circumstances (mostly
2215       external sources) where it is not possible to  determine  unambiguously
2216       whether the value is intended to be a string or a numeric.
2217
2218       Variables  that  are assigned to a numeric string shall also be treated
2219       as a numeric string. (For example, the notion of a numeric  string  can
2220       be propagated across assignments.) In comparisons, all variables having
2221       the uninitialized value are to be treated as a numeric operand evaluat‐
2222       ing to the numeric value zero.
2223
2224       Uninitialized  variables  include  all  types  of  variables  including
2225       scalars, array elements, and fields. The definition of an uninitialized
2226       value  in  Variables and Special Variables is necessary to describe the
2227       value placed on uninitialized variables and on fields  that  are  valid
2228       (for example, < $NF) but have no characters in them and to describe how
2229       these variables are to be used in comparisons. A valid field,  such  as
2230       $1,  that has no characters in it can be obtained from an input line of
2231       "\t\t" when FS= '\t' . Historically, the comparison ( $1<10)  was  done
2232       numerically after evaluating $1 to the value zero.
2233
2234       The  phrase  "...  also  shall  have  the  numeric value of the numeric
2235       string" was removed from several sections of the ISO POSIX-2:1993 stan‐
2236       dard  because  is specifies an unnecessary implementation detail. It is
2237       not necessary for IEEE Std 1003.1-2001 to specify that these objects be
2238       assigned  two  different  values.  It is only necessary to specify that
2239       these objects may evaluate to two different values  depending  on  con‐
2240       text.
2241
2242       The  description  of numeric string processing is based on the behavior
2243       of the atof() function in  the  ISO C  standard.  While  it  is  not  a
2244       requirement for an implementation to use this function, many historical
2245       implementations of awk do. In the ISO C standard,  floating-point  con‐
2246       stants  use  a  period  as  a  decimal point character for the language
2247       itself, independent of the current locale, but the atof() function  and
2248       the associated strtod() function use the decimal point character of the
2249       current locale when converting strings to numeric values. Similarly  in
2250       awk, floating-point constants in an awk script use a period independent
2251       of the locale, but input strings use the decimal point character of the
2252       locale.
2253

FUTURE DIRECTIONS

2255       None.
2256

COPYRIGHT

2262       Portions of this text are reprinted and reproduced in  electronic  form
2263       from IEEE Std 1003.1, 2003 Edition, Standard for Information Technology
2264       -- Portable Operating System Interface (POSIX),  The  Open  Group  Base
2265       Specifications  Issue  6,  Copyright  (C) 2001-2003 by the Institute of
2266       Electrical and Electronics Engineers, Inc and The Open  Group.  In  the
2267       event of any discrepancy between this version and the original IEEE and
2268       The Open Group Standard, the original IEEE and The Open Group  Standard
2269       is  the  referee document. The original Standard can be obtained online
2270       at http://www.opengroup.org/unix/online.html .
2271
2272
2273
2274IEEE/The Open Group                  2003                              AWK(1P)