1MAWK(1)                          User commands                         MAWK(1)
2
3
4

NAME

6       mawk - pattern scanning and text processing language
7

SYNOPSIS

9       mawk  [-W  option]  [-F value] [-v var=value] [--] 'program text' [file
10       ...]
11       mawk [-W option] [-F value] [-v var=value] [-f program-file] [--] [file
12       ...]
13

DESCRIPTION

15       mawk  is an interpreter for the AWK Programming Language.  The AWK lan‐
16       guage is useful for manipulation of data files, text retrieval and pro‐
17       cessing,  and  for prototyping and experimenting with algorithms.  mawk
18       is a new awk meaning it implements the AWK language as defined in  Aho,
19       Kernighan  and Weinberger, The AWK Programming Language, Addison-Wesley
20       Publishing, 1988 (hereafter referred to as the AWK  book.)   mawk  con‐
21       forms  to  the POSIX 1003.2 (draft 11.3) definition of the AWK language
22       which contains a few features not described in the AWK book,  and  mawk
23       provides a small number of extensions.
24
25       An  AWK  program  is  a sequence of pattern {action} pairs and function
26       definitions.  Short programs are entered on the  command  line  usually
27       enclosed  in ' ' to avoid shell interpretation.  Longer programs can be
28       read in from a file with the -f option.  Data  input is read  from  the
29       list  of files on the command line or from standard input when the list
30       is empty.  The input is broken into records as determined by the record
31       separator  variable,  RS.  Initially, RS = “\n” and records are synony‐
32       mous with lines.  Each record is compared against each pattern  and  if
33       it matches, the program text for {action} is executed.
34

OPTIONS

36       -F value       sets the field separator, FS, to value.
37
38       -f file        Program  text is read from file instead of from the com‐
39                      mand line.  Multiple -f options are allowed.
40
41       -v var=value   assigns value to program variable var.
42
43       --             indicates the unambiguous end of options.
44
45       The above options will be available with any POSIX compatible implemen‐
46       tation  of  AWK.  Implementation specific options are prefaced with -W.
47       mawk provides these:
48
49       -W dump
50              writes an assembler like listing of the internal  representation
51              of  the  program  to  stdout and exits 0 (on successful compila‐
52              tion).
53
54       -W exec file
55              Program text is read from file and this is the last option.
56
57              This is a useful alternative to -f on systems that  support  the
58              #!  “magic number” convention for executable scripts.  Those im‐
59              plicitly pass the pathname of the script itself as the final pa‐
60              rameter,  and expect no more than one “-” option on the #! line.
61              Because mawk can combine multiple -W options separated  by  com‐
62              mas,  you  can  use  this option when an additional -W option is
63              needed.
64
65       -W help
66              prints a usage message to stderr and exits (same as “-W usage”).
67
68       -W interactive
69              sets unbuffered writes to stdout and line  buffered  reads  from
70              stdin.   Records from stdin are lines regardless of the value of
71              RS.
72
73       -W posix
74              modifies mawk's behavior to be more POSIX-compliant:
75
76              •   forces mawk not to consider '\n' to be space.
77
78              The original “posix_space” is recognized, but deprecated.
79
80       -W random=num
81              calls srand with the given parameter (and  overrides  the  auto-
82              seeding behavior).
83
84       -W sprintf=num
85              adjusts the size of mawk's internal sprintf buffer to num bytes.
86              More than rare use of this option indicates mawk should  be  re‐
87              compiled.
88
89       -W traditional
90              Omit  features  such as interval expressions which were not sup‐
91              ported by traditional awk.
92
93       -W usage
94              prints a usage message to stderr and exits (same as “-W help”).
95
96       -W version
97              mawk writes its version and copyright  to  stdout  and  compiled
98              limits to stderr and exits 0.
99
100       mawk  accepts  abbreviations for any of these options, e.g., “-W v” and
101       “-Wv” both tell mawk to show its version.
102
103       mawk allows multiple -W options to be combined by  separating  the  op‐
104       tions  with commas, e.g., -Wsprint=2000,posix.  This is useful for exe‐
105       cutable #!  “magic number” invocations in which only  one  argument  is
106       supported, e.g., -Winteractive,exec.
107

THE AWK LANGUAGE

109   1. Program structure
110       An  AWK  program is a sequence of pattern {action} pairs and user func‐
111       tion definitions.
112
113       A pattern can be:
114            BEGIN
115            END
116            expression
117            expression , expression
118
119       One, but not both, of pattern {action} can be omitted.  If {action}  is
120       omitted  it is implicitly { print }.  If pattern is omitted, then it is
121       implicitly matched.  BEGIN and END patterns require an action.
122
123       Statements are terminated by newlines, semi-colons or both.  Groups  of
124       statements such as actions or loop bodies are blocked via { ... } as in
125       C.  The last statement in a block doesn't  need  a  terminator.   Blank
126       lines  have  no  meaning; an empty statement is terminated with a semi-
127       colon.  Long statements can be continued with a backslash, \.  A state‐
128       ment  can  be broken without a backslash after a comma, left brace, &&,
129       ||, do, else, the right parenthesis of an if, while or  for  statement,
130       and  the  right parenthesis of a function definition.  A comment starts
131       with # and extends to, but does not include the end of line.
132
133       The following statements control program flow inside blocks.
134
135            if ( expr ) statement
136
137            if ( expr ) statement else statement
138
139            while ( expr ) statement
140
141            do statement while ( expr )
142
143            for ( opt_expr ; opt_expr ; opt_expr ) statement
144
145            for ( var in array ) statement
146
147            continue
148
149            break
150
151   2. Data types, conversion and comparison
152       There are two basic data types, numeric and string.  Numeric  constants
153       can  be  integer  like -2, decimal like 1.08, or in scientific notation
154       like -1.1e4 or .28E-3.  All numbers are represented internally and  all
155       computations  are  done  in floating point arithmetic.  So for example,
156       the expression 0.2e2 == 20 is true and true is represented as 1.0.
157
158       String constants are enclosed in double quotes.
159
160                   "This is a string with a newline at the end.\n"
161
162       Strings can be continued across a line by  escaping  (\)  the  newline.
163       The following escape sequences are recognized.
164
165            \\        \
166            \"        "
167            \a        alert, ascii 7
168            \b        backspace, ascii 8
169            \t        tab, ascii 9
170            \n        newline, ascii 10
171            \v        vertical tab, ascii 11
172            \f        formfeed, ascii 12
173            \r        carriage return, ascii 13
174            \ddd      1, 2 or 3 octal digits for ascii ddd
175            \xhh      1 or 2 hex digits for ascii  hh
176
177       If  you  escape  any other character \c, you get \c, i.e., mawk ignores
178       the escape.
179
180       There are really three basic data types; the third is number and string
181       which  has  both  a  numeric value and a string value at the same time.
182       User defined variables come into existence when  first  referenced  and
183       are  initialized  to  null, a number and string value which has numeric
184       value 0 and string value "".  Non-trivial number and string typed  data
185       come from input and are typically stored in fields.  (See section 4).
186
187       The  type  of  an expression is determined by its context and automatic
188       type conversion occurs if needed.  For example, to evaluate the  state‐
189       ments
190
191            y = x + 2  ;  z = x  "hello"
192
193       The  value stored in variable y will be typed numeric.  If x is not nu‐
194       meric, the value read from x is converted to numeric before it is added
195       to  2  and  stored  in y.  The value stored in variable z will be typed
196       string, and the value of x will be converted to string if necessary and
197       concatenated  with "hello".  (Of course, the value and type stored in x
198       is not changed by any conversions.)  A string expression  is  converted
199       to numeric using its longest numeric prefix as with atof(3).  A numeric
200       expression is converted to string by replacing expr  with  sprintf(CON‐
201       VFMT,  expr),  unless expr can be represented on the host machine as an
202       exact integer then it is converted to sprintf("%d",  expr).   Sprintf()
203       is an AWK built-in that duplicates the functionality of sprintf(3), and
204       CONVFMT is a built-in variable used for internal conversion from number
205       to  string and initialized to "%.6g".  Explicit type conversions can be
206       forced, expr "" is string and expr+0 is numeric.
207
208       To evaluate, expr1 rel-op expr2, if both operands are numeric or number
209       and  string then the comparison is numeric; if both operands are string
210       the comparison is string; if one operand is string, the non-string  op‐
211       erand  is  converted  and  the comparison is string.  The result is nu‐
212       meric, 1 or 0.
213
214       In boolean contexts such as, if ( expr ) statement, a string expression
215       evaluates  true  if  and only if it is not the empty string ""; numeric
216       values if and only if not numerically zero.
217
218   3. Regular expressions
219       In the AWK language, records, fields and strings are often  tested  for
220       matching  a  regular  expression.   Regular expressions are enclosed in
221       slashes, and
222
223            expr ~ /r/
224
225       is an AWK expression that evaluates to 1 if  expr  “matches”  r,  which
226       means  a substring of expr is in the set of strings defined by r.  With
227       no match the expression evaluates to  0;  replacing  ~  with  the  “not
228       match” operator, !~ , reverses the meaning.  As  pattern-action pairs,
229
230            /r/ { action }   and   $0 ~ /r/ { action }
231
232       are  the same, and for each input record that matches r, action is exe‐
233       cuted.  In fact, /r/ is an AWK expression that is equivalent to  ($0  ~
234       /r/)  anywhere  except  when  on  the right side of a match operator or
235       passed as an argument to a built-in function that expects a regular ex‐
236       pression argument.
237
238       AWK uses extended regular expressions as with the -E option of grep(1).
239       The regular expression metacharacters, i.e., those with special meaning
240       in regular expressions are
241
242            \ ^ $ . [ ] | ( ) * + ? { }
243
244       If the command line option -W traditional is used, these are omitted:
245
246            { }
247
248       are also regular expression metacharacters, and in this mode,
249       require escaping to be a literal character.
250
251       Regular expressions are built up from characters as follows:
252
253            c            matches any non-metacharacter
254                         c.
255
256            \c           matches a character defined by the same
257                         escape sequences used
258                         in string constants or the literal
259                         character c if \c is not an escape sequence.
260
261            .            matches any character (including newline).
262
263            ^            matches the front of a string.
264
265            $            matches the back of a string.
266
267            [c1c2c3...]  matches any character in the class
268                         c1c2c3... .
269                         An interval of characters is denoted
270                         c1-c2 inside a class [...].
271
272            [^c1c2c3...] matches any character not in the class
273                         c1c2c3...
274
275       Regular expressions are built up from other regular expressions
276       as follows:
277
278            r1r2         matches
279                         r1
280                         followed immediately by
281                         r2
282                         (concatenation).
283
284
285            r1 | r2      matches
286                         r1 or
287                         r2
288                         (alternation).
289
290
291            r*           matches r repeated zero or more times.
292
293            r+           matches r repeated one or more times.
294
295            r?           matches r zero or once.
296                         (repetition).
297
298            (r)          matches r
299                         (grouping).
300
301
302            r{n}         matches r exactly n times.
303
304            r{n,}        matches r repeated n or more times.
305
306            r{n,m}       matches r repeated n to m (inclusive) times.
307
308            r{,m}        matches r repeated 0 to m times (a non-standard option).
309
310       The increasing precedence of operators is:
311
312       alternation concatenation repetition grouping
313
314
315       For example,
316
317            /^[_a-zA-Z][_a-zA-Z0-9]*$/  and
318            /^[-+]?([0-9]+\.?|\.[0-9])[0-9]*([eE][-+]?[0-9]+)?$/
319
320       are  matched by AWK identifiers and AWK numeric constants respectively.
321       Note that “.” has to be escaped to be recognized as  a  decimal  point,
322       and that metacharacters are not special inside character classes.
323
324       Any expression can be used on the right hand side of the ~ or !~ opera‐
325       tors or passed to a built-in that expects  a  regular  expression.   If
326       needed,  it  is  converted to string, and then interpreted as a regular
327       expression.  For example,
328
329            BEGIN { identifier = "[_a-zA-Z][_a-zA-Z0-9]*" }
330
331            $0 ~ "^" identifier
332
333       prints all lines that start with an AWK identifier.
334
335       mawk recognizes the empty regular expression,  //,  which  matches  the
336       empty  string and hence is matched by any string at the front, back and
337       between every character.  For example,
338
339            echo  abc | mawk '{ gsub(//, "X")' ; print }
340            XaXbXcX
341
342
343   4. Records and fields
344       Records are read in one at a time, and stored in the field variable $0.
345       The  record  is split into fields which are stored in $1, $2, ..., $NF.
346       The built-in variable NF is set to the number of fields, and NR and FNR
347       are incremented by 1.  Fields above $NF are set to "".
348
349       Assignment to $0 causes the fields and NF to be recomputed.  Assignment
350       to NF or to a field causes $0 to be reconstructed by concatenating  the
351       $i's  separated  by OFS.  Assignment to a field with index greater than
352       NF, increases NF and causes $0 to be reconstructed.
353
354       Data input stored in fields is string, unless the entire field has  nu‐
355       meric form and then the type is number and string.  For example,
356
357            echo 24 24E |
358            mawk '{ print($1>100, $1>"100", $2>100, $2>"100") }'
359            0 1 1 1
360
361       $0 and $2 are string and $1 is number and string.  The first comparison
362       is numeric, the second is string, the third is string (100 is converted
363       to "100"), and the last is string.
364
365   5. Expressions and operators
366       The expression syntax is similar to C.  Primary expressions are numeric
367       constants, string constants, variables,  fields,  arrays  and  function
368       calls.   The  identifier for a variable, array or function can be a se‐
369       quence of letters, digits and underscores, that does not start  with  a
370       digit.   Variables  are  not declared; they exist when first referenced
371       and are initialized to null.
372
373       New expressions are composed with the following operators in  order  of
374       increasing precedence.
375
376            assignment          =  +=  -=  *=  /=  %=  ^=
377            conditional         ?  :
378            logical or          ||
379            logical and         &&
380            array membership    in
381            matching       ~   !~
382            relational          <  >   <=  >=  ==  !=
383            concatenation       (no explicit operator)
384            add ops             +  -
385            mul ops             *  /  %
386            unary               +  -
387            logical not         !
388            exponentiation      ^
389            inc and dec         ++ -- (both post and pre)
390            field               $
391
392       Assignment, conditional and exponentiation associate right to left; the
393       other operators associate left to right.  Any expression can be  paren‐
394       thesized.
395
396   6. Arrays
397       Awk  provides  one-dimensional arrays.  Array elements are expressed as
398       array[expr].  Expr is internally converted to string type, so, for  ex‐
399       ample,  A[1]  and  A["1"]  are the same element and the actual index is
400       "1".  Arrays indexed by strings are called  associative  arrays.   Ini‐
401       tially  an  array is empty; elements exist when first accessed.  An ex‐
402       pression, expr in array evaluates to 1 if array[expr] exists,  else  to
403       0.
404
405       There  is  a form of the for statement that loops over each index of an
406       array.
407
408            for ( var in array ) statement
409
410       sets var to each index of array and executes statement.  The order that
411       var transverses the indices of array is not defined.
412
413       The  statement,  delete  array[expr],  causes array[expr] not to exist.
414       mawk supports the delete array feature, which deletes all  elements  of
415       array.
416
417       Multidimensional  arrays  are  synthesized with concatenation using the
418       built-in variable SUBSEP.   array[expr1,expr2]  is  equivalent  to  ar‐
419       ray[expr1 SUBSEP expr2].  Testing for a multidimensional element uses a
420       parenthesized index, such as
421
422            if ( (i, j) in A )  print A[i, j]
423
424
425   7. Builtin-variables
426       The following variables are built-in and initialized before program ex‐
427       ecution.
428
429            ARGC   number of command line arguments.
430
431            ARGV   array of command line arguments, 0..ARGC-1.
432
433            CONVFMT
434                   format  for  internal conversion of numbers to string, ini‐
435                   tially = "%.6g".
436
437            ENVIRON
438                   array indexed by  environment  variables.   An  environment
439                   string, var=value is stored as ENVIRON[var] = value.
440
441            FILENAME
442                   name of the current input file.
443
444            FNR    current record number in FILENAME.
445
446            FS     splits records into fields as a regular expression.
447
448            NF     number of fields in the current record.
449
450            NR     current record number in the total input stream.
451
452            OFMT   format for printing numbers; initially = "%.6g".
453
454            OFS    inserted between fields on output, initially = " ".
455
456            ORS    terminates each record on output, initially = "\n".
457
458            RLENGTH
459                   length  set  by  the  last  call  to the built-in function,
460                   match().
461
462            RS     input record separator, initially = "\n".
463
464            RSTART index set by the last call to match().
465
466            SUBSEP used  to  build  multiple  array  subscripts,  initially  =
467                   "\034".
468
469   8. Built-in functions
470       String functions
471
472            gsub(r,s,t)  gsub(r,s)
473                   Global substitution, every match of regular expression r in
474                   variable t is replaced by string s.  The number of replace‐
475                   ments  is  returned.  If t is omitted, $0 is used.  An & in
476                   the replacement string s is replaced by  the  matched  sub‐
477                   string of t.  \& and \\ put  literal & and \, respectively,
478                   in the replacement string.
479
480            index(s,t)
481                   If t is a substring of s, then the position where t  starts
482                   is  returned, else 0 is returned.  The first character of s
483                   is in position 1.
484
485            length(s)
486                   Returns the length of string or array s.
487
488            match(s,r)
489                   Returns the index of the first longest match of regular ex‐
490                   pression  r in string s.  Returns 0 if no match.  As a side
491                   effect, RSTART is set to the return value.  RLENGTH is  set
492                   to the length of the match or -1 if no match.  If the empty
493                   string is matched, RLENGTH is set to 0, and 1  is  returned
494                   if  the  match is at the front, and length(s)+1 is returned
495                   if the match is at the back.
496
497            split(s,A,r)  split(s,A)
498                   String s is split into fields by regular expression  r  and
499                   the  fields  are loaded into array A.  The number of fields
500                   is returned.  See section 11 below for more detail.   If  r
501                   is omitted, FS is used.
502
503            sprintf(format,expr-list)
504                   Returns  a  string  constructed from expr-list according to
505                   format.  See the description of printf() below.
506
507            sub(r,s,t)  sub(r,s)
508                   Single substitution, same as gsub() except at most one sub‐
509                   stitution.
510
511            substr(s,i,n)  substr(s,i)
512                   Returns  the substring of string s, starting at index i, of
513                   length n.  If n is omitted, the suffix of s, starting at  i
514                   is returned.
515
516            tolower(s)
517                   Returns  a  copy  of  s with all upper case characters con‐
518                   verted to lower case.
519
520            toupper(s)
521                   Returns a copy of s with all  lower  case  characters  con‐
522                   verted to upper case.
523
524       Time functions
525
526       These are available on systems which support the corresponding C mktime
527       and strftime functions:
528
529            mktime(specification)
530                   converts a date specification to a timestamp with the  same
531                   units  as systime.  The date specification is a string con‐
532                   taining the components of the date as decimal integers:
533
534                   YYYY
535                      the year, e.g., 2012
536
537                   MM the month of the year starting at 1
538
539                   DD the day of the month starting at 1
540
541                   HH hour (0-23)
542
543                   MM minute (0-59)
544
545                   SS seconds (0-59)
546
547                   DST
548                      tells how to  treat  timezone  versus  daylight  savings
549                      time:
550
551                        positive
552                           DST is in effect
553
554                        zero (default)
555                           DST is not in effect
556
557                        negative
558                           mktime()  should (use timezone information and sys‐
559                           tem databases to) attempt  to determine whether DST
560                           is in effect at the specified time.
561
562            strftime([format [, timestamp [, utc ]]])
563                   formats the given timestamp using the format (passed to the
564                   C strftime function):
565
566                   •   If the format parameter is missing, "%c" is used.
567
568                   •   If the timestamp  parameter  is  missing,  the  current
569                       value from systime is used.
570
571                   •   If the utc parameter is present and nonzero, the result
572                       is in UTC.  Otherwise local time is used.
573
574            systime()
575                   returns the current time of day as the  number  of  seconds
576                   since the Epoch (1970-01-01 00:00:00 UTC on POSIX systems).
577
578       Arithmetic functions
579
580            atan2(y,x)
581                   Arctan of y/x between -pi and pi.
582
583            cos(x) Cosine function, x in radians.
584
585            exp(x) Exponential function.
586
587            int(x) Returns x truncated towards zero.
588
589            log(x) Natural logarithm.
590
591            rand() Returns a random number between zero and one.
592
593            sin(x) Sine function, x in radians.
594
595            sqrt(x)
596                   Returns square root of x.
597
598            srand(expr)
599
600            srand()
601                   Seeds  the random number generator, using the clock if expr
602                   is omitted, and returns the value  of  the  previous  seed.
603                   Srand(expr)  is  useful  for  repeating  pseudo  random se‐
604                   quences.
605
606                   Note: mawk is normally configured to seed the random number
607                   generator  from the clock at startup, making it unnecessary
608                   to call srand().  This feature can be suppressed via condi‐
609                   tional compile, or overridden using the -Wrandom option.
610
611   9. Input and output
612       There are two output statements, print and printf.
613
614            print  writes $0  ORS to standard output.
615
616            print expr1, expr2, ..., exprn
617                   writes  expr1  OFS expr2 OFS ... exprn ORS to standard out‐
618                   put.  Numeric expressions  are  converted  to  string  with
619                   OFMT.
620
621            printf format, expr-list
622                   duplicates  the  printf C library function writing to stan‐
623                   dard output.  The complete ANSI C format specifications are
624                   recognized with conversions %c, %d, %e, %E, %f, %g, %G, %i,
625                   %o, %s, %u, %x, %X and %%, and conversion qualifiers h  and
626                   l.
627
628       The  argument  list  to  print  or printf can optionally be enclosed in
629       parentheses.  Print formats numbers using OFMT or "%d" for exact  inte‐
630       gers.   "%c"  with  a  numeric  argument prints the corresponding 8 bit
631       character, with a string argument it prints the first character of  the
632       string.   The output of print and printf can be redirected to a file or
633       command by appending > file, >> file or | command to  the  end  of  the
634       print  statement.   Redirection opens file or command only once, subse‐
635       quent redirections append to the already open stream.   By  convention,
636       mawk associates the filename
637
638          •   "/dev/stderr" with stderr,
639
640          •   "/dev/stdout" with stdout,
641
642          •   "-" and "/dev/stdin" with stdin.
643
644       The  association  with  stderr  is  especially useful because it allows
645       print and printf to be redirected to stderr.  These names can  also  be
646       passed to functions.
647
648       The input function getline has the following variations.
649
650            getline
651                   reads into $0, updates the fields, NF, NR and FNR.
652
653            getline < file
654                   reads into $0 from file, updates the fields and NF.
655
656            getline var
657                   reads the next record into var, updates NR and FNR.
658
659            getline var < file
660                   reads the next record of file into var.
661
662            command | getline
663                   pipes  a record from command into $0 and updates the fields
664                   and NF.
665
666            command | getline var
667                   pipes a record from command into var.
668
669       Getline returns 0 on end-of-file, -1 on error, otherwise 1.
670
671       Commands on the end of pipes are executed by /bin/sh.
672
673       The function close(expr) closes the file or pipe associated with  expr.
674       Close  returns  0 if expr is an open file, the exit status if expr is a
675       piped command, and -1 otherwise.  Close is used to  reread  a  file  or
676       command,  make sure the other end of an output pipe is finished or con‐
677       serve file resources.
678
679       The function fflush(expr) flushes the output file  or  pipe  associated
680       with  expr.  Fflush returns 0 if expr is an open output stream else -1.
681       Fflush without an argument flushes stdout.  Fflush with an empty  argu‐
682       ment ("") flushes all open output.
683
684       The  function  system(expr)  uses  the C runtime system call to execute
685       expr and returns the corresponding wait status of the command  as  fol‐
686       lows:
687
688       •   if  the  system call failed, setting the status to -1, mawk returns
689           that value.
690
691       •   if the command exited normally, mawk returns its exit-status.
692
693       •   if the command exited due to a signal such as SIGHUP, mawk  returns
694           the signal number plus 256.
695
696       Changes  made  to the ENVIRON array are not passed to commands executed
697       with system or pipes.
698
699   10. User defined functions
700       The syntax for a user defined function is
701
702            function name( args ) { statements }
703
704       The function body can contain a return statement
705
706            return opt_expr
707
708       A return statement is not required.  Function calls may  be  nested  or
709       recursive.   Functions  are  passed  expressions by value and arrays by
710       reference.  Extra arguments serve as local variables and  are  initial‐
711       ized  to  null.  For example, csplit(s,A) puts each character of s into
712       array A and returns the length of s.
713
714            function csplit(s, A,    n, i)
715            {
716              n = length(s)
717              for( i = 1 ; i <= n ; i++ ) A[i] = substr(s, i, 1)
718              return n
719            }
720
721       Putting extra space between passed arguments  and  local  variables  is
722       conventional.  Functions can be referenced before they are defined, but
723       the function name and the '(' of the arguments must touch to avoid con‐
724       fusion with concatenation.
725
726       A function parameter is normally a scalar value (number or string).  If
727       there is a forward reference to a function using an array as a  parame‐
728       ter,  the  function's corresponding parameter will be treated as an ar‐
729       ray.
730
731   11. Splitting strings, records and files
732       Awk programs use the same algorithm to split strings into  arrays  with
733       split(), and records into fields on FS.  mawk uses essentially the same
734       algorithm to split files into records on RS.
735
736       Split(expr,A,sep) works as follows:
737
738          (1)  If sep is omitted, it is replaced by FS.  Sep can be an expres‐
739               sion  or  regular  expression.   If it is an expression of non-
740               string type, it is converted to string.
741
742          (2)  If sep = " " (a single space), then <SPACE> is trimmed from the
743               front  and back of expr, and sep becomes <SPACE>.  mawk defines
744               <SPACE> as the regular expression /[ \t\n]+/.  Otherwise sep is
745               treated  as  a  regular expression, except that meta-characters
746               are ignored for a string of length 1, e.g.,  split(x,  A,  "*")
747               and split(x, A, /\*/) are the same.
748
749          (3)  If  expr  is not string, it is converted to string.  If expr is
750               then the empty string "", split() returns 0 and A is set empty.
751               Otherwise, all non-overlapping, non-null and longest matches of
752               sep in expr, separate expr into fields which are loaded into A.
753               The  fields are placed in A[1], A[2], ..., A[n] and split() re‐
754               turns n, the number of fields which is the  number  of  matches
755               plus  one.  Data placed in A that looks numeric is typed number
756               and string.
757
758       Splitting records into fields works the  same  except  the  pieces  are
759       loaded into $1, $2,..., $NF.  If $0 is empty, NF is set to 0 and all $i
760       to "".
761
762       mawk splits files into records by the  same  algorithm,  but  with  the
763       slight  difference  that RS is really a terminator instead of a separa‐
764       tor.  (ORS is really a terminator too).
765
766            E.g., if FS = “:+” and $0 = “a::b:” , then NF = 3 and $1 = “a”, $2
767            = “b” and $3 = "", but if “a::b:” is the contents of an input file
768            and RS = “:+”, then there are two records “a” and “b”.
769
770       RS = " " is not special.
771
772       If FS = "", then mawk breaks the  record  into  individual  characters,
773       and,  similarly,  split(s,A,"")  places  the individual characters of s
774       into A.
775
776   12. Multi-line records
777       Since mawk interprets RS as a regular  expression,  multi-line  records
778       are easy.  Setting RS = "\n\n+", makes one or more blank lines separate
779       records.  If FS = " " (the default), then single newlines, by the rules
780       for  <SPACE>  above, become space and single newlines are field separa‐
781       tors.
782
783            For example, if
784
785            •   a file is "a b\nc\n\n",
786
787RS = "\n\n+" and
788
789FS = " ",
790
791            then there is one record “a b\nc” with three fields “a”,  “b”  and
792            “c”:
793
794            •   using FS = “\n”, gives two fields “a b” and “c”;
795
796            •   using FS = “”, gives one field identical to the record.
797
798       If  you want lines with spaces or tabs to be considered blank, set RS =
799       “\n([ \t]*\n)+”.  For compatibility with other awks, setting  RS  =  ""
800       has  the  same effect as if blank lines are stripped from the front and
801       back of files and then records are  determined  as  if  RS  =  “\n\n+”.
802       POSIX  requires that “\n” always separates records when RS = "" regard‐
803       less of the value of FS.  mawk does not support  this  convention,  be‐
804       cause defining “\n” as <SPACE> makes it unnecessary.
805
806       Most  of  the  time when you change RS for multi-line records, you will
807       also want to change ORS to “\n\n” so the record spacing is preserved on
808       output.
809
810   13. Program execution
811       This  section  describes the order of program execution.  First ARGC is
812       set to the total number of command line arguments passed to the  execu‐
813       tion phase of the program.
814
815ARGV[0] is set to the name of the AWK interpreter and
816
817ARGV[1]  ...   ARGV[ARGC-1]  holds the remaining command line argu‐
818           ments exclusive of options and program source.
819
820       For example, with
821
822            mawk  -f  prog  v=1  A  t=hello  B
823
824       ARGC = 5 with
825              ARGV[0] = "mawk",
826              ARGV[1] = "v=1",
827              ARGV[2] = "A",
828              ARGV[3] = "t=hello" and
829              ARGV[4] = "B".
830
831       Next, each BEGIN block is executed in order.  If the  program  consists
832       entirely  of  BEGIN  blocks,  then  execution terminates, else an input
833       stream is opened and execution continues.  If ARGC equals 1, the  input
834       stream  is  set  to stdin, else  the command line arguments ARGV[1] ...
835       ARGV[ARGC-1] are examined for a file argument.
836
837       The command line arguments divide into three sets: file arguments,  as‐
838       signment  arguments  and  empty strings "".  An assignment has the form
839       var=string.  When an ARGV[i] is examined as a possible  file  argument,
840       if  it is empty it is skipped; if it is an assignment argument, the as‐
841       signment to var takes place and i skips  to  the  next  argument;  else
842       ARGV[i] is opened for input.  If it fails to open, execution terminates
843       with exit code 2.  If no command line argument is a file argument, then
844       input comes from stdin.  Getline in a BEGIN action opens input.  “-” as
845       a file argument denotes stdin.
846
847       Once an input stream is open, each input record is tested against  each
848       pattern,  and if it matches, the associated action is executed.  An ex‐
849       pression pattern matches if it is boolean true (see the end of  section
850       2).  A BEGIN pattern matches before any input has been read, and an END
851       pattern matches after all  input  has  been  read.   A  range  pattern,
852       expr1,expr2  ,  matches every record between the match of expr1 and the
853       match expr2 inclusively.
854
855       When end of file occurs on the input stream, the remaining command line
856       arguments  are  examined for a file argument, and if there is one it is
857       opened, else the END pattern is considered matched and all END  actions
858       are executed.
859
860       In  the example, the assignment v=1 takes place after the BEGIN actions
861       are executed, and the data placed in v is typed number and string.  In‐
862       put is then read from file A.  On end of file A, t is set to the string
863       "hello", and B is opened for input.  On end of file B, the END  actions
864       are executed.
865
866       Program flow at the pattern {action} level can be changed with the
867
868            next
869            nextfile
870            exit  opt_expr
871
872       statements:
873
874       •   A  next  statement causes the next input record to be read and pat‐
875           tern testing to restart with the first pattern {action} pair in the
876           program.
877
878       •   A  nextfile statement tells mawk to stop processing the current in‐
879           put file.  It then updates FILENAME to the next file listed on  the
880           command line, and resets FNR to 1.
881
882       •   An  exit statement causes immediate execution of the END actions or
883           program termination if there are none or if the exit occurs  in  an
884           END action.  The opt_expr sets the exit value of the program unless
885           overridden by a later exit or subsequent error.
886

EXAMPLES

888       1. emulate cat.
889
890            { print }
891
892       2. emulate wc.
893
894            { chars += length($0) + 1  # add one for the \n
895              words += NF
896            }
897
898            END{ print NR, words, chars }
899
900       3. count the number of unique “real words”.
901
902            BEGIN { FS = "[^A-Za-z]+" }
903
904            { for(i = 1 ; i <= NF ; i++)  word[$i] = "" }
905
906            END { delete word[""]
907                  for ( i in word )  cnt++
908                  print cnt
909            }
910
911       4. sum the second field of every record based on the first field.
912
913            $1 ~ /credit|gain/ { sum += $2 }
914            $1 ~ /debit|loss/  { sum -= $2 }
915
916            END { print sum }
917
918       5. sort a file, comparing as string
919
920            { line[NR] = $0 "" }  # make sure of comparison type
921                            # in case some lines look numeric
922
923            END {  isort(line, NR)
924              for(i = 1 ; i <= NR ; i++) print line[i]
925            }
926
927            #insertion sort of A[1..n]
928            function isort( A, n,    i, j, hold)
929            {
930              for( i = 2 ; i <= n ; i++)
931              {
932                hold = A[j = i]
933                while ( A[j-1] > hold )
934                { j-- ; A[j+1] = A[j] }
935                A[j] = hold
936              }
937              # sentinel A[0] = "" will be created if needed
938            }
939
940

COMPATIBILITY ISSUES

942   MAWK 1.3.3 versus POSIX 1003.2 Draft 11.3
943       The POSIX 1003.2(draft 11.3) definition of the AWK language is  AWK  as
944       described  in  the AWK book with a few extensions that appeared in Sys‐
945       temVR4 nawk.  The extensions are:
946
947          •   New functions: toupper() and tolower().
948
949          •   New variables: ENVIRON[] and CONVFMT.
950
951          •   ANSI C conversion specifications for printf() and sprintf().
952
953          •   New command options:  -v var=value, multiple -f options and  im‐
954              plementation options as arguments to -W.
955
956          •   For  systems  (MS-DOS  or Windows) which provide a setmode func‐
957              tion, an environment variable MAWKBINMODE and a  built-in  vari‐
958              able  BINMODE.   The bits of the BINMODE value tell mawk  how to
959              modify the RS and ORS variables:
960
961              0  set standard input to binary mode, and if BIT-2 is unset, set
962                 RS to "\r\n" (CR/LF) rather than "\n" (LF).
963
964              1  set  standard  output  to binary mode, and if BIT-2 is unset,
965                 set ORS to "\r\n" (CR/LF) rather than "\n" (LF).
966
967              2  suppress the assignment to RS and ORS  of  CR/LF,  making  it
968                 possible  to  run scripts and generate output compatible with
969                 Unix line-endings.
970
971       POSIX AWK is oriented to operate on files a line at a time.  RS can  be
972       changed  from  "\n" to another single character, but it is hard to find
973       any use for this — there are no examples in the AWK book.   By  conven‐
974       tion, RS = "", makes one or more blank lines separate records, allowing
975       multi-line records.  When RS = "", "\n" is always a field separator re‐
976       gardless of the value in FS.
977
978       mawk,  on  the  other hand, allows RS to be a regular expression.  When
979       "\n" appears in records, it is treated as space, and FS  always  deter‐
980       mines fields.
981
982       Removing the line at a time paradigm can make some programs simpler and
983       can often improve performance.  For example,  redoing  example  3  from
984       above,
985
986            BEGIN { RS = "[^A-Za-z]+" }
987
988            { word[ $0 ] = "" }
989
990            END { delete  word[ "" ]
991              for( i in word )  cnt++
992              print cnt
993            }
994
995       counts  the  number  of  unique words by making each word a record.  On
996       moderate size files, mawk executes twice as fast, because of  the  sim‐
997       plified inner loop.
998
999       The  following  program  replaces each comment by a single space in a C
1000       program file,
1001
1002            BEGIN {
1003              RS = "/\*([^*]|\*+[^/*])*\*+/"
1004                 # comment is record separator
1005              ORS = " "
1006              getline  hold
1007              }
1008
1009              { print hold ; hold = $0 }
1010
1011              END { printf "%s" , hold }
1012
1013       Buffering one record is needed to avoid  terminating  the  last  record
1014       with a space.
1015
1016       With mawk, the following are all equivalent,
1017
1018            x ~ /a\+b/    x ~ "a\+b"     x ~ "a\\+b"
1019
1020       The  strings  get scanned twice, once as string and once as regular ex‐
1021       pression.  On the string scan, mawk ignores the  escape  on  non-escape
1022       characters while the AWK book advocates \c be recognized as c which ne‐
1023       cessitates the double escaping of meta-characters  in  strings.   POSIX
1024       explicitly  declines to define the behavior which passively forces pro‐
1025       grams that must run under a variety of awks to use  the  more  portable
1026       but less readable, double escape.
1027
1028       POSIX AWK does not recognize "/dev/std{in,out,err}".  Some systems pro‐
1029       vide an actual device for this, allowing AWKs which  do  not  implement
1030       the feature directly to support it.
1031
1032       POSIX  AWK  does not recognize \x hex escape sequences in strings.  Un‐
1033       like ANSI C, mawk limits the number of digits that follows \x to two as
1034       the current implementation only supports 8 bit characters.
1035
1036       POSIX explicitly leaves the behavior of FS = "" undefined, and mentions
1037       splitting the record into characters as a possible interpretation,  but
1038       currently this use is not portable across implementations.
1039
1040       Some  features  were  not  part  of the POSIX standard until long after
1041       their introduction in mawk and other implementations.  These have  been
1042       approved,  though  still (as of July 2020), are not part of a published
1043       standard:
1044
1045       •   The built-in fflush first appeared in a 1993 AT&T awk  released  to
1046           netlib.  It was approved for the POSIX standard in 2012.
1047
1048       •   Aggregate deletion with delete array was approved in 2018.
1049
1050   Random numbers
1051       POSIX  does  not  prescribe a method for initializing random numbers at
1052       startup.
1053
1054       In practice, most implementations do nothing special, which makes srand
1055       and rand follow the C runtime library, making the initial seed value 1.
1056       Some implementations (Solaris XPG4 and Tru64) return 0 from  the  first
1057       call  to srand, although the results from rand behave as if the initial
1058       seed is 1.  Other implementations return 1.
1059
1060       While mawk can call srand at startup with  no  parameter  (initializing
1061       random  numbers  from  the clock), this feature may be suppressed using
1062       conditional compilation.
1063
1064   Extensions added for compatibility for GAWK and BWK
1065       Nextfile is a gawk extension (also implemented by BWK awk).  It was ap‐
1066       proved  for the POSIX standard in September 2012, and is expected to be
1067       part of the next revision of the standard.
1068
1069       Mktime, strftime and systime are gawk extensions.
1070
1071       The "/dev/stdin" feature was added to mawk after 1.3.4, for compatibil‐
1072       ity   with  gawk  and  BWK  awk.   The  corresponding  "-"  (alias  for
1073       /dev/stdin) was present in mawk 1.3.3.
1074
1075       Interval expressions, e.g., a range {m,n} in Extended  Regular  Expres‐
1076       sions (EREs), were not supported in awk (or even the original “nawk”):
1077
1078       •   Gawk provided this feature in 1991 (and later, in 1998, options for
1079           turning it off, for compatibility with “traditional awk”).
1080
1081       •   Interval expressions, were introduced into awk regular  expressions
1082           in IEEE 1003.1-2001 (also known as Unix 03), along with some inter‐
1083           nationalization features.
1084
1085       •   Apple modified its copy of the original awk in April  2006,  making
1086           this version of awk support interval expressions.
1087
1088           The  updated  source provides for compatibility with older “legacy”
1089           versions using an environment variable,  making  this  “Unix  2003”
1090           feature (perhaps meant as Unix 03) the default.
1091
1092       •   NetBSD  developers copied this change in January 2018, omitting the
1093           compatibility option, and then applied it to BWK awk.
1094
1095       •   The interval expression implementation in mawk is based on  changes
1096           proposed by James Parkinson in April 2016.
1097
1098       Mawk  also  recognizes  a  few  gawk-specific  command line options for
1099       script compatibility:
1100
1101            --help, --posix, -r, --re-interval, --traditional, --version
1102
1103   Subtle Differences not in POSIX or the AWK Book
1104       Finally, here is how mawk handles exceptional cases  not  discussed  in
1105       the  AWK  book  or the POSIX draft.  It is unsafe to assume consistency
1106       across awks and safe to skip to the next section.
1107
1108          •   substr(s, i, n) returns the characters of s in the  intersection
1109              of the closed interval [1, length(s)] and the half-open interval
1110              [i, i+n).  When this intersection is empty, the empty string  is
1111              returned; so substr("ABC", 1, 0) = "" and substr("ABC", -4, 6) =
1112              "A".
1113
1114          •   Every string, including the  empty  string,  matches  the  empty
1115              string  at  the  front so, s ~ // and s ~ "", are always 1 as is
1116              match(s, //) and match(s, "").  The last two set RLENGTH to 0.
1117
1118          •   index(s, t) is always the same as match(s, t1) where t1  is  the
1119              same  as  t with metacharacters escaped.  Hence consistency with
1120              match requires that index(s, "") always  returns  1.   Also  the
1121              condition,  index(s,t)  !=  0 if and only t is a substring of s,
1122              requires index("","") = 1.
1123
1124          •   If getline encounters end of file, getline var, leaves  var  un‐
1125              changed.  Similarly, on entry to the END actions, $0, the fields
1126              and NF have their value unaltered from the last record.
1127

ENVIRONMENT VARIABLES

1129       Mawk recognizes these variables:
1130
1131          MAWKBINMODE
1132             (see COMPATIBILITY ISSUES)
1133
1134          MAWK_LONG_OPTIONS
1135             If this is set, mawk uses its value to decide  what  to  do  with
1136             GNU-style long options:
1137
1138               allow  Mawk allows the option to be checked against the (small)
1139                      set of long options it recognizes.
1140
1141                      The long names from the -W option are recognized,  e.g.,
1142                      --version is derived from -Wversion.
1143
1144               error  Mawk prints an error message and exits.  This is the de‐
1145                      fault.
1146
1147               ignore Mawk ignores the option, unless it happens to be one  of
1148                      the one it recognizes.
1149
1150               warn   Print  an  warning  message and otherwise ignore the op‐
1151                      tion.
1152
1153             If the variable is unset, mawk prints an error message and exits.
1154
1155          WHINY_USERS
1156             This is a gawk 3.1.0 feature, removed in the 4.0.0  release.   It
1157             tells mawk to sort array indices before it starts to iterate over
1158             the elements of an array.
1159

SEE ALSO

1161       grep(1)
1162
1163       Aho, Kernighan and Weinberger, The AWK Programming  Language,  Addison-
1164       Wesley  Publishing, 1988, (the AWK book), defines the language, opening
1165       with a tutorial and advancing to many interesting programs  that  delve
1166       into  issues of software design and analysis relevant to programming in
1167       any language.
1168
1169       The GAWK Manual, The Free Software Foundation, 1991, is a tutorial  and
1170       language  reference that does not attempt the depth of the AWK book and
1171       assumes the reader may be a novice programmer.  The section on AWK  ar‐
1172       rays is excellent.  It also discusses POSIX requirements for AWK.
1173
1174       mawk-arrays(7) discusses mawk's implementation of arrays.
1175
1176       mawk-code(7) gives more information on the -W dump option.
1177

BUGS

1179       mawk  implements  printf() and sprintf() using the C library functions,
1180       printf and sprintf, so full ANSI compatibility requires an ANSI  C  li‐
1181       brary.   In  practice  this means the h conversion qualifier may not be
1182       available.
1183
1184       Also mawk inherits any bugs or limitations of the library functions.
1185
1186       Implementors of the AWK language have shown a consistent lack of imagi‐
1187       nation when naming their programs.
1188

AUTHOR

1190       Mike Brennan (brennan@whidbey.com).
1191       Thomas E. Dickey <dickey@invisible-island.net>.
1192
1193
1194
1195Version 1.3.4                     2023-04-04                           MAWK(1)
Impressum