1MAWK(1)                          USER COMMANDS                         MAWK(1)
2
3
4

NAME

6       mawk - pattern scanning and text processing language
7

SYNOPSIS

9       mawk  [-W  option]  [-F value] [-v var=value] [--] 'program text' [file
10       ...]
11       mawk [-W option] [-F value] [-v var=value] [-f program-file] [--] [file
12       ...]
13

DESCRIPTION

15       mawk  is an interpreter for the AWK Programming Language.  The AWK lan‐
16       guage is useful for manipulation of data files, text retrieval and pro‐
17       cessing,  and  for prototyping and experimenting with algorithms.  mawk
18       is a new awk meaning it implements the AWK language as defined in  Aho,
19       Kernighan  and Weinberger, The AWK Programming Language, Addison-Wesley
20       Publishing, 1988 (hereafter referred to as the AWK  book.)   mawk  con‐
21       forms  to  the POSIX 1003.2 (draft 11.3) definition of the AWK language
22       which contains a few features not described in the AWK book,  and  mawk
23       provides a small number of extensions.
24
25       An  AWK  program  is  a sequence of pattern {action} pairs and function
26       definitions.  Short programs are entered on the  command  line  usually
27       enclosed  in ' ' to avoid shell interpretation.  Longer programs can be
28       read in from a file with the -f option.  Data  input is read  from  the
29       list  of files on the command line or from standard input when the list
30       is empty.  The input is broken into records as determined by the record
31       separator  variable,  RS.  Initially, RS = “\n” and records are synony‐
32       mous with lines.  Each record is compared against each pattern  and  if
33       it matches, the program text for {action} is executed.
34

OPTIONS

36       -F value       sets the field separator, FS, to value.
37
38       -f file        Program  text is read from file instead of from the com‐
39                      mand line.  Multiple -f options are allowed.
40
41       -v var=value   assigns value to program variable var.
42
43       --             indicates the unambiguous end of options.
44
45       The above options will be available with any POSIX compatible implemen‐
46       tation  of  AWK.  Implementation specific options are prefaced with -W.
47       mawk provides these:
48
49       -W dump        writes an assembler like listing of the internal  repre‐
50                      sentation  of the program to stdout and exits 0 (on suc‐
51                      cessful compilation).
52
53       -W exec file   Program text is read from file  and  this  is  the  last
54                      option.
55
56                      This  is a useful alternative to -f on systems that sup‐
57                      port the #!  “magic number”  convention  for  executable
58                      scripts.   Those  implicitly  pass  the  pathname of the
59                      script itself as the final parameter, and expect no more
60                      than  one  “-”  option on the #! line.  Because mawk can
61                      combine multiple -W options separated by commas, you can
62                      use this option when an additional -W option is needed.
63
64       -W help        prints  a  usage  message  to  stderr and exits (same as
65                      “-W usage”).
66
67       -W interactive sets unbuffered writes to stdout and line buffered reads
68                      from  stdin.  Records from stdin are lines regardless of
69                      the value of RS.
70
71       -W posix_space forces mawk not to consider '\n' to be space.
72
73       -W random=num  calls srand with the given parameter (and overrides  the
74                      auto-seeding behavior).
75
76       -W sprintf=num adjusts  the  size  of mawk's internal sprintf buffer to
77                      num bytes.  More than rare use of this option  indicates
78                      mawk should be recompiled.
79
80       -W usage       prints  a  usage  message  to  stderr and exits (same as
81                      “-W help”).
82
83       -W version     mawk writes its version and copyright to stdout and com‐
84                      piled limits to stderr and exits 0.
85
86       mawk  accepts  abbreviations for any of these options, e.g., “-W v” and
87       “-Wv” both tell mawk to show its version.
88
89       mawk allows multiple -W  options  to  be  combined  by  separating  the
90       options  with  commas,  e.g.,  -Wsprint=2000,posix.  This is useful for
91       executable #!  “magic number” invocations in which only one argument is
92       supported, e.g., -Winteractive,exec.
93

THE AWK LANGUAGE

95   1. Program structure
96       An  AWK  program is a sequence of pattern {action} pairs and user func‐
97       tion definitions.
98
99       A pattern can be:
100            BEGIN
101            END
102            expression
103            expression , expression
104
105       One, but not both, of pattern {action} can be omitted.  If {action}  is
106       omitted  it is implicitly { print }.  If pattern is omitted, then it is
107       implicitly matched.  BEGIN and END patterns require an action.
108
109       Statements are terminated by newlines, semi-colons or both.  Groups  of
110       statements such as actions or loop bodies are blocked via { ... } as in
111       C.  The last statement in a block doesn't  need  a  terminator.   Blank
112       lines  have  no  meaning; an empty statement is terminated with a semi-
113       colon.  Long statements can be continued with a backslash, \.  A state‐
114       ment  can  be broken without a backslash after a comma, left brace, &&,
115       ||, do, else, the right parenthesis of an if, while or  for  statement,
116       and  the  right parenthesis of a function definition.  A comment starts
117       with # and extends to, but does not include the end of line.
118
119       The following statements control program flow inside blocks.
120
121            if ( expr ) statement
122
123            if ( expr ) statement else statement
124
125            while ( expr ) statement
126
127            do statement while ( expr )
128
129            for ( opt_expr ; opt_expr ; opt_expr ) statement
130
131            for ( var in array ) statement
132
133            continue
134
135            break
136
137   2. Data types, conversion and comparison
138       There are two basic data types, numeric and string.  Numeric  constants
139       can  be  integer  like -2, decimal like 1.08, or in scientific notation
140       like -1.1e4 or .28E-3.  All numbers are represented internally and  all
141       computations  are  done  in floating point arithmetic.  So for example,
142       the expression 0.2e2 == 20 is true and true is represented as 1.0.
143
144       String constants are enclosed in double quotes.
145
146                   "This is a string with a newline at the end.\n"
147
148       Strings can be continued across a line by  escaping  (\)  the  newline.
149       The following escape sequences are recognized.
150
151            \\        \
152            \"        "
153            \a        alert, ascii 7
154            \b        backspace, ascii 8
155            \t        tab, ascii 9
156            \n        newline, ascii 10
157            \v        vertical tab, ascii 11
158            \f        formfeed, ascii 12
159            \r        carriage return, ascii 13
160            \ddd      1, 2 or 3 octal digits for ascii ddd
161            \xhh      1 or 2 hex digits for ascii  hh
162
163       If  you  escape  any other character \c, you get \c, i.e., mawk ignores
164       the escape.
165
166       There are really three basic data types; the third is number and string
167       which  has  both  a  numeric value and a string value at the same time.
168       User defined variables come into existence when  first  referenced  and
169       are  initialized  to  null, a number and string value which has numeric
170       value 0 and string value "".  Non-trivial number and string typed  data
171       come from input and are typically stored in fields.  (See section 4).
172
173       The  type  of  an expression is determined by its context and automatic
174       type conversion occurs if needed.  For example, to evaluate the  state‐
175       ments
176
177            y = x + 2  ;  z = x  "hello"
178
179       The  value  stored  in  variable  y will be typed numeric.  If x is not
180       numeric, the value read from x is converted to  numeric  before  it  is
181       added  to  2  and  stored in y.  The value stored in variable z will be
182       typed string, and the value of x will be converted to string if  neces‐
183       sary  and  concatenated  with  "hello".  (Of course, the value and type
184       stored in x is not changed by any conversions.)  A string expression is
185       converted  to numeric using its longest numeric prefix as with atof(3).
186       A numeric expression is converted to  string  by  replacing  expr  with
187       sprintf(CONVFMT,  expr),  unless  expr  can  be represented on the host
188       machine as an exact integer  then  it  is  converted  to  sprintf("%d",
189       expr).   Sprintf() is an AWK built-in that duplicates the functionality
190       of sprintf(3), and CONVFMT is a built-in  variable  used  for  internal
191       conversion  from  number to string and initialized to "%.6g".  Explicit
192       type conversions can be  forced,  expr  ""  is  string  and  expr+0  is
193       numeric.
194
195       To evaluate, expr1 rel-op expr2, if both operands are numeric or number
196       and string then the comparison is numeric; if both operands are  string
197       the  comparison is string; if one operand is string, the non-string op‐
198       erand is converted  and  the  comparison  is  string.   The  result  is
199       numeric, 1 or 0.
200
201       In boolean contexts such as, if ( expr ) statement, a string expression
202       evaluates true if and only if it is not the empty  string  "";  numeric
203       values if and only if not numerically zero.
204
205   3. Regular expressions
206       In  the  AWK language, records, fields and strings are often tested for
207       matching a regular expression.  Regular  expressions  are  enclosed  in
208       slashes, and
209
210            expr ~ /r/
211
212       is  an  AWK  expression  that evaluates to 1 if expr “matches” r, which
213       means a substring of expr is in the set of strings defined by r.   With
214       no  match  the  expression  evaluates  to  0; replacing ~ with the “not
215       match” operator, !~ , reverses the meaning.  As  pattern-action pairs,
216
217            /r/ { action }   and   $0 ~ /r/ { action }
218
219       are the same, and for each input record that matches r, action is  exe‐
220       cuted.   In  fact, /r/ is an AWK expression that is equivalent to ($0 ~
221       /r/) anywhere except when on the right side  of  a  match  operator  or
222       passed  as  an  argument  to a built-in function that expects a regular
223       expression argument.
224
225       AWK uses extended regular expressions as with the -E option of grep(1).
226       The regular expression metacharacters, i.e., those with special meaning
227       in regular expressions are
228
229            \ ^ $ . [ ] | ( ) * + ?
230
231       Regular expressions are built up from characters as follows:
232
233            c            matches any non-metacharacter c.
234
235            \c           matches  a  character  defined  by  the  same  escape
236                         sequences  used  in  string  constants or the literal
237                         character c if \c is not an escape sequence.
238
239            .            matches any character (including newline).
240
241            ^            matches the front of a string.
242
243            $            matches the back of a string.
244
245            [c1c2c3...]  matches any character in the  class  c1c2c3... .   An
246                         interval  of  characters  is  denoted  c1-c2 inside a
247                         class [...].
248
249            [^c1c2c3...] matches any character not in the class c1c2c3...
250
251       Regular expressions are built up from other regular expressions as fol‐
252       lows:
253
254            r1r2         matches  r1  followed  immediately  by r2 (concatena‐
255                         tion).
256
257            r1 | r2      matches r1 or r2 (alternation).
258
259            r*           matches r repeated zero or more times.
260
261            r+           matches r repeated one or more times.
262
263            r?           matches r zero or once.
264
265            (r)          matches r, providing grouping.
266
267       The increasing precedence of operators  is  alternation,  concatenation
268       and unary (*, + or ?).
269
270       For example,
271
272            /^[_a-zA-Z][_a-zA-Z0-9]*$/  and
273            /^[-+]?([0-9]+\.?|\.[0-9])[0-9]*([eE][-+]?[0-9]+)?$/
274
275       are  matched by AWK identifiers and AWK numeric constants respectively.
276       Note that “.” has to be escaped to be recognized as  a  decimal  point,
277       and that metacharacters are not special inside character classes.
278
279       Any expression can be used on the right hand side of the ~ or !~ opera‐
280       tors or passed to a built-in that expects  a  regular  expression.   If
281       needed,  it  is  converted to string, and then interpreted as a regular
282       expression.  For example,
283
284            BEGIN { identifier = "[_a-zA-Z][_a-zA-Z0-9]*" }
285
286            $0 ~ "^" identifier
287
288       prints all lines that start with an AWK identifier.
289
290       mawk recognizes the empty regular expression,  //,  which  matches  the
291       empty  string and hence is matched by any string at the front, back and
292       between every character.  For example,
293
294            echo  abc | mawk { gsub(//, "X") ; print }
295            XaXbXcX
296
297
298   4. Records and fields
299       Records are read in one at a time, and stored in the field variable $0.
300       The  record  is split into fields which are stored in $1, $2, ..., $NF.
301       The built-in variable NF is set to the number of fields, and NR and FNR
302       are incremented by 1.  Fields above $NF are set to "".
303
304       Assignment to $0 causes the fields and NF to be recomputed.  Assignment
305       to NF or to a field causes $0 to be reconstructed by concatenating  the
306       $i's  separated  by OFS.  Assignment to a field with index greater than
307       NF, increases NF and causes $0 to be reconstructed.
308
309       Data input stored in fields is string,  unless  the  entire  field  has
310       numeric form and then the type is number and string.  For example,
311
312            echo 24 24E |
313            mawk '{ print($1>100, $1>"100", $2>100, $2>"100") }'
314            0 1 1 1
315
316       $0 and $2 are string and $1 is number and string.  The first comparison
317       is numeric, the second is string, the third is string (100 is converted
318       to "100"), and the last is string.
319
320   5. Expressions and operators
321       The expression syntax is similar to C.  Primary expressions are numeric
322       constants, string constants, variables,  fields,  arrays  and  function
323       calls.   The  identifier  for  a  variable,  array or function can be a
324       sequence of letters, digits and underscores, that does not start with a
325       digit.   Variables  are  not declared; they exist when first referenced
326       and are initialized to null.
327
328       New expressions are composed with the following operators in  order  of
329       increasing precedence.
330
331            assignment          =  +=  -=  *=  /=  %=  ^=
332            conditional         ?  :
333            logical or          ||
334            logical and         &&
335            array membership    in
336            matching       ~   !~
337            relational          <  >   <=  >=  ==  !=
338            concatenation       (no explicit operator)
339            add ops             +  -
340            mul ops             *  /  %
341            unary               +  -
342            logical not         !
343            exponentiation      ^
344            inc and dec         ++ -- (both post and pre)
345            field               $
346
347       Assignment, conditional and exponentiation associate right to left; the
348       other operators associate left to right.  Any expression can be  paren‐
349       thesized.
350
351   6. Arrays
352       Awk  provides  one-dimensional arrays.  Array elements are expressed as
353       array[expr].  Expr is internally converted  to  string  type,  so,  for
354       example,  A[1]  and A["1"] are the same element and the actual index is
355       "1".  Arrays indexed by strings are called  associative  arrays.   Ini‐
356       tially  an  array  is  empty;  elements  exist when first accessed.  An
357       expression, expr in array evaluates to 1 if array[expr] exists, else to
358       0.
359
360       There  is  a form of the for statement that loops over each index of an
361       array.
362
363            for ( var in array ) statement
364
365       sets var to each index of array and executes statement.  The order that
366       var transverses the indices of array is not defined.
367
368       The  statement,  delete  array[expr],  causes array[expr] not to exist.
369       mawk supports an extension, delete array, which deletes all elements of
370       array.
371
372       Multidimensional  arrays  are  synthesized with concatenation using the
373       built-in  variable  SUBSEP.   array[expr1,expr2]   is   equivalent   to
374       array[expr1 SUBSEP expr2].  Testing for a multidimensional element uses
375       a parenthesized index, such as
376
377            if ( (i, j) in A )  print A[i, j]
378
379
380   7. Builtin-variables
381       The following variables are built-in  and  initialized  before  program
382       execution.
383
384            ARGC      number of command line arguments.
385
386            ARGV      array of command line arguments, 0..ARGC-1.
387
388            CONVFMT   format  for  internal  conversion  of numbers to string,
389                      initially = "%.6g".
390
391            ENVIRON   array indexed by environment variables.  An  environment
392                      string, var=value is stored as ENVIRON[var] = value.
393
394            FILENAME  name of the current input file.
395
396            FNR       current record number in FILENAME.
397
398            FS        splits records into fields as a regular expression.
399
400            NF        number of fields in the current record.
401
402            NR        current record number in the total input stream.
403
404            OFMT      format for printing numbers; initially = "%.6g".
405
406            OFS       inserted between fields on output, initially = " ".
407
408            ORS       terminates each record on output, initially = "\n".
409
410            RLENGTH   length  set  by  the last call to the built-in function,
411                      match().
412
413            RS        input record separator, initially = "\n".
414
415            RSTART    index set by the last call to match().
416
417            SUBSEP    used to build multiple  array  subscripts,  initially  =
418                      "\034".
419
420   8. Built-in functions
421       String functions
422
423            gsub(r,s,t)  gsub(r,s)
424                   Global substitution, every match of regular expression r in
425                   variable t is replaced by string s.  The number of replace‐
426                   ments  is  returned.  If t is omitted, $0 is used.  An & in
427                   the replacement string s is replaced by  the  matched  sub‐
428                   string of t.  \& and \\ put  literal & and \, respectively,
429                   in the replacement string.
430
431            index(s,t)
432                   If t is a substring of s, then the position where t  starts
433                   is  returned, else 0 is returned.  The first character of s
434                   is in position 1.
435
436            length(s)
437                   Returns the length of string or array.  s.
438
439            match(s,r)
440                   Returns the index of the first  longest  match  of  regular
441                   expression  r  in  string  s.  Returns 0 if no match.  As a
442                   side effect, RSTART is set to the return value.  RLENGTH is
443                   set  to  the length of the match or -1 if no match.  If the
444                   empty string is matched, RLENGTH is set  to  0,  and  1  is
445                   returned  if  the match is at the front, and length(s)+1 is
446                   returned if the match is at the back.
447
448            split(s,A,r)  split(s,A)
449                   String s is split into fields by regular expression  r  and
450                   the  fields  are loaded into array A.  The number of fields
451                   is returned.  See section 11 below for more detail.   If  r
452                   is omitted, FS is used.
453
454            sprintf(format,expr-list)
455                   Returns  a  string  constructed from expr-list according to
456                   format.  See the description of printf() below.
457
458            sub(r,s,t)  sub(r,s)
459                   Single substitution, same as gsub() except at most one sub‐
460                   stitution.
461
462            substr(s,i,n)  substr(s,i)
463                   Returns  the substring of string s, starting at index i, of
464                   length n.  If n is omitted, the suffix of s, starting at  i
465                   is returned.
466
467            tolower(s)
468                   Returns  a  copy  of  s with all upper case characters con‐
469                   verted to lower case.
470
471            toupper(s)
472                   Returns a copy of s with all  lower  case  characters  con‐
473                   verted to upper case.
474
475       Time functions
476
477       These are available on systems which support the corresponding C mktime
478       and strftime functions:
479
480            mktime(specification)
481                   converts a date specification to a timestamp with the  same
482                   units  as systime.  The date specification is a string con‐
483                   taining the components of the date as decimal integers:
484
485                   YYYY
486                      the year, e.g., 2012
487
488                   MM the month of the year starting at 1
489
490                   DD the day of the month starting at 1
491
492                   HH hour (0-23)
493
494                   MM minute (0-59)
495
496                   SS seconds (0-59)
497
498                   DST
499                      tells how to  treat  timezone  versus  daylight  savings
500                      time:
501
502                        positive
503                           DST is in effect
504
505                        zero (default)
506                           DST is not in effect
507
508                        negative
509                           mktime()  should (use timezone information and sys‐
510                           tem databases to) attempt  to determine whether DST
511                           is in effect at the specified time.
512
513            strftime([format [, timestamp [, utc ]]])
514                   formats the given timestamp using the format (passed to the
515                   C strftime function):
516
517                   ·   If the format parameter is missing, "%c" is used.
518
519                   ·   If the timestamp  parameter  is  missing,  the  current
520                       value from systime is used.
521
522                   ·   If the utc parameter is present and nonzero, the result
523                       is in UTC.  Otherwise local time is used.
524
525            systime()
526                   returns the current time of day as the  number  of  seconds
527                   since the Epoch (1970-01-01 00:00:00 UTC on POSIX systems).
528
529       Arithmetic functions
530
531            atan2(y,x)     Arctan of y/x between -pi and pi.
532
533            cos(x)         Cosine function, x in radians.
534
535            exp(x)         Exponential function.
536
537            int(x)         Returns x truncated towards zero.
538
539            log(x)         Natural logarithm.
540
541            rand()         Returns a random number between zero and one.
542
543            sin(x)         Sine function, x in radians.
544
545            sqrt(x)        Returns square root of x.
546
547            srand(expr)  srand()
548                   Seeds  the random number generator, using the clock if expr
549                   is omitted, and returns the value  of  the  previous  seed.
550                   Srand(expr)   is   useful   for   repeating  pseudo  random
551                   sequences.
552
553                   Note: mawk is normally configured to seed the random number
554                   generator  from the clock at startup, making it unnecessary
555                   to call srand().  This feature can be suppressed via condi‐
556                   tional compile, or overridden using the -Wrandom option.
557
558   9. Input and output
559       There are two output statements, print and printf.
560
561            print  writes $0  ORS to standard output.
562
563            print expr1, expr2, ..., exprn
564                   writes  expr1  OFS expr2 OFS ... exprn ORS to standard out‐
565                   put.  Numeric expressions  are  converted  to  string  with
566                   OFMT.
567
568            printf format, expr-list
569                   duplicates  the  printf C library function writing to stan‐
570                   dard output.  The complete ANSI C format specifications are
571                   recognized with conversions %c, %d, %e, %E, %f, %g, %G, %i,
572                   %o, %s, %u, %x, %X and %%, and conversion qualifiers h  and
573                   l.
574
575       The  argument  list  to  print  or printf can optionally be enclosed in
576       parentheses.  Print formats numbers using OFMT or "%d" for exact  inte‐
577       gers.   "%c"  with  a  numeric  argument prints the corresponding 8 bit
578       character, with a string argument it prints the first character of  the
579       string.   The output of print and printf can be redirected to a file or
580       command by appending > file, >> file or | command to  the  end  of  the
581       print  statement.   Redirection opens file or command only once, subse‐
582       quent redirections append to the already open stream.   By  convention,
583       mawk associates the filename
584
585          ·   "/dev/stderr" with stderr,
586
587          ·   "/dev/stdout" with stdout,
588
589          ·   "-" and "/dev/stdin" with stdin.
590
591       The  association  with  stderr  is  especially useful because it allows
592       print and printf to be redirected to stderr.  These names can  also  be
593       passed to functions.
594
595       The input function getline has the following variations.
596
597            getline
598                   reads into $0, updates the fields, NF, NR and FNR.
599
600            getline < file
601                   reads into $0 from file, updates the fields and NF.
602
603            getline var
604                   reads the next record into var, updates NR and FNR.
605
606            getline var < file
607                   reads the next record of file into var.
608
609            command | getline
610                   pipes  a record from command into $0 and updates the fields
611                   and NF.
612
613            command | getline var
614                   pipes a record from command into var.
615
616       Getline returns 0 on end-of-file, -1 on error, otherwise 1.
617
618       Commands on the end of pipes are executed by /bin/sh.
619
620       The function close(expr) closes the file or pipe associated with  expr.
621       Close  returns  0 if expr is an open file, the exit status if expr is a
622       piped command, and -1 otherwise.  Close is used to  reread  a  file  or
623       command,  make sure the other end of an output pipe is finished or con‐
624       serve file resources.
625
626       The function fflush(expr) flushes the output file  or  pipe  associated
627       with  expr.  Fflush returns 0 if expr is an open output stream else -1.
628       Fflush without an argument flushes stdout.  Fflush with an empty  argu‐
629       ment ("") flushes all open output.
630
631       The  function  system(expr)  uses  the C runtime system call to execute
632       expr and returns the corresponding wait status of the command  as  fol‐
633       lows:
634
635       ·   if  the  system call failed, setting the status to -1, mawk returns
636           that value.
637
638       ·   if the command exited normally, mawk returns its exit-status.
639
640       ·   if the command exited due to a signal such as SIGHUP, mawk  returns
641           the signal number plus 256.
642
643       Changes  made  to the ENVIRON array are not passed to commands executed
644       with system or pipes.
645
646   10. User defined functions
647       The syntax for a user defined function is
648
649            function name( args ) { statements }
650
651       The function body can contain a return statement
652
653            return opt_expr
654
655       A return statement is not required.  Function calls may  be  nested  or
656       recursive.   Functions  are  passed  expressions by value and arrays by
657       reference.  Extra arguments serve as local variables and  are  initial‐
658       ized  to  null.  For example, csplit(s,A) puts each character of s into
659       array A and returns the length of s.
660
661            function csplit(s, A,    n, i)
662            {
663              n = length(s)
664              for( i = 1 ; i <= n ; i++ ) A[i] = substr(s, i, 1)
665              return n
666            }
667
668       Putting extra space between passed arguments  and  local  variables  is
669       conventional.  Functions can be referenced before they are defined, but
670       the function name and the '(' of the arguments must touch to avoid con‐
671       fusion with concatenation.
672
673       A function parameter is normally a scalar value (number or string).  If
674       there is a forward reference to a function using an array as a  parame‐
675       ter,  the  function's  corresponding  parameter  will  be treated as an
676       array.
677
678   11. Splitting strings, records and files
679       Awk programs use the same algorithm to split strings into  arrays  with
680       split(), and records into fields on FS.  mawk uses essentially the same
681       algorithm to split files into records on RS.
682
683       Split(expr,A,sep) works as follows:
684
685          (1)  If sep is omitted, it is replaced by FS.  Sep can be an expres‐
686               sion  or  regular  expression.   If it is an expression of non-
687               string type, it is converted to string.
688
689          (2)  If sep = " " (a single space), then <SPACE> is trimmed from the
690               front  and back of expr, and sep becomes <SPACE>.  mawk defines
691               <SPACE> as the regular expression /[ \t\n]+/.  Otherwise sep is
692               treated  as  a  regular expression, except that meta-characters
693               are ignored for a string of length 1, e.g.,  split(x,  A,  "*")
694               and split(x, A, /\*/) are the same.
695
696          (3)  If  expr  is not string, it is converted to string.  If expr is
697               then the empty string "", split() returns 0 and A is set empty.
698               Otherwise, all non-overlapping, non-null and longest matches of
699               sep in expr, separate expr into fields which are loaded into A.
700               The  fields  are  placed  in  A[1], A[2], ..., A[n] and split()
701               returns n, the number of fields which is the number of  matches
702               plus  one.  Data placed in A that looks numeric is typed number
703               and string.
704
705       Splitting records into fields works the  same  except  the  pieces  are
706       loaded into $1, $2,..., $NF.  If $0 is empty, NF is set to 0 and all $i
707       to "".
708
709       mawk splits files into records by the  same  algorithm,  but  with  the
710       slight  difference  that RS is really a terminator instead of a separa‐
711       tor.  (ORS is really a terminator too).
712
713            E.g., if FS = “:+” and $0 = “a::b:” , then NF = 3 and $1 = “a”, $2
714            = “b” and $3 = "", but if “a::b:” is the contents of an input file
715            and RS = “:+”, then there are two records “a” and “b”.
716
717       RS = " " is not special.
718
719       If FS = "", then mawk breaks the  record  into  individual  characters,
720       and,  similarly,  split(s,A,"")  places  the individual characters of s
721       into A.
722
723   12. Multi-line records
724       Since mawk interprets RS as a regular  expression,  multi-line  records
725       are easy.  Setting RS = "\n\n+", makes one or more blank lines separate
726       records.  If FS = " " (the default), then single newlines, by the rules
727       for  <SPACE>  above, become space and single newlines are field separa‐
728       tors.
729
730            For example, if
731
732            ·   a file is "a b\nc\n\n",
733
734            ·   RS = "\n\n+" and
735
736            ·   FS = " ",
737
738            then there is one record “a b\nc” with three fields “a”,  “b”  and
739            “c”:
740
741            ·   Changing FS = “\n”, gives two fields “a b” and “c”;
742
743            ·   changing FS = “”, gives one field identical to the record.
744
745       If  you want lines with spaces or tabs to be considered blank, set RS =
746       “\n([ \t]*\n)+”.  For compatibility with other awks, setting  RS  =  ""
747       has  the  same effect as if blank lines are stripped from the front and
748       back of files and then records are  determined  as  if  RS  =  “\n\n+”.
749       POSIX  requires that “\n” always separates records when RS = "" regard‐
750       less of the value of  FS.   mawk  does  not  support  this  convention,
751       because defining “\n” as <SPACE> makes it unnecessary.
752
753       Most  of  the  time when you change RS for multi-line records, you will
754       also want to change ORS to “\n\n” so the record spacing is preserved on
755       output.
756
757   13. Program execution
758       This  section  describes the order of program execution.  First ARGC is
759       set to the total number of command line arguments passed to the  execu‐
760       tion  phase  of the program.  ARGV[0] is set the name of the AWK inter‐
761       preter and ARGV[1] ...  ARGV[ARGC-1] holds the remaining  command  line
762       arguments exclusive of options and program source.  For example with
763
764            mawk  -f  prog  v=1  A  t=hello  B
765
766       ARGC = 5 with ARGV[0] = "mawk", ARGV[1] = "v=1", ARGV[2] = "A", ARGV[3]
767       = "t=hello" and ARGV[4] = "B".
768
769       Next, each BEGIN block is executed in order.  If the  program  consists
770       entirely  of  BEGIN  blocks,  then  execution terminates, else an input
771       stream is opened and execution continues.  If ARGC equals 1, the  input
772       stream  is  set  to stdin, else  the command line arguments ARGV[1] ...
773       ARGV[ARGC-1] are examined for a file argument.
774
775       The command line arguments divide  into  three  sets:  file  arguments,
776       assignment  arguments and empty strings "".  An assignment has the form
777       var=string.  When an ARGV[i] is examined as a possible  file  argument,
778       if  it  is  empty  it  is skipped; if it is an assignment argument, the
779       assignment to var takes place and i skips to the  next  argument;  else
780       ARGV[i] is opened for input.  If it fails to open, execution terminates
781       with exit code 2.  If no command line argument is a file argument, then
782       input comes from stdin.  Getline in a BEGIN action opens input.  “-” as
783       a file argument denotes stdin.
784
785       Once an input stream is open, each input record is tested against  each
786       pattern,  and  if  it  matches,  the associated action is executed.  An
787       expression pattern matches if it is boolean true (see the end  of  sec‐
788       tion  2).   A BEGIN pattern matches before any input has been read, and
789       an END pattern matches after all input has been read.  A range pattern,
790       expr1,expr2  ,  matches every record between the match of expr1 and the
791       match expr2 inclusively.
792
793       When end of file occurs on the input stream, the remaining command line
794       arguments  are  examined for a file argument, and if there is one it is
795       opened, else the END pattern is considered matched and all END  actions
796       are executed.
797
798       In  the example, the assignment v=1 takes place after the BEGIN actions
799       are executed, and the data placed in v  is  typed  number  and  string.
800       Input  is  then  read  from  file A.  On end of file A, t is set to the
801       string "hello", and B is opened for input.  On end of file B,  the  END
802       actions are executed.
803
804       Program flow at the pattern {action} level can be changed with the
805
806            next
807            nextfile
808            exit  opt_expr
809
810       statements:
811
812       ·   A  next  statement causes the next input record to be read and pat‐
813           tern testing to restart with the first pattern {action} pair in the
814           program.
815
816       ·   A  nextfile  statement  tells  mawk  to stop processing the current
817           input file.  It then updates FILENAME to the next  file  listed  on
818           the command line, and resets FNR to 1.
819
820       ·   An  exit statement causes immediate execution of the END actions or
821           program termination if there are none or if the exit occurs  in  an
822           END action.  The opt_expr sets the exit value of the program unless
823           overridden by a later exit or subsequent error.
824

EXAMPLES

826       1. emulate cat.
827
828            { print }
829
830       2. emulate wc.
831
832            { chars += length($0) + 1  # add one for the \n
833              words += NF
834            }
835
836            END{ print NR, words, chars }
837
838       3. count the number of unique “real words”.
839
840            BEGIN { FS = "[^A-Za-z]+" }
841
842            { for(i = 1 ; i <= NF ; i++)  word[$i] = "" }
843
844            END { delete word[""]
845                  for ( i in word )  cnt++
846                  print cnt
847            }
848
849       4. sum the second field of every record based on the first field.
850
851            $1 ~ /credit|gain/ { sum += $2 }
852            $1 ~ /debit|loss/  { sum -= $2 }
853
854            END { print sum }
855
856       5. sort a file, comparing as string
857
858            { line[NR] = $0 "" }  # make sure of comparison type
859                            # in case some lines look numeric
860
861            END {  isort(line, NR)
862              for(i = 1 ; i <= NR ; i++) print line[i]
863            }
864
865            #insertion sort of A[1..n]
866            function isort( A, n,    i, j, hold)
867            {
868              for( i = 2 ; i <= n ; i++)
869              {
870                hold = A[j = i]
871                while ( A[j-1] > hold )
872                { j-- ; A[j+1] = A[j] }
873                A[j] = hold
874              }
875              # sentinel A[0] = "" will be created if needed
876            }
877
878

COMPATIBILITY ISSUES

880   MAWK 1.3.3 versus POSIX 1003.2 Draft 11.3
881       The POSIX 1003.2(draft 11.3) definition of the AWK language is  AWK  as
882       described  in  the AWK book with a few extensions that appeared in Sys‐
883       temVR4 nawk.  The extensions are:
884
885          ·   New functions: toupper() and tolower().
886
887          ·   New variables: ENVIRON[] and CONVFMT.
888
889          ·   ANSI C conversion specifications for printf() and sprintf().
890
891          ·   New command options:  -v  var=value,  multiple  -f  options  and
892              implementation options as arguments to -W.
893
894          ·   For  systems  (MS-DOS  or Windows) which provide a setmode func‐
895              tion, an environment variable MAWKBINMODE and a  built-in  vari‐
896              able  BINMODE.   The  bits of the BINMODE value tell mawk how to
897              modify the RS and ORS variables:
898
899              0  set standard input to binary mode, and if BIT-2 is unset, set
900                 RS to "\r\n" (CR/LF) rather than "\n" (LF).
901
902              1  set  standard  output  to binary mode, and if BIT-2 is unset,
903                 set ORS to "\r\n" (CR/LF) rather than "\n" (LF).
904
905              2  suppress the assignment to RS and ORS  of  CR/LF,  making  it
906                 possible  to  run scripts and generate output compatible with
907                 Unix line-endings.
908
909       POSIX AWK is oriented to operate on files a line at a time.  RS can  be
910       changed  from  "\n" to another single character, but it is hard to find
911       any use for this — there are no examples in the AWK book.   By  conven‐
912       tion, RS = "", makes one or more blank lines separate records, allowing
913       multi-line records.  When RS = "", "\n" is  always  a  field  separator
914       regardless of the value in FS.
915
916       mawk,  on  the  other hand, allows RS to be a regular expression.  When
917       "\n" appears in records, it is treated as space, and FS  always  deter‐
918       mines fields.
919
920       Removing the line at a time paradigm can make some programs simpler and
921       can often improve performance.  For example,  redoing  example  3  from
922       above,
923
924            BEGIN { RS = "[^A-Za-z]+" }
925
926            { word[ $0 ] = "" }
927
928            END { delete  word[ "" ]
929              for( i in word )  cnt++
930              print cnt
931            }
932
933       counts  the  number  of  unique words by making each word a record.  On
934       moderate size files, mawk executes twice as fast, because of  the  sim‐
935       plified inner loop.
936
937       The  following  program  replaces each comment by a single space in a C
938       program file,
939
940            BEGIN {
941              RS = "/\*([^*]|\*+[^/*])*\*+/"
942                 # comment is record separator
943              ORS = " "
944              getline  hold
945              }
946
947              { print hold ; hold = $0 }
948
949              END { printf "%s" , hold }
950
951       Buffering one record is needed to avoid  terminating  the  last  record
952       with a space.
953
954       With mawk, the following are all equivalent,
955
956            x ~ /a\+b/    x ~ "a\+b"     x ~ "a\\+b"
957
958       The  strings  get  scanned  twice,  once  as string and once as regular
959       expression.  On the string scan, mawk ignores the escape on  non-escape
960       characters  while  the  AWK  book advocates \c be recognized as c which
961       necessitates the double escaping of meta-characters in strings.   POSIX
962       explicitly  declines to define the behavior which passively forces pro‐
963       grams that must run under a variety of awks to use  the  more  portable
964       but less readable, double escape.
965
966       POSIX AWK does not recognize "/dev/std{in,out,err}".  Some systems pro‐
967       vide an actual device for this, allowing AWKs which  do  not  implement
968       the feature directly to support it.
969
970       POSIX  AWK  does  not  recognize  \x  hex  escape sequences in strings.
971       Unlike ANSI C, mawk limits the number of digits that follows \x to  two
972       as  the  current  implementation  only  supports 8 bit characters.  The
973       built-in fflush first appeared in a recent (1993) AT&T awk released  to
974       netlib, and is not part of the POSIX standard.  Aggregate deletion with
975       delete array is not part of the POSIX standard.
976
977       POSIX explicitly leaves the behavior of FS = "" undefined, and mentions
978       splitting  the record into characters as a possible interpretation, but
979       currently this use is not portable across implementations.
980
981   Random numbers
982       POSIX does not prescribe a method for initializing  random  numbers  at
983       startup.
984
985       In practice, most implementations do nothing special, which makes srand
986       and rand follow the C runtime library, making the initial seed value 1.
987       Some  implementations  (Solaris XPG4 and Tru64) return 0 from the first
988       call to srand, although the results from rand behave as if the  initial
989       seed is 1.  Other implementations return 1.
990
991       While  mawk  can  call srand at startup with no parameter (initializing
992       random numbers from the clock), this feature may  be  suppressed  using
993       conditional compilation.
994
995   Extensions added for compatibility for GAWK and BWK
996       Nextfile  is a gawk extension (also implemented by BWK awk), is not yet
997       part of the POSIX standard (as of October 2012), although it  has  been
998       accepted for the next revision of the standard.
999
1000       Mktime, strftime and systime are gawk extensions.
1001
1002       The "/dev/stdin" feature was added to mawk after 1.3.4, for compatibil‐
1003       ity  with  gawk  and  BWK  awk.   The  corresponding  "-"  (alias   for
1004       /dev/stdin) was present in mawk 1.3.3.
1005
1006   Subtle Differences not in POSIX or the AWK Book
1007       Finally,  here  is  how mawk handles exceptional cases not discussed in
1008       the AWK book or the POSIX draft.  It is unsafe  to  assume  consistency
1009       across awks and safe to skip to the next section.
1010
1011          ·   substr(s,  i, n) returns the characters of s in the intersection
1012              of the closed interval [1, length(s)] and the half-open interval
1013              [i,  i+n).  When this intersection is empty, the empty string is
1014              returned; so substr("ABC", 1, 0) = "" and substr("ABC", -4, 6) =
1015              "A".
1016
1017          ·   Every  string,  including  the  empty  string, matches the empty
1018              string at the front so, s ~ // and s ~ "", are always  1  as  is
1019              match(s, //) and match(s, "").  The last two set RLENGTH to 0.
1020
1021          ·   index(s,  t)  is always the same as match(s, t1) where t1 is the
1022              same as t with metacharacters escaped.  Hence  consistency  with
1023              match  requires  that  index(s,  "") always returns 1.  Also the
1024              condition, index(s,t) != 0 if and only t is a  substring  of  s,
1025              requires index("","") = 1.
1026
1027          ·   If  getline  encounters  end  of  file,  getline var, leaves var
1028              unchanged.  Similarly, on entry to  the  END  actions,  $0,  the
1029              fields and NF have their value unaltered from the last record.
1030

ENVIRONMENT VARIABLES

1032       Mawk recognizes these variables:
1033
1034          MAWKBINMODE
1035             (see COMPATIBILITY ISSUES)
1036
1037          MAWK_LONG_OPTIONS
1038             If  this  is  set,  mawk uses its value to decide what to do with
1039             GNU-style long options:
1040
1041               allow  Mawk allows the option to be checked against the (small)
1042                      set of long options it recognizes.
1043
1044               error  Mawk  prints  an  error  message and exits.  This is the
1045                      default.
1046
1047               ignore Mawk ignores the option.
1048
1049               warn   Print  an  warning  message  and  otherwise  ignore  the
1050                      option.
1051
1052             If the variable is unset, mawk prints an error message and exits.
1053
1054          WHINY_USERS
1055             This  is  an  undocumented  gawk  feature.  It tells mawk to sort
1056             array indices before it starts to iterate over the elements of an
1057             array.
1058

SEE ALSO

1060       grep(1)
1061
1062       Aho,  Kernighan  and Weinberger, The AWK Programming Language, Addison-
1063       Wesley Publishing, 1988, (the AWK book), defines the language,  opening
1064       with  a  tutorial and advancing to many interesting programs that delve
1065       into issues of software design and analysis relevant to programming  in
1066       any language.
1067
1068       The  GAWK Manual, The Free Software Foundation, 1991, is a tutorial and
1069       language reference that does not attempt the depth of the AWK book  and
1070       assumes  the  reader  may  be  a novice programmer.  The section on AWK
1071       arrays is excellent.  It also discusses POSIX requirements for AWK.
1072

BUGS

1074       mawk implements printf() and sprintf() using the C  library  functions,
1075       printf  and  sprintf,  so  full  ANSI  compatibility requires an ANSI C
1076       library.  In practice this means the h conversion qualifier may not  be
1077       available.   Also  mawk inherits any bugs or limitations of the library
1078       functions.
1079
1080       Implementors of the AWK language have shown a consistent lack of imagi‐
1081       nation when naming their programs.
1082

AUTHOR

1084       Mike Brennan (brennan@whidbey.com).
1085       Thomas E. Dickey <dickey@invisible-island.net>.
1086
1087
1088
1089Version 1.3.4                     2019-12-31                           MAWK(1)
Impressum