1MAWK(1)                          USER COMMANDS                         MAWK(1)
2
3
4

NAME

6       mawk - pattern scanning and text processing language
7

SYNOPSIS

9       mawk  [-W  option]  [-F value] [-v var=value] [--] 'program text' [file
10       ...]
11       mawk [-W option] [-F value] [-v var=value] [-f program-file] [--] [file
12       ...]
13

DESCRIPTION

15       mawk  is an interpreter for the AWK Programming Language.  The AWK lan‐
16       guage is useful for manipulation of data files, text retrieval and pro‐
17       cessing,  and  for prototyping and experimenting with algorithms.  mawk
18       is a new awk meaning it implements the AWK language as defined in  Aho,
19       Kernighan  and Weinberger, The AWK Programming Language, Addison-Wesley
20       Publishing, 1988 (hereafter referred to as the AWK  book.)   mawk  con‐
21       forms  to  the POSIX 1003.2 (draft 11.3) definition of the AWK language
22       which contains a few features not described in the AWK book,  and  mawk
23       provides a small number of extensions.
24
25       An  AWK  program  is  a sequence of pattern {action} pairs and function
26       definitions.  Short programs are entered on the  command  line  usually
27       enclosed  in ' ' to avoid shell interpretation.  Longer programs can be
28       read in from a file with the -f option.  Data  input is read  from  the
29       list  of files on the command line or from standard input when the list
30       is empty.  The input is broken into records as determined by the record
31       separator  variable,  RS.  Initially, RS = "\n" and records are synony‐
32       mous with lines.  Each record is compared against each pattern  and  if
33       it matches, the program text for {action} is executed.
34

OPTIONS

36       -F value       sets the field separator, FS, to value.
37
38       -f file        Program  text is read from file instead of from the com‐
39                      mand line.  Multiple -f options are allowed.
40
41       -v var=value   assigns value to program variable var.
42
43       --             indicates the unambiguous end of options.
44
45       The above options will be available with any POSIX compatible implemen‐
46       tation  of  AWK.  Implementation specific options are prefaced with -W.
47       mawk provides these:
48
49       -W dump        writes an assembler like listing of the internal  repre‐
50                      sentation  of the program to stdout and exits 0 (on suc‐
51                      cessful compilation).
52
53       -W exec file   Program text is read from file  and  this  is  the  last
54                      option.
55
56                      This  is a useful alternative to -f on systems that sup‐
57                      port the #!  "magic number"  convention  for  executable
58                      scripts.   Those  implicitly  pass  the  pathname of the
59                      script itself as the final parameter, and expect no more
60                      than  one  "-"  option on the #! line.  Because mawk can
61                      combine multiple -W options separated by commas, you can
62                      use this option when an additional -W option is needed.
63
64       -W help        prints  a  usage  message  to  stderr and exits (same as
65                      “-W usage”).
66
67       -W interactive sets unbuffered writes to stdout and line buffered reads
68                      from  stdin.  Records from stdin are lines regardless of
69                      the value of RS.
70
71       -W posix_space forces mawk not to consider '\n' to be space.
72
73       -W random=num  calls srand with the given parameter (and overrides  the
74                      auto-seeding behavior).
75
76       -W sprintf=num adjusts  the  size  of mawk's internal sprintf buffer to
77                      num bytes.  More than rare use of this option  indicates
78                      mawk should be recompiled.
79
80       -W usage       prints  a  usage  message  to  stderr and exits (same as
81                      “-W help”).
82
83       -W version     mawk writes its version and copyright to stdout and com‐
84                      piled limits to stderr and exits 0.
85
86       mawk  accepts  abbreviations for any of these options, e.g., “-W v” and
87       “-Wv” both tell mawk to show its version.
88
89       mawk allows multiple -W  options  to  be  combined  by  separating  the
90       options  with  commas,  e.g.,  -Wsprint=2000,posix.  This is useful for
91       executable #!  "magic number" invocations in which only one argument is
92       supported, e.g., -Winteractive,exec.
93

THE AWK LANGUAGE

95   1. Program structure
96       An  AWK  program is a sequence of pattern {action} pairs and user func‐
97       tion definitions.
98
99       A pattern can be:
100              BEGIN
101              END
102              expression
103              expression , expression
104
105       One, but not both, of pattern {action} can be omitted.  If {action}  is
106       omitted  it is implicitly { print }.  If pattern is omitted, then it is
107       implicitly matched.  BEGIN and END patterns require an action.
108
109       Statements are terminated by newlines, semi-colons or both.  Groups  of
110       statements such as actions or loop bodies are blocked via { ... } as in
111       C.  The last statement in a block doesn't  need  a  terminator.   Blank
112       lines  have  no  meaning; an empty statement is terminated with a semi-
113       colon.  Long statements can be continued with a backslash, \.  A state‐
114       ment  can  be broken without a backslash after a comma, left brace, &&,
115       ||, do, else, the right parenthesis of an if, while or  for  statement,
116       and  the  right parenthesis of a function definition.  A comment starts
117       with # and extends to, but does not include the end of line.
118
119       The following statements control program flow inside blocks.
120
121              if ( expr ) statement
122
123              if ( expr ) statement else statement
124
125              while ( expr ) statement
126
127              do statement while ( expr )
128
129              for ( opt_expr ; opt_expr ; opt_expr ) statement
130
131              for ( var in array ) statement
132
133              continue
134
135              break
136
137   2. Data types, conversion and comparison
138       There are two basic data types, numeric and string.  Numeric  constants
139       can  be  integer  like -2, decimal like 1.08, or in scientific notation
140       like -1.1e4 or .28E-3.  All numbers are represented internally and  all
141       computations  are  done  in floating point arithmetic.  So for example,
142       the expression 0.2e2 == 20 is true and true is represented as 1.0.
143
144       String constants are enclosed in double quotes.
145
146                   "This is a string with a newline at the end.\n"
147
148       Strings can be continued across a line by  escaping  (\)  the  newline.
149       The following escape sequences are recognized.
150
151            \\        \
152            \"        "
153            \a        alert, ascii 7
154            \b        backspace, ascii 8
155            \t        tab, ascii 9
156            \n        newline, ascii 10
157            \v        vertical tab, ascii 11
158            \f        formfeed, ascii 12
159            \r        carriage return, ascii 13
160            \ddd      1, 2 or 3 octal digits for ascii ddd
161            \xhh      1 or 2 hex digits for ascii  hh
162
163       If  you  escape  any other character \c, you get \c, i.e., mawk ignores
164       the escape.
165
166       There are really three basic data types; the third is number and string
167       which  has  both  a  numeric value and a string value at the same time.
168       User defined variables come into existence when  first  referenced  and
169       are  initialized  to  null, a number and string value which has numeric
170       value 0 and string value "".  Non-trivial number and string typed  data
171       come from input and are typically stored in fields.  (See section 4).
172
173       The  type  of  an expression is determined by its context and automatic
174       type conversion occurs if needed.  For example, to evaluate the  state‐
175       ments
176
177            y = x + 2  ;  z = x  "hello"
178
179       The  value  stored  in  variable  y will be typed numeric.  If x is not
180       numeric, the value read from x is converted to  numeric  before  it  is
181       added  to  2  and  stored in y.  The value stored in variable z will be
182       typed string, and the value of x will be converted to string if  neces‐
183       sary  and  concatenated  with  "hello".  (Of course, the value and type
184       stored in x is not changed by any conversions.)  A string expression is
185       converted  to numeric using its longest numeric prefix as with atof(3).
186       A numeric expression is converted to  string  by  replacing  expr  with
187       sprintf(CONVFMT,  expr),  unless  expr  can  be represented on the host
188       machine as an exact integer  then  it  is  converted  to  sprintf("%d",
189       expr).   Sprintf() is an AWK built-in that duplicates the functionality
190       of sprintf(3), and CONVFMT is a built-in  variable  used  for  internal
191       conversion  from  number to string and initialized to "%.6g".  Explicit
192       type conversions can be  forced,  expr  ""  is  string  and  expr+0  is
193       numeric.
194
195       To evaluate, expr1 rel-op expr2, if both operands are numeric or number
196       and string then the comparison is numeric; if both operands are  string
197       the  comparison is string; if one operand is string, the non-string op‐
198       erand is converted  and  the  comparison  is  string.   The  result  is
199       numeric, 1 or 0.
200
201       In boolean contexts such as, if ( expr ) statement, a string expression
202       evaluates true if and only if it is not the empty  string  "";  numeric
203       values if and only if not numerically zero.
204
205   3. Regular expressions
206       In  the  AWK language, records, fields and strings are often tested for
207       matching a regular expression.  Regular  expressions  are  enclosed  in
208       slashes, and
209
210            expr ~ /r/
211
212       is  an  AWK  expression  that evaluates to 1 if expr "matches" r, which
213       means a substring of expr is in the set of strings defined by r.   With
214       no  match  the  expression  evaluates  to  0; replacing ~ with the "not
215       match" operator, !~ , reverses the meaning.  As  pattern-action pairs,
216
217            /r/ { action }   and   $0 ~ /r/ { action }
218
219       are the same, and for each input record that matches r, action is  exe‐
220       cuted.   In  fact, /r/ is an AWK expression that is equivalent to ($0 ~
221       /r/) anywhere except when on the right side  of  a  match  operator  or
222       passed  as  an  argument  to a built-in function that expects a regular
223       expression argument.
224
225       AWK uses extended regular expressions as with  egrep(1).   The  regular
226       expression  metacharacters, i.e., those with special meaning in regular
227       expressions are
228
229             ^ $ . [ ] | ( ) * + ?
230
231       Regular expressions are built up from characters as follows:
232
233              c            matches any non-metacharacter c.
234
235              \c           matches a character  defined  by  the  same  escape
236                           sequences  used  in string constants or the literal
237                           character c if \c is not an escape sequence.
238
239              .            matches any character (including newline).
240
241              ^            matches the front of a string.
242
243              $            matches the back of a string.
244
245              [c1c2c3...]  matches any character in the class c1c2c3... .   An
246                           interval  of  characters  is denoted c1-c2 inside a
247                           class [...].
248
249              [^c1c2c3...] matches any character not in the class c1c2c3...
250
251       Regular expressions are built up from other regular expressions as fol‐
252       lows:
253
254              r1r2         matches  r1  followed immediately by r2 (concatena‐
255                           tion).
256
257              r1 | r2      matches r1 or r2 (alternation).
258
259              r*           matches r repeated zero or more times.
260
261              r+           matches r repeated one or more times.
262
263              r?           matches r zero or once.
264
265              (r)          matches r, providing grouping.
266
267       The increasing precedence of operators  is  alternation,  concatenation
268       and unary (*, + or ?).
269
270       For example,
271
272            /^[_a-zA-Z][_a-zA-Z0-9]*$/  and
273            /^[-+]?([0-9]+\.?|\.[0-9])[0-9]*([eE][-+]?[0-9]+)?$/
274
275       are  matched by AWK identifiers and AWK numeric constants respectively.
276       Note that “.” has to be escaped to be recognized as  a  decimal  point,
277       and that metacharacters are not special inside character classes.
278
279       Any expression can be used on the right hand side of the ~ or !~ opera‐
280       tors or passed to a built-in that expects  a  regular  expression.   If
281       needed,  it  is  converted to string, and then interpreted as a regular
282       expression.  For example,
283
284            BEGIN { identifier = "[_a-zA-Z][_a-zA-Z0-9]*" }
285
286            $0 ~ "^" identifier
287
288       prints all lines that start with an AWK identifier.
289
290       mawk recognizes the empty regular expression,  //,  which  matches  the
291       empty  string and hence is matched by any string at the front, back and
292       between every character.  For example,
293
294            echo  abc | mawk { gsub(//, "X") ; print }
295            XaXbXcX
296
297
298   4. Records and fields
299       Records are read in one at a time, and stored in the field variable $0.
300       The  record  is split into fields which are stored in $1, $2, ..., $NF.
301       The built-in variable NF is set to the number of fields, and NR and FNR
302       are incremented by 1.  Fields above $NF are set to "".
303
304       Assignment to $0 causes the fields and NF to be recomputed.  Assignment
305       to NF or to a field causes $0 to be reconstructed by concatenating  the
306       $i's  separated  by OFS.  Assignment to a field with index greater than
307       NF, increases NF and causes $0 to be reconstructed.
308
309       Data input stored in fields is string,  unless  the  entire  field  has
310       numeric form and then the type is number and string.  For example,
311
312            echo 24 24E |
313            mawk '{ print($1>100, $1>"100", $2>100, $2>"100") }'
314            0 1 1 1
315
316       $0 and $2 are string and $1 is number and string.  The first comparison
317       is numeric, the second is string, the third is string (100 is converted
318       to "100"), and the last is string.
319
320   5. Expressions and operators
321       The expression syntax is similar to C.  Primary expressions are numeric
322       constants, string constants, variables,  fields,  arrays  and  function
323       calls.   The  identifier  for  a  variable,  array or function can be a
324       sequence of letters, digits and underscores, that does not start with a
325       digit.   Variables  are  not declared; they exist when first referenced
326       and are initialized to null.
327
328       New expressions are composed with the following operators in  order  of
329       increasing precedence.
330
331              assignment          =  +=  -=  *=  /=  %=  ^=
332              conditional         ?  :
333              logical or          ||
334              logical and         &&
335              array membership    in
336              matching       ~   !~
337              relational          <  >   <=  >=  ==  !=
338              concatenation       (no explicit operator)
339              add ops             +  -
340              mul ops             *  /  %
341              unary               +  -
342              logical not         !
343              exponentiation      ^
344              inc and dec         ++ -- (both post and pre)
345              field               $
346
347       Assignment, conditional and exponentiation associate right to left; the
348       other operators associate left to right.  Any expression can be  paren‐
349       thesized.
350
351   6. Arrays
352       Awk  provides  one-dimensional arrays.  Array elements are expressed as
353       array[expr].  Expr is internally converted  to  string  type,  so,  for
354       example,  A[1]  and A["1"] are the same element and the actual index is
355       "1".  Arrays indexed by strings are called  associative  arrays.   Ini‐
356       tially  an  array  is  empty;  elements  exist when first accessed.  An
357       expression, expr in array evaluates to 1 if array[expr] exists, else to
358       0.
359
360       There  is  a form of the for statement that loops over each index of an
361       array.
362
363            for ( var in array ) statement
364
365       sets var to each index of array and executes statement.  The order that
366       var transverses the indices of array is not defined.
367
368       The  statement,  delete  array[expr],  causes array[expr] not to exist.
369       mawk supports an extension, delete array, which deletes all elements of
370       array.
371
372       Multidimensional  arrays  are  synthesized with concatenation using the
373       built-in  variable  SUBSEP.   array[expr1,expr2]   is   equivalent   to
374       array[expr1 SUBSEP expr2].  Testing for a multidimensional element uses
375       a parenthesized index, such as
376
377            if ( (i, j) in A )  print A[i, j]
378
379
380   7. Builtin-variables
381       The following variables are built-in  and  initialized  before  program
382       execution.
383
384              ARGC      number of command line arguments.
385
386              ARGV      array of command line arguments, 0..ARGC-1.
387
388              CONVFMT   format  for  internal conversion of numbers to string,
389                        initially = "%.6g".
390
391              ENVIRON   array indexed by environment variables.   An  environ‐
392                        ment  string,  var=value  is  stored as ENVIRON[var] =
393                        value.
394
395              FILENAME  name of the current input file.
396
397              FNR       current record number in FILENAME.
398
399              FS        splits records into fields as a regular expression.
400
401              NF        number of fields in the current record.
402
403              NR        current record number in the total input stream.
404
405              OFMT      format for printing numbers; initially = "%.6g".
406
407              OFS       inserted between fields on output, initially = " ".
408
409              ORS       terminates each record on output, initially = "\n".
410
411              RLENGTH   length set by the last call to the built-in  function,
412                        match().
413
414              RS        input record separator, initially = "\n".
415
416              RSTART    index set by the last call to match().
417
418              SUBSEP    used  to  build multiple array subscripts, initially =
419                        "\034".
420
421   8. Built-in functions
422       String functions
423
424              gsub(r,s,t)  gsub(r,s)
425                     Global substitution, every match of regular expression  r
426                     in  variable  t  is  replaced by string s.  The number of
427                     replacements is returned.  If t is omitted, $0  is  used.
428                     An  &  in  the  replacement  string  s is replaced by the
429                     matched substring of t.  \& and \\ put  literal & and  \,
430                     respectively, in the replacement string.
431
432              index(s,t)
433                     If  t  is  a  substring  of  s, then the position where t
434                     starts is returned, else 0 is returned.  The first  char‐
435                     acter of s is in position 1.
436
437              length(s)
438                     Returns the length of string or array.  s.
439
440              match(s,r)
441                     Returns  the  index of the first longest match of regular
442                     expression r in string s.  Returns 0 if no match.   As  a
443                     side  effect, RSTART is set to the return value.  RLENGTH
444                     is set to the length of the match or -1 if no match.   If
445                     the  empty  string is matched, RLENGTH is set to 0, and 1
446                     is returned if the match is at the front, and length(s)+1
447                     is returned if the match is at the back.
448
449              split(s,A,r)  split(s,A)
450                     String s is split into fields by regular expression r and
451                     the fields are loaded into array A.  The number of fields
452                     is returned.  See section 11 below for more detail.  If r
453                     is omitted, FS is used.
454
455              sprintf(format,expr-list)
456                     Returns a string constructed from expr-list according  to
457                     format.  See the description of printf() below.
458
459              sub(r,s,t)  sub(r,s)
460                     Single  substitution,  same  as gsub() except at most one
461                     substitution.
462
463              substr(s,i,n)  substr(s,i)
464                     Returns the substring of string s, starting at  index  i,
465                     of  length n.  If n is omitted, the suffix of s, starting
466                     at i is returned.
467
468              tolower(s)
469                     Returns a copy of s with all upper case  characters  con‐
470                     verted to lower case.
471
472              toupper(s)
473                     Returns  a  copy of s with all lower case characters con‐
474                     verted to upper case.
475
476       Time functions
477
478       These are available on systems which support the corresponding C mktime
479       and strftime functions:
480
481              mktime(specification)
482                     converts  a  date  specification  to a timestamp with the
483                     same units as  systime.   The  date  specification  is  a
484                     string  containing  the components of the date as decimal
485                     integers:
486
487                     YYYY
488                        the year, e.g., 2012
489
490                     MM the month of the year starting at 1
491
492                     DD the day of the month starting at 1
493
494                     HH hour (0-23)
495
496                     MM minute (0-59)
497
498                     SS seconds (0-59)
499
500                     DST
501                        tells how to treat timezone  versus  daylight  savings
502                        time:
503
504                        positive
505                           DST is in effect
506
507                        zero (default)
508                           DST is not in effect
509
510                        negative
511                           mktime()  should (use timezone information and sys‐
512                           tem databases to) attempt  to determine whether DST
513                           is in effect at the specified time.
514
515              strftime([format [, timestamp [, utc ]]])
516                     formats  the  given timestamp using the format (passed to
517                     the C strftime function):
518
519                     ·   If the format parameter is missing, "%c" is used.
520
521                     ·   If the timestamp parameter is  missing,  the  current
522                         value from systime is used.
523
524                     ·   If  the  utc  parameter  is  present and nonzero, the
525                         result is in UTC.  Otherwise local time is used.
526
527              systime()
528                     returns the current time of day as the number of  seconds
529                     since  the  Epoch  (1970-01-01 00:00:00 UTC on POSIX sys‐
530                     tems).
531
532       Arithmetic functions
533
534              atan2(y,x)     Arctan of y/x between -pi and pi.
535
536              cos(x)         Cosine function, x in radians.
537
538              exp(x)         Exponential function.
539
540              int(x)         Returns x truncated towards zero.
541
542              log(x)         Natural logarithm.
543
544              rand()         Returns a random number between zero and one.
545
546              sin(x)         Sine function, x in radians.
547
548              sqrt(x)        Returns square root of x.
549
550              srand(expr)  srand()
551                     Seeds the random number generator,  using  the  clock  if
552                     expr  is  omitted,  and returns the value of the previous
553                     seed.  Srand(expr) is useful for repeating pseudo  random
554                     sequences.
555
556                     Note: mawk is normally configured to seed the random num‐
557                     ber generator from the clock at startup, making it unnec‐
558                     essary  to  call srand().  This feature can be suppressed
559                     via conditional compile, or overridden using the -Wrandom
560                     option.
561
562   9. Input and output
563       There are two output statements, print and printf.
564
565              print  writes $0  ORS to standard output.
566
567              print expr1, expr2, ..., exprn
568                     writes expr1 OFS expr2 OFS ... exprn ORS to standard out‐
569                     put.  Numeric expressions are converted  to  string  with
570                     OFMT.
571
572              printf format, expr-list
573                     duplicates the printf C library function writing to stan‐
574                     dard output.  The complete ANSI C  format  specifications
575                     are  recognized  with conversions %c, %d, %e, %E, %f, %g,
576                     %G, %i, %o, %s, %u, %x, %X and %%, and conversion  quali‐
577                     fiers h and l.
578
579       The  argument  list  to  print  or printf can optionally be enclosed in
580       parentheses.  Print formats numbers using OFMT or "%d" for exact  inte‐
581       gers.   "%c"  with  a  numeric  argument prints the corresponding 8 bit
582       character, with a string argument it prints the first character of  the
583       string.   The output of print and printf can be redirected to a file or
584       command by appending > file, >> file or | command to  the  end  of  the
585       print  statement.   Redirection opens file or command only once, subse‐
586       quent redirections append to the already open stream.   By  convention,
587       mawk associates the filename
588
589          ·   "/dev/stderr" with stderr,
590
591          ·   "/dev/stdout" with stdout,
592
593          ·   "-" and "/dev/stdin" with stdin.
594
595       The  association  with  stderr  is  especially useful because it allows
596       print and printf to be redirected to stderr.  These names can  also  be
597       passed to functions.
598
599       The input function getline has the following variations.
600
601              getline
602                     reads into $0, updates the fields, NF, NR and FNR.
603
604              getline < file
605                     reads into $0 from file, updates the fields and NF.
606
607              getline var
608                     reads the next record into var, updates NR and FNR.
609
610              getline var < file
611                     reads the next record of file into var.
612
613               command | getline
614                     pipes  a  record  from  command  into  $0 and updates the
615                     fields and NF.
616
617               command | getline var
618                     pipes a record from command into var.
619
620       Getline returns 0 on end-of-file, -1 on error, otherwise 1.
621
622       Commands on the end of pipes are executed by /bin/sh.
623
624       The function close(expr) closes the file or pipe associated with  expr.
625       Close  returns  0 if expr is an open file, the exit status if expr is a
626       piped command, and -1 otherwise.  Close is used to  reread  a  file  or
627       command,  make sure the other end of an output pipe is finished or con‐
628       serve file resources.
629
630       The function fflush(expr) flushes the output file  or  pipe  associated
631       with  expr.  Fflush returns 0 if expr is an open output stream else -1.
632       Fflush without an argument flushes stdout.  Fflush with an empty  argu‐
633       ment ("") flushes all open output.
634
635       The  function  system(expr)  uses  the C runtime system call to execute
636       expr and returns the corresponding wait status of the command  as  fol‐
637       lows:
638
639       ·   if  the  system call failed, setting the status to -1, mawk returns
640           that value.
641
642       ·   if the command exited normally, mawk returns its exit-status.
643
644       ·   if the command exited due to a signal such as SIGHUP, mawk  returns
645           the signal number plus 256.
646
647       Changes  made  to the ENVIRON array are not passed to commands executed
648       with system or pipes.
649
650   10. User defined functions
651       The syntax for a user defined function is
652
653            function name( args ) { statements }
654
655       The function body can contain a return statement
656
657            return opt_expr
658
659       A return statement is not required.  Function calls may  be  nested  or
660       recursive.   Functions  are  passed  expressions by value and arrays by
661       reference.  Extra arguments serve as local variables and  are  initial‐
662       ized  to  null.  For example, csplit(s,A) puts each character of s into
663       array A and returns the length of s.
664
665            function csplit(s, A,    n, i)
666            {
667              n = length(s)
668              for( i = 1 ; i <= n ; i++ ) A[i] = substr(s, i, 1)
669              return n
670            }
671
672       Putting extra space between passed arguments  and  local  variables  is
673       conventional.  Functions can be referenced before they are defined, but
674       the function name and the '(' of the arguments must touch to avoid con‐
675       fusion with concatenation.
676
677       A function parameter is normally a scalar value (number or string).  If
678       there is a forward reference to a function using an array as a  parame‐
679       ter,  the  function's  corresponding  parameter  will  be treated as an
680       array.
681
682   11. Splitting strings, records and files
683       Awk programs use the same algorithm to split strings into  arrays  with
684       split(), and records into fields on FS.  mawk uses essentially the same
685       algorithm to split files into records on RS.
686
687       Split(expr,A,sep) works as follows:
688
689          (1)  If sep is omitted, it is replaced by FS.  Sep can be an expres‐
690               sion  or  regular  expression.   If it is an expression of non-
691               string type, it is converted to string.
692
693          (2)  If sep = " " (a single space), then <SPACE> is trimmed from the
694               front  and back of expr, and sep becomes <SPACE>.  mawk defines
695               <SPACE> as the regular expression /[ \t\n]+/.  Otherwise sep is
696               treated  as  a  regular expression, except that meta-characters
697               are ignored for a string of length 1, e.g.,  split(x,  A,  "*")
698               and split(x, A, /\*/) are the same.
699
700          (3)  If  expr  is not string, it is converted to string.  If expr is
701               then the empty string "", split() returns 0 and A is set empty.
702               Otherwise, all non-overlapping, non-null and longest matches of
703               sep in expr, separate expr into fields which are loaded into A.
704               The  fields  are  placed  in  A[1], A[2], ..., A[n] and split()
705               returns n, the number of fields which is the number of  matches
706               plus  one.  Data placed in A that looks numeric is typed number
707               and string.
708
709       Splitting records into fields works the  same  except  the  pieces  are
710       loaded into $1, $2,..., $NF.  If $0 is empty, NF is set to 0 and all $i
711       to "".
712
713       mawk splits files into records by the  same  algorithm,  but  with  the
714       slight  difference  that RS is really a terminator instead of a separa‐
715       tor.  (ORS is really a terminator too).
716
717              E.g., if FS = ":+" and $0 = "a::b:" , then NF = 3 and $1 =  "a",
718              $2 = "b" and $3 = "", but if "a::b:" is the contents of an input
719              file and RS = ":+", then there are two records "a" and "b".
720
721       RS = " " is not special.
722
723       If FS = "", then mawk breaks the  record  into  individual  characters,
724       and,  similarly,  split(s,A,"")  places  the individual characters of s
725       into A.
726
727   12. Multi-line records
728       Since mawk interprets RS as a regular  expression,  multi-line  records
729       are easy.  Setting RS = "\n\n+", makes one or more blank lines separate
730       records.  If FS = " " (the default), then single newlines, by the rules
731       for  <SPACE>  above, become space and single newlines are field separa‐
732       tors.
733
734              For example, if
735
736              ·   a file is "a b\nc\n\n",
737
738              ·   RS = "\n\n+" and
739
740              ·   FS = " ",
741
742              then there is one record "a b\nc" with three fields "a", "b" and
743              "c":
744
745              ·   Changing FS = "\n", gives two fields "a b" and "c";
746
747              ·   changing FS = "", gives one field identical to the record.
748
749       If  you want lines with spaces or tabs to be considered blank, set RS =
750       "\n([ \t]*\n)+".  For compatibility with other awks, setting  RS  =  ""
751       has  the  same effect as if blank lines are stripped from the front and
752       back of files and then records are  determined  as  if  RS  =  "\n\n+".
753       POSIX  requires that "\n" always separates records when RS = "" regard‐
754       less of the value of  FS.   mawk  does  not  support  this  convention,
755       because defining "\n" as <SPACE> makes it unnecessary.
756
757       Most  of  the  time when you change RS for multi-line records, you will
758       also want to change ORS to "\n\n" so the record spacing is preserved on
759       output.
760
761   13. Program execution
762       This  section  describes the order of program execution.  First ARGC is
763       set to the total number of command line arguments passed to the  execu‐
764       tion  phase  of the program.  ARGV[0] is set the name of the AWK inter‐
765       preter and ARGV[1] ...  ARGV[ARGC-1] holds the remaining  command  line
766       arguments exclusive of options and program source.  For example with
767
768            mawk  -f  prog  v=1  A  t=hello  B
769
770       ARGC = 5 with ARGV[0] = "mawk", ARGV[1] = "v=1", ARGV[2] = "A", ARGV[3]
771       = "t=hello" and ARGV[4] = "B".
772
773       Next, each BEGIN block is executed in order.  If the  program  consists
774       entirely  of  BEGIN  blocks,  then  execution terminates, else an input
775       stream is opened and execution continues.  If ARGC equals 1, the  input
776       stream  is  set  to stdin, else  the command line arguments ARGV[1] ...
777       ARGV[ARGC-1] are examined for a file argument.
778
779       The command line arguments divide  into  three  sets:  file  arguments,
780       assignment  arguments and empty strings "".  An assignment has the form
781       var=string.  When an ARGV[i] is examined as a possible  file  argument,
782       if  it  is  empty  it  is skipped; if it is an assignment argument, the
783       assignment to var takes place and i skips to the  next  argument;  else
784       ARGV[i] is opened for input.  If it fails to open, execution terminates
785       with exit code 2.  If no command line argument is a file argument, then
786       input comes from stdin.  Getline in a BEGIN action opens input.  “-” as
787       a file argument denotes stdin.
788
789       Once an input stream is open, each input record is tested against  each
790       pattern,  and  if  it  matches,  the associated action is executed.  An
791       expression pattern matches if it is boolean true (see the end  of  sec‐
792       tion  2).   A BEGIN pattern matches before any input has been read, and
793       an END pattern matches after all input has been read.  A range pattern,
794       expr1,expr2  ,  matches every record between the match of expr1 and the
795       match expr2 inclusively.
796
797       When end of file occurs on the input stream, the remaining command line
798       arguments  are  examined for a file argument, and if there is one it is
799       opened, else the END pattern is considered matched and all END  actions
800       are executed.
801
802       In  the example, the assignment v=1 takes place after the BEGIN actions
803       are executed, and the data placed in v  is  typed  number  and  string.
804       Input  is  then  read  from  file A.  On end of file A, t is set to the
805       string "hello", and B is opened for input.  On end of file B,  the  END
806       actions are executed.
807
808       Program flow at the pattern {action} level can be changed with the
809
810            next
811            nextfile
812            exit  opt_expr
813
814       statements:
815
816       ·   A  next  statement causes the next input record to be read and pat‐
817           tern testing to restart with the first pattern {action} pair in the
818           program.
819
820       ·   A  nextfile  statement  tells  mawk  to stop processing the current
821           input file.  It then updates FILENAME to the next  file  listed  on
822           the command line, and resets FNR to 1.
823
824       ·   An  exit statement causes immediate execution of the END actions or
825           program termination if there are none or if the exit occurs  in  an
826           END action.  The opt_expr sets the exit value of the program unless
827           overridden by a later exit or subsequent error.
828

EXAMPLES

830       1. emulate cat.
831
832            { print }
833
834       2. emulate wc.
835
836            { chars += length($0) + 1  # add one for the \n
837              words += NF
838            }
839
840            END{ print NR, words, chars }
841
842       3. count the number of unique "real words".
843
844            BEGIN { FS = "[^A-Za-z]+" }
845
846            { for(i = 1 ; i <= NF ; i++)  word[$i] = "" }
847
848            END { delete word[""]
849                  for ( i in word )  cnt++
850                  print cnt
851            }
852
853       4. sum the second field of every record based on the first field.
854
855            $1 ~ /credit|gain/ { sum += $2 }
856            $1 ~ /debit|loss/  { sum -= $2 }
857
858            END { print sum }
859
860       5. sort a file, comparing as string
861
862            { line[NR] = $0 "" }  # make sure of comparison type
863                            # in case some lines look numeric
864
865            END {  isort(line, NR)
866              for(i = 1 ; i <= NR ; i++) print line[i]
867            }
868
869            #insertion sort of A[1..n]
870            function isort( A, n,    i, j, hold)
871            {
872              for( i = 2 ; i <= n ; i++)
873              {
874                hold = A[j = i]
875                while ( A[j-1] > hold )
876                { j-- ; A[j+1] = A[j] }
877                A[j] = hold
878              }
879              # sentinel A[0] = "" will be created if needed
880            }
881
882

COMPATIBILITY ISSUES

884   MAWK 1.3.3 versus POSIX 1003.2 Draft 11.3
885       The POSIX 1003.2(draft 11.3) definition of the AWK language is  AWK  as
886       described  in  the AWK book with a few extensions that appeared in Sys‐
887       temVR4 nawk.  The extensions are:
888
889          ·   New functions: toupper() and tolower().
890
891          ·   New variables: ENVIRON[] and CONVFMT.
892
893          ·   ANSI C conversion specifications for printf() and sprintf().
894
895          ·   New command options:  -v  var=value,  multiple  -f  options  and
896              implementation options as arguments to -W.
897
898          ·   For  systems  (MS-DOS  or Windows) which provide a setmode func‐
899              tion, an environment variable MAWKBINMODE and a  built-in  vari‐
900              able  BINMODE.   The  bits of the BINMODE value tell mawk how to
901              modify the RS and ORS variables:
902
903             0  set standard input to binary mode, and if BIT-2 is unset,  set
904                RS to "\r\n" (CR/LF) rather than "\n" (LF).
905
906             1  set standard output to binary mode, and if BIT-2 is unset, set
907                ORS to "\r\n" (CR/LF) rather than "\n" (LF).
908
909             2  suppress the assignment to RS and ORS of CR/LF, making it pos‐
910                sible  to run scripts and generate output compatible with Unix
911                line-endings.
912
913       POSIX AWK is oriented to operate on files a line at a time.  RS can  be
914       changed  from  "\n" to another single character, but it is hard to find
915       any use for this — there are no examples in the AWK book.   By  conven‐
916       tion, RS = "", makes one or more blank lines separate records, allowing
917       multi-line records.  When RS = "", "\n" is  always  a  field  separator
918       regardless of the value in FS.
919
920       mawk,  on  the  other hand, allows RS to be a regular expression.  When
921       "\n" appears in records, it is treated as space, and FS  always  deter‐
922       mines fields.
923
924       Removing the line at a time paradigm can make some programs simpler and
925       can often improve performance.  For example,  redoing  example  3  from
926       above,
927
928            BEGIN { RS = "[^A-Za-z]+" }
929
930            { word[ $0 ] = "" }
931
932            END { delete  word[ "" ]
933              for( i in word )  cnt++
934              print cnt
935            }
936
937       counts  the  number  of  unique words by making each word a record.  On
938       moderate size files, mawk executes twice as fast, because of  the  sim‐
939       plified inner loop.
940
941       The  following  program  replaces each comment by a single space in a C
942       program file,
943
944            BEGIN {
945              RS = "/\*([^*]|\*+[^/*])*\*+/"
946                 # comment is record separator
947              ORS = " "
948              getline  hold
949              }
950
951              { print hold ; hold = $0 }
952
953              END { printf "%s" , hold }
954
955       Buffering one record is needed to avoid  terminating  the  last  record
956       with a space.
957
958       With mawk, the following are all equivalent,
959
960            x ~ /a\+b/    x ~ "a\+b"     x ~ "a\\+b"
961
962       The  strings  get  scanned  twice,  once  as string and once as regular
963       expression.  On the string scan, mawk ignores the escape on  non-escape
964       characters  while  the  AWK  book advocates \c be recognized as c which
965       necessitates the double escaping of meta-characters in strings.   POSIX
966       explicitly  declines to define the behavior which passively forces pro‐
967       grams that must run under a variety of awks to use  the  more  portable
968       but less readable, double escape.
969
970       POSIX AWK does not recognize "/dev/std{in,out,err}".  Some systems pro‐
971       vide an actual device for this, allowing AWKs which  do  not  implement
972       the feature directly to support it.
973
974       POSIX  AWK  does  not  recognize  \x  hex  escape sequences in strings.
975       Unlike ANSI C, mawk limits the number of digits that follows \x to  two
976       as  the  current  implementation  only  supports 8 bit characters.  The
977       built-in fflush first appeared in a recent (1993) AT&T awk released  to
978       netlib, and is not part of the POSIX standard.  Aggregate deletion with
979       delete array is not part of the POSIX standard.
980
981       POSIX explicitly leaves the behavior of FS = "" undefined, and mentions
982       splitting  the record into characters as a possible interpretation, but
983       currently this use is not portable across implementations.
984
985   Random numbers
986       POSIX does not prescribe a method for initializing  random  numbers  at
987       startup.
988
989       In practice, most implementations do nothing special, which makes srand
990       and rand follow the C runtime library, making the initial seed value 1.
991       Some  implementations  (Solaris XPG4 and Tru64) return 0 from the first
992       call to srand, although the results from rand behave as if the  initial
993       seed is 1.  Other implementations return 1.
994
995       While  mawk  can  call srand at startup with no parameter (initializing
996       random numbers from the clock), this feature may  be  suppressed  using
997       conditional compilation.
998
999   Extensions added for compatibility for GAWK and BWK
1000       Nextfile  is a gawk extension (also implemented by BWK awk), is not yet
1001       part of the POSIX standard (as of October 2012), although it  has  been
1002       accepted for the next revision of the standard.
1003
1004       Mktime, strftime and systime are gawk extensions.
1005
1006       The "/dev/stdin" feature was added to mawk after 1.3.4, for compatibil‐
1007       ity  with  gawk  and  BWK  awk.   The  corresponding  "-"  (alias   for
1008       /dev/stdin) was present in mawk 1.3.3.
1009
1010   Subtle Differences not in POSIX or the AWK Book
1011       Finally,  here  is  how mawk handles exceptional cases not discussed in
1012       the AWK book or the POSIX draft.  It is unsafe  to  assume  consistency
1013       across awks and safe to skip to the next section.
1014
1015          ·   substr(s,  i, n) returns the characters of s in the intersection
1016              of the closed interval [1, length(s)] and the half-open interval
1017              [i,  i+n).  When this intersection is empty, the empty string is
1018              returned; so substr("ABC", 1, 0) = "" and substr("ABC", -4, 6) =
1019              "A".
1020
1021          ·   Every  string,  including  the  empty  string, matches the empty
1022              string at the front so, s ~ // and s ~ "", are always  1  as  is
1023              match(s, //) and match(s, "").  The last two set RLENGTH to 0.
1024
1025          ·   index(s,  t)  is always the same as match(s, t1) where t1 is the
1026              same as t with metacharacters escaped.  Hence  consistency  with
1027              match  requires  that  index(s,  "") always returns 1.  Also the
1028              condition, index(s,t) != 0 if and only t is a  substring  of  s,
1029              requires index("","") = 1.
1030
1031          ·   If  getline  encounters  end  of  file,  getline var, leaves var
1032              unchanged.  Similarly, on entry to  the  END  actions,  $0,  the
1033              fields and NF have their value unaltered from the last record.
1034

ENVIRONMENT VARIABLES

1036       Mawk recognizes these variables:
1037
1038          MAWKBINMODE
1039             (see COMPATIBILITY ISSUES)
1040
1041          MAWK_LONG_OPTIONS
1042             If  this  is  set,  mawk uses its value to decide what to do with
1043             GNU-style long options:
1044
1045             allow  Mawk allows the option to be checked against  the  (small)
1046                    set of long options it recognizes.
1047
1048             error  Mawk  prints  an  error  message  and  exits.  This is the
1049                    default.
1050
1051             ignore Mawk ignores the option.
1052
1053             warn   Print an warning message and otherwise ignore the option.
1054
1055             If the variable is unset, mawk prints an error message and exits.
1056
1057          WHINY_USERS
1058             This is an undocumented gawk feature.   It  tells  mawk  to  sort
1059             array indices before it starts to iterate over the elements of an
1060             array.
1061

SEE ALSO

1063       egrep(1)
1064
1065       Aho, Kernighan and Weinberger, The AWK Programming  Language,  Addison-
1066       Wesley  Publishing, 1988, (the AWK book), defines the language, opening
1067       with a tutorial and advancing to many interesting programs  that  delve
1068       into  issues of software design and analysis relevant to programming in
1069       any language.
1070
1071       The GAWK Manual, The Free Software Foundation, 1991, is a tutorial  and
1072       language  reference that does not attempt the depth of the AWK book and
1073       assumes the reader may be a novice  programmer.   The  section  on  AWK
1074       arrays is excellent.  It also discusses POSIX requirements for AWK.
1075

BUGS

1077       mawk  implements  printf() and sprintf() using the C library functions,
1078       printf and sprintf, so full  ANSI  compatibility  requires  an  ANSI  C
1079       library.   In practice this means the h conversion qualifier may not be
1080       available.  Also mawk inherits any bugs or limitations of  the  library
1081       functions.
1082
1083       Implementors of the AWK language have shown a consistent lack of imagi‐
1084       nation when naming their programs.
1085

AUTHOR

1087       Mike Brennan (brennan@whidbey.com).
1088       Thomas E. Dickey <dickey@invisible-island.net>.
1089
1090
1091
1092Version 1.3.4                     2016-09-18                           MAWK(1)
Impressum