1LMAWK(1)                         USER COMMANDS                        LMAWK(1)
2
3
4

NAME

6       lmawk - pattern scanning and text processing language
7

SYNOPSIS

9       lmawk  [-W  option] [-F value] [-v var=value] [--] 'program text' [file
10       ...]
11       lmawk [-W option] [-F value]  [-v  var=value]  [-f  program-file]  [--]
12       [file ...]
13

DESCRIPTION

15       lmawk  is  an interpreter for the AWK Programming Language derived from
16       mawk.  The AWK language is useful for manipulation of data files,  text
17       retrieval  and  processing,  and for prototyping and experimenting with
18       algorithms.  lmawk is a new awk meaning it implements the AWK  language
19       as  defined  in Aho, Kernighan and Weinberger, The AWK Programming Lan‐
20       guage, Addison-Wesley Publishing, 1988.  (Hereafter referred to as  the
21       AWK  book.)   mawk conforms to the Posix 1003.2 (draft 11.3) definition
22       of the AWK language which contains a few features not described in  the
23       AWK book,  and mawk provides a small number of extensions.
24
25       An  AWK  program  is  a sequence of pattern {action} pairs and function
26       definitions.  Short programs are entered on the  command  line  usually
27       enclosed  in ' ' to avoid shell interpretation.  Longer programs can be
28       read in from a file with the -f option.  Data  input is read  from  the
29       list  of files on the command line or from standard input when the list
30       is empty.  The input is broken into records as determined by the record
31       separator  variable,  RS.  Initially, RS = "\n" and records are synony‐
32       mous with lines.  Each record is compared against each pattern  and  if
33       it matches, the program text for {action} is executed.
34

OPTIONS

36       -F value       sets the field separator, FS, to value.
37
38       -f file        Program  text is read from file instead of from the com‐
39                      mand line.  Multiple -f options are allowed. As  a  lib‐
40                      mawk  extension, if file name starts with plus ('+'), it
41                      is not loaded if the same file has been  loaded  already
42                      by  a  previous  -f  or  include from any of the scripts
43                      already loaded.
44
45       -b file        Program bytecode is read from file . Multiple -b options
46                      are  allowed. Bytecode can be generated using -Wcompile.
47                      Libmawk may refuse to load bytecode generated on a  dif‐
48                      ferent  system if byte order, type sizes or dump version
49                      differs.
50
51       -v var=value   assigns value to program variable var.
52
53       --             indicates the unambiguous end of options.
54
55       The above options will be available with any Posix compatible implemen‐
56       tation  of  AWK,  and implementation specific options are prefaced with
57       -W.  lmawk provides six:
58
59       -W version     lmawk writes its version and  copyright  to  stdout  and
60                      compiled limits to stderr and exits 0.
61
62       -W debug       include  location  info  in  the compiled code; location
63                      information is visible in the dump  and  when  debugging
64                      libmawk.
65
66       -W dump        writes  an assembler like listing of the internal repre‐
67                      sentation of the program to stdout and exits 0 (on  suc‐
68                      cessful compilation).
69
70       -W dumpsym     writes  a  list  of global symbols to stdout and exits 0
71                      (on successful compilation).
72
73       -W compile     writes a binary dump of the  bytecode  to  stdout.  This
74                      bytecode can be loaded using the -b switch.
75
76       -W interactive sets unbuffered writes to stdout and line buffered reads
77                      from stdin.  Records from stdin are lines regardless  of
78                      the value of RS.
79
80       -W maxmem=num  limit  dynamic  memory allocation during compilation and
81                      execution to num bytes and exit  with  out-of-the-memory
82                      error  if more memory is to be allocated.  Optional suf‐
83                      fixes are k for kilobyte and m  for  megabyte.  0  means
84                      unlimited, which is also the default.
85
86       -W exec file   Program  text  is  read  from  file and this is the last
87                      option. Useful on systems that support  the  #!   "magic
88                      number" convention for executable scripts.
89
90       -W sprintf=num adjusts  the  size of lmawk's internal sprintf buffer to
91                      num bytes.  More than rare use of this option  indicates
92                      lmawk should be recompiled.
93
94       -W posix_space forces lmawk not to consider '\n' to be space.
95
96       The  short  forms  -W[vdiesp] are recognized and on some systems -We is
97       mandatory to avoid command line length limitations.
98

THE AWK LANGUAGE

100   1. Program structure
101       An AWK program is a sequence of pattern {action} pairs and  user  func‐
102       tion definitions.
103
104       A pattern can be:
105              BEGIN
106              END
107              expression
108              expression , expression
109
110       One, but not both, of pattern {action} can be omitted.   If {action} is
111       omitted it is implicitly { print }.  If pattern is omitted, then it  is
112       implicitly matched.  BEGIN and END patterns require an action.
113
114       Statements  are terminated by newlines, semi-colons or both.  Groups of
115       statements such as actions or loop bodies are blocked via { ... } as in
116       C.   The  last  statement  in a block doesn't need a terminator.  Blank
117       lines have no meaning; an empty statement is terminated  with  a  semi-
118       colon.  Long statements can be continued with a backslash, \.  A state‐
119       ment can be broken without a backslash after a comma, left  brace,  &&,
120       ||,  do,  else, the right parenthesis of an if, while or for statement,
121       and the right parenthesis of a function definition.  A  comment  starts
122       with # and extends to, but does not include the end of line.
123
124       The following statements control program flow inside blocks.
125
126              if ( expr ) statement
127
128              if ( expr ) statement else statement
129
130              while ( expr ) statement
131
132              do statement while ( expr )
133
134              for ( opt_expr ; opt_expr ; opt_expr ) statement
135
136              for ( var in array ) statement
137
138              continue
139
140              break
141
142   2. Data types, conversion and comparison
143       There  are two basic data types, numeric and string.  Numeric constants
144       can be integer like -2, decimal like 1.08, or  in  scientific  notation
145       like  -1.1e4 or .28E-3.  All numbers are represented internally and all
146       computations are done in floating point arithmetic.   So  for  example,
147       the expression 0.2e2 == 20 is true and true is represented as 1.0.
148
149       String constants are enclosed in double quotes.
150
151                   "This is a string with a newline at the end.\n"
152
153       Strings  can  be  continued  across a line by escaping (\) the newline.
154       The following escape sequences are recognized.
155
156            \\        \
157            \"        "
158            \a        alert, ascii 7
159            \b        backspace, ascii 8
160            \t        tab, ascii 9
161            \n        newline, ascii 10
162            \v        vertical tab, ascii 11
163            \f        formfeed, ascii 12
164            \r        carriage return, ascii 13
165            \ddd      1, 2 or 3 octal digits for ascii ddd
166            \xhh      1 or 2 hex digits for ascii  hh
167
168       If you escape any other character \c, you get \c, i.e.,  lmawk  ignores
169       the escape.
170
171       There are really three basic data types; the third is number and string
172       which has both a numeric value and a string value  at  the  same  time.
173       User  defined  variables  come into existence when first referenced and
174       are initialized to null, a number and string value  which  has  numeric
175       value  0 and string value "".  Non-trivial number and string typed data
176       come from input and are typically stored in fields.  (See section 4).
177
178       The type of an expression is determined by its  context  and  automatic
179       type  conversion occurs if needed.  For example, to evaluate the state‐
180       ments
181
182            y = x + 2  ;  z = x  "hello"
183
184       The value stored in variable y will be typed  numeric.   If  x  is  not
185       numeric,  the  value  read  from x is converted to numeric before it is
186       added to 2 and stored in y.  The value stored in  variable  z  will  be
187       typed  string, and the value of x will be converted to string if neces‐
188       sary and concatenated with "hello".  (Of course,  the  value  and  type
189       stored in x is not changed by any conversions.)  A string expression is
190       converted to numeric using its longest numeric prefix as with  atof(3).
191       A  numeric  expression  is  converted  to string by replacing expr with
192       sprintf(CONVFMT, expr), unless expr can  be  represented  on  the  host
193       machine  as  an  exact  integer  then  it is converted to sprintf("%d",
194       expr).  Sprintf() is an AWK built-in that duplicates the  functionality
195       of  sprintf(3),  and  CONVFMT  is a built-in variable used for internal
196       conversion from number to string and initialized to  "%.6g".   Explicit
197       type  conversions  can  be  forced,  expr  ""  is  string and expr+0 is
198       numeric.
199
200       To evaluate, expr1 rel-op expr2, if both operands are numeric or number
201       and  string then the comparison is numeric; if both operands are string
202       the comparison is string; if one operand is string, the non-string  op‐
203       erand  is  converted  and  the  comparison  is  string.   The result is
204       numeric, 1 or 0.
205
206       In boolean contexts such as, if ( expr ) statement, a string expression
207       evaluates  true  if  and only if it is not the empty string ""; numeric
208       values if and only if not numerically zero.
209
210   3. Regular expressions
211       In the AWK language, records, fields and strings are often  tested  for
212       matching  a  regular  expression.   Regular expressions are enclosed in
213       slashes, and
214
215            expr ~ /r/
216
217       is an AWK expression that evaluates to 1 if  expr  "matches"  r,  which
218       means  a substring of expr is in the set of strings defined by r.  With
219       no match the expression evaluates to  0;  replacing  ~  with  the  "not
220       match" operator, !~ , reverses the meaning.  As  pattern-action pairs,
221
222            /r/ { action }   and   $0 ~ /r/ { action }
223
224       are  the same, and for each input record that matches r, action is exe‐
225       cuted.  In fact, /r/ is an AWK expression that is equivalent to  ($0  ~
226       /r/)  anywhere  except  when  on  the right side of a match operator or
227       passed as an argument to a built-in function  that  expects  a  regular
228       expression argument.
229
230       AWK  uses  extended  regular expressions as with egrep(1).  The regular
231       expression metacharacters, i.e., those with special meaning in  regular
232       expressions are
233
234             ^ $ . [ ] | ( ) * + ?
235
236       Regular expressions are built up from characters as follows:
237
238              c            matches any non-metacharacter c.
239
240              \c           matches  a  character  defined  by  the same escape
241                           sequences used in string constants or  the  literal
242                           character c if \c is not an escape sequence.
243
244              .            matches any character (including newline).
245
246              ^            matches the front of a string.
247
248              $            matches the back of a string.
249
250              [c1c2c3...]  matches  any character in the class c1c2c3... .  An
251                           interval of characters is denoted  c1-c2  inside  a
252                           class [...].
253
254              [^c1c2c3...] matches any character not in the class c1c2c3...
255
256       Regular expressions are built up from other regular expressions as fol‐
257       lows:
258
259              r1r2         matches r1 followed immediately by  r2  (concatena‐
260                           tion).
261
262              r1 | r2      matches r1 or r2 (alternation).
263
264              r*           matches r repeated zero or more times.
265
266              r+           matches r repeated one or more times.
267
268              r?           matches r zero or once.
269
270              (r)          matches r, providing grouping.
271
272       The  increasing  precedence  of operators is alternation, concatenation
273       and unary (*, + or ?).
274
275       For example,
276
277            /^[_a-zA-Z][_a-zA-Z0-9]*$/  and
278            /^[-+]?([0-9]+\.?|\.[0-9])[0-9]*([eE][-+]?[0-9]+)?$/
279
280       are matched by AWK identifiers and AWK numeric constants  respectively.
281       Note  that . has to be escaped to be recognized as a decimal point, and
282       that metacharacters are not special inside character classes.
283
284       Any expression can be used on the right hand side of the ~ or !~ opera‐
285       tors  or  passed  to  a built-in that expects a regular expression.  If
286       needed, it is converted to string, and then interpreted  as  a  regular
287       expression.  For example,
288
289            BEGIN { identifier = "[_a-zA-Z][_a-zA-Z0-9]*" }
290
291            $0 ~ "^" identifier
292
293       prints all lines that start with an AWK identifier.
294
295       lmawk  recognizes  the  empty regular expression, //, which matches the
296       empty string and hence is matched by any string at the front, back  and
297       between every character.  For example,
298
299            echo  abc | lmawk { gsub(//, "X") ; print }
300            XaXbXcX
301
302
303   4. Records and fields
304       Records are read in one at a time, and stored in the field variable $0.
305       The record is split into fields which are stored in $1, $2,  ...,  $NF.
306       The built-in variable NF is set to the number of fields, and NR and FNR
307       are incremented by 1.  Fields above $NF are set to "".
308
309       Assignment to $0 causes the fields and NF to be recomputed.  Assignment
310       to  NF or to a field causes $0 to be reconstructed by concatenating the
311       $i's separated by OFS.  Assignment to a field with index  greater  than
312       NF, increases NF and causes $0 to be reconstructed.
313
314       Data  input  stored  in  fields  is string, unless the entire field has
315       numeric form and then the type is number and string.  For example,
316
317            echo 24 24E |
318            lmawk '{ print($1>100, $1>"100", $2>100, $2>"100") }'
319            0 1 1 1
320
321       $0 and $2 are string and $1 is number and string.  The first comparison
322       is numeric, the second is string, the third is string (100 is converted
323       to "100"), and the last is string.
324
325   5. Expressions and operators
326       The expression syntax is similar to C.  Primary expressions are numeric
327       constants,  string  constants,  variables,  fields, arrays and function
328       calls.  The identifier for a variable,  array  or  function  can  be  a
329       sequence of letters, digits and underscores, that does not start with a
330       digit.  Variables are not declared; they exist  when  first  referenced
331       and are initialized to null.
332
333       New  expressions  are composed with the following operators in order of
334       increasing precedence.
335
336              assignment          =  +=  -=  *=  /=  %=  ^=
337              conditional         ?  :
338              logical or          ||
339              logical and         &&
340              array membership    in
341              matching       ~   !~
342              relational          <  >   <=  >=  ==  !=
343              concatenation       (no explicit operator)
344              add ops             +  -
345              mul ops             *  /  %
346              unary               +  -
347              logical not         !
348              exponentiation      ^
349              inc and dec         ++ -- (both post and pre)
350              field               $
351
352       Assignment, conditional and exponentiation associate right to left; the
353       other  operators associate left to right.  Any expression can be paren‐
354       thesized.
355
356   6. Arrays
357       Awk provides one-dimensional arrays.  Array elements are  expressed  as
358       array[expr].   Expr  is  internally  converted  to string type, so, for
359       example, A[1] and A["1"] are the same element and the actual  index  is
360       "1".   Arrays  indexed  by strings are called associative arrays.  Ini‐
361       tially an array is empty;  elements  exist  when  first  accessed.   An
362       expression, expr in array evaluates to 1 if array[expr] exists, else to
363       0.
364
365       There is a form of the for statement that loops over each index  of  an
366       array.
367
368            for ( var in array ) statement
369
370       sets var to each index of array and executes statement.  The order that
371       var transverses the indices of array is not defined.
372
373       The statement, delete array[expr], causes  array[expr]  not  to  exist.
374       lmawk  supports  an extension, delete array, which deletes all elements
375       of array.
376
377       Multidimensional arrays are synthesized with  concatenation  using  the
378       built-in   variable   SUBSEP.    array[expr1,expr2]  is  equivalent  to
379       array[expr1 SUBSEP expr2].  Testing for a multidimensional element uses
380       a parenthesized index, such as
381
382            if ( (i, j) in A )  print A[i, j]
383
384
385   7. Builtin-variables
386       The  following  variables  are  built-in and initialized before program
387       execution.
388
389              ARGC      number of command line arguments.
390
391              ARGV      array of command line arguments, 0..ARGC-1.
392
393              CONVFMT   format for internal conversion of numbers  to  string,
394                        initially = "%.6g".
395
396              ENVIRON   array  indexed  by environment variables.  An environ‐
397                        ment string, var=value is  stored  as  ENVIRON[var]  =
398                        value.
399
400              FILENAME  name of the current input file.
401
402              FNR       current record number in FILENAME.
403
404              FS        splits records into fields as a regular expression.
405
406              NF        number of fields in the current record.
407
408              NR        current record number in the total input stream.
409
410              OFMT      format for printing numbers; initially = "%.6g".
411
412              OFS       inserted between fields on output, initially = " ".
413
414              ORS       terminates each record on output, initially = "\n".
415
416              RLENGTH   length  set by the last call to the built-in function,
417                        match().
418
419              RS        input record separator, initially = "\n".
420
421              RSTART    index set by the last call to match().
422
423              SUBSEP    used to build multiple array subscripts,  initially  =
424                        "\034".
425
426              ERRNO     misc  built-in functions (libmawk extensions) use this
427                        variable to rerport error. All  extension  calls  will
428                        set  this  variable  before  returning, therefor ERRNO
429                        holds the result of the last  call.  An  empty  string
430                        value  means no error. Error messages are formatted in
431                        a way that the first word is an unique  integer,  fol‐
432                        lowed  by a human readable error message from the sec‐
433                        ond word. int(ERRNO) can be used to acquire the  error
434                        code,  which  then  can  be used as a secondary output
435                        from the extension function. For example, an awk  pro‐
436                        gram can use valueof() to determine if a global symbol
437                        exists and is a function or  a  variable  or  anything
438                        else.
439
440              LIBPATH   is  a  semicolon  separated list of search paths. When
441                        loading an awk script by file name  (-f  command  line
442                        argument  or  include  from  another awk script) these
443                        paths are inserted before the file name, in order, one
444                        by  one,  until the first path that allows opening the
445                        file. An empty path is equivalent to the current work‐
446                        ing  directory.  LIBPATH can be modified from the com‐
447                        mand line using -v, as arguments  are  scanned  before
448                        loading  the  scripts. Setting LIBPATH to empty string
449                        results in the original behaviour of mawk. LIBPATH  is
450                        ignored  for  script  file  names  starting with slash
451                        ('/') as those are assumed to be absolute paths.
452
453   8. Built-in functions
454       String functions
455
456              gsub(r,s,t)  gsub(r,s)
457                     Global substitution, every match of regular expression  r
458                     in  variable  t  is  replaced by string s.  The number of
459                     replacements is returned.  If t is omitted, $0  is  used.
460                     An  &  in  the  replacement  string  s is replaced by the
461                     matched substring of t.  \& and \\ put  literal & and  \,
462                     respectively, in the replacement string.
463
464              index(s,t)
465                     If  t  is  a  substring  of  s, then the position where t
466                     starts is returned, else 0 is returned.  The first  char‐
467                     acter of s is in position 1.
468
469              length(s)
470                     Returns the length of string s.
471
472              match(s,r)
473                     Returns  the  index of the first longest match of regular
474                     expression r in string s.  Returns 0 if no match.   As  a
475                     side  effect, RSTART is set to the return value.  RLENGTH
476                     is set to the length of the match or -1 if no match.   If
477                     the  empty  string is matched, RLENGTH is set to 0, and 1
478                     is returned if the match is at the front, and length(s)+1
479                     is returned if the match is at the back.
480
481              split(s,A,r)  split(s,A)
482                     String s is split into fields by regular expression r and
483                     the fields are loaded into array A.  The number of fields
484                     is returned.  See section 11 below for more detail.  If r
485                     is omitted, FS is used.
486
487              sprintf(format,expr-list)
488                     Returns a string constructed from expr-list according  to
489                     format.  See the description of printf() below.
490
491              sub(r,s,t)  sub(r,s)
492                     Single  substitution,  same  as gsub() except at most one
493                     substitution.
494
495              substr(s,i,n)  substr(s,i)
496                     Returns the substring of string s, starting at  index  i,
497                     of  length n.  If n is omitted, the suffix of s, starting
498                     at i is returned.
499
500              tolower(s)
501                     Returns a copy of s with all upper case  characters  con‐
502                     verted to lower case.
503
504              toupper(s)
505                     Returns  a  copy of s with all lower case characters con‐
506                     verted to upper case.
507
508       Arithmetic functions
509
510              atan2(y,x)     Arctan of y/x between -PI and PI.
511
512              cos(x)         Cosine function, x in radians.
513
514              exp(x)         Exponential function.
515
516              int(x)         Returns x truncated towards zero.
517
518              log(x)         Natural logarithm.
519
520              rand()         Returns a random number between zero and one.
521
522              sin(x)         Sine function, x in radians.
523
524              sqrt(x)        Returns square root of x.
525
526              srand(expr)  srand()
527                     Seeds the random number generator,  using  the  clock  if
528                     expr  is  omitted,  and returns the value of the previous
529                     seed.  lmawk seeds the random number generator  from  the
530                     clock  at  startup  so  there  is  no  real  need to call
531                     srand().  Srand(expr) is useful for repeating pseudo ran‐
532                     dom sequences.
533
534       Misc functions (libmawk extensions)
535
536              call(fname,arg1,arg2,...)
537                     Call  awk  function fname with the supplied arguments. If
538                     the call fails, empty value, else the return value of the
539                     callee  is  returned.  Built-in  variable ERRNO is always
540                     set.
541
542              acall(fname,arrname)
543                     Call awk function fname with arguments supplied in  array
544                     named  arrname  (both  arguments  are  strings  naming an
545                     existing object).  The array should be  indexed  from  1.
546                     Number  of  arguments  is  determined  by looking for the
547                     first empty (non-existing) index in  the  array.  If  the
548                     call  fails,  empty  value,  else the return value of the
549                     callee is returned. Built-in  variable  ERRNO  is  always
550                     set.
551
552              valueof(vname [,idx])
553                     Return the value of variable fname; if the variable is an
554                     array, return the element indexed by idx (which  must  be
555                     present  in  this  case).  If  index is not present or is
556                     empty (""), the variable is expected to be scalar. Built-
557                     in  variable  ERRNO  is  always  set. NOTE: valueof() has
558                     access to the global symbol table only.  It will fail  to
559                     resolve  anything  else than global objects; most notably
560                     it will fail on local variables, $ arguments and on  most
561                     of the built-in variables.
562
563   9. Input and output
564       There are two output statements, print and printf.
565
566              print  writes $0  ORS to standard output.
567
568              print expr1, expr2, ..., exprn
569                     writes expr1 OFS expr2 OFS ... exprn ORS to standard out‐
570                     put.  Numeric expressions are converted  to  string  with
571                     OFMT.
572
573              printf format, expr-list
574                     duplicates the printf C library function writing to stan‐
575                     dard output.  The complete ANSI C  format  specifications
576                     are  recognized  with conversions %c, %d, %e, %E, %f, %g,
577                     %G, %i, %o, %s, %u, %x, %X and %%, and conversion  quali‐
578                     fiers h and l.
579
580       The  argument  list  to  print  or printf can optionally be enclosed in
581       parentheses.  Print formats numbers using OFMT or "%d" for exact  inte‐
582       gers.   "%c"  with  a  numeric  argument prints the corresponding 8 bit
583       character, with a string argument it prints the first character of  the
584       string.   The output of print and printf can be redirected to a file or
585       command by appending > file, >> file or | command to  the  end  of  the
586       print  statement.   Redirection opens file or command only once, subse‐
587       quent redirections append to the already open stream.   By  convention,
588       lmawk  associates  the  filename "/dev/stderr" with stderr which allows
589       print and printf to be redirected to stderr.  lmawk also associates "-"
590       and  "/dev/stdout"  with stdin and stdout which allows these streams to
591       be passed to functions.  Opening /dev/fd/N will do an fdopen() on  file
592       descriptor  N,  where N is an integer - this is a libmawk extension. If
593       any of the /dev heuristics needs to be bypassed (i.e. the script  wants
594       to  open the real /dev/stdout or the real /dev/fd/5), the leading slash
595       should be doubled (e.g. //dev/fd/5).
596
597       The input function getline has the following variations.
598
599              getline
600                     reads into $0, updates the fields, NF, NR and FNR.
601
602              getline < file
603                     reads into $0 from file, updates the fields and NF.
604
605              getline var
606                     reads the next record into var, updates NR and FNR.
607
608              getline var < file
609                     reads the next record of file into var.
610
611               command | getline
612                     pipes a record from  command  into  $0  and  updates  the
613                     fields and NF.
614
615               command | getline var
616                     pipes a record from command into var.
617
618       Getline returns 0 on end-of-file, -1 on error, otherwise 1.
619
620       Commands on the end of pipes are executed by /bin/sh.
621
622       The  function close(expr) closes the file or pipe associated with expr.
623       Close returns 0 if expr is an open file, the exit status if expr  is  a
624       piped  command,  and  -1  otherwise.  Close is used to reread a file or
625       command, make sure the other end of an output pipe is finished or  con‐
626       serve file resources.
627
628       The  function  fflush(expr)  flushes the output file or pipe associated
629       with expr.  Fflush returns 0 if expr is an open output stream else  -1.
630       Fflush  without an argument flushes stdout.  Fflush with an empty argu‐
631       ment ("") flushes all open output.
632
633       The function system(expr) uses /bin/sh to execute expr and returns  the
634       exit status of the command expr.  Changes made to the ENVIRON array are
635       not passed to commands executed with system or pipes.
636
637   10. User defined functions
638       The syntax for a user defined function is
639
640            function name( args ) { statements }
641
642       The function body can contain a return statement
643
644            return opt_expr
645
646       A return statement is not required.  Function calls may  be  nested  or
647       recursive.   Functions  are  passed  expressions by value and arrays by
648       reference.  Extra arguments serve as local variables and  are  initial‐
649       ized  to  null.  For example, csplit(s,A) puts each character of s into
650       array A and returns the length of s.
651
652            function csplit(s, A,    n, i)
653            {
654              n = length(s)
655              for( i = 1 ; i <= n ; i++ ) A[i] = substr(s, i, 1)
656              return n
657            }
658
659       Putting extra space between passed arguments  and  local  variables  is
660       conventional.  Functions can be referenced before they are defined, but
661       the function name and the '(' of the arguments must touch to avoid con‐
662       fusion with concatenation.
663
664   11. Splitting strings, records and files
665       Awk  programs  use the same algorithm to split strings into arrays with
666       split(), and records into fields on FS.   lmawk  uses  essentially  the
667       same algorithm to split files into records on RS.
668
669       Split(expr,A,sep) works as follows:
670
671              (1)    If  sep  is omitted, it is replaced by FS.  Sep can be an
672                     expression or regular expression.  If it is an expression
673                     of non-string type, it is converted to string.
674
675              (2)    If  sep  =  " " (a single space), then <SPACE> is trimmed
676                     from the front and back of expr, and sep becomes <SPACE>.
677                     lmawk   defines   <SPACE>   as   the  regular  expression
678                     /[ \t\n]+/.   Otherwise  sep  is  treated  as  a  regular
679                     expression, except that meta-characters are ignored for a
680                     string of length 1, e.g., split(x, A, "*")  and  split(x,
681                     A, /\*/) are the same.
682
683              (3)    If  expr  is  not  string, it is converted to string.  If
684                     expr is then the empty string "", split() returns 0 and A
685                     is  set  empty.  Otherwise, all non-overlapping, non-null
686                     and longest matches of sep in expr,  separate  expr  into
687                     fields which are loaded into A.  The fields are placed in
688                     A[1], A[2], ..., A[n] and split() returns n,  the  number
689                     of  fields which is the number of matches plus one.  Data
690                     placed in A  that  looks  numeric  is  typed  number  and
691                     string.
692
693       Splitting  records  into  fields  works  the same except the pieces are
694       loaded into $1, $2,..., $NF.  If $0 is empty, NF is set to 0 and all $i
695       to "".
696
697       lmawk  splits  files  into  records by the same algorithm, but with the
698       slight difference that RS is really a terminator instead of  a  separa‐
699       tor.  (ORS is really a terminator too).
700
701              E.g.,  if FS = ":+" and $0 = "a::b:" , then NF = 3 and $1 = "a",
702              $2 = "b" and $3 = "", but if "a::b:" is the contents of an input
703              file and RS = ":+", then there are two records "a" and "b".
704
705       RS = " " is not special.
706
707       If  FS  =  "", then lmawk breaks the record into individual characters,
708       and, similarly, split(s,A,"") places the  individual  characters  of  s
709       into A.
710
711   12. Multi-line records
712       Since  lmawk  interprets RS as a regular expression, multi-line records
713       are easy.  Setting RS = "\n\n+", makes one or more blank lines separate
714       records.  If FS = " " (the default), then single newlines, by the rules
715       for <SPACE> above, become space and single newlines are  field  separa‐
716       tors.
717
718              For  example,  if  a file is "a b\nc\n\n", RS = "\n\n+" and FS =
719              " ", then there is one record "a b\nc" with  three  fields  "a",
720              "b"  and  "c".   Changing  FS = "\n", gives two fields "a b" and
721              "c"; changing FS = "", gives one field identical to the record.
722
723       If you want lines with spaces or tabs to be considered blank, set RS  =
724       "\n([ \t]*\n)+".   For  compatibility  with other awks, setting RS = ""
725       has the same effect as if blank lines are stripped from the  front  and
726       back  of  files  and  then  records  are determined as if RS = "\n\n+".
727       Posix requires that "\n" always separates records when RS = ""  regard‐
728       less  of  the  value  of  FS.   lmawk does not support this convention,
729       because defining "\n" as <SPACE> makes it unnecessary.
730
731       Most of the time when you change RS for multi-line  records,  you  will
732       also want to change ORS to "\n\n" so the record spacing is preserved on
733       output.
734
735   13. Program execution
736       This section describes the order of program execution.  First  ARGC  is
737       set  to the total number of command line arguments passed to the execu‐
738       tion phase of the program.  ARGV[0] is set the name of the  AWK  inter‐
739       preter  and  ARGV[1] ...  ARGV[ARGC-1] holds the remaining command line
740       arguments exclusive of options and program source.  For example with
741
742            lmawk  -f  prog  v=1  A  t=hello  B
743
744       ARGC = 5 with ARGV[0] =  "lmawk",  ARGV[1]  =  "v=1",  ARGV[2]  =  "A",
745       ARGV[3] = "t=hello" and ARGV[4] = "B".
746
747       Next,  each  BEGIN block is executed in order.  If the program consists
748       entirely of BEGIN blocks, then  execution  terminates,  else  an  input
749       stream  is opened and execution continues.  If ARGC equals 1, the input
750       stream is set to stdin, else  the command line  arguments  ARGV[1]  ...
751       ARGV[ARGC-1] are examined for a file argument.
752
753       The  command  line  arguments  divide  into three sets: file arguments,
754       assignment arguments and empty strings "".  An assignment has the  form
755       var=string.   When  an ARGV[i] is examined as a possible file argument,
756       if it is empty it is skipped; if it  is  an  assignment  argument,  the
757       assignment  to  var  takes place and i skips to the next argument; else
758       ARGV[i] is opened for input.  If it fails to open, execution terminates
759       with exit code 2.  If no command line argument is a file argument, then
760       input comes from stdin.  Getline in a BEGIN action opens input.  "-" as
761       a file argument denotes stdin.
762
763       Once  an input stream is open, each input record is tested against each
764       pattern, and if it matches, the  associated  action  is  executed.   An
765       expression  pattern  matches if it is boolean true (see the end of sec‐
766       tion 2).  A BEGIN pattern matches before any input has been  read,  and
767       an END pattern matches after all input has been read.  A range pattern,
768       expr1,expr2 , matches every record between the match of expr1  and  the
769       match expr2 inclusively.
770
771       When end of file occurs on the input stream, the remaining command line
772       arguments are examined for a file argument, and if there is one  it  is
773       opened,  else the END pattern is considered matched and all END actions
774       are executed.
775
776       In the example, the assignment v=1 takes place after the BEGIN  actions
777       are  executed,  and  the  data  placed in v is typed number and string.
778       Input is then read from file A.  On end of file A,  t  is  set  to  the
779       string  "hello",  and B is opened for input.  On end of file B, the END
780       actions are executed.
781
782       Program flow at the pattern {action} level can be changed with the
783
784            next
785            exit  opt_expr
786
787       statements.  A next statement causes the next input record to  be  read
788       and  pattern testing to restart with the first pattern {action} pair in
789       the program.  An exit statement causes immediate execution of  the  END
790       actions  or program termination if there are none or if the exit occurs
791       in an END action.  The opt_expr sets the  exit  value  of  the  program
792       unless overridden by a later exit or subsequent error.
793
794
795   14. include
796       libmawk introduces source inclusion feature. Syntax is:
797
798            include "filename"
799
800       Include statements must be on top level (outside of blocks). If file name
801       starts with a plus sign ('+'), the script file is not loaded if it has
802       been already loaded (by another include or -f command line argument).
803
804
805

EXAMPLES

807       1. emulate cat.
808
809            { print }
810
811       2. emulate wc.
812
813            { chars += length($0) + 1  # add one for the \n
814              words += NF
815            }
816
817            END{ print NR, words, chars }
818
819       3. count the number of unique "real words".
820
821            BEGIN { FS = "[^A-Za-z]+" }
822
823            { for(i = 1 ; i <= NF ; i++)  word[$i] = "" }
824
825            END { delete word[""]
826                  for ( i in word )  cnt++
827                  print cnt
828            }
829
830       4. sum the second field of every record based on the first field.
831
832            $1 ~ /credit|gain/ { sum += $2 }
833            $1 ~ /debit|loss/  { sum -= $2 }
834
835            END { print sum }
836
837       5. sort a file, comparing as string
838
839            { line[NR] = $0 "" }  # make sure of comparison type
840                            # in case some lines look numeric
841
842            END {  isort(line, NR)
843              for(i = 1 ; i <= NR ; i++) print line[i]
844            }
845
846            #insertion sort of A[1..n]
847            function isort( A, n,    i, j, hold)
848            {
849              for( i = 2 ; i <= n ; i++)
850              {
851                hold = A[j = i]
852                while ( A[j-1] > hold )
853                { j-- ; A[j+1] = A[j] }
854                A[j] = hold
855              }
856              # sentinel A[0] = "" will be created if needed
857            }
858
859

COMPATIBILITY ISSUES

861       The  Posix  1003.2(draft 11.3) definition of the AWK language is AWK as
862       described in the AWK book with a few extensions that appeared  in  Sys‐
863       temVR4 nawk. The extensions are:
864
865              New  functions:  toupper()  and  tolower();  libmawk extensions:
866              call(), acall(), valueof().
867
868              New variables: ENVIRON[] and CONVFMT; libmawk extension:  ERRNO,
869              LIBPATH.   As  a libmawk extension, ENVIRON affects the environ‐
870              ment of children processes.
871
872              As a libmawk extension, new built-in variable LIBPATH is used as
873              a  list  of  search paths while loading scripts from the command
874              line or from include.
875
876              If a script name starts with plus ('+'), the file is not  loaded
877              if  it  has  been  loaded  earlier (to avoid double loading libs
878              trough -f and/or include).  This is a libmawk extension.
879
880              It is possible to include a script  from  another  script  using
881              keyword include "scriptname.awk" (libmawk extension).
882
883              ANSI C conversion specifications for printf() and sprintf().
884
885              New  command  options:   -v  var=value,  multiple -f options and
886              implementation options as arguments to -W.
887
888
889       Posix AWK is oriented to operate on files a line at a time.  RS can  be
890       changed  from  "\n" to another single character, but it is hard to find
891       any use for this — there are no examples in the AWK book.   By  conven‐
892       tion, RS = "", makes one or more blank lines separate records, allowing
893       multi-line records.  When RS = "", "\n" is  always  a  field  separator
894       regardless of the value in FS.
895
896       lmawk,  on  the other hand, allows RS to be a regular expression.  When
897       "\n" appears in records, it is treated as space, and FS  always  deter‐
898       mines fields.
899
900       Removing the line at a time paradigm can make some programs simpler and
901       can often improve performance.  For example,  redoing  example  3  from
902       above,
903
904            BEGIN { RS = "[^A-Za-z]+" }
905
906            { word[ $0 ] = "" }
907
908            END { delete  word[ "" ]
909              for( i in word )  cnt++
910              print cnt
911            }
912
913       counts  the  number  of  unique words by making each word a record.  On
914       moderate size files, lmawk executes twice as fast, because of the  sim‐
915       plified inner loop.
916
917       The  following  program  replaces each comment by a single space in a C
918       program file,
919
920            BEGIN {
921              RS = "/\*([^*]|\*+[^/*])*\*+/"
922                 # comment is record separator
923              ORS = " "
924              getline  hold
925              }
926
927              { print hold ; hold = $0 }
928
929              END { printf "%s" , hold }
930
931       Buffering one record is needed to avoid  terminating  the  last  record
932       with a space.
933
934       With lmawk, the following are all equivalent,
935
936            x ~ /a\+b/    x ~ "a\+b"     x ~ "a\\+b"
937
938       The  strings  get  scanned  twice,  once  as string and once as regular
939       expression.  On the string scan, lmawk ignores the escape on non-escape
940       characters  while  the  AWK  book advocates \c be recognized as c which
941       necessitates the double escaping of meta-characters in strings.   Posix
942       explicitly  declines to define the behavior which passively forces pro‐
943       grams that must run under a variety of awks to use  the  more  portable
944       but less readable, double escape.
945
946       Posix  AWK  does  not  recognize  "/dev/std{out,err}"  or \x hex escape
947       sequences in strings.  Unlike ANSI C, lmawk limits the number of digits
948       that  follows  \x  to two as the current implementation only supports 8
949       bit characters.  The built-in fflush first appeared in a recent  (1993)
950       AT&T  awk  released  to  netlib, and is not part of the posix standard.
951       Aggregate deletion with delete array is not part of the posix standard.
952
953       Posix explicitly leaves the behavior of FS = "" undefined, and mentions
954       splitting  the record into characters as a possible interpretation, but
955       currently this use is not portable across implementations.
956
957       Finally, here is how lmawk handles exceptional cases not  discussed  in
958       the  AWK  book  or the Posix draft.  It is unsafe to assume consistency
959       across awks and safe to skip to the next section.
960
961              substr(s, i, n) returns the characters of s in the  intersection
962              of the closed interval [1, length(s)] and the half-open interval
963              [i, i+n).  When this intersection is empty, the empty string  is
964              returned; so substr("ABC", 1, 0) = "" and substr("ABC", -4, 6) =
965              "A".
966
967              Every string, including the  empty  string,  matches  the  empty
968              string  at  the  front so, s ~ // and s ~ "", are always 1 as is
969              match(s, //) and match(s, "").  The last two set RLENGTH to 0.
970
971              index(s, t) is always the same as match(s, t1) where t1  is  the
972              same  as  t with metacharacters escaped.  Hence consistency with
973              match requires that index(s, "") always  returns  1.   Also  the
974              condition,  index(s,t)  !=  0 if and only t is a substring of s,
975              requires index("","") = 1.
976
977              If getline encounters end  of  file,  getline  var,  leaves  var
978              unchanged.   Similarly,  on  entry  to  the END actions, $0, the
979              fields and NF have their value unaltered from the last record.
980

SEE ALSO

982       egrep(1), mawk(1)
983
984       Aho, Kernighan and Weinberger, The AWK Programming  Language,  Addison-
985       Wesley  Publishing, 1988, (the AWK book), defines the language, opening
986       with a tutorial and advancing to many interesting programs  that  delve
987       into  issues of software design and analysis relevant to programming in
988       any language.
989
990       The GAWK Manual, The Free Software Foundation, 1991, is a tutorial  and
991       language  reference that does not attempt the depth of the AWK book and
992       assumes the reader may be a novice  programmer.   The  section  on  AWK
993       arrays is excellent.  It also discusses Posix requirements for AWK.
994

BUGS

996       lmawk  cannot handle ascii NUL \0 in the source or data files.  You can
997       output NUL using printf with %c, and  any  other  8  bit  character  is
998       acceptable input.
999
1000       lmawk  implements printf() and sprintf() using the C library functions,
1001       printf and sprintf, so full  ANSI  compatibility  requires  an  ANSI  C
1002       library.   In practice this means the h conversion qualifier may not be
1003       available.  Also lmawk inherits any bugs or limitations of the  library
1004       functions.
1005
1006       Implementors of the AWK language have shown a consistent lack of imagi‐
1007       nation when naming their programs.
1008

AUTHOR

1010       mawk: Mike Brennan (brennan@whidbey.com).
1011
1012       libmawk extensions: Tibor Palinkas (libmawk@igor2.repo.hu).
1013
1014
1015
1016Version 1.2                       Dec 12 2010                         LMAWK(1)
Impressum