1LMAWK(1) USER COMMANDS LMAWK(1)
2
3
4
6 lmawk - pattern scanning and text processing language
7
9 lmawk [-W option] [-F value] [-v var=value] [--] 'program text' [file
10 ...]
11 lmawk [-W option] [-F value] [-v var=value] [-f program-file] [--]
12 [file ...]
13
15 lmawk is an interpreter for the AWK Programming Language derived from
16 mawk. The AWK language is useful for manipulation of data files, text
17 retrieval and processing, and for prototyping and experimenting with
18 algorithms. lmawk is a new awk meaning it implements the AWK language
19 as defined in Aho, Kernighan and Weinberger, The AWK Programming Lan‐
20 guage, Addison-Wesley Publishing, 1988. (Hereafter referred to as the
21 AWK book.) mawk conforms to the Posix 1003.2 (draft 11.3) definition
22 of the AWK language which contains a few features not described in the
23 AWK book, and mawk provides a small number of extensions.
24
25 An AWK program is a sequence of pattern {action} pairs and function
26 definitions. Short programs are entered on the command line usually
27 enclosed in ' ' to avoid shell interpretation. Longer programs can be
28 read in from a file with the -f option. Data input is read from the
29 list of files on the command line or from standard input when the list
30 is empty. The input is broken into records as determined by the record
31 separator variable, RS. Initially, RS = "\n" and records are synony‐
32 mous with lines. Each record is compared against each pattern and if
33 it matches, the program text for {action} is executed.
34
36 -F value sets the field separator, FS, to value.
37
38 -f file Program text is read from file instead of from the com‐
39 mand line. Multiple -f options are allowed. As a lib‐
40 mawk extension, if file name starts with plus ('+'), it
41 is not loaded if the same file has been loaded already
42 by a previous -f or include from any of the scripts
43 already loaded.
44
45 -b file Program bytecode is read from file . Multiple -b options
46 are allowed. Bytecode can be generated using -Wcompile.
47 Libmawk may refuse to load bytecode generated on a dif‐
48 ferent system if byte order, type sizes or dump version
49 differs.
50
51 -v var=value assigns value to program variable var.
52
53 -- indicates the unambiguous end of options.
54
55 The above options will be available with any Posix compatible implemen‐
56 tation of AWK, and implementation specific options are prefaced with
57 -W. lmawk provides six:
58
59 -W version lmawk writes its version and copyright to stdout and
60 compiled limits to stderr and exits 0.
61
62 -W debug include location info in the compiled code; location
63 information is visible in the dump and when debugging
64 libmawk.
65
66 -W dump writes an assembler like listing of the internal repre‐
67 sentation of the program to stdout and exits 0 (on suc‐
68 cessful compilation).
69
70 -W dumpsym writes a list of global symbols to stdout and exits 0
71 (on successful compilation).
72
73 -W compile writes a binary dump of the bytecode to stdout. This
74 bytecode can be loaded using the -b switch.
75
76 -W interactive sets unbuffered writes to stdout and line buffered reads
77 from stdin. Records from stdin are lines regardless of
78 the value of RS.
79
80 -W maxmem=num limit dynamic memory allocation during compilation and
81 execution to num bytes and exit with out-of-the-memory
82 error if more memory is to be allocated. Optional suf‐
83 fixes are k for kilobyte and m for megabyte. 0 means
84 unlimited, which is also the default.
85
86 -W exec file Program text is read from file and this is the last
87 option. Useful on systems that support the #! "magic
88 number" convention for executable scripts.
89
90 -W sprintf=num adjusts the size of lmawk's internal sprintf buffer to
91 num bytes. More than rare use of this option indicates
92 lmawk should be recompiled.
93
94 -W posix_space forces lmawk not to consider '\n' to be space.
95
96 The short forms -W[vdiesp] are recognized and on some systems -We is
97 mandatory to avoid command line length limitations.
98
100 1. Program structure
101 An AWK program is a sequence of pattern {action} pairs and user func‐
102 tion definitions.
103
104 A pattern can be:
105 BEGIN
106 END
107 expression
108 expression , expression
109
110 One, but not both, of pattern {action} can be omitted. If {action} is
111 omitted it is implicitly { print }. If pattern is omitted, then it is
112 implicitly matched. BEGIN and END patterns require an action.
113
114 Statements are terminated by newlines, semi-colons or both. Groups of
115 statements such as actions or loop bodies are blocked via { ... } as in
116 C. The last statement in a block doesn't need a terminator. Blank
117 lines have no meaning; an empty statement is terminated with a semi-
118 colon. Long statements can be continued with a backslash, \. A state‐
119 ment can be broken without a backslash after a comma, left brace, &&,
120 ||, do, else, the right parenthesis of an if, while or for statement,
121 and the right parenthesis of a function definition. A comment starts
122 with # and extends to, but does not include the end of line.
123
124 The following statements control program flow inside blocks.
125
126 if ( expr ) statement
127
128 if ( expr ) statement else statement
129
130 while ( expr ) statement
131
132 do statement while ( expr )
133
134 for ( opt_expr ; opt_expr ; opt_expr ) statement
135
136 for ( var in array ) statement
137
138 continue
139
140 break
141
142 2. Data types, conversion and comparison
143 There are two basic data types, numeric and string. Numeric constants
144 can be integer like -2, decimal like 1.08, or in scientific notation
145 like -1.1e4 or .28E-3. All numbers are represented internally and all
146 computations are done in floating point arithmetic. So for example,
147 the expression 0.2e2 == 20 is true and true is represented as 1.0.
148
149 String constants are enclosed in double quotes.
150
151 "This is a string with a newline at the end.\n"
152
153 Strings can be continued across a line by escaping (\) the newline.
154 The following escape sequences are recognized.
155
156 \\ \
157 \" "
158 \a alert, ascii 7
159 \b backspace, ascii 8
160 \t tab, ascii 9
161 \n newline, ascii 10
162 \v vertical tab, ascii 11
163 \f formfeed, ascii 12
164 \r carriage return, ascii 13
165 \ddd 1, 2 or 3 octal digits for ascii ddd
166 \xhh 1 or 2 hex digits for ascii hh
167
168 If you escape any other character \c, you get \c, i.e., lmawk ignores
169 the escape.
170
171 There are really three basic data types; the third is number and string
172 which has both a numeric value and a string value at the same time.
173 User defined variables come into existence when first referenced and
174 are initialized to null, a number and string value which has numeric
175 value 0 and string value "". Non-trivial number and string typed data
176 come from input and are typically stored in fields. (See section 4).
177
178 The type of an expression is determined by its context and automatic
179 type conversion occurs if needed. For example, to evaluate the state‐
180 ments
181
182 y = x + 2 ; z = x "hello"
183
184 The value stored in variable y will be typed numeric. If x is not
185 numeric, the value read from x is converted to numeric before it is
186 added to 2 and stored in y. The value stored in variable z will be
187 typed string, and the value of x will be converted to string if neces‐
188 sary and concatenated with "hello". (Of course, the value and type
189 stored in x is not changed by any conversions.) A string expression is
190 converted to numeric using its longest numeric prefix as with atof(3).
191 A numeric expression is converted to string by replacing expr with
192 sprintf(CONVFMT, expr), unless expr can be represented on the host
193 machine as an exact integer then it is converted to sprintf("%d",
194 expr). Sprintf() is an AWK built-in that duplicates the functionality
195 of sprintf(3), and CONVFMT is a built-in variable used for internal
196 conversion from number to string and initialized to "%.6g". Explicit
197 type conversions can be forced, expr "" is string and expr+0 is
198 numeric.
199
200 To evaluate, expr1 rel-op expr2, if both operands are numeric or number
201 and string then the comparison is numeric; if both operands are string
202 the comparison is string; if one operand is string, the non-string op‐
203 erand is converted and the comparison is string. The result is
204 numeric, 1 or 0.
205
206 In boolean contexts such as, if ( expr ) statement, a string expression
207 evaluates true if and only if it is not the empty string ""; numeric
208 values if and only if not numerically zero.
209
210 3. Regular expressions
211 In the AWK language, records, fields and strings are often tested for
212 matching a regular expression. Regular expressions are enclosed in
213 slashes, and
214
215 expr ~ /r/
216
217 is an AWK expression that evaluates to 1 if expr "matches" r, which
218 means a substring of expr is in the set of strings defined by r. With
219 no match the expression evaluates to 0; replacing ~ with the "not
220 match" operator, !~ , reverses the meaning. As pattern-action pairs,
221
222 /r/ { action } and $0 ~ /r/ { action }
223
224 are the same, and for each input record that matches r, action is exe‐
225 cuted. In fact, /r/ is an AWK expression that is equivalent to ($0 ~
226 /r/) anywhere except when on the right side of a match operator or
227 passed as an argument to a built-in function that expects a regular
228 expression argument.
229
230 AWK uses extended regular expressions as with egrep(1). The regular
231 expression metacharacters, i.e., those with special meaning in regular
232 expressions are
233
234 ^ $ . [ ] | ( ) * + ?
235
236 Regular expressions are built up from characters as follows:
237
238 c matches any non-metacharacter c.
239
240 \c matches a character defined by the same escape
241 sequences used in string constants or the literal
242 character c if \c is not an escape sequence.
243
244 . matches any character (including newline).
245
246 ^ matches the front of a string.
247
248 $ matches the back of a string.
249
250 [c1c2c3...] matches any character in the class c1c2c3... . An
251 interval of characters is denoted c1-c2 inside a
252 class [...].
253
254 [^c1c2c3...] matches any character not in the class c1c2c3...
255
256 Regular expressions are built up from other regular expressions as fol‐
257 lows:
258
259 r1r2 matches r1 followed immediately by r2 (concatena‐
260 tion).
261
262 r1 | r2 matches r1 or r2 (alternation).
263
264 r* matches r repeated zero or more times.
265
266 r+ matches r repeated one or more times.
267
268 r? matches r zero or once.
269
270 (r) matches r, providing grouping.
271
272 The increasing precedence of operators is alternation, concatenation
273 and unary (*, + or ?).
274
275 For example,
276
277 /^[_a-zA-Z][_a-zA-Z0-9]*$/ and
278 /^[-+]?([0-9]+\.?|\.[0-9])[0-9]*([eE][-+]?[0-9]+)?$/
279
280 are matched by AWK identifiers and AWK numeric constants respectively.
281 Note that . has to be escaped to be recognized as a decimal point, and
282 that metacharacters are not special inside character classes.
283
284 Any expression can be used on the right hand side of the ~ or !~ opera‐
285 tors or passed to a built-in that expects a regular expression. If
286 needed, it is converted to string, and then interpreted as a regular
287 expression. For example,
288
289 BEGIN { identifier = "[_a-zA-Z][_a-zA-Z0-9]*" }
290
291 $0 ~ "^" identifier
292
293 prints all lines that start with an AWK identifier.
294
295 lmawk recognizes the empty regular expression, //, which matches the
296 empty string and hence is matched by any string at the front, back and
297 between every character. For example,
298
299 echo abc | lmawk { gsub(//, "X") ; print }
300 XaXbXcX
301
302
303 4. Records and fields
304 Records are read in one at a time, and stored in the field variable $0.
305 The record is split into fields which are stored in $1, $2, ..., $NF.
306 The built-in variable NF is set to the number of fields, and NR and FNR
307 are incremented by 1. Fields above $NF are set to "".
308
309 Assignment to $0 causes the fields and NF to be recomputed. Assignment
310 to NF or to a field causes $0 to be reconstructed by concatenating the
311 $i's separated by OFS. Assignment to a field with index greater than
312 NF, increases NF and causes $0 to be reconstructed.
313
314 Data input stored in fields is string, unless the entire field has
315 numeric form and then the type is number and string. For example,
316
317 echo 24 24E |
318 lmawk '{ print($1>100, $1>"100", $2>100, $2>"100") }'
319 0 1 1 1
320
321 $0 and $2 are string and $1 is number and string. The first comparison
322 is numeric, the second is string, the third is string (100 is converted
323 to "100"), and the last is string.
324
325 5. Expressions and operators
326 The expression syntax is similar to C. Primary expressions are numeric
327 constants, string constants, variables, fields, arrays and function
328 calls. The identifier for a variable, array or function can be a
329 sequence of letters, digits and underscores, that does not start with a
330 digit. Variables are not declared; they exist when first referenced
331 and are initialized to null.
332
333 New expressions are composed with the following operators in order of
334 increasing precedence.
335
336 assignment = += -= *= /= %= ^=
337 conditional ? :
338 logical or ||
339 logical and &&
340 array membership in
341 matching ~ !~
342 relational < > <= >= == !=
343 concatenation (no explicit operator)
344 add ops + -
345 mul ops * / %
346 unary + -
347 logical not !
348 exponentiation ^
349 inc and dec ++ -- (both post and pre)
350 field $
351
352 Assignment, conditional and exponentiation associate right to left; the
353 other operators associate left to right. Any expression can be paren‐
354 thesized.
355
356 6. Arrays
357 Awk provides one-dimensional arrays. Array elements are expressed as
358 array[expr]. Expr is internally converted to string type, so, for
359 example, A[1] and A["1"] are the same element and the actual index is
360 "1". Arrays indexed by strings are called associative arrays. Ini‐
361 tially an array is empty; elements exist when first accessed. An
362 expression, expr in array evaluates to 1 if array[expr] exists, else to
363 0.
364
365 There is a form of the for statement that loops over each index of an
366 array.
367
368 for ( var in array ) statement
369
370 sets var to each index of array and executes statement. The order that
371 var transverses the indices of array is not defined.
372
373 The statement, delete array[expr], causes array[expr] not to exist.
374 lmawk supports an extension, delete array, which deletes all elements
375 of array.
376
377 Multidimensional arrays are synthesized with concatenation using the
378 built-in variable SUBSEP. array[expr1,expr2] is equivalent to
379 array[expr1 SUBSEP expr2]. Testing for a multidimensional element uses
380 a parenthesized index, such as
381
382 if ( (i, j) in A ) print A[i, j]
383
384
385 7. Builtin-variables
386 The following variables are built-in and initialized before program
387 execution.
388
389 ARGC number of command line arguments.
390
391 ARGV array of command line arguments, 0..ARGC-1.
392
393 CONVFMT format for internal conversion of numbers to string,
394 initially = "%.6g".
395
396 ENVIRON array indexed by environment variables. An environ‐
397 ment string, var=value is stored as ENVIRON[var] =
398 value.
399
400 FILENAME name of the current input file.
401
402 FNR current record number in FILENAME.
403
404 FS splits records into fields as a regular expression.
405
406 NF number of fields in the current record.
407
408 NR current record number in the total input stream.
409
410 OFMT format for printing numbers; initially = "%.6g".
411
412 OFS inserted between fields on output, initially = " ".
413
414 ORS terminates each record on output, initially = "\n".
415
416 RLENGTH length set by the last call to the built-in function,
417 match().
418
419 RS input record separator, initially = "\n".
420
421 RSTART index set by the last call to match().
422
423 SUBSEP used to build multiple array subscripts, initially =
424 "\034".
425
426 ERRNO misc built-in functions (libmawk extensions) use this
427 variable to rerport error. All extension calls will
428 set this variable before returning, therefor ERRNO
429 holds the result of the last call. An empty string
430 value means no error. Error messages are formatted in
431 a way that the first word is an unique integer, fol‐
432 lowed by a human readable error message from the sec‐
433 ond word. int(ERRNO) can be used to acquire the error
434 code, which then can be used as a secondary output
435 from the extension function. For example, an awk pro‐
436 gram can use valueof() to determine if a global symbol
437 exists and is a function or a variable or anything
438 else.
439
440 LIBPATH is a semicolon separated list of search paths. When
441 loading an awk script by file name (-f command line
442 argument or include from another awk script) these
443 paths are inserted before the file name, in order, one
444 by one, until the first path that allows opening the
445 file. An empty path is equivalent to the current work‐
446 ing directory. LIBPATH can be modified from the com‐
447 mand line using -v, as arguments are scanned before
448 loading the scripts. Setting LIBPATH to empty string
449 results in the original behaviour of mawk. LIBPATH is
450 ignored for script file names starting with slash
451 ('/') as those are assumed to be absolute paths.
452
453 8. Built-in functions
454 String functions
455
456 gsub(r,s,t) gsub(r,s)
457 Global substitution, every match of regular expression r
458 in variable t is replaced by string s. The number of
459 replacements is returned. If t is omitted, $0 is used.
460 An & in the replacement string s is replaced by the
461 matched substring of t. \& and \\ put literal & and \,
462 respectively, in the replacement string.
463
464 index(s,t)
465 If t is a substring of s, then the position where t
466 starts is returned, else 0 is returned. The first char‐
467 acter of s is in position 1.
468
469 length(s)
470 Returns the length of string s.
471
472 match(s,r)
473 Returns the index of the first longest match of regular
474 expression r in string s. Returns 0 if no match. As a
475 side effect, RSTART is set to the return value. RLENGTH
476 is set to the length of the match or -1 if no match. If
477 the empty string is matched, RLENGTH is set to 0, and 1
478 is returned if the match is at the front, and length(s)+1
479 is returned if the match is at the back.
480
481 split(s,A,r) split(s,A)
482 String s is split into fields by regular expression r and
483 the fields are loaded into array A. The number of fields
484 is returned. See section 11 below for more detail. If r
485 is omitted, FS is used.
486
487 sprintf(format,expr-list)
488 Returns a string constructed from expr-list according to
489 format. See the description of printf() below.
490
491 sub(r,s,t) sub(r,s)
492 Single substitution, same as gsub() except at most one
493 substitution.
494
495 substr(s,i,n) substr(s,i)
496 Returns the substring of string s, starting at index i,
497 of length n. If n is omitted, the suffix of s, starting
498 at i is returned.
499
500 tolower(s)
501 Returns a copy of s with all upper case characters con‐
502 verted to lower case.
503
504 toupper(s)
505 Returns a copy of s with all lower case characters con‐
506 verted to upper case.
507
508 Arithmetic functions
509
510 atan2(y,x) Arctan of y/x between -PI and PI.
511
512 cos(x) Cosine function, x in radians.
513
514 exp(x) Exponential function.
515
516 int(x) Returns x truncated towards zero.
517
518 log(x) Natural logarithm.
519
520 rand() Returns a random number between zero and one.
521
522 sin(x) Sine function, x in radians.
523
524 sqrt(x) Returns square root of x.
525
526 srand(expr) srand()
527 Seeds the random number generator, using the clock if
528 expr is omitted, and returns the value of the previous
529 seed. lmawk seeds the random number generator from the
530 clock at startup so there is no real need to call
531 srand(). Srand(expr) is useful for repeating pseudo ran‐
532 dom sequences.
533
534 Misc functions (libmawk extensions)
535
536 call(fname,arg1,arg2,...)
537 Call awk function fname with the supplied arguments. If
538 the call fails, empty value, else the return value of the
539 callee is returned. Built-in variable ERRNO is always
540 set.
541
542 acall(fname,arrname)
543 Call awk function fname with arguments supplied in array
544 named arrname (both arguments are strings naming an
545 existing object). The array should be indexed from 1.
546 Number of arguments is determined by looking for the
547 first empty (non-existing) index in the array. If the
548 call fails, empty value, else the return value of the
549 callee is returned. Built-in variable ERRNO is always
550 set.
551
552 valueof(vname [,idx])
553 Return the value of variable fname; if the variable is an
554 array, return the element indexed by idx (which must be
555 present in this case). If index is not present or is
556 empty (""), the variable is expected to be scalar. Built-
557 in variable ERRNO is always set. NOTE: valueof() has
558 access to the global symbol table only. It will fail to
559 resolve anything else than global objects; most notably
560 it will fail on local variables, $ arguments and on most
561 of the built-in variables.
562
563 9. Input and output
564 There are two output statements, print and printf.
565
566 print writes $0 ORS to standard output.
567
568 print expr1, expr2, ..., exprn
569 writes expr1 OFS expr2 OFS ... exprn ORS to standard out‐
570 put. Numeric expressions are converted to string with
571 OFMT.
572
573 printf format, expr-list
574 duplicates the printf C library function writing to stan‐
575 dard output. The complete ANSI C format specifications
576 are recognized with conversions %c, %d, %e, %E, %f, %g,
577 %G, %i, %o, %s, %u, %x, %X and %%, and conversion quali‐
578 fiers h and l.
579
580 The argument list to print or printf can optionally be enclosed in
581 parentheses. Print formats numbers using OFMT or "%d" for exact inte‐
582 gers. "%c" with a numeric argument prints the corresponding 8 bit
583 character, with a string argument it prints the first character of the
584 string. The output of print and printf can be redirected to a file or
585 command by appending > file, >> file or | command to the end of the
586 print statement. Redirection opens file or command only once, subse‐
587 quent redirections append to the already open stream. By convention,
588 lmawk associates the filename "/dev/stderr" with stderr which allows
589 print and printf to be redirected to stderr. lmawk also associates "-"
590 and "/dev/stdout" with stdin and stdout which allows these streams to
591 be passed to functions. Opening /dev/fd/N will do an fdopen() on file
592 descriptor N, where N is an integer - this is a libmawk extension. If
593 any of the /dev heuristics needs to be bypassed (i.e. the script wants
594 to open the real /dev/stdout or the real /dev/fd/5), the leading slash
595 should be doubled (e.g. //dev/fd/5).
596
597 The input function getline has the following variations.
598
599 getline
600 reads into $0, updates the fields, NF, NR and FNR.
601
602 getline < file
603 reads into $0 from file, updates the fields and NF.
604
605 getline var
606 reads the next record into var, updates NR and FNR.
607
608 getline var < file
609 reads the next record of file into var.
610
611 command | getline
612 pipes a record from command into $0 and updates the
613 fields and NF.
614
615 command | getline var
616 pipes a record from command into var.
617
618 Getline returns 0 on end-of-file, -1 on error, otherwise 1.
619
620 Commands on the end of pipes are executed by /bin/sh.
621
622 The function close(expr) closes the file or pipe associated with expr.
623 Close returns 0 if expr is an open file, the exit status if expr is a
624 piped command, and -1 otherwise. Close is used to reread a file or
625 command, make sure the other end of an output pipe is finished or con‐
626 serve file resources.
627
628 The function fflush(expr) flushes the output file or pipe associated
629 with expr. Fflush returns 0 if expr is an open output stream else -1.
630 Fflush without an argument flushes stdout. Fflush with an empty argu‐
631 ment ("") flushes all open output.
632
633 The function system(expr) uses /bin/sh to execute expr and returns the
634 exit status of the command expr. Changes made to the ENVIRON array are
635 not passed to commands executed with system or pipes.
636
637 10. User defined functions
638 The syntax for a user defined function is
639
640 function name( args ) { statements }
641
642 The function body can contain a return statement
643
644 return opt_expr
645
646 A return statement is not required. Function calls may be nested or
647 recursive. Functions are passed expressions by value and arrays by
648 reference. Extra arguments serve as local variables and are initial‐
649 ized to null. For example, csplit(s,A) puts each character of s into
650 array A and returns the length of s.
651
652 function csplit(s, A, n, i)
653 {
654 n = length(s)
655 for( i = 1 ; i <= n ; i++ ) A[i] = substr(s, i, 1)
656 return n
657 }
658
659 Putting extra space between passed arguments and local variables is
660 conventional. Functions can be referenced before they are defined, but
661 the function name and the '(' of the arguments must touch to avoid con‐
662 fusion with concatenation.
663
664 11. Splitting strings, records and files
665 Awk programs use the same algorithm to split strings into arrays with
666 split(), and records into fields on FS. lmawk uses essentially the
667 same algorithm to split files into records on RS.
668
669 Split(expr,A,sep) works as follows:
670
671 (1) If sep is omitted, it is replaced by FS. Sep can be an
672 expression or regular expression. If it is an expression
673 of non-string type, it is converted to string.
674
675 (2) If sep = " " (a single space), then <SPACE> is trimmed
676 from the front and back of expr, and sep becomes <SPACE>.
677 lmawk defines <SPACE> as the regular expression
678 /[ \t\n]+/. Otherwise sep is treated as a regular
679 expression, except that meta-characters are ignored for a
680 string of length 1, e.g., split(x, A, "*") and split(x,
681 A, /\*/) are the same.
682
683 (3) If expr is not string, it is converted to string. If
684 expr is then the empty string "", split() returns 0 and A
685 is set empty. Otherwise, all non-overlapping, non-null
686 and longest matches of sep in expr, separate expr into
687 fields which are loaded into A. The fields are placed in
688 A[1], A[2], ..., A[n] and split() returns n, the number
689 of fields which is the number of matches plus one. Data
690 placed in A that looks numeric is typed number and
691 string.
692
693 Splitting records into fields works the same except the pieces are
694 loaded into $1, $2,..., $NF. If $0 is empty, NF is set to 0 and all $i
695 to "".
696
697 lmawk splits files into records by the same algorithm, but with the
698 slight difference that RS is really a terminator instead of a separa‐
699 tor. (ORS is really a terminator too).
700
701 E.g., if FS = ":+" and $0 = "a::b:" , then NF = 3 and $1 = "a",
702 $2 = "b" and $3 = "", but if "a::b:" is the contents of an input
703 file and RS = ":+", then there are two records "a" and "b".
704
705 RS = " " is not special.
706
707 If FS = "", then lmawk breaks the record into individual characters,
708 and, similarly, split(s,A,"") places the individual characters of s
709 into A.
710
711 12. Multi-line records
712 Since lmawk interprets RS as a regular expression, multi-line records
713 are easy. Setting RS = "\n\n+", makes one or more blank lines separate
714 records. If FS = " " (the default), then single newlines, by the rules
715 for <SPACE> above, become space and single newlines are field separa‐
716 tors.
717
718 For example, if a file is "a b\nc\n\n", RS = "\n\n+" and FS =
719 " ", then there is one record "a b\nc" with three fields "a",
720 "b" and "c". Changing FS = "\n", gives two fields "a b" and
721 "c"; changing FS = "", gives one field identical to the record.
722
723 If you want lines with spaces or tabs to be considered blank, set RS =
724 "\n([ \t]*\n)+". For compatibility with other awks, setting RS = ""
725 has the same effect as if blank lines are stripped from the front and
726 back of files and then records are determined as if RS = "\n\n+".
727 Posix requires that "\n" always separates records when RS = "" regard‐
728 less of the value of FS. lmawk does not support this convention,
729 because defining "\n" as <SPACE> makes it unnecessary.
730
731 Most of the time when you change RS for multi-line records, you will
732 also want to change ORS to "\n\n" so the record spacing is preserved on
733 output.
734
735 13. Program execution
736 This section describes the order of program execution. First ARGC is
737 set to the total number of command line arguments passed to the execu‐
738 tion phase of the program. ARGV[0] is set the name of the AWK inter‐
739 preter and ARGV[1] ... ARGV[ARGC-1] holds the remaining command line
740 arguments exclusive of options and program source. For example with
741
742 lmawk -f prog v=1 A t=hello B
743
744 ARGC = 5 with ARGV[0] = "lmawk", ARGV[1] = "v=1", ARGV[2] = "A",
745 ARGV[3] = "t=hello" and ARGV[4] = "B".
746
747 Next, each BEGIN block is executed in order. If the program consists
748 entirely of BEGIN blocks, then execution terminates, else an input
749 stream is opened and execution continues. If ARGC equals 1, the input
750 stream is set to stdin, else the command line arguments ARGV[1] ...
751 ARGV[ARGC-1] are examined for a file argument.
752
753 The command line arguments divide into three sets: file arguments,
754 assignment arguments and empty strings "". An assignment has the form
755 var=string. When an ARGV[i] is examined as a possible file argument,
756 if it is empty it is skipped; if it is an assignment argument, the
757 assignment to var takes place and i skips to the next argument; else
758 ARGV[i] is opened for input. If it fails to open, execution terminates
759 with exit code 2. If no command line argument is a file argument, then
760 input comes from stdin. Getline in a BEGIN action opens input. "-" as
761 a file argument denotes stdin.
762
763 Once an input stream is open, each input record is tested against each
764 pattern, and if it matches, the associated action is executed. An
765 expression pattern matches if it is boolean true (see the end of sec‐
766 tion 2). A BEGIN pattern matches before any input has been read, and
767 an END pattern matches after all input has been read. A range pattern,
768 expr1,expr2 , matches every record between the match of expr1 and the
769 match expr2 inclusively.
770
771 When end of file occurs on the input stream, the remaining command line
772 arguments are examined for a file argument, and if there is one it is
773 opened, else the END pattern is considered matched and all END actions
774 are executed.
775
776 In the example, the assignment v=1 takes place after the BEGIN actions
777 are executed, and the data placed in v is typed number and string.
778 Input is then read from file A. On end of file A, t is set to the
779 string "hello", and B is opened for input. On end of file B, the END
780 actions are executed.
781
782 Program flow at the pattern {action} level can be changed with the
783
784 next
785 exit opt_expr
786
787 statements. A next statement causes the next input record to be read
788 and pattern testing to restart with the first pattern {action} pair in
789 the program. An exit statement causes immediate execution of the END
790 actions or program termination if there are none or if the exit occurs
791 in an END action. The opt_expr sets the exit value of the program
792 unless overridden by a later exit or subsequent error.
793
794
795 14. include
796 libmawk introduces source inclusion feature. Syntax is:
797
798 include "filename"
799
800 Include statements must be on top level (outside of blocks). If file name
801 starts with a plus sign ('+'), the script file is not loaded if it has
802 been already loaded (by another include or -f command line argument).
803
804
805
807 1. emulate cat.
808
809 { print }
810
811 2. emulate wc.
812
813 { chars += length($0) + 1 # add one for the \n
814 words += NF
815 }
816
817 END{ print NR, words, chars }
818
819 3. count the number of unique "real words".
820
821 BEGIN { FS = "[^A-Za-z]+" }
822
823 { for(i = 1 ; i <= NF ; i++) word[$i] = "" }
824
825 END { delete word[""]
826 for ( i in word ) cnt++
827 print cnt
828 }
829
830 4. sum the second field of every record based on the first field.
831
832 $1 ~ /credit|gain/ { sum += $2 }
833 $1 ~ /debit|loss/ { sum -= $2 }
834
835 END { print sum }
836
837 5. sort a file, comparing as string
838
839 { line[NR] = $0 "" } # make sure of comparison type
840 # in case some lines look numeric
841
842 END { isort(line, NR)
843 for(i = 1 ; i <= NR ; i++) print line[i]
844 }
845
846 #insertion sort of A[1..n]
847 function isort( A, n, i, j, hold)
848 {
849 for( i = 2 ; i <= n ; i++)
850 {
851 hold = A[j = i]
852 while ( A[j-1] > hold )
853 { j-- ; A[j+1] = A[j] }
854 A[j] = hold
855 }
856 # sentinel A[0] = "" will be created if needed
857 }
858
859
861 The Posix 1003.2(draft 11.3) definition of the AWK language is AWK as
862 described in the AWK book with a few extensions that appeared in Sys‐
863 temVR4 nawk. The extensions are:
864
865 New functions: toupper() and tolower(); libmawk extensions:
866 call(), acall(), valueof().
867
868 New variables: ENVIRON[] and CONVFMT; libmawk extension: ERRNO,
869 LIBPATH. As a libmawk extension, ENVIRON affects the environ‐
870 ment of children processes.
871
872 As a libmawk extension, new built-in variable LIBPATH is used as
873 a list of search paths while loading scripts from the command
874 line or from include.
875
876 If a script name starts with plus ('+'), the file is not loaded
877 if it has been loaded earlier (to avoid double loading libs
878 trough -f and/or include). This is a libmawk extension.
879
880 It is possible to include a script from another script using
881 keyword include "scriptname.awk" (libmawk extension).
882
883 ANSI C conversion specifications for printf() and sprintf().
884
885 New command options: -v var=value, multiple -f options and
886 implementation options as arguments to -W.
887
888
889 Posix AWK is oriented to operate on files a line at a time. RS can be
890 changed from "\n" to another single character, but it is hard to find
891 any use for this — there are no examples in the AWK book. By conven‐
892 tion, RS = "", makes one or more blank lines separate records, allowing
893 multi-line records. When RS = "", "\n" is always a field separator
894 regardless of the value in FS.
895
896 lmawk, on the other hand, allows RS to be a regular expression. When
897 "\n" appears in records, it is treated as space, and FS always deter‐
898 mines fields.
899
900 Removing the line at a time paradigm can make some programs simpler and
901 can often improve performance. For example, redoing example 3 from
902 above,
903
904 BEGIN { RS = "[^A-Za-z]+" }
905
906 { word[ $0 ] = "" }
907
908 END { delete word[ "" ]
909 for( i in word ) cnt++
910 print cnt
911 }
912
913 counts the number of unique words by making each word a record. On
914 moderate size files, lmawk executes twice as fast, because of the sim‐
915 plified inner loop.
916
917 The following program replaces each comment by a single space in a C
918 program file,
919
920 BEGIN {
921 RS = "/\*([^*]|\*+[^/*])*\*+/"
922 # comment is record separator
923 ORS = " "
924 getline hold
925 }
926
927 { print hold ; hold = $0 }
928
929 END { printf "%s" , hold }
930
931 Buffering one record is needed to avoid terminating the last record
932 with a space.
933
934 With lmawk, the following are all equivalent,
935
936 x ~ /a\+b/ x ~ "a\+b" x ~ "a\\+b"
937
938 The strings get scanned twice, once as string and once as regular
939 expression. On the string scan, lmawk ignores the escape on non-escape
940 characters while the AWK book advocates \c be recognized as c which
941 necessitates the double escaping of meta-characters in strings. Posix
942 explicitly declines to define the behavior which passively forces pro‐
943 grams that must run under a variety of awks to use the more portable
944 but less readable, double escape.
945
946 Posix AWK does not recognize "/dev/std{out,err}" or \x hex escape
947 sequences in strings. Unlike ANSI C, lmawk limits the number of digits
948 that follows \x to two as the current implementation only supports 8
949 bit characters. The built-in fflush first appeared in a recent (1993)
950 AT&T awk released to netlib, and is not part of the posix standard.
951 Aggregate deletion with delete array is not part of the posix standard.
952
953 Posix explicitly leaves the behavior of FS = "" undefined, and mentions
954 splitting the record into characters as a possible interpretation, but
955 currently this use is not portable across implementations.
956
957 Finally, here is how lmawk handles exceptional cases not discussed in
958 the AWK book or the Posix draft. It is unsafe to assume consistency
959 across awks and safe to skip to the next section.
960
961 substr(s, i, n) returns the characters of s in the intersection
962 of the closed interval [1, length(s)] and the half-open interval
963 [i, i+n). When this intersection is empty, the empty string is
964 returned; so substr("ABC", 1, 0) = "" and substr("ABC", -4, 6) =
965 "A".
966
967 Every string, including the empty string, matches the empty
968 string at the front so, s ~ // and s ~ "", are always 1 as is
969 match(s, //) and match(s, ""). The last two set RLENGTH to 0.
970
971 index(s, t) is always the same as match(s, t1) where t1 is the
972 same as t with metacharacters escaped. Hence consistency with
973 match requires that index(s, "") always returns 1. Also the
974 condition, index(s,t) != 0 if and only t is a substring of s,
975 requires index("","") = 1.
976
977 If getline encounters end of file, getline var, leaves var
978 unchanged. Similarly, on entry to the END actions, $0, the
979 fields and NF have their value unaltered from the last record.
980
982 egrep(1), mawk(1)
983
984 Aho, Kernighan and Weinberger, The AWK Programming Language, Addison-
985 Wesley Publishing, 1988, (the AWK book), defines the language, opening
986 with a tutorial and advancing to many interesting programs that delve
987 into issues of software design and analysis relevant to programming in
988 any language.
989
990 The GAWK Manual, The Free Software Foundation, 1991, is a tutorial and
991 language reference that does not attempt the depth of the AWK book and
992 assumes the reader may be a novice programmer. The section on AWK
993 arrays is excellent. It also discusses Posix requirements for AWK.
994
996 lmawk cannot handle ascii NUL \0 in the source or data files. You can
997 output NUL using printf with %c, and any other 8 bit character is
998 acceptable input.
999
1000 lmawk implements printf() and sprintf() using the C library functions,
1001 printf and sprintf, so full ANSI compatibility requires an ANSI C
1002 library. In practice this means the h conversion qualifier may not be
1003 available. Also lmawk inherits any bugs or limitations of the library
1004 functions.
1005
1006 Implementors of the AWK language have shown a consistent lack of imagi‐
1007 nation when naming their programs.
1008
1010 mawk: Mike Brennan (brennan@whidbey.com).
1011
1012 libmawk extensions: Tibor Palinkas (libmawk@igor2.repo.hu).
1013
1014
1015
1016Version 1.2 Dec 12 2010 LMAWK(1)