mawk-arrays(7)

1MAWK-ARRAYS(7)                   Miscellaneous                  MAWK-ARRAYS(7)
2
3
4

NAME

6       mawk-arrays - design notes for mawk's array implementation
7

SYNOPSIS

9       This  is  the  documentation for the mawk implementation of awk arrays.
10       Arrays in awk are associations of strings to awk  scalar  values.   The
11       mawk  implementation  stores the associations in hash tables.  The hash
12       table scheme was influenced by and is similar to the  design  presented
13       in  Griswold  and  Townsend,  The  Design and Implementation of Dynamic
14       Hashing Sets and Tables in Icon}, Software Practice and Experience, 23,
15       351-367, 1993.
16

Data Structures

18   Array Structure
19       The  type  ARRAY is a pointer to a struct array.  The size field is the
20       number of elements in the table.  The meaning of the other  fields  de‐
21       pends on the type field.
22
23       There are three types of arrays and these are distinguished by the type
24       field in the structure.  The types are:
25
26       AY_NULL
27            The array is empty and the size field is always zero.   The  other
28            fields have no meaning.
29
30       AY_SPLIT
31            The array was created by the AWK built-in split.  The return value
32            from split is stored in the size field.  The ptr field points at a
33            vector  of  CELLs.  The number of CELLs is the limit field.  It is
34            always  true  that  size ≤ limit.   The   address   of   A[i]   is
35            (CELL*)A->ptr+i-1  for  1≤ i ≤ size.  The hmask field has no mean‐
36            ing.
37
38       Hash Table
39            The array is a hash table.  If the AY_STR bit in the type field is
40            set, then the table is keyed on strings.  If the AY_INT bit in the
41            type field is set, then the table is keyed on integers.  Both bits
42            can be set, and then the two keys are consistent, i.e., look up of
43            A[-14] and A["-14"] will return identical CELL  pointers  although
44            the  look  up  methods  will be different.  In this case, the size
45            field is the number of hash nodes in the table.  When insertion of
46            a new element would cause size to exceed limit, the table grows by
47            doubling   the   number   of   hash   chains.    The    invariant,
48            (hmask+1)max_ave_list_length=limit       is      always      true.
49            Max_ave_list_length is a tunable constant.
50
51   Hash Tables
52       The hash tables are linked lists of nodes, called ANODEs.   The  number
53       of  lists  is  hmask+1  which  is always a power of two.  The ptr field
54       points at a vector of list heads.   Since  there  are  potentially  two
55       types  of  lists,  integer lists and strings lists, each list head is a
56       structure, DUAL_LINK.
57
58       The string lists are chains connected by slinks and the  integer  lists
59       are  chains  connected by ilinks.  We sometimes refer to these lists as
60       slists and ilists, respectively.  The elements on the lists are ANODEs.
61       The fields of an ANODE are:
62
63       slink  The  link  field  for  slists.  ilink The link field for ilists.
64       sval If non-null, then sval is a pointer to a string key.  For a  given
65       table,  if  the  AY_STR bit is set then every ANODE has a non-null sval
66       field and conversely, if AY_STR is not set, then every  sval  field  is
67       null.
68
69       hval  The  hash  value  of  sval.  This field has no meaning if sval is
70       null.
71
72       ival The integer key.  The field has no meaning if set to the constant,
73       NOT_AN_IVALUE.   If the AY_STR bit is off, then every ANODE will have a
74       valid ival field.  If the AY_STR bit is on, then the ival field may  or
75       may not be valid.
76
77       cell The data field in the hash table.  \ndhitems
78
79       So  the  value of A[expr is stored in the cell field, and if expr is an
80       integer, then expr is stored in ival, else it is stored in sval.
81

Array Operations

83       The functions that operate on arrays are,
84
85       CELL* array_find(ARRAY A, CELL *cp, int create_flag)
86            returns a pointer to A[expr] where cp is a  pointer  to  the  CELL
87            holding expr.  If the create_flag is on and expr is not an element
88            of A, then the element is created with value null.
89
90       void array_delete(ARRAY A, CELL *cp)
91            removes an element A[expr from the array A.  cp points at the CELL
92            holding expr.
93
94       void array_load(ARRAY A, size_t cnt)
95            builds  a split array.  The values A[1..cnt] are moved into A from
96            an anonymous buffer with transfer_to_array() which is declared  in
97            split.h.
98
99       void array_clear(ARRAY A) removes all elements of A.
100            The type of A is then AY_NULL.
101
102       STRING** array_loop_vector(ARRAY A, size_t *sizep)
103            returns  a  pointer  to a linear vector that holds all the strings
104            that are indices of A.  The size of the the vector is returned in‐
105            directly in *sizep.  If A->size≡0, a null pointer is returned.
106
107       CELL* array_cat(CELL *sp, int cnt)
108            concatenates the elements of sp[1-cnt..0], with each element sepa‐
109            rated by SUBSEP, to compute an array index.   For  example,  on  a
110            reference to A[i,j], array_cat computes i ○ SUBSEP ○ j where ○ de‐
111            notes concatenation.
112
113   Array Find
114       Any reference to A[expr]  creates  a  call  to  array_find(A,cp,CREATE)
115       where cp points at the cell holding expr.  The test, expr in A, creates
116       a call to array_find(A,cp,NO_CREATE).
117
118       Array_find is a hash-table lookup function that handles two cases:
119
120       1.   If *cp is numeric and integer valued, then lookup by integer value
121            using  find_by_ival.   If  *cp is numeric, but not integer valued,
122            then convert to string with sprintf(CONVFMT,...) and go to case~2.
123
124       2.   If *cp is  string  valued,  then  lookup  by  string  value  using
125            find_by_sval.  \ndlist
126
127       To  test whether cp->dval is integer, we convert to the nearest integer
128       by rounding towards zero (done by do_to_I) and then cast back  to  dou‐
129       ble.  If we get the same number we started with, then cp->dval is inte‐
130       ger valued.
131
132       When we get to the function find_by_ival, the search has  been  reduced
133       to lookup in a hash table by integer value.
134
135       When  a search by integer value fails, we have to check by string value
136       to correctly handle the case insertion by A["123"] and later search  as
137       A[123].   This string search is necessary if and only if the AY_STR bit
138       is set.  An important point is that all ANODEs get created with a valid
139       sval if AY_STR is set, because then creation of new nodes always occurs
140       in a call to find_by_sval.
141
142       Searching by string value is  easier  because  AWK  arrays  are  really
143       string  associations.   If  the array does not have the AY_STR bit set,
144       then we have to convert the array to a dual  hash  table  with  strings
145       which is done by the function add_string_associations.
146
147       One Int value is reserved to show that the ival field is invalid.  This
148       works because d_to_I returns a value in [-Max_Int, Max_Int].
149
150       On entry to add_string_associations, we know that the AY_STR bit is not
151       set.   We convert to a dual hash table, then walk all the integer lists
152       and put each ANODE on a string list.
153
154   Array Delete
155       The execution of the statement, delete A[expr], creates a call  to  ar‐
156       ray_delete(ARRAY  A, CELL *cp).  Depending on the type of *cp, the call
157       is routed to find_by_sval or find_by_ival.   Each  of  these  functions
158       leaves  its  return  value  on  the front of an slist or ilist, respec‐
159       tively, and then it is deleted from the front of the  list.   The  case
160       where A[expr is on two lists, e.g., A[12] and A["12"] is checked by ex‐
161       amining the sval and ival fields of the returned ANODE*.
162
163       Even though we found a node by searching an ilist it might also  be  on
164       an slist and vice-versa.
165
166       When  the size of a hash table drops below a certain value, it might be
167       profitable to shrink the hash table.  Currently we don't do  this,  be‐
168       cause our guess is that it would be a waste of time for most AWK appli‐
169       cations.  However, we do convert an array to AY_NULL when the size goes
170       to  zero which would resize a large hash table that had been completely
171       cleared by successive deletions.
172
173   Building an Array with Split
174       A simple operation is to create an array with the AWK primitive  split.
175       The  code  that  performs split puts the pieces in an anonymous buffer.
176       array_load(A, cnt) moves the cnt elements  from  the  anonymous  buffer
177       into A.  This is the only way an array of type AY_SPLIT is created.
178
179       If the array A is a split array and big enough then we reuse it, other‐
180       wise we need to allocate a new split array.  When we allocate  a  block
181       of CELLs for a split array, we round up to a multiple of 4.
182
183   Array Clear
184       The  function array_clear(ARRAY A) converts A to type AY_NULL and frees
185       all storage used by A except for the struct array itself.   This  func‐
186       tion gets called in three contexts:
187
188       (1)  when an array local to a user function goes out of scope,
189
190       (2)  execution of the AWK statement, delete A and
191
192       (3)  when an existing changes type or size from split().
193
194   Constructor and Conversions
195       Arrays  are always created as empty arrays of type AY_NULL.  Global ar‐
196       rays are never destroyed although they can go empty or have their  type
197       change by conversion.  The only constructor function is a macro.
198
199       Hash  tables  only  get constructed by conversion.  This happens in two
200       ways.  The function make_empty_table converts an empty  array  of  type
201       AY_NULL  to an empty hash table.  The number of lists in the table is a
202       power of 2 determined by the constant STARTING_HMASK.  The  limit  size
203       of the table is determined by the constant MAX_AVE_LIST_LENGTH which is
204       the largest average size of the hash lists that we are willing to  tol‐
205       erate  before  enlarging the table.  When A->size exceeds A->limit, the
206       hash table grows in size by doubling the number of lists.  A->limit  is
207       then reset to MAX_AVE_LIST_LENGTH times A->hmask+1.
208
209       The  other  way  a hash table gets constructed is when a split array is
210       converted to a hash table of type AY_INT.
211
212       To determine the size of the table, we set the initial size  to  START‐
213       ING_HMASK+1 and then double the size until A->size ≤ A->limit.
214
215   Doubling the Size of a Hash Table
216       The  whole  point of making the table size a power of two is to facili‐
217       tate resizing the table.  If the table size is 2**(n) and h is the hash
218       key,  then h mod 2**(n) is the hash chain index which can be calculated
219       with bit-wise and, h & (2**(n-1)).  When the table  size  doubles,  the
220       new bit-mask has one more bit turned on.  Elements of an old hash chain
221       whose hash value have this bit turned on get moved to a new chain.  El‐
222       ements  with  this  bit  turned off stay on the same chain.  On average
223       only half the old chain moves to the new chain.  If the old chain is at
224       table[i], 0 ≤ i  < 2**(n), then the elements that move, all move to the
225       new chain at table[i + 2**(n)].
226
227       As we walk an old string list with pointer p, the expression p->hval  &
228       new_hmask  takes  one  of  two  values.   If  it  is equal to p->hval &
229       old_hmask (which equals i), then the node stays otherwise it gets moved
230       to a new string list at j.  The new string list preserves order so that
231       the positions of the move-to-the-front heuristic are preserved.   Nodes
232       moving  to  the  new  list  are  appended at pointer tail.  The ANODEs,
233       dummy0~and dummy1, are sentinels that remove special handling of bound‐
234       ary conditions.
235
236       The doubling of the integer lists is exactly the same except that slink
237       is replaced by ilink and hval is replaced by ival.
238
239   Array Loops
240       Our mechanism for dealing with execution of the statement,
241
242              for (i in A) { statements }
243
244       is simple.  We allocate a vector of STRING* of size, A->size.  Each el‐
245       ement of the vector is a string key for~A.  Note that if the AY_STR bit
246       of A is not set, then A has to be converted to a string hash table, be‐
247       cause the index i walks string indices.
248
249       To  execute  the loop, the only state that needs to be saved is the ad‐
250       dress of i and an index into the vector of string keys.  Since  nothing
251       about A is saved as state, the user program can do anything to A inside
252       the body of the loop, even delete A, and the loop still works.   Essen‐
253       tially,  we  have traded data space (the string vector) in exchange for
254       implementation simplicity.  On a 32-bit system, each ANODE is 36 bytes,
255       so  the extra memory needed for the array loop is 11 more than the mem‐
256       ory consumed by the ANODEs of the array.  Note that the large  size  of
257       the  ANODEs is indicative of our whole design which pays data space for
258       integer lookup speed and algorithm simplicity.
259
260       The only aspect of array loops that occurs in array.c  is  construction
261       of  the  string  vector.  The rest of the implementation is in the file
262       execute.c.
263
264       As we walk over the hash table ANODEs, putting each  sval  in  ret,  we
265       need  to  increment each reference count.  The user of the return value
266       is responsible for these new reference counts.
267
268   Concatenating Array Indices
269       In AWK, an array expression A[i,j] is equivalent to the expression  A[i
270       SUBSEP  j],  i.e., the index is the concatenation of the three elements
271       i, SUBSEP and j.  This is performed by the function array_cat.  On  en‐
272       try,  sp  points  at the top of a stack of CELLs.  Cnt cells are popped
273       off the stack and concatenated together separated by SUBSEP and the re‐
274       sult  is  pushed back on the stack.  On entry, the first multi-index is
275       in sp[1-cnt] and the last is in sp[0].  The return  value  is  the  new
276       stack  top.   (The stack is the run-time evaluation stack.  This opera‐
277       tion really has nothing to do with array structure, so  logically  this
278       code belongs in execute.c, but remains here for historical reasons.)
279
280       We  make  a  copy of SUBSEP which we can cast to string in the unlikely
281       event the user has assigned a number to SUBSEP.
282
283       Set sp and top so the cells to concatenate are inclusively  between  sp
284       and top.
285
286       The  total_len  is  the  sum  of the lengths of the cnt strings and the
287       cnt-1 copies of subsep.
288
289       The return value is sp and it is already set correctly.  We  just  need
290       to free the strings and set the contents of sp.
291
292
293
294Version 1.3.4                     2020-08-18                    MAWK-ARRAYS(7)