1MAWK-ARRAYS(7) Miscellaneous MAWK-ARRAYS(7)
2
3
4
6 mawk-arrays - design notes for mawk's array implementation
7
9 This is the documentation for the mawk implementation of awk arrays.
10 Arrays in awk are associations of strings to awk scalar values. The
11 mawk implementation stores the associations in hash tables. The hash
12 table scheme was influenced by and is similar to the design presented
13 in Griswold and Townsend, The Design and Implementation of Dynamic
14 Hashing Sets and Tables in Icon}, Software Practice and Experience, 23,
15 351-367, 1993.
16
18 Array Structure
19 The type ARRAY is a pointer to a struct array. The size field is the
20 number of elements in the table. The meaning of the other fields de‐
21 pends on the type field.
22
23 There are three types of arrays and these are distinguished by the type
24 field in the structure. The types are:
25
26 AY_NULL
27 The array is empty and the size field is always zero. The other
28 fields have no meaning.
29
30 AY_SPLIT
31 The array was created by the AWK built-in split. The return value
32 from split is stored in the size field. The ptr field points at a
33 vector of CELLs. The number of CELLs is the limit field. It is
34 always true that size ≤ limit. The address of A[i] is
35 (CELL*)A->ptr+i-1 for 1≤ i ≤ size. The hmask field has no mean‐
36 ing.
37
38 Hash Table
39 The array is a hash table. If the AY_STR bit in the type field is
40 set, then the table is keyed on strings. If the AY_INT bit in the
41 type field is set, then the table is keyed on integers. Both bits
42 can be set, and then the two keys are consistent, i.e., look up of
43 A[-14] and A["-14"] will return identical CELL pointers although
44 the look up methods will be different. In this case, the size
45 field is the number of hash nodes in the table. When insertion of
46 a new element would cause size to exceed limit, the table grows by
47 doubling the number of hash chains. The invariant,
48 (hmask+1)max_ave_list_length=limit is always true.
49 Max_ave_list_length is a tunable constant.
50
51 Hash Tables
52 The hash tables are linked lists of nodes, called ANODEs. The number
53 of lists is hmask+1 which is always a power of two. The ptr field
54 points at a vector of list heads. Since there are potentially two
55 types of lists, integer lists and strings lists, each list head is a
56 structure, DUAL_LINK.
57
58 The string lists are chains connected by slinks and the integer lists
59 are chains connected by ilinks. We sometimes refer to these lists as
60 slists and ilists, respectively. The elements on the lists are ANODEs.
61 The fields of an ANODE are:
62
63 slink The link field for slists. ilink The link field for ilists.
64 sval If non-null, then sval is a pointer to a string key. For a given
65 table, if the AY_STR bit is set then every ANODE has a non-null sval
66 field and conversely, if AY_STR is not set, then every sval field is
67 null.
68
69 hval The hash value of sval. This field has no meaning if sval is
70 null.
71
72 ival The integer key. The field has no meaning if set to the constant,
73 NOT_AN_IVALUE. If the AY_STR bit is off, then every ANODE will have a
74 valid ival field. If the AY_STR bit is on, then the ival field may or
75 may not be valid.
76
77 cell The data field in the hash table. \ndhitems
78
79 So the value of A[expr is stored in the cell field, and if expr is an
80 integer, then expr is stored in ival, else it is stored in sval.
81
83 The functions that operate on arrays are,
84
85 CELL* array_find(ARRAY A, CELL *cp, int create_flag)
86 returns a pointer to A[expr] where cp is a pointer to the CELL
87 holding expr. If the create_flag is on and expr is not an element
88 of A, then the element is created with value null.
89
90 void array_delete(ARRAY A, CELL *cp)
91 removes an element A[expr from the array A. cp points at the CELL
92 holding expr.
93
94 void array_load(ARRAY A, size_t cnt)
95 builds a split array. The values A[1..cnt] are moved into A from
96 an anonymous buffer with transfer_to_array() which is declared in
97 split.h.
98
99 void array_clear(ARRAY A) removes all elements of A.
100 The type of A is then AY_NULL.
101
102 STRING** array_loop_vector(ARRAY A, size_t *sizep)
103 returns a pointer to a linear vector that holds all the strings
104 that are indices of A. The size of the the vector is returned in‐
105 directly in *sizep. If A->size≡0, a null pointer is returned.
106
107 CELL* array_cat(CELL *sp, int cnt)
108 concatenates the elements of sp[1-cnt..0], with each element sepa‐
109 rated by SUBSEP, to compute an array index. For example, on a
110 reference to A[i,j], array_cat computes i ○ SUBSEP ○ j where ○ de‐
111 notes concatenation.
112
113 Array Find
114 Any reference to A[expr] creates a call to array_find(A,cp,CREATE)
115 where cp points at the cell holding expr. The test, expr in A, creates
116 a call to array_find(A,cp,NO_CREATE).
117
118 Array_find is a hash-table lookup function that handles two cases:
119
120 1. If *cp is numeric and integer valued, then lookup by integer value
121 using find_by_ival. If *cp is numeric, but not integer valued,
122 then convert to string with sprintf(CONVFMT,...) and go to case~2.
123
124 2. If *cp is string valued, then lookup by string value using
125 find_by_sval. \ndlist
126
127 To test whether cp->dval is integer, we convert to the nearest integer
128 by rounding towards zero (done by do_to_I) and then cast back to dou‐
129 ble. If we get the same number we started with, then cp->dval is inte‐
130 ger valued.
131
132 When we get to the function find_by_ival, the search has been reduced
133 to lookup in a hash table by integer value.
134
135 When a search by integer value fails, we have to check by string value
136 to correctly handle the case insertion by A["123"] and later search as
137 A[123]. This string search is necessary if and only if the AY_STR bit
138 is set. An important point is that all ANODEs get created with a valid
139 sval if AY_STR is set, because then creation of new nodes always occurs
140 in a call to find_by_sval.
141
142 Searching by string value is easier because AWK arrays are really
143 string associations. If the array does not have the AY_STR bit set,
144 then we have to convert the array to a dual hash table with strings
145 which is done by the function add_string_associations.
146
147 One Int value is reserved to show that the ival field is invalid. This
148 works because d_to_I returns a value in [-Max_Int, Max_Int].
149
150 On entry to add_string_associations, we know that the AY_STR bit is not
151 set. We convert to a dual hash table, then walk all the integer lists
152 and put each ANODE on a string list.
153
154 Array Delete
155 The execution of the statement, delete A[expr], creates a call to ar‐
156 ray_delete(ARRAY A, CELL *cp). Depending on the type of *cp, the call
157 is routed to find_by_sval or find_by_ival. Each of these functions
158 leaves its return value on the front of an slist or ilist, respec‐
159 tively, and then it is deleted from the front of the list. The case
160 where A[expr is on two lists, e.g., A[12] and A["12"] is checked by ex‐
161 amining the sval and ival fields of the returned ANODE*.
162
163 Even though we found a node by searching an ilist it might also be on
164 an slist and vice-versa.
165
166 When the size of a hash table drops below a certain value, it might be
167 profitable to shrink the hash table. Currently we don't do this, be‐
168 cause our guess is that it would be a waste of time for most AWK appli‐
169 cations. However, we do convert an array to AY_NULL when the size goes
170 to zero which would resize a large hash table that had been completely
171 cleared by successive deletions.
172
173 Building an Array with Split
174 A simple operation is to create an array with the AWK primitive split.
175 The code that performs split puts the pieces in an anonymous buffer.
176 array_load(A, cnt) moves the cnt elements from the anonymous buffer
177 into A. This is the only way an array of type AY_SPLIT is created.
178
179 If the array A is a split array and big enough then we reuse it, other‐
180 wise we need to allocate a new split array. When we allocate a block
181 of CELLs for a split array, we round up to a multiple of 4.
182
183 Array Clear
184 The function array_clear(ARRAY A) converts A to type AY_NULL and frees
185 all storage used by A except for the struct array itself. This func‐
186 tion gets called in three contexts:
187
188 (1) when an array local to a user function goes out of scope,
189
190 (2) execution of the AWK statement, delete A and
191
192 (3) when an existing changes type or size from split().
193
194 Constructor and Conversions
195 Arrays are always created as empty arrays of type AY_NULL. Global ar‐
196 rays are never destroyed although they can go empty or have their type
197 change by conversion. The only constructor function is a macro.
198
199 Hash tables only get constructed by conversion. This happens in two
200 ways. The function make_empty_table converts an empty array of type
201 AY_NULL to an empty hash table. The number of lists in the table is a
202 power of 2 determined by the constant STARTING_HMASK. The limit size
203 of the table is determined by the constant MAX_AVE_LIST_LENGTH which is
204 the largest average size of the hash lists that we are willing to tol‐
205 erate before enlarging the table. When A->size exceeds A->limit, the
206 hash table grows in size by doubling the number of lists. A->limit is
207 then reset to MAX_AVE_LIST_LENGTH times A->hmask+1.
208
209 The other way a hash table gets constructed is when a split array is
210 converted to a hash table of type AY_INT.
211
212 To determine the size of the table, we set the initial size to START‐
213 ING_HMASK+1 and then double the size until A->size ≤ A->limit.
214
215 Doubling the Size of a Hash Table
216 The whole point of making the table size a power of two is to facili‐
217 tate resizing the table. If the table size is 2**(n) and h is the hash
218 key, then h mod 2**(n) is the hash chain index which can be calculated
219 with bit-wise and, h & (2**(n-1)). When the table size doubles, the
220 new bit-mask has one more bit turned on. Elements of an old hash chain
221 whose hash value have this bit turned on get moved to a new chain. El‐
222 ements with this bit turned off stay on the same chain. On average
223 only half the old chain moves to the new chain. If the old chain is at
224 table[i], 0 ≤ i < 2**(n), then the elements that move, all move to the
225 new chain at table[i + 2**(n)].
226
227 As we walk an old string list with pointer p, the expression p->hval &
228 new_hmask takes one of two values. If it is equal to p->hval &
229 old_hmask (which equals i), then the node stays otherwise it gets moved
230 to a new string list at j. The new string list preserves order so that
231 the positions of the move-to-the-front heuristic are preserved. Nodes
232 moving to the new list are appended at pointer tail. The ANODEs,
233 dummy0~and dummy1, are sentinels that remove special handling of bound‐
234 ary conditions.
235
236 The doubling of the integer lists is exactly the same except that slink
237 is replaced by ilink and hval is replaced by ival.
238
239 Array Loops
240 Our mechanism for dealing with execution of the statement,
241
242 for (i in A) { statements }
243
244 is simple. We allocate a vector of STRING* of size, A->size. Each el‐
245 ement of the vector is a string key for~A. Note that if the AY_STR bit
246 of A is not set, then A has to be converted to a string hash table, be‐
247 cause the index i walks string indices.
248
249 To execute the loop, the only state that needs to be saved is the ad‐
250 dress of i and an index into the vector of string keys. Since nothing
251 about A is saved as state, the user program can do anything to A inside
252 the body of the loop, even delete A, and the loop still works. Essen‐
253 tially, we have traded data space (the string vector) in exchange for
254 implementation simplicity. On a 32-bit system, each ANODE is 36 bytes,
255 so the extra memory needed for the array loop is 11 more than the mem‐
256 ory consumed by the ANODEs of the array. Note that the large size of
257 the ANODEs is indicative of our whole design which pays data space for
258 integer lookup speed and algorithm simplicity.
259
260 The only aspect of array loops that occurs in array.c is construction
261 of the string vector. The rest of the implementation is in the file
262 execute.c.
263
264 As we walk over the hash table ANODEs, putting each sval in ret, we
265 need to increment each reference count. The user of the return value
266 is responsible for these new reference counts.
267
268 Concatenating Array Indices
269 In AWK, an array expression A[i,j] is equivalent to the expression A[i
270 SUBSEP j], i.e., the index is the concatenation of the three elements
271 i, SUBSEP and j. This is performed by the function array_cat. On en‐
272 try, sp points at the top of a stack of CELLs. Cnt cells are popped
273 off the stack and concatenated together separated by SUBSEP and the re‐
274 sult is pushed back on the stack. On entry, the first multi-index is
275 in sp[1-cnt] and the last is in sp[0]. The return value is the new
276 stack top. (The stack is the run-time evaluation stack. This opera‐
277 tion really has nothing to do with array structure, so logically this
278 code belongs in execute.c, but remains here for historical reasons.)
279
280 We make a copy of SUBSEP which we can cast to string in the unlikely
281 event the user has assigned a number to SUBSEP.
282
283 Set sp and top so the cells to concatenate are inclusively between sp
284 and top.
285
286 The total_len is the sum of the lengths of the cnt strings and the
287 cnt-1 copies of subsep.
288
289 The return value is sp and it is already set correctly. We just need
290 to free the strings and set the contents of sp.
291
292
293
294Version 1.3.4 2020-08-18 MAWK-ARRAYS(7)