1funidx(n) SAORD Documentation funidx(n)
2
3
4
6 Funidx: Using Indexes to Filter Rows in a Table
7
9 This document contains a summary of the user interface for filtering
10 rows in binary tables with indexes.
11
13 Funtools Table Filtering allows rows in a table to be selected based on
14 the values of one or more columns in the row. Because the actual filter
15 code is compiled on the fly, it is very efficient. However, for very
16 large files (hundreds of Mb or larger), evaluating the filter expres‐
17 sion on each row can take a long time. Therefore, funtools supports
18 index files for columns, which are used automatically during filtering
19 to reduce dramatically the number of row evaluations performed. The
20 speed increase for indexed filtering can be an order of magnitude or
21 more, depending on the size of the file.
22
23 The funindex program creates an index on one or more columns in a
24 binary table. For example, to create an index for the column pi in the
25 file huge.fits, use:
26
27 funindex huge.fits pi
28
29 This will create an index named huge_pi.idx.
30
31 When a filter expression is initialized for row evaluation, funtools
32 looks for an index file for each column in the filter expression. If
33 found, and if the file modification date of the index file is later
34 than that of the data file, then the index will be used to reduce the
35 number of rows that are evaluated in the filter. When Spatial Region
36 Filtering is part of the expression, the columns associated with the
37 region are checked for index files.
38
39 If an index file is not available for a given column, then in general,
40 all rows must be checked when that column is part of a filter expres‐
41 sion. This is not true, however, when a non-indexed column is part of
42 an AND expression. In this case, only the rows that pass the other part
43 of the AND expression need to be checked. Thus, in some cases, filter‐
44 ing speed can increase significantly even if all columns are not
45 indexed.
46
47 Also note that certain types of filter expression syntax cannot make
48 use of indices. For example, calling functions with column names as
49 arguments implies that all rows must be checked against the function
50 value. Once again, however, if this function is part of an AND expres‐
51 sion, then a significant improvement in speed still is possible if the
52 other part of the AND expression is indexed.
53
54 For example, note below the dramatic speedup in searching a 1 Gb file
55 using an AND filter, even when one of the columns (pha) has no index:
56
57 time fundisp \
58 huge.fits'[idx_activate=0,idx_debug=1,pha=2348&&cir 4000 4000 1]' \
59 "x y pha"
60 x y pha
61 ---------- ----------- ----------
62 3999.48 4000.47 2348
63 3999.48 4000.47 2348
64 3999.48 4000.47 2348
65 3999.48 4000.47 2348
66 3999.48 4000.47 2348
67 3999.48 4000.47 2348
68 3999.48 4000.47 2348
69 3999.48 4000.47 2348
70 3999.48 4000.47 2348
71 3999.48 4000.47 2348
72 3999.48 4000.47 2348
73 3999.48 4000.47 2348
74 3999.48 4000.47 2348
75 3999.48 4000.47 2348
76 3999.48 4000.47 2348
77 3999.48 4000.47 2348
78 42.36u 13.07s 6:42.89 13.7%
79
80 time fundisp \
81 huge.fits'[idx_activate=1,idx_debug=1,pha=2348&&cir 4000 4000 1]' \
82 "x y pha"
83 x y pha
84 ---------- ----------- ----------
85 idxeq: [INDEF]
86 idxand sort: x[ROW 8037025:8070128] y[ROW 5757665:5792352]
87 idxand(1): INDEF [IDX_OR_SORT]
88 idxall(1): [IDX_OR_SORT]
89 3999.48 4000.47 2348
90 3999.48 4000.47 2348
91 3999.48 4000.47 2348
92 3999.48 4000.47 2348
93 3999.48 4000.47 2348
94 3999.48 4000.47 2348
95 3999.48 4000.47 2348
96 3999.48 4000.47 2348
97 3999.48 4000.47 2348
98 3999.48 4000.47 2348
99 3999.48 4000.47 2348
100 3999.48 4000.47 2348
101 3999.48 4000.47 2348
102 3999.48 4000.47 2348
103 3999.48 4000.47 2348
104 3999.48 4000.47 2348
105 1.55u 0.37s 1:19.80 2.4%
106
107 When all columns are indexed, the increase in speed can be even more
108 dramatic:
109
110 time fundisp \
111 huge.fits'[idx_activate=0,idx_debug=1,pi=770&&cir 4000 4000 1]' \
112 "x y pi"
113 x y pi
114 ---------- ----------- ----------
115 3999.48 4000.47 770
116 3999.48 4000.47 770
117 3999.48 4000.47 770
118 3999.48 4000.47 770
119 3999.48 4000.47 770
120 3999.48 4000.47 770
121 3999.48 4000.47 770
122 3999.48 4000.47 770
123 3999.48 4000.47 770
124 3999.48 4000.47 770
125 3999.48 4000.47 770
126 3999.48 4000.47 770
127 3999.48 4000.47 770
128 3999.48 4000.47 770
129 3999.48 4000.47 770
130 3999.48 4000.47 770
131 42.60u 12.63s 7:28.63 12.3%
132
133 time fundisp \
134 huge.fits'[idx_activate=1,idx_debug=1,pi=770&&cir 4000 4000 1]' \
135 "x y pi"
136 x y pi
137 ---------- ----------- ----------
138 idxeq: pi start=9473025,stop=9492240 => pi[ROW 9473025:9492240]
139 idxand sort: x[ROW 8037025:8070128] y[ROW 5757665:5792352]
140 idxor sort/merge: pi[ROW 9473025:9492240] [IDX_OR_SORT]
141 idxmerge(5): [IDX_OR_SORT] pi[ROW]
142 idxall(1): [IDX_OR_SORT]
143 3999.48 4000.47 770
144 3999.48 4000.47 770
145 3999.48 4000.47 770
146 3999.48 4000.47 770
147 3999.48 4000.47 770
148 3999.48 4000.47 770
149 3999.48 4000.47 770
150 3999.48 4000.47 770
151 3999.48 4000.47 770
152 3999.48 4000.47 770
153 3999.48 4000.47 770
154 3999.48 4000.47 770
155 3999.48 4000.47 770
156 3999.48 4000.47 770
157 3999.48 4000.47 770
158 3999.48 4000.47 770
159 1.67u 0.30s 0:24.76 7.9%
160
161 The miracle of indexed filtering (and indeed, of any indexing) is the
162 speed of the binary search on the index, which is of order log2(n)
163 instead of n. (The funtools binary search method is taken from
164 http://www.tbray.org/ongoing/When/200x/2003/03/22/Binary, to whom
165 grateful acknowledgement is made.) This means that the larger the
166 file, the better the performance. Conversely, it also means that for
167 small files, using an index (and the overhead involved) can slow fil‐
168 tering down somewhat. Our tests indicate that on a file containing a
169 few tens of thousands of rows, indexed filtering can be 10 to 20 per‐
170 cent slower than non-indexed filtering. Of course, your mileage will
171 vary with conditions (disk access speed, amount of available memory,
172 process load, etc.)
173
174 Any problem encountered during index processing will result in indexing
175 being turned off, and replaced by filtering all rows. You can turn fil‐
176 tering off manually by setting the idx_activate variable to 0 (in a
177 filter expression) or the FILTER_IDX_ACTIVATE environment variable to 0
178 (in the global environment). Debugging output showing how the indexes
179 are being processed can be displayed to stderr by setting the idx_debug
180 variable to 1 (in a filter expression) or the FILTER_IDX_DEBUG environ‐
181 ment variable to 1 (in the global environment).
182
183 Currently, indexed filtering only works with FITS binary tables and raw
184 event files. It does not work with text files. This restriction might
185 be removed in a future release.
186
188 See funtools(n) for a list of Funtools help pages
189
190
191
192version 1.4.2 January 2, 2008 funidx(n)