funidx(n)

1funidx(n)                     SAORD Documentation                    funidx(n)
2
3
4

NAME

6       Funidx: Using Indexes to Filter Rows in a Table
7

SYNOPSIS

9       This document contains a summary of the user interface for filtering
10       rows in binary tables with indexes.
11

DESCRIPTION

13       Funtools Table Filtering allows rows in a table to be selected based on
14       the values of one or more columns in the row. Because the actual filter
15       code is compiled on the fly, it is very efficient. However, for very
16       large files (hundreds of Mb or larger), evaluating the filter expres‐
17       sion on each row can take a long time. Therefore, funtools supports
18       index files for columns, which are used automatically during filtering
19       to reduce dramatically the number of row evaluations performed.  The
20       speed increase for indexed filtering can be an order of magnitude or
21       more, depending on the size of the file.
22
23       The funindex program creates an index on one or more columns in a
24       binary table. For example, to create an index for the column pi in the
25       file huge.fits, use:
26
27         funindex huge.fits pi
28
29       This will create an index named huge_pi.idx.
30
31       When a filter expression is initialized for row evaluation, funtools
32       looks for an index file for each column in the filter expression. If
33       found, and if the file modification date of the index file is later
34       than that of the data file, then the index will be used to reduce the
35       number of rows that are evaluated in the filter. When Spatial Region
36       Filtering is part of the expression, the columns associated with the
37       region are checked for index files.
38
39       If an index file is not available for a given column, then in general,
40       all rows must be checked when that column is part of a filter expres‐
41       sion.  This is not true, however, when a non-indexed column is part of
42       an AND expression. In this case, only the rows that pass the other part
43       of the AND expression need to be checked. Thus, in some cases, filter‐
44       ing speed can increase significantly even if all columns are not
45       indexed.
46
47       Also note that certain types of filter expression syntax cannot make
48       use of indices. For example, calling functions with column names as
49       arguments implies that all rows must be checked against the function
50       value. Once again, however, if this function is part of an AND expres‐
51       sion, then a significant improvement in speed still is possible if the
52       other part of the AND expression is indexed.
53
54       For example, note below the dramatic speedup in searching a 1 Gb file
55       using an AND filter, even when one of the columns (pha) has no index:
56
57         time fundisp \
58         huge.fits'[idx_activate=0,idx_debug=1,pha=2348&&cir 4000 4000 1]' \
59         "x y pha"
60                 x           y        pha
61        ---------- ----------- ----------
62           3999.48     4000.47       2348
63           3999.48     4000.47       2348
64           3999.48     4000.47       2348
65           3999.48     4000.47       2348
66           3999.48     4000.47       2348
67           3999.48     4000.47       2348
68           3999.48     4000.47       2348
69           3999.48     4000.47       2348
70           3999.48     4000.47       2348
71           3999.48     4000.47       2348
72           3999.48     4000.47       2348
73           3999.48     4000.47       2348
74           3999.48     4000.47       2348
75           3999.48     4000.47       2348
76           3999.48     4000.47       2348
77           3999.48     4000.47       2348
78           42.36u 13.07s 6:42.89 13.7%
79
80         time fundisp \
81         huge.fits'[idx_activate=1,idx_debug=1,pha=2348&&cir 4000 4000 1]' \
82         "x y pha"
83                 x           y        pha
84        ---------- ----------- ----------
85        idxeq: [INDEF]
86        idxand sort: x[ROW 8037025:8070128] y[ROW 5757665:5792352]
87        idxand(1): INDEF [IDX_OR_SORT]
88        idxall(1): [IDX_OR_SORT]
89           3999.48     4000.47       2348
90           3999.48     4000.47       2348
91           3999.48     4000.47       2348
92           3999.48     4000.47       2348
93           3999.48     4000.47       2348
94           3999.48     4000.47       2348
95           3999.48     4000.47       2348
96           3999.48     4000.47       2348
97           3999.48     4000.47       2348
98           3999.48     4000.47       2348
99           3999.48     4000.47       2348
100           3999.48     4000.47       2348
101           3999.48     4000.47       2348
102           3999.48     4000.47       2348
103           3999.48     4000.47       2348
104           3999.48     4000.47       2348
105           1.55u 0.37s 1:19.80 2.4%
106
107       When all columns are indexed, the increase in speed can be even more
108       dramatic:
109
110         time fundisp \
111         huge.fits'[idx_activate=0,idx_debug=1,pi=770&&cir 4000 4000 1]' \
112         "x y pi"
113                 x           y         pi
114        ---------- ----------- ----------
115           3999.48     4000.47        770
116           3999.48     4000.47        770
117           3999.48     4000.47        770
118           3999.48     4000.47        770
119           3999.48     4000.47        770
120           3999.48     4000.47        770
121           3999.48     4000.47        770
122           3999.48     4000.47        770
123           3999.48     4000.47        770
124           3999.48     4000.47        770
125           3999.48     4000.47        770
126           3999.48     4000.47        770
127           3999.48     4000.47        770
128           3999.48     4000.47        770
129           3999.48     4000.47        770
130           3999.48     4000.47        770
131           42.60u 12.63s 7:28.63 12.3%
132
133         time fundisp \
134         huge.fits'[idx_activate=1,idx_debug=1,pi=770&&cir 4000 4000 1]' \
135         "x y pi"
136                 x           y         pi
137        ---------- ----------- ----------
138        idxeq: pi start=9473025,stop=9492240 => pi[ROW 9473025:9492240]
139        idxand sort: x[ROW 8037025:8070128] y[ROW 5757665:5792352]
140        idxor sort/merge: pi[ROW 9473025:9492240] [IDX_OR_SORT]
141        idxmerge(5): [IDX_OR_SORT] pi[ROW]
142        idxall(1): [IDX_OR_SORT]
143           3999.48     4000.47        770
144           3999.48     4000.47        770
145           3999.48     4000.47        770
146           3999.48     4000.47        770
147           3999.48     4000.47        770
148           3999.48     4000.47        770
149           3999.48     4000.47        770
150           3999.48     4000.47        770
151           3999.48     4000.47        770
152           3999.48     4000.47        770
153           3999.48     4000.47        770
154           3999.48     4000.47        770
155           3999.48     4000.47        770
156           3999.48     4000.47        770
157           3999.48     4000.47        770
158           3999.48     4000.47        770
159           1.67u 0.30s 0:24.76 7.9%
160
161       The miracle of indexed filtering (and indeed, of any indexing) is the
162       speed of the binary search on the index, which is of order log2(n)
163       instead of n. (The funtools binary search method is taken from
164       http://www.tbray.org/ongoing/When/200x/2003/03/22/Binary, to whom
165       grateful acknowledgement is made.)  This means that the larger the
166       file, the better the performance. Conversely, it also means that for
167       small files, using an index (and the overhead involved) can slow fil‐
168       tering down somewhat. Our tests indicate that on a file containing a
169       few tens of thousands of rows, indexed filtering can be 10 to 20 per‐
170       cent slower than non-indexed filtering. Of course, your mileage will
171       vary with conditions (disk access speed, amount of available memory,
172       process load, etc.)
173
174       Any problem encountered during index processing will result in indexing
175       being turned off, and replaced by filtering all rows. You can turn fil‐
176       tering off manually by setting the idx_activate variable to 0 (in a
177       filter expression) or the FILTER_IDX_ACTIVATE environment variable to 0
178       (in the global environment). Debugging output showing how the indexes
179       are being processed can be displayed to stderr by setting the idx_debug
180       variable to 1 (in a filter expression) or the FILTER_IDX_DEBUG environ‐
181       ment variable to 1 (in the global environment).
182
183       Currently, indexed filtering only works with FITS binary tables and raw
184       event files. It does not work with text files. This restriction might
185       be removed in a future release.
186

NAME

SYNOPSIS

DESCRIPTION

SEE ALSO