1file_sorter(3) Erlang Module Definition file_sorter(3)
2
3
4
6 file_sorter - File sorter.
7
9 This module contains functions for sorting terms on files, merging
10 already sorted files, and checking files for sortedness. Chunks con‐
11 taining binary terms are read from a sequence of files, sorted inter‐
12 nally in memory and written on temporary files, which are merged pro‐
13 ducing one sorted file as output. Merging is provided as an optimiza‐
14 tion; it is faster when the files are already sorted, but it always
15 works to sort instead of merge.
16
17 On a file, a term is represented by a header and a binary. Two options
18 define the format of terms on files:
19
20 {header, HeaderLength}:
21 HeaderLength determines the number of bytes preceding each binary
22 and containing the length of the binary in bytes. Defaults to 4.
23 The order of the header bytes is defined as follows: if B is a
24 binary containing a header only, size Size of the binary is calcu‐
25 lated as <<Size:HeaderLength/unit:8>> = B.
26
27 {format, Format}:
28 Option Format determines the function that is applied to binaries
29 to create the terms to be sorted. Defaults to binary_term, which is
30 equivalent to fun binary_to_term/1. Value binary is equivalent to
31 fun(X) -> X end, which means that the binaries are sorted as they
32 are. This is the fastest format. If Format is term, io:read/2 is
33 called to read terms. In that case, only the default value of
34 option header is allowed.
35
36 Option format also determines what is written to the sorted output
37 file: if Format is term, then io:format/3 is called to write each
38 term, otherwise the binary prefixed by a header is written. Notice
39 that the binary written is the same binary that was read; the
40 results of applying function Format are thrown away when the terms
41 have been sorted. Reading and writing terms using the io module is
42 much slower than reading and writing binaries.
43
44 Other options are:
45
46 {order, Order}:
47 The default is to sort terms in ascending order, but that can be
48 changed by value descending or by specifying an ordering function
49 Fun. An ordering function is antisymmetric, transitive, and total.
50 Fun(A, B) is to return true if A comes before B in the ordering,
51 otherwise false. An example of a typical ordering function is less
52 than or equal to, =</2. Using an ordering function slows down the
53 sort considerably. Functions keysort, keymerge and keycheck do not
54 accept ordering functions.
55
56 {unique, boolean()}:
57 When sorting or merging files, only the first of a sequence of
58 terms that compare equal (==) is output if this option is set to
59 true. Defaults to false, which implies that all terms that compare
60 equal are output. When checking files for sortedness, a check that
61 no pair of consecutive terms compares equal is done if this option
62 is set to true.
63
64 {tmpdir, TempDirectory}:
65 The directory where temporary files are put can be chosen explic‐
66 itly. The default, implied by value "", is to put temporary files
67 on the same directory as the sorted output file. If output is a
68 function (see below), the directory returned by file:get_cwd() is
69 used instead. The names of temporary files are derived from the
70 Erlang nodename (node()), the process identifier of the current
71 Erlang emulator (os:getpid()), and a unique integer
72 (erlang:unique_integer([positive])). A typical name is fs_myn‐
73 ode@myhost_1763_4711.17, where 17 is a sequence number. Existing
74 files are overwritten. Temporary files are deleted unless some
75 uncaught EXIT signal occurs.
76
77 {compressed, boolean()}:
78 Temporary files and the output file can be compressed. Defaults
79 false, which implies that written files are not compressed. Regard‐
80 less of the value of option compressed, compressed files can always
81 be read. Notice that reading and writing compressed files are sig‐
82 nificantly slower than reading and writing uncompressed files.
83
84 {size, Size}:
85 By default about 512*1024 bytes read from files are sorted inter‐
86 nally. This option is rarely needed.
87
88 {no_files, NoFiles}:
89 By default 16 files are merged at a time. This option is rarely
90 needed.
91
92 As an alternative to sorting files, a function of one argument can be
93 specified as input. When called with argument read, the function is
94 assumed to return either of the following:
95
96 * end_of_input or {end_of_input, Value}} when there is no more input
97 (Value is explained below).
98
99 * {Objects, Fun}, where Objects is a list of binaries or terms
100 depending on the format, and Fun is a new input function.
101
102 Any other value is immediately returned as value of the current call to
103 sort or keysort. Each input function is called exactly once. If an
104 error occurs, the last function is called with argument close, the
105 reply of which is ignored.
106
107 A function of one argument can be specified as output. The results of
108 sorting or merging the input is collected in a non-empty sequence of
109 variable length lists of binaries or terms depending on the format. The
110 output function is called with one list at a time, and is assumed to
111 return a new output function. Any other return value is immediately
112 returned as value of the current call to the sort or merge function.
113 Each output function is called exactly once. When some output function
114 has been applied to all of the results or an error occurs, the last
115 function is called with argument close, and the reply is returned as
116 value of the current call to the sort or merge function.
117
118 If a function is specified as input and the last input function returns
119 {end_of_input, Value}, the function specified as output is called with
120 argument {value, Value}. This makes it easy to initiate the sequence of
121 output functions with a value calculated by the input functions.
122
123 As an example, consider sorting the terms on a disk log file. A func‐
124 tion that reads chunks from the disk log and returns a list of binaries
125 is used as input. The results are collected in a list of terms.
126
127 sort(Log) ->
128 {ok, _} = disk_log:open([{name,Log}, {mode,read_only}]),
129 Input = input(Log, start),
130 Output = output([]),
131 Reply = file_sorter:sort(Input, Output, {format,term}),
132 ok = disk_log:close(Log),
133 Reply.
134
135 input(Log, Cont) ->
136 fun(close) ->
137 ok;
138 (read) ->
139 case disk_log:chunk(Log, Cont) of
140 {error, Reason} ->
141 {error, Reason};
142 {Cont2, Terms} ->
143 {Terms, input(Log, Cont2)};
144 {Cont2, Terms, _Badbytes} ->
145 {Terms, input(Log, Cont2)};
146 eof ->
147 end_of_input
148 end
149 end.
150
151 output(L) ->
152 fun(close) ->
153 lists:append(lists:reverse(L));
154 (Terms) ->
155 output([Terms | L])
156 end.
157
158 For more examples of functions as input and output, see the end of the
159 file_sorter module; the term format is implemented with functions.
160
161 The possible values of Reason returned when an error occurs are:
162
163 * bad_object, {bad_object, FileName} - Applying the format function
164 failed for some binary, or the key(s) could not be extracted from
165 some term.
166
167 * {bad_term, FileName} - io:read/2 failed to read some term.
168
169 * {file_error, FileName, file:posix()} - For an explanation of
170 file:posix(), see file(3).
171
172 * {premature_eof, FileName} - End-of-file was encountered inside some
173 binary term.
174
176 file_name() = file:name()
177
178 file_names() = [file:name()]
179
180 i_command() = read | close
181
182 i_reply() =
183 end_of_input |
184 {end_of_input, value()} |
185 {[object()], infun()} |
186 input_reply()
187
188 infun() = fun((i_command()) -> i_reply())
189
190 input() = file_names() | infun()
191
192 input_reply() = term()
193
194 o_command() = {value, value()} | [object()] | close
195
196 o_reply() = outfun() | output_reply()
197
198 object() = term() | binary()
199
200 outfun() = fun((o_command()) -> o_reply())
201
202 output() = file_name() | outfun()
203
204 output_reply() = term()
205
206 value() = term()
207
208 options() = [option()] | option()
209
210 option() =
211 {compressed, boolean()} |
212 {header, header_length()} |
213 {format, format()} |
214 {no_files, no_files()} |
215 {order, order()} |
216 {size, size()} |
217 {tmpdir, tmp_directory()} |
218 {unique, boolean()}
219
220 format() = binary_term | term | binary | format_fun()
221
222 format_fun() = fun((binary()) -> term())
223
224 header_length() = integer() >= 1
225
226 key_pos() = integer() >= 1 | [integer() >= 1]
227
228 no_files() = integer() >= 1
229
230 order() = ascending | descending | order_fun()
231
232 order_fun() = fun((term(), term()) -> boolean())
233
234 size() = integer() >= 0
235
236 tmp_directory() = [] | file:name()
237
238 reason() =
239 bad_object |
240 {bad_object, file_name()} |
241 {bad_term, file_name()} |
242 {file_error,
243 file_name(),
244 file:posix() | badarg | system_limit} |
245 {premature_eof, file_name()}
246
248 check(FileName) -> Reply
249
250 check(FileNames, Options) -> Reply
251
252 Types:
253
254 FileNames = file_names()
255 Options = options()
256 Reply = {ok, [Result]} | {error, reason()}
257 Result = {FileName, TermPosition, term()}
258 FileName = file_name()
259 TermPosition = integer() >= 1
260
261 Checks files for sortedness. If a file is not sorted, the first
262 out-of-order element is returned. The first term on a file has
263 position 1.
264
265 check(FileName) is equivalent to check([FileName], []).
266
267 keycheck(KeyPos, FileName) -> Reply
268
269 keycheck(KeyPos, FileNames, Options) -> Reply
270
271 Types:
272
273 KeyPos = key_pos()
274 FileNames = file_names()
275 Options = options()
276 Reply = {ok, [Result]} | {error, reason()}
277 Result = {FileName, TermPosition, term()}
278 FileName = file_name()
279 TermPosition = integer() >= 1
280
281 Checks files for sortedness. If a file is not sorted, the first
282 out-of-order element is returned. The first term on a file has
283 position 1.
284
285 keycheck(KeyPos, FileName) is equivalent to keycheck(KeyPos,
286 [FileName], []).
287
288 keymerge(KeyPos, FileNames, Output) -> Reply
289
290 keymerge(KeyPos, FileNames, Output, Options) -> Reply
291
292 Types:
293
294 KeyPos = key_pos()
295 FileNames = file_names()
296 Output = output()
297 Options = options()
298 Reply = ok | {error, reason()} | output_reply()
299
300 Merges tuples on files. Each input file is assumed to be sorted
301 on key(s).
302
303 keymerge(KeyPos, FileNames, Output) is equivalent to
304 keymerge(KeyPos, FileNames, Output, []).
305
306 keysort(KeyPos, FileName) -> Reply
307
308 Types:
309
310 KeyPos = key_pos()
311 FileName = file_name()
312 Reply = ok | {error, reason()} | input_reply() | out‐
313 put_reply()
314
315 Sorts tuples on files.
316
317 keysort(N, FileName) is equivalent to keysort(N, [FileName],
318 FileName).
319
320 keysort(KeyPos, Input, Output) -> Reply
321
322 keysort(KeyPos, Input, Output, Options) -> Reply
323
324 Types:
325
326 KeyPos = key_pos()
327 Input = input()
328 Output = output()
329 Options = options()
330 Reply = ok | {error, reason()} | input_reply() | out‐
331 put_reply()
332
333 Sorts tuples on files. The sort is performed on the element(s)
334 mentioned in KeyPos. If two tuples compare equal (==) on one
335 element, the next element according to KeyPos is compared. The
336 sort is stable.
337
338 keysort(N, Input, Output) is equivalent to keysort(N, Input,
339 Output, []).
340
341 merge(FileNames, Output) -> Reply
342
343 merge(FileNames, Output, Options) -> Reply
344
345 Types:
346
347 FileNames = file_names()
348 Output = output()
349 Options = options()
350 Reply = ok | {error, reason()} | output_reply()
351
352 Merges terms on files. Each input file is assumed to be sorted.
353
354 merge(FileNames, Output) is equivalent to merge(FileNames, Out‐
355 put, []).
356
357 sort(FileName) -> Reply
358
359 Types:
360
361 FileName = file_name()
362 Reply = ok | {error, reason()} | input_reply() | out‐
363 put_reply()
364
365 Sorts terms on files.
366
367 sort(FileName) is equivalent to sort([FileName], FileName).
368
369 sort(Input, Output) -> Reply
370
371 sort(Input, Output, Options) -> Reply
372
373 Types:
374
375 Input = input()
376 Output = output()
377 Options = options()
378 Reply = ok | {error, reason()} | input_reply() | out‐
379 put_reply()
380
381 Sorts terms on files.
382
383 sort(Input, Output) is equivalent to sort(Input, Output, []).
384
385
386
387Ericsson AB stdlib 3.10 file_sorter(3)