1file_sorter(3)             Erlang Module Definition             file_sorter(3)
2
3
4

NAME

6       file_sorter - File sorter.
7

DESCRIPTION

9       This  module contains functions for sorting terms on files, merging al‐
10       ready sorted files, and checking files for sortedness. Chunks  contain‐
11       ing  binary  terms are read from a sequence of files, sorted internally
12       in memory and written on temporary files, which  are  merged  producing
13       one  sorted  file as output. Merging is provided as an optimization; it
14       is faster when the files are already sorted, but  it  always  works  to
15       sort instead of merge.
16
17       On  a file, a term is represented by a header and a binary. Two options
18       define the format of terms on files:
19
20         {header, HeaderLength}:
21           HeaderLength determines the number of bytes preceding  each  binary
22           and  containing  the  length of the binary in bytes. Defaults to 4.
23           The order of the header bytes is defined as follows: if B is a  bi‐
24           nary  containing  a  header only, size Size of the binary is calcu‐
25           lated as <<Size:HeaderLength/unit:8>> = B.
26
27         {format, Format}:
28           Option Format determines the function that is applied  to  binaries
29           to create the terms to be sorted. Defaults to binary_term, which is
30           equivalent to fun binary_to_term/1. Value binary is  equivalent  to
31           fun(X)  ->  X end, which means that the binaries are sorted as they
32           are. This is the fastest format. If Format is  term,  io:read/2  is
33           called  to  read terms. In that case, only the default value of op‐
34           tion header is allowed.
35
36           Option format also determines what is written to the sorted  output
37           file:  if  Format is term, then io:format/3 is called to write each
38           term, otherwise the binary prefixed by a header is written.  Notice
39           that  the  binary written is the same binary that was read; the re‐
40           sults of applying function Format are thrown away  when  the  terms
41           have  been sorted. Reading and writing terms using the io module is
42           much slower than reading and writing binaries.
43
44       Other options are:
45
46         {order, Order}:
47           The default is to sort terms in ascending order, but  that  can  be
48           changed  by  value descending or by specifying an ordering function
49           Fun. An ordering function is antisymmetric, transitive, and  total.
50           Fun(A,  B)  is  to return true if A comes before B in the ordering,
51           otherwise false. An example of a typical ordering function is  less
52           than  or  equal to, =</2. Using an ordering function slows down the
53           sort considerably. Functions keysort, keymerge and keycheck do  not
54           accept ordering functions.
55
56         {unique, boolean()}:
57           When  sorting  or  merging  files,  only the first of a sequence of
58           terms that compare equal (==) is output if this option  is  set  to
59           true.  Defaults to false, which implies that all terms that compare
60           equal are output. When checking files for sortedness, a check  that
61           no  pair of consecutive terms compares equal is done if this option
62           is set to true.
63
64         {tmpdir, TempDirectory}:
65           The directory where temporary files are put can be  chosen  explic‐
66           itly.  The  default, implied by value "", is to put temporary files
67           on the same directory as the sorted output file.  If  output  is  a
68           function  (see  below), the directory returned by file:get_cwd() is
69           used instead. The names of temporary files are derived from the Er‐
70           lang  nodename  (node()), the process identifier of the current Er‐
71           lang  emulator   (os:getpid()),   and   a   unique   integer   (er‐
72           lang:unique_integer([positive])).  A  typical name is fs_mynode@my‐
73           host_1763_4711.17, where 17 is a sequence  number.  Existing  files
74           are  overwritten.  Temporary files are deleted unless some uncaught
75           EXIT signal occurs.
76
77         {compressed, boolean()}:
78           Temporary files and the output file  can  be  compressed.  Defaults
79           false, which implies that written files are not compressed. Regard‐
80           less of the value of option compressed, compressed files can always
81           be  read. Notice that reading and writing compressed files are sig‐
82           nificantly slower than reading and writing uncompressed files.
83
84         {size, Size}:
85           By default about 512*1024 bytes read from files are  sorted  inter‐
86           nally. This option is rarely needed.
87
88         {no_files, NoFiles}:
89           By  default  16  files  are merged at a time. This option is rarely
90           needed.
91
92       As an alternative to sorting files, a function of one argument  can  be
93       specified as input. When called with argument read, the function is as‐
94       sumed to return either of the following:
95
96         * end_of_input or {end_of_input, Value}} when there is no more  input
97           (Value is explained below).
98
99         * {Objects,  Fun},  where  Objects is a list of binaries or terms de‐
100           pending on the format, and Fun is a new input function.
101
102       Any other value is immediately returned as value of the current call to
103       sort  or keysort. Each input function is called exactly once. If an er‐
104       ror occurs, the last function is called with argument close, the  reply
105       of which is ignored.
106
107       A  function  of one argument can be specified as output. The results of
108       sorting or merging the input is collected in a  non-empty  sequence  of
109       variable length lists of binaries or terms depending on the format. The
110       output function is called with one list at a time, and  is  assumed  to
111       return a new output function. Any other return value is immediately re‐
112       turned as value of the current call to the sort or merge function. Each
113       output  function  is called exactly once. When some output function has
114       been applied to all of the results or an error occurs, the  last  func‐
115       tion  is called with argument close, and the reply is returned as value
116       of the current call to the sort or merge function.
117
118       If a function is specified as input and the last input function returns
119       {end_of_input,  Value}, the function specified as output is called with
120       argument {value, Value}. This makes it easy to initiate the sequence of
121       output functions with a value calculated by the input functions.
122
123       As  an  example, consider sorting the terms on a disk log file. A func‐
124       tion that reads chunks from the disk log and returns a list of binaries
125       is used as input. The results are collected in a list of terms.
126
127       sort(Log) ->
128           {ok, _} = disk_log:open([{name,Log}, {mode,read_only}]),
129           Input = input(Log, start),
130           Output = output([]),
131           Reply = file_sorter:sort(Input, Output, {format,term}),
132           ok = disk_log:close(Log),
133           Reply.
134
135       input(Log, Cont) ->
136           fun(close) ->
137                   ok;
138              (read) ->
139                   case disk_log:chunk(Log, Cont) of
140                       {error, Reason} ->
141                           {error, Reason};
142                       {Cont2, Terms} ->
143                           {Terms, input(Log, Cont2)};
144                       {Cont2, Terms, _Badbytes} ->
145                           {Terms, input(Log, Cont2)};
146                       eof ->
147                           end_of_input
148                   end
149           end.
150
151       output(L) ->
152           fun(close) ->
153                   lists:append(lists:reverse(L));
154              (Terms) ->
155                   output([Terms | L])
156           end.
157
158       For  more examples of functions as input and output, see the end of the
159       file_sorter module; the term format is implemented with functions.
160
161       The possible values of Reason returned when an error occurs are:
162
163         * bad_object, {bad_object, FileName} - Applying the  format  function
164           failed  for  some binary, or the key(s) could not be extracted from
165           some term.
166
167         * {bad_term, FileName} - io:read/2 failed to read some term.
168
169         * {file_error,  FileName,  file:posix()}  -  For  an  explanation  of
170           file:posix(), see file(3).
171
172         * {premature_eof, FileName} - End-of-file was encountered inside some
173           binary term.
174

DATA TYPES

176       file_name() = file:name()
177
178       file_names() = [file:name()]
179
180       i_command() = read | close
181
182       i_reply() =
183           end_of_input |
184           {end_of_input, value()} |
185           {[object()], infun()} |
186           input_reply()
187
188       infun() = fun((i_command()) -> i_reply())
189
190       input() = file_names() | infun()
191
192       input_reply() = term()
193
194       o_command() = {value, value()} | [object()] | close
195
196       o_reply() = outfun() | output_reply()
197
198       object() = term() | binary()
199
200       outfun() = fun((o_command()) -> o_reply())
201
202       output() = file_name() | outfun()
203
204       output_reply() = term()
205
206       value() = term()
207
208       options() = [option()] | option()
209
210       option() =
211           {compressed, boolean()} |
212           {header, header_length()} |
213           {format, format()} |
214           {no_files, no_files()} |
215           {order, order()} |
216           {size, size()} |
217           {tmpdir, tmp_directory()} |
218           {unique, boolean()}
219
220       format() = binary_term | term | binary | format_fun()
221
222       format_fun() = fun((binary()) -> term())
223
224       header_length() = integer() >= 1
225
226       key_pos() = integer() >= 1 | [integer() >= 1]
227
228       no_files() = integer() >= 1
229
230       order() = ascending | descending | order_fun()
231
232       order_fun() = fun((term(), term()) -> boolean())
233
234       size() = integer() >= 0
235
236       tmp_directory() = [] | file:name()
237
238       reason() =
239           bad_object |
240           {bad_object, file_name()} |
241           {bad_term, file_name()} |
242           {file_error,
243            file_name(),
244            file:posix() | badarg | system_limit} |
245           {premature_eof, file_name()}
246

EXPORTS

248       check(FileName) -> Reply
249
250       check(FileNames, Options) -> Reply
251
252              Types:
253
254                 FileNames = file_names()
255                 Options = options()
256                 Reply = {ok, [Result]} | {error, reason()}
257                 Result = {FileName, TermPosition, term()}
258                 FileName = file_name()
259                 TermPosition = integer() >= 1
260
261              Checks files for sortedness. If a file is not sorted, the  first
262              out-of-order  element  is returned. The first term on a file has
263              position 1.
264
265              check(FileName) is equivalent to check([FileName], []).
266
267       keycheck(KeyPos, FileName) -> Reply
268
269       keycheck(KeyPos, FileNames, Options) -> Reply
270
271              Types:
272
273                 KeyPos = key_pos()
274                 FileNames = file_names()
275                 Options = options()
276                 Reply = {ok, [Result]} | {error, reason()}
277                 Result = {FileName, TermPosition, term()}
278                 FileName = file_name()
279                 TermPosition = integer() >= 1
280
281              Checks files for sortedness. If a file is not sorted, the  first
282              out-of-order  element  is returned. The first term on a file has
283              position 1.
284
285              keycheck(KeyPos, FileName)  is  equivalent  to  keycheck(KeyPos,
286              [FileName], []).
287
288       keymerge(KeyPos, FileNames, Output) -> Reply
289
290       keymerge(KeyPos, FileNames, Output, Options) -> Reply
291
292              Types:
293
294                 KeyPos = key_pos()
295                 FileNames = file_names()
296                 Output = output()
297                 Options = options()
298                 Reply = ok | {error, reason()} | output_reply()
299
300              Merges  tuples on files. Each input file is assumed to be sorted
301              on key(s).
302
303              keymerge(KeyPos,   FileNames,   Output)   is    equivalent    to
304              keymerge(KeyPos, FileNames, Output, []).
305
306       keysort(KeyPos, FileName) -> Reply
307
308              Types:
309
310                 KeyPos = key_pos()
311                 FileName = file_name()
312                 Reply  =  ok | {error, reason()} | input_reply() | output_re‐
313                 ply()
314
315              Sorts tuples on files.
316
317              keysort(N, FileName) is  equivalent  to  keysort(N,  [FileName],
318              FileName).
319
320       keysort(KeyPos, Input, Output) -> Reply
321
322       keysort(KeyPos, Input, Output, Options) -> Reply
323
324              Types:
325
326                 KeyPos = key_pos()
327                 Input = input()
328                 Output = output()
329                 Options = options()
330                 Reply  =  ok | {error, reason()} | input_reply() | output_re‐
331                 ply()
332
333              Sorts tuples on files. The sort is performed on  the  element(s)
334              mentioned in KeyPos. If two tuples compare equal (==) on one el‐
335              ement, the next element according to  KeyPos  is  compared.  The
336              sort is stable.
337
338              keysort(N,  Input,  Output)  is  equivalent to keysort(N, Input,
339              Output, []).
340
341       merge(FileNames, Output) -> Reply
342
343       merge(FileNames, Output, Options) -> Reply
344
345              Types:
346
347                 FileNames = file_names()
348                 Output = output()
349                 Options = options()
350                 Reply = ok | {error, reason()} | output_reply()
351
352              Merges terms on files. Each input file is assumed to be sorted.
353
354              merge(FileNames, Output) is equivalent to merge(FileNames,  Out‐
355              put, []).
356
357       sort(FileName) -> Reply
358
359              Types:
360
361                 FileName = file_name()
362                 Reply  =  ok | {error, reason()} | input_reply() | output_re‐
363                 ply()
364
365              Sorts terms on files.
366
367              sort(FileName) is equivalent to sort([FileName], FileName).
368
369       sort(Input, Output) -> Reply
370
371       sort(Input, Output, Options) -> Reply
372
373              Types:
374
375                 Input = input()
376                 Output = output()
377                 Options = options()
378                 Reply = ok | {error, reason()} | input_reply()  |  output_re‐
379                 ply()
380
381              Sorts terms on files.
382
383              sort(Input, Output) is equivalent to sort(Input, Output, []).
384
385
386
387Ericsson AB                      stdlib 3.17.2                  file_sorter(3)
Impressum