1Sort::Maker(3) User Contributed Perl Documentation Sort::Maker(3)
2
3
4
6 Sort::Maker - A simple way to make efficient sort subs
7
9 use Sort::Maker ;
10
11 my $sorter = make_sorter( ... ) ;
12
14 This module has two main goals: to make it easy to create correct sort
15 functions, and to make it simple to select the optimum sorting
16 algorithm for the number of items to be sorted. Sort::Maker generates
17 complete sort subroutines in one of four styles, plain, orcish
18 manouver, Schwartzian Transform and the Guttman-Rosler Transform. You
19 can also get the source for a sort sub you create via the sorter_source
20 call.
21
23 The sub "make_sorter" is exported by Sort::Maker. It makes a sort sub
24 and returns a reference to it. You describe how you want it to sort its
25 input by passing general options and key descriptions to "make_sorter".
26
27 Arguments to "make_sorter"
28 There are two types of arguments, boolean and value. Boolean arguments
29 can be set just with the option name and can optionally be followed by
30 '1'. You can easily set multiple boolean general arguments with qw().
31 Value arguments must have a following value. Arguments can appear in
32 any order but the key descriptions (see below) must appear in their
33 sort order. The code examples below show various ways to set the
34 various arguments.
35
36 Arguments fall into four categories: selecting the style of the sort,
37 key descriptions, setting defaults for key description attributes, and
38 setting general flags and values. The following sections will describe
39 the categories and their associated arguments.
40
41 Sort Style
42 The style of the sort to be made is selected by setting one of the
43 following Boolean arguments. Only one may be set otherwise an error is
44 reported (see below for error handling). Also see below for detailed
45 descriptions of the supported sort styles.
46
47 plain
48 orcish
49 ST
50 GRT
51
52 # Make a plain sorter
53 my $plain_sorter = make_sorter( qw( plain ) ... ) ;
54
55 # Make an orcish manouevre sorter
56 my $orcish_sorter = make_sorter( orcish => 1 ... ) ;
57
58 # Make a Schwartzian Transform sorter
59 my $st_sorter = make_sorter( 'ST', 1, ... ) ;
60
61 # Make a GRT sort
62 my $GRT = make_sorter( 'GRT', ... ) ;
63
64 Key Attribute Defaults
65 The following arguments set defaults for the all of the keys'
66 attributes. These default values can be overridden in any individual
67 key. Only one of the attributes in each of the groups below can be set
68 as defaults or for any given key. If more than one attribute in each
69 group is set, then "make_sorter" will return an error. The attribute
70 that is the default for each group is marked. See below for details on
71 key attributes.
72
73 ascending (default)
74 descending
75
76 case (default)
77 no_case
78
79 signed
80 unsigned
81 signed_float (default)
82 unsigned_float
83
84 fixed
85 varying
86
87 General Options
88 These arguments set general options that apply to how the generated
89 sorter interacts with the outside world.
90
91 "name"
92
93 This is a value option which exports the generated sort sub to that
94 name. The call to "make_sorter" must be run to install the named named
95 function before it is called. You should still check the result of
96 "make_sorter" to see if an error occurred (it returns undef).
97
98 my $sorter = make_sorter( name => 'sort_func', ... ) ;
99 die "make_sorter: $@" unless $sorter ;
100
101 ...
102
103 @sorted = sort_func @unsorted ;
104
105 "ref_in/ref_out"
106
107 This boolean arguments specifies that the input to and output from the
108 sort sub will be array references. "ref_in" makes the sorter only take
109 as input a single array reference (which contains the unsorted
110 records). "ref_out" makes the sorter only return a single array
111 reference (which contains the sorted records). You can set both of
112 these options in a sorter.
113
114 Note: This does not affect key extraction code which still gets each
115 record in $_. It only modifies the I/O of the generated sorter.
116
117 # input records are in an array reference
118 my $sorter = make_sorter( qw( ref_in ), ... ) ;
119 @sorted_array = $sorter->( \@unsorted_input ) ;
120
121 # sorted output records are in an array reference
122 my $sorter = make_sorter( ref_out => 1, ... ) ;
123 $sorted_array_ref = $sorter->( @unsorted_input ) ;
124
125 # input and output records are in array references
126 my $sorter = make_sorter( qw( ref_in ref_out ), ... ) ;
127 $sorted_array_ref = $sorter->( \@unsorted_input ) ;
128
129 "string_data"
130
131 This boolean argument specifies that the input records will be plain
132 text strings with no null (0x0) bytes in them. It is only valid for
133 use with the GRT and it is ignored for the other sort styles. It tells
134 the GRT that it can put the record directly into the string cache and
135 it will be separated from the packed keys with a null byte (hence that
136 restriction). This is an optimization that can run slightly faster than
137 the normal index sorting done with the GRT. Run this to see the
138 benchmark results.
139
140 perl t/string_data.t -bench
141
142 "init_code"
143
144 This value argument is code that will be put into the beginning of the
145 generated sorter subroutine. It is meant to be used to declare lexical
146 variables that the extraction code can use. Normally different
147 extraction code have no way to share common code. By declaring lexicals
148 with the "init_code" option, some key extraction code can save data
149 there for use by another key. This is useful if you have two (or more)
150 keys that share a complex piece of code such as accessing a deep value
151 in a record tree.
152
153 For example, suppose the input record is an array of arrays of hashes
154 of strings and the string has 2 keys that need to be grabbed by a
155 regex. The string is a string key, a ':' and a number key. So the
156 common part of the key extraction is:
157
158 $_->[0][0]{a}
159
160 And the make_sorter call is:
161
162 my $sorter = make_sorter(
163 'ST',
164 init_code => 'my( $str, $num ) ;',
165 string => 'do{( $str, $num ) =
166 $_->[0][0]{a} =~ /^(\w+):(\d+)$/; $str}',
167 number => '$num'
168 ) ;
169
170 In the above code both keys are extracted in the first key extraction
171 code and the number key is saved in $num. The second key extraction
172 code just uses that saved value.
173
174 Note that "init_code" is only useful in the ST and GRT sort styles as
175 they process all the keys of a record at one time and can use variables
176 declared in "init_code" to transfer data to later keys. The plain and
177 orcish sorts may not process a later key at the same time as an earlier
178 key (that only happens when the earlier key is compared to an equal
179 key). Also for "init_code" to be a win, the data set must be large
180 enough and the work to extract the keys must be hard enough for the
181 savings to be noticed. The test init_code.t shows some examples and you
182 can see the speedup when you run:
183
184 perl t/init_code.t -bench
185
186 Key Description Arguments
187 Sorting data requires that records be compared in some way so they can
188 be put into a proper sequence. The parts of the records that actually
189 get compared are called its keys. In the simplest case the entire
190 record is the key, as when you sort a list of numbers or file names.
191 But in many cases the keys are embedded in the full record and they
192 need to be extracted before they can be used in comparisons.
193 Sort::Maker uses key descriptions that extract the key from the record,
194 and optional other attributes that will help optimize the sorting
195 operation. This section will explain how to pass key description
196 arguments to the make_sorter subroutine and what the various attributes
197 mean and how to best use them.
198
199 The generated sorter will sort the records according to the order of
200 the key arguments. The first key is used to compare a pair of records
201 and if they are deemed equal, then the next key is examined. This
202 happens until the records are given an ordering or you run out of keys
203 and the records are deemed equal in sort order. Key descriptions can
204 be mixed with the other arguments which can appear in any order and
205 anywhere in the argument list, but the keys themselves must be in the
206 desired order.
207
208 A key argument is either 'string' or 'number' followed by optional
209 attributes. The key type sets the way that the key is compared (e.g.
210 using 'cmp' or '<=>'). All key attributes can be set from the default
211 values in the global arguments or set in each individual key
212 description.
213
214 There are 4 ways to provide attributes to a key:
215
216 No attributes
217
218 A key argument which is either at the end of the argument list or is
219 followed by a valid keyword token has no explict attributes. This key
220 will use the default attributes. In both of these examples, a default
221 attribute was set and used by the key description which is just a
222 single key argument.
223
224 # sort the record as a single number in descending order
225 my $sorter = make_sorter( qw( plain number descending ) ) ;
226
227 # sort the record as a case sensitive string
228 my $sorter = make_sorter( qw( plain case string ) ) ;
229
230 # sort the record as a single number in ascending order
231 my $sorter = make_sorter( qw( ST number ) ) ;
232
233 Only Code as a Value
234
235 A key argument which is followed by a scalar value which is not a valid
236 keyword token, will use that scalar value as its key extraction code.
237 See below for more on key extraction code.
238
239 # sort by the first (optionally signed) number matched
240 my $sorter = make_sorter( qw( plain number /([+-]?\d+)/ ) ) ;
241
242 # string sort by the 3rd field in the input records (array refs)
243 my $sorter = make_sorter( 'ST', string => '$_->[2]' ) ;
244
245 An Array Reference
246
247 A key argument which is followed by an array reference will parse that
248 array for its description attributes. As with the general boolean
249 arguments, any boolean attribute can be optionally followed by a '1'.
250 Value attributes must be followed by their value.
251
252 # another way to specify the same sort as above
253 # sort by the first (optionally signed) number matched
254
255 my $sorter = make_sorter(
256 qw( plain ),
257 number => [
258 code => '/(\d+)/',
259 'descending',
260 ],
261 ) ;
262
263 # same sort but for the GRT which uses the 'unsigned'
264 # attribute to optimize the sort.
265
266 my $sorter = make_sorter(
267 qw( GRT ),
268 number => [
269 qw( descending unsigned ),
270 code => '/(\d+)/',
271 ],
272 ) ;
273
274 A Hash Reference
275
276 A key argument which is followed a hash reference will use that hash as
277 its description attributes. Any boolean attribute in the hash must have
278 a value of '1'. Value attributes must be followed by their value.
279
280 # another way to specify the same sort as above
281 # sort by the first (optionally signed) number matched
282
283 my $sorter = make_sorter(
284 qw( plain ),
285 number => {
286 code => '/(\d+)/',
287 descending => 1,
288 },
289 ) ;
290
291 # a multi-key sort. the first key is a descending unsigned
292 # integer and the second is a string padded to 10 characters
293
294 my $sorter = make_sorter(
295 qw( GRT ),
296 number => {
297 code => '/(\d+)/',
298 descending => 1,
299 unsigned => 1,
300 },
301 string => {
302 code => '/FOO<(\w+)>/',
303 fixed => 10,
304 },
305 ) ;
306
307 Key Description Attributes
308 What follows are the attributes for key descriptions. Most use the
309 default values passed in the arguments to "make_sorter".
310
311 "code"
312
313 This value attribute is the code that will be used to extract a key
314 from the input record. It can be a string of Perl code, a qr// regular
315 expression (Regexp reference) or an anonymous sub (CODE reference) that
316 operates on $_ and extracts a value. The code will be wrapped in a
317 do{} block and called in a list context so that regular expressions can
318 just use () to grab a key value. The code defaults to $_ which means
319 the entire record is used for this key. You can't set the default for
320 code (unlike all the other key attributes). See the section on
321 Extraction Code for more.
322
323 # make an ST sort of the first number grabbed in descending order
324
325 my $sorter = make_sorter(
326 qw( ST ),
327 number => {
328 code => '/(\d+)/',
329 descending => 1,
330 },
331 ) ;
332
333 "ascending/descending"
334
335 These two Boolean attributes control the sorting order for this key. If
336 a key is marked as "ascending" (which is the initial default for all
337 keys), then lower keys will sort before higher keys. "descending" sorts
338 have the higher keys sort before the lower keys. It is illegal to have
339 both set in the defaults or in any key.
340
341 # sort by descending order of the first grabbed number
342 # and then sort in ascending order the first grabbed <word>
343
344 my $sorter = make_sorter(
345 qw( ST descending ),
346 number => {
347 code => '/(\d+)/',
348 },
349 string => {
350 code => '/<(\w+)>/',
351 ascending => 1,
352 },
353 ) ;
354
355 # this will return undef and store an error in $@.
356 # you can't have both 'ascending' and 'descending' as defaults
357
358 my $sorter = make_sorter(
359 qw( ST ascending descending ),
360 number => {
361 code => '/(\d+)/',
362 descending => 1,
363 },
364 ) ;
365
366 # this will return undef and store an error in $@.
367 # you can't have both 'ascending' and 'descending' in a key
368
369 my $sorter = make_sorter(
370 qw( ST )
371 number => {
372 code => '/(\d+)/',
373 descending => 1,
374 ascending => 1,
375 },
376 ) ;
377
378 "case/no_case"
379
380 These two Boolean attributes control how 'string' keys handle case
381 sensitivity. If a key is marked as "case" (which is the initial default
382 for all keys), then keys will treat upper and lower case letters as
383 different. If the key is marked as "no_case" then they are treated as
384 equal. It is illegal to have both set in the defaults or in any key.
385 Internally this uses the uc() function so you can use locale settings
386 to affect string sorts.
387
388 # sort by the first grabbed word with no case
389 # and then sort the grabbed <word> with case
390
391 my $sorter = make_sorter(
392 qw( ST no_case ),
393 string => {
394 code => '/(\w+)/',
395 },
396 string => {
397 code => '/<(\w+)>/',
398 case => 1,
399 },
400 ) ;
401
402 # this will return undef and store an error in $@.
403 # you can't have both 'case' and 'no_case' as defaults
404
405 my $sorter = make_sorter(
406 qw( ST no_case case ),
407 string => {
408 code => '/(\w+)/',
409 },
410 ) ;
411
412 # this will return undef and store an error in $@.
413 # you can't have both 'case' and 'no_case' in a key
414
415 my $sorter = make_sorter(
416 qw( ST )
417 string => {
418 code => '/(\w+)/',
419 no_case => 1,
420 case => 1,
421 },
422 ) ;
423
424 "closure"
425
426 This Boolean attribute causes this key to use call its CODE reference
427 to extract its value. This is useful if you need to access a lexical
428 variable during the key extraction. A typical use would be if you have
429 a sorting order stored in a lexical and need to access that from the
430 extraction code. If you didn't set the "closure" attribute for this
431 key, the generated source (see Key Extraction) would not be able to see
432 that lexical which will trigger a Perl compiling error in make_sorter.
433
434 my @months = qw(
435 January February March April May June
436 July August September October November December ) ;
437 my @month_jumble = qw(
438 February June October March January April
439 July November August December May September ) ;
440
441 my %month_to_num ;
442 @month_to_num{ @months } = 1 .. @months ;
443
444 # this will fail to generate a sorter if 'closure' is removed # as
445 %month_to_num will not be in scope to the eval inside sort_maker.
446
447 my $sorter = make_sorter(
448 'closure',
449 number => sub { $month_to_num{$_} },
450 ) ;
451
452 my @sorted = $sorter->( @month_jumble ) ;
453
454 "signed/unsigned/signed_float/unsigned_float" (GRT only)
455
456 These Boolean attributes are only used by the GRT sort style. They are
457 meant to describe the type of a number key so that the GRT can best
458 process and cache the key's value. It is illegal to have more than one
459 of them set in the defaults or in any key. See the section on GRT
460 sorting for more.
461
462 The "signed" and "unsigned" attributes mark this number key as an
463 integer. The GRT does the least amount of work processing an unsigned
464 integer and only slightly more work for a signed integer. It is worth
465 using these attributes if a sort key is restricted to integers.
466
467 The "signed_float" (which is the normal default for all keys) and
468 "unsigned_float" attributes mark this number key as a float. The GRT
469 does the less work processing an unsigned float then a signed float.
470 It is worth using the "unsigned_float" attribute if a sort key is
471 restricted to non-negative values. The "signed_float" attribute is
472 supported to allow overriding defaults and to make it easier to auto-
473 generate sorts.
474
475 "fixed/varying" (GRT only)
476
477 These attributes are only used by the GRT sort style. They are used to
478 describe the type of a string key so that the GRT can properly process
479 and cache the key's value. It is illegal to have more than one of them
480 set in the defaults or in any key. See the section on GRT sorting for
481 more.
482
483 "fixed" is a value attribute that marks this string key as always being
484 this length. The extracted value will either be padded with null (0x0)
485 bytes or truncated to the specified length (the value of "fixed"). The
486 data in this key may have embedded null bytes (0x0) and may be sorted
487 in descending order.
488
489 "varying" is a Boolean attribute marks this string key as being of
490 varying lengths. The GRT sorter will do a scan of all of this key's
491 values to find the maximum string length and then it pads all the
492 extracted values to that length. The data in this key may have embedded
493 null bytes (0x0) and may be sorted in descending order.
494
495 Key Extraction Code
496 Each input record must have its sort keys extracted from the data.
497 This is the purpose of the 'code' attribute in key descriptions. The
498 code has to operate on a record which is in $_ and it must return the
499 key value. The code is executed in a list context so you can use grabs
500 in m// to return the key. Note that only the first grab will be used
501 but you shouldn't have more than one anyway. See the examples below.
502
503 Code can be either a string, a qr// object (Regexp reference) or an
504 anonymous sub (CODE reference).
505
506 If qr// is used, the actual generated code will be m($qr) which works
507 because qr// will interpolate to its string representation. The
508 advantage of qr// over a string is that the qr// will be syntax checked
509 at compile time while the string only later when the generated sorter
510 is compiled by an eval.
511
512 If a CODE reference is found, it is used to extract the key in the
513 generated sorter. As with qr//, the advantage is that the extraction
514 code is syntax checked at compile time and not runtime. Also the
515 deparsed code is wrapped in a "do{}" block so you may use complex code
516 to extract the key. In the default case a CODE reference will be
517 deparsed by the B::Deparse module into Perl source. If the key has the
518 "closure" attribute set, the code will be called to extract the key.
519
520 The following will generate sorters with exact same behavior:
521
522 $sorter = make_sorter( 'ST', string => '/(\w+)/' ) ;
523 $sorter = make_sorter( 'ST', string => qr/(\w+)/ ) ;
524 $sorter = make_sorter( 'ST', string => sub { /(\w+)/ } ) ;
525 $sorter = make_sorter( 'ST', 'closure', string => sub { /(\w+)/ } ) ;
526
527 Extraction code for a key can be set in one of three ways.
528
529 No explicit code
530
531 If you don't pass any extraction code to a key, it will default to $_
532 which is the entire record. This is useful in certain cases such as in
533 simple sorts where you are sorting the entire record.
534
535 # sort numerically and in reverse order
536 my $sorter = make_sorter( qw( plain number descending ) ;
537
538 # sort with case folding
539 my $sorter = make_sorter( qw( plain no_case string ) ;
540
541 # sort by file time stamp and then by name
542 my $sorter = make_sorter( 'ST', number => '-M', 'string' ) ;
543
544 Code is the only key attribute
545
546 In many cases you don't need to specify any specific key attributes
547 (the normal or globally set defaults are fine) but you need extraction
548 code. If the argument that follows a key type ( 'string' or 'number' )
549 is not a valid keyword, it will be assumed to be the extraction code
550 for that key.
551
552 # grab the first number string as the key
553 my $sorter = make_sorter( qw( plain number /(\d+)/ ) ) ;
554
555 # no_case string sort on the 3rd-5th chars of the 2nd array element
556 my $sorter = make_sorter(
557 plain => 1,
558 no_case => 1,
559 string => 'substr( $_->[1], 2, 3)'
560 ) ;
561
562 Key needs specific attributes
563
564 When the key needs to have its own specific attributes other than its
565 code, you need to pass them in an ARRAY or HASH reference. This is
566 mostly needed when there are multiple keys and the defaults are not
567 correct for all the keys.
568
569 # string sort by the first 3 elements of the array record with
570 # different case requirements
571
572 my $sorter = make_sorter(
573 ST => 1,
574 string => {
575 code => '$_->[0]',
576 no_case => 1,
577 },
578 string => '$_->[1]',
579 string => {
580 code => '$_->[2]',
581 no_case => 1,
582 },
583 ) ;
584
585 # GRT sort with multiple unsigned integers and padded strings
586 # note that some keys use a hash ref and some an array ref
587 # the record is marked with key\d: sections
588 my $sorter = make_sorter(
589 GRT => 1,
590 descending => 1,
591 number => {
592 code => 'key1:(\d+)',
593 unsigned => 1,
594 },
595 number => [
596 code => 'key2:([+-]?\d+)',
597 qw( signed ascending ),
598 ],
599 string => [
600 code => 'key3:(\w{10})',
601 fixed => 1,
602 ascending => 1,
603 ],
604 # pad the extracted keys to 8 chars
605 string => {
606 code => 'key4:([A-Z]+)',
607 pad => 8,
608 },
609 ) ;
610
612 A good question to ask is "What speed advantages do you get from this
613 module when all the sorts generated use Perl's internal sort function?"
614 The sort function has a O( N * log N ) growth function which means that
615 the amount of work done increases by that formula as N (the number of
616 input records) increases. In a plain sort this means the the key
617 extraction code is executed N * log N times when you only have N
618 records. That can be a very large waste of cpu time. So the other three
619 sort styles speed up the overall sort by only executing the extraction
620 code N times by caching the extracted keys. How they cache the keys is
621 their primary difference. To compare or study the actual code generated
622 for the different sort styles, you can run make_sorter and just change
623 the style. Then call sorter_source (not exported by default) and pass
624 it the sort code reference returned by make_sorter. It will return the
625 generated sort source.
626
627 "plain"
628 Plain sorting doesn't do any key caching. It is fine for short input
629 lists (see the Benchmark section) and also as a way to see how much CPU
630 is saved when using one of the other styles.
631
632 "orcish"
633 The Orcish maneuvre (created by Joseph Hall) caches the extracted keys
634 in a hash. It does this with code like this:
635
636 $cache{$a} ||= CODE($a) ;
637
638 CODE is the extract code and it operates on a record in $a. If we have
639 never seen this record before then the cache entry will be undef and
640 the ||= operator will assign the extracted key to that hash slot. The
641 next time this record is seen in a comparison, the saved extracted key
642 will be found in the hash and used. The name orcish comes from OR-
643 cache.
644
645 "ST"
646 The ST (Schwartzian Transform and popularized by Randal Schwartz) uses
647 an anonymous array to store the record and its extracted keys. It first
648 executes a map that creates an anonymous array:
649
650 map [ $_, CODE1( $_ ), CODE2( $_ ) ], @input
651
652 The CODE's extract the set of keys from the record but only once per
653 record so it is O(N). Now the sort function can just do the comparisons
654 and it returns a list of sorted anonymous arrays.
655
656 sort {
657 $a->[1] cmp $b->[1]
658 ||
659 $a->[2] cmp $b->[2]
660 }
661
662 Finally, we need to get back the original records which are in the
663 first slot of the anonymous array:
664
665 map $_->[0]
666
667 This is why the ST is known as a map/sort/map technique.
668
669 "GRT"
670 The Guttman-Rosler Transform (popularized by Uri Guttman and Larry
671 Rosler) uses a string to cache the extracted keys as well as either the
672 record or its index. It is also a map/sort/map technique but because
673 its cache is a string, it can be sorted without any Perl level callback
674 (the {} block passed to sort). This is a signifigant win since that
675 callback is running O( N log N). But this speedup comes at a cost of
676 complexity. You can't just join the keys into a string and properly
677 sort them. Each key may need to be processed so that it will correctly
678 sort in order and it doesn't interfere with other keys. That is why the
679 GRT has several key attributes to enable it to properly and efficiently
680 pack the sort keys into a single string. The following lists the GRT
681 key attributes, when you need them and what key processing is done for
682 each. Note that you can always enable the GRT specific attributes as
683 they are just ignored by the other sort styles.
684
685 The GRT gains its speed by using a single byte string to cache all of
686 the extracted keys from a given input record. Packing keys into a
687 string such that it will lexically sort the correct way requires some
688 deep mojo and data munging. But that is why this module was written -
689 to hide all that from the coder. Below are descriptions of how the
690 various key types are packed and how to best use the GRT specific key
691 attributes. Note: you can only use one of the GRT number or string
692 attributes for any key. Setting more than one in either the defaults or
693 in any given key is an error (a key's attribute can override a default
694 choice).
695
696 "unsigned"
697
698 The 'unsigned' Boolean attribute tells the GRT that this number key is
699 a non-negative integer. This allows the GRT to just pack it into 4
700 bytes using the N format (network order - big endian). An integer
701 packed this way will have its most significant bytes compared before
702 its least signifigant bytes. This involves the least amount of key
703 munging and so it is the most efficient way to sort numbers in the GRT.
704
705 If you want this key to sort in descending order, then the key value is
706 negated and normalized (see the 'signed' attribute) so there is no
707 advantage to using 'unsigned'.
708
709 "signed"
710
711 The 'signed' Boolean attribute tells the GRT that this number key is an
712 integer. This allows the GRT to just pack it into 4 bytes using the N
713 format (network order - big endian). The key value must first be
714 normalized which will convert it to an unsigned integer but with the
715 same ordering as a signed integer. This is simply done by inverting the
716 sign (highest order) bit of the integer. As mentioned above, when
717 sorting this key in descending order, the GRT just negates the key
718 value.
719
720 NOTE: In the GRT the signed and unsigned integer attributes only work
721 on perl built with 32 bit integers. This is due to using the N format
722 of pack which is specified to be 32 bits. A future version may support
723 64 bit integers (anyone want to help?).
724
725 "unsigned_float"
726
727 The 'unsigned_float' Boolean attribute tells the GRT that this number
728 key is a non-negative floating point number. This allows the GRT to
729 pack it into 8 bytes using the 'd' format. A float packed this way will
730 have its most significant bytes compared before its least signifigant
731 bytes.
732
733 "signed_float"
734
735 The "signed_float" Boolean attribute (which is the default for all
736 number keys when using the GRT) tells the GRT that this number key is a
737 floating point number. This allows the GRT to pack it into 8 bytes
738 using the 'd' format. A float packed this way will have its most
739 significant bytes compared before its least signifigant bytes. When
740 processed this key will be normalized to an unsigned float similar to
741 to the "signed" to "unsigned" conversion mentioned above.
742
743 NOTE: The GRT only works with floats that are in the IEEE format for
744 doubles. This includes most modern architectures including x86, sparc,
745 powerpc, mips, etc. If the cpu doesn't have IEEE floats you can either
746 use the integer attributes or select another sort style (all the others
747 have no restriction on float formats).
748
749 simple string.
750
751 If a string key is being sorted in ascending order with the GRT and it
752 doesn't have one of the GRT string attributes, it will be packed
753 without any munging and a null (0x0) byte will be appended to it. This
754 byte enables a shorter string to sort before longer ones that start
755 with the shorter string.
756
757 NOTE: You cannot sort strings in descending order in the GRT unless the
758 key has either the 'fixed' or 'varying' attributes set. Also, if a
759 string is being sorted in ascending order but has any null (0x0) bytes
760 in it, the key must have one of those attributes set.
761
762 "fixed"
763
764 This value attribute tells the GRT to pack this key value as a fixed
765 length string. The extracted value will either be padded with null
766 (0x0) bytes or truncated to the specified length (the value of the
767 "fixed" attribute). This means it can be packed into the cache string
768 with no padding and no trailing null byte is needed. The key can
769 contain any data including null (0x0) bytes. Data munging happens only
770 if the key's sort order is descending. Then the key value is xor'ed
771 with a same length string of 0xff bytes. This toggles each bit which
772 allows for a lexical comparison but in the reverse order. This same bit
773 inversion is used for descending varying strings.
774
775 "varying"
776
777 This Boolean attribute tells the GRT that this key value is a varying
778 length string and has no predetermined padding length. A prescan is
779 done to determine the maximum string length for this key and that is
780 used as the padding length. The rest is the same as with the 'fixed'
781 attribute.
782
783 "sorter_source"
784 This sub (which can be exported) returns the source of a generated sort
785 sub or the source of the last one that had an error. To get the source
786 of an existing sort sub, pass it a reference to that sub (i.e. the
787 reference returned from make_sorter). To get the source for a failed
788 call to make_sorter, don't pass in any arguments.
789
790 my $sorter = make_sorter( ... ) ;
791 print sorter_source( $sorter ) ;
792
793 make_sorter( name => 'my_sorter', ... ) ;
794 print sorter_source( \&my_sorter ) ;
795
796 my $sorter = make_sorter( ... )
797 or die "make_sorter error: $@\n", sorter_source();
798
799 If all you want is the generated source you can just do:
800
801 print sorter_source make_sorter( ... ) ;
802
803 Error Handling
804 When "make_sorter" detects an error (either bad arguments or when the
805 generated sorter won't compile), it returns undef and set $@ to an
806 error message. The error message will include the generated source and
807 compiler and warning errors if the sorter didn't compile correctly.
808 The test t/errors.t covers all the possible error messages. You can
809 also retrieve the generated source after a compiling error by calling
810 "sorter_source".
811
813 "Sort::Maker" uses a table of test configurations that can both run
814 tests and benchmarks. Each test script is mostly a table that generates
815 multiple versions of the sorters, generate sample data and compares the
816 sorter results with a sort that is known to be good. If you run the
817 scripts directly and with a -bench argument, then they generate the
818 same sorter subs and benchmark them. This design ensures that
819 benchmarks are running on correctly generated code and it makes it very
820 easy to add more test and benchmark variations. The code that does all
821 the work is in t/common.pl. Here is a typical test table entry:
822
823 {
824 skip => 0,
825 source => 0,
826 name => 'init_code',
827 gen => sub { rand_choice( @string_keys ) . ':' .
828 rand_choice( @number_keys ) },
829 gold => sub {
830 ($a =~ /^(\w+)/)[0] cmp ($b =~ /^(\w+)/)[0]
831 ||
832 ($a =~ /(\d+$)/)[0] <=> ($b =~ /(\d+$)/)[0]
833 },
834 args => [
835 init_code => 'my( $str, $num ) ;',
836 string => 'do{( $str, $num ) = /^(\w+):(\d+)$/; $str}',
837 number => '$num',
838 ],
839 },
840
841 "skip" is a boolean that causes this test/benchmark to be skipped.
842 Setting "source" causes the sorter's source to be printed out. "gen"
843 is a sub that generates a single input record. There are support subs
844 in t/common.pl that will generate random data. Some tests have a "data"
845 field which is fixed data for a test (instead of the generated data).
846 The <gold> field is a comparision subroutine usable by the sort
847 function. It is used to sort the test data into a golden result which
848 is used to compare against all the generated sorters. "args" is an
849 anonymous array of arguments for a sorter or a hash ref with multiple
850 named/args pairs. See t/io.t for an example of that.
851
854 This module always exports the "make_sorter" sub. It can also
855 optionally export "sorter_source".
856
858 Sort::Maker GRT currently works only with 32 bit integers due to pack N
859 format being exactly 32 bits. If someone with a 64 bit Perl wants to
860 work on using the Q format or the ! suffix and dealing with endian
861 issues, I will be glad to help and support it. It would be best if
862 there was a network (big endian) pack format for quads/longlongs but
863 that can be done similarly to how floats are packed now.
864
866 Uri Guttman, <uri@stemsystems.com>
867
869 I would like to thank the inimitable Damian Conway for his help in the
870 API design, the POD, and for being a good Perl friend.
871
872 And thanks to Boston.pm for the idea of allowing qr// for key
873 extraction code.
874
875
876
877perl v5.36.0 2022-07-22 Sort::Maker(3)