1
2MSORT(1) User Commands MSORT(1)
3
4
5
7 msort - sort records in complex ways
8
10 msort <options> [<input file>]
11
13 msort is a program for sorting text files in sophisticated ways. It
14 was developed initially for alphabetizing dictionaries of languages in
15 which the ordering may be quite different from English but has many
16 other uses.
17
18 msort allows you to sort blocks of text delimited in a number of ways
19 rather than just lines and to specify particular fields of a record as
20 sort keys using either their position, counted from either end, or by
21 matching regular expressions to their tags.
22
23 msort is capable of sorting on multiple keys, so that when two records
24 tie on one key, the tie may be broken on another. Any or all keys may
25 be optional. How absent optional keys are ordered with respect to
26 present keys may be set separately for each key.
27
28 msort allows you to specify arbitrary sort orders and to define virtu‐
29 ally unlimited numbers of multigraphs of effectively unlimited length.
30 The sort order and multigraphs are defined separately for each key. If
31 your system has locale support, you can also use locale collation rules
32 instead of specify your own sort order.
33
34
35 msort provides twelve types of key comparison: lexicographic, numeric,
36 numeric string, hybrid, by string length, by angle, by date, by domain
37 name, by time, by ISO8601 date/time stamp, by month name, and random.
38
39
40 What month names are used is a bit complicated. If the -s flag is used
41 on the same key and its argument is the name of a file, the month names
42 are read from the file, which should be in the same format as a sort
43 order definition file. If the -s flag is used and its argument is a
44 locale name, the month names recognized will be the month names and
45 abbreviations associated with the specified locale. If the -s flag is
46 not used the month names recognized will be the month names and abbre‐
47 viations associated with the current locale. If your system does not
48 have locale support and you do not use the -s flag to read the month
49 names from a file, the month names recognized will be the English month
50 names and abbreviations.
51
52
53 msort can reverse the characters in a key, allowing it to be used to
54 generate reverse dictionaries.
55
56 A choice of sorting algorithms is provided.
57
58 msort fully supports Unicode. The text to be sorted, and all specifica‐
59 tions, should be in UTF-8 Unicode. (If you have plain ASCII text, this
60 is not a problem as ASCII is a subset of Unicode.) Full Unicode case-
61 folding is available, in Turkic and non-Turkic variants. Unicode nor‐
62 malization is performed before sorting.
63
64 For usage information, execute msort with no arguments.
65
66 Full information about msort is currently to be found in the reference
67 manual, which is distributed as a PDF (Portable Document Format) file.
68 If a copy is not available locally, you can download it from msort's
69 home page:
70 http://billposer.org/Software/msort.html
71
72
74 Informational options
75 -h,--help
76 Print usage message
77
78 -v,--version
79 Print version message
80
81 -D,--defaults
82 List defaults
83
84 -F,--general-options
85 List general command line options
86
87 -G,--gnu-equivalences
88 List equivalents for GNU sort command line options.
89
90 -H,--informational-options
91 List informational command line options
92
93 -K,--key-specific-options
94 List key-specific command line options
95
96 -L,--limits
97 List limits
98
99 -N,--number-systems
100 List the supported number systems.
101
102 General options
103 -b,--block
104 A record is terminated by two or more newlines
105
106 -l,--line
107 A record consists of a single line
108
109 -r,--record-separator <separator>
110 A record is terminated by separator character
111
112 -O,--fixed-size-record <bytes>
113 A record consists of the specified number of bytes.
114
115 -d,--field-separators <character>+
116 Fields are delimited by the named character(s)
117
118 -w,--whole
119 Sort on the entire text of the record
120
121 -a,--algorithm <algorithm>
122 Use the specified sort algorithm. The choices are: I(nsertion‐
123 Sort), M(ergeSort), Q(uickSort), and S(hellSort). Note that
124 InsertionSort and MergeSort are stable, while QuickSort and
125 ShellSort are unstable. The default is QuickSort.
126
127 -M,-initial-maximum-records <records>
128 Set initial maximum number of records
129
130 -m,--line-end-carriage-return
131 End-of-line in the input data is marked by Carriage Return
132 (0x0D) as on the Macintosh rather than by Line Feed (0x0A) as on
133 Unix systems.
134
135 -I,--invert-globally
136 Invert sense of comparisons globally
137
138 -B,--BMP
139 No characters fall outside the Basic Multingual Plane (that is,
140 have values greater than 0xFFFF).
141
142 -Z,--skip-first-record
143 Copy the first record in the input to the output without sorting
144 it. This is useful for sorting files with a header.
145
146 -p,--reserve-private-use-area
147 Do not make internal use of the Private Use areas. By default,
148 multigraphs are assigned internally to codepoints in the Supple‐
149 mentary Private Use areas if full Unicode is in use or to code‐
150 points in the Private Use area if input is restricted to the
151 Basic Multilingual Plane by means of the -B option. If your
152 input makes use of the Private Use areas, this option prevents
153 interference with your input. In this case, multigraphs will be
154 assigned to the Low and High Surrogate areas (0xD800-0xDFFF).
155 Note that this limits the number of multigraphs to 2,048.
156
157 -P,--random-seed <seed>
158 Set the seed for the random number generator. If not set here,
159 it is set to a value determined by the time. The seed used is
160 reported in the log. This option allows runs to be replicated.
161
162 -Q,--check-only
163 Check whether the input is already sorted. Do not generate any
164 output. Exit status is 0 if input is already sorted, 11 if not
165 sorted.
166
167 -1,--in <input file name>
168
169 -2,--out <output file name>
170 If the output file is the same as the input file, the input file
171 will be overwritten. The input file will not be overwritten if
172 the run is unsuccessful.
173
174 -j,--suppress-log
175 Suppress output to the log. If this flag is given before there
176 is any output to the log from a command line flag, nothing will
177 be written to the log and the log file will not be created. If a
178 command line flag generates a log message before this flag is
179 processed, the log file will be created but no log messages will
180 be written to it once this flag is processed. To guarantee that
181 no attempt will be made to open a log file, give this flag
182 first.
183
184 -q,--quiet
185 Be quiet - do not chat while working
186
187 -u,--unicode-normalization <mode>
188 Select Unicode normalization mode. The choices of mode are: c
189 for normalization form C (NFC), d for normalization form D
190 (NFD), C for normalization form KC (NFKC), D for normalization
191 form KD (NFKD), and n for no normalization. The default is NFC.
192
193 Key specific options
194 -e,--character-range <m,n>
195 Sort on characters m through n. Positive indices start from one.
196 Negative indices indicate position with respect to the end of
197 the record. For example, the range 3,-2 consists of the third
198 character through the next-to-last character.
199
200 -n,--position <POS>(,<POS>)
201 Sort on the specified POS or contiguous range of POSs, where a
202 POS is of the form <field number>(.<character number>). Both
203 counts begin at one. Field numbers but not character numbers
204 may be negative, in which case they are counted from the right.
205 Thus, 1.2 is the second character of the first field; -2.1 is
206 the first character of the next to last field.
207
208 -t,--tag <tag regexp>
209 Sort on the field with the specified tag
210
211 -o,--optional <comparison>
212 Optional: compare as (<,=,>) to present key if absent
213
214 -C,--fold-case
215 Fold case
216
217 -z,--fold-case-turkic
218 Fold case with additional Turkic conversions.
219
220 -c,--comparison-type <comparison type>
221 a(ngle),l(exicographic), i(so8601 date/time), t(ime), D(omain
222 name/email address), d(ate), m(onth name), n(umeric), N(umeric
223 string),s(ize), h(hybrid), r(andom)
224
225 -y,--number-system <number system>
226 Specifies the number system expected for this key. This affects
227 only numeric and numeric string keys. There are two special val‐
228 ues. If the number system is "all", records may contain any num‐
229 ber system that msort can interpret. Different records may con‐
230 tain different number systems. If the number system is "any",
231 records may contain any writing system that msort can interpret,
232 but all records must make use of the same number system. msort
233 sets the number system on the basis of the first record.
234
235 -f,--date-format <date format>
236 Permutation of ymd with separators, e.g. y-m-d for international
237 date format, m/d/y for American date format, or a permutation of
238 yd with separators, e.g. y-d, for day-of-year dates. All three
239 components may be numbers in any available number system. The
240 month field may also be a month name, determined by the same
241 devices as independent month name fields.
242
243 -W,--sort-order-file-separators <file name>
244 Read the list of characters to be treated as separators in the
245 sort order definition file.
246
247 -S,--substitutions <file name>
248 Read substitutions from named file
249
250 -s,--sort-order <file name>|<locale name>|"locale"
251 If the argument is a file name, it is taken to be a sort order
252 file and the sort order for the key is read from the file. If
253 the argument is a locale name, the collation rules for that
254 locale are used. If the argument is "locale", the collation
255 rules for the current locale are used.
256
257 -T,--transformations <(d)(e)(s)>
258 Apply the specified transformations. d specifies that diacrit‐
259 ics are to be stripped. Separately encoded combining diacritics
260 are removed. Characters with diacritics represented by single
261 codepoints are replaced with the corresponding ASCII character
262 without the diacritics, if there is one. e specifies that
263 enclosed characters, that is, characters within circles or
264 parentheses, are to be replaced with the corresponding plain
265 ASCII character if there is one. s specifies that characters in
266 special styles are to be replaced with the corresponding plain
267 ASCII character if there is one. Stylistic equivalents include:
268 small capitals (e.g. U+1D04), script forms (e.g. U+212C), black
269 letter forms (e.g. U+212D), Arabic presentation forms (e.g.
270 U+FE81), Hebrew presentation forms (e.g. U+FB1D), fullwidth
271 forms (e.g. U+FF01), halfwidth forms (e.g. U+FF7B), and the
272 mathematical alphanumeric symbols (e.g. U+1D400).
273
274 -x,--exclusion-file <file name>
275 Read exclusions from named file
276
277 -X,--exclude-characters <exclusions>
278 Exclude specified characters
279
280 -i,--invert-locally
281 Invert sense of comparisons
282
283 -R,--reverse-key
284 Reverse characters of key
285
286 -A,--first-character-only
287 Ignore all but the first character of the field, after substitu‐
288 tions, exclusions, etc.
289
290 Note: long options may not be available on your system.
291
293 sort(1), uninum(3)
294
295
297 Bill Poser (billposer@alum.mit.edu)
298
300 GNU General Public License (http://www.gnu.org/licenses/gpl.html), ver‐
301 sion 3.
302
303
304
305
306msort January 2010 MSORT(1)