1SKF(1) General Commands Manual SKF(1)
2
3
4
6 skf - simple Kanji Filter (v2.1)
7
9 skf [-EIJKNQRSXZbehjknqrsuvxz] [ long_format_options ] [infiles..]
10
12 skf is a yet another i18n capable kanji-filter, designed for reading
13 various CJK-coded files on the Net. skf converts input kanji texts or
14 streams into a character stream using designated codeset and output
15 them to standard output. Specifically, skf is designed to be a versa‐
16 tile filter to read documents in various code sets, and does not pro‐
17 vide features not related to code conversion.
18
19 Like nkf, skf automatically recognizes an input file code when it is a
20 kind of ISO-2022 compliant code, and also detects EUC-variant codes if
21 input file is Japanese text without X 0201 kanas. skf 2.1 can read
22 various iso-2022 compliant character sets, including JIS Kanji codes (X
23 0208, X 0212 and X 0213), EUC encoding (euc-jp (with X 0213 support),
24 euc-cn, euc-kr and euc-tw), ISO Europian latins (ISO-8859-1 to 11,
25 13/14/15/16) and many regional character sets. skf can also read some
26 non-iso2022 compliant sets, including Microsoft Shift-JIS code,
27 KOI-8-R/U, GB2312 (HZ), big5, VISCII(rfc1456, include VIQR), Unicode
28 standard (UCS2/UTF-16, UTF7 and UTF8), some of MS codesets (cp1250
29 etc.) and some other vendor specific codes (KEIS83, JEF etc).
30
31 Supported output character sets of skf are more limited, but still
32 include X 0208/X 0212/X 0213 JIS, X 0201 JIS, ASCII, Microsoft Shift-
33 JIS, EUC-jp/-kr/-cn, HZ, iso-2022-jp/kr, big5, VISCII and Unicode.
34
35 skf also provides some basic decoding features for some common encod‐
36 ings including MIME, Punycode and URI codepoint. Unicode decomposition
37 feature is also supported since 1.96.
38
39 As noted above, skf is designed to convert input text into some kind of
40 human-readable forms under a local environment (i.e. codeset), and has
41 several extra conversion features like GNU recode type folding. Such
42 conversions include Windows/Macintosh specific code swaps and old-new
43 jis glyph changes, html-format/TeX format conversion and variant unifi‐
44 cations.
45
46 skf also can be compiled as an extension of some lightweight languages.
47 See README.txt for details.
48
49 If one or more file names are given, skf read the files and output con‐
50 verted stream to stdout. If no file names are given, input is taken
51 from stdin and output is also stdout. OPTIONS are taken from environ‐
52 ment variables SKFENV, skfenv and command line, respectively in this
53 order. Environment variables are not used when skf is running as a
54 priviledged user. skf does not use LOCALE-related environment vari‐
55 ables for conversions, but output error messages are controlled by
56 given LOCALES.
57
59 skf is written from scratch, and inherits no code from nkf. However,
60 skf is intended to be a drop-in replacement for nkf(v1.4) and has a
61 similar commonly-used nkf option set.
62 skf 2.1 recognizes following options. Defaults are all off if not
63 explicitly specified.
64
65 buffering control
66 -b use buffered output. This is default.
67
68 -u use unbuffered output. Code detection feature is disabled when
69 this option is on.
70
71 Input/Output codeset options
72 --ic= input_code_set
73 specify input codeset is input_code_set. Possible candidates
74 are shown below.
75
76 --oc= output_code_set
77 specify output codeset is output_code_set. Possible candidates
78 are shown below. Default codeset in distribution package is euc-
79 jp, but depends on compile option. Default codeset is shown by
80
81 Supported codeset
82 skf recognizes following codesets as an input/output codeset. These
83 codeset names are case insensitive, and minus ('-') and underscore
84 ('_') is ignored. Note that iso-2022 escape-based input codeset (reg‐
85 istered to IANA) is recoginized automatically, even when non-iso2022
86 codeset (except Unicode and B-Right/V) is specified. o in in-column
87 means named codeset can be specified as input and x means named codeset
88 is not for input. output-column is same except it is for output.
89
90 in out name description
91 o o iso8859-1 ascii + iso-8859-1 (latin-1)
92 o o iso8859-2 ascii + iso-8859-2 (latin-2)
93 o o iso8859-3 ascii + iso-8859-3 (latin-3)
94 o o iso8859-4 ascii + iso-8859-4 (latin-4)
95 o o iso8859-5 ascii + iso-8859-5 (Cyrillic)
96 o o iso8859-6 ascii + iso-8859-6 (Arabic)
97 o o iso8859-7 ascii + iso-8859-7 (Greek)
98 o o iso8859-8 ascii + iso-8859-8 (Hebrew)
99 o o iso8859-9 ascii + iso-8859-9 (latin-5)
100 o o iso8859-10 ascii + iso-8859-10 (latin-6)
101 o o iso8859-11 ascii + iso-8859-11 (Thai)
102 o o iso8859-13 ascii + iso-8859-13 (Baltic Rim)
103 o o iso8859-14 ascii + iso-8859-14 (Celtic)
104 o o iso8859-15 ascii + iso-8859-15 (Latin-9)
105 o o iso8859-16 ascii + iso-8859-16
106 o o koi-8r koi-8r (Russian)
107 o o cp1251 Cyrillic latin MS cp1251
108 o o jis iso-2022-jp (rfc1496 7bit JIS)
109 o o iso-2022-jp-x0213 iso-2022-jp-3 (JIS X 0213:2000)
110 a.k.a. jis-x0213
111 o o jis-x0213-strict iso-2022-jp-3-strict
112 o o iso-2022-jp-2004 iso-2022-jp-2004(JIS X 0213:2004)
113 a.k.a. jis-x0213-2004
114 o o oldjis iso-2022-jp-1978(JIS X 0208:1978)
115 o o cp50220 Microsoft codepage 50220
116 o o cp50221 Microsoft codepage 50221
117 o o cp50222 Microsoft codepage 50222
118 o o euc-jp EUC-encoded JIS X 0208:1997
119 o o euc-x0213 EUC-encoded JIS X 0213:2000
120 o o euc-jis-2004 EUC-encoded JIS X 0213:2004
121 o o cp51932 EUC-encoded Microsoft codepage 932
122 o o euc-kr EUC-encoded KS X 1001 Korian
123 o o euc7-kr 7bit EUC-encoded KS X 1001 Korian
124 o o uhc Unified hangle (Windows cp949)
125 o o johab KS X 1001-johab Korian
126 o o euc-cn EUC-encoded GB2312 Chinese
127 o o euc7-cn 7bit EUC-encoded GB2312 Chinese
128 o o hz HZ-encoded GB2312 Chinese
129 o o euc-tw EUC-encoded CNS 11643 Chinese
130 o o gb12345 EUC-encoded GB12345 Chinese
131 o o gbk GB2312 Extension(cp936) Chinese
132 o o gb18030 GB18030 chinese
133 o o big5 BIG5 (with Eten extension + EURO)
134 o o cp950 BIG5 (Microsoft cp950 + EURO)
135 o o big5-hkscs BIG5 with HKSCS
136 o o big5-2003 BIG5-2003
137 o o big5-uao BIG5-Unicode at On
138 o o sjis Shift-jis (Microsoft cp943)
139 o o shiftjis-x0213 Shiftjis-encoded JIS X 0213:2000
140 o o shiftjis-2004 Shiftjis-encoded JIS X 0213:2004
141 o o sjis-docomo Shiftjis-encoded with NTT Docomo emoticons.
142 o o sjis-au Shiftjis-encoded with AU emoticons.
143 o o sjis-softbank Shiftjis-encoded with SoftBank emoticons.
144 o o oldsjis Shift-jis (JIS X 0208:1978)
145 o o cp932 Shift-jis-encoded MS cp932
146 o o cp932w Shift-jis-encoded MS cp932 with
147 MS compatibility
148 o o viscii VISCII (rfc1456) Vietnamise
149 o o viqr VISCII (rfc1456-VIQR) Vietnamise
150 o o keis Hitachi KEIS83/90
151 o x jef Fujitsu JEF (basic support only)
152 o x ibm930 IBM EBCDIC DBCS Japanese
153 o x ibm931 IBM EBCDIC DBCS Japanese w.latin
154 o x ibm933 IBM EBCDIC DBCS Korian
155 o x ibm935 IBM EBCDIC DBCS Simpl. Chinese
156 o x ibm937 IBM EBCDIC DBCS Trad. Chinese
157 o o unicode Unicode(TM) UTF-16LE
158 o o unicodefffe Unicode(TM) UTF-16BE
159 o o utf7 Unicode(TM) UTF-7
160 o o utf8 Unicode(TM) UTF-8
161 x o nyukan-utf-8 nyukan-utf-16 Nyukan-moji(Japanese nyukoku-kan‐
162 rikyoku gaiji). Encoding is utf-8 and utf-16 respectively.
163 o x arib-b24 ARIB B24 8-bit JIS-based
164 o x arib-b24-sj ARIB B24 8-bit SJIS-based
165 x o transparent Transparent mode (see below)
166
167
168 Codeset explanations
169 iso-8859-*
170 When specified as output, G0 = GL is ascii and G1 = GR is
171 iso-8859-*. 8bit encoding is used.
172
173 iso-2022-jp, jis
174 Encoding is iso-2022-jp-2 (RFC1496). G0 = GL is JIS X 0201
175 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3 is
176 JIS X 0212:1990 Supplementary Kanji.
177
178 jis-x0213, iso-2022-jp-3
179 Encoding is iso-2022-jp-3 (JIS X 0213:2000 based). G0 = GL is
180 JIS X 0201 roman, For output, G1 = GR is JIS X 0201 kana, G2 is
181 iso-8859-1 and G3 is JIS X 0213 plane2 Kanji.
182
183 jis-x0213-strict
184 Encoding is subset of iso-2022-jp-3-strict (uses Plane 1 only).
185 For output, G0 = GL is JIS X 0201 roman, G1 = GR is JIS X 0201
186 kana, G2 is iso-8859-1 and G3 is not set. Output code using JIS
187 X 0208 whenever possible. JIS X 0213 input is automatically rec‐
188 ognized.
189
190 jis-x0213-2004, iso-2022-jp-2004
191 Encoding is iso-2022-jp-2003:2004. For output, G0 = GL is JIS X
192 0201 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3
193 is JIS X 0213 plane2 Kanji.
194
195 oldjis
196 Encoding is iso-2022-jp using old JIS X 0208:1978). G0 = GL is
197 JIS X 0201 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1
198 and G3 is JIS X 0212 Supplementary Kanji.
199
200 euc-jp, euc
201 Encoding is 8-bit EUC using JIS X 0208:1997 character set. G0 =
202 GL is ascii, G1 = GR is JIS X 0208, G2 is JIS X 0201 kana and G3
203 is JIS X 0212 Supplementary Kanji.
204
205 euc-x0213, euc-jis-2003
206 Encoding is 8-bit EUC-based JIS X 0213:2000. G0 = GL is ascii,
207 G1 = GR is X 0213:2000 plane 1, G2 is iso-8859-1 and G3 is JIS X
208 0213:2000 plane2 Kanji.
209
210 euc-jis-2004
211 Encoding is 8-bit EUC-based JIS X0213:2004. G0 = GL is ascii,
212 G1 = GR is X0213:2004 plane 1, G2 is iso-8859-1 and G3 is JIS
213 x0213:2004 plane2 Kanji.
214
215 euc-kr
216 Encoding is 8-bit EUC using KS X 1001 Wansung character set. G0
217 = GR is KS X1003, G1 = GR is KS X1001, G2 and G3 is not set.
218
219 euc7-kr iso-2022-kr
220 Encoding is iso-2022-kr (rfc1557): 7-bit EUC using KS X 1001
221 Wansung character set. G0 = GR is KS X1003, G1 is KS X1001, G2
222 and G3 is not set.
223
224 euc-cn
225 Encoding is 8-bit EUC using GB 2312 simplified chinese character
226 set. G0 = GR is ASCII, G1 = GR is GB2312, G2 and G3 is not set.
227
228 euc7-cn
229 Encoding is 7-bit EUC using GB 2312 simplified chinese character
230 set. G0 = GR is ASCII, G1 is GB2312, G2 and G3 is not set.
231
232 hz
233 Encoding is HZ encoded (rfc1842) GB 2312 simplified chinese
234 character set. G0 = GR is ASCII, G1 = GR is GB2312, G2 and G3
235 is not set.
236
237 euc-tw
238 Encoding is EUC encoded CNS11643 Plane1/2 traditional chinese
239 character set. Subset of iso-2022-cn. G0 = GR is ASCII, G1 = GR
240 is CNS11643 plane 1, G2 is CNS11643 plane 2 and G3 is not set.
241
242 gb12345
243 Encoding is 8-bit EUC using GB 12345 (GBF) traditional chinese
244 character set. G0 = GR is ASCII, G1 = GR is GB12345, G2 and G3
245 is not set.
246
247 gbk, cp936
248 Encoding is GBK simplified chinese character set. G0 = GR is
249 ASCII and G1 = GR is GBK. G2 and G3 is not set.
250
251 gb18030 (experimental)
252 Encoding is GB18030 (ibm-1392, Windows cp54936) chinese charac‐
253 ter set. Uses ASCII as latin part.
254
255 big5
256 Encoding is Big5 traditional chinese character set with ETen
257 extension. Include Euro mapping. Uses ASCII as latin part.
258
259 cp950
260 Encoding is Microsoft cp950-Big5 traditional chinese character
261 set. Uses ASCII as latin part.
262
263 big5-hkscs (experimental)
264 Encoding is cp950-Big5 traditional chinese character set with
265 HKSCS extension. Uses ASCII as latin part.
266
267 big5-2003 (experimental)
268 Encoding is Big5-2003 Taiwanese standard traditional chinese
269 character set. Uses ASCII as latin part.
270
271 big5-uao (experimental)
272 Encoding is Big5-UAO (http://uao.cpatch.org) traditional chinese
273 character set. Uses ASCII as latin part.
274
275 VISCII (experimental)
276 Vietnamise VISCII (rfc1456) character set. Not TCVN-5712.
277
278 VIQR (experimental)
279 Vietnamise VISCII character set with VIQR encoding(rfc1456).
280
281 sjis
282 Encoding is Shift-encoded JIS X 0208:1997 character set. Note
283 that this is not cp932. Uses JIS X 0201 latin as latin(GL) part.
284
285 sjis-x0213, shift_jis-2000
286 Encoding is Shift-encoded JIS using JIS X 0213:2000 character
287 set.
288
289 sjis-x0213-2004, shift_jis-2004
290 Encoding is Shift-encoded JIS using JIS X 0213:2004 character
291 set. 10 newly defined character added, but Unicode mapping is
292 same as JIS X 0213:2000. Uses JIS X 0201 latin as latin(GL)
293 part.
294
295 sjis-cellular (experimental)
296 Encoding is Shift-encoded JIS X 0208:1997 character set with NTT
297 Docomo/Vodafone(SoftBank) cellular phone glyph mapping. Output
298 is not supported.
299
300 cp932 cp932w
301 Encoding is Microsoft SJIS cp932 with NEC/IBM gaiji area, based
302 on Windows XP mapping. Uses ASCII as latin(GL) part. --use-com‐
303 pat and --use-ms-compat is automatically enabled. cp932w pro‐
304 vides further WideCharToMultiByte compatibility.
305
306 cp51932
307 Encoding is Microsoft EUC-based cp51932 with NEC/IBM gaiji area,
308 based on Windows XP mapping. Uses ASCII as G0 and JIS X 0201
309 kana as EUC G2 part. G3 is not used for output, and JIS X
310 0212:2000 as input. --use-compat and --use-ms-compat is auto‐
311 matically enabled.
312
313 cp50220, cp50221, cp50222
314 Encoding is Microsoft JIS-based cp50220, cp50221, cp50222 with
315 NEC/IBM gaiji area, based on Windows XP mapping. For input, skf
316 accepts cp50220, 50221 and 50222. Note that this codeset is NOT
317 compatible with iso-2022. Uses ASCII as default character set.
318 --use-compat and --use-ms-compat is automatically enabled.
319
320 oldsjis
321 Encoding is Microsoft SJIS (JIS X 0208:1978 a.k.a. old JIS).
322 Uses JIS X 0201 latin as latin(GL) part.
323
324 johab
325 Encoding is KS X1001(Johab) character set. Uses KS X1003 latin
326 as latin(GL) part.
327
328 uhc
329 Encoding is UHC (cp949) character set. Uses ASCII as latin(GL)
330 part.
331
332 unicode, unicodefffe, utf16, utf16le
333 Encoding is Unicode UTF-16 (v11.0). Input/Output default byte-
334 endian is little for unicode and big for unicodefffe, and input
335 byte order mark is recognized. utf16 and unicodefffe is big-
336 endian. utf16le and unicode is little endian. Output includes
337 endian mark by default unless --disable-endian-mark is speci‐
338 fied. Output range is within UTF-32 with surrogate pair unless
339 --limit-to-ucs2 is specified.
340 Note that ucs2 is not supported within lightweight language
341 extension in both in and output, because of SWIG's passing data
342 structure limitation. Specify to ucs2 will generate error.
343
344 utf8
345 Encoding is UTF-8 encoded Unicode (v11.0). Output doesn't
346 include byte order mark unless --enable-endian-mark is speci‐
347 fied. Output range is within UTF-32 unless --limit-to-ucs2 is
348 specified. By default, CESU-8 is not accepted as input. Option
349 --enable-cesu8 enables CESU-8 input for utf-8 converter. CESU-8
350 output is not supported. For UTF-8, endian mark (BOM) is always
351 ignored.
352
353 utf7
354 Encoding is UTF-7 encoded Unicode (v11.0). Input/output range is
355 limited to UTF-16, and value above U+10000 is regarded as unde‐
356 fined. BOM is always ignored for input, and never used for out‐
357 put.
358
359 keis (experimental)
360 Encoding is Hitachi KEIS83/90. Output range is limited to EBCDIK
361 and JIS X 0208 area.
362
363 jef (experimental)
364 Encoding is Fujitsu JEF. Input only. Only basic part is sup‐
365 ported.
366
367 ibm930 (experimental)
368 Encoding is IBM DBCS Japanese with EBCDIC Kana
369
370 ibm931 (experimental)
371 Encoding is IBM DBCS Japanese with EBCDIC latin (ibm037)
372
373 ibm933 (experimental)
374 Encoding is IBM DBCS Korian with EBCDIC Wansung character set
375
376 ibm935 (experimental)
377 Encoding is IBM DBCS Simplified Chinese with EBCDIC Chinese
378
379 ibm937 (experimental)
380 Encoding is IBM DBCS Traditional Chinese with EBCDIC Chinese
381
382 koi8r
383 Russian KOI-8R code.
384
385 cp1250
386 Central Europian latin Microsoft cp1250 code
387
388 cp1251
389 Eastern Europian cyrillic Microsoft cp1251 code
390
391 arib-b24 arib-b24-sj
392 ARIB B24 code defined in ATIB-STD-B24 vol.1 part.2 chapt. 7.3.
393 b24 is 8-bit jis based, and b24-sj is sjis based.
394
395 nyukan-utf-8 nyukan-utf-16
396 Normalized Unicode UTF-8/UTF-16 based on Japanese law ministry
397 kokuji No. 582.
398
399 transparent
400 Transparent mode. Various code control features, include folding
401 and line end code conversion, is also ignored.
402
403
404 Shortcuts
405 -j same as --oc=jis
406
407 -s same as --oc=sjis
408
409 -e same as --oc=euc-jp
410
411 -q same as --oc=unicode
412
413 -z same as --oc=sjis
414
415 -E same as --ic=euc-jp. Assume input codeset is EUC-JP.
416
417 -J same as --ic=jis. Assume input codeset is iso-2022-jp.
418
419 -S same as --ic=sjis. Assume input codeset is shift JIS
420
421 -Q same as --ic=utf-16 --input-little-endian.
422
423 -Z same as --ic=utf8.
424
425
426 ISO-2022 Specific controls
427 Replaces G0-3 after setting up according to specified input codeset by
428 assigned character set with this option. Note that this doesn't change
429 any codeset properties of the original codeset, like language and
430 encoding.
431
432 --set-g0=`charset name'
433 Predefines specified code set to plane 0 (G0). Also set to GL at
434 initial state.
435
436 --set-g1=`charset name'
437 Predefines specified code set to right plane (G1). Also set to
438 GR at initial state.
439
440 --set-g2=`charset name'
441 Predefines specified code set to right plane (G2).
442
443 --set-g3=`charset name'
444 Predefines specified code set to right plane (G3).
445
446
447 Supported `char_set' is as follows. 'o' means the codeset can be speci‐
448 fied to set to the plane. 'x' means you can't. For unicode family code‐
449 sets, this option is ignored. For other non-iso2022 categories, this
450 option is not supported, and result is unpredictable.
451
452
453 g0 g1 g2 g3 codeset name description
454 o o o o ascii ANSI X3.4 ASCII
455 o o o o x0201 JIS X 0201 (latin part)
456 x o o o iso8859-1 ISO 8859-1 latin
457 x o o o iso8859-2 ISO 8859-2 latin
458 x o o o iso8859-3 ISO 8859-3 latin
459 x o o o iso8859-4 ISO 8859-4 latin
460 x o o o iso8859-5 ISO 8859-5 Cyrillic
461 x o o o iso8859-6 ISO 8859-6 Arabic
462 x o o o iso8859-7 ISO 8859-7 Greek-latin
463 x o o o iso8859-8 ISO 8859-8 Hebrew
464 x o o o iso8859-9 ISO 8859-9 latin
465 x o o o iso8859-10 ISO 8859-10 latin
466 x o o o iso8859-11 ISO 8859-11 Thai
467 x o o o iso8859-13 ISO 8859-13 latin
468 x o o o iso8859-14 ISO 8859-14 latin
469 x o o o iso8859-15 ISO 8859-15 latin
470 x o o o iso8859-16 ISO 8859-16 latin
471 x o o o tcvn5712 TCVN 5712 (Vietnamese)
472 x o o o ecma94 ECMA 94 Cyrillic (KOI-8e)
473 o o o o x0212 JIS X 0212:1990
474 o o o o x0208 JIS X 0208:1997
475 o o o o x0213 JIS X 0213 Plane 1:2000
476 o o o o x0213-2 JIS X 0213 Plane 2:2000
477 o o o o x0213n JIS X 0213 Plane 1:2004
478 o o o o gb2312 Simplified Chinese GB2312
479 o o o o gb1988 Chinese GB1988(latin)
480 o o o o gb12345 Traditional Chinese GB12345
481 o o o o ksx1003 Korian KS X 1003(latin)
482 o o o o ksx1001 Korian KS X 1001
483 x o o o koi8-r Cyrillic KOI-8R
484 x o o o koi8-u Ukrainean Cyrillic KOI-8U
485 o o o o cns11643-1 Traditional Chinese CNS11643-1
486 x o o o viscii-r RFC1496 VISCII (right plane)
487 o o o o viscii-l RFC1496 VISCII (left plane)
488 x o o o cp437 Microsoft cp437 (US latin)
489 x o o o cp737 Microsoft cp737
490 x o o o cp775 Microsoft cp775
491 x o o o cp850 Microsoft cp850
492 x o o o cp852 Microsoft cp852
493 x o o o cp855 Microsoft cp855
494 x o o o cp857 Microsoft cp857
495 x o o o cp860 Microsoft cp860
496 x o o o cp861 Microsoft cp861
497 x o o o cp862 Microsoft cp862
498 x o o o cp863 Microsoft cp863
499 x o o o cp864 Microsoft cp864
500 x o o o cp865 Microsoft cp865
501 x o o o cp866 Microsoft cp866
502 x o o o cp869 Microsoft cp869
503 x o o o cp874 Microsoft cp874
504 x o o o cp932 Microsoft cp932 (Japanese)
505 x o o o cp1250 Microsoft cp1250(Central Europe)
506 x o o o cp1251 Microsoft cp1251 (Cyrillic)
507 x o o o cp1252 Microsoft cp1252 (Latin-1)
508 x o o o cp1253 Microsoft cp1253 (Greek)
509 x o o o cp1254 Microsoft cp1254 (Turkish)
510 x o o o cp1255 Microsoft cp1255
511 x o o o cp1256 Microsoft cp1256
512 x o o o cp1257 Microsoft cp1257
513 x o o o cp1258 Microsoft cp1258
514
515 --euc-protect-g1
516 In EUC input mode, suppress sequences to set a charset to G1.
517 Such sequences are discarded.
518
519 --add-annon
520 Add announcer for JIS X 0208:1997 to X 0208 designate sequence.
521 This option works only with iso-2022-based output.
522
523 --input-detect-jis78
524 Distinguish JIS X 0208:1978 codeset and JIS X 0208:1997 codeset.
525 By default, these two charsets are regarded as X 0208:1997. This
526 option is valid only when input encoding is JIS (iso-2022-jp).
527
528
529 JIS X 0212(Supplement Kanji code) Support
530 --x0212-enable
531 skf by default does not output JIS X 0212 code in JIS/EUC mode.
532 This option enables use of JIS X 0212 part. Non-Japanese code,
533 Shift_JIS variants, Unicode or KEIS output ignore this option.
534 Note that this option is supported for backward compatibility.
535 It may not be supported in future versions.
536
537
538 Unicode coding specific control options
539 skf-2.10 is conformed on Unicode 11.0 specification.
540
541 --use-compat --suppress-compat
542 By --suppress-compat, skf substitutes characters in unicode com‐
543 patibility planes (U+F900 - U+FFFD) to appropriate characters in
544 non-compatibility planes. If this substitution is enabled, these
545 characters is converted to variants or undefined. By --use-com‐
546 pat, skf outputs character in this area as it is. Default is
547 --use-compat. Several codesets controls this as codeset feature
548 (i.e. Use compatibility planes). See codeset section.
549
550 --use-ms-compat
551 When output is Unicode, make Unicode map to be Microsoft windows
552 compatible). This only changes conversion for some symbols in
553 JIS-Kanji, and adding --use-compat option is recommended for
554 roundtrip conversion. If you need more strict compatibility, try
555 cp932w for input codeset.
556
557 --use-cde-compat
558 When output is Unicode, make translation CDE standard codeset
559 compatible.
560
561 --little-endian
562 When output is UTF-16le/be, use little endian byte-order.
563
564 --big-endian
565 When output is UTF-16le/be, use big endian byte-order.
566
567 --disable-endian-mark --enable-endian-mark
568 When output is UTF-16 or UTF-8, do not use/use byte order mark‐
569 ing. To make UTF-16N, use this option with --little-endian. By
570 default, BOM is enabled for UTF-16 and disabled for UTF-8.
571
572 --input-little-endian
573 When input is UTF-16le/be, assume input is little endian byte-
574 ordered.
575
576 --input-big-endian
577 When input is UTF-16le/be, assume input is big endian byte-
578 ordered.
579
580 --endian-protect
581 Do not use endian mark in input stream. Endian mark is just dis‐
582 carded. This is off by default.
583
584 --limit-to-ucs2
585 Do not use > 0x10000 area code in Unicode (i.e. limits code to
586 BMP area). This option doesn't limit internal code range in
587 skf. This is off by default.
588
589 --disable-cjk-extension
590 Treat CJK extension A/B areas as undefined. This is off (i.e.
591 these areas are enabled) by default.
592
593 --enable-cesu8
594 Enable CESU-8 input in utf-8 codeset. Ignored for any other
595 codesets.
596
597 --non-strict-utf8
598 Enable broken (decodable but not obeying specs.) utf-8 input. If
599 you need this option, proceeds with extra care.
600
601 --enable-nfd-decomposition --disable-nfd-decomposition
602 Enable/Disable Unicode Normalized decomposition. Default is dis‐
603 abled.
604
605 --enable-nfda-decomposition --disable-nfda-decomposition
606 Enable/Disable Apple-compatible Unicode Normalized decomposi‐
607 tion. Default is disabled.
608
609 --oldcell-to-emoticon
610 Convert old cell-phone gaiji area to emoticon. Supported: NTT
611 Docomo/AU emoticons. A reverse mapping is not supported.
612
613
614
615 Miscellanious codeset related options
616 --old-nec-compat
617 Enable old NEC kanji sequence (ESC-K,H). Needs compile option
618 --enable-oldnec at configuration.
619
620 --no-utf7
621 Assume input codeset is *NOT* UTF-7 encoded Unicode. This
622 option disables input utf7 testing.
623
624 --no-kana
625 Assume input codeset does *NOT* include JIS X 0201 kana.
626
627 --input-limit-to-jp
628 Tell detection mechanism that input is some kind of Japanese
629 codeset.
630
631
632 OUTPUT Conversions options
633 skf is intended to output stream to stdout, buf nkf-compatible file-
634 encoding change option is also provided.
635
636 --overwrite[=SUFFIX] --in-place[=SUFFIX]
637 converts encoding of file(s) specified as input. --overwrite
638 preserves file change date. If SUFFIX parameter is added, input
639 file is back-up'ed with a name appended this SUFFIX.
640
641 skf has various features to fix output files appropriate in local envi‐
642 ronment. Most of these are controlled by extended control switches
643 described in this section.
644
645 --use-g0-ascii
646 set G0(=GL) for output encoding to ASCII, ignoring codeset des‐
647 ignation.
648
649 X-0201 Kana/latin conversions
650 skf by default converts X-0201 kanas to X-0208 kanas. To output X-0201
651 kana as it is, use one of following options. When output is designated
652 to EUC or SJIS, these three options enable X-0201 kana output by ways
653 provided by each encoding. When Unicode output is specified, (equiv.)
654 kana part output is controlled by --use-compat, not following switches.
655 Valid only when output codeset is NOT Unicode family.
656
657 --kana-jis7
658 use SI/SO locking shift sequence to designate X-0201 kana. This
659 switch is valid for jis, jis-x0213 and cp50220 (i.e. cp50221)
660 encoding. For other codesets, this option is ignored.
661
662 --kana-jis8
663 output X-0201 kana using 8-bit code right plane. This switch is
664 valid for jis and jis-x0213 encoding. For other codeset, this
665 option is ignored.
666
667 --kana-esci --kana-call
668 use ESC-(-I to designate X-0201 kana. This switch is valid for
669 jis, jis-x0213 and cp50220 (i.e. cp50222) encoding. For other
670 codeset, this option is ignored.
671
672 --kana-enable
673 If output is EUC-JP or cp51932, use X-0201 kana with G2. If
674 SJIS output, it is same as --kana-jis8. When JIS output, it is
675 same as --kana-call.
676
677 --use-iso8859-1
678 Enable iso-8859-1 output. Iso-8859-1 is invoked to G1 and set to
679 GR plane.
680
681
682 URI/TeX format conversion feature options
683 With Unicode(tm) family output codings, skf output non-ascii latin
684 character part as it is, but with other output codings, skf converts
685 these characters using following rules:
686
687 (1) If a code is defined in a specified output codeset, specified code
688 point is used for output.
689 (2) If one of following html convert modes are enabled (i.e. --con‐
690 vert-html --convert-sgml) and the code is defined in html/sgml codeset,
691 it is converted to entity-reference or codepoint reference.
692 (3) If tex convert mode enabled and the code is defined in tex expres‐
693 sion, it is converted to tex format.
694 (4) If the code is a kind of combined ligatures, it is shown by a set
695 of characters.
696 (5) A kind of replacement character is shown, with warning.
697
698 --convert-html --convert-sgml--convert-xml
699 Enable html convert mode. This mode is cleared by --reset. These
700 two options are synonyms, and are treated as same option.
701
702 --convert-html-decimal
703 Enable html code-point decimal convert mode. This mode is
704 cleared by --reset.
705
706 --convert-html-hexadecimal
707 Enable html code-point hexadecimal convert mode. This mode is
708 cleared by --reset.
709
710 --convert-tex
711 Enable TeX convert mode. This mode is cleared by --reset.
712
713 --convert-perl
714 Enable Perl5 literal convert mode. This mode is cleared by
715 --reset.
716
717 --convert-java
718 Enable Java literal convert mode. This mode is cleared by
719 --reset.
720
721 --convert-python
722 Enable Python literal convert mode. This mode is cleared by
723 --reset.
724
725 --use-replace-char
726 In Unicode, use unicode replacement chatacter (U+fffc) for unde‐
727 fined chatacter.
728
729
730 Extended Options
731 Encoding/Decoding control options
732 --decode=`encoding scheme'
733
734 --encode=`encoding scheme'
735 Specify an decoding/encoding scheme for input stream. Supported
736 encoding schemes for decoding are `hex', 'mime', 'mime_q',
737 'mime_b', 'uri', 'ace', 'hex_perc_encode', Each option means CAP
738 hex-code, mime, mime Q-encoding, mime B-encoding, uri character
739 reference, ACE punycode, uri percent notation, base64, Q-encod‐
740 ing, rfc2231 and rot13/47 respectively. 'none' means no decode.
741 For encoding, 'hex', 'mime_b', 'mime_q', 'uri', 'ace', 'cap',
742 'hex_perc_encode', 'base64' and 'none' are supported. EBCDIC
743 related codesets and some already ascii-encoded codeset (e.g.
744 UTF-7) output with encoding is not supported.
745 Only one decode/encode option is valid, and if more than one
746 option is specified, the last one is used. When one of mime
747 decodings is specified, base text is assumed to be EUC encoding
748 unless specified otherwise. Except rot, which assumes input
749 stream is Shift_JIS, EUC or iso-2022-jp, these encodings assumes
750 input stream is ascii (as defined in RFC2045). Some encodings
751 may co-exist with encoding, but this is not guaranteed. Espe‐
752 cially, if input is UTF-16/UCS2 code, these encoding is ignored
753 in skf.
754
755 --mime-ms-compat
756 treat japanese generic codesets as Microsoft cp932 compatible.
757 More specifically, with this option skf treats iso-2022-jp as
758 cp50220, euc-jp as cp51932 and Shift_JIS as cp932w. --mime-per‐
759 sistent skf detects address-like strings and excludes them from
760 mime encoding. This option disables such behavior. Default in
761 nkf-compatible mode.
762
763
764 Shortcut
765 -m same as --decode=mime
766
767 -mB same as --decode=mime_b
768
769 -mQ same as --decode=qencode
770
771 -m0 same as --decode=none
772
773 -M same as --encode=mime_b
774
775 -MB same as --encode=base64
776
777 -MQ same as --encode=qencode
778
779 End of line control options
780 --lineend-thru
781 Output end-of-line code as it is. Also output ^Z code as it is.
782 This is default.
783
784 --lineend-cr --lineend-mac-Lm
785 Use CR as end-of-line code. Also delete ^Z code from input
786 stream.
787
788 --lineend-lf --lineend-unix-Lu
789 Use LF as end-of-line code. Also delete ^Z code from input
790 stream.
791
792 --lineend-crlf --lineend-windows-Lw
793 Use CR+LF as end-of-line code. Also delete ^Z code from input
794 stream. This option doesn't preserve original order of cr and
795 lf.
796
797 --input-cr
798 Assume input stream uses CR as end-of-line code.
799
800 --input-lf
801 Assume input stream uses LF as end-of-line code.
802
803 --input-crlf
804 Assume input stream uses CR+LF as end-of-line code.
805
806 -F[line_length[-kinsoku]]
807
808 -f[line_length[-kinsoku]] -f[line_length[+kinsoku]]
809 Wrap input lines by line_length columns. f option deletes
810 CR/LF's in input, and F option doesn't delete them. For Japanese
811 convension, both gyoutou-kinsoku(by burasage-gumi) and
812 gyoumatsu-kinsoku(by oidasi-gumi) is supported. The burasage-
813 length is controlled by kinsoku option. Default value for
814 line_length is 66, and must be < 1000. Default value for kinsoku
815 is 5, and must be <= 10. In 'f' option, skf autodetects para‐
816 graph and retains some CR/LF. 2nd 'f' option format (with '+')
817 disables this behaviour. In nkf compatible mode, some fold
818 behaviors change as follows.
819 (1) Default line_length is set to 60, and kinsoku value is 10.
820 (2) alpha numeric characters become gyoutou-kinsoku characters.
821
822 File control options
823 --filewise-detect --force-reset
824 Reset and re-detect input code set at the start of each file.
825
826 --linewise-detect
827 Reset and re-detect input code set at the start of each line.
828
829
830 Compatibility options
831 --nkf-compat
832 interpret following options as nkf compatible manners. -l, -d,
833 -c, -x, -X, -w and -W works as nkf2.x -f and -F behavior is
834 changed as shown above. -T, -i, -o is not supported. Most of
835 other nkf options and switches also work like nkf, except in
836 case of error.
837
838 --skf-compat
839 interpret following options as skf-native manners.
840
841 -r nkf-compatible rot. Works only with --nkf-compat mode. Allowed
842 input encodings are limited to JIS/Shift_JIS/EUC.
843
844 -h[123]--hiragana--katakana--katakana-hiragana
845 -h, -h1 and --hiragana converts all kanas to hiragana. -h2 and
846 --katakana convert all kanas to katakana. -h3 and
847 --katakana-hiragana swap katakana and hiragana.
848
849 --nkf-help
850 show option difference/compatibility between skf and nkf.
851
852 --in-place[=SUF]--overwrite[=SUF]
853 replace specified file with converted codeset. overwrite retains
854 file create time stamp. If a suffix is given, the suffix is
855 added to output file name and input file is not removed.
856
857
858 Lightweight language specific options
859 skf plugin for lightweight language has subset of options. More specif‐
860 ically, file input/output related options(-b, -u, --overwrite --in-
861 place, --filewise-detect --linewise-detect --show-filename --suppress-
862 filename) and UTF-16 output is disabled(except ruby or python3).
863
864
865 Ruby-1.9.x/2.x specific options
866 Since ruby 1.9, ruby uses CCS string handling. skf returns output
867 string with specified codeset. Following options override this behav‐
868 ior.
869
870 --rb-out-ascii8bit
871 returns string with ascii-8bit encoding.
872
873 --rb-out-string
874 returns string with specified encoding.
875
876 Python-3.x specific options
877 Since native codeset representation in python3.x is UCS2/UCS4, skf
878 behaves differently with output codeset option. If output codeset is
879 either UTF-16 or UTF-32(in wide mode), skf returns Unicode object, and
880 for all other codesets skf returns binary array object. Following
881 options change this behavior.
882
883 --py-out-binary
884 use psuede unicode binary stream to output.
885
886 --py-out-string
887 use binary array object on UTF-16/32 output. BOM is enabled.
888 skf accepts either a binary array or an unicode object for
889 input.
890
891
892 Misc. Control options
893 --disable-space-convert --enable-space-convert
894 skf converts an ideographic space into two ascii spaces. Dis‐
895 able option disables, and enable option enables this behavior.
896 Default is disabled.
897
898 --html-sanitize
899 Convert several characters in HTML document to entity reference
900 expression. Specifically, "!#$&%()/<>:;?´ are escaped by entity-
901 references.
902
903 --filewise-detect --force-reset
904 If multiple input files are given, detect input codeset for each
905 file.
906
907 --linewise-detect
908 Detect input code line-wise. Note this option weakens code
909 detect correctness.
910
911 --reset
912 Reset all flags specified by extended controls and enviroment
913 variables.
914
915 --inquiry --guess
916 skf detects code and output detect result to stdout. No filter‐
917 ing output is performed. If multiple input files are given,
918 --show-filename is automatically enabled.
919
920 --hard-inquiry
921 Similar as inquiry, but reports both code and an end-of-line
922 character.
923
924 --suppress-filename
925 When inquiry(--inquiry) is on, this option disables file name
926 output. This option overrides --show-filename.
927
928 --show-filename
929 When inquiry(--inquiry) is on, this option adds each file name
930 to output.
931
932 --invis-strip
933 Delete all escape sequences not belonging to ISO-2022 code
934 extension. This is intended to replace invisstrip command bun‐
935 dled in inews package.
936
937 -I Warn if input has unassigned code points.
938
939 -v print version information and exit.
940
941 --help print brief help and exit.
942
943 --show-supported-codeset
944 Display supported codesets (input) and exit. Both canonical
945 names (left side) and detailed names are shown. This canonical
946 name can be used as MIME charset and also as ic-option code
947 specification.
948
949 --show-supported-charset
950 Display supported character sets (output) and exit. Both canoni‐
951 cal names and detailed names are shown. Some charsets with spe‐
952 cial treatments (i.e. meaningless as set-g* parameters) inten‐
953 sionally lacks addressable cnames.
954
955
957 /usr/(local/)share/skf/lib/ (Unices)
958
959 /Program Files/skf/share/lib (MS Windows)
960 These directories are where external codeset conversion tables
961 go. The location that current skf assumes are shown by -h
962 option.
963
964
966 skf is written by Seiji Kaneko (efialtes@osdn.jp) based on idea from
967 nkf written by Itaru Ichikawa (ichikawa@flab.fujitsu.co.jp) X 0213 code
968 table is derived from work of earthian@tama.or.jp. Some codeset map‐
969 ping is derived from various sources. Detailed origin is shown in copy‐
970 right document included in this distribution.
971
972
974 skf is inspired by works or requests by shinoda@cs.titech,
975 kato@cs.titech, uematsu@cs.titech, void@global ohta@ricoh, Hinata(HKE)
976 Ashizawa(CRL) Kunimoto(SDL) Oohara(Univ of Kyoto), Jokagi(elf2000) and
977 Naruse (at osdn.jp). Thanks.
978
979
981 1. skf can handle mixed coding with some limitations. However, code
982 detection tends to fail for mixed code, and giving explicit input code
983 set is strongly encouraged, if codeset is known beforehand.
984 In case of need, --linewise-detect option may help, but code detecting
985 will more likely fail.
986
987 2. skf implements ISO-2022 with following exceptions.
988 i) GL 0x20 is always space. Even when 96-character codeset is invoked
989 to GL.
990 ii) Sequences for setting codes to C1 and C2 are always ignored.
991 iii) If unknown sequence is given to G0, G0 is set to ascii, and lock‐
992 ing/single shift is cleared. Unknown sequece call to set to G1-G3 is
993 just ignored.
994 Private charset is also not supported and is ignored.
995 iv) Sequences for 96 character multibyte coding is ignored (Currently,
996 no codeset is registered).
997 v) Calling UTF-8, UTF-16 coding system from iso-2022 is supported, and
998 returns to previous coding system by standard return.
999 Callings and returns to/from other coding schemes are ignored.
1000 vi) For supporting some of cellular phone glyphs, several private (not
1001 registered) codesets are defined in skf, and can be called by appropri‐
1002 ate sequences.
1003
1004 3. Error output coding is controlled by LOCALE environment variables in
1005 UN*X system. skf doesn't take care of situations like stdout and stderr
1006 are redirecting into a same stream. Such case should be handled by user
1007 side.
1008
1009 4. skf converts KEIS/JIS X 0213 code using CJK-extension B area and CJK
1010 compatibility area. For this reason, X 0213 and KEIS convert result
1011 varies depending on --use-compat and --limit-to-ucs2 switches.
1012
1013 5. JIS X 0207:1979 is not supported. JIS X 0211:1987 is designed to be
1014 supported (i.e. common terminal control sequence will be transparently
1015 passed to output).
1016
1017 6. Even if unbuffer option(-u) is specified, some code-translation
1018 related bufferings are still performed (in MIME, kana, VIQR etc.).
1019
1020 7. skf-1.9x or later recognizes and handles languages in iso639-1(alpha
1021 2). iso639-2 is not supported as a valid language set.
1022
1023 8. Unicode IVS is not supported. Sequences are just discarded.
1024
1025 9. skf-1.9x or later does not retain Macintosh RLO-ordered character
1026 property. Codesets with this kind of codes are not supported.
1027
1028
1030 1. Extended options are changed extensively since skf-1.9. Some archaic
1031 options (eg. -B, -@ and -r) have been deleted from this version.
1032
1033 2. skf is originally forked project from nkf, but doesn't contain any
1034 nkf codes now. Copyright notice is retained by honor.
1035
1036 3. From version 1.9, default Japanese character set assumed by skf has
1037 changed to JIS X 0208:1990 with Microsoft Japanese Windows gaiji (i.e.
1038 CP932).
1039
1040 4. Code autodetection is not perfect by design. If it has failed to
1041 detect input code properly, please give input code information explic‐
1042 itly.
1043
1044 5. Some ligatures in Unicode, cp932 gaiji and KEIS83 are converted
1045 using JIS X 0124 and other convention. During this conversion, its
1046 byte length is not preserved.
1047
1048 6. skf is intended to pass ANSI compatible terminal control codes
1049 transparently, but this is not guaranteed.
1050
1051 7. nkf's -i and -o options works only in nkf-compat mode. It is obso‐
1052 lete option in 1.97, and valid only when iso-2022-jp and without con‐
1053 sidering output codeset specifications.
1054
1055 8. For unconverted character, skf uses geta and undefined character as
1056 --use-replace-char option. If output codeset doesn't contain geta
1057 code, skf prefers 'black square character', then uses '.' respectively.
1058
1059 9. There are some undocumented options. These options should be consid‐
1060 ered as highly experimental.
1061
1062 10. In lineend_thru mode and using folding, skf remembers order of cr
1063 and lf appears in stream, and use that order. For this design, if skf
1064 needs to output line-end character before any line-end character
1065 appears in input stream, input order may not be preserved.
1066
1067 11. NKF-compatibility
1068 1) --prefix, some --fb's and --no-best-fit-chars are not supported.
1069 2) MSDOS (and -T), --exec-in and --exec-out are not supported.
1070 3) MIME decoding/encoding handling behaviors differ in various ways.
1071 4) lineend conversion acts differently. Results may not be same for
1072 some messy text.
1073 5) -r option and --decode=rot is different. See each option descrip‐
1074 tion.
1075 6) detected codeset name is not compatible with nkf. --help and --ver‐
1076 sion return different results.
1077 7) in-place and overwrite suffix with * is not supported.
1078
1079 12. Conversion to NYUUKAN GAIJI is as follows
1080 1) Kanji codes in JIS X0208(1997), JIS X0212(1990), JIS
1081 X0213(2004/2012),
1082 Houmusho-kokuji No.582 beppyou No.1 are sent to output as it is.
1083 2) Kanji codes in beppyou No.4-2 leftmost columns are converted to the
1084 first
1085 priority character in the table. If the second priority characters
1086 appear,
1087 the codes are sent to output as it is.
1088 3) Other kanji codes are converted as undefined codes. See above con‐
1089 version method. Non-kanji codes (latins, glyphs etc.) are sent to out‐
1090 put as it is.
1091
1092 13. ARIB B24 compatibility
1093 1) Input only. ARIB B24 output is not supported.
1094 2) Neither international encoding nor X0213 extension are supported.
1095 3) Macro define sequences are suppressed. These sequences are recog‐
1096 nized and
1097 discarded.
1098 4) Without specifying arib codeset, skf treats Arib-defined codepage as
1099 follows.
1100 i) private codepage are supported. ascii/jis x-0201 0x5f is not modi‐
1101 fied.
1102 ii) macro define/invoke and rpc invoke does not work. These charac‐
1103 ters are
1104 discarded.
1105
1106
1108 Unicode(TM) is a trademark of Unicode, Inc. Microsoft and Windows are
1109 registered trademarks of Microsoft corporation. Macintosh is a regis‐
1110 tered trademark of Apple Inc. Vodafone is a trademark of Vodafone K.K.
1111 Other names and terms may be trademarks or registered trademarks of
1112 their respective owner. Trademark symbol (TM) may be omitted in this
1113 manual page.
1114
1115
1116
1117
1118 10/Aug/2018 SKF(1)