1SKF(1) General Commands Manual SKF(1)
2
3
4
6 skf - simple Kanji Filter (v2.1)
7
9 skf [-EIJKNQRSXZbehjknqrsuvxz] [ long_format_options ] [infiles..]
10
12 skf is a yet another i18n capable kanji-filter, designed for reading
13 various CJK-coded files on the Net. skf converts input kanji texts or
14 streams into a character stream using designated codeset and output
15 them to standard output. Specifically, skf is designed to be a versa‐
16 tile filter to read documents in various code sets, and does not pro‐
17 vide features not related to code conversion.
18
19 Like nkf, skf automatically recognizes an input file code when it is a
20 kind of ISO-2022 compliant code, and also detects EUC-variant codes if
21 input file is Japanese text without X 0201 kanas. skf 2.1 can read
22 various iso-2022 compliant character sets, including JIS Kanji codes (X
23 0208, X 0212 and X 0213), EUC encoding (euc-jp (with X 0213 support),
24 euc-cn, euc-kr and euc-tw), ISO Europian latins (ISO-8859-1 to 11,
25 13/14/15/16) and many regional character sets. skf can also read some
26 non-iso2022 compliant sets, including Microsoft Shift-JIS code,
27 KOI-8-R/U, GB2312 (HZ), big5, VISCII(rfc1456, include VIQR), Unicode
28 standard (UCS2/UTF-16, UTF7 and UTF8), some of MS codesets (cp1250
29 etc.) and some other vendor specific codes (KEIS83, JEF etc).
30
31 Supported output character sets of skf are more limited, but still
32 include X 0208/X 0212/X 0213 JIS, X 0201 JIS, ASCII, Microsoft Shift-
33 JIS, EUC-jp/-kr/-cn, HZ, iso-2022-jp/kr, big5, VISCII and Unicode.
34
35 skf also provides some basic decoding features for some common encod‐
36 ings including MIME, Punycode and URI codepoint. Unicode decomposition
37 feature is also supported since 1.96.
38
39 As noted above, skf is designed to convert input text into some kind of
40 human-readable forms under a local environment (i.e. codeset), and has
41 several extra conversion features like GNU recode type folding. Such
42 conversions include Windows/Macintosh specific code swaps and old-new
43 jis glyph changes, html-format/TeX format conversion and variant unifi‐
44 cations.
45
46 skf also can be compiled as an extension of some lightweight languages.
47 See README.txt for details.
48
49 If one or more file names are given, skf read the files and output con‐
50 verted stream to stdout. If no file names are given, input is taken
51 from stdin and output is also stdout. OPTIONS are taken from environ‐
52 ment variables SKFENV, skfenv and command line, respectively in this
53 order. Environment variables are not used when skf is running as a
54 priviledged user. skf does not use LOCALE-related environment vari‐
55 ables for conversions, but output error messages are controlled by
56 given LOCALES.
57
59 skf is written from scratch, and inherits no code from nkf. However,
60 skf is intended to be a drop-in replacement for nkf(v1.4) and has a
61 similar commonly-used nkf option set.
62 skf 2.1 recognizes following options. Defaults are all off if not
63 explicitly specified.
64
65 buffering control
66 -b use buffered output. This is default.
67
68 -u use unbuffered output. Code detection feature is disabled when
69 this option is on.
70
71 Input/Output codeset options
72 --ic= input_code_set
73 specify input codeset is input_code_set. Possible candidates
74 are shown below.
75
76 --oc= output_code_set
77 specify output codeset is output_code_set. Possible candidates
78 are shown below. Default codeset in distribution package is euc-
79 jp, but depends on compile option. Default codeset is shown by
80 ´skf -h´.
81
82 Supported codeset
83 skf recognizes following codesets as an input/output codeset. These
84 codeset names are case insensitive, and minus ('-') and underscore
85 ('_') is ignored. Note that iso-2022 escape-based input codeset (reg‐
86 istered to IANA) is recoginized automatically, even when non-iso2022
87 codeset (except Unicode and B-Right/V) is specified. o in in-column
88 means named codeset can be specified as input and x means named codeset
89 is not for input. output-column is same except it is for output.
90
91 in out name description
92 o o iso8859-1 ascii + iso-8859-1 (latin-1)
93 o o iso8859-2 ascii + iso-8859-2 (latin-2)
94 o o iso8859-3 ascii + iso-8859-3 (latin-3)
95 o o iso8859-4 ascii + iso-8859-4 (latin-4)
96 o o iso8859-5 ascii + iso-8859-5 (Cyrillic)
97 o o iso8859-6 ascii + iso-8859-6 (Arabic)
98 o o iso8859-7 ascii + iso-8859-7 (Greek)
99 o o iso8859-8 ascii + iso-8859-8 (Hebrew)
100 o o iso8859-9 ascii + iso-8859-9 (latin-5)
101 o o iso8859-10 ascii + iso-8859-10 (latin-6)
102 o o iso8859-11 ascii + iso-8859-11 (Thai)
103 o o iso8859-13 ascii + iso-8859-13 (Baltic Rim)
104 o o iso8859-14 ascii + iso-8859-14 (Celtic)
105 o o iso8859-15 ascii + iso-8859-15 (Latin-9)
106 o o iso8859-16 ascii + iso-8859-16
107 o o koi-8r koi-8r (Russian)
108 o o cp1251 Cyrillic latin MS cp1251
109 o o jis iso-2022-jp (rfc1496 7bit JIS)
110 o o iso-2022-jp-x0213 iso-2022-jp-3 (JIS X 0213:2000)
111 a.k.a. jis-x0213
112 o o jis-x0213-strict iso-2022-jp-3-strict
113 o o iso-2022-jp-2004 iso-2022-jp-2004(JIS X 0213:2004)
114 a.k.a. jis-x0213-2004
115 o o oldjis iso-2022-jp-1978(JIS X 0208:1978)
116 o o cp50220 Microsoft codepage 50220
117 o o cp50221 Microsoft codepage 50221
118 o o cp50222 Microsoft codepage 50222
119 o o euc-jp EUC-encoded JIS X 0208:1997
120 o o euc-x0213 EUC-encoded JIS X 0213:2000
121 o o euc-jis-2004 EUC-encoded JIS X 0213:2004
122 o o cp51932 EUC-encoded Microsoft codepage 932
123 o o euc-kr EUC-encoded KS X 1001 Korian
124 o o euc7-kr 7bit EUC-encoded KS X 1001 Korian
125 o o uhc Unified hangle (Windows cp949)
126 o o johab KS X 1001-johab Korian
127 o o euc-cn EUC-encoded GB2312 Chinese
128 o o euc7-cn 7bit EUC-encoded GB2312 Chinese
129 o o hz HZ-encoded GB2312 Chinese
130 o o euc-tw EUC-encoded CNS 11643 Chinese
131 o o gb12345 EUC-encoded GB12345 Chinese
132 o o gbk GB2312 Extension(cp936) Chinese
133 o o gb18030 GB18030 chinese
134 o o big5 BIG5 (with Eten extension + EURO)
135 o o cp950 BIG5 (Microsoft cp950 + EURO)
136 o o big5-hkscs BIG5 with HKSCS
137 o o big5-2003 BIG5-2003
138 o o big5-uao BIG5-Unicode at On
139 o o sjis Shift-jis (Microsoft cp943)
140 o o shiftjis-x0213 Shiftjis-encoded JIS X 0213:2000
141 o o shiftjis-2004 Shiftjis-encoded JIS X 0213:2004
142 o o sjis-docomo Shiftjis-encoded with NTT Docomo emoticons.
143 o o sjis-au Shiftjis-encoded with AU emoticons.
144 o o sjis-softbank Shiftjis-encoded with SoftBank emoticons.
145 o o oldsjis Shift-jis (JIS X 0208:1978)
146 o o cp932 Shift-jis-encoded MS cp932
147 o o cp932w Shift-jis-encoded MS cp932 with
148 MS compatibility
149 o o viscii VISCII (rfc1456) Vietnamise
150 o o viqr VISCII (rfc1456-VIQR) Vietnamise
151 o o keis Hitachi KEIS83/90
152 o x jef Fujitsu JEF (basic support only)
153 o x ibm930 IBM EBCDIC DBCS Japanese
154 o x ibm931 IBM EBCDIC DBCS Japanese w.latin
155 o x ibm933 IBM EBCDIC DBCS Korian
156 o x ibm935 IBM EBCDIC DBCS Simpl. Chinese
157 o x ibm937 IBM EBCDIC DBCS Trad. Chinese
158 o o unicode Unicode(TM) UTF-16LE
159 o o unicodefffe Unicode(TM) UTF-16BE
160 o o utf7 Unicode(TM) UTF-7
161 o o utf8 Unicode(TM) UTF-8
162 o o utf7-imap IMAP modified Unicode(TM) UTF-7 (RFC2060)
163 o o mutf8 Java modified Unicode(TM) UTF-8
164 o o cesu8 CESU-8 (Unicode Technical Report #26)
165 x o nyukan-utf-8 nyukan-utf-16 Nyukan-moji(Japanese nyukoku-kan‐
166 rikyoku gaiji). Encoding is utf-8 and utf-16 respectively.
167 o x arib-b24 ARIB B24 8-bit JIS-based
168 o x arib-b24-sj ARIB B24 8-bit SJIS-based
169 x o transparent Transparent mode (see below)
170
171
172 Codeset explanations
173 iso-8859-*
174 When specified as output, G0 = GL is ascii and G1 = GR is
175 iso-8859-*. 8bit encoding is used.
176
177 iso-2022-jp, jis
178 Encoding is iso-2022-jp-2 (RFC1496). G0 = GL is JIS X 0201
179 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3 is
180 JIS X 0212:1990 Supplementary Kanji.
181
182 jis-x0213, iso-2022-jp-3
183 Encoding is iso-2022-jp-3 (JIS X 0213:2000 based). G0 = GL is
184 JIS X 0201 roman, For output, G1 = GR is JIS X 0201 kana, G2 is
185 iso-8859-1 and G3 is JIS X 0213 plane2 Kanji.
186
187 jis-x0213-strict
188 Encoding is subset of iso-2022-jp-3-strict (uses Plane 1 only).
189 For output, G0 = GL is JIS X 0201 roman, G1 = GR is JIS X 0201
190 kana, G2 is iso-8859-1 and G3 is not set. Output code using JIS
191 X 0208 whenever possible. JIS X 0213 input is automatically rec‐
192 ognized.
193
194 jis-x0213-2004, iso-2022-jp-2004
195 Encoding is iso-2022-jp-2003:2004. For output, G0 = GL is JIS X
196 0201 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3
197 is JIS X 0213 plane2 Kanji.
198
199 oldjis
200 Encoding is iso-2022-jp using old JIS X 0208:1978). G0 = GL is
201 JIS X 0201 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1
202 and G3 is JIS X 0212 Supplementary Kanji.
203
204 euc-jp, euc
205 Encoding is 8-bit EUC using JIS X 0208:1997 character set. G0 =
206 GL is ascii, G1 = GR is JIS X 0208, G2 is JIS X 0201 kana and G3
207 is JIS X 0212 Supplementary Kanji.
208
209 euc-x0213, euc-jis-2003
210 Encoding is 8-bit EUC-based JIS X 0213:2000. G0 = GL is ascii,
211 G1 = GR is X 0213:2000 plane 1, G2 is iso-8859-1 and G3 is JIS X
212 0213:2000 plane2 Kanji.
213
214 euc-jis-2004
215 Encoding is 8-bit EUC-based JIS X0213:2004. G0 = GL is ascii,
216 G1 = GR is X0213:2004 plane 1, G2 is iso-8859-1 and G3 is JIS
217 x0213:2004 plane2 Kanji.
218
219 euc-kr
220 Encoding is 8-bit EUC using KS X 1001 Wansung character set. G0
221 = GR is KS X1003, G1 = GR is KS X1001, G2 and G3 is not set.
222
223 euc7-kr iso-2022-kr
224 Encoding is iso-2022-kr (rfc1557): 7-bit EUC using KS X 1001
225 Wansung character set. G0 = GR is KS X1003, G1 is KS X1001, G2
226 and G3 is not set.
227
228 euc-cn
229 Encoding is 8-bit EUC using GB 2312 simplified chinese character
230 set. G0 = GR is ASCII, G1 = GR is GB2312, G2 and G3 is not set.
231
232 euc7-cn
233 Encoding is 7-bit EUC using GB 2312 simplified chinese character
234 set. G0 = GR is ASCII, G1 is GB2312, G2 and G3 is not set.
235
236 hz
237 Encoding is HZ encoded (rfc1842) GB 2312 simplified chinese
238 character set. G0 = GR is ASCII, G1 = GR is GB2312, G2 and G3
239 is not set.
240
241 euc-tw
242 Encoding is EUC encoded CNS11643 Plane1/2 traditional chinese
243 character set. Subset of iso-2022-cn. G0 = GR is ASCII, G1 = GR
244 is CNS11643 plane 1, G2 is CNS11643 plane 2 and G3 is not set.
245
246 gb12345
247 Encoding is 8-bit EUC using GB 12345 (GBF) traditional chinese
248 character set. G0 = GR is ASCII, G1 = GR is GB12345, G2 and G3
249 is not set.
250
251 gbk, cp936
252 Encoding is GBK simplified chinese character set. G0 = GR is
253 ASCII and G1 = GR is GBK. G2 and G3 is not set.
254
255 gb18030 (experimental)
256 Encoding is GB18030 (ibm-1392, Windows cp54936) chinese charac‐
257 ter set. Uses ASCII as latin part.
258
259 big5
260 Encoding is Big5 traditional chinese character set with ETen
261 extension. Include Euro mapping. Uses ASCII as latin part.
262
263 cp950
264 Encoding is Microsoft cp950-Big5 traditional chinese character
265 set. Uses ASCII as latin part.
266
267 big5-hkscs (experimental)
268 Encoding is cp950-Big5 traditional chinese character set with
269 HKSCS extension. Uses ASCII as latin part.
270
271 big5-2003 (experimental)
272 Encoding is Big5-2003 Taiwanese standard traditional chinese
273 character set. Uses ASCII as latin part.
274
275 big5-uao (experimental)
276 Encoding is Big5-UAO (http://uao.cpatch.org) traditional chinese
277 character set. Uses ASCII as latin part.
278
279 VISCII (experimental)
280 Vietnamise VISCII (rfc1456) character set. Not TCVN-5712.
281
282 VIQR (experimental)
283 Vietnamise VISCII character set with VIQR encoding(rfc1456).
284
285 sjis
286 Encoding is Shift-encoded JIS X 0208:1997 character set. Note
287 that this is not cp932. Uses JIS X 0201 latin as latin(GL) part.
288
289 sjis-x0213, shift_jis-2000
290 Encoding is Shift-encoded JIS using JIS X 0213:2000 character
291 set.
292
293 sjis-x0213-2004, shift_jis-2004
294 Encoding is Shift-encoded JIS using JIS X 0213:2004 character
295 set. 10 newly defined character added, but Unicode mapping is
296 same as JIS X 0213:2000. Uses JIS X 0201 latin as latin(GL)
297 part.
298
299 sjis-cellular (experimental)
300 Encoding is Shift-encoded JIS X 0208:1997 character set with NTT
301 Docomo/Vodafone(SoftBank) cellular phone glyph mapping. Output
302 is not supported.
303
304 cp932 cp932w
305 Encoding is Microsoft SJIS cp932 with NEC/IBM gaiji area, based
306 on Windows XP mapping. Uses ASCII as latin(GL) part. --use-com‐
307 pat and --use-ms-compat is automatically enabled. cp932w pro‐
308 vides further WideCharToMultiByte compatibility.
309
310 cp51932
311 Encoding is Microsoft EUC-based cp51932 with NEC/IBM gaiji area,
312 based on Windows XP mapping. Uses ASCII as G0 and JIS X 0201
313 kana as EUC G2 part. G3 is not used for output, and JIS X
314 0212:2000 as input. --use-compat and --use-ms-compat is auto‐
315 matically enabled.
316
317 cp50220, cp50221, cp50222
318 Encoding is Microsoft JIS-based cp50220, cp50221, cp50222 with
319 NEC/IBM gaiji area, based on Windows XP mapping. For input, skf
320 accepts cp50220, 50221 and 50222. Note that this codeset is NOT
321 compatible with iso-2022. Uses ASCII as default character set.
322 --use-compat and --use-ms-compat is automatically enabled.
323
324 oldsjis
325 Encoding is Microsoft SJIS (JIS X 0208:1978 a.k.a. old JIS).
326 Uses JIS X 0201 latin as latin(GL) part.
327
328 johab
329 Encoding is KS X1001(Johab) character set. Uses KS X1003 latin
330 as latin(GL) part.
331
332 uhc
333 Encoding is UHC (cp949) character set. Uses ASCII as latin(GL)
334 part.
335
336 unicode, unicodefffe, utf16, utf16le
337 Encoding is Unicode UTF-16 (v11.0). Input/Output default byte-
338 endian is little for unicode and big for unicodefffe, and input
339 byte order mark is recognized. utf16 and unicodefffe is big-
340 endian. utf16le and unicode is little endian. Output includes
341 endian mark by default unless --disable-endian-mark is speci‐
342 fied. Output range is within UTF-32 with surrogate pair unless
343 --limit-to-ucs2 is specified.
344 Note that ucs2 is not supported within lightweight language
345 extension in both in and output, because of SWIG's passing data
346 structure limitation. Specify to ucs2 will generate error.
347
348 utf8
349 Encoding is UTF-8 encoded Unicode (v11.0). Output doesn't
350 include byte order mark unless --enable-endian-mark is speci‐
351 fied. Output range is within UTF-32 unless --limit-to-ucs2 is
352 specified. By default, CESU-8 is not accepted as input. Option
353 --enable-cesu8 enables CESU-8 input for utf-8 converter. CESU-8
354 output is not supported. For UTF-8, endian mark (BOM) is always
355 ignored.
356
357 utf7
358 Encoding is UTF-7 encoded Unicode (v11.0). Input/output range is
359 limited to UTF-16, and value above U+10000 is regarded as unde‐
360 fined. BOM is always ignored for input, and never used for out‐
361 put.
362
363 utf7-imap
364 Modified utf-7 for IMAP protocol described in RFC2060. BOM is
365 always ignored for input, and never used for output.
366
367 mutf8
368 Modified utf-8 for Java language. CESU-8 plus U-0000 encoding.
369 BOM is always ignored for input, and never used for output.
370
371 cesu-8
372 Modified utf-8 described in unicode technical report #26. BOM
373 is always ignored for input, and never used for output.
374
375 keis (experimental)
376 Encoding is Hitachi KEIS83/90. Output range is limited to EBCDIK
377 and JIS X 0208 area.
378
379 jef (experimental)
380 Encoding is Fujitsu JEF. Input only. Only basic part is sup‐
381 ported.
382
383 ibm930 (experimental)
384 Encoding is IBM DBCS Japanese with EBCDIC Kana
385
386 ibm931 (experimental)
387 Encoding is IBM DBCS Japanese with EBCDIC latin (ibm037)
388
389 ibm933 (experimental)
390 Encoding is IBM DBCS Korian with EBCDIC Wansung character set
391
392 ibm935 (experimental)
393 Encoding is IBM DBCS Simplified Chinese with EBCDIC Chinese
394
395 ibm937 (experimental)
396 Encoding is IBM DBCS Traditional Chinese with EBCDIC Chinese
397
398 koi8r
399 Russian KOI-8R code.
400
401 cp1250
402 Central Europian latin Microsoft cp1250 code
403
404 cp1251
405 Eastern Europian cyrillic Microsoft cp1251 code
406
407 arib-b24 arib-b24-sj
408 ARIB B24 code defined in ATIB-STD-B24 vol.1 part.2 chapt. 7.3.
409 b24 is 8-bit jis based, and b24-sj is sjis based.
410
411 nyukan-utf-8 nyukan-utf-16
412 Normalized Unicode UTF-8/UTF-16 based on Japanese law ministry
413 kokuji No. 582.
414
415 transparent
416 Transparent mode. Various code control features, include folding
417 and line end code conversion, is also ignored.
418
419
420 Shortcuts
421 -j same as --oc=jis
422
423 -s same as --oc=sjis
424
425 -e same as --oc=euc-jp
426
427 -q same as --oc=unicode
428
429 -z same as --oc=sjis
430
431 -E same as --ic=euc-jp. Assume input codeset is EUC-JP.
432
433 -J same as --ic=jis. Assume input codeset is iso-2022-jp.
434
435 -S same as --ic=sjis. Assume input codeset is shift JIS
436
437 -Q same as --ic=utf-16 --input-little-endian.
438
439 -Z same as --ic=utf8.
440
441
442 ISO-2022 Specific controls
443 Replaces G0-3 after setting up according to specified input codeset by
444 assigned character set with this option. Note that this doesn't change
445 any codeset properties of the original codeset, like language and
446 encoding.
447
448 --set-g0=`charset name'
449 Predefines specified code set to plane 0 (G0). Also set to GL at
450 initial state.
451
452 --set-g1=`charset name'
453 Predefines specified code set to right plane (G1). Also set to
454 GR at initial state.
455
456 --set-g2=`charset name'
457 Predefines specified code set to right plane (G2).
458
459 --set-g3=`charset name'
460 Predefines specified code set to right plane (G3).
461
462
463 Supported `char_set' is as follows. 'o' means the codeset can be speci‐
464 fied to set to the plane. 'x' means you can't. For unicode family code‐
465 sets, this option is ignored. For other non-iso2022 categories, this
466 option is not supported, and result is unpredictable.
467
468
469 g0 g1 g2 g3 codeset name description
470 o o o o ascii ANSI X3.4 ASCII
471 o o o o x0201 JIS X 0201 (latin part)
472 x o o o iso8859-1 ISO 8859-1 latin
473 x o o o iso8859-2 ISO 8859-2 latin
474 x o o o iso8859-3 ISO 8859-3 latin
475 x o o o iso8859-4 ISO 8859-4 latin
476 x o o o iso8859-5 ISO 8859-5 Cyrillic
477 x o o o iso8859-6 ISO 8859-6 Arabic
478 x o o o iso8859-7 ISO 8859-7 Greek-latin
479 x o o o iso8859-8 ISO 8859-8 Hebrew
480 x o o o iso8859-9 ISO 8859-9 latin
481 x o o o iso8859-10 ISO 8859-10 latin
482 x o o o iso8859-11 ISO 8859-11 Thai
483 x o o o iso8859-13 ISO 8859-13 latin
484 x o o o iso8859-14 ISO 8859-14 latin
485 x o o o iso8859-15 ISO 8859-15 latin
486 x o o o iso8859-16 ISO 8859-16 latin
487 x o o o tcvn5712 TCVN 5712 (Vietnamese)
488 x o o o ecma94 ECMA 94 Cyrillic (KOI-8e)
489 o o o o x0212 JIS X 0212:1990
490 o o o o x0208 JIS X 0208:1997
491 o o o o x0213 JIS X 0213 Plane 1:2000
492 o o o o x0213-2 JIS X 0213 Plane 2:2000
493 o o o o x0213n JIS X 0213 Plane 1:2004
494 o o o o gb2312 Simplified Chinese GB2312
495 o o o o gb1988 Chinese GB1988(latin)
496 o o o o gb12345 Traditional Chinese GB12345
497 o o o o ksx1003 Korian KS X 1003(latin)
498 o o o o ksx1001 Korian KS X 1001
499 x o o o koi8-r Cyrillic KOI-8R
500 x o o o koi8-u Ukrainean Cyrillic KOI-8U
501 o o o o cns11643-1 Traditional Chinese CNS11643-1
502 x o o o viscii-r RFC1496 VISCII (right plane)
503 o o o o viscii-l RFC1496 VISCII (left plane)
504 x o o o cp437 Microsoft cp437 (US latin)
505 x o o o cp737 Microsoft cp737
506 x o o o cp775 Microsoft cp775
507 x o o o cp850 Microsoft cp850
508 x o o o cp852 Microsoft cp852
509 x o o o cp855 Microsoft cp855
510 x o o o cp857 Microsoft cp857
511 x o o o cp860 Microsoft cp860
512 x o o o cp861 Microsoft cp861
513 x o o o cp862 Microsoft cp862
514 x o o o cp863 Microsoft cp863
515 x o o o cp864 Microsoft cp864
516 x o o o cp865 Microsoft cp865
517 x o o o cp866 Microsoft cp866
518 x o o o cp869 Microsoft cp869
519 x o o o cp874 Microsoft cp874
520 x o o o cp932 Microsoft cp932 (Japanese)
521 x o o o cp1250 Microsoft cp1250(Central Europe)
522 x o o o cp1251 Microsoft cp1251 (Cyrillic)
523 x o o o cp1252 Microsoft cp1252 (Latin-1)
524 x o o o cp1253 Microsoft cp1253 (Greek)
525 x o o o cp1254 Microsoft cp1254 (Turkish)
526 x o o o cp1255 Microsoft cp1255
527 x o o o cp1256 Microsoft cp1256
528 x o o o cp1257 Microsoft cp1257
529 x o o o cp1258 Microsoft cp1258
530
531 --euc-protect-g1
532 In EUC input mode, suppress sequences to set a charset to G1.
533 Such sequences are discarded.
534
535 --add-annon
536 Add announcer for JIS X 0208:1997 to X 0208 designate sequence.
537 This option works only with iso-2022-based output.
538
539 --input-detect-jis78
540 Distinguish JIS X 0208:1978 codeset and JIS X 0208:1997 codeset.
541 By default, these two charsets are regarded as X 0208:1997. This
542 option is valid only when input encoding is JIS (iso-2022-jp).
543
544
545 JIS X 0212(Supplement Kanji code) Support
546 --x0212-enable
547 skf by default does not output JIS X 0212 code in JIS/EUC mode.
548 This option enables use of JIS X 0212 part. Non-Japanese code,
549 Shift_JIS variants, Unicode or KEIS output ignore this option.
550 Note that this option is supported for backward compatibility.
551 It may not be supported in future versions.
552
553
554 Unicode coding specific control options
555 skf-2.10 is conformed on Unicode 11.0 specification.
556
557 --use-compat --suppress-compat
558 By --suppress-compat, skf substitutes characters in unicode com‐
559 patibility planes (U+F900 - U+FFFD) to appropriate characters in
560 non-compatibility planes. If this substitution is enabled, these
561 characters is converted to variants or undefined. By --use-com‐
562 pat, skf outputs character in this area as it is. Default is
563 --use-compat. Several codesets controls this as codeset feature
564 (i.e. Use compatibility planes). See codeset section.
565
566 --use-ms-compat
567 When output is Unicode, make Unicode map to be Microsoft windows
568 compatible). This only changes conversion for some symbols in
569 JIS-Kanji, and adding --use-compat option is recommended for
570 roundtrip conversion. If you need more strict compatibility, try
571 cp932w for input codeset.
572
573 --use-cde-compat
574 When output is Unicode, make translation CDE standard codeset
575 compatible.
576
577 --little-endian
578 When output is UTF-16le/be, use little endian byte-order.
579
580 --big-endian
581 When output is UTF-16le/be, use big endian byte-order.
582
583 --disable-endian-mark --enable-endian-mark
584 When output is UTF-16 or UTF-8, do not use/use byte order mark‐
585 ing. To make UTF-16N, use this option with --little-endian. By
586 default, BOM is enabled for UTF-16 and disabled for UTF-8.
587
588 --input-little-endian
589 When input is UTF-16le/be, assume input is little endian byte-
590 ordered.
591
592 --input-big-endian
593 When input is UTF-16le/be, assume input is big endian byte-
594 ordered.
595
596 --endian-protect
597 Do not use endian mark in input stream. Endian mark is just dis‐
598 carded. This is off by default.
599
600 --limit-to-ucs2
601 Do not use > 0x10000 area code in Unicode (i.e. limits code to
602 BMP area). This option doesn't limit internal code range in
603 skf. This is off by default.
604
605 --disable-cjk-extension
606 Treat CJK extension A/B areas as undefined. This is off (i.e.
607 these areas are enabled) by default.
608
609 --enable-cesu8
610 Enable CESU-8 input in utf-8 codeset. Ignored for any other
611 codesets.
612
613 --non-strict-utf8
614 Enable broken (decodable but not obeying specs.) utf-8 input. If
615 you need this option, proceeds with extra care.
616
617 --enable-nfd-decomposition --disable-nfd-decomposition
618 Enable/Disable Unicode Normalized decomposition. Default is dis‐
619 abled.
620
621 --enable-nfda-decomposition --disable-nfda-decomposition
622 Enable/Disable Apple-compatible Unicode Normalized decomposi‐
623 tion. Default is disabled.
624
625 --oldcell-to-emoticon
626 Convert old cell-phone gaiji area to emoticon. Supported: NTT
627 Docomo/AU emoticons. A reverse mapping is not supported.
628
629 --fix-ms-radical-bug
630 mscvrt bug for Windows 10 20H1 or later has an infamous bug
631 which convert some Kanji to Kanji radix. This option reconvert
632 radix area to appropriate Kanjis. This option is for Unicode
633 output.
634
635
636
637 Miscellanious codeset related options
638 --old-nec-compat
639 Enable old NEC kanji sequence (ESC-K,H). Needs compile option
640 --enable-oldnec at configuration.
641
642 --no-utf7
643 Assume input codeset is *NOT* UTF-7 encoded Unicode. This
644 option disables input utf7 testing.
645
646 --no-kana
647 Assume input codeset does *NOT* include JIS X 0201 kana.
648
649 --input-limit-to-jp
650 Tell detection mechanism that input is some kind of Japanese
651 codeset.
652
653
654 OUTPUT Conversions options
655 skf is intended to output stream to stdout, buf nkf-compatible file-
656 encoding change option is also provided.
657
658 --overwrite[=SUFFIX] --in-place[=SUFFIX]
659 converts encoding of file(s) specified as input. --overwrite
660 preserves file change date. If SUFFIX parameter is added, input
661 file is back-up'ed with a name appended this SUFFIX.
662
663 skf has various features to fix output files appropriate in local envi‐
664 ronment. Most of these are controlled by extended control switches
665 described in this section.
666
667 --use-g0-ascii
668 set G0(=GL) for output encoding to ASCII, ignoring codeset des‐
669 ignation.
670
671 X-0201 Kana/latin conversions
672 skf by default converts X-0201 kanas to X-0208 kanas. To output X-0201
673 kana as it is, use one of following options. When output is designated
674 to EUC or SJIS, these three options enable X-0201 kana output by ways
675 provided by each encoding. When Unicode output is specified, (equiv.)
676 kana part output is controlled by --use-compat, not following switches.
677 Valid only when output codeset is NOT Unicode family.
678
679 --kana-jis7
680 use SI/SO locking shift sequence to designate X-0201 kana. This
681 switch is valid for jis, jis-x0213 and cp50220 (i.e. cp50221)
682 encoding. For other codesets, this option is ignored.
683
684 --kana-jis8
685 output X-0201 kana using 8-bit code right plane. This switch is
686 valid for jis and jis-x0213 encoding. For other codeset, this
687 option is ignored.
688
689 --kana-esci --kana-call
690 use ESC-(-I to designate X-0201 kana. This switch is valid for
691 jis, jis-x0213 and cp50220 (i.e. cp50222) encoding. For other
692 codeset, this option is ignored.
693
694 --kana-enable
695 If output is EUC-JP or cp51932, use X-0201 kana with G2. If
696 SJIS output, it is same as --kana-jis8. When JIS output, it is
697 same as --kana-call.
698
699 --use-iso8859-1
700 Enable iso-8859-1 output. Iso-8859-1 is invoked to G1 and set to
701 GR plane.
702
703
704 URI/TeX format conversion feature options
705 With Unicode(tm) family output codings, skf output non-ascii latin
706 character part as it is, but with other output codings, skf converts
707 these characters using following rules:
708
709 (1) If a code is defined in a specified output codeset, specified code
710 point is used for output.
711 (2) If one of following html convert modes are enabled (i.e. --con‐
712 vert-html --convert-sgml) and the code is defined in html/sgml codeset,
713 it is converted to entity-reference or codepoint reference.
714 (3) If tex convert mode enabled and the code is defined in tex expres‐
715 sion, it is converted to tex format.
716 (4) If the code is a kind of combined ligatures, it is shown by a set
717 of characters.
718 (5) A kind of replacement character is shown, with warning.
719
720 --convert-html --convert-sgml--convert-xml
721 Enable html convert mode. This mode is cleared by --reset. These
722 two options are synonyms, and are treated as same option.
723
724 --convert-html-decimal
725 Enable html code-point decimal convert mode. This mode is
726 cleared by --reset.
727
728 --convert-html-hexadecimal
729 Enable html code-point hexadecimal convert mode. This mode is
730 cleared by --reset.
731
732 --convert-tex
733 Enable TeX convert mode. This mode is cleared by --reset.
734
735 --convert-perl
736 Enable Perl5 literal convert mode. This mode is cleared by
737 --reset.
738
739 --convert-java
740 Enable Java literal convert mode. This mode is cleared by
741 --reset.
742
743 --convert-python
744 Enable Python literal convert mode. This mode is cleared by
745 --reset.
746
747 --use-replace-char
748 In Unicode, use unicode replacement chatacter (U+fffc) for unde‐
749 fined chatacter.
750
751
752 Extended Options
753 Encoding/Decoding control options
754 --decode=`encoding scheme'
755
756 --encode=`encoding scheme'
757 Specify an decoding/encoding scheme for input stream. Supported
758 encoding schemes for decoding are `hex', 'mime', 'mime_q',
759 'mime_b', 'uri', 'ace', 'hex_perc_encode', 'base64', 'qencode',
760 'rfc2231', `rot' and 'none'. Each option means CAP hex-code,
761 mime, mime Q-encoding, mime B-encoding, uri character reference,
762 ACE punycode, uri percent notation, base64, Q-encoding, rfc2231
763 and rot13/47 respectively. 'none' means no decode.
764 For encoding, 'hex', 'mime_b', 'mime_q', 'uri', 'ace', 'cap',
765 'hex_perc_encode', 'base64' and 'none' are supported. EBCDIC
766 related codesets and some already ascii-encoded codeset (e.g.
767 UTF-7) output with encoding is not supported.
768 Only one decode/encode option is valid, and if more than one
769 option is specified, the last one is used. When one of mime
770 decodings is specified, base text is assumed to be EUC encoding
771 unless specified otherwise. Except rot, which assumes input
772 stream is Shift_JIS, EUC or iso-2022-jp, these encodings assumes
773 input stream is ascii (as defined in RFC2045). Some encodings
774 may co-exist with encoding, but this is not guaranteed. Espe‐
775 cially, if input is UTF-16/UCS2 code, these encoding is ignored
776 in skf.
777
778 --mime-ms-compat
779 treat japanese generic codesets as Microsoft cp932 compatible.
780 More specifically, with this option skf treats iso-2022-jp as
781 cp50220, euc-jp as cp51932 and Shift_JIS as cp932w. --mime-per‐
782 sistent skf detects address-like strings and excludes them from
783 mime encoding. This option disables such behavior. Default in
784 nkf-compatible mode.
785
786
787 Shortcut
788 -m same as --decode=mime
789
790 -mB same as --decode=mime_b
791
792 -mQ same as --decode=qencode
793
794 -m0 same as --decode=none
795
796 -M same as --encode=mime_b
797
798 -MB same as --encode=base64
799
800 -MQ same as --encode=qencode
801
802 End of line control options
803 --lineend-thru
804 Output end-of-line code as it is. Also output ^Z code as it is.
805 This is default.
806
807 --lineend-cr --lineend-mac-Lm
808 Use CR as end-of-line code. Also delete ^Z code from input
809 stream.
810
811 --lineend-lf --lineend-unix-Lu
812 Use LF as end-of-line code. Also delete ^Z code from input
813 stream.
814
815 --lineend-crlf --lineend-windows-Lw
816 Use CR+LF as end-of-line code. Also delete ^Z code from input
817 stream. This option doesn't preserve original order of cr and
818 lf.
819
820 --input-cr
821 Assume input stream uses CR as end-of-line code.
822
823 --input-lf
824 Assume input stream uses LF as end-of-line code.
825
826 --input-crlf
827 Assume input stream uses CR+LF as end-of-line code.
828
829 -F[line_length[-kinsoku]]
830
831 -f[line_length[-kinsoku]] -f[line_length[+kinsoku]]
832 Wrap input lines by line_length columns. f option deletes
833 CR/LF's in input, and F option doesn't delete them. For Japanese
834 convension, both gyoutou-kinsoku(by burasage-gumi) and
835 gyoumatsu-kinsoku(by oidasi-gumi) is supported. The burasage-
836 length is controlled by kinsoku option. Default value for
837 line_length is 66, and must be < 1000. Default value for kinsoku
838 is 5, and must be <= 10. In 'f' option, skf autodetects para‐
839 graph and retains some CR/LF. 2nd 'f' option format (with '+')
840 disables this behaviour. In nkf compatible mode, some fold
841 behaviors change as follows.
842 (1) Default line_length is set to 60, and kinsoku value is 10.
843 (2) alpha numeric characters become gyoutou-kinsoku characters.
844
845 File control options
846 --filewise-detect --force-reset
847 Reset and re-detect input code set at the start of each file.
848
849 --linewise-detect
850 Reset and re-detect input code set at the start of each line.
851
852
853 Compatibility options
854 --nkf-compat
855 interpret following options as nkf compatible manners. -l, -d,
856 -c, -x, -X, -w and -W works as nkf2.x -f and -F behavior is
857 changed as shown above. -T, -i, -o is not supported. Most of
858 other nkf options and switches also work like nkf, except in
859 case of error.
860
861 --skf-compat
862 interpret following options as skf-native manners.
863
864 -r nkf-compatible rot. Works only with --nkf-compat mode. Allowed
865 input encodings are limited to JIS/Shift_JIS/EUC.
866
867 -h[123]--hiragana--katakana--katakana-hiragana
868 -h, -h1 and --hiragana converts all kanas to hiragana. -h2 and
869 --katakana convert all kanas to katakana. -h3 and
870 --katakana-hiragana swap katakana and hiragana.
871
872 --nkf-help
873 show option difference/compatibility between skf and nkf.
874
875 --in-place[=SUF]--overwrite[=SUF]
876 replace specified file with converted codeset. overwrite retains
877 file create time stamp. If a suffix is given, the suffix is
878 added to output file name and input file is not removed.
879
880
881 Lightweight language specific options
882 skf plugin for lightweight language has subset of options. More specif‐
883 ically, file input/output related options(-b, -u, --overwrite --in-
884 place, --filewise-detect --linewise-detect --show-filename --suppress-
885 filename) and UTF-16 output is disabled(except ruby or python3).
886
887
888 Ruby-1.9.x/2.x specific options
889 Since ruby 1.9, ruby uses CCS string handling. skf returns output
890 string with specified codeset. Following options override this behav‐
891 ior.
892
893 --rb-out-ascii8bit
894 returns string with ascii-8bit encoding.
895
896 --rb-out-string
897 returns string with specified encoding.
898
899 Python-3.x specific options
900 Since native codeset representation in python3.x is UCS2/UCS4, skf
901 behaves differently with output codeset option. If output codeset is
902 either UTF-16 or UTF-32(in wide mode), skf returns Unicode object, and
903 for all other codesets skf returns binary array object. Following
904 options change this behavior.
905
906 --py-out-binary
907 use psuede unicode binary stream to output.
908
909 --py-out-string
910 use binary array object on UTF-16/32 output. BOM is enabled.
911 skf accepts either a binary array or an unicode object for
912 input.
913
914
915 Misc. Control options
916 --disable-space-convert --enable-space-convert
917 skf converts an ideographic space into two ascii spaces. Dis‐
918 able option disables, and enable option enables this behavior.
919 Default is disabled.
920
921 --html-sanitize
922 Convert several characters in HTML document to entity reference
923 expression. Specifically, "!#$&%()/<>:;?´ are escaped by entity-
924 references.
925
926 --filewise-detect --force-reset
927 If multiple input files are given, detect input codeset for each
928 file.
929
930 --linewise-detect
931 Detect input code line-wise. Note this option weakens code
932 detect correctness.
933
934 --reset
935 Reset all flags specified by extended controls and enviroment
936 variables.
937
938 --inquiry --guess
939 skf detects code and output detect result to stdout. No filter‐
940 ing output is performed. If multiple input files are given,
941 --show-filename is automatically enabled.
942
943 --hard-inquiry
944 Similar as inquiry, but reports both code and an end-of-line
945 character.
946
947 --suppress-filename
948 When inquiry(--inquiry) is on, this option disables file name
949 output. This option overrides --show-filename.
950
951 --show-filename
952 When inquiry(--inquiry) is on, this option adds each file name
953 to output.
954
955 --invis-strip
956 Delete all escape sequences not belonging to ISO-2022 code
957 extension. This is intended to replace invisstrip command bun‐
958 dled in inews package.
959
960 -I Warn if input has unassigned code points.
961
962 -v print version information and exit.
963
964 --help print brief help and exit.
965
966 --show-supported-codeset
967 Display supported codesets (input) and exit. Both canonical
968 names (left side) and detailed names are shown. This canonical
969 name can be used as MIME charset and also as ic-option code
970 specification.
971
972 --show-supported-charset
973 Display supported character sets (output) and exit. Both canoni‐
974 cal names and detailed names are shown. Some charsets with spe‐
975 cial treatments (i.e. meaningless as set-g* parameters) inten‐
976 sionally lacks addressable cnames.
977
978
980 /usr/(local/)share/skf/lib/ (Unices)
981
982 /Program Files/skf/share/lib (MS Windows)
983 These directories are where external codeset conversion tables
984 go. The location that current skf assumes are shown by -h
985 option.
986
987
989 skf is written by Seiji Kaneko (efialtes@osdn.jp) based on idea from
990 nkf written by Itaru Ichikawa (ichikawa@flab.fujitsu.co.jp) X 0213 code
991 table is derived from work of earthian@tama.or.jp. Some codeset map‐
992 ping is derived from various sources. Detailed origin is shown in copy‐
993 right document included in this distribution.
994
995
997 skf is inspired by works or requests by shinoda@cs.titech,
998 kato@cs.titech, uematsu@cs.titech, void@global ohta@ricoh, Hinata(HKE)
999 Ashizawa(CRL) Kunimoto(SDL) Oohara(Univ of Kyoto), Jokagi(elf2000) and
1000 Naruse (at osdn.jp). Thanks.
1001
1002
1004 1. skf can handle mixed coding with some limitations. However, code
1005 detection tends to fail for mixed code, and giving explicit input code
1006 set is strongly encouraged, if codeset is known beforehand.
1007 In case of need, --linewise-detect option may help, but code detecting
1008 will more likely fail.
1009
1010 2. skf implements ISO-2022 with following exceptions.
1011 i) GL 0x20 is always space. Even when 96-character codeset is invoked
1012 to GL.
1013 ii) Sequences for setting codes to C1 and C2 are always ignored.
1014 iii) If unknown sequence is given to G0, G0 is set to ascii, and lock‐
1015 ing/single shift is cleared. Unknown sequece call to set to G1-G3 is
1016 just ignored.
1017 Private charset is also not supported and is ignored.
1018 iv) Sequences for 96 character multibyte coding is ignored (Currently,
1019 no codeset is registered).
1020 v) Calling UTF-8, UTF-16 coding system from iso-2022 is supported, and
1021 returns to previous coding system by standard return.
1022 Callings and returns to/from other coding schemes are ignored.
1023 vi) For supporting some of cellular phone glyphs, several private (not
1024 registered) codesets are defined in skf, and can be called by appropri‐
1025 ate sequences.
1026
1027 3. Error output coding is controlled by LOCALE environment variables in
1028 UN*X system. skf doesn't take care of situations like stdout and stderr
1029 are redirecting into a same stream. Such case should be handled by user
1030 side.
1031
1032 4. skf converts KEIS/JIS X 0213 code using CJK-extension B area and CJK
1033 compatibility area. For this reason, X 0213 and KEIS convert result
1034 varies depending on --use-compat and --limit-to-ucs2 switches.
1035
1036 5. JIS X 0207:1979 is not supported. JIS X 0211:1987 is designed to be
1037 supported (i.e. common terminal control sequence will be transparently
1038 passed to output).
1039
1040 6. Even if unbuffer option(-u) is specified, some code-translation
1041 related bufferings are still performed (in MIME, kana, VIQR etc.).
1042
1043 7. skf-1.9x or later recognizes and handles languages in iso639-1(alpha
1044 2). iso639-2 is not supported as a valid language set.
1045
1046 8. Unicode IVS is not supported. Sequences are just discarded.
1047
1048 9. skf-1.9x or later does not retain Macintosh RLO-ordered character
1049 property. Codesets with this kind of codes are not supported.
1050
1051 10. CNS11643 4th, 5th, 6th planes are not supported.
1052
1053 11. In python 3 extension, a detected codeset by inquiry for input uni‐
1054 code strings are always UTF-32be.
1055
1056 12. In lightweight language extension except ruby and python,
1057 UCS2/UTF-16 are not supported.
1058
1059
1060
1062 1. Extended options are changed extensively since skf-1.9. Some archaic
1063 options (eg. -B, -@ and -r) have been deleted from this version.
1064
1065 2. skf is originally forked project from nkf, but doesn't contain any
1066 nkf codes now. Copyright notice is retained by honor.
1067
1068 3. From version 1.9, default Japanese character set assumed by skf has
1069 changed to JIS X 0208:1990 with Microsoft Japanese Windows gaiji (i.e.
1070 CP932).
1071
1072 4. Code autodetection is not perfect by design. If it has failed to
1073 detect input code properly, please give input code information explic‐
1074 itly.
1075
1076 5. Some ligatures in Unicode, cp932 gaiji and KEIS83 are converted
1077 using JIS X 0124 and other convention. During this conversion, its
1078 byte length is not preserved.
1079
1080 6. skf is intended to pass ANSI compatible terminal control codes
1081 transparently, but this is not guaranteed.
1082
1083 7. nkf's -i and -o options works only in nkf-compat mode. It is obso‐
1084 lete option in 1.97, and valid only when iso-2022-jp and without con‐
1085 sidering output codeset specifications.
1086
1087 8. For unconverted character, skf uses geta and undefined character as
1088 --use-replace-char option. If output codeset doesn't contain geta
1089 code, skf prefers 'black square character', then uses '.' respectively.
1090
1091 9. There are some undocumented options. These options should be consid‐
1092 ered as highly experimental.
1093
1094 10. In lineend_thru mode and using folding, skf remembers order of cr
1095 and lf appears in stream, and use that order. For this design, if skf
1096 needs to output line-end character before any line-end character
1097 appears in input stream, input order may not be preserved.
1098
1099 11. NKF-compatibility
1100 1) --prefix, some --fb's and --no-best-fit-chars are not supported.
1101 2) MSDOS (and -T), --exec-in and --exec-out are not supported. -O is
1102 supported.
1103 3) MIME decoding/encoding handling behaviors differ in various ways.
1104 4) lineend conversion acts differently. Results may not be same for
1105 some messy text.
1106 5) -r option and --decode=rot is different. See each option descrip‐
1107 tion.
1108 6) detected codeset name is not compatible with nkf. --help and --ver‐
1109 sion return different results.
1110 7) in-place and overwrite suffix with * is not supported.
1111
1112 12. Conversion to NYUUKAN GAIJI is as follows
1113 1) Kanji codes in JIS X0208(1997), JIS X0212(1990), JIS
1114 X0213(2004/2012),
1115 Houmusho-kokuji No.582 beppyou No.1 are sent to output as it is.
1116 2) Kanji codes in beppyou No.4-2 leftmost columns are converted to the
1117 first
1118 priority character in the table. If the second priority characters
1119 appear,
1120 the codes are sent to output as it is.
1121 3) Other kanji codes are converted as undefined codes. See above con‐
1122 version method. Non-kanji codes (latins, glyphs etc.) are sent to out‐
1123 put as it is.
1124
1125 13. ARIB B24 compatibility
1126 1) Input only. ARIB B24 output is not supported.
1127 2) Neither international encoding nor X0213 extension are supported.
1128 3) Macro define sequences are suppressed. These sequences are recog‐
1129 nized and
1130 discarded.
1131 4) Without specifying arib codeset, skf treats Arib-defined codepage as
1132 follows.
1133 i) private codepage are supported. ascii/jis x-0201 0x5f is not modi‐
1134 fied.
1135 ii) macro define/invoke and rpc invoke does not work. These charac‐
1136 ters are
1137 discarded.
1138
1139
1141 Unicode(TM) is a trademark of Unicode, Inc. Microsoft and Windows are
1142 registered trademarks of Microsoft corporation. Macintosh is a regis‐
1143 tered trademark of Apple Inc. Vodafone is a trademark of Vodafone K.K.
1144 Other names and terms may be trademarks or registered trademarks of
1145 their respective owner. Trademark symbol (TM) may be omitted in this
1146 manual page.
1147
1148
1149
1150
1151 10/Aug/2018 SKF(1)