1SKF(1) General Commands Manual SKF(1)
2
3
4
6 skf - simple Kanji Filter (v1.97)
7
9 skf [-AEIJKNQRSXZabehjknqrsuvxz] [ long_format_options ] [infiles..]
10
12 skf is a yet another i18n capable kanji-filter, designed for reading
13 various CJK-coded files on the Net. skf converts input kanji texts or
14 streams into a character stream using designated codeset and output
15 them to standard output. Specifically, skf is designed to be a versa‐
16 tile filter to read documents in various code sets, and does not pro‐
17 vide features not related to code conversion.
18
19 Like nkf, skf automatically recognizes an input file code when it is a
20 kind of ISO-2022 compliant code, and also detects EUC-variant codes if
21 input file is Japanese text without X 0201 kanas. skf 1.9x can read
22 various iso-2022 compliant character sets, including JIS Kanji codes (X
23 0208, X 0212 and X 0213), EUC encoding (euc-jp (with X 0213 support),
24 euc-cn, euc-kr and euc-tw), ISO Europian latins (ISO-8859-1 to 11,
25 13/14/15/16) and many regional character sets. skf can also read some
26 non-iso2022 compliant sets, including Microsoft Shift-JIS code,
27 KOI-8-R/U, GB2312 (HZ), big5, VISCII(rfc1456, include VIQR), Unicode
28 standard (UCS2/UTF-16, UTF7 and UTF8), some of MS codesets (cp1250
29 etc.) and some other vendor specific codes (KEIS83, JEF etc).
30
31 Supported output character sets of skf are more limited, but still
32 include X 0208/X 0212/X 0213 JIS, X 0201 JIS, ASCII, Microsoft Shift-
33 JIS, EUC-jp/-kr/-cn, HZ, iso-2022-jp/kr, big5, VISCII and Unicode.
34
35 skf also provides some basic decoding features for some common encod‐
36 ings including MIME, Punycode and URI codepoint. Unicode decomposition
37 feature is also supported since 1.96.
38
39 As noted above, skf is designed to convert input text into some kind of
40 human-readable forms under a local environment (i.e. codeset), and has
41 several extra conversion features like GNU recode type folding. Such
42 conversions include Windows/Macintosh specific code swaps and old-new
43 jis glyph changes, html-format/TeX format conversion and variant unifi‐
44 cations.
45
46 skf also can be compiled as an extension of some lightweight languages.
47 See README.txt for details.
48
49 If one or more file names are given, skf read the files and output con‐
50 verted stream to stdout. If no file names are given, input is taken
51 from stdin and output is also stdout. OPTIONS are taken from environ‐
52 ment variables SKFENV, skfenv and command line, respectively in this
53 order. Environment variables are not used when skf is running as a
54 priviledged user. skf does not use LOCALE-related environment vari‐
55 ables for conversions, but output error messages are controlled by
56 given LOCALES.
57
59 skf-1.9 is written from scratch, and inherits no code from nkf. How‐
60 ever, skf is intended to be a drop-in replacement for nkf(v1.4) and has
61 a similar commonly-used nkf option set.
62 skf 1.96 recognizes following options. Defaults are all off if not
63 explicitly specified.
64
65 buffering control
66 -b use buffered output. This is default.
67
68 -u use unbuffered output. Code detection feature is disabled when
69 this option is on.
70
71 Input/Output codeset options
72 --ic= input_code_set
73 specify input codeset is input_code_set. Possible candidates
74 are shown below.
75
76 --oc= output_code_set
77 specify output codeset is output_code_set. Possible candidates
78 are shown below. Default codeset in distribution package is euc-
79 jp, but depends on compile option. Default codeset is shown by
80
81 Supported codeset
82 skf recognizes following codesets as an input/output codeset. These
83 codeset names are case insensitive, and minus ('-') and underscore
84 ('_') is ignored. Note that iso-2022 escape-based input codeset (reg‐
85 istered to IANA) is recoginized automatically, even when non-iso2022
86 codeset (except Unicode and B-Right/V) is specified. o in in-column
87 means named codeset can be specified as input and x means named codeset
88 is not for input. output-column is same except it is for output.
89
90 in out name description
91 o o iso8859-1 ascii + iso-8859-1 (latin-1)
92 o o iso8859-2 ascii + iso-8859-2 (latin-2)
93 o o iso8859-3 ascii + iso-8859-3 (latin-3)
94 o o iso8859-4 ascii + iso-8859-4 (latin-4)
95 o o iso8859-5 ascii + iso-8859-5 (Cyrillic)
96 o o iso8859-6 ascii + iso-8859-6 (Arabic)
97 o o iso8859-7 ascii + iso-8859-7 (Greek)
98 o o iso8859-8 ascii + iso-8859-8 (Hebrew)
99 o o iso8859-9 ascii + iso-8859-9 (latin-5)
100 o o iso8859-10 ascii + iso-8859-10 (latin-6)
101 o o iso8859-11 ascii + iso-8859-11 (Thai)
102 o o iso8859-13 ascii + iso-8859-13 (Baltic Rim)
103 o o iso8859-14 ascii + iso-8859-14 (Celtic)
104 o o iso8859-15 ascii + iso-8859-15 (Latin-9)
105 o o iso8859-16 ascii + iso-8859-16
106 o o koi-8r koi-8r (Russian)
107 o o cp1251 Cyrillic latin MS cp1251
108 o o jis iso-2022-jp (rfc1496 7bit JIS)
109 o o iso-2022-jp-x0213 iso-2022-jp-3 (JIS X 0213:2000)
110 a.k.a. jis-x0213
111 o o jis-x0213-strict iso-2022-jp-3-strict
112 o o iso-2022-jp-2004 iso-2022-jp-2004(JIS X 0213:2004)
113 a.k.a. jis-x0213-2004
114 o o oldjis iso-2022-jp-1978(JIS X 0208:1978)
115 o o cp50220 Microsoft codepage 50220
116 o o cp50221 Microsoft codepage 50221
117 o o cp50222 Microsoft codepage 50222
118 o o euc-jp EUC-encoded JIS X 0208:1997
119 o o euc-x0213 EUC-encoded JIS X 0213:2000
120 o o euc-jis-2004 EUC-encoded JIS X 0213:2004
121 o o cp51932 EUC-encoded Microsoft codepage 932
122 o o euc-kr EUC-encoded KS X 1001 Korian
123 o o euc7-kr 7bit EUC-encoded KS X 1001 Korian
124 o o uhc Unified hangle (Windows cp949)
125 o o johab KS X 1001-johab Korian
126 o o euc-cn EUC-encoded GB2312 Chinese
127 o o euc7-cn 7bit EUC-encoded GB2312 Chinese
128 o o hz HZ-encoded GB2312 Chinese
129 o o euc-tw EUC-encoded CNS 11643 Chinese
130 o o gb12345 EUC-encoded GB12345 Chinese
131 o o gbk GB2312 Extension(cp936) Chinese
132 o o gb18030 GB18030 chinese
133 o o big5 BIG5 (with Eten extension + EURO)
134 o o cp950 BIG5 (Microsoft cp950 + EURO)
135 o o big5-hkscs BIG5 with HKSCS
136 o o big5-2003 BIG5-2003
137 o o big5-uao BIG5-Unicode at On
138 o o sjis Shift-jis (Microsoft cp943)
139 o o shiftjis-x0213 Shiftjis-encoded JIS X 0213:2000
140 o o shiftjis-2004 Shiftjis-encoded JIS X 0213:2004
141 o x sjis-cellular Shiftjis-encoded JIS X 0208:1997
142 with NTT Docomo, Vodafone(SoftBank) phone glyph
143 o o oldsjis Shift-jis (JIS X 0208:1978)
144 o o cp932 Shift-jis-encoded MS cp932
145 o o cp932w Shift-jis-encoded MS cp932 with
146 MS compatibility
147 o o viscii VISCII (rfc1456) Vietnamise
148 o o viqr VISCII (rfc1456-VIQR) Vietnamise
149 o o keis Hitachi KEIS83/90
150 o x jef Fujitsu JEF (basic support only)
151 o x ibm930 IBM EBCDIC DBCS Japanese
152 o x ibm931 IBM EBCDIC DBCS Japanese w.latin
153 o x ibm933 IBM EBCDIC DBCS Korian
154 o x ibm935 IBM EBCDIC DBCS Simpl. Chinese
155 o x ibm937 IBM EBCDIC DBCS Trad. Chinese
156 o o unicode Unicode(TM) UCS-2/UTF-16LE
157 o o unicodefffe Unicode(TM) UTF-16BE
158 o o utf7 Unicode(TM) UTF-7
159 o o utf8 Unicode(TM) UTF-8
160 o x transparent Transparent mode (see below)
161
162
163 Codeset explanations
164 iso-8859-*
165 When specified as output, G0 = GL is ascii and G1 = GR is
166 iso-8859-*. 8bit encoding is used.
167
168 iso-2022-jp, jis
169 Encoding is iso-2022-jp-2 (RFC1496). G0 = GL is JIS X 0201
170 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3 is
171 JIS X 0212:1990 Supplementary Kanji.
172
173 jis-x0213, iso-2022-jp-3
174 Encoding is iso-2022-jp-3 (JIS X 0213:2000 based). G0 = GL is
175 JIS X 0201 roman, For output, G1 = GR is JIS X 0201 kana, G2 is
176 iso-8859-1 and G3 is JIS X 0213 plane2 Kanji.
177
178 jis-x0213-strict
179 Encoding is subset of iso-2022-jp-3-strict (uses Plane 1 only).
180 For output, G0 = GL is JIS X 0201 roman, G1 = GR is JIS X 0201
181 kana, G2 is iso-8859-1 and G3 is not set. Output code using JIS
182 X 0208 whenever possible. JIS X 0213 input is automatically rec‐
183 ognized.
184
185 jis-x0213-2004, iso-2022-jp-2004
186 Encoding is iso-2022-jp-2003:2004. For output, G0 = GL is JIS X
187 0201 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3
188 is JIS X 0213 plane2 Kanji.
189
190 oldjis
191 Encoding is iso-2022-jp using old JIS X 0208:1978). G0 = GL is
192 JIS X 0201 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1
193 and G3 is JIS X 0212 Supplementary Kanji.
194
195 euc-jp, euc
196 Encoding is 8-bit EUC using JIS X 0208:1997 character set. G0 =
197 GL is ascii, G1 = GR is JIS X 0208, G2 is JIS X 0201 kana and G3
198 is JIS X 0212 Supplementary Kanji.
199
200 euc-x0213, euc-jis-2003
201 Encoding is 8-bit EUC-based JIS X 0213:2000. G0 = GL is ascii,
202 G1 = GR is X 0213:2000 plane 1, G2 is iso-8859-1 and G3 is JIS X
203 0213:2000 plane2 Kanji.
204
205 euc-jis-2004
206 Encoding is 8-bit EUC-based JIS X0213:2004. G0 = GL is ascii,
207 G1 = GR is X0213:2004 plane 1, G2 is iso-8859-1 and G3 is JIS
208 x0213:2004 plane2 Kanji.
209
210 euc-kr
211 Encoding is 8-bit EUC using KS X 1001 Wansung character set. G0
212 = GR is KS X1003, G1 = GR is KS X1001, G2 and G3 is not set.
213
214 euc7-kr iso-2022-kr
215 Encoding is iso-2022-kr (rfc1557): 7-bit EUC using KS X 1001
216 Wansung character set. G0 = GR is KS X1003, G1 is KS X1001, G2
217 and G3 is not set.
218
219 euc-cn
220 Encoding is 8-bit EUC using GB 2312 simplified chinese character
221 set. G0 = GR is ASCII, G1 = GR is GB2312, G2 and G3 is not set.
222
223 euc7-cn
224 Encoding is 7-bit EUC using GB 2312 simplified chinese character
225 set. G0 = GR is ASCII, G1 is GB2312, G2 and G3 is not set.
226
227 hz
228 Encoding is HZ encoded (rfc1842) GB 2312 simplified chinese
229 character set. G0 = GR is ASCII, G1 = GR is GB2312, G2 and G3
230 is not set.
231
232 euc-tw
233 Encoding is EUC encoded CNS11643 Plane1/2 traditional chinese
234 character set. Subset of iso-2022-cn. G0 = GR is ASCII, G1 = GR
235 is CNS11643 plane 1, G2 is CNS11643 plane 2 and G3 is not set.
236
237 gb12345
238 Encoding is 8-bit EUC using GB 12345 (GBF) traditional chinese
239 character set. G0 = GR is ASCII, G1 = GR is GB12345, G2 and G3
240 is not set.
241
242 gbk, cp936
243 Encoding is GBK simplified chinese character set. G0 = GR is
244 ASCII and G1 = GR is GBK. G2 and G3 is not set.
245
246 gb18030 (experimental)
247 Encoding is GB18030 (ibm-1392, Windows cp54936) chinese charac‐
248 ter set. Uses ASCII as latin part.
249
250 big5
251 Encoding is Big5 traditional chinese character set with ETen
252 extension. Include Euro mapping. Uses ASCII as latin part.
253
254 cp950
255 Encoding is Microsoft cp950-Big5 traditional chinese character
256 set. Uses ASCII as latin part.
257
258 big5-hkscs (experimental)
259 Encoding is cp950-Big5 traditional chinese character set with
260 HKSCS extension. Uses ASCII as latin part.
261
262 big5-2003 (experimental)
263 Encoding is Big5-2003 Taiwanese standard traditional chinese
264 character set. Uses ASCII as latin part.
265
266 big5-uao (experimental)
267 Encoding is Big5-UAO (http://uao.cpatch.org) traditional chinese
268 character set. Uses ASCII as latin part.
269
270 VISCII (experimental)
271 Vietnamise VISCII (rfc1456) character set. Not TCVN-5712.
272
273 VIQR (experimental)
274 Vietnamise VISCII character set with VIQR encoding(rfc1456).
275
276 sjis
277 Encoding is Shift-encoded JIS X 0208:1997 character set. Note
278 that this is not cp932. Uses JIS X 0201 latin as latin(GL) part.
279
280 sjis-x0213, shift_jis-2000
281 Encoding is Shift-encoded JIS using JIS X 0213:2000 character
282 set.
283
284 sjis-x0213-2004, shift_jis-2004
285 Encoding is Shift-encoded JIS using JIS X 0213:2004 character
286 set. 10 newly defined character added, but Unicode mapping is
287 same as JIS X 0213:2000. Uses JIS X 0201 latin as latin(GL)
288 part.
289
290 sjis-cellular (experimental)
291 Encoding is Shift-encoded JIS X 0208:1997 character set with NTT
292 Docomo/Vodafone(SoftBank) cellular phone glyph mapping.
293
294 cp932 cp932w
295 Encoding is Microsoft SJIS cp932 with NEC/IBM gaiji area, based
296 on Windows XP mapping. Uses ASCII as latin(GL) part. --use-com‐
297 pat and --use-ms-compat is automatically enabled. cp932w pro‐
298 vides further WideCharToMultiByte compatibility.
299
300 cp51932
301 Encoding is Microsoft EUC-based cp51932 with NEC/IBM gaiji area,
302 based on Windows XP mapping. Uses ASCII as G0 and JIS X 0201
303 kana as EUC G2 part. G3 is not used for output, and JIS X
304 0212:2000 as input. --use-compat and --use-ms-compat is auto‐
305 matically enabled.
306
307 cp50220, cp50221, cp50222
308 Encoding is Microsoft JIS-based cp50220, cp50221, cp50222 with
309 NEC/IBM gaiji area, based on Windows XP mapping. For input, skf
310 accepts cp50220, 50221 and 50222. Note that this codeset is NOT
311 compatible with iso-2022. Uses ASCII as default character set.
312 --use-compat and --use-ms-compat is automatically enabled.
313
314 oldsjis
315 Encoding is Microsoft SJIS (JIS X 0208:1978 a.k.a. old JIS).
316 Uses JIS X 0201 latin as latin(GL) part.
317
318 johab
319 Encoding is KS X1001(Johab) character set. Uses KS X1003 latin
320 as latin(GL) part.
321
322 uhc
323 Encoding is UHC (cp949) character set. Uses ASCII as latin(GL)
324 part.
325
326 unicode, unicodefffe
327 Encoding is Unicode UTF-16 (v5.0). Input/Output default byte-
328 endian is little for unicode and big for unicodefffe, and input
329 byte order mark is recognized. Output includes endian mark by
330 default unless --disable-endian-mark is specified. Output range
331 is within UTF-32 with surrogate pair unless --limit-to-ucs2 is
332 specified.
333 Note that ucs2 is not supported within perl/ruby extension in
334 both in and output, because of data structure limitation. Spec‐
335 ify to ucs2 will generate error.
336
337 utf8
338 Encoding is UTF-8 encoded Unicode (v5.0). Output doesn't include
339 byte order mark unless --enable-endian-mark is specified. Out‐
340 put range is within UTF-32 unless --limit-to-ucs2 is specified.
341 By default, CESU-8 is not accepted as input. Option
342 --enable-cesu8 enables CESU-8 input for utf-8 converter. CESU-8
343 output is not supported. For UTF-8, endian mark (BOM) is always
344 ignored.
345
346 utf7
347 Encoding is UTF-7 encoded Unicode (v5.0). Input/output range is
348 limited to UTF-16, and value above U+10000 is regarded as unde‐
349 fined. BOM is always ignored for input, and never used for out‐
350 put.
351
352 keis (experimental)
353 Encoding is Hitachi KEIS83/90. Output range is limited to EBCDIK
354 and JIS X 0208 area.
355
356 jef (experimental)
357 Encoding is Fujitsu JEF. Input only. Only basic part is sup‐
358 ported.
359
360 ibm930 (experimental)
361 Encoding is IBM DBCS Japanese with EBCDIC Kana
362
363 ibm931 (experimental)
364 Encoding is IBM DBCS Japanese with EBCDIC latin (ibm037)
365
366 ibm933 (experimental)
367 Encoding is IBM DBCS Korian with EBCDIC Wansung character set
368
369 ibm935 (experimental)
370 Encoding is IBM DBCS Simplified Chinese with EBCDIC Chinese
371
372 ibm937 (experimental)
373 Encoding is IBM DBCS Traditional Chinese with EBCDIC Chinese
374
375 koi8r
376 Russian KOI-8R code.
377
378 cp1250
379 Central Europian latin Microsoft cp1250 code
380
381 cp1251
382 Eastern Europian cyrillic Microsoft cp1251 code
383
384 transparent
385 Transparent mode. Various code control features, include folding
386 and line end code conversion, is also ignored.
387
388
389 Shortcuts
390 -n -j same as --oc=jis
391
392 -s -x same as --oc=sjis
393
394 -a -e same as --oc=euc-jp
395
396 -q same as --oc=ucs2
397
398 -z same as --oc=sjis
399
400 -y same as --oc=utf7
401
402 -k same as --oc=keis
403
404 -A, -E same as --ic=euc-jp. Assume input codeset is EUC-JP.
405
406 -N same as --ic=jis. Assume input codeset is iso-2022-jp.
407
408 -S, -X same as --ic=sjis. Assume input codeset is shift JIS
409
410 -Q same as --ic=ucs2.
411
412 -Y same as --ic=utf7.
413
414 -Z same as --ic=utf8.
415
416 -K same as --ic=keis.
417
418
419 ISO-2022 Specific controls
420 Replaces G0-3 after setting up according to specified input codeset by
421 assigned character set with this option. Note that this doesn't change
422 any codeset properties of the original codeset, like language and
423 encoding.
424
425 --set-g0=`charset name'
426 Predefines specified code set to plane 0 (G0). Also set to GL at
427 initial state.
428
429 --set-g1=`charset name'
430 Predefines specified code set to right plane (G1). Also set to
431 GR at initial state.
432
433 --set-g2=`charset name'
434 Predefines specified code set to right plane (G2).
435
436 --set-g3=`charset name'
437 Predefines specified code set to right plane (G3).
438
439
440 Supported `char_set' is as follows. 'o' means the codeset can be speci‐
441 fied to set to the plane. 'x' means you can't. For unicode family code‐
442 sets, this option is ignored. For other non-iso2022 categories, this
443 option is not supported, and result is unpredictable.
444
445
446 g0 g1 g2 g3 codeset name description
447 o o o o ascii ANSI X3.4 ASCII
448 o o o o x0201 JIS X 0201 (latin part)
449 x o o o iso8859-1 ISO 8859-1 latin
450 x o o o iso8859-2 ISO 8859-2 latin
451 x o o o iso8859-3 ISO 8859-3 latin
452 x o o o iso8859-4 ISO 8859-4 latin
453 x o o o iso8859-5 ISO 8859-5 Cyrillic
454 x o o o iso8859-6 ISO 8859-6 Arabic
455 x o o o iso8859-7 ISO 8859-7 Greek-latin
456 x o o o iso8859-8 ISO 8859-8 Hebrew
457 x o o o iso8859-9 ISO 8859-9 latin
458 x o o o iso8859-10 ISO 8859-10 latin
459 x o o o iso8859-11 ISO 8859-11 Thai
460 x o o o iso8859-13 ISO 8859-13 latin
461 x o o o iso8859-14 ISO 8859-14 latin
462 x o o o iso8859-15 ISO 8859-15 latin
463 x o o o iso8859-16 ISO 8859-16 latin
464 x o o o tcvn5712 TCVN 5712 (Vietnamese)
465 x o o o ecma94 ECMA 94 Cyrillic (KOI-8e)
466 o o o o x0212 JIS X 0212:1990
467 o o o o x0208 JIS X 0208:1997
468 o o o o x0213 JIS X 0213 Plane 1:2000
469 o o o o x0213-2 JIS X 0213 Plane 2:2000
470 o o o o x0213n JIS X 0213 Plane 1:2004
471 o o o o gb2312 Simplified Chinese GB2312
472 o o o o gb1988 Chinese GB1988(latin)
473 o o o o gb12345 Traditional Chinese GB12345
474 o o o o ksx1003 Korian KS X 1003(latin)
475 o o o o ksx1001 Korian KS X 1001
476 x o o o koi8-r Cyrillic KOI-8R
477 x o o o koi8-u Ukrainean Cyrillic KOI-8U
478 o o o o cns11643-1 Traditional Chinese CNS11643-1
479 x o o o viscii-r RFC1496 VISCII (right plane)
480 o o o o viscii-l RFC1496 VISCII (left plane)
481 x o o o cp437 Microsoft cp437 (US latin)
482 x o o o cp737 Microsoft cp737
483 x o o o cp775 Microsoft cp775
484 x o o o cp850 Microsoft cp850
485 x o o o cp852 Microsoft cp852
486 x o o o cp855 Microsoft cp855
487 x o o o cp857 Microsoft cp857
488 x o o o cp860 Microsoft cp860
489 x o o o cp861 Microsoft cp861
490 x o o o cp862 Microsoft cp862
491 x o o o cp863 Microsoft cp863
492 x o o o cp864 Microsoft cp864
493 x o o o cp865 Microsoft cp865
494 x o o o cp866 Microsoft cp866
495 x o o o cp869 Microsoft cp869
496 x o o o cp874 Microsoft cp874
497 x o o o cp932 Microsoft cp932 (Japanese)
498 x o o o cp1250 Microsoft cp1250(Central Europe)
499 x o o o cp1251 Microsoft cp1251 (Cyrillic)
500 x o o o cp1252 Microsoft cp1252 (Latin-1)
501 x o o o cp1253 Microsoft cp1253 (Greek)
502 x o o o cp1254 Microsoft cp1254 (Turkish)
503 x o o o cp1255 Microsoft cp1255
504 x o o o cp1256 Microsoft cp1256
505 x o o o cp1257 Microsoft cp1257
506 x o o o cp1258 Microsoft cp1258
507
508 --euc-protect-g1
509 In EUC input mode, suppress sequences to set a charset to G1.
510 Such sequences are discarded.
511
512 --add-annon
513 Add announcer for JIS X 0208:1997 to X 0208 designate sequence.
514 This option works only with iso-2022-based output.
515
516 --input-detect-jis78
517 Distinguish JIS X 0208:1978 codeset and JIS X 0208:1997 codeset.
518 By default, these two charset is regarded as X 0208:1997. This
519 option is valid only when input encoding is JIS (iso-2022-jp).
520
521
522 Unicode coding specific control options
523 --use-compat --suppress-compat
524 skf substitutes characters in unicode compatibility planes
525 (U+F900 - U+FFFD) to appropriate characters in non-compatibility
526 planes. If enabled, these characters is converted to variants
527 or undefined. --use-compat disables this substitution, and
528 --suppress-compat enables this behavior. Default is enabled, but
529 several codesets disable this as codeset feature (i.e. Use com‐
530 patibility planes). See codeset section.
531
532 --use-ms-compat
533 When output is Unicode, make Unicode map to be Microsoft windows
534 compatible). This only changes conversion for some symbols in
535 JIS-Kanji, and adding --use-compat option is recommended for
536 roundtrip conversion. If you need more strict compatibility, try
537 cp932w for input codeset.
538
539 --use-cde-compat
540 When output is Unicode, make translation CDE standard codeset
541 compatible.
542
543 --little-endian
544 When output is UTF-16, use little endian byte-order. This is
545 default.
546
547 --big-endian
548 When output is UTF-16, use big endian byte-order.
549
550 --disable-endian-mark --enable-endian-mark
551 When output is UTF-16 or UTF-8, do not use/use byte order mark‐
552 ing. To make UTF-16N, use this option with --little-endian. By
553 default, BOM is enabled for UTF-16 and disabled for UTF-8.
554
555 --input-little-endian
556 When input is UTF-16, assume input is little endian byte-
557 ordered. This is default, but skf respects byte-order mark.
558
559 --input-big-endian
560 When input is UTF-16, assume input is big endian byte-ordered.
561 Note that skf respects byte-order mark.
562
563 --endian-protect
564 Do not use endian mark in input stream. Endian mark is just dis‐
565 carded. This is off by default.
566
567 --limit-to-ucs2
568 Do not use > 0x10000 area code in Unicode (i.e. limits code to
569 BMP area). This option doesn't limit internal code range in
570 skf. This is off by default.
571
572 --disable-cjk-extension
573 Treat CJK extension A/B areas as undefined. This is off (i.e.
574 these areas are enabled) by default.
575
576 --enable-cesu8
577 Enable CESU-8 input in utf-8 codeset. Ignored for any other
578 codesets.
579
580 --non-strict-utf8
581 Enable broken (decodable but not obeying specs.) utf-8 input. If
582 you need this option, proceeds with extra care.
583
584 --enable-nfd-decomposition --disable-nfd-decomposition
585 Enable/Disable Unicode Normalized decomposition. Default is dis‐
586 abled.
587
588 --enable-nfda-decomposition --disable-nfda-decomposition
589 Enable/Disable Apple-compatible Unicode Normalized decomposi‐
590 tion. Default is disabled.
591
592
593 Codeset/Vendor Specific codeset handling flags
594 skf by default assumes machine specific parts of kanji code are Micro‐
595 soft Windows compatible. Here are some options that control this behav‐
596 ior. Option in this category is valid when output codeset is Japanese
597 codeset, except --disable-charts.
598
599 --use-apple-gaiji
600 Assume machine specific part in input file is Macintosh Classic
601 OS (System 7,8,9) compatible.
602
603 --disable-ibm-gaiji --disable-nec-gaiji
604 Disable IBM/NEC defined machine specific part in input file.
605
606 --disable-chart
607 Do not use Moji-keisen characters. This is for old Macintosh
608 system (System 6.x or older) compatibility.
609
610
611 Miscellanious codeset related options
612 --old-nec-compat
613 Enable old NEC kanji sequence (ESC-K,H). Needs compile option
614 --enable-oldnec at configuration.
615
616 --no-utf7
617 Assume input codeset is *NOT* UTF-7 encoded Unicode. This
618 option disables input utf7 testing.
619
620 --no-kana
621 Assume input codeset does *NOT* include JIS X 0201 kana.
622
623
624 OUTPUT Conversions options
625 skf is intended to output stream to stdout, buf nkf-compatible file-
626 encoding change option is also provided.
627
628 --overwrite --in-place
629 converts encoding of file(s) specified as input. --overwrite
630 preserves file change date.
631
632 skf has various features to fix output files appropriate in local envi‐
633 ronment. Most of these are controlled by extended control switches
634 described in this section.
635
636 --use-g0-ascii
637 set G0(=GL) for output encoding to ASCII, ignoring codeset des‐
638 ignation.
639
640 X-0201 Kana/latin conversions
641 skf by default converts X-0201 kanas to X-0208 kanas. To output X-0201
642 kana as it is, use one of following options. When output is designated
643 to EUC or SJIS, these three options enable X-0201 kana output by ways
644 provided by each encoding. When Unicode output is specified, (equiv.)
645 kana part output is controlled by --use-compat, not following switches.
646 Valid only when output codeset is NOT Unicode family.
647
648 --kana-jis7
649 use SI/SO locking shift sequence to designate X-0201 kana. This
650 switch is valid for jis, jis-x0213 and cp50220 (i.e. cp50221)
651 encoding. For other codesets, this option is ignored.
652
653 --kana-jis8
654 output X-0201 kana using 8-bit code right plane. This switch is
655 valid for jis and jis-x0213 encoding. For other codeset, this
656 option is ignored.
657
658 --kana-esci --kana-call
659 use ESC-(-I to designate X-0201 kana. This switch is valid for
660 jis, jis-x0213 and cp50220 (i.e. cp50222) encoding. For other
661 codeset, this option is ignored.
662
663 --kana-enable
664 If output is EUC-JP or cp51932, use X-0201 kana with G2. If
665 SJIS output, it is same as --kana-jis8. When JIS output, it is
666 same as --kana-call.
667
668 --use-iso8859-1
669 Enable iso-8859-1 output. Iso-8859-1 is invoked to G1 and set to
670 GR plane.
671
672
673 JIS X 0212(Supplement Kanji code) Support
674 --x0212-enable
675 skf by default does not output JIS X 0212 code. This option
676 enables use of JIS X 0212 part. Output code set may be neither
677 Microsoft code nor KEIS. For Unicode variant encodings, this
678 option is ignored. Note that this option is supported for back‐
679 ward compatibility. May not be supported in future versions.
680
681
682 URI/TeX format conversion feature options
683 With Unicode(tm) family output codings, skf output non-ascii latin
684 character part as it is, but with other output codings, skf converts
685 these characters using following rules:
686
687 (1) If a code is defined in a specified output codeset, specified code
688 point is used for output.
689 (2) If one of following html convert modes are enabled (i.e. --con‐
690 vert-html --convert-sgml) and the code is defined in html/sgml codeset,
691 it is converted to entity-reference or codepoint reference.
692 (3) If tex convert mode enabled and the code is defined in tex expres‐
693 sion, it is converted to tex format.
694 (4) If the code is a kind of combined ligatures, it is shown by a set
695 of characters.
696 (5) A kind of replacement character is shown, with warning.
697
698 --convert-html --convert-sgml
699 Enable html convert mode. This mode is cleared by --reset. These
700 two options are synonyms, and are treated as same option.
701
702 --convert-html-decimal
703 Enable html code-point decimal convert mode. This mode is
704 cleared by --reset.
705
706 --convert-html-hexadecimal
707 Enable html code-point hexadecimal convert mode. This mode is
708 cleared by --reset.
709
710 --convert-tex
711 Enable TeX convert mode. This mode is cleared by --reset.
712
713 --use-replace-char
714 In Unicode, use unicode replacement chatacter (U+fffc) for unde‐
715 fined chatacter.
716
717
718 Encoding/Decoding control options
719 --decode=`encoding scheme'
720 --encode=`encoding scheme' Specify an decoding/encoding scheme
721 for input stream. Supported encoding schemes for decoding are
722 `hex', 'mime', 'mime_q', 'mime_b', 'uri', 'ace',
723 'hex_perc_encode', CAP hex-code, mime, mime Q-encoding, mime B-
724 encoding, uri character reference, ACE punycode, uri percent
725 notation, base64, Q-encoding, rfc2231 and rot13/47 respectively.
726 For encoding, 'hex', 'mime_b', 'mime_q', 'uri', 'ace', 'cap',
727 and some already ascii-encoded codeset (e.g. UTF-7) output with
728 encoding is not supported.
729 Only one decode/encode option is valid, and if more than one
730 option is specified, the last one is used. When one of mime
731 decodings is specified, base text is assumed to be EUC encoding
732 unless specified otherwise. Except rot, which assumes input
733 stream is Shift_JIS, EUC or iso-2022-jp, these encodings assumes
734 input stream is ascii (as defined in RFC2045). Some encodings
735 may co-exist with encoding, but this is not guaranteed. Espe‐
736 cially, if input is UTF-16/UCS2 code, these encoding is ignored
737 in skf.
738
739 --mime-ms-compat
740 treat japanese generic codesets as Microsoft cp932 compatible.
741 More specifically, with this option skf treats iso-2022-jp as
742 cp50220, euc-jp as cp51932 and Shift_JIS as cp932w.
743
744
745 End of line control options
746 --lineend-thru
747 Output end-of-line code as it is. Also output ^Z code as it is.
748 This is default.
749
750 --lineend-cr --lineend-mac
751 Use CR as end-of-line code. Also delete ^Z code from input
752 stream.
753
754 --lineend-lf --lineend-unix
755 Use LF as end-of-line code. Also delete ^Z code from input
756 stream.
757
758 --lineend-crlf --lineend-windows
759 Use CR+LF as end-of-line code. Also delete ^Z code from input
760 stream. This option doesn't preserve original order of cr and
761 lf.
762
763 --input-cr
764 Assume input stream uses CR as end-of-line code.
765
766 --input-lf
767 Assume input stream uses LF as end-of-line code.
768
769 --input-crlf
770 Assume input stream uses CR+LF as end-of-line code.
771
772 -F[line_length[-kinsoku]]
773
774 -f[line_length[-kinsoku]] -f[line_length[+kinsoku]]
775 Wrap input lines by line_length columns. f option deletes
776 CR/LF's in input, and F option doesn't delete them. For Japanese
777 convension, both gyoutou-kinsoku(by burasage-gumi) and
778 gyoumatsu-kinsoku(by oidasi-gumi) is supported. The burasage-
779 length is controlled by kinsoku option. Default value for
780 line_length is 66, and must be < 1000. Default value for kinsoku
781 is 5, and must be <= 10. In 'f' option, skf autodetects para‐
782 graph and retains some CR/LF. 2nd 'f' option format (with '+')
783 disables this behaviour. In nkf compatible mode, some fold
784 behaviors change as follows.
785 (1) Default line_length is set to 60, and kinsoku value is 10.
786 (2) alpha numeric characters become gyoutou-kinsoku characters.
787
788 File control options
789 --filewise-detect --force-reset
790 Reset and re-detect input code set at the start of each file.
791
792 --linewise-detect
793 Reset and re-detect input code set at the start of each line.
794 This option needs -DKUNIMOTO at compile time.
795
796
797 Compatibility options
798 --nkf-compat
799 interpret following options as nkf compatible manners. -l, -d,
800 -c, -x, -m, -w and -W works as nkf2.0. -f and -F behavior is
801 changed as shown above, and --disable-space-convert is also
802 enabled. Most of other nkf options and switches also work,
803 except in case of error behavior.
804
805 --skf-compat
806 interpret following options as skf-native manners.
807
808
809 Misc. Control options
810 --disable-space-convert --enable-space-convert
811 skf converts an ideographic space into two ascii spaces. Dis‐
812 able option disables, and enable option enables this behavior.
813 Default is enabled.
814
815 --html-sanitize
816 Convert several characters in HTML document to entity reference
817 expression. Specifically, "!#$&%()/<>:;?´ are escaped by entity-
818 references.
819
820 --filewise-detect --force-reset
821 If multiple input files are given, detect input codeset for each
822 file.
823
824 --linewise-detect
825 Detect input code line-wise. Note this option weakens code
826 detect correctness.
827
828 --reset
829 Reset all flags specified by extended controls and given input
830 code.
831
832 --inquiry --guess
833 skf detects code and output detect result to stdout. No filter‐
834 ing output is performed. If multiple input file is given,
835 --show-filename is automatically enabled.
836
837 --hard-inquiry
838 Similar as inquiry, but reports both code and end-of-line char‐
839 acter.
840
841 --suppress-filename
842 When inquiry(--inquiry) is on, this option disables file name
843 output. This option overrides --show-filename.
844
845 --show-filename
846 When inquiry(--inquiry) is on, this option adds each file name
847 to output.
848
849 --invis-strip
850 Delete all escape sequences not belonging to ISO-2022 code
851 extension. This is intended to replace invisstrip command bun‐
852 dled in inews package.
853
854 -I Warn if input has unassigned code points.
855
856 -v print version information and exit.
857
858 -h --help
859 print brief help and exit.
860
861 --show-supported-codeset
862 Display supported codesets (input) and exit. Both canonical
863 names (left side) and detailed names are shown. This canonical
864 name can be used as MIME charset and also as ic-option code
865 specification.
866
867 --show-supported-charset
868 Display supported character sets (output) and exit. Both canoni‐
869 cal names and detailed names are shown. Some charsets with spe‐
870 cial treatments (i.e. meaningless as set-g* parameters) inten‐
871 sionally lacks addressable cnames.
872
873 -%[debug_level]
874 Enable skf debugging. Debug level is one digit. 0 is the least
875 verbose, and with -%9 you'll get whole traces within skf. This
876 option needs configure option --enable-debug.
877
878
880 /usr/(local/)share/skf/lib/ (Unices)
881
882 /Program Files/skf/share/lib (MS Windows)
883 These directories are where external codeset conversion tables
884 go. The location that current skf assumes are shown by -h
885 option.
886
887
889 skf is written by Seiji Kaneko (efialtes@sourceforge.jp) based on idea
890 from nkf written by Itaru Ichikawa (ichikawa@flab.fujitsu.co.jp) X 0213
891 code table is derived from work of earthian@tama.or.jp. Some codeset
892 mapping is derived from various sources. Detailed origin is shown in
893 copyright document included in this distribution.
894
895
897 skf is inspired by works or requests by shinoda@cs.titech,
898 kato@cs.titech, uematsu@cs.titech, void@global ohta@ricoh, Hinata(HKE)
899 Ashizawa(CRL) Kunimoto(SDL) Oohara(Univ of Kyoto), Jokagi(elf2000) and
900 Naruse (at sourceforge.jp). Thanks.
901
902
904 1. skf can handle mixed coding with some limitations. However, code
905 detection tends to fail for mixed code, and giving explicit input code
906 set is strongly encouraged, if codeset is known beforehand.
907 In case of need, --linewise-detect option may help, but code detecting
908 will be more likely to fail.
909
910 2. When using UCS2, UTF-16, UTF-8 and UTF-7, skf tries to detect input
911 code, but giving explicit code set is encouraged. skf doesn't support
912 UCS4, but does support UTF-32 area by UTF-16 (i.e. surrogate pairs) and
913 UTF-8. skf just passes composite characters to output. No further nor‐
914 malization process are performed.
915
916 3. skf implements ISO-2022 with following exceptions.
917 i) GL 0x20 is always space. Even when 96-character codeset is invoked
918 to GL.
919 ii) Sequences for setting codes to C1 and C2 are always ignored.
920 iii) If unknown sequence is given to G0, G0 is set to ascii, and lock‐
921 ing/single shift is cleared. Unknown sequece call to set to G1-G3 is
922 just ignored.
923 Private charset is also not supported and is ignored.
924 iv) Sequences for 96 character multibyte coding is ignored (Currently,
925 no codeset is registered).
926 v) Calling UTF-8, UTF-16 coding system from iso-2022 is supported, and
927 returns to previous coding system by standard return.
928 Callings and returns to/from other coding schemes are ignored.
929 vi) For supporting some of cellular phone glyphs, several private (not
930 registered) codesets are defined in skf, and can be called by appropri‐
931 ate sequences.
932
933 4. Since skf by default tests input stream to detect utf7 coding, skf
934 sometimes misdetects pure ascii text as utf7. If this occurs, use
935 --no-utf7 option.
936
937 5. Error output coding is controlled by LOCALE environment variables in
938 UN*X system. skf don't take care of a situation like stdout and stderr
939 is redirecting into same stream. Such case should be handled by user
940 side.
941
942 6. skf-1.9x converts KEIS/JIS X 0213 code using CJK-extension B and CJK
943 compatibility area. For this reason, X 0213 and KEIS convert result
944 varies depending on --use-compat and --limit-to-ucs2 switches.
945
946 7. JIS X 0207:1979 is not supported. JIS X 0211:1987 is designed to be
947 supported (i.e. common terminal control sequence will be transparently
948 passed to output).
949
950 8. Even if unbuffer option(-u) is specified, some code-translation
951 related bufferings are still performed (in MIME, kana, VIQR etc.).
952
953 9. skf-1.9x recognizes and handles languages in iso639-1(alpha 2).
954 iso639-2 is not supported as a valid language set.
955
956 10. UCS-2(UTF-16) is not supported within perl/ruby extension either in
957 and output, because of data structure limitation. Specify to ucs2 will
958 generate error. This is a limitation of SWIG and language itself,
959 rather than a limitation of skf. Use UTF-8 for these LWL.
960
961 11. skf-1.9x does not retain Macintosh RLO-ordered character property.
962 Codesets with this kind of codes are not supported.
963
964
966 1. Extended options are changed extensively since skf-1.9. Some archaic
967 options (eg. -B, -@ and -r) have been deleted from this version.
968
969 2. skf is originally forked project from nkf, but doesn't contain nkf
970 codes. Copyright notice is retained by honor.
971
972 3. From version 1.9, default Japanese character set assumed by skf has
973 changed to JIS X 0208:1990 with Microsoft Japanese Windows gaiji (i.e.
974 CP932).
975
976 4. Code autodetection is not perfect by design. If it has failed to
977 detect input code properly, please give input code information explic‐
978 itly.
979
980 5. Some ligatures in Unicode, cp932 gaiji and KEIS83 are converted
981 using JIS X 0124 and other convention. During this conversion, its
982 byte length is not preserved.
983
984 6. skf is intended to pass ANSI compatible terminal control codes
985 transparently, but this is not guaranteed.
986
987 7. nkf's -i and -o options works only in nkf-compat mode. It is obso‐
988 lete option in 1.97, and valid only when iso-2022-jp and without con‐
989 sidering output codeset specifications.
990
991 8. For unconverted character, skf uses geta and undefined character as
992 --use-replace-char option. If output codeset doesn't contain geta
993 code, skf prefers 'black square character', then uses '.' respectively.
994
995 9. There are some undocumented options. These options should be consid‐
996 ered as highly experimental.
997
998 10. In lineend_thru mode and using folding, skf remembers order of cr
999 and lf appears in stream, and use that order. For this design, if skf
1000 needs to output line-end character before any line-end character
1001 appears in input stream, input order may not be preserved.
1002
1003 11. NKF-compatibility
1004 1) -B*, and --prefix, some --fb's and --no-cp932ext/best-fit-chars are
1005 not supported.
1006 2) rot encoding is not supported. rot decode can't use with other
1007 decoding.
1008 3) MSDOS (and -T) are not supported.
1009 4) MIME decoding/encoding error handling behavior differs in various
1010 ways.
1011 5) LF/CR behaves differently. Results may not be same for some messy
1012 text.
1013
1014
1016 Unicode(TM) is a trademark of Unicode, Inc. Microsoft and Windows are
1017 registered trademarks of Microsoft corporation. Macintosh is a regis‐
1018 tered trademark of Apple Computer Inc. Vodafone is a trademark of Voda‐
1019 fone K.K. Other names and terms may be trademarks or registered trade‐
1020 marks of their respective owner. Trademark symbol (TM) may be omitted
1021 in this manual page.
1022
1023
1024
1025
1026 25/JAN/2008 SKF(1)