1SKF(1)                      General Commands Manual                     SKF(1)
2
3
4

NAME

6       skf - simple Kanji Filter (v1.97)
7

SYNOPSIS

9       skf [-AEIJKNQRSXZabehjknqrsuvxz] [ long_format_options ] [infiles..]
10

DESCRIPTION

12       skf  is  a  yet another i18n capable kanji-filter, designed for reading
13       various CJK-coded files on the Net.  skf converts input kanji texts  or
14       streams  into  a  character  stream using designated codeset and output
15       them to standard output. Specifically, skf is designed to be  a  versa‐
16       tile  filter  to read documents in various code sets, and does not pro‐
17       vide features not related to code conversion.
18
19       Like nkf, skf automatically recognizes an input file code when it is  a
20       kind  of ISO-2022 compliant code, and also detects EUC-variant codes if
21       input file is Japanese text without X 0201 kanas.  skf  1.9x  can  read
22       various iso-2022 compliant character sets, including JIS Kanji codes (X
23       0208, X 0212 and X 0213), EUC encoding (euc-jp (with X  0213  support),
24       euc-cn,  euc-kr  and  euc-tw),  ISO  Europian latins (ISO-8859-1 to 11,
25       13/14/15/16) and many regional character sets.  skf can also read  some
26       non-iso2022   compliant   sets,  including  Microsoft  Shift-JIS  code,
27       KOI-8-R/U, GB2312 (HZ), big5, VISCII(rfc1456,  include  VIQR),  Unicode
28       standard  (UCS2/UTF-16,  UTF7  and  UTF8),  some of MS codesets (cp1250
29       etc.) and some other vendor specific codes (KEIS83, JEF etc).
30
31       Supported output character sets of skf  are  more  limited,  but  still
32       include  X  0208/X 0212/X 0213 JIS, X 0201 JIS, ASCII, Microsoft Shift-
33       JIS, EUC-jp/-kr/-cn, HZ, iso-2022-jp/kr, big5, VISCII and Unicode.
34
35       skf also provides some basic decoding features for some  common  encod‐
36       ings including MIME, Punycode and URI codepoint.  Unicode decomposition
37       feature is also supported since 1.96.
38
39       As noted above, skf is designed to convert input text into some kind of
40       human-readable  forms under a local environment (i.e. codeset), and has
41       several extra conversion features like GNU recode type  folding.   Such
42       conversions  include  Windows/Macintosh specific code swaps and old-new
43       jis glyph changes, html-format/TeX format conversion and variant unifi‐
44       cations.
45
46       skf also can be compiled as an extension of some lightweight languages.
47       See README.txt for details.
48
49       If one or more file names are given, skf read the files and output con‐
50       verted  stream  to  stdout.  If no file names are given, input is taken
51       from stdin and output is also stdout.  OPTIONS are taken from  environ‐
52       ment  variables  SKFENV,  skfenv and command line, respectively in this
53       order. Environment variables are not used when  skf  is  running  as  a
54       priviledged  user.   skf  does not use LOCALE-related environment vari‐
55       ables for conversions, but output  error  messages  are  controlled  by
56       given LOCALES.
57

OPTIONS

59       skf-1.9  is  written  from scratch, and inherits no code from nkf. How‐
60       ever, skf is intended to be a drop-in replacement for nkf(v1.4) and has
61       a similar commonly-used nkf option set.
62       skf  1.96  recognizes  following  options.  Defaults are all off if not
63       explicitly specified.
64
65   buffering control
66       -b     use buffered output. This is default.
67
68       -u     use unbuffered output.  Code detection feature is disabled  when
69              this option is on.
70
71   Input/Output codeset options
72       --ic=  input_code_set
73              specify  input  codeset  is input_code_set.  Possible candidates
74              are shown below.
75
76       --oc=  output_code_set
77              specify output codeset is output_code_set.  Possible  candidates
78              are shown below. Default codeset in distribution package is euc-
79              jp, but depends on compile option. Default codeset is shown by
80
81     Supported codeset
82       skf recognizes following codesets as  an  input/output  codeset.  These
83       codeset  names  are  case  insensitive,  and minus ('-') and underscore
84       ('_') is ignored.  Note that iso-2022 escape-based input codeset  (reg‐
85       istered  to  IANA)  is recoginized automatically, even when non-iso2022
86       codeset (except Unicode and B-Right/V) is specified.   o  in  in-column
87       means named codeset can be specified as input and x means named codeset
88       is not for input. output-column is same except it is for output.
89
90       in out  name            description
91       o  o    iso8859-1       ascii + iso-8859-1 (latin-1)
92       o  o    iso8859-2       ascii + iso-8859-2 (latin-2)
93       o  o    iso8859-3       ascii + iso-8859-3 (latin-3)
94       o  o    iso8859-4       ascii + iso-8859-4 (latin-4)
95       o  o    iso8859-5       ascii + iso-8859-5 (Cyrillic)
96       o  o    iso8859-6       ascii + iso-8859-6 (Arabic)
97       o  o    iso8859-7       ascii + iso-8859-7 (Greek)
98       o  o    iso8859-8       ascii + iso-8859-8 (Hebrew)
99       o  o    iso8859-9       ascii + iso-8859-9 (latin-5)
100       o  o    iso8859-10      ascii + iso-8859-10 (latin-6)
101       o  o    iso8859-11      ascii + iso-8859-11 (Thai)
102       o  o    iso8859-13      ascii + iso-8859-13 (Baltic Rim)
103       o  o    iso8859-14      ascii + iso-8859-14 (Celtic)
104       o  o    iso8859-15      ascii + iso-8859-15 (Latin-9)
105       o  o    iso8859-16      ascii + iso-8859-16
106       o  o    koi-8r          koi-8r (Russian)
107       o  o    cp1251          Cyrillic latin MS cp1251
108       o  o    jis             iso-2022-jp (rfc1496 7bit JIS)
109       o  o    iso-2022-jp-x0213 iso-2022-jp-3 (JIS X 0213:2000)
110                               a.k.a. jis-x0213
111       o  o    jis-x0213-strict iso-2022-jp-3-strict
112       o  o    iso-2022-jp-2004 iso-2022-jp-2004(JIS X 0213:2004)
113                               a.k.a. jis-x0213-2004
114       o  o    oldjis          iso-2022-jp-1978(JIS X 0208:1978)
115       o  o    cp50220         Microsoft codepage 50220
116       o  o    cp50221         Microsoft codepage 50221
117       o  o    cp50222         Microsoft codepage 50222
118       o  o    euc-jp          EUC-encoded JIS X 0208:1997
119       o  o    euc-x0213       EUC-encoded JIS X 0213:2000
120       o  o    euc-jis-2004    EUC-encoded JIS X 0213:2004
121       o  o    cp51932         EUC-encoded Microsoft codepage 932
122       o  o    euc-kr          EUC-encoded KS X 1001 Korian
123       o  o    euc7-kr         7bit EUC-encoded KS X 1001 Korian
124       o  o    uhc             Unified hangle (Windows cp949)
125       o  o    johab           KS X 1001-johab Korian
126       o  o    euc-cn          EUC-encoded GB2312 Chinese
127       o  o    euc7-cn         7bit EUC-encoded GB2312 Chinese
128       o  o    hz              HZ-encoded GB2312 Chinese
129       o  o    euc-tw          EUC-encoded CNS 11643 Chinese
130       o  o    gb12345         EUC-encoded GB12345 Chinese
131       o  o    gbk             GB2312 Extension(cp936) Chinese
132       o  o    gb18030         GB18030 chinese
133       o  o    big5            BIG5 (with Eten extension + EURO)
134       o  o    cp950           BIG5 (Microsoft cp950 + EURO)
135       o  o    big5-hkscs      BIG5 with HKSCS
136       o  o    big5-2003       BIG5-2003
137       o  o    big5-uao        BIG5-Unicode at On
138       o  o    sjis            Shift-jis (Microsoft cp943)
139       o  o    shiftjis-x0213  Shiftjis-encoded JIS X 0213:2000
140       o  o    shiftjis-2004   Shiftjis-encoded JIS X 0213:2004
141       o  x    sjis-cellular   Shiftjis-encoded JIS X 0208:1997
142                        with NTT Docomo, Vodafone(SoftBank) phone glyph
143       o  o    oldsjis         Shift-jis (JIS X 0208:1978)
144       o  o    cp932           Shift-jis-encoded MS cp932
145       o  o    cp932w          Shift-jis-encoded MS cp932 with
146                               MS compatibility
147       o  o    viscii          VISCII (rfc1456) Vietnamise
148       o  o    viqr            VISCII (rfc1456-VIQR) Vietnamise
149       o  o    keis            Hitachi KEIS83/90
150       o  x    jef             Fujitsu JEF (basic support only)
151       o  x    ibm930          IBM EBCDIC DBCS Japanese
152       o  x    ibm931          IBM EBCDIC DBCS Japanese w.latin
153       o  x    ibm933          IBM EBCDIC DBCS Korian
154       o  x    ibm935          IBM EBCDIC DBCS Simpl. Chinese
155       o  x    ibm937          IBM EBCDIC DBCS Trad. Chinese
156       o  o    unicode         Unicode(TM) UCS-2/UTF-16LE
157       o  o    unicodefffe     Unicode(TM) UTF-16BE
158       o  o    utf7            Unicode(TM) UTF-7
159       o  o    utf8            Unicode(TM) UTF-8
160       o  x    transparent     Transparent mode (see below)
161
162
163     Codeset explanations
164       iso-8859-*
165              When specified as output, G0 = GL  is  ascii  and  G1  =  GR  is
166              iso-8859-*. 8bit encoding is used.
167
168       iso-2022-jp, jis
169              Encoding  is  iso-2022-jp-2  (RFC1496).  G0  =  GL is JIS X 0201
170              roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1  and  G3  is
171              JIS X 0212:1990 Supplementary Kanji.
172
173       jis-x0213, iso-2022-jp-3
174              Encoding  is  iso-2022-jp-3  (JIS X 0213:2000 based). G0 = GL is
175              JIS X 0201 roman, For output, G1 = GR is JIS X 0201 kana, G2  is
176              iso-8859-1 and G3 is JIS X 0213 plane2 Kanji.
177
178       jis-x0213-strict
179              Encoding  is subset of iso-2022-jp-3-strict (uses Plane 1 only).
180              For output, G0 = GL is JIS X 0201 roman, G1 = GR is JIS  X  0201
181              kana,  G2 is iso-8859-1 and G3 is not set. Output code using JIS
182              X 0208 whenever possible. JIS X 0213 input is automatically rec‐
183              ognized.
184
185       jis-x0213-2004, iso-2022-jp-2004
186              Encoding  is iso-2022-jp-2003:2004. For output, G0 = GL is JIS X
187              0201 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and  G3
188              is JIS X 0213 plane2 Kanji.
189
190       oldjis
191              Encoding  is iso-2022-jp using old JIS X 0208:1978).  G0 = GL is
192              JIS X 0201 roman, G1 = GR is JIS X 0201 kana, G2  is  iso-8859-1
193              and G3 is JIS X 0212 Supplementary Kanji.
194
195       euc-jp, euc
196              Encoding is 8-bit EUC using JIS X 0208:1997 character set.  G0 =
197              GL is ascii, G1 = GR is JIS X 0208, G2 is JIS X 0201 kana and G3
198              is JIS X 0212 Supplementary Kanji.
199
200       euc-x0213, euc-jis-2003
201              Encoding  is 8-bit EUC-based JIS X 0213:2000.  G0 = GL is ascii,
202              G1 = GR is X 0213:2000 plane 1, G2 is iso-8859-1 and G3 is JIS X
203              0213:2000 plane2 Kanji.
204
205       euc-jis-2004
206              Encoding  is  8-bit EUC-based JIS X0213:2004.  G0 = GL is ascii,
207              G1 = GR is X0213:2004 plane 1, G2 is iso-8859-1 and  G3  is  JIS
208              x0213:2004 plane2 Kanji.
209
210       euc-kr
211              Encoding is 8-bit EUC using KS X 1001 Wansung character set.  G0
212              = GR is KS X1003, G1 = GR is KS X1001, G2 and G3 is not set.
213
214       euc7-kr iso-2022-kr
215              Encoding is iso-2022-kr (rfc1557): 7-bit EUC  using  KS  X  1001
216              Wansung  character set.  G0 = GR is KS X1003, G1 is KS X1001, G2
217              and G3 is not set.
218
219       euc-cn
220              Encoding is 8-bit EUC using GB 2312 simplified chinese character
221              set.  G0 = GR is ASCII, G1 = GR is GB2312, G2 and G3 is not set.
222
223       euc7-cn
224              Encoding is 7-bit EUC using GB 2312 simplified chinese character
225              set.  G0 = GR is ASCII, G1 is GB2312, G2 and G3 is not set.
226
227       hz
228              Encoding is HZ encoded  (rfc1842)  GB  2312  simplified  chinese
229              character  set.   G0 = GR is ASCII, G1 = GR is GB2312, G2 and G3
230              is not set.
231
232       euc-tw
233              Encoding is EUC encoded CNS11643  Plane1/2  traditional  chinese
234              character set. Subset of iso-2022-cn.  G0 = GR is ASCII, G1 = GR
235              is CNS11643 plane 1, G2 is CNS11643 plane 2 and G3 is not set.
236
237       gb12345
238              Encoding is 8-bit EUC using GB 12345 (GBF)  traditional  chinese
239              character  set.  G0 = GR is ASCII, G1 = GR is GB12345, G2 and G3
240              is not set.
241
242       gbk, cp936
243              Encoding is GBK simplified chinese character set.  G0  =  GR  is
244              ASCII and G1 = GR is GBK. G2 and G3 is not set.
245
246       gb18030 (experimental)
247              Encoding  is GB18030 (ibm-1392, Windows cp54936) chinese charac‐
248              ter set.  Uses ASCII as latin part.
249
250       big5
251              Encoding is Big5 traditional chinese  character  set  with  ETen
252              extension.  Include Euro mapping.  Uses ASCII as latin part.
253
254       cp950
255              Encoding  is  Microsoft cp950-Big5 traditional chinese character
256              set.  Uses ASCII as latin part.
257
258       big5-hkscs (experimental)
259              Encoding is cp950-Big5 traditional chinese  character  set  with
260              HKSCS extension.  Uses ASCII as latin part.
261
262       big5-2003 (experimental)
263              Encoding  is  Big5-2003  Taiwanese  standard traditional chinese
264              character set.  Uses ASCII as latin part.
265
266       big5-uao (experimental)
267              Encoding is Big5-UAO (http://uao.cpatch.org) traditional chinese
268              character set.  Uses ASCII as latin part.
269
270       VISCII (experimental)
271              Vietnamise VISCII (rfc1456) character set. Not TCVN-5712.
272
273       VIQR (experimental)
274              Vietnamise VISCII character set with VIQR encoding(rfc1456).
275
276       sjis
277              Encoding  is  Shift-encoded JIS X 0208:1997 character set.  Note
278              that this is not cp932. Uses JIS X 0201 latin as latin(GL) part.
279
280       sjis-x0213, shift_jis-2000
281              Encoding is Shift-encoded JIS using JIS  X  0213:2000  character
282              set.
283
284       sjis-x0213-2004, shift_jis-2004
285              Encoding  is  Shift-encoded  JIS using JIS X 0213:2004 character
286              set.  10 newly defined character added, but Unicode  mapping  is
287              same  as  JIS  X  0213:2000.  Uses JIS X 0201 latin as latin(GL)
288              part.
289
290       sjis-cellular (experimental)
291              Encoding is Shift-encoded JIS X 0208:1997 character set with NTT
292              Docomo/Vodafone(SoftBank) cellular phone glyph mapping.
293
294       cp932 cp932w
295              Encoding  is Microsoft SJIS cp932 with NEC/IBM gaiji area, based
296              on Windows XP mapping. Uses ASCII as latin(GL) part.  --use-com‐
297              pat  and  --use-ms-compat is automatically enabled.  cp932w pro‐
298              vides further WideCharToMultiByte compatibility.
299
300       cp51932
301              Encoding is Microsoft EUC-based cp51932 with NEC/IBM gaiji area,
302              based  on  Windows  XP mapping.  Uses ASCII as G0 and JIS X 0201
303              kana as EUC G2 part.  G3 is not  used  for  output,  and  JIS  X
304              0212:2000  as  input.  --use-compat and --use-ms-compat is auto‐
305              matically enabled.
306
307       cp50220, cp50221, cp50222
308              Encoding is Microsoft JIS-based cp50220, cp50221,  cp50222  with
309              NEC/IBM gaiji area, based on Windows XP mapping.  For input, skf
310              accepts cp50220, 50221 and 50222.  Note that this codeset is NOT
311              compatible  with iso-2022.  Uses ASCII as default character set.
312              --use-compat and --use-ms-compat is automatically enabled.
313
314       oldsjis
315              Encoding is Microsoft SJIS (JIS X  0208:1978  a.k.a.  old  JIS).
316              Uses JIS X 0201 latin as latin(GL) part.
317
318       johab
319              Encoding  is  KS X1001(Johab) character set. Uses KS X1003 latin
320              as latin(GL) part.
321
322       uhc
323              Encoding is UHC (cp949) character set. Uses ASCII  as  latin(GL)
324              part.
325
326       unicode, unicodefffe
327              Encoding  is  Unicode  UTF-16 (v5.0). Input/Output default byte-
328              endian is little for unicode and big for unicodefffe, and  input
329              byte  order  mark is recognized.  Output includes endian mark by
330              default unless --disable-endian-mark is specified. Output  range
331              is  within  UTF-32 with surrogate pair unless --limit-to-ucs2 is
332              specified.
333              Note that ucs2 is not supported within  perl/ruby  extension  in
334              both  in and output, because of data structure limitation. Spec‐
335              ify to ucs2 will generate error.
336
337       utf8
338              Encoding is UTF-8 encoded Unicode (v5.0). Output doesn't include
339              byte  order mark unless --enable-endian-mark is specified.  Out‐
340              put range is within UTF-32 unless --limit-to-ucs2 is  specified.
341              By   default,   CESU-8   is   not   accepted  as  input.  Option
342              --enable-cesu8 enables CESU-8 input for utf-8 converter.  CESU-8
343              output is not supported.  For UTF-8, endian mark (BOM) is always
344              ignored.
345
346       utf7
347              Encoding is UTF-7 encoded Unicode (v5.0). Input/output range  is
348              limited  to UTF-16, and value above U+10000 is regarded as unde‐
349              fined.  BOM is always ignored for input, and never used for out‐
350              put.
351
352       keis (experimental)
353              Encoding is Hitachi KEIS83/90. Output range is limited to EBCDIK
354              and JIS X 0208 area.
355
356       jef (experimental)
357              Encoding is Fujitsu JEF. Input only. Only  basic  part  is  sup‐
358              ported.
359
360       ibm930 (experimental)
361              Encoding is IBM DBCS Japanese with EBCDIC Kana
362
363       ibm931 (experimental)
364              Encoding is IBM DBCS Japanese with EBCDIC latin (ibm037)
365
366       ibm933 (experimental)
367              Encoding is IBM DBCS Korian with EBCDIC Wansung character set
368
369       ibm935 (experimental)
370              Encoding is IBM DBCS Simplified Chinese with EBCDIC Chinese
371
372       ibm937 (experimental)
373              Encoding is IBM DBCS Traditional Chinese with EBCDIC Chinese
374
375       koi8r
376              Russian KOI-8R code.
377
378       cp1250
379              Central Europian latin Microsoft cp1250 code
380
381       cp1251
382              Eastern Europian cyrillic Microsoft cp1251 code
383
384       transparent
385              Transparent mode. Various code control features, include folding
386              and line end code conversion, is also ignored.
387
388
389     Shortcuts
390       -n -j  same as --oc=jis
391
392       -s -x  same as --oc=sjis
393
394       -a -e  same as --oc=euc-jp
395
396       -q     same as --oc=ucs2
397
398       -z     same as --oc=sjis
399
400       -y     same as --oc=utf7
401
402       -k     same as --oc=keis
403
404       -A, -E same as --ic=euc-jp. Assume input codeset is EUC-JP.
405
406       -N     same as --ic=jis. Assume input codeset is iso-2022-jp.
407
408       -S, -X same as --ic=sjis. Assume input codeset is shift JIS
409
410       -Q     same as --ic=ucs2.
411
412       -Y     same as --ic=utf7.
413
414       -Z     same as --ic=utf8.
415
416       -K     same as --ic=keis.
417
418
419     ISO-2022 Specific controls
420       Replaces G0-3 after setting up according to specified input codeset  by
421       assigned  character set with this option. Note that this doesn't change
422       any codeset properties of  the  original  codeset,  like  language  and
423       encoding.
424
425       --set-g0=`charset name'
426              Predefines specified code set to plane 0 (G0). Also set to GL at
427              initial state.
428
429       --set-g1=`charset name'
430              Predefines specified code set to right plane (G1). Also  set  to
431              GR at initial state.
432
433       --set-g2=`charset name'
434              Predefines specified code set to right plane (G2).
435
436       --set-g3=`charset name'
437              Predefines specified code set to right plane (G3).
438
439
440       Supported `char_set' is as follows. 'o' means the codeset can be speci‐
441       fied to set to the plane. 'x' means you can't. For unicode family code‐
442       sets,  this  option  is ignored. For other non-iso2022 categories, this
443       option is not supported, and result is unpredictable.
444
445
446       g0 g1 g2 g3    codeset name   description
447       o  o  o  o     ascii          ANSI X3.4 ASCII
448       o  o  o  o     x0201          JIS X 0201 (latin part)
449       x  o  o  o     iso8859-1      ISO 8859-1 latin
450       x  o  o  o     iso8859-2      ISO 8859-2 latin
451       x  o  o  o     iso8859-3      ISO 8859-3 latin
452       x  o  o  o     iso8859-4      ISO 8859-4 latin
453       x  o  o  o     iso8859-5      ISO 8859-5 Cyrillic
454       x  o  o  o     iso8859-6      ISO 8859-6 Arabic
455       x  o  o  o     iso8859-7      ISO 8859-7 Greek-latin
456       x  o  o  o     iso8859-8      ISO 8859-8 Hebrew
457       x  o  o  o     iso8859-9      ISO 8859-9 latin
458       x  o  o  o     iso8859-10     ISO 8859-10 latin
459       x  o  o  o     iso8859-11     ISO 8859-11 Thai
460       x  o  o  o     iso8859-13     ISO 8859-13 latin
461       x  o  o  o     iso8859-14     ISO 8859-14 latin
462       x  o  o  o     iso8859-15     ISO 8859-15 latin
463       x  o  o  o     iso8859-16     ISO 8859-16 latin
464       x  o  o  o     tcvn5712       TCVN 5712 (Vietnamese)
465       x  o  o  o     ecma94         ECMA 94 Cyrillic (KOI-8e)
466       o  o  o  o     x0212          JIS X 0212:1990
467       o  o  o  o     x0208          JIS X 0208:1997
468       o  o  o  o     x0213          JIS X 0213 Plane 1:2000
469       o  o  o  o     x0213-2        JIS X 0213 Plane 2:2000
470       o  o  o  o     x0213n         JIS X 0213 Plane 1:2004
471       o  o  o  o     gb2312         Simplified Chinese GB2312
472       o  o  o  o     gb1988         Chinese GB1988(latin)
473       o  o  o  o     gb12345        Traditional Chinese GB12345
474       o  o  o  o     ksx1003        Korian KS X 1003(latin)
475       o  o  o  o     ksx1001        Korian KS X 1001
476       x  o  o  o     koi8-r         Cyrillic KOI-8R
477       x  o  o  o     koi8-u         Ukrainean Cyrillic KOI-8U
478       o  o  o  o     cns11643-1   Traditional Chinese CNS11643-1
479       x  o  o  o     viscii-r       RFC1496 VISCII (right plane)
480       o  o  o  o     viscii-l       RFC1496 VISCII (left plane)
481       x  o  o  o     cp437          Microsoft cp437 (US latin)
482       x  o  o  o     cp737          Microsoft cp737
483       x  o  o  o     cp775          Microsoft cp775
484       x  o  o  o     cp850          Microsoft cp850
485       x  o  o  o     cp852          Microsoft cp852
486       x  o  o  o     cp855          Microsoft cp855
487       x  o  o  o     cp857          Microsoft cp857
488       x  o  o  o     cp860          Microsoft cp860
489       x  o  o  o     cp861          Microsoft cp861
490       x  o  o  o     cp862          Microsoft cp862
491       x  o  o  o     cp863          Microsoft cp863
492       x  o  o  o     cp864          Microsoft cp864
493       x  o  o  o     cp865          Microsoft cp865
494       x  o  o  o     cp866          Microsoft cp866
495       x  o  o  o     cp869          Microsoft cp869
496       x  o  o  o     cp874          Microsoft cp874
497       x  o  o  o     cp932          Microsoft cp932 (Japanese)
498       x  o  o  o     cp1250     Microsoft cp1250(Central Europe)
499       x  o  o  o     cp1251         Microsoft cp1251 (Cyrillic)
500       x  o  o  o     cp1252         Microsoft cp1252 (Latin-1)
501       x  o  o  o     cp1253         Microsoft cp1253 (Greek)
502       x  o  o  o     cp1254         Microsoft cp1254 (Turkish)
503       x  o  o  o     cp1255         Microsoft cp1255
504       x  o  o  o     cp1256         Microsoft cp1256
505       x  o  o  o     cp1257         Microsoft cp1257
506       x  o  o  o     cp1258         Microsoft cp1258
507
508       --euc-protect-g1
509              In EUC input mode, suppress sequences to set a  charset  to  G1.
510              Such sequences are discarded.
511
512       --add-annon
513              Add  announcer for JIS X 0208:1997 to X 0208 designate sequence.
514              This option works only with iso-2022-based output.
515
516       --input-detect-jis78
517              Distinguish JIS X 0208:1978 codeset and JIS X 0208:1997 codeset.
518              By  default,  these two charset is regarded as X 0208:1997. This
519              option is valid only when input encoding is JIS (iso-2022-jp).
520
521
522     Unicode coding specific control options
523       --use-compat --suppress-compat
524              skf  substitutes  characters  in  unicode  compatibility  planes
525              (U+F900 - U+FFFD) to appropriate characters in non-compatibility
526              planes.  If enabled, these characters is converted  to  variants
527              or  undefined.   --use-compat  disables  this  substitution, and
528              --suppress-compat enables this behavior. Default is enabled, but
529              several  codesets disable this as codeset feature (i.e. Use com‐
530              patibility planes). See codeset section.
531
532       --use-ms-compat
533              When output is Unicode, make Unicode map to be Microsoft windows
534              compatible).  This  only  changes conversion for some symbols in
535              JIS-Kanji, and adding --use-compat  option  is  recommended  for
536              roundtrip conversion. If you need more strict compatibility, try
537              cp932w for input codeset.
538
539       --use-cde-compat
540              When output is Unicode, make translation  CDE  standard  codeset
541              compatible.
542
543       --little-endian
544              When  output  is  UTF-16,  use little endian byte-order. This is
545              default.
546
547       --big-endian
548              When output is UTF-16, use big endian byte-order.
549
550       --disable-endian-mark --enable-endian-mark
551              When output is UTF-16 or UTF-8, do not use/use byte order  mark‐
552              ing.  To  make UTF-16N, use this option with --little-endian. By
553              default, BOM is enabled for UTF-16 and disabled for UTF-8.
554
555       --input-little-endian
556              When input is  UTF-16,  assume  input  is  little  endian  byte-
557              ordered.  This is default, but skf respects byte-order mark.
558
559       --input-big-endian
560              When  input  is UTF-16, assume input is big endian byte-ordered.
561              Note that skf respects byte-order mark.
562
563       --endian-protect
564              Do not use endian mark in input stream. Endian mark is just dis‐
565              carded.  This is off by default.
566
567       --limit-to-ucs2
568              Do  not  use > 0x10000 area code in Unicode (i.e. limits code to
569              BMP area).  This option doesn't limit  internal  code  range  in
570              skf. This is off by default.
571
572       --disable-cjk-extension
573              Treat  CJK  extension  A/B areas as undefined. This is off (i.e.
574              these areas are enabled) by default.
575
576       --enable-cesu8
577              Enable CESU-8 input in utf-8  codeset.  Ignored  for  any  other
578              codesets.
579
580       --non-strict-utf8
581              Enable broken (decodable but not obeying specs.) utf-8 input. If
582              you need this option, proceeds with extra care.
583
584       --enable-nfd-decomposition --disable-nfd-decomposition
585              Enable/Disable Unicode Normalized decomposition. Default is dis‐
586              abled.
587
588       --enable-nfda-decomposition --disable-nfda-decomposition
589              Enable/Disable  Apple-compatible  Unicode  Normalized decomposi‐
590              tion.  Default is disabled.
591
592
593     Codeset/Vendor Specific codeset handling flags
594       skf by default assumes machine specific parts of kanji code are  Micro‐
595       soft Windows compatible. Here are some options that control this behav‐
596       ior.  Option in this category is valid when output codeset is  Japanese
597       codeset, except --disable-charts.
598
599       --use-apple-gaiji
600              Assume  machine specific part in input file is Macintosh Classic
601              OS (System 7,8,9) compatible.
602
603       --disable-ibm-gaiji --disable-nec-gaiji
604              Disable IBM/NEC defined machine specific part in input file.
605
606       --disable-chart
607              Do not use Moji-keisen characters. This  is  for  old  Macintosh
608              system (System 6.x or older) compatibility.
609
610
611     Miscellanious codeset related options
612       --old-nec-compat
613              Enable  old  NEC  kanji sequence (ESC-K,H). Needs compile option
614              --enable-oldnec at configuration.
615
616       --no-utf7
617              Assume input codeset  is  *NOT*  UTF-7  encoded  Unicode.   This
618              option disables input utf7 testing.
619
620       --no-kana
621              Assume input codeset does *NOT* include JIS X 0201 kana.
622
623
624   OUTPUT Conversions options
625       skf  is  intended  to output stream to stdout, buf nkf-compatible file-
626       encoding change option is also provided.
627
628       --overwrite --in-place
629              converts encoding of file(s)  specified  as  input.  --overwrite
630              preserves file change date.
631
632       skf has various features to fix output files appropriate in local envi‐
633       ronment.  Most of these are controlled  by  extended  control  switches
634       described in this section.
635
636       --use-g0-ascii
637              set  G0(=GL) for output encoding to ASCII, ignoring codeset des‐
638              ignation.
639
640     X-0201 Kana/latin conversions
641       skf by default converts X-0201 kanas to X-0208 kanas. To output  X-0201
642       kana  as it is, use one of following options. When output is designated
643       to EUC or SJIS, these three options enable X-0201 kana output  by  ways
644       provided  by  each encoding. When Unicode output is specified, (equiv.)
645       kana part output is controlled by --use-compat, not following switches.
646       Valid only when output codeset is NOT Unicode family.
647
648       --kana-jis7
649              use SI/SO locking shift sequence to designate X-0201 kana.  This
650              switch is valid for jis, jis-x0213 and  cp50220  (i.e.  cp50221)
651              encoding.  For other codesets, this option is ignored.
652
653       --kana-jis8
654              output X-0201 kana using 8-bit code right plane.  This switch is
655              valid for jis and jis-x0213 encoding.  For other  codeset,  this
656              option is ignored.
657
658       --kana-esci --kana-call
659              use  ESC-(-I to designate X-0201 kana.  This switch is valid for
660              jis, jis-x0213 and cp50220 (i.e. cp50222) encoding.   For  other
661              codeset, this option is ignored.
662
663       --kana-enable
664              If  output  is  EUC-JP  or cp51932, use X-0201 kana with G2.  If
665              SJIS output, it is same as --kana-jis8.  When JIS output, it  is
666              same as --kana-call.
667
668       --use-iso8859-1
669              Enable iso-8859-1 output. Iso-8859-1 is invoked to G1 and set to
670              GR plane.
671
672
673     JIS X 0212(Supplement Kanji code) Support
674       --x0212-enable
675              skf by default does not output JIS  X  0212  code.  This  option
676              enables  use  of JIS X 0212 part. Output code set may be neither
677              Microsoft code nor KEIS. For  Unicode  variant  encodings,  this
678              option is ignored.  Note that this option is supported for back‐
679              ward compatibility.  May not be supported in future versions.
680
681
682     URI/TeX format conversion feature options
683       With Unicode(tm) family output  codings,  skf  output  non-ascii  latin
684       character  part  as  it is, but with other output codings, skf converts
685       these characters using following rules:
686
687       (1) If a code is defined in a specified output codeset, specified  code
688       point is used for output.
689       (2)  If  one  of  following html convert modes are enabled (i.e. --con‐
690       vert-html --convert-sgml) and the code is defined in html/sgml codeset,
691       it is converted to entity-reference or codepoint reference.
692       (3)  If tex convert mode enabled and the code is defined in tex expres‐
693       sion, it is converted to tex format.
694       (4) If the code is a kind of combined ligatures, it is shown by  a  set
695       of characters.
696       (5) A kind of replacement character is shown, with warning.
697
698       --convert-html --convert-sgml
699              Enable html convert mode. This mode is cleared by --reset. These
700              two options are synonyms, and are treated as same option.
701
702       --convert-html-decimal
703              Enable html  code-point  decimal  convert  mode.  This  mode  is
704              cleared by --reset.
705
706       --convert-html-hexadecimal
707              Enable  html  code-point  hexadecimal convert mode. This mode is
708              cleared by --reset.
709
710       --convert-tex
711              Enable TeX convert mode. This mode is cleared by --reset.
712
713       --use-replace-char
714              In Unicode, use unicode replacement chatacter (U+fffc) for unde‐
715              fined chatacter.
716
717
718   Encoding/Decoding control options
719       --decode=`encoding scheme'
720              --encode=`encoding scheme'  Specify  an decoding/encoding scheme
721              for input stream.  Supported encoding schemes for  decoding  are
722              `hex',     'mime',    'mime_q',    'mime_b',    'uri',    'ace',
723              'hex_perc_encode', CAP hex-code, mime, mime Q-encoding, mime  B-
724              encoding,  uri  character  reference,  ACE punycode, uri percent
725              notation, base64, Q-encoding, rfc2231 and rot13/47 respectively.
726              For encoding, 'hex', 'mime_b', 'mime_q',  'uri',  'ace',  'cap',
727              and  some already ascii-encoded codeset (e.g. UTF-7) output with
728              encoding is not supported.
729              Only one decode/encode option is valid, and  if  more  than  one
730              option  is  specified,  the  last one is used.  When one of mime
731              decodings is specified, base text is assumed to be EUC  encoding
732              unless  specified  otherwise.  Except  rot,  which assumes input
733              stream is Shift_JIS, EUC or iso-2022-jp, these encodings assumes
734              input  stream  is  ascii (as defined in RFC2045). Some encodings
735              may co-exist with encoding, but this is  not  guaranteed.  Espe‐
736              cially,  if input is UTF-16/UCS2 code, these encoding is ignored
737              in skf.
738
739       --mime-ms-compat
740              treat japanese generic codesets as Microsoft  cp932  compatible.
741              More  specifically,  with  this option skf treats iso-2022-jp as
742              cp50220, euc-jp as cp51932 and Shift_JIS as cp932w.
743
744
745   End of line control options
746       --lineend-thru
747              Output end-of-line code as it is. Also output ^Z code as it  is.
748              This is default.
749
750       --lineend-cr --lineend-mac
751              Use  CR  as  end-of-line  code.  Also  delete ^Z code from input
752              stream.
753
754       --lineend-lf --lineend-unix
755              Use LF as end-of-line code.  Also  delete  ^Z  code  from  input
756              stream.
757
758       --lineend-crlf --lineend-windows
759              Use  CR+LF  as  end-of-line code. Also delete ^Z code from input
760              stream.  This option doesn't preserve original order of  cr  and
761              lf.
762
763       --input-cr
764              Assume input stream uses CR as end-of-line code.
765
766       --input-lf
767              Assume input stream uses LF as end-of-line code.
768
769       --input-crlf
770              Assume input stream uses CR+LF as end-of-line code.
771
772       -F[line_length[-kinsoku]]
773
774       -f[line_length[-kinsoku]] -f[line_length[+kinsoku]]
775              Wrap  input  lines  by  line_length  columns.  f  option deletes
776              CR/LF's in input, and F option doesn't delete them. For Japanese
777              convension,    both    gyoutou-kinsoku(by   burasage-gumi)   and
778              gyoumatsu-kinsoku(by oidasi-gumi) is  supported.  The  burasage-
779              length  is  controlled  by  kinsoku  option.  Default  value for
780              line_length is 66, and must be < 1000. Default value for kinsoku
781              is  5,  and  must be <= 10. In 'f' option, skf autodetects para‐
782              graph and retains some CR/LF. 2nd 'f' option format  (with  '+')
783              disables  this  behaviour.   In  nkf  compatible mode, some fold
784              behaviors change as follows.
785              (1) Default line_length is set to 60, and kinsoku value is 10.
786              (2) alpha numeric characters become gyoutou-kinsoku characters.
787
788   File control options
789       --filewise-detect --force-reset
790              Reset and re-detect input code set at the start of each file.
791
792       --linewise-detect
793              Reset and re-detect input code set at the start  of  each  line.
794              This option needs -DKUNIMOTO at compile time.
795
796
797   Compatibility options
798       --nkf-compat
799              interpret  following options as nkf compatible manners.  -l, -d,
800              -c, -x, -m, -w and -W works as nkf2.0.  -f and  -F  behavior  is
801              changed  as  shown  above,  and  --disable-space-convert is also
802              enabled.  Most of other nkf  options  and  switches  also  work,
803              except in case of error behavior.
804
805       --skf-compat
806              interpret following options as skf-native manners.
807
808
809   Misc. Control options
810       --disable-space-convert --enable-space-convert
811              skf  converts  an ideographic space into two ascii spaces.  Dis‐
812              able option disables, and enable option enables  this  behavior.
813              Default is enabled.
814
815       --html-sanitize
816              Convert  several characters in HTML document to entity reference
817              expression. Specifically, "!#$&%()/<>:;?´ are escaped by entity-
818              references.
819
820       --filewise-detect --force-reset
821              If multiple input files are given, detect input codeset for each
822              file.
823
824       --linewise-detect
825              Detect input code  line-wise.  Note  this  option  weakens  code
826              detect correctness.
827
828       --reset
829              Reset  all  flags specified by extended controls and given input
830              code.
831
832       --inquiry --guess
833              skf detects code and output detect result to stdout. No  filter‐
834              ing  output  is  performed.  If  multiple  input  file is given,
835              --show-filename is automatically enabled.
836
837       --hard-inquiry
838              Similar as inquiry, but reports both code and end-of-line  char‐
839              acter.
840
841       --suppress-filename
842              When  inquiry(--inquiry)  is  on, this option disables file name
843              output.  This option overrides --show-filename.
844
845       --show-filename
846              When inquiry(--inquiry) is on, this option adds each  file  name
847              to output.
848
849       --invis-strip
850              Delete  all  escape  sequences  not  belonging  to ISO-2022 code
851              extension. This is intended to replace invisstrip  command  bun‐
852              dled in inews package.
853
854       -I     Warn if input has unassigned code points.
855
856       -v     print version information and exit.
857
858       -h --help
859              print brief help and exit.
860
861       --show-supported-codeset
862              Display  supported  codesets  (input)  and  exit. Both canonical
863              names (left side) and detailed names are shown.  This  canonical
864              name  can  be  used  as  MIME charset and also as ic-option code
865              specification.
866
867       --show-supported-charset
868              Display supported character sets (output) and exit. Both canoni‐
869              cal  names and detailed names are shown. Some charsets with spe‐
870              cial treatments (i.e.  meaningless as set-g* parameters)  inten‐
871              sionally lacks addressable cnames.
872
873       -%[debug_level]
874              Enable  skf  debugging. Debug level is one digit. 0 is the least
875              verbose, and with -%9 you'll get whole traces within skf.   This
876              option needs configure option --enable-debug.
877
878

FILES

880       /usr/(local/)share/skf/lib/   (Unices)
881
882       /Program Files/skf/share/lib (MS Windows)
883              These  directories  are where external codeset conversion tables
884              go.  The location that current  skf  assumes  are  shown  by  -h
885              option.
886
887

AUTHOR

889       skf  is written by Seiji Kaneko (efialtes@sourceforge.jp) based on idea
890       from nkf written by Itaru Ichikawa (ichikawa@flab.fujitsu.co.jp) X 0213
891       code  table  is derived from work of earthian@tama.or.jp.  Some codeset
892       mapping is derived from various sources. Detailed origin  is  shown  in
893       copyright document included in this distribution.
894
895

ACKNOWLEDGEMENT

897       skf   is   inspired   by   works   or  requests  by  shinoda@cs.titech,
898       kato@cs.titech, uematsu@cs.titech, void@global ohta@ricoh,  Hinata(HKE)
899       Ashizawa(CRL)  Kunimoto(SDL) Oohara(Univ of Kyoto), Jokagi(elf2000) and
900       Naruse (at sourceforge.jp). Thanks.
901
902

BUGS AND LIMITATIONS

904       1. skf can handle mixed coding with  some  limitations.  However,  code
905       detection  tends to fail for mixed code, and giving explicit input code
906       set is strongly encouraged, if codeset is known beforehand.
907       In case of need, --linewise-detect option may help, but code  detecting
908       will be more likely to fail.
909
910       2.  When using UCS2, UTF-16, UTF-8 and UTF-7, skf tries to detect input
911       code, but giving explicit code set is encouraged.  skf doesn't  support
912       UCS4, but does support UTF-32 area by UTF-16 (i.e. surrogate pairs) and
913       UTF-8.  skf just passes composite characters to output. No further nor‐
914       malization process are performed.
915
916       3. skf implements ISO-2022 with following exceptions.
917        i)  GL 0x20 is always space. Even when 96-character codeset is invoked
918       to GL.
919        ii) Sequences for setting codes to C1 and C2 are always ignored.
920        iii) If unknown sequence is given to G0, G0 is set to ascii, and lock‐
921       ing/single  shift  is  cleared. Unknown sequece call to set to G1-G3 is
922       just ignored.
923        Private charset is also not supported and is ignored.
924        iv) Sequences for 96 character multibyte coding is ignored (Currently,
925       no codeset is registered).
926        v) Calling UTF-8, UTF-16 coding system from iso-2022 is supported, and
927       returns to previous coding system by standard return.
928        Callings and returns to/from other coding schemes are ignored.
929        vi) For supporting some of cellular phone glyphs, several private (not
930       registered) codesets are defined in skf, and can be called by appropri‐
931       ate sequences.
932
933       4. Since skf by default tests input stream to detect utf7  coding,  skf
934       sometimes  misdetects  pure  ascii  text  as  utf7. If this occurs, use
935       --no-utf7 option.
936
937       5. Error output coding is controlled by LOCALE environment variables in
938       UN*X  system. skf don't take care of a situation like stdout and stderr
939       is redirecting into same stream. Such case should be  handled  by  user
940       side.
941
942       6. skf-1.9x converts KEIS/JIS X 0213 code using CJK-extension B and CJK
943       compatibility area. For this reason, X 0213  and  KEIS  convert  result
944       varies depending on --use-compat and --limit-to-ucs2 switches.
945
946       7.  JIS X 0207:1979 is not supported. JIS X 0211:1987 is designed to be
947       supported (i.e. common terminal control sequence will be  transparently
948       passed to output).
949
950       8.  Even  if  unbuffer  option(-u)  is specified, some code-translation
951       related bufferings are still performed (in MIME, kana, VIQR etc.).
952
953       9. skf-1.9x recognizes and  handles  languages  in  iso639-1(alpha  2).
954       iso639-2 is not supported as a valid language set.
955
956       10. UCS-2(UTF-16) is not supported within perl/ruby extension either in
957       and output, because of data structure limitation. Specify to ucs2  will
958       generate  error.  This  is  a  limitation  of SWIG and language itself,
959       rather than a limitation of skf. Use UTF-8 for these LWL.
960
961       11. skf-1.9x does not retain Macintosh RLO-ordered character  property.
962       Codesets with this kind of codes are not supported.
963
964

Notes

966       1. Extended options are changed extensively since skf-1.9. Some archaic
967       options (eg. -B, -@ and -r) have been deleted from this version.
968
969       2. skf is originally forked project from nkf, but doesn't  contain  nkf
970       codes.  Copyright notice is retained by honor.
971
972       3.  From version 1.9, default Japanese character set assumed by skf has
973       changed to JIS X 0208:1990 with Microsoft Japanese Windows gaiji  (i.e.
974       CP932).
975
976       4.  Code  autodetection  is  not perfect by design. If it has failed to
977       detect input code properly, please give input code information  explic‐
978       itly.
979
980       5.  Some  ligatures  in  Unicode,  cp932 gaiji and KEIS83 are converted
981       using JIS X 0124 and other convention.   During  this  conversion,  its
982       byte length is not preserved.
983
984       6.  skf  is  intended  to  pass  ANSI compatible terminal control codes
985       transparently, but this is not guaranteed.
986
987       7. nkf's -i and -o options works only in nkf-compat mode. It  is  obso‐
988       lete  option  in 1.97, and valid only when iso-2022-jp and without con‐
989       sidering output codeset specifications.
990
991       8. For unconverted character, skf uses geta and undefined character  as
992       --use-replace-char  option.   If  output  codeset  doesn't contain geta
993       code, skf prefers 'black square character', then uses '.' respectively.
994
995       9. There are some undocumented options. These options should be consid‐
996       ered as highly experimental.
997
998       10.  In  lineend_thru mode and using folding, skf remembers order of cr
999       and lf appears in stream, and use that order.  For this design, if  skf
1000       needs  to  output  line-end  character  before  any  line-end character
1001       appears in input stream, input order may not be preserved.
1002
1003       11. NKF-compatibility
1004       1) -B*, and --prefix, some --fb's and --no-cp932ext/best-fit-chars  are
1005       not supported.
1006       2)  rot  encoding  is  not  supported.  rot decode can't use with other
1007       decoding.
1008       3) MSDOS (and -T) are not supported.
1009       4) MIME decoding/encoding error handling behavior  differs  in  various
1010       ways.
1011       5)  LF/CR  behaves  differently. Results may not be same for some messy
1012       text.
1013
1014

Notice

1016       Unicode(TM) is a trademark of Unicode, Inc. Microsoft and  Windows  are
1017       registered  trademarks  of Microsoft corporation. Macintosh is a regis‐
1018       tered trademark of Apple Computer Inc. Vodafone is a trademark of Voda‐
1019       fone K.K.  Other names and terms may be trademarks or registered trade‐
1020       marks of their respective owner.  Trademark symbol (TM) may be  omitted
1021       in this manual page.
1022
1023
1024
1025
1026                                  25/JAN/2008                           SKF(1)
Impressum