1MMSEG(1) User Contributed Perl Documentation MMSEG(1)
2
3
4
6 mmseg - maximum matching segment Chinese text.
7
9 mmseg -d dict_file [option]... [corpus_file]...
10
12 mmseg is a tool for segmenting Chinese text into words using maximum
13 matching algorithm. mmseg segments corpus_file, or standard input if no
14 filename is specified, and write the segmented result to standard
15 output.
16
18 -d dict_file
19 Use dict_file as lexicon. A default lexicon can be found at
20 /usr/share/sunpinyin-slm/dict.utf8.
21
22 -f,--format (text|bin)
23 Output Format, can be 'text' or 'bin'. default 'bin'. Normally, in
24 text mode, word text are output, while in binary mode, binary short
25 integer of the word-ids are written to stdout.
26
27 -s, --stok STOK_ID
28 Sentence token id. Default 10. It will be written to output in
29 binary mode after every sentence.
30
31 -i, --show-id
32 Show Id info. Under text output format mode, attach id after known
33 words. If under binary mode, print id(s) in text.
34
35 -a, --ambiguious-id AMBI-ID
36 Ambiguious means ABC => A BC or AB C. If specified (AMBI-ID != 0),
37 The sequence ABC will not be segmented, in binary mode, the AMBI-ID
38 is written out; in text mode, "<ambi>ABC</ambi>" will be output.
39 Default is 0.
40
42 Under binary mode, consecutive id of 0 are merged into one 0. Under
43 text mode, no space are inserted between unknown-words.
44
46 Originally written by Phill.Zhang <phill.zhang@sun.com>. Currently
47 maintained by Kov.Chai <tchaikov@gmail.com>.
48
50 slmseg(1), ids2ngram (1).
51
52
53
54perl v5.36.0 2022-07-23 MMSEG(1)