1SLMBUILD(1) User Contributed Perl Documentation SLMBUILD(1)
2
3
4
6 slmbuild - generate language model from idngram file
7
9 slmbuild [option]... idngram_file...
10
12 slmbuild generates a back-off smoothing language model from a given
13 idngram file. Generally, the idngram_file is created by ids2ngram.
14
16 -n,--NMax N
17 1 for unigram, 2 for bigram, 3 for trigram. Any number not in the
18 range of 1..3 is not valid.
19
20 -o, --out output-file
21 Specify the output xfilei name.
22
23 -l, --log
24 using -log(pr), use pr directly by default.
25
26 -w, --wordcount N
27 Lexican size, number of different words.
28
29 -b, --brk id...
30 Set the ids which should be treated as breaker.
31
32 -e, --e id...
33 Set the ids which should not be put into LM.
34
35 -c, --cut c...
36 k-grams whose freq <= c[k] are dropped.
37
38 -d, --discount method, param...
39 The k-th -d parm specifies the discount method
40
41 For k-gram, possibble values for method/param are:
42
43 B<GT>,I<R>,I<dis> : B<GT> discount for r E<lt>= I<R>, r is the freq of a ngram.
44 Linear discount for those r E<gt> I<R>, i.e. r'=r*dis
45 0 E<lt>E<lt> dis E<lt> 1.0, for example 0.999
46 B<ABS>,[I<dis>] : Absolute discount r'=r-I<dis>. And I<dis> is optional
47 0 E<lt>E<lt> I<dis> E<lt> cut[k]+1.0, normally I<dis> E<lt> 1.0.
48 LIN,[I<dis>] : Linear discount r'=r*dis. And dis is optional
49 0 E<lt> dis E<lt> 1.0
50
52 -n must be given before -c -b. And -c must give right number of cut-
53 off, also -ds must appear exactly N times specifying the discounts for
54 1-gram, 2-gram..., respectively.
55
56 BREAKER-IDs could be SentenceTokens or ParagraphTokens. Conceptually,
57 these ids have no meaning when they appeared in the middle of n-gram.
58
59 EXCLUDE-IDs could be ambiguious-ids. Conceptually, n-grams which
60 contain those ids are meaningless.
61
62 We can not erase ngrams according to BREAKER-IDS and EXCLUDE-IDs
63 directly from IDNGRAM file, because some low-level information is still
64 useful in it.
65
67 Following example read 'all.id3gram' and write trigram model 'all.slm'.
68
69 At 1-gram level, use Good-Turing discount with cut-off 0, i<R>=8,
70 dis=0.9995. At 2-gram level, use Absolute discount with cut-off 3, dis
71 auto-calc. At 3-gram level, use Absolute discount with cut-off 2, dis
72 auto-calc. Word id 10,11,12 are breakers (sentence/para/paper breaker,
73 etc). Exclude-ID is 9. Lexicon contains 200000 words. The result
74 languagme model uses -log(pr).
75
76 slmbuild -l -n 3 -o all.slm -w 200000 -c 0,3,2 -d GT,8,0.9995 -d ABS -d
77 ABS -b 10,11,12 -e 9 all.id3gram
78
80 Originally written by Phill.Zhang <phill.zhang@sun.com>. Currently
81 maintained by Kov.Chai <tchaikov@gmail.com>.
82
84 ids2ngram(1), slmprune(1).
85
86
87
88perl v5.28.1 2016-03-01 SLMBUILD(1)