1POCKETSPHINX(1) General Commands Manual POCKETSPHINX(1)
2
3
4
6 pocketsphinx - Run speech recognition on audio data
7
9 pocketsphinx [ options... ] [ live | single | help | soxflags ] IN‐
10 PUTS...
11
13 The ‘pocketsphinx’ command-line program reads single-channel 16-bit PCM
14 audio one or more input files (or ‘-’ to read from standard input), and
15 attemps to recognize speech in it using the default acoustic and lan‐
16 guage model. The input files can be raw audio, WAV, or NIST Sphere
17 files, though some of these may not be recognized properly. It accepts
18 a large number of options which you probably don't care about, and a
19 command which defaults to ‘live’. The commands are as follows:
20
21 help Print a long list of those options you don't care about.
22
23 config Dump configuration as JSON to standard output (can be loaded
24 with the ‘-config’ option).
25
26 live Detect speech segments in input files, run recognition on them
27 (using those options you don't care about), and write the re‐
28 sults to standard output in line-delimited JSON. I realize this
29 isn't the prettiest format, but it sure beats XML. Each line
30 contains a JSON object with these fields, which have short names
31 to make the lines more readable:
32
33 "b": Start time in seconds, from the beginning of the stream
34
35 "d": Duration in seconds
36
37 "p": Estimated probability of the recognition result, i.e. a
38 number between 0 and 1 which may be used as a confidence score
39
40 "t": Full text of recognition result
41
42 "w": List of segments (usually words), each of which in turn
43 contains the ‘b’, ‘d’, ‘p’, and ‘t’ fields, for start, end,
44 probability, and the text of the word. In the future we may also
45 support hierarchical results in which case ‘w’ could be present.
46
47 single Recognize the input as a single utterance, and write a JSON ob‐
48 ject in the same format described above.
49
50 align
51
52 Align a single input file (or ‘-’ for standard input) to a word
53 sequence, and write a JSON object in the same format described
54 above. The first positional argument is the input, and all sub‐
55 sequent ones are concatenated to make the text, to avoid sur‐
56 prises if you forget to quote it. You are responsible for nor‐
57 malizing the text to remove punctuation, uppercase, centipedes,
58 etc. For example:
59
60 pocketsphinx align goforward.wav "go forward ten meters"
61
62 By default, only word-level alignment is done. To get phone
63 alignments, pass `-phone_align yes` in the flags, e.g.:
64
65 pocketsphinx -phone_align yes align audio.wav $text
66
67 This will make not particularly readable output, but you can use
68 jq (https://stedolan.github.io/jq/) to clean it up. For exam‐
69 ple, you can get just the word names and start times like this:
70
71 pocketsphinx align audio.wav $text | jq '.w[]|[.t,.b]'
72
73 Or you could get the phone names and durations like this:
74
75 pocketsphinx -phone_align yes align audio.wav $text | jq '.w[]|.w[]|[.t,.d]'
76
77 There are many, many other possibilities, of course.
78
79 help Print a usage and help text with a list of possible arguments.
80
81 soxflags
82 Return arguments to ‘sox’ which will create the appropriate in‐
83 put format. Note that because the ‘sox’ command-line is slightly
84 quirky these must always come after the filename or ‘-d’ (which
85 tells ‘sox’ to read from the microphone). You can run live
86 recognition like this:
87
88 sox -d $(pocketsphinx soxflags) | pocketsphinx -
89
90 or decode from a file named "audio.mp3" like this:
91
92 sox audio.mp3 $(pocketsphinx soxflags) | pocketsphinx -
93
94 By default only errors are printed to standard error, but if you want
95 more information you can pass ‘-loglevel INFO’. Partial results are not
96 printed, maybe they will be in the future, but don't hold your breath.
97 Force-alignment is likely to be supported soon, however.
98
100 -agc Automatic gain control for c0 ('max', 'emax', 'noise', or
101 'none')
102
103 -agcthresh
104 Initial threshold for automatic gain control
105
106 -allphone
107 phoneme decoding with phonetic lm (given here)
108
109 -allphone_ci
110 Perform phoneme decoding with phonetic lm and context-indepen‐
111 dent units only
112
113 -alpha Preemphasis parameter
114
115 -ascale
116 Inverse of acoustic model scale for confidence score calculation
117
118 -aw Inverse weight applied to acoustic scores.
119
120 -backtrace
121 Print results and backtraces to log.
122
123 -beam Beam width applied to every frame in Viterbi search (smaller
124 values mean wider beam)
125
126 -bestpath
127 Run bestpath (Dijkstra) search over word lattice (3rd pass)
128
129 -bestpathlw
130 Language model probability weight for bestpath search
131
132 -ceplen
133 Number of components in the input feature vector
134
135 -cmn Cepstral mean normalization scheme ('live', 'batch', or 'none')
136
137 -cmninit
138 Initial values (comma-separated) for cepstral mean when 'live'
139 is used
140
141 -compallsen
142 Compute all senone scores in every frame (can be faster when
143 there are many senones)
144
145 -dict pronunciation dictionary (lexicon) input file
146
147 -dictcase
148 Dictionary is case sensitive (NOTE: case insensitivity applies
149 to ASCII characters only)
150
151 -dither
152 Add 1/2-bit noise
153
154 -doublebw
155 Use double bandwidth filters (same center freq)
156
157 -ds Frame GMM computation downsampling ratio
158
159 -fdict word pronunciation dictionary input file
160
161 -feat Feature stream type, depends on the acoustic model
162
163 -featparams
164 containing feature extraction parameters.
165
166 -fillprob
167 Filler word transition probability
168
169 -frate Frame rate
170
171 -fsg format finite state grammar file
172
173 -fsgusealtpron
174 Add alternate pronunciations to FSG
175
176 -fsgusefiller
177 Insert filler words at each state.
178
179 -fwdflat
180 Run forward flat-lexicon search over word lattice (2nd pass)
181
182 -fwdflatbeam
183 Beam width applied to every frame in second-pass flat search
184
185 -fwdflatefwid
186 Minimum number of end frames for a word to be searched in fwd‐
187 flat search
188
189 -fwdflatlw
190 Language model probability weight for flat lexicon (2nd pass)
191 decoding
192
193 -fwdflatsfwin
194 Window of frames in lattice to search for successor words in
195 fwdflat search
196
197 -fwdflatwbeam
198 Beam width applied to word exits in second-pass flat search
199
200 -fwdtree
201 Run forward lexicon-tree search (1st pass)
202
203 -hmm containing acoustic model files.
204
205 -input_endian
206 Endianness of input data, big or little, ignored if NIST or MS
207 Wav
208
209 -jsgf grammar file
210
211 -keyphrase
212 to spot
213
214 -kws file with keyphrases to spot, one per line
215
216 -kws_delay
217 Delay to wait for best detection score
218
219 -kws_plp
220 Phone loop probability for keyphrase spotting
221
222 -kws_threshold
223 Threshold for p(hyp)/p(alternatives) ratio
224
225 -latsize
226 Initial backpointer table size
227
228 -lda containing transformation matrix to be applied to features (sin‐
229 gle-stream features only)
230
231 -ldadim
232 Dimensionality of output of feature transformation (0 to use en‐
233 tire matrix)
234
235 -lifter
236 Length of sin-curve for liftering, or 0 for no liftering.
237
238 -lm trigram language model input file
239
240 -lmctl a set of language model
241
242 -lmname
243 language model in -lmctl to use by default
244
245 -logbase
246 Base in which all log-likelihoods calculated
247
248 -logfn to write log messages in
249
250 -loglevel
251 Minimum level of log messages (DEBUG, INFO, WARN, ERROR)
252
253 -logspec
254 Write out logspectral files instead of cepstra
255
256 -lowerf
257 Lower edge of filters
258
259 -lpbeam
260 Beam width applied to last phone in words
261
262 -lponlybeam
263 Beam width applied to last phone in single-phone words
264
265 -lw Language model probability weight
266
267 -maxhmmpf
268 Maximum number of active HMMs to maintain at each frame (or -1
269 for no pruning)
270
271 -maxwpf
272 Maximum number of distinct word exits at each frame (or -1 for
273 no pruning)
274
275 -mdef definition input file
276
277 -mean gaussian means input file
278
279 -mfclogdir
280 to log feature files to
281
282 -min_endfr
283 Nodes ignored in lattice construction if they persist for fewer
284 than N frames
285
286 -mixw mixture weights input file (uncompressed)
287
288 -mixwfloor
289 Senone mixture weights floor (applied to data from -mixw file)
290
291 -mllr transformation to apply to means and variances
292
293 -mmap Use memory-mapped I/O (if possible) for model files
294
295 -ncep Number of cep coefficients
296
297 -nfft Size of FFT, or 0 to set automatically (recommended)
298
299 -nfilt Number of filter banks
300
301 -nwpen New word transition penalty
302
303 -pbeam Beam width applied to phone transitions
304
305 -pip Phone insertion penalty
306
307 -pl_beam
308 Beam width applied to phone loop search for lookahead
309
310 -pl_pbeam
311 Beam width applied to phone loop transitions for lookahead
312
313 -pl_pip
314 Phone insertion penalty for phone loop
315
316 -pl_weight
317 Weight for phoneme lookahead penalties
318
319 -pl_window
320 Phoneme lookahead window size, in frames
321
322 -rawlogdir
323 to log raw audio files to
324
325 -remove_dc
326 Remove DC offset from each frame
327
328 -remove_noise
329 Remove noise using spectral subtraction
330
331 -round_filters
332 Round mel filter frequencies to DFT points
333
334 -samprate
335 Sampling rate
336
337 -seed Seed for random number generator; if less than zero, pick our
338 own
339
340 -sendump
341 dump (compressed mixture weights) input file
342
343 -senlogdir
344 to log senone score files to
345
346 -senmgau
347 to codebook mapping input file (usually not needed)
348
349 -silprob
350 Silence word transition probability
351
352 -smoothspec
353 Write out cepstral-smoothed logspectral files
354
355 -svspec
356 specification (e.g., 24,0-11/25,12-23/26-38 or 0-12/13-25/26-38)
357
358 -tmat state transition matrix input file
359
360 -tmatfloor
361 HMM state transition probability floor (applied to -tmat file)
362
363 -topn Maximum number of top Gaussians to use in scoring.
364
365 -topn_beam
366 Beam width used to determine top-N Gaussians (or a list, per-
367 feature)
368
369 -toprule
370 rule for JSGF (first public rule is default)
371
372 -transform
373 Which type of transform to use to calculate cepstra (legacy,
374 dct, or htk)
375
376 -unit_area
377 Normalize mel filters to unit area
378
379 -upperf
380 Upper edge of filters
381
382 -uw Unigram weight
383
384 -var gaussian variances input file
385
386 -varfloor
387 Mixture gaussian variance floor (applied to data from -var file)
388
389 -varnorm
390 Variance normalize each utterance (only if CMN == current)
391
392 -verbose
393 Show input filenames
394
395 -warp_params
396 defining the warping function
397
398 -warp_type
399 Warping function type (or shape)
400
401 -wbeam Beam width applied to word exits
402
403 -wip Word insertion penalty
404
405 -wlen Hamming window length
406
408 Written by numerous people at CMU from 1994 onwards. This manual page
409 by David Huggins-Daines <dhdaines@gmail.com>
410
412 Copyright © 1994-2016 Carnegie Mellon University. See the file LICENSE
413 included with this package for more information.
414
416 pocketsphinx_batch(1), sphinx_fe(1).
417
418
419
420 2022-09-27 POCKETSPHINX(1)