1POCKETSPHINX(1)             General Commands Manual            POCKETSPHINX(1)
2
3
4

NAME

6       pocketsphinx - Run speech recognition on audio data
7

SYNOPSIS

9       pocketsphinx  [  options...  ]  [ live | single | help | soxflags ] IN‐
10       PUTS...
11

DESCRIPTION

13       The ‘pocketsphinx’ command-line program reads single-channel 16-bit PCM
14       audio one or more input files (or ‘-’ to read from standard input), and
15       attemps to recognize speech in it using the default acoustic  and  lan‐
16       guage  model.  The  input  files  can be raw audio, WAV, or NIST Sphere
17       files, though some of these may not be recognized properly.  It accepts
18       a  large  number  of options which you probably don't care about, and a
19       command which defaults to ‘live’. The commands are as follows:
20
21       help   Print a long list of those options you don't care about.
22
23       config Dump configuration as JSON to standard  output  (can  be  loaded
24              with the ‘-config’ option).
25
26       live   Detect  speech  segments in input files, run recognition on them
27              (using those options you don't care about), and  write  the  re‐
28              sults  to standard output in line-delimited JSON. I realize this
29              isn't the prettiest format, but it sure  beats  XML.  Each  line
30              contains a JSON object with these fields, which have short names
31              to make the lines more readable:
32
33              "b": Start time in seconds, from the beginning of the stream
34
35              "d": Duration in seconds
36
37              "p": Estimated probability of the  recognition  result,  i.e.  a
38              number between 0 and 1 which may be used as a confidence score
39
40              "t": Full text of recognition result
41
42              "w":  List  of  segments  (usually words), each of which in turn
43              contains the ‘b’, ‘d’, ‘p’, and  ‘t’  fields,  for  start,  end,
44              probability, and the text of the word. In the future we may also
45              support hierarchical results in which case ‘w’ could be present.
46
47       single Recognize the input as a single utterance, and write a JSON  ob‐
48              ject in the same format described above.
49
50       align
51
52              Align  a single input file (or ‘-’ for standard input) to a word
53              sequence, and write a JSON object in the same  format  described
54              above.  The first positional argument is the input, and all sub‐
55              sequent ones are concatenated to make the text,  to  avoid  sur‐
56              prises  if you forget to quote it.  You are responsible for nor‐
57              malizing the text to remove punctuation, uppercase,  centipedes,
58              etc. For example:
59
60                  pocketsphinx align goforward.wav "go forward ten meters"
61
62              By  default,  only  word-level  alignment is done.  To get phone
63              alignments, pass `-phone_align yes` in the flags, e.g.:
64
65                  pocketsphinx -phone_align yes align audio.wav $text
66
67              This will make not particularly readable output, but you can use
68              jq  (https://stedolan.github.io/jq/)  to clean it up.  For exam‐
69              ple, you can get just the word names and start times like this:
70
71                  pocketsphinx align audio.wav $text | jq '.w[]|[.t,.b]'
72
73              Or you could get the phone names and durations like this:
74
75                  pocketsphinx -phone_align yes align audio.wav $text | jq '.w[]|.w[]|[.t,.d]'
76
77              There are many, many other possibilities, of course.
78
79       help   Print a usage and help text with a list of possible arguments.
80
81       soxflags
82              Return arguments to ‘sox’ which will create the appropriate  in‐
83              put format. Note that because the ‘sox’ command-line is slightly
84              quirky these must always come after the filename or ‘-d’  (which
85              tells  ‘sox’  to  read  from  the  microphone). You can run live
86              recognition like this:
87
88                  sox -d $(pocketsphinx soxflags) | pocketsphinx -
89
90              or decode from a file named "audio.mp3" like this:
91
92                  sox audio.mp3 $(pocketsphinx soxflags) | pocketsphinx -
93
94       By default only errors are printed to standard error, but if  you  want
95       more information you can pass ‘-loglevel INFO’. Partial results are not
96       printed, maybe they will be in the future, but don't hold your  breath.
97       Force-alignment is likely to be supported soon, however.
98

OPTIONS

100       -agc   Automatic  gain  control  for  c0  ('max',  'emax',  'noise', or
101              'none')
102
103       -agcthresh
104              Initial threshold for automatic gain control
105
106       -allphone
107              phoneme decoding with phonetic lm (given here)
108
109       -allphone_ci
110              Perform phoneme decoding with phonetic lm  and  context-indepen‐
111              dent units only
112
113       -alpha Preemphasis parameter
114
115       -ascale
116              Inverse of acoustic model scale for confidence score calculation
117
118       -aw    Inverse weight applied to acoustic scores.
119
120       -backtrace
121              Print results and backtraces to log.
122
123       -beam  Beam  width  applied  to  every frame in Viterbi search (smaller
124              values mean wider beam)
125
126       -bestpath
127              Run bestpath (Dijkstra) search over word lattice (3rd pass)
128
129       -bestpathlw
130              Language model probability weight for bestpath search
131
132       -ceplen
133              Number of components in the input feature vector
134
135       -cmn   Cepstral mean normalization scheme ('live', 'batch', or 'none')
136
137       -cmninit
138              Initial values (comma-separated) for cepstral mean  when  'live'
139              is used
140
141       -compallsen
142              Compute  all  senone  scores  in every frame (can be faster when
143              there are many senones)
144
145       -dict  pronunciation dictionary (lexicon) input file
146
147       -dictcase
148              Dictionary is case sensitive (NOTE: case  insensitivity  applies
149              to ASCII characters only)
150
151       -dither
152              Add 1/2-bit noise
153
154       -doublebw
155              Use double bandwidth filters (same center freq)
156
157       -ds    Frame GMM computation downsampling ratio
158
159       -fdict word pronunciation dictionary input file
160
161       -feat  Feature stream type, depends on the acoustic model
162
163       -featparams
164              containing feature extraction parameters.
165
166       -fillprob
167              Filler word transition probability
168
169       -frate Frame rate
170
171       -fsg   format finite state grammar file
172
173       -fsgusealtpron
174              Add alternate pronunciations to FSG
175
176       -fsgusefiller
177              Insert filler words at each state.
178
179       -fwdflat
180              Run forward flat-lexicon search over word lattice (2nd pass)
181
182       -fwdflatbeam
183              Beam width applied to every frame in second-pass flat search
184
185       -fwdflatefwid
186              Minimum  number  of end frames for a word to be searched in fwd‐
187              flat search
188
189       -fwdflatlw
190              Language model probability weight for flat  lexicon  (2nd  pass)
191              decoding
192
193       -fwdflatsfwin
194              Window  of  frames  in  lattice to search for successor words in
195              fwdflat search
196
197       -fwdflatwbeam
198              Beam width applied to word exits in second-pass flat search
199
200       -fwdtree
201              Run forward lexicon-tree search (1st pass)
202
203       -hmm   containing acoustic model files.
204
205       -input_endian
206              Endianness of input data, big or little, ignored if NIST  or  MS
207              Wav
208
209       -jsgf  grammar file
210
211       -keyphrase
212              to spot
213
214       -kws   file with keyphrases to spot, one per line
215
216       -kws_delay
217              Delay to wait for best detection score
218
219       -kws_plp
220              Phone loop probability for keyphrase spotting
221
222       -kws_threshold
223              Threshold for p(hyp)/p(alternatives) ratio
224
225       -latsize
226              Initial backpointer table size
227
228       -lda   containing transformation matrix to be applied to features (sin‐
229              gle-stream features only)
230
231       -ldadim
232              Dimensionality of output of feature transformation (0 to use en‐
233              tire matrix)
234
235       -lifter
236              Length of sin-curve for liftering, or 0 for no liftering.
237
238       -lm    trigram language model input file
239
240       -lmctl a set of language model
241
242       -lmname
243              language model in -lmctl to use by default
244
245       -logbase
246              Base in which all log-likelihoods calculated
247
248       -logfn to write log messages in
249
250       -loglevel
251              Minimum level of log messages (DEBUG, INFO, WARN, ERROR)
252
253       -logspec
254              Write out logspectral files instead of cepstra
255
256       -lowerf
257              Lower edge of filters
258
259       -lpbeam
260              Beam width applied to last phone in words
261
262       -lponlybeam
263              Beam width applied to last phone in single-phone words
264
265       -lw    Language model probability weight
266
267       -maxhmmpf
268              Maximum  number  of active HMMs to maintain at each frame (or -1
269              for no pruning)
270
271       -maxwpf
272              Maximum number of distinct word exits at each frame (or  -1  for
273              no pruning)
274
275       -mdef  definition input file
276
277       -mean  gaussian means input file
278
279       -mfclogdir
280              to log feature files to
281
282       -min_endfr
283              Nodes  ignored in lattice construction if they persist for fewer
284              than N frames
285
286       -mixw  mixture weights input file (uncompressed)
287
288       -mixwfloor
289              Senone mixture weights floor (applied to data from -mixw file)
290
291       -mllr  transformation to apply to means and variances
292
293       -mmap  Use memory-mapped I/O (if possible) for model files
294
295       -ncep  Number of cep coefficients
296
297       -nfft  Size of FFT, or 0 to set automatically (recommended)
298
299       -nfilt Number of filter banks
300
301       -nwpen New word transition penalty
302
303       -pbeam Beam width applied to phone transitions
304
305       -pip   Phone insertion penalty
306
307       -pl_beam
308              Beam width applied to phone loop search for lookahead
309
310       -pl_pbeam
311              Beam width applied to phone loop transitions for lookahead
312
313       -pl_pip
314              Phone insertion penalty for phone loop
315
316       -pl_weight
317              Weight for phoneme lookahead penalties
318
319       -pl_window
320              Phoneme lookahead window size, in frames
321
322       -rawlogdir
323              to log raw audio files to
324
325       -remove_dc
326              Remove DC offset from each frame
327
328       -remove_noise
329              Remove noise using spectral subtraction
330
331       -round_filters
332              Round mel filter frequencies to DFT points
333
334       -samprate
335              Sampling rate
336
337       -seed  Seed for random number generator; if less than  zero,  pick  our
338              own
339
340       -sendump
341              dump (compressed mixture weights) input file
342
343       -senlogdir
344              to log senone score files to
345
346       -senmgau
347              to codebook mapping input file (usually not needed)
348
349       -silprob
350              Silence word transition probability
351
352       -smoothspec
353              Write out cepstral-smoothed logspectral files
354
355       -svspec
356              specification (e.g., 24,0-11/25,12-23/26-38 or 0-12/13-25/26-38)
357
358       -tmat  state transition matrix input file
359
360       -tmatfloor
361              HMM state transition probability floor (applied to -tmat file)
362
363       -topn  Maximum number of top Gaussians to use in scoring.
364
365       -topn_beam
366              Beam  width  used  to determine top-N Gaussians (or a list, per-
367              feature)
368
369       -toprule
370              rule for JSGF (first public rule is default)
371
372       -transform
373              Which type of transform to use  to  calculate  cepstra  (legacy,
374              dct, or htk)
375
376       -unit_area
377              Normalize mel filters to unit area
378
379       -upperf
380              Upper edge of filters
381
382       -uw    Unigram weight
383
384       -var   gaussian variances input file
385
386       -varfloor
387              Mixture gaussian variance floor (applied to data from -var file)
388
389       -varnorm
390              Variance normalize each utterance (only if CMN == current)
391
392       -verbose
393              Show input filenames
394
395       -warp_params
396              defining the warping function
397
398       -warp_type
399              Warping function type (or shape)
400
401       -wbeam Beam width applied to word exits
402
403       -wip   Word insertion penalty
404
405       -wlen  Hamming window length
406

AUTHOR

408       Written  by numerous people at CMU from 1994 onwards.  This manual page
409       by David Huggins-Daines <dhdaines@gmail.com>
410
412       Copyright © 1994-2016 Carnegie Mellon University.  See the file LICENSE
413       included with this package for more information.
414

SEE ALSO

416       pocketsphinx_batch(1), sphinx_fe(1).
417
418
419
420                                  2022-09-27                   POCKETSPHINX(1)
Impressum