1Text::Ngram(3)        User Contributed Perl Documentation       Text::Ngram(3)
2
3
4

NAME

6       Text::Ngram - Ngram analysis of text
7

SYNOPSIS

9         use Text::Ngram qw(ngram_counts add_to_counts);
10         my $text   = "abcdefghijklmnop";
11         my $hash_r = ngram_counts($text, 3); # Window size = 3
12         # $hash_r => { abc => 1, bcd => 1, ... }
13
14         add_to_counts($more_text, 3, $hash_r);
15

DESCRIPTION

17       n-Gram analysis is a field in textual analysis which uses sliding
18       window character sequences in order to aid topic analysis, language
19       determination and so on. The n-gram spectrum of a document can be used
20       to compare and filter documents in multiple languages, prepare word
21       prediction networks, and perform spelling correction.
22
23       The neat thing about n-grams, though, is that they're really easy to
24       determine. For n=3, for instance, we compute the n-gram counts like so:
25
26           the cat sat on the mat
27           ---                     $counts{"the"}++;
28            ---                    $counts{"he "}++;
29             ---                   $counts{"e c"}++;
30              ...
31
32       This module provides an efficient XS-based implementation of n-gram
33       spectrum analysis.
34
35       There are two functions which can be imported:
36
37   ngram_counts
38       This first function returns a hash reference with the n-gram histogram
39       of the text for the given window size. The default window size is 5.
40
41           $href = ngram_counts(\%config, $text, $window_size);
42
43       As of version 0.14, the %config may instead be passed in as named
44       arguments:
45
46           $href = ngram_counts($text, $window_size, %config);
47
48       The only necessary parameter is $text.
49
50       The possible value for %config are:
51
52       flankbreaks
53
54       If set to 1 (default), breaks are flanked by spaces; if set to 0,
55       they're not. Breaks are punctuation and other non-alphabetic
56       characters, which, unless you use "punctuation => 0" in your
57       configuration, do not make it into the returned hash.
58
59       Here's an example, supposing you're using the default value for
60       punctuation (1):
61
62         my $text = "Hello, world";
63         my $hash = ngram_counts($text, 5);
64
65       That produces the following ngrams:
66
67         {
68           'Hello' => 1,
69           'ello ' => 1,
70           ' worl' => 1,
71           'world' => 1,
72         }
73
74       On the other hand, this:
75
76         my $text = "Hello, world";
77         my $hash = ngram_counts({flankbreaks => 0}, $text, 5);
78
79       Produces the following ngrams:
80
81         {
82           'Hello' => 1,
83           ' worl' => 1,
84           'world' => 1,
85         }
86
87       lowercase
88
89       If set to 0, casing is preserved. If set to 1, all letters are
90       lowercased before counting ngrams. Default is 1.
91
92           # Get all ngrams of size 4 preserving case
93           $href_p = ngram_counts( {lowercase => 0}, $text, 4 );
94
95       punctuation
96
97       If set to 0 (default), punctuation is removed before calculating the
98       ngrams.  Set to 1 to preserve it.
99
100           # Get all ngrams of size 2 preserving punctuation
101           $href_p = ngram_counts( {punctuation => 1}, $text, 2 );
102
103       spaces
104
105       If set to 0 (default is 1), no ngrams containing spaces will be
106       returned.
107
108          # Get all ngrams of size 3 that do not contain spaces
109          $href = ngram_counts( {spaces => 0}, $text, 3);
110
111       If you're going to request both types of ngrams, than the best way to
112       avoid calculating the same thing twice is probably this:
113
114           $href_with_spaces = ngram_counts($text[, $window]);
115           $href_no_spaces = $href_with_spaces;
116           for (keys %$href_no_spaces) { delete $href->{$_} if / / }
117
118   add_to_counts
119       This incrementally adds to the supplied hash; if $window is zero or
120       undefined, then the window size is computed from the hash keys.
121
122           add_to_counts($more_text, $window, $href)
123

TO DO

125       ยท     Look further into the tests. Sort them and add more.
126

SEE ALSO

128       Cavnar, W. B. (1993). N-gram-based text filtering for TREC-2. In D.
129       Harman (Ed.), Proceedings of TREC-2: Text Retrieval Conference 2.
130       Washington, DC: National Bureau of Standards.
131
132       Shannon, C. E. (1951). Predication and entropy of printed English.  The
133       Bell System Technical Journal, 30. 50-64.
134
135       Ullmann, J. R. (1977). Binary n-gram technique for automatic correction
136       of substitution, deletion, insert and reversal errors in words.
137       Computer Journal, 20. 141-147.
138

AUTHOR

140       Maintained by Alberto Simoes, "ambs@cpan.org".
141
142       Previously maintained by Jose Castro, "cog@cpan.org".  Originally
143       created by Simon Cozens, "simon@cpan.org".
144
146       Copyright 2006 by Alberto Simoes
147
148       Copyright 2004 by Jose Castro
149
150       Copyright 2003 by Simon Cozens
151
152       This library is free software; you can redistribute it and/or modify it
153       under the same terms as Perl itself.
154
155
156
157perl v5.28.1                      2014-07-17                    Text::Ngram(3)
Impressum