Text::Ngram(3pm)

1Text::Ngram(3)        User Contributed Perl Documentation       Text::Ngram(3)
2
3
4

NAME

6       Text::Ngram - Ngram analysis of text
7

SYNOPSIS

9         use Text::Ngram qw(ngram_counts add_to_counts);
10         my $text   = "abcdefghijklmnop";
11         my $hash_r = ngram_counts($text, 3); # Window size = 3
12         # $hash_r => { abc => 1, bcd => 1, ... }
13
14         add_to_counts($more_text, 3, $hash_r);
15

DESCRIPTION

17       n-Gram analysis is a field in textual analysis which uses sliding
18       window character sequences in order to aid topic analysis, language
19       determination and so on. The n-gram spectrum of a document can be used
20       to compare and filter documents in multiple languages, prepare word
21       prediction networks, and perform spelling correction.
22
23       The neat thing about n-grams, though, is that they're really easy to
24       determine. For n=3, for instance, we compute the n-gram counts like so:
25
26           the cat sat on the mat
27           ---                     $counts{"the"}++;
28            ---                    $counts{"he "}++;
29             ---                   $counts{"e c"}++;
30              ...
31
32       This module provides an efficient XS-based implementation of n-gram
33       spectrum analysis.
34
35       There are two functions which can be imported:
36
37   ngram_counts
38       This first function returns a hash reference with the n-gram histogram
39       of the text for the given window size. The default window size is 5.
40
41           $href = ngram_counts(\%config, $text, $window_size);
42
43       As of version 0.14, the %config may instead be passed in as named
44       arguments:
45
46           $href = ngram_counts($text, $window_size, %config);
47
48       The only necessary parameter is $text.
49
50       The possible value for %config are:
51
52       flankbreaks
53
54       If set to 1 (default), breaks are flanked by spaces; if set to 0,
55       they're not. Breaks are punctuation and other non-alphabetic
56       characters, which, unless you use "punctuation => 0" in your
57       configuration, do not make it into the returned hash.
58
59       Here's an example, supposing you're using the default value for
60       punctuation (1):
61
62         my $text = "Hello, world";
63         my $hash = ngram_counts($text, 5);
64
65       That produces the following ngrams:
66
67         {
68           'Hello' => 1,
69           'ello ' => 1,
70           ' worl' => 1,
71           'world' => 1,
72         }
73
74       On the other hand, this:
75
76         my $text = "Hello, world";
77         my $hash = ngram_counts({flankbreaks => 0}, $text, 5);
78
79       Produces the following ngrams:
80
81         {
82           'Hello' => 1,
83           ' worl' => 1,
84           'world' => 1,
85         }
86
87       lowercase
88
89       If set to 0, casing is preserved. If set to 1, all letters are
90       lowercased before counting ngrams. Default is 1.
91
92           # Get all ngrams of size 4 preserving case
93           $href_p = ngram_counts( {lowercase => 0}, $text, 4 );
94
95       punctuation
96
97       If set to 0 (default), punctuation is removed before calculating the
98       ngrams.  Set to 1 to preserve it.
99
100           # Get all ngrams of size 2 preserving punctuation
101           $href_p = ngram_counts( {punctuation => 1}, $text, 2 );
102
103       spaces
104
105       If set to 0 (default is 1), no ngrams containing spaces will be
106       returned.
107
108          # Get all ngrams of size 3 that do not contain spaces
109          $href = ngram_counts( {spaces => 0}, $text, 3);
110
111       If you're going to request both types of ngrams, than the best way to
112       avoid calculating the same thing twice is probably this:
113
114           $href_with_spaces = ngram_counts($text[, $window]);
115           $href_no_spaces = $href_with_spaces;
116           for (keys %$href_no_spaces) { delete $href->{$_} if / / }
117
118   add_to_counts
119       This incrementally adds to the supplied hash; if $window is zero or
120       undefined, then the window size is computed from the hash keys.
121
122           add_to_counts($more_text, $window, $href)
123

TO DO

125       •     Look further into the tests. Sort them and add more.
126

AUTHOR

140       Maintained by Alberto Simoes, "ambs@cpan.org".
141
142       Previously maintained by Jose Castro, "cog@cpan.org".  Originally
143       created by Simon Cozens, "simon@cpan.org".
144

COPYRIGHT AND LICENSE

146       Copyright 2006 by Alberto Simoes
147
148       Copyright 2004 by Jose Castro
149
150       Copyright 2003 by Simon Cozens
151
152       This library is free software; you can redistribute it and/or modify it
153       under the same terms as Perl itself.
154
155
156
157perl v5.38.0                      2023-07-21                    Text::Ngram(3)

NAME

SYNOPSIS

DESCRIPTION

TO DO

SEE ALSO

AUTHOR

COPYRIGHT AND LICENSE