1Lingua::EN::Sentence(3)User Contributed Perl DocumentatioLningua::EN::Sentence(3)
2
3
4
6 Lingua::EN::Sentence - split text into sentences
7
9 use Lingua::EN::Sentence qw( get_sentences add_acronyms );
10
11 add_acronyms('lt','gen'); ## adding support for 'Lt. Gen.'
12 my $text = q{
13 A sentence usually ends with a dot, exclamation or question mark optionally followed by a space!
14 A string followed by 2 carriage returns denotes a sentence, even though it doesn't end in a dot
15
16 Dots after single letters such as U.S.A. or in numbers like -12.34 will not cause a split
17 as well as common abbreviations such as Dr. I. Smith, Ms. A.B. Jones, Apr. Calif. Esq.
18 and (some text) ellipsis such as ... or . . are ignored.
19 Some valid cases canot be deteected, such as the answer is X. It cannot easily be
20 differentiated from the single letter-dot sequence to abbreviate a person's given name.
21 Numbered points within a sentence will not cause a split 1. Like this one.
22 See the code for all the rules that apply.
23 This string has 7 sentences.
24 };
25
26 my $sentences=get_sentences($text); # Get the sentences.
27 foreach my $sent (@$sentences)
28 {
29 $i++;
30 print("SENTENCE $i:$sent\n");
31 }
32
34 The "Lingua::EN::Sentence" module contains the function get_sentences,
35 which splits text into its constituent sentences, based on a regular
36 expression and a list of abbreviations (built in and given).
37
38 Certain well know exceptions, such as abbreviations, may cause
39 incorrect segmentations. But some of them are already integrated into
40 this code and are being taken care of. Still, if you see that there are
41 words causing the get_sentences function to fail, you can add those to
42 the module, so it notices them. Note that abbreviations are case
43 sensitive, so 'Mrs.' is recognised but not 'mrs.'
44
46 The first step is to mark the dot ending an abbreviation by changing
47 it to a special character. Now it won't cause a sentence split. The
48 original dot is restored after the sentences are split
49
50 Basically, I use a 'brute' regular expression to split the text into
51 sentences. (Well, nothing is yet split - I just mark the end-of-
52 sentence). Then I look into a set of rules which decide when an end-of-
53 sentence is justified and when it's a mistake. In case of a mistake,
54 the end-of-sentence mark is removed. What are such mistakes?
55
56 Letter-dot sequences: U.S.A. , i.e. , e.g. Dot sequences: '..' or
57 '...' or 'text . . more text' Two carriage returns denote the end of a
58 sentence even if it doesn't end with a dot
59
61 1) John F. Kennedy was a former president 2) The answer is F. That ends
62 the quiz
63
64 In the first sentence, F. is detected as a persons initial and not the
65 end of a sentence. But this means we cannot detect the true end of
66 sentence 2, which is after the 'F'. This case is not common though.
67
69 All functions used should be requested in the 'use' clause. None is
70 exported by default.
71
72 get_sentences( $text )
73 The get_sentences function takes a scalar containing ascii text as
74 an argument and returns a reference to an array of sentences that
75 the text has been split into. Returned sentences will be trimmed
76 (beginning and end of sentence) of white space. Strings with no
77 alpha-numeric characters in them, won't be returned as sentences.
78
79 add_acronyms( @acronyms )
80 This function is used for adding acronyms not supported by this
81 code. The input should be regular expressions for matching the
82 desired acronyms, but should not include the final period (".").
83 So, for example, "blv?d" matches "blvd." and "bld.". "a\.mlf" will
84 match "a.mlf.". You do not need to bother with acronyms consisting
85 of single letters and dots (e.g. "U.S.A."), as these are found
86 automatically. Note also that acronyms are searched for on a case
87 insensitive basis.
88
89 Please see 'Acronym/Abbreviations list' section for the
90 abbreviations already supported by this module.
91
92 get_acronyms( )
93 This function will return the defined list of acronyms.
94
95 set_acronyms( @my_acronyms )
96 This function replaces the predefined acronym list with the given
97 list. See "add_acronyms" for details on the input specifications.
98
99 get_EOS( )
100 This function returns the value of the string used to mark the end
101 of sentence. You might want to see what it is, and to make sure
102 your text doesn't contain it. You can use set_EOS() to alter the
103 end-of-sentence string to whatever you desire.
104
105 set_EOS( $new_EOS_string )
106 This function alters the end-of-sentence string used to mark the
107 end of sentences.
108
109 set_locale( $new_locale ) Receives language locale in the form
110 language.country.character-set for example: "fr_CA.ISO8859-1" for
111 Canadian French using character set ISO8859-1.
112 Returns a reference to a hash containing the current locale
113 formatting values. Returns undef if got undef.
114
115 The following will set the LC_COLLATE behaviour to Argentinian
116 Spanish. NOTE: The naming and availability of locales depends on
117 your operating sysem. Please consult the perllocale manpage for
118 how to find out which locales are available in your system.
119
120 $loc = set_locale( "es_AR.ISO8859-1" );
121
122 This actually does this:
123
124 $loc = setlocale( LC_ALL, "es_AR.ISO8859-1" );
125
127 You can use the get_acronyms() function to get acronyms. It has become
128 too long to specify in the documentation.
129
130 If I come across a good general-purpose list - I'll incorporate it into
131 this module. Feel free to suggest such lists.
132
134 [1] Object Oriented like usage
135 [2] Supporting more than just English/French
136 [3] Code optimization. Currently everything is RE based and not so optimized RE
137 [4] Possibly use more semantic heuristics for detecting a beginning of a sentence
138
140 Text::Sentence
141 Lingua::Sentence
142 Raku port of Lingua::EN::Sentence
143
145 <https://github.com/kimryan/Lingua-EN-Sentence>
146
148 Shlomo Yona shlomo@cs.haifa.ac.il
149
150 Currently being maintained by Kim Ryan, kimryan at CPAN d o t org
151
153 Copyright (c) 2001-2016 Shlomo Yona. All rights reserved. Copyright
154 (c) 2022 Kim Ryan. All rights reserved.
155
156 This library is free software; you can redistribute it and/or modify it
157 under the same terms as Perl itself.
158
159
160
161perl v5.36.0 2023-01-20 Lingua::EN::Sentence(3)