Lingua::EN::Sentence(3pm)

1Lingua::EN::Sentence(3)User Contributed Perl DocumentatioLningua::EN::Sentence(3)
2
3
4

NAME

6       Lingua::EN::Sentence - split text into sentences
7

SYNOPSIS

9       use Lingua::EN::Sentence qw( get_sentences add_acronyms );
10
11       add_acronyms('lt','gen');          ## adding support for 'Lt. Gen.'  my
12       $text = q{ A sentence usually ends with a dot, exclamation or question
13       mark optionally followed by a space!  A string followed by 2 carriage
14       returns denotes a sentence, even though it doesn't end in a dot
15
16       Dots after single letters such as U.S.A. or in numbers like -12.34 will
17       not cause a split as well as common abbreviations such as Dr. I. Smith,
18       Ms. A.B. Jones, Apr. Calif. Esq.  and (some text) ellipsis such as ...
19       or . . are ignored.  Some valid cases canot be deteected, such as the
20       answer is X. It cannot easily be differentiated from the single letter-
21       dot sequence to abbreviate a person's given name.  Numbered points
22       within a sentence will not cause a split 1. Like this one.  See the
23       code for all the rules that apply.  This string has 7 sentences.  };
24
25       if (defined($sentences)) {      my $sentences = get_sentences($text);
26            foreach my $sent (@$sentences)      {           $i++;
27                 print("SENTENCE $i:$sent\n");      } }
28

DESCRIPTION

30       The "Lingua::EN::Sentence" module contains the function get_sentences,
31       which splits text into its constituent sentences, based on a regular
32       expression and a list of abbreviations (built in and given).
33
34       Certain well know exceptions, such as abbreviations, may cause
35       incorrect segmentations. But some of them are already integrated into
36       this code and are being taken care of. Still, if you see that there are
37       words causing the get_sentences function to fail, you can add those to
38       the module, so it notices them.  Note that abbreviations are case
39       sensitive, so 'Mrs.' is recognised but not 'mrs.'
40

ALGORITHM

42       The first step is to mark  the dot ending an abbreviation by changing
43       it to a special character. Now it won't cause a sentence split. The
44       original dot is restored after the sentences are split
45
46       Basically, I use a 'brute' regular expression to split the text into
47       sentences.  (Well, nothing is yet split - I just mark the end-of-
48       sentence). Then I look into a set of rules which decide when an end-of-
49       sentence is justified and when it's a mistake. In case of a mistake,
50       the end-of-sentence mark is removed. What are such mistakes?
51
52       Letter-dot sequences:  U.S.A. ,  i.e. , e.g.  Dot sequences: '..' or
53       '...'  or 'text . . more text' Two carriage returns denote the end of a
54       sentence even if it doesn't end with a dot
55

LIMITATIONS

57       1) John F. Kennedy was a former president 2) The answer is F. That ends
58       the quiz
59
60       In the first sentence, F. is detected as a persons initial and not the
61       end of a sentence.  But this means we cannot detect the true end of
62       sentence 2, which is after the 'F'. This case is not common though.
63

FUNCTIONS

65       All functions used should be requested in the 'use' clause. None is
66       exported by default.
67
68       get_sentences( $text )
69           The get_sentences function takes a scalar containing ascii text as
70           an argument and returns a reference to an array of sentences that
71           the text has been split into. Returned sentences will be trimmed
72           (beginning and end of sentence) of white space. Strings with no
73           alpha-numeric characters in them, won't be returned as sentences.
74           If no text is supplied, a reference to an empty array is returned.
75
76       add_acronyms( @acronyms )
77           This function is used for adding acronyms not supported by this
78           code.  The input should be regular expressions for matching the
79           desired acronyms, but should not include the final period (".").
80           So, for example, "blv?d" matches "blvd." and "bld.". "a\.mlf" will
81           match "a.mlf.". You do not need to bother with acronyms consisting
82           of single letters and dots (e.g. "U.S.A."), as these are found
83           automatically. Note also that acronyms are searched for on a case
84           insensitive basis.
85
86           Please see 'Acronym/Abbreviations list' section for the
87           abbreviations already supported by this module.
88
89       get_acronyms( )
90           This function will return the defined list of acronyms.
91
92       set_acronyms( @my_acronyms )
93           This function replaces the predefined acronym list with the given
94           list. See "add_acronyms" for details on the input specifications.
95
96       get_EOS( )
97           This function returns the value of the string used to mark the end
98           of sentence.  You might want to see what it is, and to make sure
99           your text doesn't contain it.  You can use set_EOS() to alter the
100           end-of-sentence string to whatever you desire.
101
102       set_EOS( $new_EOS_string )
103           This function alters the end-of-sentence string used to mark the
104           end of sentences.
105
106       set_locale( $new_locale ) Receives language locale in the form
107       language.country.character-set for example: "fr_CA.ISO8859-1" for
108       Canadian French using character set ISO8859-1.
109           Returns a reference to a hash containing the current locale
110           formatting values.  Returns undef if got undef.
111
112           The following will set the LC_COLLATE behaviour to Argentinian
113           Spanish.  NOTE: The naming and availability of locales depends on
114           your operating sysem.  Please consult the perllocale manpage for
115           how to find out which locales are available in your system.
116
117           $loc = set_locale( "es_AR.ISO8859-1" );
118
119           This actually does this:
120
121           $loc = setlocale( LC_ALL, "es_AR.ISO8859-1" );
122

Acronym/Abbreviations list

124       You can use the get_acronyms() function to get acronyms.  It has become
125       too long to specify in the documentation.
126
127       If I come across a good general-purpose list - I'll incorporate it into
128       this module.  Feel free to suggest such lists.
129

FUTURE WORK

131               [1] Object Oriented like usage
132               [2] Supporting more than just English/French
133               [3] Code optimization. Currently everything is RE based and not so optimized RE
134               [4] Possibly use more semantic heuristics for detecting a beginning of a sentence
135

REPOSITORY

142       <https://github.com/kimryan/Lingua-EN-Sentence>
143

AUTHOR

145       Shlomo Yona shlomo@cs.haifa.ac.il
146
147       Currently being maintained by Kim Ryan, kimryan at CPAN d o t org
148

COPYRIGHT AND LICENSE

150       Copyright (c) 2001-2016 Shlomo Yona. All rights reserved.  Copyright
151       (c) 2022 Kim Ryan. All rights reserved.
152
153       This library is free software; you can redistribute it and/or modify it
154       under the same terms as Perl itself.
155
156
157
158perl v5.38.0                      2023-07-21           Lingua::EN::Sentence(3)