Lingua::EN::Sentence(3pm)

1Lingua::EN::Sentence(3)User Contributed Perl DocumentatioLningua::EN::Sentence(3)
2
3
4

NAME

6       Lingua::EN::Sentence - split text into sentences
7

SYNOPSIS

9               use Lingua::EN::Sentence qw( get_sentences add_acronyms );
10
11               add_acronyms('lt','gen');               ## adding support for 'Lt. Gen.'
12               my $sentences=get_sentences($text);     ## Get the sentences.
13               foreach my $sentence (@$sentences) {
14                       ## do something with $sentence
15               }
16

DESCRIPTION

18       The "Lingua::EN::Sentence" module contains the function get_sentences,
19       which splits text into its constituent sentences, based on a regular
20       expression and a list of abbreviations (built in and given).
21
22       Certain well know exceptions, such as abbreviations, may cause
23       incorrect segmentations. But some of them are already integrated into
24       this code and are being taken care of. Still, if you see that there are
25       words causing the get_sentences function to fail, you can add those to
26       the module, so it notices them.
27

ALGORITHM

29       Basically, I use a 'brute' regular expression to split the text into
30       sentences.  (Well, nothing is yet split - I just mark the end-of-
31       sentence). Then I look into a set of rules which decide when an end-of-
32       sentence is justified and when it's a mistake. In case of a mistake,
33       the end-of-sentence mark is removed.
34
35       What are such mistakes? Cases of abbreviations, for example. I have a
36       list of such abbreviations (Please see public globals belwo for a
37       list), and more general rules (for example, the abbreviations 'i.e.'
38       and '.e.g.' need not to be in the list as a special rule takes care of
39       all single letter abbreviations).
40

FUNCTIONS

42       All functions used should be requested in the 'use' clause. None is
43       exported by default.
44
45       get_sentences( $text )
46           The get_sentences function takes a scalar containing ascii text as
47           an argument and returns a reference to an array of sentences that
48           the text has been split into. Returned sentences will be trimmed
49           (beginning and end of sentence) of white space. Strings with no
50           alpha-numeric characters in them, won't be returned as sentences.
51
52       add_acronyms( @acronyms )
53           This function is used for adding acronyms not supported by this
54           code.  The input should be regular expressions for matching the
55           desired acronyms, but should not include the final period (".").
56           So, for example, "blv?d" matches "blvd." and "bld.". "a\.mlf" will
57           match "a.mlf.". You do not need to bother with acronyms consisting
58           of single letters and dots (e.g. "U.S.A."), as these are found
59           automatically. Note also that acronyms are searched for on a case
60           insensitive basis.
61
62           Please see 'Acronym/Abbreviations list' section for the
63           abbreviations already supported by this module.
64
65       get_acronyms( )
66           This function will return the defined list of acronyms.
67
68       set_acronyms( @my_acronyms )
69           This function replaces the predefined acronym list with the given
70           list. See "add_acronyms" for details on the input specifications.
71
72       get_EOS( )
73           This function returns the value of the string used to mark the end
74           of sentence.  You might want to see what it is, and to make sure
75           your text doesn't contain it.  You can use set_EOS() to alter the
76           end-of-sentence string to whatever you desire.
77
78       set_EOS( $new_EOS_string )
79           This function alters the end-of-sentence string used to mark the
80           end of sentences.
81
82       set_locale( $new_locale ) Receives language locale in the form
83       language.country.character-set for example: "fr_CA.ISO8859-1" for
84       Canadian French using character set ISO8859-1.
85           Returns a reference to a hash containing the current locale
86           formatting values.  Returns undef if got undef.
87
88           The following will set the LC_COLLATE behaviour to Argentinian
89           Spanish.  NOTE: The naming and availability of locales depends on
90           your operating sysem.  Please consult the perllocale manpage for
91           how to find out which locales are available in your system.
92
93           $loc = set_locale( "es_AR.ISO8859-1" );
94
95           This actually does this:
96
97           $loc = setlocale( LC_ALL, "es_AR.ISO8859-1" );
98

Acronym/Abbreviations list

100       You can use the get_acronyms() function to get acronyms.  It has become
101       too long to specify in the documentation.
102
103       If I come across a good general-purpose list - I'll incorporate it into
104       this module.  Feel free to suggest such lists.
105

FUTURE WORK

107               [1] Object Oriented like usage
108               [2] Supporting more than just English/French
109               [3] Code optimization. Currently everything is RE based and not so optimized RE
110               [4] Possibly use more semantic heuristics for detecting a beginning of a sentence
111

REPOSITORY

116       <https://github.com/kimryan/Lingua-EN-Sentence>
117

AUTHOR

119       Shlomo Yona shlomo@cs.haifa.ac.il
120
121       Currently being maintained by Kim Ryan, kimryan at CPAN d o t org
122

COPYRIGHT AND LICENSE

124       Copyright (c) 2001-2016 Shlomo Yona. All rights reserved.  Copyright
125       (c) 2018 Kim Ryan. All rights reserved.
126
127       This library is free software; you can redistribute it and/or modify it
128       under the same terms as Perl itself.
129
130
131
132perl v5.34.0                      2022-01-21           Lingua::EN::Sentence(3)