|
NAMELingua::EN::Sentence - split text into sentencesSYNOPSISuse Lingua::EN::Sentence qw( get_sentences add_acronyms ); add_acronyms('lt','gen'); ## adding support for 'Lt. Gen.' my $sentences=get_sentences($text); ## Get the sentences. foreach my $sentence (@$sentences) { ## do something with $sentence } DESCRIPTIONThe "Lingua::EN::Sentence" module contains the function get_sentences, which splits text into its constituent sentences, based on a regular expression and a list of abbreviations (built in and given).Certain well know exceptions, such as abbreviations, may cause incorrect segmentations. But some of them are already integrated into this code and are being taken care of. Still, if you see that there are words causing the get_sentences function to fail, you can add those to the module, so it notices them. ALGORITHMBasically, I use a 'brute' regular expression to split the text into sentences. (Well, nothing is yet split - I just mark the end-of-sentence). Then I look into a set of rules which decide when an end-of-sentence is justified and when it's a mistake. In case of a mistake, the end-of-sentence mark is removed.What are such mistakes? Cases of abbreviations, for example. I have a list of such abbreviations (Please see public globals belwo for a list), and more general rules (for example, the abbreviations 'i.e.' and '.e.g.' need not to be in the list as a special rule takes care of all single letter abbreviations). FUNCTIONSAll functions used should be requested in the 'use' clause. None is exported by default.
Acronym/Abbreviations listYou can use the get_acronyms() function to get acronyms. It has become too long to specify in the documentation.If I come across a good general-purpose list - I'll incorporate it into this module. Feel free to suggest such lists. FUTURE WORK[1] Object Oriented like usage [2] Supporting more than just English/French [3] Code optimization. Currently everything is RE based and not so optimized RE [4] Possibly use more semantic heuristics for detecting a beginning of a sentence SEE ALSOText::Sentence REPOSITORY<https://github.com/kimryan/Lingua-EN-Sentence>AUTHORShlomo Yona shlomo@cs.haifa.ac.ilCurrently being maintained by Kim Ryan, kimryan at CPAN d o t org COPYRIGHT AND LICENSECopyright (c) 2001-2016 Shlomo Yona. All rights reserved. Copyright (c) 2018 Kim Ryan. All rights reserved.This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Visit the GSP FreeBSD Man Page Interface. |