|
NAMEText::Similarity - Measure the pair-wise Similarity of Files or StringsSYNOPSIS# this will return an un-normalized score that just gives the # number of overlaps by default (or F1 if normalize is set), # plus a hash table of other scores, with the following keys # 'wc1', 'wc2', 'raw', 'precision', 'recall', 'F', 'dice', 'E', 'cosine', 'raw_lesk','lesk' # wc1 and wc2 are respective word counts; see Overlaps.pm for definitions of other scores use Text::Similarity::Overlaps; my $mod = Text::Similarity::Overlaps->new; defined $mod or die "Construction of Text::Similarity::Overlaps failed"; # adjust file names to reflect true relative position # these paths are valid from lib/Text/Similarity my $text_file1 = 'Overlaps.pm'; my $text_file2 = '../OverlapFinder.pm'; my $score = $mod->getSimilarity ($text_file1, $text_file2); print "The similarity of $text_file1 and $text_file2 is : $score\n"; my ($score1, %allScores) = $mod->getSimilarity ($text_file1, $text_file2); print "The raw similarity of $text_file1 and $text_file2 is : $allScores{'raw'}\n"; print "The lesk score of $text_file1 and $text_file2 is : $allScores{'lesk'}\n"; # if you want to turn on the verbose options and provide a stoplist # you can pass those parameters to Overlaps.pm via hash arguments # the verbose option causes extra scores to be printed to STDERR use Text::Similarity::Overlaps; my %options = ('verbose' => 1, 'stoplist' => '../../samples/stoplist.txt'); my $mod = Text::Similarity::Overlaps->new (\%options); defined $mod or die "Construction of Text::Similarity::Overlaps failed"; # adjust file names to reflect true relative position # these paths are valid from lib/Text/Similarity my $text_file1 = 'Overlaps.pm'; my $text_file2 = '../OverlapFinder.pm'; my ($score, %allScores) = $mod->getSimilarity ($text_file1, $text_file2); print "The raw similarity of $text_file1 and $text_file2 is : $allScores{'raw'}\n"; print "The lesk score of $text_file1 and $text_file2 is : $allScores{'lesk'}\n"; DESCRIPTIONThis module is a superclass for other modules and provides generic services such as stop word removal, compound identification, and text cleaning or sanitizing.It's important to realize that additional methods of measuring similarity can be added to this package. Text::Similarity::Overlaps is just one possible way of measuring similarity, others can be added. Subroutine sanitizeString carries out text cleaning. Briefly, it removes nearly all punctuation except for underscores and embedded apostrophes, converts all text to lower case, and collapes multiple white spaces to a single space. This module is where compounds are identified (although currently disabled). When implemented it will check a list of compounds provided by the user, and then when a compound is found in the text it will be desigated via an underscore (e.g., white house might be converted to white_house). Stop words are removed here. The length of the documents reported does not include the stop words. Overlaps are found after stopword removal. By including a word in the stoplist, you are saying that the word never existed in your input (in effect). BUGS
SEE ALSO<http://text-similarity.sourceforge.net>AUTHORSTed Pedersen, University of Minnesota, Duluth tpederse at d.umn.edu Siddharth Patwardhan, University of Utah sidd at cs.utah.edu Jason Michelizzi Ying Liu, University of Minnesota, Twin Cities liux0395 at umn.edu Last modified by : $Id: Similarity.pm,v 1.4 2015/10/08 13:22:13 tpederse Exp $ COPYRIGHT AND LICENSECopyright (C) 2004-2010, Ted Pedersen, Jason Michelizzi, Siddharth Patwardhan, and Ying LiuThis program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
Visit the GSP FreeBSD Man Page Interface. |