|
NAMEText::Distill - Quick texts compare, plagiarism and common parts detectionVERSIONVersion 0.2SYNOPSISuse Text::Distill qw(Distill); my $DistilledText1 = Distill($text1); my $DistilledText2 = Distill($text2); $DistilledText1 eq $DistilledText2 ? print("Equal") : print("Not equal"); or use Text::Distill; my $FileFormat = Text::Distill::DetectBookFormat($FilePath); die "Not a fb2.zip file" if $FileFormat ne 'fb2.zip'; my $Text = Text::Distill::ExtractTextFromFB2File($FilePath); my $Gems = TextToGems($Text); my $VURL = 'http://partnersdnld.litres.ru/copyright_check_by_gems/'; my $TextInfo = Text::Distill::GemsValidate($Gems,$VURL); die "Copyright-protected content" if $TextInfo->{verdict} eq 'protected'; Distilling gems from textTextToGems($UTF8TextString)Transforms a text (valid UTF8 expected) into an array of 32-bit hash-summs (Jenkins's Hash). Text is at first flattened the hard way (something like soundex, see Distill below), than splitted into fragments by statistically choosen sequences. First and the last fragments are rejected, short fragments are rejected as well, from remaining strings calc hashes and returns reference to them in the array.What you really need to know is that TextToGem's from exactly the same texts are eqlal, texts with small changes have similar "gems" as well. And if two texts have 3+ common gems - they share some text parts, for sure. This is somewhat close to "Edit distance", but fast on calc and indexable. So you can effectively search for citings or plagiarism. Choosen split-method makes average detection segment about 2k of text (1-2 paper pages), so this package will not normally detect a single equal paragraph. If you need more precise match extended @Text::Distill::SplitChars with some sequences from SeqNumStats.xlsx on GitHub, I guiess you can get down to parts of about 300 chars without problems. Just don't forget to lower $Text::Distill::MinPartSize as well and keep in mind GemsValidate will break if you play with $MinPartSize and @SplitChars. Should return about one 32-bit jHash from every 2kb of source text (may vary depending on the text thou). my $Gems = TextToGems($String); print join(',',@$Gems); Distill($UTF8TextString)Transforming the text (valid UTF8 expected) into a sequence of 1-8 numbers (string as well). Internally used by TextToGems, but you may use it's output with standart "edit distance" algorithm, like Text::Levenshtein. Distilled string is shorter, so you math will go much faster.At the end works somewhat close to 'soundex' with addition of some basic rules for cyrillic chars, pre- and post-cleanup and utf normalization. Drops strange sequences, drops short words as well (how are you going to make you plagiarism without copying the long words, huh?) $Distilled = Distill($Text); # $Distilled should be ~60% shorter than $Text Remote validationThere is at least one open service to check your text against known text database, docs are here: <https://goo.gl/xmFMdr>.GemsValidate(\@Gems, $Url)Checks your gems against remote database, returns overall verdict and a structure with info on found titlesService functionsExtractTextFromFB2File($FilePath)Function receives a path to the fb2-file and returns all significant text from the file as a stringExtractTextFromFB3File($FilePath)Function receives a path to the fb3-file and returns all significant text from the file as a stringExtractTextFromTXTFile($FilePath)Function receives a path to the text-file and returns all significant text from the file as a stringExtractTextFromDocFile($FilePath)Function receives a path to the doc-file and returns all significant text from the file as a stringExtractTextFromDOCXFile($FilePath)Function receives a path to the docx-file and returns all significant text from the file as a stringExtractTextFromEPUBFile($FilePath)Function receives a path to the epub-file and returns all significant text from the file as a stringDetectBookFormat($FilePath, $Format)Function detects format of an e-book and returns it. You may suggest the format to start with, this wiil speed up the process a bit (not required).$Format can be 'fb2.zip', 'fb2', 'doc.zip', 'doc', 'docx.zip', 'docx', 'epub.zip', 'epub', 'txt.zip', 'txt', 'fb3', 'fb3' Internals:Receives a path to the file and checks whether this file is ...CheckIfDocZip() - MS Word .doc in zip-archive CheckIfEPubZip() - Electronic Publication .epub in zip-archive CheckIfDocxZip() - MS Word 2007 .docx in zip-archive CheckIfFB2Zip() - FictionBook2 (FB2) in zip-archive CheckIfTXT2Zip() - text-file in zip-archive CheckIfEPub() - Electronic Publication .epub CheckIfDocx() - MS Word 2007 .docx CheckIfDoc() - MS Word .doc CheckIfFB2() - FictionBook2 (FB2) CheckIfFB3() - FictionBook3 (FB3) CheckIfTXT() - text-file REQUIRED MODULESDigest::JHash; XML::LibXML; XML::LibXSLT; Encode::Detect; Text::Extract::Word; HTML::TreeBuilder; OLE::Storage_Lite; Text::Unidecode (v1.27 or later); Unicode::Normalize (v1.25 or later); Archive::Zip Encode; Carp; LWP::UserAgent; JSON::XS; File::Temp; SCRIPTSplagiarism_check.pl - checks your ebook againts known texts databaseScript uses check_by_gems API (<https://goo.gl/xmFMdr>). You can select any "check service" provider with CHECKURL (see below), by default text checked with LitRes copyright-check service: <http://partnersdnld.litres.ru/copyright_check_by_gems/>USAGE > plagiarism_check.pl FILEPATH [CHECKURL] [--full-info] [--help] EXAMPLE > plagiarism_check.pl /home/file.epub --full-info PARAMS FILEPATH path to file for check CHECKURL url of validating API to check file with. By default: http://partnersdnld.litres.ru/copyright_check_by_gems/ --full-info show full info of checked --help show this information OUTPUT Ebook statuses explained: protected there are either copyrights on this book or it is forbidden for distribution by some other reason (racist content, etc) free ebook content owner distributes it for free (but content may still be protected from certan kind use) public_domain this it public domain, no restrictions at all unknown service have has no valid info on this text AUTHORLitres.ru, "<gu at litres.ru>" Get the latest code from <https://github.com/Litres/TextDistill>BUGSPlease report any bugs or feature requests to <https://github.com/Litres/TextDistill/issues>.SUPPORTYou can find documentation for this module with the perldoc command.perldoc Text::Distill You can also look for information at:
LICENSE AND COPYRIGHTCopyright (C) 2016 Litres.ruThe GNU Lesser General Public License version 3.0 Text::Distill is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3.0 of the License. Text::Distill is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. Full text of License <http://www.gnu.org/licenses/lgpl-3.0.en.html>.
Visit the GSP FreeBSD Man Page Interface. |