|
NAMEAlgorithm::RabinKarp - Rabin-Karp streaming hashSYNOPSISmy $text = "A do run run run, a do run run"; my $kgram = Algorithm::RabinKarp->new($window, $text); or my $kgram2 = Algorithm::RabinKarp->new($window, $fh); or my $kgram3 = Algorithm::RabinKarp->new($window, sub { ... return $num, $position; }); my ($hash, $start_position, $end_position) = $kgram->next; my @values = $kgram->values; my %occurances; # a dictionary of all kgrams. while (my ($hash, @pos) = @{shift @values}) { push @{$occurances{$hash}}, \@pos; } my $needle = Algorithm::RabinKarp->new(6, "needle"); open my $fh, '<', "haystack.txt"; my $haystack = Algorithm::RabinKarp->new(6, $fh); my $needle_hash = $needle->next; while (my ($hay_hash, @pos) = $haystack->next) { warn "Possible match for 'needle' at @pos" if $needle_hash eq $hay_hash; } DESCRIPTIONThis is an implementation of Rabin and Karp's streaming hash, as described in "Winnowing: Local Algorithms for Document Fingerprinting" by Schleimer, Wilkerson, and Aiken. Following the suggestion of Schleimer, I am using their second equation:$H[ $c[2..$k + 1] ] = (( $H[ $c[1..$k] ] - $c[1] ** $k ) + $c[$k+1] ) * $k The results of this hash encodes information about the next k values in the stream (hense k-gram.) This means for any given stream of length n integer values (or characters), you will get back n - k + 1 hash values. For best results, you will want to create a code generator that filters your data to remove all unnecessary information. For example, in a large english document, you should probably remove all white space, as well as removing all capitalization. INTENTBy preprocessing your document with the Rabin Karp hashing algorithm, it makes it possible to create a "fingerprint" of your document (or documents), and then perform multiple searches for fragments contained within your document database.Schleimer, Wilkerson, and Aiken suggest preproccessing to remove unnecessary information (like whitespace), as well as known redundent information (like, say, copyright notices or other boilerplate that is 'acceptable'.) They also suggest a post processing pass to reduce data volume, using a technique called winnowing (see the link at the end of this documentation.) METHODS
BUGSThe current multipliers and modulus lead to very poor hash distributions. I'll investigate methods of improving this in future versions.SEE ALSO"Winnowing: Local Algorithms for Document Fingerprinting" L<http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf> Wikipedia: Rabin-Karp string search algorithm L<http://en.wikipedia.org/wiki/Rabin-Karp> AUTHORNorman Nunley E<lt>nnunley@gmail.comE<gt> Nicholas Clark (Who paired with me)
Visit the GSP FreeBSD Man Page Interface. |