|
NAMEText::Match::FastAlternatives - efficient search for many stringsSYNOPSISuse Text::Match::FastAlternatives; my $expletives = Text::Match::FastAlternatives->new(@naughty); while (my $line = <>) { print "Do you email your mother with that keyboard?\n" if $expletives->match($line); } DESCRIPTIONThis module allows you to search for any of a list of substrings ("keys") in a larger string. It is particularly efficient when the set of keys is large.This efficiency comes at the cost of some flexibility: if you want case-insensitive matching, you have to fold case yourself: my $expletives = Text::Match::FastAlternatives->new( map { lc } @naughty); while (my $line = <>) { print "Do you email your mother with that keyboard?\n" if $expletives->match(lc $line); } (The same applies if you want matching that is insensitive to Unicode normalization forms; see Unicode::Normalize.) This module is designed as a drop-in replacement for Perl code of the following form: my $expletives_regex = join '|', map { quotemeta } @naughty; $expletives_regex = qr/$expletives_regex/; while (my $line = <>) { print "Do you email your mother with that keyboard?\n" if $line =~ $expletives_regex; } Text::Match::FastAlternatives can easily perform this test a hundred times faster than the equivalent regex, if you have enough keys. The more keys it searches for, the faster it gets compared to the regex. Modules like Regexp::Trie can build an optimised version of such a regex, designed to take advantage of the niceties of perl's regex engine. With a large number of keys, this module will substantially outperform even an optimised regex like that. In one real-world situation with 339 keys, running on Perl 5.8, Regexp::Trie produced a regex that ran 857% faster than the naive regex (according to Benchmark), but using Text::Match::FastAlternatives ran 18275% faster than the naive regex, or twenty times faster than Regexp::Trie's optimised regex. The enhancements to the regex engine in Perl 5.10 include algorithms similar to those in Text::Match::FastAlternatives. However, even with very small sets of keys, Perl has to do extra work to be fully general, so Text::Match::FastAlternatives is still faster. The difference is greater for larger sets of keys. For one test with only 5 keys, Text::Match::FastAlternatives was 21% faster than perl-5.10.0; with 339 keys (as before), the difference was 111% (that is, slightly over twice as fast). METHODS
CAVEATSSubclassingText::Match::FastAlternatives has a "DESTROY" method implemented in XS. If you write a subclass with its own destructor, you will need to invoke the base destructor, or you will leak memory.Interaction with Perl internalsText::Match::FastAlternatives may change the Perl-internal encoding of strings passed to "new" or to its "match" methods. This is not considered a bug, as the Perl-internal encoding of a string is not normally of interest to Perl code (as opposed to "perl" internals). However, you may encounter situations where preserving a string's existing encoding is important (perhaps to work around a bug in some other module). If so, you may need to copy scalar variables before matching them:$matches++ if $tmfa->match(my $temporary_copy = $original); IMPLEMENTATIONText::Match::FastAlternatives manages to be so fast by using an Aho-Corasick automaton internally. The time to search for a match (or determine that there is no match) is independent of the number of keys being sought; total worst-case search time is O(n) where n is the length of the target string.The "match_at" and "exact_match" methods only need to find a match at one position, so they have worst-case running time of O(min(n, m)) where m is the length of the longest key. Text::Match::FastAlternatives uses an adaptive technique to minimise memory usage; in the best case, each character in a key requires only 4 bytes of storage in the automaton. This low memory usage has the additional advantage of reducing contention for CPU cache lines, further improving performance. SEE ALSO<http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm>, Regexp::Trie, Regexp::Optimizer, Regexp::Assemble, Unicode::Normalize, perl5100delta, perlunitut, perlunifaq.AUTHORAaron Crane <arc@cpan.org>COPYRIGHTCopyright 2006, 2007, 2008, 2010, 2012 Aaron Crane.This library is free software; you can redistribute it and/or modify it under the terms of the Artistic License, or (at your option) under the terms of the GNU General Public License version 2.
Visit the GSP FreeBSD Man Page Interface. |