Search::VectorSpace - a very basic vector-space search engine
use Search::VectorSpace;
my @docs = ...;
my $engine = Search::VectorSpace->new( docs => \@docs, threshold => .04);
$engine->build_index();
while ( my $query = <> ) {
my %results = $engine->search( $query );
print join "\n", keys %results;
}
This module takes a list of documents (in English) and builds a simple in-memory
search engine using a vector space model. Documents are stored as PDL objects,
and after the initial indexing phase, the search should be very fast. This
implementation applies a rudimentary stop list to filter out very common
words, and uses a cosine measure to calculate document similarity. All
documents above a user-configurable similarity threshold are returned.
- new docs => ARRAYREF [, threshold => VALUE ]
- Object constructor. Argument hash must contain a key 'docs' whose value is
a reference to an array of documents. The hash can also contain an
optional threshold setting, between zero and one, to serve as a relevance
cutoff for search results.
- build_index
- Creates the document vectors and stores them in memory, along with a
master word list for the document collection.
- search QUERY
- Returns all documents matching the QUERY string above the set relevance
threshold. Unlike regular search engines, the query can be arbitrarily
long, and contain pretty much anything. It gets mapped into a query vector
just like the documents in the collection were. Returns a hash in the form
RESULT => RELEVANCE, where the relevance value is between zero and
one.
- get_words STRING
- Rudimentary parser, splits string on whitespace and removes punctuation.
Returns a hash in the form WORD => NUMBER, where NUMBER is how many
times the word was found.
- stem WORD
- Convenience wrapper for Lingua::Stem::stem()
Maciej Ceglowski <maciej@ceglowski.com>
This program is free software, released under the GNU public
license