TITLE

Search::VectorSpace - a very basic vector-space search engine

SYNOPSIS

        use Search::VectorSpace;
        
        my @docs = ...;
        my $engine = Search::VectorSpace->new( docs => \@docs, threshold => .04);
        $engine->build_index();
        
        while ( my $query = <> ) {
                my %results = $engine->search( $query );
                print join "\n", keys %results;
        }

DESCRIPTION

This module takes a list of documents (in English) and builds a simple in-memory search engine using a vector space model. Documents are stored as PDL objects, and after the initial indexing phase, the search should be very fast. This implementation applies a rudimentary stop list to filter out very common words, and uses a cosine measure to calculate document similarity. All documents above a user-configurable similarity threshold are returned.

METHODS

new docs => ARRAYREF [, threshold => VALUE ]: Object constructor. Argument hash must contain a key 'docs' whose value is a reference to an array of documents. The hash can also contain an optional threshold setting, between zero and one, to serve as a relevance cutoff for search results.
build_index: Creates the document vectors and stores them in memory, along with a master word list for the document collection.
search QUERY: Returns all documents matching the QUERY string above the set relevance threshold. Unlike regular search engines, the query can be arbitrarily long, and contain pretty much anything. It gets mapped into a query vector just like the documents in the collection were. Returns a hash in the form RESULT => RELEVANCE, where the relevance value is between zero and one.
get_words STRING: Rudimentary parser, splits string on whitespace and removes punctuation. Returns a hash in the form WORD => NUMBER, where NUMBER is how many times the word was found.
stem WORD: Convenience wrapper for Lingua::Stem::stem()

AUTHOR

Maciej Ceglowski <maciej@ceglowski.com>

This program is free software, released under the GNU public license