GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages
VectorSpace(3) User Contributed Perl Documentation VectorSpace(3)

Search::VectorSpace - a very basic vector-space search engine

        use Search::VectorSpace;
        
        my @docs = ...;
        my $engine = Search::VectorSpace->new( docs => \@docs, threshold => .04);
        $engine->build_index();
        
        while ( my $query = <> ) {
                my %results = $engine->search( $query );
                print join "\n", keys %results;
        }

This module takes a list of documents (in English) and builds a simple in-memory search engine using a vector space model. Documents are stored as PDL objects, and after the initial indexing phase, the search should be very fast. This implementation applies a rudimentary stop list to filter out very common words, and uses a cosine measure to calculate document similarity. All documents above a user-configurable similarity threshold are returned.

new docs => ARRAYREF [, threshold => VALUE ]
Object constructor. Argument hash must contain a key 'docs' whose value is a reference to an array of documents. The hash can also contain an optional threshold setting, between zero and one, to serve as a relevance cutoff for search results.
build_index
Creates the document vectors and stores them in memory, along with a master word list for the document collection.
search QUERY
Returns all documents matching the QUERY string above the set relevance threshold. Unlike regular search engines, the query can be arbitrarily long, and contain pretty much anything. It gets mapped into a query vector just like the documents in the collection were. Returns a hash in the form RESULT => RELEVANCE, where the relevance value is between zero and one.
get_words STRING
Rudimentary parser, splits string on whitespace and removes punctuation. Returns a hash in the form WORD => NUMBER, where NUMBER is how many times the word was found.
stem WORD
Convenience wrapper for Lingua::Stem::stem()

Maciej Ceglowski <maciej@ceglowski.com>

This program is free software, released under the GNU public license

2003-12-19 perl v5.32.1

Search for    or go to Top of page |  Section 3 |  Main Index

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with ManDoc.