|
|
| |
AI::Categorizer::Document(3) |
User Contributed Perl Documentation |
AI::Categorizer::Document(3) |
AI::Categorizer::Document - Embodies a document
use AI::Categorizer::Document;
# Simplest way to create a document:
my $d = new AI::Categorizer::Document(name => $string,
content => $string);
# Other parameters are accepted:
my $d = new AI::Categorizer::Document(name => $string,
categories => \@category_objects,
content => { subject => $string,
body => $string2, ... },
content_weights => { subject => 3,
body => 1, ... },
stopwords => \%skip_these_words,
stemming => $string,
front_bias => $float,
use_features => $feature_vector,
);
# Specify explicit feature vector:
my $d = new AI::Categorizer::Document(name => $string);
$d->features( $feature_vector );
# Now pass the document to a categorization algorithm:
my $learner = AI::Categorizer::Learner::NaiveBayes->restore_state($path);
my $hypothesis = $learner->categorize($document);
The Document class embodies the data in a single document, and contains methods
for turning this data into a FeatureVector. Usually documents are plain text,
but subclasses of the Document class may handle any kind of data.
- new(%parameters)
- Creates a new Document object. Document objects are used during training
(for the training documents), testing (for the test documents), and when
categorizing new unseen documents in an application (for the unseen
documents). However, you'll typically only call
"new()" in the latter case, since the
KnowledgeSet or Collection classes will create Document objects for you in
the former cases.
The "new()" method accepts
the following parameters:
- name
- A string that identifies this document. Required.
- content
- The raw content of this document. May be specified as either a string or
as a hash reference, allowing structured document types.
- content_weights
- A hash reference indicating the weights that should be assigned to
features in different sections of a structured document when creating its
feature vector. The weight is a multiplier of the feature vector values.
For instance, if a "subject" section has
a weight of 3 and a "body" section has a
weight of 1, and word counts are used as feature vector values, then it
will be as if all words appearing in the
"subject" appeared 3 times.
If no weights are specified, all weights are set to 1.
- front_bias
- Allows smooth bias of the weights of words in a document according to
their position. The value should be a number between -1 and 1. Positive
numbers indicate that words toward the beginning of the document should
have higher weight than words toward the end of the document. Negative
numbers indicate the opposite. A bias of 0 indicates that no biasing
should be done.
- categories
- A reference to an array of Category objects that this document belongs to.
Optional.
- stopwords
- A list/hash of features (words) that should be ignored when parsing
document content. A hash reference is preferred, with the features as the
keys. If you pass an array reference containing the features, it will be
converted to a hash reference internally.
- use_features
- A Feature Vector specifying the only features that should be considered
when parsing this document. This is an alternative to using
"stopwords".
- stemming
- Indicates the linguistic procedure that should be used to convert tokens
in the document to features. Possible values are
"none", which indicates that the tokens
should be used without change, or
"porter", indicating that the Porter
stemming algorithm should be applied to each token. This requires the
"Lingua::Stem" module from CPAN.
- stopword_behavior
- There are a few ways you might want the stopword list (specified with the
"stopwords" parameter) to interact with
the stemming algorithm (specified with the
"stemming" parameter). These options can
be controlled with the
"stopword_behavior" parameter, which can
take the following values:
- no_stem
- Match stopwords against non-stemmed document words.
- stem
- Stem stopwords according to 'stemming' parameter, then match them against
stemmed document words.
- pre_stemmed
- Stopwords are already stemmed, match them against stemmed document
words.
The default value is "stem",
which seems to produce the best results in most cases I've tried. I'm not
aware of any studies comparing the
"no_stem" behavior to the
"stem" behavior in the general case.
This parameter has no effect if there are no stopwords being used,
or if stemming is not being used. In the latter case, the list of stopwords
will always be matched as-is against the document words.
Note that if the "stem" option
is used, the data structure passed as the
"stopwords" parameter will be modified
in-place to contain the stemmed versions of the stopwords supplied.
- read( path => $path )
- An alternative constructor method which reads a file on disk and returns a
document with that file's contents.
- parse( content => $content )
- name()
- Returns this document's "name" property
as specified when the document was created.
- features()
- Returns the Feature Vector associated with this document.
- categories()
- In a list context, returns a list of Category objects to which this
document belongs. In a scalar context, returns the number of such
categories.
- create_feature_vector()
- Creates this document's Feature Vector by parsing its content. You won't
call this method directly, it's called by
"new()".
Ken Williams <ken@mathforum.org>
This distribution is free software; you can redistribute it and/or modify it
under the same terms as Perl itself. These terms apply to every file in the
distribution - if you have questions, please contact the author.
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |