|
|
| |
AI::Categorizer::KnowledgeSet(3) |
User Contributed Perl Documentation |
AI::Categorizer::KnowledgeSet(3) |
AI::Categorizer::KnowledgeSet - Encapsulates set of documents
use AI::Categorizer::KnowledgeSet;
my $k = new AI::Categorizer::KnowledgeSet(...parameters...);
my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
$nb->train(knowledge_set => $k);
The KnowledgeSet class that provides an interface to a set of documents, a set
of categories, and a mapping between the two. Many parameters for controlling
the processing of documents are managed by the KnowledgeSet class.
- new()
- Creates a new KnowledgeSet and returns it. Accepts the following
parameters:
- load
- If a "load" parameter is present, the
"load()" method will be invoked
immediately. If the "load" parameter is
a string, it will be passed as the
"path" parameter to
"load()". If the
"load" parameter is a hash reference, it
will represent all the parameters to pass to
"load()".
- categories
- An optional reference to an array of Category objects representing the
complete set of categories in a KnowledgeSet. If used, the
"documents" parameter should also be
specified.
- documents
- An optional reference to an array of Document objects representing the
complete set of documents in a KnowledgeSet. If used, the
"categories" parameter should also be
specified.
- features_kept
- A number indicating how many features (words) should be considered when
training the Learner or categorizing new documents. May be specified as a
positive integer (e.g. 2000) indicating the absolute number of features to
be kept, or as a decimal between 0 and 1 (e.g. 0.2) indicating the
fraction of the total number of features to be kept, or as 0 to indicate
that no feature selection should be done and that the entire set of
features should be used. The default is 0.2.
- feature_selection
- A string indicating the type of feature selection that should be
performed. Currently the only option is also the default option:
"document_frequency".
- tfidf_weighting
- Specifies how document word counts should be converted to vector values.
Uses the three-character specification strings from Salton & Buckley's
paper "Term-weighting approaches in automatic text retrieval".
The three characters indicate the three factors that will be multiplied
for each feature to find the final vector value for that feature. The
default weighting is "xxx".
The first character specifies the "term frequency"
component, which can take the following values:
- b
- Binary weighting - 1 for terms present in a document, 0 for terms
absent.
- t
- Raw term frequency - equal to the number of times a feature occurs in the
document.
- x
- A synonym for 't'.
- n
- Normalized term frequency - 0.5 + 0.5 * t/max(t). This is the same as the
't' specification, but with term frequency normalized to lie between 0.5
and 1.
The second character specifies the "collection
frequency" component, which can take the following values:
- f
- Inverse document frequency - multiply term
"t"'s value by
"log(N/n)", where
"N" is the total number of documents in
the collection, and "n" is the number of
documents in which term "t" is
found.
- p
- Probabilistic inverse document frequency - multiply term
"t"'s value by
"log((N-n)/n)" (same variable meanings
as above).
- x
- No change - multiply by 1.
The third character specifies the "normalization"
component, which can take the following values:
- c
- Apply cosine normalization - multiply by 1/length(document_vector).
- x
- No change - multiply by 1.
The three components may alternatively be specified by the
"term_weighting",
"collection_weighting", and
"normalize_weighting" parameters
respectively.
- verbose
- If set to a true value, some status/debugging information will be output
on "STDOUT".
- categories()
- In a list context returns a list of all Category objects in this
KnowledgeSet. In a scalar context returns the number of such objects.
- documents()
- In a list context returns a list of all Document objects in this
KnowledgeSet. In a scalar context returns the number of such objects.
- document()
- Given a document name, returns the Document object with that name, or
"undef" if no such Document object
exists in this KnowledgeSet.
- features()
- Returns a FeatureSet object which represents the features of all the
documents in this KnowledgeSet.
- verbose()
- Returns the "verbose" parameter of this
KnowledgeSet, or sets it with an optional argument.
- scan_stats()
- Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection. (XXX need to describe
stats)
- scan_features()
- This method scans through a Collection object and determines the
"best" features (words) to use when loading the documents and
training the Learner. This process is known as "feature
selection", and it's a very important part of categorization.
The Collection object should be specified as a
"collection" parameter, or by giving
the arguments to pass to the Collection's
"new()" method.
The process of feature selection is governed by the
"feature_selection" and
"features_kept" parameters given to
the KnowledgeSet's "new()" method.
This method returns the features as a FeatureVector whose
values are the "quality" of each feature, by whatever measure
the "feature_selection" parameter
specifies. Normally you won't need to use the return value, because this
FeatureVector will become the
"use_features" parameter of any
Document objects created by this KnowledgeSet.
- save_features()
- Given the name of a file, this method writes the features (as determined
by the "scan_features" method) to the
file.
- restore_features()
- Given the name of a file written by
"save_features", loads the features from
that file and passes them as the
"use_features" parameter for any
Document objects created in the future by this KnowledgeSet.
- read()
- Iterates through a Collection of documents and adds them to the
KnowledgeSet. The Collection can be specified using a
"collection" parameter - otherwise,
specify the arguments to pass to the
"new()" method of the Collection
class.
- load()
- This method can do feature selection and load a Collection in one step
(though it currently uses two steps internally).
- add_document()
- Given a Document object as an argument, this method will add it and any
categories it belongs to to the KnowledgeSet.
- make_document()
- This method will create a Document object with the given data and then
call "add_document()" to add it to the
KnowledgeSet. A "categories" parameter
should specify an array reference containing a list of categories by
name. These are the categories that the document belongs to. Any other
parameters will be passed to the Document class's
"new()" method.
- finish()
- This method will be called prior to training the Learner. Its purpose is
to perform any operations (such as feature vector weighting) that may
require examination of the entire KnowledgeSet.
- weigh_features()
- This method will be called during
"finish()" to adjust the weights of the
features according to the
"tfidf_weighting" parameter.
- document_frequency()
- Given a single feature (word) as an argument, this method will return the
number of documents in the KnowledgeSet that contain that feature.
- partition()
- Divides the KnowledgeSet into several subsets. This may be useful for
performing cross-validation. The relative sizes of the subsets should be
passed as arguments. For example, to split the KnowledgeSet into four
KnowledgeSets of equal size, pass the arguments .25, .25, .25 (the final
size is 1 minus the sum of the other sizes). The partitions will be
returned as a list.
Ken Williams, ken@mathforum.org
Copyright 2000-2003 Ken Williams. All rights reserved.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |