|
NAMEAI::Categorizer - Automatic Text CategorizationSYNOPSISuse AI::Categorizer; my $c = new AI::Categorizer(...parameters...); # Run a complete experiment - training on a corpus, testing on a test # set, printing a summary of results to STDOUT $c->run_experiment; # Or, run the parts of $c->run_experiment separately $c->scan_features; $c->read_training_set; $c->train; $c->evaluate_test_set; print $c->stats_table; # After training, use the Learner for categorization my $l = $c->learner; while (...) { my $d = ...create a document... my $hypothesis = $l->categorize($d); # An AI::Categorizer::Hypothesis object print "Assigned categories: ", join ', ', $hypothesis->categories, "\n"; print "Best category: ", $hypothesis->best_category, "\n"; } DESCRIPTION"AI::Categorizer" is a framework for automatic text categorization. It consists of a collection of Perl modules that implement common categorization tasks, and a set of defined relationships among those modules. The various details are flexible - for example, you can choose what categorization algorithm to use, what features (words or otherwise) of the documents should be used (or how to automatically choose these features), what format the documents are in, and so on.The basic process of using this module will typically involve obtaining a collection of pre-categorized documents, creating a "knowledge set" representation of those documents, training a categorizer on that knowledge set, and saving the trained categorizer for later use. There are several ways to carry out this process. The top-level "AI::Categorizer" module provides an umbrella class for high-level operations, or you may use the interfaces of the individual classes in the framework. A simple sample script that reads a training corpus, trains a categorizer, and tests the categorizer on a test corpus, is distributed as eg/demo.pl . Disclaimer: the results of any of the machine learning algorithms are far from infallible (close to fallible?). Categorization of documents is often a difficult task even for humans well-trained in the particular domain of knowledge, and there are many things a human would consider that none of these algorithms consider. These are only statistical tests - at best they are neat tricks or helpful assistants, and at worst they are totally unreliable. If you plan to use this module for anything really important, human supervision is essential, both of the categorization process and the final results. For the usage details, please see the documentation of each individual module. FRAMEWORK COMPONENTSThis section explains the major pieces of the "AI::Categorizer" object framework. We give a conceptual overview, but don't get into any of the details about interfaces or usage. See the documentation for the individual classes for more details.A diagram of the various classes in the framework can be seen in "doc/classes-overview.png", and a more detailed view of the same thing can be seen in "doc/classes.png". Knowledge SetsA "knowledge set" is defined as a collection of documents, together with some information on the categories each document belongs to. Note that this term is somewhat unique to this project - other sources may call it a "training corpus", or "prior knowledge". A knowledge set also contains some information on how documents will be parsed and how their features (words) will be extracted and turned into meaningful representations. In this sense, a knowledge set represents not only a collection of data, but a particular view on that data.A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet" class. Before you can start playing with categorizers, you will have to start playing with knowledge sets, so that the categorizers have some data to train on. See the documentation for the "AI::Categorizer::KnowledgeSet" module for information on its interface. Feature selection Deciding which features are the most important is a very large part of the categorization task - you cannot simply consider all the words in all the documents when training, and all the words in the document being categorized. There are two main reasons for this - first, it would mean that your training and categorizing processes would take forever and use tons of memory, and second, the significant stuff of the documents would get lost in the "noise" of the insignificant stuff. The process of selecting the most important features in the training set is called "feature selection". It is managed by the "AI::Categorizer::KnowledgeSet" class, and you will find the details of feature selection processes in that class's documentation. CollectionsBecause documents may be stored in lots of different formats, a "collection" class has been created as an abstraction of a stored set of documents, together with a way to iterate through the set and return Document objects. A knowledge set contains a single collection object. A "Categorizer" doing a complete test run generally contains two collections, one for training and one for testing. A "Learner" can mass-categorize a collection.The "AI::Categorizer::Collection" class and its subclasses instantiate the idea of a collection in this sense. DocumentsEach document is represented by an "AI::Categorizer::Document" object, or an object of one of its subclasses. Each document class contains methods for turning a bunch of data into a Feature Vector. Each document also has a method to report which categories it belongs to.CategoriesEach category is represented by an "AI::Categorizer::Category" object. Its main purpose is to keep track of which documents belong to it, though you can also examine statistical properties of an entire category, such as obtaining a Feature Vector representing an amalgamation of all the documents that belong to it.Machine Learning AlgorithmsThere are lots of different ways to make the inductive leap from the training documents to unseen documents. The Machine Learning community has studied many algorithms for this purpose. To allow flexibility in choosing and configuring categorization algorithms, each such algorithm is a subclass of "AI::Categorizer::Learner". There are currently four categorizers included in the distribution:
Other machine learning methods that may be implemented soonish include Neural Networks, k-Nearest-Neighbor, and/or a mixture-of-experts combiner for ensemble learning. No timetable for their creation has yet been set. Please see the documentation of these individual modules for more details on their guts and quirks. See the "AI::Categorizer::Learner" documentation for a description of the general categorizer interface. If you wish to create your own classifier, you should inherit from "AI::Categorizer::Learner" or "AI::Categorizer::Learner::Boolean", which are abstract classes that manage some of the work for you. Feature VectorsMost categorization algorithms don't deal directly with documents' data, they instead deal with a vector representation of a document's features. The features may be any properties of the document that seem helpful for determining its category, but they are usually some version of the "most important" words in the document. A list of features and their weights in each document is encapsulated by the "AI::Categorizer::FeatureVector" class. You may think of this class as roughly analogous to a Perl hash, where the keys are the names of features and the values are their weights.HypothesesThe result of asking a categorizer to categorize a previously unseen document is called a hypothesis, because it is some kind of "statistical guess" of what categories this document should be assigned to. Since you may be interested in any of several pieces of information about the hypothesis (for instance, which categories were assigned, which category was the single most likely category, the scores assigned to each category, etc.), the hypothesis is returned as an object of the "AI::Categorizer::Hypothesis" class, and you can use its object methods to get information about the hypothesis. See its class documentation for the details.ExperimentsThe "AI::Categorizer::Experiment" class helps you organize the results of categorization experiments. As you get lots of categorization results (Hypotheses) back from the Learner, you can feed these results to the Experiment class, along with the correct answers. When all results have been collected, you can get a report on accuracy, precision, recall, F1, and so on, with both micro-averaging and macro-averaging over categories. We use the "Statistics::Contingency" module from CPAN to manage the calculations. See the docs for "AI::Categorizer::Experiment" for more details.METHODS
HISTORYThis module is a revised and redesigned version of the previous "AI::Categorize" module by the same author. Note the added 'r' in the new name. The older module has a different interface, and no attempt at backward compatibility has been made - that's why I changed the name.You can have both "AI::Categorize" and "AI::Categorizer" installed at the same time on the same machine, if you want. They don't know about each other or use conflicting namespaces. AUTHORKen Williams <ken@mathforum.org>Discussion about this module can be directed to the perl-AI list at <perl-ai@perl.org>. For more info about the list, see http://lists.perl.org/showlist.cgi?name=perl-ai REFERENCESAn excellent introduction to the academic field of Text Categorization is Fabrizio Sebastiani's "Machine Learning in Automated Text Categorization": ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 1-47.COPYRIGHTCopyright 2000-2003 Ken Williams. All rights reserved.This distribution is free software; you can redistribute it and/or modify it under the same terms as Perl itself. These terms apply to every file in the distribution - if you have questions, please contact the author.
Visit the GSP FreeBSD Man Page Interface. |