|
NAMEAlgorithm::NaiveBayes - Bayesian prediction of categoriesSYNOPSISuse Algorithm::NaiveBayes; my $nb = Algorithm::NaiveBayes->new; $nb->add_instance (attributes => {foo => 1, bar => 1, baz => 3}, label => 'sports'); $nb->add_instance (attributes => {foo => 2, blurp => 1}, label => ['sports', 'finance']); ... repeat for several more instances, then: $nb->train; # Find results for unseen instances my $result = $nb->predict (attributes => {bar => 3, blurp => 2}); DESCRIPTIONThis module implements the classic "Naive Bayes" machine learning algorithm. It is a well-studied probabilistic algorithm often used in automatic text categorization. Compared to other algorithms (kNN, SVM, Decision Trees), it's pretty fast and reasonably competitive in the quality of its results.A paper by Fabrizio Sebastiani provides a really good introduction to text categorization: <http://faure.iei.pi.cnr.it/~fabrizio/Publications/ACMCS02.pdf> METHODS
THEORYBayes' Theorem is a way of inverting a conditional probability. It states:P(y|x) P(x) P(x|y) = ------------- P(y) The notation "P(x|y)" means "the probability of "x" given "y"." See also "/mathforum.org/dr.math/problems/battisfore.03.22.99.html"" in "http: for a simple but complete example of Bayes' Theorem. In this case, we want to know the probability of a given category given a certain string of words in a document, so we have: P(words | cat) P(cat) P(cat | words) = -------------------- P(words) We have applied Bayes' Theorem because "P(cat | words)" is a difficult quantity to compute directly, but "P(words | cat)" and "P(cat)" are accessible (see below). The greater the expression above, the greater the probability that the given document belongs to the given category. So we want to find the maximum value. We write this as P(words | cat) P(cat) Best category = ArgMax ----------------------- cat in cats P(words) Since "P(words)" doesn't change over the range of categories, we can get rid of it. That's good, because we didn't want to have to compute these values anyway. So our new formula is: Best category = ArgMax P(words | cat) P(cat) cat in cats Finally, we note that if "w1, w2, ... wn" are the words in the document, then this expression is equivalent to: Best category = ArgMax P(w1|cat)*P(w2|cat)*...*P(wn|cat)*P(cat) cat in cats That's the formula I use in my document categorization code. The last step is the only non-rigorous one in the derivation, and this is the "naive" part of the Naive Bayes technique. It assumes that the probability of each word appearing in a document is unaffected by the presence or absence of each other word in the document. We assume this even though we know this isn't true: for example, the word "iodized" is far more likely to appear in a document that contains the word "salt" than it is to appear in a document that contains the word "subroutine". Luckily, as it turns out, making this assumption even when it isn't true may have little effect on our results, as the following paper by Pedro Domingos argues: "/www.cs.washington.edu/homes/pedrod/mlj97.ps.gz"" in "http: HISTORYMy first implementation of a Naive Bayes algorithm was in the now-obsolete AI::Categorize module, first released in May 2001. I replaced it with the Naive Bayes implementation in AI::Categorizer (note the extra 'r'), first released in July 2002. I then extracted that implementation into its own module that could be used outside the framework, and that's what you see here.AUTHORKen Williams, ken@mathforum.orgCOPYRIGHTCopyright 2003-2004 Ken Williams. All rights reserved.This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. SEE ALSOAI::Categorizer(3), perl.
Visit the GSP FreeBSD Man Page Interface. |