|
NAMEdbacl - a digramic Bayesian classifier for text recognition.SYNOPSISdbacl [-01dvnirmwMNDXW] [-T type ] -l category [-h size] [-H gsize] [-x decim] [-q quality] [-w max_order] [-e deftok] [-o online] [-L measure] [-z ftresh] [-O ronline]... [-g regex]... [FILE]... dbacl [-vnimNRXYP] [-h size] [-T type] -c category [-c category]... [-f keep]... [FILE]... dbacl -V OVERVIEWdbacl is a Bayesian text and email classifier. When using the -l switch, it learns a body of text and produce a file named category which summarizes the text. When using the -c switch, it compares an input text stream with any number of category files, and outputs the name of the closest match, or optionally various numerical scores explained below.Whereas this manual page is intended as a reference, there are several tutorials and documents you can read to get specialized information. Specific documentation about the design of dbacl and the statistical models that it uses can be found in dbacl.ps. For a basic overview of text classification using dbacl, see tutorial.html. A companion tutorial geared towards email filtering is email.html. If you have trouble getting dbacl to classify reliably, read is_it_working.html. The USAGE section of this manual page also has some examples.
dbacl uses a maximum entropy (minimum divergence) language model constructed with respect to a digramic reference measure (unknown tokens are predicted from digrams, i.e. pairs of letters). Practically, this means that a category is constructed from tokens in the training set, while previously unseen tokens can be predicted automatically from their letters. A token here is either a word (fragment) or a combination of words (fragments), selected according to various switches. Learning roughly works by tweaking token probabilities until the training data is least surprising. EXIT_STATUSThe normal shell exit conventions aren't followed (sorry!). When using the -l command form, dbacl returns zero on success, nonzero if an error occurs. When using the -c form, dbacl returns a positive integer corresponding to the category with the highest posterior probability. In case of a tie, the first most probable category is chosen. If an error occurs, dbacl returns zero.DESCRIPTIONWhen using the -l command form, dbacl learns a category when given one or more FILE names, which should contain readable ASCII text. If no FILE is given, dbacl learns from STDIN. If FILE is a directory, it is opened and all its files are read, but not its subdirectories. The result is saved in the binary file named category, and completely replaces any previous contents. As a convenience, if the environment variable DBACL_PATH contains a directory, then that is prepended to the file path, unless category starts with a '/' or a '.'.The input text for learning is assumed to be unstructured plain text by default. This is not suitable for learning email, because email contains various transport encodings and formatting instructions which can reduce classification effectiveness. You must use the -T switch in that case so that dbacl knows it should perform decoding and filtering of MIME and HTML as appropriate. Apropriate switch values are "-T email" for RFC2822 email input, "-T html" for HTML input, "-T xml" for generic XML style input and "-T text" is the default plain text format. There are other values of the -T switch that also allow fine tuning of the decoding capabilities. When using the -c command form, dbacl attempts to classify the text found in FILE, or STDIN if no FILE is given. Each possible category must be given separately, and should be the file name of a previously learned text corpus. As a convenience, if the variable DBACL_PATH contains a directory, it is prepended to each file path which doesn't start with a '/' or a '.'. The visible output of the classification depends on the combination of extra switches used. If no switch is used, then no output is shown on STDOUT. However, dbacl always produces an exit code which can be tested. To see an output for a classification, you must use at least one of the -v,-U,-n,-N,-D,-d switches. Sometimes, they can be used in combination to produce a natural variation of their individual outputs. Sometimes, dbacl also produces warnings on STDERR if applicable. The -v switch outputs the name of the best category among all the choices given. The -U switch outputs the name of the best category followed by a confidence percentage. Normally, this is the switch that you want to use. A percentage of 100% means that dbacl is sure of its choice, while a percentage of 0% means that some other category is equally likely. This is not the model probability, but measures how unambiguous the classification is, and can be used to tag unsure classifications (e.g. if the confidence is 25% or less). The -N switch prints each category name followed by its (posterior) probability, expressed as a percentage. The percentages always sum to 100%. This is intuitive, but only valuable if the document being classified contains a handful of tokens (ten or less). In the common case with many more tokens, the probabilities are always extremely close to 100% and 0%. The -n switch prints each category name followed by the negative logarithm of its probability. This is equivalent to using the -N switch, but much more useful. The smallest number gives the best category. A more convenient form is to use both -n and -v which prints each category name followed by the cross entropy and the number of tokens analyzed. The cross entropy measures (in bits) the average compression rate which is achievable, under the given category model, per token of input text. If you use all three of -n,-v,-X then an extra value is output for each category, representing a kind of p-value for each category score. This indicates how typical the score is compared to the training documents, but only works if the -X switch was used during learning, and only for some types of models (e.g. email). These p-values are uniformly distributed and independent (if the categories are independent), so can be combined using Fisher's chi squared test to obtain composite p-values for groupings of categories. The -v and -X switches together print each category name followed by a detailed decomposition of the category score, factored into ( divergence rate + shannon entropy rate )* token count @ p-value. Again, this only works in some types of models. The -v and -U switches print each category name followed by a decomposition of the category score into ( divergence rate + shannon entropy rate # score variance )* token count. The -D switch prints out the input text as modified internally by dbacl prior to tokenization. For example, if a MIME encoded email document is classified, then this prints the decoded text that will be actually tokenized and classified. This switch is mainly useful for debugging. The -d switch dumps tokens and scores while they are being read. It is useful for debugging, or if you want to create graphical representations of the classification. A detailed explanation of the output is beyond the scope of this manual page, but is straightforward if you've read dbacl.ps. Possible variations include -d together with -n or -N. Classification can be done with one or several categories in principle. When two or more categories are used, the Bayesian posterior probability is used, given the input text, with a uniform prior distribution on categories. For other choices of prior, see the companion utility bayesol(1). When a single category is used, classification can be done by comparing the score with a treshold. In practice however, much better results are obtained with several categories. Learning and classifying cannot be mixed on the same command invocation, however there are no locking issues and separate dbacl processes can operate simultaneously with obvious results, because file operations are designed to be atomic. Finally, note that dbacl does not manage your document corpora or your computed categories. In particular, dbacl cannot add or subtract a document from a category file directly. If you want to learn a category incrementally, the standard way is to keep adding to your document corpus, and learn the whole corpus each time. By keeping control of your archives, you can never lose the information in your categories, and you can easily experiment with different switches or tokenizations or sets of training documents if you like. If the standard incremental learning method is too slow, the -o switch can help. This creates a data file named online which contains all the document statistics that have been learned. When you use the -l and -o switches together, dbacl merges the online data file (if it exists) with the new document(s) to be learned, and recreates an updated version of online. This is equivalent to adding the new documents to the corpus and relearning the whole corpus, but faster. However, documents cannot be removed if you change your mind. This is a limitation of dbacl which cannot be changed for mathematical reasons. You can work around this by making backups of the online data file. It is also possible to merge one or more extra online data files simultaneously by using the -O switch one or more times. SECONDARY_SWITCHESBy default, dbacl classifies the input text as a whole, ie it only outputs a single result even if you specify several input files. If you want to classify multiple input files you can either call dbacl repeatedly (which is fast when you use the -m switch), or use the -F switch, which prints each input FILE followed by the result for that FILE. Alternatively, you can classify each line of the input individually, by using the -f option, which prints only those lines which match one or more models identified by keep (use the category name or number to refer to a category). This last switch is useful if you want to filter out some lines, but note that if the lines are short, then the error rate can be high.The -e,-w,-g,-j switches are used for selecting an appropriate tokenization scheme. A token is a word or word fragment or combination of words or fragments. The shape of tokens is important because it forms the basis of the language models used by dbacl. The -e switch selects a predefined tokenization scheme, which is speedy but limited. The -w switch specifies composite tokens derived from the -e switch. For example, "-e alnum -w 2" means that tokens should be alphanumeric word fragments combined into overlapping pairs (bigrams). When the -j switch is used, all tokens are converted to lowercase, which reduces the number of possible tokens and therefore memory consumption. If the -g switch is used, you can completely specify what the tokens should look like using a regular expression. Several -g switches can be used to construct complex tokenization schemes, and parentheses within each expression can be used to select fragments and combine them into n-grams. The cost of such flexibility is reduced classification and learning speed. When experimenting with tokenization schemes, try using the -d or -D switches while learning or classifying, as they will print the tokens explicitly so you can see what text fragments are picked up or missed out. For regular exression syntax, see regex(7). The -h and -H switches regulate how much memory dbacl may use for learning. Text classification can use a lot of memory, and by default dbacl limits itself even at the expense of learning accuracy. In many cases if a limit is reached, a warning message will be printed on STDERR with some advice. When relearning the same category several times, a significant speedup can be obtained by using the -1 switch, as this allows the previously learned probabilities to be read from the category and reused. Note that classification accuracy depends foremost on the amount and quality of the training samples, and then only on amount of tweaking. EXIT STATUSWhen using the -l command form, dbacl returns zero on success. When using the -c form, dbacl returns a positive integer (1,2,3...) corresponding to the category with the highest posterior probability. In case of a tie, the first most probable category is chosen. If an error occurs, dbacl returns zero.OPTIONS
USAGETo create two category files in the current directory from two ASCII text files named Mark_Twain.txt and William_Shakespeare.txt respectively, type:% dbacl -l twain Mark_Twain.txt
Now you can classify input text, for example: % echo "howdy" | dbacl -v -c twain -c shake
Note that the -v option at least is necessary, otherwise dbacl does not print anything. The return value is 1 in the first case, 2 in the second. % echo "to be or not to be" | dbacl -v -N -c twain -c
shake
These invocations are equivalent. The numbers 6.74 and 7.04 represent how close the average token is to each category, and 6.0 is the number of tokens observed. If you want to print a simple confidence value together with the best category, replace -v with -U. % echo "to be or not to be" | dbacl -U -c twain -c shake
Note that the true probability of category shake versus category twain is 77.37%, but the calculation is somewhat ambiguous, and 34% is the confidence out of 100% that the calculation is qualitatively correct. Suppose a file document.txt contains English text lines interspersed with noise lines. To filter out the noise lines from the English lines, assuming you have an existing category shake say, type: % dbacl -c shake -f shake -R document.txt > document.txt_eng
Note that the quality of the results will vary depending on how well the categories shake and random represent each input line. It is sometimes useful to see the posterior probabilities for each line without filtering: % dbacl -c shake -f shake -RN document.txt > document.txt_probs You can now postprocess the posterior probabilities for each line of text with another script, to replicate an arbitrary Bayesian decision rule of your choice. In the special case of exactly two categories, the optimal Bayesian decision procedure can be implemented for documents as follows: let p1 be the prior probability that the input text is classified as category1. Consequently, the prior probability of classifying as category2 is 1 - p1. Let u12 be the cost of misclassifying a category1 input text as belonging to category2 and vice versa for u21. We assume there is no cost for classifying correctly. Then the following command implements the optimal Bayesian decision: % dbacl -n -c category1 -c category2 | awk '{ if($2 * p1 * u12 > $4 * (1 - p1) * u21) { print $1; } else { print $3; } }' dbacl can also be used in conjunction with procmail(1) to implement a simple Bayesian email classification system. Assume that incoming mail should be automatically delivered to one of three mail folders located in $MAILDIR and named work, personal, and spam. Initially, these must be created and filled with appropriate sample emails. A crontab(1) file can be used to learn the three categories once a day, e.g. CATS=$HOME/.dbacl
To automatically deliver each incoming email into the appropriate folder, the following procmailrc(5) recipe fragment could be used: CATS=$HOME/.dbacl # run the spam classifier
# send to the appropriate mailbox
:0:
Sometimes, dbacl will send the email to the wrong mailbox. In that case, the misclassified message should be removed from its wrong destination and placed in the correct mailbox. The error will be corrected the next time your messages are learned. If it is left in the wrong category, dbacl will learn the wrong corpus statistics. The default text features (tokens) read by dbacl are purely alphabetic strings, which minimizes memory requirements but can be unrealistic in some cases. To construct models based on alphanumeric tokens, use the -e switch. The example below also uses the optional -D switch, which prints a list of actual tokens found in the document: % dbacl -e alnum -D -l twain Mark_Twain.txt | less It is also possible to override the default feature selection method used to learn the category model by means of regular expressions. For example, the following duplicates the default feature selection method in the C locale, while being much slower: % dbacl -l twain -g '^([[:alpha:]]+)' -g '[^[:alpha:]]([[:alpha:]]+)' Mark_Twain.txt The category twain which is obtained depends only on single alphabetic words in the text file Mark_Twain.txt (and computed digram statistics for prediction). For a second example, the following command builds a smoothed Markovian (word bigram) model which depends on pairs of consecutive words within each line (but pairs cannot straddle a line break): % dbacl -l twain2 -g '(^|[^[:alpha:]])([[:alpha:]]+)||2' -g '(^|[^[:alpha:]])([[:alpha:]]+)[^[:alpha:]]+([[:alpha:]]+)||23' Mark_Twain.txt More general, line based, n-gram models of all orders (up to 7) can be built in a similar way. To construct paragraph based models, you should reformat the input corpora with awk(1) or sed(1) to obtain one paragraph per line. Line size is limited by available memory, but note that regex performance will degrade quickly for long lines. PERFORMANCEThe underlying assumption of statistical learning is that a relatively small number of training documents can represent a much larger set of input documents. Thus in the long run, learning can grind to a halt without serious impact on classification accuracy. While not true in reality, this assumption is surprisingly accurate for problems such as email filtering. In practice, this means that a well chosen corpus on the order of ten thousand documents is sufficient for highly accurate results for years. Continual learning after such a critical mass results in diminishing returns. Of course, when real world input document patterns change dramatically, the predictive power of the models can be lost. At the other end, a few hundred documents already give acceptable results in most cases.dbacl is heavily optimized for the case of frequent classifications but infrequent batch learning. This is the long run optimum described above. Under ideal conditions, dbacl can classify a hundred emails per second on low end hardware (500Mhz Pentium III). Learning speed is not very much slower, but takes effectively much longer for large document collections for various reasons. When using the -m switch, data structures are aggressively mapped into memory if possible, reducing overheads for both I/O and memory allocations. dbacl throws away its input as soon as possible, and has no limits on the input document size. Both classification and learning speed are directly proportional to the number of tokens in the input, but learning also needs a nonlinear optimization step which takes time proportional to the number of unique tokens discovered. At time of writing, dbacl is one of the fastest open source mail filters given its optimal usage scenario, but uses more memory for learning than other filters. MULTIPLE PROCESSES AND DATA CORRUPTIONWhen saving category files, dbacl first writes out a temporary file in the same location, and renames it afterwards. If a problem or crash occurs during learning, the old category file is therefore left untouched. This ensures that categories can never be corrupted, no matter how many processes try to simultaneously learn or classify, and means that valid categories are available for classification at any time.When using the -m switch, file contents are memory mapped for speedy reading and writing. This, together with the -o switch, is intended mainly for testing purposes, when tens of thousands of messages must be learned and scored in a laboratory to measure dbacl's accuracy. Because no file locking is attempted for performance reasons, corruptions are possible, unless you make sure that only one dbacl process reads or writes any file at any given time. This is the only case (-m and -o together) when corruption is possible. MEMORY USEWhen classifying a document, dbacl loads all indicated categories into RAM, so the total memory needed is approximately the sum of the category file sizes plus a fixed small overhead. The input document is consumed while being read, so its size doesn't matter, but very long lines can take up space. When using the -m switch, the categories are read using mmap(2) as available.When learning, dbacl keeps a large structure in memory which contains many objects which won't be saved into the output category. The size of this structure is proportional to the number of unique tokens read, but not the size of the input documents, since they are discarded while being read. As a rough guide, this structure is 4x-5x the size of the final category file that is produced. To prevent unchecked memory growth, dbacl allocates by default a fixed smallish amount of memory for tokens. When this space is used up, further tokens are discarded which has the effect of skewing the learned category making it less usable as more tokens are dropped. A warning is printed on STDERR in such a case. The -h switch lets you fix the initial size of the token space in powers of 2, ie "-h 17" means 2^17 = 131072 possible tokens. If you type "dbacl -V", you can see the number of bytes needed for each token when either learning or classifying. Multiply this number by the maximum number of possible tokens to estimate the memory needed for learning. The -H switch lets dbacl grow its tables automatically if and when needed, up to a maximum specified. So if you type "-H 21", then the initial size will be doubled repeatedly if necessary, up to approximately two million unique tokens. When learning with the -X switch, a handful of input documents are also kept in RAM throughout. ENVIRONMENT
SIGNALS
NOTESdbacl generated category files are in binary format, and may or may not be portable to systems using a different byte order architecture (this depends on how dbacl was compiled). The -V switch prints out whether categories are portable, or else you can just experiment.dbacl does not recognize functionally equivalent regular expressions, and in this case duplicate features will be counted several times. With every learned category, the command line options that were used are saved. When classifying, make sure that every relevant category was learned with the same set of options (regexes are allowed to differ), otherwise behaviour is undefined. There is no need to repeat all the switches when classifying. If you get many digitization warnings, then you are trying to learn too much data at once, or your model is too complex. dbacl is compiled to save memory by digitizing final weights, but you can disable digitization by editing dbacl.h and recompiling. dbacl offers several built-in tokenizers (see -e switch) with more to come in future versions, as the author invents them. While the default tokenizer may evolve, no tokenizer should ever be removed, so that you can always simulate previous dbacl behaviour subject to bug fixes and architectural changes. The confidence estimates obtained through the -X switch are underestimates, ie are more conservative than they should be. BUGS"Ya know, some day scientists are gonna invent something that will outsmart a rabbit." (Robot Rabbit, 1953)SOURCEThe source code for the latest version of this program is available at the following locations:http://www.lbreyer.com/gpl.html
AUTHORLaird A. Breyer <laird@lbreyer.com>SEE ALSOawk(1), bayesol(1), crontab(1), hmine(1), hypex(1), less(1), mailcross(1), mailfoot(1), mailinspect(1), mailtoe(1), procmailex(5), regex(7), stty(1), sed(1)
Visit the GSP FreeBSD Man Page Interface. |