|
NAMETree::Trie - A data structure optimized for prefix lookup.SYNOPSISuse Tree::Trie; use strict; my($trie) = new Tree::Trie; $trie->add(qw[aeode calliope clio erato euterpe melete melpomene mneme polymnia terpsichore thalia urania]); my(@all) = $trie->lookup(""); my(@ms) = $trie->lookup("m"); $" = "--"; print "All muses: @all\nMuses beginning with 'm': @ms\n"; my(@deleted) = $trie->remove(qw[calliope thalia doc]); print "Deleted muses: @deleted\n"; DESCRIPTIONThis module implements a trie data structure. The term "trie" comes from the word retrieval, but is generally pronounced like "try". A trie is a tree structure (or directed acyclic graph), the nodes of which represent letters in a word. For example, the final lookup for the word 'bob' would look something like "$ref->{'b'}{'o'}{'b'}{'00'}" (the 00 being an end marker). Only nodes which would represent words in the trie exist, making the structure slightly smaller than a hash of the same data set.The advantages of the trie over other data storage methods is that lookup times are O(1) WRT the size of the index. For sparse data sets, it is probably not as efficient as performing a binary search on a sorted list, and for small files, it has a lot of overhead. The main advantage (at least from my perspective) is that it provides a relatively cheap method for finding a list of words in a large, dense data set which begin with a certain string. The term "word" in this documentation can refer to one of two things: either a reference to an array of strings, or a scalar which is not a reference. In the case of the former, each element of the array is treated as a "letter" of the "word". In the case of the latter, the scalar is evaluated in string context and it is split into its component letters. Return values of methods match the values of what is passed in -- that is, if you call lookup() with an array reference, the return value will be an array reference (if appropriate). NOTE: The return semantics of the lookup_data method have CHANGED from version 1.0 to version 1.1. If you use this method, be sure to see the perldoc on that method for details. METHODS
End MarkersOverviewThe following discussion is only important for those people using multi-character letters, or words as array references. If you are just using this module with words as simple strings, you may disregard this section.First, it's important to understand how data is stored in the trie. As described above, the trie structure is basically just a complicated hash of hashes, with each key of each has being a letter. There needs to be a distinct way of determining when we're at the end of a word; we can't just use the end of the hash structure as a guide, because we need to distinguish between the word "barn" being in the trie and the words "bar" and "barn" being there. The answer is an end marker -- a distinct token that signifies that we're at the end of the word. Using the above example, if "bar" and "barn" are in the trie, then the keys of the hash at "r" would be "n" and this end marker. Choosing this end marker is easy when all letters are just one character -- we just choose any two-character string and we know that it will never match a letter. However, once we allow arbitrary multi-character letters, then things get much more difficult: there is no possible end marker which can be guaranteed to always work. Here is where we enter some dark water. Dark WaterIn order to make sure that the end marker is always safe, we must check incoming letters on every word submission. If the word is an array ref, then each letter in it is compared to the current end marker. This does add overhead, but it's necessary. If it is found that a letter does conflict with the end marker, then we choose a new end marker.In order to find a new end marker, we obviously need to find a string that isn't already being used in the trie. This requires a complete traversal of the trie to collect a complete set of the letters in use. Once we have this it is a simple exercise to generate a new marker which is not in use. Then we must replace the marker. This of course requires a complete traversal once again. As you can see, this adds a bit of overhead to working with multi-character letters, but it's neccessary to make sure things keep working correctly. This should be fine for people with small data sets, or who just do a bunch of additions ahead of time and then only do lookups. However, if computation time is important to you, there are ways to avoid this mess. Speeding Things UpOne way to speed things up is to avoid the need to replace the end marker. You can set the trie's end marker using the "end_marker()" method, or at creation time, by passing the "end_marker" option to the trie in its constructor's option hashref. Note that setting the end marker causes a trie traversal, as it must update existing data. As such, you want to set the end marker as soon as possible.Note that end marker MUST be at least 2 characters long. Just setting the end marker though, won't stop the trie from checking each letter as you add arrayref words. If you are 100% sure that the end marker you set won't ever show up in an added word, you can either use the "freeze_end_marker()" method or the "freeze_end_marker" construction option to tell the trie not to check any more. However, be careful -- once this option is enabled, the data structure is no longer self-policing, so if a letter that matches your end marker does end up slipping in, strange things will begin to happen. ExamplesHere are some situations in which you might want to use the methods described in the previous section.Let's say your application takes user input data describing travel across the united states, and each node in the trie is a two-letter state abbreviation. In this case, it would probably be fairly safe to set your end marker to something like '00'. However, since this is user-supplied data, you don't want to let some user break your whole system by entering '00', so you should probably not freeze the end marker in this case. Let's say you're using the trie for a networking application -- your words will be IP addresses, and your letters will be the four "quads" of an IP address. In this case you can safely set your end marker to 'xx' or anything with letters in it, and know that there will never be a collision. It is entirely reasonable to set the freeze tag in this case. Future Work
Known Problems
AUTHORCopyright 2011 Avi Finkel <avi@finkel.org>This package is free software and is provided "as is" without express or implied warranty. It may be used, redistributed and/or modified under the terms of the Perl Artistic License (see http://www.perl.com/perl/misc/Artistic.html)
Visit the GSP FreeBSD Man Page Interface. |