|
|
| |
Tree::Parser(3) |
User Contributed Perl Documentation |
Tree::Parser(3) |
Tree::Parser - Module to parse formatted files into tree structures
use Tree::Parser;
# create a new parser object with some input
my $tp = Tree::Parser->new($input);
# use the built in tab indent filters
$tp->useTabIndentedFilters();
# use the built in space indent filters
$tp->useSpaceIndentedFilters(4);
# use the built in dot-seperated numbers filters
$tp->useDotSeperatedLevelFilters();
# use the nested parens filter
$tp->useNestedParensFilters();
# create your own filter
$tp->setParseFilter(sub {
my ($line_iterator) = @_;
my $line = $line_iterator->next();
my ($id, $tabs, $desc) = $line =~ /(\d+)(\t*)(.*)/;
my $depth = length $tabs;
return ($depth, { id => $id, desc => $desc } );
});
# parse our input and get back a tree
my $tree = $tp->parse();
# create your own deparse filter
# (which is in the inverse of our
# custom filter above)
$tp->setDeparseFilter(sub {
my ($tree) = @_;
my $info = $tree->getNodeValue();
return ($info->{id} . ("\t" x $tree->getDepth()) . $info->{desc});
});
# deparse our tree and get back a string
my $tree_string = $tp->deparse();
This module can parse various types of input (formatted and containing
hierarchal information) into a tree structures. It can also deparse the same
tree structures back into a string. It accepts various types of input, such
as; strings, filenames, array references. The tree structure is a hierarchy of
Tree::Simple objects.
The parsing is controlled through a parse filter, which is used to
process each "line" in the input (see
"setParseFilter" below for more
information about parse filters).
The deparseing as well is controlled by a deparse filter, which is
used to covert each tree node into a string representation.
This module can be viewed (somewhat simplistically) as a
serialization tool for Tree::Simple objects. Properly written parse
and deparse filters can be used to do "round-trip" tree
handling.
- new ($tree | $input)
- The constructor is used primarily for creating an object instance.
Initializing the object is done by the
"_init" method (see below).
- setInput ($input)
- This method will take varios types of input, and pre-process them through
the "prepareInput" method below.
- prepareInput ($input)
- The "prepareInput" method is used to
pre-process certain types of $input. It accepts
any of the follow types of arguments:
- an Array::Iterator object
This just gets passed on through.
- an array reference containing the lines to be parsed
This type of argument is used to construct an
Array::Iterator instance.
- a filename
The file is opened, its contents slurped into an array, which
is then used to construct an Array::Iterator instance.
NOTE: we used to only handle files with the
".tree" extension, however that was
annoying, so now we accept any file name.
- a string
The string is expected to have at least one embedded newline
or be in the nested parens format.
It then returns an Array::Iterator object ready for the
parser.
- setFileEncoding($encoding)
- This allows you to specify the $encoding that the
file should be read using. This is only only applicable when your input is
a file.
- useTabIndentedFilters
- This will set the parse and deparse filters to handle tab indented
content. This is for true tabs "\t"
only. The parse and deparse filters this uses are compatible with one
another so round-triping is possible.
Example:
1.0
1.1
1.2
1.2.1
2.0
2.1
3.0
3.1
3.1.1
- useSpaceIndentedFilters ($num_spaces)
- This will set the parse and deparse filters to handle space indented
content. The optional $num_spaces argument allows
you to specify how many spaces are to be treated as a single indent, if
this argument is not specified it will default to a 4 space indent. The
parse and deparse filters this uses are compatible with one another so
round-triping is possible.
Example:
1.0
1.1
1.2
1.2.1
2.0
2.1
3.0
3.1
3.1.1
- useDotSeparatedLevelFilters (@level_identifiers)
- This will set the parse and deparse filters to handle trees which are
described in the following format:
1 First Child
1.1 First Grandchild
1.2 Second Grandchild
1.2.1 First Child of the Second Grandchild
1.3 Third Grandchild
2 Second Child
There must be at least one space seperating the level
identifier from the level name, all other spaces will be considered part
of the name itself.
The parse and deparse filters this uses are compatible with
one another so round-triping is possible.
The labels used are those specified in the
@level_identifiers argument. The above code uses
the default level identifiers ("1 ..
100"). But by passing the following as a set of level
identifiers: 'a' .. 'z', you can successfully
parse a format like this:
a First Child
a.a First Grandchild
a.b Second Grandchild
a.b.a First Child of the Second Grandchild
a.c Third Grandchild
b Second Child
Currently, you are restricted to only one set of level
identifiers. Future plans include allowing each depth to have its own
set of identifiers, therefore allowing formats like this:
"1.a" or other such variations (see
"TO DO" section for more info).
- useDotSeperatedLevelFilters
- This old mispelled method name is kept for backwards compat.
- useNestedParensFilters
- This will set the parse and deparse filters to handle trees which are
described in the following format:
(1 (1.1 1.2 (1.2.1) 1.3) 2 (2.1))
The parser will count the parentheses to determine the depth
of the current node. This filter can also handle double quoted strings
as values as well. So this would be valid input:
(root ("tree 1" ("tree 1 1" "tree 1 2") "tree 2"))
This format is currently somewhat limited in that the input
must all be on one line and not contain a trailing newline. It also does
not handle embedded escaped double quotes. Further refinement and
improvement of this filter format is to come (and patches are always
welcome).
It should be noted that this filter also cannot perform a
roundtrip operation where the deparsed output is the exact same as the
parsed input because it does not treat whitespace as signifigant (unless
it is within a double quoted string).
- setParseFilter ($filter)
- A parse filter is a subroutine reference which is used to process each
element in the input. As the main parse loop runs, it calls this filter
routine and passes it the Array::Iterator instance which represents
the input. To get the next element/line/token in the iterator, the filter
must call "next", the element should
then be processed by the filter. A filter can if it wants advance the
iterator further by calling "next" more
than once if nessecary, there are no restrictions as to what it can do.
However, the filter must return these two values in order to
correctly construct the tree:
- the depth of the node within the tree
- Followed by either of the following items:
- the value of the node
- This value will be used as the node value when constructing the new tree.
This can basically be any scalar value.
- an instance of either a Tree::Simple object, or some derivative of
Tree::Simple
- If you need to perform special operations on the tree instance before it
get's added to the larger hierarchy, then you can construct it within the
parse filter and return it. An example of why you might want to do this
would be if you wanted to set the UID of the tree instance from something
in the parse filter.
The following is an example of a very basic filter which simply
counts the number of tab characters to determine the node depth and then
captures any remaining character on the line.
$tree_parser->setParseFilter(sub {
my ($iterator) = @_;
my $line = $iterator->next();
# match the tables and all that follows it
my ($tabs, $node) = ($line =~ /(\t*)(.*)/);
# calculate the depth by seeing how long
# the tab string is.
my $depth = length $tabs;
# return the depth and the node value
return ($depth, $node);
});
- setDeparseFilter ($filter)
- The deparse filter is the opposite of the parse filter, it takes each
element of the tree and returns a string representation of it. The filter
routine gets passed a Tree::Simple instance and is expected to
return a single string. However, this is not enforced we actually will
gobble up all the filter returns, but keep in mind that each element
returned is considered to be a single line in the output, so multiple
elements will be treated as mutiple lines.
Here is an example of a deparse filter. This can be viewed as
the inverse of the parse filter example above.
$tp->setDeparseFilter(sub {
my ($tree) = @_;
return ("\t" x $tree->getDepth()) . $tree->getNodeValue();
});
- getTree
- This method returns the tree held by the parser or set through the
constructor.
- parse
- Parsing is pretty automatic once everthing is set up. This routine will
check to be sure you have all you need to proceed, and throw an execption
if not. Once the parsing is complete, the tree will be stored interally as
well as returned from this method.
- deparse
- This method too is pretty automatic, it verifies that it has all its
needs, throwing an exception if it does not. It will return an array of
lines in list context, or in scalar context it will join the array into a
single string seperated by newlines.
- _init ($tree | $input)
- This will initialize the slots of the object. If given a
$tree object, it will store it. This is currently
the prefered way in which to use subclasses of Tree::Simple to
build your tree with, as this object will be used to build any other trees
(see "TO DO" for more information). If given some other kind of
input, it will process this through the
"prepareInput" method.
- _parse
- This is where all the parsing work is done. If you are truely interested
in the inner workings of this method, I suggest you refer to the source.
It is a very simple algorithm and should be easy to understand.
- _deparse
- This is where all the deparsing work is done. As with the
"_parse" method, if you are interested
in the inner workings, I suggest you refer to the source.
- Enhance the Nested Parens filter
- This filter is somewhat limited in its handling of embedded newlines as
well as embedded double quotes (even if they are escaped). I would like to
improve this filter more when time allows.
- Enhance the Dot Seperated Level filter
- I would like to enhance this built in filter to handle multi-level
level-identifiers, basically allowing formats like this:
1 First Child
1.a First Grandchild
1.b Second Grandchild
1.b.I First Child of the Second Grandchild
1.b.II Second Child of the Second Grandchild
1.c Third Grandchild
2 Second Child
- Make Tree::Simple subclasses more easy to handle
- Currently in order to have Tree::Parser use a subclass of Tree::Simple to
build the heirarchy with, you must pass a tree into the constructor, and
then set the input manually. This could be handled better I think, but
right now I am not 100% how best to go about it.
None that I am aware of. Of course, if you find a bug, let me know, and I will
be sure to fix it. This module, in an earlier form, has been and is being used
in production for approx. 1 year now without incident. This version has been
improved and the test suite added.
I use Devel::Cover to test the code coverage of my tests, below is the
Devel::Cover report on this module's test suite.
---------------------------- ------ ------ ------ ------ ------ ------ ------
File stmt branch cond sub pod time total
---------------------------- ------ ------ ------ ------ ------ ------ ------
Tree/Parser.pm 100.0 87.9 81.2 100.0 100.0 100.0 94.6
---------------------------- ------ ------ ------ ------ ------ ------ ------
Total 100.0 87.9 81.2 100.0 100.0 100.0 94.6
---------------------------- ------ ------ ------ ------ ------ ------ ------
This module is not an attempt at a general purpose parser by any stretch of the
imagination. It is basically a very flexible special purpose parser, it only
builds Tree::Simple heirarchies, but your parse filters can be as complex as
nessecary. If this is not what you are looking for, then you might want to
consider one of the following modules:
- Parse::RecDescent
- This is a general purpose Recursive Descent parser generator written by
Damian Conway. If your parsing needs lean towards the more complex, this
is good module for you. Recursive Descent parsing is known to be slower
than other parsing styles, but it tends to be easier to write grammers
for, so there is a trade off. If speed is a concern, then you may just
want to skip perl and go straight to C and use
"yacc".
- Parse::Yapp
- As an alternative to Recursive Descent parsing, you can do LALR parsing.
It is faster and does not have some of the well known (and avoidable)
problems of Recursive Descent parsing. I have never actually used this
module, but I have heard good things about it.
- Parse::FixedLength
- If all you really need to do is process a file with fixed length fields in
it, you can use this module.
- Parse::Tokens
- This class will help you parse text with embedded tokens in it. I am not
very familiar with this module, but it looks interesting.
There are also a number of specific parsers out here, such as
HTML::Parser and XML::Parser, which do one thing and do it
well. If you are looking to parse HTML or XML, don't use my module, use
these ones, it just makes sense. Use the right tool for the job
basically.
This module uses two other modules I have written:
- Tree::Simple
- Array::Iterator
- Thanks to Chad Ullman for reporting RT Bug #12244 and providing code and
test case for it.
- Thanks to Gerd for reporting RT Bug #13041 and providing code to fix
it.
stevan little, <stevan@iinteractive.com>
Copyright 2004-2007 by Infinity Interactive, Inc.
<http://www.iinteractive.com>
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |