|
NAMEMarpa::Advanced::Algorithm - The Marpa AlgorithmDESCRIPTIONThis document describes the aspects of the Marpa algorithm that break new ground. It is not a comprehensive description of the Marpa algorithm. Marpa owes its existence to previous work by other researchers. Their work is cited in Marpa's bibliography. This document is intended to supplement their writings.I hope to have the opportunity to write a full account of the Marpa algorithm. Until then, I hope the reader sees the one-sidedness of this document as a temporary necessity, and not as the product of my ingratitude toward the researchers without whom my own work would not have been possible. When I started writing Marpa I had many expectations. Marpa has exceeded most of them. Marpa will be a fast algorithm for almost every practical grammar. The one respect in which Marpa disappoints me is its complexity. As an algorithm, Marpa has its moments of beauty, but nobody could say it achieves that beauty through simplicity. Marpa is many different parsing techniques, rolled into one. The way these interact is one of the beauties I do see in the algorithm. Perhaps it was wrong of me to hope Marpa would be a late Picasso, instead of a Hieronymous Bosch. THE SOURCES OF THE MARPA ALGORITHM
ABOUT THE REST OF THIS DOCUMENTThe rest of this document is fragmented, highly technical material. I tried to write these sections for as wide an audience as possible, but in many cases it simply was not possible to keep the sections short without assuming a lot of familiarity with the literature on Earley parsing.COMBINING LEO 1991 and AYCOCK-HORSPOOL 2002The algorithm described in Aycock-Horspool 2002speeds up Earley's algorithm by combining Earley items. Each Earley item in Earley's original algorithm is for a dotted rule -- a pair consisting of a grammar rule and a position in that rule. Dotted rules are so called because the convention when writing them is to designate the distinguished position with a raised dot. Dotted rules are also called LR(0) items. Aycock and Horspool combined dotted rules into sets. Those sets were also states of a finite automaton. I call this finite automaton, an Aycock-Horspool Finite Automaton (AHFA). Aycock and Horspool changed their Earley items to contain information about AHFA states, instead of about individual dotted rules. This reduced the number of Earley items and improved the speed of Earley's algorithm considerably. Aycock-Horspool 2002 had other very practical insights. For example, I can't imagine how I could have written Marpa had I not learned how to deal with nullable symbols and rules from Aycock-Horspool 2002 . But Aycock and Horspool could not lay claim to any theoretical milestones. As the theoreticians say, Aycock-Horspool 2002 "improved the constants". Worst case remained the same, and right-recursion was still quadratic. Leo's algorithm and the Aycock-Horspool algorithm each offered major improvements. Was it possible to get the best of both worlds? The major obstacle was that the algorithms spoke different dialects: Earley items in the Aycock-Horspool algorithm used AHFA states. Leo kept his Earley items in the classic form, one dotted rule per Earley item. To combine the two algorithms, one algorithm would have to learn to speak the other's dialect. It was always clear that translating from one algorithm to the other would be possible. What was not so clear was whether this translation could be made without doing so much processing that most or all of the benefit of one algorithm or the other would be lost. Leo 1991 , Aycock-Horspool 2002 , and even Earley's original algorithm (after the bug fix suggested by Aho & Ullman in 1972) all do the same job. For decades now, the only question has been speed. The main issue was the translation of "Leo completions" -- those Earley items added during recognition as a result of Leo's logic. Leo completions had to "play nice" with Marpa's AHFA-based recognition engine, or all was lost. Potentially, I might have been required to rewrite the Aycock-Horspool Finite Automaton. But the AHFA was the source of the efficiencies delivered by the Aycock-Horspool algorithm. Rewrite the AHFA enough and I might as well throw it away. But I was in luck. I noticed that, in the special case of Leo completions, there seemed to be a one-to-one correspondence between AHFA states and the original dotted rules as used by Earley and Leo. Eventually, I was able to prove this. Leo completions were always AHFA "singletons" -- exactly one dotted rule. This was very unexpected. The Aycock-Horspool algorithm produced its gains by combining dotted rules into AHFA states, and was aggressive about it. Singletons were infrequent. But when it most mattered, the Leo logic and the AHFA states spoke the same language. In almost every other case, as expected, the Leo logic and the Aycock-Horspool logic did not speak the same language. A lot of details had to be worked out, but none of these other details had the potential to derail the project. With the proof that Leo completions were always AHFA singletons, I knew that Leo 1991 and Aycock-Horspool 2002 could be combined to get the best of both. MARPA'S CHANGES TO THE LEO ALGORITHMLeo 1991describes an algorithm for a recognizer. Marpa modifies Leo's original recognizer algorithm in two major respects. First, as mentioned above, Leo's recognizer algorithm had to be converted to work with AHFA states, instead of dotted rules. Second, Leo's algorithm is lazy about computing Leo items. (A Leo item is my term for what Leo 1991 calls "transition items".) Leo's original algorithm does not compute the Leo items until they are actually needed. I chose eager computation -- computing the Leo items as soon as each Earley set is complete. Eager computation of Leo items greatly simplifies the logic for using the Leo items. The processing for each Earley set can assume that any Leo items of interest in prior Earley sets already exist. Another advantage of eager computation is that the logic in the recognition phase never needs to follow long chains of Leo items. LEO SEMANTICS: INDIRECT AND LAZYLeo's hints about semantic processing, while brief, were insightful. The first decision to make with the Leo semantics was direct versus indirect. The direct approach does the semantic processing with the Leo items themselves, without converting them. This has the advantage that costs are incurred only for the Leo items that are actually used by the semantics. It has the very serious disadvantage of intertwining the Leo logic with the semantics. Dealing directly with Leo items would more than double the complexity of the logic for Marpa's semantics. Because of this, Marpa rejected the direct approach.Leo 1991 suggests an indirect approach. The indirect approach is to expand the Leo completions into the stacks of Earley items for which Leo completions are a shorthand. However, in a naive implementation, the indirect approach eliminates every advantage of using the Leo speedups -- it merely moves processing from the recognizer into the semantic phase. Marpa optimizes by using a lazy implementation of the indirect approach. Only those Leo completions useful for the semantics are expanded. For a lazy and indirect implementation, in all cases, time complexity is the same as with a direct implementation. But for some ambiguous grammars, the space required grows from linear to quadratic with any indirect approach. This does not change the worst case time-complexity -- that was already quadratic. Fortunately, this worst case -- highly ambiguous parses where all the parse results are demanded -- is of limited practical interest. IMPROVEMENTS TO THE AYCOCK-HORSPOOL RECOGNITION ENGINEMarpa takes the recognition engine as described by Aycock and Horspool and turns it inside out and upside down -- a double inversion. In the Aycock and Horspool version of Earley's, the main loop iterated over each item of an Earley set, first using that item to scan for tokens, then using it to add new items to the Earley set through completion.Marpa turns the single main loop of the Aycock-Horspool algorithm into two separate loops over the Earley items. The first loop completes the current earley set. The second loop scans for tokens. The next two sections describe the advantages of separating completion and scanning. PREDICTION-DRIVEN LEXINGPrediction-driven lexing involves lookahead, but it is lookahead of an especially powerful kind. A very nice property of Earley's algorithm is that an Earley item exists in the Earley sets if and only if there is, starting with the current location, some additional input that would make that Earley item a part of a successful parse. Note that this additional input may be zero length, in which case the presence of that Earley item indicates that the parse would be complete and successful if the input ended at that location.It is easy to tell from the Earley items which tokens they are expecting. This means that at every point in the parse, it is easy to determine, for every token, if there is some subsequent input that would make it part of a successful parse. Conversely, it is easy to determine which tokens are fated to be "dead ends". Intuitively, you might say that Earley's algorithm always knows exactly what is going on at any point in the parse. But previous implementations of the Earley parse engine did not allow the lexer to take advantage of this information. The obstacle lay in the way that the parse engines worked. At each Earley set, Earley parse engines add Earley items either
The knowledge of which tokens to expect is not available until the indirect processing is finished. For predictive lexing to be done, that knowledge needed to be available when the direct processing began. Previous Earley's implementations intermixed direct and indirect processing, which made predictive lexing impossible. Indirect processing depends on direct processing, as well as vice versa. It might seem that this would make it impossible to separate the two, much less complete the indirect processing first. But direct processing only adds Earley items to subsequent Earley sets. And indirect processing only adds Earley items to the current Earley set. While indirect processing depends on direct processing as the parse progresses, for each individual Earley set the indirect processing is independent of the direct processing, and can be done separately and first. To my knowledge, the Marpa Earley's implementation is the first to make this separation. Marpa's double inversion of the Aycock-Horspool parse engine divides the parsing for each Earley set into two loops. The first loop does the indirect processing for an Earley set. The second loop does the direct processing. In Marpa, the indirect processing is complete before the direct processing starts. Once indirect processing ends, the parse engine knows which tokens are possible at the current location. This knowledge can be made available to the lexer before any tokens are scanned. This is what makes prediction-driven lexing possible. A lexer working with the Marpa parse engine can know which tokens are possible at any point, and which are not. The lexer can restrict itself to looking for those tokens which have the potential to produce a successful parse. The lexer can avoid wasting its time on the other tokens. Some current lexers are already stateful. But these rely on state generated by their own regular expression engines. The Marpa parse engine's knowledge of expected and unexpected tokens is far more specific than the knowledge that even the most sophisticated lexer can derive from looking at its internal state. RUBY SLIPPERS PARSINGI expected that more efficient lexing would be the main benefit of prediction-driven lexing. But then I stumbled on the Ruby Slippers Technique, and realized that prediction-driven lexing is capable of far more than optimizing the lexer. I call the technique of this section, Ruby Slippers Parsing, after Dorothy's magical shoes in the movie Wizard of Oz. This is because, frankly, it does seem too good to be true.To do Ruby Slippers Parsing, you program the lexer to change, based on the input "wished for" by the parser, the list of tokens passed to the parser. To use the image of the movie, if the parse engine says that the next thing it wants to do is go to Kansas, then the next thing that lexer tells the parse engine is that, yes indeed, the parse engine is in Kansas. The Ruby Slippers Technique can make difficult problems stunningly easy. For example, Marpa::HTML is extremely liberal in the HTML it accepts, and therefore had to come up with a way to deal with the problem of missing HTML end tags. Ordinarily, this is done with some very convoluted logic full of special cases. Instead, Marpa::HTML uses the Ruby Slippers. When the parser in Marpa::HTML hits an obstacle, it checks to see if the parser is looking for an HTML end tag. If the parser wants an HTML end tag, the lexer creates an tag of exactly the type that the parser wants, on the fly. Presto, Kansas. For more about Ruby Slippers Parsing, see my blog post on the subject <http://blogs.perl.org/users/jeffrey_kegler/2010/06/parsing-with-ruby-slippers.html>. EARLEMESMarpa allows ambiguous lexing, including the recognition of lexemes of different lengths starting at the same lexical position, and the recognition of overlapping lexemes. To facilitate this, Marpa introduces the earleme (named after Jay Earley). Previous Earley implementations required the input to be broken up into tokens, typically by lexical analysis of the input using regular expressions. This in turn forced the input to be unambiguous. Requiring that the input of a general BNF parser be unambiguous is hobbling the parser.Marpa allows the scanning phase of Earley's algorithm to add items not just to the next Earley set, but to any later one. Alternative scans of the input can be put into the Earley sets, and the power of Earley's algorithm harnessed to deal with the indeterminism. Marpa does this by allowing each scanned token to have a length in earlemes. The earleme is a unit of distance measured in Earley sets. If the length of a scanned token is L, and the current Earley set is C, a newly scanned Earley item is added to Earley set C+L. For example, if a token is 7 earlemes long, and the token is to start at Earley set 35, then the token is added to Earley set 42. Lexing in earlemes is flexible. If an implementation defines an earleme so that the distances in earlemes are the same as distance in a token stream, then lexing in earlemes is exactly equivalent to traditional lexing. But distance in earlemes could also be defined as equivalent to string length, measured either in bytes or Unicode graphemes. CHOMSKY-HORSPOOL-AYCOCK FORMMarpa's internal rewriting of the grammar is another significant change to the Aycock and Horspool algorithm. Aycock and Horspool rewrote their grammars internally into what they called NNF (Nihilist Normal Form). Earley's original algorithm had serious issues with nullable symbols and productions, and the NNF rewrite fixes almost all of them. (As a reminder, a nullable symbol or production is one which derives the empty string.) Importantly, NNF also allows complete and easy mapping of the semantics of the original grammar to its NNF rewrite, so that NNF and the whole rewrite process can be made invisible to the user.A problem with NNF is that the rewritten grammar is exponentially larger than the original in the worst case. I think cases will arise in practice where the NNF size explosion is a real problem. For example, any language in which whitespace is significant but optional can cause an NNF size explosion. Marpa's solution is Chomsky-Horspool-Aycock Form (CHAF). This is Aycock and Horspool's NNF, but with the further restriction that no more than two nullable symbols may appear in any production. (The idea that any context-free grammar can be rewritten into productions of at most a small fixed size appears to originate, like so much else, with Noam Chomsky.) The shortened CHAF production maps back to the original grammar so that like NNF, the CHAF rewrite can be made invisible to the user. With CHAF, the theoretical worst behavior is linear, and in those difficult cases likely to arise in practice the multiplier is much smaller. CONSTRUCTING THE AHFAThe Aycock-Horspool algorithm requires the grammar to be converted into what their paper calls a "split X-DFA". I find the term "split X-DFA" unwieldy, and prefer to take the opportunity to give credit where it is due. The sets that Aycock and Horspool call split X-DFA states, I call AHFA (Aycock-Horspool Finite Automaton) states.The description of the "split X-DFA" is on pages 624-627 of Aycock-Horspool 2002. This description contained the insights without which Marpa could not have been written. But, in my reading at least, the Aycock-Horspool description was conceptual -- a tool for explaining the theory behind the AHFA. An actual algorithm to create an AHFA from an arbitrary grammar needed to be developed as part of the Marpa project. TWO PROBLEMS I DID NOT HAVE TO SOLVEThis document is devoted to what is new with Marpa. It is not intended to describe the work of others. But I hope the reader will forgive the digression that is this section. Having the work of others to build on was essential to the Marpa project, and I want to illustrate this with two examples.The Marpa project did not have to figure out how to handle nullable symbols. This is fortunate, because I am not sure I would have had the insight to look at the right problem, never mind find the solution. But Aycock and Horspool identified this problem and solved it for me. All I needed to do was understand their ideas and implement them. Similarly, Marpa did not have to figure out how to solve another long-standing problem with Earley's algorithm -- the inefficiency of right-recursion. Joop Leo solved this problem. As with the problem of nullables, I am not sure I would have had the perspicuity even to define the problem of right-recursion correctly, and I am very happy to have had it solved for me. This digression is far from a comprehensive list of what the Marpa project owes to the work of others. But I hope these two examples are useful in putting things in perspective. ITERATING AND EVALUATING PARSESThe rest of this document will deal with my algorithm for iterating and evaluating the Marpa parses, and the concepts I discovered and methods I developed along the way. While the problem of efficiently recognizing parses is harder than the problem of iterating and evaluating them, more than half of the work in writing Marpa was to develop the algorithm to iterate and evaluate the parses.The reason I wound up putting so much effort into iteration and evaluation lay in Marpa's objective. From the first, the main objective of the Marpa project was to put a practical general BNF parsing tool into the hands of mainstream programmers. This had never been done. Clearly, for a general parsing tool to be practical, it needed a convenient way to evaluate the parse results it recognized. And, if it claimed (as in fact Marpa does claim) to handle ambiguous parsing smoothly, it also needed a good method for iterating parse results. Among my sources, the degree of focus on the practical varies. But all of my sources focused far more on the recognition problem than anything else. When it came to issues in recognition, my sources went out of their way to find the hard problems, solve them, and set forth the main ideas to the reader. When it came to iterating and evaluating the Earley sets, my sources were content with hints and guesses. The guesses proved insightful and the hints wise, but the actual working out of an algorithm fell to the Marpa project. My work in developing Marpa's Evaluator demonstrated persistence more than it embodied elegance. I hope the current implementation looks straightforward, because my route to it certainly was not. Marpa's parse iteration and evaluation logic is in its 3rd Generation. Two previous Marpa evaluators were developed and repeatedly enhanced, only to have them reach a point where it was obvious that further progress demanded that they be completely discarded. Twice, Marpa's Evaluator was replaced with a new implementation, one written almost entirely from scratch. Aycock and Horspool seem to have understood the iteration and evaluation might contain significant unsolved problems. In their paper, they identify the search for "better ways to reconstruct derivations" as an "area of future work." APPLICABLE DOTTED RULESA key concept behind Marpa's 3rd Generation Evaluator is that of an applicable dotted rule. In the Aycock-Horspool modification of Earley's algorithm, the Earley items do not directly correspond to dotted rules. Aycock-Horspool modified the Earley items to contain LR(0) states, which are sets of dotted rules. (As a reminder, a dotted rule is another term for an LR(0) item.) In order to iterate and evaluate a parse result, it is necessary to find all the dotted rules which are applicable to that parse result. This process is described more fully in the implementation document.A disadvantage of Aycock-Horspool's use of LR(0) states in Earley items was that it muddied the question of what it means for two parses to be "different". Following the terminology in the textbooks, let me say that two parse trees are different if they have different rightmost derivations. Next, let me introduce the idea of a factored parse tree. In this context, what I mean by factoring is assigning start and end locations in the input to each node of the parse tree. Often, given a parse tree and an input, only one factoring is possible, but this is not always the case. A factored parse tree is a parse tree, all of whose nodes are factored. A factoring is a total map, from the nodes of the parse tree, to duples composed of a start location and an end location. The use of LR(0) states in the Aycock-Horspool Earley items meant that two different sets of the Earley items could produce the same factored parse tree. As far as I know I was the first person to encounter this issue -- I've not seen it mentioned in the literature. This problem did not arise with traditional Earley parsing. In traditional Earley parsing, dotted rules have a one-to-one correspondence with traditional Earley items. Because Earley items have an origin and a current location, each Earley item implies a factoring. Since in traditional Earley parsing each Earley item implies a unique factoring and a unique dotted rule, each factored parse tree corresponded to a unique set of traditional Earley items. In traditional Earley parsing, if two sets of Earley items were different, their factored parse trees had to be different. The intuitive idea of "two parses being the same" seems to be equivalent to the identity of their factored parse trees. In Marpa's Generation 2 Evaluator, or-nodes in the parse bocage usually corresponded to Aycock-Horspool Earley items, and therefore to LR(0) states. This resulted in the production of multiple parse results for the same factored parse tree. Later versions of the Generation 2 Evaluator were modified to detect and prevent these duplicate parses. With the Generation 3 Marpa Evaluator, the whole logic of the parse bocage was rethought, and the entire evaluator rewritten. In Generation 3, the parse bocage or-nodes do not necessarily correspond one-on-one to Earley items. Instead, Generation 3 parse bocage or-nodes correspond one-on-one to dotted rules. PREVIOUS WORK: PARSE FORESTSParse forests and their terminology are very standard in the literature on ambiguous parsing. But ambiguous parsing gets little or no coverage in many modern textbooks on parsing, so that even readers who have studied parsing might not know these terms.A Marpa parse is a set of parse trees, one for every parse result. The set of parse trees is trivial if the parse is unambiguous -- it is a set of trees that contains only one tree. If the parse is ambiguous, there is more than one parse tree. There is an efficient and compact means of storing a set of closely related parse trees. It is called, aptly, a parse forest. Parse forests have been in the parsing literature for some time now. I am unable to discover who discovered the concept or who coined the term. Nodes in a parse forest are divided into and-nodes and or-nodes. And-nodes are individual pieces of parse trees. In conventional parse forests, each and-node represents a production. And-nodes in a parse forest are just like all the nodes in a parse tree -- they have the LHS as the parent node and the RHS symbols as child nodes. Or-nodes represent choices among and-nodes. Each child of an or-node is one choice. A parse tree can be generated from a parse forest by traversing the forest and selecting one child and-node at every or-node. THE PARSE BOCAGEMarpa's representation of parse results is considerably more complicated than a parse forest, but it borrows many of the ideas, and recycles much of the terminology. Marpa represents parse results in a virtual data structure called a parse bocage. I say that the parse bocage is a virtual data structure, because no actual Marpa data structure corresponds exactly to the parse bocage. Like parse forests, parse bocages contain and-nodes and or-nodes, but in the parse bocage these are created on a just-in-time basis.The parse bocage differs from a parse forest in several other ways.
LINKS IN THE PARSE BOCAGEThe system of links used in Marpa is based on that devised by Aycock and Horspool. In the Aycock-Horspool scheme, each Earley item has a list of sources. Each source can have two elements: a predecessor and a cause.The predecessor is optional. If present, it is a link to an Earley item. It's called a "predecessor link" because both the linking and the linked Earley item are pieces of the same production, and the linked Earley item is the "predecessor", or earlier part. The cause element of the source is always present. It can be a token or a link to a child Earley item. Because the Earley items contain their sources in the form of predecessor-cause pairs, they in effect rewrite the original grammar into productions with at most two symbols on the right hand side. This is basically Chomsky Normal Form (CNF). A CNF grammar can be represented conveniently as a binary tree. Marpa does not attempt to reconstruct the original structure of the grammar when it creates the parse bocage from the Earley items. Code to traverse, iterate and evaluate binary nodes is considerably simpler than code which needs to deal with nodes which have an arbitrary number of children. Rather than rewrite the grammar into a form that's harder to process, it makes sense to leave the grammar the way it is found in the Earley items. THE ITERATION STACKA crucial change for Generation 3 of Marpa's Evaluator was the introduction of the iteration nodes, and the creation of an iteration stack. Marpa uses the iteration stack and iteration nodes to perform the iteration of parse results.As of Generation 3, information that changes during iteration and evaluation is not kept in the parse bocage. As soon as the recognition phase ends, Marpa's Generation 3 parse bocage is conceptually constant. (In the actual implementation, the parse bocage is created and linked during iteration and evaluation, as needed and on a just-in-time basis.) The content of the parse bocage is fully determined as soon as the recognition phase ends, and before the iteration and evaluation of parse results begins. The iteration stack is composed of iteration nodes. An iteration node corresponds to one or-node, and throughout the iteration node's lifetime that correspondence remains constant. In addition, if the parse is cycle-free, that correspondence will be on a one-to-one basis. At every evaluation point during the iteration, each iteration node contains a choice of and-node. Conceptually, Marpa does its evaluation using a stack of and-nodes, called an evaluation stack. Marpa's evaluation stack is created from the iteration stack by transforming every iteration node into the and-node that is its current choice. PARSE BOCAGE: ABOUT THE TERMThe term "bocage" comes from the bocage forests of Normandy and Brittany. These man-made forests are networks of hedgerows, vegetative tangles grown more and more impenetrable over the centuries.A bocage hedge is typically part of a working farm, and its everyday use is to fence livestock. But most of the world's farmers find their cows to be adequately confined by less imposing obstacles. The bocage country is the most fought-over section of France, and over its violent history the hedgerows have accumulated a considerable reputation as field fortification. The first bocage hedgerows may have been cultivated to stop the armies of the Hundred Years War. But bocage continued to be formidable well after the invention of tanks, artillery and aircraft. Every account of the Normandy Campaign of 1944 speaks with respect of the bocage country. Most identify the Allied breakout from the hedgerows as the turning point of that campaign. LICENSE AND COPYRIGHTCopyright 2007-2010 Jeffrey Kegler, all rights reserved. Marpa is free software under the Perl license. For details see the LICENSE file in the Marpa distribution.POD ERRORSHey! The above document had some coding errors, which are explained below:
Visit the GSP FreeBSD Man Page Interface. |