Marpa::Advanced::Algorithm - The Marpa Algorithm
This document describes the aspects of the Marpa algorithm that break new
ground. It is not a comprehensive description of the Marpa algorithm. Marpa
owes its existence to previous work by other researchers. Their work is cited
in Marpa's bibliography. This document is intended to supplement their
I hope to have the opportunity to write a full account of the Marpa algorithm.
Until then, I hope the reader sees the one-sidedness of this document as a
temporary necessity, and not as the product of my ingratitude toward the
researchers without whom my own work would not have been possible.
When I started writing Marpa I had many expectations. Marpa has exceeded most of
them. Marpa will be a fast algorithm for almost every practical grammar. The
one respect in which Marpa disappoints me is its complexity. As an algorithm,
Marpa has its moments of beauty, but nobody could say it achieves that beauty
Marpa is many different parsing techniques, rolled into one. The way these
interact is one of the beauties I do see in the algorithm. Perhaps it was
wrong of me to hope Marpa would be a late Picasso, instead of a Hieronymous
- Jay Earley invents the algorithm that now bears his name.
- Joop Leo describes a way to modify Earley's algorithm so that it runs in
O(n) time for all LR-regular grammars. LR-regular is a vast class of
grammars, including all the LR(k) grammars, all grammars parseable with
recursive descent, and regular expressions. LR-regular can safely be
thought of as including all grammars in practical use today, and then
- Aycock and Horspool describe a way to do LR(0) precomputation for
Earley's algorithm. Their method makes Earley's faster in most practical
situations, but not all. In particular, right-recursion remains quadratic
in the Aycock and Horspool algorithm. Worst case is no better than
Earley's. Leo is unaware of Aycock and Horspool's work and Aycock and
Horspool seem unaware of Leo.
- ... which brings us to Marpa.
The rest of this document is fragmented, highly technical material. I tried to
write these sections for as wide an audience as possible, but in many cases it
simply was not possible to keep the sections short without assuming a lot of
familiarity with the literature on Earley parsing.
The algorithm described in Aycock-Horspool 2002
speeds up Earley's algorithm by combining Earley items. Each Earley item in
Earley's original algorithm is for a dotted rule
-- a pair consisting
of a grammar rule and a position in that rule. Dotted rules are so called
because the convention when writing them is to designate the distinguished
position with a raised dot. Dotted rules are also called LR
Aycock and Horspool combined dotted rules into sets. Those sets were also states
of a finite automaton. I call this finite automaton, an Aycock-Horspool Finite
Automaton (AHFA). Aycock and Horspool changed their Earley items to contain
information about AHFA states, instead of about individual dotted rules. This
reduced the number of Earley items and improved the speed of Earley's
had other very practical insights. For example, I can't imagine how I could have
written Marpa had I not learned how to deal with nullable symbols and rules
from Aycock-Horspool 2002 . But Aycock and Horspool could not lay claim to any
theoretical milestones. As the theoreticians say, Aycock-Horspool 2002
"improved the constants". Worst case remained the same, and
right-recursion was still quadratic.
Leo's algorithm and the Aycock-Horspool algorithm each offered major
improvements. Was it possible to get the best of both worlds? The major
obstacle was that the algorithms spoke different dialects: Earley items in the
Aycock-Horspool algorithm used AHFA states. Leo kept his Earley items in the
classic form, one dotted rule per Earley item. To combine the two algorithms,
one algorithm would have to learn to speak the other's dialect.
It was always clear that translating from one algorithm to the other would be
possible. What was not so clear was whether this translation could be made
without doing so much processing that most or all of the benefit of one
algorithm or the other would be lost. Leo 1991 , Aycock-Horspool 2002 , and
even Earley's original algorithm (after the bug fix suggested by Aho &
Ullman in 1972) all do the same job. For decades now, the only question has
The main issue was the translation of "Leo completions" -- those
Earley items added during recognition as a result of Leo's logic. Leo
completions had to "play nice" with Marpa's AHFA-based recognition
engine, or all was lost. Potentially, I might have been required to rewrite
the Aycock-Horspool Finite Automaton. But the AHFA was the source of the
efficiencies delivered by the Aycock-Horspool algorithm. Rewrite the AHFA
enough and I might as well throw it away.
But I was in luck. I noticed that, in the special case of Leo completions, there
seemed to be a one-to-one correspondence between AHFA states and the original
dotted rules as used by Earley and Leo. Eventually, I was able to prove this.
Leo completions were always AHFA "singletons" -- exactly one dotted
This was very unexpected. The Aycock-Horspool algorithm produced its gains by
combining dotted rules into AHFA states, and was aggressive about it.
Singletons were infrequent. But when it most mattered, the Leo logic and the
AHFA states spoke the same language.
In almost every other case, as expected, the Leo logic and the Aycock-Horspool
logic did not speak the same language. A lot of details had to be worked out,
but none of these other details had the potential to derail the project. With
the proof that Leo completions were always AHFA singletons, I knew that Leo
and Aycock-Horspool 2002 could be combined to get the best of both.
describes an algorithm for a recognizer. Marpa modifies Leo's original
recognizer algorithm in two major respects. First, as mentioned above, Leo's
recognizer algorithm had to be converted to work with AHFA states, instead of
Second, Leo's algorithm is lazy about computing Leo items. (A Leo item
my term for what Leo 1991
calls "transition items".) Leo's original algorithm does not compute
the Leo items until they are actually needed. I chose eager computation --
computing the Leo items as soon as each Earley set is complete.
Eager computation of Leo items greatly simplifies the logic for using the Leo
items. The processing for each Earley set can assume that any Leo items of
interest in prior Earley sets already exist. Another advantage of eager
computation is that the logic in the recognition phase never needs to follow
long chains of Leo items.
Leo's hints about semantic processing, while brief, were insightful. The first
decision to make with the Leo semantics was direct versus indirect. The direct
approach does the semantic processing with the Leo items themselves, without
converting them. This has the advantage that costs are incurred only for the
Leo items that are actually used by the semantics. It has the very serious
disadvantage of intertwining the Leo logic with the semantics. Dealing
directly with Leo items would more than double the complexity of the logic for
Marpa's semantics. Because of this, Marpa rejected the direct approach.
suggests an indirect approach. The indirect approach is to expand the Leo
completions into the stacks of Earley items for which Leo completions are a
shorthand. However, in a naive implementation, the indirect approach
eliminates every advantage of using the Leo speedups -- it merely moves
processing from the recognizer into the semantic phase.
Marpa optimizes by using a lazy implementation of the indirect approach. Only
those Leo completions useful for the semantics are expanded.
For a lazy and indirect implementation, in all cases, time complexity is the
same as with a direct implementation. But for some ambiguous grammars, the
space required grows from linear to quadratic with any indirect approach. This
does not change the worst case time-complexity -- that was already quadratic.
Fortunately, this worst case -- highly ambiguous parses where all the parse
results are demanded -- is of limited practical interest.
Marpa takes the recognition engine as described by Aycock and Horspool and turns
it inside out and upside down -- a double inversion. In the Aycock and
Horspool version of Earley's, the main loop iterated over each item of an
Earley set, first using that item to scan for tokens, then using it to add new
items to the Earley set through completion.
Marpa turns the single main loop of the Aycock-Horspool algorithm into two
separate loops over the Earley items. The first loop completes the current
earley set. The second loop scans for tokens. The next two sections describe
the advantages of separating completion and scanning.
Prediction-driven lexing involves lookahead, but it is lookahead of an
especially powerful kind. A very nice property of Earley's algorithm is that
an Earley item exists in the Earley sets if and only if there is, starting
with the current location, some additional input that would make that Earley
item a part of a successful parse. Note that this additional input may be zero
length, in which case the presence of that Earley item indicates that the
parse would be complete and successful if the input ended at that location.
It is easy to tell from the Earley items which tokens they are expecting. This
means that at every point in the parse, it is easy to determine, for every
token, if there is some subsequent input that would make it part of a
successful parse. Conversely, it is easy to determine which tokens are fated
to be "dead ends".
Intuitively, you might say that Earley's algorithm always knows
what is going on at any point in the parse. But previous implementations of
the Earley parse engine did not allow the lexer to take advantage of this
information. The obstacle lay in the way that the parse engines worked.
At each Earley set, Earley parse engines add Earley items either
- Directly based on what is seen in the input. In the literature this
process is called "scanning". In this discussion I will call it
- Based on already existing Earley items, and only indirectly on what is
seen the input. In the literature this process is called
"completion". In this discussion I will call it indirect
The knowledge of which tokens to expect is not available until the indirect
processing is finished. For predictive lexing to be done, that knowledge
needed to be available when the direct processing began. Previous Earley's
implementations intermixed direct and indirect processing, which made
predictive lexing impossible.
Indirect processing depends on direct processing, as well as vice versa. It
might seem that this would make it impossible to separate the two, much less
complete the indirect processing first.
But direct processing only adds Earley items to subsequent
And indirect processing only adds Earley items to the current
set. While indirect processing depends on direct processing as the parse
progresses, for each individual Earley set the indirect processing is
independent of the direct processing, and can be done separately and first.
To my knowledge, the Marpa Earley's implementation is the first to make this
separation. Marpa's double inversion of the Aycock-Horspool parse engine
divides the parsing for each Earley set into two loops. The first loop does
the indirect processing for an Earley set. The second loop does the direct
In Marpa, the indirect processing is complete before the direct processing
starts. Once indirect processing ends, the parse engine knows which tokens are
possible at the current location. This knowledge can be made available to the
lexer before any tokens are scanned. This is what makes prediction-driven
A lexer working with the Marpa parse engine can know which tokens are possible
at any point, and which are not. The lexer can restrict itself to looking for
those tokens which have the potential to produce a successful parse. The lexer
can avoid wasting its time on the other tokens.
Some current lexers are already stateful. But these rely on state generated by
their own regular expression engines. The Marpa parse engine's knowledge of
expected and unexpected tokens is far more specific than the knowledge that
even the most sophisticated lexer can derive from looking at its internal
I expected that more efficient lexing would be the main benefit of
prediction-driven lexing. But then I stumbled on the Ruby Slippers Technique,
and realized that prediction-driven lexing is capable of far more than
optimizing the lexer. I call the technique of this section, Ruby Slippers
Parsing, after Dorothy's magical shoes in the movie Wizard of Oz
is because, frankly, it does seem too good to be true.
To do Ruby Slippers Parsing, you program the lexer to change, based on the input
"wished for" by the parser, the list of tokens passed to the parser.
To use the image of the movie, if the parse engine says that the next thing it
wants to do is go to Kansas, then the next thing that lexer tells the parse
engine is that, yes indeed, the parse engine is in Kansas.
The Ruby Slippers Technique can make difficult problems stunningly easy. For
example, Marpa::HTML is extremely liberal in the HTML it accepts, and
therefore had to come up with a way to deal with the problem of missing HTML
end tags. Ordinarily, this is done with some very convoluted logic full of
Instead, Marpa::HTML uses the Ruby Slippers. When the parser in Marpa::HTML hits
an obstacle, it checks to see if the parser is looking for an HTML end tag. If
the parser wants an HTML end tag, the lexer creates an tag of exactly the type
that the parser wants, on the fly. Presto, Kansas. For more about Ruby
Slippers Parsing, see my blog post on the subject
Marpa allows ambiguous lexing, including the recognition of lexemes of different
lengths starting at the same lexical position, and the recognition of
overlapping lexemes. To facilitate this, Marpa introduces the earleme
(named after Jay Earley). Previous Earley implementations required the input
to be broken up into tokens, typically by lexical analysis of the input using
regular expressions. This in turn forced the input to be unambiguous.
Requiring that the input of a general BNF parser be unambiguous is hobbling
Marpa allows the scanning phase of Earley's algorithm to add items not just to
the next Earley set, but to any later one. Alternative scans of the input can
be put into the Earley sets, and the power of Earley's algorithm harnessed to
deal with the indeterminism.
Marpa does this by allowing each scanned token to have a length in earlemes. The
is a unit of distance measured in Earley sets. If the length of
a scanned token is L
, and the current Earley set is C
, a newly
scanned Earley item is added to Earley set C+L
. For example, if a token
is 7 earlemes long, and the token is to start at Earley set 35, then the token
is added to Earley set 42.
Lexing in earlemes is flexible. If an implementation defines an earleme so that
the distances in earlemes are the same as distance in a token stream, then
lexing in earlemes is exactly equivalent to traditional lexing. But distance
in earlemes could also be defined as equivalent to string length, measured
either in bytes or Unicode graphemes.
Marpa's internal rewriting of the grammar is another significant change to the
Aycock and Horspool algorithm. Aycock and Horspool rewrote their grammars
internally into what they called NNF (Nihilist Normal Form). Earley's original
algorithm had serious issues with nullable symbols and productions, and the
NNF rewrite fixes almost all of them. (As a reminder, a nullable symbol or
production is one which derives the empty string.) Importantly, NNF also
allows complete and easy mapping of the semantics of the original grammar to
its NNF rewrite, so that NNF and the whole rewrite process can be made
invisible to the user.
A problem with NNF is that the rewritten grammar is exponentially larger than
the original in the worst case. I think cases will arise in practice where the
NNF size explosion is a real problem. For example, any language in which
whitespace is significant but optional can cause an NNF size explosion.
Marpa's solution is Chomsky-Horspool-Aycock Form (CHAF). This is Aycock and
Horspool's NNF, but with the further restriction that no more than two
nullable symbols may appear in any production. (The idea that any context-free
grammar can be rewritten into productions of at most a small fixed size
appears to originate, like so much else, with Noam Chomsky.) The shortened
CHAF production maps back to the original grammar so that like NNF, the CHAF
rewrite can be made invisible to the user. With CHAF, the theoretical worst
behavior is linear, and in those difficult cases likely to arise in practice
the multiplier is much smaller.
The Aycock-Horspool algorithm requires the grammar to be converted into what
their paper calls a "split X-DFA". I find the term "split
X-DFA" unwieldy, and prefer to take the opportunity to give credit where
it is due. The sets that Aycock and Horspool call split X-DFA states, I call
AHFA (Aycock-Horspool Finite Automaton) states.
The description of the "split X-DFA" is on pages 624-627 of
Aycock-Horspool 2002. This description contained the insights without which
Marpa could not have been written. But, in my reading at least, the
Aycock-Horspool description was conceptual -- a tool for explaining the theory
behind the AHFA. An actual algorithm to create an AHFA from an arbitrary
grammar needed to be developed as part of the Marpa project.
This document is devoted to what is new with Marpa. It is not intended to
describe the work of others. But I hope the reader will forgive the digression
that is this section. Having the work of others to build on was essential to
the Marpa project, and I want to illustrate this with two examples.
The Marpa project did not
have to figure out how to handle nullable
symbols. This is fortunate, because I am not sure I would have had the insight
to look at the right problem, never mind find the solution. But Aycock and
Horspool identified this problem and solved it for me. All I needed to do was
understand their ideas and implement them.
Similarly, Marpa did not
have to figure out how to solve another
long-standing problem with Earley's algorithm -- the inefficiency of
right-recursion. Joop Leo solved this problem. As with the problem of
nullables, I am not sure I would have had the perspicuity even to define the
problem of right-recursion correctly, and I am very happy to have had it
solved for me.
This digression is far from a comprehensive list of what the Marpa project owes
to the work of others. But I hope these two examples are useful in putting
things in perspective.
The rest of this document will deal with my algorithm for iterating and
evaluating the Marpa parses, and the concepts I discovered and methods I
developed along the way. While the problem of efficiently recognizing parses
is harder than the problem of iterating and evaluating them, more than half of
the work in writing Marpa was to develop the algorithm to iterate and evaluate
The reason I wound up putting so much effort into iteration and evaluation lay
in Marpa's objective. From the first, the main objective of the Marpa project
was to put a practical general BNF parsing tool into the hands of mainstream
programmers. This had never been done. Clearly, for a general parsing tool to
be practical, it needed a convenient way to evaluate the parse results it
recognized. And, if it claimed (as in fact Marpa does claim) to handle
ambiguous parsing smoothly, it also needed a good method for iterating parse
Among my sources, the degree of focus on the practical varies. But all of my
sources focused far more on the recognition problem than anything else. When
it came to issues in recognition, my sources went out of their way to find the
hard problems, solve them, and set forth the main ideas to the reader. When it
came to iterating and evaluating the Earley sets, my sources were content with
hints and guesses. The guesses proved insightful and the hints wise, but the
actual working out of an algorithm fell to the Marpa project.
My work in developing Marpa's Evaluator demonstrated persistence more than it
embodied elegance. I hope the current implementation looks straightforward,
because my route to it certainly was not. Marpa's parse iteration and
evaluation logic is in its 3rd Generation. Two previous Marpa evaluators were
developed and repeatedly enhanced, only to have them reach a point where it
was obvious that further progress demanded that they be completely discarded.
Twice, Marpa's Evaluator was replaced with a new implementation, one written
almost entirely from scratch.
Aycock and Horspool seem to have understood the iteration and evaluation might
contain significant unsolved problems. In their paper, they identify the
search for "better ways to reconstruct derivations" as an "area
of future work."
A key concept behind Marpa's 3rd Generation Evaluator is that of an
applicable dotted rule
. In the Aycock-Horspool modification of
Earley's algorithm, the Earley items do not directly correspond to dotted
rules. Aycock-Horspool modified the Earley items to contain LR
states, which are sets of dotted rules. (As a reminder, a dotted rule is
another term for an LR
(0) item.) In order to iterate and evaluate a
parse result, it is necessary to find all the dotted rules which are
applicable to that parse result. This process is described more fully in the
A disadvantage of Aycock-Horspool's use of LR
(0) states in Earley items
was that it muddied the question of what it means for two parses to be
"different". Following the terminology in the textbooks, let me say
that two parse trees are different if they have different rightmost
derivations. Next, let me introduce the idea of a factored parse tree.
In this context, what I mean by factoring is assigning start and end locations
in the input to each node of the parse tree. Often, given a parse tree and an
input, only one factoring is possible, but this is not always the case. A
factored parse tree
is a parse tree, all of whose nodes are factored. A
is a total map, from the nodes of the parse tree, to duples
composed of a start location and an end location.
The use of LR
(0) states in the Aycock-Horspool Earley items meant that
two different sets of the Earley items could produce the same factored parse
tree. As far as I know I was the first person to encounter this issue -- I've
not seen it mentioned in the literature. This problem did not arise with
traditional Earley parsing. In traditional Earley parsing, dotted rules have a
one-to-one correspondence with traditional Earley items.
Because Earley items have an origin and a current location, each Earley item
implies a factoring. Since in traditional Earley parsing each Earley item
implies a unique factoring and a unique dotted rule, each factored parse tree
corresponded to a unique set of traditional Earley items. In traditional
Earley parsing, if two sets of Earley items were different, their factored
parse trees had to be different.
The intuitive idea of "two parses being the same" seems to be
equivalent to the identity of their factored parse trees. In Marpa's
Generation 2 Evaluator, or-nodes in the parse bocage usually corresponded to
Aycock-Horspool Earley items, and therefore to LR
(0) states. This
resulted in the production of multiple parse results for the same factored
parse tree. Later versions of the Generation 2 Evaluator were modified to
detect and prevent these duplicate parses.
With the Generation 3 Marpa Evaluator, the whole logic of the parse bocage was
rethought, and the entire evaluator rewritten. In Generation 3, the parse
bocage or-nodes do not necessarily correspond one-on-one to Earley items.
Instead, Generation 3 parse bocage or-nodes correspond one-on-one to dotted
Parse forests and their terminology are very standard in the literature on
ambiguous parsing. But ambiguous parsing gets little or no coverage in many
modern textbooks on parsing, so that even readers who have studied parsing
might not know these terms.
A Marpa parse is a set of parse trees, one for every parse result. The set of
parse trees is trivial if the parse is unambiguous -- it is a set of trees
that contains only one tree.
If the parse is ambiguous, there is more than one parse tree. There is an
efficient and compact means of storing a set of closely related parse trees.
It is called, aptly, a parse forest. Parse forests have been in the parsing
literature for some time now. I am unable to discover who discovered the
concept or who coined the term.
Nodes in a parse forest are divided into and-nodes and or-nodes. And-nodes are
individual pieces of parse trees. In conventional parse forests, each and-node
represents a production. And-nodes in a parse forest are just like all the
nodes in a parse tree -- they have the LHS as the parent node and the RHS
symbols as child nodes.
Or-nodes represent choices among and-nodes. Each child of an or-node is one
choice. A parse tree can be generated from a parse forest by traversing the
forest and selecting one child and-node at every or-node.
Marpa's representation of parse results is considerably more complicated than a
parse forest, but it borrows many of the ideas, and recycles much of the
terminology. Marpa represents parse results in a virtual data structure called
a parse bocage
. I say that the parse bocage is a virtual data
structure, because no actual Marpa data structure corresponds exactly to the
parse bocage. Like parse forests, parse bocages contain and-nodes and
or-nodes, but in the parse bocage these are created on a just-in-time basis.
The parse bocage
differs from a parse forest in several other ways.
- The parse bocage is never fully linked internally. Its structure relies on
the iteration stack (described below).
- Marpa parses grammars with cycles, so parse bocages might contain not only
trees, but graphs. This means that, pedantically speaking, Marpa's parse
bocages are based not on parse forests, but on parse graphs.
- In the parse bocage, and-nodes often do not represent productions of the
grammar directly. (In a parse forest, the and-nodes always represent the
productions of the grammar directly.) In the parse bocage, and-nodes have
at most two children. The parse bocage often breaks productions of the
original grammar into pieces, and it is the pieces that are represented in
the and-nodes of the bocage.
The system of links used in Marpa is based on that devised by Aycock and
Horspool. In the Aycock-Horspool scheme, each Earley item has a list of
sources. Each source can have two elements: a predecessor and a cause.
The predecessor is optional. If present, it is a link to an Earley item. It's
called a "predecessor link" because both the linking and the linked
Earley item are pieces of the same production, and the linked Earley item is
the "predecessor", or earlier part.
The cause element of the source is always present. It can be a token or a link
to a child Earley item.
Because the Earley items contain their sources in the form of predecessor-cause
pairs, they in effect rewrite the original grammar into productions with at
most two symbols on the right hand side. This is basically Chomsky Normal Form
A CNF grammar can be represented conveniently as a binary tree. Marpa does not
attempt to reconstruct the original structure of the grammar when it creates
the parse bocage from the Earley items. Code to traverse, iterate and evaluate
binary nodes is considerably simpler than code which needs to deal with nodes
which have an arbitrary number of children. Rather than rewrite the grammar
into a form that's harder to process, it makes sense to leave the grammar the
way it is found in the Earley items.
A crucial change for Generation 3 of Marpa's Evaluator was the introduction of
the iteration nodes
, and the creation of an iteration stack
Marpa uses the iteration stack and iteration nodes to perform the iteration of
As of Generation 3, information that changes during iteration and evaluation is
not kept in the parse bocage. As soon as the recognition phase ends, Marpa's
Generation 3 parse bocage is conceptually constant. (In the actual
implementation, the parse bocage is created and linked during iteration and
evaluation, as needed and on a just-in-time basis.) The content of the parse
bocage is fully determined as soon as the recognition phase ends, and before
the iteration and evaluation of parse results begins.
The iteration stack
is composed of iteration nodes
. An iteration
node corresponds to one or-node, and throughout the iteration node's lifetime
that correspondence remains constant. In addition, if the parse is cycle-free,
that correspondence will be on a one-to-one basis.
At every evaluation point during the iteration, each iteration node contains a
choice of and-node. Conceptually, Marpa does its evaluation using a stack of
and-nodes, called an evaluation stack
. Marpa's evaluation stack is
created from the iteration stack by transforming every iteration node into the
and-node that is its current choice.
The term "bocage" comes from the bocage forests of Normandy and
Brittany. These man-made forests are networks of hedgerows, vegetative tangles
grown more and more impenetrable over the centuries.
A bocage hedge is typically part of a working farm, and its everyday use is to
fence livestock. But most of the world's farmers find their cows to be
adequately confined by less imposing obstacles. The bocage country is the most
fought-over section of France, and over its violent history the hedgerows have
accumulated a considerable reputation as field fortification.
The first bocage hedgerows may have been cultivated to stop the armies of the
Hundred Years War. But bocage continued to be formidable well after the
invention of tanks, artillery and aircraft. Every account of the Normandy
Campaign of 1944 speaks with respect of the bocage country. Most identify the
Allied breakout from the hedgerows as the turning point of that campaign.
Copyright 2007-2010 Jeffrey Kegler, all rights reserved. Marpa is free software
under the Perl license. For details see the LICENSE file in the Marpa
Hey! The above document had some coding errors, which are explained
- Around line 51:
- Expected text after =item, not a number
- Around line 62:
- Expected text after =item, not a number
- Around line 75:
- Expected text after =item, not a number