NAME

Marpa::Advanced::Models - Other Input Models

ABOUT THIS DOCUMENT

The alternative input models desribed in this document are an advanced technique. This document may safely be ignored by ordinary Marpa users.

DESCRIPTION

Token Streams

Marpa's default input model is the traditional one -- a token stream. Token streams are very standard in parsing applications -- so much so that most texts do not take the trouble of defining the term. A token stream is input structured as a sequence of tokens, where each token occupies one location and every location has a token. In the token stream model, all tokens are of the same length.

Conventionally, all tokens are of length 1, and the token stream starts at location 0. Following this convention, the Nth token would start at location N-1 and end at location N. For example, the first token would start at location 0 and end at location 1. The second token would start at location 1 and end at location 2.

Earlemes

The basic idea of location in Marpa is the earleme. Earlemes are named after Jay Earley. Internally, each earleme corresponds exactly and directly to an Earley set. Every token has a start earleme and an end earleme.

The token stream model may also be called the token-per-earleme model. In the token stream model, token location and earleme location directly correspond on an one-to-one basis. It can be useful to have the structure of the input relate to earleme location in other ways. One such alternative, which is useful and been tested, is the character-per-earleme model, discussed below.

INTERFACE

Alternative models are implemented using the optional third and fourth parameters of the token descriptors. Token descriptors are used in the arguments to the Marpa Recognizer's tokens method.

Token Length

Token length is the optional third element of the token descriptor. By default, it is 1, which is the correct value for the token stream model. Its value can be any integer greater than zero. Marpa does not allow zero length tokens in any input model.

Token Offset

The token offset is the fourth element of each token's descriptor. Its value is taken as the offset to be added to the current earleme location. The current earleme location is the earleme which will be the start earleme of the next token to be added. When parsing begins, the current earleme location is earleme 0.

Negative token offsets are not allowed. Zero token offsets are allowed except for the last token descriptor of a tokens() call. A zero token offset will cause the next token to start at the same location. Multiple tokens starting at the same location will cause the lexing to be ambiguous. Marpa supports ambiguous lexing.

When token offset is left undefined, it defaults to 1. This is the correct value for the token stream model of input.

The intuitive expectation seems to be that every tokens() call will advance the current earleme after the last token descriptor. Because of this, a token offset of zero is not allowed for the last token descriptor. As a special case, when a token offset of zero is explicitly specified for the last token descriptor of a tokens() call, it is silently changed to 1.

In practice this seems to obey the principle of least surprise. The typical loop in an ambiguous lexing makes one pass for each earleme, gathering all the tokens that can start at that earleme into a list. For every token in this list, except the last, the token offset clearly should be 0, since all of the tokens are at the same earleme. A programmer might expect that he can give all the tokens a token offset of 0, thinking that Marpa will understand that the intention is to advance the current earleme after the tokens() call is complete. Since this is what the programmer expects, this is what Marpa does.

AMBIGUOUS LEXING

Ambiguous lexing occurs when several different sequences of tokens are possible. Potentially ambiguous lexing occurs in any parse where multiple tokens start at a single earleme. An actual ambiguity only occurs if more than one of the potential token choices is consistent with the grammar and previous input. If there is no actual ambiguity, Marpa will use the only token choice which is consistent with the grammar and previous inputs.

When lexing is actually ambiguous, Marpa will use all consistent alternatives. When the lexing in a parse is actually ambiguous, the parse will be ambiguous. From the point of view of Marpa's semantics, ambiguity caused by lexing is exactly the same as ambiguity caused by an ambiguous grammar.

In the standard terminology, a grammar is ambiguous if it can produce more than one parse result, and vice versa. In Marpa this is not strictly true. In Marpa, an unambiguous grammar will produce than one parse, if the lexing is actually ambiguous.