|
NAMEParse::RecDescent - Generate Recursive-Descent ParsersVERSIONThis document describes version 1.967015 of Parse::RecDescent released April 4th, 2017.SYNOPSISuse Parse::RecDescent; # Generate a parser from the specification in $grammar: $parser = new Parse::RecDescent ($grammar); # Generate a parser from the specification in $othergrammar $anotherparser = new Parse::RecDescent ($othergrammar); # Parse $text using rule 'startrule' (which must be # defined in $grammar): $parser->startrule($text); # Parse $text using rule 'otherrule' (which must also # be defined in $grammar): $parser->otherrule($text); # Change the universal token prefix pattern # before building a grammar # (the default is: '\s*'): $Parse::RecDescent::skip = '[ \t]+'; # Replace productions of existing rules (or create new ones) # with the productions defined in $newgrammar: $parser->Replace($newgrammar); # Extend existing rules (or create new ones) # by adding extra productions defined in $moregrammar: $parser->Extend($moregrammar); # Global flags (useful as command line arguments under -s): $::RD_ERRORS # unless undefined, report fatal errors $::RD_WARN # unless undefined, also report non-fatal problems $::RD_HINT # if defined, also suggestion remedies $::RD_TRACE # if defined, also trace parsers' behaviour $::RD_AUTOSTUB # if defined, generates "stubs" for undefined rules $::RD_AUTOACTION # if defined, appends specified action to productions DESCRIPTIONOverviewParse::RecDescent incrementally generates top-down recursive-descent text parsers from simple yacc-like grammar specifications. It provides:
Using "Parse::RecDescent"Parser objects are created by calling "Parse::RecDescent::new", passing in a grammar specification (see the following subsections). If the grammar is correct, "new" returns a blessed reference which can then be used to initiate parsing through any rule specified in the original grammar. A typical sequence looks like this:$grammar = q { # GRAMMAR SPECIFICATION HERE }; $parser = new Parse::RecDescent ($grammar) or die "Bad grammar!\n"; # acquire $text defined $parser->startrule($text) or print "Bad text!\n"; The rule through which parsing is initiated must be explicitly defined in the grammar (i.e. for the above example, the grammar must include a rule of the form: "startrule: <subrules>". If the starting rule succeeds, its value (see below) is returned. Failure to generate the original parser or failure to match a text is indicated by returning "undef". Note that it's easy to set up grammars that can succeed, but which return a value of 0, "0", or "". So don't be tempted to write: $parser->startrule($text) or print "Bad text!\n"; Normally, the parser has no effect on the original text. So in the previous example the value of $text would be unchanged after having been parsed. If, however, the text to be matched is passed by reference: $parser->startrule(\$text) then any text which was consumed during the match will be removed from the start of $text. RulesIn the grammar from which the parser is built, rules are specified by giving an identifier (which must satisfy /[A-Za-z]\w*/), followed by a colon on the same line, followed by one or more productions, separated by single vertical bars. The layout of the productions is entirely free-format:rule1: production1 | production2 | production3 | production4 At any point in the grammar previously defined rules may be extended with additional productions. This is achieved by redeclaring the rule with the new productions. Thus: rule1: a | b | c rule2: d | e | f rule1: g | h is exactly equivalent to: rule1: a | b | c | g | h rule2: d | e | f Each production in a rule consists of zero or more items, each of which may be either: the name of another rule to be matched (a "subrule"), a pattern or string literal to be matched directly (a "token"), a block of Perl code to be executed (an "action"), a special instruction to the parser (a "directive"), or a standard Perl comment (which is ignored). A rule matches a text if one of its productions matches. A production matches if each of its items match consecutive substrings of the text. The productions of a rule being matched are tried in the same order that they appear in the original grammar, and the first matching production terminates the match attempt (successfully). If all productions are tried and none matches, the match attempt fails. Note that this behaviour is quite different from the "prefer the longer match" behaviour of yacc. For example, if yacc were parsing the rule: seq : 'A' 'B' | 'A' 'B' 'C' upon matching "AB" it would look ahead to see if a 'C' is next and, if so, will match the second production in preference to the first. In other words, yacc effectively tries all the productions of a rule breadth-first in parallel, and selects the "best" match, where "best" means longest (note that this is a gross simplification of the true behaviour of yacc but it will do for our purposes). In contrast, "Parse::RecDescent" tries each production depth-first in sequence, and selects the "best" match, where "best" means first. This is the fundamental difference between "bottom-up" and "recursive descent" parsing. Each successfully matched item in a production is assigned a value, which can be accessed in subsequent actions within the same production (or, in some cases, as the return value of a successful subrule call). Unsuccessful items don't have an associated value, since the failure of an item causes the entire surrounding production to immediately fail. The following sections describe the various types of items and their success values. SubrulesA subrule which appears in a production is an instruction to the parser to attempt to match the named rule at that point in the text being parsed. If the named subrule is not defined when requested the production containing it immediately fails (unless it was "autostubbed" - see Autostubbing).A rule may (recursively) call itself as a subrule, but not as the left-most item in any of its productions (since such recursions are usually non-terminating). The value associated with a subrule is the value associated with its $return variable (see "Actions" below), or with the last successfully matched item in the subrule match. Subrules may also be specified with a trailing repetition specifier, indicating that they are to be (greedily) matched the specified number of times. The available specifiers are: subrule(?) # Match one-or-zero times subrule(s) # Match one-or-more times subrule(s?) # Match zero-or-more times subrule(N) # Match exactly N times for integer N > 0 subrule(N..M) # Match between N and M times subrule(..M) # Match between 1 and M times subrule(N..) # Match at least N times Repeated subrules keep matching until either the subrule fails to match, or it has matched the minimal number of times but fails to consume any of the parsed text (this second condition prevents the subrule matching forever in some cases). Since a repeated subrule may match many instances of the subrule itself, the value associated with it is not a simple scalar, but rather a reference to a list of scalars, each of which is the value associated with one of the individual subrule matches. In other words in the rule: program: statement(s) the value associated with the repeated subrule "statement(s)" is a reference to an array containing the values matched by each call to the individual subrule "statement". Repetition modifiers may include a separator pattern: program: statement(s /;/) specifying some sequence of characters to be skipped between each repetition. This is really just a shorthand for the <leftop:...> directive (see below). TokensIf a quote-delimited string or a Perl regex appears in a production, the parser attempts to match that string or pattern at that point in the text. For example:typedef: "typedef" typename identifier ';' identifier: /[A-Za-z_][A-Za-z0-9_]*/ As in regular Perl, a single quoted string is uninterpolated, whilst a double-quoted string or a pattern is interpolated (at the time of matching, not when the parser is constructed). Hence, it is possible to define rules in which tokens can be set at run-time: typedef: "$::typedefkeyword" typename identifier ';' identifier: /$::identpat/ Note that, since each rule is implemented inside a special namespace belonging to its parser, it is necessary to explicitly quantify variables from the main package. Regex tokens can be specified using just slashes as delimiters or with the explicit "m<delimiter>......<delimiter>" syntax: typedef: "typedef" typename identifier ';' typename: /[A-Za-z_][A-Za-z0-9_]*/ identifier: m{[A-Za-z_][A-Za-z0-9_]*} A regex of either type can also have any valid trailing parameter(s) (that is, any of [cgimsox]): typedef: "typedef" typename identifier ';' identifier: / [a-z_] # LEADING ALPHA OR UNDERSCORE [a-z0-9_]* # THEN DIGITS ALSO ALLOWED /ix # CASE/SPACE/COMMENT INSENSITIVE The value associated with any successfully matched token is a string containing the actual text which was matched by the token. It is important to remember that, since each grammar is specified in a Perl string, all instances of the universal escape character '\' within a grammar must be "doubled", so that they interpolate to single '\'s when the string is compiled. For example, to use the grammar: word: /\S+/ | backslash line: prefix word(s) "\n" backslash: '\\' the following code is required: $parser = new Parse::RecDescent (q{ word: /\\S+/ | backslash line: prefix word(s) "\\n" backslash: '\\\\' }); Anonymous subrulesParentheses introduce a nested scope that is very like a call to an anonymous subrule. Hence they are useful for "in-lining" subroutine calls, and other kinds of grouping behaviour. For example, instead of:word: /\S+/ | backslash line: prefix word(s) "\n" you could write: line: prefix ( /\S+/ | backslash )(s) "\n" and get exactly the same effects. Parentheses are also use for collecting unrepeated alternations within a single production. secret_identity: "Mr" ("Incredible"|"Fantastic"|"Sheen") ", Esq." Terminal SeparatorsFor the purpose of matching, each terminal in a production is considered to be preceded by a "prefix" - a pattern which must be matched before a token match is attempted. By default, the prefix is optional whitespace (which always matches, at least trivially), but this default may be reset in any production.The variable $Parse::RecDescent::skip stores the universal prefix, which is the default for all terminal matches in all parsers built with "Parse::RecDescent". If you want to change the universal prefix using $Parse::RecDescent::skip, be careful to set it before creating the grammar object, because it is applied statically (when a grammar is built) rather than dynamically (when the grammar is used). Alternatively you can provide a global "<skip:...>" directive in your grammar before any rules (described later). The prefix for an individual production can be altered by using the "<skip:...>" directive (described later). Setting this directive in the top-level rule is an alternative approach to setting $Parse::RecDescent::skip before creating the object, but in this case you don't get the intended skipping behaviour if you directly invoke methods different from the top-level rule. ActionsAn action is a block of Perl code which is to be executed (as the block of a "do" statement) when the parser reaches that point in a production. The action executes within a special namespace belonging to the active parser, so care must be taken in correctly qualifying variable names (see also "Start-up Actions" below).The action is considered to succeed if the final value of the block is defined (that is, if the implied "do" statement evaluates to a defined value - even one which would be treated as "false"). Note that the value associated with a successful action is also the final value in the block. An action will fail if its last evaluated value is "undef". This is surprisingly easy to accomplish by accident. For instance, here's an infuriating case of an action that makes its production fail, but only when debugging isn't activated: description: name rank serial_number { print "Got $item[2] $item[1] ($item[3])\n" if $::debugging } If $debugging is false, no statement in the block is executed, so the final value is "undef", and the entire production fails. The solution is: description: name rank serial_number { print "Got $item[2] $item[1] ($item[3])\n" if $::debugging; 1; } Within an action, a number of useful parse-time variables are available in the special parser namespace (there are other variables also accessible, but meddling with them will probably just break your parser. As a general rule, if you avoid referring to unqualified variables - especially those starting with an underscore - inside an action, things should be okay):
Warning: the parser relies on the information in the various "this..." objects in some non-obvious ways. Tinkering with the other members of these objects will probably cause Bad Things to happen, unless you really know what you're doing. The only exception to this advice is that the use of "$this...->{local}" is always safe. Start-up ActionsAny actions which appear before the first rule definition in a grammar are treated as "start-up" actions. Each such action is stripped of its outermost brackets and then evaluated (in the parser's special namespace) just before the rules of the grammar are first compiled.The main use of start-up actions is to declare local variables within the parser's special namespace: { my $lastitem = '???'; } list: item(s) { $return = $lastitem } item: book { $lastitem = 'book'; } bell { $lastitem = 'bell'; } candle { $lastitem = 'candle'; } but start-up actions can be used to execute any valid Perl code within a parser's special namespace. Start-up actions can appear within a grammar extension or replacement (that is, a partial grammar installed via "Parse::RecDescent::Extend()" or "Parse::RecDescent::Replace()" - see "Incremental Parsing"), and will be executed before the new grammar is installed. Note, however, that a particular start-up action is only ever executed once. AutoactionsIt is sometimes desirable to be able to specify a default action to be taken at the end of every production (for example, in order to easily build a parse tree). If the variable $::RD_AUTOACTION is defined when "Parse::RecDescent::new()" is called, the contents of that variable are treated as a specification of an action which is to appended to each production in the corresponding grammar.Alternatively, you can hard-code the autoaction within a grammar, using the "<autoaction:...>" directive. So, for example, to construct a simple parse tree you could write: $::RD_AUTOACTION = q { [@item] }; parser = Parse::RecDescent->new(q{ expression: and_expr '||' expression | and_expr and_expr: not_expr '&&' and_expr | not_expr not_expr: '!' brack_expr | brack_expr brack_expr: '(' expression ')' | identifier identifier: /[a-z]+/i }); or: parser = Parse::RecDescent->new(q{ <autoaction: { [@item] } > expression: and_expr '||' expression | and_expr and_expr: not_expr '&&' and_expr | not_expr not_expr: '!' brack_expr | brack_expr brack_expr: '(' expression ')' | identifier identifier: /[a-z]+/i }); Either of these is equivalent to: parser = new Parse::RecDescent (q{ expression: and_expr '||' expression { [@item] } | and_expr { [@item] } and_expr: not_expr '&&' and_expr { [@item] } | not_expr { [@item] } not_expr: '!' brack_expr { [@item] } | brack_expr { [@item] } brack_expr: '(' expression ')' { [@item] } | identifier { [@item] } identifier: /[a-z]+/i { [@item] } }); Alternatively, we could take an object-oriented approach, use different classes for each node (and also eliminating redundant intermediate nodes): $::RD_AUTOACTION = q { $#item==1 ? $item[1] : "$item[0]_node"->new(@item[1..$#item]) }; parser = Parse::RecDescent->new(q{ expression: and_expr '||' expression | and_expr and_expr: not_expr '&&' and_expr | not_expr not_expr: '!' brack_expr | brack_expr brack_expr: '(' expression ')' | identifier identifier: /[a-z]+/i }); or: parser = Parse::RecDescent->new(q{ <autoaction: $#item==1 ? $item[1] : "$item[0]_node"->new(@item[1..$#item]) > expression: and_expr '||' expression | and_expr and_expr: not_expr '&&' and_expr | not_expr not_expr: '!' brack_expr | brack_expr brack_expr: '(' expression ')' | identifier identifier: /[a-z]+/i }); which are equivalent to: parser = Parse::RecDescent->new(q{ expression: and_expr '||' expression { "expression_node"->new(@item[1..3]) } | and_expr and_expr: not_expr '&&' and_expr { "and_expr_node"->new(@item[1..3]) } | not_expr not_expr: '!' brack_expr { "not_expr_node"->new(@item[1..2]) } | brack_expr brack_expr: '(' expression ')' { "brack_expr_node"->new(@item[1..3]) } | identifier identifier: /[a-z]+/i { "identifer_node"->new(@item[1]) } }); Note that, if a production already ends in an action, no autoaction is appended to it. For example, in this version: $::RD_AUTOACTION = q { $#item==1 ? $item[1] : "$item[0]_node"->new(@item[1..$#item]) }; parser = Parse::RecDescent->new(q{ expression: and_expr '&&' expression | and_expr and_expr: not_expr '&&' and_expr | not_expr not_expr: '!' brack_expr | brack_expr brack_expr: '(' expression ')' | identifier identifier: /[a-z]+/i { 'terminal_node'->new($item[1]) } }); each "identifier" match produces a "terminal_node" object, not an "identifier_node" object. A level 1 warning is issued each time an "autoaction" is added to some production. AutotreesA commonly needed autoaction is one that builds a parse-tree. It is moderately tricky to set up such an action (which must treat terminals differently from non-terminals), so Parse::RecDescent simplifies the process by providing the "<autotree>" directive.If this directive appears at the start of grammar, it causes Parse::RecDescent to insert autoactions at the end of any rule except those which already end in an action. The action inserted depends on whether the production is an intermediate rule (two or more items), or a terminal of the grammar (i.e. a single pattern or string item). So, for example, the following grammar: <autotree> file : command(s) command : get | set | vet get : 'get' ident ';' set : 'set' ident 'to' value ';' vet : 'check' ident 'is' value ';' ident : /\w+/ value : /\d+/ is equivalent to: file : command(s) { bless \%item, $item[0] } command : get { bless \%item, $item[0] } | set { bless \%item, $item[0] } | vet { bless \%item, $item[0] } get : 'get' ident ';' { bless \%item, $item[0] } set : 'set' ident 'to' value ';' { bless \%item, $item[0] } vet : 'check' ident 'is' value ';' { bless \%item, $item[0] } ident : /\w+/ { bless {__VALUE__=>$item[1]}, $item[0] } value : /\d+/ { bless {__VALUE__=>$item[1]}, $item[0] } Note that each node in the tree is blessed into a class of the same name as the rule itself. This makes it easy to build object-oriented processors for the parse-trees that the grammar produces. Note too that the last two rules produce special objects with the single attribute '__VALUE__'. This is because they consist solely of a single terminal. This autoaction-ed grammar would then produce a parse tree in a data structure like this: { file => { command => { [ get => { identifier => { __VALUE__ => 'a' }, }, set => { identifier => { __VALUE__ => 'b' }, value => { __VALUE__ => '7' }, }, vet => { identifier => { __VALUE__ => 'b' }, value => { __VALUE__ => '7' }, }, ], }, } } (except, of course, that each nested hash would also be blessed into the appropriate class). You can also specify a base class for the "<autotree>" directive. The supplied prefix will be prepended to the rule names when creating tree nodes. The following are equivalent: <autotree:MyBase::Class> <autotree:MyBase::Class::> And will produce a root node blessed into the "MyBase::Class::file" package in the example above. AutostubbingNormally, if a subrule appears in some production, but no rule of that name is ever defined in the grammar, the production which refers to the non-existent subrule fails immediately. This typically occurs as a result of misspellings, and is a sufficiently common occurrence that a warning is generated for such situations.However, when prototyping a grammar it is sometimes useful to be able to use subrules before a proper specification of them is really possible. For example, a grammar might include a section like: function_call: identifier '(' arg(s?) ')' identifier: /[a-z]\w*/i where the possible format of an argument is sufficiently complex that it is not worth specifying in full until the general function call syntax has been debugged. In this situation it is convenient to leave the real rule "arg" undefined and just slip in a placeholder (or "stub"): arg: 'arg' so that the function call syntax can be tested with dummy input such as: f0() f1(arg) f2(arg arg) f3(arg arg arg) et cetera. Early in prototyping, many such "stubs" may be required, so "Parse::RecDescent" provides a means of automating their definition. If the variable $::RD_AUTOSTUB is defined when a parser is built, a subrule reference to any non-existent rule (say, "subrule"), will cause a "stub" rule to be automatically defined in the generated parser. If "$::RD_AUTOSTUB eq '1'" or is false, a stub rule of the form: subrule: 'subrule' will be generated. The special-case for a value of '1' is to allow the use of the perl -s with -RD_AUTOSTUB without generating "subrule: '1'" per below. If $::RD_AUTOSTUB is true, a stub rule of the form: subrule: $::RD_AUTOSTUB will be generated. $::RD_AUTOSTUB must contain a valid production item, no checking is performed. No lazy evaluation of $::RD_AUTOSTUB is performed, it is evaluated at the time the Parser is generated. Hence, with $::RD_AUTOSTUB defined, it is possible to only partially specify a grammar, and then "fake" matches of the unspecified (sub)rules by just typing in their name, or a literal value that was assigned to $::RD_AUTOSTUB. Look-aheadIf a subrule, token, or action is prefixed by "...", then it is treated as a "look-ahead" request. That means that the current production can (as usual) only succeed if the specified item is matched, but that the matching does not consume any of the text being parsed. This is very similar to the "/(?=...)/" look-ahead construct in Perl patterns. Thus, the rule:inner_word: word ...word will match whatever the subrule "word" matches, provided that match is followed by some more text which subrule "word" would also match (although this second substring is not actually consumed by "inner_word") Likewise, a "...!" prefix, causes the following item to succeed (without consuming any text) if and only if it would normally fail. Hence, a rule such as: identifier: ...!keyword ...!'_' /[A-Za-z_]\w*/ matches a string of characters which satisfies the pattern "/[A-Za-z_]\w*/", but only if the same sequence of characters would not match either subrule "keyword" or the literal token '_'. Sequences of look-ahead prefixes accumulate, multiplying their positive and/or negative senses. Hence: inner_word: word ...!......!word is exactly equivalent to the original example above (a warning is issued in cases like these, since they often indicate something left out, or misunderstood). Note that actions can also be treated as look-aheads. In such cases, the state of the parser text (in the local variable $text) after the look-ahead action is guaranteed to be identical to its state before the action, regardless of how it's changed within the action (unless you actually undefine $text, in which case you get the disaster you deserve :-). DirectivesDirectives are special pre-defined actions which may be used to alter the behaviour of the parser. There are currently twenty-three directives: "<commit>", "<uncommit>", "<reject>", "<score>", "<autoscore>", "<skip>", "<resync>", "<error>", "<warn>", "<hint>", "<trace_build>", "<trace_parse>", "<nocheck>", "<rulevar>", "<matchrule>", "<leftop>", "<rightop>", "<defer>", "<nocheck>", "<perl_quotelike>", "<perl_codeblock>", "<perl_variable>", and "<token>".
Note that, when autogenerating error messages, all underscores in any rule name used in a message are replaced by single spaces (for example "a_production" becomes "a production"). Judicious choice of rule names can therefore considerably improve the readability of automatic error messages (as well as the maintainability of the original grammar). If the automatically generated error is not sufficient, it is possible to provide an explicit message as part of the error directive. For example: Spock: "Fascinating ',' (name | 'Captain') '.' | "Highly illogical, doctor." | <error: He never said that!> which would result in all failures to parse a "Spock" subrule printing the following message: ERROR (line <N>): Invalid Spock: He never said that! The error message is treated as a "qq{...}" string and interpolated when the error is generated (not when the directive is specified!). Hence: <error: Mystical error near "$text"> would correctly insert the ambient text string which caused the error. There are two other forms of error directive: "<error?>" and "<error?: msg>". These behave just like "<error>" and "<error: msg>" respectively, except that they are only triggered if the rule is "committed" at the time they are encountered. For example: Scotty: "Ya kenna change the Laws of Phusics," <commit> name | name <commit> ',' 'she's goanta blaw!' | <error?> will only generate an error for a string beginning with "Ya kenna change the Laws o' Phusics," or a valid name, but which still fails to match the corresponding production. That is, "$parser->Scotty("Aye, Cap'ain")" will fail silently (since neither production will "commit" the rule on that input), whereas "$parser->Scotty("Mr Spock, ah jest kenna do'ut!")" will fail with the error message: ERROR (line 1): Invalid Scotty: expected 'she's goanta blaw!' but found 'I jest kenna do'ut!' instead. since in that case the second production would commit after matching the leading name. Note that to allow this behaviour, all "<error>" directives which are the first item in a production automatically uncommit the rule just long enough to allow their production to be attempted (that is, when their production fails, the commitment is reinstated so that subsequent productions are skipped). In order to permanently uncommit the rule before an error message, it is necessary to put an explicit "<uncommit>" before the "<error>". For example: line: 'Kirk:' <commit> Kirk | 'Spock:' <commit> Spock | 'McCoy:' <commit> McCoy | <uncommit> <error?> <reject> | <resync> Error messages generated by the various "<error...>" directives are not displayed immediately. Instead, they are "queued" in a buffer and are only displayed once parsing ultimately fails. Moreover, "<error...>" directives that cause one production of a rule to fail are automatically removed from the message queue if another production subsequently causes the entire rule to succeed. This means that you can put "<error...>" directives wherever useful diagnosis can be done, and only those associated with actual parser failure will ever be displayed. Also see "GOTCHAS". As a general rule, the most useful diagnostics are usually generated either at the very lowest level within the grammar, or at the very highest. A good rule of thumb is to identify those subrules which consist mainly (or entirely) of terminals, and then put an "<error...>" directive at the end of any other rule which calls one or more of those subrules. There is one other situation in which the output of the various types of error directive is suppressed; namely, when the rule containing them is being parsed as part of a "look-ahead" (see "Look-ahead"). In this case, the error directive will still cause the rule to fail, but will do so silently. An unconditional "<error>" directive always fails (and hence has no associated value). This means that encountering such a directive always causes the production containing it to fail. Hence an "<error>" directive will inevitably be the last (useful) item of a rule (a level 3 warning is issued if a production contains items after an unconditional "<error>" directive). An "<error?>" directive will succeed (that is: fail to fail :-), if the current rule is uncommitted when the directive is encountered. In that case the directive's associated value is zero. Hence, this type of error directive can be used before the end of a production. For example: command: 'do' <commit> something | 'report' <commit> something | <error?: Syntax error> <error: Unknown command> Warning: The "<error?>" directive does not mean "always fail (but do so silently unless committed)". It actually means "only fail (and report) if committed, otherwise succeed". To achieve the "fail silently if uncommitted" semantics, it is necessary to use: rule: item <commit> item(s) | <error?> <reject> # FAIL SILENTLY UNLESS COMMITTED However, because people seem to expect a lone "<error?>" directive to work like this: rule: item <commit> item(s) | <error?: Error message if committed> | <error: Error message if uncommitted> Parse::RecDescent automatically appends a "<reject>" directive if the "<error?>" directive is the only item in a production. A level 2 warning (see below) is issued when this happens. The level of error reporting during both parser construction and parsing is controlled by the presence or absence of four global variables: $::RD_ERRORS, $::RD_WARN, $::RD_HINT, and <$::RD_TRACE>. If $::RD_ERRORS is defined (and, by default, it is) then fatal errors are reported. Whenever $::RD_WARN is defined, certain non-fatal problems are also reported. Warnings have an associated "level": 1, 2, or 3. The higher the level, the more serious the warning. The value of the corresponding global variable ($::RD_WARN) determines the lowest level of warning to be displayed. Hence, to see all warnings, set $::RD_WARN to 1. To see only the most serious warnings set $::RD_WARN to 3. By default $::RD_WARN is initialized to 3, ensuring that serious but non-fatal errors are automatically reported. There is also a grammar directive to turn on warnings from within the grammar: "<warn>". It takes an optional argument, which specifies the warning level: "<warn: 2>". See "DIAGNOSTICS" for a list of the various error and warning messages that Parse::RecDescent generates when these two variables are defined. Defining any of the remaining variables (which are not defined by default) further increases the amount of information reported. Defining $::RD_HINT causes the parser generator to offer more detailed analyses and hints on both errors and warnings. Note that setting $::RD_HINT at any point automagically sets $::RD_WARN to 1. There is also a "<hint>" directive, which can be hard-coded into a grammar. Defining $::RD_TRACE causes the parser generator and the parser to report their progress to STDERR in excruciating detail (although, without hints unless $::RD_HINT is separately defined). This detail can be moderated in only one respect: if $::RD_TRACE has an integer value (N) greater than 1, only the N characters of the "current parsing context" (that is, where in the input string we are at any point in the parse) is reported at any time. $::RD_TRACE is mainly useful for debugging a grammar that isn't behaving as you expected it to. To this end, if $::RD_TRACE is defined when a parser is built, any actual parser code which is generated is also written to a file named "RD_TRACE" in the local directory. There are two directives associated with the $::RD_TRACE variable. If a grammar contains a "<trace_build>" directive anywhere in its specification, $::RD_TRACE is turned on during the parser construction phase. If a grammar contains a "<trace_parse>" directive anywhere in its specification, $::RD_TRACE is turned on during any parse the parser performs. Note that the four variables belong to the "main" package, which makes them easier to refer to in the code controlling the parser, and also makes it easy to turn them into command line flags ("-RD_ERRORS", "-RD_WARN", "-RD_HINT", "-RD_TRACE") under perl -s. The corresponding directives are useful to "hardwire" the various debugging features into a particular grammar (rather than having to set and reset external variables).
If a quote-like expression is not found, the directive fails with the usual "undef" value. The "<perl_variable>" directive can be used to parse any Perl variable: $scalar, @array, %hash, $ref->{field}[$index], etc. It does this by calling Text::Balanced::extract_variable(). If the directive matches text representing a valid Perl variable specification, it returns that text. Otherwise it fails with the usual "undef" value. The "<perl_codeblock>" directive can be used to parse curly-brace-delimited block of Perl code, such as: { $a = 1; f() =~ m/pat/; }. It does this by calling Text::Balanced::extract_codeblock(). If the directive matches text representing a valid Perl code block, it returns that text. Otherwise it fails with the usual "undef" value. You can also tell it what kind of brackets to use as the outermost delimiters. For example: arglist: <perl_codeblock ()> causes an arglist to match a perl code block whose outermost delimiters are "(...)" (rather than the default "{...}").
Subrule argument listsIt is occasionally useful to pass data to a subrule which is being invoked. For example, consider the following grammar fragment:classdecl: keyword decl keyword: 'struct' | 'class'; decl: # WHATEVER The "decl" rule might wish to know which of the two keywords was used (since it may affect some aspect of the way the subsequent declaration is interpreted). "Parse::RecDescent" allows the grammar designer to pass data into a rule, by placing that data in an argument list (that is, in square brackets) immediately after any subrule item in a production. Hence, we could pass the keyword to "decl" as follows: classdecl: keyword decl[ $item[1] ] keyword: 'struct' | 'class'; decl: # WHATEVER The argument list can consist of any number (including zero!) of comma-separated Perl expressions. In other words, it looks exactly like a Perl anonymous array reference. For example, we could pass the keyword, the name of the surrounding rule, and the literal 'keyword' to "decl" like so: classdecl: keyword decl[$item[1],$item[0],'keyword'] keyword: 'struct' | 'class'; decl: # WHATEVER Within the rule to which the data is passed ("decl" in the above examples) that data is available as the elements of a local variable @arg. Hence "decl" might report its intentions as follows: classdecl: keyword decl[$item[1],$item[0],'keyword'] keyword: 'struct' | 'class'; decl: { print "Declaring $arg[0] (a $arg[2])\n"; print "(this rule called by $arg[1])" } Subrule argument lists can also be interpreted as hashes, simply by using the local variable %arg instead of @arg. Hence we could rewrite the previous example: classdecl: keyword decl[keyword => $item[1], caller => $item[0], type => 'keyword'] keyword: 'struct' | 'class'; decl: { print "Declaring $arg{keyword} (a $arg{type})\n"; print "(this rule called by $arg{caller})" } Both @arg and %arg are always available, so the grammar designer may choose whichever convention (or combination of conventions) suits best. Subrule argument lists are also useful for creating "rule templates" (especially when used in conjunction with the "<matchrule:...>" directive). For example, the subrule: list: <matchrule:$arg{rule}> /$arg{sep}/ list[%arg] { $return = [ $item[1], @{$item[3]} ] } | <matchrule:$arg{rule}> { $return = [ $item[1]] } is a handy template for the common problem of matching a separated list. For example: function: 'func' name '(' list[rule=>'param',sep=>';'] ')' param: list[rule=>'name',sep=>','] ':' typename name: /\w+/ typename: name When a subrule argument list is used with a repeated subrule, the argument list goes before the repetition specifier: list: /some|many/ thing[ $item[1] ](s) The argument list is "late bound". That is, it is re-evaluated for every repetition of the repeated subrule. This means that each repeated attempt to match the subrule may be passed a completely different set of arguments if the value of the expression in the argument list changes between attempts. So, for example, the grammar: { $::species = 'dogs' } pair: 'two' animal[$::species](s) animal: /$arg[0]/ { $::species = 'cats' } will match the string "two dogs cats cats" completely, whereas it will only match the string "two dogs dogs dogs" up to the eighth letter. If the value of the argument list were "early bound" (that is, evaluated only the first time a repeated subrule match is attempted), one would expect the matching behaviours to be reversed. Of course, it is possible to effectively "early bind" such argument lists by passing them a value which does not change on each repetition. For example: { $::species = 'dogs' } pair: 'two' { $::species } animal[$item[2]](s) animal: /$arg[0]/ { $::species = 'cats' } Arguments can also be passed to the start rule, simply by appending them to the argument list with which the start rule is called (after the "line number" parameter). For example, given: $parser = new Parse::RecDescent ( $grammar ); $parser->data($text, 1, "str", 2, \@arr); # ^^^^^ ^ ^^^^^^^^^^^^^^^ # | | | # TEXT TO BE PARSED | | # STARTING LINE NUMBER | # ELEMENTS OF @arg WHICH IS PASSED TO RULE data then within the productions of the rule "data", the array @arg will contain "("str", 2, \@arr)". AlternationsAlternations are implicit (unnamed) rules defined as part of a production. An alternation is defined as a series of '|'-separated productions inside a pair of round brackets. For example:character: 'the' ( good | bad | ugly ) /dude/ Every alternation implicitly defines a new subrule, whose automatically-generated name indicates its origin: "_alternation_<I>_of_production_<P>_of_rule<R>" for the appropriate values of <I>, <P>, and <R>. A call to this implicit subrule is then inserted in place of the brackets. Hence the above example is merely a convenient short-hand for: character: 'the' _alternation_1_of_production_1_of_rule_character /dude/ _alternation_1_of_production_1_of_rule_character: good | bad | ugly Since alternations are parsed by recursively calling the parser generator, any type(s) of item can appear in an alternation. For example: character: 'the' ( 'high' "plains" # Silent, with poncho | /no[- ]name/ # Silent, no poncho | vengeance_seeking # Poncho-optional | <error> ) drifter In this case, if an error occurred, the automatically generated message would be: ERROR (line <N>): Invalid implicit subrule: Expected 'high' or /no[- ]name/ or generic, but found "pacifist" instead Since every alternation actually has a name, it's even possible to extend or replace them: parser->Replace( "_alternation_1_of_production_1_of_rule_character: 'generic Eastwood'" ); More importantly, since alternations are a form of subrule, they can be given repetition specifiers: character: 'the' ( good | bad | ugly )(?) /dude/ Incremental Parsing"Parse::RecDescent" provides two methods - "Extend" and "Replace" - which can be used to alter the grammar matched by a parser. Both methods take the same argument as "Parse::RecDescent::new", namely a grammar specification string"Parse::RecDescent::Extend" interprets the grammar specification and adds any productions it finds to the end of the rules for which they are specified. For example: $add = "name: 'Jimmy-Bob' | 'Bobby-Jim'\ndesc: colour /necks?/"; parser->Extend($add); adds two productions to the rule "name" (creating it if necessary) and one production to the rule "desc". "Parse::RecDescent::Replace" is identical, except that it first resets are rule specified in the additional grammar, removing any existing productions. Hence after: $add = "name: 'Jimmy-Bob' | 'Bobby-Jim'\ndesc: colour /necks?/"; parser->Replace($add); there are only valid "name"s and the one possible description. A more interesting use of the "Extend" and "Replace" methods is to call them inside the action of an executing parser. For example: typedef: 'typedef' type_name identifier ';' { $thisparser->Extend("type_name: '$item[3]'") } | <error> identifier: ...!type_name /[A-Za-z_]w*/ which automatically prevents type names from being typedef'd, or: command: 'map' key_name 'to' abort_key { $thisparser->Replace("abort_key: '$item[2]'") } | 'map' key_name 'to' key_name { map_key($item[2],$item[4]) } | abort_key { exit if confirm("abort?") } abort_key: 'q' key_name: ...!abort_key /[A-Za-z]/ which allows the user to change the abort key binding, but not to unbind it. The careful use of such constructs makes it possible to reconfigure a a running parser, eliminating the need for semantic feedback by providing syntactic feedback instead. However, as currently implemented, "Replace()" and "Extend()" have to regenerate and re-"eval" the entire parser whenever they are called. This makes them quite slow for large grammars. In such cases, the judicious use of an interpolated regex is likely to be far more efficient: typedef: 'typedef' type_name/ identifier ';' { $thisparser->{local}{type_name} .= "|$item[3]" } | <error> identifier: ...!type_name /[A-Za-z_]w*/ type_name: /$thisparser->{local}{type_name}/ Precompiling parsersNormally Parse::RecDescent builds a parser from a grammar at run-time. That approach simplifies the design and implementation of parsing code, but has the disadvantage that it slows the parsing process down - you have to wait for Parse::RecDescent to build the parser every time the program runs. Long or complex grammars can be particularly slow to build, leading to unacceptable delays at start-up.To overcome this, the module provides a way of "pre-building" a parser object and saving it in a separate module. That module can then be used to create clones of the original parser. A grammar may be precompiled using the "Precompile" class method. For example, to precompile a grammar stored in the scalar $grammar, and produce a class named PreGrammar in a module file named PreGrammar.pm, you could use: use Parse::RecDescent; Parse::RecDescent->Precompile([$options_hashref], $grammar, "PreGrammar", ["RuntimeClass"]); The first required argument is the grammar string, the second is the name of the class to be built. The name of the module file is generated automatically by appending ".pm" to the last element of the class name. Thus Parse::RecDescent->Precompile($grammar, "My::New::Parser"); would produce a module file named Parser.pm. After the class name, you may specify the name of the runtime_class called by the Precompiled parser. See "Precompiled runtimes" for more details. An optional hash reference may be supplied as the first argument to "Precompile". This argument is currently EXPERIMENTAL, and may change in a future release of Parse::RecDescent. The only supported option is currently "-standalone", see "Standalone precompiled parsers". It is somewhat tedious to have to write a small Perl program just to generate a precompiled grammar class, so Parse::RecDescent has some special magic that allows you to do the job directly from the command-line. If your grammar is specified in a file named grammar, you can generate a class named Yet::Another::Grammar like so: > perl -MParse::RecDescent - grammar Yet::Another::Grammar [Runtime::Class] This would produce a file named Grammar.pm containing the full definition of a class called Yet::Another::Grammar. Of course, to use that class, you would need to put the Grammar.pm file in a directory named Yet/Another, somewhere in your Perl include path. Having created the new class, it's very easy to use it to build a parser. You simply "use" the new module, and then call its "new" method to create a parser object. For example: use Yet::Another::Grammar; my $parser = Yet::Another::Grammar->new(); The effect of these two lines is exactly the same as: use Parse::RecDescent; open GRAMMAR_FILE, "grammar" or die; local $/; my $grammar = <GRAMMAR_FILE>; my $parser = Parse::RecDescent->new($grammar); only considerably faster. Note however that the parsers produced by either approach are exactly the same, so whilst precompilation has an effect on set-up speed, it has no effect on parsing speed. RecDescent 2.0 will address that problem. Standalone precompiled parsers Until version 1.967003 of Parse::RecDescent, parser modules built with "Precompile" were dependent on Parse::RecDescent. Future Parse::RecDescent releases with different internal implementations would break pre-existing precompiled parsers. Version 1.967_005 added the ability for Parse::RecDescent to include itself in the resulting .pm file if you pass the boolean option "-standalone" to "Precompile": Parse::RecDescent->Precompile({ -standalone => 1, }, $grammar, "My::New::Parser"); Parse::RecDescent is included as $class::_Runtime in order to avoid conflicts between an installed version of Parse::RecDescent and other precompiled, standalone parser made with Parse::RecDescent. The name of this class may be changed with the "-runtime_class" option to Precompile. This renaming is experimental, and is subject to change in future versions. Precompiled parsers remain dependent on Parse::RecDescent by default, as this feature is still considered experimental. In the future, standalone parsers will become the default. Precompiled runtimes Standalone precompiled parsers each include a copy of Parse::RecDescent. For users who have a family of related precompiled parsers, this is very inefficient. "Precompile" now supports an experimental "-runtime_class" option. To build a precompiled parser with a different runtime name, call: Parse::RecDescent->Precompile({ -standalone => 1, -runtime_class => "My::Runtime", }, $grammar, "My::New::Parser"); The resulting standalone parser will contain a copy of Parse::RecDescent, renamed to "My::Runtime". To build a set of parsers that "use" a custom-named runtime, without including that runtime in the output, simply build those parsers with "-runtime_class" and without "-standalone": Parse::RecDescent->Precompile({ -runtime_class => "My::Runtime", }, $grammar, "My::New::Parser"); The runtime itself must be generated as well, so that it may be "use"d by My::New::Parser. To generate the runtime file, use one of the two folling calls: Parse::RecDescent->PrecompiledRuntime("My::Runtime"); Parse::RecDescent->Precompile({ -standalone => 1, -runtime_class => "My::Runtime", }, '', # empty grammar "My::Runtime"); GOTCHASThis section describes common mistakes that grammar writers seem to make on a regular basis.1. Expecting an error to always invalidate a parseA common mistake when using error messages is to write the grammar like this:file: line(s) line: line_type_1 | line_type_2 | line_type_3 | <error> The expectation seems to be that any line that is not of type 1, 2 or 3 will invoke the "<error>" directive and thereby cause the parse to fail. Unfortunately, that only happens if the error occurs in the very first line. The first rule states that a "file" is matched by one or more lines, so if even a single line succeeds, the first rule is completely satisfied and the parse as a whole succeeds. That means that any error messages generated by subsequent failures in the "line" rule are quietly ignored. Typically what's really needed is this: file: line(s) eofile { $return = $item[1] } line: line_type_1 | line_type_2 | line_type_3 | <error> eofile: /^\Z/ The addition of the "eofile" subrule to the first production means that a file only matches a series of successful "line" matches that consume the complete input text. If any input text remains after the lines are matched, there must have been an error in the last "line". In that case the "eofile" rule will fail, causing the entire "file" rule to fail too. Note too that "eofile" must match "/^\Z/" (end-of-text), not "/^\cZ/" or "/^\cD/" (end-of-file). And don't forget the action at the end of the production. If you just write: file: line(s) eofile then the value returned by the "file" rule will be the value of its last item: "eofile". Since "eofile" always returns an empty string on success, that will cause the "file" rule to return that empty string. Apart from returning the wrong value, returning an empty string will trip up code such as: $parser->file($filetext) || die; (since "" is false). Remember that Parse::RecDescent returns undef on failure, so the only safe test for failure is: defined($parser->file($filetext)) || die; 2. Using a "return" in an actionAn action is like a "do" block inside the subroutine implementing the surrounding rule. So if you put a "return" statement in an action:range: '(' start '..' end )' { return $item{end} } /\s+/ that subroutine will immediately return, without checking the rest of the items in the current production (e.g. the "/\s+/") and without setting up the necessary data structures to tell the parser that the rule has succeeded. The correct way to set a return value in an action is to set the $return variable: range: '(' start '..' end )' { $return = $item{end} } /\s+/ 2. Setting $Parse::RecDescent::skip at parse timeIf you want to change the default skipping behaviour (see "Terminal Separators" and the "<skip:...>" directive) by setting $Parse::RecDescent::skip you have to remember to set this variable before creating the grammar object.For example, you might want to skip all Perl-like comments with this regular expression: my $skip_spaces_and_comments = qr/ (?mxs: \s+ # either spaces | \# .*?$ # or a dash and whatever up to the end of line )* # repeated at will (in whatever order) /; And then: my $parser1 = Parse::RecDescent->new($grammar); $Parse::RecDescent::skip = $skip_spaces_and_comments; my $parser2 = Parse::RecDescent->new($grammar); $parser1->parse($text); # this does not cope with comments $parser2->parse($text); # this skips comments correctly The two parsers behave differently, because any skipping behaviour specified via $Parse::RecDescent::skip is hard-coded when the grammar object is built, not at parse time. DIAGNOSTICSDiagnostics are intended to be self-explanatory (particularly if you use -RD_HINT (under perl -s) or define $::RD_HINT inside the program)."Parse::RecDescent" currently diagnoses the following:
AUTHORDamian Conway (damian@conway.org) Jeremy T. Braun (JTBRAUN@CPAN.org) [current maintainer]BUGS AND IRRITATIONSThere are undoubtedly serious bugs lurking somewhere in this much code :-) Bug reports, test cases and other feedback are most welcome.Ongoing annoyances include:
ON-GOING ISSUES AND FUTURE DIRECTIONS
SUPPORTSource Code Repository<http://github.com/jtbraun/Parse-RecDescent>Mailing ListVisit <http://www.perlfoundation.org/perl5/index.cgi?parse_recdescent> to sign up for the mailing list.<http://www.PerlMonks.org> is also a good place to ask questions. Previous posts about Parse::RecDescent can typically be found with this search: <http://perlmonks.org/index.pl?node=recdescent>. FAQVisit Parse::RecDescent::FAQ for answers to frequently (and not so frequently) asked questions about Parse::RecDescent.View/Report BugsTo view the current bug list or report a new issue visit <https://rt.cpan.org/Public/Dist/Display.html?Name=Parse-RecDescent>.SEE ALSORegexp::Grammars provides Parse::RecDescent style parsing using native Perl 5.10 regular expressions.LICENCE AND COPYRIGHTCopyright (c) 1997-2007, Damian Conway "<DCONWAY@CPAN.org>". All rights reserved.This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic. DISCLAIMER OF WARRANTYBECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
Visit the GSP FreeBSD Man Page Interface. |