|
|
| |
Parse::Token(3) |
User Contributed Perl Documentation |
Parse::Token(3) |
"Parse::Token" - Definition of tokens used by "Parse::Lex"
require 5.005;
use Parse::Lex;
@token = qw(
ADDOP [-+]
INTEGER [1-9][0-9]*
);
$lexer = Parse::Lex->new(@token);
$lexer->from(\*DATA);
$content = $INTEGER->next;
if ($INTEGER->status) {
print "$content\n";
}
$content = $ADDOP->next;
if ($ADDOP->status) {
print "$content\n";
}
if ($INTEGER->isnext(\$content)) {
print "$content\n";
}
__END__
1+2
The "Parse::Token" class and its derived
classes permit defining the tokens used by
"Parse::Lex" or
"Parse::LexEvent".
The creation of tokens can be done by means of the
"new()" or
"factory()" methods. The
"Lex::new()" method of the
"Parse::Lex" package indirectly creates
instances of the tokens to be recognized.
The "next()" or
"isnext()" methods of the
"Parse::Token" package permit interfacing
the lexical analyzer with a syntactic analyzer of recursive descent type.
For interfacing with "byacc", see the
"Parse::YYLex" package.
"Parse::Token" is included
indirectly by means of "use Parse::Lex" or
"use Parse::LexEvent".
- action
- Returns the anonymous subroutine defined within the
"Parse::Token" object.
- factory LIST
- factory ARRAY_REF
- The "factory(LIST)" method creates a
list of tokens from a list of specifications, which include for each
token: a name, a regular expression, and possibly an anonymous subroutine.
The list can also include objects of class
"Parse::Token" or of a class derived
from it.
The "factory(ARRAY_REF)"
method permits creating tokens from specifications of type
attribute-value:
Parse::Token->factory([Type => 'Simple',
Name => 'EXAMPLE',
Regex => '.+']);
"Type" indicates the type of
each token to be created (the package prefix is not indicated).
"factory()" creates a series
of tokens but does not import these tokens into the calling package.
You could for example write:
%keywords =
qw (
PROC undef
FUNC undef
RETURN undef
IF undef
ELSE undef
WHILE undef
PRINT undef
READ undef
);
@tokens = Parse::Token->factory(%keywords);
and install these tokens in a symbol table in the following
manner:
foreach $name (keys %keywords) {
${$name} = pop @tokens;
$symbol{"\L$name"} = [${$name}, ''];
}
"${$name}" is the token
instance.
During the lexical analysis phase, you can use the tokens in
the following manner:
qw(IDENT [a-zA-Z][a-zA-Z0-9_]*), sub {
$symbol{$_[1]} = [] unless defined $symbol{$_[1]};
my $type = $symbol{$_[1]}[0];
$lexer->setToken((not defined $type) ? $VAR : $type);
$_[1]; # THE TOKEN TEXT
}
This permits indicating that any symbol of unknown type is a
variable.
In this example we have used $_[1]
which corresponds to the text recognized by the regular expression. This
text associated with the token must be returned by the anonymous
subroutine.
- get EXPR
- "get" obtains the value of the attribute
named by the result of evaluating EXPR. You can also use the name of the
attribute as a method name.
- getText
- Returns the character string that was recognized by means of this
"Parse::Token" object.
Same as the text() method.
- isnext EXPR
- isnext
- Returns the status of the token. The consumed string is put into EXPR if
it is a reference to a scalar.
- name
- Returns the name of the token.
- next
- Activate searching for the lexeme defined by the regular expression
contained in the object. If this lexeme is recognized on the character
stream to analyze, "next" returns the
string found and sets the status of the object to true.
- new SYMBOL_NAME, REGEXP, SUB
- new SYMBOL_NAME, REGEXP
- Creates an object of type
"Parse::Token::Simple" or
"Parse::Token::Segmented". The arguments
of the "new()" method are, respectively:
a symbolic name, a regular expression, and possibly an anonymous
subroutine. The subclasses of
"Parse::Token" permit specifying tokens
by means of a list of attribute-values.
REGEXP is either a simple regular expression, or a reference
to an array containing from one to three regular expressions. In the
first case, the instance belongs to the
"Parse::Token::Simple" class. In the
second case, the instance belongs to the
"Parse::Token::Segmented" class. The
tokens of this type permit recognizing structures of type character
string delimited by quotation marks, comments in a C program, etc. The
regular expressions are used to recognize:
1. The beginning of the lexeme,
2. The "body" of the lexeme; if this second
expression is missing, "Parse::Lex"
uses "(?:.*?)",
3. the end of the lexeme; if this last expression is missing
then the first one is used. (Note! The end of the lexeme cannot span
several lines).
Example:
qw(STRING), [qw(" (?:[^"\\\\]+|\\\\(?:.|\n))* ")],
These regular expressions can recognize multi-line strings
delimited by quotation marks, where the backslash is used to quote the
quotation marks appearing within the string. Notice the quadrupling of
the backslash.
Here is a variation of the previous example which uses the
"s" option to include newline in the
characters recognized by
""."":
qw(STRING), [qw(" (?s:[^"\\\\]+|\\\\.)* ")],
(Note: it is possible to write regular expressions which are
more efficient in terms of execution time, but this is not our objective
with this example. See Mastering Regular Expressions.)
The anonymous subroutine is called when the lexeme is
recognized by the lexical analyzer. This subroutine takes two arguments:
$_[0] contains the token instance, and
$_[1] contains the string recognized by the
regular expression. The scalar returned by the anonymous subroutine
defines the character string memorized in the token instance.
In the anonymous subroutine you can use the positional
variables $1, $2, etc.
which correspond to the groups of parentheses in the regular
expression.
- regexp
- Returns the regular expression of the
"Token" object.
- set LIST
- Allows marking a token with a list of attribute-value pairs.
An attribute name can be used as a method name.
- setText EXPR
- The value of "EXPR" defines the
character string associated with the lexeme.
Same as the "text(EXPR)"
method.
- status EXPR
- status
- Indicates if the last search of the lexeme succeeded or failed.
"status EXPR" overrides the existing
value and sets it to the value of EXPR.
- text EXPR
- text
- "text()" returns the character string
recognized by means of the token. The value of
"EXPR" sets the character string
associated with the lexeme.
- trace OUTPUT
- trace
- Class method which activates/deactivates a trace of the lexical analysis.
"OUTPUT" can be a file name
or a reference to a filehandle to which the trace will be directed.
Subclasses of the "Parse::Token" class are
being defined. They permit recognizing specific structures such as, for
example, strings within double-quotes, C comments, etc. Here are the
subclasses which I am working on:
"Parse::Token::Simple" : tokens
of this class are defined by means of a single regular expression.
"Parse::Token::Segmented" :
tokens of this class are defined by means of three regular expressions.
Reading of new data is done automatically.
"Parse::Token::Delimited" :
permits recognizing, for example, C language comments.
"Parse::Token::Quoted" : permits
recognizing, for example, character strings within quotation marks.
"Parse::Token::Nested" : permits
recognizing nested structures such as parenthesized expressions. NOT
DEFINED.
These classes are recently created and no doubt contain some
bugs.
Tokens of the "Parse::Token::Action" class
permit inserting arbitrary Perl expressions within a lexical analyzer. An
expression can be used for instance to print out internal variables of the
analyzer:
- $LEX_BUFFER : contents of the buffer to be
analyzed
- $LEX_LENGTH : length of the character string being
analyzed
- $LEX_RECORD : number of the record being
analyzed
- $LEX_OFFSET : number of characters already
consumed since the start of the analysis.
- $LEX_POS : position reached by the analysis as a
number of characters since the start of the buffer.
The class constructor accepts the following attributes:
- "Name" : the name of the token
- "Expr" : a Perl expression
Example :
$ACTION = new Parse::Token::Action(
Name => 'ACTION',
Expr => q!print "LEX_POS: $LEX_POS\n" .
"LEX_BUFFER: $LEX_BUFFER\n" .
"LEX_LENGTH: $LEX_LENGTH\n" .
"LEX_RECORD: $LEX_RECORD\n" .
"LEX_OFFSET: $LEX_OFFSET\n"
;!,
);
The class constructor accepts the following attributes:
- "Handler" : the value indicates the name
of a function to call during an analysis performed by an analyzer of class
"Parse::LexEvent".
- "Name" : the associated value is the
name of the token.
- "Regex" : the associated value is a
regular expression corresponding to the pattern to be recognized.
- "ReadMore" : if the associated value is
1, the recognition of the token continues after reading a new record. The
strings recognized are concatenated. This attribute only has effect during
analysis of a character stream.
- "Sub" : the associated value must be an
anonymous subroutine to be executed after the token is recognized. This
function is only used with analyzers of class
"Parse::Lex" or
"Parse::CLex".
Example.
new Parse::Token::Simple(Name => 'remainder',
Regex => '[^/\'\"]+',
ReadMore => 1);
The definition of these tokens includes three regular expressions. During
analysis of a data stream, new data is read as long as the end of the token
has not been reached.
The class constructor accepts the following attributes:
- "Handler" : the value indicates the name
of a function to call during analysis performed by an analyzer of class
"Parse::LexEvent".
- "Name" : the associated value is the
name of the token.
- "Regex" : the associated value must be a
reference to an array that contains three regular expressions.
- "Sub" : the associated value must be an
anonymous subroutine to be executed after the token is recognized. This
function is only used with analyzers of class
"Parse::Lex" or
"Parse::CLex".
"Parse::Token::Quoted" is a subclass of
"Parse::Token::Segmented". It permits
recognizing character strings within double quotes or single quotes.
Examples.
---------------------------------------------------------
Start End Escaping
---------------------------------------------------------
' ' ''
" " ""
" " \
---------------------------------------------------------
The class constructor accepts the following attributes:
- "End" : The associated value is a
regular expression permitting recognizing the end of the token.
- "Escape" : The associated value
indicates the character used to escape the delimiter. By default, a double
occurrence of the terminating character escapes that character.
- "Handler" : the value indicates the name
of a function to be called during an analysis performed by an analyzer of
class "Parse::LexEvent".
- "Name" : the associated value is the
name of the token.
- "Start" : the associated value is a
regular expression permitting recognizing the start of the token.
- "Sub" : the associated value must be an
anonymous subroutine to be executed after the token is recognized. This
function is only used with analyzers of class
"Parse::Lex" or
"Parse::CLex".
Example.
new Parse::Token::Quoted(Name => 'squotes',
Handler => 'string',
Escape => '\\',
Quote => qq!\'!,
);
"Parse::Token::Delimited" is a subclass of
"Parse::Token::Segmented". It permits, for
example, recognizing C language comments.
Examples.
---------------------------------------------------------
Start End Constraint
on the contents
---------------------------------------------------------
/* */ C Comment
<!-- --> No '--' XML Comment
<!-- --> SGML Comment
<? ?> Processing instruction
in SGML/XML
---------------------------------------------------------
The class constructor accepts the following attributes:
- "End" : The associated value is a
regular expression permitting recognizing the end of the token.
- "Handler" : the value indicates the name
of a function to be called during an analysis performed by an analyzer of
class "Parse::LexEvent".
- "Name" : the associated value is the
name of the token.
- "Start" : the associated value is a
regular expression permitting recognizing the start of the token.
- "Sub" : the associated value must be an
anonymous subroutine to be executed after the token is recognized. This
function is only used with analyzers of class
"Parse::Lex" or
"Parse::CLex".
Example.
new Parse::Token::Delimited(Name => 'comment',
Start => '/[*]',
End => '[*]/'
);
Examples.
----------------------------------------------------------
Start End
----------------------------------------------------------
( ) Symbolic Expressions
{ } Rich Text Format Groups
----------------------------------------------------------
The implementation of subclasses of tokens is not complete for analyzers of the
"Parse::CLex" class. I am not too keen to do
it, since an implementation for classes
"Parse::Lex" and
"Parse::LexEvent" seems quite sufficient.
Philippe Verdret. Documentation translated to English by Vladimir Alexiev and
Ocrat.
Version 2.0 owes much to suggestions made by Vladimir Alexiev. Ocrat has
significantly contributed to improving this documentation. Thanks also to the
numerous persons who have made comments or sometimes sent bug fixes.
Friedl, J.E.F. Mastering Regular Expressions. O'Reilly & Associates 1996.
Mason, T. & Brown, D. - Lex & Yacc. O'Reilly &
Associates, Inc. 1990.
Copyright (c) 1995-1999 Philippe Verdret. All rights reserved. This module is
free software; you can redistribute it and/or modify it under the same terms
as Perl itself.
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |