|
|
| |
Parse::Lex(3) |
User Contributed Perl Documentation |
Parse::Lex(3) |
"Parse::Lex" - Generator of lexical analyzers - moving pointer inside
text
require 5.005;
use Parse::Lex;
@token = (
qw(
ADDOP [-+]
LEFTP [\(]
RIGHTP [\)]
INTEGER [1-9][0-9]*
NEWLINE \n
),
qw(STRING), [qw(" (?:[^"]+|"")* ")],
qw(ERROR .*), sub {
die qq!can\'t analyze: "$_[1]"!;
}
);
Parse::Lex->trace; # Class method
$lexer = Parse::Lex->new(@token);
$lexer->from(\*DATA);
print "Tokenization of DATA:\n";
TOKEN:while (1) {
$token = $lexer->next;
if (not $lexer->eoi) {
print "Line $.\t";
print "Type: ", $token->name, "\t";
print "Content:->", $token->text, "<-\n";
} else {
last TOKEN;
}
}
__END__
1+2-5
"a multiline
string with an embedded "" in it"
an invalid string with a "" in it"
The classes "Parse::Lex" and
"Parse::CLex" create lexical analyzers. They
use different analysis techniques:
1. "Parse::Lex" steps through
the analysis by moving a pointer within the character strings to be analyzed
(use of "pos()" together with
"\G"),
2. "Parse::CLex" steps through
the analysis by consuming the data recognized (use of
"s///").
Analyzers of the "Parse::CLex"
class do not allow the use of anchoring in regular expressions. In addition,
the subclasses of "Parse::Token" are not
implemented for this type of analyzer.
A lexical analyzer is specified by means of a list of tokens
passed as arguments to the "new()" method.
Tokens are instances of the "Parse::Token"
class, which comes with "Parse::Lex". The
definition of a token usually comprises two arguments: a symbolic name (like
"INTEGER"), followed by a regular
expression. If a sub ref (anonymous subroutine) is given as third argument,
it is called when the token is recognized. Its arguments are the
"Parse::Token" instance and the string
recognized by the regular expression. The anonymous subroutine's return
value is used as the new string contents of the
"Parse::Token" instance.
The order in which the lexical analyzer examines the regular
expressions is determined by the order in which these expressions are passed
as arguments to the "new()" method. The
token returned by the lexical analyzer corresponds to the first regular
expression which matches (this strategy is different from that used by Lex,
which returns the longest match possible out of all that can be
recognized).
The lexical analyzer can recognize tokens which span multiple
records. If the definition of the token comprises more than one regular
expression (placed within a reference to an anonymous array), the analyzer
reads as many records as required to recognize the token (see the
documentation for the "Parse::Token"
class). When the start pattern is found, the analyzer looks for the end, and
if necessary, reads more records. No backtracking is done in case of
failure.
The analyzer can be used to analyze an isolated character string
or a stream of data coming from a file handle. At the end of the input data
the analyzer returns a "Parse::Token"
instance named "EOI" (End Of Input).
You can associate start conditions with the token-recognition rules that
comprise your lexical analyzer (this is similar to what Flex provides). When
start conditions are used, the rule which succeeds is no longer necessarily
the first rule that matches.
A token symbol may be preceded by a start condition specifier for
the associated recognition rule. For example:
qw(C1:TERMINAL_1 REGEXP), sub { # associated action },
qw(TERMINAL_2 REGEXP), sub { # associated action },
Symbol "TERMINAL_1" will be
recognized only if start condition "C1" is
active. Start conditions are activated/deactivated using the
"start(CONDITION_NAME)" and
"end(CONDITION_NAME)" methods.
"start('INITIAL')" resets the
analysis automaton.
Start conditions can be combined using AND/OR operators as
follows:
C1:SYMBOL condition C1
C1:C2:SYMBOL condition C1 AND condition C2
C1,C2:SYMBOL condition C1 OR condition C2
There are two types of start conditions: inclusive and
exclusive, which are declared by class methods
"inclusive()" and
"exclusive()" respectively. With an
inclusive start condition, all rules are active regardless of whether or not
they are qualified with the start condition. With an exclusive start
condition, only the rules qualified with the start condition are active; all
other rules are deactivated.
Example (borrowed from the documentation of Flex):
use Parse::Lex;
@token = (
'EXPECT', 'expect-floats', sub {
$lexer->start('expect');
$_[1]
},
'expect:FLOAT', '\d+\.\d+', sub {
print "found a float: $_[1]\n";
$_[1]
},
'expect:NEWLINE', '\n', sub {
$lexer->end('expect') ;
$_[1]
},
'NEWLINE2', '\n',
'INT', '\d+', sub {
print "found an integer: $_[1] \n";
$_[1]
},
'DOT', '\.', sub {
print "found a dot\n";
$_[1]
},
);
Parse::Lex->exclusive('expect');
$lexer = Parse::Lex->new(@token);
The special start condition
"ALL" is always verified.
- analyze EXPR
- Analyzes "EXPR" and returns a list of
pairs consisting of a token name followed by recognized text.
"EXPR" can be a character string or a
reference to a filehandle.
Examples:
@tokens = Parse::Lex->new(qw(PLUS [+] NUMBER \d+))->analyze("3+3+3");
@tokens = Parse::Lex->new(qw(PLUS [+] NUMBER \d+))->analyze(\*STREAM);
- buffer EXPR
- buffer
- Returns the contents of the internal buffer of the lexical analyzer. With
an expression as argument, places the result of the expression in the
buffer.
It is not advisable to directly change the contents of the
buffer without changing the position of the analysis pointer
("pos()") and the value length of the
buffer ("length()").
- configure(HASH)
- Instance method which permits specifying a lexical analyzer. This method
accepts the list of the following attribute values:
- From => EXPR
- This attribute plays the same role as the
"from(EXPR)" method.
"EXPR" can be a filehandle or a
character string.
- Tokens => ARRAY_REF
- "ARRAY_REF" must contain the list of
attribute values specifying the tokens to be recognized (see the
documentation for "Parse::Token").
- Skip => REGEX
- This attribute plays the same role as the
"skip(REGEX)" method.
"REGEX" describes the patterns to skip
over during the analysis.
- end EXPR
- Deactivates condition "EXPR".
- eoi
- Returns TRUE when there is no more data to analyze.
- every SUB
- Avoids having to write a reading loop in order to analyze a stream of
data. "SUB" is an anonymous subroutine
executed after the recognition of each token. For example, to lex the
string "1+2" you can write:
use Parse::Lex;
$lexer = Parse::Lex->new(
qw(
ADDOP [-+]
INTEGER \d+
));
$lexer->from("1+2");
$lexer->every (sub {
print $_[0]->name, "\t";
print $_[0]->text, "\n";
});
The first argument of the anonymous subroutine is the
"Parse::Token" instance
recognized.
- exclusive LIST
- Class method declaring the conditions present in LIST to be
exclusive.
- flush
- If saving of the consumed strings is activated,
"flush()" returns and clears the buffer
containing the character strings recognized up to now. This is only useful
if "hold()" has been called to activate
saving of consumed strings.
- from EXPR
- from
- "from(EXPR)" allows specifying the
source of the data to be analyzed. The argument of this method can be a
string (or list of strings), or a reference to a filehandle. If no
argument is given, "from()" returns the
filehandle if defined, or "undef" if
input is a string. When an argument
"EXPR" is used, the return value is the
calling lexer object itself.
By default it is assumed that data are read from
"STDIN".
Examples:
$handle = new IO::File;
$handle->open("< filename");
$lexer->from($handle);
$lexer->from(\*DATA);
$lexer->from('the data to be analyzed');
- getSub
- "getSub" returns the anonymous
subroutine that performs the lexical analysis.
Example:
my $token = '';
my $sub = $lexer->getSub;
while (($token = &$sub()) ne $Token::EOI) {
print $token->name, "\t";
print $token->text, "\n";
}
# or
my $token = '';
local *tokenizer = $lexer->getSub;
while (($token = tokenizer()) ne $Token::EOI) {
print $token->name, "\t";
print $token->text, "\n";
}
- getToken
- Same as "token()" method.
- hold EXPR
- hold
- Activates/deactivates saving of the consumed strings. The return value is
the current setting (TRUE or FALSE). Can be used as a class method.
You can obtain the contents of the buffer using the
"flush" method, which also empties the
buffer.
- inclusive LIST
- Class method declaring the conditions present in LIST to be
inclusive.
- length EXPR
- length
- Returns the length of the current record. "length
EXPR" sets the length of the current record.
- line EXPR
- line
- Returns the line number of the current record. "line
EXPR" sets the value of the line number. Always returns 1 if a
character string is being analyzed. The
"readline()" method increments the line
number.
- name EXPR
- name
- "name EXPR" lets you give a name to the
lexical analyzer. "name()" return the
value of this name.
- next
- Causes searching for the next token. Return the recognized
"Parse::Token" instance. Returns the
"Token::EOI" instance at the end of the
data.
Examples:
$lexer = Parse::Lex->new(@token);
print $lexer->next->name; # print the token type
print $lexer->next->text; # print the token content
- nextis SCALAR_REF
- Variant of the "next()" method. Tokens
are placed in "SCALAR_REF". The method
returns 1 as long as the token is not
"EOI".
Example:
while($lexer->nextis(\$token)) {
print $token->text();
}
- new LIST
- Creates and returns a new lexical analyzer. The argument of the method is
a list of "Parse::Token" instances, or a
list of triplets permitting their creation. The triplets consist of: the
symbolic name of the token, the regular expression necessary for its
recognition, and possibly an anonymous subroutine that is called when the
token is recognized. For each triplet, an instance of type
"Parse::Token" is created in the calling
package.
- offset
- Returns the number of characters already consumed since the beginning of
the analyzed data stream.
- pos EXPR
- pos
- "pos EXPR" sets the position of the
beginning of the next token to be recognized in the current line (this
doesn't work with analyzers of the
"Parse::CLex" class).
"pos()" returns the number of characters
already consumed in the current line.
- readline
- Reads data from the input specified by the
"from()" method. Returns the result of
the reading.
Example:
use Parse::Lex;
$lexer = Parse::Lex->new();
while (not $lexer->eoi) {
print $lexer->readline() # read and print one line
}
- reset
- Clears the internal buffer of the lexical analyzer and erases all tokens
already recognized.
- restart
- Reinitializes the analysis automaton. The only active condition becomes
the condition "INITIAL".
- setToken TOKEN
- Sets the token to "TOKEN". Useful to
requalify a token inside the anonymous subroutine associated with this
token.
- skip EXPR
- skip
- "EXPR" is a regular expression defining
the token separator pattern (by default "[
\t]+"). "skip('')" sets this
to no pattern. With no argument,
"skip()" returns the value of the
pattern. "skip()" can be used as a class
method.
Changing the skip pattern causes recompilation of the lexical
analyzer.
Example:
Parse::Lex->skip('\s*#(?s:.*)|\s+');
@tokens = Parse::Lex->new('INTEGER' => '\d+')->analyze(\*DATA);
print "@tokens\n"; # print INTEGER 1 INTEGER 2 INTEGER 3 INTEGER 4 EOI
__END__
1 # first string to skip
2
3# second string to skip
4
- start EXPR
- Activates condition EXPR.
- state EXPR
- Returns the state of the condition represented by EXPR.
- token
- Returns the instance corresponding to the last recognized token. In case
no token was recognized, return the special token named
"DEFAULT".
- tokenClass EXPR
- tokenClass
- Indicates which is the class of the tokens to be created from the list
passed as argument to the "new()"
method. If no argument is given, returns the name of the class. By default
the class is "Parse::Token".
- trace OUTPUT
- trace
- Class method which activates trace mode. The activation of trace mode must
take place before the creation of the lexical analyzer. The mode can then
be deactivated by another call of this method.
"OUTPUT" can be a file name
or a reference to a filehandle where the trace will be redirected.
To handle the cases of token non-recognition, you can define a specific token at
the end of the list of tokens that comprise our lexical analyzer. If searching
for this token succeeds, it is then possible to call an error handling
function:
qw(ERROR (?s:.*)), sub {
print STDERR "ERROR: buffer content->", $_[0]->lexer->buffer, "<-\n";
die qq!can\'t analyze: "$_[1]"!;
}
ctokenizer.pl - Scan a stream of data using the
"Parse::CLex" class.
tokenizer.pl - Scan a stream of data using the
"Parse::Lex" class.
every.pl - Use of the "every"
method.
sexp.pl - Interpreter for prefix arithmetic expressions.
sexpcond.pl - Interpeter for prefix arithmetic expressions, using
conditions.
Analyzers of the "Parse::CLex" class do not
allow the use of regular expressions with anchoring.
"Parse::Token",
"Parse::LexEvent",
"Parse::YYLex".
Philippe Verdret. Documentation translated to English by Vladimir Alexiev and
Ocrat.
Version 2.0 owes much to suggestions made by Vladimir Alexiev. Ocrat has
significantly contributed to improving this documentation. Thanks also to the
numerous people who have sent me bug reports and occasionally fixes.
Friedl, J.E.F. Mastering Regular Expressions. O'Reilly & Associates 1996.
Mason, T. & Brown, D. - Lex & Yacc. O'Reilly &
Associates, Inc. 1990.
FLEX - A Scanner generator (available at ftp://ftp.ee.lbl.gov/ and
elsewhere)
Copyright (c) 1995-1999 Philippe Verdret. All rights reserved. This module is
free software; you can redistribute it and/or modify it under the same terms
as Perl itself.
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |