|
|
| |
Text::Shellwords::Cursor(3) |
User Contributed Perl Documentation |
Text::Shellwords::Cursor(3) |
Text::Shellwords::Cursor - Parse a string into tokens
use Text::Shellwords::Cursor;
my $parser = Text::Shellwords::Cursor->new();
my $str = 'ab cdef "ghi" j"k\"l "';
my ($tok1) = $parser->parse_line($str);
$tok1 = ['ab', 'cdef', 'ghi', 'j', 'k"l ']
my ($tok2, $tokno, $tokoff) = $parser->parse_line($str, cursorpos => 6);
as above, but $tokno=1, $tokoff=3 (under the 'f')
DESCRIPTION
This module is very similar to Text::Shellwords and
Text::ParseWords. However, it has one very significant difference: it keeps
track of a character position in the line it's parsing. For instance, if you
pass it ("zq fmgb", cursorpos=>6), it would return (['zq',
'fmgb'], 1, 3). The cursorpos parameter tells where in the input string the
cursor resides (just before the 'b'), and the result tells you that the
cursor was on token 1 ('fmgb'), character 3 ('b'). This is very useful when
computing command-line completions involving quoting, escaping, and
tokenizing characters (like '(' or '=').
A few helper utilities are included as well. You can escape a
string to ensure that parsing it will produce the original string
(parse_escape). You can also reassemble the tokens with a visually pleasing
amount of whitespace between them (join_line).
This module started out as an integral part of Term::GDBUI using
code loosely based on Text::ParseWords. However, it is now basically a
ground-up reimplementation. It was split out of Term::GDBUI for version
0.8.
- new
- Creates a new parser. Takes named arguments on the command line.
- keep_quotes
- Normally all unescaped, unnecessary quote marks are stripped. If you
specify "keep_quotes=>1", however,
they are preserved. This is useful if you need to know whether the string
was quoted or not (string constants) or what type of quotes was around it
(affecting variable interpolation, for instance).
- token_chars
- This argument specifies the characters that should be considered tokens
all by themselves. For instance, if I pass token_chars=>'=', then
'ab=123' would be parsed to ('ab', '=', '123'). Without token_chars,
'ab=123' remains a single string.
NOTE: you cannot change token_chars after the constructor has
been called! The regexps that use it are compiled once (m//o). Also,
until the Gnu Readline library can accept "=[]," without
diving into an endless loop, we will not tell history expansion to use
token_chars (it uses " \t\fBen()<>;&|" by
default).
- debug
- Turns on rather copious debugging to try to show what the parser is
thinking at every step.
- space_none
- space_before
- space_after
- These variables affect how whitespace in the line is normalized and it is
reassembled into a string. See the join_line routine.
- error
- This is a reference to a routine that should be called to display a parse
error. The routine takes two arguments: a reference to the parser, and the
error message to display as a string.
- parsebail(msg)
- If the parsel routine or any of its subroutines runs into a fatal error,
they call parsebail to present a very descriptive diagnostic.
- parsel
- This is the heinous routine that actually does the parsing. You should
never need to call it directly. Call parse_line instead.
- parse_line(line, named args)
- This is the entrypoint to this module's parsing functionality. It converts
a line into tokens, respecting quoted text, escaped characters, etc. It
also keeps track of a cursor position on the input text, returning the
token number and offset within the token where that position can be found
in the output.
This routine originally bore some resemblance to
Text::ParseWords. It has changed almost completely, however, to support
keeping track of the cursor position. It also has nicer failure modes,
modular quoting, token characters (see token_chars in "new"),
etc. This routine now does much more.
Arguments:
- line
- This is a string containing the command-line to parse.
This routine also accepts the following named parameters:
- cursorpos
- This is the character position in the line to keep track of. Pass undef
(by not specifying it) or the empty string to have the line processed with
cursorpos ignored.
Note that passing undef is not the same as passing some
random number and ignoring the result! For instance, if you pass 0 and
the line begins with whitespace, you'll get a 0-length token at the
beginning of the line to represent the cursor in the middle of the
whitespace. This allows command completion to work even when the cursor
is not near any tokens. If you pass undef, all whitespace at the
beginning and end of the line will be trimmed as you would expect.
If it is ambiguous whether the cursor should belong to the
previous token or to the following one (i.e. if it's between two quoted
strings, say "a""b" or a token_char), it always
gravitates to the previous token. This makes more sense when
completing.
- fixclosequote
- Sometimes you want to try to recover from a missing close quote (for
instance, when calculating completions), but usually you want a missing
close quote to be a fatal error. fixclosequote=>1 will implicitly
insert the correct quote if it's missing. fixclosequote=>0 is the
default.
- messages
- parse_line is capable of printing very informative error messages.
However, sometimes you don't care enough to print a message (like when
calculating completions). Messages are printed by default, so pass
messages=>0 to turn them off.
This function returns a reference to an array containing three
items:
- tokens
- A the tokens that the line was separated into (ref to an array of
strings).
- tokno
- The number of the token (index into the previous array) that contains
cursorpos.
- tokoff
- The character offet into tokno of cursorpos.
If the cursor is at the end of the token, tokoff will point to 1
character past the last character in tokno, a non-existant character. If the
cursor is between tokens (surrounded by whitespace), a zero-length token
will be created for it.
- parse_escape(lines)
- Escapes characters that would be otherwise interpreted by the parser. Will
accept either a single string or an arrayref of strings (which will be
modified in-place).
- join_line(tokens)
- This routine does a somewhat intelligent job of joining tokens back into a
command line. If token_chars (see "new") is empty (the default),
then it just escapes backslashes and quotes, and joins the tokens with
spaces.
However, if token_chars is nonempty, it tries to insert a
visually pleasing amount of space between the tokens. For instance,
rather than 'a ( b , c )', it tries to produce 'a (b, c)'. It won't
reformat any tokens that aren't found in
$self->{token_chars}, of course.
To change the formatting, you can redefine the variables
$self->{space_none},
$self->{space_before}, and
$self->{space_after}. Each variable is a
string containing all characters that should not be surrounded by
whitespace, should have whitespace before, and should have whitespace
after, respectively. Any character found in token_chars, but non in any
of these space_ variables, will have space placed both before and
after.
Copyright (c) 2003-2011 Scott Bronson, all rights reserved. This program is
covered by the MIT license.
Scott Bronson <bronson@rinspin.com>
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |