Hand-Coded Parsers
Since Python is a general purpose programming language, its also reasonable to consider writing a hand-coded parser. For instance, recursive descent parsing is a fairly well-known technique for analyzing language-based information. Since Python is a very high-level language, writing the parser itself is usually easier than it would be in a traditional language like C or C++.
To illustrate, this section develops a custom parser for a simple grammar: it parses and evaluates arithmetic expression strings. This example also demonstrates the utility of Python as a general-purpose programming language. Although Python is often used as a frontend or rapid development language, its also useful for the kinds of things wed normally write in a systems development language like C or C++.
.6.1 The Expression Grammar
The grammar our parser will recognize can be described as follows:
goal -> This is a fairly typical grammar for a simple expression language,
and it allows for arbitrary expression nesting (some example
expressions appear at the end of the testparser
module listing in Example 18-11). Strings to be parsed
are either an expression or an assignment to a variable name
(set). Expressions involve
numbers, variables, and the operators +,
-, *, and /.
Because factor is nested in
expr in the grammar, * and
/ have higher precedence (i.e., bind tighter) than
+ and -. Expressions can be
enclosed in parentheses to override precedence, and all operators are
left associative -- that is, group on the left (e.g.,
1-2-3 is treated the same as
(1-2)-3).
Tokens are just the most primitive components of
the expression language. Each grammar rule earlier is followed in
square brackets by a list of tokens used to select it. In recursive
descent parsing, we determine the set of tokens that can possibly
start a rules substring, and use that information to predict
which rule will work ahead of time. For rules that iterate (the
-tail rules), we use the set of possibly following
tokens to know when to stop. Typically, tokens are recognized by a
string processor (a "scanner"), and a higher-level
processor (a "parser") uses the token stream to predict
and step through grammar rules and substrings.
The system is structured as two modules, holding two classes:
The parser is also responsible for computing the expressions
value and testing the system. In this version, the parser evaluates
the expression while it is being parsed. To use the system, we create
a parser with an input string and call its parse
method. We can also call parse again later with a
new expression string.
Theres a deliberate division of labor here. The scanner
extracts tokens from the string, but knows nothing about the grammar.
The parser handles the grammar, but is naive about the string itself.
This modular structure keeps the code relatively simple. And
its another example of the OOP composition relationship at
work: parsers embed and delegate to scanners.
The module in Example 18-9 implements the lexical
analysis task -- detecting the expressions basic tokens by
scanning the text string left to right on demand. Notice that this is
all straightforward logic here; such analysis can sometimes be
performed with regular expressions instead (described earlier), but
the pattern needed to detect and extract tokens in this example would
be too complex and fragile for my tastes. If your tastes vary, try
recoding this module with re.
####################################################
# the scanner (lexical analyser)
####################################################
import string
SyntaxError = SyntaxError # local errors
LexicalError = LexicalError
class Scanner:
def __init__(self, text):
self.next = 0
self.text = text + \0
def newtext(self, text):
Scanner.__init__(self, text)
def showerror(self):
print => , self.text
print => , ( * self.start) + ^
def match(self, token):
if self.token != token:
raise SyntaxError, [token]
else:
value = self.value
if self.token != \0:
self.scan( ) # next token/value
return value # return prior value
def scan(self):
self.value = None
ix = self.next
while self.text[ix] in string.whitespace:
ix = ix+1
self.start = ix
if self.text[ix] in [(, ), -, +, /, *, \0]:
self.token = self.text[ix]
ix = ix+1
elif self.text[ix] in string.digits:
str = \
while self.text[ix] in string.digits:
str = str + self.text[ix]
ix = ix+1
if self.text[ix] == .:
str = str + .
ix = ix+1
while self.text[ix] in string.digits:
str = str + self.text[ix]
ix = ix+1
self.token =
um
self.value = string.atof(str)
else:
self.token =
um
self.value = string.atol(str)
elif self.text[ix] in string.letters:
str = \
while self.text[ix] in (string.digits + string.letters):
str = str + self.text[ix]
ix = ix+1
if string.lower(str) == set:
self.token = set
else:
self.token = var
self.value = str
else:
raise LexicalError
self.next = ix
The parser modules class creates and embeds a scanner for its
lexical chores, and handles interpretation of the expression
grammars rules and evaluation of the expressions
result, as shown in Example 18-10.
########################################################
# the parser (syntax analyser, evaluates during parse)
########################################################
UndefinedError = UndefinedError
from scanner import Scanner, LexicalError, SyntaxError
class Parser:
def __init__(self, text=\):
self.lex = Scanner(text) # embed a scanner
self.vars = {pi:3.14159} # add a variable
def parse(self, *text):
if text: # main entry-point
self.lex.newtext(text[0]) # reuse this parser?
try:
self.lex.scan( ) # get first token
self.Goal( ) # parse a sentence
except SyntaxError:
print Syntax Error at column:, self.lex.start
self.lex.showerror( )
except LexicalError:
print Lexical Error at column:, self.lex.start
self.lex.showerror( )
except UndefinedError, name:
print "\%s is undefined at column:" % name, self.lex.start
self.lex.showerror( )
def Goal(self):
if self.lex.token in [
um, var, (]:
val = self.Expr( )
self.lex.match(\0) # expression?
print val
elif self.lex.token == set: # set command?
self.Assign( )
self.lex.match(\0)
else:
raise SyntaxError
def Assign(self):
self.lex.match(set)
var = self.lex.match(var)
val = self.Expr( )
self.vars[var] = val # assign name in dict
def Expr(self):
left = self.Factor( )
while 1:
if self.lex.token in [\0, )]:
return left
elif self.lex.token == +:
self.lex.scan( )
left = left + self.Factor( )
elif self.lex.token == -:
self.lex.scan( )
left = left - self.Factor( )
else:
raise SyntaxError
def Factor(self):
left = self.Term( )
while 1:
if self.lex.token in [+, -, \0, )]:
return left
elif self.lex.token == *:
self.lex.scan( )
left = left * self.Term( )
elif self.lex.token == /:
self.lex.scan( )
left = left / self.Term( )
else:
raise SyntaxError
def Term(self):
if self.lex.token ==
um:
val = self.lex.match(
um) # numbers
return val
elif self.lex.token == var:
if self.vars.has_key(self.lex.value):
val = self.vars[self.lex.value] # lookup names value
self.lex.scan( )
return val
else:
raise UndefinedError, self.lex.value
elif self.lex.token == (:
self.lex.scan( )
val = self.Expr( ) # sub-expression
self.lex.match())
return val
else:
raise SyntaxError
if __name__ == \__main__:
import testparser # self-test code
testparser.test(Parser, parser1) # test local Parser
If you study this code closely, youll notice that the parser
keeps a dictionary (self.vars) to manage variable
names: they
e stored in the dictionary on a
set command and fetched from it when they appear
in an expression. Tokens are represented as strings, with an optional
associated value (a numeric value for numbers and a string for
variable names).
The parser uses iteration (while loops) instead of
recursion, for the expr-tail and
factor-tail rules. Other than this optimization,
the rules of the grammar map directly onto parser methods: tokens
become calls to the scanner, and nested rule references become calls
to other methods.
When file parser1.py is run as a top-level
program, its self-test code is executed, which in turn simply runs a
canned test in the module shown in Example 18-11. Note
that all integer math uses Python long integers (unlimited precision
integers), because the scanner converts numbers to strings with
string.atol. Also notice that mixed
integer/floating-point operations cast up to floating point since
Python operators are used to do the actual calculations.
####################################################
# parser test code
####################################################
def test(ParserClass, msg):
print msg, ParserClass
x = ParserClass(4 / 2 + 3) # allow different Parsers
x.parse( )
x.parse(3 + 4 / 2) # like eval(3 + 4 / 2)...
x.parse((3 + 4) / 2)
x.parse(4 / (2 + 3))
x.parse(4.0 / (2 + 3))
x.parse(4 / (2.0 + 3))
x.parse(4.0 / 2 * 3)
x.parse((4.0 / 2) * 3)
x.parse(4.0 / (2 * 3))
x.parse((((3))) + 1)
y = ParserClass( )
y.parse(set a 4 / 2 + 1)
y.parse(a * 3)
y.parse(set b 12 / a)
y.parse()
z = ParserClass( )
z.parse(set a 99)
z.parse(set a a + 1)
z.parse(a)
z = ParserClass( )
z.parse(pi)
z.parse(2 * pi)
z.parse(1.234 + 2.1)
def interact(ParserClass): # command-line entry
print ParserClass
x = ParserClass( )
while 1:
cmd = raw_input(Enter=> )
if cmd == stop:
break
x.parse(cmd)
Correlate the following results to print statements in the self-test
module:
C:...PP2ELangParser>python parser1.py
parser1 __main__.Parser
5L
5L
3L
0L
0.8
0.8
6.0
6.0
0.666666666667
4L
9L
4L
100L
3.14159
6.28318
3.334
As usual, we can also test and use the system interactively: % python
>>> import parser1
>>> x = parser1.Parser( )
>>> x.parse(1 + 2)
3L
Error cases are trapped and reported: >>> x.parse(1 + a)
a is undefined at column: 4
=> 1 + a
=> ^
>>> x.parse(1+a+2)
a is undefined at column: 2
=> 1+a+2
=> ^
>>> x.parse(1 * 2 $)
Lexical Error at column: 6
=> 1 * 2 $
=> ^
>>> x.parse(1 * - 1)
Syntax Error at column: 4
=> 1 * - 1
=> ^
>>> x.parse(1 * (9)
Syntax Error at column: 6
=> 1 * (9
=> ^
Pathologically big numbers are handled well, because Pythons
built-in objects and operators are used along the way:
>>> x.parse(888888888888888888888888888888888888888888888.9999999)
8.88888888889e+44
>>> x.parse(99999999999999999999999999999999999999999 + 2)
100000000000000000000000000000000000000001L
>>> x.parse(999999999999999999999999999999.88888888888 + 1.1)
1e+30
In addition, there is an interactive loop interface in the
testparser module, if you want to use the parser
as a simple command-line calculator (or if you get tired of typing
parser method calls). Pass the Parser class, so
testparser can make one of its own:
>>> import testparser
>>> testparser.interact(parser1.Parser)
Enter=> 4 * 3 + 5
17L
Enter=> 5 + 4 * 3
17L
Enter=> (5 + 4) * 3
27L
Enter=> set a 99
Enter=> set b 66
Enter=> a + b
165L
Enter=> # + 1
Lexical Error at column: 0
=> # + 1
=> ^
Enter=> a * b + c
c is undefined at column: 8
=> a * b + c
=> ^
Enter=> a * b * + c
Syntax Error at column: 8
=> a * b * + c
=> ^
Enter=> a
99L
Enter=> a * a * a
970299L
Enter=> stop
>>>
.6.2 The Parsers Code
Example 18-9. PP2ELangParserscanner.py
Example 18-10. PP2ELangParserparser1.py
Example 18-11. PP2ELangParser estparser.py
.6.3 Adding a Parse Tree Interpreter
One weakness in the parser1 program is that it embeds expression evaluation logic in the parsing logic: the result is computed while the string is being parsed. This makes evaluation quick, but it can also make it difficult to modify the code, especially in larger systems. To simplify, we could restructure the program to keep expression parsing and evaluation separate. Instead of evaluating the string, the parser can build up an intermediate representation of it that can be evaluated later. As an added incentive, building the representation separately makes it available to other analysis tools (e.g., optimizers, viewers, and so on).
Example 18-12 shows a variant of parser1 that implements this idea. The parser analyzes the string and builds up a parse tree -- that is, a tree of class instances that represents the expression and that may be evaluated in a separate step. The parse tree is built from classes that "know" how to evaluate themselves: to compute the expression, we just ask the tree to evaluate itself. Root nodes in the tree ask their children to evaluate themselves and then combine the results by applying a single operator. In effect, evaluation in this version is simply a recursive traversal of a tree of embedded class instances constructed by the parser.
Example 18-12. PP2ELangParserparser2.py
TraceDefault = 0
UndefinedError = "UndefinedError"
from scanner import Scanner, SyntaxError, LexicalError
####################################################
# the interpreter (a smart objects tree)
####################################################
class TreeNode:
def validate(self, dict): # default error check
pass
def apply(self, dict): # default evaluator
pass
def trace(self, level): # default unparser
print .*level + When parser2 is run as a top-level program, we get
the same test code output as for parser1. In fact,
it reuses the same test code: both parsers pass in their parser class
object to testparser.test. And since classes are
objects, we can also pass this version of the parser to
testparsers interactive loop:
testparser.interact(parser2.Parser). The new
parsers external behavior is identical to that of the
original.
Notice that the new parser reuses the same scanner module, too. To
catch errors raised by scanner, it also imports the specific strings
that identify the scanners exceptions. The scanner and parser
can both raise exceptions on errors (lexical errors, syntax errors,
and undefined name errors). They
e caught at the top level of
the parser, and end the current parse. Theres no need to set
and check status flags to terminate the recursion. Since math is done
using long integers, floating-point numbers, and Pythons
operators, theres usually no need to trap numeric overflow or
underflow errors. But as is, the parser doesn handle errors
like division by zero: they make the parser system exit with a Python
stack dump. Uncovering the cause and fix for this is left as an
exercise.
The
intermediate representation of an expression is a tree of class
instances, whose shape reflects the order of operator evaluation.
This parser also has logic to print an indented listing of the
constructed parse tree if the traceme attribute is
set. Indentation gives the nesting of subtrees, and binary operators
list left subtrees first. For example:
% python
>>> import parser2
>>> p = parser2.Parser( )
>>> p.traceme = 1
>>> p.parse(5 + 4 * 2)
[+]
...5L
...[*]
......4L
......2L
13L
When this tree is evaluated, the apply method
recursively evaluates subtrees and applies root operators to their
results. Here, * is evaluated before
+, since its lower in the tree. The
Factor method consumes the *
substring before returning a right subtree to
Expr:
>>> p.parse(5 * 4 - 2)
[-]
...[*]
......5L
......4L
...2L
18L
In this example, * is evaluated before -. The
Factor method loops though a substring of
* and / expressions before
returning the resulting left subtree to Expr:
>>> p.parse(1 + 3 * (2 * 3 + 4))
[+]
...1L
...[*]
......3L
......[+]
.........[*]
............2L
............3L
.........4L
31L
Trees are made of nested class instances. From an OOP perspective,
its another way to use composition. Since tree nodes are just
class instances, this tree could be created and evaluated manually,
too:
PlusNode( NumNode(1),
TimesNode( NumNode(3),
PlusNode( TimesNode(NumNode(2), NumNode(3)),
NumNode(4) ))).apply({})
But we might as well let the parser build it for us (Python is not
that much like Lisp, despite what you may have heard).
But
wait -- there is a better way to explore parse tree structures.
Figure 18-1 shows the parse tree generated for
string "1 + 3 * (2 * 3 +
4)", displayed in PyTree, the tree visualization GUI presented
at the end of the previous chapter. This only works because the
parser2 module builds the parse tree explicitly
(parser1 evaluates during a parse instead), and
because PyTrees code is generic and reusable.
If you read the last chapter, youll recall that PyTree can
draw most any tree data structure, but it is preconfigured to handle
binary search trees and the parse trees we
e studying in this
chapter. You might also remember that clicking on nodes in a
displayed parse tree evaluates the subtree rooted there. Figure 18-2 shows the pop-up generated after clicking the
trees root node (you get different results if you click other
parts of tree, because smaller subtrees are evaluated).
PyTree makes it easy to learn about and experiment with the parser.
To determine the tree shape produced for a given expression, start
PyTree, click on its Parser radiobutton, type the expression in the
input field at the bottom, and press "input" (or your
Enter key). The parser class is run to generate a tree from your
input, and the GUI displays the result. For instance, Figure 18-3 sketches the parse tree generated if we remove
the parentheses from the first expression in the input field. The
root node evaluates to 23 this time, due to the different
shapes evaluation order.
To generate an even more different shape, try introducing more
parentheses to the expression and hitting the Enter key again. Figure 18-4 shows a much flatter tree structure produced
by adding a few parentheses to override operator precedence. Because
these parentheses change the tree shape, they also change the
expressions overall result again. Figure 18-5
shows the result pop-up after clicking the root node in this display.
Depending upon the operators used within an expression, some very
differently shaped trees yield the same result when evaluated. For
instance, Figure 18-6 shows a more left-heavy tree
generated from a different expression string that evaluates to 56
nevertheless.
Finally, Figure 18-7 shows a parsed assignment
statement; clicking the "set" root assigns variable
spam, and clicking node spam
then evaluates to -4. If you find the parser puzzling, try running
PyTree like this on your computer to get a better feel for the
parsing process. (Id like to show more example trees, but I
ran out of page real estate at this point in the book.)
The hand-coded parser programs shown earlier illustrate some
interesting concepts and underscore the power of Python for
general-purpose programming. Depending on your job description, they
may also be typical of the sort of thing youd write regularly
in a traditional language like C. Parsers are an important component
in a wide variety of applications, but in some cases, they
e
not as necessary as you might think. Let me explain why.
So far, we started with an expression parser and added a parse tree
interpreter to make the code easier to modify. As is, the parser
works, but it may be slow compared to a C implementation. If the
parser is used frequently, we could speed it up by moving parts to C
extension modules. For instance, the scanner might be moved to C
initially, since its often called from the parser. Ultimately,
we might add components to the grammar that allow expressions to
access application-specific variables and functions.
All of the these steps constitute
good engineering. But depending on your application, this approach
may not be the best one in Python. The easiest way to evaluate input
expressions in Python is often to let Python do it, by calling the
eval built-in function. In fact, we can usually
replace the entire expression evaluation program with one function
call. The next example will demonstrate how this is done.
More importantly, the next section underscores a core idea behind the
language: if you already have an extensible, embeddable, high-level
language system, why invent another? Python itself can often satisfy
language-based component needs.
.6.4 Parse Tree Structure
.6.5 Exploring Parse Trees with Pytree
Figure 18-1. Parse tree built for 1 + 3 * (2 * 3 + 4)
Figure 18-2. Clicking the root node to evaluate a tree
Figure 18-3. Parse tree for 1 + 3 * 2 * 3 + 4, result=23
Figure 18-4. Parse tree built for "(1 + 3) * (2 * ( 3 + 4))"
Figure 18-5. Clicking and evaluating the root node
Figure 18-6. Parse tree for "(1 + 3) * 2 * ( 3 + 4)", result=56
Figure 18-7. Assignment, left-grouping: "set spam 1 - 2 - 3"
.6.6 Parsers Versus Python
Категории