Hand-Coded Parsers

2017-11-03 09:05:09

Since Python is a general purpose programming language, its also reasonable to consider writing a hand-coded parser. For instance, recursive descent parsing is a fairly well-known technique for analyzing language-based information. Since Python is a very high-level language, writing the parser itself is usually easier than it would be in a traditional language like C or C++.

To illustrate, this section develops a custom parser for a simple grammar: it parses and evaluates arithmetic expression strings. This example also demonstrates the utility of Python as a general-purpose programming language. Although Python is often used as a frontend or rapid development language, its also useful for the kinds of things wed normally write in a systems development language like C or C++.

.6.1 The Expression Grammar

The grammar our parser will recognize can be described as follows:

goal -> END [number, variable, ( ] goal -> <assign> END [set] assign -> set [set] expr -> [number, variable, ( ] expr-tail -> ^ [END, ) ] expr-tail -> + [+] expr-tail -> - [-] factor -> [number, variable, ( ] factor-tail -> ^ [+, -, END, ) ] factor-tail -> * [*] factor-tail -> / [/] term -> [number] term -> [variable] term -> ( ) [(] tokens: (, ), num, var, -, +, /, *, set, end

This is a fairly typical grammar for a simple expression language, and it allows for arbitrary expression nesting (some example expressions appear at the end of the testparser module listing in Example 18-11). Strings to be parsed are either an expression or an assignment to a variable name (set). Expressions involve numbers, variables, and the operators +, -, *, and /. Because factor is nested in expr in the grammar, * and / have higher precedence (i.e., bind tighter) than + and -. Expressions can be enclosed in parentheses to override precedence, and all operators are left associative -- that is, group on the left (e.g., 1-2-3 is treated the same as (1-2)-3).

Tokens are just the most primitive components of the expression language. Each grammar rule earlier is followed in square brackets by a list of tokens used to select it. In recursive descent parsing, we determine the set of tokens that can possibly start a rules substring, and use that information to predict which rule will work ahead of time. For rules that iterate (the -tail rules), we use the set of possibly following tokens to know when to stop. Typically, tokens are recognized by a string processor (a "scanner"), and a higher-level processor (a "parser") uses the token stream to predict and step through grammar rules and substrings.

.6.2 The Parsers Code

The system is structured as two modules, holding two classes:

The scanner handles low-level character-by-character analysis.

The parser embeds a scanner and handles higher-level grammar analysis.

The parser is also responsible for computing the expressions value and testing the system. In this version, the parser evaluates the expression while it is being parsed. To use the system, we create a parser with an input string and call its parse method. We can also call parse again later with a new expression string.

Theres a deliberate division of labor here. The scanner extracts tokens from the string, but knows nothing about the grammar. The parser handles the grammar, but is naive about the string itself. This modular structure keeps the code relatively simple. And its another example of the OOP composition relationship at work: parsers embed and delegate to scanners.

The module in Example 18-9 implements the lexical analysis task -- detecting the expressions basic tokens by scanning the text string left to right on demand. Notice that this is all straightforward logic here; such analysis can sometimes be performed with regular expressions instead (described earlier), but the pattern needed to detect and extract tokens in this example would be too complex and fragile for my tastes. If your tastes vary, try recoding this module with re.

Example 18-9. PP2ELangParserscanner.py

#################################################### # the scanner (lexical analyser) #################################################### import string SyntaxError = SyntaxError # local errors LexicalError = LexicalError class Scanner: def __init__(self, text): self.next = 0 self.text = text + \0 def newtext(self, text): Scanner.__init__(self, text) def showerror(self): print => , self.text print => , ( * self.start) + ^ def match(self, token): if self.token != token: raise SyntaxError, [token] else: value = self.value if self.token != \0: self.scan( ) # next token/value return value # return prior value def scan(self): self.value = None ix = self.next while self.text[ix] in string.whitespace: ix = ix+1 self.start = ix if self.text[ix] in [(, ), -, +, /, *, \0]: self.token = self.text[ix] ix = ix+1 elif self.text[ix] in string.digits: str = \ while self.text[ix] in string.digits: str = str + self.text[ix] ix = ix+1 if self.text[ix] == .: str = str + . ix = ix+1 while self.text[ix] in string.digits: str = str + self.text[ix] ix = ix+1 self.token = um self.value = string.atof(str) else: self.token = um self.value = string.atol(str) elif self.text[ix] in string.letters: str = \ while self.text[ix] in (string.digits + string.letters): str = str + self.text[ix] ix = ix+1 if string.lower(str) == set: self.token = set else: self.token = var self.value = str else: raise LexicalError self.next = ix

The parser modules class creates and embeds a scanner for its lexical chores, and handles interpretation of the expression grammars rules and evaluation of the expressions result, as shown in Example 18-10.

Example 18-10. PP2ELangParserparser1.py

######################################################## # the parser (syntax analyser, evaluates during parse) ######################################################## UndefinedError = UndefinedError from scanner import Scanner, LexicalError, SyntaxError class Parser: def __init__(self, text=\): self.lex = Scanner(text) # embed a scanner self.vars = {pi:3.14159} # add a variable def parse(self, *text): if text: # main entry-point self.lex.newtext(text[0]) # reuse this parser? try: self.lex.scan( ) # get first token self.Goal( ) # parse a sentence except SyntaxError: print Syntax Error at column:, self.lex.start self.lex.showerror( ) except LexicalError: print Lexical Error at column:, self.lex.start self.lex.showerror( ) except UndefinedError, name: print "\%s is undefined at column:" % name, self.lex.start self.lex.showerror( ) def Goal(self): if self.lex.token in [ um, var, (]: val = self.Expr( ) self.lex.match(\0) # expression? print val elif self.lex.token == set: # set command? self.Assign( ) self.lex.match(\0) else: raise SyntaxError def Assign(self): self.lex.match(set) var = self.lex.match(var) val = self.Expr( ) self.vars[var] = val # assign name in dict def Expr(self): left = self.Factor( ) while 1: if self.lex.token in [\0, )]: return left elif self.lex.token == +: self.lex.scan( ) left = left + self.Factor( ) elif self.lex.token == -: self.lex.scan( ) left = left - self.Factor( ) else: raise SyntaxError def Factor(self): left = self.Term( ) while 1: if self.lex.token in [+, -, \0, )]: return left elif self.lex.token == *: self.lex.scan( ) left = left * self.Term( ) elif self.lex.token == /: self.lex.scan( ) left = left / self.Term( ) else: raise SyntaxError def Term(self): if self.lex.token == um: val = self.lex.match( um) # numbers return val elif self.lex.token == var: if self.vars.has_key(self.lex.value): val = self.vars[self.lex.value] # lookup names value self.lex.scan( ) return val else: raise UndefinedError, self.lex.value elif self.lex.token == (: self.lex.scan( ) val = self.Expr( ) # sub-expression self.lex.match()) return val else: raise SyntaxError if __name__ == \__main__: import testparser # self-test code testparser.test(Parser, parser1) # test local Parser

If you study this code closely, youll notice that the parser keeps a dictionary (self.vars) to manage variable names: they e stored in the dictionary on a set command and fetched from it when they appear in an expression. Tokens are represented as strings, with an optional associated value (a numeric value for numbers and a string for variable names).

The parser uses iteration (while loops) instead of recursion, for the expr-tail and factor-tail rules. Other than this optimization, the rules of the grammar map directly onto parser methods: tokens become calls to the scanner, and nested rule references become calls to other methods.

When file parser1.py is run as a top-level program, its self-test code is executed, which in turn simply runs a canned test in the module shown in Example 18-11. Note that all integer math uses Python long integers (unlimited precision integers), because the scanner converts numbers to strings with string.atol. Also notice that mixed integer/floating-point operations cast up to floating point since Python operators are used to do the actual calculations.

Example 18-11. PP2ELangParser estparser.py

#################################################### # parser test code #################################################### def test(ParserClass, msg): print msg, ParserClass x = ParserClass(4 / 2 + 3) # allow different Parsers x.parse( ) x.parse(3 + 4 / 2) # like eval(3 + 4 / 2)... x.parse((3 + 4) / 2) x.parse(4 / (2 + 3)) x.parse(4.0 / (2 + 3)) x.parse(4 / (2.0 + 3)) x.parse(4.0 / 2 * 3) x.parse((4.0 / 2) * 3) x.parse(4.0 / (2 * 3)) x.parse((((3))) + 1) y = ParserClass( ) y.parse(set a 4 / 2 + 1) y.parse(a * 3) y.parse(set b 12 / a) y.parse() z = ParserClass( ) z.parse(set a 99) z.parse(set a a + 1) z.parse(a) z = ParserClass( ) z.parse(pi) z.parse(2 * pi) z.parse(1.234 + 2.1) def interact(ParserClass): # command-line entry print ParserClass x = ParserClass( ) while 1: cmd = raw_input(Enter=> ) if cmd == stop: break x.parse(cmd)

Correlate the following results to print statements in the self-test module:

C:...PP2ELangParser>python parser1.py parser1 __main__.Parser 5L 5L 3L 0L 0.8 0.8 6.0 6.0 0.666666666667 4L 9L 4L 100L 3.14159 6.28318 3.334

As usual, we can also test and use the system interactively:

% python >>> import parser1 >>> x = parser1.Parser( ) >>> x.parse(1 + 2) 3L

Error cases are trapped and reported:

>>> x.parse(1 + a) a is undefined at column: 4 => 1 + a => ^ >>> x.parse(1+a+2) a is undefined at column: 2 => 1+a+2 => ^ >>> x.parse(1 * 2 $) Lexical Error at column: 6 => 1 * 2 $ => ^ >>> x.parse(1 * - 1) Syntax Error at column: 4 => 1 * - 1 => ^ >>> x.parse(1 * (9) Syntax Error at column: 6 => 1 * (9 => ^

Pathologically big numbers are handled well, because Pythons built-in objects and operators are used along the way:

>>> x.parse(888888888888888888888888888888888888888888888.9999999) 8.88888888889e+44 >>> x.parse(99999999999999999999999999999999999999999 + 2) 100000000000000000000000000000000000000001L >>> x.parse(999999999999999999999999999999.88888888888 + 1.1) 1e+30

In addition, there is an interactive loop interface in the testparser module, if you want to use the parser as a simple command-line calculator (or if you get tired of typing parser method calls). Pass the Parser class, so testparser can make one of its own:

>>> import testparser >>> testparser.interact(parser1.Parser) Enter=> 4 * 3 + 5 17L Enter=> 5 + 4 * 3 17L Enter=> (5 + 4) * 3 27L Enter=> set a 99 Enter=> set b 66 Enter=> a + b 165L Enter=> # + 1 Lexical Error at column: 0 => # + 1 => ^ Enter=> a * b + c c is undefined at column: 8 => a * b + c => ^ Enter=> a * b * + c Syntax Error at column: 8 => a * b * + c => ^ Enter=> a 99L Enter=> a * a * a 970299L Enter=> stop >>>

Lesson 3 Divide and Conquer

As the parser system demonstrates, modular program design is almost always a major win. By using Pythons program structuring tools (functions, modules, classes, etc.), big tasks can be broken down into small, manageable parts that can be coded and tested independently.

For instance, the scanner can be tested without the parser by making an instance with an input string and calling its scan or match methods repeatedly. We can even test it like this interactively, from Pythons command line. By separating programs into logical components, they become easier to understand and modify. Imagine what the parser would look like if the scanners logic was embedded rather than called.

.6.3 Adding a Parse Tree Interpreter

One weakness in the parser1 program is that it embeds expression evaluation logic in the parsing logic: the result is computed while the string is being parsed. This makes evaluation quick, but it can also make it difficult to modify the code, especially in larger systems. To simplify, we could restructure the program to keep expression parsing and evaluation separate. Instead of evaluating the string, the parser can build up an intermediate representation of it that can be evaluated later. As an added incentive, building the representation separately makes it available to other analysis tools (e.g., optimizers, viewers, and so on).

Example 18-12 shows a variant of parser1 that implements this idea. The parser analyzes the string and builds up a parse tree -- that is, a tree of class instances that represents the expression and that may be evaluated in a separate step. The parse tree is built from classes that "know" how to evaluate themselves: to compute the expression, we just ask the tree to evaluate itself. Root nodes in the tree ask their children to evaluate themselves and then combine the results by applying a single operator. In effect, evaluation in this version is simply a recursive traversal of a tree of embedded class instances constructed by the parser.

Example 18-12. PP2ELangParserparser2.py

TraceDefault = 0 UndefinedError = "UndefinedError" from scanner import Scanner, SyntaxError, LexicalError #################################################### # the interpreter (a smart objects tree) #################################################### class TreeNode: def validate(self, dict): # default error check pass def apply(self, dict): # default evaluator pass def trace(self, level): # default unparser print .*level + # ROOTS class BinaryNode(TreeNode): def __init__(self, left, right): # inherited methods self.left, self.right = left, right # left/right branches def validate(self, dict): self.left.validate(dict) # recurse down branches self.right.validate(dict) def trace(self, level): print .*level + [ + self.label + ] self.left.trace(level+3) self.right.trace(level+3) class TimesNode(BinaryNode): label = * def apply(self, dict): return self.left.apply(dict) * self.right.apply(dict) class DivideNode(BinaryNode): label = / def apply(self, dict): return self.left.apply(dict) / self.right.apply(dict) class PlusNode(BinaryNode): label = + def apply(self, dict): return self.left.apply(dict) + self.right.apply(dict) class MinusNode(BinaryNode): label = - def apply(self, dict): return self.left.apply(dict) - self.right.apply(dict) # LEAVES class NumNode(TreeNode): def __init__(self, num): self.num = num # already numeric def apply(self, dict): # use default validate return self.num def trace(self, level): print .*level + `self.num` class VarNode(TreeNode): def __init__(self, text, start): self.name = text # variable name self.column = start # column for errors def validate(self, dict): if not dict.has_key(self.name): raise UndefinedError, (self.name, self.column) def apply(self, dict): return dict[self.name] # validate before apply def assign(self, value, dict): dict[self.name] = value # local extension def trace(self, level): print .*level + self.name # COMPOSITES class AssignNode(TreeNode): def __init__(self, var, val): self.var, self.val = var, val def validate(self, dict): self.val.validate(dict) # don validate var def apply(self, dict): self.var.assign( self.val.apply(dict), dict ) def trace(self, level): print .*level + set self.var.trace(level + 3) self.val.trace(level + 3) #################################################### # the parser (syntax analyser, tree builder) #################################################### class Parser: def __init__(self, text=\): self.lex = Scanner(text) # make a scanner self.vars = {pi:3.14159} # add constants self.traceme = TraceDefault def parse(self, *text): # external interface if text: self.lex.newtext(text[0]) # reuse with new text tree = self.analyse( ) # parse string if tree: if self.traceme: # dump parse-tree? print; tree.trace(0) if self.errorCheck(tree): # check names self.interpret(tree) # evaluate tree def analyse(self): try: self.lex.scan( ) # get first token return self.Goal( ) # build a parse-tree except SyntaxError: print Syntax Error at column:, self.lex.start self.lex.showerror( ) except LexicalError: print Lexical Error at column:, self.lex.start self.lex.showerror( ) def errorCheck(self, tree): try: tree.validate(self.vars) # error checker return ok except UndefinedError, varinfo: print "\%s is undefined at column: %d" % varinfo self.lex.start = varinfo[1] self.lex.showerror( ) # returns None def interpret(self, tree): result = tree.apply(self.vars) # tree evals itself if result != None: # ignore set result print result def Goal(self): if self.lex.token in [ um, var, (]: tree = self.Expr( ) self.lex.match(\0) return tree elif self.lex.token == set: tree = self.Assign( ) self.lex.match(\0) return tree else: raise SyntaxError def Assign(self): self.lex.match(set) vartree = VarNode(self.lex.value, self.lex.start) self.lex.match(var) valtree = self.Expr( ) return AssignNode(vartree, valtree) # two subtrees def Expr(self): left = self.Factor( ) # left subtree while 1: if self.lex.token in [\0, )]: return left elif self.lex.token == +: self.lex.scan( ) left = PlusNode(left, self.Factor( )) # add root-node elif self.lex.token == -: self.lex.scan( ) left = MinusNode(left, self.Factor( )) # grows up/right else: raise SyntaxError def Factor(self): left = self.Term( ) while 1: if self.lex.token in [+, -, \0, )]: return left elif self.lex.token == *: self.lex.scan( ) left = TimesNode(left, self.Term( )) elif self.lex.token == /: self.lex.scan( ) left = DivideNode(left, self.Term( )) else: raise SyntaxError def Term(self): if self.lex.token == um: leaf = NumNode(self.lex.match( um)) return leaf elif self.lex.token == var: leaf = VarNode(self.lex.value, self.lex.start) self.lex.scan( ) return leaf elif self.lex.token == (: self.lex.scan( ) tree = self.Expr( ) self.lex.match()) return tree else: raise SyntaxError #################################################### # self-test code: use my parser, parser1s tester #################################################### if __name__ == \__main__: import testparser testparser.test(Parser, parser2) # run with Parser class here

When parser2 is run as a top-level program, we get the same test code output as for parser1. In fact, it reuses the same test code: both parsers pass in their parser class object to testparser.test. And since classes are objects, we can also pass this version of the parser to testparsers interactive loop: testparser.interact(parser2.Parser). The new parsers external behavior is identical to that of the original.

Notice that the new parser reuses the same scanner module, too. To catch errors raised by scanner, it also imports the specific strings that identify the scanners exceptions. The scanner and parser can both raise exceptions on errors (lexical errors, syntax errors, and undefined name errors). They e caught at the top level of the parser, and end the current parse. Theres no need to set and check status flags to terminate the recursion. Since math is done using long integers, floating-point numbers, and Pythons operators, theres usually no need to trap numeric overflow or underflow errors. But as is, the parser doesn handle errors like division by zero: they make the parser system exit with a Python stack dump. Uncovering the cause and fix for this is left as an exercise.

.6.4 Parse Tree Structure

The intermediate representation of an expression is a tree of class instances, whose shape reflects the order of operator evaluation. This parser also has logic to print an indented listing of the constructed parse tree if the traceme attribute is set. Indentation gives the nesting of subtrees, and binary operators list left subtrees first. For example:

% python >>> import parser2 >>> p = parser2.Parser( ) >>> p.traceme = 1 >>> p.parse(5 + 4 * 2) [+] ...5L ...[*] ......4L ......2L 13L

When this tree is evaluated, the apply method recursively evaluates subtrees and applies root operators to their results. Here, * is evaluated before +, since its lower in the tree. The Factor method consumes the * substring before returning a right subtree to Expr:

>>> p.parse(5 * 4 - 2) [-] ...[*] ......5L ......4L ...2L 18L

In this example, * is evaluated before -. The Factor method loops though a substring of * and / expressions before returning the resulting left subtree to Expr:

>>> p.parse(1 + 3 * (2 * 3 + 4)) [+] ...1L ...[*] ......3L ......[+] .........[*] ............2L ............3L .........4L 31L

Trees are made of nested class instances. From an OOP perspective, its another way to use composition. Since tree nodes are just class instances, this tree could be created and evaluated manually, too:

PlusNode( NumNode(1), TimesNode( NumNode(3), PlusNode( TimesNode(NumNode(2), NumNode(3)), NumNode(4) ))).apply({})

But we might as well let the parser build it for us (Python is not that much like Lisp, despite what you may have heard).

.6.5 Exploring Parse Trees with Pytree

But wait -- there is a better way to explore parse tree structures. Figure 18-1 shows the parse tree generated for string "1 + 3 * (2 * 3 + 4)", displayed in PyTree, the tree visualization GUI presented at the end of the previous chapter. This only works because the parser2 module builds the parse tree explicitly (parser1 evaluates during a parse instead), and because PyTrees code is generic and reusable.

Figure 18-1. Parse tree built for 1 + 3 `` (2 `` 3 + 4)

If you read the last chapter, youll recall that PyTree can draw most any tree data structure, but it is preconfigured to handle binary search trees and the parse trees we e studying in this chapter. You might also remember that clicking on nodes in a displayed parse tree evaluates the subtree rooted there. Figure 18-2 shows the pop-up generated after clicking the trees root node (you get different results if you click other parts of tree, because smaller subtrees are evaluated).

Figure 18-2. Clicking the root node to evaluate a tree

PyTree makes it easy to learn about and experiment with the parser. To determine the tree shape produced for a given expression, start PyTree, click on its Parser radiobutton, type the expression in the input field at the bottom, and press "input" (or your Enter key). The parser class is run to generate a tree from your input, and the GUI displays the result. For instance, Figure 18-3 sketches the parse tree generated if we remove the parentheses from the first expression in the input field. The root node evaluates to 23 this time, due to the different shapes evaluation order.

Figure 18-3. Parse tree for 1 + 3 `` 2 `` 3 + 4, result=23

To generate an even more different shape, try introducing more parentheses to the expression and hitting the Enter key again. Figure 18-4 shows a much flatter tree structure produced by adding a few parentheses to override operator precedence. Because these parentheses change the tree shape, they also change the expressions overall result again. Figure 18-5 shows the result pop-up after clicking the root node in this display.

Figure 18-4. Parse tree built for "(1 + 3) `` (2 `` ( 3 + 4))"

Figure 18-5. Clicking and evaluating the root node

Depending upon the operators used within an expression, some very differently shaped trees yield the same result when evaluated. For instance, Figure 18-6 shows a more left-heavy tree generated from a different expression string that evaluates to 56 nevertheless.

Figure 18-6. Parse tree for "(1 + 3) `` 2 `` ( 3 + 4)", result=56

Finally, Figure 18-7 shows a parsed assignment statement; clicking the "set" root assigns variable spam, and clicking node spam then evaluates to -4. If you find the parser puzzling, try running PyTree like this on your computer to get a better feel for the parsing process. (Id like to show more example trees, but I ran out of page real estate at this point in the book.)

Figure 18-7. Assignment, left-grouping: "set spam 1 - 2 - 3"

.6.6 Parsers Versus Python

The hand-coded parser programs shown earlier illustrate some interesting concepts and underscore the power of Python for general-purpose programming. Depending on your job description, they may also be typical of the sort of thing youd write regularly in a traditional language like C. Parsers are an important component in a wide variety of applications, but in some cases, they e not as necessary as you might think. Let me explain why.

So far, we started with an expression parser and added a parse tree interpreter to make the code easier to modify. As is, the parser works, but it may be slow compared to a C implementation. If the parser is used frequently, we could speed it up by moving parts to C extension modules. For instance, the scanner might be moved to C initially, since its often called from the parser. Ultimately, we might add components to the grammar that allow expressions to access application-specific variables and functions.

All of the these steps constitute good engineering. But depending on your application, this approach may not be the best one in Python. The easiest way to evaluate input expressions in Python is often to let Python do it, by calling the eval built-in function. In fact, we can usually replace the entire expression evaluation program with one function call. The next example will demonstrate how this is done.

More importantly, the next section underscores a core idea behind the language: if you already have an extensible, embeddable, high-level language system, why invent another? Python itself can often satisfy language-based component needs.