how parsers and compilers work

1/13/15, 5:10 AMHow parsers and compilers work

of 39http://parsingintro.sourceforge.net/

Notes on How Parsers and Compilers WorkContact the author: Stephen Ferg

Content updated: 2007-10-15 Page format updated: 2011-10-14

Download this zip file to obtain the source code of files discussed in this article.

Table of Contents 1 About this document 1.1 What I want... 1.2 I achieve enlightenment 2 The Language Definition 3 The Source Text 4 The Scanner 4.1 The Character Class 4.2 The Scanner Class 4.3 A driver for the scanner 4.4 A bit of source code for the scanner to process 5 The Lexer 5.1 What is whitespace? 6 The Rules for a Programming Language 6.1 Tokenizing Rules 6.2 Writing a Lexer 6.3 Specifying keywords, symbols, token types, etc. 6.4 A Token class 6.5 A Lexer class 6.6 A driver for the lexer 6.7 Some source text to test the lexer 7 The Parser 7.1 Writing a Recursive Descent Parser 8 I achieve enlightenment (again) 8.1 Generating an Abstract Syntax Tree (AST) 8.2 A Node class for the AST 8.3 A recursive descent parser 8.4 A driver program for the recursive descent parser 8.5 The AST it generated

1 About this documentA few months ago I began a personal project to learn a bit more about how parsers and compilers work.

http://www.ferg.org/contact_info/index.html

http://parsingintro.sourceforge.net/ferg_ParsingAndCompiling.zip

http://parsingintro.sourceforge.net/#contents_item_0



http://parsingintro.sourceforge.net/#contents_item_1.1





















































http://parsingintro.sourceforge.net/#table_of_contents_1



Here are some notes that I made during that project. Maybe they will be of use to you.

1.1 What I want...

I'd like to learn how to write parsers and compilers. If I could write my own parsers and compilers, Icould do things like:

write compilers or parsers for my own little languageswrite a pretty printer for some of my favorite little languageswrite a translator from one little language to another

To learn, I have decided to embark on a project to write a parser and a compiler in Python. Python is apowerful, high-level, object-oriented language that is also very readable. It would allow me to work withbasic concepts without getting bogged down in language mechanics.

1.2 I achieve enlightenment

I had been studying the subject for some time and I had learned some basic concepts. But I was totallyconfused by discussions that conflated the scanner, lexer, and parser — some even used the terms"scanner", "lexer", and "parser" interchangeably. Then I discovered How does aninterpreter/compiler work? and became enlightended. It clearly laid out the different functions ofthe scanner, lexer, and parser. Here are the lines that triggered my awakening:

Source File —> Scanner —> Lexer —> Parser —> Interpreter/Code Generator

Scanner: This is the first module in a compiler or interpreter. Its job is to read the source file onecharacter at a time. It can also keep track of which line number and character is currently being read..... For now, assume that each time the scanner is called, it returns the next character in the file.

Lexer: This module serves to break up the source file into chunks (called tokens). It calls the scannerto get characters one at a time and organizes them into tokens and token types. Thus, the lexer callsthe scanner to pass it one character at a time and groups them together and identifies them as tokensfor the language parser (which is the next stage).

Parser: This is the part of the compiler that really understands the syntax of the language. It calls thelexer to get tokens and processes the tokens per the syntax of the language.

2 The Language DefinitionEvery compiler is written to process source files in a particular language. A COBOL compiler compilesCOBOL code; it doesn't compile Fortran. A Python interpreter parses and executes Python; not Ruby.

So the place to start in compiler development is with the language that I want to compile. In particular, Ineed some mechanism for creating a precise, formal specification of the language that I want to compile.That mechanism is EBNF. So the first step is to use EBNF to specify the language that I want to process.

But to begin with, I don't really need to know much about the particular language to write a scanner. So Iwill defer thinking about EBNF and language specification until it I start to build the lexer.

http://parsingintro.sourceforge.net/#table_of_contents_1.1

http://www.pythonology.org/


http://forums.devshed.com/other-programming-languages-139/how-does-an-interpreter-compiler-work-312483.html#post1342279




3 The Source TextThe source text contains the text of a program written in the specified language.




4 The ScannerThe scanner's job is to read the source file one character at a time. For each character, it keepstrack of the line and character position where the character was found. Each time the scanner iscalled, it reads the next character from the file and returns it.

So let's write a scanner.

4.1 The Character Class

Since we're all object-oriented now, we write two classes.

The first class is a Character class that will wrap a single character that the scanner retrieves from thesource text. In addition to holding the character itself (its cargo) it will hold information about thelocation of the character in the source text. [SourceCode]

ENDMARK = "\0" # aka "lowvalues"

#-----------------------------------------------------------------------## Character##-----------------------------------------------------------------------class Character: """ A Character object holds - one character (self.cargo) - the index of the character's position in the sourceText. - the index of the line where the character was found in the sourceText. - the index of the column in the line where the character was found in the sourceText. - (a reference to) the entire sourceText (self.sourceText)

This information will be available to a token that uses this character. If an error occurs, the token can use this information to report the line/column number where the error occurred, and to show an image of the line in sourceText where the error occurred. """

#------------------------------------------------------------------- # #------------------------------------------------------------------- def __init__(self, c, lineIndex, colIndex, sourceIndex, sourceText): """ In Python, the __init__ method is the constructor. """ self.cargo = c self.sourceIndex = sourceIndex self.lineIndex = lineIndex self.colIndex = colIndex self.sourceText = sourceText

#------------------------------------------------------------------- # return a displayable string representation of the Character object



http://parsingintro.sourceforge.net/web_page_fragments/genericCharacter.py.txt



#------------------------------------------------------------------- def __str__(self): """ In Python, the __str__ method returns a string representation of an object. In Java, this would be the toString() method. """ cargo = self.cargo if cargo == " " : cargo = " space" elif cargo == "\n" : cargo = " newline" elif cargo == "\t" : cargo = " tab" elif cargo == ENDMARK : cargo = " eof"

return ( str(self.lineIndex).rjust(6) + str(self.colIndex).rjust(4) + " " + cargo )



4.2 The Scanner Class

The second class is a Scanner class. When we instantiate this class, we will pass the constructor a stringcontaining the source text. The result will be a scanner object that is ready to scan that particular string ofsource text. [SourceCode]

from genericCharacter import *

"""A Scanner object reads through the sourceTextand returns one character at a time."""#-------------------------------------------------------------------##-------------------------------------------------------------------def initialize(sourceTextArg): global sourceText, lastIndex, sourceIndex, lineIndex, colIndex sourceText = sourceTextArg lastIndex = len(sourceText) - 1 sourceIndex = -1 lineIndex = 0 colIndex = -1

#-------------------------------------------------------------------##-------------------------------------------------------------------def get(): """ Return the next character in sourceText. """ global lastIndex, sourceIndex, lineIndex, colIndex

sourceIndex += 1 # increment the index in sourceText

# maintain the line count if sourceIndex > 0: if sourceText[sourceIndex - 1] == "\n": #------------------------------------------------------- # The previous character in sourceText was a newline # character. So... we're starting a new line. # Increment lineIndex and reset colIndex. #------------------------------------------------------- lineIndex +=1 colIndex = -1

colIndex += 1

if sourceIndex > lastIndex: # We've read past the end of sourceText. # Return the ENDMARK character. char = Character(ENDMARK, lineIndex, colIndex, sourceIndex,sourceText) else: c = sourceText[sourceIndex] char = Character(c, lineIndex, colIndex, sourceIndex, sourceText)

return char


http://parsingintro.sourceforge.net/web_page_fragments/genericScanner.py



4.3 A driver for the scanner




Having created the scanner machinery, let's put it through its paces.

I create a driver program that sets up a string of source text, creates a scanner object to scan that sourcetext, and then displays the characters that it gets back from the scanner. [SourceCode]

writeln("Here are the characters returned by the scanner:") writeln(" line col character")

# create a scanner (an instance of the Scanner class) scanner.initialize(sourceText)

#------------------------------------------------------------------ # Call the scanner's get() method repeatedly # to get the characters in the sourceText. # Stop when we reach the ENDMARK. #------------------------------------------------------------------ character = scanner.get() # getfirst Character object from the scanner while True: writeln(character) if character.cargo == scanner.ENDMARK: break character = scanner.get() # getnext

4.4 A bit of source code for the scanner to process

We specify the contents of the source text:

/* PROGRAM NAME: nxx1.txt

nxx is a simple programming language that provides: numbers strings assignment statements string concatenation simple arithmetic operations print capability comments may be enclosed in slash+asterisk .. asterisk+slash*/alpha = 16 ;beta = 2 ;resultName = "delta" ;delta = alpha / beta ;print "Value of " || resultName || " is: " ;print delta ;print "\n" ;

When we run the scanner driver program, it produces the following results.

Here are the characters returned by the scanner: line col character

http://parsingintro.sourceforge.net/web_page_fragments/genericScannerDriver.py.txt




0 0 / 0 1 * 0 2 space 0 3 P 0 4 R 0 5 O 0 6 G 0 7 R 0 8 A 0 9 M 0 10 space 0 11 N 0 12 A 0 13 M 0 14 E 0 15 : 0 16 space 0 17 n 0 18 x 0 19 x 0 20 1 0 21 . 0 22 t 0 23 x 0 24 t 0 25 newline 1 0 newline 2 0 n 2 1 x 2 2 x 2 3 space 2 4 i 2 5 s 2 6 space 2 7 a 2 8 space 2 9 s 2 10 i 2 11 m 2 12 p 2 13 l 2 14 e 2 15 space 2 16 p 2 17 r 2 18 o 2 19 g 2 20 r 2 21 a 2 22 m 2 23 m 2 24 i 2 25 n 2 26 g 2 27 space 2 28 l 2 29 a 2 30 n 2 31 g 2 32 u



2 33 a 2 34 g 2 35 e 2 36 space 2 37 t 2 38 h 2 39 a 2 40 t 2 41 space 2 42 p 2 43 r 2 44 o 2 45 v 2 46 i 2 47 d 2 48 e 2 49 s 2 50 : 2 51 newline 3 0 space 3 1 n 3 2 u 3 3 m 3 4 b 3 5 e 3 6 r 3 7 s 3 8 newline 4 0 space 4 1 s 4 2 t 4 3 r 4 4 i 4 5 n 4 6 g 4 7 s 4 8 newline 5 0 space 5 1 a 5 2 s 5 3 s 5 4 i 5 5 g 5 6 n 5 7 m 5 8 e 5 9 n 5 10 t 5 11 space 5 12 s 5 13 t 5 14 a 5 15 t 5 16 e 5 17 m 5 18 e 5 19 n 5 20 t 5 21 s 5 22 newline



6 0 space 6 1 s 6 2 t 6 3 r 6 4 i 6 5 n 6 6 g 6 7 space 6 8 c 6 9 o 6 10 n 6 11 c 6 12 a 6 13 t 6 14 e 6 15 n 6 16 a 6 17 t 6 18 i 6 19 o 6 20 n 6 21 newline 7 0 space 7 1 s 7 2 i 7 3 m 7 4 p 7 5 l 7 6 e 7 7 space 7 8 a 7 9 r 7 10 i 7 11 t 7 12 h 7 13 m 7 14 e 7 15 t 7 16 i 7 17 c 7 18 space 7 19 o 7 20 p 7 21 e 7 22 r 7 23 a 7 24 t 7 25 i 7 26 o 7 27 n 7 28 s 7 29 newline 8 0 space 8 1 p 8 2 r 8 3 i 8 4 n 8 5 t 8 6 space 8 7 c



8 8 a 8 9 p 8 10 a 8 11 b 8 12 i 8 13 l 8 14 i 8 15 t 8 16 y 8 17 newline 9 0 space 9 1 newline 10 0 c 10 1 o 10 2 m 10 3 m 10 4 e 10 5 n 10 6 t 10 7 s 10 8 space 10 9 m 10 10 a 10 11 y 10 12 space 10 13 b 10 14 e 10 15 space 10 16 e 10 17 n 10 18 c 10 19 l 10 20 o 10 21 s 10 22 e 10 23 d 10 24 space 10 25 i 10 26 n 10 27 space 10 28 s 10 29 l 10 30 a 10 31 s 10 32 h 10 33 + 10 34 a 10 35 s 10 36 t 10 37 e 10 38 r 10 39 i 10 40 s 10 41 k 10 42 space 10 43 . 10 44 . 10 45 space 10 46 a 10 47 s



10 48 t 10 49 e 10 50 r 10 51 i 10 52 s 10 53 k 10 54 + 10 55 s 10 56 l 10 57 a 10 58 s 10 59 h 10 60 newline 11 0 * 11 1 / 11 2 newline 12 0 a 12 1 l 12 2 p 12 3 h 12 4 a 12 5 space 12 6 = 12 7 space 12 8 1 12 9 6 12 10 space 12 11 ; 12 12 newline 13 0 b 13 1 e 13 2 t 13 3 a 13 4 space 13 5 = 13 6 space 13 7 2 13 8 space 13 9 space 13 10 space 13 11 ; 13 12 newline 14 0 r 14 1 e 14 2 s 14 3 u 14 4 l 14 5 t 14 6 N 14 7 a 14 8 m 14 9 e 14 10 space 14 11 = 14 12 space 14 13 " 14 14 d 14 15 e 14 16 l 14 17 t



14 18 a 14 19 " 14 20 space 14 21 ; 14 22 newline 15 0 d 15 1 e 15 2 l 15 3 t 15 4 a 15 5 space 15 6 = 15 7 space 15 8 a 15 9 l 15 10 p 15 11 h 15 12 a 15 13 space 15 14 / 15 15 space 15 16 b 15 17 e 15 18 t 15 19 a 15 20 space 15 21 ; 15 22 newline 16 0 p 16 1 r 16 2 i 16 3 n 16 4 t 16 5 space 16 6 " 16 7 V 16 8 a 16 9 l 16 10 u 16 11 e 16 12 space 16 13 o 16 14 f 16 15 space 16 16 " 16 17 space 16 18 | 16 19 | 16 20 space 16 21 r 16 22 e 16 23 s 16 24 u 16 25 l 16 26 t 16 27 N 16 28 a 16 29 m 16 30 e 16 31 space



16 32 | 16 33 | 16 34 space 16 35 " 16 36 space 16 37 i 16 38 s 16 39 : 16 40 space 16 41 " 16 42 space 16 43 ; 16 44 newline 17 0 p 17 1 r 17 2 i 17 3 n 17 4 t 17 5 space 17 6 d 17 7 e 17 8 l 17 9 t 17 10 a 17 11 space 17 12 ; 17 13 newline 18 0 p 18 1 r 18 2 i 18 3 n 18 4 t 18 5 space 18 6 " 18 7 \ 18 8 n 18 9 " 18 10 space 18 11 ; 18 12 eof



5 The LexerA lexical analyser is also called a lexer or a tokenizer.

The lexer's job is to group the characters of the source file into chunks called tokens. (If the source textwas written in a natural language (English, Spanish, French, etc.) the tokens would correspond tothe words and punctuation marks in the text.) Each time the lexer is called, it calls the scanner(perhaps several times) to get as many characters as it needs in order to assemble the characters intoa token. It determines the type of token that it has found (a string, a number, an identifier, acomment, etc.) and returns the token.

A scanner can be pretty much language-agnostic, but a lexer needs to have a precise specification for thelanguage that it must tokenize. Suppose we want to process a language called nxx. Then the lexer needs toknow the answers to questions like these abut nxx:

What counts as whitespace?How are strings (string literals) delimited — single quotes? double quotes? both? other?What symbols and operators does the language support? For example: ( ) + - = ;What are the rules governing the formation of names (identifiers)? Can names contain dashes?underscores?What are the rules governing the formation of numbers (numeric literals)?What are the rules for writing comments?

5.1 What is whitespace?

Most people think of whitespace as being any string that contains only instances of the space character,the tab character, and the NL (newline) character (or the carriage return + linefeed characters (CRLF)that Windows uses to indicate a newline).

For a lexer, a whitespace character is: a character whose sole purpose is to delimit tokens. In a COBOLstatement like this, for example:

MOVE STATE-ID TO HOLD-STATE-ID.

the spaces aren't significant in themselves, but without them the COBOL compiler would see:

MOVESTATE-IDTOHOLD-STATE-ID.

and wouldn't know what to make of it.

Whitespace characters are used by the lexer to detect the end of tokens, but generally are not passed on tothe parser. Here is (a rough approximation of) the list of the tokens that the lexer would pass to theCOBOL parser. Note that that the list does not include any whitespace tokens.

KEYWORD MOVEVARIABLE STATE-IDKEYWORD TOVARIABLE HOLD-STATE-IDSYMBOL .





In many languages, the whitespace characters consist of the usual suspects: SPACE, TAB, NEWLINE. Butconsider a language in which each statement must exist on its own line. For such a language theNEWLINE character is not whitespace at all but a token indicating the end of a statement in the sameway that a semi-colon does in Java and PL/I. Another example: Python uses indentation (tabs or spaces)rather than keywords or symbols (e.g. "do..end" or "{..}") to control scope. So for Python (in at least somecontexts) spaces and tabs are not whitespace characters.

Another case in which the lexer would pass whitespace tokens back to its caller is if the calling module ismaking some modifications to the input text (for example, removing comments from the source code) butotherwise leaving the source text intact, whitespace and all.

6 The Rules for a Programming LanguageAt the very simplest there are three kinds of rules for a language:

tokenizing rulessyntactic rulessemantic rules

For the lexer, we will need some tokenizing rules.

6.1 Tokenizing Rules

The standard way to specify rules for recognizing tokens is via a finite state machine (FSM), also knownas a deterministic finite automaton (DFA). So explanations of how to build a tokenizer often begin thisway: Build a FSM that.... Or this way: Write a regular expression to match....

But in many cases, you don't need the power of a full FSM — the job can be done in a much simpler way.For example, most programming languages have simple rules for identifier tokens that can be expressedthis way:

An identifier token can contain any of the following characters: <a list of characters goeshere>

An identifier token must start with one of the following characters: <a list of characters goeshere>

For example:

Identifiers can contain letters, numeric digits, underscores, and must begin with a letter.



http://en.wikipedia.org/wiki/Finite_state_machine

http://en.wikipedia.org/wiki/Deterministic_finite_state_machine



6.2 Writing a Lexer

So let's write a lexer.

First, we invent a simple language called nxx. We won't bother to define nxx very carefully -- we won'tneed to do that in order to write a simple lexer that can process identifiers and whitespace.

We will need to define at least the rudiments of our language. That means specifying the symbols (theassignment symbol, grouping symbols such as parentheses and brackets, symbols for mathematicaloperations (plus, minus, etc.), and so on.We also need to define the rules for tokenizing comments, stringliterals (strings), numeric literals (numbers), and identifiers. For example, for nxx we say that a string canbe contained in either single or double quotes, an identifier can contain letters, numeric digits, andunderscores, and must start with a letter, and so on. I defined these in a file called nxxSymbols.py.

6.3 Specifying keywords, symbols, token types, etc.

#----------------------------------------------------------# a list of keywords#----------------------------------------------------------Keywords = """ifthenelseelifendifwhileloopendloopprintreturnexit"""Keywords = Keywords.split()

#----------------------------------------------------------# a list of symbols that are one character long#----------------------------------------------------------OneCharacterSymbols = """=( )< >/ * + -! &. ;"""OneCharacterSymbols = OneCharacterSymbols.split()

#----------------------------------------------------------# a list of symbols that are two characters long#----------------------------------------------------------TwoCharacterSymbols = """==<=>=





<>!=++**--+=-=||"""TwoCharacterSymbols = TwoCharacterSymbols.split()

import string

IDENTIFIER_STARTCHARS = string.lettersIDENTIFIER_CHARS = string.letters + string.digits + "_"

NUMBER_STARTCHARS = string.digitsNUMBER_CHARS = string.digits + "."

STRING_STARTCHARS = "'" + '"'WHITESPACE_CHARS = " \t\n"

#-----------------------------------------------------------------------# TokenTypes for things other than symbols and keywords#-----------------------------------------------------------------------STRING = "String"IDENTIFIER = "Identifier"NUMBER = "Number"WHITESPACE = "Whitespace"COMMENT = "Comment"EOF = "Eof"

Now we write two classes.

6.4 A Token class

The first class is a Token class that will wrap its cargo -- a string of characters that is the text of thetoken. In addition to holding its cargo, the token will hold information about the location of the token(actually, the location of its first character) in the source text. [SourceCode]

from genericScanner import *

class LexerError(Exception): pass

#-----------------------------------------------------------------------## Token##-----------------------------------------------------------------------class Token: """ A Token object is the kind of thing that the Lexer returns. It holds: - the text of the token (self.cargo) - the type of token that it is


http://parsingintro.sourceforge.net/web_page_fragments/genericToken.py.txt



- the line number and column index where the token starts """

#------------------------------------------------------------------- # #------------------------------------------------------------------- def __init__(self, startChar): """ The constructor of the Token class """ self.cargo = startChar.cargo

#---------------------------------------------------------- # The token picks up information # about its location in the sourceText #---------------------------------------------------------- self.sourceText = startChar.sourceText self.lineIndex = startChar.lineIndex self.colIndex = startChar.colIndex

#---------------------------------------------------------- # We won't know what kind of token we have until we have # finished processing all of the characters in the token. # So when we start, the token.type is None (aka null). #---------------------------------------------------------- self.type = None

#------------------------------------------------------------------- # return a displayable string representation of the token #------------------------------------------------------------------- def show(self,showLineNumbers=False,**kwargs): """ align=True shows token type left justified with dot leaders. Specify align=False to turn this feature OFF. """ align = kwargs.get("align",True) if align: tokenTypeLen = 12 space = " " else: tokenTypeLen = 0 space = "" if showLineNumbers: s = str(self.lineIndex).rjust(6) + str(self.colIndex).rjust(4) + " " else: s = "" if self.type == self.cargo: s = s + "Symbol".ljust(tokenTypeLen,".") + ":" + space + self.type elif self.type == "Whitespace": s = s + "Whitespace".ljust(tokenTypeLen,".") + ":" + space + repr(self.cargo) else: s = s + self.type.ljust(tokenTypeLen,".") + ":" + space + self.cargo return s guts = property(show)



6.5 A Lexer class

The second class is a Lexer class. When we instantiate this class, we will pass the constructor a stringcontaining the source text. The lexer object will create a scanner object and pass the source text to it.Then the lexer will get characters from the scanner, which will get them from the source text. The resultwill be a lexer object that is ready to return the tokens in the source text. [SourceCode]


http://parsingintro.sourceforge.net/web_page_fragments/nxxLexer.py.txt



#-------------------------------------------------------------------##-------------------------------------------------------------------def initialize(sourceText): """ """ global scanner

# initialize the scanner with the sourceText scanner.initialize(sourceText)

# use the scanner to read the first character from the sourceText getChar()

#-------------------------------------------------------------------##-------------------------------------------------------------------def get(): """ Construct and return the next token in the sourceText. """

#-------------------------------------------------------------------------------- # read past and ignore any whitespace characters or any comments -- START #-------------------------------------------------------------------------------- while c1 in WHITESPACE_CHARS or c2 == "/*":

# process whitespace while c1 in WHITESPACE_CHARS: token = Token(character) token.type = WHITESPACE getChar()

while c1 in WHITESPACE_CHARS: token.cargo += c1 getChar() # return token # only if we want the lexer to return whitespace

# process comments while c2 == "/*": # we found comment start token = Token(character) token.type = COMMENT token.cargo = c2

getChar() # read past the first character of a 2-character token getChar() # read past the second character of a 2-character token

while not (c2 == "*/"): if c1 == ENDMARK: token.abort("Found end of file before end of comment") token.cargo += c1 getChar()

token.cargo += c2 # append the */ to the token cargo

getChar() # read past the first character of a 2-character token



getChar() # read past the second character of a 2-character token # return token # only if we want the lexer to return comments #-------------------------------------------------------------------------------- # read past and ignore any whitespace characters or any comments -- END #--------------------------------------------------------------------------------

# Create a new token. The token will pick up # its line and column information from the character. token = Token(character)

if c1 == ENDMARK: token.type = EOF return token

if c1 in IDENTIFIER_STARTCHARS: token.type = IDENTIFIER getChar()

while c1 in IDENTIFIER_CHARS: token.cargo += c1 getChar()

if token.cargo in Keywords: token.type = token.cargo return token

if c1 in NUMBER_STARTCHARS: token.type = NUMBER getChar() while c1 in NUMBER_CHARS: token.cargo += c1 getChar() return token

if c1 in STRING_STARTCHARS: # remember the quoteChar (single or double quote) # so we can look for the same character to terminate the quote. quoteChar = c1

getChar()

while c1 != quoteChar: if c1 == ENDMARK: token.abort("Found end of file before end of string literal")

token.cargo += c1 # append quoted character to text getChar()

token.cargo += c1 # append close quote to text getChar() token.type = STRING return token

if c2 in TwoCharacterSymbols: token.cargo = c2 token.type = token.cargo # for symbols, the token type is same as the cargo getChar() # read past the first character of a 2-character token getChar() # read past the second character of a 2-character token



return token

if c1 in OneCharacterSymbols: token.type = token.cargo # for symbols, the token type is same as the cargo getChar() # read past the symbol return token

# else.... We have encountered something that we don't recognize. token.abort("I found a character or symbol that I do not recognize: " + dq(c1))



6.6 A driver for the lexer

Having created the lexer machinery, let's put it through its paces. We will create a driver program thatsets up a string of source text, creates a lexer object to tokenize that source text, and then displays thetokens that it gets back from the lexer. [SourceCode]

Here is the core code of the driver program.

#-----------------------------------------------------------------------## main##-----------------------------------------------------------------------def main(sourceText): global f f = open(outputFilename, "w") writeln("Here are the tokens returned by the lexer:")

# create an instance of a lexer lexer.initialize(sourceText)

#------------------------------------------------------------------ # use the lexer.getlist() method repeatedly to get the tokens in # the sourceText. Then print the tokens. #------------------------------------------------------------------ while True: token = lexer.get() writeln(token.show(True)) if token.type == EOF: break f.close()

6.7 Some source text to test the lexer

Here is the source text that we will pass to the lexer.

/* PROGRAM NAME: nxx1.txt

nxx is a simple programming language that provides: numbers strings assignment statements string concatenation simple arithmetic operations print capability comments may be enclosed in slash+asterisk .. asterisk+slash*/alpha = 16 ;beta = 2 ;resultName = "delta" ;delta = alpha / beta ;print "Value of " || resultName || " is: " ;


http://parsingintro.sourceforge.net/web_page_fragments/nxxLexerDriver.py.txt




print delta ;print "\n" ;



When we run the lexer on this source code, the lexer it produces the following results. That is, it producesthis list of tokens.

Here are the tokens returned by the lexer: 12 0 Identifier..: alpha 12 6 Symbol......: = 12 8 Number......: 16 12 11 Symbol......: ; 13 0 Identifier..: beta 13 5 Symbol......: = 13 7 Number......: 2 13 11 Symbol......: ; 14 0 Identifier..: resultName 14 11 Symbol......: = 14 13 String......: "delta" 14 21 Symbol......: ; 15 0 Identifier..: delta 15 6 Symbol......: = 15 8 Identifier..: alpha 15 14 Symbol......: / 15 16 Identifier..: beta 15 21 Symbol......: ; 16 0 Symbol......: print 16 6 String......: "Value of " 16 18 Symbol......: || 16 21 Identifier..: resultName 16 32 Symbol......: || 16 35 String......: " is: " 16 43 Symbol......: ; 17 0 Symbol......: print 17 6 Identifier..: delta 17 12 Symbol......: ; 18 0 Symbol......: print 18 6 String......: "\n" 18 11 Symbol......: ; 18 12 Eof.........: ?



7 The Parser

7.1 Writing a Recursive Descent Parser

Parsing Techniques (first edition, 1990) by Dick Grune and Ceriel Jacobs is a great book aboutparsing techniques. This book is freely downloadable from http://www.cs.vu.nl/~dick/PTAPG.html.

I learned a lot from the book.

The first thing I tried to do was to implement (in Python) what they consider the best-ever recursive-descent parsing technique. It is explained on pp. 137-140 of their book. Unfortunately, the results weredisappointing. I won't dispute their claim that this is a fabulous technique. But I found it clumsy anddifficult to understand. I certainly couldn't just sit down and write a parser using the technique.

I had always heard that it wasn't difficult to implement a recursive-descent parser. But that technique wasmuch too difficult (for me at least).

8 I achieve enlightenment (again)While searching the Web for more information about recursive-descent parsers, I found the WikiPediaarticle on recursive-descent parsers, complete with EBNF and example in C. And I could see that it is easyto write a recursive-descent parser. As the Wikipedia article puts it — "Notice how closely the [code]mirrors the grammar. There is a procedure for each nonterminal in the grammar."

Here is the Python translation of that code.

token = ""

ASSIGNMENT = ":="BEGIN = "begin"CALL = "call"COMMA = ","CONST = "const"DO = "do"END = "end"EQ = "=="GE = ">="GT = ">"IF = "if"LE = "<="LPAREN = "("LT = "<"MINUS = "-"MULTIPLY = "*"NE = "!="ODD = "odd"PERIOD = "."PLUS = "+"PROC = "proc"RPAREN = ")"SEMICOLON = ";"



http://www.cs.vu.nl/~dick/PTAPG.html


http://en.wikipedia.org/wiki/Recursive_descent_parser



SLASH = "/"THEN = "then"VAR = "var"WHILE = "while"

IDENTIFIER = "IDENTIFIER"NUMBER = "NUMBER"

class ParserException(Exception): pass

def error(msg): quotedToken = '"%s"' % token msg = msg + " while processing token " + quotedToken raise ParserException("\n\n" + msg)

def found(argToken): if (token == argToken): getToken() return True return False

def expect(argToken): if found(argToken): return # no problem else: quotedToken = '"%s"' % argToken error("I was expecting to find token " + quotedToken + "\n but I found something else" )

#--------------------------------------------------def factor(): """ factor = IDENTIFIER | NUMBER | "(" expression ")" . """ if found(IDENTIFIER): pass elif found(NUMBER): pass elif found(LPAREN): expression() expect(RPAREN) else: error("factor: syntax error") getToken()

#--------------------------------------------------def term(): """ term = factor {("*"|"/") factor} . """ factor() while found(MULTIPLY) or found(SLASH): factor()

#--------------------------------------------------def expression():



""" expression = ["+"|"-"] term {("+"|"-") term} . """ if found(PLUS) or found(MINUS): pass term() while found(PLUS) or found(MINUS): term()

#--------------------------------------------------def condition(): """ condition = "odd" expression | expression ("="|"#"|"<"|"<="|">"|">=") expression . """ if found(ODD): expression() else: expression() if ( found(EQ) or found(NE) or found(LT) or found(LE) or found(GT) or found(GE) ): expression() else: error("condition: found invalid operator") getToken()

#--------------------------------------------------def statement(): """ statement = [IDENTIFIER ":=" expression | "call" IDENTIFIER | "begin" statement {";" statement} "end" | "if" condition "then" statement | "while" condition "do" statement ] . """ if found(IDENTIFIER): expect(ASSIGNMENT) expression()

elif found(CALL): expect(IDENTIFIER)

elif found(BEGIN): statement() while found(SEMICOLON): statement() expect(END)

elif found(IF): condition() expect(THEN) statement()



elif found(WHILE): condition() expect(DO) statement()

#--------------------------------------------------def block(): """ block = ["const" IDENTIFIER "=" NUMBER {"," IDENTIFIER "=" NUMBER} ";"] ["var" IDENTIFIER {"," IDENTIFIER} ";"] {"procedure" IDENTIFIER ";" block ";"} statement . """ if found(CONST): expect(IDENTIFIER) expect(EQ) expect(NUMBER)

while found(COMMA): expect(IDENTIFIER) expect(EQ) expect(NUMBER) expect(SEMICOLON)

if found(VAR): expect(IDENTIFIER) while found(COMMA): expect(IDENTIFIER) expect(SEMICOLON)

while found(PROC): expect(IDENTIFIER) expect(SEMICOLON) block() expect(SEMICOLON)

statement()

#--------------------------------------------------def program(): """ program = block "." . """ getToken() block() expect(PERIOD)

8.1 Generating an Abstract Syntax Tree (AST)

Many discussions of parsers omit the most important thing — that the purpose of a parser is to producean abstract syntax tree, and AST. So here is some code for a node in an AST.




8.2 A Node class for the AST

class Node: def __init__(self, token=None): self.token = token self.level = 0 self.children = [] # a list of my children def add(self, token): """ make a node out of a token and add it to self.children """ self.addNode( Node(token) ) def addNode(self, node): """ add a node to self.children """ node.level = self.level + 1 self.children.append(node) def toString(self): s = " " * self.level if self.token == None: s += "ROOT\n" else: s += self.token.cargo + "\n" for child in self.children: s += child.toString() return s

8.3 A recursive descent parser

So here is a Python recursive descent parser for nxx.

"""A recursive descent parser for nxx1, as defined in nxx1ebnf.txt"""import nxxLexer as lexerfrom nxxSymbols import *from genericAstNode import Node

class ParserError(Exception): pass

def dq(s): return '"%s"' %s

token = Noneverbose = Falseindent = 0numberOperator = ["+","-","/","*"]

#-------------------------------------------------------------------





##-------------------------------------------------------------------def getToken(): global token if verbose: if token: # print the current token, before we get the next one #print (" "*40 ) + token.show() print((" "*indent) + " (" + token.show(align=False) + ")") token = lexer.get()

#-------------------------------------------------------------------# push and pop#-------------------------------------------------------------------def push(s): global indent indent += 1 if verbose: print((" "*indent) + " " + s)

def pop(s): global indent if verbose: #print((" "*indent) + " " + s + ".end") pass indent -= 1

#-------------------------------------------------------------------# decorator track0#-------------------------------------------------------------------def track0(func): def newfunc(): push(func.__name__) func() pop(func.__name__) return newfunc

#-------------------------------------------------------------------# decorator track#-------------------------------------------------------------------def track(func): def newfunc(node): push(func.__name__) func(node) pop(func.__name__) return newfunc

#-------------------------------------------------------------------##-------------------------------------------------------------------def error(msg): token.abort(msg)

#-------------------------------------------------------------------# foundOneOf#-------------------------------------------------------------------def foundOneOf(argTokenTypes): """ argTokenTypes should be a list of argTokenType



""" for argTokenType in argTokenTypes: #print "foundOneOf", argTokenType, token.type if token.type == argTokenType: return True return False

#-------------------------------------------------------------------# found#-------------------------------------------------------------------def found(argTokenType): if token.type == argTokenType: return True return False

#-------------------------------------------------------------------# consume#-------------------------------------------------------------------def consume(argTokenType): """ Consume a token of a given type and get the next token. If the current token is NOT of the expected type, then raise an error. """ if token.type == argTokenType: getToken() else: error("I was expecting to find " + dq(argTokenType) + " but I found " + token.show(align=False) )

#-------------------------------------------------------------------# parse#-------------------------------------------------------------------def parse(sourceText, **kwargs): global lexer, verbose verbose = kwargs.get("verbose",False) # create a Lexer object & pass it the sourceText lexer.initialize(sourceText) getToken() program() if verbose: print "~"*80 print "Successful parse!" print "~"*80 return ast

#--------------------------------------------------------# program#--------------------------------------------------------@track0def program(): """program = statement {statement} EOF. """ global ast node = Node()



statement(node) while not found(EOF): statement(node)

consume(EOF) ast = node

#--------------------------------------------------------# statement#--------------------------------------------------------@trackdef statement(node): """statement = printStatement | assignmentStatement .assignmentStatement = variable "=" expression ";".printStatement = "print" expression ";". """ if found("print"): printStatement(node) else: assignmentStatement(node)

#--------------------------------------------------------# expression#--------------------------------------------------------@trackdef expression(node): """expression = stringExpression | numberExpression.

/* "||" is the concatenation operator, as in PL/I */stringExpression = (stringLiteral | variable) {"||" stringExpression}.numberExpression = (numberLiteral | variable) { numberOperator numberExpression}.numberOperator = "+" | "-" | "/" | "*" . """

if found(STRING): stringLiteral(node) while found("||"): getToken() stringExpression(node)

elif found(NUMBER): numberLiteral(node) while foundOneOf(numberOperator): node.add(token) getToken() numberExpression(node) else: node.add(token) consume(IDENTIFIER)

if found("||"): while found("||"): getToken() stringExpression(node) elif foundOneOf(numberOperator):



while foundOneOf(numberOperator): node.add(token) getToken() numberExpression(node)

#--------------------------------------------------------# assignmentStatement#--------------------------------------------------------@trackdef assignmentStatement(node): """assignmentStatement = variable "=" expression ";". """ identifierNode = Node(token) consume(IDENTIFIER)

operatorNode = Node(token) consume("=") node.addNode(operatorNode)

operatorNode.addNode(identifierNode)

expression(operatorNode) consume(";")

#--------------------------------------------------------# printStatement#--------------------------------------------------------@trackdef printStatement(node): """printStatement = "print" expression ";". """ statementNode = Node(token) consume("print")

node.addNode(statementNode)

expression(statementNode)

consume(";")

#--------------------------------------------------------# stringExpression#--------------------------------------------------------@trackdef stringExpression(node): """/* "||" is the concatenation operator, as in PL/I */stringExpression = (stringLiteral | variable) {"||" stringExpression}. """

if found(STRING): node.add(token) getToken()

while found("||"): getToken() stringExpression(node) else:



node.add(token) consume(IDENTIFIER)

while found("||"): getToken() stringExpression(node)

#--------------------------------------------------------# numberExpression#--------------------------------------------------------@trackdef numberExpression(node): """numberExpression = (numberLiteral | variable) { numberOperator numberExpression}.numberOperator = "+" | "-" | "/" | "*" . """ if found(NUMBER): numberLiteral(node) else: node.add(token) consume(IDENTIFIER)

while foundOneOf(numberOperator): node.add(token) getToken() numberExpression(node)

#--------------------------------------------------------# stringLiteral#--------------------------------------------------------def stringLiteral(node): node.add(token) getToken()

#--------------------------------------------------------# numberLiteral#--------------------------------------------------------def numberLiteral(node): node.add(token) getToken()

8.4 A driver program for the recursive descent parser

Here is the code for its driver.

import nxxParser as parser

#-------------------------------------------------# support for writing output to a file#-------------------------------------------------def writeln(*args): for arg in args: f.write(str(arg)) f.write("\n")

if __name__ == "__main__":




outputFilename = "output\\nxxParserDriver_output.txt" sourceFilename = "input\\nxx1.txt" sourceText = open(sourceFilename).read() ast = parser.parse(sourceText, verbose=False) print "~"*80 print "Here is the abstract syntax tree:" print "~"*80 f = open(outputFilename,"w") f.write(ast.toString()) f.close() print(open(outputFilename).read())



8.5 The AST it generated

And here is the output of the nxx parser when run on nxx1.txt. This is (as near as I can tell) pretty close toan AST.

ROOT = alpha 16 = beta 2 = resultName "delta" = delta alpha / beta print "Value of " resultName " is: " print delta print "\n"

Download this zip file to obtain the source code of files discussed in this article.

End of this article/web page.


http://parsingintro.sourceforge.net/ferg_ParsingAndCompiling.zip

how parsers and compilers work

Documents

language parser

lexer class

scanner class

lexer callsthe scanner

recursive descent parser

source text

character class

compilers workpage