introduction to peg (parsing expression grammar) in python

Introduction to packrat parsing for PEGs (Parsing Expression Grammars)

gavin bongpycon APAC 2011, Singapore

Le 8 juin 2011 2129 mercredi

2

roadmap

Motivation PEG theory pyparsingPyMeta PyPy rlib/parsing Closing

34 mins

mins04051607

min01min01

minsminsmins

motivation

Natural languages

Mini languages (DSLs) Structured / unstructured file formats

4 thoughts :i. Aren't structured formats like JSON, XML, HTML well-served by existing parsers ?ii. Parsing log files & configuration files are easy with python.iii. Regular expression is good enough.

3

How to parse texts with PEGs NLTK

iv. What is wrong with the classical way of writing parsers ?

CFG (Context Free Grammars)

In formal language theory, CFG is suitable for modeling both natural & computer languages.

4

BNF is the defacto notation for describing syntax of CFGs.

if_stmt ::= "if" expression ":" suite ( "elif" expression ":" suite )* [ "else" ":" suite ]

EBNF

Original BNF only supported recursion. sequence, decision(choice) repetition, recursion

S S→ aS → Ɛ

CFG & AmbiguityCFG grammars are potentially ambiguous.

Dangling elseproblem

1 if( x > 5 )2 if( y > 5 )3 console.log("heaven");4 else console.log("limbo");

IfExp IfExp

CompName'x' Ops

>

Log

test Num5body

orelseStr

'limbo'

values

Comp

Name'x'

Name'y'

test

Str'heaven'Log

values

body

AST #1

5

CFG & Ambiguity (2)

6

IfExp IfExp

CompName'x' Ops

>

Log

test Num5body

orelse

Str'limbo'

values

Comp

Name'x'

Name'y'

test

Str'heaven'Log

values

body

AST #2

DefinitionsParse trees vs AST

Top-Down vs Bottom-up

= concrete whitespace, braces, semicolons

= abstract

= begin with start nonterminal.= work down the parse tree.

= identify terminals= infer nonterminals = climb the parse tree.

= nodes are nonterminals from grammar

= uses tree nodes specific to language constructs

Definitions (2)Recursive descent parsing

8

* A top-down parser constructed from recursive functions.* Each function represents a rule in the grammar.

version ::= <digit> '.' <digit>digit ::= '0' | '1' ... | '9'

def version( source, position=0 ): digit( source, position ) period( source, position + 1 ) digit( source, position + 2 )

Run (pymeta) nose --nocapture -v test_rdp_list.py

Recursive Descent Parsing

9

def digit(source, position): fn = (lambda t: t in string.digits,this_rule()) expect(source, position, fn)

def expect(source, position, comparator): try: expecting, msg = comparator if not expecting(source[0]): raise ParseError(position, msg)

source.popleft() #consume ! except IndexError: raise EOFError(position)

def period(source, position): fn = (lambda t: t == '.',this_rule()) expect(source, position, fn)

Recursive Descent Parsing (2)

>>> version(collections.deque('1.6'))

>>> import collections

ParseError: (1, 'expected <period>')

>>> version(collections.deque('1,6'))

>>> version(collections.deque('1.'))

EOFError: (2, [('message', 'end of input')])

10

ParseError: (0, 'expected <digit>')

>>> version(collections.deque('A.6'))

Classical method of parsing

Specific to LALR(1) bottom-up parsers11

1. Flesh out a grammar in BNF

2. Lexical analysis phase

lexer ( patterns, stream-of-characters) => stream of tokens

3. Parsing phase

parser ( grammar, stream-of-tokens) => parse tree / AST 4. Use your parser

Photo attribution: http://www.flickr.com/photos/j_aroche/2160902499/

Spectrum of parsing solutions

Regex

Lex / Yacc parser generators (GNU flex/bison)

PEG parsers

Handwritten Recursive Descent Parsers

ANTLR

12

Other python parsing toolkits

PLY

funcparserlib

Yapps

http:// wiki.python.org / moin / LanguageParsing

13

PEG

Scanner-less

Formalized by Bryan Ford in 2002-2004

Grammar mimics a recursive descent parser (+ backtracking).

14

A PEG grammar consists of a set of parsing expressions of the form: A e →One expression is denoted the starting expressione1 / e2 Ordered Choicee1 e2 Sequencee+ e? e* Repetition&e !e Predicates

PEG != EBNF

PEG's ordered choice

S → “Hitch” / “Hitchens”Q. Given an input string of “Hitchens”, what is the result of the parse ?

Law #1: Given an input of A, the parsing expression matches a prefix A' of A or fails. Law #2: A rule S -> M / N will try to parse for a M. If that fails, backtrack & look for N.

Answer: Hitch 15

PEG vs CFG

PEG CFG

Handles ambiguous grammars

No Yes

Syntax definition philosophy

Analytical Generative

Requires a lexical analysis phase ? No Yes (lex/yacc)

Choicealternation

Ordered Commutativee1/e2

16

Left recursion * No Yes

* Warth et al. Packrat parsers can support left recursion (2008)

PEG & Packrat parsing

Neotoma Cinerea

17

Solution: memoization guarantees linear time performance.

Context: recursive descent parsing with backtracking

Problem: an input substring might be re-parsed during backtracking.

grammar ::= AB | AC

Photo attribution: http://en.wikipedia.org/wiki/File:Neotoma_cinerea.jpg

Parse modern Japanese dates in various formats.

If the date parses successfully, convert it to its equivalent datetime.date instance.

18

case study #1 problem statement

case study #1 : The four ERAs

19

HEISEI ( ) 1989 Jan 8 - present

SHOWA ( ) 1926 Dec 25 – 1989 Jan 7

TAISHOU ( ) 1912 Jul 30 – 1926 Dec 24

MEIJI ( ) 1868 Sep 8 – 1912 July 29

Akihito

Hirohito

Yoshihito

Mutsuhito

case study #1 : liberties taken

1. No support for days-of-the-week tagged onto the end.

2. Numbers use western digits, not kanji.

3. Some eras have overlapping days. Ignore.

4. For 1st year of an era, no support for gannen.

20

case study #1 : initial attempt

from pyparsing import Literal, Word, nums

year = Literal( u'\u5e74' )month = Literal( u'\u6708' )day = Literal( u'\u65e5' )heisei_era = Literal( u'\u5e73\u6210' ) integer = Word(nums)

21

Word(nums, exact=2)

case study #1 : initial attempt (2)

western_year = integer('yyyy') + yearimperial_year = heisei_era + western_year

day_spec = integer.setResultsName('dd') + daymonth_spec = integer('mm') + month

year_spec = (imperial_year('imperial') | western_year('western'))grammar = year_spec + month_spec + day_spec

case study #1 : initial attempt (3)

23

result = grammar.parseString(japanese_date)print result.dump()

pyparsing : introduction

Easy to use PEG-based text parser

Grammar definitions in python

Framework distributed as one file pyparsing.py

Runs on both python 2.x & 3.x .Future releases after 1.5.x will be focused on python 3.x only

24

Not classified as recursive descent !

25

pyparsing : framework overview

pyparsing & PEGs : correlation

e1 e2̷

e1 e2

e*

e+

e?

&e

!e

PEG pyparsinge1 + e2 == And( e1, e2 )

e1 | e2 == MatchFirst( [e1,e2] )

ZeroOrMore( e )

OneOrMore( e )

Optional( e )

Followed( e )

~e == NotAny( e )

pyparsing : framework overview

27

pyparsing : ordered choiceMatchFirst will short circuit as soon as a match is found. Not commutative.

Shadowing literals in which one is a substring of the other should be avoided.

28

Keywords are different

pyparsing : backtracking

Or forces the parser to make an exhaustive search of the alternatives. (match longest)

Or might introduce ambiguities. No better than non-PEG parsers.

Tweak the order of alternatives & put most probable (e.g. frequency of occurrence) first. Avoids wasteful backtracking.

29


p1,p2,p3,p4,p5 = map(Literal,['ronaldo','messi', 'park-ji-sung', 'xavi','iniesta'])first = p2 + p1 + p4second = p2 + p1 + p5third = p2 + p1 + p3

grammar = first | second | third

print grammar.parseString( "messi ronaldo park-ji-sung" )

Ballon d'Or 2011 example


31

pyparsing : left factoredp1,p2,p3,p4,p5 = map(Literal,['ronaldo','messi', 'park-ji-sung', 'xavi','iniesta'])absolute_certainty = p2 + p1too_close_to_call = p4 | p5 | p3

grammar = absolute_certainty + too_close_to_callprint grammar.parseString( "messi ronaldo \ park-ji-sung" )

32

pyparsing : packrat Memoization must be manually turned on.

ParserElement.enablePackrat()

Caches: a. ParseResults b. Exceptions thrown

run python select_parser.py 33

Caveat emptor: A grammar with parse actions that has side effects do not always play well with memoization turned on.

pyparsing : semantic actionsIn pyparsing parlance, a ParserElement can have zero or more parsing actions.

34

4 forms of parse actions: fn(s,loc,toks) fn(loc,toks) fn(toks) fn()

Usage: ParserElement.setParseAction( *fn ) ParserElement.addParseAction( *fn )

Uses: 1. Perform validation (see ParseException) 2. Process the matched token(s) & modify it Returning a value overwrites the matched token(s). 3. Annotate with custom types (collary of #2)

case study #1 : Semantic action

integer = Word(nums).setParseAction( lambda t: int(t[0]))

All users of the integer expression will inherit the parse action.

def range_check(toks): month = int(toks[0]) if month <=0 or month >= 13: raise ParseException('month must be in range 1..12')

month_spec = integer('m').addParseAction(range_check) + month

Selective assignments of parse action to copies.

Show: japan_simple.py 35integer.copy().addParseAction( .. )integer( 'result_name' ).addParseAction( .. )!

case study #1 : test files

imperial . utf8 western . utf8

36

case study #1 : complete solution

Show: japan_dates.py

Demo:

37

@traceParseActiondef convert_kanji_year(toks): if 'imperial' in toks.keys(): year = toks.imperial.yearZero + toks.imperial.yy toks['era'] = toks.imperial.type_ toks['yyyy'] = year elif 'western' in toks.keys(): year = toks.yyyy try: toks['modernDate'] = date(year, toks.mm, toks.dd) except ValueError, error: raise ParseException(error.args[0])

case study #2 problem statement

Parse Gmail search criterias.

Supports a tiny subset of the full grammar :

from : ( <sender> )

label : inbox -label : sent

yyyyy -yyyyy “zzzzz” -”zzzzz”38

case study #2: example strings

label : sarawak -label : not-urgent

from : ( bruno manser )

from : ( [email protected] )

from : ( @swiss.org )

“penan injustice”

-logging

39

case study #2: email addresses

emailfull = Regex(r"(?P<user>[A-Za-z0-9._%+-]+)@(?P<hostname>[A-Za-z0-9.-]+)\.(?P<tld>[A-Za-z]{2,4})")

emailpartial = Regex(r"@(?P<hostname>[A-Za-z0-9.-]+)\.(?P<tld>[A-Za-z]{2,4})")

40

email = (emailpartial | emailfull)

squeeze = lambda t: ' '.join( t[0].split() )

name = ZeroOrMore(Word(alphanums + ' ')) .setParseAction( squeeze )

case study #2: email addresses

opener,closer,colon = map(Suppress,'():')

enclosed = email | name

nested = opener + enclosed + closer

grammar_email = Combine(Suppress('from') + colon + nested)

41

case study #2: email addressesresult = grammar_email.parseString( 'from:([email protected])' )print result.dump()

42

result = grammar.parseString( 'from:( Marco de Gasperi )')print result.dump()

Run: nosetests -v testFromTo.py

case study #2: labels

hyphen = Suppress('-')

label_rhs = delimitedList(Word(alphanums), delim='-', combine=True )

43

Combine( expr + ZeroOrMore( delim + expr ) )

label_include = Combine( Suppress('label') + colon + label_rhs )label_exclude = Combine( hyphen + label_include )

label_all = MatchFirst([ label_exclude.setResultsName('labels.exclude', listAllMatches=True), label_include('labels.include*')])

grammar_label = ZeroOrMore( label_all )

pyparsing 1.5.6

GOAL: group the excluded and included labels into their own sub-lists. E.g. label : fukushima1 -label : aloo-gobi

case study #2: labelsresult = grammar_label.parseString('-label:fukushima1 label:onagawa -label:aloo-gobi label:cheese-naan' ) print result.dump()

Question. Will this grammar work if the user entered LABEL instead of label ?

44

CaselessLiteral('label')

Answer.

case study #2: search stringsGOAL: group the excluded and included search strings into their own sub-lists.

key_single = Word(alphanums)key_quoted = quotedString.setParseAction(removeQuotes)

key_included = key_quoted | key_singlekey_excluded = Combine(hyphen + key_included)

key_all = MatchFirst( [key_excluded("key.exclude*"), key_included("key.include*")] )

grammar_key = ZeroOrMore( key_all )45

rumi - “ jack kerouac ”

case study #2: search stringsresult = grammar_key.parseString( ' -osama obama -"bin laden" "white house" ' )print result.dump()

Question. If the user entered single instead of double quotes, will it conform to the grammar ?

46Answer. Yes

case study #2: Final solution

email_all = grammar_email('from*')

gmail = (ZeroOrMore(email_all | label_all | key_all) + Suppress(restOfLine))

Let's compose all the individual pieces together.

47

result = gmail.parseString('love label:writing-tips "bird by bird" from:(Anne Lamott) -"dalai lama" -label:macchu-pichu from:([email protected]) -label:french-guiana -"epictetus" label:yoga "bugle podcast" label from:(@microsoft.com)')print result.dump()

nested = opener + Group(enclosed) + closer

48

case study #2: Final solution['love', 'writing-tips', 'bird-by-bird', 'Anne Lamott', 'dalai lama', 'macchu-pichu', '@microsoft.com', '[email protected]', 'french-guiana', 'epictetus', 'yoga', 'bugle podcast', 'label']-from: ['Anne Lamott','@microsoft.com', '[email protected]']-key.exclude: ['dalai lama','epictetus'] -key.include: ['love', 'bird by bird', 'bugle podcast', 'label']-labels.exclude: ['macchu-pichu', 'french-guiana']-labels.include: ['writing-tips','yoga']

mailto:'@microsoft.com

mailto:'[email protected]

mailto:'@microsoft.com

mailto:'[email protected]

pyparsing: Recursion

49

A grammar is recursive when there exists a nonterminal which has itself in the right-hand-side of the production rule. number ::= digit rest

rest ::= digit rest | empty

digit = Word(nums,exact=1).setName('1-digit')

rest = Forward()rest << Optional(digit + rest)

number = Combine(digit + rest, adjacent=False) ('digit-list')

grammar = number.setParseAction(lambda t:int(t[0])) + Suppress(restOfLine)

Run

case study #3: binary tree

Parse parentheses notation for binary trees.

(nil,4,nil) ((nil,2,(nil,3,nil)),4,((nil,5,(nil,6,nil)),7,nil))

2

3

4

5

6

7

Convert it to list notation in python50

case study #3: recursive solution

node ::= '(' node ',' number ',' node ')' | empty

BNF

Codeleft, right, comma = map(Suppress, '(),')empty = (CaselessLiteral('nil') .setParseAction(replaceWith(None)))tree = Forward()value = Word(nums).setParseAction(lambda t:int(t[0]))

tree << ((left + Group(tree) + bookend(value) + Group(tree) + right)

51Run

“ ((nil,2,(nil,3,nil)),4,((nil,5,(nil,6,nil)),7,nil)) ”

[[[None],2,[[None],3,[None]]],4,[[[None],5, [[None],6,[None]]],7,[None]]]

case study #3: recursive solution

Input :

Output :

How to fix it : Group(tree)Re-implement Group in

class TreeGroup(TokenConverter): def postParse(self, instring, loc, tokenlist): if len(tokenlist) == 1 and tokenlist[0] is None: return tokenlist else: return [tokenlist]

52

pyparsing does not support left recursion.term ::= \d+ expr ::= expr + term | term

@raises(RecursiveGrammarException) def test_left_recursion(self): expr.validate()

Run 53

pyparsing : left recursion

pyparsing will raise a RuntimeError with message 'maximum recursion depth exceeded' '

Eliminate left recursion if you want it to work in pyparsing

PyMeta : introduction

55

lowercase ::= <char_range 'a' 'z'>

OMeta is a language prototyping system (PEG).

Implemented in several programming languages.

* Packrat memoization

* Grammar: BNF dialect (with host language snippets)

* Object-Oriented: inheritance, overriding rules

def rule_lowercase(): // ..body..

* <anything> consumes one object from the input stream. (c.f. regex)* Built-in rules <letter> <digit> <letterOrDigit> <token '?'>

PEGs & PyMeta

PEG PyMeta

Syntactic Predicates(unlimited lookahead)

e1 e2

e1 | e2

~~e

!e == ~e

e*

e+

e?

e1 e2

e*

e+

&e

e1 / e2

e?

!e

case study #1 : in PyMetaModest goals:a) recognize western and Heisei imperial datesb) read & parse both imperial.utf8 & western.utf8

common.py : Common rules & utilities

western_dates.py : Grammar to recognize western dates

era_heisei.py : Grammar to recognize heisei dates

japan_date_parser.py : Final grammar

Separate files:

57

case study #1 : in PyMeta pt Afrom pymeta.grammar import OMetabaseGrammar = r"""# common literals for all ERAs year ::= <token u'\u5E74'> month ::= <token u'\u6708'> day ::= <token u'\u65E5'>

common.py

range_num :min :max ::= <digit>+:m ?(int(join(m)) >= min & int(join(m)) <= max) => mrest_of_line ::= <anything>* <token '\n'>? => Noneempty_line ::= <spaces> <rest_of_line> => Nonepython_comment ::= <token '#'> <rest_of_line> => None """

JapanCommonParser = OMeta.makeGrammar(baseGrammar, globals(), "JapanCommonParser")

def join(x): return ''.join(x)

58

case study #1 : in PyMeta pt Bwestern_dates.pywesternGrammar = r"""

western ::= <spaces> <digit>+:y <year> <range_num 1 12>:m <month> <range_num 1 31>:d <day> <rest_of_line> => westernized( int(join(y)),int(join(m)), int(join(d)))

grammar ::= <python_comment> | <western>"""

def westernized(yyyy, mm, dd): retval = JapanDate() retval['western'] = date(yyyy,mm,dd) return retval

WesternParser = JapanCommonParser.makeGrammar( westernGrammar, globals(), 'WesternParser') 59

case study #1 : in PyMeta pt Cera_heisei.py

60

era_heisei = Era('Heisei','Akihito', (u'\u5E73\u6210',u'\u337B'), startDate=date(1989,1,8))

def heisei_year_ok(yy): return (yy >= 1 and yy <= era_heisei.maxYearUnit)

def collect( yy, mm, dd ): retval = JapanDate() retval['imperial'] = date( era_heisei.yearZero + yy, mm, dd ) retval['era'] = [ era_heisei.name, yy ] return retval

case study #1: in PyMeta pt C (2)era_heisei.py (continued)

heiseiGrammar = r"""

hlong ::= <token u'\u5e73\u6210'> hshort ::= <token u'\u337b'>

heisei ::= (<hlong> | <hshort>) <digit>+:y ?(heisei_year_ok(int(join(y)))) <year> <range_num 1 12>:m <month> <range_num 1 31>:d <day> <rest_of_line> => collect(int(join(y)),int(join(m)),int(join(d)))"""

HeiseiParser = JapanCommonParser.makeGrammar(heiseiGrammar, globals(), 'HeiseiParser')

61

case study #1 : in PyMeta pt Djapan_date_parser.py

finalGrammar = r""" # override 'grammar' in WesternParser grammar ::= <super> | <heisei> | <empty_line>"""

class BaseParser(HeiseiParser, WesternParser): pass

BaseParser.globals.update(WesternParser.globals)BaseParser.globals.update(HeiseiParser.globals)

JapanDateParser = BaseParser.makeGrammar( finalGrammar, globals(), "JapanDateParser")

62

case study #1 : in PyMeta pt D (2)japan_date_parser.py (continued)

def parse_file(filename): “”” iterate through each line “”” .... snipped ... parser = JapanDateParser(line) result,error = parser.apply('grammar') .... snipped ...

results = parse_file('imperial.utf8')results = parse_file('western.utf8')

Run63

case study #1 : PyMeta output

64

PyMeta : Left Recursion

recursiveGrammar = r"""

num ::= <num>:n <digit>:d => n * 10 + d | <digit>

digit ::= :d ?((d>='0') & (d<='9')) => int(d)"""

PyMeta can handle left recursion.

Run 65

Quiz. Is the following grammar equivalent ?

num ::= <digit> | <num>:n <digit>:d => n * 10 + d

PyMeta : Matching objects

listGrammar = “”” digit ::= :x ?(x.isdigit()) => int(x) interp ::= [<digit>:x '+' <digit>:y] => x + y”””

g = OMeta.makeGrammar(listGrammar, {})parser = g( [['600','+','66']] )result,error = parser.apply('interp')

iterable

python list

66

>>> result666

>>> errorParseError(2,[])

PyMeta : Matching objects (2)

import :i ::= <anything>:a ?(a.__class__ == Import) => 'import '+', '.join(import_match(a.names))

Object graph (e.g. tree)python rewriter project visits the AST tree created by the compiler module (python 2.x) & regenerates the python statement.

>>> import compiler>>> print compiler.parse('import ctypes')>>> Module(None, Stmt([Import(['ctypes', None)])]))

67

pyparsing vs PyMeta pyparsing PyMeta

Whitespace sensitive? No. But turned on vialeaveWhitespace()

Yes. Use <spaces> rule to eat whitespaces

Left recursion No Yes

Packrat memoization Yes. Off by default. Yes. Only no-arg rules

Operates on characterstreams

Yes Yes

Operates on objectstreams

No Yes

Syntactic predicates Yes Yes

Semantic predicates No (@see parse actions) Yes

Semantic actions Yes Yes

Regex support NoYes68

PyPy rlib/parsing

69

Library for generating tokenizers & parsers in RPython.

Consists of: regex / packrat parser

tree structure / EBNF parser

NUMBER: "\-?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][\+\-]?[0-9]+)?";value: <STRING> | <NUMBER> | <object> | <array> | <"null"> |<"true"> | <"false">;array: ["["] (value [","])* value ["]"];entry: STRING [":"] value;

Sample JSON ebnf

Resulting parse tree can be transformed or traversed with custom visitors. (dot)

Topics not covered

● Usage of syntactic predicates ● Parsing grammars of mathematical

expression in order to preserve operator precedence

● Handling indents/dedents in order to parse indentation-sensitive languages– e.g. coffeescript, python, haskell

Resourcespyparsing

PyMeta

PyPy Rpython parsing library

http://pyparsing.wikispaces.com/

http://www.tinlizzie.org/ometa/

http://doc.pypy.org/en/latest/rlib.html

http://gitorious.org/python-decompiler/python_rewriter

https://github.com/marcua/tweeql

71

[email protected]

introduction to peg (parsing expression grammar) in python

Documents