introduction to peg (parsing expression grammar) in python
DESCRIPTION
Explainer for PEGTRANSCRIPT
Introduction to packrat parsing for PEGs (Parsing Expression Grammars)
gavin bongpycon APAC 2011, Singapore
Le 8 juin 2011 2129 mercredi
2
roadmap
Motivation PEG theory pyparsingPyMeta PyPy rlib/parsing Closing
34 mins
mins04051607
min01min01
minsminsmins
motivation
Natural languages
Mini languages (DSLs) Structured / unstructured file formats
4 thoughts :i. Aren't structured formats like JSON, XML, HTML well-served by existing parsers ?ii. Parsing log files & configuration files are easy with python.iii. Regular expression is good enough.
3
How to parse texts with PEGs NLTK
iv. What is wrong with the classical way of writing parsers ?
CFG (Context Free Grammars)
In formal language theory, CFG is suitable for modeling both natural & computer languages.
4
BNF is the defacto notation for describing syntax of CFGs.
if_stmt ::= "if" expression ":" suite ( "elif" expression ":" suite )* [ "else" ":" suite ]
EBNF
Original BNF only supported recursion. sequence, decision(choice) repetition, recursion
S S→ aS → Ɛ
CFG & AmbiguityCFG grammars are potentially ambiguous.
Dangling elseproblem
1 if( x > 5 )2 if( y > 5 )3 console.log("heaven");4 else console.log("limbo");
IfExp IfExp
CompName'x' Ops
>
Log
test Num5body
orelseStr
'limbo'
values
Comp
Name'x'
Name'y'
test
Str'heaven'Log
values
body
AST #1
5
CFG & Ambiguity (2)
6
IfExp IfExp
CompName'x' Ops
>
Log
test Num5body
orelse
Str'limbo'
values
Comp
Name'x'
Name'y'
test
Str'heaven'Log
values
body
AST #2
DefinitionsParse trees vs AST
Top-Down vs Bottom-up
= concrete whitespace, braces, semicolons
= abstract
= begin with start nonterminal.= work down the parse tree.
= identify terminals= infer nonterminals = climb the parse tree.
= nodes are nonterminals from grammar
= uses tree nodes specific to language constructs
Definitions (2)Recursive descent parsing
8
* A top-down parser constructed from recursive functions.* Each function represents a rule in the grammar.
version ::= <digit> '.' <digit>digit ::= '0' | '1' ... | '9'
def version( source, position=0 ): digit( source, position ) period( source, position + 1 ) digit( source, position + 2 )
Run (pymeta) nose --nocapture -v test_rdp_list.py
Recursive Descent Parsing
9
def digit(source, position): fn = (lambda t: t in string.digits,this_rule()) expect(source, position, fn)
def expect(source, position, comparator): try: expecting, msg = comparator if not expecting(source[0]): raise ParseError(position, msg)
source.popleft() #consume ! except IndexError: raise EOFError(position)
def period(source, position): fn = (lambda t: t == '.',this_rule()) expect(source, position, fn)
Recursive Descent Parsing (2)
>>> version(collections.deque('1.6'))
>>> import collections
ParseError: (1, 'expected <period>')
>>> version(collections.deque('1,6'))
>>> version(collections.deque('1.'))
EOFError: (2, [('message', 'end of input')])
10
ParseError: (0, 'expected <digit>')
>>> version(collections.deque('A.6'))
Classical method of parsing
Specific to LALR(1) bottom-up parsers11
1. Flesh out a grammar in BNF
2. Lexical analysis phase
lexer ( patterns, stream-of-characters) => stream of tokens
3. Parsing phase
parser ( grammar, stream-of-tokens) => parse tree / AST 4. Use your parser
Photo attribution: http://www.flickr.com/photos/j_aroche/2160902499/
Spectrum of parsing solutions
Regex
Lex / Yacc parser generators (GNU flex/bison)
PEG parsers
Handwritten Recursive Descent Parsers
ANTLR
12
Other python parsing toolkits
PLY
funcparserlib
Yapps
http:// wiki.python.org / moin / LanguageParsing
13
PEG
Scanner-less
Formalized by Bryan Ford in 2002-2004
Grammar mimics a recursive descent parser (+ backtracking).
14
A PEG grammar consists of a set of parsing expressions of the form: A e →One expression is denoted the starting expressione1 / e2 Ordered Choicee1 e2 Sequencee+ e? e* Repetition&e !e Predicates
PEG != EBNF
PEG's ordered choice
S → “Hitch” / “Hitchens”Q. Given an input string of “Hitchens”, what is the result of the parse ?
Law #1: Given an input of A, the parsing expression matches a prefix A' of A or fails. Law #2: A rule S -> M / N will try to parse for a M. If that fails, backtrack & look for N.
Answer: Hitch 15
PEG vs CFG
PEG CFG
Handles ambiguous grammars
No Yes
Syntax definition philosophy
Analytical Generative
Requires a lexical analysis phase ? No Yes (lex/yacc)
Choicealternation
Ordered Commutativee1/e2
16
Left recursion * No Yes
* Warth et al. Packrat parsers can support left recursion (2008)
PEG & Packrat parsing
Neotoma Cinerea
17
Solution: memoization guarantees linear time performance.
Context: recursive descent parsing with backtracking
Problem: an input substring might be re-parsed during backtracking.
grammar ::= AB | AC
Photo attribution: http://en.wikipedia.org/wiki/File:Neotoma_cinerea.jpg
Parse modern Japanese dates in various formats.
If the date parses successfully, convert it to its equivalent datetime.date instance.
18
case study #1 problem statement
case study #1 : The four ERAs
19
HEISEI ( ) 1989 Jan 8 - present
SHOWA ( ) 1926 Dec 25 – 1989 Jan 7
TAISHOU ( ) 1912 Jul 30 – 1926 Dec 24
MEIJI ( ) 1868 Sep 8 – 1912 July 29
Akihito
Hirohito
Yoshihito
Mutsuhito
case study #1 : liberties taken
1. No support for days-of-the-week tagged onto the end.
2. Numbers use western digits, not kanji.
3. Some eras have overlapping days. Ignore.
4. For 1st year of an era, no support for gannen.
20
case study #1 : initial attempt
from pyparsing import Literal, Word, nums
year = Literal( u'\u5e74' )month = Literal( u'\u6708' )day = Literal( u'\u65e5' )heisei_era = Literal( u'\u5e73\u6210' ) integer = Word(nums)
21
Word(nums, exact=2)
case study #1 : initial attempt (2)
western_year = integer('yyyy') + yearimperial_year = heisei_era + western_year
day_spec = integer.setResultsName('dd') + daymonth_spec = integer('mm') + month
year_spec = (imperial_year('imperial') | western_year('western'))grammar = year_spec + month_spec + day_spec
case study #1 : initial attempt (3)
23
result = grammar.parseString(japanese_date)print result.dump()
pyparsing : introduction
Easy to use PEG-based text parser
Grammar definitions in python
Framework distributed as one file pyparsing.py
Runs on both python 2.x & 3.x .Future releases after 1.5.x will be focused on python 3.x only
24
Not classified as recursive descent !
25
pyparsing : framework overview
pyparsing & PEGs : correlation
e1 e2̷
e1 e2
e*
e+
e?
&e
!e
PEG pyparsinge1 + e2 == And( e1, e2 )
e1 | e2 == MatchFirst( [e1,e2] )
ZeroOrMore( e )
OneOrMore( e )
Optional( e )
Followed( e )
~e == NotAny( e )
pyparsing : framework overview
27
pyparsing : ordered choiceMatchFirst will short circuit as soon as a match is found. Not commutative.
Shadowing literals in which one is a substring of the other should be avoided.
28
Keywords are different
pyparsing : backtracking
Or forces the parser to make an exhaustive search of the alternatives. (match longest)
Or might introduce ambiguities. No better than non-PEG parsers.
Tweak the order of alternatives & put most probable (e.g. frequency of occurrence) first. Avoids wasteful backtracking.
29
pyparsing : backtracking
p1,p2,p3,p4,p5 = map(Literal,['ronaldo','messi', 'park-ji-sung', 'xavi','iniesta'])first = p2 + p1 + p4second = p2 + p1 + p5third = p2 + p1 + p3
grammar = first | second | third
print grammar.parseString( "messi ronaldo park-ji-sung" )
Ballon d'Or 2011 example
pyparsing : backtracking
31
pyparsing : left factoredp1,p2,p3,p4,p5 = map(Literal,['ronaldo','messi', 'park-ji-sung', 'xavi','iniesta'])absolute_certainty = p2 + p1too_close_to_call = p4 | p5 | p3
grammar = absolute_certainty + too_close_to_callprint grammar.parseString( "messi ronaldo \ park-ji-sung" )
32
pyparsing : packrat Memoization must be manually turned on.
ParserElement.enablePackrat()
Caches: a. ParseResults b. Exceptions thrown
run python select_parser.py 33
Caveat emptor: A grammar with parse actions that has side effects do not always play well with memoization turned on.
pyparsing : semantic actionsIn pyparsing parlance, a ParserElement can have zero or more parsing actions.
34
4 forms of parse actions: fn(s,loc,toks) fn(loc,toks) fn(toks) fn()
Usage: ParserElement.setParseAction( *fn ) ParserElement.addParseAction( *fn )
Uses: 1. Perform validation (see ParseException) 2. Process the matched token(s) & modify it Returning a value overwrites the matched token(s). 3. Annotate with custom types (collary of #2)
case study #1 : Semantic action
integer = Word(nums).setParseAction( lambda t: int(t[0]))
All users of the integer expression will inherit the parse action.
def range_check(toks): month = int(toks[0]) if month <=0 or month >= 13: raise ParseException('month must be in range 1..12')
month_spec = integer('m').addParseAction(range_check) + month
Selective assignments of parse action to copies.
Show: japan_simple.py 35integer.copy().addParseAction( .. )integer( 'result_name' ).addParseAction( .. )!
case study #1 : test files
imperial . utf8 western . utf8
36
case study #1 : complete solution
Show: japan_dates.py
Demo:
37
@traceParseActiondef convert_kanji_year(toks): if 'imperial' in toks.keys(): year = toks.imperial.yearZero + toks.imperial.yy toks['era'] = toks.imperial.type_ toks['yyyy'] = year elif 'western' in toks.keys(): year = toks.yyyy try: toks['modernDate'] = date(year, toks.mm, toks.dd) except ValueError, error: raise ParseException(error.args[0])
case study #2 problem statement
Parse Gmail search criterias.
Supports a tiny subset of the full grammar :
from : ( <sender> )
label : inbox -label : sent
yyyyy -yyyyy “zzzzz” -”zzzzz”38
case study #2: example strings
label : sarawak -label : not-urgent
from : ( bruno manser )
from : ( [email protected] )
from : ( @swiss.org )
“penan injustice”
-logging
39
case study #2: email addresses
emailfull = Regex(r"(?P<user>[A-Za-z0-9._%+-]+)@(?P<hostname>[A-Za-z0-9.-]+)\.(?P<tld>[A-Za-z]{2,4})")
emailpartial = Regex(r"@(?P<hostname>[A-Za-z0-9.-]+)\.(?P<tld>[A-Za-z]{2,4})")
40
email = (emailpartial | emailfull)
squeeze = lambda t: ' '.join( t[0].split() )
name = ZeroOrMore(Word(alphanums + ' ')) .setParseAction( squeeze )
case study #2: email addresses
opener,closer,colon = map(Suppress,'():')
enclosed = email | name
nested = opener + enclosed + closer
grammar_email = Combine(Suppress('from') + colon + nested)
41
case study #2: email addressesresult = grammar_email.parseString( 'from:([email protected])' )print result.dump()
42
result = grammar.parseString( 'from:( Marco de Gasperi )')print result.dump()
Run: nosetests -v testFromTo.py
case study #2: labels
hyphen = Suppress('-')
label_rhs = delimitedList(Word(alphanums), delim='-', combine=True )
43
Combine( expr + ZeroOrMore( delim + expr ) )
label_include = Combine( Suppress('label') + colon + label_rhs )label_exclude = Combine( hyphen + label_include )
label_all = MatchFirst([ label_exclude.setResultsName('labels.exclude', listAllMatches=True), label_include('labels.include*')])
grammar_label = ZeroOrMore( label_all )
pyparsing 1.5.6
GOAL: group the excluded and included labels into their own sub-lists. E.g. label : fukushima1 -label : aloo-gobi
case study #2: labelsresult = grammar_label.parseString('-label:fukushima1 label:onagawa -label:aloo-gobi label:cheese-naan' ) print result.dump()
Question. Will this grammar work if the user entered LABEL instead of label ?
44
CaselessLiteral('label')
Answer.
case study #2: search stringsGOAL: group the excluded and included search strings into their own sub-lists.
key_single = Word(alphanums)key_quoted = quotedString.setParseAction(removeQuotes)
key_included = key_quoted | key_singlekey_excluded = Combine(hyphen + key_included)
key_all = MatchFirst( [key_excluded("key.exclude*"), key_included("key.include*")] )
grammar_key = ZeroOrMore( key_all )45
rumi - “ jack kerouac ”
case study #2: search stringsresult = grammar_key.parseString( ' -osama obama -"bin laden" "white house" ' )print result.dump()
Question. If the user entered single instead of double quotes, will it conform to the grammar ?
46Answer. Yes
case study #2: Final solution
email_all = grammar_email('from*')
gmail = (ZeroOrMore(email_all | label_all | key_all) + Suppress(restOfLine))
Let's compose all the individual pieces together.
47
result = gmail.parseString('love label:writing-tips "bird by bird" from:(Anne Lamott) -"dalai lama" -label:macchu-pichu from:([email protected]) -label:french-guiana -"epictetus" label:yoga "bugle podcast" label from:(@microsoft.com)')print result.dump()
nested = opener + Group(enclosed) + closer
48
case study #2: Final solution['love', 'writing-tips', 'bird-by-bird', 'Anne Lamott', 'dalai lama', 'macchu-pichu', '@microsoft.com', '[email protected]', 'french-guiana', 'epictetus', 'yoga', 'bugle podcast', 'label']-from: ['Anne Lamott','@microsoft.com', '[email protected]']-key.exclude: ['dalai lama','epictetus'] -key.include: ['love', 'bird by bird', 'bugle podcast', 'label']-labels.exclude: ['macchu-pichu', 'french-guiana']-labels.include: ['writing-tips','yoga']
pyparsing: Recursion
49
A grammar is recursive when there exists a nonterminal which has itself in the right-hand-side of the production rule. number ::= digit rest
rest ::= digit rest | empty
digit = Word(nums,exact=1).setName('1-digit')
rest = Forward()rest << Optional(digit + rest)
number = Combine(digit + rest, adjacent=False) ('digit-list')
grammar = number.setParseAction(lambda t:int(t[0])) + Suppress(restOfLine)
Run
case study #3: binary tree
Parse parentheses notation for binary trees.
(nil,4,nil) ((nil,2,(nil,3,nil)),4,((nil,5,(nil,6,nil)),7,nil))
2
3
4
5
6
7
Convert it to list notation in python50
case study #3: recursive solution
node ::= '(' node ',' number ',' node ')' | empty
BNF
Codeleft, right, comma = map(Suppress, '(),')empty = (CaselessLiteral('nil') .setParseAction(replaceWith(None)))tree = Forward()value = Word(nums).setParseAction(lambda t:int(t[0]))
tree << ((left + Group(tree) + bookend(value) + Group(tree) + right)
51Run
“ ((nil,2,(nil,3,nil)),4,((nil,5,(nil,6,nil)),7,nil)) ”
[[[None],2,[[None],3,[None]]],4,[[[None],5, [[None],6,[None]]],7,[None]]]
case study #3: recursive solution
Input :
Output :
How to fix it : Group(tree)Re-implement Group in
class TreeGroup(TokenConverter): def postParse(self, instring, loc, tokenlist): if len(tokenlist) == 1 and tokenlist[0] is None: return tokenlist else: return [tokenlist]
52
pyparsing does not support left recursion.term ::= \d+ expr ::= expr + term | term
@raises(RecursiveGrammarException) def test_left_recursion(self): expr.validate()
Run 53
pyparsing : left recursion
pyparsing will raise a RuntimeError with message 'maximum recursion depth exceeded' '
Eliminate left recursion if you want it to work in pyparsing
PyMeta : introduction
55
lowercase ::= <char_range 'a' 'z'>
OMeta is a language prototyping system (PEG).
Implemented in several programming languages.
* Packrat memoization
* Grammar: BNF dialect (with host language snippets)
* Object-Oriented: inheritance, overriding rules
def rule_lowercase(): // ..body..
* <anything> consumes one object from the input stream. (c.f. regex)* Built-in rules <letter> <digit> <letterOrDigit> <token '?'>
PEGs & PyMeta
PEG PyMeta
Syntactic Predicates(unlimited lookahead)
e1 e2
e1 | e2
~~e
!e == ~e
e*
e+
e?
e1 e2
e*
e+
&e
e1 / e2
e?
!e
case study #1 : in PyMetaModest goals:a) recognize western and Heisei imperial datesb) read & parse both imperial.utf8 & western.utf8
common.py : Common rules & utilities
western_dates.py : Grammar to recognize western dates
era_heisei.py : Grammar to recognize heisei dates
japan_date_parser.py : Final grammar
Separate files:
57
case study #1 : in PyMeta pt Afrom pymeta.grammar import OMetabaseGrammar = r"""# common literals for all ERAs year ::= <token u'\u5E74'> month ::= <token u'\u6708'> day ::= <token u'\u65E5'>
common.py
range_num :min :max ::= <digit>+:m ?(int(join(m)) >= min & int(join(m)) <= max) => mrest_of_line ::= <anything>* <token '\n'>? => Noneempty_line ::= <spaces> <rest_of_line> => Nonepython_comment ::= <token '#'> <rest_of_line> => None """
JapanCommonParser = OMeta.makeGrammar(baseGrammar, globals(), "JapanCommonParser")
def join(x): return ''.join(x)
58
case study #1 : in PyMeta pt Bwestern_dates.pywesternGrammar = r"""
western ::= <spaces> <digit>+:y <year> <range_num 1 12>:m <month> <range_num 1 31>:d <day> <rest_of_line> => westernized( int(join(y)),int(join(m)), int(join(d)))
grammar ::= <python_comment> | <western>"""
def westernized(yyyy, mm, dd): retval = JapanDate() retval['western'] = date(yyyy,mm,dd) return retval
WesternParser = JapanCommonParser.makeGrammar( westernGrammar, globals(), 'WesternParser') 59
case study #1 : in PyMeta pt Cera_heisei.py
60
era_heisei = Era('Heisei','Akihito', (u'\u5E73\u6210',u'\u337B'), startDate=date(1989,1,8))
def heisei_year_ok(yy): return (yy >= 1 and yy <= era_heisei.maxYearUnit)
def collect( yy, mm, dd ): retval = JapanDate() retval['imperial'] = date( era_heisei.yearZero + yy, mm, dd ) retval['era'] = [ era_heisei.name, yy ] return retval
case study #1: in PyMeta pt C (2)era_heisei.py (continued)
heiseiGrammar = r"""
hlong ::= <token u'\u5e73\u6210'> hshort ::= <token u'\u337b'>
heisei ::= (<hlong> | <hshort>) <digit>+:y ?(heisei_year_ok(int(join(y)))) <year> <range_num 1 12>:m <month> <range_num 1 31>:d <day> <rest_of_line> => collect(int(join(y)),int(join(m)),int(join(d)))"""
HeiseiParser = JapanCommonParser.makeGrammar(heiseiGrammar, globals(), 'HeiseiParser')
61
case study #1 : in PyMeta pt Djapan_date_parser.py
finalGrammar = r""" # override 'grammar' in WesternParser grammar ::= <super> | <heisei> | <empty_line>"""
class BaseParser(HeiseiParser, WesternParser): pass
BaseParser.globals.update(WesternParser.globals)BaseParser.globals.update(HeiseiParser.globals)
JapanDateParser = BaseParser.makeGrammar( finalGrammar, globals(), "JapanDateParser")
62
case study #1 : in PyMeta pt D (2)japan_date_parser.py (continued)
def parse_file(filename): “”” iterate through each line “”” .... snipped ... parser = JapanDateParser(line) result,error = parser.apply('grammar') .... snipped ...
results = parse_file('imperial.utf8')results = parse_file('western.utf8')
Run63
case study #1 : PyMeta output
64
PyMeta : Left Recursion
recursiveGrammar = r"""
num ::= <num>:n <digit>:d => n * 10 + d | <digit>
digit ::= :d ?((d>='0') & (d<='9')) => int(d)"""
PyMeta can handle left recursion.
Run 65
Quiz. Is the following grammar equivalent ?
num ::= <digit> | <num>:n <digit>:d => n * 10 + d
PyMeta : Matching objects
listGrammar = “”” digit ::= :x ?(x.isdigit()) => int(x) interp ::= [<digit>:x '+' <digit>:y] => x + y”””
g = OMeta.makeGrammar(listGrammar, {})parser = g( [['600','+','66']] )result,error = parser.apply('interp')
iterable
python list
66
>>> result666
>>> errorParseError(2,[])
PyMeta : Matching objects (2)
import :i ::= <anything>:a ?(a.__class__ == Import) => 'import '+', '.join(import_match(a.names))
Object graph (e.g. tree)python rewriter project visits the AST tree created by the compiler module (python 2.x) & regenerates the python statement.
>>> import compiler>>> print compiler.parse('import ctypes')>>> Module(None, Stmt([Import(['ctypes', None)])]))
67
pyparsing vs PyMeta pyparsing PyMeta
Whitespace sensitive? No. But turned on vialeaveWhitespace()
Yes. Use <spaces> rule to eat whitespaces
Left recursion No Yes
Packrat memoization Yes. Off by default. Yes. Only no-arg rules
Operates on characterstreams
Yes Yes
Operates on objectstreams
No Yes
Syntactic predicates Yes Yes
Semantic predicates No (@see parse actions) Yes
Semantic actions Yes Yes
Regex support NoYes68
PyPy rlib/parsing
69
Library for generating tokenizers & parsers in RPython.
Consists of: regex / packrat parser
tree structure / EBNF parser
NUMBER: "\-?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][\+\-]?[0-9]+)?";value: <STRING> | <NUMBER> | <object> | <array> | <"null"> |<"true"> | <"false">;array: ["["] (value [","])* value ["]"];entry: STRING [":"] value;
Sample JSON ebnf
Resulting parse tree can be transformed or traversed with custom visitors. (dot)
Topics not covered
● Usage of syntactic predicates ● Parsing grammars of mathematical
expression in order to preserve operator precedence
● Handling indents/dedents in order to parse indentation-sensitive languages– e.g. coffeescript, python, haskell
Resourcespyparsing
PyMeta
PyPy Rpython parsing library
http://pyparsing.wikispaces.com/
http://www.tinlizzie.org/ometa/
http://doc.pypy.org/en/latest/rlib.html
http://gitorious.org/python-decompiler/python_rewriter
https://github.com/marcua/tweeql
71