lexing, parsing - inria...recalling grammar definitions grammar a grammar is composed of : i a...

42
Lexing, Parsing Laure Gonnord & Ludovic Henrio & Matthieu Moy Master 1, ENS de Lyon 2019-2020

Upload: others

Post on 11-May-2020

23 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Lexing, Parsing

Laure Gonnord & Ludovic Henrio & Matthieu Moyhttps:

//compil-lyon.gitlabpages.inria.fr/compil-lyon/

[email protected]

Master 1, ENS de Lyon

2019-2020

Page 2: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Analysis Phase

source code↓

lexical analysis↓

sequence of “lexems” (tokens)↓

syntactic analysis (Parsing)↓

abstract syntax tree (AST )↓

semantic analysis↓

abstract syntax (+ symbol table)

Page 3: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Goal of this chapter

I Understand the syntaxic structure of a language;

I Separate the different steps of syntax analysis;

I Be able to write a syntax analysis tool for a simplelanguage;

I Remember: syntax6=semantics.

Page 4: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Syntax analysis steps

How do you read text ?

I Text=a sequence of symbols (letters, spaces, punctuation);I Group symbols into tokens:

I Words: groups of letters;I Punctuation;I Spaces.

I Group tokens into:I Propositions;I Sentences.

I Then proceed with word meanings:I Definition of each word.

ex: a dog is a hairy mammal, that barks and...I Role in the phrase: verb, subject, ...

Syntax analysis=Lexical analysis+Parsing

Page 5: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Syntax analysis steps

How do you read text ?

I Text=a sequence of symbols (letters, spaces, punctuation);I Group symbols into tokens: Lexical analysis

I Words: groups of letters;I Punctuation;I Spaces.

I Group tokens into:I Propositions;I Sentences.

I Then proceed with word meanings:I Definition of each word.

ex: a dog is a hairy mammal, that barks and...I Role in the phrase: verb, subject, ...

Syntax analysis=Lexical analysis+Parsing

Page 6: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Syntax analysis steps

How do you read text ?

I Text=a sequence of symbols (letters, spaces, punctuation);I Group symbols into tokens: Lexical analysis

I Words: groups of letters;I Punctuation;I Spaces.

I Group tokens into:I Propositions;I Sentences.

I Then proceed with word meanings:I Definition of each word.

ex: a dog is a hairy mammal, that barks and...I Role in the phrase: verb, subject, ...

Syntax analysis=Lexical analysis+Parsing

Page 7: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Syntax analysis steps

How do you read text ?

I Text=a sequence of symbols (letters, spaces, punctuation);I Group symbols into tokens: Lexical analysis

I Words: groups of letters;I Punctuation;I Spaces.

I Group tokens into: ParsingI Propositions;I Sentences.

I Then proceed with word meanings:I Definition of each word.

ex: a dog is a hairy mammal, that barks and...I Role in the phrase: verb, subject, ...

Syntax analysis=Lexical analysis+Parsing

Page 8: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Syntax analysis steps

How do you read text ?

I Text=a sequence of symbols (letters, spaces, punctuation);I Group symbols into tokens: Lexical analysis

I Words: groups of letters;I Punctuation;I Spaces.

I Group tokens into: ParsingI Propositions;I Sentences.

I Then proceed with word meanings:I Definition of each word.

ex: a dog is a hairy mammal, that barks and...I Role in the phrase: verb, subject, ...

Syntax analysis=Lexical analysis+Parsing

Page 9: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Syntax analysis steps

How do you read text ?

I Text=a sequence of symbols (letters, spaces, punctuation);I Group symbols into tokens: Lexical analysis

I Words: groups of letters;I Punctuation;I Spaces.

I Group tokens into: ParsingI Propositions;I Sentences.

I Then proceed with word meanings: SemanticsI Definition of each word.

ex: a dog is a hairy mammal, that barks and...I Role in the phrase: verb, subject, ...

Syntax analysis=Lexical analysis+Parsing

Page 10: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Syntax analysis steps

How do you read text ?

I Text=a sequence of symbols (letters, spaces, punctuation);I Group symbols into tokens: Lexical analysis

I Words: groups of letters;I Punctuation;I Spaces.

I Group tokens into: ParsingI Propositions;I Sentences.

I Then proceed with word meanings: SemanticsI Definition of each word.

ex: a dog is a hairy mammal, that barks and...I Role in the phrase: verb, subject, ...

Syntax analysis=Lexical analysis+Parsing

Page 11: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

What for ?

int y = 12 + 4*x ;

=⇒ [TINT, VAR("y"), EQ, INT(12), PLUS, INT(4), TIMES,VAR("x"), SCOL]

I Group characters into a list of tokens, e.g.:

I The word “int” stands for type integer;

I A sequence of letters stands for a variable;

I A sequence of digits stands for an integer;

I ...

Page 12: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Principle

I Take a lexical description: E = ( E1︸︷︷︸Tokens class

| . . . |En)∗

I Construct an automaton.

Example - lexical description (“lex file”)E = ((0|1)+|(0|1)+.(0|1)+|′+′)∗

Page 13: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

What’s behind

Regular languages, regular automata:

I Thompson construction I non-det automaton

I Determinization, completion

I Minimisation

Page 14: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Construction/use of automaton

Let E = ((0|1)+︸ ︷︷ ︸#1

| (0|1)+.(0|1)+︸ ︷︷ ︸#2

| ′+′︸︷︷︸#3

)∗ and

Σ = {0, 1,+, .,#1,#2,#3}.We define E# = ((0|1)+#1|(0|1)+.(0.1)+#2|′ +′ #3).

1start 2 3 4

5 sink

0,1 . 0,1

+

0,10,1

#1

#2

#3

Source C. Alias drawn by former M1 students

Page 15: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Remarks

I Notion of ambiguity.

I Compaction of transition table.

Page 16: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Tools: lexical analyzer constructors

I Lexical analyzer constructor: builds an automaton from aregular language definition;

I Ex: Lex (C), JFlex (Java), OCamllex, ANTLR (multi), ...

I input: a set of regular expressions with actions (Toto.g4);

I output: a file(s) (Toto.java) that contains thecorresponding automaton.

Page 17: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Analyzing text with the compiled lexer

I The input of the lexer is a text file;I Execution:

I Checks that the input is accepted by the compiledautomaton;

I Executes some actions during the “automaton traversal”.

Page 18: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Lexing tool for Java: ANTLR

I The official webpage : www.antlr.org (BSD license);

I ANTLR is both a lexer and a parser;

I ANTLR is multi-language (not only Java).

Page 19: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

ANTLR lexer format and compilation

.g4

grammar XX;

@header {

// Some init code ...

}

@members {

// Some global variables

}

// More optional blocks are available

--->> lex rules

Compilation:

antlr4 Toto.g4 // produces several Java files

javac *.java // compiles into xx.class files

grun Toto r // Run analyzer with starting rule r

Page 20: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Lexing with ANTLR: example

Lexing rules:

I Must start with an upper-case letter;

I Follow the usual extended regular-expressions syntax(same as egrep, sed, ...).

A simple example

lexer grammar Tokens;

HELLO : 'hello' ; // beware the single quotes

ID : [a-z]+ ; // match lower -case identifiers

INT : [0-9]+ ;

KEYWORD : 'begin' | 'end' | 'for' ; // perhaps this should

be elsewhere

WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines

Page 21: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Lexing - We can count!

Counting in ANTLR - CountLines2.g4

lexer grammar CountLines2;

// Members can be accessed in any rule

@members {int nbLines =0;}

NEWLINE : [\r\n] {

nbLines ++;

System.out.println("Current lines:"+nbLines);} ;

WS : [ \t]+ -> skip ;

Page 22: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

What’s Parsing ?

Relate tokens by structuring them.

Flat tokens[TINT, VAR("y"), EQ, INT(12), PLUS, INT(4), TIMES, VAR("x"),SCOL]

⇒ Parsing⇒

Accept→ Structured tokens=

yint +

12

4 x

*

int

Page 23: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Analysis Phase

source code↓

lexical analysis↓

sequence of “lexems” (tokens)↓

syntactic analysis (Parsing)↓

abstract syntax tree (AST )↓

semantic analysis↓

abstract syntax (+ symbol table)

Page 24: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

In this course

Only write acceptors “OK” or “Syntax Error”.

Page 25: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

What’s behind ?

From a Context-free Grammar, produce a Stack Automaton(already seen in L3 course?)

Page 26: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Recalling grammar definitions

GrammarA grammar is composed of :

I A finite set N of non terminal symbols

I A finite set Σ of terminal symbols (disjoint from N )

I A finite set of production rules, each rule of the formw → w′ where w is a word on Σ ∪N with at least one letterof N . w′ is a word on Σ ∪N .

I A start symbol S ∈ N .

Page 27: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Example

Example:S → aSb

S → ε

is a grammar with N = .... and ....

Page 28: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Associated Language

DerivationG a grammar defines the relation :

x⇒G y iff ∃u, v, p, qx = upv and y = uqv and (p→ q) ∈ P

I A grammar describes a language (the set of words on Σ thatcan be derived from the start symbol).

Page 29: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Example - associated language

S → aSb

S → ε

The grammar defines the language {anbn, n ∈ N}

S → aBSc

S → abc

Ba→ aB

Bb→ bb

The grammar defines the language {anbncn, n ∈ N}

Page 30: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Context-free grammars

Context-free grammarA CF-grammar is a grammar where all production rules are ofthe form N → (Σ ∪N)∗.

Example:S → S + S|S ∗ S|a

The grammar defines a language of arithmetical expressions.

I Notion of derivation tree.Draw a derivation tree of a*a+a, of S+S !

Page 31: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Recognizing grammars

Grammar type Rules DecidableRegular grammar X → aY,X → b YESContext-free grammar X → u YESContext-sensitive grammar uXv → uwv YESGeneral grammar u→ v NO

Page 32: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Parser construction

There exists algorithms to recognize class of grammars:

I Predictive (descending) analysis (LL)

I Ascending analysis (LR)

I See the Dragon book.

Page 33: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Tools: parser generators

I Parser generator: builds a stack automaton from agrammar definition;

I Ex: yacc(C), javacup (Java), OCamlyacc, ANTLR, ...

I input : a set of grammar rules with actions (Toto.g4);

I output : a file(s) (Toto.java) that contains thecorresponding stack automaton.

Page 34: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Lexing vs Parsing

I Lexing supports (' regular) languages;

I We want more (general) languages⇒ rely on context-freegrammars;

I To that intent, we need a way:I To declare terminal symbols (tokens);I To write grammars.

I Use both Lexing rules and Parsing rules.

Page 35: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

From a grammar to a parser

The grammar must be context-free:

S-> aSb

S-> eps

I The grammar rules are specified as Parsing rules;

I a and b are terminal tokens, produced by Lexing rules.

Page 36: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Parsing with ANTLR: example 1

Tokens.g4

lexer grammar Tokens;

HELLO : 'hello' ; // beware the single quotes

ID : [a-z]+ ; // match lower -case identifiers

INT : [0-9]+ ;

KEYWORD : 'begin' | 'end' | 'for' ; // perhaps this should

be elsewhere

WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines

Page 37: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Parsing with ANTLR: example 2

AnBnLexer.g4

l e xe r grammar AnBnLexer;

/ / Lexing r u l e s : recognize tokensA: 'a' ;B: 'b' ;

WS : [ \ t \ r \ n ]+ −> sk ip ; / / sk ip spaces, t abs , newl ines

Page 38: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Parsing with ANTLR: example 2 (cont’)

AnBnParser.g4

parser grammar AnBnParser;options {tokenVocab=AnBnLexer;} / / ex tern tokens d e f i n i t i o n

/ / Parsing r u l e s : s t r u c t u r e tokens toge therprog : s EOF ; / / EOF: predef ined end−of− f i l e tokens : A s B

| ; / / no th ing f o r empty a l t e r n a t i v e

Page 39: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

ANTLR expressivity

LL(*)At parse-time, decisions gracefully throttle up from con-ventional fixed k > 1 lookahead to arbitrary lookahead.

Further reading (PLDI’11 paper, T. Parr, K. Fisher)

http://www.antlr.org/papers/LL-star-PLDI11.pdf

Page 40: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Left recursion

ANTLR permits left recursion:

a: a b;

But not indirect left recursion. X1 → . . .→ Xn

There exist algorithms to eliminate indirect recursions.

Page 41: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

Lists

ANTLR permits lists:

prog: statement+ ;

Read the documentation!

https:

//github.com/antlr/antlr4/blob/master/doc/index.md

Page 42: Lexing, Parsing - Inria...Recalling grammar definitions Grammar A grammar is composed of : I A finite set N of non terminal symbols I A finite set of terminal symbols (disjoint

So Far ...

ANTLR has been used to:

I Produce acceptors for context-free languages;

I Do a bit of computation on-the-fly.

⇒ In a classic compiler, parsing produces an Abstract SyntaxTree.I Next course!