cse p501 – compiler construction scanner regex automata hand-written scanner grammars & bnf...
TRANSCRIPT
CSE P501 – Compiler Construction
Scanner
Regex
Automata
Hand-Written Scanner
Grammars & BNF
Next
Spring 2014 Jim Hogg - UW - CSE P501 B-1
Spring 2014 Jim Hogg - UW - CSE P501 A-2
Source TargetFront End Back End
Scan
chars
tokens
AST
IR
AST = Abstract Syntax Tree
IR = Intermediate Representation
‘Middle End’
Optimize
Select Instructions
Parse
Semantics
Allocate Registers
Emit
Machine Code
IR
IR
IR
IR
IR
Scanner
Automatic or Hand-Written?
Use a scanner-generator - JFlex
Spring 2014 Jim Hogg - UW - CSE P501 B-3
regex define tokens
JFlex Scanner
.jflex .java
Write a scanner, in Java, by hand Easy and enlightening Will see an outline of how, later
OR
Reminder: a token is . . .
Spring 2014 Jim Hogg - UW - CSE P501 A-4
class C { public int fac(int n) { // factorial int nn; if (n < 1) nn = 1; else nn = n * this.fac(n-1); return nn; }}
class∙C∙{◊∙∙public∙int∙fac(int∙n)∙{∙∙//∙factorial◊∙∙∙∙int∙nn;◊∙∙∙∙if(n∙<∙1)◊∙∙∙∙∙∙nn∙=∙1;◊∙∙∙∙else◊∙∙∙∙nn∙=∙n∙*∙(this.fac(n-1));◊∙∙∙∙return∙nn;◊∙∙}◊}
Key for Char Stream:
◊ newline \n∙ space
CLASS ID:C LBRACE PUBLIC INT ID:fac LPAREN INT ID:n RPAREN LBRACE INT ID:nn SEMI IF LPAREN ID:n LT ILIT:1 RPAREN ID:nn EQ ILIT:1 ELSE ID:nn EQ ID:n TIMES LPAREN ID:this DOT ID:fac LPAREN ID:n MINUS ILIT:1 RPAREN RPAREN SEMI RETURN ID:nn SEMI RBRACE RBRACE
A Token in your Java scanner
class Token { public int kind; // eg: LPAREN, ID, ILIT public int line; // for debugging/diagnostics public int column; // for debugging/diagnostics public String lexeme; // eg: “x”, “Total”, “(“, “42” public int value; // attribute of ILIT}
Spring 2014 Jim Hogg - UW - CSE P501 B-5
Obviously this Token is wasteful of memory: • lexeme is not required for primitive tokens, such as LPAREN, RBRACE, et• value is only required for ILIT
But, there's only 1 token alive at any instant during parsing, so no point refining into 3 leaner variants!
Spring 2014 Jim Hogg - UW - CSE P501 B-6
Typical Tokens
Operators & Punctuation Single chars: + - * = / ( ] ; : Double chars: :: <= == !=
Keywords if while for goto return switch void …
Identifiers A single ID token kind, parameterized by lexeme
Integer constants A single ILIT token kind, parameterized by int value
See jflex-1.5.0\examples\java\java.flex for real example
Token Spotting
Spring 2014 Jim Hogg - UW - CSE P501 B-7
if(a<=3)++grades[1]; // what are the tokens? (no spaces)
public int fac(int n) { // what are the tokens? (need spaces?)
Counter-example: fixed-format FORTRAN:
DO 50 I = 1,99 // DO loopDO 50 I = 1.2 // assignment: DO50I = 1.2
Spring 2014 Jim Hogg - UW - CSE P501 B-8
Principle of Longest Match
Scanner should pick the longest possible string to make up the next token (“greedy” algorithm)
Examplereturn idx <= iffy;
should be scanned into 5 tokens:
<= is one token, not two iffy is an ID, not IF followed by ID:fy
RETURN ID:idx LEQ ID:iffy SEMI
Spring 2014 Jim Hogg - UW - CSE P501 B-9
The syntax, of most programming languages can be specified using Regular Expressions “REs” in Cooper&Torczon “regex” is more common
Tokens can be recognized by a deterministic finite automaton (DFA) DFA (a Java class) is almost always
generated from regex using a software tool, such as JFlex
Regex
Regex Cheat Sheet
Spring 2014 Jim Hogg - UW - CSE P501 B-10
Pattern Matches?
a a
a* zero or more a’s
a+ one or more a’s
a? zero or one a
a|b a or b
ab a followed by bPrecedence: * (highest), concatenation, | (lowest)
Parentheses can be used to group regexs as needed
Notice meta-characters, in red
Escaped characters: \* \+ \? \| \. \t \n
Pattern Matches?
[c-f] one of c or d or e or f
[^0-3] any one character except 0-3
. any character, except newline
Spring 2014 Jim Hogg - UW - CSE P501 B-11
Regex Examples
regex Meaning?
[abc]+
[abc]* (Kleene closure)
[0-9]+
[1-9][0-9]*
[a-zA-Z_][a-zA-Z0-9_]*
(0|1)* 0
(a|b)*aa(a|b)*
Check free online Regex tutorials if you are rusty. Eg: http://regexone.com/ Experiment with a regex-capable editor. Eg: http://www.editpadpro.com/
Spring 2014 Jim Hogg - UW - CSE P501 B-12
regex
Defined over some alphabet Σ For programming languages, alphabet is ASCII or
Unicode
If re is a regular expression, L(re ) is the language (set of strings) generated by re
Spring 2014 Jim Hogg - UW - CSE P501 B-13
regex macros
Possible syntax for numeric constantsDigit = [0-9]Digits = Digit+
Number = Digits ( . Digits )? ( [eE] (+ | -)? Digits ) ?
How would you describe this set in English?
What are some examples of legal constants (strings) generated by Number?
Tools like JFlex accept these convenient macros
Spring 2014 Jim Hogg - UW - CSE P501 B-14
Finite automata (state machines) can be used to recognize strings generated by regular expressions
Can build automaton by-hand or automagically Will not build by-hand in this course Will use the JFlex tool: given a set of regex, it
generates an automaton recognizer (a Java class)
Automata
Spring 2014 Jim Hogg - UW - CSE P501 B-15
Finite Automata Terminology
Phrase Abbreviation
Finite Automaton FA
Deterministic Finite Automaton DFA
Non-deterministic Finite Automaton NFA
Finite-State Automaton FSA = {DFA, NFA}
Spring 2014 Jim Hogg - UW - CSE P501 B-16
DFA for “cat”
a tc
Accepting State(double circles)
Start State
regex = cat
Spring 2014 Jim Hogg - UW - CSE P501 B-17
DFA for ILIT
0-91
0-9
2
We have labelled the states
regex = [0-9][0-9]* = [0-9]+
Spring 2014 Jim Hogg - UW - CSE P501 B-18
DFA for ID
a-z
0 0-9
1
a-z
regex = [a-zA-Z_][a-zA-Z0-9_]*
A-Z_
A-Z_
DFAs work like this . . .
Spring 2014 Jim Hogg - UW - CSE P501 B-19
1. scan the input text string, character-by-character
2. following the arc/edge corresponding to the character just read
3. if there is no arc for the character just read, then, either:
a. if you are in an accepting state: you're done. Success!
b. if you are not in an accepting state: you're done. Failure!
DFAs work like this - examples
Spring 2014 Jim Hogg - UW - CSE P501 B-20
1. Scan "fac(int n);" for the regex, alphaid = [a-z]+ (lower-case alphas)We hit "(" and are already in state 1. Success
2. Scan "23;" for regex alphaidThere is no arc for "2". We are still in state 0. Failure
3. Scan "today" for regex alphaidWe hit end-of-string and are already in state 1. Success
0 1
a-za-z
Note: no need to add arcs to the DFA for all error cases - they are implicit
Thompson’s Construction: Combining DFAs
Spring 2014 Jim Hogg - UW - CSE P501 B-21
ε
a b
DFA for: a DFA for: b
a bNFA for: ab
εa
b
NFA for a|b
ε
ε
ε
Combining DFAs, cont’d
Spring 2014 Jim Hogg - UW - CSE P501 B-22
ε
a b
DFA for: a DFA for: b
aNFA for: a*
ε
ε
ε
Exercise
Draw the NFA for: b(at|ag) | bug
Spring 2014 Jim Hogg - UW - CSE P501 B-23
b
a t
ub g
a g
Exercise
Draw the NFA for: b(at|ag) | bug
Spring 2014 Jim Hogg - UW - CSE P501B-24
b
a t
ub g
a g
NFA for a(b|c)*
Spring 2014 Jim Hogg - UW - CSE P501 B-25
b
c
a
a
b
c
To recognize "acb" successfully, we need to:
• guess the future correctly• backtrack and retry if we fail to
recognize• somehow execute all possible paths
None of these is attractive! Can we construct an equivalent DFA?
Spring 2014 Jim Hogg - UW - CSE P501 B-26
Finite State Automaton (FSA)
A finite set of states One marked as initial state One or more marked as final states States sometimes labeled or numbered
A set of transitions from state to state Each labeled with symbol from Σ, or ε
Operate by reading input symbols (usually characters) Transition can be taken if labeled with current symbol ε-transition can be taken at any time (free bus ride)
Accept when final state reached & no more input Scanner uses an FSA as a subroutine – accept longest
match from current location each time called, even if more input
Reject if no transition possible, or no more input and not in final state (DFA)
Spring 2014 Jim Hogg - UW - CSE P501 B-27
DFA vs NFA
Deterministic Finite Automata (DFA) No choice of which transition to take In particular, no ε transitions No guessing
Non-deterministic Finite Automata (NFA) Choice of transition in at least one case Accepts if some way to reach final state on given
input Reject if no possible way to final state How to implement in software?
Spring 2014 Jim Hogg - UW - CSE P501 B-28
DFAs in Scanners
We really want DFA for speed: no backtracking, no guessing, no foretelling the future
Conversion from regex to NFA is easy, right?
But how to turn an NFA into an equivalent DFA?
Turns out to be obvious (once seen) and easy
NFA to DFA
Spring 2014 Jim Hogg - UW - CSE P501B-29
Starting with the above NFA, we want to 'collapse' epsilon edges, ending up with a DFA that recognizes, and rejects, the same char strings. Ideally, we will end up with:
0a
c
b
4b
6c
3
5
7
2 8
NFA for a(b|c)*
0a
1 9
1
NFA to DFA
Spring 2014 Jim Hogg - UW - CSE P501 B-30
4b
6c
3
5
7
2 8
NFA for a(b|c)*
0a
1 9
• Begin in the Start state• Foreach labelled arc leaving that state, what set of states can I
reach, along labelled arc, or along transitions?
NFA to DFA
Spring 2014 Jim Hogg - UW - CSE P501 B-31
n4b
n6c
n3
n5
n7
n2 n8
NFA for a(b|c)*
n0a
n1 n9
NFA State a b c
d0 = n0 d1 = {1,2,3,4,6,9}
none none
d1 = {1,2,3,4,6,9} none d2 = {3,4,5,6,8,9}
d3 = {3,4,6,7,8,9}
d2 = {3,4,5,6,8,9} none d2 = {3,4,5,6,8,9}
d3 = {3,4,6,7,8,9}
d3 = {3,4,6,7,8,9} none d2 = {3,4,5,6,8,9}
d3 = {3,4,6,7,8,9}
NFA to DFA
Spring 2014 Jim Hogg - UW - CSE P501 B-32
b
c
DFA for a(b|c)*
d0a bc
c
b
NFA State a b c
d0 d1 - -
d1 - d2 d3
d2 - d2 d3
d3 - d2 d3
d2
d1
d3
NFA to DFA - Even Better
Spring 2014 Jim Hogg - UW - CSE P501 B-33
DFA for a(b|c)*
d0a
c
b
• Can reduce number of states further, to yield above result
• If interested, see books for details
• States minimization is not examined in P501
d1
Spring 2014 Jim Hogg - UW - CSE P501 B-34
From NFA to DFA
Subset construction (equivalence class) Construct DFA from NFA, where each DFA state
represents a set of NFA states
Key idea State of DFA after reading some input is the set of all
states the NFA could have reached after reading the same input
Algorithm: example of a fixed-point computation
If NFA has n states, DFA has at most 2n states => DFA is finite, can construct in finite # steps
Build DFA for: b(at|ag) | bug from its NFA
Spring 2014 Jim Hogg - UW - CSE P501B-35
b
a
1
3t
u
0
b98 10
g
42
a6
g75
11
12
NFA State a b g t u
d0 = 0 - {1,2,5,9} - - -
d1 = {1,2,5,9} ? ? ? ? ?
? ? ? ? ? ?
Build DFA for: b(at|ag) | bug from its NFA
Spring 2014 Jim Hogg - UW - CSE P501 B-36
b
a
1
3t
u
0
b98 10
g
42
a6
g75
11
12
NFA State a b g t u
d0={0} - d1={1,2,5,9} - - -
d1 = {1,2,5,9} d2={3,6} - - - d3={10}
d2 = {3,6} - - d4={7} d5={4,12}
-
d3 = {10} - - d6={11,12}
- -
TBD ? ? ? ? ?
Idea: show a hand-written DFA for some typical tokens Then use to construct hand-written scanner
Setting: Parser calls scanner whenever it wants next token JFlex provides next_token Scanner stores current position in input
For illustration only. Course project will use JFlex scanner-generator
Note - most commercial compilers use hand-written scanners - generally faster
Spring 2014 Jim Hogg - UW - CSE P501 B-37
Hand-Written Scanner
Spring 2014 Jim Hogg - UW - CSE P501 B-38
Scanner DFA Example – Part 1
0
Accept LPAREN(
2
Accept RPAREN)
3
whitespaceor comments
Accept SEMI;
4
Accept EOFend of input
1
Spring 2014 Jim Hogg - UW - CSE P501 B-39
Scanner DFA Example – Part 2
Accept NEQ! 6
Accept NOT7
5=
[other ]
Accept LEQ< 9
Accept LESS10
8=
[other ]
Spring 2014 Jim Hogg - UW - CSE P501 B-40
Scanner DFA Example – Part 3
[0-9]
Accept ILIT12
11
[other ]
[0-9]
Spring 2014 Jim Hogg - UW - CSE P501 B-41
Strategies for handling identifiers vs keywords Hand-written scanner: look up identifier-like things in table of
keywords Machine-generated scanner: generate DFA with appropriate
transitions to recognize keywords
Scanner DFA Example – Part 4
[a-zA-Z]
Accept ID or keyword14
13
[other ]
[a-zA-Z0-9_]
Scanner – class, ctor, skipWhite
public class Scanner { private String prog; // the MiniJava program to be scanned private int p; // index in 'prog' of current char
public Scanner(String prog) { this.prog = prog; p = 0; }
private void skipWhite() { char c = prog.charAt(p); while ( Character.isWhitespace(c) ) c = prog.charAt(++p); }
Spring 2014 Jim Hogg - UW - CSE P501 B-42
Scanner- id
private Token id() { int pBegin = p; // remember begin index of id char c = prog.charAt(p); // current char - alphabetic
while ( Character.isAlphabetic(c) || Character.isDigit(c) || c == '_') { c = prog.charAt(++p); } return new Token(ID, prog.substring(pBegin, p));}
Spring 2014 Jim Hogg - UW - CSE P501 B-43
Scanner - iLit
private Token iLit() { int pBegin = p; // remember begin index of lexeme char c = prog.charAt(p); // current char int val = Character.getNumericValue(c); // convert to int
while ( Character.isDigit(c) ) { // step thru chars of number c = prog.charAt(++p); val = 10 * val + Character.getNumericValue(c); } String lex = prog.substring(pBegin, p); return new Token(ID, lex, val);}
Spring 2014 Jim Hogg - UW - CSE P501 B-44
Scanner - nextToken
public Token nextToken() { skipWhitespace(); // returns at prog[p] char c = prog.charAt(p); // current char in 'prog' char n = prog.charAt(p + 1); // next char in 'prog'
switch (c) { case ‘>': if (n == '=') { p++; p++; return new Token(GEQ, “>="); } else { p++; return new Token(GT, “>"); } // . . . case '+': p++; return new Token(PLUS, "+"); // . . . } // end of switch
Spring 2014 Jim Hogg - UW - CSE P501 B-45
Scanner – nextToken, cont’d
if (Character.isDigit(c)) { return this.iLit(); } else if (Character.isAlphabetic(c)) { return this.id(); } else { return new Token(BAD, ""); } } // end of nextToken
} // end of class Scanner
Spring 2014 Jim Hogg - UW - CSE P501 B-46
An entire hand-written scanner for MiniJava takes ~100 lines of Java
Spring 2014 Jim Hogg - UW - CSE P501 B-47
Since the 60s, the syntax of every significant programming language has been specified by a formal grammar
First done in 1959 with BNF (Backus-Naur Form); used to specify ALGOL 60 syntax
Borrowed from the linguistics community (Noam Chomsky)
Grammars & BNF
Spring 2014 Jim Hogg - UW - CSE P501 B-48
Grammar for a Tiny Language
program statement | program statement statement assignStmt | ifStmt assignStmt id = expr ; ifStmt if ( expr ) statement expr id | ilit | expr + expr id a | b | c | i | j | k | n | x | y | z ilit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Note: often see ::= used instead of
Spring 2014 Jim Hogg - UW - CSE P501 B-49
Example Derivation
a = 1 ; if ( a + 1 ) b = 2 ;
program ::= statement | program statementstatement ::= assignStmt | ifStmtassignStmt ::= id = expr ;ifStmt ::= if ( expr ) statementexpr ::= id | ilit | expr + exprid ::= a | b | c | i | j | k | n | x | y | zilit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
P S | P SS A | IA id = E ;I if ( E ) SE id | ilit | E + Eid [a-z]ilit [0-9]
B-50
Parse Tree - First Few Steps
a = 1 ; if ( a + 1 ) b = 2 ;
P
P S
S
A
= Eid
ilit
;
P S | P SS A | IA id = E ;I if ( E ) SE id | ilit | E + Eid [a-z]ilit [0-9]
B-51
Parse Tree - Complete
a = 1 ; if ( a + 1 ) b = 2 ;
P
P S
S
A
= Eid
ilit
I
SE(if )
EE +
id ilit
A
= Eid
ilit
;
;
P S | P SS A | IA id = E ;I if ( E ) SE id | ilit | E + Eid [a-z]ilit [0-9]
Spring 2014 Jim Hogg - UW - CSE P501 B-52
Alternative Notations
There are several syntax notations for productions in common use; all mean the same thing
ifStmt ::= if ( expr ) statement
ifStmt if ( expr ) statement
<ifStmt> ::= if ( <expr> ) <statement>
Spring 2014 Jim Hogg - UW - CSE P501 B-53
Formal Languages & Automata Theory
Alphabet: a finite set of symbols ( eg: [a-zA-Z0-9_] )
String: a finite, possibly empty sequence of symbols from an alphabet
Language: a set, often infinite, of strings
Finite specifications of (possibly infinite) languages Grammar – a generator; a system for producing all strings in the
language (and no other strings)
A particular language may be specified by many different grammars
A grammar specifies only one language
Spring 2014 Jim Hogg - UW - CSE P501 B-54
Productions
The rules of a grammar are called productions
Rules contain Nonterminal symbols: grammar variables (program,
statement, id, etc) Terminal symbols: concrete syntax that appears in
programs (a, b, c, 0, 1, if, (, ), … )
Meaning of nonterminal <sequence of terminals and non-terminals>
In a derivation, an instance of non-terminal can be replaced by the sequence of terminals and non-terminals on its RHS
Often, there are two or more productions for one nonterminal – use any in different parts of derivation
Spring 2014 Jim Hogg - UW - CSE P501 B-55
Two ways to Parse
Parse: re-construct the derivation (syntactic structure) of a program
More prosaically: fill the gap between top and bottom of page with a parse tree:
Start at top; build tree downwards, sweeping left-to-right. This is called a "top-down" parse. What we just did for the "Tiny Language" example
Start at bottom; build little trees that join upwards. Called a "bottom-up" parse. What CUP does for us.
Spring 2014 Jim Hogg - UW - CSE P501 B-56
Why Separate Scanner and Parser?
In principle, a single recognizer could work directly from a concrete, character-by-character grammar
In practice this is never done: always scan chars to tokens, because:
Simplicity & Separation of Concerns Scanner hides details from parser (comments, whitespace, input files,
etc) Parser becomes easier to build; has simpler input - stream-of-tokens
Efficiency Scanner can use simpler, fast design But still often consumes a surprising amount of the compiler’s total
execution time - it touches every char in source program
Spring 2014 Jim Hogg - UW - CSE P501 B-57
Project Notes
For MiniJava project Use JFlex scanner-generator tool Use CUP parser-generator tool The two work together
CUP generates a file of token kinds into sym.java (SEMI = 28, LT = 18, etc)
JFlex needs these definitions. To bootstrap this process, inspect the MiniJava grammar and devise your own set of token kinds
See MiniJava page at: http://www.cambridge.org/resources/052182060X/
Spring 2014 Jim Hogg - UW - CSE P501 B-58
Homework: paper exercises on regex and FAs
Next week: first part of the compiler assignment – the scanner
Send partner info to Nat if you want project space
Next topic: parsing Will do LR parsing first, for the project (CUP) Cooper&Torczon chapter 3
Next