cse p501 – compiler construction scanner regex automata hand-written scanner grammars & bnf...

CSE P501 – Compiler Construction

Scanner

Regex

Automata

Hand-Written Scanner

Grammars & BNF

Next

Spring 2014 Jim Hogg - UW - CSE P501 B-1

Spring 2014 Jim Hogg - UW - CSE P501 A-2

Source TargetFront End Back End

Scan

chars

tokens

AST

IR

AST = Abstract Syntax Tree

IR = Intermediate Representation

‘Middle End’

Optimize

Select Instructions

Parse

Semantics

Allocate Registers

Emit

Machine Code

IR

IR

IR

IR

IR

Scanner

Automatic or Hand-Written?

Use a scanner-generator - JFlex


regex define tokens

JFlex Scanner

.jflex .java

Write a scanner, in Java, by hand Easy and enlightening Will see an outline of how, later

OR

Reminder: a token is . . .

Spring 2014 Jim Hogg - UW - CSE P501 A-4

class C { public int fac(int n) { // factorial int nn; if (n < 1) nn = 1; else nn = n * this.fac(n-1); return nn; }}

class∙C∙{◊∙∙public∙int∙fac(int∙n)∙{∙∙//∙factorial◊∙∙∙∙int∙nn;◊∙∙∙∙if(n∙<∙1)◊∙∙∙∙∙∙nn∙=∙1;◊∙∙∙∙else◊∙∙∙∙nn∙=∙n∙*∙(this.fac(n-1));◊∙∙∙∙return∙nn;◊∙∙}◊}

Key for Char Stream:

◊ newline \n∙ space

CLASS ID:C LBRACE PUBLIC INT ID:fac LPAREN INT ID:n RPAREN LBRACE INT ID:nn SEMI IF LPAREN ID:n LT ILIT:1 RPAREN ID:nn EQ ILIT:1 ELSE ID:nn EQ ID:n TIMES LPAREN ID:this DOT ID:fac LPAREN ID:n MINUS ILIT:1 RPAREN RPAREN SEMI RETURN ID:nn SEMI RBRACE RBRACE

A Token in your Java scanner

class Token { public int kind; // eg: LPAREN, ID, ILIT public int line; // for debugging/diagnostics public int column; // for debugging/diagnostics public String lexeme; // eg: “x”, “Total”, “(“, “42” public int value; // attribute of ILIT}


Obviously this Token is wasteful of memory: • lexeme is not required for primitive tokens, such as LPAREN, RBRACE, et• value is only required for ILIT

But, there's only 1 token alive at any instant during parsing, so no point refining into 3 leaner variants!


Typical Tokens

Operators & Punctuation Single chars: + - * = / ( ] ; : Double chars: :: <= == !=

Keywords if while for goto return switch void …

Identifiers A single ID token kind, parameterized by lexeme

Integer constants A single ILIT token kind, parameterized by int value

See jflex-1.5.0\examples\java\java.flex for real example

Token Spotting


if(a<=3)++grades[1]; // what are the tokens? (no spaces)

public int fac(int n) { // what are the tokens? (need spaces?)

Counter-example: fixed-format FORTRAN:

DO 50 I = 1,99 // DO loopDO 50 I = 1.2 // assignment: DO50I = 1.2


Principle of Longest Match

Scanner should pick the longest possible string to make up the next token (“greedy” algorithm)

Examplereturn idx <= iffy;

should be scanned into 5 tokens:

<= is one token, not two iffy is an ID, not IF followed by ID:fy

RETURN ID:idx LEQ ID:iffy SEMI


The syntax, of most programming languages can be specified using Regular Expressions “REs” in Cooper&Torczon “regex” is more common

Tokens can be recognized by a deterministic finite automaton (DFA) DFA (a Java class) is almost always

generated from regex using a software tool, such as JFlex

Regex

Regex Cheat Sheet


Pattern Matches?

a a

a* zero or more a’s

a+ one or more a’s

a? zero or one a

a|b a or b

ab a followed by bPrecedence: * (highest), concatenation, | (lowest)

Parentheses can be used to group regexs as needed

Notice meta-characters, in red

Escaped characters: \* \+ \? \| \. \t \n

Pattern Matches?

[c-f] one of c or d or e or f

[^0-3] any one character except 0-3

. any character, except newline


Regex Examples

regex Meaning?

[abc]+

[abc]* (Kleene closure)

[0-9]+

[1-9][0-9]*

[a-zA-Z_][a-zA-Z0-9_]*

(0|1)* 0

(a|b)*aa(a|b)*

Check free online Regex tutorials if you are rusty. Eg: http://regexone.com/ Experiment with a regex-capable editor. Eg: http://www.editpadpro.com/

http://regexone.com/

http://regexone.com/

http://www.editpadpro.com/

http://www.editpadpro.com/


regex

Defined over some alphabet Σ For programming languages, alphabet is ASCII or

Unicode

If re is a regular expression, L(re ) is the language (set of strings) generated by re


regex macros

Possible syntax for numeric constantsDigit = [0-9]Digits = Digit+

Number = Digits ( . Digits )? ( [eE] (+ | -)? Digits ) ?

How would you describe this set in English?

What are some examples of legal constants (strings) generated by Number?

Tools like JFlex accept these convenient macros


Finite automata (state machines) can be used to recognize strings generated by regular expressions

Can build automaton by-hand or automagically Will not build by-hand in this course Will use the JFlex tool: given a set of regex, it

generates an automaton recognizer (a Java class)

Automata


Finite Automata Terminology

Phrase Abbreviation

Finite Automaton FA

Deterministic Finite Automaton DFA

Non-deterministic Finite Automaton NFA

Finite-State Automaton FSA = {DFA, NFA}


DFA for “cat”

a tc

Accepting State(double circles)

Start State

regex = cat


DFA for ILIT

0-91

0-9

2

We have labelled the states

regex = [0-9][0-9]* = [0-9]+


DFA for ID

a-z

0 0-9

1

a-z

regex = [a-zA-Z_][a-zA-Z0-9_]*

A-Z_

A-Z_

DFAs work like this . . .


1. scan the input text string, character-by-character

2. following the arc/edge corresponding to the character just read

3. if there is no arc for the character just read, then, either:

a. if you are in an accepting state: you're done. Success!

b. if you are not in an accepting state: you're done. Failure!

DFAs work like this - examples


1. Scan "fac(int n);" for the regex, alphaid = [a-z]+ (lower-case alphas)We hit "(" and are already in state 1. Success

2. Scan "23;" for regex alphaidThere is no arc for "2". We are still in state 0. Failure

3. Scan "today" for regex alphaidWe hit end-of-string and are already in state 1. Success

0 1

a-za-z

Note: no need to add arcs to the DFA for all error cases - they are implicit

Thompson’s Construction: Combining DFAs


ε

a b

DFA for: a DFA for: b

a bNFA for: ab

εa

b

NFA for a|b

ε

ε

ε

Combining DFAs, cont’d


ε

a b

DFA for: a DFA for: b

aNFA for: a*

ε

ε

ε

Exercise

Draw the NFA for: b(at|ag) | bug


b

a t

ub g

a g

Exercise

Draw the NFA for: b(at|ag) | bug

Spring 2014 Jim Hogg - UW - CSE P501B-24

b

a t

ub g

a g

NFA for a(b|c)*


b

c

a

a

b

c

To recognize "acb" successfully, we need to:

• guess the future correctly• backtrack and retry if we fail to

recognize• somehow execute all possible paths

None of these is attractive! Can we construct an equivalent DFA?


Finite State Automaton (FSA)

A finite set of states One marked as initial state One or more marked as final states States sometimes labeled or numbered

A set of transitions from state to state Each labeled with symbol from Σ, or ε

Operate by reading input symbols (usually characters) Transition can be taken if labeled with current symbol ε-transition can be taken at any time (free bus ride)

Accept when final state reached & no more input Scanner uses an FSA as a subroutine – accept longest

match from current location each time called, even if more input

Reject if no transition possible, or no more input and not in final state (DFA)


DFA vs NFA

Deterministic Finite Automata (DFA) No choice of which transition to take In particular, no ε transitions No guessing

Non-deterministic Finite Automata (NFA) Choice of transition in at least one case Accepts if some way to reach final state on given

input Reject if no possible way to final state How to implement in software?


DFAs in Scanners

We really want DFA for speed: no backtracking, no guessing, no foretelling the future

Conversion from regex to NFA is easy, right?

But how to turn an NFA into an equivalent DFA?

Turns out to be obvious (once seen) and easy

NFA to DFA


Starting with the above NFA, we want to 'collapse' epsilon edges, ending up with a DFA that recognizes, and rejects, the same char strings. Ideally, we will end up with:

0a

c

b

4b

6c

3

5

7

2 8

NFA for a(b|c)*

0a

1 9

1

NFA to DFA


4b

6c

3

5

7

2 8

NFA for a(b|c)*

0a

1 9

• Begin in the Start state• Foreach labelled arc leaving that state, what set of states can I

reach, along labelled arc, or along transitions?

NFA to DFA


n4b

n6c

n3

n5

n7

n2 n8

NFA for a(b|c)*

n0a

n1 n9

NFA State a b c

d0 = n0 d1 = {1,2,3,4,6,9}

none none

d1 = {1,2,3,4,6,9} none d2 = {3,4,5,6,8,9}

d3 = {3,4,6,7,8,9}

d2 = {3,4,5,6,8,9} none d2 = {3,4,5,6,8,9}

d3 = {3,4,6,7,8,9}

d3 = {3,4,6,7,8,9} none d2 = {3,4,5,6,8,9}

d3 = {3,4,6,7,8,9}

NFA to DFA


b

c

DFA for a(b|c)*

d0a bc

c

b

NFA State a b c

d0 d1 - -

d1 - d2 d3

d2 - d2 d3

d3 - d2 d3

d2

d1

d3

NFA to DFA - Even Better


DFA for a(b|c)*

d0a

c

b

• Can reduce number of states further, to yield above result

• If interested, see books for details

• States minimization is not examined in P501

d1


From NFA to DFA

Subset construction (equivalence class) Construct DFA from NFA, where each DFA state

represents a set of NFA states

Key idea State of DFA after reading some input is the set of all

states the NFA could have reached after reading the same input

Algorithm: example of a fixed-point computation

If NFA has n states, DFA has at most 2n states => DFA is finite, can construct in finite # steps

Build DFA for: b(at|ag) | bug from its NFA


b

a

1

3t

u

0

b98 10

g

42

a6

g75

11

12

NFA State a b g t u

d0 = 0 - {1,2,5,9} - - -

d1 = {1,2,5,9} ? ? ? ? ?

? ? ? ? ? ?

Build DFA for: b(at|ag) | bug from its NFA


b

a

1

3t

u

0

b98 10

g

42

a6

g75

11

12

NFA State a b g t u

d0={0} - d1={1,2,5,9} - - -

d1 = {1,2,5,9} d2={3,6} - - - d3={10}

d2 = {3,6} - - d4={7} d5={4,12}

-

d3 = {10} - - d6={11,12}

- -

TBD ? ? ? ? ?

Idea: show a hand-written DFA for some typical tokens Then use to construct hand-written scanner

Setting: Parser calls scanner whenever it wants next token JFlex provides next_token Scanner stores current position in input

For illustration only. Course project will use JFlex scanner-generator

Note - most commercial compilers use hand-written scanners - generally faster


Hand-Written Scanner


Scanner DFA Example – Part 1

0

Accept LPAREN(

2

Accept RPAREN)

3

whitespaceor comments

Accept SEMI;

4

Accept EOFend of input

1



Accept NEQ! 6

Accept NOT7

5=

[other ]

Accept LEQ< 9

Accept LESS10

8=

[other ]



[0-9]

Accept ILIT12

11

[other ]

[0-9]


Strategies for handling identifiers vs keywords Hand-written scanner: look up identifier-like things in table of

keywords Machine-generated scanner: generate DFA with appropriate

transitions to recognize keywords


[a-zA-Z]

Accept ID or keyword14

13

[other ]

[a-zA-Z0-9_]

Scanner – class, ctor, skipWhite

public class Scanner { private String prog; // the MiniJava program to be scanned private int p; // index in 'prog' of current char

public Scanner(String prog) { this.prog = prog; p = 0; }

private void skipWhite() { char c = prog.charAt(p); while ( Character.isWhitespace(c) ) c = prog.charAt(++p); }


Scanner- id

private Token id() { int pBegin = p; // remember begin index of id char c = prog.charAt(p); // current char - alphabetic

while ( Character.isAlphabetic(c) || Character.isDigit(c) || c == '_') { c = prog.charAt(++p); } return new Token(ID, prog.substring(pBegin, p));}


Scanner - iLit

private Token iLit() { int pBegin = p; // remember begin index of lexeme char c = prog.charAt(p); // current char int val = Character.getNumericValue(c); // convert to int

while ( Character.isDigit(c) ) { // step thru chars of number c = prog.charAt(++p); val = 10 * val + Character.getNumericValue(c); } String lex = prog.substring(pBegin, p); return new Token(ID, lex, val);}


Scanner - nextToken

public Token nextToken() { skipWhitespace(); // returns at prog[p] char c = prog.charAt(p); // current char in 'prog' char n = prog.charAt(p + 1); // next char in 'prog'

switch (c) { case ‘>': if (n == '=') { p++; p++; return new Token(GEQ, “>="); } else { p++; return new Token(GT, “>"); } // . . . case '+': p++; return new Token(PLUS, "+"); // . . . } // end of switch


Scanner – nextToken, cont’d

if (Character.isDigit(c)) { return this.iLit(); } else if (Character.isAlphabetic(c)) { return this.id(); } else { return new Token(BAD, ""); } } // end of nextToken

} // end of class Scanner


An entire hand-written scanner for MiniJava takes ~100 lines of Java


Since the 60s, the syntax of every significant programming language has been specified by a formal grammar

First done in 1959 with BNF (Backus-Naur Form); used to specify ALGOL 60 syntax

Borrowed from the linguistics community (Noam Chomsky)

Grammars & BNF


Grammar for a Tiny Language

program statement | program statement statement assignStmt | ifStmt assignStmt id = expr ; ifStmt if ( expr ) statement expr id | ilit | expr + expr id a | b | c | i | j | k | n | x | y | z ilit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Note: often see ::= used instead of


Example Derivation

a = 1 ; if ( a + 1 ) b = 2 ;

program ::= statement | program statementstatement ::= assignStmt | ifStmtassignStmt ::= id = expr ;ifStmt ::= if ( expr ) statementexpr ::= id | ilit | expr + exprid ::= a | b | c | i | j | k | n | x | y | zilit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

P S | P SS A | IA id = E ;I if ( E ) SE id | ilit | E + Eid [a-z]ilit [0-9]

B-50

Parse Tree - First Few Steps

a = 1 ; if ( a + 1 ) b = 2 ;

P

P S

S

A

= Eid

ilit

;


B-51

Parse Tree - Complete

a = 1 ; if ( a + 1 ) b = 2 ;

P

P S

S

A

= Eid

ilit

I

SE(if )

EE +

id ilit

A

= Eid

ilit

;

;



Alternative Notations

There are several syntax notations for productions in common use; all mean the same thing

ifStmt ::= if ( expr ) statement

ifStmt if ( expr ) statement

<ifStmt> ::= if ( <expr> ) <statement>


Formal Languages & Automata Theory

Alphabet: a finite set of symbols ( eg: [a-zA-Z0-9_] )

String: a finite, possibly empty sequence of symbols from an alphabet

Language: a set, often infinite, of strings

Finite specifications of (possibly infinite) languages Grammar – a generator; a system for producing all strings in the

language (and no other strings)

A particular language may be specified by many different grammars

A grammar specifies only one language


Productions

The rules of a grammar are called productions

Rules contain Nonterminal symbols: grammar variables (program,

statement, id, etc) Terminal symbols: concrete syntax that appears in

programs (a, b, c, 0, 1, if, (, ), … )

Meaning of nonterminal <sequence of terminals and non-terminals>

In a derivation, an instance of non-terminal can be replaced by the sequence of terminals and non-terminals on its RHS

Often, there are two or more productions for one nonterminal – use any in different parts of derivation


Two ways to Parse

Parse: re-construct the derivation (syntactic structure) of a program

More prosaically: fill the gap between top and bottom of page with a parse tree:

Start at top; build tree downwards, sweeping left-to-right. This is called a "top-down" parse. What we just did for the "Tiny Language" example

Start at bottom; build little trees that join upwards. Called a "bottom-up" parse. What CUP does for us.


Why Separate Scanner and Parser?

In principle, a single recognizer could work directly from a concrete, character-by-character grammar

In practice this is never done: always scan chars to tokens, because:

Simplicity & Separation of Concerns Scanner hides details from parser (comments, whitespace, input files,

etc) Parser becomes easier to build; has simpler input - stream-of-tokens

Efficiency Scanner can use simpler, fast design But still often consumes a surprising amount of the compiler’s total

execution time - it touches every char in source program


Project Notes

For MiniJava project Use JFlex scanner-generator tool Use CUP parser-generator tool The two work together

CUP generates a file of token kinds into sym.java (SEMI = 28, LT = 18, etc)

JFlex needs these definitions. To bootstrap this process, inspect the MiniJava grammar and devise your own set of token kinds

See MiniJava page at: http://www.cambridge.org/resources/052182060X/

http://www.cambridge.org/resources/052182060X/


Homework: paper exercises on regex and FAs

Next week: first part of the compiler assignment – the scanner

Send partner info to Nat if you want project space

Next topic: parsing Will do LR parsing first, for the project (CUP) Cooper&Torczon chapter 3

Next

cse p501 – compiler construction scanner regex automata hand-written scanner grammars & bnf...

Documents

jim hogg uw

cse p501b

cse p501a

e se id ilit e eid

expr ifstmt

expr statement ifstmt

return nn

b c i j