cs416 compiler design - sjtujiangli/teaching/cs308/cs308-slides02.pdf• lexical analyzer reads the...

CS308 Compiler Principles

Lexical Analyzer

Li JiangDepartment of Computer Science and Engineering

Shanghai Jiao Tong University

Compiler Principles

Outline

• Content:

• Basic concepts: pattern, lexeme, and token.

• Operations on languages, and regular expression

• Recognition of tokens

• Finite automata, including NFA and DFA

• Conversion from regular expression to NFA and

DFA

• Optimization of lexical analyzer

2

Compiler Principles

Lexical Analyzer

• Lexical Analyzer reads the source program character by character to produce tokens.

– strips out comments and whitespaces

– returns a token when the parser asks for

– correlates error messages with the source program

3

Compiler Principles

Token

• A token is a pair of a token name and an optional attribute value.

– Token name specifies the pattern of the token

– Attribute stores the lexeme of the token

• Tokens

– Keyword: “begin”, “if”, “else”, …

– Identifier: string of letters or digits, starting with a letter

– Integer: a non-empty string of digits

– Punctuation symbol: “,”, “;”, “(”, “)”, …

• Regular expressions are widely used to specify patterns of the tokens.

4

Compiler Principles

Attributes of Token

• Information for subsequent compiler

phases about the particular lexeme

– Token name influences parsing decision

– attribute value influences translation of tokens

after the parse

• Attributes of identifier

– Lexeme, type, location

– Stored in symbol table

• Tricky problem

– DO 5 I = 1.25 VS. DO 5 I = 1,25

5

Compiler Principles

Token Example

6

Compiler Principles

Outline

• Content:






DFA


7

Compiler Principles

Input Buffering

• Why a compiler needs buffers?

• Buffer Pairs: alternately reload

• Two pointers

– lexemeBegin

– forward

• Sentinels: a mark for buffer end

8

If length of lexeme + look

ahead distance > buffer size

Compiler Principles

Lookahead with Sentinels

9

Compiler Principles

Terminology of Languages• Alphabet: a finite set of symbols

– ASCII

– Unicode

• String: a finite sequence of symbols on an alphabet– is the empty string

– |s| is the length of string s

– Concatenation: xy represents x followed by y

– Exponentiation: sn= s s s .. s ( n times) s0

=

• Language: a set of strings over some fixed alphabet– the empty set is a language

– The set of well-formed C programs is a language

10

Compiler Principles

Operations on Languages

• Union: L1 L2 = { s | s L1 or s L2 }

• Concatenation: L1L2 = { s1s2 | s1 L1 and s2 L2 }

• (Kleene) Closure:

• Positive Closure:

0

*

i

iLL

1i

iLL

11

Compiler Principles

Example

• L1 = {a,b,c,d} L2 = {1,2}

• L1 L2 =

• L1L2 =

• L1* =

• L1+ =

12

{a,b,c,d,1,2}

{a1,a2,b1,b2,c1,c2,d1,d2}

all strings using letters a,b,c,d

including the empty string

all strings using letters a,b,c,d

without the empty string

Compiler Principles

Regular Expressions

• Regular expression is a representation of a language that can be built from the operatorsapplied to the symbols of some alphabet.

• A regular expression is built up of smaller regular expressions (using defining rules).

• Each regular expression r denotes a language L(r).

• A language denoted by a regular expression is called as a regular set.

13

Compiler Principles

Regular Expressions (Rules)Regular expressions over alphabet

Reg. Expr Language it denotes L() = {}a L(a) = {a}(r1) | (r2) L(r1) L(r2)(r1) (r2) L(r1) L(r2)(r)* (L(r))*

(r) L(r)

Extension(r)+ = (r)(r)* (L(r))+ Positive closure(r)? = (r) | L(r) {} zero or one instance [a1-an] L(a1|a2|…|an) character class

14

Compiler Principles

Regular Expressions (cont.)

• We may remove parentheses by using precedence rules:– * highest

– concatenation second highest

– | lowest

• (a(b)*)|(c)

• Example:– =

– 0|1 =>

– (0|1)(0|1) =>

– 0* =>

– (0|1)* =>

15

ab*|c

{0,1}

{0,1}

{00,01,10,11}

{ ,0,00,000,0000,....}

all strings with 0 and 1, including

the empty string

Compiler Principles

Lex regular expression

16

Compiler Principles

Regular Definitions

• We can give names to regular expressions, and use these names as symbols to define other regular expressions.

• A regular definition is a sequence of the definitions of the form:

d1 r1 where di is a innovative symbol and

d2 r2 ri is a regular expression over symbols

… in {d1,d2,...,di-1}

dn rn

alphabetpreviously defined

symbols17

Compiler Principles

Regular Definitions Example

• Example: Identifiers in Pascal

letter A | B | ... | Z | a | b | ... | z

digit 0 | 1 | ... | 9

id letter (letter | digit ) *

– If we try to write the regular expression

representing identifiers without using regular

definitions, that regular expression will be

complex.

18

(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *

Q: unsigned numbers (integer or floating point)

Compiler Principles

Quiz

1. All strings of lowercase letters that

contain the five vowels in order.

2. All strings of lowercase letters in which

the letters are in ascending lexicographic

order.

3. Comments, consisting of a string

surrounded by /* and */, without an

intervening */, unless it is inside double-

quotes (“). [HOMEWORK]

19

*

Compiler Principles

Outline

• Content:






DFA


21

Compiler Principles

Recognition of token

Grammar

Regular Definitions

22

Express the pattern

Find a prefix that is a

lexeme matching the

pattern

Compiler Principles

Transition Diagram

• State: represents a condition that could

occur during scanning

– start/initial state:

– accepting/final state: lexeme found

– intermediate state:

• Edge: directs from one state to another,

labeled with one or a set of symbols

23

*

Compiler Principles

Transition Diagram for relop

Transition Diagram for ``relop < | > |< = | >= | = | <>’’

24

Among the lexemes that

match the pattern for relop,

what can we only be

looking at?

Compiler Principles

Transition-Diagram-Based Lexical Analyzer

Implementation of relop transition diagram

25

Switch statement or multi way branch

Determines the next state by reading

and examining the next input character

Holds the number of

the current state

Find the edge Take action

Compiler Principles

Transition Diagram for Others

A transition diagram for id's

A transition diagram for unsigned numbers

26

What about the Transition

Diagram of letter/digit?

*

Compiler Principles

Outline

• Content:






DFA


29

Compiler Principles

Finite Automata

• A finite automaton is a recognizer that takes a string, and answers “yes” if the string matches a pattern of a specified language, and “no” otherwise.

• Two kinds:– Nondeterministic finite automaton (NFA)

• no restriction on the labels of their edges

– Deterministic finite automaton (DFA)• exactly one edge with a distinguished symbol goes out of

each state

• Both NFA and DFA have the same capability

• We may use NFA or DFA as lexical analyzer

30

*

Compiler Principles

Nondeterministic Finite Automaton (NFA)

• A NFA consists of:– S: a set of states

– Σ: a set of input symbols (alphabet)

– A transition function: maps state-symbol pairs to sets of states

– s0: a start (initial) state

– F: a set of accepting states (final states)

• NFA can be represented by a transition graph

• Accepts a string x, if and only if there is a path from the starting state to one of accepting states such that edge labels along this path spell out x.

• Remarks– The same symbol can label edges from one state to

several different states

– An edge may be labeled by ε, the empty string

31

Compiler Principles

NFA Example (1)The language recognized by this NFA is

32

(a|b) * a b

Compiler Principles

NFA Example (2)

NFA accepting aa* |bb*

33

Compiler Principles

Implementing an NFAS -closure({s0}) { set all of states can be accessible

from s0 by -transitions }

c nextchar()

while (c != eof) {

begin

S -closure(move(S,c))

c nextchar

end

if (SF != ) then { if S contains an accepting state }

return “yes”

else

return “no”

{ set of all states can be

accessible from a state in S by a

transition on c}

34

Subset Constructionbacktrack may be needed to identify the longest match.

Compiler Principles

Excise 3• For NFA in the following figure, indicate all the paths

labeled aabb. Does the NFA accept aabb?

• Give the transition table.

35

- (0) -a-> (1) -a-> (2) -b-> (2) -b-> ((3)) (0) -a-> (1) -a-> (2) -b-> (2) -b-> (2)

- (0) -a-> (0) -a-> (0) -b-> (0) -b-> (0) (0) -a-> (0) -a-> (1) -b-> (1) -b-> (1)

- (0) -a-> (1) -a-> (1) -b-> (1) -b-> (1) (0) -a-> (1) -a-> (2) -b-> (2) -ε-> (0) -b-> (0)

- (0) -a-> (1) -a-> (2) -ε-> (0) -b-> (0) -b-> (0)

Compiler Principles

Deterministic Finite Automaton (DFA)

• A Deterministic Finite Automaton (DFA) is

a special form of a NFA.

– No state has ε- transition

– For each symbol a and state s, there is at

most one a labeled edge leaving s.

The language recognized by this DFA is ?

start

36

(a|b) * a b

Compiler Principles

Practice

• Draw the transition diagram for recognizing

the following regular expression

a(a|b)*a

37

1 2 3aa

a|b

Nondeterministic

1 2 3aa

b

b a

Deterministic

*

Compiler Principles

Implementing a DFA

s s0 { start from the initial state }

c nextchar { get the next character from the input string }

while (c != eof) do { do until the end of the string }

begin

s move(s,c) { transition function }

c nextchar

end

if (s in F) then { if s is an accepting state }

return “yes”

else

return “no”

38

Compiler Principles

NFA vs. DFA

Compactibility Readability Speed

NFA Good Good Slow

DFA Bad Bad Fast

• DFAs are widely used to build lexical analyzers.

NFA DFAThe language recognized (a|b) * a b

39

Maintaining a set of state is more complex than keeping

track a single state.

Compiler Principles40

(a)1 2 3 4 5

6 7 8 9

0

0 0 0

0

00

1 1

1

111

1

(b) 1 2 3 4 5

a

a aaa

Pop Quiz

1) What are the languages presented by the two FAs?

40

Solution: 01 strings with length 4, except 0110

Solution: a(aaaaa)*

Fixed pattern

Closure

Compiler Principles

Outline

• Content:






DFA


42

Compiler Principles

Regular Expression NFA

• McNaughton-Yamada-Thompson (MYT)

construction

– Simple and systematic (recursive up the

parse tree for the regular expression)

– Construction starts from the simplest parts

(alphabet symbols).

– For a complex regular expression, sub-

expressions are combined to create its NFA.

– Guarantees the resulting NFA will have

exactly one final state, and one start state.

43

Compiler Principles

MYT Construction

• Basic rules: for subexpressions with no

operators

– For expression

– For a symbol a in the alphabet

i fstart

i fastart

44

Compiler Principles

MYT Construction Cont’d

• Inductive rules: for constructing larger

NFAs from the NFAs of subexpressions

(Let N(r1) and N(r2) denote NFAs for regular

expressions r1 and r2, respectively)

– For regular expression r1 | r2

i

N(r1)

N(r2)

f

start

45

Compiler Principles

MYT Construction Cont’d

– For regular expression r1r2

– For regular expression r*

i N(r1) fN(r2)start

N(r)i f

start

46


Example: (a|b)*a

a:a

bb:

(a|b):

a

b

b

a

(a|b)*:

b

a

a(a|b)*a:

47

Compiler Principles

Properties of the Constructed NFA

1. N(r) has at most twice as many states as there are operators and operands in r.

– This bound follows from the fact that each step of the algorithm creates at most two new states.

2. N(r) has one start state and one accepting state. The accepting state has no outgoing transitions, and the start state has no incoming transitions.

3. Each state of N(r) other than the accepting state has either one outgoing transition on a symbol in {} or two outgoing transitions, both on .

48

Compiler Principles

Conversion of an NFA to a DFA

• Approach: Subset Construction– each state of the constructed DFA corresponds to

a set / combination of NFA states

• Details① Create transition table Dtran for the DFA

② Insert -closure(s0) to Dstates as initial state

③ Pick a not visited state T in Dstates

④ For each symbol a, Create state

-closure(move(T, a)), and add it to Dstates and Dtran

⑤ Repeat step (3) and (4) until all states in Dstates are visited

49

Compiler Principles

The Subset Construction

50

Simulate in parallel all

possible moves NFA can

make on the input a

Compiler Principles

NFA to DFA Example

NFA for (a|b) * abb

51

A = -closure({0}) = {0,1,2,4,7} A into DS as an unmarked state mark A

-closure(move(A,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = B B into DS

-closure(move(A,b)) = -closure({5}) = {1,2,4,5,6,7} = C C into DS

transfunc[A,a] B transfunc[A,b] C mark B

-closure(move(B,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = B

-closure(move(B,b)) = -closure({5,9}) = {1,2,4,5,6,7,9} = D

transfunc[B,a] B transfunc[B,b] D mark C

-closure(move(C,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = B

-closure(move(C,b)) = -closure({5}) = {1,2,4,5,6,7} = C

transfunc[C,a] B transfunc[C,b] C

Compiler Principles

NFA to DFA Example

NFA for (a|b) * abb

Transition table for DFA Equivalent DFA

52

4


Quiz 1

Suppose we have two tokens: (1) the keyword if, and (2)

identifiers, which are strings of letters other than if. Show:

1. The NFA for these tokens, and

2. The DFA for these tokens

NFA DFA

Compiler Principles

Regular Expression DFA

• First, augment the given regular expression

by concatenating a special symbol #

r r# augmented regular expression

• Second, create a syntax tree for the

augmented regular expression.

– All leaves are alphabet symbols (plus # and the

empty string)

– All inner nodes are operators

• Third, number each alphabet symbol (plus #)

(position numbers)

56


Regular Expression DFA Cont’d

(a|b)*a (a|b)*a# augmented regular expression

*

|

b

a

#

a1

4

3

2

• each symbol is at a leaf

• each symbol is numbered (positions)

• inner nodes are operators

Syntax tree of (a|b)*a#

3 F

2

1

b

a

a4

#


followpos

Then we define the function followpos for the positions (positions

assigned to leaves).

followpos(i) -- the set of positions which can follow

the position i in the strings generated by

the augmented regular expression.

Example: ( a | b) * a #

1 2 3 4

followpos(1) = {1,2,3}


followpos(3) = {4}

followpos(4) = {}

followpos() is just defined for leaves,

not defined for inner nodes.

Compiler Principles

firstpos, lastpos, nullable

• To compute followpos, we need three more functions defined for the nodes (not just for leaves) of the syntax tree.– firstpos(n) -- the set of the positions of the first

symbols of strings generated by the sub-expression rooted by n.

– lastpos(n) -- the set of the positions of the lastsymbols of strings generated by the sub-expression rooted by n.

– nullable(n) -- true if the empty string is a member of strings generated by the sub-expression rooted by n; false otherwise

59

Compiler Principles

Usage of the Functions

*

|

b

a

#

a1

4

3

2

(a|b)*a (a|b)*a# augmented regular expression

Syntax tree of (a|b)*a#

n

m

nullable(n) = false

nullable(m) = true

firstpos(n) = {1, 2, 3}

lastpos(n) = {3}

60


Computing nullable, firstpos, lastpos

n nullable(n) firstpos(n) lastpos(n)

leaf labeled true

leaf labeled

with position i

false {i} {i}

|

c1 c2

nullable(c1) or

nullable(c2)

firstpos(c1) firstpos(c2) lastpos(c1)

lastpos(c2)

c1 c2

nullable(c1)

and

nullable(c2)

if (nullable(c1))

firstpos(c1)firstpos(c2)

else firstpos(c1)

if (nullable(c2))

lastpos(c1)lastpos(c2)

else lastpos(c2)

*

c1

true firstpos(c1) lastpos(c1)

Straightforward recursion on the height of the tree


Thinking

Extend the above table to include two more operations

(a) ? (b) +

n nullable(n) firstpos(n) lastpos(n)

?

c1

+

c1

TRUE firstpos(c1) lastpos(c1)

Nullable(c1 ) firstpos(c1) lastpos(c1)

Compiler Principles

How to evaluate followpos

• Two-rules define the function followpos:

1. If n is concatenation-node with left child c1 and right child c2, and i is a position in lastpos(c1), then all positions in firstpos(c2) are in followpos(i).

2. If n is a star-node, and i is a position in lastpos(n), then all positions in firstpos(n) are in followpos(i).

• If firstpos and lastpos have been computed for each node, followpos of each position can be computed by making one depth-first traversal of the syntax tree.

63


Example -- ( a | b) * a #

*

|

b

a

#

a1

4

3

2

{1,2,3}

{3}{1,2}

{1,2}

{1} {2}

{1,2,3} {4}

{4}

{4}{3}

{3}{1,2}

{1,2}

{1} {2}

red – firstpos

blue – lastpos

Then we can calculate followpos



followpos(3) = {4}

followpos(4) = {}

• After we calculate follow positions, we are ready to create

DFA for the regular expression.

Compiler Principles

Algorithm (RE DFA)1. Create the syntax tree of (r) #

2. Calculate nullable, firstpos, lastpos, followpos

3. Put firstpos(root) into the states of DFA as an unmarked state.

4. while (there is an unmarked state S in the states of DFA) do

– mark S

– for each input symbol a do

• let s1,...,sn are positions in S and symbols in those positions are a

• S’ followpos(s1) ... followpos(sn)

• Dtran[S,a] S’

• if (S’ is not in the states of DFA)

– put S’ into the states of DFA as an unmarked state.

• the start state of DFA is firstpos(root)

• the accepting states of DFA are all states containing the position of #

65

Compiler Principles

Example -- ( a | b) * a #

followpos(1)={1,2,3} followpos(2)={1,2,3} followpos(3)={4} followpos(4)={}

S1=firstpos(root)={1,2,3}

mark S1

a: followpos(1) followpos(3)={1,2,3,4}=S2 Dtran[S1,a]=S2

b: followpos(2)={1,2,3}=S1 Dtran[S1,b]=S1

mark S2

a: followpos(1) followpos(3)={1,2,3,4}=S2 Dtran[S2,a]=S2

b: followpos(2)={1,2,3}=S1 Dtran[S2,b]=S1

start state: S1

accepting states: {S2}

1 2 3 4

S1 S2

a

b

b

a

66


Example -- ( a | ) b c* #1 2 3 4

followpos(1)={2} Let’s continue

followpos(2)={3,4} followpos(3)={3,4} followpos(4)={}

S1=firstpos(root)={1,2}

mark S1

a: followpos(1)={2}=S2 Dtran[S1,a]=S2

b: followpos(2)={3,4}=S3 Dtran[S1,b]=S3

mark S2

b: followpos(2)={3,4}=S3 Dtran[S2,b]=S3

mark S3

c: followpos(3)={3,4}=S3 Dtran[S3,c]=S3

start state: S1

accepting states: {S3}

S3

S2

S1

c

ab

b

Compiler Principles

Minimizing Number of DFA States

• For any regular language, there is always a uniqueminimum state DFA, which can be constructed from any DFA of the language.

• Algorithm:– Partition the set of states into two groups:

• G1 : set of accepting states

• G2 : set of non-accepting states

– For each new group G• partition G into subgroups such that states s1 and s2 are in the

same group iff

for all input symbols a, states s1 and s2 have transitions to states in the same group.

– Start state of the minimized DFA is the group containing the start state of the original DFA.

– Accepting states of the minimized DFA are the groups containing the accepting states of the original DFA.

68


Minimizing DFA – Example (1)

b a

a

a

b

b

3

2

1

G1 = {2}

G2 = {1,3}

G2 cannot be partitioned because

Dtran[1,a]=2 Dtran[1,b]=3

Dtran[3,a]=2 Dtran[3,b]=3

So, the minimized DFA (with minimum states) is

1 2

a

a

b

b


Minimizing DFA – Example (2)

Groups: {1,2,3} {4}

a b

1->2 1->3

2->2 2->3

3->4 3->3

{1,2} {3}no more partitioning

Minimized DFA

b

b

b

a

a

a

a

b 4

3

2

1

3

1

2b

a

a

a

b

b

70


Architecture of A Lexical Analyzer

71

Compiler Principles

An NFA for Lex program

• Create an NFA for each

regular expression

• Combine all the NFAs into

one

• Introduce a new start

state

• Connect it with ε-

transitions to the start

states of the NFAs

72

Compiler Principles

Pattern Matching with NFA① The lexical analyzer reads

in input and calculates the set of states it is in at each symbol.

② Eventually, it reach a point with no next state.

③ It looks backwards in the sequence of sets of states, until it finds a set including one or more accepting states.

④ It picks the one associated with the earliest pattern in the list from the Lexprogram.

⑤ It performs the associated action of the pattern.

73

Compiler Principles

Pattern Matching with NFA -- Example

Input: aaba

Report pattern: a*b+

74

Compiler Principles

Pattern Matching with DFA① Convert the NFA for all the

patterns into an equivalent DFA. For each DFA state with more than one accepting NFA states, choose the pattern, who is defined earliest, the output of the DFA state.

② Simulate the DFA until there is no next state.

③ Trace back to the nearest accepting DFA state, and perform the associated action.

Input: abba

0137 247 58 68

Report pattern abb

75

Compiler Principles

Summary

• How lexical analyzers work

– Convert REs to NFA

– Convert NFA to DFA

– Minimize DFA

– Use the minimized DFA to recognize tokens

in the input

– Use priorities, longest matching rule

76

Compiler Principles

Homework

• Check the web page!!!

77

cs416 compiler design - sjtujiangli/teaching/cs308/cs308-slides02.pdf• lexical analyzer reads the...

Documents