compiler design unit 2. lexical analysis...xqfwlrqv &rpsxwhg )urp wkh 6\qwd[ 7uhh...

40
UNIT 2: LEXICAL ANALYSIS 1 Sadique Nayeem Asst. Professor Dept. of CSE Sitamarhi Institute of Technology, Sitamarhi

Upload: others

Post on 17-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

UNIT 2: LEXICAL ANALYSIS

1

Sadique NayeemAsst. ProfessorDept. of CSE

Sitamarhi Institute of Technology, Sitamarhi

Page 2: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Lexical Analysis

Being the first phase of a compiler, the main task of the lexical analyzeris to: Read the input characters of the source program, Group them into lexemes, and Produce as output a sequence of tokens for each lexeme in the

source program.source program.

The stream of tokens is sent to the parser for syntax analysis.

It is common for the lexical analyzer to interact with the symbol table aswell.

Another task of LA is stripping out comments and whitespace (blank,newline, tab).

Another task is correlating error messages generated by the compilerwith the source program.

Page 3: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

getNextToken

Commonly, the interaction is implemented by having the parsercall the lexical analyzer. The call, suggested by thegetNextToken command, causes the lexical analyzer to readcharacters from its input until it can identify the next lexemeand produce for it the next token, which it returns to the parser.and produce for it the next token, which it returns to the parser.

Page 4: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Sometimes, lexical analyzers are divided into a cascade of twoprocesses:

a) Scanning consists of the simple processes that do not requiretokenization of the input, such as deletion of comments andcompaction of consecutive whitespace characters into one.compaction of consecutive whitespace characters into one.

b) Lexical analysis proper is the more complex portion, where thescanner produces the sequence of tokens as output.

Page 5: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

All Program have

Keywords

Operator

Identifiers

Constants (number and strings)

Punctuation marks

Page 6: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Token

A token is a pair consisting of a token name and an optionalattribute value.

<token name, attribute value>

The token name is an abstract symbol representing a kind oflexical unit, e.g., a particular keyword, or a sequence of inputlexical unit, e.g., a particular keyword, or a sequence of inputcharacters denoting an identifier.

The token names are the input symbols that the parserprocesses.

Page 7: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Pattern

A Pattern is a description of the form that the lexemes of a tokenmay take.

In the case of a keyword as a token, the pattern is just thesequence of characters that form the keyword. (Example: if)

For identifiers and some other tokens, the pattern is a more For identifiers and some other tokens, the pattern is a morecomplex structure that is matched by many strings. (Example: age)

Page 8: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Lexeme

A lexeme is a sequence of characters in the source programthat matches the pattern for a token and is identified by thelexical analyzer as an instance of that token.

#include<stdio.h> #include<stdio.h>#include<stdio.h>

void main()

{

printf(“SIT, Sitamarhi”);

}

#include<stdio.h>void main(){

int a=10, b=20, c;c = a + b;printf(“%d”, c);

}

Page 9: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Examples of Tokens

Page 10: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV
Page 11: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV
Page 12: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

GATE 2000

printf("i = %d, &i = %x", i, &i);

Page 13: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Lexical Errors

These errors are mainly the spelling mistakes and accidentalinsertion of foreign character if the language does not allow it.

It is hard for a lexical analyzer to tell, without the aid of othercomponents, that there is a source-code error.

For instance, if the string fi is encountered for the first time in a C For instance, if the string fi is encountered for the first time in a Cprogram in the context:

fi ( a == 10 )

A lexical analyzer cannot tell whether fi is a misspelling of thekeyword if or an undeclared function identifier. Since fi is a validlexeme for the token id, the lexical analyzer must return thetoken id to the parser and let some other phase of the compiler— probably the parser in this case — handle an error due totransposition of the letters.

Page 14: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Suppose a situation arises in which the lexical analyzer is unableto proceed because none of the patterns for tokens matches anyprefix of the remaining input.

The simplest recovery strategy is "panic mode" recovery. Wedelete successive characters from the remaining input, until thedelete successive characters from the remaining input, until thelexical analyzer can find a well-formed token at the beginning ofwhat input is left.

Other possible error-recovery actions are:

1. Delete one character from the remaining input.

2. Insert a missing character into the remaining input.

3. Replace a character by another character.

4. Transpose two adjacent characters.

Page 15: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Specification of Tokens

Alphabet

String

Language

Operation on Language (U , . , * , +)

Kleen Closure and Positive Closure

Transition Table

ε- Closure

RE to ε- NFA

ε- NFA to NFA

NFA to DFAKleen Closure and Positive Closure

Regular Expression

Transition Diagram

Finite Automata

NFA

DFA

ε- NFA

NFA to DFA

DFA Minimizations

Page 16: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Regular Definitions

Page 17: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV
Page 18: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV
Page 19: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV
Page 20: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV
Page 21: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV
Page 22: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV
Page 23: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV
Page 24: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

ε- NFA

NFA RE

DFA

Page 25: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Regular expression can be represented by its syntax tree,where the leaves correspond to operands and the interiornodes correspond to operators.

An interior node is called a cat-node, or-node, or star-node if itis labeled by the concatenation operator (dot), union operator

RE to DFA

is labeled by the concatenation operator (dot), union operator|, or star operator *, respectively.

Page 26: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Leaves in a syntax tree arelabeled by ε or by an alphabetsymbol. To each leaf not labeledε, we attach a unique integer.

We refer to this integer as theposition of the leaf and also as aposition of its symbol.

Page 27: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Construct Syntax tree

a(a|b)*#

(a|b)c*#

(a|b) (a|b)#

(a|b)*(a|b)# (a|b)*(a|b)#

Page 28: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Functions Computed From the Syntax Tree

To construct a DFA directly from a regular expression, we construct itssyntax tree and then compute four functions: nullable, firstpos, lastpos,and followpos, defined as follows. Each definition refers to the syntaxtree for a particular augmented regular expression ( r ) #.

1. nullable(n) is true for a syntax-tree node n if and only if thesubexpression represented by n has ε in its language. That is, thesubexpression represented by n has ε in its language. That is, thesubexpression can be "made null" or the empty string, even thoughthere may be other strings it can represent as well.

2. firstpos(n) is the set of positions in the subtree rooted at n thatcorrespond to the first symbol of at least one string in the languageof the subexpression rooted at n. (From where the starting positionelement of the sting is coming)

Page 29: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

3. lastpos(n) is the set of positions in the subtree rooted at n thatcorrespond to the last symbol of at least one string in the languageof the subexpression rooted at n. (From where the last positionelement of the sting is coming)

4. followpos(p), is the set of position q that can match the first or lastsymbol of the string generated by a given subexpression of asymbol of the string generated by a given subexpression of aregular expression.

Page 30: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Computing nullable, firstpos, and lastpos

lastpos(n)

Ø

{i}

lastpos(c1) U lastpos(c2)

If (nullable(c2)) (lastpos(c1) U lastpos(c2)) else lastpos(c2)

lastpos(c1)

C2C1

*

C2FP1 LP1 FP2 LP2

Page 31: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV
Page 32: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Computing followpos

Page 33: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Converting a Regular Expression Directly to a DFA

Step1. Construct a syntax tree T from the augmented regularexpression ( r ) #.

Step 2. Compute nullable, firstpos, lastpos, and followpos for T.

Step 3. Construct Dstates (set of states of DFA D) and Dtran (transitionfunction for D) by using following procedure.

Page 34: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV
Page 35: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

The states of D are sets of positions in T.

Initially, each state is "unmarked," and a state becomes "marked"just before we consider its out-transitions.

The start state of D is firstpos(no), where node ‘no’ is the root of T.

The accepting states are those containing the position for theendmarker symbol #.

Page 36: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

The value of firstpos for the root of the tree is {1,2,3}, so this set is the start state of D.

Let us Call this set of states A.

We must compute Dtran[A, a] and Dtran[A, b].

Among the positions of A, leaf 1 and leaf 3 correspond to a, while leaf 2 correspondsto b. Thus,

Dtran[A,a] = followpos(l) U followpos(3) = {1,2,3,4} B

Dtran[A, b] = followpos{2) = {1,2,3} A

Page 37: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Dtran[B, a] = followpos(l) U followpos(3) = {1,2,3,4} B

Dtran[B, b] = followpos(2) U followpos(4) = {1,2,3,5} C

Dtran[C, a] = followpos(l) U followpos(3) = {1,2,3,4} B

Dtran[C, b] = followpos(2) U followpos(5) = {1,2,3,6} D

Dtran[D, a] = followpos(l) U followpos(3) = {1,2,3,4} B

Dtran[D, b] = followpos(2) = {1,2,3} A

Page 38: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

A B C D

Note: We can also minimize the resultant DFA.

A B C D

Page 39: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

Question Time

Q. Find DFA from following regular expression.

a(a|b)*#

(a|b)c*#

Page 40: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV

40

THANK YOU!