compiler design unit 2. lexical analysis...xqfwlrqv &rpsxwhg )urp wkh 6\qwd[ 7uhh...

$: Compiler Design Unit 2. LEXICAL ANALYSIS...XQFWLRQV &RPSXWHG )URP WKH 6\QWD[ 7UHH 7RFRQVWUXFWD')$GLUHFWO\IURPDUHJXODUH[SUHVVLRQ ZHFRQVWUXFWLWV V\QWD[WUHHDQGWKHQFRPSXWHIRXUIXQFWLRQV$
UNIT 2: LEXICAL ANALYSIS

1

Sadique NayeemAsst. ProfessorDept. of CSE

Sitamarhi Institute of Technology, Sitamarhi

Lexical Analysis

Being the first phase of a compiler, the main task of the lexical analyzeris to: Read the input characters of the source program, Group them into lexemes, and Produce as output a sequence of tokens for each lexeme in the

source program.source program.

The stream of tokens is sent to the parser for syntax analysis.

It is common for the lexical analyzer to interact with the symbol table aswell.

Another task of LA is stripping out comments and whitespace (blank,newline, tab).

Another task is correlating error messages generated by the compilerwith the source program.

getNextToken

Commonly, the interaction is implemented by having the parsercall the lexical analyzer. The call, suggested by thegetNextToken command, causes the lexical analyzer to readcharacters from its input until it can identify the next lexemeand produce for it the next token, which it returns to the parser.and produce for it the next token, which it returns to the parser.

Sometimes, lexical analyzers are divided into a cascade of twoprocesses:

a) Scanning consists of the simple processes that do not requiretokenization of the input, such as deletion of comments andcompaction of consecutive whitespace characters into one.compaction of consecutive whitespace characters into one.

b) Lexical analysis proper is the more complex portion, where thescanner produces the sequence of tokens as output.

All Program have

Keywords

Operator

Identifiers

Constants (number and strings)

Punctuation marks

Token

A token is a pair consisting of a token name and an optionalattribute value.

<token name, attribute value>

The token name is an abstract symbol representing a kind oflexical unit, e.g., a particular keyword, or a sequence of inputlexical unit, e.g., a particular keyword, or a sequence of inputcharacters denoting an identifier.

The token names are the input symbols that the parserprocesses.

Pattern

A Pattern is a description of the form that the lexemes of a tokenmay take.

In the case of a keyword as a token, the pattern is just thesequence of characters that form the keyword. (Example: if)

For identifiers and some other tokens, the pattern is a more For identifiers and some other tokens, the pattern is a morecomplex structure that is matched by many strings. (Example: age)

Lexeme

A lexeme is a sequence of characters in the source programthat matches the pattern for a token and is identified by thelexical analyzer as an instance of that token.

#include<stdio.h> #include<stdio.h>#include<stdio.h>

void main()

{

printf(“SIT, Sitamarhi”);

}

#include<stdio.h>void main(){

int a=10, b=20, c;c = a + b;printf(“%d”, c);

}

Examples of Tokens

GATE 2000

printf("i = %d, &i = %x", i, &i);

Lexical Errors

These errors are mainly the spelling mistakes and accidentalinsertion of foreign character if the language does not allow it.

It is hard for a lexical analyzer to tell, without the aid of othercomponents, that there is a source-code error.

For instance, if the string fi is encountered for the first time in a C For instance, if the string fi is encountered for the first time in a Cprogram in the context:

fi ( a == 10 )

A lexical analyzer cannot tell whether fi is a misspelling of thekeyword if or an undeclared function identifier. Since fi is a validlexeme for the token id, the lexical analyzer must return thetoken id to the parser and let some other phase of the compiler— probably the parser in this case — handle an error due totransposition of the letters.

Suppose a situation arises in which the lexical analyzer is unableto proceed because none of the patterns for tokens matches anyprefix of the remaining input.

The simplest recovery strategy is "panic mode" recovery. Wedelete successive characters from the remaining input, until thedelete successive characters from the remaining input, until thelexical analyzer can find a well-formed token at the beginning ofwhat input is left.

Other possible error-recovery actions are:

1. Delete one character from the remaining input.

2. Insert a missing character into the remaining input.

3. Replace a character by another character.

4. Transpose two adjacent characters.

Specification of Tokens

Alphabet

String

Language

Operation on Language (U , . , * , +)

Kleen Closure and Positive Closure

Transition Table

ε- Closure

RE to ε- NFA

ε- NFA to NFA

NFA to DFAKleen Closure and Positive Closure

Regular Expression

Transition Diagram

Finite Automata

NFA

DFA

ε- NFA

NFA to DFA

DFA Minimizations

Regular Definitions

ε- NFA

NFA RE

DFA

Regular expression can be represented by its syntax tree,where the leaves correspond to operands and the interiornodes correspond to operators.

An interior node is called a cat-node, or-node, or star-node if itis labeled by the concatenation operator (dot), union operator

RE to DFA

is labeled by the concatenation operator (dot), union operator|, or star operator *, respectively.

Leaves in a syntax tree arelabeled by ε or by an alphabetsymbol. To each leaf not labeledε, we attach a unique integer.

We refer to this integer as theposition of the leaf and also as aposition of its symbol.

Construct Syntax tree

a(a|b)*#

(a|b)c*#

(a|b) (a|b)#

(a|b)*(a|b)# (a|b)*(a|b)#

Functions Computed From the Syntax Tree

To construct a DFA directly from a regular expression, we construct itssyntax tree and then compute four functions: nullable, firstpos, lastpos,and followpos, defined as follows. Each definition refers to the syntaxtree for a particular augmented regular expression ( r ) #.

1. nullable(n) is true for a syntax-tree node n if and only if thesubexpression represented by n has ε in its language. That is, thesubexpression represented by n has ε in its language. That is, thesubexpression can be "made null" or the empty string, even thoughthere may be other strings it can represent as well.

2. firstpos(n) is the set of positions in the subtree rooted at n thatcorrespond to the first symbol of at least one string in the languageof the subexpression rooted at n. (From where the starting positionelement of the sting is coming)

3. lastpos(n) is the set of positions in the subtree rooted at n thatcorrespond to the last symbol of at least one string in the languageof the subexpression rooted at n. (From where the last positionelement of the sting is coming)

4. followpos(p), is the set of position q that can match the first or lastsymbol of the string generated by a given subexpression of asymbol of the string generated by a given subexpression of aregular expression.

Computing nullable, firstpos, and lastpos

lastpos(n)

Ø

{i}

lastpos(c1) U lastpos(c2)

If (nullable(c2)) (lastpos(c1) U lastpos(c2)) else lastpos(c2)

lastpos(c1)

C2C1

*

C2FP1 LP1 FP2 LP2

Computing followpos

Converting a Regular Expression Directly to a DFA

Step1. Construct a syntax tree T from the augmented regularexpression ( r ) #.

Step 2. Compute nullable, firstpos, lastpos, and followpos for T.

Step 3. Construct Dstates (set of states of DFA D) and Dtran (transitionfunction for D) by using following procedure.

The states of D are sets of positions in T.

Initially, each state is "unmarked," and a state becomes "marked"just before we consider its out-transitions.

The start state of D is firstpos(no), where node ‘no’ is the root of T.

The accepting states are those containing the position for theendmarker symbol #.

The value of firstpos for the root of the tree is {1,2,3}, so this set is the start state of D.

Let us Call this set of states A.

We must compute Dtran[A, a] and Dtran[A, b].

Among the positions of A, leaf 1 and leaf 3 correspond to a, while leaf 2 correspondsto b. Thus,

Dtran[A,a] = followpos(l) U followpos(3) = {1,2,3,4} B

Dtran[A, b] = followpos{2) = {1,2,3} A

Dtran[B, a] = followpos(l) U followpos(3) = {1,2,3,4} B

Dtran[B, b] = followpos(2) U followpos(4) = {1,2,3,5} C

Dtran[C, a] = followpos(l) U followpos(3) = {1,2,3,4} B

Dtran[C, b] = followpos(2) U followpos(5) = {1,2,3,6} D

Dtran[D, a] = followpos(l) U followpos(3) = {1,2,3,4} B

Dtran[D, b] = followpos(2) = {1,2,3} A

A B C D

Note: We can also minimize the resultant DFA.

A B C D

Question Time

Q. Find DFA from following regular expression.

a(a|b)*#

(a|b)c*#

40

THANK YOU!

compiler design unit 2. lexical analysis...xqfwlrqv &rpsxwhg )urp wkh 6\qwd[ 7uhh...

Documents