![Page 1: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/1.jpg)
COMP 524: Programming Language ConceptsBjörn B. Brandenburg
The University of North Carolina at Chapel Hill
Based in part on slides and notes by S. Olivier, A. Block, N. Fisher, F. Hernandez-Campos, and D. Stotts.
Lexical Analysis
![Page 2: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/2.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
The Big Picture
Scanner (lexical analysis)
Parser (syntax analysis)
Semantic analysis & intermediate code gen.
Machine-independent optimization (optional)
Target code generation.
Machine-specific optimization (optional)
Character Stream
Token Stream
Parse Tree
Abstract syntax tree
Modified intermediate form
Machine language
Modified target language
2
![Page 3: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/3.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
The Big Picture
Scanner (lexical analysis)
Parser (syntax analysis)
Semantic analysis & intermediate code gen.
Machine-independent optimization (optional)
Target code generation.
Machine-specific optimization (optional)
Character Stream
Token Stream
Parse Tree
Abstract syntax tree
Modified intermediate form
Machine language
Modified target language
3
Lexical analysis:grouping consecutive characters that “belong together.”
Turn the stream of individual characters into astream of tokens that have individual meaning.
![Page 4: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/4.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Source ProgramThe compiler reads the program from a file.➡ Input as a character stream.
4
1 2
Source File
3 - 7 5 * f… …o o ;=
Compilation requires analysis of program structure.➡ Identify subroutines, classes, methods, etc.➡ Thus, first step is to find units of meaning.
![Page 5: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/5.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Tokens
Not every character has an individual meaning.➡ In Java, a ʻ+ʼ can have two interpretations:‣A single ʻ+ʼ means addition.‣A ʻ+ʼ ʻ+ʼ sequence means increment.
➡ A sequence of characters that has an atomic meaning is called a token.
➡ Compiler must identify all input tokens.5
1 2
Source File
3 - 7 5 * f… …o o ;=
![Page 6: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/6.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Tokens
Not every character has an individual meaning.➡ In Java, a ʻ+ʼ can have two interpretations:‣A single ʻ+ʼ means addition.‣A ʻ+ʼ ʻ+ʼ followed by another means increment.
➡ A sequence of characters that has an atomic meaning is called a token.
➡ Compiler must identify all input tokens.6
1 2
Source File
3 - 7 5 * f… …o o ;=
Human Analogy:To understand the meaning of an English
sentence, we do not look at individual characters. Rather, we look at individual words.
Human word = Program token
![Page 7: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/7.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Tokens
Not every character has an individual meaning.➡ In Java, a ʻ+ʼ can have two interpretations:‣A single ʻ+ʼ means addition.‣A ʻ+ʼ ʻ+ʼ followed by another means increment.
➡ A sequence of characters that has an atomic meaning is called a token.
➡ Compiler must identify all input tokens.7
1 2
Source File
3 - 7 5 * f… …o o ;=
Operator: Assignment
![Page 8: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/8.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Tokens
Not every character has an individual meaning.➡ In Java, a ʻ+ʼ can have two interpretations:‣A single ʻ+ʼ means addition.‣A ʻ+ʼ ʻ+ʼ followed by another means increment.
➡ A sequence of characters that has an atomic meaning is called a token.
➡ Compiler must identify all input tokens.8
1 2
Source File
3 - 7 5 * f… …o o ;=
Integer Literal
![Page 9: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/9.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Tokens
Not every character has an individual meaning.➡ In Java, a ʻ+ʼ can have two interpretations:‣A single ʻ+ʼ means addition.‣A ʻ+ʼ ʻ+ʼ followed by another means increment.
➡ A sequence of characters that has an atomic meaning is called a token.
➡ Compiler must identify all input tokens.9
1 2
Source File
3 - 7 5 * f… …o o ;=
Operator: Minus
![Page 10: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/10.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Tokens
Not every character has an individual meaning.➡ In Java, a ʻ+ʼ can have two interpretations:‣A single ʻ+ʼ means addition.‣A ʻ+ʼ ʻ+ʼ followed by another means increment.
➡ A sequence of characters that has an atomic meaning is called a token.
➡ Compiler must identify all input tokens.10
1 2
Source File
3 - 7 5 * f… …o o ;=
Integer Literal
![Page 11: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/11.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Tokens
Not every character has an individual meaning.➡ In Java, a ʻ+ʼ can have two interpretations:‣A single ʻ+ʼ means addition.‣A ʻ+ʼ ʻ+ʼ followed by another means increment.
➡ A sequence of characters that has an atomic meaning is called a token.
➡ Compiler must identify all input tokens.11
1 2
Source File
3 - 7 5 * f… …o o ;=
Operator: Multiplication
![Page 12: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/12.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Tokens
Not every character has an individual meaning.➡ In Java, a ʻ+ʼ can have two interpretations:‣A single ʻ+ʼ means addition.‣A ʻ+ʼ ʻ+ʼ followed by another means increment.
➡ A sequence of characters that has an atomic meaning is called a token.
➡ Compiler must identify all input tokens.12
1 2
Source File
3 - 7 5 * f… …o o ;=
Identifier: foo
![Page 13: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/13.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Tokens
Not every character has an individual meaning.➡ In Java, a ʻ+ʼ can have two interpretations:‣A single ʻ+ʼ means addition.‣A ʻ+ʼ ʻ+ʼ followed by another means increment.
➡ A sequence of characters that has an atomic meaning is called a token.
➡ Compiler must identify all input tokens.13
1 2
Source File
3 - 7 5 * f… …o o ;=
Statementseparator/terminator
![Page 14: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/14.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Lexical vs. Syntactical Analysis
‣ In theory, token discovery (lexical analysis) could be done as part of the structure discovery (syntactical analysis, parsing).‣However, this is unpractical.‣ It is much easier (and much more efficient) to express the syntax rules in terms of tokens.‣ Thus, lexical analysis is made a separate step because it greatly simplifies the subsequently performed syntactical analysis.
14
Why have a separate lexical analysis phase?
![Page 15: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/15.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Example: Java Language Specification
LEXICAL STRUCTURE Operators 3.12
31
3.11 Separators
The following nine ASCII characters are the separators (punctuators):
Separator: one of! " # $ % & ' ( )
3.12 Operators
The following 37 tokens are the operators , formed from ASCII characters:
Operator: one of* + , - . / 0** ,* +* -* 11 22 33 443 4 5 6 1 2 7 8 ,, ++ +++3* 4* 5* 6* 1* 2* 7* 8* ,,* ++* +++*
EXPRESSIONS Prefix Increment Operator ++ 15.15.1
487
A variable that is declared !"#$% cannot be decremented (unless it is a defi-nitely unassigned (§16) blank final variable (§4.12.4)), because when an access ofsuch a !"#$% variable is used as an expression, the result is a value, not a variable.Thus, it cannot be used as the operand of a postfix decrement operator.
15.15 Unary Operators
The unary operators include &, ', &&, '', (, ), and cast operators. Expressionswith unary operators group right-to-left, so that '(* means the same as '+(*,.
UnaryExpression-PreIncrementExpressionPreDecrementExpression&.UnaryExpression'.UnaryExpressionUnaryExpressionNotPlusMinus
PreIncrementExpression:&&.UnaryExpression
PreDecrementExpression:''.UnaryExpression
UnaryExpressionNotPlusMinus-PostfixExpression(.UnaryExpression).UnaryExpressionCastExpression
The following productions from §15.16 are repeated here for convenience:
CastExpression:+.PrimitiveType.,.UnaryExpression+.ReferenceType., UnaryExpressionNotPlusMinus
15.15.1 Prefix Increment Operator ++
A unary expression preceded by a && operator is a prefix increment expression.The result of the unary expression must be a variable of a type that is convertible(§5.1.8) to a numeric type, or a compile-time error occurs. The type of the prefixincrement expression is the type of the variable. The result of the prefix incrementexpression is not a variable, but a value.
EXPRESSIONS Prefix Increment Operator ++ 15.15.1
487
A variable that is declared !"#$% cannot be decremented (unless it is a defi-nitely unassigned (§16) blank final variable (§4.12.4)), because when an access ofsuch a !"#$% variable is used as an expression, the result is a value, not a variable.Thus, it cannot be used as the operand of a postfix decrement operator.
15.15 Unary Operators
The unary operators include &, ', &&, '', (, ), and cast operators. Expressionswith unary operators group right-to-left, so that '(* means the same as '+(*,.
UnaryExpression-PreIncrementExpressionPreDecrementExpression&.UnaryExpression'.UnaryExpressionUnaryExpressionNotPlusMinus
PreIncrementExpression:&&.UnaryExpression
PreDecrementExpression:''.UnaryExpression
UnaryExpressionNotPlusMinus-PostfixExpression(.UnaryExpression).UnaryExpressionCastExpression
The following productions from §15.16 are repeated here for convenience:
CastExpression:+.PrimitiveType.,.UnaryExpression+.ReferenceType., UnaryExpressionNotPlusMinus
15.15.1 Prefix Increment Operator ++
A unary expression preceded by a && operator is a prefix increment expression.The result of the unary expression must be a variable of a type that is convertible(§5.1.8) to a numeric type, or a compile-time error occurs. The type of the prefixincrement expression is the type of the variable. The result of the prefix incrementexpression is not a variable, but a value.
Lexical Structure
Syntactical Structure
![Page 16: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/16.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Example: Java Language Specification
LEXICAL STRUCTURE Operators 3.12
31
3.11 Separators
The following nine ASCII characters are the separators (punctuators):
Separator: one of! " # $ % & ' ( )
3.12 Operators
The following 37 tokens are the operators , formed from ASCII characters:
Operator: one of* + , - . / 0** ,* +* -* 11 22 33 443 4 5 6 1 2 7 8 ,, ++ +++3* 4* 5* 6* 1* 2* 7* 8* ,,* ++* +++*
EXPRESSIONS Prefix Increment Operator ++ 15.15.1
487
A variable that is declared !"#$% cannot be decremented (unless it is a defi-nitely unassigned (§16) blank final variable (§4.12.4)), because when an access ofsuch a !"#$% variable is used as an expression, the result is a value, not a variable.Thus, it cannot be used as the operand of a postfix decrement operator.
15.15 Unary Operators
The unary operators include &, ', &&, '', (, ), and cast operators. Expressionswith unary operators group right-to-left, so that '(* means the same as '+(*,.
UnaryExpression-PreIncrementExpressionPreDecrementExpression&.UnaryExpression'.UnaryExpressionUnaryExpressionNotPlusMinus
PreIncrementExpression:&&.UnaryExpression
PreDecrementExpression:''.UnaryExpression
UnaryExpressionNotPlusMinus-PostfixExpression(.UnaryExpression).UnaryExpressionCastExpression
The following productions from §15.16 are repeated here for convenience:
CastExpression:+.PrimitiveType.,.UnaryExpression+.ReferenceType., UnaryExpressionNotPlusMinus
15.15.1 Prefix Increment Operator ++
A unary expression preceded by a && operator is a prefix increment expression.The result of the unary expression must be a variable of a type that is convertible(§5.1.8) to a numeric type, or a compile-time error occurs. The type of the prefixincrement expression is the type of the variable. The result of the prefix incrementexpression is not a variable, but a value.
EXPRESSIONS Prefix Increment Operator ++ 15.15.1
487
A variable that is declared !"#$% cannot be decremented (unless it is a defi-nitely unassigned (§16) blank final variable (§4.12.4)), because when an access ofsuch a !"#$% variable is used as an expression, the result is a value, not a variable.Thus, it cannot be used as the operand of a postfix decrement operator.
15.15 Unary Operators
The unary operators include &, ', &&, '', (, ), and cast operators. Expressionswith unary operators group right-to-left, so that '(* means the same as '+(*,.
UnaryExpression-PreIncrementExpressionPreDecrementExpression&.UnaryExpression'.UnaryExpressionUnaryExpressionNotPlusMinus
PreIncrementExpression:&&.UnaryExpression
PreDecrementExpression:''.UnaryExpression
UnaryExpressionNotPlusMinus-PostfixExpression(.UnaryExpression).UnaryExpressionCastExpression
The following productions from §15.16 are repeated here for convenience:
CastExpression:+.PrimitiveType.,.UnaryExpression+.ReferenceType., UnaryExpressionNotPlusMinus
15.15.1 Prefix Increment Operator ++
A unary expression preceded by a && operator is a prefix increment expression.The result of the unary expression must be a variable of a type that is convertible(§5.1.8) to a numeric type, or a compile-time error occurs. The type of the prefixincrement expression is the type of the variable. The result of the prefix incrementexpression is not a variable, but a value.
Lexical Structure
Syntactical Structure
Token Specification:These strings mean something, but knowledge of the exact meaning is not required to identify them.
![Page 17: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/17.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Example: Java Language Specification
LEXICAL STRUCTURE Operators 3.12
31
3.11 Separators
The following nine ASCII characters are the separators (punctuators):
Separator: one of! " # $ % & ' ( )
3.12 Operators
The following 37 tokens are the operators , formed from ASCII characters:
Operator: one of* + , - . / 0** ,* +* -* 11 22 33 443 4 5 6 1 2 7 8 ,, ++ +++3* 4* 5* 6* 1* 2* 7* 8* ,,* ++* +++*
EXPRESSIONS Prefix Increment Operator ++ 15.15.1
487
A variable that is declared !"#$% cannot be decremented (unless it is a defi-nitely unassigned (§16) blank final variable (§4.12.4)), because when an access ofsuch a !"#$% variable is used as an expression, the result is a value, not a variable.Thus, it cannot be used as the operand of a postfix decrement operator.
15.15 Unary Operators
The unary operators include &, ', &&, '', (, ), and cast operators. Expressionswith unary operators group right-to-left, so that '(* means the same as '+(*,.
UnaryExpression-PreIncrementExpressionPreDecrementExpression&.UnaryExpression'.UnaryExpressionUnaryExpressionNotPlusMinus
PreIncrementExpression:&&.UnaryExpression
PreDecrementExpression:''.UnaryExpression
UnaryExpressionNotPlusMinus-PostfixExpression(.UnaryExpression).UnaryExpressionCastExpression
The following productions from §15.16 are repeated here for convenience:
CastExpression:+.PrimitiveType.,.UnaryExpression+.ReferenceType., UnaryExpressionNotPlusMinus
15.15.1 Prefix Increment Operator ++
A unary expression preceded by a && operator is a prefix increment expression.The result of the unary expression must be a variable of a type that is convertible(§5.1.8) to a numeric type, or a compile-time error occurs. The type of the prefixincrement expression is the type of the variable. The result of the prefix incrementexpression is not a variable, but a value.
EXPRESSIONS Prefix Increment Operator ++ 15.15.1
487
A variable that is declared !"#$% cannot be decremented (unless it is a defi-nitely unassigned (§16) blank final variable (§4.12.4)), because when an access ofsuch a !"#$% variable is used as an expression, the result is a value, not a variable.Thus, it cannot be used as the operand of a postfix decrement operator.
15.15 Unary Operators
The unary operators include &, ', &&, '', (, ), and cast operators. Expressionswith unary operators group right-to-left, so that '(* means the same as '+(*,.
UnaryExpression-PreIncrementExpressionPreDecrementExpression&.UnaryExpression'.UnaryExpressionUnaryExpressionNotPlusMinus
PreIncrementExpression:&&.UnaryExpression
PreDecrementExpression:''.UnaryExpression
UnaryExpressionNotPlusMinus-PostfixExpression(.UnaryExpression).UnaryExpressionCastExpression
The following productions from §15.16 are repeated here for convenience:
CastExpression:+.PrimitiveType.,.UnaryExpression+.ReferenceType., UnaryExpressionNotPlusMinus
15.15.1 Prefix Increment Operator ++
A unary expression preceded by a && operator is a prefix increment expression.The result of the unary expression must be a variable of a type that is convertible(§5.1.8) to a numeric type, or a compile-time error occurs. The type of the prefixincrement expression is the type of the variable. The result of the prefix incrementexpression is not a variable, but a value.
Lexical Structure
Syntactical Structure
Token Specification:These strings mean something, but knowledge of the exact meaning is not required to identify them.
Meaning is given by where they can occur in the program (grammar) and
and language semantics.
![Page 18: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/18.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Lexical Analysis
The need to identify tokens raises two questions.➡ How can we specify the tokens of a language?➡ How can we recognize tokens in a character stream?
18
Regular Expressions
Language Design and
Specification
Token Specification
Deterministic Finite Automata (DFA)
Language Implementation
Token Recognition
DFA Construction
(several steps)
![Page 19: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/19.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Grammars and Languages
A regular expression is a kind of grammar.➡ A grammar describes the structure of strings.➡ A string that “matches” a grammar Gʼs structure is
said to be in the language L(G) (which is a set).
19
A grammar is a set of productions:➡ Rules to obtain (produce) a string that is in L(G) via
repeated substitutions.➡ There are many grammar classes (see COMP 455).➡ Two are commonly used to describe programming
languages: regular grammars for tokens and context-free grammars for syntax.
![Page 20: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/20.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Grammar 101
digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
non_zero_digit → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
natural_number → non_zero_digit digit*
non_neg_number → (0 | natural_number) ( ( . digit* non_zero_digit) | ε )
20
![Page 21: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/21.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Grammar 101: Productions
digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
non_zero_digit → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
natural_number → non_zero_digit digit*
non_neg_number → (0 | natural_number) ( ( . digit* non_zero_digit) | ε )
21
“A → B” is called a production.
![Page 22: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/22.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Grammar 101: Non-Terminals
digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
non_zero_digit → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
natural_number → non_zero_digit digit*
non_neg_number → (0 | natural_number) ( ( . digit* non_zero_digit) | ε )
22
The “name” on the left is calleda non-terminal symbol.
![Page 23: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/23.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Grammar 101: Terminals
digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
non_zero_digit → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
natural_number → non_zero_digit digit*
non_neg_number → (0 | natural_number) ( ( . digit* non_zero_digit) | ε )
23
The symbols on the right are either terminal or non-terminal symbols. A terminal symbol is just a character.
![Page 24: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/24.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Grammar 101: Definition
digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
non_zero_digit → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
natural_number → non_zero_digit digit*
non_neg_number → (0 | natural_number) ( ( . digit* non_zero_digit) | ε )
24
“→” means “is a”or “replace with”
![Page 25: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/25.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Grammar 101: Choice
digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
non_zero_digit → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
natural_number → non_zero_digit digit*
non_neg_number → (0 | natural_number) ( ( . digit* non_zero_digit) | ε )
25
“|” denotes or
![Page 26: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/26.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Grammar 101: Example
digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
non_zero_digit → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
natural_number → non_zero_digit digit*
non_neg_number → (0 | natural_number) ( ( . digit* non_zero_digit) | ε )
26
Thus, the first production means:
A digit is a “0” or ‘1’ or ‘2’ or … or ’9’.
![Page 27: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/27.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Grammar 101: Optional Repetition
digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
non_zero_digit → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
natural_number → non_zero_digit digit*
non_neg_number → (0 | natural_number) ( ( . digit* non_zero_digit) | ε )
27
“*” denotes zero or more of a symbol.
![Page 28: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/28.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Grammar 101: Sequence
digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
non_zero_digit → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
natural_number → non_zero_digit digit*
non_neg_number → (0 | natural_number) ( ( . digit* non_zero_digit) | ε )
28
Two symbols next to each other means “followed by.”
![Page 29: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/29.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Grammar 101: Example
digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
non_zero_digit → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
natural_number → non_zero_digit digit*
non_neg_number → (0 | natural_number) ( ( . digit* non_zero_digit) | ε )
29
Thus, this means:
A natural number is a non-zero digit followed by zero or more digits.
![Page 30: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/30.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Grammar 101: Epsilon
digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
non_zero_digit → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
natural_number → non_zero_digit digit*
non_neg_number → (0 | natural_number) ( ( . digit* non_zero_digit) | ε )
30
“ε” is special terminal that means empty.It corresponds to the empty string.
![Page 31: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/31.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Grammar 101: Example
digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
non_zero_digit → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
natural_number → non_zero_digit digit*
non_neg_number → (0 | natural_number) ( ( . digit* non_zero_digit) | ε )
31
So, what does this mean?
![Page 32: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/32.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Grammar 101: Example
digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
non_zero_digit → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
natural_number → non_zero_digit digit*
non_neg_number → (0 | natural_number) ( ( . digit* non_zero_digit) | ε )
32
A non-negative number is a ‘0’ or a natural number, followed by either
nothing or a ‘.’, followed by zero or more digits, followed by (exactly one) digit.
![Page 33: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/33.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Regular Expression Rules
Base case: a regular expression (RE) is either➡ a character (e.g., ʻ0ʼ, ʻ1ʼ, ...), or➡ the empty string (i.e., ʻεʼ).
A compound RE is constructed by➡ concatenation: two REs next to each other (e.g.,
“non_negative_digit digit”),➡ alternation: two REs separated by “|” next to each other
(e.g., “non_negative_digit | digit”),➡ optional repetition: a RE followed by “*” (the Kleene star)
to denote zero or more occurrences, and➡ parentheses (in order to avoid ambiguity).
33
![Page 34: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/34.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Regular Expression Rules
Base case: a regular expression (RE) is either➡ a character (e.g., ʻ0ʼ, ʻ1ʼ, ...), or➡ the empty string (i.e., ʻεʼ).
A compound RE is constructed by➡ concatenation: two REs next to each other (e.g.,
“non_negative_digit digit”),➡ alternation: two REs separated by “|” next to each other
(e.g., “non_negative_digit | digit”),➡ optional repetition: a RE followed by “*” (the Kleene star)
to denote zero or more occurrences, and➡ parentheses (in order to avoid ambiguity).
34
A RE is NEVER defined in terms of itself!Thus, REs cannot define recursive statements.
![Page 35: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/35.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Example
35
Letʼs create a regular expression corresponding to the “City, State ZIP-code” line in mailing addresses.
E.g.:# Chapel Hill, NC 27599-3175! ! Beverly Hills, CA 90210
![Page 36: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/36.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Example
36
Letʼs create a regular expression corresponding to the “City, State ZIP-code” line in mailing addresses.
E.g.:# Chapel Hill, NC 27599-3175! ! Beverly Hills, CA 90210
city_line## # → city ʻ, ʻ state_abbrev ʻ ʻ zip_codecity! ! ! ! → letter (letter | ʻ ʻ letter)*state_abbrev! → ʻALʼ | ʻAKʼ | ʻASʼ | ʻAZʼ | … | ʻWYʼzip_code! ! → digit digit digit digit digit (extra | ε )extra!! ! ! → ʻ-ʼ digit digit digit digitdigit! ! ! ! → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9letter!! ! ! → A | B | C | … | ö | …
![Page 37: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/37.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Regular Sets and Finite Automata
37
Fundamental equivalence:
For every regular set L(G), there exists a deterministic finite automaton (DFA) that
accepts a string S if and only if S∈L(G).
If a grammar G is a regular expression,then the language L(G) is called a regular set.
(See COMP 455 for proof.)
![Page 38: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/38.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA 101
38
Deterministic finite automaton:➡ Has a finite number of states.➡ Exactly one start state.➡ One or more final states.➡ Transitions: define how automaton switches between
states (given an input symbol).
A(Start)
0
B1
0
C1
0, 1
![Page 39: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/39.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA 101
39
Deterministic finite automaton:➡ Has a finite number of states.➡ Exactly one start state.➡ One or more final states.➡ Transitions: define how automaton switches between
states (given an input symbol).
A(Start)
0
B1
0
C1
0, 1
Start State
![Page 40: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/40.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA 101
40
Deterministic finite automaton:➡ Has a finite number of states.➡ Exactly one start state.➡ One or more final states.➡ Transitions: define how automaton switches between
states (given an input symbol).
A(Start)
0
B1
0
C1
0, 1
Intermediate State(neither start nor final)
![Page 41: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/41.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA 101
41
Deterministic finite automaton:➡ Has a finite number of states.➡ Exactly one start state.➡ One or more final states.➡ Transitions: define how automaton switches between
states (given an input symbol).
A(Start)
0
B1
0
C1
0, 1
Final State(indicated by double border)
![Page 42: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/42.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA 101
42
Deterministic finite automaton:➡ Has a finite number of states.➡ Exactly one start state.➡ One or more final states.➡ Transitions: define how automaton switches between
states (given an input symbol).
A(Start)
0
B1
0
C1
0, 1
TransitionGiven an input of ‘1’, if DFA is in
state A, then transition to state B(and consume the input).
![Page 43: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/43.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA 101
43
Deterministic finite automaton:➡ Has a finite number of states.➡ Exactly one start state.➡ One or more final states.➡ Transitions: define how automaton switches between
states (given an input symbol).
A(Start)
0
B1
0
C1
0, 1
Self TransitionGiven an input of ‘0’, if DFA is in state A, then stay in state A
(and consume the input).
![Page 44: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/44.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA 101
44
Deterministic finite automaton:➡ Has a finite number of states.➡ Exactly one start state.➡ One or more final states.➡ Transitions: define how automaton switches between
states (given an input symbol).
A(Start)
0
B1
0
C1
0, 1
Transitions must be unambiguous:For each state and each input, there exist only one transition. This is what makes the DFA deterministic.
Z
Y
X1
1Not a legal DFA!
![Page 45: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/45.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA 101
45
Deterministic finite automaton:➡ Has a finite number of states.➡ Exactly one start state.➡ One or more final states.➡ Transitions: define how automaton switches between
states (given an input symbol).
A(Start)
0
B1
0
C1
0, 1
Multiple TransitionsGiven an input of either ‘0’ or ‘1’, if DFA
is in state C, then stay in state C(and consume the input).
![Page 46: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/46.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA String Processing
String processing.➡ Initially in start state.➡ Sequentially make transitions each character in input string.
A DFA either accepts or rejects a string.➡ Reject if a character is encountered for which no transition is
defined in the current state.➡ Reject if end of input is reached and DFA is not in a final state.➡ Accept if end of input is reached and DFA is in final state.
46
A(Start)
0
B1
0
C1
0, 1
![Page 47: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/47.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Example
47
A(Start)
0
B1
0
C1
0, 1
1Input: 0
current input character
current state
Initially, DFA is in the start State A.The first input character is ʻ1ʼ.
This causes a transition to State B.
![Page 48: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/48.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Example
48
A(Start)
0
B1
0
C1
0, 1
1Input: 0
current input character
current state
The next input character is ʻ0ʼ.This causes a self transition in
State B.
![Page 49: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/49.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Example
49
A(Start)
0
B1
0
C1
0, 1
1Input: 0
current input character
current state
The end of the input is reached, but the DFA is not in a final state:
the string ʼ10ʼ is rejected!
![Page 50: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/50.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA-Equivalent Regular Expression
50
A(Start)
0
B1
0
C1
0, 1
Whatʼs the RE such that the REʼs language is exactly the set of strings that is accepted by this DFA?
0*10*1(1|0)*
![Page 51: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/51.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA-Equivalent Regular Expression
51
A(Start)
0
B1
0
C1
0, 1
Whatʼs the RE such that the REʼs language is exactly the set of strings that is accepted by this DFA?
0*10*1(1|0)*
![Page 52: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/52.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Recognizing Tokens with a DFA
Table-driven implementation.➡ DFAʼs can be represented as a 2-dimensional table.
52
A(Start)
0
B1
0
C1
0, 1
Current State On ‘0’ On ‘1’ NoteA transition to A transition to B startB transition to B transition to C —C transition to C transition to C final
![Page 53: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/53.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Recognizing Tokens with a DFA
Table-driven implementation.➡ DFAʼs can be represented as a 2-dimensional table.
53
A(Start)
0
B1
0
C1
0, 1
Current State On ‘0’ On ‘1’ NoteA transition to A transition to B startB transition to B transition to C —C transition to C transition to C final
currentState = start state;while end of input not yet reached: {c = get next input character;if transitionTable[currentState][c] ≠ null:currentState = transitionTable[currentState][c]
else:reject input
}if currentState is final:accept input
else:reject input
![Page 54: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/54.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Recognizing Tokens with a DFA
Table-driven implementation.➡ DFAʼs can be represented as a 2-dimensional table.
54
A(Start)
0
B1
0
C1
0, 1
Current State On ‘0’ On ‘1’ NoteA transition to A transition to B startB transition to B transition to C —C transition to C transition to C final
currentState = start state;while end of input not yet reached: {c = get next input character;if transitionTable[currentState][c] ≠ null:currentState = transitionTable[currentState][c]
else:reject input
}if currentState is final:accept input
else:reject input
This accepts exactly one token in the input.A real lexer must detect multiple successive tokens.
This can be achieved by resetting to the start state.But what happens if the suffix of one token is the prefix of another?
(See Chapter 2 for a solution.)
![Page 55: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/55.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Lexical AnalysisThe need to identify tokens raises two questions.➡ How can we specify the tokens of a language?‣ With regular expressions.
➡ How can we recognize tokens in a character stream?‣ With DFAs.
55
Deterministic Finite Automata (DFA)Regular Expressions DFA Construction
Token Specification Token Recognition
No single-step algorithm:We first need to construct a Non-Deterministic Finite Automaton…
![Page 56: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/56.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Non-Deterministic Finite Automaton (NFA)
Like a DFA, but less restrictive:➡ Transitions do not have to be unique: each state may have
multiple ambiguous transitions for the same input symbol. (Hence, it can be non-deterministic.)
➡ Epsilon transitions do not consume any input.(They correspond to the empty string.)
➡ Note that every DFA is also a NFA.
56
Z Y
X
1 1
V
A legal NFA fragment.
ε
Acceptance rule:➡ Accepts an input string if there exists
a series of transitions such that the NFA is in a final state when the end of input is reached.
➡ Inherent parallelism: all possible paths are explored simultaneously.
![Page 57: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/57.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
NFA Example
57
aInput: a
![Page 58: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/58.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
NFA Example
58
aInput: a
current input character
current state
Epsilon transition:Can transition from State 1 to State
2 without consuming any input.
![Page 59: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/59.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
NFA Example
59
aInput: a
current input character
current state
Regular transition:Can transition from State 2 to State
3, which consumes the first ʻaʼ.
![Page 60: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/60.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
NFA Example
60
aInput: a
current input character
current state
Epsilon transition:Can transition from State 3 to State 2
without consuming any input.
![Page 61: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/61.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
NFA Example
61
aInput: a
current input character
current state
Regular transition:Can transition from State 2 to State 3,
which consumes the second ʻaʼ.
![Page 62: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/62.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
NFA Example
62
aInput: a
current input character
current state
Epsilon transition from State 3 to 4:End of input reached, but the NFA
can still carry out epsilon transitions.
![Page 63: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/63.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
NFA Example
63
aInput: a
current input character
current state
Input Accepted:
There exists a sequence of transitions such that the NFA is in a final state at the
end of input.
![Page 64: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/64.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Equivalent DFA Construction
Constructing a DFA corresponding to a RE.➡ In theory, this requires two steps.‣From a RE to an equivalent NFA.‣From the NFA to an equivalent DFA.
To be practical, we require a third optimization step.➡ Large DFA to minimal DFA.
64
![Page 65: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/65.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Example
65
0*1(1|0)*RE to NFA
NFA to DFA
Optimization
Final DFA
![Page 66: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/66.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Step 1: RE ➔ NFA
66
Every RE can be converted to a NFA by repeatedly applying four simple rules.➡ Base case: a single character.➡ Concatenation: joining two REs in sequence.➡ Alternation: joining two REs in parallel.➡ Kleene Closure: repeating a RE.
(recall the definition of a RE)
![Page 67: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/67.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
The Four NFA Construction Rules
S
Rule 1—Base case: ‘a’
a
67
![Page 68: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/68.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
The Four NFA Construction Rules
S
Rule 1—Base case: ‘a’
a
68
Simple two-state NFA (even DFA, too).
![Page 69: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/69.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
The Four NFA Construction Rules
Rule 2—Concatenation: AB
S
A
S
B
S
AB
followed by
69
![Page 70: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/70.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
The Four NFA Construction Rules
Rule 2—Concatenation: AB
S
A
S
B
S
AB
70
Not just two states, but any NFA with a single final state.
followed by
![Page 71: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/71.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
The Four NFA Construction Rules
Rule 3--Alternation: “A|B”
S
A
S
B
BS
or
A
A|B
ε
ε
ε
ε
71
![Page 72: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/72.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
The Four NFA Construction Rules
Rule 3--Alternation: “A|B”
S
A
S
B
BS
or
A
A|B
ε
ε
ε
ε
72
Notice the epsilon transitions.
![Page 73: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/73.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
The Four NFA Construction Rules
Rule 4—Kleene Closure: “A*”
S
A
S A
A*
ε ε
ε
ε
73
![Page 74: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/74.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
The Four NFA Construction Rules
Rule 4—Kleene Closure: “A*”
S
A
S A
A*
ε ε
ε
ε
74
Notice the epsilon transitions.
![Page 75: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/75.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
The Four NFA Construction Rules
Rule 4—Kleene Closure: “A*”
S
A
S
zero occurrences
A
A*
ε ε
ε
ε
75
one occurrencerepetition
![Page 76: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/76.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Overview S
S
BS
Aε
ε
ε
ε
S Aε ε
ε
ε
a
BA
76
Four rules:➡ Create two-state NFAs
for individual symbols,e.g., ʻaʼ.
➡ Append consecutive NFAs, e.g., AB.
➡ Alternate choices in parallel, e.g., A|B.
➡ Repeat Kleene Star,e.g., A*.
![Page 77: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/77.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
NFA Construction Example
77
Regular expression: (a|b)(c|d)e*
Apply Rule 1:
Apply Rule 3:
a|b c|d
B
Aε
ε
ε
ε
![Page 78: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/78.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
NFA Construction Example
78
Regular expression: (a|b)(c|d)e*
Apply Rule 2:BA
(a|b)(c|d)
a|b c|d
![Page 79: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/79.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
NFA Construction Example
79
Regular expression: (a|b)(c|d)e*
Apply Rule 1:
Apply Rule 4:
S Aε ε
ε
ε
e*
![Page 80: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/80.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
NFA Construction Example
80
Regular expression: (a|b)(c|d)e*
e*(a|b)(c|d)
Apply Rule 2:BA
(a|b)(c|d)e*
![Page 81: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/81.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Step 2: NFA ➔ DFA
81
Simulating NFA requires exploration of all paths.➡ Either in parallel (memory consumption!).➡ Or with backtracking (large trees!).➡ Both are impractical.
Instead, we derive a DFA that encodes all possible paths.➡ Instead of doing a specific parallel search each time that
we simulate the NFA, we do it only once in general.
Key idea: for each input character, find sets of NFA states that can be reached.➡ These are the states that a parallel search would explore.➡ Create a DFA state + transitions for each such set.➡ Final states: a DFA state is a final state if its corresponding
set of NFA states contains at least one final NFA state.
![Page 82: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/82.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
82
NFA-to-DFA-CONVERSION:" " todo: stack of sets of NFA states." " push {NFA start state and all epsilon-reachable states} onto todo
" "" " while (todo is not empty):" " " curNFA: set of NFA states
curDFA: a DFA state
" " curNFA = todo.popmark curNFA as done
curDFA = find or create DFA state corresponding to curNFA
reachableNFA: set of NFA states "reachableDFA: a DFA state
for each symbol x for which at least one state in curNFA has a transition:" " " " reachableNFA = find each state that is reachable from a state in curNFA via one x transition and any number of epsilon transitions
" " " " if (reachableNFA is not empty and not done):push reachableNFA onto todo
" " " " reachableDFA = find or create DFA state corresponding to reachableNFAadd transition on x from curDFA to reachableDFA
end forend while
![Page 83: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/83.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
83
Regular expression: (a|b)(c|d)e*
continues…
![Page 84: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/84.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
84
Regular expression: (a|b)(c|d)e*
continues…
First Step: before any input is consumed
Find all states that are reachable from the start state via epsilon transitions.
![Page 85: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/85.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
85
Regular expression: (a|b)(c|d)e*
continues…
First Step: before any input is consumed
Create corresponding DFA start state.
![Page 86: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/86.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
86
Regular expression: (a|b)(c|d)e*
continues…
Next: find all input characters for which transitions in start set exist.
ʻaʼ and ʻbʼ in this case.
![Page 87: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/87.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
87
Regular expression: (a|b)(c|d)e*
continues…
For each such input character, determine the set of reachable states
(including epsilon transitions).
On an ʻbʼ, NFA can reach states 5,6,7, and 9.
![Page 88: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/88.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
88
Regular expression: (a|b)(c|d)e*
continues…
Create DFA states for each distinct reachable set of states.
![Page 89: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/89.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
89
Regular expression: (a|b)(c|d)e*
continues…
On an ʻaʼ, NFA can reach states 3,6,7, and 9.
![Page 90: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/90.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
90
Regular expression: (a|b)(c|d)e*
continues…
Create DFA states for each distinct reachable set of states.
![Page 91: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/91.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
91
Regular expression: (a|b)(c|d)e*
continues…
Repeat process for each newly-discovered set of states.
done
![Page 92: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/92.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
92
Regular expression: (a|b)(c|d)e*
Reachable states:on a ʻcʼ: 8,11,12,14
done
![Page 93: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/93.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
93
Regular expression: (a|b)(c|d)e*
Create state and transitions for the set of reachable states.
done
![Page 94: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/94.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
94
Regular expression: (a|b)(c|d)e*
Reachable states:on a ʻdʼ: 10,11,12,14
done
![Page 95: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/95.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
95
Regular expression: (a|b)(c|d)e*
Reachable states:on a ʻdʼ: 10,11,12,14
Create state and transitions for the set of reachable states.
done
![Page 96: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/96.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
96
Regular expression: (a|b)(c|d)e*
Note: both new DFA states are final states because their corresponding sets include NFA state 14, which is a final state.
done
done
![Page 97: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/97.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
97
Regular expression: (a|b)(c|d)e*
done
done
Repeat process for State [3, 6, 7, 9].
![Page 98: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/98.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
98
Regular expression: (a|b)(c|d)e*
done
done
Reachable states:on a ʻdʼ: 10,11,12,14
Reachable states:on a ʻcʼ: 8,11,12,14
![Page 99: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/99.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
99
Regular expression: (a|b)(c|d)e*
done
done
There already exist DFA states corresponding to those sets!Just add transitions to these states.
![Page 100: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/100.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
100
Regular expression: (a|b)(c|d)e*
done
done
done
Repeat process for State [10, 11, 12, 14].
Reachable states:on an ʻeʼ: 12, 13, 14
![Page 101: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/101.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
101
Regular expression: (a|b)(c|d)e*
done
done
done
Create state and transitions for the set of reachable states.
![Page 102: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/102.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
102
Regular expression: (a|b)(c|d)e*
done
done
done
Repeat process for State [8, 11, 12, 14].
Reachable states:on an ʻeʼ: 12, 13, 14
done
98
![Page 103: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/103.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
103
Regular expression: (a|b)(c|d)e*
done
done
done
done
State already exists. Just create transition.
![Page 104: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/104.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
104
Regular expression: (a|b)(c|d)e*
done
done
done
done
done
Repeat process for State [12, 13, 14].
Reachable states:on an ʻeʼ: 12, 13, 14 (itself!)
![Page 105: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/105.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
105
Regular expression: (a|b)(c|d)e*
done
done
done
done
done
There is no “escape” from the set of states [12, 13, 14] on an ʻeʼ. Thus, create a self-loop.
![Page 106: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/106.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Conversion Example
106
Regular expression: (a|b)(c|d)e*
done
done
done
done
done
done
The result: an equivalent DFA!
![Page 107: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/107.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
NFA ➔ DFA Conversion
107
‣Any NFA can be converted into an equivalent DFA using this method.‣However, the number of states can increase exponentially.‣With careful syntax design, this problem can be avoided in practice.‣ Limitation: resulting DFA is not necessarily optimal.
![Page 108: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/108.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
NFA ➔ DFA Conversion
108
‣Any NFA can be converted into an equivalent DFA using this method.‣However, the number of states can increase exponentially.‣With careful syntax design, this problem can be avoided in practice.‣ Limitation: resulting DFA is not necessarily optimal.
These two states are equivalent: for each input element, they both lead to the same state.
Thus, having two states is unnecessary.
![Page 109: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/109.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Step 3: DFA MinimizationGoal: obtain minimal DFA.➡ For each RE, the minimal DFA is unique (ignoring
simple renaming).➡ DFA minimization: merge states that are
equivalent.
Key idea: itʼs easier to split.➡ Start with two partitions: final and non-final states.➡ Repeatedly split partitions until all partitions
contain only equivalent states.➡ Two states S1, S2 are equivalent if all their
transitions “agree,” i.e., if there exists an input symbol x such that the DFA transitions (on input x) to a state in partition P1 if in S1 and to state in partition P2 if in S2 and P1≠P2, then S1 and S2 are not equivalent.
![Page 110: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/110.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Step 3: DFA MinimizationGoal: obtain minimal DFA.➡ For each RE, the minimal DFA is unique (ignoring
simple renaming).➡ DFA minimization: merge states that are
equivalent.
Key idea: itʼs easier to split.➡ Start with two partitions: final and non-final states.➡ Repeatedly split partitions until all partitions
contain only equivalent states.➡ Two states S1, S2 are equivalent if all their
transitions “agree,” i.e., if there exists an input symbol x such that the DFA transitions (on input x) to a state in partition P1 if in S1 and to state in partition P2 if in S2 and P1≠P2, then S1 and S2 are not equivalent.
Part. 1
Part. 2
Part. 3
![Page 111: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/111.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Step 3: DFA MinimizationGoal: obtain minimal DFA.➡ For each RE, the minimal DFA is unique (ignoring
simple renaming).➡ DFA minimization: merge states that are
equivalent.
Key idea: itʼs easier to split.➡ Start with two partitions: final and non-final states.➡ Repeatedly split partitions until all partitions
contain only equivalent states.➡ Two states S1, S2 are equivalent if all their
transitions “agree,” i.e., if there exists an input symbol x such that the DFA transitions (on input x) to a state in partition P1 if in S1 and to state in partition P2 if in S2 and P1≠P2, then S1 and S2 are not equivalent.
Part. 1
Part. 2
Part. 3
A and B are equivalent.
C is not equivalent to either A or B.
Because it has a transition into Part.3.
![Page 112: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/112.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Minimization Example
112
![Page 113: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/113.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Minimization Example
113
Partition final and non-final states.
FinalNon-Final
![Page 114: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/114.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Minimization Example
114
Examine final states.
All final states are equivalent!
FinalNon-Final
![Page 115: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/115.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Minimization Example
115
[1,2,4] is not equivalent to any other state:it is the only state with a transition to the non-final partition.
FinalNon-Final
![Page 116: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/116.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Minimization Example
116
[5,6,7,9] and [3,6,7,9] are equivalent.Thus, we are done.
Final
![Page 117: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/117.jpg)
UNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
DFA Minimization Example
117
Create one state for each partition.We have obtained a minimal DFA for (a|b)(c|d)e*.
![Page 118: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/118.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Recognizing Multiple Tokens
118
‣Construction up to this point can only recognize a single token type.‣Results in Accept or Reject, but does not yield which token was seen.
‣Real lexical analysis must discern between multiple token types.‣Solution: annotate final states with token type.
![Page 119: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/119.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Multi Token ConstructionTo build DFA for N tokens:➡ Create a NFA for each token type RE as before.➡ Join all token NFAs as shown below:
119
Type 1
S
NFA 1ε
εType 2NFA 2
Type NNFA N
ε …
![Page 120: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/120.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Multi Token ConstructionTo build DFA for N tokens:➡ Create a NFA for each token type RE as before.➡ Join all token NFAs as shown below:
120
Type 1
S
NFA 1ε
εType 2NFA 2
Type NNFA N
ε …
This is similar to NFA construction rule 3.Key difference: we keep all final states.
![Page 121: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/121.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Multi Token ConstructionTo build DFA for N tokens:➡ Create a NFA for each token type RE as before.➡ Join all token NFAs as shown below:
121
Type 1
S
NFA 1ε
εType 2NFA 2
Type NNFA N
ε …
This is similar to NFA construction rule 3.Key difference: we keep all final states.
![Page 122: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/122.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Token PrecedenceConsider the following regular grammar.➡ Create DFA to recognize identifiers and keywords.
122
identifier## # → letter (letter | digit | _)*keyword## # → if | else | whiledigit! ! ! ! → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9letter!! ! ! → a | b | c | … | z
Can you spot a problem?
![Page 123: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/123.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Token PrecedenceConsider the following regular grammar.➡ Create DFA to recognize identifiers and keywords.
123
identifier## # → letter (letter | digit | _)*keyword## # → if | else | whiledigit! ! ! ! → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9letter!! ! ! → a | b | c | … | z
All keywords are also identifiers!The grammar is ambiguous.
Example: for string ʻwhileʼ, there are two accepting states in the final NFA with different labels.
![Page 124: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/124.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Token PrecedenceConsider the following regular grammar.➡ Create DFA to recognize identifiers and keywords.
124
identifier## # → letter (letter | digit | _)*keyword## # → if | else | whiledigit! ! ! ! → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9letter!! ! ! → a | b | c | … | z
Solution➡ Assign precedence values to tokens (and labels).➡ In case of ambiguity, prefer final state with highest
precedence value.
![Page 125: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/125.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Token PrecedenceConsider the following regular grammar.➡ Create DFA to recognize identifiers and keywords.
125
identifier## # → letter (letter | digit | _)*keyword## # → if | else | whiledigit! ! ! ! → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9letter!! ! ! → a | b | c | … | z
Solution➡ Assign precedence values to tokens (and labels).➡ In case of ambiguity, prefer final state with highest
precedence value.
Note: during DFA optimization, two final states are not equivalent if they are labeled with
different token types.
![Page 126: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/126.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Multi Token Example
126
Java Integer Literals
![Page 127: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/127.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Multi Token Example
127
Java Integer Literals
Final states labeled with token type.
![Page 128: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/128.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Multi Token Example
128
Java Integer Literals
Final states can have transitions into
non-final states.
![Page 129: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/129.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
+ Kleene Plus
name → letter+is the same as
name → letter letter*
Extended Regular Expressions
129
some commonly used abbreviations+ ? [] [^]n
![Page 130: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/130.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
n times
name → letter3
is the same asname → letter letter letter
Extended Regular Expressions
130
some commonly used abbreviations+ ? [] [^]n
![Page 131: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/131.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Extended Regular Expressions
131
some commonly used abbreviations+ ? [] [^]n
? optionally
ZIP → digit5 (-digit4)?is the same as
ZIP → digit5 ( ε | -digit4 )
![Page 132: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/132.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Extended Regular Expressions
132
some commonly used abbreviations+ ? [] [^]n
[] one off
digit → [123456789]is the same as
digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
![Page 133: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/133.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Extended Regular Expressions
133
some commonly used abbreviations+ ? [] [^]n
[^] not one off
notADigit → [^123456789]is the same as
notADigit → A | B | C ...
![Page 134: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/134.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Extended Regular Expressions
134
some commonly used abbreviations+ ? [] [^]n
[^] not one off
notADigit → [^123456789]is the same as
notADigit → A | B | C ...
Every character except those listed between [^ and ].
![Page 135: Lexical Analysis - Computer Science · 2012-04-02 · 04: Lexical Analysis COMP 524: Programming Language Concepts Lexical vs. Syntactical Analysis ‣ In theory, token discovery](https://reader033.vdocuments.net/reader033/viewer/2022041808/5e563fa6a0895e7a2c7fd588/html5/thumbnails/135.jpg)
UNC Chapel HillUNC Chapel Hill Brandenburg — Spring 2010
COMP 524: Programming Language Concepts04: Lexical Analysis
Limitations of REsSuppose we wanted to remove extraneous, balanced ʻ(ʻ ʻ)ʼ pairs around identifiers.➡ Example: report (sum), ((sum)) and (((sum)))
simply as Identifier.➡ But not: ((sum)
One might try:
135
identifier → (n letter+ )m such that n = m
This cannot be expressed with regular expressions!Requires a recursive grammar: let the parser do it.