cpsc 325 - compiler
DESCRIPTION
CPSC 325 - Compiler. Tutorial 2 Scanner & Lex. Tokens. Input. Token Stream: Each significant lexical chunk of the program is represented by a token Operators & Punctuation: { } ! + - = * ; : … Keywords: if while return goto Identifier: id & actual name - PowerPoint PPT PresentationTRANSCRIPT
CPSC 325 - Compiler
Tutorial 2
Scanner & Lex
Tokens
Token Stream: Each significant lexical chunk of the program is represented by a token– Operators & Punctuation: { } ! + - = * ; : …– Keywords: if while return goto– Identifier: id & actual name– Constants: kind & value; int, floating-point charact
er, string, …
Input
Token – example 1
Input text
if( x >= y ) y = 10;
Token Stream
IF LP ID(x)
Assign SEMIINT(10)
ID(y) RPGEQ
ID(y)
Parser
Tokens
IF LP ID(x)
Assign SEMIINT(10)
ID(y) RPGEQ
ID(y)
IfStmt
INT(10)ID(y)ID(y)ID(x)
>= assign
Sample Grammar
Program ::= statement | program statement Statement ::= assignStmt | ifStmt assignStmt ::= id = expr; ifStmt ::= if ( expr ) Statement Expr ::= id | int | expr + expr id ::= a | b | … | y | z Int ::= 1 | 2 | … | 9 | 0
Why Separate the Scanner and Parser?
Simplicity & Separation of Concerns– Scanner hides details from parser (comments, wh
itespace, input files, etc.)– Parser is easier to build; has simpler input stream
Efficiency– Scanner can use simpler, faster design
(But still often consumes a surprising amount of the compiler’s total execution time)
Principle of Longest Match
In most of languages, the scanner should pick the longest possible string to make up the next token if there is a choice.
Examplereturn apple != banana;
Should be recognized as 5 tokens
Not more (not parts of words or identifier, or ! And = as separate tokens)
return NEQ ID(banana) SEMIID(apple)
Scanner DFA Example (1)
0
4
3
2
1
Accept EOF
Accept LP
Accept RP
Accept SEMI
White space or comments
end of input
(
)
;
Scanner DFA Example (2)
10
9
7
6
Accept NEQ
Accept NOT
Accept LEQ
Accept LESS
White space or comments
5
8
!
<
=
other
=
other
Scanner DFA Example (3)
11
12
White space or comments
[0-9]
[0-9]
other Accept INT
Scanner DFA Example (4)
13
14
White space or comments
[a-zA-Z]
[a-zA-Z]
other Accept ID orkeyword
Lex/Flex
Use Flex instead of Lex Use Bison instead of yacc When compile, link to the library
flex file.lex gcc –o object lex.yy.c –ll object
Lex - Structure
Declarations/Definitions
%% Rules/Production
- Lex expression
- white space
- C statement (optional)
%% Additional Code/Subroutines
Lex – Basic operators
* - zero or more occurrences . - “ANY” character .* - matches any sequence | - separator + - one or more occurrences. (a+ :== aa*) ? - zero or one of something. (b? :== (b+null) [ ] - choice, so [12345] (1|2|3|4|5) (Note: [*+] represent a choice between star and plus. They lost their specialty. - - [a-zA-Z] a to z and A to Z, all the letters. \ - \* matches *, and \. Match period or decimal point.