lexical analysis. the input read string input might be sequence of characters (unix) might be...

46
Lexical Analysis

Post on 21-Dec-2015

243 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Lexical Analysis

Page 2: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

The Input

Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set:

ASCII ISO Latin-1 ISO 10646 (16-bit = unicode) Ada, Java Others (EBCDIC, JIS, etc)

Page 3: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

The Output

A series of tokens: kind, location, name (if any) Punctuation ( ) ; , [ ] Operators + - ** := Keywords begin end if while try catch Identifiers Square_Root String literals “press Enter to continue” Character literals ‘x’ Numeric literals

Integer: 123 Floating_point: 4_5.23e+2 Based representation: 16#ac#

Page 4: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Free form vs Fixed form

Free form languages (all modern ones) White space does not matter. Ignore these:

Tabs, spaces, new lines, carriage returns Only the ordering of tokens is important

Fixed format languages (historical) Layout is critical

Fortran, label in cols 1-6 COBOL, area A B Lexical analyzer must know about layout to find

tokens

Page 5: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Punctuation: Separators

Typically individual special characters such as ( { } : .. (two dots) Sometimes double characters: lexical scanner

looks for longest token: (*, /* -- comment openers in various languages

Returned just as identity (kind) of token And perhaps location for error messages and

debugging purposes

Page 6: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Operators

Like punctuation No real difference for lexical analyzer Typically single or double special chars

Operators + - == <= Operations := =>

Returned as kind of token And perhaps location

Page 7: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Keywords

Reserved identifiers E.g. BEGIN END in Pascal, if in C, catch in C++ Maybe distinguished from identifiers

E.g. mode vs mode in Algol-68 Returned as kind of token

With possible location information Oddity: unreserved keywords in PL/1

IF IF THEN THEN = THEN + 1; Handled as identifiers (parser disambiguates)

Page 8: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Identifiers

Rules differ Length, allowed characters, separators

Need to build a names table Single entry for all occurrences of Var1

Language may be case insensitive: same entry for VAR1, vAr1, Var1

Typical structure: hash table Lexical analyzer returns token kind

And key (index) to table entry Table entry includes location information

Page 9: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Organization of names table

Most common structure is hash table With fixed number of headers Chain according to hash code Serial search on one chain Hash code computed from characters (e.g. sum

mod table size). No hash code is perfect! Expect collisions. Avoid any arbitrary limits on table or chain size.

Page 10: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

String Literals Text must be stored Actual characters are important

Not like identifiers: must preserve casing Character set issues: uniform internal representation Table needed

Lexical analyzer returns key into table May or may not be worth hashing to avoid duplicates

Page 11: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Character Literals

Similar issues to string literals Lexical Analyzer returns

Token kind Identity of character

Cannot assume character set of host machine, may be different

Page 12: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Numeric Literals

need a table to store numeric value E.g. 123 = 0123 = 01_23 (Ada) But cannot use predefined type for values

Because may have different bounds

Floating point representations much more complex Denormals, correct rounding Very delicate to compute correct value. Host / target issues

Page 13: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Handling Comments

Comments have no effect on program Can be eliminated by scanner But may need to be retrieved by tools Error detection issues

E.g. unclosed comments Scanner skips over comments and returns

next meaningful token

Page 14: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Case Equivalence

Some languages are case-insensitive Pascal, Ada

Some are not C, Java

Lexical analyzer ignores case if needed This_Routine = THIS_RouTine Error analysis may need exact casing Friendly diagnostics follow user’s conventions

Page 15: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Performance Issues

Speed Lexical analysis can become bottleneck Minimize processing per character

Skip blanks fast I/O is also an issue (read large blocks)

We compile frequently Compilation time is important

Especially during development

Communicate with parser through global variables

Page 16: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

General Approach

Define set of token kinds: An enumeration type (tok_int, tok_if, tok_plus,

tok_left_paren, tok_assign etc). Or a series of integer definitions in more primitive

languages… Some tokens carry associated data

E.g. key for identifier table May be useful to build tree node

For identifiers, literals etc

Page 17: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Interface to Lexical Analyzer

Either: Convert entire file to a file of tokens Lexical analyzer is separate phase

Or: Parser calls lexical analyzer to supply next token This approach avoids extra I/O Parser builds tree incrementally, using successive

tokens as tree nodes

Page 18: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Relevant Formalisms

Type 3 (Regular) Grammars Regular Expressions Finite State Machines Equivalent in expressive power Useful for program construction, even if

hand-written

Page 19: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Regular Grammars

Regular grammars Non-terminals (arbitrary names) Terminals (characters) Productions limited to the following:

Non-terminal ::= terminal Non-terminal ::= terminal Non-terminal Treat character class (e.g. digit) as terminal

Regular grammars cannot count: cannot express size limits on identifiers, literals

Cannot express proper nesting (parentheses)

Page 20: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Regular Grammars

grammar for real literals with no exponent digit :: = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 REAL ::= digit REAL1 REAL1 ::= digit REAL1 (arbitrary size) REAL1 ::= . INTEGER INTEGER ::= digit INTEGER (arbitrary size) INTEGER ::= digit

Start symbol is REAL

Page 21: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Regular Expressions

Regular expressions (RE) defined by an alphabet (terminal symbols) and three operations: Alternation RE1 | RE2

Concatenation RE1 RE2

Repetition RE* (zero or more RE’s) Language of RE’s = regular grammars

Regular expressions are more convenient for some applications

Page 22: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Specifying RE’s in Unix Tools

Single characters a b c d \x Alternation [bcd] [b-z] ab|cd Any character . (period) Match sequence of characters x* y+ Concatenation abc[d-q] Optional RE [0-9]+(\.[0-9]*)?

Page 23: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Finite State Machines

A language defined by a grammar is a (possibly infinite) set of strings

An automaton is a computation that determines whether a given string belongs to a specified language

A finite state machine (FSM) is an automaton that recognize regular languages (regular expressions)

Simplest automaton: memory is single number (state)

Page 24: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Specifying an FSM A set of labeled states Directed arcs between states labeled with character One or more states may be terminal (accepting) A distinguished state is start Automaton makes transition from state S1 to S2

If and only if arc from S1 to S2 is labeled with next character in input

Token is legal if automaton stops on terminal state

Page 25: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Building FSM from Grammar

One state for each non-terminal A rule of the form

Nt1 ::= terminal Generates transition from S1 to final state

A rule of the form Nt1 ::= terminal Nt2 Generates transition from S1 to S2 on an arc

labeled by the terminal

Page 26: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Graphic representation

Sdigit

digit

letterletter lette

r

digitdigit

underscore

Int

id

Page 27: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Building FSM’s from RE’s

Every RE corresponds to a grammar For all regular expressions

A natural translation to FSM exists Alternation often leads to non-deterministic

machines

Page 28: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Non-Deterministic FSM

A non-deterministic FSM Has at least one state

With two arcs to two distinct states Labeled with the same character

Example: from start state, a digit can begin an integer literal or a real literal

Implementation requires backtracking Nasty

Page 29: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Deterministic FSM

For all states S For all characters C:

There is at most one arc from any state S that is labeled with C

Much easier to implement No backtracking

Page 30: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

From NFSM to DFSM

There is an algorithm for converting a non-deterministic machine to a deterministic one

Result may have exponentially more states Intuitively: need new states to express uncertainty

about token: int or real Algorithm is efficient in practice (e.g. grep)

Other algorithms for minimizing number of states of FSM, for showing equivalence, etc.

Page 31: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Implementing the Scanner

Three methods Hand-coded approach:

draw DFSM, then implement with loop and case statement Hybrid approach :

define tokens using regular expressions, convert to NFSM, apply algorithm to obtain minimal DSFM

Hand-code resulting DFSM Automated approach:

Use regular grammar as input to lexical scanner generator (e.g. LEX)

Page 32: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Hand-coding

Normal coding techniques Scan over white space and comments till non-blank character

found. Branch depending on first character:

If digit, scan numeric literal If character, scan identifier or keyword If operator, check next character (++, etc.) Need table to determine character type efficiently

Return token found Write aggressive efficient code: goto’s, global

variables

Page 33: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Using grammar and FSM

Start with regular grammar or RE Typically found in the language reference

example (Ada): Chapter 2. Lexical Elements

Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 decimal-literal ::= integer [.integer][exponent] integer ::= digit {[underline] digit} exponent ::= E [+] integer | E - integer

Page 34: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Using grammar and FSM

Create one state for each non-terminal Label edges according to productions in grammar Each state becomes a label in the program Code for each state is a switch on next character,

corresponding to edges out of current state If no possible transition on next character, then:

If state is accepting, return the corresponding token If state is not accepting, report error

Page 35: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Hand-coded version:

Each state is encoded as follows: <<state1>>

case Next_Character iswhen ‘a’ => goto state3;when ‘b’ => goto state1;when others => End_of_token_processing;

end case; <<state2>>

… No explicit mention of state of automaton

Page 36: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Translating from FSM to code variable holds current state:

loop case State is when state1 =>

<<state1>> case Next_Character is

when ‘a’ => State := state3; when ‘b’ => State := state1; when others => End_token_processing;

end case; when state2 …

… end case;

end loop;

Page 37: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Automatic scanner construction

LEX builds a transition table, indexed by state and by character.

Code gets transition from table: Tab : array (State, Character) of State := …

begin

while More_Input loop

Curstate := Tab (Curstate, Next_Char);

if Curstate = Error_State then … end loop;

Page 38: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Automatic FSM Generation

Our example, FLEX See home page for manual in HTML

FLEX is given A set of regular expressions Actions associated with each RE

It builds a scanner Which matches RE’s and executes actions

Page 39: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Flex General Format

Input to Flex is a set of rules: Regexp actions (C statements) Regexp actions (C statements) …

Flex scans the longest matching Regexp And executes the corresponding actions

Page 40: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

An Example of a Flex scanner DIGIT [0-9]

ID [a-z][a-z0-9]*%%{DIGIT}+ {

printf (“an integer %s (%d)\n”, yytext, atoi (yytext));

}

{DIGIT}+”.”{DIGIT}* { printf (“a float %s (%g)\n”, yytext, atof (yytext));

if|then|begin|end|procedure|function { printf (“a keyword: %s\n”, yytext));

Page 41: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Flex Example (continued)

{ID} printf (“an identifier %s\n”, yytext);

“+”|“-”|“*”|“/” { printf (“an operator %s\n”, yytext); }

“--”.*\n /* eat Ada style comment */

[ \t\n]+ /* eat white space */

. printf (“unrecognized character”);%%

Page 42: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Assembling the flex program

%{#include <math.h> /* for atof */%}

<<flex text we gave goes here>>

%%main (argc, argv)int argc;char **argv;{

yyin = fopen (argv[1], “r”);yylex();

}

Page 43: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Running flex

flex is an executable program The input is lexical grammar as described The output is a running C program

For Ada fans Look at aflex (www.adapower.com)

For C++ fans flex can run in C++ mode

Generates appropriate classes

Page 44: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Choice Between Methods?

Hand written scanners Typically much faster execution Easy to write (standard structure) Preferable for good error recovery

Flex approach Simple to Use Easy to modify token language

Page 45: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

The GNAT Scanner

Hand written (scn.adb/scn.ads) Each call does:

Optimal scan past blanks/comments etc. Processing based on first character Call special routines for major classes:

Namet.Get_Name for identifier (hashing) Keywords recognized by special hash Strings (scn-slit.adb):

complication with “+”, “and”, etc. (string or operator?) Numeric literals (scn-nlit.adb):

complication with based literals: 16#FFF#

Page 46: Lexical Analysis. The Input Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set: ASCII ISO Latin-1

Historical oddities

Because early keypunch machines were unreliable, FORTRAN treats blanks as optional: lexical analysis and parsing are intertwined. DO10I=1.6 3 tokens:

identifier operator literal DO10I = 1.6

DO10I=1,6 7 tokens: Keyword stmt id operator literal comma literal DO 10 I = 1 , 6

Celebrated NASA failure caused by this bug (?)