color lecture notes

8/6/2019 Color Lecture Notes

1/237

CS 352 Compilers: Principles and Practice

1. Introduction 2

2. Lexical analysis 31

3. LL parsing 58

4. LR parsing 110

5. JavaCC and JTB 127

6. Semantic analysis 150

7. Translation and simplification 165

8. Liveness analysis and register allocation 185

9. Activation Records 216

1


2/237

Chapter 1: Introduction

2


3/237

Things to do

make sure you have a working mentor account

start brushing up on Java

review Java development tools find http://www.cs.purdue.edu/homes/palsberg/cs352/F00/index.html

add yourself to the course mailing list by writing (on a CS computer)

mailer add me to cs352

Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this workfor personal or classroom use is granted without fee provided that copies are not made or distributed forprofit or commercial advantage and that copies bear this notice and full citation on the first page. To copyotherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/orfee. Request permission to publish from [email protected].

3


4/237

Compilers

What is a compiler?

a program that translates an executableprogram in one language into

an executable program in another language

we expect the program produced by the compiler to be better, in someway, than the original

What is an interpreter?

a program that reads an executableprogram and produces the results

of running that program

usually, this involves executing the source program in some fashion

This course deals mainly with compilers

Many of the same issues arise in interpreters4


5/237

Motivation

Why study compiler construction?

Why build compilers?

Why attend class?

5


6/237

Interest

Compiler construction is a microcosm of computer science

artificial intelligence greedy algorithms

learning algorithmsalgorithms graph algorithms

union-find

dynamic programmingtheory DFAs for scanning

parser generators

lattice theory for analysissystems allocation and naming

locality

synchronization

architecture pipeline managementhierarchy management

instruction set use

Inside a compiler, all these things come together6


7/237

Isnt it a solved problem?

Machines are constantly changing

Changes in architecture changes in compilers

new features pose new problems

changing costs lead to different concerns

old solutions need re-engineering

Changes in compilers should prompt changes in architecture

New languages and features

7


8/237

Intrinsic Merit

Compiler construction is challenging and fun

interesting problems

primary responsibility for performance (blame)

new architectures new challenges

real results

extremely complex interactions

Compilers have an impact on how computers are used

Compiler construction poses some of the most interesting problems in

computing8


9/237

Experience

You have used several compilers

What qualities are important in a compiler?

1. Correct code

2. Output runs fast

3. Compiler runs fast

4. Compile time proportional to program size

5. Support for separate compilation6. Good diagnostics for syntax errors

7. Works well with the debugger

8. Good diagnostics for flow anomalies

9. Cross language calls

10. Consistent, predictable optimization

Each of these shapes your feelings about the correct contents of this course9


10/237

Abstract view

errors

compilercode codesource machine

Implications:

recognize legal (and illegal) programs

generate correct code

manage storage of all variables and code

agreement on format for object (or assembly) code

Big step up from assembler higher level notations10


11/237

Traditional two pass compiler

codesource codemachinefrontend backendIR

errors

Implications:

intermediate representation (IR)

front end maps legal code into IR

back end maps IR onto target machine

simplify retargeting

allows multiple front ends

multiple passes better code

11


12/237

A fallacy

backend

frontend

FORTRANcode

frontend

frontend

frontend

backend

backend

code

code

code

C++

CLU

Smalltalk

target1

target2

target3

Can we build n m compilers with n m components?

must encode all the knowledge in each front end

must represent all the features in one IR

must handle all the features in each back end

Limited success with low-level IRs12


13/237


14/237

Front end

codesource tokens

errors

scanner parser IR

Scanner:

maps characters into tokens the basic unit of syntax

becomes

id, id, id,

character string value for a token is a lexeme

typical tokens: number, id, , , , , ,

eliminates white space (tabs, blanks, comments)

a key issue is speed

use specialized recognizer (as opposed to

)14


15/237

Front end

codesourcetokens

errors

scanner parser IR

Parser:

recognize context-free syntax

guide context-sensitive analysis

construct IR(s)

produce meaningful error messages

attempt error correction

Parser generators mechanize much of the work15


16/237

Front end

Context-free syntax is specified with a grammar

sheep noise

::=

sheep noise

This grammar defines the set of noises that a sheep makes under normalcircumstances

The format is called Backus-Naur form (BNF)

Formally, a grammar G S N T P

S is the start symbol

N is a set of non-terminal symbols

T is a set of terminal symbols

P is a set of productions or rewrite rules (P : N N T)

16


17/237

Front end

Context free syntax can be put to better use

1

goal

::=

expr

2

expr

::=

expr

op

term

3

term

4

term

::=

5

6 op ::= 7

This grammar defines simple expressions with addition and subtractionover the tokens and

S = goal

T = , , ,

N = goal , expr , term , op

P = 1, 2, 3, 4, 5, 6, 7

17

F t d


18/237

Front end

Given a grammar, valid sentences can be derived by repeated substitution.

Prodn. Result

goal

1

expr

2

expr

op

term

5

expr

op

7

expr

2

expr

op

term

4

expr

op

6 expr 3 term 5

To recognize a valid sentence in some CFG, we reverse this process and

build up a parse18

F t d


19/237

Front end

A parse can be represented by a tree called a parse or syntax tree

2

>


20/237

Front end

So, compilers often use an abstract syntax tree

2>


21/237

Back end

errors

IR allocationregisterselectioninstruction machinecode

Responsibilities

translate IR into target machine code

choose instructions for each IR operation

decide what to keep in registers at each point

ensure conformance with system interfaces

Automation has been less successful here21

Back end


22/237

Back end

errors

IR allocationregister machine

codeinstructionselection

Instruction selection:

produce compact, fast code

use available addressing modes

pattern matching problem

ad hoc techniques tree pattern matching

string pattern matching

dynamic programming

22

Back end


23/237

Back end

errors

IR machinecodeinstructionselection registerallocation

Register Allocation:

have value in a register when used

limited resources

changes instruction choices

can move loads and stores

optimal allocation is difficult

Modern allocators often use an analogy to graph coloring23

Traditional three pass compiler


24/237

Traditional three pass compiler

IR

errors

IRmiddlefront backend end endsourcecode codemachine

Code Improvement

analyzes and changes IR

goal is to reduce runtime

must preserve values

24


25/237


26/237

The Tiger compiler

Parse TranslateLexCanon-Semantic

Analysis calize

Instruction

Selection

Frame

Layout

Parsing

Actions

SourceProgram

Tokens

Pass 10

Reductions

AbstractSyntax

Translate

IRTrees

IRTrees

Frame

Tables

Environ-

ments

Assem

Control

FlowAnalysis

Data

FlowAnalysis

Register

Allocation

Code

Emission Assembler

MachineLanguage

Asse

m

FlowG

raph

Interferenc

eGraph

RegisterAs

signment

AssemblyL

anguage

RelocatableO

bjectCode

Pass 1 Pass 4

Pass 5 Pass 8 Pass 9

Linker

Pass 2

Pass 3

Pass 6 Pass 7

26

The Tiger compiler phases


27/237

The Tiger compiler phases

Lex Break source file into individual words, or tokens

Parse Analyse the phrase structure of programParsingActions

Build a piece of abstract syntax tree for each phrase

SemanticAnalysis

Determine what each phrase means, relate uses of variables to theirdefinitions, check types of expressions, request translation of eachphrase

FrameLayout

Place variables, function parameters, etc., into activation records (stackframes) in a machine-dependent way

Translate Produce intermediate representation trees(IR trees), a notation that isnot tied to any particular source language or target machine

Canonicalize Hoist side effects out of expressions, and clean up conditional branches,for convenience of later phases

InstructionSelection

Group IR-tree nodes into clumps that correspond to actions of target-machine instructions

Control FlowAnalysis

Analyse sequence of instructions into control flow graph showing allpossible flows of control program might follow when it runs

Data Flow

Analysis

Gather information about flow of data through variables of program; e.g.,

liveness analysis calculates places where each variable holds a still-needed (live) value

RegisterAllocation

Choose registers for variables and temporary values; variables not si-multaneously live can share same register

CodeEmission

Replace temporary names in each machine instruction with registers

27

A straight-line programming language


28/237

A straight line programming language

A straight-line programming language (no loops or conditionals):

Stm

Stm; Stm CompoundStmStm : Exp AssignStmStm ExpList PrintStmExp IdExpExp NumExp

Exp Exp Binop Exp OpExpExp Stm Exp EseqExpExpList Exp ExpList PairExpList

ExpList Exp LastExpList

Binop

PlusBinop

Minus

Binop Times

Binop Div

e.g., : 5 3; : 1 10 ;

prints:

28

Tree representation


29/237

Tree representation

: 5 3; : 1 10 ;

AssignStm

CompoundStm

a OpExp

PlusNumExp

5

NumExp

3

AssignStm

b EseqExp

PrintStm

PairExpList

IdExp

a

LastExpList

OpExp

MinusIdExp NumExp

a 1

OpExp

NumExp Times IdExp

a10

PrintStm

LastExpList

IdExp

b

CompoundStm

This is a convenient internal representation for a compiler to use.

29

Java classes for trees


30/237

30


31/237

Chapter 2: Lexical Analysis

31

Scanner


32/237

code

source tokens

errors

scanner parser IR

maps characters into tokens the basic unit of syntax

becomes id, id, id,

character string value for a token is a lexeme

typical tokens: number, id,

,

,

,

,

,

eliminates white space (tabs, blanks, comments)

a key issue is speed use specialized recognizer (as opposed to )

Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this workfor personal or classroom use is granted without fee provided that copies are not made or distributed forprofit or commercial advantage and that copies bear this notice and full citation on the first page. To copyotherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or

fee. Request permission to publish from [email protected].

32

Specifying patterns


33/237

A scanner must recognize various parts of the languages syntax

Some parts are easy:

white space ws ::= ws

ws

keywords and operators

specified as literal patterns:

,

comments

opening and closing delimiters:

33

Specifying patterns


34/237

A scanner must recognize various parts of the languages syntax

Other parts are much harder:

identifiers

alphabetic followed by k alphanumerics ( , $, &, . . . )

numbers

integers: 0 or digit from 1-9 followed by digits from 0-9

decimals: integer

digits from 0-9

reals: (integer or decimal)

(+ or -) digits from 0-9

complex:

real

real

We need a powerful notation to specify these patterns34

Operations on languages


35/237

Operation Definition

unionof L and M L M s s L or s Mwritten L M

concatenation of L and M LM st s L and t Mwritten LM

Kleene closureof L L

i 0 L

i

written L

positive closureof L L

i

1 Li

written L

35

Regular expressions


36/237

Patterns are often specified as regular languages

Notations used to describe a regular language (or a regular set) includeboth regular expressionsand regular grammars

Regular expressions (over an alphabet):

1. is a RE denoting the set

2. if a , then a is a RE denoting a

3. if r and s are REs, denoting L

r

and L

s

, then:

r is a RE denoting L r

r

s is a RE denoting L r

L

s

r s is a RE denoting L r L s

r

is a RE denoting L r

If we adopt a precedencefor operators, the extra parentheses can go away.

We assume closure, then concatenation, then alternation as the order of

precedence.36

Examples


37/237

identifier

letter a b c z A B C Z

digit 0 1 2 3 4 5 6 7 8 9

id letter

letter digit

numbers

integer

0 1 2 3 9 digit

decimal integer . digit

real integer decimal digit

complex real real

Numbers can get much more complicated

Most programming language tokens can be described with REs

We can use REs to build scanners automatically37

Algebraic properties of REs


38/237

Axiom Description

r s s r is commutativer

s t

r s

t is associative rs t r st concatenation is associativer

s t

rs rt concatenation distributes over

s

t

r

sr

trr r is the identity for concatenationr

r

r

r

relation between and r

r is idempotent

38

Examples


39/237

Let a b

1. a b denotes a b

2. a b a b denotes aa ab ba bb

i.e., a b a b aa ab ba bb

3. a denotes a aa aaa

4. a b denotes the set of all strings of as and bs (including )

i.e., a b a b

5. a a b denotes a b ab aab aaab aaaab

39

Recognizers


40/237

From a regular expression we can construct a

deterministic finite automaton(DFA)

Recognizer for identifier:

0 21

3

digit

other

letter

digit

letter

other

error

accept

identifierletter a b c z A B C Z

digit 0 1 2 3 4 5 6 7 8 9

id letter letter digit

40

Code for the recognizer


41/237

41

Tables for the recognizer


42/237

Two tables control the recognizer

a

z A

Z 0

9 othervalue letter letter digit other

class 0 1 2 3

letter 1 1 digit 3 1

other 3 2

To change languages, we can just change tables42

Automatic construction


43/237

Scanner generators automatically construct code from regular expression-

like descriptions

construct a dfa

use state minimization techniques

emit code for the scanner

(table driven or direct code )

A key issue in automation is an interface to the parser

is a scanner generator supplied with UNIX

emits C code for scanner

provides macro definitions for each token

(used in the parser)

43

Grammars for regular languages


44/237

Can we place a restriction on the form of a grammar to ensure that it de-

scribes a regular language?

Provable fact:

For any RE r, there is a grammar g such that L

r

L

g

.

The grammars that generate regular sets are called regular grammars

Definition:

In a regular grammar, all productions have one of two forms:

1. A aA

2. A a

where A is any non-terminal and a is any terminal symbol

These are also called type 3 grammars (Chomsky)44

More regular languages


45/237

Example: the set of strings containing an even number of zeros and an

even number of ones

s0 s1

s2 s3

1

1

0 0

1

1

0 0

The RE is 00 11 01 10 00 11 01 10 00 11

45

More regular expressions


46/237

What about the RE

a b

abb ?

s0 s1 s2 s3

a

b

a b b

State s0

has multiple transitions on a!

nondeterministic finite automaton

a b

s0 s0 s1 s0

s1 s2 s2 s3

46

Finite automata

A non deterministic finite automaton (NFA) consists of:


47/237

A non-deterministic finite automaton(NFA) consists of:

1. a set of states S s0 sn

2. a set of input symbols (the alphabet)

3. a transition function movemapping state-symbol pairs to sets of states

4. a distinguished start state s0

5. a set of distinguished accepting or final states F

A Deterministic Finite Automaton(DFA) is a special case of an NFA:

1. no state has a -transition, and

2. for each state s and input symbol a, there is at most one edge labelled

a leaving s.

A DFA accepts x iff. there exists a unique path through the transition graph

from the s0 to an accepting state such that the labels along the edges spell

x.47

DFAs and NFAs are equivalent


48/237

1. DFAs are clearly a subset of NFAs

2. Any NFA can be converted into a DFA, by simulating sets of simulta-

neous states:

each DFA state corresponds to a set of NFA states

possible exponential blowup

48

NFA to DFA using the subset construction: example 1


49/237

s0 s1 s2 s3

a b

a b b

a b s0 s0 s1 s0

s0 s1 s0 s1 s0 s2

s0 s2 s0 s1 s0 s3

s0 s3 s0 s1 s0

f

s0 g f s0 ; s1 g f s0 ; s2 g f s0 ; s3 g

b

a b b

b

a

a

a

49

Constructing a DFA from a regular expression


50/237

DFA

DFA

NFA

RE

minimized

moves

RE NFA w/ movesbuild NFA for each termconnect them with moves

NFA w/ moves to DFAconstruct the simulationthe subset construction

DFA minimized DFA

merge compatible states

DFA REconstruct Rki j R

k 1

ik

Rk 1kk

Rk 1k j

Rk 1i j

50

RE to NFA


51/237

N

N

a

a

N

A

B

AN(A)

N(B) B

N AB

A N(A) N(B) B

N A

AN(A)

51

RE to NFA: example


52/237

a b abb

a b

1

2 3

6

4 5

a

b

a b

0 1

2 3

6

4 5

7

a

b

abb7 8 9 10

a b b

52

NFA to DFA: the subset construction


53/237

Input: NFA NOutput: A DFA D with states Dstatesand transitions Dtrans

such that L D L NMethod: Let s be a state in N and T be a set of states,

and using the following operations:

Operation Definition

-closure

s

set of NFA states reachable from NFA states

on -transitions alone-closure T set of NFA states reachable from some NFA state s in T on -transitions alone

move T a set of NFA states to which there is a transition on input symbol afrom some NFA state s in T

add state T

-closure

s0

unmarked to Dstateswhile

unmarked state T in Dstatesmark Tfor each input symbol a

U -closure move T a

if U

Dstatesthen add U to Dstatesunmarked

Dtrans T a Uendfor

endwhile

-closure s0 is the start state of DA state of D is accepting if it contains at least one accepting state in N

53

NFA to DFA using subset construction: example 2


54/237

0 1

2 3

6

4 5

7

a

b

8 9 10a b b

A 0 1 2 4 7 D 1 2 4 5 6 7 9

B 1 2 3 4 6 7 8 E 1 2 4 5 6 7 10

C 1 2 4 5 6 7

a b A B C

B B D

C B C

D B E E B C

54

Limits of regular languages

Not all languages are regular


55/237

g g g

One cannot construct DFAs to recognize these languages:

L

pkqk

L

wcwr w

Note: neither of these is a regular expression!

(DFAs cannot count!)

But, this is a little subtle. One can construct DFAs for:

alternating 0s and 1s

1

01

0

sets of pairs of 0s and 1s

01 10

55

So what is hard?

Language features that can cause problems:


56/237

reserved wordsPL/I had no reserved words

significant blanks

FORTRAN and Algol68 ignore blanks

string constants

special characters in strings

, , ,

finite closures

some languages limit identifier lengthsadds states to count length

FORTRAN 66 6 characters

These can be swept under the rug in the language design

56

How bad can it get?


57/237

57


58/237

Chapter 3: LL Parsing

58

The role of the parser

tokens


59/237

code

source tokens

errors

scanner parser IR

Parser

performs context-free syntax analysis

guides context-sensitive analysis

constructs an intermediate representation produces meaningful error messages

attempts error correction

For the next few weeks, we will look at parser constructionCopyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this workfor personal or classroom use is granted without fee provided that copies are not made or distributed forprofit or commercial advantage and that copies bear this notice and full citation on the first page. To copyotherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/orfee. Request permission to publish from [email protected].

59

Syntax analysis


60/237

Context-free syntax is specified with a context-free grammar.

Formally, a CFG G is a 4-tuple

Vt Vn S P , where:

Vt is the set of terminal symbols in the grammar.For our purposes, Vt is the set of tokens returned by the scanner.

Vn, the nonterminals, is a set of syntactic variables that denote sets of(sub)strings occurring in the language.These are used to impose a structure on the grammar.

S is a distinguished nonterminal S Vn denoting the entire set of stringsin L G .This is sometimes called a goal symbol.

P is a finite set of productions specifying how terminals and non-terminals

can be combined to form strings in the language.Each production must have a single non-terminal on its left hand side.

The set V Vt Vn is called the vocabulary of G

60

Notation and terminology


61/237

a

b

c

Vt

A

B

C

Vn

U

V

W

V

V

u v w Vt

If A then A is a single-step derivationusing A

Similarly, and denote derivations of 0 and 1 steps

If S then is said to be a sentential formof G

L G w Vt S w , w L G is called a sentence of G

Note, L G V S V

t61

Syntax analysis

Grammars are often written in Backus-Naur form (BNF).


62/237

G a a s a e o te tte ac us au o ( )

Example:

1 goal :: expr

2 expr :: expr op expr

3

4

5 op ::

6

7

8

This describes simple expressions over numbers and identifiers.

In a BNF for a grammar, we represent

1. non-terminals with angle brackets or capital letters2. terminals with

font or underline

3. productions as in the example

62

Scanning vs. parsing

Where do we draw the line?


63/237

term :: 0 9

0

op ::

expr :: term op term

Regular expressions are used to classify:

identifiers, numbers, keywords

REs are more concise and simpler for tokens than a grammar

more efficient scanners can be built from REs (DFAs) than grammars

Context-free grammars are used to count:

brackets: , . . . , . . . . . .

imparting structure: expressions

Syntactic analysis is complicated enough: grammar for C has around 200

productions. Factoring out lexical analysis as a separate phase makes

compiler more manageable.

63

Derivations

We can view the productions of a CFG as rewriting rules


64/237

We can view the productions of a CFG as rewriting rules.

Using our example CFG:

goal

expr

expr

op

expr

expr op expr op expr

id,

op

expr

op

expr

id, expr

op

expr

id, num,

op

expr

id,

num,

expr

id, num, id,

We have derived the sentence

.

We denote this

goal

.

Such a sequence of rewrites is a derivationor a parse.

The process of discovering a derivation is called parsing.

64

Derivations

At h t h t i l t l


65/237

At each step, we chose a non-terminal to replace.

This choice can lead to different derivations.

Two are of particular interest:

leftmost derivationthe leftmost non-terminal is replaced at each step

rightmost derivationthe rightmost non-terminal is replaced at each step

The previous example was a leftmost derivation.

65

Rightmost derivation

F th t i


66/237

For the string

:

goal expr

expr op expr

expr op id,

expr id,

expr op expr id,

expr

op

num,

id,

expr

num,

id,

id,

num,

id,

Again, goal .

66

Precedence


67/237

goal

expr

expr op expr

expr op expr *

+

Treewalk evaluation computes( )

the wrong answer!

Should be

(

)

67

Precedence

These two derivations point out a problem with the grammar.


68/237

It has no notion of precedence, or implied order of evaluation.

To add precedence takes additional machinery:

1 goal :: expr

2

expr

::

expr

term

3

expr

term

4 term

5 term :: term factor

6 term factor

7 factor

8 factor ::

9

This grammar enforces a precedence on the derivation:

terms must be derived from expressions

forces the correct tree

68

Precedence

Now for the string :


69/237

Now, for the string

:

goal expr

expr term

expr term factor

expr term id,

expr factor id,

expr

num,

id,

term

num,

id,

factor

num,

id,

id,

num,

id,

Again, goal , but this time, we build the desired tree.

69

Precedence

l


70/237

expr

expr

+

term

factor

goal

term

*term

factor

factor

Treewalk evaluation computes

(

)

70

Ambiguity

If a grammar has more than one derivation for a single sentential form,

h i i bi


71/237

then it is ambiguous

Example: stmt ::=

expr

stmt

expr stmt stmt

Consider deriving the sentential form:

E1 E2 S1 S2

It has two derivations.

This ambiguity is purely grammatical.

It is a context-freeambiguity.

71

Ambiguity

May be able to eliminate ambiguities by rearranging the grammar:


72/237

May be able to eliminate ambiguities by rearranging the grammar: stmt ::= matched

unmatched

matched ::= expr matched matched

unmatched ::= expr stmt

expr matched unmatched

This generates the same language as the ambiguous grammar, but applies

the common sense rule:

match each

with the closest unmatched

This is most likely the language designers intent.

72

Ambiguity

Ambiguity is often due to confusion in the context-free specification.


73/237

Context-sensitive confusions can arise from overloading.

Example:

In many Algol-like languages,

could be a function or subscripted variable.

Disambiguating this statement requires context:

need valuesof declarations not context-free really an issue of type

Rather than complicate parsing, we will handle this separately.

73

Parsing: the big picture


74/237

parser

generator

code

parser

tokens

IR

grammar

Our goal is a flexible parser generator system

74

Top-down versus bottom-up

Top-down parsers


75/237

start at the root of derivation tree and fill in

picks a production and tries to match the input

may require backtracking

some grammars are backtrack-free (predictive)

Bottom-up parsers

start at the leaves and fill in

start in a state valid for legal first tokens

as input is consumed, change state to encode possibilities

(recognize valid prefixes)

use a stack to store both state and sentential forms

75

Top-down parsing

A top-down parser starts with the root of the parse tree, labelled with the

start or goal symbol of the grammar


76/237

start or goal symbol of the grammar.

To build a parse, it repeats the following steps until the fringe of the parse

tree matches the input string

1. At a node labelled A, select a production A

and construct theappropriate child for each symbol of

2. When a terminal is added to the fringe that doesnt match the input

string, backtrack

3. Find the next node to be expanded (must have a label in Vn)

The key is selecting the right production in step 1

should be guided by input string

76

Simple expression grammar

Recall our grammar for simple expressions:


77/237

1 goal :: expr

2 expr :: expr term

3 expr term

4 term


6

term

factor

7

factor

8 factor ::

9

Consider the input string

77


78/237

Example

Another possible parse for


79/237

Prodn Sentential form Input

goal 1 expr

2

expr

term

2 expr term term 2 expr term 2 expr term 2

If the parser makes the wrong choices, expansion doesnt terminate.This isnt a good property for a parser to have.

(Parsers should terminate!)

79

Left-recursion

Top-down parsers cannot handle left-recursion in a grammar


80/237

Formally, a grammar is left-recursive if

A Vn such that A

A for some string

Our simple expression grammar is left-recursive

80

Eliminating left-recursion

To remove left-recursion, we can transform the grammar


81/237

Consider the grammar fragment:

foo :: foo

where and do not start with foo

We can rewrite this as:

foo :: bar bar :: bar

where bar is a new non-terminal

This fragment contains no left-recursion

81

Example

Our expression grammar contains two cases of left-recursion

expr :: expr term


82/237

expr

term

term

term :: term factor

term factor

factor

Applying the transformation gives

expr :: term expr

expr :: term expr

term

expr

term ::

factor

term

term ::

factor

term

factor term

With this grammar, a top-down parser will

terminate backtrack on some inputs

82

Example

This cleaner grammar defines the same language


83/237

1

goal

::

expr

2 expr :: term expr

3 term expr

4 term

5 term :: factor term

6 factor term

7 factor

8 factor ::

9

It is

right-recursive

free of productions

Unfortunately, it generates different associativity

Same syntax, different meaning

83

Example

Our long-suffering expression grammar:


84/237

1 goal :: expr

2 expr :: term expr

3 expr :: term expr

4 term expr

5 6 term ::

factor

term

7 term ::

factor

term

8

factor

term

9

10 factor ::

11

Recall, we factored out left-recursion

84

How much lookahead is needed?

We saw that top-down parsers may need to backtrack when they select the

wrong production


85/237

g p

Do we need arbitrary lookahead to parse CFGs?

in general, yes

use the Earley or Cocke-Younger, Kasami algorithmsAho, Hopcroft, and Ullman, Problem 2.34

Parsing, Translation and Compiling, Chapter 4

Fortunately

large subclasses of CFGs can be parsed with limited lookahead

most programming language constructs can be expressed in a gram-

mar that falls in these subclasses

Among the interesting subclasses are:

LL(1): left to right scan, left-most derivation, 1-token lookahead; and

LR(1): left to right scan, right-most derivation, 1-token lookahead

85

Predictive parsing

Basic idea:


86/237

For any two productions A , we would like a distinct way ofchoosing the correct production to expand.

For some RHS G, define FIRST

as the set of tokens that appear

first in some string derived from That is, for some w Vt , w FIRST iff.

w.

Key property:

Whenever two productions A

and A

both appear in the grammar,we would like

FIRST

FIRST

This would allow the parser to make a correct choice with a lookahead ofonly one symbol!

The example grammar has this property!

86

Left factoring

What if a grammar does not have this property?


87/237

Sometimes, we can transform a grammar to have this property.

For each non-terminal A find the longest prefix

common to two or more of its alternatives.

if then replace all of the A productions

A 1 2 n

withA A

A 1 2 n

where A is a new non-terminal.

Repeat until no two alternatives for a single

non-terminal have a common prefix.

87

Example

Consider a right-recursiveversion of the expression grammar:

1 l


88/237

1 goal :: expr

2 expr :: term expr

3 term expr

4 term

5 term :: factor term

6 factor term

7 factor

8 factor ::

9

To choose between productions 2, 3, & 4, the parser must see past the

or and look at the , , , or .

FIRST

2

FIRST

3

FIRST

4

This grammar failsthe test.

Note: This grammar is right-associative.

88

Example

There are two nonterminals that must be left factored:

t


89/237

expr :: term expr term expr

term

term :: factor term

factor

term

factor

Applying the transformation gives us:

expr :: term expr

expr :: expr

expr

term :: factor term

term :: term

term

89

Example

Substituting back into the grammar yields

1 l


90/237

1 goal :: expr

2 expr :: term expr

3 expr :: expr

4 expr

5 6 term ::

factor

term

7 term ::

term

8

term

9

10 factor ::

11

Now, selection requires only a single token lookahead.

Note: This grammar is still right-associative.

90

Example

Sentential form Input

goal 1

expr 2 term expr


91/237

2 term expr 6 factor term expr

11 term expr

term

expr

9 expr

4

expr

expr

2 term expr

6

factor

term

expr

10 term expr

term expr

7 term expr

term

expr

6

factor

term

expr

11 term expr

term

expr

9 expr 5

The next symbol determined each choice correctly.

91

Back to left-recursion elimination

Given a left-factored CFG, to eliminate left-recursion:


92/237

if A A then replace all of the A productions

A A

with

A NA

N

A A

where N and A are new productions.

Repeat until there are no left-recursive productions.

92

Generality

Question:

By left factoring and eliminating left-recursion can we transform


93/237

By left factoring and eliminating left-recursion, can we transforman arbitrary context-free grammar to a form where it can be pre-

dictively parsed with a single token lookahead?

Answer:

Given a context-free grammar that doesnt meet our conditions,

it is undecidable whether an equivalent grammar exists that does

meet our conditions.

Many context-free languagesdo not have such a grammar:

an0bn n 1

an1b2n n 1

Must look past an arbitrary number of as to discover the 0 or the 1 and so

determine the derivation.

93

Recursive descent parsing

Now, we can produce a simple recursive descent parser from the (right-

associative) grammar.


94/237

94

Recursive descent parsing


95/237

95

Building the tree

One of the key jobs of the parser is to build an intermediate representation

of the source code.


96/237

To build an abstract syntax tree, we can simply insert code at the appropri-

ate points:

can stack nodes

,

can stack nodes ,

can pop 3, build and push subtree

can stack nodes

,

can pop 3, build and push subtree

can pop and return tree

96

Non-recursive predictive parsing

Observation:

Our recursive descent parser encodes state information in its run-


97/237

Our recursive descent parser encodes state information in its run time stack, or call stack.

Using recursive procedure calls to implement a stack abstraction may not

be particularly efficient.This suggests other implementation methods:

explicit stack, hand-coded parser

stack-based, table-driven parser

97


Now, a predictive parser looks like:

stack


98/237

scannertable-driven

parser IR

parsing

tables

source

code

tokens

Rather than writing code, we build tables.

Building tables can be automated!

98

Table-driven parsers

A parser generator system often looks like:


99/237

p g y

scannertable-driven

parserIR

parsing

tables

stack

source

code

tokens

parser

generatorgrammar

This is true for both top-down (LL) and bottom-up (LR) parsers

99


Input: a string w and a parsing table M for G


100/237

Start Symbol

X is a non-terminal

M

X Y1Y2 Yk

Yk Yk

1 Y1

100


101/237

FIRST

For a string of grammar symbols , define FIRST

as:

the set of terminal symbols that begin strings derived from :

a Vt a


102/237

If then FIRST

FIRST

contains the set of tokens valid in the initial position in

To build FIRST X :

1. If X Vt then FIRST X is X

2. If X then add to FIRST X .

3. If X Y1Y2 Yk:

(a) Put FIRST

Y1 in FIRST X

(b) i : 1 i k, if FIRST Y1 FIRST Yi

1

(i.e., Y1

Yi

1

)then put FIRST

Yi in FIRST X

(c) If FIRST

Y1 FIRST Yk then put in FIRST X

Repeat until no more additions can be made.

102

FOLLOW

For a non-terminal A, define FOLLOW A as

the set of terminals that can appear immediately to the right of Ain some sentential form


103/237

in some sentential form

Thus, a non-terminals FOLLOW set specifies the tokens that can legally

appear after it.

A terminal symbol has no FOLLOW set.

To build FOLLOW A :

1. Put $ in FOLLOW

goal

2. If A B:

(a) Put FIRST in FOLLOW B

(b) If

(i.e.,A B) or FIRST

(i.e., ) then put FOLLOW

A

in FOLLOW B

Repeat until no more additions can be made

103

LL(1) grammars

Previous definition

A grammar G is LL(1) iff. for all non-terminals A, each distinct pair of pro-

ductions A and A satisfy the condition FIRST

FIRST .


104/237

What if A ?

Revised definition

A grammar G is LL(1) iff. for each set of productions A

1

2

n:

1. FIRST 1 FIRST 2 FIRST n are all pairwise disjoint

2. If i then FIRST j

FOLLOW

A

1 j n i j.

If G is -free, condition 1 is sufficient.

104

LL(1) grammars

Provable facts about LL(1) grammars:

1. No left-recursive grammar is LL(1)

2 No ambiguous grammar is LL(1)


105/237

2. No ambiguous grammar is LL(1)

3. Some languages have no LL(1) grammar

4. A free grammar where each alternative expansion for A begins with

a distinct terminal is a simpleLL(1) grammar.

Example

S aS a

is not LL(1) because FIRST aS FIRST a a

S aS

S aS accepts the same language and is LL(1)

105

LL(1) parse table construction

Input: Grammar G

Output: Parsing table M

Method:


106/237

Method:

1. productions A :

(a)

a FIRST

, add A

to M

A

a

(b) If FIRST

:

i. b FOLLOW A , add A to M A b

ii. If $ FOLLOW A then add A to M A $

2. Set each undefined entry of M to

If M A a with multiple entries then grammar is not LL(1).

Note: recall a b Vt, so a b

106

Example

Our long-suffering expression grammar:

S E T FT

E T E

T

T T E E E F


107/237

E

E

E F

FIRST FOLLOW

S

$

E

$

E $

T $

T $

F $

$

S S

E S

E

E E

T E E

T E

E

E

E E

E

E

T T FT T FT

T

T T T T T T T

F F

F

107

A grammar that is not LL(1)

stmt :: expr stmt

expr stmt stmt

Left factored:


108/237

Left-factored:

stmt ::

expr

stmt

stmt

stmt ::

stmt

Now, FIRST stmt

Also, FOLLOW

stmt $

But, FIRST stmt

FOLLOW

stmt

On seeing , conflict between choosing

stmt :: stmt and stmt ::

grammar is not LL(1)!

The fix:

Put priority on stmt :: stmt to associate with clos-

est previous

.

108

Error recovery

Key notion:

For each non-terminal, construct a set of terminals on which the parser

can synchronize


109/237

When an error occurs looking for A, scan until an element of SYNCH

A

is found

Building SYNCH:

1. a FOLLOW

A

a SYNCH

A

2. place keywords that start statements in SYNCH

A

3. add symbols in FIRST A to SYNCH A

If we cant match a terminal on top of stack:

1. pop the terminal

2. print a message saying the terminal was inserted

3. continue the parse

(i.e., SYNCH a Vt a )

109


110/237

Chapter 4: LR Parsing

110

Some definitions

Recall

For a grammar G, with start symbol S, any string such that S

iscalled a sentential form


111/237

called a sentential form

If Vt , then is called a sentence in L G

Otherwise it is just a sentential form (not a sentence in L

G

)

A left-sentential form is a sentential form that occurs in the leftmost deriva-

tion of some sentence.

A right-sentential form is a sentential form that occurs in the rightmost

derivation of some sentence.


111

Bottom-up parsing

Goal:

Given an input string w and a grammar G, construct a parse tree by


112/237

starting at the leaves and working to the root.

The parser repeatedly matches a right-sentential form from the languageagainst the trees upper frontier.

At each match, it applies a reduction to build on the frontier:

each reduction matches an upper frontier of the partially built tree to

the RHS of some production

each reduction adds a node on top of the frontier

The final result is a rightmost derivation, in reverse.

112

Example

Consider the grammar

1 S AB

2 A A


113/237

3

4 B

and the input string

Prodn. Sentential Form

3

2

A

4 A

1 aABe S

The trick appears to be scanning the input and finding valid sentential

forms.

113

Handles

What are we trying to find?

A substring of the trees upper frontier that

matches some production A where reducing to A is one step in


114/237

matches some production A where reducing to A is one step in

the reverse of a rightmost derivation

We call such a string a handle.

Formally:

a handle of a right-sentential form is a production A and a po-

sition in where may be found and replaced by A to produce the

previous right-sentential form in a rightmost derivation of

i.e., if S rm Aw rm w then A in the position following is a

handle of w

Because is a right-sentential form, the substring to the right of a handle

contains only terminal symbols.

114

Handles

S


115/237

A

wThe handle A in the parse tree for w

115

Handles

Theorem:

If G is unambiguous then every right-sentential form has a unique han-


116/237

dle.

Proof: (by definition)

1. G is unambiguous rightmost derivation is unique

2. a unique production A applied to take i 1 to i

3.

a unique position k at which A

is applied

4. a unique handle A

116


117/237

Handle-pruning

The process to construct a bottom-up parse is called handle-pruning.

To construct a rightmost derivation


118/237

S

0 1 2 n

1 n w

we set i to n and apply the following simple algorithm

n

1.

Ai i i

2. i Ai i 1

This takes2n steps, wheren is the length of the derivation

118

Stack implementation

One scheme to implement a handle-pruning, bottom-up parser is called a

shift-reduceparser.

Shift-reduce parsers use a stack and an input buffer


119/237

1. initialize stack with $

2. Repeat until the top of the stack is the goal symbol and the input token

is $

a) find the handle

if we dont have a handle on top of the stack, shiftan input symbolonto the stack

b) prune the handle

if we have a handle A

on the stack, reducei) pop symbols off the stack

ii) push A onto the stack

119

Example: back to

1

goal

::

expr

2

expr

::

expr

term

Stack Input Action

$ shift$

reduce 9

$ factor reduce 7

$

term

reduce 4

$ expr shift


120/237

3 expr term4

term


6 term factor7

factor

8 factor :: 9

$ expr shift$

expr shift$

expr

reduce 8$ expr factor reduce 7

$ expr term shift$

expr term shift$

expr

term

reduce 9$ expr term factor reduce 5

$ expr term reduce 3

$ expr reduce 1$

goal

accept

1. Shift until top of stack is the right end of a handle

2. Find the left end of the handle and reduce

5 shifts + 9 reduces + 1 accept

120

Shift-reduce parsing

Shift-reduce parsers are simple to understand

A shift-reduce parser has just four canonical actions:


121/237

A shift reduce parser has just four canonical actions:

1. shift next input symbol is shifted onto the top of the stack

2. reduce right end of handle is on top of stack;

locate left end of handle within the stack;

pop handle off stack and push appropriate non-terminal LHS

3. accept terminate parsing and signal success

4. error call an error recovery routine

The key problem: to recognize handles (not covered in this course).

121

LR k grammars

Informally, we say that a grammar G is LR k if, given a rightmost derivation

S

0

1

2

n

w

we can, for each right-sentential form in the derivation,


122/237

1. isolate the handle of each right-sentential form, and

2. determine the production by which to reduce

by scanning i from left to right, going at most k symbols beyond the right

end of the handle of i.

122

LR k grammars

Formally, a grammar G is LR k iff.:

1. S

rm Aw

rm w, and2. S rm Bx rm y, and

3 FIRST FIRST


123/237

3. FIRSTk w FIRSTk y

Ay

Bx

i.e., Assume sentential forms w and y, with common prefix and

common k-symbol lookahead FIRSTk y FIRSTk w , such that w re-

duces to Aw and y reduces to Bx.

But, the common prefix means y also reduces to Ay, for the same re-

sult.

Thus Ay

Bx.

123

Why study LR grammars?

LR(1) grammars are often used to construct parsers.

We call these parsers LR(1) parsers.

everyones favorite parser


124/237

virtually all context-free programming language constructs can be ex-

pressed in an LR(1) form

LR grammars are the most general grammars parsable by a determin-

istic, bottom-up parser

efficient parsers can be implemented for LR(1) grammars

LR parsers detect an error as soon as possible in a left-to-right scanof the input

LR grammars describe a proper superset of the languages recognized

by predictive (i.e., LL) parsers

LL k : recognize use of a production A seeing first k symbols of

LR

k

: recognize occurrence of (the handle) having seen all of what

is derived from plus k symbols of lookahead

124

Left versus right recursion

Right Recursion:

needed for termination in predictive parsers

requires more stack space


125/237

q p

right associative operators

Left Recursion:

works fine in bottom-up parsers

limits required stack space left associative operators

Rule of thumb:

right recursion for top-down parsers

left recursion for bottom-up parsers

125

Parsing review

Recursive descent

A hand coded recursive descent parser directly encodes a grammar(typically an LL(1) grammar) into a series of mutually recursive proce-

dures It has most of the linguistic limitations of LL(1)


126/237

dures. It has most of the linguistic limitations of LL(1).

LL k

An LL k parser must be able to recognize the use of a production after

seeing only the first k symbols of its right hand side.

LR

k

An LR k parser must be able to recognize the occurrence of the righthand side of a production after having seen all that is derived from that

right hand side with k symbols of lookahead.

126


127/237

Chapter 5: JavaCC and JTB

127

The Java Compiler Compiler

Can be thought of as Lex and Yacc for Java.

It is based on LL(k) rather than LALR(1).


128/237

Grammars are written in EBNF.

The Java Compiler Compiler transforms an EBNF grammar into anLL(k) parser.

The JavaCC grammar can have embedded action code written in Java,

just like a Yacc grammar can have embedded action code written in C.

The lookahead can be changed by writing .

The whole input is given in just one file (not two).

128

The JavaCC input format


129/237

One file:

header

token specifications for lexical analysis

grammar

129

The JavaCC input format

Example of a token specification:


130/237

Example of a production:

130


131/237

Generating a parser with JavaCC

131

The Visitor Pattern

For object-oriented programming,

the Visitor pattern enables


132/237

the definition of a new operation

on an object structure

without changing the classes

of the objects.

Gamma, Helm, Johnson, Vlissides:

Design Patterns, 1995.

132

Sneak Preview


133/237

When using the Visitor pattern,

the set of classes must be fixed in advance, and

each class must have an accept method.

133

First Approach: Instanceof and Type Casts

The running Java example: summing an integer list.


134/237

134

First Approach: Instanceof and Type Casts


135/237

Advantage: The code is written without touching the classes

and

.

Drawback: The code constantly uses type casts and to

determine what class of object it is considering.

135

Second Approach: Dedicated Methods

The first approach is not object-oriented!

To access parts of an object the classical approach is to use dedicated


136/237

To access parts of an object, the classical approach is to use dedicated

methods which both access and act on the subobjects.

We can now compute the sum of all components of a given

-object

by writing

.

136

Second Approach: Dedicated Methods


137/237

Advantage: The type casts and operations have disappeared,

and the code can be written in a systematic way.

Disadvantage: For each new operation on -objects, new dedicated

methods have to be written, and all classes must be recompiled.

137

Third Approach: The Visitor Pattern

The Idea:

Divide the code into an object structure and a Visitor (akin to Func-tional Programming!)

Insert an method in each class Each accept method takes a


138/237

Insert an

method in each class. Each accept method takes aVisitor as argument.

A Visitor contains a

method for each class (overloading!) Amethod for a class C takes an argument of type C.

138


The purpose of the

methods is to

invoke the

method in the Visitor which can handle the currentobject.


139/237

139


The control flow goes back and forth between the

methods inthe Visitor and the

methods in the object structure.


140/237

Notice: The methods describe both1) actions, and 2) access of subobjects.

140

Comparison

The Visitor pattern combines the advantages of the two other approaches.

Frequent Frequenttype casts? recompilation?

Instanceof and type casts Yes NoDedicated methods No Yes


141/237

Dedicated methods No YesThe Visitor pattern No No

The advantage of Visitors: New methods without recompilation!

Requirement for using Visitors: All classes must have an accept method.

Tools that use the Visitor pattern:

JJTree (from Sun Microsystems) and the Java Tree Builder (from Pur-due University), both frontends for The Java Compiler Compiler from

Sun Microsystems.

141

Visitors: Summary

Visitor makes adding new operations easy. Simply write a new

visitor.

A visitor gathers related operations. It also separates unrelated

ones.


142/237

ones.

Adding new classes to the object structure is hard. Key consid-eration: are you most likely to change the algorithm applied over an

object structure, or are you most like to change the classes of objects

that make up the structure.

Visitors can accumulate state.

Visitor can break encapsulation. Visitors approach assumes that

the interface of the data structure classes is powerful enough to let

visitors do their job. As a result, the pattern often forces you to providepublic operations that access internal state, which may compromise

its encapsulation.

142

The Java Tree Builder

The Java Tree Builder (JTB) has been developed here at Purdue in my

group.

JTB is a frontend for The Java Compiler Compiler.


143/237

JTB supports the building of syntax trees which can be traversed usingvisitors.

JTB transforms a bare JavaCC grammar into three components:

a JavaCC grammar with embedded Java code for building a syntax

tree;

one class for every form of syntax tree node; and

a default visitor which can do a depth-first traversal of a syntax tree.

143

The Java Tree Builder

The produced JavaCC grammar can then be processed by the Java Com-

piler Compiler to give a parser which produces syntax trees.

The produced syntax trees can now be traversed by a Java program by

writing subclasses of the default visitor.


144/237

writing subclasses of the default visitor.

JavaCCgrammar

JTB JavaCC grammarwith embeddedJava code

Java CompilerCompiler

Parser

Program

Syntax treewith accept methods

Syntax-tree-nodeclasses

Default visitor

144

Using JTB


145/237

145

Example (simplified)

For example, consider the Java 1.1 production

JTB d


146/237

JTB produces:

Notice that the production returns a syntax tree represented as an

object.

146

Example (simplified)JTB produces a syntax-tree-node class for :


147/237

Notice the method; it invokes the method for in

the default visitor.

147

Example (simplified)The default visitor looks like this:


148/237

Notice the body of the method which visits each of the three subtrees of

the

node.

148

Example (simplified)Here is an example of a program which operates on syntax trees for Java

1.1 programs. The program prints the right-hand side of every assignment.

The entire program is six lines:


149/237

When this visitor is passed to the root of the syntax tree, the depth-firsttraversal will begin, and when

nodes are reached, the method

in

is executed.

Notice the use of

. It is a visitor which pretty prints Java1.1 programs.

JTB is bootstrapped.

149


150/237

Chapter 6: Semantic Analysis

150

Semantic Analysis

The compilation process is driven by the syntactic structure of the program

as discovered by the parser

Semantic routines:

interpret meaning of the program based on its syntactic structure


151/237

two purposes: finish analysis by deriving context-sensitive information

begin synthesis by generating the IR or target code

associated with individual productions of a context free grammar orsubtrees of a syntax tree


151

Context-sensitive analysis

What context-sensitive questions might the compiler ask?

1. Is scalar, an array, or a function?

2. Is

declared before it is used?

3. Are any names declared but not used?

4 Which declaration of does this reference?


152/237

4. Which declaration of does this reference?

5. Is an expression type-consistent?

6. Does the dimension of a reference match the declaration?

7. Where can

be stored? (heap, stack,

)

8. Does

reference the result of a malloc()?9. Is defined before it is used?

10. Is an array reference in bounds?

11. Does function produce a constant value?

12. Can

be implemented as a memo-function?

These cannot be answered with a context-free grammar

152

Context-sensitive analysis

Why is context-sensitive analysis hard?

answers depend on values, not syntax

questions and answers involve non-local information


153/237

answers may involve computation

Several alternatives:

abstract syntax tree specify non-local computations

(attribute grammars) automatic evaluators

symbol tables central store for factsexpress checking code

language design simplify languageavoid problems

153

Symbol tables

For compile-timeefficiency, compilers often use a symbol table:

associates lexical names (symbols) with their attributes

What items should be entered?

variable names


154/237

variable names

defined constants

procedure and function names

literal constants and strings

source text labels

compiler-generated temporaries (well get there)

Separate table for structure layouts (types) (field offsets and lengths)

A symbol table is a compile-time structure

154

Symbol table information

What kind of information might the compiler need?

textual name

data type

dimension information (for aggregates)


155/237

( gg g )

declaring procedure

lexical level of declaration

storage class (base address)

offset in storage

if record, pointer to structure table

if parameter, by-reference or by-value?

can it be aliased? to what other names?

number and type of arguments to functions

155

Nested scopes: block-structured symbol tables

What information is needed?

when we ask about a name, we want the most recent declaration

the declaration may be from the current scope or some enclosing

scope

innermost scope overrides declarations from outer scopes


156/237

Key point: new declarations (usually) occur only in current scope

What operations do we need?

binds key to value

returns value bound to key

remembers current state of table

restores table to state at most recent scope thathas not been ended

May need to preserve list of locals for the debugger

156

Attribute information

Attributes are internal representation of declarations

Symbol table associates names with attributes


157/237

Names may have different attributes depending on their meaning:

variables: type, procedure level, frame offset

types: type descriptor, data size/alignment

constants: type, value

procedures: formals (names/types), result type, block information (lo-

cal decls.), frame size

157

Type expressions

Type expressions are a textual representation for types:

1. basic types: boolean, char, integer, real, etc.

2. type names


158/237

3. constructed types (constructors applied to type expressions):(a) array

I

T

denotes array of elements type T, index type I

e.g., array 1 10 integer

(b) T1 T2 denotes Cartesian product of type expressions T1 and T2

(c) records: fields have names

e.g., record

integer

real

(d) pointer T denotes the type pointer to object of type T

(e) D

R denotes type of function mapping domain D to range Re.g., integer integer integer

158

Type descriptors

Type descriptors are compile-time structures representing type expres-

sions

e.g., char char pointer integer


159/237

g , p g

!

char char

pointer

integer

or

!

char

pointer

integer

159

Type compatibility

Type checking needs to determine type equivalence

Two approaches:

Name equivalence: each type name is a distinct type


160/237

Structural equivalence: two types are equivalent iff. they have the same

structure (after substituting type expressions for type names)

s t iff. s and t are the same basic types

array

s1 s2 array t1 t2 iff. s1 t1 and s2 t2

s1 s2 t1 t2 iff. s1 t1 and s2 t2

pointer

s

pointer

t

iff. s t

s1 s2 t1 t2 iff. s1 t1 and s2 t2

160

Type compatibility: example

Consider:


161/237

Under name equivalence:

and have the same type

,

and

have the same type

and have different type

Under structural equivalence all variables have the same type

Ada/Pascal/Modula-2/Tiger are somewhat confusing: they treat distinct

type definitions as distinct types, so

has different type from

and

161

Type compatibility: Pascal-style name equivalence

Build compile-time structure called a type graph:

each constructor or basic type creates a node

each name creates a leaf (associated with the types descriptor)


162/237

n e x t l a s t

l i n k = pointer

c e l l

pointer

p

pointer

q r

Type expressions are equivalent if they are represented by the same nodein the graph

162

Type compatibility: recursive types

Consider:

We may want to eliminate the names from the type graph


163/237

Eliminating name from type graph for record:

record=c e l l

i n f o integer

n e x t pointer

c e l l

163

Type compatibility: recursive types

Allowing cycles in the type graph eliminates :


164/237

record=c e l l

i n f o integer

n e x t pointer

164


165/237

Chapter 7: Translation and Simplification

165

Tiger IR trees: Expressions

i

CONSTInteger constant i

n

NAMESymbolic constant n [a code label]

t

TEMPTemporary t [one of any number of registers]

o e1 e2

BINOPApplication of binary operator o:

PLUS, MINUS, MUL, DIV


166/237

AND, OR, XOR [bitwise logical]LSHIFT, RSHIFT [logical shifts]ARSHIFT [arithmetic right-shift]

to integer operands e1 (evaluated first) and e2 (evaluated second)

e

MEMContents of a word of memory starting at address e

f

e1 en

CALLProcedure call; expression f is evaluated before arguments e1 en

s e

ESEQExpression sequence; evaluate s for side-effects, then e for result


166

Tiger IR trees: Statements

t

TEMP e

MOVE

Evaluate e into temporary t

e1

MEM e2

MOVE

Evaluate e1 yielding address a, e2 into word at a

EXPEvaluate e and discard result


167/237

e

e

l1 ln

JUMPTransfer control to address e; l1 ln are all possible values for e

o e1 e2 t f

CJUMPEvaluate e1 then e2, yielding a and b, respectively; compare a with busing relational operator o:

EQ, NE [signed and unsigned integers]LT, GT, LE, GE [signed]ULT, ULE, UGT, UGE [unsigned]

jump to t if true, f if false

s1 s2

SEQStatement s1 followed by s2

n

LABELDefine constant value of name n as current code address; NAME ncan be used as target of jumps, calls, etc.

167

Kinds of expressions

Expression kinds indicate how expression might be used

Ex(exp) expressions that compute a value

Nx(stm) statements: expressions that compute no value

Cx conditionals (jump to true and false destinations)


168/237

RelCx(op, left, right)

IfThenElseExp expression/statement depending on use

Conversion operators allow use of one form in context of another:

unEx convert to tree expression that computes value of inner tree

unNx convert to tree statement that computes inner tree but returns no

value

unCx(t, f) convert to statement that evaluates inner tree and branches totrue destination if non-zero, false destination otherwise

168

Translating Tiger

Simple variables: fetch with a MEM:

PLUS TEMP fpCONST k

BINOP

MEM

Ex(MEM(

(TEMP fp, CONST k)))

where fp is home frame of variable, found by following static links; k is


169/237

offset of variable in that levelTiger array variables: Tiger arrays are pointers to array base, so fetch

with a MEM like any other variable:

Ex(MEM(

(TEMP fp, CONST k)))

Thus, for e i :

Ex(MEM( (e.unEx, (i.unEx, CONST w))))

i is index expression and w is word size all values are word-sized

(scalar) in Tiger

Note: must first check array index i size e ; runtime will put size in

word preceding array base

169

Translating TigerTiger record variables: Again, records are pointers to record base, so fetch like other

variables. For e.

:

Ex(MEM(

(e.unEx, CONST o)))

where o is the byte offset of the field

in the recordNote: must check record pointer is non-nil (i.e., non-zero)

String literals: Statically allocated, so just use the strings label

Ex(NAME(label))


170/237

where the literal will be emitted as:

Record creation: f1 e1 f2 e2 fn en in the (preferably GCd) heap, first allocatethe space then initialize it:

Ex( ESEQ(SEQ(MOVE(TEMP r, externalCall(allocRecord, [CONST n])),SEQ(MOVE(MEM(TEMP r), e1.unEx)),

SEQ(...,MOVE(MEM(+(TEMP r, CONST n 1 w)),

en.unEx))),TEMP r))

where w is the word size

Array creation:

e1 e2: Ex(externalCall(initArray, [e1.unEx, e2.unEx]))

170

Control structures

Basic blocks:

a sequence of straight-line code

if one instruction executes then they all execute

i l f i i i h b h


171/237

a maximal sequence of instructions without branches a label starts a new basic block

Overview of control structure translation:

control flow links up the basic blocks ideas are simple

implementation requires bookkeeping

some care is needed for good code

171

while loops

while c do s:

1. evaluate c

2. if false jump to next statement after loop3. if true fall into loop body

4. branch to top of loop

e.g.,

t t


172/237

test:if not(c) jump dones

jump test

done:Nx( SEQ(SEQ(SEQ(LABEL test, c.unCx(body, done)),

SEQ(SEQ(LABEL body, s.unNx), JUMP(NAME test))),LABEL done))

repeat e1 until e2 evaluate/compare/branch at bottom of loop

172

for loops

for := e1 to e2 do s

1. evaluate lower bound into index variable

2. evaluate upper bound into limit variable3. if index

limit jump to next statement after loop4. fall through to loop body5. increment index6. if index limit jump to top of loop body

t1 1


173/237

t1 e1t2 e2if t1 t2 jump done

body: s

t1

t1

1if t1 t2 jump bodydone:

For break statements:

when translating a loop push the done label on some stack break simply jumps to label on top of stack when done translating loop and its body, pop the label

173

Function calls

f e1 en :


174/237

Ex(CALL(NAME labelf, [sl,e1, . . . en]))

where sl is the static link for the callee f, found by following n static links

from the caller, n being the difference between the levels of the caller andthe callee

174

Comparisons

Translate a op b as:

RelCx(op, a.unEx, b.unEx)

When used as a conditional unCx t f yields:

CJUMP(op a unEx b unEx t f )


175/237

CJUMP(op, a.unEx, b.unEx, t, f)

where t and f are labels.

When used as a value unEx yields:

ESEQ(SEQ(MOVE(TEMP r, CONST 1),

SEQ(unCx(t, f),

SEQ(LABEL f,

SEQ(MOVE(TEMP r, CONST 0), LABEL t)))),

TEMP r)

175

ConditionalsThe short-circuiting Boolean operators have already been transformed into

if-expressions in Tiger abstract syntax:

e.g., x 5 & a b turns into if x 5 then a b else 0

Translate if e1 then e2 else e3 into: IfThenElseExp(e1, e2, e3)

When used as a value unEx yields:

ESEQ(SEQ(SEQ(e1.unCx(t, f),

SEQ(SEQ(LABEL t


176/237

SEQ(SEQ(LABEL t,SEQ(MOVE(TEMP r, e2.unEx),

JUMP join)),

SEQ(LABEL f,

SEQ(MOVE(TEMP r, e3.unEx),JUMP join)))),

LABEL join),

TEMP r)

As a conditional unCx t f yields:SEQ(e1.unCx(tt,ff), SEQ(SEQ(LABEL tt, e2.unCx(t, f)),

SEQ(LABEL ff, e3.unCx(t, f))))

176

Conditionals: Example

Applying unCx t f to if x 5 then a b else 0:

SEQ(CJUMP(LT, x.unEx, CONST 5, tt, ff),

SEQ(SEQ(LABEL CJUMP(GT E b E f ))


177/237

SEQ(SEQ(LABEL tt, CJUMP(GT, a.unEx, b.unEx, t, f)),

SEQ(LABEL ff, JUMP f)))

or more optimally:

SEQ(CJUMP(LT, x.unEx, CONST 5, tt, f),

SEQ(LABEL tt, CJUMP(GT, a.unEx, b.uneX, t, f)))

177

One-dimensional fixed arrays


178/237

translates to:

MEM(+(TEMP fp, +(CONST k

2w,

(CONST w, e.unEx))))

where k is offset of static array from fp, w is word size

In Pascal, multidimensional arrays are treated as arrays of arrays, so

is equivalent to A[i][j], so can translate as above.

178

Multidimensional arrays

Array allocation:

constant bounds

allocate in static area, stack, or heap

no run-time descriptor is needed


179/237

no run time descriptor is neededdynamic arrays: bounds fixed at run-time

allocate in stack or heap

descriptor is needed

dynamic arrays: bounds can change at run-time

allocate in heap

descriptor is needed

179

Multidimensional arrays

Array layout:

Contiguous:1. Row major

Rightmost subscript varies most quickly:


180/237

Used in PL/1, Algol, Pascal, C, Ada, Modula-3

2. Column major

Leftmost subscript varies most quickly:

Used in FORTRAN

By vectorsContiguous vector of pointers to (non-contiguous) subarrays

180

Multi-dimensional arrays: row-major layout

no. of elts in dimension j:

Dj Uj Lj 1

position of i1 in :

in Ln

in 1 Ln 1 Dn


181/237

in 2 Ln 2 DnDn 1

i1 L1 Dn D2

which can be rewritten asvariable part

i1D2 Dn i2D3 Dn in

1Dn in

L1D2 Dn L2D3 Dn Ln

1Dn Ln

constant partaddress of

i1 in :

address( ) + ((variable part constant part) element size)

181

ca

color lecture notes

Documents