cpcs302 1
DESCRIPTION
FOR_MY_FRIENDSTRANSCRIPT
1
Course Goals
•To provide students with an understanding of the major phases of a compiler .
• To introduce students to the theory behind the various phases, including regular expressions, context-free
grammars, and finite state automata . • To provide students with an understanding of the design
and implementation of a compiler . • To have the students build a compiler, through type
checking and intermediate code generation, for a small language .
• To provide students with an opportunity to work in a group on a large project .
2
Course Outcomes
•Students will have experience using current compiler generation tools .
• Students will be familiar with the different phases of compilation .
• Students will have experience defining and specifying the semantic rules of a
programming language
3
Prerequisites
•In-depth knowledge of at least one structured programming language .
•Strong background in algorithms, data structures, and abstract data types, including
stacks, binary trees, graphs . •Understanding of grammar theories .
•Understanding of data types and control structures, their design and implementation .
•Understanding of the design and implementation of subprograms, parameter passing mechanisms, scope .
4
Major Topics Covered in the Course
• Overview & Lexical Analysis (Scanning)• Grammars & Syntax Analysis: Top-Down Parsing• Syntax Analysis: Bottom-Up Parsing• Semantic Analysis• Symbol Tables and Run-time Systems • Code Generation• Introduction to Optimization and Control Flow Analysis 5
Textbook
Compilers: Principles, Techniques, and Tools” by Aho, Lam, Sethi, and Ullman, 2nd edition.
6
GRADING
Assignements & project: 40
Midterm Exam: 20
Final Exam: 40
7
Compilers and Interpreters
“Compilation”Translation of a program written in a source language into a semantically equivalent program written in a target language
Compiler
Error messages
SourceProgram
TargetProgram
Input
Output8
Compilers and Interpreters (cont’d)
Interpreter
SourceProgram
Input
Output
Error messages
• “Interpretation”– Performing the operations implied by the source
program
9
The Analysis-Synthesis Model of Compilation
• There are two parts to compilation:– Analysis determines the operations implied by the
source program which are recorded in a tree structure
– Synthesis takes the tree structure and translates the operations therein into the target program
10
Preprocessors, Compilers, Assemblers, and Linkers
Preprocessor
Compiler
Assembler
Linker
Skeletal Source Program
Source Program
Target Assembly Program
Relocatable Object Code
Absolute Machine Code
Libraries andRelocatable Object Files
Try for example:gcc -v myprog.c
11
The Phases of a CompilerPhaseOutputSample
Programmer (source code producer)Source stringA=B+C;
Scanner (performs lexical analysis)Token string‘A’, ‘=’, ‘B’, ‘+’, ‘C’, ‘;’And symbol table with names
Parser (performs syntax analysis based on the grammar of the programming language)
Parse tree or abstract syntax tree ; | = / \A + / \ B C
Semantic analyzer (type checking, etc)
Annotated parse tree or abstract syntax tree
Intermediate code generatorThree-address code, quads, or RTL
int2fp B t1+ t1 C t2:= t2 A
OptimizerThree-address code, quads, or RTL
int2fp B t1+ t1 #2.3 A
Code generatorAssembly codeMOVF #2.3,r1ADDF2 r1,r2MOVF r2,A
Peephole optimizerAssembly codeADDF2 #2.3,r2MOVF r2,A12
The Grouping of Phases
• Compiler front and back ends:– Front end: analysis (machine independent)– Back end: synthesis (machine dependent)
• Compiler passes:– A collection of phases is done only once (single pass) or
multiple times (multi pass)• Single pass: usually requires everything to be defined before being
used in source program• Multi pass: compiler may have to keep entire program
representation in memory
13
Compiler-Construction Tools
• Software development tools are available to implement one or more compiler phases– Scanner generators– Parser generators– Syntax-directed translation engines– Automatic code generators– Data-flow engines
14
What qualities do you want in a that compiler you buy
• 1. Correct Code• 2. Output runs fast• 3. Compiler runs fast• 4. Compile time proportional to program size• 5. Support for separate compilation• 6. Good diagnostics for syntax errors• 7. Works well with debugger• 8. Good diagnostics for flow anomalies• 9. Good diagnostics for storage leaks• 10. Consistent, predictable optimization
15
16
High-level View of a Compiler
Sourcecode
Machinecode
Compiler
Errors
Implications• Must recognize legal (and illegal) programs• Must generate correct code• Must manage storage of all variables (and code)• Must agree with OS & linker on format for object code
17
Traditional Two-pass Compiler
Sourcecode
FrontEnd
Errors
Machinecode
BackEnd
IR
• Use an intermediate representation (IR)• Front end maps legal source code into IR• Back end maps IR into target machine code• Admits multiple front ends & multiple passes (better code)
18
The Front End
Sourcecode
ScannerIR
Parser
Errors
tokens
Responsibilities
• Recognize legal (& illegal) programs• Report errors in a useful way• Produce IR & preliminary storage map• Shape the code for the back end• Much of front end construction can be automated
19
The Front End
Sourcecode
ScannerIR
Parser
Errors
tokens
Scanner• Maps character stream into words—the basic unit of syntax
• Produces words & their parts of speech• x = x + y ; becomes <id,x> <op,= > <id,x> <op,+ <id,y> ;• word lexeme, part of speech token• In casual speech, we call the pair a token
• Typical tokens include number, identifier, +, -, while, if
• Scanner eliminates white space• Speed is important use a specialized recognizer
20
The Front End
Sourcecode
ScannerIR
Parser
Errors
tokens
Parser• Recognizes context-free syntax & reports errors• Guides context-sensitive analysis (type checking)• Builds IR for source program
Hand-coded parsers are fairly easy to build
Most books advocate using automatic parser generators
21
The Front End
Compilers often use an abstract syntax tree
This is much more conciseASTs are one form of intermediate representation
(IR)
+
-
>id,x< >number,2<
>id,y<
The AST summarizes grammatical structure, without including detail about the derivation
22
The Back End
Errors
IR InstructionScheduling
InstructionSelection
Machinecode
RegisterAllocation
IR IR
ResponsibilitiesTranslate IR into target machine code
Choose instructions to implement each IR operation
Decide which value to keep in registers
Ensure conformance with system interfaces
Automation has been much less successful in the back end
23
The Back End
Errors
IR InstructionScheduling
InstructionSelection
Machinecode
RegisterAllocation
IR IR
Instruction Selection• Produce fast, compact code• Take advantage of target features such as addressing modes• Usually viewed as a pattern matching problem
• ad hoc methods, pattern matching, dynamic programming• This was the problem of the future in 1978
• Spurred by transition from PDP-11 to VAX-11• Orthogonality of RISC simplified this problem
24
The Back End
Errors
IR InstructionScheduling
InstructionSelection
Machinecode
RegisterAllocation
IR IR
Instruction Scheduling• Avoid hardware stalls and interlocks• Use all functional units productively• Can increase lifetime of variables (changing the allocation)• Optimal scheduling is NP-Complete in nearly all cases
Good heuristic techniques are well understood
25
The Back End
Errors
IR InstructionScheduling
InstructionSelection
Machinecode
RegisterAllocation
IR IR
Register allocation• Have each value in a register when it is used• Manage a limited set of resources• Can change instruction choices & insert LOADs & STOREs• Optimal allocation is NP-Complete (1 or k registers)
Compilers approximate solutions to NP-Complete problems
26
Traditional Three-pass Compiler
Errors
SourceCode
MiddleEnd
FrontEnd
Machinecode
BackEnd
IR IR
• Code Improvement (or Optimization)• Analyzes IR and rewrites (or transforms) IR• Primary goal is to reduce running time of the compiled
code• May also improve space, power consumption, …
• Must preserve “meaning” of the codeMeasured by values of named variables
27
The Optimizer (or Middle End)
Errors
Opt1
Opt3
Opt2
Optn
...IR IR IR IR IR
Modern optimizers are structured as a series of passes
Typical TransformationsDiscover & propagate some constant valueMove a computation to a less frequently executed placeDiscover a redundant computation & remove itRemove useless or unreachable code
The Big Picture
Why study lexical analysis?• We want to avoid writing scanners by hand
Scanner
ScannerGenerator
specifications
source code parts of speech
tables or code
Goals:
To simplify specification & implementation of scanners
To understand the underlying techniques and technologies
28
Lexical Analysis
• The lexical analyzer reads the stream of characters making up the source program and groups the characters into meaningful sequences called lexemes. For each lexeme, the lexical analyzer produces as output a token of the form
(token-name, attribute-value) the first component token-name is an abstract
symbol that is used during syntax analysis, and the second component attribute-value points to an entry in the symbol table for this token.
29
Example• suppose a source program contains the assignment statement:
position = i n i t i a l + r a t e * 60 The characters in this assignment could grouped into the following lexemes
and mapped into the following tokens passed on to the syntax analyzer:
1. position is a lexeme that would be mapped into a token (id, I).2. The assignment symbol = is a lexeme that is mapped into the
token (=).3. i n i t i a l is a lexeme that is mapped into the token (id, 2).4. + is a lexeme that is mapped into the token (+).5. r a t e is a lexeme that is mapped into the token (id, 3).6. * is a lexeme that is mapped into the token (*).7. 60 is a lexeme that is mapped into the token (60).
30
31
32
Specifying Lexical Patterns (micro-syntax)
A scanner recognizes the language’s parts of speech
Some parts are easy• White space
– WhiteSpace blank | tab | WhiteSpace blank | WhiteSpace tab
• Keywords and operators– Specified as literal patterns: if, then, else, while, =, +, …
• Comments– Opening and (perhaps) closing delimiters– /* followed by */ in C– // in C++– % in LaTeX
33
A scanner recognizes the language’s parts of speech
Some parts are more complex• Identifiers
– Alphabetic followed by alphanumerics + _, &, $, …– May have limited length
• Numbers– Integers: 0 or a digit from 1-9 followed by digits from 0-9– Decimals: integer . digits from 0-9, or . digits from 0-9– Reals: (integer or decimal) E (+ or -) digits from 0-9– Complex: ( real , real )
We need a notation for specifying these patternsWe would like the notation to lead to an implementation
Specifying Lexical Patterns (micro-syntax)
34
Regular ExpressionsPatterns form a regular language
*** any finite language is regular ***
Regular expressions (REs) describe regular languages
Regular Expression (over alphabet )• is a RE denoting the set {}• If a is in , then a is a RE denoting {a}• If x and y are REs denoting L(x) and L(y) then
– x is a RE denoting L(x)– x |y is a RE denoting L(x) L(y)– xy is a RE denoting L(x)L(y)– x* is a RE denoting L(x)*
Precedence is closure, then
concatenation, then alternation
Ever type “rm *.o a.out? ”
35
Set Operations (refresher)Operation Definition
Union of L and MWritten L M L M = {s | s L or s M }
Concatenation of L and MWritten LM
LM = {st | s L and t M }
Kleene closure of LWritten L* L* = 0i L
i
Positive Closure of LWritten L+ L* = 1i L
i
You need to know these definitions
36
Examples of Regular Expressions
Identifiers:Letter (a|b|c| … |z|A|B|C| … |Z)
Digit (0|1|2| … |9)
Identifier Letter ( Letter | Digit )*
Numbers:Integer (+|-|) (0| (1|2|3| … |9)(Digit *) )
Decimal Integer . Digit *
Real ( Integer | Decimal ) E (+|-|) Digit *
Complex ( Real , Real )
Numbers can get much more complicated!
37
Regular Expressions (the point)
To make scanning tractable, programming languages differentiate between parts of speech by
controlling their spelling (as opposed to dictionary lookup)
Difference between Identifier and Keyword is entirely lexical– While is a Keyword– Whilst is an Identifier
The lexical patterns used in programming languages are regular
Using results from automata theory, we can automatically build recognizers from regular expressions
We study REs to automate scanner construction !
38
Consider the problem of recognizing register names
Register r (0|1|2| … | 9) (0|1|2| … | 9)*
• Allows registers of arbitrary number• Requires at least one digit
RE corresponds to a recognizer (or DFA)
With implicit transitions on other inputs to an error state, se
Example
S0 S2 S1
r
)0|1|2 … |9(
accepting state
)0|1|2 … |9(
Recognizer for Register
39
DFA operation• Start in state S0 & take transitions on each input character
• DFA accepts a word x iff x leaves it in a final state (S2 )
So,
• r17 takes it through s0, s1, s2 and accepts
• r takes it through s0, s1 and fails
• a takes it straight to se
Example (continued)
S0 S2 S1
r
)0|1|2 … |9(
accepting state
)0|1|2 … |9(
Recognizer for Register
40
Example (continued)char next character;
state s0;
call action(state,char);while (char eof)
state (state,char); call action(state,char); char next character;
if (state) = final then report acceptance;
else report failure;
action(state,char) switch((state) )
case start: word char;
break; case normal:
word word + char;
break; case final:
word char; break;
case error: report error; break;
end;
action
S0 start
S1 normal
S2 final
Se error
r
0,1,2,3,4,5,6,7,8,9
other
S0 S1 Se Se
S1 Se S2 Se
S2 Se S2 Se
Se Se Se Se•The recognizer translates directly into code
•To change DFAs, just change the tables
41
r Digit Digit* allows arbitrary numbers• Accepts r00000 • Accepts r99999• What if we want to limit it to r0 through r31 ?
Write a tighter regular expression– Register r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31)– Register r0|r1|r2| … |r31|r00|r01|r02| … |r09
Produces a more complex DFA• Has more states• Same cost per transition• Same basic implementation
What if we need a tighter specification?
42
Tighter register specification (continued)
The DFA forRegister r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31)
• Accepts a more constrained set of registers• Same set of actions, more states
S0 S5 S1
r
S4
S3
S6
S2
0,1,2
3 0,1
4,5,6,7,8,9
)0|1|2 … |9(
43
Tighter register specification (continued)To implement the recognizer• Use the same code skeleton • Use transition and action tables for the new RE
• Bigger tables, more space, same asymptotic costs• Better (micro-)syntax checking at the same cost
r 0,1 2 3 4,5,67,8,9
other
S0 S1 Se Se Se Se Se
S1 Se S2 S2 S5 S4 Se
S2 Se S3 S3 S3 S3 Se
S3 Se Se Se Se Se Se
S4 Se Se Se Se Se Se
S5 Se S6 Se Se Se Se
S6 Se Se Se Se Se Se
Se Se Se Se Se Se Se
action
S0 start
S1 normal
S2,3,4,5,6 final
Se error
44