1 incorporating domain- specific information into the compilation process samuel z. guyer...

51
1 Domain-Specific Information into the Compilation Process Samuel Z. Guyer Supervisor: Calvin Lin April 14, 2003

Post on 22-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

1

Incorporating Domain-Specific Information into the Compilation Process

Samuel Z. Guyer

Supervisor: Calvin Lin

April 14, 2003

2

Motivation

Two different views of software:

Compiler’s view Abstractions: numbers, pointers, loops Operators: +, -, *, ->, []

Programmer’s view Abstractions: files, matrices, locks, graphics Operators: read, factor, lock, draw

This discrepancy is a problem...

3

Find the error – part 1

Example:

Error: case outside of switch statement Part of the language definition Error reported at compile time Compiler indicates the location and nature of error

switch (var_83) {case 0: func_24(); break;case 1: func_29(); break;}case 2: func_78();!

4

Find the error – part 2

Example:

Improper call to libfunc_38 Syntax is correct – no compiler message Fails at run-time

Problem: what does libfunc_38 do? This is how compilers view reusables

struct __sue_23 * var_72;char var_81[100];var_72 = libfunc_84(__str_14, __str_65);libfunc_44(var_72);libfunc_38(var_81, 100, 1, var_72);!

5

Find the error – part 3

Example:

Improper call to fread() after fclose() The names reveal the mistake

No traditional compiler reports this error Run-time system: how does the code fail? Code review: rarely this easy to spot

FILE * my_file;char buffer[100];my_file = fopen(“my_data”, “r”);fclose(my_file);fread(buffer, 100, 1, my_file);!

6

Problem

Compilers are unaware of library semantics Library calls have no special meaning The compiler cannot provide any assistance

Burden is on the programmer: Use library routines correctly Use library routines efficiently and effectively

These are difficult manual tasks Tedious and error-prone Can require considerable expertise

7

Solution

A library-level compiler Compiler support for software libraries Treat library routines more like built-in operators

Compile at the library interface level Check programs for library-level errors Improve performance with library-level optimizations

Key: Libraries represent domains Capture domain-specific semantics and expertise Encode in a form that the compiler can use

8

The Broadway Compiler

Broadway – source-to-source C compiler Domain-independent compiler mechanisms

Annotations – lightweight specification language Domain-specific analyses and transformations

Many libraries, one compiler

ApplicationSource code

LibraryAnnotationsHeader filesSource code

Broadway

Analyzer

Optimizer

Error reportsLibrary-specific messages

Application+LibraryIntegrated source code

9

Benefits

Improves capabilities of the compiler Adds many new error checks and optimizations Qualitatively different

Works with existing systems Domain-specific compilation without recoding For us: more thorough and convincing validation

Improve productivity Less time spent on manual tasks All users benefit from one set of annotations

10

Outline

Motivation The Broadway Compiler

Recent work on scalable program analysis Problem: Error checking demands powerful analysis Solution: Client-driven analysis algorithm Example: Detecting security vulnerabilities

Contributions Related work Conclusions and future work

11

Security vulnerabilities

How does remote hacking work? Most are not direct attacks (e.g., cracking passwords) Idea: trick a program into unintended behavior

Automated vulnerability detection: How do we define “intended”? Difficult to formalize and check application logic

Libraries control all critical system services Communication, file access, process control Analyze routines to approximate vulnerability

12

Remote access vulnerability

Example:

Vulnerability: executes any remote command What if this program runs as root? Clearly domain-specific: sockets, processes, etc. Requirement:

Why is detecting this vulnerability hard?

int sock;char buffer[100];sock = socket(AF_INET, SOCK_STREAM, 0);read(sock, buffer, 100);execl(buffer);

Data from an Internet socket should not specify a program to execute

!

13

Challenge 1: Pointers

Example:

Still contains a vulnerability Only one buffer Variables buffer and ref are aliases

We need an accurate model of memory

int sock;char buffer[100];char * ref = buffer;sock = socket(AF_INET, SOCK_STREAM, 0);read(sock, buffer, 100);execl(ref);!

14

Challenge 2: Scope

Call graph:

Objects flow throughout program No scoping constraints Objects referenced through pointers

We need whole-program analysis

main

readsocketsock = (AF_INET, SOCK_STREAM, 0);

(sock, buffer, 100); (ref);execl

!

15

Challenge 3: Precision

Static analysis is always an approximation

Precision: level of detail or sensitivity Multiple calls to a procedure

Context-sensitive: analyze each call separately Context-insensitive: merge information from all calls

Multiple assignments to a variable Flow-sensitive: record each value separately Flow-insensitive: merge values from all assignments

Lower precision reduces the cost of analysis Exponential polynomial ~linear

16

Insufficient precision

Example:Context-insensitivity

Information merged at call Analyzer reports 2 possible errors Only 1 real error

Imprecision leads to false positives

!

main

socketexeclexecl

read

stdin

??

17

Cost versus precision

Problem: A tradeoff Precise analysis prohibitively expensive Cheap analysis too many false positives

Idea: Mixed precision analysis Focus effort on the parts of the program that matter Don’t waste time over-analyzing the rest

Key: Let error detection problem drive precision Client-driven program analysis

18

Client-Driven Algorithm Client: Error detection analysis problem

Algorithm: Start with fast cheap analysis – monitor imprecision Determine extra precision – reanalyze

Pointer Analyzer

Client Analysis

Memory Model

Error Reports

Dependence Graph

MonitorInformation Loss

Adaptor

Precision Policy

19

Algorithm components

Monitor Runs alongside main analysis Records imprecision

Adaptor Start at the locations of reported errors Trace back to the cause and diagnose

?

20

Sources of imprecisionPolluting assignments

Multiple assignments

x =

x =

x

foo( )

Multiple procedure calls

foo( )

foo( ) = ( , )

Conditions

if(cond)

x = x =

ptr

Polluted target

ptr

Polluted pointer

(*ptr)or

Pointer dereference

21

In action...

Monitor analysis

Polluting assignments

Diagnose and apply “fix” In this case: one procedure context-sensitive

Reanalyze

main

socketexeclexecl

read

stdin

??

readread

!

22

Methodology

Compare with commonly-used fixed precision

Metrics Accuracy – number of errors reported

Includes false positives – fewer is better Performance – only when accuracy is the same

CS-FS Full: context-sensitive, flow-sensitive

CS-FI Slow: context-sensitive, flow-insensitive

CI-FS Medium: context-insensitive, flow-sensitive

CI-FI Fast: context-insensitive, flow-insensitive

23

Programs 18 real C programs

Unmodified source – all the issues of production code Many are system tools – run in privileged mode

Representative examples:

Name Description Priv Lines of code Procedures CFG nodes

muh IRC proxy 5K (25K) 84 5,191

blackhole E-mail filter 12K (244K) 71 21,370

wu-ftpd FTP daemon 22K (66K) 205 23,107

named DNS server 26K (84K) 210 25,452

nn News reader 36K (116K) 494 46,336

24

Error detection problems

1. File access:

2. Remote access vulnerabillity:

3. Format string vulnerability (FSV):

4. Remote FSV:

5. FTP behavior:

Data from an Internet socket should not specify a program to execute

Files must be open when accessed

Format string may not contain untrusted data

Check if FSV is remotely exploitable

Can this program be tricked into reading and transmitting arbitrary files

25

Increasing number of CFG nodes

Results

10X

1000X

1

100X

stunnelpfingerm

uh (I)m

uh (II)pureftpfcronapachem

akeblackholew

u-ftp (I)ssh clientprivoxyw

u-ftp (II)nam

edssh servercfengineS

QLite

nn

0 0 0 0 0 0 07 29 6 85 28 2 31 4 5 93 41

00 0

0

0

00

7 186

85

15

1

26 4

5 8941

07 18

615

1

26 4

5

8841

Slow (CS-FI)

Medium (CI-FS)

Fast (CI-FI)

Full (CS-FS)

Client-DrivenRemote access vulnerability

No

rmal

ized

per

form

ance

0

0 0

0

7

29

28310

0 0

7 1526

? ? ? ? ? ? ? ? ? ? ??

26

Overall results 90 test cases: 18 programs, 5 problems

Test case: 1 program and 1 error detection problem Compare algorithms: client-driven vs. fixed precision

As accurate as any other algorithm:

Runs faster than best fixed algorithm:

Both most accurate and fastest:

Performance not an issue:

87 out of 90

64 out of 87

19 of 23

29 out of 64

27

Why does it work?

Validates our hypothesis Different errors have different precision requirements Amount of extra precision is small

Name

Total procedures

# procedures context-sensitive

RA File FSV RFSV FTP

muh 84 6

apache 313 8 2 2 10

blackhole 71 2 5

wu-ftpd 205 4 4 17

named 210 1 2 1 4

cfengine 421 4 1 3 31

nn 494 2 1 1 30

28

Outline

Motivation The Broadway Compiler

Recent work on scalable program analysis Problem: Error checking demands powerful analysis Solution: Client-driven analysis algorithm Example: Detecting security vulnerabilities

Contributions Related work Conclusions and future work

29

Central contribution Library-level compilation

Opportunity: library interfaces make domains explicit in existing programming practice

Key: a separate language for codifying domain-specific knowledge

Result: our compiler can automate previously manual error checks and performance improvements

Knowledge representation

Applying knowledge

Results

Old way:

Broadway:

Informal

Codified

Manual

Compiler

Difficult, unpredictable

Easy, automatic, reliable

30

Specific contributions

Broadway compiler implementation Working system (43K C-Breeze, 23K pointers, 30K Broadway)

Client-driven pointer analysis algorithm [SAS’03] Precise and scalable whole-program analysis

Library-level error checking experiments [CSTR’01] No false positives for format string vulnerability

Library-level optimization experiments [LCPC’00] Solid improvements for PLAPACK programs

Annotation language [DSL’99] Balance expressive power and ease of use

31

Related work Configurable compilers

Power versus usability – who is the user?

Active libraries Previous work focusing on specific domains Few complete, working systems

Error detection Partial program verification – paucity of results

Scalable pointer analysis Many studies of cost/precision tradeoff Few mixed-precision approaches

32

Future work Language

More analysis capabilities

Optimization We have only scratched the surface

Error checking Resource leaks Path sensitivity – conditional transfer functions

Scalable analysis Start with cheaper analysis – unification-based Refine to more expensive analysis – shape analysis

33

Thank You

34

Annotations (I)

Dependence and pointer information Describe pointer structures Indicate which objects are accessed and modified

procedure fopen(pathname, mode){ on_entry { pathname --> path_string mode --> mode_string }

access { path_string, mode_string }

on_exit { return --> new file_stream }}

35

Annotations (II)

Library-specific properties Dataflow lattices

property State : { Open, Closed} initially Open

property Kind : { File, Socket { Local, Remote } }

SocketFile

Local RemoteOpenClosed

36

Annotations (III)

Library routine effects Dataflow transfer functions

procedure socket(domain, type, protocol){ analyze Kind { if (domain == AF_UNIX) IOHandle <- Local if (domain == AF_INET) IOHandle <- Remote }

analyze State { IOHandle <- Open }

on_exit { return --> new IOHandle }}

37

Annotations (IV)

Reports and transformations

procedure execl(path, args){ on_entry { path --> path_string }

report if (Kind : path_string could-be Remote) “Error at “ ++ $callsite ++ “: remote access”;}

procedure slow_routine(first, second){ when (condition) replace-with %{ quick_check($first); fast_routine($first, $second); }%}

38

Why does it work?

Validates our hypothesis Different clients have different precision requirements Amount of extra precision is small

Name

# procedures context-sensitive % variables flow-sensitive

RA File FSV RFSV FTP RA File FSV RFSV FTP

muh 6 0.1 0.07 0.31

apache 8 2 2 10 0.89 0.18 0.91 1.07 0.83

blackhole 2 5 0.24 0.04 0.32

wu-ftpd 4 4 17 0.63 0.09 0.51 0.53 0.23

named 1 2 1 4 0.14 0.01 0.23 0.20 0.42

cfengine 4 1 3 31 0.43 0.04 0.46 0.48 0.03

nn 2 1 1 30 1.82 0.17 1.99 2.03 0.97

39

Time

40

Validation

Optimization experiments Cons: One library – three applications Pros: Complex library – consistent results

Error checking experiments Cons: Quibble about different errors Pros: We set the standard for experiments

Overall Same system designed for optimizations is among the

best for detecting errors and security vulnerabilities

41

Type Theory

Equivalent to dataflow analysis (heresy?)

Different in practice Dataflow: flow-sensitive problems, iterative analysis Types: flow-insensitive problems, constraint solver

Commonality No magic bullet: same cost for the same precision Extracting the store model is a primary concern

Flow values Types

Transfer functions Inference rules Remember PhilWadler’s talk?

42

Generators

Direct support for domain-specific programming Language extensions or new language Generate implementation from specification

Our ideas are complementary Provides a way to analyze component compositions Unifies common algorithms:

Redundancy elimination Dependence-based optimizations

43

Is it correct?

Three separate questions:

Are Sam Guyer’s experiments correct? Yes, to the best of our knowledge

Checked PLAPACK results Checked detected errors against known errors

Is our compiler implemented correctly? Flip answer: who’s is? Better answer: testing suites

How do we validate a set of annotations?

44

Annotation correctness

Not addressed in my dissertation, but... Theoretical approach

Does the library implement the domain? Formally verify annotations against implementation

Practical approach Annotation debugger: interactive Automated assistance in early stages of development

Middle approach Basic consistency checks

45

Error Checking vs Optimization

Optimistic False positives allowed It can even be unsound Tend to be “may” analyses

Correctness is absolute “Black and white” Certify programs bug-free

Cost tolerant Explore costly analysis

Pessimistic Must preserve semantics Soundness mandatory Tend to be “must” analyses

Performance is relative Spectrum of results No guarantees

Cost sensitive Compile-time is a factor

46

Complexity

Pointer analysis Address taken: linear Steensgaard: almost linear (log log n factor) Anderson: polynomial (cubic) Shape analysis: double exponential

Dataflow analysis Intraprocedural: polynomial (height of lattice) Context-sensitivity: exponential (call graph)

Rarely see worst-case

47

Optimization

Overall strategy Exploit layers and modularity Customize lower-level layers in context

Compiler strategy: Top-down layer processing Preserve high-level semantics as long as possible Systematically dissolve layer boundaries

Annotation strategy General-purpose specialization Idiomatic code substitutions

48

PLAPACK Optimizations

PLAPACK matrices are distributed

Optimizations exploit special cases Example: Matrix multiply

Processor grid

PLA_Gemm( , , ); PLA_Local_gemm

PLA_Gemm( , , ); PLA_Rankk

49

Results

50

Find the error – part 3

State-of-the-art compilerstruct __sue_23 * var_72;struct __sue_25 * new_f = (struct __sue_25 *) malloc(sizeof (struct __sue_25));_IO_no_init(& new_f->fp.file, 1, 0, ((void *) 0), ((void *) 0));(& new_f->fp)->vtable = & _IO_file_jumps;_IO_file_init(& new_f->fp);if (_IO_file_fopen((struct __sue_23 *) new_f, filename, mode, is32) != ((void *) 0)) {  var_72 = & new_f->fp.file;  if ((var_72->_flags2 & 1) && (var_72->_flags & 8)) {    if (var_72->_mode <= 0) ((struct __sue_23 *) var_72)->vtable = & _IO_file_jumps_maybe_mmap;    else ((struct __sue_23 *) var_72)->vtable = & _IO_wfile_jumps_maybe_mmap;    var_72->_wide_data->_wide_vtable = & _IO_wfile_jumps_maybe_mmap;  }}if (var_72->_flags & 8192U) _IO_un_link((struct __sue_23 *) var_72);if (var_72->_flags & 8192U) status = _IO_file_close_it(var_72);  else status = var_72->_flags & 32U ? - 1 : 0;((* (struct _IO_jump_t * *) ((void *) (& ((struct __sue_23 *) (var_72))->vtable) +                              (var_72)->_vtable_offset))->__finish)(var_72, 0);if (var_72->_mode <= 0)  if (((var_72)->_IO_save_base != ((void *) 0))) _IO_free_backup_area(var_72);if (var_72 != ((struct __sue_23 *) (& _IO_2_1_stdin_)) && var_72 != ((struct __sue_23 *) (& _IO_2_1_stdout_)) &&     var_72 != ((struct __sue_23 *) (& _IO_2_1_stderr_))) { var_72->_flags = 0;  free(var_72); }bytes_read = _IO_sgetn(var_72, (char *) var_81, bytes_requested);

51

End backup slides