binary translation using peephole superoptimizers

37
Binary Translation Using Peephole Superoptimizers Sorav Bansal, Alex Aiken Stanford University

Upload: beate

Post on 10-Jan-2016

64 views

Category:

Documents


3 download

DESCRIPTION

Binary Translation Using Peephole Superoptimizers. Sorav Bansal, Alex Aiken Stanford University. Binary Translation. Allow one ISA to run on another Applications Portability (e.g., running legacy software) Virtualization Backward and Forward Compatibility On-chip binary translation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Binary Translation Using Peephole Superoptimizers

Binary Translation Using Peephole Superoptimizers

Sorav Bansal, Alex AikenStanford University

Page 2: Binary Translation Using Peephole Superoptimizers

Binary Translation

• Allow one ISA to run on another• Applications

– Portability (e.g., running legacy software)

– Virtualization– Backward and Forward Compatibility– On-chip binary translation– Java Virtual Machines

Page 3: Binary Translation Using Peephole Superoptimizers

Hypervisor

x86 hardware

x86 OS

x86app

x86app

Binary Translator

powerpcapp

powerpc OS

Binary Translation

x86 hardware

OS

x86app

x86app

Binary Translator

powerpcapp

x86 hardware

OS

x86app

x86appBinary Translator

powerpcapp

Page 4: Binary Translation Using Peephole Superoptimizers

Binary Translation Wish-list

Performance

Large Complex ISAs

Retargetability OS Compatibility

Page 5: Binary Translation Using Peephole Superoptimizers

Talk Outline

SuperoptimizationPeephole SuperoptimizationApplication to Binary TranslationImplementation & Experimental

ResultsConclusion

Page 6: Binary Translation Using Peephole Superoptimizers

Superoptimization

• Superoptimizer is a unique code generator that uses brute-force search to attempt to find the optimal code

Eg. int signum(int x) { if (x > 0) return 1; if (x < 0) return –1; else return 0;}

On Motorola 68020: add.l d0, d0 subx.l d1, d1 negx.l d0 addx.l d1, d1

Page 7: Binary Translation Using Peephole Superoptimizers

Superoptimization

• Enumerate all sequences up to a certain length

and

• Compare each enumerated sequence with target function for equivalence

Page 8: Binary Translation Using Peephole Superoptimizers

Talk Outline

SuperoptimizationPeephole SuperoptimizationApplication to Binary TranslationImplementation & Experimental

ResultsConclusion

Page 9: Binary Translation Using Peephole Superoptimizers

Peephole SuperoptimizationUse a superoptimizer to

automatically infer peephole optimizations

add $1, reg inc reg

mul $2, reg shl reg

… …Table of Peephole Optimizations

[S. Bansal, A. Aiken. Automatic Generation of Peephole Superoptimizers, ASPLOS 2006]

pattern replace-with

Page 10: Binary Translation Using Peephole Superoptimizers

Peephole SuperoptimizerStep 1

a.out

010001001011110100011101101011101010100010101010001010100010001010101001010100101010101001010000101011111101100101010101101111010010101001010100101010010101001110011111010010001101111011011101010001001101010101010101010101010101010101010100110100100101010101010101010101000011111101010111101010001111010101011101110110111011101110111010100110110010101011011

01…

01100101

mov %eax, %ecxmov %ecx, %eax

sub $123, %eaxadd $456, %eax

movl (%eax), %ecxinc %ecxmovl %ecx, (%eax)

Harvest instruction sequences that

can potentially be optimized.

Canonicalize and store them. Target Sequences

Page 11: Binary Translation Using Peephole Superoptimizers

Peephole Superoptimization

Step 2mov %eax, %ecxmov %ecx, %eax

sub $123, %eaxadd $456, %eax

movl (%eax), %ecxinc %ecxmovl %ecx, (%eax)

Target Sequences

mov %eax, %ecx

add $333, %eax

inc (%eax)

…Brute force

Optimization Optimized Sequences

Page 12: Binary Translation Using Peephole Superoptimizers

Equivalence Test

ExecutionTest

BooleanTest

Two sequences

pass

fail fail

not-equivalent not-equivalent

equivalent

Page 13: Binary Translation Using Peephole Superoptimizers

Peephole Superoptimization

Step 3mov %eax, %ecxmov %ecx, %eax

sub $123, %eaxadd $456, %eax

movl (%eax), %ecxinc %ecxmovl %ecx, (%eax)

mov %eax, %ecx

add $333, %eax

inc (%eax)

Table of Peephole Optimizations

Page 14: Binary Translation Using Peephole Superoptimizers

Talk Outline

SuperoptimizationPeephole SuperoptimizersApplication to Binary TranslationImplementation & Experimental

ResultsConclusion

Page 15: Binary Translation Using Peephole Superoptimizers

Application to Binary Translation

• Our approach: Use lots of peephole transformations

pattern(ppc)

translate-to(x86)

shl %eax

add %ecx,%eax

addi r1,r1,1

mullw r1,r1,2

add r1,r1,r2

inc %eax

ppcx86register map

r1eax

r1eax

r1eax; r2ecx

Page 16: Binary Translation Using Peephole Superoptimizers

Peephole Binary Translation

mr r1, r2mr r2, r1

lis r1, 0x12ori r1, r1, 0x3456

ldl r2, (r1)addi r2, r2, 1stl r2, (r1)

mov %eax, %ecx

mov $0x123456, Mr1

inc (%eax)

r1 eaxr2 ecx

r1 Mr1

r1 eaxr2 ecx

source arch.(ppc)

register map destination arch.(x86)

Page 17: Binary Translation Using Peephole Superoptimizers

Register Map Selection

• The best code may require changing the register map from one code point to another

• The choice of register maps affects the choice of instruction selection and vice-versa

Page 18: Binary Translation Using Peephole Superoptimizers

Register Map Selection

li r1, 123addi r2, r2, 1subf r2, r1, r2ori r1, r1, 31

powerpc sequence:?x86 sequence:

Instruction costsIf accesses memory, 10

Else, 1

Switching CostsRM or MR : 10

Cost Model

P0P1P2P3

exit

At entry: r1Mr1 ; r2Mr2

At exit: r1Mr1 ; r2Mr2

Example

Page 19: Binary Translation Using Peephole Superoptimizers

Register Map Selection

li r1, 123

r1 Mr1 ; r2 Mr2entry

addi r2,r2,1

subf r2,r1,r2

ori r1,r1,31

movl $123, Mr1r1 Mr1

0

10

incl Mr2r2 Mr2

0

10

subl Mr1, eaxr1 Mr1 ; r2 eax

10 10

exit

orl $31, Mr1 10r1 Mr1

0

10

Total 40Total 20

Grand Total 60

r1 Mr1 ; r2 Mr2

Instruction costsIf accesses memory, 10

Else, 1

Switching CostsRM or MR : 10

Greedy Strategy

P0:

P1:

P2:

P3:

Page 20: Binary Translation Using Peephole Superoptimizers

li r1, 123

r1 Mr1 ; r2 Mr2entry

addi r2,r2,1

subf r2,r1,r2

ori r1,r1,31

exit

movl $123, eaxr1 eax

10

1

incl ecxr2 ecx

10

1

subl eax, ecxr1 eax ; r2 ecx

0

1

orl $31, eax 1r1 eax0

20

Total 4Total 40

Grand Total 44

r1 Mr1 ; r2 Mr2

Switching CostsRM or MR : 10

Instruction costsIf accesses memory, 10

Else, 1

Register Map SelectionOptimal Solution

Page 21: Binary Translation Using Peephole Superoptimizers

Register Map Selection

• Use Dynamic Programming– near-optimal solution– account for translations spanning

multiple instructions– simultaneously perform instruction-

selection and register-mapping

Page 22: Binary Translation Using Peephole Superoptimizers

Talk Outline

SuperoptimizationPeephole SuperoptimizersApplication to Binary TranslationImplementation & Experimental

ResultsConclusion

Page 23: Binary Translation Using Peephole Superoptimizers

Powerpc X86 Translator Implementation

• Superoptimizer– Use a PPC emulator (Qemu) for execution

test– Use a SAT solver (zChaff) for boolean test

• Static user-level translator– ELF 32-bit ppc/Linux binary ELF 32-bit

x86/Linux binary– Translate most (but not all) system calls

Page 24: Binary Translation Using Peephole Superoptimizers

Implementation

Endianness: ppc big-endian ; x86 little-endian

– Convert all memory writes to big-endian (source)

– Convert all memory reads to little-endian (dest)

Compiler Optimizations– Problem:PowerPC optimizer staggers data-

dependent instructions to reduce pipeline stalls

– Solution: Cluster data-dependent instructions in basic block before translation

• Many Issues– Condition Codes, Endianness, System Calls,

Stack and Heap, Indirect Jumps, Function Calls and Returns, Register Name Constraints, Untranslated Opcodes, Compiler Optimizations

Page 25: Binary Translation Using Peephole Superoptimizers

Experimental Results• Setup

– Pentium4 3.0 GHz, 1MB Cache, 4GB Memory– gcc 4.0.1, glibc 2.3.6– Use soft-float library– Statically-linked input executables

• Benchmarks– Microbenchmarks, SPEC CINT2000

• Metrics– Compare against natively-compiled code– Compare against other binary translators

• Qemu, Apple’s Rosetta

Page 26: Binary Translation Using Peephole Superoptimizers

Experimental Setup

• For our experiments– there are around 750 translation rules

in the peephole table– the translation table is computed

offline and it can take up to a week to compute the peephole rules

Page 27: Binary Translation Using Peephole Superoptimizers

Experimental Results:Setup

C source

PowerPCexecutable

x86executable

gcc <options> -arch=ppc gcc <options> -arch=x86

Peephole Binary Translation

x86executable

Compare

Page 28: Binary Translation Using Peephole Superoptimizers

Microbenchmarks

emptyloop A bounded for-loop doing nothing

fibo Compute first few fibonacci numbers

quicksort Quicksort on 64-bit integers

mergesort Mergesort on 64-bit integers

bubblesort Bubblesort on 64-bit integers

hanoi1 Towers of Hanoi Algorithm 1

hanoi2 Towers of Hanoi Algorithm 2

hanoi3 Towers of Hanoi Algorithm 3

traverse Traverse a linked list

binsearch Binary search on a sorted array

Page 29: Binary Translation Using Peephole Superoptimizers

Microbenchmarks99 11

9

81 83

75

85

107

81

69

65

319

93 92

71 70

140

90

68

61

127

128

90

84

65 62

144

80

67

62

129

0

10

20

30

40

50

60

70

80

90

100em

ptyl

oop

fibo

quic

ksor

t

mer

geso

rt

bubs

ort

hano

i1

hano

i2

hano

i3

trav

erse

bins

earc

h

O0 O2 O2 -omit-f rame-pointer

Perc

enta

ge o

f nati

ve (

%)

avg: 90% of native

Page 30: Binary Translation Using Peephole Superoptimizers

Experimental Results: Microbenchmarks

• We sometimes outperform native performance on these small benchmarks!– gcc generates better code for

powerpc primarily because it has the luxury of many registers

– Our register-mapping algorithm performs an efficient “re-allocation” of the PowerPC registers to x86 registers.

Page 31: Binary Translation Using Peephole Superoptimizers

Experimental Results:SPEC CINT2000

66

53

66

87

59

167

4243

57

95

67

153

74

0

10

20

30

40

50

60

70

80

90

100

bzip

2

gap

gzip

mcf

pars

er

twol

f

vort

ex

O0 O2

Perc

enta

ge o

f nati

ve (

%)

Page 32: Binary Translation Using Peephole Superoptimizers

Comparisons with Qemu and Rosetta

• Qemu– Use same PowerPC and x86 executables as used

for our own translator

• Rosetta– Runs on Mac OS X and hence supports on Mac

executables– Recompiled the benchmarks on Mac using the

same compiler version (gcc 4.0.1)– Mac Hardware: Intel Core 2 Duo 1.83GHz

processor, 32KB L1-cache, 2MB L2-cache and 2GB memory

Page 33: Binary Translation Using Peephole Superoptimizers

Comparisons with Qemu and Rosetta

18

12 15

48

16

55

11

65

59

85

54

43

66

53

66

87

59

167

42

0102030405060708090

100

bzip

2

gap

gzip

mcf

pars

er

twol

f

vort

ex

-O0 -O2

avg: 3% faster than rosetta avg: 12% faster than rosetta

25

13

22

64

21

58

54 53

82

49

74

43

57

95

67

153

010

20304050

607080

90100

bzip

2

gap

gzip

mcf

pars

er

twol

f

qemu rosetta peep

Page 34: Binary Translation Using Peephole Superoptimizers

Translation Time• Takes 2-6 minutes to translate a 650KB

executable (around 100K instructions)– majority of time spent in optimal register map

computation

• It is possible to reduce this to <10 seconds– For 98K instructions (<0.01% of time), use any

register map. Fast (<1second)– For other 2K, use optimal computation

Page 35: Binary Translation Using Peephole Superoptimizers

Conclusions and Future Work

• A scheme to perform efficient binary translation using a superoptimizer– Competitive performance– Simplified Design

• Other applications– Just-in-time compilation– Machine virtualization

Page 36: Binary Translation Using Peephole Superoptimizers

Q&A Thank you.

Page 37: Binary Translation Using Peephole Superoptimizers

Backup Slides