pypy's approach to construct domain-specific language runtime

Post on 16-Mar-2018

1.355 Views

Category:

Software

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Tag: virtual machine, compiler, performance

PyPy’s Approach to Construct Domain-specific Language Runtime

Tag: virtual machine, compiler, performance

Construct Domain-specific Language Runtimeusing

Speed

7.4 times faster than CPythonhttp://speed.pypy.org

antocuni (PyCon Otto) PyPy Status Update April 07 2017 4 / 19

Why is Python slow?

Interpretation overhead

Boxed arithmetic and automatic overflow handling

Dynamic dispatch of operations

Dynamic lookup of methods and attributes

Everything can change on runtime

Extreme introspective and reflective capabilities

Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 8 / 51

Why is Python slow?Boxed arithmetic and automatic overflow handling

i = 0

while i < 10000000:

i = i +1

Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 9 / 51

Why is Python slow?Dynamic dispatch of operations

# while i < 1000000

9 LOAD_FAST 0 (i)

12 LOAD_CONST 2 (10000000)

15 COMPARE_OP 0 (<)

18 POP_JUMP_IF_FALSE 34

# i = i + 1

21 LOAD_FAST 0 (i)

24 LOAD_CONST 3 (1)

27 BINARY_ADD

28 STORE_FAST 0 (i)

31 JUMP_ABSOLUTE 9

Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 10 / 51

Why is Python slow?Dynamic lookup of methods and attributes

class MyExample(object ):

pass

def foo(target , flag):

if flag:

target.x = 42

obj = MyExample ()

foo(obj , True)

print obj.x #=> 42

print getattr(obj , "x") #=> 42

Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 11 / 51

Why is Python slow?Everything can change on runtime

def fn():

return 42

def hello ():

return ’Hi! PyConEs!’

def change_the_world ():

global fn

fn = hello

print fn() #=> 42

change_the_world ()

print fn() => ’Hi! PyConEs!’

Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 12 / 51

Why is Python slow?Everything can change on runtime

class Dog(object ):

def __init__(self):

self.name = ’Jandemor ’

def talk(self):

print "%s: guau!" % self.name

class Cat(object ):

def __init__(self):

self.name = ’CatInstance ’

def talk(self):

print "%s: miau!" % self.name

Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 13 / 51

Why is Python slow?Everything can change on runtime

my_pet = Dog()

my_pet.talk() #=> ’Jandemor: guau!’

my_pet.__class__ = Cat

my_pet.talk() #=> ’Jandemor: miau!’

Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 14 / 51

Why is Python slow?Extreme introspective and reflective capabilities

def fill_list(name):

frame = sys._getframe (). f_back

lst = frame.f_locals[name]

lst.append (42)

def foo():

things = []

fill_list(’things ’)

print things #=> 42

Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 15 / 51

Why is Python slow?Everything can change on runtime

def fn():

return 42

def hello ():

return ’Hi! PyConEs!’

def change_the_world ():

global fn

fn = hello

print fn() #=> 42

change_the_world ()

print fn() => ’Hi! PyConEs!’

Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 12 / 51

PyPy Translation Toolchain

• Capable of compiling (R)Python!

• Garbage collection!

• Tracing just-in-time compiler generator!

• Software transactional memory?

PyPy Architecture

PyPy based interpreters• Topaz (Ruby)!

• HippyVM (PHP)!

• Pyrolog (Prolog)!

• pycket (Racket)!

• Various other interpreters for (Scheme, Javascript, io, Gameboy)

Compiler / Interpreter

Source: Compiler Construction, Prof. O. NierstraszSource: Compiler Construction, Prof. O. Nierstrasz

• intermediate representation (IR) • front end maps legal code into IR • back end maps IR onto target machine • simplify retargeting • allows multiple front ends • multiple passes better code →

Traditional 2 pass compiler

• analyzes and changes IR • goal is to reduce runtime • must preserve values

Traditional 3 pass compiler

• constant propagation and folding• code motion• reduction of operator strength • common sub-expression elimination• redundant store elimination • dead code elimination

Optimizer: middle end

Modern optimizers are usually built as a set of passes

• Preserve language semantics• Reflection, Introspection, Eval

• External APIs

• Interpreter consists of short sequences of code• Prevent global optimizations

• Typically implemented as a stack machine

• Dynamic, imprecise type information• Variables can change type

• Duck Typing: method works with any object that provides accessed interfaces

• Monkey Patching: add members to “class” after initialization

• Memory management and concurrency

• Function calls through packing of operands in fat object

Optimization Challenges

PyPy Functional Architecture

RPython• Python subset!

• Statically typed!

• Garbage collected!

• Standard library almost entirely unavailable!

• Some missing builtins (print, open(), …)!

• rpython.rlib!

• exceptions are (sometimes) ignored!

• Not a really a language, rather a "state"

22

PyPy Interpreter

def f(x):

return x + 1

>>> dis.dis(f)

2 0 LOAD_FAST 0 (x)

3 LOAD_CONST 1 (1)

6 BINARY_ADD

7 RETURN_VALUE

• written in Rpython• Stack-based bytecode interpreter (like JVM)

• bytecode compiler generates bytecode→

• bytecode evaluator interprets bytecode →

• object space handles operations on objects→

23

PyPy Bytecode Interpreter

31

CFG (Call Flow Graph)

• Consists of Blocks and Links

• Starting from entry_point

• “Single Static Information” form

def f(n):

return 3 * n + 2

Block(v1): # input argument

v2 = mul(Constant(3), v1)

v3 = add(v2, Constant(2))

33

CFG: Static Single Information

33

def test(a):if a > 0:

if a > 5:return 10

return 4if a < - 10:

return 3return 10

• SSI: “PHIs” for all used variables• Blocks as “functions without branches”

• High Level Language Implementation• to implement new features: lazily computed objects

and functions, plug-able  garbage-collection, runtime replacement of live-objects, stackless concurrency 

• JIT Generation• Object space• Stackless• infinite Recursion

• Microthreads: Coroutines, Tasklets and Channels, Greenlets

PyPy Advantages

PERCEPTION

http://abstrusegoose.com/secretarchives/under-the-hood - CC BY-NC 3.0 US

Assumptions

Pareto Principle (80-20 rule)I the 20% of the program accounts for the 80% of the

runtimeI hot-spots

Fast Path principleI optimize only what is necessaryI fall back for uncommon cases

Most of runtime spent in loopsAlways the same code paths (likely)

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 9 / 32

Tracing JIT phases

Interpretation

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 11 / 32

Tracing JIT phases

Interpretation

Tracinghot loop detected

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 11 / 32

Tracing JIT phases

Interpretation

Tracinghot loop detected

Compilation

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 11 / 32

Tracing JIT phases

Interpretation

Tracinghot loop detected

Compilation

Running

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 11 / 32

Tracing JIT phases

Interpretation

Tracinghot loop detected

Compilation

Running

cold guard failed

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 11 / 32

Tracing JIT phases

Interpretation

Tracinghot loop detected

Compilation

Running

cold guard failed

entering compiled loop

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 11 / 32

Tracing JIT phases

Interpretation

Tracinghot loop detected

Compilation

Running

cold guard failed

entering compiled loop

guard failure → hot

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 11 / 32

Tracing JIT phases

Interpretation

Tracinghot loop detected

Compilation

Running

cold guard failed

entering compiled loop

guard failure → hot

hot guard failed

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 11 / 32

Trace trees (1)

tracetree.pydef foo():a = 0i = 0N = 100while i < N:

if i%2 == 0:a += 1

else:a *= 2;

i += 1return a

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 12 / 32

Trace trees (2)

label(start, i0, a0)v0 = int_lt(i0, 2000)guard_true(v0)v1 = int_mod(i0, 2)v2 = int_eq(v1, 0)guard_true(v1)a1 = int_add(a0, 10)i1 = int_add(i0, 1)jump(start, i1, a1)

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32

Trace trees (2)

label(start, i0, a0)v0 = int_lt(i0, 2000)guard_true(v0)v1 = int_mod(i0, 2)v2 = int_eq(v1, 0)guard_true(v1)a1 = int_add(a0, 10)i1 = int_add(i0, 1)jump(start, i1, a1)

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32

Trace trees (2)

label(start, i0, a0)v0 = int_lt(i0, 2000)guard_true(v0)v1 = int_mod(i0, 2)v2 = int_eq(v1, 0)guard_true(v1)a1 = int_add(a0, 10)i1 = int_add(i0, 1)jump(start, i1, a1)

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32

Trace trees (2)

label(start, i0, a0)v0 = int_lt(i0, 2000)guard_true(v0)v1 = int_mod(i0, 2)v2 = int_eq(v1, 0)guard_true(v1)a1 = int_add(a0, 10)i1 = int_add(i0, 1)jump(start, i1, a1)

BLACKHOLE

COLD FAIL

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32

Trace trees (2)

label(start, i0, a0)v0 = int_lt(i0, 2000)guard_true(v0)v1 = int_mod(i0, 2)v2 = int_eq(v1, 0)guard_true(v1)a1 = int_add(a0, 10)i1 = int_add(i0, 1)jump(start, i1, a1)

BLACKHOLE

COLD FAIL

INTERPRETER

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32

Trace trees (2)

label(start, i0, a0)v0 = int_lt(i0, 2000)guard_true(v0)v1 = int_mod(i0, 2)v2 = int_eq(v1, 0)guard_true(v1)a1 = int_add(a0, 10)i1 = int_add(i0, 1)jump(start, i1, a1)

BLACKHOLE

COLD FAIL

INTERPRETER

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32

Trace trees (2)

label(start, i0, a0)v0 = int_lt(i0, 2000)guard_true(v0)v1 = int_mod(i0, 2)v2 = int_eq(v1, 0)guard_true(v1)a1 = int_add(a0, 10)i1 = int_add(i0, 1)jump(start, i1, a1)

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32

Trace trees (2)

label(start, i0, a0)v0 = int_lt(i0, 2000)guard_true(v0)v1 = int_mod(i0, 2)v2 = int_eq(v1, 0)guard_true(v1)a1 = int_add(a0, 10)i1 = int_add(i0, 1)jump(start, i1, a1)

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32

Trace trees (2)

label(start, i0, a0)v0 = int_lt(i0, 2000)guard_true(v0)v1 = int_mod(i0, 2)v2 = int_eq(v1, 0)guard_true(v1)a1 = int_add(a0, 10)i1 = int_add(i0, 1)jump(start, i1, a1)

a1 = int_mul(a0, 2)i1 = int_add(i0, 1)jump(start, i1, a1)

HOT FAIL

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32

Trace trees (2)

label(start, i0, a0)v0 = int_lt(i0, 2000)guard_true(v0)v1 = int_mod(i0, 2)v2 = int_eq(v1, 0)guard_true(v1)a1 = int_add(a0, 10)i1 = int_add(i0, 1)jump(start, i1, a1)

a1 = int_mul(a0, 2)i1 = int_add(i0, 1)jump(start, i1, a1)

HOT FAIL

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32

Trace trees (2)

label(start, i0, a0)v0 = int_lt(i0, 2000)guard_true(v0)v1 = int_mod(i0, 2)v2 = int_eq(v1, 0)guard_true(v1)a1 = int_add(a0, 10)i1 = int_add(i0, 1)jump(start, i1, a1)

a1 = int_mul(a0, 2)i1 = int_add(i0, 1)jump(start, i1, a1)

HOT FAIL

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 13 / 32

Part 3

The PyPy JIT

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 14 / 32

Terminology (1)

translation time: when you run "rpythontargetpypy.py" to get the pypy binaryruntime: everything which happens after you startpypy

interpretation, tracing, compilingassembler/machine code: the output of the JITcompilerexecution time: when your Python program is beingexecuted

I by the interpreterI by the machine code

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 15 / 32

Terminology (2)

interp-level: things written in RPython[PyPy] interpreter: the RPython program whichexecutes the final Python programsbytecode: "the output of dis.dis". It is executed by thePyPy interpreter.app-level: things written in Python, and executed bythe PyPy Interpreter

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 16 / 32

Terminology (3)

(the following is not 100% accurate but it’s enough tounderstand the general principle)low level op or ResOperation

I low-level instructions like "add two integers", "read a fieldout of a struct", "call this function"

I (more or less) the same level of C ("portable assembler")I knows about GC objects (e.g. you have getfield_gc

vs getfield_raw)

jitcodes: low-level representation of RPythonfunctions

I sequence of low level opsI generated at translation timeI 1 RPython function --> 1 C function --> 1 jitcode

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 17 / 32

Terminology (4)

JIT traces or loopsI a very specific sequence of llops as actually executed by

your Python programI generated at runtime (more specifically, during tracing)

JIT optimizer: takes JIT traces and emits JIT tracesJIT backend: takes JIT traces and emits machinecode

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 18 / 32

General architecture

def LOAD_GLOBAL(self): ...

def STORE_FAST(self): ...

def BINARY_ADD(self): ...

RPYTHON

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 19 / 32

General architecture

def LOAD_GLOBAL(self): ...

def STORE_FAST(self): ...

def BINARY_ADD(self): ...

RPYTHON

CODEWRITER

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 19 / 32

General architecture

def LOAD_GLOBAL(self): ...

def STORE_FAST(self): ...

def BINARY_ADD(self): ...

RPYTHON

CODEWRITER

...p0 = getfield_gc(p0, 'func_globals')p2 = getfield_gc(p1, 'strval')call(dict_lookup, p0, p2)....

...p0 = getfield_gc(p0, 'locals_w')setarrayitem_gc(p0, i0, p1)....

...promote_class(p0)i0 = getfield_gc(p0, 'intval')promote_class(p1)i1 = getfield_gc(p1, 'intval')i2 = int_add(i0, i1)if (overflowed) goto ...p2 = new_with_vtable('W_IntObject')setfield_gc(p2, i2, 'intval')....

JITCODE

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 19 / 32

General architecture

def LOAD_GLOBAL(self): ...

def STORE_FAST(self): ...

def BINARY_ADD(self): ...

RPYTHON

CODEWRITER

...p0 = getfield_gc(p0, 'func_globals')p2 = getfield_gc(p1, 'strval')call(dict_lookup, p0, p2)....

...p0 = getfield_gc(p0, 'locals_w')setarrayitem_gc(p0, i0, p1)....

...promote_class(p0)i0 = getfield_gc(p0, 'intval')promote_class(p1)i1 = getfield_gc(p1, 'intval')i2 = int_add(i0, i1)if (overflowed) goto ...p2 = new_with_vtable('W_IntObject')setfield_gc(p2, i2, 'intval')....

JITCODE

compile-time

runtime

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 19 / 32

General architecture

def LOAD_GLOBAL(self): ...

def STORE_FAST(self): ...

def BINARY_ADD(self): ...

RPYTHON

CODEWRITER

...p0 = getfield_gc(p0, 'func_globals')p2 = getfield_gc(p1, 'strval')call(dict_lookup, p0, p2)....

...p0 = getfield_gc(p0, 'locals_w')setarrayitem_gc(p0, i0, p1)....

...promote_class(p0)i0 = getfield_gc(p0, 'intval')promote_class(p1)i1 = getfield_gc(p1, 'intval')i2 = int_add(i0, i1)if (overflowed) goto ...p2 = new_with_vtable('W_IntObject')setfield_gc(p2, i2, 'intval')....

JITCODE

compile-time

runtime

META-TRACER

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 19 / 32

General architecture

def LOAD_GLOBAL(self): ...

def STORE_FAST(self): ...

def BINARY_ADD(self): ...

RPYTHON

CODEWRITER

...p0 = getfield_gc(p0, 'func_globals')p2 = getfield_gc(p1, 'strval')call(dict_lookup, p0, p2)....

...p0 = getfield_gc(p0, 'locals_w')setarrayitem_gc(p0, i0, p1)....

...promote_class(p0)i0 = getfield_gc(p0, 'intval')promote_class(p1)i1 = getfield_gc(p1, 'intval')i2 = int_add(i0, i1)if (overflowed) goto ...p2 = new_with_vtable('W_IntObject')setfield_gc(p2, i2, 'intval')....

JITCODE

compile-time

runtime

META-TRACEROPTIMIZER

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 19 / 32

General architecture

def LOAD_GLOBAL(self): ...

def STORE_FAST(self): ...

def BINARY_ADD(self): ...

RPYTHON

CODEWRITER

...p0 = getfield_gc(p0, 'func_globals')p2 = getfield_gc(p1, 'strval')call(dict_lookup, p0, p2)....

...p0 = getfield_gc(p0, 'locals_w')setarrayitem_gc(p0, i0, p1)....

...promote_class(p0)i0 = getfield_gc(p0, 'intval')promote_class(p1)i1 = getfield_gc(p1, 'intval')i2 = int_add(i0, i1)if (overflowed) goto ...p2 = new_with_vtable('W_IntObject')setfield_gc(p2, i2, 'intval')....

JITCODE

compile-time

runtime

META-TRACEROPTIMIZERBACKEND

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 19 / 32

General architecture

def LOAD_GLOBAL(self): ...

def STORE_FAST(self): ...

def BINARY_ADD(self): ...

RPYTHON

CODEWRITER

...p0 = getfield_gc(p0, 'func_globals')p2 = getfield_gc(p1, 'strval')call(dict_lookup, p0, p2)....

...p0 = getfield_gc(p0, 'locals_w')setarrayitem_gc(p0, i0, p1)....

...promote_class(p0)i0 = getfield_gc(p0, 'intval')promote_class(p1)i1 = getfield_gc(p1, 'intval')i2 = int_add(i0, i1)if (overflowed) goto ...p2 = new_with_vtable('W_IntObject')setfield_gc(p2, i2, 'intval')....

JITCODE

compile-time

runtime

META-TRACEROPTIMIZERBACKENDASSEMBLER

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 19 / 32

PyPy trace example

def fn(): c = a+b ...

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 20 / 32

PyPy trace example

def fn(): c = a+b ...

LOAD_GLOBAL ALOAD_GLOBAL BBINARY_ADDSTORE_FAST C

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 20 / 32

PyPy trace example

def fn(): c = a+b ...

LOAD_GLOBAL ALOAD_GLOBAL BBINARY_ADDSTORE_FAST C

...p0 = getfield_gc(p0, 'func_globals')p2 = getfield_gc(p1, 'strval')call(dict_lookup, p0, p2)...

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 20 / 32

PyPy trace example

def fn(): c = a+b ...

LOAD_GLOBAL ALOAD_GLOBAL BBINARY_ADDSTORE_FAST C

...p0 = getfield_gc(p0, 'func_globals')p2 = getfield_gc(p1, 'strval')call(dict_lookup, p0, p2)......p0 = getfield_gc(p0, 'func_globals')p2 = getfield_gc(p1, 'strval')call(dict_lookup, p0, p2)...

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 20 / 32

PyPy trace example

def fn(): c = a+b ...

LOAD_GLOBAL ALOAD_GLOBAL BBINARY_ADDSTORE_FAST C

...p0 = getfield_gc(p0, 'func_globals')p2 = getfield_gc(p1, 'strval')call(dict_lookup, p0, p2)......p0 = getfield_gc(p0, 'func_globals')p2 = getfield_gc(p1, 'strval')call(dict_lookup, p0, p2)...

...guard_class(p0, W_IntObject)i0 = getfield_gc(p0, 'intval')guard_class(p1, W_IntObject)i1 = getfield_gc(p1, 'intval')i2 = int_add(00, i1)guard_not_overflow()p2 = new_with_vtable('W_IntObject')setfield_gc(p2, i2, 'intval')...

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 20 / 32

PyPy trace example

def fn(): c = a+b ...

LOAD_GLOBAL ALOAD_GLOBAL BBINARY_ADDSTORE_FAST C

...p0 = getfield_gc(p0, 'func_globals')p2 = getfield_gc(p1, 'strval')call(dict_lookup, p0, p2)......p0 = getfield_gc(p0, 'func_globals')p2 = getfield_gc(p1, 'strval')call(dict_lookup, p0, p2)...

...guard_class(p0, W_IntObject)i0 = getfield_gc(p0, 'intval')guard_class(p1, W_IntObject)i1 = getfield_gc(p1, 'intval')i2 = int_add(00, i1)guard_not_overflow()p2 = new_with_vtable('W_IntObject')setfield_gc(p2, i2, 'intval')......p0 = getfield_gc(p0, 'locals_w')setarrayitem_gc(p0, i0, p1)....

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 20 / 32

PyPy optimizer

intboundsconstant folding / pure operationsvirtualsstring optimizationsheap (multiple get/setfield, etc)unroll

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 21 / 32

Intbound optimization (1)

intbound.py

def fn():i = 0while i < 5000:

i += 2return i

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 22 / 32

Intbound optimization (2)

unoptimized...i17 = int_lt(i15, 5000)guard_true(i17)i19 = int_add_ovf(i15, 2)guard_no_overflow()...

optimized...i17 = int_lt(i15, 5000)guard_true(i17)i19 = int_add(i15, 2)...

It works oftenarray bound checkingintbound info propagates all over the trace

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 23 / 32

Intbound optimization (2)

unoptimized...i17 = int_lt(i15, 5000)guard_true(i17)i19 = int_add_ovf(i15, 2)guard_no_overflow()...

optimized...i17 = int_lt(i15, 5000)guard_true(i17)i19 = int_add(i15, 2)...

It works oftenarray bound checkingintbound info propagates all over the trace

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 23 / 32

Intbound optimization (2)

unoptimized...i17 = int_lt(i15, 5000)guard_true(i17)i19 = int_add_ovf(i15, 2)guard_no_overflow()...

optimized...i17 = int_lt(i15, 5000)guard_true(i17)i19 = int_add(i15, 2)...

It works oftenarray bound checkingintbound info propagates all over the trace

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 23 / 32

Virtuals (1)

virtuals.py

def fn():i = 0while i < 5000:

i += 2return i

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 24 / 32

Virtuals (2)

unoptimized...guard_class(p0, W_IntObject)i1 = getfield_pure(p0, ’intval’)i2 = int_add(i1, 2)p3 = new(W_IntObject)setfield_gc(p3, i2, ’intval’)...

optimized...i2 = int_add(i1, 2)...

The most important optimization (TM)It works both inside the trace and across the loopIt works for tons of cases

I e.g. function frames

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 25 / 32

Virtuals (2)

unoptimized...guard_class(p0, W_IntObject)i1 = getfield_pure(p0, ’intval’)i2 = int_add(i1, 2)p3 = new(W_IntObject)setfield_gc(p3, i2, ’intval’)...

optimized...i2 = int_add(i1, 2)...

The most important optimization (TM)It works both inside the trace and across the loopIt works for tons of cases

I e.g. function frames

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 25 / 32

Virtuals (2)

unoptimized...guard_class(p0, W_IntObject)i1 = getfield_pure(p0, ’intval’)i2 = int_add(i1, 2)p3 = new(W_IntObject)setfield_gc(p3, i2, ’intval’)...

optimized...i2 = int_add(i1, 2)...

The most important optimization (TM)It works both inside the trace and across the loopIt works for tons of cases

I e.g. function frames

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 25 / 32

Constant folding (1)

constfold.py

def fn():i = 0while i < 5000:

i += 2return i

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 26 / 32

Constant folding (2)

unoptimized...i1 = getfield_pure(p0, ’intval’)i2 = getfield_pure(<W_Int(2)>,

’intval’)i3 = int_add(i1, i2)...

optimized...i1 = getfield_pure(p0, ’intval’)i3 = int_add(i1, 2)...

It "finishes the job"Works well together with other optimizations (e.g.virtuals)It also does "normal, boring, static" constant-folding

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 27 / 32

Constant folding (2)

unoptimized...i1 = getfield_pure(p0, ’intval’)i2 = getfield_pure(<W_Int(2)>,

’intval’)i3 = int_add(i1, i2)...

optimized...i1 = getfield_pure(p0, ’intval’)i3 = int_add(i1, 2)...

It "finishes the job"Works well together with other optimizations (e.g.virtuals)It also does "normal, boring, static" constant-folding

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 27 / 32

Constant folding (2)

unoptimized...i1 = getfield_pure(p0, ’intval’)i2 = getfield_pure(<W_Int(2)>,

’intval’)i3 = int_add(i1, i2)...

optimized...i1 = getfield_pure(p0, ’intval’)i3 = int_add(i1, 2)...

It "finishes the job"Works well together with other optimizations (e.g.virtuals)It also does "normal, boring, static" constant-folding

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 27 / 32

Out of line guards (1)

outoflineguards.py

N = 2def fn():

i = 0while i < 5000:

i += Nreturn i

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 28 / 32

Out of line guards (2)

unoptimized...quasiimmut_field(<Cell>, ’val’)guard_not_invalidated()p0 = getfield_gc(<Cell>, ’val’)...i2 = getfield_pure(p0, ’intval’)i3 = int_add(i1, i2)

optimized...guard_not_invalidated()...i3 = int_add(i1, 2)...

Python is too dynamic, but we don’t care :-)No overhead in assembler codeUsed a bit "everywhere"

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 29 / 32

Out of line guards (2)

unoptimized...quasiimmut_field(<Cell>, ’val’)guard_not_invalidated()p0 = getfield_gc(<Cell>, ’val’)...i2 = getfield_pure(p0, ’intval’)i3 = int_add(i1, i2)

optimized...guard_not_invalidated()...i3 = int_add(i1, 2)...

Python is too dynamic, but we don’t care :-)No overhead in assembler codeUsed a bit "everywhere"

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 29 / 32

Out of line guards (2)

unoptimized...quasiimmut_field(<Cell>, ’val’)guard_not_invalidated()p0 = getfield_gc(<Cell>, ’val’)...i2 = getfield_pure(p0, ’intval’)i3 = int_add(i1, i2)

optimized...guard_not_invalidated()...i3 = int_add(i1, 2)...

Python is too dynamic, but we don’t care :-)No overhead in assembler codeUsed a bit "everywhere"

antocuni (Intel@Bucharest) PyPy Intro April 4 2016 29 / 32

Hello RPython# hello_rpython.pyimport os!

def entry_point(argv): os.write(2, “Hello, World!\n”) return 0!

def target(driver, argv): return entry_point, None

$ rpython hello_rpython.py…$ ./hello_python-cHello, RPython!

Goal

• BASIC interpreter capable of running Hamurabi!

• Bytecode based!

• Garbage Collection!

• Just-In-Time Compilation

Live play session

Architecture

Parser

Compiler

Virtual Machine

AST

Bytecode

Source

10 PRINT TAB(32);"HAMURABI"20 PRINT TAB(15);"CREATIVE COMPUTING MORRISTOWN, NEW JERSEY"30 PRINT:PRINT:PRINT80 PRINT "TRY YOUR HAND AT GOVERNING ANCIENT SUMERIA"90 PRINT "FOR A TEN-YEAR TERM OF OFFICE.":PRINT95 D1=0: P1=0100 Z=0: P=95:S=2800: H=3000: E=H-S110 Y=3: A=H/Y: I=5: Q=1210 D=0215 PRINT:PRINT:PRINT "HAMURABI: I BEG TO REPORT TO YOU,": Z=Z+1217 PRINT "IN YEAR";Z;",";D;"PEOPLE STARVED,";I;"CAME TO THE CITY,"218 P=P+I227 IF Q>0 THEN 230228 P=INT(P/2)229 PRINT "A HORRIBLE PLAGUE STRUCK! HALF THE PEOPLE DIED."230 PRINT "POPULATION IS NOW";P232 PRINT "THE CITY NOW OWNS ";A;"ACRES."235 PRINT "YOU HARVESTED";Y;"BUSHELS PER ACRE."250 PRINT "THE RATS ATE";E;"BUSHELS."260 PRINT "YOU NOW HAVE ";S;"BUSHELS IN STORE.": PRINT270 REM *** MORE CODE THAT DID NOT FIT INTO THE SLIDE FOLLOWS

Parser

Parser

Abstract Syntax Tree (AST)

Source

Parser

Parser

AST

SourceLexer

Tokens

Source

Parser

AST

RPLY

• Based on PLY, which is based on Lex and Yacc!

• Lexer generator!

• LALR parser generator

Lexerfrom rply import LexerGenerator!

lg = LexerGenerator()!

lg.add(“NUMBER”, “[0-9]+”)# …lg.ignore(“ +”) # whitespace!

lexer = lg.build().lex

lg.add('NUMBER', r'[0-9]*\.[0-9]+')lg.add('PRINT', r'PRINT')lg.add('IF', r'IF')lg.add('THEN', r'THEN')lg.add('GOSUB', r'GOSUB')lg.add('GOTO', r'GOTO')lg.add('INPUT', r'INPUT')lg.add('REM', r'REM')lg.add('RETURN', r'RETURN')lg.add('END', r'END')lg.add('FOR', r'FOR')lg.add('TO', r'TO')lg.add('NEXT', r'NEXT')lg.add('NAME', r'[A-Z][A-Z0-9$]*')lg.add('(', r'\(')lg.add(')', r'\)')lg.add(';', r';')lg.add('STRING', r'"[^"]*"')

lg.add(':', r'\r?\n')lg.add(':', r':')lg.add('=', r'=')lg.add('<>', r'<>')lg.add('-', r'-')lg.add('/', r'/')lg.add('+', r'\+')lg.add('>=', r'>=')lg.add('>', r'>')lg.add('***', r'\*\*\*.*')lg.add('*', r'\*')lg.add('<=', r'<=')lg.add('<', r'<')

>>> from basic.lexer import lex>>> source = open("hello.bas").read()>>> for token in lex(source):... print tokenToken("NUMBER", "10")Token("PRINT", "PRINT")Token("STRING",'"HELLO BASIC!"')Token(":", "\n")

Grammar

• A set of formal rules that defines the syntax!

• terminals = tokens!

• nonterminals = rules defining a sequence of one or more (non)terminals

10 PRINT TAB(32);"HAMURABI"20 PRINT TAB(15);"CREATIVE COMPUTING MORRISTOWN, NEW JERSEY"30 PRINT:PRINT:PRINT80 PRINT "TRY YOUR HAND AT GOVERNING ANCIENT SUMERIA"90 PRINT "FOR A TEN-YEAR TERM OF OFFICE.":PRINT95 D1=0: P1=0100 Z=0: P=95:S=2800: H=3000: E=H-S110 Y=3: A=H/Y: I=5: Q=1210 D=0215 PRINT:PRINT:PRINT "HAMURABI: I BEG TO REPORT TO YOU,": Z=Z+1217 PRINT "IN YEAR";Z;",";D;"PEOPLE STARVED,";I;"CAME TO THE CITY,"218 P=P+I227 IF Q>0 THEN 230228 P=INT(P/2)229 PRINT "A HORRIBLE PLAGUE STRUCK! HALF THE PEOPLE DIED."230 PRINT "POPULATION IS NOW";P232 PRINT "THE CITY NOW OWNS ";A;"ACRES."235 PRINT "YOU HARVESTED";Y;"BUSHELS PER ACRE."250 PRINT "THE RATS ATE";E;"BUSHELS."260 PRINT "YOU NOW HAVE ";S;"BUSHELS IN STORE.": PRINT270 REM *** MORE CODE THAT DID NOT FIT INTO THE SLIDE FOLLOWS

program :program : lineprogram : line program

line : NUMBER statements

statements : statementstatements : statement statements

statement : PRINT :statement : PRINT expressions :expressions : expressionexpressions : expression ;expressions : expression ; expressions

statement : NAME = expression :

statement : IF expression THEN number :

statement : INPUT name :

statement : GOTO NUMBER :statement : GOSUB NUMBER :statement : RETURN :

statement : REM *** :

statement : FOR NAME = NUMBER TO NUMBER :statement : NEXT NAME :

statement : END :

expression : NUMBERexpression : NAMEexpression : STRINGexpression : operationexpression : ( expression )expression : NAME ( expression )

operation : expression + expressionoperation : expression - expressionoperation : expression * expressionoperation : expression / expressionoperation : expression <= expressionoperation : expression < expressionoperation : expression = expressionoperation : expression <> expressionoperation : expression > expressionoperation : expression >= expression

from rply.token import BaseBox!class Program(BaseBox): def __init__(self, lines): self.lines = lines

AST

class Line(BaseBox): def __init__(self, lineno, statements): self.lineno = lineno self.statements = statements

class Statements(BaseBox): def __init__(self, statements): self.statements = statements

class Print(BaseBox): def __init__(self, expressions, newline=True): self.expressions = expressions self.newline = newline

from rply import ParserGenerator!pg = ParserGenerator(["NUMBER", "PRINT", …])

Parser

@pg.production("program : ")@pg.production("program : line")@pg.production("program : line program")def program(p): if len(p) == 2: return Program([p[0]] + p[1].get_lines()) return Program(p)

@pg.production("line : number statements")def line(p): return Line(p[0], p[1].get_statements())

@pg.production("op : expression + expression")@pg.production("op : expression * expression")def op(p): if p[1].gettokentype() == "+": return Add(p[0], p[2]) elif p[1].gettokentype() == "*": return Mul(p[0], p[2])

pg = ParserGenerator([…], precedence=[ ("left", ["+", "-"]), ("left", ["*", "/"])])

parse = pg.build().parse

Compiler/Virtual Machine

Compiler

Virtual Machine

AST

Bytecode

class VM(object): def __init__(self, program): self.program = program

class VM(object): def __init__(self, program): self.program = program self.pc = 0

class VM(object): def __init__(self, program): self.program = program self.pc = 0 self.frames = []

class VM(object): def __init__(self, program): self.program = program self.pc = 0 self.frames = [] self.iterators = []

class VM(object): def __init__(self, program): self.program = program self.pc = 0 self.frames = [] self.iterators = [] self.stack = []

class VM(object): def __init__(self, program): self.program = program self.pc = 0 self.frames = [] self.iterators = {} self.stack = [] self.variables = {}

class VM(object): … def execute(self): while self.pc < len(self.program.instructions): self.execute_bytecode(self.program.instructions[self.pc])

class VM(object): … def execute_bytecode(self, code): raise NotImplementedError(code)

class VM(object): ... def execute_bytecode(self): if isinstance(code, TYPE): self.execute_TYPE(code) ... else: raise NotImplementedError(code)

class Program(object): def __init__(self): self.instructions = []

Bytecode

class Instruction(object): pass

class Number(Instruction): def __init__(self, value): self.value = value!class String(Instructions): def __init__(self, value): self.value = value

class Print(Instruction): def __init__(self, expressions, newline): self.expressions = expressions self.newline = newline

class Call(Instruction): def __init__(self, function_name): self.function_name = function_name

class Let(Instruction): def __init__(self, name): self.name = name

class Lookup(Instruction): def __init__(self, name): self.name = name

class Add(Instruction): pass!class Sub(Instruction): pass!class Mul(Instruction): pass!class Equal(Instruction): pass!...

class GotoIfTrue(Instruction): def __init__(self, target): self.target = target!class Goto(Instruction): def __init__(self, target, with_frame=False): self.target = target self.with_frame = with_frame!class Return(Instruction): pass

class Input(object): def __init__(self, name): self.name = name

class For(Instruction): def __init__(self, variable): self.variable = variable!class Next(Instruction): def __init__(self, variable): self.variable = variable

class Program(object): def __init__(self): self.instructions = [] self.lineno2instruction = {}! def __enter__(self): return self! def __exit__(self, exc_type, exc_value, tb): if exc_type is None: for i, instruction in enumerate(self.instructions): instruction.finalize(self, i)

def finalize(self, program, index): self.target = program.lineno2instruction[self.target]

class Program(BaseBox): … def compile(self): with bytecode.Program() as program: for line in self.lines: line.compile(program) return program

class Line(BaseBox): ... def compile(self, program): program.lineno2instruction[self.lineno] = len(program.instructions) for statement in self.statements: statement.compile(program)

class Line(BaseBox): ... def compile(self, program): program.lineno2instruction[self.lineno] = len(program.instructions) for statement in self.statements: statement.compile(program)

class Print(Statement): def compile(self, program): for expression in self.expressions: expression.compile(program) program.instructions.append( bytecode.Print( len(self.expressions), self.newline ) )

class Print(Statement): ... def compile(self, program): for expression in self.expressions: expression.compile(program) program.instructions.append( bytecode.Print( len(self.expressions), self.newline ) )

class Let(Statement): ... def compile(self, program): self.value.compile(program) program.instructions.append( bytecode.Let(self.name) )

class Input(Statement): ... def compile(self, program): program.instructions.append( bytecode.Input(self.variable) )

class Goto(Statement): ... def compile(self, program): program.instructions.append( bytecode.Goto(self.target) )!class Gosub(Statement): ... def compile(self, program): program.instructions.append( bytecode.Goto( self.target, with_frame=True ) )!class Return(Statement): ... def compile(self, program): program.instructions.append( bytecode.Return() )

class For(Statement): ... def compile(self, program): self.start.compile(program) program.instructions.append( bytecode.Let(self.variable) ) self.end.compile(program) program.instructions.append( bytecode.For(self.variable) )

class WrappedObject(object): pass!class WrappedString(WrappedObject): def __init__(self, value): self.value = value!class WrappedFloat(WrappedObject): def __init__(self, value): self.value = value

class VM(object): … def execute_number(self, code): self.stack.append(WrappedFloat(code.value)) self.pc += 1! def execute_string(self, code): self.stack.append(WrappedString(code.value)) self.pc += 1

class VM(object): … def execute_call(self, code): argument = self.stack.pop() if code.function_name == "TAB": self.stack.append(WrappedString(" " * int(argument))) elif code.function_name == "RND": self.stack.append(WrappedFloat(random.random())) ... self.pc += 1

class VM(object): … def execute_let(self, code): value = self.stack.pop() self.variables[code.name] = value self.pc += 1! def execute_lookup(self, code): value = self.variables[code.name] self.stack.append(value) self.pc += 1

class VM(object): … def execute_add(self, code): right = self.stack.pop() left = self.stack.pop() self.stack.append(WrappedFloat(left + right)) self.pc += 1

class VM(object): … def execute_goto_if_true(self, code): condition = self.stack.pop() if condition: self.pc = code.target else: self.pc += 1

class VM(object): … def execute_goto(self, code): if code.with_frame: self.frames.append(self.pc + 1) self.pc = code.target

class VM(object): … def execute_return(self, code): self.pc = self.frames.pop()

class VM(object): … def execute_input(self, code): value = WrappedFloat(float(raw_input() or “0.0”)) self.variables[code.name] = value self.pc += 1

class VM(object): … def execute_for(code): self.pc += 1 self.iterators[code.variable] = ( self.pc, self.stack.pop() )

class VM(object): … def execute_next(self, code): loop_begin, end = self.iterators[code.variable] current_value = self.variables[code.variable].value next_value = current_value + 1.0 if next_value <= end: self.variables[code.variable] = \ WrappedFloat(next_value) self.pc = loop_begin else: del self.iterators[code.variable] self.pc += 1

def entry_point(argv): try: filename = argv[1] except IndexError: print(“You must supply a filename”) return 1 content = read_file(filename) tokens = lex(content) ast = parse(tokens) program = ast.compile() vm = VM(program) vm.execute() return 0

Entry Point

JIT (in PyPy)1. Identify “hot" loops!

2. Create trace inserting guards based on observed values!

3. Optimize trace!

4. Compile trace!

5. Execute machine code instead of interpreter

from rpython.rlib.jit import JitDriver!jitdriver = JitDriver( greens=[“pc”, “vm”, “program”, “frames”, “iterators”], reds=[“stack”, “variables"])

class VM(object): … def execute(self): while self.pc < len(self.program.instructions): jitdriver.merge_point( vm=self, pc=self.pc, … )

Benchmark10 N = 120 IF N <= 10000 THEN 4030 END40 GOSUB 10050 IF R = 0 THEN 7060 PRINT "PRIME"; N70 N = N + 1: GOTO 20100 REM *** ISPRIME N -> R110 IF N <= 2 THEN 170120 FOR I = 2 TO (N - 1)130 A = N: B = I: GOSUB 200140 IF R <> 0 THEN 160150 R = 0: RETURN160 NEXT I170 R = 1: RETURN200 REM *** MOD A -> B -> R210 R = A - (B * INT(A / B))220 RETURN

cbmbasic 58.22s

basic-c 5.06s

basic-c-jit 2.34s

Python implementation (CPython) 2.83s

Python implementation (PyPy) 0.11s

C implementation 0.03s

Project milestones

2008 Django support

2010 First JIT-compiler

2011 Compatibility with CPython 2.7

2014 Basic ARM support

CPython 3 support

Improve compatibility with C extensions

NumPyPy

Multi-threading support

PyPy STM

PyPy STM

http://dabeaz.com/GIL/gilvis/

GIL locking

PyPy STM

10 loops, best of 3: 1.2 sec per loop10 loops, best of 3: 822 msec per loop

from threading import Thread

def count(n): while n > 0: n -= 1

def run(): t1 = Thread(target=count, args=(10000000,)) t1.start() t2 = Thread(target=count, args=(10000000,)) t2.start() t1.join(); t2.join()

def count(n): while n > 0: n -= 1

def run(): count(10000000) count(10000000)

Inside the Python GIL - David Beazley

PyPy in the real world (1)

High frequency trading platform for sports betsI low latency is a must

PyPy used in production since 2012~100 PyPy processes running 24/7up to 10x speedups

I after careful tuning and optimizing for PyPy

antocuni (PyCon Otto) PyPy Status Update April 07 2017 6 / 19

PyPy in the real world (2)

Real-time online advertising auctionsI tight latency requirement (<100ms)I high throughput (hundreds of thousands of requests per

second)

30% speedup

We run PyPy basically everywhere

Julian Berman

antocuni (PyCon Otto) PyPy Status Update April 07 2017 7 / 19

PyPy in the real world (3)

IoT on the cloud5-10x fasterWe do not even run benchmarks on CPython

because we just know that PyPy is way faster

Tobias Oberstein

antocuni (PyCon Otto) PyPy Status Update April 07 2017 8 / 19

top related