high performance computing with python (4 hour...
TRANSCRIPT
[email protected] - EuroPy 2011
High Performance Computing High Performance Computing with Python (4 hour tutorial)with Python (4 hour tutorial)
EuroPython 2011
[email protected] - EuroPy 2011
GoalGoal
ā¢ Get you writing faster code for CPU-bound problems using Python
ā¢ Your task is probably in pure Python, is CPU bound and can be parallelised (right?)
ā¢ We're not looking at network-bound problems
ā¢ Profiling + Tools == Speed
[email protected] - EuroPy 2011
Get the source please!Get the source please!
ā¢ http://tinyurl.com/europyhpcā¢ (original:
http://ianozsvald.com/wp-content/hpc_tutorial_code_europython2011_v1.zipā¢ )ā¢ google: āgithub ianozsvaldā, get HPC full
source (but you can do this after!)
[email protected] - EuroPy 2011
About me (Ian Ozsvald)About me (Ian Ozsvald)
ā¢ A.I. researcher in industry for 12 yearsā¢ C, C++, (some) Java, Python for 8 years
ā¢ Demo'd pyCUDA and Headroid last year
ā¢ Lecturer on A.I. at Sussex Uni (a bit)ā¢ ShowMeDo.com co-founderā¢ Python teacher, BrightonPy co-founder
ā¢ IanOzsvald.com - MorConsulting.com
[email protected] - EuroPy 2011
Overview (pre-requisites)Overview (pre-requisites)
ā¢ cProfile, line_profiler, runsnakeā¢ numpyā¢ Cython and ShedSkin
ā¢ multiprocessing
ā¢ ParallelPythonā¢ PyPyā¢ pyCUDA
[email protected] - EuroPy 2011
We won't be looking at...We won't be looking at...
ā¢ Algorithmic choices, clusters or cloudā¢ Gnumpy (numpy->GPU)
ā¢ Theano (numpy(ish)->CPU/GPU)
ā¢ CopperHead (numpy(ish)->GPU)ā¢ BottleNeck (Cython'd numpy)ā¢ Map/Reduce
ā¢ pyOpenCL
[email protected] - EuroPy 2011
Something to considerSomething to consider
ā¢ āProebsting's Lawāā¢ http://research.microsoft.com/en-us/um/people/toddpro/papers/law.htm
ā¢ Compiler advances (generally) unhelpful (sort-of ā consider auto vectorisation!)
ā¢ Multi-core commonā¢ Very-parallel (CUDA, OpenCL, MS AMP,
APUs) should be considered
[email protected] - EuroPy 2011
What can we expect?What can we expect?
ā¢ Close to C speeds (shootout):ā http://attractivechaos.github.com/plb/ā http://shootout.alioth.debian.org/u32/which-programming-languages-are-fastest.php
ā¢ Depends on how much work you put inā¢ nbody JavaScript much faster than Python
but we can catch it/beat it (and get close to C speed)
[email protected] - EuroPy 2011
Practical result - PANalyticalPractical result - PANalytical
Numpy+Py +More Numpy +'if' added! Numpy+Py+Cy All +Multiprocessing0
50
100
150
200
250234
167
126
9081
20
Sec
ond
s
[email protected] - EuroPy 2011
Mandelbrot results (Desktop i3)Mandelbrot results (Desktop i3)
Py 2.7PyPy 1.5
Numpy v.NumExpr
Cython np.ShedSkin
pyCUDA pypyCUDA C
ParallelPython
0
5
10
15
20
25
30
35
40
35
10
36
10
0.3 0.3
3.5
0.07
9
Sec
ond
s
[email protected] - EuroPy 2011
Our codeOur code
ā¢ pure_python.py ā¢ numpy_vector.py ā¢ pure_python.py 1000 1000 # RUNā¢ Our two building blocksā¢ Google āgithub ianozsvaldā ->
EuroPython2011_HighPerformanceComputing
ā¢ https://github.com/ianozsvald/EuroPython2011_HighPerformanceComputing
[email protected] - EuroPy 2011
Profiling bottlenecksProfiling bottlenecks
ā¢ python -m cProfile -o rep.prof pure_python.py 1000 1000
ā¢ import pstatsā¢ p = pstats.Stats('rep.prof')ā¢ p.sort_stats('cumulative').pri
nt_stats(10)
[email protected] - EuroPy 2011
cProfile outputcProfile output
51923594 function calls (51923523 primitive calls) in 74.301 seconds
ncalls tottime percall cumtime percall
pure_python.py:1(<module>)
1 0.034 0.034 74.303 74.303
pure_python.py:23(calc_pure_python)
1 0.273 0.273 74.268 74.268
pure_python.py:9(calculate_z_serial_purepython)
1 57.168 57.168 73.580 73.580
{abs}
51,414,419 12.465 0.000 12.465 0.000
...
[email protected] - EuroPy 2011
Let's profile python.pyLet's profile python.py
ā¢ python -m cProfile -o res.prof pure_python.py 1000 1000
ā¢ runsnake res.profā¢ Let's look at the result
[email protected] - EuroPy 2011
What's the problem?What's the problem?
ā¢ What's really slow?ā¢ Useful from a high level...ā¢ We want a line profiler!
[email protected] - EuroPy 2011
line_profiler.pyline_profiler.py
ā¢ kernprof.py -l -v pure_python_lineprofiler.py 1000 1000
ā¢ Warning...slow! We might want to use 300 100
[email protected] - EuroPy 2011
kernprof.py outputkernprof.py output
...% Time Line Contents
=====================
@profile
def calculate_z_serial_purepython(q, maxiter, z):
0.0 output = [0] * len(q)
1.1 for i in range(len(q)):
27.8 for iteration in range(maxiter):
35.8 z[i] = z[i]*z[i] + q[i]
31.9 if abs(z[i]) > 2.0:
[email protected] - EuroPy 2011
Dereferencing is slowDereferencing is slow
ā¢ Dereferencing involves lookups ā slowā¢ Our 'i' changes slowlyā¢ zi = z[i]; qi = q[i] # DO ITā¢ Change all z[i] and q[i] referencesā¢ Run kernprof againā¢ Is it cheaper?
[email protected] - EuroPy 2011
We have faster codeWe have faster code
ā¢ pure_python_2.py is faster, we'll use this as the basis for the next steps
ā¢ There are tricks:ā sets over lists if possibleā use dict[] rather than dict.get()ā build-in sort is fastā list comprehensionsā map rather than loops
[email protected] - EuroPy 2011
PyPy 1.5PyPy 1.5
ā¢ Confession ā I'm a newbieā¢ Probably cool tricks to learnā¢ pypy pure_python_2.py 1000 1000ā¢ PIL support, numpy isn'tā¢ My (bad) code needs numpy for display
(maybe you can fix that?)ā¢ pypy -m cProfile -o
runpypy.prof pure_python_2.py 1000 1000 # abs but no range
[email protected] - EuroPy 2011
CythonCython
ā¢ Manually add types, converts to Cā¢ .pyx files (built on Pyrex)ā¢ Win/Mac/Lin with gcc, msvc etcā¢ 10-100* speed-upā¢ numpy integrationā¢ http://cython.org/
[email protected] - EuroPy 2011
Cython on pure_python_2.pyCython on pure_python_2.py
ā¢ # ./cython_pure_pythonā¢ Make calculate_z.py, test it worksā¢ Turn calculate_z.py to .pyxā¢ Add setup.py (see Getting Started doc)ā¢ python setup.py build_ext
--inplaceā¢ cython -a calculate_z.pyx to get
profiling feedback (.html)
[email protected] - EuroPy 2011
Cython typesCython types
ā¢ Help Cython by adding annotations:ā list q zā int ā unsigned int # hint no negative
indices with for loop ā complex and complex double
ā¢ How much faster?
[email protected] - EuroPy 2011
Compiler directivesCompiler directives
ā¢ http://wiki.cython.org/enhancements/compilerdirectivesā¢ We can go faster (maybe):
ā #cython: boundscheck=Falseā #cython: wraparound=False
ā¢ Profiling:ā #cython: profile=True
ā¢ Check profiling worksā¢ Show _2_bettermath # FAST!
[email protected] - EuroPy 2011
ShedSkinShedSkin
ā¢ http://code.google.com/p/shedskin/ā¢ Auto-converts Python to C++ (auto type
inference)ā¢ Can only import modules that have been
implementedā¢ No numpy, PIL etc but great for writing
new fast modulesā¢ 3000 SLOC 'limit', always improving
[email protected] - EuroPy 2011
Easy to useEasy to use
ā¢ # ./shedskin/ā¢ shedskin shedskin1.pyā¢ makeā¢ ./shedskin1 1000 1000ā¢ shedskin shedskin2.py; makeā¢ ./shedskin2 1000 1000 # FAST!ā¢ No easy profiling, complex is slow (for
now)
[email protected] - EuroPy 2011
numpy vectorsnumpy vectors
ā¢ http://numpy.scipy.org/ā¢ Vectors not brilliantly suited to Mandelbrot
(but we'll ignore that...)ā¢ numpy is very-parallel for CPUsā¢ a = numpy.array([1,2,3,4])ā¢ a *= 3 ->
numpy.array([3,6,9,12])
[email protected] - EuroPy 2011
Vector outline...Vector outline...
# ./numpy_vector/numpy_vector.py
for iteration...
z = z*z + q
done = np.greater(abs(z), 2.0)
q = np.where(done,0+0j, q)
z = np.where(done,0+0j, z)
output = np.where(done,
iteration, output)
[email protected] - EuroPy 2011
Profiling some moreProfiling some more
ā¢ python numpy_vector.py 1000 1000
ā¢ kernprof.py -l -v numpy_vector.py 300 100
ā¢ How could we break out early?ā¢ How big is 250,000 complex numbers?ā¢ # .nbytes, .size
[email protected] - EuroPy 2011
Cache sizesCache sizes
ā¢ Modern CPUs have 2-6MB cachesā¢ Tuning is hard (and may not be
worthwhile)ā¢ Heuristic: Either keep it tiny (<64KB) or
worry about really big data sets (>20MB)ā¢ # numpy_vector_2.py
[email protected] - EuroPy 2011
Speed vs cache size (Core2/i3)Speed vs cache size (Core2/i3)
250k 90k 50k 45k 20k 10k 5k 1k 1000
20
40
60
80
100
120
140
160
180
200
54 5245 45 42 43 45
62
180
Sec
ond
s
[email protected] - EuroPy 2011
NumExprNumExpr
ā¢ http://code.google.com/p/numexpr/ā¢ This is magicā¢ With Intel MKL it goes even fasterā¢ # ./numpy_vector_numexpr/ā¢ python numpy_vector_numexpr.py
1000 1000ā¢ Now convert your numpy_vector.py
[email protected] - EuroPy 2011
numpy and iterationnumpy and iteration
ā¢ Normally there's no point using numpy if we aren't using vector operations
ā¢ python numpy_loop.py 1000 1000ā¢ Is it any faster?ā¢ Let's run kernprof.py on this and the
earlier pure_python_2.pyā¢ Any significant differences?
[email protected] - EuroPy 2011
Cython on numpy_loop.pyCython on numpy_loop.py
ā¢ Can low-level C give us a speed-up over vectorised C?
ā¢ # ./cython_numpy_loop/ā¢ http://docs.cython.org/src/tutorial/numpy.htmlā¢ Your task ā make .pyx, start without types,
make it work from numpy_loop.pyā¢ Add basic types, use cython -a
[email protected] - EuroPy 2011
multiprocessingmultiprocessing
ā¢ Using all our CPUs is cool, 4 are common, 8 will be common
ā¢ Global Interpreter Lock (isn't our enemy)ā¢ Silo'd processes are easiest to paralleliseā¢ http://docs.python.org/library/multiprocessing.html
[email protected] - EuroPy 2011
multiprocessing Poolmultiprocessing Pool
ā¢ # ./multiprocessing/multi.pyā¢ p = multiprocessing.Pool()ā¢ po = p.map_async(fn, args)ā¢ result = po.get() # for all po
objectsā¢ join the result items to make full result
[email protected] - EuroPy 2011
Making chunks of workMaking chunks of work
ā¢ Split the work into chunks (follow my code)ā¢ Splitting by number of CPUs is goodā¢ Submit the jobs with map_asyncā¢ Get the results back, join the lists
[email protected] - EuroPy 2011
Code outlineCode outline
ā¢ Copy my chunk code
output = []
for chunk in chunks:
out = calc...(chunk)
output += out
[email protected] - EuroPy 2011
ParallelPythonParallelPython
ā¢ Same principle as multiprocessing but allows >1 machine with >1 CPU
ā¢ http://www.parallelpython.com/ā¢ Seems to work poorly with lots of data
(e.g. 8MB split into 4 lists...!)ā¢ We can run it locally, run it locally via
ppserver.py and run it remotely tooā¢ Can we demo it to another machine?
[email protected] - EuroPy 2011
ParallelPython + binariesParallelPython + binaries
ā¢ We can ask it to use modules, other functions and our own compiled modules
ā¢ Works for Cython and ShedSkinā¢ Modules have to be in PYTHONPATH (or
current directory for ppserver.py)ā¢ parallelpython_cython_pure_pyth
on
[email protected] - EuroPy 2011
Challenge...Challenge...
ā¢ Can we send binaries (.so/.pyd) automatically?
ā¢ It looks like we couldā¢ We'd then avoid having to deploy to
remote machines ahead of time...ā¢ Anybody want to help me?
[email protected] - EuroPy 2011
pyCUDApyCUDA
ā¢ NVIDIA's CUDA -> Python wrapperā¢ http://mathema.tician.de/software/pycudaā¢ Can be a pain to install...ā¢ Has numpy-like interface and two lower
level C interfaces
[email protected] - EuroPy 2011
pyCUDA demospyCUDA demos
ā¢ # ./pyCUDA/ā¢ I'm using float32/complex64 as my CUDA
card is too old :-( (Compute 1.3)ā¢ numpy-like interface is easy but slowā¢ elementwise requires C thinkingā¢ sourcemodule gives you complete controlā¢ Great for prototyping and moving to C
[email protected] - EuroPy 2011
Birds of Feather?Birds of Feather?
ā¢ numpy is cool but CPU boundā¢ pyCUDA is cool and is numpy-likeā¢ Could we monkey patch numpy to auto-
run CUDA(/openCL) if a card is present?ā¢ Anyone want to chat about this?
[email protected] - EuroPy 2011
Future trendsFuture trends
ā¢ multi-core is obviousā¢ CUDA-like systems are inevitableā¢ write-once, deploy to many targets ā that
would be lovelyā¢ Cython+ShedSkin could be coolā¢ Parallel Cython could be coolā¢ Refactoring with rope is definitely cool
[email protected] - EuroPy 2011
Bits to considerBits to consider
ā¢ Cython being wired into Python (GSoC)ā¢ CorePy assembly -> numpy
http://numcorepy.blogspot.com/ā¢ PyPy advancing nicelyā¢ GPUs being interwoven with CPUs (APU)ā¢ numpy+NumExpr->GPU/CPU mix?ā¢ Learning how to massively parallelise is
the key
[email protected] - EuroPy 2011
FeedbackFeedback
ā¢ I plan to write this upā¢ I want feedback (and maybe a testimonial
if you found this helpful?)ā¢ [email protected]ā¢ Thank you :-)