intel python - microsigma
TRANSCRIPT
Intel Python
Stephen blair-chappellbayncore
Introduction
It used to be - Prototype in Python
- Implement in C/C++
But now you can- Prototype in Python
- Implement in Python
Python is #1programming language in
hiring demandfollowed by Java and C++.
And the demand is growing
Seven Levels of Hardware-supported Parallelism
3
Levels of ParallelismNodeSocketCore / Thread-Level(Hyperthreading) GPU-CPUInstruction Data (Vectorisation)
• Use Intel Python to leverage all these levels of parallelism
• Use Intel tools to profile and further optimise the code
Efficient Vectorisation & Parallelism leads to best performance
4
* Xeon = Intel® Xeon® processor; * Xeon Phi = Intel® Xeon Phi™ coprocessor
Parallelism
Vectorisation
Perf
orm
ance
(Big
ger i
s be
tter
)Performance ‘sweet-spot’
Intel Python Distribution
5
Intel Python is FREE
6
See: software.intel.com/en-us/distribution-for-python
software.intel.com/en-us/distribution-for-python
7
See: software.intel.com/en-us/distribution-for-python
Commercial Support Available
8
See: software.intel.com/en-us/distribution-for-python
Performance-boosted Python
Numerical packages acceleration with Intel® performance libraries
(MKL, DAAL, IPP)
Better parallelism and composablemulti-threading
(TBB, MPI)
Language extensions for vectorization and multi-
threading(Cython, Numba, Pyston, etc)
Integration with Big Data and Machine Learning platforms
and frameworks(Spark, Hadoop, Theano, etc)
Profiling Python and mixed language codes
(VTune)
Performance-boosted Python
Numerical packages acceleration with Intel® performance libraries
(MKL, DAAL, IPP)
Better parallelism and composablemulti-threading
(TBB, MPI)
Language extensions for vectorization and multi-
threading(Cython, Numba, Pyston, etc)
Integration with Big Data and Machine Learning platforms
and frameworks(Spark, Hadoop, Theano, etc)
Profiling Python and mixed language codes
(VTune)
Numerical Packages are boosted by Intel Maths Kernel Library
11
Boosted by MKL - What does this mean?
• Use Intel Python to get improved performance
• NO code changes - just write regular Python code
• This way you get improved
Vectorisation
Thread parallelism
12
Leads to better use of CPUs
Automatic Performance Improvements
Numpy & Scipy optimizations with Intel® MKLLinear Algebra
• BLAS• LAPACK• ScaLAPACK• Sparse BLAS• Sparse Solvers• Iterative • PARDISO SMP & Cluster
Fast Fourier Transforms
• 1-7D• FFTW interfaces• Cluster FFT
Vector Math
• Trigonometric• Hyperbolic • Exponential• Log• Power• Root
Vector RNGs• Multiple BRNG• Support methods for
independent streamscreation
• Support all key probability distributions
Summary Statistics• Kurtosis• Variation coefficient• Order statistics• Min/max• Variance-covariance
And More• Splines• Interpolation• Trust Region• Fast Poisson Solver
13
Up to 100x faster
Up to 10x
faster!
Up to 10x
faster!
Up to 60x
faster!
Configuration info: - Versions: Intel® Distribution for Python 2017 Beta, icc 15.0; Hardware: Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz (2 sockets, 16 cores each, HT=OFF), 64 GB of RAM, 8 DIMMS of 8GB@2133MHz; Operating System: Ubuntu 14.04 LTS.
Python is continuing to be optimised by Intel
Optimizations in update 2 of Intel Python
• Fast Fourier Transforms (60x vs update 1)
• Arithmetic and transcendental expressions
• Memory management optimizations
• Faster Machine Learning with Scikit-learn (1.5x to 160x vs update 1)
• Neural network enhancements for pyDAAL
14
Source: software.intel.com/en-us/articles/intelr-distribution-for-python-2017-update-2
Our problem . . .
Running several parallel programs together or nested parallelism within a program can cause
over-subscription and cause a
slowing down of the environment
16
Intel Python Thread Support
17
Intel MKL
OpenMPThreading
Python
Intel MKL
TBB
Python
Regular Python
Optimised Libraries
Threading provision
Default usage (can oversubscribe)
ComposableSolution
( –m TBB )
Dask Code example – bench.py
18
python bench.py
19
Dask uses MKLWhich useOpenMP
python -m TBB bench.py
20
Now Dask uses MKLWhich uses
TBB
By using the TBB enabled Dask . . .
21
https://software.intel.com/en-us/blogs/2016/04/04/unleash-parallel-performance-of-python-programs
“… you can get about 50% reduction of the compute time for this particular example or even more …”
Configuration Info: - Versions: Intel(R) Distribution for Python 2.7.11 2017, Beta (Mar 04, 2016), MKL version 11.3.2 for Intel Distribution for Python 2017, Beta, Fedora* built Python*: Python 2.7.10 (default, Sep 8 2015), NumPy 1.9.2, SciPy 0.14.1, multiprocessing 0.70a1 built with gcc 5.1.1; Hardware: 96 CPUs (HT ON), 4 sockets (12 cores/socket), 1 NUMA node, Intel(R) Xeon(R) E5-4657L [email protected], RAM 64GB, Operating System: Fedora release 23 (Twenty Three)
0
2000
4000
6000
8000
10000
12000
Fedora Python Intel Python Fedora Python +ThreadPool
Intel Python+ThreadPool Intel Python + ThreadPool+ TBB
Use
r req
uest
s pe
r sec
Collaborative Filtering - Generation of User Recommendations Positive effect of TBB-based nested parallelism in Python
Innermost parallelism Outermost
Oversubscription with nested
Nested
17x15x
11x
27x
1x
About this Session
• Live Demo of Using VTune• Profiling Python & Cython code • Comparing Results• Digging deeper
24
Setup
25
Ubuntu 14LTSssh –x
Windows Laptop Linux Laptopi7 4-Core
Windows 10
source /opt/intel/parallel_studio_xe_2017.0.035/psxevars.sh
source intel/intelpython35/bin/activate
261. Tool Setup
272. CPU Spec
The Application
28m
ain.py
mandelbrot.py
Interactive Demo
29
Eight Activities
30
Activ
ity 1
to
3
Python
Activ
ity 4
to
6
Cython
Activ
ity 7
Threads
Activ
ity 8
Using the Intel
Compiler
HEALTH WARNINGWe are looking at the Profiling tool and not writing the best code
Activity 0Setting up the SHELL
31
Demo 0: Sourcing the tools# setup path to intel tools
$ source /opt/intel/<parallel studio folder>bin/psxevars.sh
# enable intel python
$ source <Intel python folder>/bin/activate
# reduce prompt (for demo purposes)
$ echo $PS1
$ PS1=“IP \$”
32
Activity 1Running the Naïve Python Application
33
Shell Scripts – Activity 1
34
run.sh
nuke.sh
profile.sh
Running Naïve Python Example
$ ./run.sh -v
main.py
settings.py
Mandelbrot.py Run.shPython main.py $1
363. Run Naïve Python Example
Demo 1: Running Naïve Python example# View directories & contents
$ cd dv/VTUNE-PYTHON/Mandelbrot
$ ls
# View shell script
$ cd 01_NaivePython
$ cat ./run.sh
# run (without, then with visual output)
$ ./run.sh
$ ./run -v
37
Reminder: You must run Activity 0 first!
VTune Command Line Collection
$ ./profile.sh
main.py
settings.py
Mandelbrot.py Profile.shamplxe-cl -collect hotspots -- python main.py $1 $2 $3
r000hs
Demo 2: Command Line Profiling & Reporting
# Collect Hotspots
$ cd 01_NaivePython/
$ $cat ./profile.sh
$ ./profile.sh
# View results & generate a command line report (top 6 hotspots)
$ ls
$ amplxe-cl –R hotspots -limit=6
39
404. Profile Naïve Python Example
Command line report types
41
$ amplxe-cl –help reports
425. Command Line Report
Activity 3Viewing results from Vtune GUI
43
VTune Command Line Collection
$ amplxe-gui r000hsr000hs
Summary Page
45
Bottom Up
46
Source Code View
47
With Assembler
48
Activity 4Cythonizing a Python Code
49
Mandelbrot.pyx setup.pysetup(cmdclass = {'build_ext': build_ext},ext_modules =
[Extension('mandelbrot', sources=["mandelbrot.pyx"], extra_compile_args=['-g'],extra_link_args=['-DEBUG']),],
)
Cython Examplemain.pysettings.py
build.sh \python setup.py \build_ext –inplacecython -a mandelbrot.pyx
Mandelbrot.c
mandelbrot.cpython-35m-x86_64-linux-gnu.so
1
2
516. Build Naïve Cython (no code changes)
Two step automatic build
52
Compile (default GCC)
Link(default GCC)
Shell Scripts – Activity 4+
53
run.sh
build.sh
profile.sh
Demo 4: Build & Profile Naïve code
# Build the naïve cython project
$ cd dv/VTUNE-PYTHON/Mandelbrot/02_NaiveCython
$ ls #
$ ./build.sh
$ ls -l #see what the build system produced
$ cat ./run # notice we still call main.py
$ ./amplxe-gui &
54
557. Run Naïve Cython
Config issues . . .
VTune may warn you that you need to do the following
56
578. Profile Naïve Cython
Summary
58
Bottom Up
59
The native source code
60
Activity 5Building and running all the versions
61
The Different Versions
62
Name Code Changes01_NaivePython Naïve Python02_NaiveCython Naïve Cython (no Changes to code)03_MoreCdefs Add some Cdefs04_RichCompare Add More CDefs05_ALL_Cython Get rid of all API calls in pyc06_MovePythonLoop Move loop from main.py to pyc code07_Parallel Add parallelism08_icc Use the Intel compiler
Demo 5: Building and running all the apps
# Build all the projects
$ cd dv/VTUNE-PYTHON/Mandelbrot
$ ./build_all.sh
$ ./run_all.sh
$
63
649. Build All
6510. Run All
The Different Versions
66
Name Code Changes Time01_NaivePython Naïve Python 20.4102_NaiveCython Naïve Cython (no Changes to code) 12.8303_MoreCdefs Add some Cdefs 8.5404_RichCompare Add More CDefs 7.8805_ALL_Cython Get rid of all API calls in pyc 1.9106_MovePythonLoop
Move loop from main.py to pyc code 1.5
07_Parallel Add parallelismUsing Numpy & Threading
0.11
08_icc Use the Intel compiler 0.078
Activity 6Comparing Results
67
In this activity we
Use Cython Annotation to discover Python API calls in Cythonized code
Use Vtune to compare 2 sets of results
Purpose:
• Identifying Expensive APIs
• Understand the benefit of a particular code change
68
Annotating the Cython code
69
build.sh
-Annotation option produces HTML report
Goal: get rid of the deep yellow!
70
Calls to the Python API can be expensive
You can use VTune to help understand the benefits of you code changes
You can use the command line
amplxe-cl –r r001hs –r r002hs
But better to use the GUI!
71
Demo 6: Comparison of two results# Build & profile all the projects
$ cd dv/VTUNE-PYTHON/Mandelbrot
$ ./build_all.sh
$ ./profile_all.sh
#open the xml file and look at the cython annotated code
# open the GUI, and load the results from 04_RichCompare
and 05_ALL_Cython
72
7311. Profile All
Comparison Summary – Top Hotspots
74
Comparison Summary - Platform Info
75
Comparison Bottom-up
76
Activity 7Profiling threading
77
7812. Settings for last two activities
You CAN thread code
79
• Global lock is disabled• Only on pure Cython code
The generated code uses OpenMP
80
Demo 8: Threading & Advanced Hotspot analysis# Edit the settings.py so the value ‘factor = 8’
# build the application
./build.sh
# run to see how long it runs for
$ ./run.sh
# profile using advanced-hotspots
amplxe-cl –collect advanced-hotspots ./run.sh
# open the gui ad look at the results
amplxe-gui &
81
Advanced Hotspot Analysisamplxe-cl –collect advanced-hotspots ./run.sh
82
Activity 8Using the Intel compiler (rather than GCC)
83
Using the Intel Compiler
export CC=iccexport CXX= icpcExport LDSHARED=“icc -shared”
84
Differences between v8 and v7
85
All arithmetic instructions are
Scalar
SSE instructions
Some arithmetic instructions are
Packed
AVX instructions(but still using 128 bit wide registers
Using the Intel Compiler (Windows)
SET CC=iclSET CXX=iclSET LD=xilinkSET AR=xilib
86
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2017, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
87
88
Legal Notices & disclaimersThis document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
Intel, the Intel logo, Pentium, Celeron, Atom, Core, Xeon and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
© 2016 Intel Corporation.