intel python - microsigma

Intel Python

Stephen blair-chappellbayncore

Introduction

It used to be - Prototype in Python

- Implement in C/C++

But now you can- Prototype in Python

- Implement in Python

Python is #1programming language in

hiring demandfollowed by Java and C++.

And the demand is growing

Seven Levels of Hardware-supported Parallelism

3

Levels of ParallelismNodeSocketCore / Thread-Level(Hyperthreading) GPU-CPUInstruction Data (Vectorisation)

• Use Intel Python to leverage all these levels of parallelism

• Use Intel tools to profile and further optimise the code

Efficient Vectorisation & Parallelism leads to best performance

4

* Xeon = Intel® Xeon® processor; * Xeon Phi = Intel® Xeon Phi™ coprocessor

Parallelism

Vectorisation

Perf

orm

ance

(Big

ger i

s be

tter

)Performance ‘sweet-spot’

Intel Python Distribution

5

Intel Python is FREE

6

See: software.intel.com/en-us/distribution-for-python

software.intel.com/en-us/distribution-for-python

7


Commercial Support Available

8


Performance-boosted Python

Numerical packages acceleration with Intel® performance libraries

(MKL, DAAL, IPP)

Better parallelism and composablemulti-threading

(TBB, MPI)

Language extensions for vectorization and multi-

threading(Cython, Numba, Pyston, etc)

Integration with Big Data and Machine Learning platforms

and frameworks(Spark, Hadoop, Theano, etc)

Profiling Python and mixed language codes

(VTune)

Numerical Packages are boosted by Intel Maths Kernel Library

11

Boosted by MKL - What does this mean?

• Use Intel Python to get improved performance

• NO code changes - just write regular Python code

• This way you get improved

Vectorisation

Thread parallelism

12

Leads to better use of CPUs

Automatic Performance Improvements

Numpy & Scipy optimizations with Intel® MKLLinear Algebra

• BLAS• LAPACK• ScaLAPACK• Sparse BLAS• Sparse Solvers• Iterative • PARDISO SMP & Cluster

Fast Fourier Transforms

• 1-7D• FFTW interfaces• Cluster FFT

Vector Math

• Trigonometric• Hyperbolic • Exponential• Log• Power• Root

Vector RNGs• Multiple BRNG• Support methods for

independent streamscreation

• Support all key probability distributions

Summary Statistics• Kurtosis• Variation coefficient• Order statistics• Min/max• Variance-covariance

And More• Splines• Interpolation• Trust Region• Fast Poisson Solver

13

Up to 100x faster

Up to 10x

faster!

Up to 10x

faster!

Up to 60x

faster!

Configuration info: - Versions: Intel® Distribution for Python 2017 Beta, icc 15.0; Hardware: Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz (2 sockets, 16 cores each, HT=OFF), 64 GB of RAM, 8 DIMMS of 8GB@2133MHz; Operating System: Ubuntu 14.04 LTS.

Python is continuing to be optimised by Intel

Optimizations in update 2 of Intel Python

• Fast Fourier Transforms (60x vs update 1)

• Arithmetic and transcendental expressions

• Memory management optimizations

• Faster Machine Learning with Scikit-learn (1.5x to 160x vs update 1)

• Neural network enhancements for pyDAAL

14

Source: software.intel.com/en-us/articles/intelr-distribution-for-python-2017-update-2

Our problem . . .

Running several parallel programs together or nested parallelism within a program can cause

over-subscription and cause a

slowing down of the environment

16

Intel Python Thread Support

17

Intel MKL

OpenMPThreading

Python

Intel MKL

TBB

Python

Regular Python

Optimised Libraries

Threading provision

Default usage (can oversubscribe)

ComposableSolution

( –m TBB )

Dask Code example – bench.py

18

python bench.py

19

Dask uses MKLWhich useOpenMP

python -m TBB bench.py

20

Now Dask uses MKLWhich uses

TBB

By using the TBB enabled Dask . . .

21

https://software.intel.com/en-us/blogs/2016/04/04/unleash-parallel-performance-of-python-programs

“… you can get about 50% reduction of the compute time for this particular example or even more …”

https://software.intel.com/en-us/blogs/2016/04/04/unleash-parallel-performance-of-python-programs

Configuration Info: - Versions: Intel(R) Distribution for Python 2.7.11 2017, Beta (Mar 04, 2016), MKL version 11.3.2 for Intel Distribution for Python 2017, Beta, Fedora* built Python*: Python 2.7.10 (default, Sep 8 2015), NumPy 1.9.2, SciPy 0.14.1, multiprocessing 0.70a1 built with gcc 5.1.1; Hardware: 96 CPUs (HT ON), 4 sockets (12 cores/socket), 1 NUMA node, Intel(R) Xeon(R) E5-4657L [email protected], RAM 64GB, Operating System: Fedora release 23 (Twenty Three)

0

2000

4000

6000

8000

10000

12000

Fedora Python Intel Python Fedora Python +ThreadPool

Intel Python+ThreadPool Intel Python + ThreadPool+ TBB

Use

r req

uest

s pe

r sec

Collaborative Filtering - Generation of User Recommendations Positive effect of TBB-based nested parallelism in Python

Innermost parallelism Outermost

Oversubscription with nested

Nested

17x15x

11x

27x

1x

About this Session

• Live Demo of Using VTune• Profiling Python & Cython code • Comparing Results• Digging deeper

24

Setup

25

Ubuntu 14LTSssh –x

Windows Laptop Linux Laptopi7 4-Core

Windows 10

source /opt/intel/parallel_studio_xe_2017.0.035/psxevars.sh

source intel/intelpython35/bin/activate

261. Tool Setup

272. CPU Spec

The Application

28m

ain.py

mandelbrot.py

Interactive Demo

29

Eight Activities

30

Activ

ity 1

to

3

Python

Activ

ity 4

to

6

Cython

Activ

ity 7

Threads

Activ

ity 8

Using the Intel

Compiler

HEALTH WARNINGWe are looking at the Profiling tool and not writing the best code

Activity 0Setting up the SHELL

31

Demo 0: Sourcing the tools# setup path to intel tools

$ source /opt/intel/<parallel studio folder>bin/psxevars.sh

# enable intel python

$ source <Intel python folder>/bin/activate

# reduce prompt (for demo purposes)

$ echo $PS1

$ PS1=“IP \$”

32

Activity 1Running the Naïve Python Application

33

Shell Scripts – Activity 1

34

run.sh

nuke.sh

profile.sh

Running Naïve Python Example

$ ./run.sh -v

main.py

settings.py

Mandelbrot.py Run.shPython main.py $1

363. Run Naïve Python Example

Demo 1: Running Naïve Python example# View directories & contents

$ cd dv/VTUNE-PYTHON/Mandelbrot

$ ls

# View shell script

$ cd 01_NaivePython

$ cat ./run.sh

# run (without, then with visual output)

$ ./run.sh

$ ./run -v

37

Reminder: You must run Activity 0 first!

VTune Command Line Collection

$ ./profile.sh

main.py

settings.py

Mandelbrot.py Profile.shamplxe-cl -collect hotspots -- python main.py $1 $2 $3

r000hs

Demo 2: Command Line Profiling & Reporting

# Collect Hotspots

$ cd 01_NaivePython/

$ $cat ./profile.sh

$ ./profile.sh

# View results & generate a command line report (top 6 hotspots)

$ ls

$ amplxe-cl –R hotspots -limit=6

39

404. Profile Naïve Python Example

Command line report types

41

$ amplxe-cl –help reports

425. Command Line Report

Activity 3Viewing results from Vtune GUI

43

VTune Command Line Collection

$ amplxe-gui r000hsr000hs

Summary Page

45

Bottom Up

46

Source Code View

47

With Assembler

48

Activity 4Cythonizing a Python Code

49

Mandelbrot.pyx setup.pysetup(cmdclass = {'build_ext': build_ext},ext_modules =

[Extension('mandelbrot', sources=["mandelbrot.pyx"], extra_compile_args=['-g'],extra_link_args=['-DEBUG']),],

)

Cython Examplemain.pysettings.py

build.sh \python setup.py \build_ext –inplacecython -a mandelbrot.pyx

Mandelbrot.c

mandelbrot.cpython-35m-x86_64-linux-gnu.so

1

2

516. Build Naïve Cython (no code changes)

Two step automatic build

52

Compile (default GCC)

Link(default GCC)

Shell Scripts – Activity 4+

53

run.sh

build.sh

profile.sh

Demo 4: Build & Profile Naïve code

# Build the naïve cython project

$ cd dv/VTUNE-PYTHON/Mandelbrot/02_NaiveCython

$ ls #

$ ./build.sh

$ ls -l #see what the build system produced

$ cat ./run # notice we still call main.py

$ ./amplxe-gui &

54

557. Run Naïve Cython

Config issues . . .

VTune may warn you that you need to do the following

56

578. Profile Naïve Cython

Summary

58

Bottom Up

59

The native source code

60

Activity 5Building and running all the versions

61

The Different Versions

62

Name Code Changes01_NaivePython Naïve Python02_NaiveCython Naïve Cython (no Changes to code)03_MoreCdefs Add some Cdefs04_RichCompare Add More CDefs05_ALL_Cython Get rid of all API calls in pyc06_MovePythonLoop Move loop from main.py to pyc code07_Parallel Add parallelism08_icc Use the Intel compiler

Demo 5: Building and running all the apps

# Build all the projects


$ ./build_all.sh

$ ./run_all.sh

$

63

649. Build All

6510. Run All

The Different Versions

66

Name Code Changes Time01_NaivePython Naïve Python 20.4102_NaiveCython Naïve Cython (no Changes to code) 12.8303_MoreCdefs Add some Cdefs 8.5404_RichCompare Add More CDefs 7.8805_ALL_Cython Get rid of all API calls in pyc 1.9106_MovePythonLoop

Move loop from main.py to pyc code 1.5

07_Parallel Add parallelismUsing Numpy & Threading

0.11

08_icc Use the Intel compiler 0.078

Activity 6Comparing Results

67

In this activity we

Use Cython Annotation to discover Python API calls in Cythonized code

Use Vtune to compare 2 sets of results

Purpose:

• Identifying Expensive APIs

• Understand the benefit of a particular code change

68

Annotating the Cython code

69

build.sh

-Annotation option produces HTML report

Goal: get rid of the deep yellow!

70

Calls to the Python API can be expensive

You can use VTune to help understand the benefits of you code changes

You can use the command line

amplxe-cl –r r001hs –r r002hs

But better to use the GUI!

71

Demo 6: Comparison of two results# Build & profile all the projects


$ ./build_all.sh

$ ./profile_all.sh

#open the xml file and look at the cython annotated code

# open the GUI, and load the results from 04_RichCompare

and 05_ALL_Cython

72

7311. Profile All

Comparison Summary – Top Hotspots

74

Comparison Summary - Platform Info

75

Comparison Bottom-up

76

Activity 7Profiling threading

77

7812. Settings for last two activities

You CAN thread code

79

• Global lock is disabled• Only on pure Cython code

The generated code uses OpenMP

80

Demo 8: Threading & Advanced Hotspot analysis# Edit the settings.py so the value ‘factor = 8’

# build the application

./build.sh

# run to see how long it runs for

$ ./run.sh

# profile using advanced-hotspots

amplxe-cl –collect advanced-hotspots ./run.sh

# open the gui ad look at the results

amplxe-gui &

81

Advanced Hotspot Analysisamplxe-cl –collect advanced-hotspots ./run.sh

82

Activity 8Using the Intel compiler (rather than GCC)

83

Using the Intel Compiler

export CC=iccexport CXX= icpcExport LDSHARED=“icc -shared”

84

Differences between v8 and v7

85

All arithmetic instructions are

Scalar

SSE instructions

Some arithmetic instructions are

Packed

AVX instructions(but still using 128 bit wide registers

Using the Intel Compiler (Windows)

SET CC=iclSET CXX=iclSET LD=xilinkSET AR=xilib

86

Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2017, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

87

88

Legal Notices & disclaimersThis document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Intel, the Intel logo, Pentium, Celeron, Atom, Core, Xeon and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation.

http://www.intel.com/performance

intel python - microsigma

Documents