intel python - microsigma

89
Intel Python Stephen blair - chappell bayncore

Upload: others

Post on 09-Apr-2022

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intel Python - Microsigma

Intel Python

Stephen blair-chappellbayncore

Page 2: Intel Python - Microsigma

Introduction

It used to be - Prototype in Python

- Implement in C/C++

But now you can- Prototype in Python

- Implement in Python

Python is #1programming language in

hiring demandfollowed by Java and C++.

And the demand is growing

Page 3: Intel Python - Microsigma

Seven Levels of Hardware-supported Parallelism

3

Levels of ParallelismNodeSocketCore / Thread-Level(Hyperthreading) GPU-CPUInstruction Data (Vectorisation)

• Use Intel Python to leverage all these levels of parallelism

• Use Intel tools to profile and further optimise the code

Page 4: Intel Python - Microsigma

Efficient Vectorisation & Parallelism leads to best performance

4

* Xeon = Intel® Xeon® processor; * Xeon Phi = Intel® Xeon Phi™ coprocessor

Parallelism

Vectorisation

Perf

orm

ance

(Big

ger i

s be

tter

)Performance ‘sweet-spot’

Page 5: Intel Python - Microsigma

Intel Python Distribution

5

Page 6: Intel Python - Microsigma

Intel Python is FREE

6

See: software.intel.com/en-us/distribution-for-python

Page 7: Intel Python - Microsigma

software.intel.com/en-us/distribution-for-python

7

See: software.intel.com/en-us/distribution-for-python

Page 8: Intel Python - Microsigma

Commercial Support Available

8

See: software.intel.com/en-us/distribution-for-python

Page 9: Intel Python - Microsigma

Performance-boosted Python

Numerical packages acceleration with Intel® performance libraries

(MKL, DAAL, IPP)

Better parallelism and composablemulti-threading

(TBB, MPI)

Language extensions for vectorization and multi-

threading(Cython, Numba, Pyston, etc)

Integration with Big Data and Machine Learning platforms

and frameworks(Spark, Hadoop, Theano, etc)

Profiling Python and mixed language codes

(VTune)

Page 10: Intel Python - Microsigma

Performance-boosted Python

Numerical packages acceleration with Intel® performance libraries

(MKL, DAAL, IPP)

Better parallelism and composablemulti-threading

(TBB, MPI)

Language extensions for vectorization and multi-

threading(Cython, Numba, Pyston, etc)

Integration with Big Data and Machine Learning platforms

and frameworks(Spark, Hadoop, Theano, etc)

Profiling Python and mixed language codes

(VTune)

Page 11: Intel Python - Microsigma

Numerical Packages are boosted by Intel Maths Kernel Library

11

Page 12: Intel Python - Microsigma

Boosted by MKL - What does this mean?

• Use Intel Python to get improved performance

• NO code changes - just write regular Python code

• This way you get improved

Vectorisation

Thread parallelism

12

Leads to better use of CPUs

Automatic Performance Improvements

Page 13: Intel Python - Microsigma

Numpy & Scipy optimizations with Intel® MKLLinear Algebra

• BLAS• LAPACK• ScaLAPACK• Sparse BLAS• Sparse Solvers• Iterative • PARDISO SMP & Cluster

Fast Fourier Transforms

• 1-7D• FFTW interfaces• Cluster FFT

Vector Math

• Trigonometric• Hyperbolic • Exponential• Log• Power• Root

Vector RNGs• Multiple BRNG• Support methods for

independent streamscreation

• Support all key probability distributions

Summary Statistics• Kurtosis• Variation coefficient• Order statistics• Min/max• Variance-covariance

And More• Splines• Interpolation• Trust Region• Fast Poisson Solver

13

Up to 100x faster

Up to 10x

faster!

Up to 10x

faster!

Up to 60x

faster!

Configuration info: - Versions: Intel® Distribution for Python 2017 Beta, icc 15.0; Hardware: Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz (2 sockets, 16 cores each, HT=OFF), 64 GB of RAM, 8 DIMMS of 8GB@2133MHz; Operating System: Ubuntu 14.04 LTS.

Page 14: Intel Python - Microsigma

Python is continuing to be optimised by Intel

Optimizations in update 2 of Intel Python

• Fast Fourier Transforms (60x vs update 1)

• Arithmetic and transcendental expressions

• Memory management optimizations

• Faster Machine Learning with Scikit-learn (1.5x to 160x vs update 1)

• Neural network enhancements for pyDAAL

14

Source: software.intel.com/en-us/articles/intelr-distribution-for-python-2017-update-2

Page 15: Intel Python - Microsigma
Page 16: Intel Python - Microsigma

Our problem . . .

Running several parallel programs together or nested parallelism within a program can cause

over-subscription and cause a

slowing down of the environment

16

Page 17: Intel Python - Microsigma

Intel Python Thread Support

17

Intel MKL

OpenMPThreading

Python

Intel MKL

TBB

Python

Regular Python

Optimised Libraries

Threading provision

Default usage (can oversubscribe)

ComposableSolution

( –m TBB )

Page 18: Intel Python - Microsigma

Dask Code example – bench.py

18

Page 19: Intel Python - Microsigma

python bench.py

19

Dask uses MKLWhich useOpenMP

Page 20: Intel Python - Microsigma

python -m TBB bench.py

20

Now Dask uses MKLWhich uses

TBB

Page 21: Intel Python - Microsigma

By using the TBB enabled Dask . . .

21

https://software.intel.com/en-us/blogs/2016/04/04/unleash-parallel-performance-of-python-programs

“… you can get about 50% reduction of the compute time for this particular example or even more …”

Page 22: Intel Python - Microsigma

Configuration Info: - Versions: Intel(R) Distribution for Python 2.7.11 2017, Beta (Mar 04, 2016), MKL version 11.3.2 for Intel Distribution for Python 2017, Beta, Fedora* built Python*: Python 2.7.10 (default, Sep 8 2015), NumPy 1.9.2, SciPy 0.14.1, multiprocessing 0.70a1 built with gcc 5.1.1; Hardware: 96 CPUs (HT ON), 4 sockets (12 cores/socket), 1 NUMA node, Intel(R) Xeon(R) E5-4657L [email protected], RAM 64GB, Operating System: Fedora release 23 (Twenty Three)

0

2000

4000

6000

8000

10000

12000

Fedora Python Intel Python Fedora Python +ThreadPool

Intel Python+ThreadPool Intel Python + ThreadPool+ TBB

Use

r req

uest

s pe

r sec

Collaborative Filtering - Generation of User Recommendations Positive effect of TBB-based nested parallelism in Python

Innermost parallelism Outermost

Oversubscription with nested

Nested

17x15x

11x

27x

1x

Page 23: Intel Python - Microsigma
Page 24: Intel Python - Microsigma

About this Session

• Live Demo of Using VTune• Profiling Python & Cython code • Comparing Results• Digging deeper

24

Page 25: Intel Python - Microsigma

Setup

25

Ubuntu 14LTSssh –x

Windows Laptop Linux Laptopi7 4-Core

Windows 10

source /opt/intel/parallel_studio_xe_2017.0.035/psxevars.sh

source intel/intelpython35/bin/activate

Page 26: Intel Python - Microsigma

261. Tool Setup

Page 27: Intel Python - Microsigma

272. CPU Spec

Page 28: Intel Python - Microsigma

The Application

28m

ain.py

mandelbrot.py

Page 29: Intel Python - Microsigma

Interactive Demo

29

Page 30: Intel Python - Microsigma

Eight Activities

30

Activ

ity 1

to

3

Python

Activ

ity 4

to

6

Cython

Activ

ity 7

Threads

Activ

ity 8

Using the Intel

Compiler

HEALTH WARNINGWe are looking at the Profiling tool and not writing the best code

Page 31: Intel Python - Microsigma

Activity 0Setting up the SHELL

31

Page 32: Intel Python - Microsigma

Demo 0: Sourcing the tools# setup path to intel tools

$ source /opt/intel/<parallel studio folder>bin/psxevars.sh

# enable intel python

$ source <Intel python folder>/bin/activate

# reduce prompt (for demo purposes)

$ echo $PS1

$ PS1=“IP \$”

32

Page 33: Intel Python - Microsigma

Activity 1Running the Naïve Python Application

33

Page 34: Intel Python - Microsigma

Shell Scripts – Activity 1

34

run.sh

nuke.sh

profile.sh

Page 35: Intel Python - Microsigma

Running Naïve Python Example

$ ./run.sh -v

main.py

settings.py

Mandelbrot.py Run.shPython main.py $1

Page 36: Intel Python - Microsigma

363. Run Naïve Python Example

Page 37: Intel Python - Microsigma

Demo 1: Running Naïve Python example# View directories & contents

$ cd dv/VTUNE-PYTHON/Mandelbrot

$ ls

# View shell script

$ cd 01_NaivePython

$ cat ./run.sh

# run (without, then with visual output)

$ ./run.sh

$ ./run -v

37

Reminder: You must run Activity 0 first!

Page 38: Intel Python - Microsigma

VTune Command Line Collection

$ ./profile.sh

main.py

settings.py

Mandelbrot.py Profile.shamplxe-cl -collect hotspots -- python main.py $1 $2 $3

r000hs

Page 39: Intel Python - Microsigma

Demo 2: Command Line Profiling & Reporting

# Collect Hotspots

$ cd 01_NaivePython/

$ $cat ./profile.sh

$ ./profile.sh

# View results & generate a command line report (top 6 hotspots)

$ ls

$ amplxe-cl –R hotspots -limit=6

39

Page 40: Intel Python - Microsigma

404. Profile Naïve Python Example

Page 41: Intel Python - Microsigma

Command line report types

41

$ amplxe-cl –help reports

Page 42: Intel Python - Microsigma

425. Command Line Report

Page 43: Intel Python - Microsigma

Activity 3Viewing results from Vtune GUI

43

Page 44: Intel Python - Microsigma

VTune Command Line Collection

$ amplxe-gui r000hsr000hs

Page 45: Intel Python - Microsigma

Summary Page

45

Page 46: Intel Python - Microsigma

Bottom Up

46

Page 47: Intel Python - Microsigma

Source Code View

47

Page 48: Intel Python - Microsigma

With Assembler

48

Page 49: Intel Python - Microsigma

Activity 4Cythonizing a Python Code

49

Page 50: Intel Python - Microsigma

Mandelbrot.pyx setup.pysetup(cmdclass = {'build_ext': build_ext},ext_modules =

[Extension('mandelbrot', sources=["mandelbrot.pyx"], extra_compile_args=['-g'],extra_link_args=['-DEBUG']),],

)

Cython Examplemain.pysettings.py

build.sh \python setup.py \build_ext –inplacecython -a mandelbrot.pyx

Mandelbrot.c

mandelbrot.cpython-35m-x86_64-linux-gnu.so

1

2

Page 51: Intel Python - Microsigma

516. Build Naïve Cython (no code changes)

Page 52: Intel Python - Microsigma

Two step automatic build

52

Compile (default GCC)

Link(default GCC)

Page 53: Intel Python - Microsigma

Shell Scripts – Activity 4+

53

run.sh

build.sh

profile.sh

Page 54: Intel Python - Microsigma

Demo 4: Build & Profile Naïve code

# Build the naïve cython project

$ cd dv/VTUNE-PYTHON/Mandelbrot/02_NaiveCython

$ ls #

$ ./build.sh

$ ls -l #see what the build system produced

$ cat ./run # notice we still call main.py

$ ./amplxe-gui &

54

Page 55: Intel Python - Microsigma

557. Run Naïve Cython

Page 56: Intel Python - Microsigma

Config issues . . .

VTune may warn you that you need to do the following

56

Page 57: Intel Python - Microsigma

578. Profile Naïve Cython

Page 58: Intel Python - Microsigma

Summary

58

Page 59: Intel Python - Microsigma

Bottom Up

59

Page 60: Intel Python - Microsigma

The native source code

60

Page 61: Intel Python - Microsigma

Activity 5Building and running all the versions

61

Page 62: Intel Python - Microsigma

The Different Versions

62

Name Code Changes01_NaivePython Naïve Python02_NaiveCython Naïve Cython (no Changes to code)03_MoreCdefs Add some Cdefs04_RichCompare Add More CDefs05_ALL_Cython Get rid of all API calls in pyc06_MovePythonLoop Move loop from main.py to pyc code07_Parallel Add parallelism08_icc Use the Intel compiler

Page 63: Intel Python - Microsigma

Demo 5: Building and running all the apps

# Build all the projects

$ cd dv/VTUNE-PYTHON/Mandelbrot

$ ./build_all.sh

$ ./run_all.sh

$

63

Page 64: Intel Python - Microsigma

649. Build All

Page 65: Intel Python - Microsigma

6510. Run All

Page 66: Intel Python - Microsigma

The Different Versions

66

Name Code Changes Time01_NaivePython Naïve Python 20.4102_NaiveCython Naïve Cython (no Changes to code) 12.8303_MoreCdefs Add some Cdefs 8.5404_RichCompare Add More CDefs 7.8805_ALL_Cython Get rid of all API calls in pyc 1.9106_MovePythonLoop

Move loop from main.py to pyc code 1.5

07_Parallel Add parallelismUsing Numpy & Threading

0.11

08_icc Use the Intel compiler 0.078

Page 67: Intel Python - Microsigma

Activity 6Comparing Results

67

Page 68: Intel Python - Microsigma

In this activity we

Use Cython Annotation to discover Python API calls in Cythonized code

Use Vtune to compare 2 sets of results

Purpose:

• Identifying Expensive APIs

• Understand the benefit of a particular code change

68

Page 69: Intel Python - Microsigma

Annotating the Cython code

69

build.sh

-Annotation option produces HTML report

Page 70: Intel Python - Microsigma

Goal: get rid of the deep yellow!

70

Calls to the Python API can be expensive

Page 71: Intel Python - Microsigma

You can use VTune to help understand the benefits of you code changes

You can use the command line

amplxe-cl –r r001hs –r r002hs

But better to use the GUI!

71

Page 72: Intel Python - Microsigma

Demo 6: Comparison of two results# Build & profile all the projects

$ cd dv/VTUNE-PYTHON/Mandelbrot

$ ./build_all.sh

$ ./profile_all.sh

#open the xml file and look at the cython annotated code

# open the GUI, and load the results from 04_RichCompare

and 05_ALL_Cython

72

Page 73: Intel Python - Microsigma

7311. Profile All

Page 74: Intel Python - Microsigma

Comparison Summary – Top Hotspots

74

Page 75: Intel Python - Microsigma

Comparison Summary - Platform Info

75

Page 76: Intel Python - Microsigma

Comparison Bottom-up

76

Page 77: Intel Python - Microsigma

Activity 7Profiling threading

77

Page 78: Intel Python - Microsigma

7812. Settings for last two activities

Page 79: Intel Python - Microsigma

You CAN thread code

79

• Global lock is disabled• Only on pure Cython code

Page 80: Intel Python - Microsigma

The generated code uses OpenMP

80

Page 81: Intel Python - Microsigma

Demo 8: Threading & Advanced Hotspot analysis# Edit the settings.py so the value ‘factor = 8’

# build the application

./build.sh

# run to see how long it runs for

$ ./run.sh

# profile using advanced-hotspots

amplxe-cl –collect advanced-hotspots ./run.sh

# open the gui ad look at the results

amplxe-gui &

81

Page 82: Intel Python - Microsigma

Advanced Hotspot Analysisamplxe-cl –collect advanced-hotspots ./run.sh

82

Page 83: Intel Python - Microsigma

Activity 8Using the Intel compiler (rather than GCC)

83

Page 84: Intel Python - Microsigma

Using the Intel Compiler

export CC=iccexport CXX= icpcExport LDSHARED=“icc -shared”

84

Page 85: Intel Python - Microsigma

Differences between v8 and v7

85

All arithmetic instructions are

Scalar

SSE instructions

Some arithmetic instructions are

Packed

AVX instructions(but still using 128 bit wide registers

Page 86: Intel Python - Microsigma

Using the Intel Compiler (Windows)

SET CC=iclSET CXX=iclSET LD=xilinkSET AR=xilib

86

Page 87: Intel Python - Microsigma

Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2017, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

87

Page 88: Intel Python - Microsigma

88

Legal Notices & disclaimersThis document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Intel, the Intel logo, Pentium, Celeron, Atom, Core, Xeon and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation.

Page 89: Intel Python - Microsigma