using jupyter notebooks on blue waters //bluewaters.ncsa.illinois.edu/.../jupyter.pdfalternatively...

Using jupyter notebooks on Blue Waters

https://goo.gl/4Eb7qw

Roland Haas (NCSA / University of Illinois)Email: [email protected]

2/18

Jupyter notebooks● interactive, browser based interface

to Python● freely mix

– python commands

– shell commands

– graphics

– markdown

● explore data and debug scripts interactively– more memory (128GB) than your laptop

– no need to copy data off Blue Waters

● export result as Python script● share notebook with collaborators

3/18

Three ways to use jupyter on Blue WatersJupyter starts a notebook server

that one connects to using a browser

● on login node– good for quick, inexpensive tasks– simple to set up

● on a single compute node– for tasks longer than what the process

killer allows– for tasks using large amounts of

resources (IO, memory, CPU)– somewhat harder to set up

● on multiple compute nodes– large scale python applications using

MPI, e.g. yt– task farms with many jobs– workflow systems like parsl– complex to set up

Blue Waters

laptop

Notebookserver

Pythonkernelssh TCP

note

book

file

4/18

Common setup● jupyter notebook server is

provided by bwpy module

module load bwpy● notebook server listens for

connections using TCP– anyone on Blue Waters can connect– provides full access to your account

– protect access using a password

● password is set in

jupyter_notebook_config.py

● file in ~/.jupyter directory

from socket import *ips = gethostbyname_ex(gethostname())[2]for ip in ips: if(ip.startswith("10.")): internal_ip = ip break

from IPython.lib import passwdc.NotebookApp.password = passwd("$passwd$")c.NotebookApp.open_browser = Falsec.NotebookApp.ip = internal_ip

● config file available on portal http://bluewaters.ncsa.illinois.edu/pythonnotebooks

● notebook server listens on internal network interface only

● default password is $passwd$

http://bluewaters.ncsa.illinois.edu/pythonnotebooks

http://bluewaters.ncsa.illinois.edu/pythonnotebooks

5/18

Jupyter on login nodesbw$ module load bwpybw$ jupyter notebookThe Jupyter Notebook is running at: http://10.0.0.147:8981/laptop% ssh -L 8888:10.0.0.147:8981 bw.ncsa.illinois.edulaptop% open http://127.0.0.1:8888

● Notebook server accessible Blue Waters wide but not from the public internet– jupyter outputs connection information to stdout on startup– use second ssh connection to any login node to forward ports

– on Windows ssh -L 127.0.0.1:8888:10.0.0.147:8981 bw may be required

6/18

General issues and suggestions● bwpy provides a large number

of python modules useful to explore data– numpy, scipy, sympy

– h5py, netCDF, gdal, pandas– PostCactus– matplotlib, yt, plotly

– pip list shows all packages

● use %matplotlib notebook magic command to show plots

● jupyter auto-saves notebooks in case connection is dropped

7/18

Jupyter on compute nodes● sample PBS script (ccmrun!)

#!/bin/bash#PBS -l nodes=1:xe:ppn=32#PBS -l walltime=2:0:0module load ccmmodule load bwpyccmrun jupyter-notebook

● uses same jupyter_notebook_config.py as on login node

● CCM used to provide cluster like environment on compute node

● use xe or xk nodes

● connection information written to <JOBID>.bw.OU● create port forwarding as before, this time to compute node

laptop% ssh -L 8888:10.128.22.27:8981 bw.ncsa.illinois.edu

laptop% open http://127.0.0.1:8888

8/18

Parallelism in python● bag-of-jobs workload

– use multiprocessing module

– http://docs.python.org/3/library/multiprocessing.html

– use multi-threading in compiled code

● more complex workflows– use parsl (http://parsl-project.org/)

– install in virtualenv using pip

● caveats– request all 32 (16) cores in PBS script– avoid core binding (ccmrun is fine)– check if you are compute or IO bound

Multiprocessing

import multiprocessing as mppool = mp.Pool(processes=8)

def sqr(x): return x*xdata = range(21)

squared = pool.map(sqr, data)print(squared)

Parsl

from parsl import App, DataFlowKernelimport parsl.configs.local as lcdfk = DataFlowKernel(lc.localThreads)

@App('python', dfk)def sqr(x): return x*xdata = range(21)

squared = map(sqr, data)print([i.result() for i in squared])

http://docs.python.org/3/library/multiprocessing.html

http://docs.python.org/3/library/multiprocessing.html

http://parsl-project.org/

9/18

Multiple compute nodes using MPI● PBS starts notebook server

ccmrun jupyter-notebook● this does not work with aprun

aprun -n2 jupyter-notebook● starts 2 copies of the notebook

server, not two copies of the python kernel

● need to start two copies of kernel and hook up to server → handled by ipyparallel

Blue Waters

laptop

Notebookserver

Pythonkernelssh TCP

note

book

file

● ipyparallel lets you use clusters with jupyter

● multiple parallelism backends– local processes → good for testing

– MPI (and PBS) → Blue Waters

● supported in bwpy/1.1.1

MPI workers

TC

P

10/18

Setting up ipyparallel in your accountbw$ module load bwpybw$ jupyter serverextension enable --py ipyparallelbw$ jupyter nbextension install --user --py ipyparallel

bw$ jupyter nbextension enable --py ipyparallel

bw$ ipython profile create --parallel --profile=pbs

● installs to $HOME/.local– cannot use virtualenv

● “pbs” profile is in $HOME/.ipython/profile_pbs– ipcluster_config.py– ipengine_config.py

● enable MPI in ipengine_config.py:c.MPI.use = 'mpi4py'

● ipcluster_config.py contains– launcher class (PBS)– PBS script template– port to listen for engines

● listen only to internal BW network

11/18

Sample ipcluster_config.py file● ipcluster_config.py:

from socket import *ips = gethostbyname_ex(\ gethostname())[2]for ip in ips: if(ip.startswith("10.")): internal_ip = ip break

c.HubFactory.ip = \ internal_ip

c.BatchSystemLauncher.\ queue = 'debug'

c.IPClusterEngines.\ engine_launcher_class = \ 'PBSEngineSetLauncher'

c.PBSEngineSetLauncher.\ batch_template = """#!/bin/bash#PBS -q {queue}#PBS -l walltime=00:30:00#PBS -l nodes={n//4}:ppn=32:xe

module load bwpy bwpy-mpi

OMP_NUM_THREADS=8aprun -n{n} -d$OMP_NUM_THREADS \ bwpy-environ -- ipengine \ --profile-dir={profile_dir}"""

12/18

● ipcluster command in bwpy creates parallel workers

bw$ module load bwpybw$ bwpy-environ ipcluster start --n=4 --profile=pbs● alternatively can use IPython Clusters tab

in jupyter notebook for same purpose● connect to workers in jupyter notebook

import ipyparallel as ippranks = ipp.Client(profile='pbs')ranks.ids● if this is empty then workers are not yet ready → wait and try again● ipcluster has --PBSLauncher.queue option which accepts any

string for the {queue} placeholder. This allows for creative abuse.

Starting ipyparallel on compute nodes

13/18

MPI parallel notebooks – examples (1/4)● two sets of nodes

– notebook and ipython kernel– MPI workers

● use %%px cell magic command to execute code on MPI ranks

● use mpi4py to access MPI● ranks[rank]['var']

accesses var from rank rank● worker variables persist between %%px cells

● output to stdout and stderr is forwarded

14/18

MPI parallel notebooks – examples (2/4)%%pximport numpy as np, numpy.random asas rdx = np.arange(10); y = rd.rand(len(x))import matplotlib as mplmpl.use('agg')import matplotlib.pyplot as pltif MPI.COMM_WORLD.Get_rank() == 3: p = plt.plot(x, y)

%matplotlib notebookranks[3]['p'];

● can use matplotlib to visualize results– runs on all nodes by default– use MPI rank to select single node

● plots need to be stored and pulled to display

● use %matplotlib notebook magic

15/18

MPI parallel notebooks – examples (3/4)%%pximport ytyt.enable_parallelism()

ds = yt.load(\ "enzo_tiny_cosmology/D0046/D0046")p = yt.ProjectionPlot(ds, "y",

"density")

if yt.is_root()print ("Redshift =",

ds.current_redshift)p.show()

16/18

MPI parallel notebooks – examples (4/4)%%pxds = yt.load(\ "IsolatedGalaxy/galaxy0030/galaxy0030")t2 = time.time()

sc = yt.create_scene(ds)sc.camera.set_width(ds.quan(20, 'kpc'))source = sc.sources['source_00']tf = \ yt.ColorTransferFunction((-28, -24))tf.add_layers(4, w=0.01)source.set_transfer_function(tf)

sc.render()

if yt.is_root():sc.show()

Question?

This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.

18/18

Using compatible Python versions● bwpy/1.1.0 offers python 2.7,

3.5 and 3.6 and defaults to python 3.5

● jupyter notebook uses python 3.6

● ipyparallel workers use python 3.5

→ causes hang or crash when accessing worker variables

→ adjust jupyter kernel config

$HOME/.local/share/jupyter/kernels/python3/kernel.json

{"argv": [ "python3.5", "-m", "ipykernel_launcher", "-f", "{connection_file}" ],

"display_name": "Python 3", "language": "python"}

using jupyter notebooks on blue waters //bluewaters.ncsa.illinois.edu/.../jupyter.pdfalternatively...

Documents