training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · training...

35
Training deep convolutional neural networks for classification of multi-scale, nonlocal data in fusion energy, using the Pytorch framework R.M. Churchill 1 , the DIII-D team Special thanks to: DIII-D team generally, specifically Ben Tobias 1 , Yilun Zhu 2 , Neville Luhmann 2 , Dave Schissel 3 , Raffi Nazikian 1 , Cristina Rea 4 , Bob Granetz 4 PPPL colleagues: CS Chang 1 , Bill Tang 1 , Julian Kates-Harbeck 1,5 , Ahmed Diallo 1 , Ken Silber 1 Princeton University Research Computing 6 https://deepmind.com/blog/wavenet-generative-model-raw-audio/ 1 2 3 4 6 5

Upload: others

Post on 30-Apr-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Training deep convolutional neural networks for

classification of multi-scale, nonlocal data in fusion

energy, using the Pytorch framework

R.M. Churchill1, the DIII-D teamSpecial thanks to: ● DIII-D team generally, specifically Ben Tobias1,

Yilun Zhu2, Neville Luhmann2, Dave Schissel3, Raffi Nazikian1, Cristina Rea4, Bob Granetz4

● PPPL colleagues: CS Chang1, Bill Tang1, Julian Kates-Harbeck1,5, Ahmed Diallo1, Ken Silber1

● Princeton University Research Computing6

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

1

2

3

4

6

5

Page 2: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

The Princeton Plasma Physics Laboratory is a world-class fusion energy research laboratory dedicated to developing the scientific and technological knowledge base for fusion energy as a safe, economical and environmentally attractive energy source for the world’s long-term energy requirements.

Page 3: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Disruptions need to be avoided in fusion machines

Page 4: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Why you should use Pytorch...

Page 5: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Outline

● Neural network architectures for multi-scale, non-local data

● Initial results with ECEi for Disruption Prediction

● Using Pytorch

Page 6: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Motivation: Automating classification of fusion plasma phenomena is complicated ● Fusion plasmas exhibit a range of physics over different time and

spatial scales● Fusion experimental diagnostics are disparate, and increasingly

high time resolution● How can we automate identification of important plasma

phenomena, for example oncoming disruptions?

Figure: [F. Poli, APS DPP 2017]

Page 7: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Neural networks can be thought of as series of filters whose weights are “learned” to accomplish a task

● Fusion experiment/simulation have a wide variety of data analysis pipelines, use prior knowledge to get result

● Neural networks (NN) have a number of layers of “weights” which can be viewed as filters (esp. Convolutional NN). But these filters are taught how to map given input through a complicated non-linear function to a given output.

Low pass filter EFITInternal Inductance

Input (magnetics)

https://becominghuman.ai/deep-learning-made-easy-with-deep-cognition-403fbe445351

Page 8: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Challenges for RNN/LSTM on long sequences

● Typical, popular sequence NN like LSTM in principle are sensitive to infinite sequence length, due to memory cell technique

● However, in practice they tend to “forget” for phenomena with sequence length >1000 (approximate, depends on data)

● If characterising a sequence requires Tlong seconds, and short-scale phenomena of time-scale Tshort are important in the sequence, to use an LSTM requires

● Various NN architectures enable learning on long sequences (CNN with dilated convolutions, attention, etc.)

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 9: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Dilated convolutions enable efficient training on long sequences● One difficulty using CNNs with

causal filters is they require large filters or many layers to learn from long sequences ○ Due to memory constraints, this

becomes infeasible

● A seminal paper [*] showed using dilated convolutions (i.e. convolution w/ defined gaps) for time series modeling could increase the NN receptive field, reducing computational and memory requirements, and allowing training on long sequences

Normal convolution

Dilated convolution

[* A. Van Den Oord, et. al., WaveNET: A Generative Model for Raw Audio, 2016]

Page 10: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Temporal Convolutional Networks

● Temporal Convolutional Network (TCN) architecture [*] combines causal, dilated convolutions with additional modern NN improvements (residual connections, weight normalization)

● Several beneficial aspects compared to RNN’s:○ Empirically TCN’s exhibit longer memory (i.e. better for long

sequences)○ Non-sequential, allows parallelized training and inference○ Require less GPU memory for training

[* Bai, J.Z. Kolter, V. Koltun, http://arxiv.org/abs/1803.01271(2018)]

Page 11: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Outline

● Neural network architectures for multi-scale, non-local data

● Initial results with ECEi for Disruption Prediction

● Using Pytorch

Page 12: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Machine learning for Disruption Prediction

● Predicting (and understanding?) disruptions is a key challenge for tokamak operation, a lot of ML research has been applied [Vega Fus. Eng., 2013 , Rea FST 2018, Kates-Harbeck Nature 2019, D. Ferreira arxiv 2018]○ Most ML methods use processed 0-D signals (e.g. line averaged

density, locked mode amplitude, internal inductance, etc.)○ Can we apply deep CNNs directly to diagnostic outputs for improved

disruption prediction?● Electron Cyclotron Emission imaging (ECEi) diagnostic has

temporal & spatial sensitivity to disruption markers [Choi NF 2016]

ITER Physics Basis, Chapter 3 Nucl. Fusion 39 (1999) 2251–2389.

disr

uptio

n

ECEi data near disruption

Page 13: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

DIII-D Electron Cyclotron Emission Imaging (ECEi)

● ECEi characteristics:○ Measures electron temperature, Te○ Time resolution (1 MHz) enabling measurement of δTe on turbulent

timescales○ Digitizer sufficient to measure entire DIII-D discharge (~O(5s))○ 20 x 8 channels for spatial resolution○ Some limitations due to signal cutoff above certain densities

● Sensitive to a number of plasma phenomena, e.g.○ Sawteeth○ Tearing modes○ ELM’s

https://sites.google.com/view/mmwave/research/advanced-mmw-imaging/ecei-on-diii-d

● Due to high temporal resolution (long time sequences), and spatial resolution, ECEi is a good candidate for applying end-to-end TCN

[B. Tobias et al., RSI (2010)]

Page 14: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Dataset and computation

● Database of ~3000 shots (~50/50 non-disruptive/disruptive) with good ECEi data created from the Omfit DISRUPTIONS module shot list [E. Kolemen, et. al.]○ “Good” data defined as all channels have SNR>3, avoid

discharges where 2nd harmonic ECE cutoff● ECEi data (~10 TB) transferred to Princeton TigerGPU cluster for

distributed training (320 nVidia P100 GPU’s, 4 GPU’s per compute node)

Page 15: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Setup for training neural network

● Each time point is labeled as “disruptive” or “non”. For a disruptive shot, all time points 300ms or closer to disruption are labelled “disruptive” ○ Times before 350ms have similar distribution

to non-disruptive discharges [Rea FST 2018]● Binary classification problem

(disruptive/non-disruptive time slice)● Overlapping subsequences of length

>> receptive field are created, length mainly set by GPU memory constraints

Figure: Rea, FST, 2018

Near disruption

Far from disruption

Non-disrupted

Receptive field(# inputs needed to make 1 prediction)

Page 16: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

● Starting with smaller subsets of data, working up.● Data setup

○ Downsampled to 100 kHz○ Sequences broken up into subsequences of 78,125 (781ms)○ Undersampled subsequence training dataset so that 50/50 split in

non-disruptive/disruptive subsequences (natural class imbalance ~5% disruptive subsequences).

○ Weighted loss function for 50/50 balancing of time slices classes○ Full 20 x 8 channels used (but no 2D convolutions)○ Data normalized with z-normalization (y - mean(y))/std(y)

Setup for training neural network

● TCN setup:○ Receptive field ~30,000 i.e. 300ms (each

time slice prediction based on receptive field)

○ 4 layers, dilation 10, kernel size 15, hidden nodes 80 per layer

Page 17: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Current, initial results

● Training on the subset of data, the loss does continually decrease, suggesting the network has the capacity necessary to capture and model disruptions with the ECEi data

● F1-score is ~91%, accuracy ~94%, on individual time slices. ○ Additional regularization and/or training with larger dataset can

help improve.● Run on 16 GPU’s for 2 days.

Page 18: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Future Possibilities

● Deep CNN architectures (e.g. TCN) can be applied to many fusion sequence diagnostics, e.g. magnetics, bolometry, etc.○ Tying together multiple diagnostics in a single or multiple neural

networks can give enhanced possibilities○ Can be used to create “automated logbook”, to enable

researchers to quickly find discharges with phenomena of interest. Especially important for longer pulses + more diagnostics + higher time resolution diagnostics

● Transfer learning can be explored for quickly re-training CNN on a different machine with few examples

● Large batch training for distributed learning ○ Very hot topic now in ML/DL community, how to quickly and

efficiently train NN? Especially important since training often requires many iterations on hyperparameters

Page 19: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Outline

● Neural network architectures for multi-scale, non-local data

● Initial results with ECEi for Disruption Prediction

● Using Pytorch

Page 20: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Getting started with Pytorch

● Once you know Python (+NumPy), most of Pytorch will be intuitive (with some exceptions)

● Pytorch.org has great documentation, decent tutorials (some outdated), and generally useful User Forum

● For TigerGPU, make sure you load:○ anaconda3○ cudatoolkit/10.0○ cudnn/cuda-10.0○ ->then install Pytorch according to website

● For distributed training examples, highly recommend the Pytorch Imagenet example (https://github.com/pytorch/examples/tree/master/imagenet)

● Lots of good examples available (see e.g. https://paperswithcode.com/, or new Pytorch Hub)

Page 21: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Custom Datasets in Pytorch

● If your dataset isn’t ImageNet, or other predefined datasets that Pytorch offers, creating a custom data loader is straightforward:○ Define your “__init__” method, with any parameters or metadata○ Define a “__len__” method, returning the number of samples○ Define a “__getitem__(index)” method, which returns a sample for

a given index● This allows a lot of flexibility in terms of data file type, size, etc.

Anyway of reading data in Python can be used within custom DataLoaders.

● Any correctly created Dataset object can be passed to the Pytorch DataLoader, defining things like batch_size, samplers, data loader num_workers, etc.)○ One caution: still appears to be a bug at times for num_workers>0

Page 22: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Custom Data Samplers

● The torch DataLoader (highly recommend!) allows inputing a sampler object (defines how to draw samples from dataset each epoch)○ Pytorch already has a number of predefined samplers, including a

DistributedSampler● Custom DataSamplers can be written, inheriting from the other

Samplers and rewriting methods needed (__init__, __iter__, __len__)

● This allowed creating a stratified sampler, to ensure each batch received balanced labels, and ensure no bleeding from training to validation and test sets.

Page 23: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Pytorch model size estimation

● A common error you might get is “OOM”, out-of-memory, due to your neural network size○ This is (usually) dominated by gradients for backpropagation, not

by parameter size itself● There are ways to estimate before hand the memory required

○ See great blog: ○ Code implemented here:

(though this tended to be too high in practice for me)● Brute force method is to load a model onto GPU, begin training,

and monitor GPU memory usage with “nvidia-smi”○ This code can be added to your bash script for running in parallel

to your Pytorch code

http://jacobkimmel.github.io/pytorch_estimating_model_size/

https://github.com/sksq96/pytorch-summary

Page 24: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Distributed training in PytorchMain Pytorch code

Rank0 Rank1 RankN

GPU0

GPU2 GPU3

GPU1 GPU0

GPU2 GPU3

GPU1 GPU0

GPU2 GPU3

GPU1

Rank per compute node, torch.multiprocessing launches 1 process per 4 GPU’s. Allows sharing memory between processes.

Main Pytorch code

GPU0

GPU2 GPU3

GPU1

GPU3GPU3

Rank per GPU, no multiprocessing

Rank0

Rank2 Rank3

Rank1

GPU0

GPU2 GPU3

GPU1

Rank4

Rank6 Rank7

Rank5

GPU0

GPU2 GPU3

GPU1

RankN-4

RankN-2

RankN-1

RankN-3

How Pytorch distributed recommends

How I could get Pytorch distributed to work on TigerGPY

Main issue with torch.multiprocessing was deadlocks

Page 25: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Distributed training on TigerGPU with Pytorch● Use Pytorch distributed package

○ Easy SLURM integration○ If using file init_method, make sure

file name changes each run (otherwise can have issues)● Start with DistributedDataParallel

○ Efficient for data parallel training (I haven’t tried new model parallel just out in Pytorch 1.1)

○ Should be able to do everything DataParallel does, and more● Stay away from torch.multiprocessing (or tell me how it works)● Pytorch distributed package recommends NCCL backend for GPU

systems with Infiniband, but also works on TigerGPU○ GLOO often gave deadlocks, and was very opaque to debug. NCCL

once working worked consistently, and was ~1.5x faster

Use NOTNCCL GLOO

Page 26: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Conclusions

● Deep convolutional neural networks offer the promise of identifying multi-scale plasma phenomena, using end-to-end learning to work with diagnostic output directly○ TCN architecture with causal, dilated convolutions allows

predictions to be sensitive to longer time sequences while maintaining computational efficiency

● Initial work training a TCN on reduced ECEi datasets yields promising results for the ability of the TCN architecture to train a disruption predictor based on the ECEi data alone.

● Pytorch offers a flexible, performant architecture for tackling YOUR problem

Page 27: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

References

1. https://deepmind.com/blog/wavenet-generative-model-raw-audio/

2. F. Poli, APS DPP (2017)

3. https://becominghuman.ai/deep-learning-made-easy-with-deep-cognition-403fbe445351

4. http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html

5. https://developers.google.com/machine-learning/practica/image-classification/convolutional-neural-networks

6. A. Van Den Oord, et. al., https://arxiv.org/pdf/1609.03499.pdf (2016),

7. S. Bai, et. al., http://arxiv.org/abs/1803.01271 (2018).

8. ITER Physics Basis, Chapter 3, Nucl. Fusion 39 (1999) 2251–2389.

9. J. Vega, et. al., Fusion Eng. Des. 88 (2013) 1228–1231.

10. C. Rea, et. al., Fusion Sci. Technol. (2018) 1–12.

11. Kates-Harbeck, et. al., submitted (2018)

12. D. Ferreira, http://arxiv.org/abs/1811.00333, (2018)

13. M.J. Choi, et. al., Nucl. Fusion 56 (2016) 066013.

14. https://sites.google.com/view/mmwave/research/advanced-mmw-imaging/ecei-on-diii-d

15. B. Tobias, et. al., Rev. Sci. Instrum. 81 (2010) 10D928.

Page 28: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

END PRESENTATION

Page 29: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Deep learning enables end-to-end learning

● Traditional machine learning focused on hand developed features (e.g. shape in an image) to train shallow NN or other ML algorithms

● Deep learning (multiple layer NN) enable end-to-end learning, where higher dimensional features (e.g. pixels in an image) are input directly to the NN

Page 30: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Convolutional neural networks (CNN)

http://deeplearning.net/software/theano/tutorial/conv_arithmetic.htmlhttps://developers.google.com/machine-learning/practica/image-classification/convolutional-neural-networks

● CNN’s very successful in image classification, whereas Recurrent NN (e.g. LSTM) are often used for sequence classification (e.g. time series)

● But viewing NN as “filters”, no reason CNN can’t be applied to sequence machine learning also

Page 32: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Alex Krizhevsky, et. al., NIPS 2012

1st layers of CNN often exhibit basic filter charactersitics, e.g. edge or color filters

Page 33: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

● Shot where disruption alarm triggered >> 300ms before disruption due to very similar behavior in that time region to just before the disruption (drop in Te, followed by recovery)

Target/prediction

ECEi data

Sequence ind

Target and prediction for disruptive shot

Page 34: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

(previous results)

● F1-score is ~86%, accuracy ~91%, on individual time slices. ○ Additional regularization and/or training with larger dataset can

help improve.● Run on 16 GPU’s for 2 days.

Page 35: Training deep convolutional neural networks for ...jdh4/churchill_deep_learning_user... · Training deep convolutional neural networks for classification of multi-scale, nonlocal

Training deep convolutional neural networks for classification of multi-scale, nonlocal data in fusion energy, using the Pytorch framework

R. Michael Churchill, Princeton Plasma Physics Laboratory

Fusion plasmas exhibit phenomena over a wide range of time and spatial scales, and a number of different sensors are used in fusion energy experiments to observe these phenomena. Recently, many deep neural network architectures have been developed to work directly with such multi-scale, nonlocal data, including deep convolutional neural networks with dilated convolutions[1,2], and transformer networks[3]. I’ll show how these networks have been applied to the raw data of a high sample rate fusion experiment diagnostic (ECEi), to predict one of the most pressing issues with magnetic confinement fusion experiments: sudden, violent instabilities known as disruptions. Along the way I’ll share how Pytorch was used, and lessons learned with the framework, including custom data classes, custom DistributedSamplers, and distributed training on TigerGPU.

[1] A. Van Den Oord, et. al., ArXiv E-Prints (2016) arXiv:1609.03499.

[2] S. Bai, et. al., ArXiv E-Prints (2018) arXiv:1803.01271.

[3] R. Child, et. al., ArXiv E-Prints (2019) arXiv:1904.10509.