simulating a quantum annealer with gpu-based monte carlo...
TRANSCRIPT
Simulating a quantum annealer withGPU-based Monte Carlo algorithmsMayssam Mohammadi NevisiMani RanjbarJames KingSheir YarkoniJeremy P. HiltonCatherine C. McGeoch
April 6, 2016
©2016 D-Wave Systems Inc. All rights reserved.
Introduction
2 / 27
©2016 D-Wave Systems Inc. All rights reserved.
D-Wave QPU
I Quantum annealing chipI Highly specialized co-processorI Physical implementation of an
NP-hard optimization problemI Physical heuristic algorithm runs
on the chip
3 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Ising Minimization
Given:I A graph G = (V , E)
I A collection of weights h = {hi : i ∈ V} andJ = {Jij : (i , j) ∈ E} (the Hamiltonian)
Assign:I Values from {−1, +1} to n spin variables s = {si}
Such that we minimize the energy function:
E(s) =∑i∈V
hisi +∑
(i,j)∈E
Jijsisj .
4 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Chimera topology
I Ck is a k × k grid of dense K4,4“unit cells”
Processor Topology QubitsD-Wave One C4 128D-Wave Two C8 512D-Wave 2X C12 1152
I Chimera topologies are bipartiteI Any graph can be embedded in
a Chimera graph via minorembedding
5 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Simulated (Thermal) Annealing
I Heuristic optimization algorithmthat simulates classical thermalannealing
I System of spins moves randomlyin state space
I Cools slowly from hot(random/explorative) to cold(greedy/exploitative)
I Uses thermal activation to jumpover energy barriers
Thermal Jump
Configuration(state)
Energy(cost)
6 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Quantum Annealing
I Quantum annealing (QA) isrelated to adiabatic quantumcomputing (AQC)
H(t) = A(t) · Hinit + B(t) · Hprob.
I Takes advantage of thermalactivation just like classicalannealing
I Also has a new complementaryresource: quantum tunneling.
Thermal Jump
QuantumTunneling
Configuration(state)
Energy(cost)
7 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Motivation for GPU Solvers
8 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Why develop optimized GPU implementations?
I Quantum computers arehard to simulate
I Even approximatesimulations via MonteCarlo methods can be slow
Between somequantiles and systemsizes we observe aprefactor advantage[for D-Wave] as highas 108.- Denchev et al. (2015)
9 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Why develop optimized GPU implementations?
I Software solvers slow down our experiments”This experiment occupied millions of processor coresfor several days to tune and run the classicalalgorithms for these benchmarks.”
- Denchev et al. (2015)
I Faster solvers → faster experimental cycle → improvedunderstanding of our chips
I Fast GPU simulation leads to better quantum computers!
10 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Algorithms and GPU suitability
I Good/interesting classical solvers forChimera Ising problems fall into twocategories:
I Low-treewidth local searchI Single-spin Monte Carlo algorithms
I Low-treewidth local search is notsuitable.
I Memory requirements are too highI Limited parallelizability.
I Single-spin Monte Carlo algorithms areideal!
I Very low memory requirementsI Highly parallelizable.
11 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Algorithms
12 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Simulated Annealing
I Single-spin updatesI Flipping this spin would lead to a change in energy ∆EI Probability of accepting the spin flip is min(1, e−β∆E )
Algorithm 1 Simulated Annealing1: for each sample to be taken do2: for i = 1 to num sweeps do3: β := betas[i]4: for spin in spins do5: calculate ∆Espin6: flip spin with probability min(1, e−β∆Espin )7: end for8: end for9: end for
sam
ples
swee
pssp
ins
bipartite graph means half of thespin updates can be done in parallel
13 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Parallel Tempering
I Instead of one Markov chain thatslowly goes from high to lowtemperature:
I Use an ensemble offixed-temperature Markov chains(“replicas”)
I Replicas form a “temperatureladder”
I Replicas can exchangetemperatures with neighbouringchains on the ladder withprobability min(1, e(Ei−Ej )(βi−βj ))
14 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Approximate Simulations of Quantum Annealing
Quantum Monte Carlo†
I Many replicas of thesystem (Trotter slices)representing differentpoints in imaginary time
I Path-integral Monte Carlomethod
I We implement the ‘discretetime’ variant
† QMC can reproduce QAequilibrated statistics, but doesn’tsimulate its dynamics.
Spin Vector Monte Carlo
I Mean-field approximationI Simulates coherence but
no entanglementI Each spin is represented
by an angle
15 / 27
©2016 D-Wave Systems Inc. All rights reserved.
GPU Simulated AnnealingImplementation
16 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Thread Structure — Hamiltonian
I One unit cell per threadI Cell Hamiltonian stored as floats in
40 registers
8 fields (h)16 in-tile couplings (J)
+ 16 inter-tile couplings† (J)40 registers
I Compiler uses additional 39 registersper thread
† Each inter-tile coupling is stored in two threads
17 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Thread Structure — States
I Each state is +1 or −1I Each state is accessed by multiple
threads for energy calculation
States must be stored inshared memory!
I 8k2 states per sampleI Storing as floats is faster than
packing bits; registers are still thelimiting factor†
† For parallel tempering and quantum MonteCarlo we pack bits because we have up to 64replicas
18 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Block Structure
I 79 registers per threadI k2 threads per sampleI 65,536 registers per SM (Maxwell)I Each SM can run
⌊ 65,53679k2
⌋samples in
parallel
Topology C4 C8 C12
Concurrentsamples 51 12 5per SM
19 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Fast Random Number Generation
I A significant fraction of running time isused to generate random numbers.
I We use xorshift random numbergenerators
I 2-3 times faster than cuRandI Imperfect but still suitable for
applications that are not highly sensitiveto RNG quality.
20 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Fast Approximations of Mathematical Functions
I Exponentiation is necessary todetermine flip probabilities
I Sine and cosine are used in Spin VectorMonte Carlo
I CPU implementations often cachefunction values in lookup tables
I Not feasible for GPUs due to memoryrestrictions
I CUDA to the rescue! Intrinsic fast mathfunctions are:
I Faster than regular math functions orTaylor approximations
I Accurate enough for our Monte Carloalgorithms
x
f (x)
f (x) = sin x
f (x) = min(1, e−x )
21 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Results
22 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Implementation Speeds
I Code is still being fine-tunedI Significant speedup over CPU seen in all four algorithmsI Huge spin flip/nanosecond/dollar improvement over CPUsI Actual numbers to be released in a forthcoming paper
23 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Breakdown of Runtime — Simulated Annealing
35%
25%
35%
5%
Delta energy calculationRandom number generationSpin flip tests (exp function)Other
24 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Conclusion
25 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Recap
I Quantum processors are very hard to simulate classicallyI Monte Carlo algorithms are among the best tractable
approximationsI Monte Carlo algorithms with single-spin updates are ideal
for GPUI We can achieve significant speedups even over a more
expensive CPU
26 / 27
©2016 D-Wave Systems Inc. All rights reserved.
Looking to the Future
I Future D-Wave chips will be biggerand denser
I Future NVIDIA chips will be biggerand faster (more registers per SM?)
I GPUs should continue to beat CPUsfor Monte Carlo algorithms withsingle-spin updates
I Algorithms with low-treewidthupdates unlikely to become feasiblefor GPUs
27 / 27