simulating a quantum annealer with gpu-based monte carlo...

Simulating a quantum annealer withGPU-based Monte Carlo algorithmsMayssam Mohammadi NevisiMani RanjbarJames KingSheir YarkoniJeremy P. HiltonCatherine C. McGeoch

April 6, 2016

©2016 D-Wave Systems Inc. All rights reserved.

Introduction

2 / 27


D-Wave QPU

I Quantum annealing chipI Highly specialized co-processorI Physical implementation of an

NP-hard optimization problemI Physical heuristic algorithm runs

on the chip

3 / 27


Ising Minimization

Given:I A graph G = (V , E)

I A collection of weights h = {hi : i ∈ V} andJ = {Jij : (i , j) ∈ E} (the Hamiltonian)

Assign:I Values from {−1, +1} to n spin variables s = {si}

Such that we minimize the energy function:

E(s) =∑i∈V

hisi +∑

(i,j)∈E

Jijsisj .

4 / 27


Chimera topology

I Ck is a k × k grid of dense K4,4“unit cells”

Processor Topology QubitsD-Wave One C4 128D-Wave Two C8 512D-Wave 2X C12 1152

I Chimera topologies are bipartiteI Any graph can be embedded in

a Chimera graph via minorembedding

5 / 27


Simulated (Thermal) Annealing

I Heuristic optimization algorithmthat simulates classical thermalannealing

I System of spins moves randomlyin state space

I Cools slowly from hot(random/explorative) to cold(greedy/exploitative)

I Uses thermal activation to jumpover energy barriers

Thermal Jump

Configuration(state)

Energy(cost)

6 / 27


Quantum Annealing

I Quantum annealing (QA) isrelated to adiabatic quantumcomputing (AQC)

H(t) = A(t) · Hinit + B(t) · Hprob.

I Takes advantage of thermalactivation just like classicalannealing

I Also has a new complementaryresource: quantum tunneling.

Thermal Jump

QuantumTunneling

Configuration(state)

Energy(cost)

7 / 27


Motivation for GPU Solvers

8 / 27


Why develop optimized GPU implementations?

I Quantum computers arehard to simulate

I Even approximatesimulations via MonteCarlo methods can be slow

Between somequantiles and systemsizes we observe aprefactor advantage[for D-Wave] as highas 108.- Denchev et al. (2015)

9 / 27


Why develop optimized GPU implementations?

I Software solvers slow down our experiments”This experiment occupied millions of processor coresfor several days to tune and run the classicalalgorithms for these benchmarks.”

- Denchev et al. (2015)

I Faster solvers → faster experimental cycle → improvedunderstanding of our chips

I Fast GPU simulation leads to better quantum computers!

10 / 27


Algorithms and GPU suitability

I Good/interesting classical solvers forChimera Ising problems fall into twocategories:

I Low-treewidth local searchI Single-spin Monte Carlo algorithms

I Low-treewidth local search is notsuitable.

I Memory requirements are too highI Limited parallelizability.

I Single-spin Monte Carlo algorithms areideal!

I Very low memory requirementsI Highly parallelizable.

11 / 27


Algorithms

12 / 27


Simulated Annealing

I Single-spin updatesI Flipping this spin would lead to a change in energy ∆EI Probability of accepting the spin flip is min(1, e−β∆E )

Algorithm 1 Simulated Annealing1: for each sample to be taken do2: for i = 1 to num sweeps do3: β := betas[i]4: for spin in spins do5: calculate ∆Espin6: flip spin with probability min(1, e−β∆Espin )7: end for8: end for9: end for

sam

ples

swee

pssp

ins

bipartite graph means half of thespin updates can be done in parallel

13 / 27


Parallel Tempering

I Instead of one Markov chain thatslowly goes from high to lowtemperature:

I Use an ensemble offixed-temperature Markov chains(“replicas”)

I Replicas form a “temperatureladder”

I Replicas can exchangetemperatures with neighbouringchains on the ladder withprobability min(1, e(Ei−Ej )(βi−βj ))

14 / 27


Approximate Simulations of Quantum Annealing

Quantum Monte Carlo†

I Many replicas of thesystem (Trotter slices)representing differentpoints in imaginary time

I Path-integral Monte Carlomethod

I We implement the ‘discretetime’ variant

† QMC can reproduce QAequilibrated statistics, but doesn’tsimulate its dynamics.

Spin Vector Monte Carlo

I Mean-field approximationI Simulates coherence but

no entanglementI Each spin is represented

by an angle

15 / 27


GPU Simulated AnnealingImplementation

16 / 27


Thread Structure — Hamiltonian

I One unit cell per threadI Cell Hamiltonian stored as floats in

40 registers

8 fields (h)16 in-tile couplings (J)

+ 16 inter-tile couplings† (J)40 registers

I Compiler uses additional 39 registersper thread

† Each inter-tile coupling is stored in two threads

17 / 27


Thread Structure — States

I Each state is +1 or −1I Each state is accessed by multiple

threads for energy calculation

States must be stored inshared memory!

I 8k2 states per sampleI Storing as floats is faster than

packing bits; registers are still thelimiting factor†

† For parallel tempering and quantum MonteCarlo we pack bits because we have up to 64replicas

18 / 27


Block Structure

I 79 registers per threadI k2 threads per sampleI 65,536 registers per SM (Maxwell)I Each SM can run

⌊ 65,53679k2

⌋samples in

parallel

Topology C4 C8 C12

Concurrentsamples 51 12 5per SM

19 / 27


Fast Random Number Generation

I A significant fraction of running time isused to generate random numbers.

I We use xorshift random numbergenerators

I 2-3 times faster than cuRandI Imperfect but still suitable for

applications that are not highly sensitiveto RNG quality.

20 / 27


Fast Approximations of Mathematical Functions

I Exponentiation is necessary todetermine flip probabilities

I Sine and cosine are used in Spin VectorMonte Carlo

I CPU implementations often cachefunction values in lookup tables

I Not feasible for GPUs due to memoryrestrictions

I CUDA to the rescue! Intrinsic fast mathfunctions are:

I Faster than regular math functions orTaylor approximations

I Accurate enough for our Monte Carloalgorithms

x

f (x)

f (x) = sin x

f (x) = min(1, e−x )

21 / 27


Results

22 / 27


Implementation Speeds

I Code is still being fine-tunedI Significant speedup over CPU seen in all four algorithmsI Huge spin flip/nanosecond/dollar improvement over CPUsI Actual numbers to be released in a forthcoming paper

23 / 27


Breakdown of Runtime — Simulated Annealing

35%

25%

35%

5%

Delta energy calculationRandom number generationSpin flip tests (exp function)Other

24 / 27


Conclusion

25 / 27


Recap

I Quantum processors are very hard to simulate classicallyI Monte Carlo algorithms are among the best tractable

approximationsI Monte Carlo algorithms with single-spin updates are ideal

for GPUI We can achieve significant speedups even over a more

expensive CPU

26 / 27


Looking to the Future

I Future D-Wave chips will be biggerand denser

I Future NVIDIA chips will be biggerand faster (more registers per SM?)

I GPUs should continue to beat CPUsfor Monte Carlo algorithms withsingle-spin updates

I Algorithms with low-treewidthupdates unlikely to become feasiblefor GPUs

27 / 27

simulating a quantum annealer with gpu-based monte carlo...

Documents