summary problem: exponential performance gap: computer architectures transitioned from exponential...

SummaryProblem:

Exponential Performance Gap: Computer architectures transitioned from exponential frequency scaling to parallelism ending decades of free exponential performance gainsThe natural “MapReduce” Belief Propagation (BP) algorithm:

Embarrassingly Parallel Highly Inefficient: (Asymptotically slower than efficient sequential algorithms)

Solution:Explore the limiting sequential structure using chain graphical modelsIntroduce approximation which improves parallel performancePropose ResidualSplash, a new parallel dynamic BP Algorithm and show that it performs optimally on chain graphical models in the approximate inference setting

Results:We demonstrate that our new algorithm outperforms existing techniques on two real-world tasks

Many Core RevolutionTransition from exponential frequency scaling to exponential parallelism

1985 199019801970 1975 1995 2000 2005

Raw

Power4Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Boardcom 1480

20??

# c

ore

s

1

2

4

8

16

32

64

128

256

512 AmbricAM2045

4004

8008

80868080 286 386 486 Pentium P2 P3P4

Itanium

Itanium 2Athlon

YearGraph courtesy of Saman Amarasinghe

Paralle

l

Perform

ance

Single ProcessorPerformance

Exponentially Growing Gap

Inference in Markov Random Fields

Ãi (xi ) = N (oi ;¾2)

Ãi ;j (xi ;xj ) =

(1 xi = xj

e¡ ¸ xi 6= xj

X1 X2

X4 X5

X3

X6

X7 X8 X9

PixelsNoisy Image Predicted Image

MRF

P (X 1; : : : ;X N ) /Q

i2VÃi (xi )

Q

f i ;j g2EÃi ;j (xi ;xj )

Unary Potentials

Binary Potentials

Pair wise Markov Random Field (MRF):

Graph encoding conditional independence assumptionsFactors encoding functional dependencies

Inference Objective:Compute marginal distribution for all variables

m3! 2(x2)

m2! 1(x1) m4! 2(x2)

m4! 2(x2)

Loopy Belief PropagationLoopy Belief Propagation:

Approximate inference methodExact on trees

At Convergence:

P (X i = xi ) ¼bi (xi ) / Ãi (xi )Y

k2¡ i

mk! i (xi )

X2X1

X3

X4

X5

mi ! j (xj ) /X

xi 2A i

Ãi ;j (xi ;xj )Ãi (xi )Y

k2¡ i nj

mk! i (xi )

m3! 2(x2)

m2! 1(x1) m4! 2(x2)X2X1

X3

X4

X5

m4! 2(x2)

Levels of ParallelismMessage Level Parallelism

Making a single message update calculation run in parallelLimited by the complexity of individual variables

Graph Level Parallelism Simultaneously updating multiple messagesMore “parallelism” for larger models

Running Time Definition:Message calculations as unit time operationsRunning time is measured in message calculations

“MapReduce” Belief Propagation

Update all messages simultaneously using p · 2(n-1) processors.Chain graphs provide a challenging performance benchmark:

CPU 1

CPU 2

Old

Mess

ages

(t-1

) New

Messa

ges (t)

Write

Write

Shared Memory

Iterate

ReadOnly

n R

oun

ds t=1

t=n

2n Message Calculations

Running Time:2(n ¡ 1)2

p

Optimal Sequential SchedulingUsing one processor

Send messages left to right and then right to left:

Optimal Parallel SchedulingUsing two processors

Send messages left to right and right to left at the same time:

n R

oun

ds t=1

t=n

CPU 1

CPU 2

CPU 2

CPU 1

Efficient Chain Scheduling2

n R

ound

s

t=1

t=n+1

t=2n

CPU 1

CPU 1

CPU 1

Running Time:2(n ¡ 1)Running Time: n ¡ 1

Efficiency Gap!

For p<n the MapReduce algorithm is slower than the efficient single processor algorithm

Cannot efficiently use more than 2 processors

n R

ou

nd

s t=1

t=n

2n Messages

2n2

p 2n n

“MapReduce” Parallel

2n

Rou

nd

s

t=1

t=n+1

t=2n

CPU 1

CPU 1

CPU 1

n R

ou

nd

s t=1

t=n

CPU 1

CPU 2

CPU 2

CPU 1

Optimal Single Processor Optimal Parallel (p=2)

Factor n Gap!2n2

p n

Breaking Sequentially with ¿²-Approximation

Message errors decay over paths:

The value of ¿² Maximum length of dependencies for a given accuracy ² Not known in practiceNot known to the algorithm

1 2 3 4 5 6 7 8 9 10

TrueMessages

¿²-Approximation

m1! 2 m2! 3 m3! 4 m4! 5 m5! 6 m6! 7 m7! 8 m8! 9 m9! 10

m03! 4 m0

4! 5 m05! 6 m0

6! 7 m07! 8 m0

8! 9 m09! 10

jjm9! 10 ¡ m09! 10jj · ²

¿²

Based on work by [Ihler et al., 2005]

Synchronous BP and ¿²-Approximation

For an approximate marginal, we only need to consider a small ¿² subgraph

1 2 3 4 5 6 7 8 9 10

¿ ² S

tep

s t=1

t=n

2n MessagesTheorem:

Given an acyclic MRF with n vertices a ¿²-approximation is obtained by running Parallel Synchronous BP using p processors (p·2n) in running time:

2(n¡ 1)¿²

p = O³

n¿²p

´

¿²

Optimal Approximate Inference

Evenly partition the vertices:

Run sequential exact inference on each “tree” in parallel:

We obtain the running time on chain graphs:

Step 1 Step 2

¿ ² n=p

+1

2np

Oµ

np

+ ¿²

¶

time per iteration

Processor 1 Processor 3Processor 2Processor 1 Processor 3Processor 2

Processor 1 Processor 3Processor 2

Processor 1 Processor 3Processor 2

Proof sketch:After kth iterations of parallel message computations in one direction:

n ¡ ¿² · p2 (k ¡ ¿² + 1) k ¸ 2n

p + ¿²

³1¡ 2

p

´¡ 1

Theorem:For an arbitrary chain graphical model with n vertices and p processors, a ¿²-approximation cannot in general be computed with fewer message updates than:

³

np + ¿²

´

Total requiredwork in

one direction

Maximum possible work done by a single processor

Solving for k

Splash OperationGeneralizes optimal tree inference:

Construct a BFS tree of a fixed sizeStarting at leaves invoke SendMessages on each vertex [13,12,11,…,1]Start at root invoke send SendMessages on each vertex [1,2,3,…,13]

SendMessages Routine:Using all current inbound messages compute all outbound messages

1

2

3

4

5

6

7

8

9

10

11

12

13

1

2

3

7

8

9 1

2

3

7

8

9

Splash(1)SendMessages(8)

Scheduling SplashesNot all vertices are equal:

Difficult

EasyA A

B B

Time = t Time = t+1

Wast

ed

Work

Use

ful W

ork Some vertices need

to be updated more often than others

Residual SchedulingIntuition: Prioritize updating messages which change the most.Message Residual: Difference between current message value and next incoming message value

Vertex Residual: Maximum of all incoming message residuals

¯¯¯m̄next

i ! u ¡ mlasti ! u

¯¯¯¯1

maxi2 ¡ u

¯¯¯m̄next

i ! u ¡ mlasti ! u

¯¯¯¯1

residual=0.1

m(x)

residual=0.1

m0(x)

Vertex update!

residual=0.4

m0(x)

Update vertex residual!

Parallel Residual Splash

Vertex 5Vertex 91Vertex 62Vertex 22Vertex 28

SharedPriority Queue

Shared Memory

CPU 1 CPU 2

Pop top vertex from queue

Build BFS tree of size s

Update vertices in tree in reverse BFS order. Update priority queue as needed

Update

Update vertices in tree in forward BFS order. Update priority queue as needed

Update

Return root vertex to queue

4 3 2 1

2 3 4

2

3

1 4

CPU 1

CPU 1

Residual Splash Running Time

Theorem:

For an arbitrary chain graphical model with n vertices and p processors (p ·n) and a particular initial residual scheduling the Residual Splash algorithm computes a ¿²-approximation in time:

O³

np + ¿²

´

Using a random initial priorities the Residual Splash algorithm computes a ¿²-approximation in time:

O³log(p)

³np + ¿²

´´

We suspect that the log(p) factor is not tight.

Overall Performance:

Non-uniform complexity

True Predicted

Difficult

Easy

(1) (2) (3)

(6)(5)(4)

Region Difficulty Execution Phase Total Updates

Log S

cale

Experimental Setup

Protein Side Chain prediction

Video Popup

Chen Yanover and Yair Weiss. Approximate Inference and Protein Folding. NIPS 2002

Predict protein side chain positions [Chen 02]276 proteinsHundreds of variables per protein with arity up to 79Average degree of 20

Extension of Make3D [ref] to videos with edges connecting pixels over framesDepths discretized to 40 levels.500K vertices. 3D Grid MRF 107x86x60

Movie

Depth Map

Stereo Images

3D Movie (Anaglyph)

Software ImplementationOptimized GNU C++ using POSIX threads with MATLAB wrapper

www.select.cs.cmu.edu/code

Protein ResultsExperiments performed on an 8-core AMD Opteron 2384 processor @ 2.7 Ghz with 32 GB RAM.

1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

9

10

Number of Cores

Rel

ativ

e S

peed

up

Residual BP

Linear Speedup

ResidualSplash BP

MapReduce BP

1 2 3 4 5 6 7 80

10

20

30

40

50

60

70

Number of Cores

Tim

e in

Sec

onds

MapReduce BP

ResidualSplash BPResidual BP

3D-Video ResultsExperiments performed on an 8-core AMD Opteron 2384 processor @ 2.7 Ghz with 32 GB RAM.

1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

Number of Cores

Rel

ativ

e S

peed

up

Residual BP

ResidualSplash BP

Linear Speedup

MapReduce BP

1 2 3 4 5 6 7 80

1000

2000

3000

4000

5000

6000

7000

8000

Number of Cores

Tim

e in

Sec

onds

Residual BP

MapReduce BP

ResidualSplash BP

Conclusions and Future WorkTrivially parallel MapReduce algorithm is inefficientApproximation can lead to increased parallelismProvided new parallel inference algorithm which performs optimal on chain graph and generalizes to loopy graphsDemonstrated superior performance on several real world tasksA cluster scale factor graph extension is under reviewExtend running time bounds to arbitrary cyclic graphical modelsEfficient parallel parameter learning

AcknowledgementsDavid O’Hallaron and Jason Campbell from Intel Research Pittsburgh who provided guidance in algorithm and task development and access to the BigData multi-core cluster.Funding provided by:

ONR Young Investigator Program Grant N00014-08-1-0752ARO under MURI W911NF0810242NSF Grants NeTS-NOSS and CNS-0625518AT&T Labs Fellowship Program

ReferencesR. Nallapati, W. Cohen, and J. Laf

f

erty. Parallelized variationalEM for latent Dirichlet allocation: An experimental evaluationof speed and scalability. InICDMW ’07: Proceedings of theSeventh IEEEInternational ConferenceonDataMiningWork-shops,pages349–354,2007.

D.Newman,A.Asuncion,P.Smyth,and M.Welling. Distributedinference for latent dirichlet allocation. InNIPS, pages1081–1088.2008.

D.M.Pennock. Logarithmic timeparallel bayesian inference. InProc. 14th Conf. Uncertainty in Artificial Intelligence, pages431–438.Morgan Kaufmann,1998.

C.T.Chu,S.K.Kim,Y.A.Lin,Y.Yu,G.R.Bradski,A.Y.Ng,andK. Olukotun. Map-reduce for machine learning on multicore.In NIPS,pages281–288.MITPress,2006.

M. Kearns. Efficient noise-tolerant learning from statisticalqueries.J.ACM ,45(6):983–1006,1998.

A. I. VilaCasado, M. Griot, and R.D. Wesel. Informed dynamicscheduling for belief-propagation decoding of LDPC codes.CoRR,abs/cs/0702111,2007.

A. Mendiburu, R. Santana, J.A. Lozano, and E. Bengoetxea. Aparallel framework for loopy belief propagation. InGECCO’07: Proceedings of the 2007 GECCO conference companionon Genetic and evolutionary computation, pages 2843–2850,2007.

D.Koller and N.Friedman. Probabilisticgraphical models.

J. Pearl. Probabilistic reasoning in intelligent systems: networksof plausible inference. 1988. ISBN0-934613-73-7.

R.J. McEliece, D.J.C. MacKay, and J.F. Cheng. Turbo decodingasan instanceof Pearl’sbelief propagationalgorithm.SelectedAreas in Communications, IEEE Journal on, 16(2):140–152,Feb 1998.

J.Sun,N.N.Zheng,andH.Y.Shum. Stereomatchingusingbeliefpropagation.Pattern Analysisand MachineIntelligence, IEEETransactionson,25(7):787–800, July2003.

J.S. Yedidia, W.T. Freeman, and Y. Weiss. Understanding beliefpropagation and itsgeneralizations. pages239–269,2003.

C. Yanover and Y. Weiss. Approximate inference and proteinfolding. InNIPS,pages84–86.MITPress,2002.

C. Yanover, O. Schueler-Furman, and Y. Weiss. Minimizing andlearningenergyfunctionsforside-chainprediction.pages381–395.2007.

J. Dean and S. Ghemawat. MapReduce: simplified dataprocess-ing on largeclusters.Commun.ACM,51(1):107–113,2008.

A.T. Ihler, J.W.Fischer III, and A.S.Willsky. Loopybelief prop-agation: Convergence and ef

f

ectsof message errors.J. Mach.Learn.Res.,6:905–936,2005.

Y.Weiss. Correctnessof

l

ocal probabilitypropagation in graphi-cal modelswith loops.Neural Comput.,12(1):1–41,2000.

J.M. Mooij and H.J. Kappen. Sufficient conditions for conver-gence of the Sum-Product algorithm. Information Theory,IEEETransactionson ,53(12):4422–4437,Dec.2007.

G.Elidan, I.Mcgraw,and D.Koller. Residual belief propagation:Informed scheduling for asynchronous message passing. InProceedings of the Twenty-second Conference on Uncertaintyin AI (UAI) ,Boston,Massachussetts,2006.

A.Y.Ng A.Saxena,S.H.Chung. 3-d depth reconstruction fromasinglestill image. InInternational Journal of ComputerVision(IJCV) ,2007.

SelectLab. ResidualSplash Pairwise MRF code, 2009. URLhttp://www.select.cs.cmu.edu/code .

summary problem: exponential performance gap: computer architectures transitioned from exponential...

Documents

new messages t

n message calculations

aa n rounds t

exponential parallelism

aaa slide

old messages t

exponential performance

texpoint manual