summary problem: exponential performance gap: computer architectures transitioned from exponential...
Post on 19-Dec-2015
219 views
TRANSCRIPT
SummaryProblem:
Exponential Performance Gap: Computer architectures transitioned from exponential frequency scaling to parallelism ending decades of free exponential performance gainsThe natural “MapReduce” Belief Propagation (BP) algorithm:
Embarrassingly Parallel Highly Inefficient: (Asymptotically slower than efficient sequential algorithms)
Solution:Explore the limiting sequential structure using chain graphical modelsIntroduce approximation which improves parallel performancePropose ResidualSplash, a new parallel dynamic BP Algorithm and show that it performs optimally on chain graphical models in the approximate inference setting
Results:We demonstrate that our new algorithm outperforms existing techniques on two real-world tasks
Many Core RevolutionTransition from exponential frequency scaling to exponential parallelism
1985 199019801970 1975 1995 2000 2005
Raw
Power4Opteron
Power6
Niagara
YonahPExtreme
Tanglewood
Cell
IntelTflops
Xbox360
CaviumOcteon
RazaXLR
PA-8800
CiscoCSR-1
PicochipPC102
Boardcom 1480
20??
# c
ore
s
1
2
4
8
16
32
64
128
256
512 AmbricAM2045
4004
8008
80868080 286 386 486 Pentium P2 P3P4
Itanium
Itanium 2Athlon
YearGraph courtesy of Saman Amarasinghe
Paralle
l
Perform
ance
Single ProcessorPerformance
Exponentially Growing Gap
Inference in Markov Random Fields
Ãi (xi ) = N (oi ;¾2)
Ãi ;j (xi ;xj ) =
(1 xi = xj
e¡ ¸ xi 6= xj
X1 X2
X4 X5
X3
X6
X7 X8 X9
PixelsNoisy Image Predicted Image
MRF
P (X 1; : : : ;X N ) /Q
i2VÃi (xi )
Q
f i ;j g2EÃi ;j (xi ;xj )
Unary Potentials
Binary Potentials
Pair wise Markov Random Field (MRF):
Graph encoding conditional independence assumptionsFactors encoding functional dependencies
Inference Objective:Compute marginal distribution for all variables
m3! 2(x2)
m2! 1(x1) m4! 2(x2)
m4! 2(x2)
Loopy Belief PropagationLoopy Belief Propagation:
Approximate inference methodExact on trees
At Convergence:
P (X i = xi ) ¼bi (xi ) / Ãi (xi )Y
k2¡ i
mk! i (xi )
X2X1
X3
X4
X5
mi ! j (xj ) /X
xi 2A i
Ãi ;j (xi ;xj )Ãi (xi )Y
k2¡ i nj
mk! i (xi )
m3! 2(x2)
m2! 1(x1) m4! 2(x2)X2X1
X3
X4
X5
m4! 2(x2)
Levels of ParallelismMessage Level Parallelism
Making a single message update calculation run in parallelLimited by the complexity of individual variables
Graph Level Parallelism Simultaneously updating multiple messagesMore “parallelism” for larger models
Running Time Definition:Message calculations as unit time operationsRunning time is measured in message calculations
“MapReduce” Belief Propagation
Update all messages simultaneously using p · 2(n-1) processors.Chain graphs provide a challenging performance benchmark:
CPU 1
CPU 2
Old
Mess
ages
(t-1
) New
Messa
ges (t)
Write
Write
Shared Memory
Iterate
ReadOnly
n R
oun
ds t=1
t=n
2n Message Calculations
Running Time:2(n ¡ 1)2
p
Optimal Sequential SchedulingUsing one processor
Send messages left to right and then right to left:
Optimal Parallel SchedulingUsing two processors
Send messages left to right and right to left at the same time:
n R
oun
ds t=1
t=n
CPU 1
CPU 2
CPU 2
CPU 1
Efficient Chain Scheduling2
n R
ound
s
t=1
t=n+1
t=2n
CPU 1
CPU 1
CPU 1
Running Time:2(n ¡ 1)Running Time: n ¡ 1
Efficiency Gap!
For p<n the MapReduce algorithm is slower than the efficient single processor algorithm
Cannot efficiently use more than 2 processors
n R
ou
nd
s t=1
t=n
2n Messages
2n2
p 2n n
“MapReduce” Parallel
2n
Rou
nd
s
t=1
t=n+1
t=2n
CPU 1
CPU 1
CPU 1
n R
ou
nd
s t=1
t=n
CPU 1
CPU 2
CPU 2
CPU 1
Optimal Single Processor Optimal Parallel (p=2)
Factor n Gap!2n2
p n
Breaking Sequentially with ¿²-Approximation
Message errors decay over paths:
The value of ¿² Maximum length of dependencies for a given accuracy ² Not known in practiceNot known to the algorithm
1 2 3 4 5 6 7 8 9 10
TrueMessages
¿²-Approximation
m1! 2 m2! 3 m3! 4 m4! 5 m5! 6 m6! 7 m7! 8 m8! 9 m9! 10
m03! 4 m0
4! 5 m05! 6 m0
6! 7 m07! 8 m0
8! 9 m09! 10
jjm9! 10 ¡ m09! 10jj · ²
¿²
Based on work by [Ihler et al., 2005]
Synchronous BP and ¿²-Approximation
For an approximate marginal, we only need to consider a small ¿² subgraph
1 2 3 4 5 6 7 8 9 10
¿ ² S
tep
s t=1
t=n
2n MessagesTheorem:
Given an acyclic MRF with n vertices a ¿²-approximation is obtained by running Parallel Synchronous BP using p processors (p·2n) in running time:
2(n¡ 1)¿²
p = O³
n¿²p
´
¿²
Optimal Approximate Inference
Evenly partition the vertices:
Run sequential exact inference on each “tree” in parallel:
We obtain the running time on chain graphs:
Step 1 Step 2
¿ ² n=p
+1
2np
Oµ
np
+ ¿²
¶
time per iteration
Processor 1 Processor 3Processor 2Processor 1 Processor 3Processor 2
Processor 1 Processor 3Processor 2
Processor 1 Processor 3Processor 2
Proof sketch:After kth iterations of parallel message computations in one direction:
n ¡ ¿² · p2 (k ¡ ¿² + 1) k ¸ 2n
p + ¿²
³1¡ 2
p
´¡ 1
Theorem:For an arbitrary chain graphical model with n vertices and p processors, a ¿²-approximation cannot in general be computed with fewer message updates than:
³
np + ¿²
´
Total requiredwork in
one direction
Maximum possible work done by a single processor
Solving for k
Splash OperationGeneralizes optimal tree inference:
Construct a BFS tree of a fixed sizeStarting at leaves invoke SendMessages on each vertex [13,12,11,…,1]Start at root invoke send SendMessages on each vertex [1,2,3,…,13]
SendMessages Routine:Using all current inbound messages compute all outbound messages
1
2
3
4
5
6
7
8
9
10
11
12
13
1
2
3
7
8
9 1
2
3
7
8
9
Splash(1)SendMessages(8)
Scheduling SplashesNot all vertices are equal:
Difficult
EasyA A
B B
Time = t Time = t+1
Wast
ed
Work
Use
ful W
ork Some vertices need
to be updated more often than others
Residual SchedulingIntuition: Prioritize updating messages which change the most.Message Residual: Difference between current message value and next incoming message value
Vertex Residual: Maximum of all incoming message residuals
¯¯¯m̄next
i ! u ¡ mlasti ! u
¯¯¯¯1
maxi2 ¡ u
¯¯¯m̄next
i ! u ¡ mlasti ! u
¯¯¯¯1
residual=0.1
m(x)
residual=0.1
m0(x)
Vertex update!
residual=0.4
m0(x)
Update vertex residual!
Parallel Residual Splash
Vertex 5Vertex 91Vertex 62Vertex 22Vertex 28
SharedPriority Queue
Shared Memory
CPU 1 CPU 2
Pop top vertex from queue
Build BFS tree of size s
Update vertices in tree in reverse BFS order. Update priority queue as needed
Update
Update vertices in tree in forward BFS order. Update priority queue as needed
Update
Return root vertex to queue
4 3 2 1
2 3 4
2
3
1 4
CPU 1
CPU 1
Residual Splash Running Time
Theorem:
For an arbitrary chain graphical model with n vertices and p processors (p ·n) and a particular initial residual scheduling the Residual Splash algorithm computes a ¿²-approximation in time:
O³
np + ¿²
´
Using a random initial priorities the Residual Splash algorithm computes a ¿²-approximation in time:
O³log(p)
³np + ¿²
´´
We suspect that the log(p) factor is not tight.
Overall Performance:
Non-uniform complexity
True Predicted
Difficult
Easy
(1) (2) (3)
(6)(5)(4)
Region Difficulty Execution Phase Total Updates
Log S
cale
Experimental Setup
Protein Side Chain prediction
Video Popup
Chen Yanover and Yair Weiss. Approximate Inference and Protein Folding. NIPS 2002
Predict protein side chain positions [Chen 02]276 proteinsHundreds of variables per protein with arity up to 79Average degree of 20
Extension of Make3D [ref] to videos with edges connecting pixels over framesDepths discretized to 40 levels.500K vertices. 3D Grid MRF 107x86x60
Movie
Depth Map
Stereo Images
3D Movie (Anaglyph)
Software ImplementationOptimized GNU C++ using POSIX threads with MATLAB wrapper
www.select.cs.cmu.edu/code
Protein ResultsExperiments performed on an 8-core AMD Opteron 2384 processor @ 2.7 Ghz with 32 GB RAM.
1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8
9
10
Number of Cores
Rel
ativ
e S
peed
up
Residual BP
Linear Speedup
ResidualSplash BP
MapReduce BP
1 2 3 4 5 6 7 80
10
20
30
40
50
60
70
Number of Cores
Tim
e in
Sec
onds
MapReduce BP
ResidualSplash BPResidual BP
3D-Video ResultsExperiments performed on an 8-core AMD Opteron 2384 processor @ 2.7 Ghz with 32 GB RAM.
1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8
Number of Cores
Rel
ativ
e S
peed
up
Residual BP
ResidualSplash BP
Linear Speedup
MapReduce BP
1 2 3 4 5 6 7 80
1000
2000
3000
4000
5000
6000
7000
8000
Number of Cores
Tim
e in
Sec
onds
Residual BP
MapReduce BP
ResidualSplash BP
Conclusions and Future WorkTrivially parallel MapReduce algorithm is inefficientApproximation can lead to increased parallelismProvided new parallel inference algorithm which performs optimal on chain graph and generalizes to loopy graphsDemonstrated superior performance on several real world tasksA cluster scale factor graph extension is under reviewExtend running time bounds to arbitrary cyclic graphical modelsEfficient parallel parameter learning
AcknowledgementsDavid O’Hallaron and Jason Campbell from Intel Research Pittsburgh who provided guidance in algorithm and task development and access to the BigData multi-core cluster.Funding provided by:
ONR Young Investigator Program Grant N00014-08-1-0752ARO under MURI W911NF0810242NSF Grants NeTS-NOSS and CNS-0625518AT&T Labs Fellowship Program
ReferencesR. Nallapati, W. Cohen, and J. Laf
f
erty. Parallelized variationalEM for latent Dirichlet allocation: An experimental evaluationof speed and scalability. InICDMW ’07: Proceedings of theSeventh IEEEInternational ConferenceonDataMiningWork-shops,pages349–354,2007.
D.Newman,A.Asuncion,P.Smyth,and M.Welling. Distributedinference for latent dirichlet allocation. InNIPS, pages1081–1088.2008.
D.M.Pennock. Logarithmic timeparallel bayesian inference. InProc. 14th Conf. Uncertainty in Artificial Intelligence, pages431–438.Morgan Kaufmann,1998.
C.T.Chu,S.K.Kim,Y.A.Lin,Y.Yu,G.R.Bradski,A.Y.Ng,andK. Olukotun. Map-reduce for machine learning on multicore.In NIPS,pages281–288.MITPress,2006.
M. Kearns. Efficient noise-tolerant learning from statisticalqueries.J.ACM ,45(6):983–1006,1998.
A. I. VilaCasado, M. Griot, and R.D. Wesel. Informed dynamicscheduling for belief-propagation decoding of LDPC codes.CoRR,abs/cs/0702111,2007.
A. Mendiburu, R. Santana, J.A. Lozano, and E. Bengoetxea. Aparallel framework for loopy belief propagation. InGECCO’07: Proceedings of the 2007 GECCO conference companionon Genetic and evolutionary computation, pages 2843–2850,2007.
D.Koller and N.Friedman. Probabilisticgraphical models.
J. Pearl. Probabilistic reasoning in intelligent systems: networksof plausible inference. 1988. ISBN0-934613-73-7.
R.J. McEliece, D.J.C. MacKay, and J.F. Cheng. Turbo decodingasan instanceof Pearl’sbelief propagationalgorithm.SelectedAreas in Communications, IEEE Journal on, 16(2):140–152,Feb 1998.
J.Sun,N.N.Zheng,andH.Y.Shum. Stereomatchingusingbeliefpropagation.Pattern Analysisand MachineIntelligence, IEEETransactionson,25(7):787–800, July2003.
J.S. Yedidia, W.T. Freeman, and Y. Weiss. Understanding beliefpropagation and itsgeneralizations. pages239–269,2003.
C. Yanover and Y. Weiss. Approximate inference and proteinfolding. InNIPS,pages84–86.MITPress,2002.
C. Yanover, O. Schueler-Furman, and Y. Weiss. Minimizing andlearningenergyfunctionsforside-chainprediction.pages381–395.2007.
J. Dean and S. Ghemawat. MapReduce: simplified dataprocess-ing on largeclusters.Commun.ACM,51(1):107–113,2008.
A.T. Ihler, J.W.Fischer III, and A.S.Willsky. Loopybelief prop-agation: Convergence and ef
f
ectsof message errors.J. Mach.Learn.Res.,6:905–936,2005.
Y.Weiss. Correctnessof
l
ocal probabilitypropagation in graphi-cal modelswith loops.Neural Comput.,12(1):1–41,2000.
J.M. Mooij and H.J. Kappen. Sufficient conditions for conver-gence of the Sum-Product algorithm. Information Theory,IEEETransactionson ,53(12):4422–4437,Dec.2007.
G.Elidan, I.Mcgraw,and D.Koller. Residual belief propagation:Informed scheduling for asynchronous message passing. InProceedings of the Twenty-second Conference on Uncertaintyin AI (UAI) ,Boston,Massachussetts,2006.
A.Y.Ng A.Saxena,S.H.Chung. 3-d depth reconstruction fromasinglestill image. InInternational Journal of ComputerVision(IJCV) ,2007.
SelectLab. ResidualSplash Pairwise MRF code, 2009. URLhttp://www.select.cs.cmu.edu/code .