smart redundancy for distributed computation

17
Smart Redundancy for Distributed Computation George Edwards Blue Cell Software, LLC Yuriy Brun University of Washington Jae young Bang University of Southern California Nenad Medvidovic University of Southern California

Upload: aletha

Post on 23-Mar-2016

38 views

Category:

Documents


4 download

DESCRIPTION

Smart Redundancy for Distributed Computation. Yuriy Brun University of Washington. George Edwards Blue Cell Software, LLC. Jae young Bang University of Southern California. Nenad Medvidovic University of Southern California. Distributed Computation Architectures. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Smart Redundancy for Distributed Computation

Smart Redundancy forDistributed Computation

George EdwardsBlue Cell Software, LLC

Yuriy BrunUniversity of Washington

Jae young BangUniversity of Southern

California

Nenad MedvidovicUniversity of Southern

California

Page 2: Smart Redundancy for Distributed Computation

Distributed Computation Architectures• Solve large computational

problems and/or process large data sets

• Provide a platform and API for applications

• Transparently parallelize computation across a pool of computers

• Examples:– Clouds– Grids– Volunteer computing

Page 3: Smart Redundancy for Distributed Computation

DCA Applications

• Highly parallelizable problems– Find the 10100th digit of π– Factor 22011 – 1

• Driven by:– Basic research– Pharmaceutical applications– Web analytics– …

Page 4: Smart Redundancy for Distributed Computation

Volunteer Computing• Attempts to leverage the

more than 1 billion (mostly idle) machines on the Internet– Volunteers install a client– When idle, the client requests

work from a server and send back results

• Aids projects that have limited funding but large public appeal

Page 5: Smart Redundancy for Distributed Computation

Dealing with Faults

• Context:– Volunteers fail and maliciously return false results– Volunteers are not accountable– Malicious volunteers may collude– Well-formed but incorrect results are hard to

detect– The reliability of volunteers is difficult to estimate

• Solution:– Redundancy and voting

Page 6: Smart Redundancy for Distributed Computation

System Model• A task server subdivides

computations into tasks

• The task server replicates each task into multiple identical jobs

• The task server assigns each job to a node in the node pool

• Nodes perform work, send results, and rejoin the pool

• New volunteer nodes may join the pool while other nodes may leave

Page 7: Smart Redundancy for Distributed Computation

k-vote Traditional Redundancy (TR)• Performs k independent executions of

each task

• Takes a vote on the correctness of the result

• Requires expending a factor of k resources or suffering a factor of k slowdown in performance

Example

• k = 19• r = 0.7

Page 8: Smart Redundancy for Distributed Computation

Insights• Redundant computations

need not be simultaneous

• DCAs can dynamically adjust the level of redundancy based onrun-time information

• k-vote traditional redundancy wastes computations

Example

• 19 independent computations (k = 19)

• 70% node reliability (r = 0.7)

• (0.7)10 ≈ 2.8% of the time, the first 10 of them will return the correct result• The last 9 results are

irrelevant

Page 9: Smart Redundancy for Distributed Computation

k-vote Progressive Redundancy (PR)

• Distributes jobs in waves

• In each wave, distributes the minimum jobs needed to produce a consensus (assuming all agree)

• Repeats until a consensus is reached

Example

• k = 19• r = 0.7

Page 10: Smart Redundancy for Distributed Computation

Insights• The confidence level

associated with a result can be computed

• k-vote progressive redundancy produces results with varying confidence

Example

• k = 19, r = 0.7

• If the vote is 10-0, confidence level ≈ 99.98%

• If the vote is 10-9, confidence level = 70%

Page 11: Smart Redundancy for Distributed Computation

Iterative Redundancy (IR)

• Distributes jobs in waves

• In each wave, distributes the minimum jobs required to achieve a desired confidence level

• Repeats until desired confidence level is reached

Example

• d = 4• r = 0.7

Page 12: Smart Redundancy for Distributed Computation

Algorithm Comparison• System reliability

approaches 1 exponentially for TR, PR, and IR

• IR produces the same reliability at a lower cost– Or, equivalently, higher

reliability at the same cost

• IR is optimal with respect to cost– Guaranteed to use the

minimum computation needed to achieve desired system reliability

Cost Factor

Cost Factor

Syst

em R

elia

bilit

ySy

stem

Rel

iabi

lity

Page 13: Smart Redundancy for Distributed Computation

Algorithm Comparison

• PR and IR perform best when the reliability of the node pool is high

Node Reliability

Ratio

Impr

ovem

ent O

ver

Trad

ition

al R

ecov

ery

Page 14: Smart Redundancy for Distributed Computation

Adaptive Behavior

• IR maintains a constant system reliability as node reliability fluctuates

– Injects redundancy where it is needed• “Unlucky” situations

– Removes redundancy where it is unnecessary

Time

Time

Time

Nod

eRe

liabi

lity

Cost

Fac

tor

Syst

emRe

liabi

lity

Page 15: Smart Redundancy for Distributed Computation

Node Reliability Estimation

• Incorrectly estimating node reliability does not affect the performance of IR

Cost Factor

Syst

em R

elia

bilit

y

Page 16: Smart Redundancy for Distributed Computation

Conclusions

• Iterative redundancy automatically replicates computation with optimal efficiency

• Iterative redundancy can be used when:– A computation can be broken down into

independent tasks– Computation is performed by a pool of

independent processing resources– Task deployment decisions can be made at runtime– The reliability of resources in the pool is unknown

Page 17: Smart Redundancy for Distributed Computation

For More InformationTo appear in ICDCS 2011:

Smart Redundancy for Distributed Computationby Yuriy Brun, George Edwards, Jae young Bang and Nenad Medvidovic

http://www.cs.washington.edu/homes/brun/pubs/pubs/Brun11icdcs.pdf