adaptive replication for elastic data stream processing

22
An Adaptive Replication Scheme For Elastic Data Stream Processing Thomas Heinze, Mariam Zia, Robert Krahn, Zbigniew Jerzak, Christof Fetzer July 02, 2015

Upload: zbigniew-jerzak

Post on 16-Apr-2017

545 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Adaptive Replication for Elastic Data Stream Processing

An Adaptive Replication Scheme For Elastic Data Stream ProcessingThomas Heinze, Mariam Zia, Robert Krahn, Zbigniew Jerzak, Christof FetzerJuly 02, 2015

Page 2: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 2InternalPublic

Elasticity

Utilization below 30% in most cloud data centers

Users needs to reserve required resources Limited understanding of the performance of the system Limited knowledge of characteristics of the workload

Workload/Resources

Load

Static Provisioning

Elastic Provisioning

time

Underprovisioning

Overprovisioning

Page 3: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 3InternalPublic

Configuring an Elastic Scaling System

Data Stream Processing highly suited for elasticity due to highly variable load and small state size (e.g. StreamCloud[1] or SEEP[2])

Key challenge: Minimize the overprovisioning and number of SLA violations by optimizing scaling decisions

Page 4: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 4InternalPublic

Enabling Fault Tolerance

Elasticity requires horizontal scaling → we need fault tolerance

Two mechanisms: Active Replication vs. Upstream Backup

User-defined Threshold

Financial

0.0

2.5

5.0

7.5

10.0

12.5

0.0 0.5 1.0 1.5 2.0Monetary Cost($)

Rec

over

y Ti

me

(in s

ec.)

Active Upstream

Page 5: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 5InternalPublic

Outline

1. Introduction2. An Adaptive Replication Scheme3. Evaluation4. Conclusion and Future Work

Page 6: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 6InternalPublic

Related Work

a) Improving Upstream Backup: Sweeping checkpoints, … Faster Recovery by using Micro Batch Processing (D-Stream [3],

TimeStream [4]) But: no user-configurable recovery time threshold

b) Combination of both mechanisms: Already proposed for Borealis by Hwang et al. [5] Static Optimizer proposed by Updahyaya et al. [6] Dynamic switching to handle overload/fault case by Martin et al. [7]/

Zhang et al.[8] But: static or without user-configurable recovery time threshold

Page 7: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 7InternalPublic

Adaptive Replication Scheme

Dynamically switch between upstream backup and active replication during runtime

Replication Scheme describes current replication mode for all operatorsActive Replication

Upstream BackupUpstream Backup

process

process

passive

reserved Switch

Roles

Switch Instance 2

SwitchInstance 1

process

process

reserved

Key Questions:1) When we need to switch replication mode?

→ Estimation Model for Upstream Backup

2) How to integrate with our elastic scaling system?

→ Adaption Algorithm

Page 8: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 8InternalPublic

Recovery Time Estimation

Many factors influence the recovery time: Operator type and Checkpointing time (static) State Size and Queue Length (changing with the current workload)

Our solution: estimation based on historical samples

Accurancy: 0.3 sec. error for 10 sec. recovery time (sample size: 1000)

Clustering EstimationHistorical Samples

Clustered Samples

Estimated Recovery Time

Current workload characteristics

Page 9: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 9InternalPublic

Single Operator Scenario

Observe current state size and queue length of all operators Adapt replication scheme if user threshold is not met

t

Operating interval

Recovery Time Threshold

EstimatedRecovery

Time

EstimatedRecovery Time

1

2Active Replication

Upstream Backup

Page 10: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 10InternalPublic

Integration with Elastic Scaling System

Architecture of an elastic scaling system Process many queries in parallel Places operators on a varying number of hosts based CPU + network

consumption Scaling requires moving operators between hosts

Integration Replication-aware operator placement System recovery time = max(recovery time per query) Monitor the recovery time for the crash of host h

Page 11: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 11InternalPublic

Example: Multi Query Scenario

System Recovery Time:

max(trec(q1) , trec(q2))

F1 A1S

A1‘F1‘

q1:

trec(q1)= max(trec(F1), trec(A1) , trec(D1))

D1

D1‘

F2 A1S

A2‘F2‘

q2:

trec(q2)= max(trec(F2), trec(A2) , trec(D2))

D1

D1‘

Page 12: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 12InternalPublic

Example: Operator Placement

F1 A1S

A1‘F1‘

D1

D1‘

F2 A2S

A2‘F2‘

q2:

D2

D2‘

Placement:Host 1

F1

F2

Host 2

A2

D1‘

F1‘Host 4

A2‘

D2

F1‘

A1

q1:

Host 3

D2‘

D1

A1‘

Recovery Time (max):

trec(F1), trec(F2),

trec(A1)

trec(A2) trec(D1) trec(A2), trec(D2)

Page 13: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 13InternalPublic

Example: Too High Recovery Time

F1 A1S

A1‘F1‘

D1

D1‘

F2 A2S

A2‘F2‘

q2:

D2

D2‘

Placement:Host 1

F1

F2

Host 2

A2

D1‘

F1‘Host 4

A2‘

D2

F1‘

A1

trec(F1), trec(F2),

trec(A1)

trec(A2) trec(D1)

q1:

Host 3

D2‘

D1

A1‘

trec(A2), trec(D2)Recovery Time (max):

Page 14: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 14InternalPublic

Example: Too High Recovery Time

F1 A1S

A1‘F1‘

D1

D1‘

F2 A2S

A2‘F2‘

q2:

D2

D2‘

Placement:Host 1

F1

F2

Host 2

A2

D1‘

F1‘Host 4

A2‘

D2

F1‘

A1

trec(F1), trec(F2),

trec(A1)

trec(A2), trec (D1‘) trec(D1), trec(A1)

q1:

Host 3

D2‘

D1

A1‘

trec(A2), trec(D2)Recovery Time (max):

Page 15: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 15InternalPublic

Example: Too Low Recovery Time

F1 A1S

A1‘F1‘

D1

D1‘

F2 A2S

A2‘F2‘

q2:

D2

D2‘

Placement:Host 1

F1

F2

Host 2

A2

D1‘

F1‘Host 4

A2‘

D2

F1‘

A1

trec(F1), trec(F2),

trec(A1)

trec(A2), trec (D1‘) trec(D1), trec(A1)

q1:

Host 3

D2‘

D1

A1‘

trec(A2), trec(D2)Recovery Time (max):

Page 16: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 16InternalPublic

Example: Too Low Recovery Time

F1 A1S

A1‘F1‘

D1

D1‘

F2 A2S

A2‘F2‘

q2:

D2

D2‘

Placement:Host 1

F1

F2

Host 2

A2

D1‘

F1‘Host 4

A2‘

D2

F1‘

A1

trec(F1), trec(F2),

trec(A1)

trec(A2), trec (D1‘) trec(D1), trec(A1)

q1:

Host 3

D2‘

D1

A1‘

trec(D2)Recovery Time (max):

Page 17: Adaptive Replication for Elastic Data Stream Processing

Evaluation

Page 18: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 18InternalPublic

Setup

Private cloud environment with up to 12 hosts Three Workloads: Financial, Twitter, Energy Sensors Measure characteristics like CPU load, latency, etc. in 10 seconds

intervals 20 crashes of a random host (immediately trigger recovery process) Recovery Time measured as maximal latency peak observed after a host

crash

Two baseline algorithms: Active Replication and Upstream Backup

Page 19: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 19InternalPublic

Recovery Time For Different Thresholds

Page 20: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 20InternalPublic

Adaptive Replication Scheme

Page 21: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 21InternalPublic

Summary

Active replication/upstream backup forces a hard trade-off between resource overhead and recovery time

Our adaptive replication scheme allows to customize trade-off based on user configuration

Future work Formalize approach for replication degree >2 Network-bound workloads Replication Placement

Page 22: Adaptive Replication for Elastic Data Stream Processing

© 2015 SAP SE or an SAP affiliate company. All rights reserved.

Thank you

Contact information:

Thomas HeinzeResearch [email protected]