machine learning for router congestion kurtis...

1
Preamble/Abstract Network congestion in Datacenters have taken down some of the largest service providers. In this work, I combine machine learning techniques with the flexibilities a Datacenter environment provides to remedy this problem. I focus in particular on the router, and use an active testing methodology to determine “failure dependencies” in the Datacenter. With these dependencies, I will be able to more effectively route packets during times of congestion. Goals: Determine the relative importance of services by testing at the router level. Route packets in a congestion event as dictated by their relative importance. Machine Learning for Router Congestion Kurtis Heimerl Key Concepts/Tools Active testing Actively drop packets to test the nature of the dependencies. Failure dependencies rather than generic dependencies Generic dependencies do not allow us to route. Actively modifying the network allows us to test the nature of the dependency. Batch Jobs rather than online algorithms Distinct Phases Data gathering (blue) Testing (red) Each phase requires only one SVM run, which allows us to reduce the overhead by using a coprocessor. Datacenter service redundancy Services expect outages Infrastructure exists to restart dead services These allow us to kill services and expect minimal affect on the users of the services. Weighing later data points more heavily Services may not immediately recover from outages Weigh packets based on the expectation of a service recovery 117-108 Weight 0 158-108 Equal Weight 0.603175568804 Linear Weight 0.678278982714944 Exponential Weight 0.689595088365564 174-108 Equal Weight 0.642814009661783 Linear Weight 0.797291379253128 Exponential Weight 0.844763634041119 Result: 117 Depends on 108. Very unlikely that 108 depends on 117 HDFS 3 Nodes, 3 Replication 174-108 disconnected HDFS 3 nodes, 1 Replication 117-108 disconnected 117-108 Equal Weight 0.289433384379793 Linear Weight 0.281712955072555 Exponential Weight 0.27887394256017 117-158 Weight 0 158-108 Equal Weight 0.0677545796336299 Linear Weight 0.0794150611951044 Exponential Weight 0.0835522769207024 174-108 Weight 0 174-117 – No Data 174-158 Weight 0 Result: 174 Depends on 108. Possible that 108 depends on 174 174-108 117-108 174-117 174-158 158-108 117-108 174-108

Upload: others

Post on 01-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning for Router Congestion Kurtis Heimerlbnrg.eecs.berkeley.edu/~randy/Courses/CS294.F07/Kurtis_poster.pdf · Kurtis Heimerl Key Concepts/Tools •Active testing •Actively

Preamble/Abstract

Network congestion in Datacenters have taken down someof the largest service providers. In this work, I combinemachine learning techniques with the flexibilities aDatacenter environment provides to remedy this problem. Ifocus in particular on the router, and use an active testingmethodology to determine “failure dependencies” in theDatacenter. With these dependencies, I will be able to moreeffectively route packets during times of congestion.

Goals:•Determine the relative importance of services by testingat the router level.•Route packets in a congestion event as dictated by theirrelative importance.

Machine Learning for Router CongestionKurtis Heimerl

Key Concepts/Tools•Active testing

•Actively drop packets to test the nature of thedependencies.

•Failure dependencies rather than generic dependencies•Generic dependencies do not allow us to route.•Actively modifying the network allows us to test thenature of the dependency.

•Batch Jobs rather than online algorithms•Distinct Phases

•Data gathering (blue)•Testing (red)

•Each phase requires only one SVM run, which allowsus to reduce the overhead by using a coprocessor.

•Datacenter service redundancy•Services expect outages•Infrastructure exists to restart dead services•These allow us to kill services and expect minimalaffect on the users of the services.

•Weighing later data points more heavily•Services may not immediately recover from outages•Weigh packets based on the expectation of a servicerecovery

117-108Weight 0

158-108Equal Weight 0.603175568804Linear Weight 0.678278982714944Exponential Weight 0.689595088365564

174-108Equal Weight 0.642814009661783Linear Weight 0.797291379253128Exponential Weight 0.844763634041119

Result: 117 Depends on 108. Very unlikely that 108depends on 117

HDFS 3 Nodes, 3 Replication174-108 disconnected

HDFS 3 nodes, 1 Replication117-108 disconnected

117-108Equal Weight 0.289433384379793Linear Weight 0.281712955072555Exponential Weight 0.27887394256017

117-158Weight 0

158-108Equal Weight 0.0677545796336299Linear Weight 0.0794150611951044Exponential Weight 0.0835522769207024

174-108Weight 0

174-117 – No Data

174-158Weight 0

Result: 174 Depends on 108. Possible that 108depends on 174

174-108 117-108

174-117 174-158

158-108 117-108

174-108