scalability, accountability and instant information access for network centric warfare

25
Johns Hopkins & Purdue 15 Dec 05 Scalability, Accountability and Instant Information Access for Network Centric Warfare Department of Computer Science Johns Hopkins University Yair Amir, Claudiu Danilov, Danny Dolev, Jon Kirsch, John Lane, Jonathan Shapiro Chi-Bun Chan, Cristina Nita-Rotaru, Josh Olsen David Zage Department of Computer Science Purdue University http://www.cnds.jhu.edu

Upload: tasya

Post on 07-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Scalability, Accountability and Instant Information Access for Network Centric Warfare. Yair Amir, Claudiu Danilov, Danny Dolev, Jon Kirsch, John Lane, Jonathan Shapiro. Department of Computer Science Johns Hopkins University. Chi-Bun Chan, Cristina Nita-Rotaru, Josh Olsen David Zage. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 115 Dec 05

Scalability, Accountability and Instant Information Access for

Network Centric Warfare

Department of Computer ScienceJohns Hopkins University

Yair Amir, Claudiu Danilov, Danny Dolev, Jon Kirsch, John Lane, Jonathan Shapiro

Chi-Bun Chan, Cristina Nita-Rotaru, Josh OlsenDavid Zage

Department of Computer SciencePurdue University

http://www.cnds.jhu.edu

Page 2: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 215 Dec 05

Dealing with Insider Threats

• Scaling survivable replication to wide area networks.– Overcome 5 malicious replicas.

– SRS goal: Improve latency by a factor of 3.

– Self imposed goal: Improve throughput by a factor of 3.

– Self imposed goal: Improve availability of the system.

• Dealing with malicious clients.– Compromised clients can inject authenticated but

incorrect data - hard to detect on the fly.

– Malicious or just an honest error? Can be useful for both.

• Exploiting application update semantics for replication speedup in malicious environments.– Weaker update semantics allows for immediate

response.

Project Goals:

Page 3: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 315 Dec 05

State Machine Replication

• Main Challenge: Ensuring coordination between servers.– Requires agreement on the request to be

processed and consistent order of requests.

• Byzantine faults: BFT [CL99]: must contact 2f+1 out of 3f+1 servers and uses 3 rounds to allow consistent progress.

• Benign faults: Paxos [Lam98,Lam01]: must contact f+1 out of 2f+1 servers and uses 2 rounds to allow consistent progress.

Page 4: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 415 Dec 05

State of the Art in Byzantine ReplicationBFT [CL99]

Baseline technology

C

0

1

2

requ es t p re -p rep a re p rep a re com m it

3

rep ly

Page 5: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 515 Dec 05

The Paxos ProtocolNormal Case, after leader election

[Lam98]

Key: A simple end-to-end algorithm

C

0

1

2

request proposal accept reply

Page 6: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 615 Dec 05

Steward: Survivable Technology for Wide Area Replication

• Each site acts as a trusted logical unit that can crash or partition.

• Effects of malicious faults are confined to the local site.– Threshold signatures prove agreement to other sites.

• Between sites:– Fault-tolerant protocol between sites.

• There is no free lunch – we pay with more hardware…

Server

Replicas 1 o o o2 3 3f+1

ClientsA site

Page 7: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 715 Dec 05

Challenges (I)

• Each site has a representative that:– Coordinates the Byzantine protocol inside the site.– Forwards packets in and out of the site.

• One of the sites act as the leader in the wide area protocol– The representative of the leading site is the one assigning

sequence numbers to updates.

• How do we select and change the representatives and the leader site, in agreement ?

• How do we transition safely when we need to change them ?

Page 8: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 815 Dec 05

Challenges (II)

• Messages coming out of a site during leader election are based on communication between 2f+1(out of 3f+1) servers inside the site.– There can be multiple sets of 2f+1 servers.– In some instances, multiple correct but different site

messages can be issued by a malicious representative.– It is sometimes impossible to completely isolate a malicious

server behavior inside its own site.

• This behavior can happen in two instances:– The servers inside a site propose a new leading site.– The servers inside a site report their individual status with

respect to the global site progress.

• Developed a detailed proof of correctness of the protocol.

Page 9: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 915 Dec 05

Main idea• Sites change their local

representatives based on timeouts.

• Leader site representative has a larger timeout .– allows forcommunication with at least one correct rep. at other sites.

• After changing f+1 leader site representatives, servers at all sites stop participating in the protocol, and elect a different leading site.

Page 10: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 1015 Dec 05

Steward: First Byzantine Replication Scalable to Wide

Area Networks• A second iteration implementation

– Based on the complete theoretical design.– Follows closely the pseudocode proven to be

correct.

• We benchmarked the new implementation against the program metrics.

• The code successfully passed the red-team experiment.

• We believe it is theoretically unbreakable.

Page 11: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 1115 Dec 05

Testing Environment

Platform: Dual Intel Xeon CPU 3.2 GHz 64 bits 1 GByte RAM, Linux Fedora Core 3.

Library relies on Openssl :- Used OpenSSL 0.9.7a 19 Feb 2003.

Baseline operations:- RSA 1024-bits sign: 1.3 ms, verify: 0.07 ms.- Perform modular exponentiation 1024 bits, ~1 ms.- Generate a 1024 bits RSA key ~55ms.

Page 12: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 1215 Dec 05

Evaluation Network 1: Symmetric Wide Area Network

• Synthetic network used for analysis and understanding.

• 5 sites, each of which connected to all other sites with equal bandwidth/latency links.

• One fully deployed site of 16 replicas; the other sites are emulated by one computer each.

• Total – 80 replicas in the system, emulated by 20 computers.

• 50 ms wide area links between sites.

• Varied wide area bandwidth and the number of clients.

Page 13: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 1315 Dec 05

Write Update Performance

• Symmetric network.• 5 sites.

• Steward:• 16 replicas per site. • Total of 80 replicas (four

sites are emulated).• Actual computers: 20.

• BFT:• 16 replicas total.• 4 replicas in one site, 3

replicas in each other site.

• Update only performance (no disk writes).

Update Throughput

0

10

20

30

40

50

60

70

80

90

0 5 10 15 20 25 30

Clients

Up

dat

es/s

ec

Steward 10Mbps

Steward 5Mbps

Steward 2.5Mbps

BFT 10Mbps

BFT 5Mbps

BFT 2.5Mbps

Update Latency

0

100

200

300

400

500

600

700

800

900

1000

0 5 10 15 20 25 30

Clients

Lat

ency

(m

s)Steward 10Mbps

Steward 5Mbps

Steward 2.5Mbps

BFT 10Mbps

BFT 5Mbps

BFT 2.5Mbps

Page 14: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 1415 Dec 05

Read-only Query Performance

• 10 Mbps on wide area links.

• 10 clients inject mixes of read-only queries and write updates.

• None of the systems was limited by bandwidth.

• Performance improves between a factor of two and more than an order of magnitude.

• Availability: Queries can be answered locally, within each site.

Query Mix Throughput

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80 90 100

Update ratio (%)

Act

ion

s/se

c

Steward

BFT

Query Mix Latency

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80 90 100

Update ratio (%)

Lat

ency

(m

s)

Steward

BFT

Page 15: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 1515 Dec 05

Evaluation Network 2:Practical Wide-Area Network

• Based on a real experimental network (CAIRN). • Modeled on our cluster, emulating bandwidth and latency

constraints, both for Steward and BFT.

ISIPC

ISIPC4

TISWPC

ISEPC3

ISEPC

UDELPC

MITPC

38.8 ms1.86Mbits/sec

1.4 ms1.47Mbits/sec

4.9 ms9.81Mbits/sec

3.6 ms1.42Mbits/sec

100 Mb/s< 1ms

100 Mb/s<1ms

Virginia

Delaware

Boston

San Jose

Los Angeles

Page 16: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 1615 Dec 05

CAIRN Emulation Performance

• Link of 1.86Mbps between East and West coasts is the bottleneck

• Steward is limited by bandwidth at 51 updates per second.

• 1.8Mbps can barely accommodate 2 updates per second for BFT.

• Earlier experimentation with benign fault 2-phase commit protocols achieved up to 76 updates per second.

CAIRN Update Throughput

0

10

20

30

40

50

60

70

80

90

0 5 10 15 20 25 30

Clients

Up

dat

es/s

ec

Steward

BFT

CAIRN Update Latency

0

200

400

600

800

1000

1200

1400

0 5 10 15 20 25 30

Clients

Lat

ency

(m

s)

Steward

BFT

Page 17: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 1715 Dec 05

Wide-Area Scalability (3)

• Selected 5 Planetlab sites, in 5 different continents: US, Brazil, Sweden, Korea and Australia.

• Measured bandwidth and latency between every pair of sites.

• Emulated the network on our cluster, both for Steward and BFT.

• 3-fold latency improvement even when bandwidth is not limited.

Planetlab Update Throughput

0

10

20

30

40

50

60

70

80

90

0 5 10 15 20 25 30

Clients

Up

dat

es/s

ec

Steward

BFT

Planetlab Update Latency

0

200

400

600

800

1000

1200

1400

0 5 10 15 20 25 30

Clients

Lat

ency

(m

s)

Steward

BFT

Page 18: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 1815 Dec 05

Performance metrics

• The system can withstand f (5) faults in each site.• Performs better than a flat solution that withstands

f (5) faults total.• Quantitative improvements - Performance

– Between twice and over 30 times lower latency, depending on network topology and update/query mix.

– Program metric met and exceeded in most types of wide area networks, even when write updates only are considered.

• Qualitative improvements - Availability– Read-only queries can be answered locally even in case

of partitions.– Write updates can be done when only a majority of sites

are connected (as opposed to 2f+1 out of 3f+1 connected servers).

Page 19: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 1915 Dec 05

Red Team Experiment

• Excellent interaction both with red-team and white-team.

• Performance evaluation – symmetric network– Several points on the performance graphs

presented were re-evaluated.• results were almost identical.

– Thorough discussions regarding the measuring methodology and presenting the latency results

• validated our experiments.

– Five crash faults were induced in the leading site• Performance slightly improved !!!

Page 20: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 2015 Dec 05

Red Team Experiment (2)• Steward under attack

– Five sites, 4 replicas each.– Red team had full control

(sudo) over five replicas, one in each site.

– Compromised replicas were injecting:

• Loss (up to 20% each)

• Delay (up to 200ms)

• Packet reordering

• Fragmentation (up to 100 bytes)

• Replay attacks

– Compromised replicas were running modified servers that contained malicious code.

4

51

2

3

Page 21: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 2115 Dec 05

Red Team Experiment (3)

• The system was NOT compromised!– Safety and liveness guarantees were preserved.– The system continued to run correctly under all attacks.– All logs from all experiments are available.

• Most of the attacks did not affect the performance.• The system was slowed down when the representative

of the leading site was attacked.– Speed of update ordering was slowed down to a factor of 1/5.– Speed was not low enough to trigger defense mechanisms.– Crashing the corrupt representative caused the system to do a

view change and re-gain performance.

Page 22: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 2215 Dec 05

Red Team Experiment (4)

Lessons learned:

• We re-built the entire system having in mind the red-team attack: we learned a lot even before the experiment.

• The overall performance of the system could be improved by not validating messages that are not needed (after 2f+1 messages have been received).

• Performance under attack could be improved substantially with further research.

Page 23: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 2315 Dec 05

Next Steps:Throughput Comparison (CAIRN)

050

100150

200250

300350

400

0 14 28 42 56 70 84 98 112 126 140

number of clients (Evaluation Network 2)

upda

te t

rans

actio

ns /

sec

ond

Congruity Engine Upper bound 2PC

[ADMST02]Not Byzantine!!!!!

Page 24: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 2415 Dec 05

Next Steps:

• Performance during common operation:– We believe that wide-area throughput performance can be

improved by at least a factor of 5 by using a more elaborate replication algorithm between wide area sites.

• Performance under attack:– So far, we only focused on optimizing performance in the

common case, while guaranteeing safety and liveness at all times. Performance under attack is extremely important, but not trivial to achieve.

• System availability and safety guarantees:– A Byzantine-tolerant protocol between wide-area sites would

guarantee system availability and safety even when some of the sites are completely compromised.

Page 25: Scalability, Accountability and Instant Information Access for Network Centric Warfare

Johns Hopkins & Purdue 2515 Dec 05

Impact

New ideas

Scalability, Accountability and Instant Information Access forNetwork-Centric Warfare

ScheduleResulting systems with at least 3 times higher throughput, lower latency and high availability for updates over wide area networks. Clear path for technology transitions intoMilitary C3I systems such as the Army Future Combat System.

http://www.cnds.jhu.edu/funding/srs/

June 04

Dec 04

June05

Dec 05

C3I model, baseline and demo

Componentanalysis & design

ComponentImplement.

System integration & evaluation

Final C3I demoand baseline eval

First scalable wide-area intrusion-tolerant replication architecture.

Providing accountability for authorized but malicious client updates.

Exploiting update semantics to provide instant and consistent information access.

Comp.eval.