supporting strong cache coherency for active caches in...
TRANSCRIPT
Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand
S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda
The Ohio State University
Presentation Outline
Introduction/Motivation
Design and Implementation
Experimental Results
Conclusions
Introduction
Fast Internet GrowthNumber of UsersAmount of dataTypes of services
Several usesE-Commerce, Online Banking, Online Auctions, etc
Types of ContentImages, documents, audio clips, video clips, etc - Static ContentStock Quotes, Online Stores (Amazon), Online Banking, etc. - Dynamic Content (Active
Presentation Outline
Introduction/MotivationMulti-Tier Data-CentersActive CachesInfiniBand
Design and ImplementationExperimental ResultsConclusions
Multi-Tier Data-Centers
Single Powerful ComputersClusters
Low ‘Cost to Performance’ RatioIncreasingly Popular
Multi-Tier Data-CentersScalability – an important issue
A Typical Multi-Tier Data-Center
Database Servers
Clients
Application Servers
Web Servers
Proxy Nodes
Tier 0
Tier 1
Tier 2
Apache
PHP
WAN
Tiers of a Typical Multi-Tier Data-Center
Proxy Nodes Handle Caching, load balancing, security, etc
Web ServersHandle the HTML content
Application ServersHandle Dynamic Content, Provide Services
Database ServersHandle persistent storage
Data-Center Characteristics
Computation
Front-End Tiers
Back-End Tiers
• The amount of computation required for processing each request increases as we go to the inner tiers of the Data-Center• Caching at the front tiers is an important factor for scalability
Presentation Outline
Introduction/MotivationIntroductionMulti-Tier Data-CentersActive CachesInfiniBand
Design and ImplementationExperimental ResultsConclusions
Caching
Can avoid re-fetching of content Beneficial if requests repeatStatic content caching
Well studied in the pastWidely used
Front-End Tiers
Back-End Tiers
Number ofRequestsDecrease
Active CachingDynamic Data
Stock Quotes, Scores, Personalized Content, etc
Simple caching methods not suitedIssues
ConsistencyCoherency
Proxy NodeCache
Back-EndData
User Request
Update
Cache Consistency
Non-decreasing views of system stateUpdates seen by all or none
User Requests
Proxy Nodes
Back-End Nodes
Update
Cache Coherency
Refers to the average staleness of the document served from cacheTwo models of coherence
Bounded staleness (Weak Coherency)Strong or immediate (Strong Coherency)
Strong Cache Coherency
An absolute necessity for certain kinds of data
Online shopping, Travel ticket availability, Stock Quotes, Online auctionsExample: Online banking
Cannot afford to show different values to different concurrent requests
Caching policies
TTL/Adaptive TTL
Invalidation *
Client Polling
No Caching
CoherencyConsistency
*D. Li, P. Cao, and M. Dahlin. WCIP: Web Cache Invalidation Protocol. IETF Internet Draft, November 2000.
Presentation Outline
Introduction/MotivationIntroductionMulti-Tier Data-CentersActive CachesInfiniBand
Design and ImplementationExperimental ResultsConclusions
InfiniBandHigh Performance
Low latencyHigh Bandwidth
Open Industry StandardProvides rich features
RDMA, Remote Atomic operations, etcTargeted for Data-CentersTransport Layers
VAPIIPoIBSDP
Performance
Latency
0
20
40
60
80
100
120
140
2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K
Message Size
Lat
ency
(u
s)
IPoIBSDPVAPI
Throughput
0
100
200
300
400
500
600
700
800
900
4 16 64 256 1K 4K 16K 64KMessage Size
Thr
ough
put
(MB
/s) IPoIB
SDPVAPI
• Low latencies of less than 5us achieved• Bandwidth over 840 MB/s
* SDP and IPoIB from Voltaire’s Software Stack
Performance
• Receiver side CPU utilization is very low• Leveraging the benefits of One sided communication
Throughput (RDMA Read)
0
1000
2000
3000
4000
5000
6000
7000
1 4 16 64 256 1K 4K 16K 64K 256K
Message Size (bytes)
Thro
ughp
ut (M
bps)
0
5
10
15
20
25
Send CPU Recv CPUThroughput (Poll) Throughput (Event)
Caching policies
TTL/Adaptive TTL
Invalidation
Client Polling
No Caching
CoherencyConsistency
Objective
To design an architecture that very efficiently supports strong cache coherency on InfiniBand
Presentation Outline
Introduction/Motivation
Design and Implementation
Experimental Results
Conclusions
Basic Architecture
External modules are usedModule communication can use any transport
Versioning:Application servers version dynamic data Version value of data passed to front end with every request to back-endVersion maintained by front end along with cached value of response
Mechanism
Cache Hit:Back-end Version CheckIf version current, use cacheInvalidate data for failed version check
Cache MissGet data to cacheInitialize local versions
Architecture
Front-End Back-End
Request
Cache Hit
Cache Miss
Response
Design
Every server has an associated module that uses IPoIB, SDP or VAPI to communicate VAPI:
When a request arrives at proxy, VAPI module is contacted.Module reads latest version of the data from the back-end using one-sided RDMA Read operationIf versions do not match, cached value is invalidated
VAPI Architecture
Front-End Back-End
Request
Cache Hit
Cache Miss
Response
RDMA Read
Implementation
Socket-based Implementation:IPoIB and SDP are usedBack-end version check is done using two-sided communication from the module
Requests to read and update are mutually excluded at the back-end module to avoid simultaneous readers and writers accessing the same data.Minimal changes to existing software
Presentation OutlineIntroduction/Motivation
Design and Implementation
Experimental ResultsData-Center Throughput
Data-Center Response Time
Data-Center Break-up
Zipf and WC Trace Throughput
Conclusions
Experimental Test-bed
Eight Dual 2.4GHz Xeon processor nodes
64-bit 133MHz PCI-X interfaces
512KB L2-Cache and 400MHz Front Side Bus
Mellanox InfiniHost MT23108 Dual Port 4x HCAs
MT43132 eight 4x port Switch
SDK version 0.2.0
Firmware version 1.17
Data-Center: Performance
• The VAPI module can sustain performance even with heavy load on theback-end servers
DataCenter: Throughput
0
500
1000
1500
2000
2500
0 10 20 30 40 50 60 70 80 90 100 200
Number of Compute Threads
Tran
sact
ions
per
sec
ond
(TPS)
No Cache IPoIB VAPI SDP
Data-Center: Performance
• The VAPI module responds faster even with heavy load on theback-end servers
Datacenter: Response Time
0123456789
10
0 10 20 30 40 50 60 70 80 90 100 200
Number of Compute Threads
Res
pons
e tim
e (m
s)
NoCache IPoIB VAPI SDP
Response Time BreakupResponse Time Splitup - 0 Compute Threads
0
1
2
3
4
5
6
7
8
ClientCommunication
ProxyProcessing
ModuleProcessing
Backend versioncheck
Tim
e (m
s)
IPoIBSDPVAPI
• Worst case Module Overhead less than 10% of the response time• Minimal overhead for VAPI based version check even for 200 compute threads
Response Time Splitup - 200 Compute Threads
0
1
2
3
4
5
6
7
8
ClientCommunication
ProxyProcessing
ModuleProcessing
Backendversion check
Tim
e (m
s)
IPoIBSDPVAPI
Data-Center: Throughput
Throughput: ZipF distribution
0100
200300400
500600
700800
0 10 20 30 40 50 60 70 80 90 100 200
Number of Compute Threads
Tran
sact
ions
per
Sec
ond
(TP
S)
No Cache IPoIB VAPI SDP
ThroughPut: World Cup Trace
0
500
1000
1500
2000
2500
3000
0 10 20 30 40 50 60 70 80 90 100 200
Number of Compute Threads
Tran
sact
ions
Per
Sec
ond
(TPS
)
NoCache IPoIB VAPI SDP
• The drop in the throughput of VAPI in World cup trace is due tothe higher penalty for cache misses under increased load• VAPI implementation does better for real trace too
Conclusions
An architecture for supporting Strong Cache CoherenceExternal module based design
Freedom in choice of transportMinimal changes to existing software
Sockets API inherent limitation Two-sided communicationHigh performance Sockets not the solution (SDP)
Main benefit One sided nature of RDMA calls
Web Pointers
http://nowlab.cis.ohio-state.edu/
E-mail: {narravul, balaji, vaidyana, savitha, wuj, panda}@cis.ohio-state.edu
NBC home page