towards a smart workload generator on ramp archana ganapathi, david patterson, anthony joseph...

Towards a Smart Workload Generator on RAMP

Archana Ganapathi, David Patterson, Anthony Joseph

{archanag, pattrsn, adj} @ cs.berkeley.edu

RADLAB goals

-

20

40

60

80

100

120

0 10 20 30 40 50Mill

ion

Uni

que

Vis

itors

for F

eb 2

006

Help Single Operator/ Developer sites become Google-scale

Eliminate SW/HW obstacles for scaling

Tools to identify/fix problems

Use RAMP to move websites from right to left

Source: Washington Post 3/31/2006

Top 50 Web Domains

RADLAB Goals (2)

Jan Feb Mar Apr May Jun

Log

scal

e

YouTube.com 2006 Daily Traffic Ranking Challenges:

Scalability Configurability Single person operation Cost-effectiveness

Reproducibility and Observability

Source: Alexa.com

RAMP for Time Travel UCB goal for RAMP: Data center in a box

Google/Amazon.com O(10,000) processors in data center Anticipate load of 3-6 months in future for fast moving

company Smart Workload Generation: “smart gateware” design

informed by SML analysis of workload

Time Dilation on Emulated Machines: Try software/ config changes and observe behavior prior to deployment

Targeted Component-specific Load Generation: Stress-test components in the critical path to determine performance limitations

RAMP as emulation environment + workload generator

Building Blocks

Workload generation engine Parser to extract data from server response Workload description language to specify

primitives to compile onto an FPGA Machine Learning techniques to discover web

interactions

ML to determine interactions

User scripts

Real System Workload Generator

Naïve Workload Generator (CS252 class project by Lorenzo Orecchia and Madhur Tulsiani)

Generate the data-set using analytical models server file size distribution request size distribution relative file popularity

Derive URL connectivity graph, load in memory Circuit logic to perform random walk on graph. We achieve our goal of scalability:

Graph Size: 1048576 Memory Usage: 21 MB Total data size: 21083 MB

Scalability: RAMP DRAM limits Given: 2GB DRAMs, 4 DRAM banks per FPGA,

100 MHz clock cycle ~ 10 ns. Per cycle = 21 bits per walk (21Mbits for 1M walks)

Assume 10 clock cycles per access => 100 ns10 million accesses per second per bank per walk

Given four DRAM banks = 40M accesses per sec

Compare: Google receives 2000 requests per second

Scalability: RAMP Ethernet limitsGiven: 20 10Gbit/sec Ethernet ports per boardAssume can generate 100 million accesses/sec

Naïve Response-ignorant workload generation: 4 bytes for URL + check sum + header… ~ 32 bytes => 3 GB for 100 million accesses (per second)

Smart Response-driven workload generation: Google: 23KB Flickr: 45KB CNN: 100KB Assume up to 200KB response can receive 50K responses per second per port

= 1M responses handled by 20 10G ports.

Ethernet limits cont. About 1000X Google average with

Smart (response-driven) generation

Mixed RAMP emulation/workload generation even higher BW inside box

Have plenty of headroom to tradeoff speed for greater accuracy of workload

Some Open Questions Limits on types of workloads? Workload trace sources?

Web services, existing traffic generators, …? Role of Response-Ignorant trace

generation? UDP, error/congestion-free TCP, …?

Required level of fidelity for Response-Driven trace generation? How much of TCP FSM to model?

Question/Feedback?

Backup Slides

State of the art Hardware vs. software based

Hammer, Optixia vs SURGE, SLAMd Tunability vs Automation

SPECweb, TPC-W, Harpoon vs Optixia, SLAMd Realistic vs Synthetic

SURGE, SLAMd, Harpoon vs TPC-W Generic vs App-Specific

SLAMd, Harpoon vs TPC-W, Hammer Open-loop vs Closed-loop

Partly-open loop is most realistic for web services

Workload Generator Next Steps Handle server responses

Include “server response” states in logic Parse server response to identify current state

Include think-time distribution User think-time + server response time What happens when things go wrong?

Improve temporal/spatial locality Prefetch other URLs a page is linked to Take advantage of Zipfian popularity

distribution

Sketch of Random Walk Module

MEMORY

Data-set parameters

Graph Size

Memory Usage

Total data size

Average file size

4096 50 KB 2138 MB 521 KB

65536 1 MB 3121 MB 47.6 KB

1048576 21 MB 21083 MB

20.11 KB

Circuit properties Device : Virtex-E

Maximum delayWalk Module : 3.99 ns

Memory Module : 5.82 ns

Estimated frequency = (1000/9.81)= 101.93 MHz

Number of LUTs per walk : 593 Number of slices per walk : 307

EECS

Add total

Request Size Distribution

0

2000

4000

6000

8000

10000

12000

8 40 72 104 136 168 200 232 264 296 328 360 392 424

Request Size (KB)

Num

ber o

f Req

uests

(o

ut of

2000

0)

ObservedExpected

Popularity Distribution

0

100

200

300

400

500

600

File Rank

Numb

er of

Req

uests

(out

of 20

000) Observed

Expected

towards a smart workload generator on ramp archana ganapathi, david patterson, anthony joseph...

Documents