hippodrome: automatic global storage adaptation eric anderson, mustafa uysal, michael hobbs,...

Hippodrome: Automatic Global Storage Adaptation

Eric Anderson, Mustafa Uysal, Michael Hobbs, Guillermo Alvarez, Mahesh Kallahalla, Kim Keeton, Arif Merchant, Erik Riedel, Susan Spence, Ram Swaminathan, Simon Towers, Alistair Veitch, John Wilkes; HP Labs Storage Systems Program

Migrate toConfiguration

AnalyzeWorkload

ExecuteApplication

Design NewConfiguration

Hippodrome: Why?

• Computer systems very complex

• System administrators very expensive

• Let the computer handle it

• Optimize the system for the workload as it changes

• Determine when to add/remove hardware

• Two parts to talk

• Description of framework for managing a large I/O centric system

• Experimental results showing when it works and when it doesn’t.

Hippodrome: Lessons

• Global system adaptation possible by use of four parts of the loop:

• Solver: Finds new "optimal" configuration• Models: Predicts the performance of a configuration• Analysis: Generates summary of a workload• Migration: Moves current configuration to new one

• "Goodness" dependent on accuracy of models• Rate of adaptation dependent on "over-commit"

available in the system• A gradually increasing workload can always be

"good" if enough headroom exists

Hippodrome: Our System

• Targeted at applications running on large storage systems

• Solver chooses appropriate configuration for array and mapping of application-level storage units onto the array

• Experiments use synthetic applications for ease of understanding "good" behaviour

• Applications run on an N-class server and access an HP FC-60 disk array via switched fibre channel

Hippodrome: Four Parts Needed for Adaptation

• Analysis: Generates summary of a workload• Models: Predicts the performance of a configuration• Solver: Finds new "optimal" configuration• Migration: Moves current configuration to new one

• Solver and Models both part of "Design New Configuration" step


AnalyzeWorkload

ExecuteApplication


Hippodrome: Analysis, Models, Solver, Migration

• Trace the I/O's generated, run through analysis tools to create "workload" file.

• Two parts generated from analysis:

• "stores:" a logically contiguous fixed-size block of storage. Usually implemented as a logical volume

• "streams:" an access pattern to a particular store. Currently defined as average request rate, average request size, run count, on/off time, overlap fraction

• In our experiments, some additional per-stream values are also calculated to ease understanding the behaviour of the system


• Two inputs to models:• Device configuration: Logical Units (LUNs) with disk

type, number of disks, raid level, stripe size; array controller associated with each LUN

• Workload configuration: List of stores on each LUN and therefore the streams accessing that lun and using the associated controller

• Output is utilization of each component (disk, controller, SCSI bus, etc.)

• In our experiments, models calibrated to 6-disk R5 LUN for 4k and 256k random I/Os at an accuracy above 98% as the general models are still being developed.


• Two inputs to solver:• The workload (streams and stores)• Description of "valid" configurations (what devices to

use, what raid levels to use, etc.)• Output of solver is a configuration:

• Array descriptions (LUNs, disks, controllers, etc.)• The mapping of stores onto LUNs

• Solver uses models to predict if a configuration is valid (i.e. No component is over 100% utilized)

• In our experiments, solver pinned to using 6-disk R5 luns to match the models and to eliminate the need to migrate between raid types.


• Takes as input new "desired" configuration

• Migrates the system to the new configuration preserving the data and access to the data during the migration

• In our experiments, the synthetic application does not care about the data, and so we simply destroy the old configuration and create the new one to do a "migration"

Hippodrome: Experimental overview

• Each experiment is a series of iterations around the loop. Each iteration is called a "step"

• Each step will provide three values:• Deviation from target rate: "goodness" metric 1• Average I/O response time: "goodness" metric 2• Number of LUNs used


AnalyzeWorkload

ExecuteApplication


Experiment Grouping

• Multiple variants of each "application":

• constant-1: streams always on, I/O rate constant

• constant-2: stream groups anti-correlated, I/O rate constant when active

• scaling-1: one store running as fast as possible

• scaling-2: like constant-1, but streams are enabled in different steps; once enabled, a stream will stay on

• scaling-3: like constant-1, but stream I/O rate increases as step number increases

• All experiments show global adaptation possible

Hippodrome: Experiments Demonstrate Lessons

• "Goodness" dependent on accuracy of models• constant-1, constant-2; we show how to "break" the

loop.• Rate of adaptation dependent on "over-commit"

available in the system• constant-1, constant-2; we show how fast the system

converges• A gradually increasing workload can always be

"good" if enough headroom exists• scaling-2, scaling-3; we show that the application

always runs at its target rate

Hippodrome: Experimental Hardware/Software

• Array for experiments is HP FC-60• 2 controllers, 6 trays• 1 Ultra SCSI bus/tray (40MB/s)• 4 Seagate 18GB, 10k RPM disks used/tray = 24 total• 4 6 disk R5 LUNs at 16k stripe size• 1 LUN can do ~625 random 4k reads/second

• Host for experiments is HP N-Class• 1 440 MHz CPU, 1 GB memory, HP-UX 11.00• 2 100 MB/s fibre channel cards used

• Locally developed synthetic application (Buttress)• Host and array connected through Brocade switch

Hippodrome: Common Experiment Parameters• Will vary # stores, # streams, target request rate• Some parameters usually the same:

• Phasing: all streams on at the same time• Store capacity: 256 MB• Max. # I/O's outstanding/stream: 4• Headroom: 0%

• Some parameters constant for all experiments:• Request type: 4k read• Request offset: uniformly random across store, aligned

to 1k boundary• Run count: 1 (no sequentiality in requests)• Arrival process: open, poisson

Hippodrome: Constant-1 experiments

• Important result is shape of the graphs:

• Deviation from target rate converges to 0

• Response time gets (much) better

• # luns used (in the end) matches required request rate

• Comments:

• Variants 0-3 have total RR of 2000 = 4 LUNs

• Variants 4-6 experiment with filling a LUN to start

• Variants 5,6 differ only in the headroomParameter Variant 0 Variant 1 Variant 2 Variant 3 Variant 4 Variant 5 Variant 6# Stores 100 200 50 50 84 21 21Target Request Rate 20 10 40 40 20 80 80Max Outstanding 4 4 4 16 4 4 4Store Size 256MB 256MB 256MB 256MB 1GB 4GB 4GBHeadroom 0% 0% 0% 0% 0% 10% 0%

Constant-1: Deviation from Target Rate

• Variants 0-5 converge to 95% CI of 0

• Variant 4 converged even though the LUN was full at the start

• Variant 5 converged because of the 10% headroom

• Variant 6 never converges; models predict the LUN is only 95% utilized

Constant-1: Response Time

• Response times get an order of magnitude better

• Variant 6 stays at the bad (0.15 second) average response time

Constant-1: Number of LUNs

• Lines offset slightly so different variants can be seen

• Goes up by 1 lun each step; can't over commit device to 200%

• Variants 4,5 have a total request rate < 3*625, so only use 3 luns

• Variant 6 stays at 1 lun, as would be predicted by other results

Hippodrome: Constant workload review

• Given a constant workload, the loop converges to the "correct" system in most cases

• "Goodness" dependent on accuracy of models

• We "break" the loop either through not enough headroom or bad models

• Rate of adaptation dependent on "over-commit" available in the system

• In general, it increases by 1 LUN per iteration

• With a workload with idle time, it converges faster

• Now look at workloads that change

Hippodrome: Scaling-2 experiments• Scaling-2 intended to simulate adding in

additional weeks in a data warehouse, additional file systems, etc.

• We turn on streams as the step number increases• Store capacity 64 MB, max. outstanding 4

• Comments:• Always "correct!"; rate of increase is small enough• Response time shows points where we added work• LUNs increases as necessary

Parameter Variant 0 Variant 1 Variant 2 Variant 3# Stores 60 120 90 60Target Request Rate 36 18 24 36Stream Enablement 10*(1+step/2) 10*(1+step) 10 for 2 steps 10 for 2 stepsPattern 10 every other step 10 every step same for 1 step same for 2 steps

Scaling-2: Deviation from Target Rate

• Error bars are the same size as before; scale is much smaller

• Amazingly, always within 95% confidence interval of correct

• Slightly above 0 deviation because of measurement methodology

Scaling-2: Response Time

• Scale is much smaller than for constant workloads (max. of 0.055 s vs. 1s)

• Now we can see when we add work and when we remain constant

• Height of peaks show how close to 100% the previous step was

• Slight trend upward; more total I/Os and more capacity actively used

Scaling-2: Response Time – Variant 0 only

• Now we can see when we add work and when we remain constant

• Height of peaks show how close to 100% the previous step was

Scaling-2: Number of LUNs

• Gradual increase in # luns

• Exact switch point dependent on specific increase pattern

• Changes close together as increase patterns are similar

Hippodrome: Scaling workload review

• Handled order of magnitude increase in workload without having serious slowdowns

• Number of luns up by factor of 4

• Could see points of additional work in response time jumping and then settling

• Question: what other scaling up patterns are useful?

• One other group planned is different streams scaling at different rates

Hippodrome: Future Work

• Shifting workloads (transaction processing in the day, decision support at night)

• Cyclic workloads (system is told about the different shift positions)

• More complete models, migration of actual data

• More complex synthetic workloads

• Simple "application" (TPC-B?)

• Complex application (Retail Data Warehouse)

• Support for global bounds on system size/cost

Hippodrome: Four Parts Needed for Adaptation

• Analysis: Generates summary of a workload• Models: Predicts the performance of a configuration• Solver: Finds new "optimal" configuration• Migration: Moves current configuration to new one

• Solver and Models both part of "Design New Configuration" step


AnalyzeWorkload

ExecuteApplication


Hippodrome: Lessons

• Global system adaptation possible by use of four parts of the loop:

• Solver: Finds new "optimal" configuration• Models: Predicts the performance of a configuration• Analysis: Generates summary of a workload• Migration: Moves current configuration to new one

• "Goodness" dependent on accuracy of models• Rate of adaptation dependent on "over-commit"

available in the system• A gradually increasing workload can always be

"good" if enough headroom exists

Hippodrome: Automatic Global Storage Adaptation

• Questions?

• Joint work with: Eric Anderson, Mustafa Uysal, Michael Hobbs, Guillermo Alvarez, Mahesh Kallahalla, Kim Keeton, Arif Merchant, Erik Riedel, Susan Spence, Ram Swaminathan, Simon Towers, Alistair Veitch, John Wilkes; HP Labs Storage Systems Program

Hippodrome: Constant-2 experiments• Phasing is a very important workload property

• Divide streams into groups (1..n), group start times offset, then constant on/off pattern

• Max. outstanding/stream: 32; Target rate: 40• Comments:

• Variant 1 shows faster adaptation because of idle time• Variant 4/5 show what happens when analysis step is

wrong• Scaling # groups proved uninteresting

Parameter Variant 0 Variant 1 Variant 2 Variant 3 Variant 4 Variant 5# Stores 200 200 100 100 100 100On/Off Time 1.0/1.0 0.75/3.25 2.0/2.0 1.0/1.0 0.5/0.5 0.5/0.5Number of Groups 4 4 2 2 2 2Start Delay Multiplier 1 1 2 1 0 0.5

Constant-2: Deviation from Target Rate

• All except variant 4 converge; 4 appears the same as 5 in the analysis, but the two groups overlap in 4 and are anti-correlated in 5

• Variant 1 converges faster than the others; this is because the idle time between groups running allows the system to drain requests

Constant-2: Response Time

• Similar results to previous slide:

• Variant 4 does not get to a good response time

• Variant 1 converges faster than others.

Constant-2: Number of LUNs

• Now we see why variant 1 converges faster, it gets to 4 luns in only 2 steps rather than three; this is because of the idle time.

• Otherwise, behaviour is the same as for constant-1, which is to be expected as in the aggregate, constant-2 is the same as constant-1

Hippodrome: Scaling-1 experiments

• Scaling-1 intended to simulate something like a disk copy that will run as fast as the disks will go (it's disk bound, not cpu bound)

• Worked for 3 iterations of the loop (even striped the store across multiple LUNs), then wanted 5 luns, which is not available

• Future work: handling a global bound on the size of the storage system (for example, you can't spend more than $100,000)

Hippodrome: Scaling-3 experiments

• Scaling-3 intended to simulate adding work over constant data set (e.g. more queries to DB)

• We increase target request rate as step increases

• Store capacity 64 MB, max. outstanding 4, max. RR 36

• Comments:

• Always "correct!"; rate of increase is small enough

• Response time shows points where we added work

• LUNs increases as necessary

• Initial deviations garbage due to low request rateParameter Variant 0 Variant 1# Stores 60 60Target Request Rate 3*(step+1) 6*(1+step/2)

Scaling-3: Deviation from target rate

• Ignore the graph before about step 4, request rates too low, analysis seeing bursts and calculating rates over that

• Always supports target request rate

Scaling-3: Response Time

• Variant 1 shows up-down pattern of changing then stabilizing workload

• Always doing pretty well, big drops for variant 0 as lun count increased

Scaling-3: Number of LUNs

• Increases gradually, exact switch over dependent on variant specifics

hippodrome: automatic global storage adaptation eric anderson, mustafa uysal, michael hobbs,...

Documents

configuration solver

configuration analysis

workload models

lun workload configuration

current configuration

device configuration

appropriate configuration

system slide