hippodrome: automatic global storage adaptation eric anderson, mustafa uysal, michael hobbs,...
Post on 19-Dec-2015
217 views
TRANSCRIPT
Hippodrome: Automatic Global Storage Adaptation
Eric Anderson, Mustafa Uysal, Michael Hobbs, Guillermo Alvarez, Mahesh Kallahalla, Kim Keeton, Arif Merchant, Erik Riedel, Susan Spence, Ram Swaminathan, Simon Towers, Alistair Veitch, John Wilkes; HP Labs Storage Systems Program
Migrate toConfiguration
AnalyzeWorkload
ExecuteApplication
Design NewConfiguration
Hippodrome: Why?
• Computer systems very complex
• System administrators very expensive
• Let the computer handle it
• Optimize the system for the workload as it changes
• Determine when to add/remove hardware
• Two parts to talk
• Description of framework for managing a large I/O centric system
• Experimental results showing when it works and when it doesn’t.
Hippodrome: Lessons
• Global system adaptation possible by use of four parts of the loop:
• Solver: Finds new "optimal" configuration• Models: Predicts the performance of a configuration• Analysis: Generates summary of a workload• Migration: Moves current configuration to new one
• "Goodness" dependent on accuracy of models• Rate of adaptation dependent on "over-commit"
available in the system• A gradually increasing workload can always be
"good" if enough headroom exists
Hippodrome: Our System
• Targeted at applications running on large storage systems
• Solver chooses appropriate configuration for array and mapping of application-level storage units onto the array
• Experiments use synthetic applications for ease of understanding "good" behaviour
• Applications run on an N-class server and access an HP FC-60 disk array via switched fibre channel
Hippodrome: Four Parts Needed for Adaptation
• Analysis: Generates summary of a workload• Models: Predicts the performance of a configuration• Solver: Finds new "optimal" configuration• Migration: Moves current configuration to new one
• Solver and Models both part of "Design New Configuration" step
Migrate toConfiguration
AnalyzeWorkload
ExecuteApplication
Design NewConfiguration
Hippodrome: Analysis, Models, Solver, Migration
• Trace the I/O's generated, run through analysis tools to create "workload" file.
• Two parts generated from analysis:
• "stores:" a logically contiguous fixed-size block of storage. Usually implemented as a logical volume
• "streams:" an access pattern to a particular store. Currently defined as average request rate, average request size, run count, on/off time, overlap fraction
• In our experiments, some additional per-stream values are also calculated to ease understanding the behaviour of the system
Hippodrome: Analysis, Models, Solver, Migration
• Two inputs to models:• Device configuration: Logical Units (LUNs) with disk
type, number of disks, raid level, stripe size; array controller associated with each LUN
• Workload configuration: List of stores on each LUN and therefore the streams accessing that lun and using the associated controller
• Output is utilization of each component (disk, controller, SCSI bus, etc.)
• In our experiments, models calibrated to 6-disk R5 LUN for 4k and 256k random I/Os at an accuracy above 98% as the general models are still being developed.
Hippodrome: Analysis, Models, Solver, Migration
• Two inputs to solver:• The workload (streams and stores)• Description of "valid" configurations (what devices to
use, what raid levels to use, etc.)• Output of solver is a configuration:
• Array descriptions (LUNs, disks, controllers, etc.)• The mapping of stores onto LUNs
• Solver uses models to predict if a configuration is valid (i.e. No component is over 100% utilized)
• In our experiments, solver pinned to using 6-disk R5 luns to match the models and to eliminate the need to migrate between raid types.
Hippodrome: Analysis, Models, Solver, Migration
• Takes as input new "desired" configuration
• Migrates the system to the new configuration preserving the data and access to the data during the migration
• In our experiments, the synthetic application does not care about the data, and so we simply destroy the old configuration and create the new one to do a "migration"
Hippodrome: Experimental overview
• Each experiment is a series of iterations around the loop. Each iteration is called a "step"
• Each step will provide three values:• Deviation from target rate: "goodness" metric 1• Average I/O response time: "goodness" metric 2• Number of LUNs used
Migrate toConfiguration
AnalyzeWorkload
ExecuteApplication
Design NewConfiguration
Experiment Grouping
• Multiple variants of each "application":
• constant-1: streams always on, I/O rate constant
• constant-2: stream groups anti-correlated, I/O rate constant when active
• scaling-1: one store running as fast as possible
• scaling-2: like constant-1, but streams are enabled in different steps; once enabled, a stream will stay on
• scaling-3: like constant-1, but stream I/O rate increases as step number increases
• All experiments show global adaptation possible
Hippodrome: Experiments Demonstrate Lessons
• "Goodness" dependent on accuracy of models• constant-1, constant-2; we show how to "break" the
loop.• Rate of adaptation dependent on "over-commit"
available in the system• constant-1, constant-2; we show how fast the system
converges• A gradually increasing workload can always be
"good" if enough headroom exists• scaling-2, scaling-3; we show that the application
always runs at its target rate
Hippodrome: Experimental Hardware/Software
• Array for experiments is HP FC-60• 2 controllers, 6 trays• 1 Ultra SCSI bus/tray (40MB/s)• 4 Seagate 18GB, 10k RPM disks used/tray = 24 total• 4 6 disk R5 LUNs at 16k stripe size• 1 LUN can do ~625 random 4k reads/second
• Host for experiments is HP N-Class• 1 440 MHz CPU, 1 GB memory, HP-UX 11.00• 2 100 MB/s fibre channel cards used
• Locally developed synthetic application (Buttress)• Host and array connected through Brocade switch
Hippodrome: Common Experiment Parameters• Will vary # stores, # streams, target request rate• Some parameters usually the same:
• Phasing: all streams on at the same time• Store capacity: 256 MB• Max. # I/O's outstanding/stream: 4• Headroom: 0%
• Some parameters constant for all experiments:• Request type: 4k read• Request offset: uniformly random across store, aligned
to 1k boundary• Run count: 1 (no sequentiality in requests)• Arrival process: open, poisson
Hippodrome: Constant-1 experiments
• Important result is shape of the graphs:
• Deviation from target rate converges to 0
• Response time gets (much) better
• # luns used (in the end) matches required request rate
• Comments:
• Variants 0-3 have total RR of 2000 = 4 LUNs
• Variants 4-6 experiment with filling a LUN to start
• Variants 5,6 differ only in the headroomParameter Variant 0 Variant 1 Variant 2 Variant 3 Variant 4 Variant 5 Variant 6# Stores 100 200 50 50 84 21 21Target Request Rate 20 10 40 40 20 80 80Max Outstanding 4 4 4 16 4 4 4Store Size 256MB 256MB 256MB 256MB 1GB 4GB 4GBHeadroom 0% 0% 0% 0% 0% 10% 0%
Constant-1: Deviation from Target Rate
• Variants 0-5 converge to 95% CI of 0
• Variant 4 converged even though the LUN was full at the start
• Variant 5 converged because of the 10% headroom
• Variant 6 never converges; models predict the LUN is only 95% utilized
Constant-1: Response Time
• Response times get an order of magnitude better
• Variant 6 stays at the bad (0.15 second) average response time
Constant-1: Number of LUNs
• Lines offset slightly so different variants can be seen
• Goes up by 1 lun each step; can't over commit device to 200%
• Variants 4,5 have a total request rate < 3*625, so only use 3 luns
• Variant 6 stays at 1 lun, as would be predicted by other results
Hippodrome: Constant workload review
• Given a constant workload, the loop converges to the "correct" system in most cases
• "Goodness" dependent on accuracy of models
• We "break" the loop either through not enough headroom or bad models
• Rate of adaptation dependent on "over-commit" available in the system
• In general, it increases by 1 LUN per iteration
• With a workload with idle time, it converges faster
• Now look at workloads that change
Hippodrome: Scaling-2 experiments• Scaling-2 intended to simulate adding in
additional weeks in a data warehouse, additional file systems, etc.
• We turn on streams as the step number increases• Store capacity 64 MB, max. outstanding 4
• Comments:• Always "correct!"; rate of increase is small enough• Response time shows points where we added work• LUNs increases as necessary
Parameter Variant 0 Variant 1 Variant 2 Variant 3# Stores 60 120 90 60Target Request Rate 36 18 24 36Stream Enablement 10*(1+step/2) 10*(1+step) 10 for 2 steps 10 for 2 stepsPattern 10 every other step 10 every step same for 1 step same for 2 steps
Scaling-2: Deviation from Target Rate
• Error bars are the same size as before; scale is much smaller
• Amazingly, always within 95% confidence interval of correct
• Slightly above 0 deviation because of measurement methodology
Scaling-2: Response Time
• Scale is much smaller than for constant workloads (max. of 0.055 s vs. 1s)
• Now we can see when we add work and when we remain constant
• Height of peaks show how close to 100% the previous step was
• Slight trend upward; more total I/Os and more capacity actively used
Scaling-2: Response Time – Variant 0 only
• Now we can see when we add work and when we remain constant
• Height of peaks show how close to 100% the previous step was
Scaling-2: Number of LUNs
• Gradual increase in # luns
• Exact switch point dependent on specific increase pattern
• Changes close together as increase patterns are similar
Hippodrome: Scaling workload review
• Handled order of magnitude increase in workload without having serious slowdowns
• Number of luns up by factor of 4
• Could see points of additional work in response time jumping and then settling
• Question: what other scaling up patterns are useful?
• One other group planned is different streams scaling at different rates
Hippodrome: Future Work
• Shifting workloads (transaction processing in the day, decision support at night)
• Cyclic workloads (system is told about the different shift positions)
• More complete models, migration of actual data
• More complex synthetic workloads
• Simple "application" (TPC-B?)
• Complex application (Retail Data Warehouse)
• Support for global bounds on system size/cost
Hippodrome: Four Parts Needed for Adaptation
• Analysis: Generates summary of a workload• Models: Predicts the performance of a configuration• Solver: Finds new "optimal" configuration• Migration: Moves current configuration to new one
• Solver and Models both part of "Design New Configuration" step
Migrate toConfiguration
AnalyzeWorkload
ExecuteApplication
Design NewConfiguration
Hippodrome: Lessons
• Global system adaptation possible by use of four parts of the loop:
• Solver: Finds new "optimal" configuration• Models: Predicts the performance of a configuration• Analysis: Generates summary of a workload• Migration: Moves current configuration to new one
• "Goodness" dependent on accuracy of models• Rate of adaptation dependent on "over-commit"
available in the system• A gradually increasing workload can always be
"good" if enough headroom exists
Hippodrome: Automatic Global Storage Adaptation
• Questions?
• Joint work with: Eric Anderson, Mustafa Uysal, Michael Hobbs, Guillermo Alvarez, Mahesh Kallahalla, Kim Keeton, Arif Merchant, Erik Riedel, Susan Spence, Ram Swaminathan, Simon Towers, Alistair Veitch, John Wilkes; HP Labs Storage Systems Program
Hippodrome: Constant-2 experiments• Phasing is a very important workload property
• Divide streams into groups (1..n), group start times offset, then constant on/off pattern
• Max. outstanding/stream: 32; Target rate: 40• Comments:
• Variant 1 shows faster adaptation because of idle time• Variant 4/5 show what happens when analysis step is
wrong• Scaling # groups proved uninteresting
Parameter Variant 0 Variant 1 Variant 2 Variant 3 Variant 4 Variant 5# Stores 200 200 100 100 100 100On/Off Time 1.0/1.0 0.75/3.25 2.0/2.0 1.0/1.0 0.5/0.5 0.5/0.5Number of Groups 4 4 2 2 2 2Start Delay Multiplier 1 1 2 1 0 0.5
Constant-2: Deviation from Target Rate
• All except variant 4 converge; 4 appears the same as 5 in the analysis, but the two groups overlap in 4 and are anti-correlated in 5
• Variant 1 converges faster than the others; this is because the idle time between groups running allows the system to drain requests
Constant-2: Response Time
• Similar results to previous slide:
• Variant 4 does not get to a good response time
• Variant 1 converges faster than others.
Constant-2: Number of LUNs
• Now we see why variant 1 converges faster, it gets to 4 luns in only 2 steps rather than three; this is because of the idle time.
• Otherwise, behaviour is the same as for constant-1, which is to be expected as in the aggregate, constant-2 is the same as constant-1
Hippodrome: Scaling-1 experiments
• Scaling-1 intended to simulate something like a disk copy that will run as fast as the disks will go (it's disk bound, not cpu bound)
• Worked for 3 iterations of the loop (even striped the store across multiple LUNs), then wanted 5 luns, which is not available
• Future work: handling a global bound on the size of the storage system (for example, you can't spend more than $100,000)
Hippodrome: Scaling-3 experiments
• Scaling-3 intended to simulate adding work over constant data set (e.g. more queries to DB)
• We increase target request rate as step increases
• Store capacity 64 MB, max. outstanding 4, max. RR 36
• Comments:
• Always "correct!"; rate of increase is small enough
• Response time shows points where we added work
• LUNs increases as necessary
• Initial deviations garbage due to low request rateParameter Variant 0 Variant 1# Stores 60 60Target Request Rate 3*(step+1) 6*(1+step/2)
Scaling-3: Deviation from target rate
• Ignore the graph before about step 4, request rates too low, analysis seeing bursts and calculating rates over that
• Always supports target request rate
Scaling-3: Response Time
• Variant 1 shows up-down pattern of changing then stabilizing workload
• Always doing pretty well, big drops for variant 0 as lun count increased
Scaling-3: Number of LUNs
• Increases gradually, exact switch over dependent on variant specifics