a distributed storage system allowing application users to reserve i/o performance in advance for...
TRANSCRIPT
A Distributed Storage System AllowingA Distributed Storage System AllowingApplication Users to Reserve I/O PerformanceApplication Users to Reserve I/O Performance
in Advance for Achieving SLAin Advance for Achieving SLA
Yusuke Tanimura , Hidetaka Koie, Tomohiro Kudoh
Isao Kojima, and Yoshio Tanaka
National Institute of AIST, Japan
The 11th ACM/IEEE International Conference on Grid Computing
October 26-28, 2010, Brussels, Belgium
2
On Grids/Clouds• Importance of Service Level Agreement (SLA)
– A contract between users and the service providers• End-to-end performance, reliability, and etc.
• I/O performance of the storage system tends to be a critical bottleneck.– The bandwidth can be guaranteed by recent network
technologies such as the lambda path.
ServiceBest effort BW
Guaranteed BW
Performance guaranteed?
Storage
3
On-going Studies• QoS of parallel I/O in a distributed storage
– Focused on scheduling and resource allocation
Application
I/O library
Local I/O scheduler
Storage servers
Storage client
Storage NW Broker
Requirements
Analyzed behaviors
Automatic translation and performance reservation
I/O control technologies
However, resources are assigned to each application on a first-come (open request), first-serve basis.
4
Our Approach• The storage system allows application users to reserve I/
O performance in advance.– Explicit throughput (MB/sec) reservation
• In advance reservation, there is a room to negotiate the contract.– Financial charge to a request
• Features of our design and implementation:– A distributed storage which supports:
• Advance performance reservation– User interfaces, protocols, resource allocation, etc.
• Striping I/O with QoS according to the reservation– Integration of I/O control techniques
– Cooperation with the network bandwidth reservation and the computing resource reservation
5
Assumptions and Definitions (1)• Assumed I/O workload (in our current focus)
– Streaming type for a large amount of data– Not a mixture of read & write at a single access
• Open for read-only, create or append-only
• Space reservation– Cooperation with write performance reservation – Reserved space = “Bucket”
• User’s private space– Name– Start and end time– Space size– Guaranteed throughput
» Read
» Write
– Stored data = “Object”
Time transition
Object lifetime
Bucket lifetime
Object creation is allowed.
6
Assumptions and Definitions (2)• Performance reservation
– Metrics shown to users:• Throughput (MB/sec)
• Start and end time
• Access type– Read for object– Write for bucket or object
– Condition:• Space reservation or object creation should be prior to the
performance reservation, and vice-versa for the cancellation.
• Combined reservation– Support 1 space & N performance reservations at once
• The storage resources are co-allocated so that the all reservations are accepted, or the request is rejected.
Time transition
Object lifetime
Read and write (append) reservation are allowed.
Bucket lifetime
Write reservation is allowed.
7
Overview Architecture
Applications
Storage server (SS)
Reserve request
Client node
Management server (MGS)
Reservation management
Metadata management for buckets and objects
Global ResourceCoordinator
Network ResourceManager
Storage ResourceManager
Collocation
Web Services-based protocol
Web Services-based protocol
Storage server (SS)
Storage server (SS)
OSD
(Disk I/O rate control)
OSD
OSD
Allocate resources and administer I/O controls according to the reservation
Commands
Web Services-basedreservation client
Reserve request
(Network flow control)
Client API library
Our proposed distributed storage system
8
Overview Architecture
Applications
Storage server (SS)
Reserve request
Client node
Management server (MGS)
Reservation management
Metadata management for buckets and objects
Global ResourceCoordinator
Network ResourceManager
Storage ResourceManager
Collocation
Web Services-based protocol
Web Services-based protocol
Storage server (SS)
Storage server (SS)
OSD
(Disk I/O rate control)
OSD
OSD
Allocate resources and administer I/O controls according to the reservation
Commands
Web Services-basedreservation client
Reserve request
(Network flow control)
Client API library
Our proposed distributed storage system
9
Reservation Interface• Command-line interface• Web-services interface (SRM interface)
– A wrapped interface of command-line interface– Based on the GNS-WSI3 protocol
• Polling-based asynchronous operation and two-phases commit– reserve/modify/release request ... (polling) ... commit/abort ... (polling) ...
– We newly defined “ReservationResources_Type” for storage resources.
ReservationResources_Type
ReservationID: string [0..1]
ReservationStatus: ReservationStatus_Type [0..1]
TimeSpecification: TimeSpecification_Type [1..1]
ResourceAttribute: ResourceAttribute_Type [0..*]
StorageResources_Type
ServicePoint: ServicePoint_Type [1..1]
Space: Space_Type
Access: Access_Type
Space_Type
SpaceName: string [1..1]
SpaceSize: GeneralSpaceSize_Type [1..1]
GuaranteedReadThput: GeneralThput_Type [0..1]
GuaranteedWriteThput: GeneralThput_Type [0..1]
Access_Type
Client: Client_Type [0..1]
FileName: string [1..1]
SpaceName: string [0..1]
Mode: Mode_Type [1..1]
GuaranteedThput: GeneralThput_Type [1..1]
[0..1]
[0..1]
ReservationResources_Type
ReservationID: string [0..1]
ReservationStatus: ReservationStatus_Type [0..1]
TimeSpecification: TimeSpecification_Type [1..1]
ResourceAttribute: ResourceAttribute_Type [0..*]
StorageResources_Type
ServicePoint: ServicePoint_Type [1..1]
Space: Space_Type
Access: Access_Type
Space_Type
SpaceName: string [1..1]
SpaceSize: GeneralSpaceSize_Type [1..1]
GuaranteedReadThput: GeneralThput_Type [0..1]
GuaranteedWriteThput: GeneralThput_Type [0..1]
Access_Type
Client: Client_Type [0..1]
FileName: string [1..1]
SpaceName: string [0..1]
Mode: Mode_Type [1..1]
GuaranteedThput: GeneralThput_Type [1..1]
[0..1]
[0..1]
10
Overview Architecture
Applications
Storage server (SS)
Reserve request
Client node
Management server (MGS)
Reservation management
Metadata management for buckets and objects
Global ResourceCoordinator
Network ResourceManager
Storage ResourceManager
Collocation
Web Services-based protocol
Web Services-based protocol
Storage server (SS)
Storage server (SS)
OSD
(Disk I/O rate control)
OSD
OSD
Allocate resources and administer I/O controls according to the reservation
Commands
Web Services-basedreservation client
Reserve request
(Network flow control)
Client API library
Our proposed distributed storage system
11
Client API Library
• Features– Striping I/O over multiple storage servers– Use a fixed I/O size against storage servers
• Conversion from the application’s I/O size
– Non-POSIX API
• Reservation ID must be specified in a create or open request for object.– Reservation ID is returned as a ticket when the performance
reservation request is accepted.– The management server verifies the reservation by Reservation
ID and User ID.
create_bucket()
delete_bucket() create_object() open_object() read() write() close()
12
Overview Architecture
Applications
Storage server (SS)
Reserve request
Client node
Management server (MGS)
Reservation management
Metadata management for buckets and objects
Global ResourceCoordinator
Network ResourceManager
Storage ResourceManager
Collocation
Web Services-based protocol
Web Services-based protocol
Storage server (SS)
Storage server (SS)
OSD
(Disk I/O rate control)
OSD
OSD
Allocate resources and administer I/O controls according to the reservation
Commands
Web Services-basedreservation client
Reserve request
(Network flow control)
Client API library
Our proposed distributed storage system
13
Management Server(MGS)
Resource Management• Storage resources
– Disk space & throughput of each OSD in a certain period of time
• Role of MGS– Collect status information of the all OSDs
• Max. throughput (Currently static)
• Used/free space– Each OSD primarily manages its own disk space.
– Allocate resources according to the reservation request• Record allocate/free information in the internal tables
OSD
Reserve request from client
• Access reservation info.
Internal tables
• Space reservation info. (cache)
Storage server (SS)
Space reservation request before committing the allocation plan
reply
14
Resource Allocation (1)
1. Check availability of each OSD
2. Score each OSD and sort the list
- Estimate available space in a time window
- Estimate available performance in a time window
3. Allocate a set of OSDs to the request
- Normalize and weight the availability
Performance model
Scoring model
- Assign OSDs from the list according to the score
Allocation model
Input (A set of reservation requests)
Output (A set of OSDs)
- Check to ensure the assigned OSDs are not overused
Performance model
Each request usually has a time window, space size, and performance.
Balancing space, balancing workload, or something else?
Allocation strategy
Change striping count and iterate this process
15
Resource Allocation (2)• Three models (performance, scoring and allocation
models) should be customizable by storage administrators.
• Our simple models in a prototype:– Read throughput is proportionally shared by multiple accesses.
• E.g. Total 200MB/s by 2 process
-> Each process can get 90 MB/s with 10% overhead.
– Write access is always exclusive to any other accesses.– Balancing I/O workload is first.
• The OSD which can provide higher throughput will be assigned first (a greedy strategy).
– Free space is considered second.
– Minimize striping count and limit the max. striping count• Striping size is fixed as a system-wide parameter.
16
Overview Architecture
Applications
Storage server (SS)
Reserve request
Client node
Management server (MGS)
Reservation management
Metadata management for buckets and objects
Global ResourceCoordinator
Network ResourceManager
Storage ResourceManager
Collocation
Web Services-based protocol
Web Services-based protocol
Storage server (SS)
Storage server (SS)
OSD
(Disk I/O rate control)
OSD
OSD
Allocate resources and administer I/O controls according to the reservation
Commands
Web Services-basedreservation client
Reserve request
(Network flow control)
Client API library
Our proposed distributed storage system
17
I/O Rate Control Framework• The storage server controls I/O rate according to the MG
S’s instruction.– Disk I/O scheduling
• Under development
– A storage network between client and storage servers• Integrate PSPacer into our prototype to configure the target netw
ork bandwidth on the Ethernet
• The instruction is delivered using the capability model.
Management Server(MGS)
OSD Storage server (SS)
Client1. Open request with reservation ID
2. Receive a capability
Sharing the key4. Verify the capability
5. Enforce rate control on this connection
3. Connect request
18
Prototype Implementation• Papio: our developed distributed storage software
– Implemented in C++ on Linux– Use SQLite version 3 for the internal database of MGS– Use EBOFS (an extent and B+tree based object file system) as
our OSD base• Extend the allocation algorithm to support space reservation
– Use PSPacer for network bandwidth control– Support the simple models for resource allocation
• SRM: our developed reservation agent for Papio, providing Web-Services interface (SRM interface)– Implemented in Java– Use GridARS to support the GNS-WSI3 protocol
19
Evaluation• Reservation cost
– Comparison between commad-line and SRM interfaces– Overheads of SRM and Papio
• Performance of reserved v.s. non-reserved access– A single occupation strategy– A multiple occupation strategy
• Experiment environment– 6 machines below connected by Dell PowerConnect 6248
CPU AMD Opteron Quad Core 2.3GHz
Memory 8GB memory
Disk OCZ Apex v3 (SSD)
OS CentOS 5 (Kernel version 2.6.18)
Network 1 GbE (4 nodes) or 10 GbE (2 nodes)
20
Reservation Cost (1)• We had 4 experiment cases.
Storage ResourceManager (SRM)
Web Services-basedreservation client
MGS
Dummy MGS cmds
Node-2 Node-1
Storage ResourceManager (SRM)
Web Services-basedreservation client
MGS cmds
Node-2 Node-1
MGSMGS cmds
Node-1
MGSMGS cmds
Node-1Node-2
a)
b)
c)
d)
21
Reservation Cost (2)• In the result, the SRM interface was 3~4 times slower than command-
line’s because of the polling (100 msec interval) based operation.• The cost is reasonably low and might not be a bottleneck.
Operation
Execution time [msec]
a) b) c) d)
Initialize 82.4 - -
Reserve Total 477 613 149 153
Request 102 105
Confirm 197 322
Commit 177 185
Release Total 386 511 137 141
Request 107 108
Confirm 98 221
Commit 181 183
22
Reserved / non-reserved access (1)• Measured Client-A’s read access:
– Reserved: Papio applies a single occupation strategy that each OSD serves only one access.
SS
Client-A Client-B
SS
OSD
SS
SS
Client-A Client-B
SS
OSD
SS SS
Client-A: Striping
Client-A Client-B
SS
OSD
SS
Client-A: 1 stream
Client-A Client-B
SS
OSD
Reserved
Non-reserved
I/O control is not applied.
Conflict with Client-B’s read or write access
Client-A: StripingClient-A: 1 stream
23
Reserved / non-reserved access (2)• Non-reserved access affected by Client-B’s access.
0
20
40
60
80
100
120
0 400 800 1200 1600
Total read size [MB]
0
50
100
150
200
250
300
350
0 1200 2400 3600 4800Total read size [MB]
Clie
nt-A
’s r
ead
thro
ughp
ut [
MB
/sec
] - Reserved
■ Non-reserved: R-R
X Non-reserved : R-W
1 stream Striping
55MB/s x355MB/s
24
Reserved / non-reserved access (3)• Measured Client-A’s read access:
– Reserved: Papio applies a multiple occupation strategy that each OSD serves more than one access.
SS
Client-A Client-B
SS
OSD
SS
Client-A Client-B
SS
OSD
Reserved
Non-reserved
I/O control by PSPacer is applied.
10% overhead (protocol etc.) estimation
SS
Client-A Client-B
SS
OSD
SS
Client-A Client-B
SS
OSD
80MB/s 20MB/s
80MB/s x320MB/s
Client-A: StripingClient-A: 1 stream
Client-A: StripingClient-A: 1 stream
Conflict with Client-B’s read or write access
25
Reserved / non-reserved access (4)• Reserved access got the requested I/O throughput.
0
20
40
60
80
100
120
0 400 800 1200 1600
Total read size [MB]
0
50
100
150
200
250
300
350
0 1200 2400 3600 4800
Total read size [MB]
Clie
nt-A
’s r
ead
thro
ughp
ut [
MB
/sec
] - Single occupation
■ Non-reserved: R-R
▲ Reserved: controlled
1 stream Striping
80MB/s 80MB/s x3
26
Potential Applications• Constraints
– Require an advance reservation– Read and append-only access for a large amount of data
• Potential applications (scheduled execution?)– Multimedia streaming (We had a demo in August.)– Moving large data between data centers– Server provisioning
VOD service provider
Watch reservation
Streamingserver
xx
xx
xOptical path network Streaming
server
x
Streamingserver
Papio storage
xx
SRM
NRM
Coordinate & reserve resources
27
Related Work• SRM in OGF
– SLA features: retention policy, access latency
• Automatic configuration to satisfy given I/O workload– Hippodrome, MINERVA
• Resource allocation based on performance prediction
• Many existing works for QoS – Disk I/O scheduling
– Network QoS
– Performance monitoring and feedbacked I/O control
We would like to apply some of these techniques to Papio and achieves more fine-grained performance guarantee.
28
Conclusion and Future Work• Proposed “an advance reservation feature by application
users” for storage access.– A different model from that resources are allocated at time of cre
ating/opening files (on-demand)– Design
• Defined performance metrics and storage resources
• Four key components:– Reservation interface– Client API– Resource management framework– I/O control framework
• Implemented Papio and SRM as a prototype and evaluated the basic performance and functions.
• Providing a more sophisticated user interface and a “guarantee” mechanism are in our future work.