angelos bilas, forth [email protected] - terena€¦ · • most fs operations do not scale with # of...
TRANSCRIPT
17-June-2011 TERENA TF on Storage 1
PERSISTENT I/O CHALLENGES & APPROACHESCHALLENGES & APPROACHES
Angelos Bilas, [email protected]
Outline
17-June-2011 TERENA TF on Storage 2
Outline• Modern application stackspp
• Stream processing (STREAM)• Transaction processing (CumuloNimbo)
St t h l i• Storage technologies• Storage virtualization and scaling with multicores (IOLanes)• Abstractions for modern applications (CumuloNimbo)• Parallel I/O (SCALUS)
• Remarks
Application Stacks
17-June-2011 TERENA TF on Storage 3
Application Stacks• STREAM• CumuloNimbo
Stream Global Architecture Picture
17-June-2011 TERENA TF on Storage 4
Stream Global Architecture PictureCredit Card
Fraud DetectionSLA
ComplianceTelephony
Fraud DetectionCOIsAggregation Queries
FraudProfiles Monitoring
FraudDetectionQueries
Queries
SLA ViolationDetection
Fraud Detection Queries
Profiles
FraudProfiles
StreamCloud StreamMine
Parallel Stream Operators
Parallel DB Operators
Stream MapReduceOperators Operators
Fault Tolerance Self‐Provisioning
MapReduce
StateMachine
DynamicGraphs
Communication & StorageCompressed SSDQueue mem‐to‐memcommunication
Persistent Streaming Silent Error Detection
CumuloNimbo Global Architecture
17-June-2011 TERENA TF on Storage 5
CumuloNimbo Global Architecture
JEE Application Server: JBoss+Hibernate
Object Cache: CumuloCache Transactions
SelfProvisioner
Query Engine: DerbyConcurrencyControllers
Distributed File System: HDFS
Column-Oriented Data Store & Block Cache: HBASE
CommitSequencers
Monitors
y
Storage
Commu-nication
LoggersLoad
Balancers
TransactionManagement
ElasticityManagement
Application Stacks
17-June-2011 TERENA TF on Storage 6
Application Stacks• They tend to be complexy p• Each layer adds substantial protocol “machinery”
• E.g. transactions, global name space
• Today I/O significant bottleneck• Hard to know what all layers do
• Questionable what can be modified realistically• Questionable what can be modified realistically
• How can modern storage systems best support these?
Outline
17-June-2011 TERENA TF on Storage 7
Outline• Modern application stackspp
• Stream processing (STREAM)• Transaction processing (CumuloNimbo)
St t h l i• Storage technologies• Storage virtualization and scaling with multicores (IOLanes)• Abstractions for modern applications (CumuloNimbo)• Parallel I/O (SCALUS)
• Remarks
Dimension Infrastructure Properly17‐June‐2011 TERENA TF on Storage 8
Multicores +
PCs/bladesDifferent flavors of PCs/bladesPCs/bladesDifferent flavors of PCs/bladesDifferent flavors of PCs/blades
Multicores + memory + IO xput
High‐speed InterconnectInterconnect
High‐speed Interconnect
100s file serversservers
1000s of applservers
10‐40 Gbits/s• Dimensioning issues not straight forward today- I/O application overheads not understood
Disk controllers~2GB/sSATA disks, 12‐36 disks/node
10‐100 Gbits/s
- Do you balance thin or fat?- Other factors besides performance, power
SATA disks, 36 disks/node100 MBy/s, ~2TBytes+10% SSD cache
Scaling I/O on multicore CPUs
17-June-2011 TERENA TF on Storage 9
Scaling I/O on multicore CPUs• Observation
• As the number of cores increases in modern systems, we are not able to perform more I/O
• Target: 1M IOPS 10 GBytes/s• Target: 1M IOPS, 10 GBytes/s
• Goal• Provide scalable I/O stack (virtualized) over direct and networked
storage devices
• Go over1. Performance and scaling analysis1. Performance and scaling analysis2. Hybrid hierarchies to take advantage of potential3. Design for memory and synchronization issues4 Parallelism in lower part of networked I/O stack4. Parallelism in lower part of networked I/O stack
(1) Performance and Scaling Analysis
17-June-2011 TERENA TF on Storage 10
(1) Performance and Scaling Analysis• Bottom-upGuestOS
U S Applications Bottom up• Controller
• Actual controller
UserSpace
SystemCalls
Applications
Middleware
• PCI• Host driversBl k l
SystemCalls
VirtualDrivers
GuestOSKernel VFS+FS
• Block layer• SCSI• Block
HostOSVFS+FS
BlockDevices• Block
• Filesystem• xfs (a well accepted fs)
SCSILayers,HWdevicedrivers,PCIdriver
PCIExpressInterconnect ( p )• vfs (integral linux part)
StorageController,DiskController
NetworkController
I/O Controller [Systor’10]
17-June-2011 TERENA TF on Storage 11
I/O Controller [Systor 10]• (1) A queue protocol over PCI( ) q p
• Many parameters and quite complex• Requires decisions: Tune for high throughput
(2) R t t l ti t ll• (2) Request translation on controller• Memory management: Balance between speed and waste
• (3) Request issue completion towards devices(3) q p• Use existing mechanisms but do careful scheduling
• Prototype comparable to commercial products
Results and Outlook
17-June-2011 TERENA TF on Storage 12
Results and Outlook140016001800
MB/sec
DMA Throughput
host-to-HBA HBA-to-host
head: valid queue elementNew-head
80010001200
4 8 16 32 64
MB/sec
transfer size (KB)Impact of host-issued PIO on DMA Throughput
tailHost
head DMAPCIe interconnect
: valid element to dequeue
OFF
ON
host
-issu
ed P
IO?
2-way to-host from-host
Controller
head
Controller initiates DMA
tail
• xput: Each controller can achieve 2 Gbytes/s bi-dir• IOPs: Each controller can achieve ~80K IOPs
f f /O
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200
MB/sec
-Needs to know tail at Host-Host needs to know head at Controller
New-tail
• 50K for commercial controllers with full I/O processing• Controller CPU is an important limitation• Outlook
• (1) Scale throughput and IOPs by using multiple controllers(2) O tlook I/O controllers sho ld be f sed ith host CPU• (2) Outlook: I/O controllers should be fused with host CPU
Block Layer
17-June-2011 TERENA TF on Storage 13
Block Layer• I/O request protocol• I/O request protocol translation, e.g. SCSI
• Buffer managementBuffer management and placement• Other layers involved, y ,
essentially a block-type operationModern architecture• Modern architecture trends create significant problems
Results and Outlook
17-June-2011 TERENA TF on Storage 14
Results and Outlook• Translation processing scales
with number of cores 5000
6000MB/s Sequential I/O Throughput
4
rs
Random I/O Operations
with number of cores• Both throughput and IOPs• I/O translation incurs overhead
0
1000
2000
3000
4000
5000
seq.readsseq.writes
1
2
3
# co
ntro
ller
write IOPS
read IOPS
• Affinity an important problem• Wrong placement can reduce
throughput almost to half 6
1 2 3 4# ControllersIOPS
g p
4
TLOR0TROR0
1 2 3 4 5 6 7 8
0
2 TROR0TLORPRILTLORPLIL
1 7
No. of Instance of Benchmark
Filesystem
17-June-2011 TERENA TF on Storage 15
Filesystem• Complex layerp y
• Many complain about FS performance on multicores• Translates from (request, file, offset, size) API to (request, block#)
APIAPI• Responsible for recovery (first layer to include extensive metadata
in traditional systems)We include VFS in our analysis additional complexity• We include VFS in our analysis – additional complexity
• Detailed analysis with extensive modifications to kernel• Required non-trivial instrumentation to measure lock and wait times• Extensive tuning to ensure that we measure “meaningful” cases
Results and Outlook
17-June-2011 TERENA TF on Storage 16
Results and Outlook100
kfsmark‐ CPUbreakdown(1MBfiles,64app.threads)
IO‐WAIT USER SYSTEM INTERRUPT IDLE100
120
sand
s 1LOG PER PROCESS200
250
sand
s 1LOG PER PROCESS
020406080
%CPU
20
40
60
80
ops/
sec
Thou
s
CREAT
50
100
150
ops/
sec
Thou
s
#CPUs
READ
• Most FS operations do not scale with # of cores• Two main scaling problems• (2) FS journaling
1 2 4 8 1601 2 4 6 8 12 16
o
#CPUs0
1 2 4 6 8 12 16
( ) j g• All modern FSs need to worry about recovery• Most use a journaling scheme that is integrated with lookup/update path• Synchronization over this journal is hindering scaling
• (1) vfs lockingf t t f i t i i di t t d i d i f ti (d t d i d• vfs uses a structure for maintaining directory entry and inode information (dentry and inode
caches)• Synchronization over the dentry cache is problematic due to vfs design
• Outlook• There is significant potential from both (1) and (2)g p ( ) ( )• (1) is being discussed and (a) people are working on it, (b) there is potential to bypass• (2) is more fundamental – our goal is to target this
Summary of Analysis
17-June-2011 TERENA TF on Storage 17
• (1) Fundamentally, I/O performance should scale
Summary of Analysis( ) y, p
• (2) Controller: use spatial parallelism and go with technology trends
• (3) Block: worry about placement and affinity problems• (4) FS: worry about synchronization at specific points
Both (3) and (4) are due to current trends in multicores• Both (3) and (4) are due to current trends in multicores• Not broadly known problems yet
(2) Hybrid Device Hierarchies
17-June-2011 TERENA TF on Storage 18
(2) Hybrid Device Hierarchies• To take advantage of this potentialg p
• Need hybrid device hierarchies using disks and SSDs• Otherwise, not be adequate raw performance• [FlashCache’06, BPLRU’08, …] HDD(WD5001AALS‐00L3B2) SSD(IntelX25‐E)
Price/capacity ($/GB) $0 3 $3
• Designed and evaluated such a base hierarchy
Price/capacity ($/GB) $0.3 $3Responsetime(ms) 12.6 0.17Throughput(R/W)(MB/s) 100/90 277/202IOPS(R/W) 150/150 30,000/3,500
• Designed and evaluated such a base hierarchy
• Significant improvementOver disks only• Over disks only
• Over disks + SSDs due to our policies
Summary [Eurosys’10 NAS’11]
EuroSys 2010 - Compressed SSD I/O Caching 19
Summary [Eurosys 10, NAS 11]• Transparent SSD caching promising for improving performance• Improve SSD caching efficiency using online compression
• Trade (cheap) CPU cycles for (expensive) I/O performance• Address challenges in online block-level compression for SSDsAddress challenges in online block level compression for SSDs
• Our techniques mitigate CPU and additional I/O overheads• Results in increased performance with realistic workloads
• TPC-H up to 99% PostMark up to 20% SPECsfs2008 up to 11%TPC H up to 99%, PostMark up to 20%, SPECsfs2008 up to 11%• Cache hit ratio improves between 22%-145%• Increased CPU utilization by up to 4.5x• Low concurrency, small I/O workloads problematicLow concurrency, small I/O workloads problematic
• Overall our approach worthwhile, but adds complexity…• Future work
• Power performance implications interesting hardware off loading• Power-performance implications interesting, hardware off-loading• Improving compression efficiency by grouping similar blocks
(3) Buffer Mgmt and Recovery Issues
17-June-2011 TERENA TF on Storage 20
(3) Buffer Mgmt and Recovery Issues• Revisit
• Buffer mgmt in DRAM required to stage/cache I/Os• Recovery required due to volatility of DRAM
Both fundamental and related to system I/O architecture• Both fundamental and related to system I/O architecture
• We design a new DRAM buffer+cache mechanism• (1) Allow isolation and partitioning• (2) Allow control over placement• (3) Deal with both fixed and variable size items• Similar techniques recently used for other structures in kernel• Similar techniques recently used for other structures in kernel
[OSDI’10]
• Use it with a kernel-level FS that is stateless
(4) Networked I/O Stack
17-June-2011 TERENA TF on Storage 21
(4) Networked I/O Stack• Host overhead for network processing significantp g g
• We would like to push limits for networked I/O• Related: TCP/IP overhead at 10 GigE, xATA over Ethernet
U ti l ll li i th t k• Use spatial parallelism in the network• Multiple 10GBit/s controllers• Total 80GBit/s bi-dir over Ethernet• Treat as a transparent link between target and initiator
• Storage protocols not arbitraryRequest/response• Request/response
• Fixed size buffers
• How well can we do
Results and Outlook
17-June-2011 TERENA TF on Storage 22
Results and Outlook• Base net protocol design and
implementationp• Preliminary numbers (latest)
• Over 4.5 GBytes/s• Writes, 4x10GigE NICs• Read is about 2GBytes/s• Read is about 2GBytes/s• Over 160K IOPs
• Insight: Using a traditional-generic commprotocol induces overheads
• Able to design comm protocol that benefits from storage-specific semantics
• Target vs. initiator• I/O semantics not simpleI/O semantics not simple• Buffer management happens high-up in stack• Initiator less important (?)• Results very encouraging
IOLanes
17-June-2011 TERENA TF on Storage 23
IOLanes• Overall, data intensive
applications are increasing
GuestOSUserSpace
TPC WS
TPC‐WSPECjAppServ
RUBiSReplication
gLinearRoadStreaming
Tariff AdvisorTariff Advisor
New
TPC‐H,TPC‐Cpp g• Distributed, data-center type
applications• I/O subsystem important
building block
SystemCallsSystemCallsMwarepMware
TariffAdvisorTariffAdvisor
VFS+FSGuestOSKernel
g QPostgreSQL
building block• Main challenges
• (1) Performance and scalability• (2) Extensibility and effort SystemCalls
BlockDevices,VirtioSplit‐X
QEMU
(2) Extensibility and effort• Today
• Few disks per cpu/core (e.g. two)
Syste Ca s
HostOS
VFS+FSOn/off‐load module
KVMSplit‐X
• Any new feature or adaptation in stack remarkably complex
• IOLanes• (1) Identify bottlenecks St /N tSt /N t
SCSILayers,HWdevicedrivers,PCIdriver
BlockDevicesOn/off loadmodule
• (1) Identify bottlenecks• (2) Build better stack• (3) Allow for easier extensibility
Co t o e sStorage/NetControllersCo t o e s
Storage/NetControllers
Specific Challenges
17-June-2011 TERENA TF on Storage 24
Specific Challenges• Scaling the I/O stack across all system layers on multicore g y y
CPUs• Interaction of the I/O paths of multiple isolated virtual
himachines• Use cycles offered by multicores to offer more
“machinery” and optimize onliney p• Evaluation with realistic workloads• Full stack monitoring and analysis
Outline
17-June-2011 TERENA TF on Storage 25
Outline• Modern application stackspp
• Stream processing (STREAM)• Transaction processing (CumuloNimbo)
St t h l i• Storage technologies• Storage virtualization and scaling with multicores (IOLanes)• Parallel I/O (SCALUS)• Abstractions for modern applications (CumuloNimbo)
• Remarks
Dimension Infrastructure Properly17‐June‐2011 TERENA TF on Storage 26
Multicores +
PCs/bladesDifferent flavors of PCs/bladesPCs/bladesDifferent flavors of PCs/bladesDifferent flavors of PCs/blades
Multicores + memory + IO xput
High‐speed InterconnectInterconnect
High‐speed Interconnect
100s file serversservers
1000s of applservers
10‐40 Gbits/s• Dimensioning issues not straight forward today- I/O application overheads not understood
Disk controllers~2GB/sSATA disks, 12‐36 disks/node
10‐100 Gbits/s
- Do you balance thin or fat?- Other factors besides performance, power
SATA disks, 36 disks/node100 MBy/s, ~2TBytes+10% SSD cache
Scaling Beyond Single Node Requires
17‐June‐2011 TERENA TF on Storage 27
Scaling Beyond Single Node Requires• Namespace managementp g• Distributed recovery, mostly for metadata• Distributed DRAM caching, at the client side• Understanding scaling overheads (efficiency)
NamespaceManagement
17‐June‐2011 TERENA TF on Storage 28
Namespace Management• Need to go from (filename, offset) to (node, device, object g ( , ) ( , , jblock)
• This requires translation metadata• Metadata cannot be co‐located with file/object data, if we need to scale single file performance
• This requires distributed lookup• This requires distributed lookup• Also, updates can be complicated• Would be interesting to separate from rest of data storage g p g
Distributed Recovery
17‐June‐2011 TERENA TF on Storage 29
Distributed Recovery• Single node recovery not enough when data is spread out
• Some layer will need to do it• Part of the storage system or the application middleware
• It probably means that storage nodes and application nodes will d t b t tineed to be separate tiers
• Fewer storage nodes and more applications nodes• Recovery protocol will only involve (hopefully) storage nodesS f f t ti l API t t i ht• Some form of transactional API to storage seems right• Not simply read/write any more• Versioning vs. logging approachesWill i l t t l f ll d i l d i• Will involve some agreement protocol for all nodes involved in an operation due to striping, replication, metadata/data, etc.
• New mechanism for the common path• Much more complicated to traditional systems• Much more complicated to traditional systems• Either centralized controllers or centralized metadata servers
Distributed DRAMCaching
17‐June‐2011 TERENA TF on Storage 30
Distributed DRAM Caching• Traditionally, a cache exists as close to the application node as possible• In the file client
• This is problematicp• For recovery• For scaling to many application nodes
• Two possibilitiesTwo possibilities• (1) Do client‐side caching but avoid write back• (2) Do not do client side caching and use single object owner approach at a next (storage) tierapproach at a next (storage) tier
• Both seem good approaches• (1) relies on “smarter” I/O path(2) li “ /f ” k b li i /fil• (2) relies on “smarter/faster” networks between application/file client and storage node
Efficiency: Ultimately it is all about power
17‐June‐2011 TERENA TF on Storage 31
Efficiency: Ultimately it is all about power• Today, people do not pay much attention to the cost of scaling
• Goal is to scale performance• Experimental setups with 1‐2 disks per node and many nodes for scaling I/O are commonh ff ( d k d )• This is very poor efficiency (CPU to disk ratio, consider power)
• How much are you willing to pay for scaling?• Start from a base, optimized I/O stack like the one I have described• If we can scale and each I/O subsystem operates at best rate we are fine• Essentially, the cost of scaling should not be too high (or ideally visible) going from one to many nodesThis is not true today by far• This is not true today, by far…
• Ultimately power will force everyone to look into this• Or, only a few applications will be able to pay for itA l SAN t d k b t th t• Analogy: SANs today work but they cost
“Machinery” for distribution
17‐June‐2011 TERENA TF on Storage 32
“Machinery” for distribution• All previous mechanisms, require “machinery” that is expensive
W d t ith di t ib t d I/O h th t d ll i• We need to come up with distributed I/O approaches that do all processing more efficiently
• We have or can assume a lot of concurrency so there is always work• This is more about being asynchronous all the time and using DRAM as a g y gbuffer to not starve any other resource
• Design systems that wait only when I/O xput is exhausted• No application should be I/O bound!
ith hi h th h t d i d t i t t i d d• …with high throughput devices and system interconnects in modern and future systems
• Efficiency will matter at some point• Even for apps that are able to scale and achieve their perf goals• Even for apps that are able to scale and achieve their perf goals
• We need to understand• Mechanisms required for scaling and their overheadsWho should do what in the distributed I/O path• Who should do what in the distributed I/O path
• Different appl domains will resolve tradeoffs in overheads, semantics
Where Should Each OpGo in the I/O Path?
17‐June‐2011 TERENA TF on Storage 33
Where Should Each Op Go in the I/O Path?• (1) Everything in the file‐system (most prevalent today)
H t b id d b fil t• Has to be provided by every filesystem• The world will have many filesystems• Some problems, e.g. consistent client caching, inherently difficult (not scalable)• Try using GPFS (not to mention extending it…)(2) Wh t b l t t diti l SAN/NAS ?
File Servers
• (2) Why not be closer to traditional SAN/NAS ?• Let’s do reliability and availability as SAN• File operations and scaling as NAS• Requires distributed block‐level consistency and atomicity
h i f l l (k l fi )
NAS (NFS/CIFS)
FS Layer
Block I/O stack
NAS (NFS/CIFS)
FS Layer
Block I/O stack
• … at the infrastructure level (kernel, firmware, …)• Not clear this is the way to go…
• (3) Other alternatives? Who knows…Network
Block level Block‐levelBlock‐level stack
Block level stack
Storage Nodes
33 I/O Path Design & Implementation
Storage Nodes
Outline
17-June-2011 TERENA TF on Storage 34
Outline• Modern application stackspp
• Stream processing (STREAM)• Transaction processing (CumuloNimbo)
St t h l i• Storage technologies• Storage virtualization and scaling with multicores (IOLanes)• Parallel I/O (SCALUS)• Abstractions for modern applications (CumuloNimbo)
• Summary
CumuloNimbo Global Architecture
17-June-2011 TERENA TF on Storage 35
CumuloNimbo Global Architecture
JEE Application Server: JBoss+Hibernate
Object Cache: CumuloCache Transactions
SelfProvisioner
Query Engine: DerbyConcurrencyControllers
Distributed File System: HDFS
Column-Oriented Data Store & Block Cache: HBASE
CommitSequencers
Monitors
y
Storage
Commu-nication
LoggersLoad
Balancers
TransactionManagement
ElasticityManagement
State of the Art
17-June-2011 TERENA TF on Storage 36
State of the Art• Key-value data stores gaining significancey g g g
• Supporting arbitrary variable-size keys and values
• Distributed key-value stores used increasingly• HBase is a component of the CumuloNimbo architecture• Also, other s/w stacks are built on top of key-value stores
• To access persistent storage such systems are built todayTo access persistent storage such systems are built today on top of traditional file systems• However, semantics of the underlying system differ in fundamental
waysways
Key value Store vs FS Mismatch
17-June-2011 TERENA TF on Storage 37
Key-value Store vs. FS Mismatch• Hard to map mutable variable size keys/values to filesp y
• Key-based indexing vs. offset based indexing in the presence of variable size values
• Data placement on local/networked storage devices p gcannot take advantage of semantics of key/value stores• Information that has been provided by the application is thrown
away during mapping to flat filesy g pp g• Local file systems offer limited recovery/availability
guarantees• Last write recovery expensive no data consistency guaranteesLast write recovery expensive, no data consistency guarantees
• Significant performance overheads and scalability limitations
When scaling to large amounts of storage and high rates• When scaling to large amounts of storage and high rates
Our Goal
17-June-2011 TERENA TF on Storage 38
Our Goal• Raise the abstraction of traditional locally managed persistent
t i ti k l APIstorage using a native key-value API• Support mutable variable-length items– important for workloads
that incur frequent updates• Perform all operations required (packing cleanup) for dealing withPerform all operations required (packing, cleanup) for dealing with
variable size items over fixed block-size persistent devices• Optimize device use based on the importance of data items• Ensure consistency of the data store after a failure based on
fi bl kl d i tconfigurable workload requirements• Use tunable data replication for availability purposes
• Separate distributed aspects from efficiency at local levelSynergies can be important for performance e g recovery• Synergies can be important for performance, e.g. recovery mechanism
Storage Layer Architecture
17-June-2011 TERENA TF on Storage 39
Storage Layer Architecture
Outline
17-June-2011 TERENA TF on Storage 40
Outline• Modern application stackspp
• Stream processing (STREAM)• Transaction processing (CumuloNimbo)
St t h l i• Storage technologies• Storage virtualization and scaling with multicores (IOLanes)• Abstractions for modern applications (CumuloNimbo)• Parallel I/O (SCALUS)
• Remarks
The role of persistent I/O
17-June-2011 TERENA TF on Storage 41
The role of persistent I/O• Required to keep user dataq p
• Data generated and used at different times (and over long periods)
• Tolerate failures• Persistence of control information (metadata)
• Both emerge as problems
Data
17-June-2011 TERENA TF on Storage 42
Data• Many applications today in data centers require large y pp y q g
amounts of data • “Waste” in todays architectures
G tti d t f i t t d i t• Getting data from persistent devices to memory• Requires complex namespace operations which lead to significant
resource utilization• Contrast this to memory accesses that are simpler in nature
• Systems have been built to tolerate high response times• Results in more work per I/OResults in more work per I/O
• Virtualization introduces significant overheads for I/O• But important for isolation among workloads and environment
Metadata
17-June-2011 TERENA TF on Storage 43
Metadata• Examples
• In a filesystem inodes and dentries• In a tuplestore hash-tables and b-trees for indexing• At block-level (e.g. FTL) logical to physical (re)mapping tables
Equally important to data• Equally important to data• In some cases even more so• Many systems can afford to be sloppy about data but not metadata
• FootprintFootprint• Metadata needs to be kept in memory for performance purposes• Sophisticated (and application-specific) caching techniques• Otherwise dramatically increase the number of I/Os per user I/O
• Persistence• Remaining consistent at failures of paramount importance• But DRAM not persistent => complex write management techniques
M iddl d li i l d h dl• Many system, middleware, and application layers need to handle metadata, resulting in multiple times these in-efficiencies
Today
17-June-2011 TERENA TF on Storage 44
Today• Persistence is “heavy” due to device/controller technologyy gy• Persistence not designed with multicores in mind• Persistence inefficient when scaling across nodes
• Persistence incurs overheads in multiple layers
What can we do?
17-June-2011 TERENA TF on Storage 45
What can we do?• Persistent I/O should “get closer” to the CPUg
• Namespace issues should be simpler• Transfers between persistent and non-persistent stages of memory
should be more efficientshould be more efficient• Role of access granularity
• Architectures should better support persistence for metadata• Treating data and metadata the same is a very inefficient
simplification
• Understand overheads and scaling characteristics on modern systems
How many cycles of processing per I/O does a data centric• How many cycles of processing per I/O does a data-centric application need?
Summary
17-June-2011 TERENA TF on Storage 46
Summary• (1) Memory hierarchy work to bring persistence closer to CPU( ) y y g p
• Profound changes – impact all layers• Achieving efficiency with device technology
• (2) I/O path evolution to scale with # cores• (2) I/O path evolution to scale with # cores • Current systems not designed with this in mind• As cores increase, base I/O performance does not scale
Vi t li ti h d / t ti b t• Virtualization overheads/contention exacerbates• Energy proportionality
• (3) Persistent I/O needs to scale efficiently with # nodes• Extensive additional “machinery” today at system and middleware
level to achieve scaling => incurs high overhead and impacts efficiency
• E.g. Heartbeats and replication not compatible with energy efficiency
Acknowledgements
17‐June‐2011 TERENA TF on Storage 47
Acknowledgements• People
• Funding agencies• EC
• Shoaib Akram• Konstantinos Chassapis• Michail Flouris
• SIVSS, SCALUS, IOLANES• CumuloNimbo, STREAM, HiPEAC
• GSRT: National research office• Markos Foundoulakis• Dhiraj Gulati• Yiannis KlonatosYiannis Klonatos• Kostas Magoutis• Thanos Makatos• Manolis MarazakisManolis Marazakis• Stelios Mavridis• Zoe Sebepou
• Many partners and colleagues• Many partners and colleagues