acunu - research overview

Storage is changing. We need new algorithms to deal

with it.

We are witnessing at least two revolutions in storage: (1) massive datasets and workloads, and (2) the rise of

scale-out commodity hardware. This whitepaper describes the Acunu Data Platform, and how Acunu is allowing

massive data workloads to take full advantage of

today’s hardware.

Acunu is rewriting the storage stack in the Linux ker-

nel for Massive Data thanks to world-class engineer-

ing and algorithms research.

Massive Data Workloads.

How have workloads changed? The workloads de-

manded by hardware of massive datasets typically

exhibit three main features:

• Continuously high ingest rates (many thousands of

updates/s, typically high-entropy, random updates)

• Individual pieces of data are small, and aren’t valu-

able in isolation (for example, stock ticks or ses-

sion IDs)

• Continual range queries are important for analyt-

ics (such as demanded by Apache Hadoop)

This is in stark contrast to the ‘load, then query’

regimes of more traditional databases.

Understanding massive data means being able to

extract features and trends, all the time while the

data is continually updated. Existing platforms and

solutions cannot do this at scale, with predictably

high performance. This is where Acunu comes in.

The first revolution is the rise of non-relational, or

‘nosql’ data bases such as Cassandra, and analyt-

ics frameworks and tools such as Hadoop. The driving force is using clusters of commodity machines to ingest large

volumes of data, process it, and serve it. Previous technologies such as mysql are traditionally cumbersome to operate

at the scales needed here. For many deployments in both enterprise and non-enterprise settings, these technologies

are likely to account for the majority of data stored where features such as high availability at low cost are more impor-

tant than transactional durability.

The second revolution is a hardware one. Commodity machines now typically possess many cores, and bear closer

resemblence to a supercomputer of the 90s than a desktop of the same era. Hard drive capacity and sequential band-

width has been doubling every 18 months, as predicted; yet random IO performance has not improved. Solid-state

drives (SSDs) offer 2-3 orders of magnitude better random IO performance than hard drives. Clearly these have huge

potential to revolutionize the database world, if only the software stack can harness and utilize their performance.

Fundamental research = new possibilities.

The Acunu Storage Core is based on fundamental, patent-pending, algorithms and engineering research. This isn’t just

a better implementation of an existing idea, or about a shinier UI or management console (although our management

stack is also pretty cool). We are doing world-class research, engineering, patenting, and we publish at top confer-

ences. Why? This allows us to do things simply not

possible before. Here are some examples.

Fast, full versioning.

Versioning of large data sets is an incredibly powerful

tool. Not just low-performance snapshots for back-

ups, but high-performance, concurrent-accessible

clones and snapshots of live datasets for test and

development, offering many users different, writeable,

views of the same large dataset, going back in time,

and much more.

Traditionally, the state-of-the-art in algorithms for ver-

sioning large data sets is based on a data structure

known as the ‘copy-on-write B-tree’ (CoW B-tree) -

this is ubiquitous in file systems and databases in-

cluding ZFS, WAFL, Btrfs, and more. The CoW B-tree (and most of its variants, such as append-only trees, log file sys-

tems, redirect-on-write, etc.) has three fundamental problems - (1) it is space-inefficient (and thus requires frequent

garbage collection); (2) it relies on random IO to scale (and thus performs poorly on rotational drives); and (3) it cannot

perform fast updates, even on SSDs.

Acunu has invented a fundamentally new data structure - the Stratified B-tree - that addresses all the above problems.

Some details of this revolutionary data structure have been published: see [Twigg, Byde - Stratified B-trees and ver-

sioned dictionaries, USENIX HotStorage’11].

Designed for SSDs

Existing storage schemes do not address the fact that SSDs require addressing in a fundamentally different way. Al-

though they present a SATA/SAS interface and are sector-addressed, this is only to allow them to be a drop-in replace-

ment for hard drives. Extracting maximum performance and lifetimes requires two things: (1) the storage stack to un-

derstand how they operate; and (2) new data structures and algorithms that exploit their design characteristics.

By understanding how SSDs fundamentally work, Acunu has been able to engineer data structures that allow unprece-

dented long-term write performance, while guaranteeing device endurance.

Not just peak performance, but predictable performance.

By eliminating JVM-based garbage collection and memory management issues, and carefully controlling hardware ac-

cess from within the Linux kernel, Acunu is able to offer predictably high performance, even under sustained high loads,

with both ingest and analytic range queries - the perfect ingredients for any real-time analytics platform. Watch carefully

in future versions as Acunu begins to deploy fundamentally new offerings here, exploiting our back-end algorithmic

advantage.

SSDs - it’s all about endurance.

Flash SSDs are a fundamental change in storage technology, yet many systems treat them as if they were rotating hard

drives. Indeed, the legacy storage stack is filled with implicit assumptions about rotational drives. To exploit SSDs fully,

we need new algorithms and a stack that understands how flash SSDs fundamentally work.

What’s the problem?

Let’s start by considering why in-place updates to B-trees fail to give good performance on SSDs. The figure below

shows what happens to a fresh Intel X25M Flash SSD [1] under a simple workload: write a random 512KB buffer to a

random 512KB-aligned offsets. The device’s stated capacity is 160GB, and around this point the performance drops off

dramatically. The take-away message is this: to get consistently high performance from this device, we need to do

something else. B-trees, or any other random-write-intensive data structure won’t work.

The reason for the drop off once the

write volume reaches the device ca-

pacity is quite complex, and depends

on the internal structure of the device

— if you’re interested, read this great

report [2] for a simulation-based

analysis of different SSD architec-

tures. The basic reason is that al-

though the flash memory chips have a

512KB erase block, most SSDs im-

plement an internal log structure (the

magic ‘flash translation layer’ or FTL)

for several reasons, most notably be-

cause the bandwidth of these individ-

ual memory chips is relatively very

low, and to enable wear leveling and

error correction. This often makes the

”’effective”’ logical erase block size

much larger, typically around 100s of MBs for recent MLC devices. The result is that writes are at the mercy of the de-

vice’s FTL, which is the part manufacturers keep quiet and closed.

Log file systems.

Many emerging file systems and storage products argue that append-only B-trees are perfectly suited to today’s hard-

ware, particularly SSDs. Is this true? The append-only B-tree has two major problems, which Acunu’s fundamental al-

gorithms research finally overcomes.

The CoW B-tree has a potentially big space blowup: to rewrite a 16-byte key/value pair in a tree of depth 3 with 256K

block size, you may have to do 3x256K random reads and then write 768K of data. In practice, some of these nodes

are cached and don’t need rewriting, but for random updates to large datasets, this is pretty close. Even if you don’t

care about space utilisation, when the device is full, you’ll be writing, on average, a lot of data per small random update,

and this means you’re no longer fast at writing. Unfortunately, other than heuristic tweaks or giving your machine gigan-

tic amounts of RAM, this is an inherent problem for append-only CoW indexes.

The classic Achilles heel of a log file system is garbage collection (cleaning) — recovering invalidated (e.g. overwritten)

blocks in order to reclaim sufficiently large contiguous regions of free space so that future writes can be efficient. Very

few guarantees are known for garbage collection in log file systems, particularly when the system does not experience

idle time, or is under low free space conditions. To make matters worse, the space blowup described above means that

http://en.wikipedia.org/wiki/Wear_leveling

http://en.wikipedia.org/wiki/Wear_leveling

CoW trees generate a lot of extra work for the garbage collector — at a 50x space blowup, the garbage collector has to

work 50x harder to keep ahead of the input stream.

Soules et al. (2003) [3] compare the metadata efficiency of a versioning file system using both CoW B-trees and a struc-

ture (CVFS) based on the Multi-version B-tree (MVBT) [4]. They find that, in many cases, the size of the CoW metadata

index exceeds the dataset size. In one trace, the versioned data occupies 123GB, yet the CoW metadata requires

152GB while the CVFS metadata requires 4GB, a saving of 97%.

Stratified B-trees.

Acunu has invented a fundamentally new data structure, the Stratified B-tree [5,6], that dominates CoW B-trees, with or

without log file systems. They can be written without append-only logs and heuristic-based garbage collectors. They

are the first data structure to offer provably optimal performance for full versioning (allowing updates in far less than 1

IO per update on average), use asymptotically optimal O(N) space, offer an optimal range of trade-offs between up-

dates and queries, and can generally avoid performing random IO for both updates and range queries. In particular, one

construction offers updates three orders of magnitude faster than CoW B-trees, and can answer range queries around

one order of magnitude faster than the CoW B-tree!

[1] Model Number: INTEL SSDSA2M160G2GC, Firmware Revision: 2CV102M, writes use Linux AIO direct to device

with queue depth 32.

[2] http://research.microsoft.com/apps/pubs/?id=63596

[3] http://www.hpl.hp.com/personal/Craig_Soules/papers/fast03.pdf

[4] http://portal.acm.org/citation.cfm?id=765851.765854

[5] A Twigg et al., Stratified B-trees and versioned dictionaries, USENIX HotStorage’11, 2011.

[6] A Byde, A Twigg, Stratified B-trees and versioned dictionaries (version with proofs), arXiv.org, 2011.

http://www.hpl.hp.com/personal/Craig_Soules/papers/fast03.pdf


http://portal.acm.org/citation.cfm?id=765851.765854


http://research.microsoft.com/apps/pubs/?id=63596

http://research.microsoft.com/apps/pubs/?id=63596





About Acunu.

Acunu is reengineering the storage stack from the ground-up for the age of Massive Data. Based on fundamental algo-

rithms research and world-class engineering, the Acunu Platform allows applications such as Apache Cassandra and

Hadoop, along with many others, to (1) drive today’s commodity hardware harder than ever before, including many-core

architectures, SSDs and large SATA drives; (2) exploit new features in the Acunu Core (such as fast cloning and version-

ing); and (3) obtain predictable, reliable high performance. Storage is the key to understanding Massive Data, and gain-

ing competitive advantage. The Acunu Open Platform lets companies do this quicker, easier and cheaper.

Acunu was founded in 2009 by researchers and engineers from Cambridge, Oxford, and several well-known high-tech

companies. We are backed by some of Europe’s top VCs, with total funding over $5.0M. We are based in London and

California.

Founders.

Dr Tim Moreton, CEO: Tim is an expert in distributed file systems. He holds a PhD from Cambridge, where he built a

distributed file system for the Xen project. He was previously at Tideway (now BMC), where he was lead engineer on a

number of data center projects.

Dr Andy Twigg, CTO: Andy has an outstanding track record of theoretical and applied computing research. He has

held positions at Cambridge University, Microsoft Research, Thomson Research and Oxford University. His PhD in 2006

on compact routing algorithms was nominated for the BCS Best Dissertation Award. He holds a Junior Research Fel-

lowship at Oxford University, where he is a member of the CS department.

Tom Wilkie, VP Engineering: Tom was one of the first UK employees at XenSource before its acquisition by Citrix in

2007. He worked on the XenCenter management stack and numerous customer projects. He has a BA in Computer

Science from Cambridge.

Dr John Wilkes, Technical Advisor: John is an advisor to Acunu. John led the Storage Systems group at HP Labs for

15 years, before moving to Google in 2008. John received his PhD from Cambridge in 1984, an Outstanding Contribu-

tion award from SNIA in 2001 and was made an ACM Fellow in 2002.

acunu - research overview

Technology