alluxio (formerly tachyon): open source memory speed virtual distributed storage

Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage

October 2016

Gene Pang

About Me and Alluxio, Inc.

• Team members from Google, Palantir, Uber, Yahoo with years of distributed systems development experience

• Graduated from Stanford University, UC Berkeley, CMU, Peking University, and Tsinghua, with CS masters or PhDs

• Top 9 committers of the Alluxio open source project

AlluxioTeam

Gene Pang, Software Engineer, Alluxio Maintainer

Ph.D. from UC Berkeley AMPLabPreviously on Google F1 team

Twitter: @unityxx

• Andreessen HorowitzInvestors

AGENDA

• Alluxio Open Source Status and History

• Alluxio Overview

• Alluxio Use Cases

• What’s Next?

HISTORY

• Started at UC Berkeley AMPLab In Summer 2012• Original named as Tachyon

• Open Sourced in 2013• Apache License 2.0• Latest Stable Release: Alluxio 1.2.0• Next Release (Alluxio 1.3.0) soon!

• Rebranded as Alluxio in 2016

OPEN SOURCE ALLUXIO

• One of the fastest growing open-source projects in the big data ecosystem

• Currently over 300 contributors from over 100 organizations

• Welcome to join our community!

Popular Open Source Projects’ Growth

Alluxio

BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIOBIG DATA ECOSYSTEM YESTERDAY

FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System

Enabling any application to access data from any storage system at memory-speed

BIG DATA ECOSYSTEM ISSUES

GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface

• Memory is getting Faster, Larger, and Cheaper

• Memory price as halving every 18 months

• Disk throughput increasing slowly

TECHNOLOGY TRENDS

Top left chart: https://lazure2.wordpress.com/2013/07/02/20-years-of-samsung-new-management-as-manifested-by-the-latest-june-20th-galaxy-ativ-innovations/

Top right chart: people.eecs.berkeley.edu/~istoica/classes/cs294/15/notes/02-TechnologyTrends.ppt

Bottom chart: jcmit.com/

File System APISoftware Only

ALLUXIO ATTRIBUTES

Memor y-Speed Virtual Distributed Storage

Scale out architecture

Virtualizes across different storage

systems, providing a unified namespace

Memory-speed access to data

Server A

A p p l i c a t i o n s

Server B

Server Z

Server C

A p p l i c a t i o n sA l l u x i o A l l u x i o A l l u x i oA l l u x i o

ALLUXIO SOLUTION DEPLOYMENT

St o ra g e B Sto ra g e C Sto ra g e ZSto ra g e A

ALLUXIO BENEFITS

UnificationNew workflows across any data in any storage system

PerformanceHigh performance data access

FlexibilityWork with the compute and storage frameworks of your choice

CostGrow compute and storage systems independently

USE CASE 1 – Accelerate I/O to/from Remote Storage

• Compute and Storage Separation• Advantages• Meet different compute and storage hardware

requirements efficiently• Scale compute and storage independently• Store data in Traditional filers/SANs and object

stores cost effectively• Compute on data in existing storage via Big Data

Computational frameworks• Disadvantage• Accessing data requires remote I/O

Use Case without Alluxio

Storage

Low latency, memory throughput

High latency, network throughput

Use Case with Alluxio

Storage

AlluxioKeeping data in Alluxio accelerates data access

CASE STUDY

Baidu File System

The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds.

- Shaoshan Liu, Baidu

RESULTS

• Data queries are now 30x faster with Alluxio

• Alluxio cluster run stably, providing over 50TB of RAM space

• By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds

Accelerate Access to Remote Storage

• 200+ nodes deployment

• 2+ petabytes of storage

• Mix of memory + HDD

USE CASE 2 – Share Data Across Jobs at Memory Speed

• Architectures Requiring Shared Data• Pipelines: output of one job is input of the next job• Different applications, jobs, or contexts read the

same data• Disadvantage• Sharing data requires I/O

Use Case without Alluxio

Storage

MapReduce Spark

Network I/O

Disk I/O

I/O slows down

sharing

Use Case with Alluxio

Storage

MapReduce Spark

Sharing data with Alluxio via memory

Alluxio

CASE STUDY

Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity.

- Henry Powell, Barclays

RESULTS

• Barclays workflow iteration time decreased from hours to seconds

• Alluxio enabled workflows that were impossible before

• By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds

Relational Database

Share Data Across Jobs at Memory-Speed

• 6 node deployment

• 1TB of storage

• Memory only

USE CASE 3 - Transparently Manage Data Across Storage Systems

• Reasons• Most enterprises have multiple storage systems• New (better, faster, cheaper) storage systems arise

• Disadvantage• Managing data across systems can be difficult

Use Case Explained

Storage

Alluxio

Spark MapReduce Spark

Storage Storage

Flexible,

simple

no application changes,

new mount point

CASE STUDY

We’ve been running Alluxio in production for over 9 months, resulting in 15x speedup on average, and 300x speedup at peak service times.

- Xueyan Li, Qunar

RESULTS

• Alluxio’s unified namespace enables different applications and frameworks to easily interact with their data from different storage systems

• Improved the performance of their system with 15x – 300x speedups

• Tiered storage feature manages various storage resources including memory, SSD and disk

Transparently Manage Data Across Different Storage Systems

• 200+ nodes deployment

• 6 billion logs (4.5 TB) daily

• Mix of Memory + HDD

What’s Next?

• Contact: gene@alluxio.com or info@alluxio.com • Twitter: @Alluxio• Websites: www.alluxio.com and www.alluxio.org• Alluxio Github: www.github.com/Alluxio/alluxio

Thank you!

alluxio (formerly tachyon): open source memory speed virtual distributed storage

Software

alluxio (formerly tachyon) - snia · 2019-12-21 · alluxio...

tachyon therapy

accelerating machine learning pipelines with alluxio at...

tachyon-the general quiz - finals

exp tachyon

alluxio (formerly tachyon): the journey thus far and the...

alluxio (formerly tachyon): unify data at memory speed at...

alluxio keynote at strata+hadoop world beijing 2016

tachyon-the general quiz - prelims

getting started with alluxio + spark + s3

tachyon code 1 extr

sdi/istc seminar · alluxio, formerly tachyon, is an open...

基于alluxio提升spark和hadoop hdfs的系统性能与 …...

alluxio: the missing piece of on-demand clusters at alluxio...

© 2017 new infrared technologies, s.l. ... · drilling...

open superstring field theory applied to tachyon...

alluxio mesos meetup - smack to smaack

cosmology of the string tachyon - university of · pdf...

tachyon inﬂation in teleparallel gravity

tachyon web - christopher pike