tachyon: an open source memory-centric distributed storage system

Post on 11-Jan-2017

2.630 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Haoyuan Li, Tachyon Nexushaoyuan@tachyonnexus.com

September 30, 2015 @ Strata and Hadoop World NYC 2015

An Open Source Memory-Centric Distributed Storage System

Outline

•  Open Source

•  Introduction to Tachyon

•  New Features

•  Getting Involved

2

Outline

•  Open Source

•  Introduction to Tachyon

•  New Features

•  Getting Involved

3

History •  Started at UC Berkeley AMPLab –  From summer 2012 –  Same lab produced Apache Spark and Apache Mesos

•  Open sourced –  April 2013 –  Apache License 2.0 –  Latest Release: Version 0.7.1 (August 2015)

•  Deployed at > 100 companies

4

Contributors Growth

5

v0.4!Feb ‘14

v0.3!Oct ‘13

v0.2 Apr ‘13

v0.1 Dec ‘12

v0.6!Mar ‘15

v0.5!Jul ‘14

v0.7!Jul ‘15

1 3 15

30

46

70

111

Contributors Growth

6

> 150 Contributors (3x increment over the last Strata NYC)

> 50 Organizations

Contributors Growth

7

One of the Fastest Growing Big Data Open Source Project

Thanks to Contributors and Users!

8

One Tachyon ProductionDeployment Example

•  Baidu (Dominant Search Engine in China, ~ 50 Billion USD Market Cap)

•  Framework: SparkSQL •  Under Storage: Baidu’s File System •  Storage Media: MEM + HDD •  100+ nodes deployment •  1PB+ managed space •  30x Performance Improvement

9

Outline

•  Open Source

•  Introduction to Tachyon

•  New Features

•  Getting Involved

10

Tachyon is an Open Source

Memory-centricDistributed

Storage System 11

12

Why Tachyon?

Performance Trend: Memory is Fast

•  RAM throughput increasing exponentially

•  Disk throughput increasing slowly

13

Memory-locality key to interactive response times

Price Trend: Memory is Cheaper

source:  jcmit.com  14

Realized by many…

15

16

Is the Problem Solved?

17

Missing a Solution for the Storage Layer

A Use Case Example with -

•  Fast, in-memory data processing framework – Keep one in-memory copy inside JVM – Track lineage of operations used to derive data – Upon failure, use lineage to recompute data

map

filter map

join reduce

Lineage Tracking

18

Issue 1

19

Data Sharing is the bottleneck in analytics pipeline:Slow writes to disk

Spark Job1

Spark mem block manager

block 1

block 3

Spark Job2

Spark mem block manager

block 3

block 1

HDFS / Amazon S3 block 1

block 3

block 2

block 4

storage engine & execution engine same process (slow writes)

Issue 1

20

Spark Job

Spark mem block manager

block 1

block 3

Hadoop MR Job

YARN

HDFS / Amazon S3 block 1

block 3

block 2

block 4

Data Sharing is the bottleneck in analytics pipeline:Slow writes to disk

storage engine & execution engine same process (slow writes)

Issue 1 resolved with Tachyon

21

Memory-speed data sharingamong jobs in different

frameworks execution engine & storage engine same process (fast writes)

Spark Job

Spark mem

Hadoop MR Job

YARN

HDFS / Amazon S3 block 1

block 3

block 2

block 4

HDFS  disk  

block  1  

block  3  

block  2  

block  4  Tachyon!in-memory

block 1

block 3 block 4

Issue 2

22

Spark Task

Spark memory block manager

block 1

block 3

HDFS / Amazon S3 block 1

block 3

block 2

block 4

execution engine & storage engine same process

Cache loss when process crashes

Issue 2

23

crash

Spark memory block manager

block 1

block 3

HDFS / Amazon S3 block 1

block 3

block 2

block 4

execution engine & storage engine same process

Cache loss when process crashes

HDFS / Amazon S3

Issue 2

24

block 1

block 3

block 2

block 4

execution engine & storage engine same process

crash

Cache loss when process crashes

HDFS / Amazon S3 block 1

block 3

block 2

block 4 Tachyon!in-memory

block 1

block 3 block 4

Issue 2 resolved with Tachyon

25

Spark Task

Spark memory block manager

execution engine & storage engine same process

Keep in-memory data safe,even when a job crashes.

Issue 2 resolved with Tachyon

26

HDFS  disk  

block  1  

block  3  

block  2  

block  4  

execution engine & storage engine same process

Tachyon!in-memory

block 1

block 3 block 4

crash

HDFS / Amazon S3 block 1

block 3

block 2

block 4

Keep in-memory data safe,even when a job crashes.

HDFS / Amazon S3

Issue 3

27

In-memory Data Duplication & Java Garbage Collection

Spark Job1

Spark mem block manager

block 1

block 3

Spark Job2

Spark mem block manager

block 3

block 1

block 1

block 3

block 2

block 4

execution engine & storage engine same process (duplication & GC)

Issue 3 resolved with Tachyon

28

No in-memory data duplication,much less GC

Spark Job1

Spark mem

Spark Job2

Spark mem

HDFS / Amazon S3 block 1

block 3

block 2

block 4

execution engine & storage engine same process (no duplication & GC)

HDFS  disk  

block  1  

block  3  

block  2  

block  4  Tachyon!in-memory

block 1

block 3 block 4

Previously Mentioned

•  A memory-centric storage architecture

•  Push lineage down to storage layer

29

Tachyon Memory-Centric Architecture

30

Tachyon Memory-Centric Architecture

31

Lineage in Tachyon

32

Outline

•  Open Source

•  Introduction to Tachyon

•  New Features

•  Getting Involved

33

1) Eco-system: Enable new workload in any storage;

Work with the framework of your choice;

34

2) Tachyon running in production environment,

both in the Cloud and on Premise.

35

Use Case: Baidu

•  Framework: SparkSQL •  Under Storage: Baidu’s File System •  Storage Media: MEM + HDD •  100+ nodes deployment •  1PB+ managed space •  30x Performance Improvement

36

Use Case: a SAAS Company

•  Framework: Impala

•  Under Storage: S3

•  Storage Media: MEM + SSD

•  15x Performance Improvement

37

Use Case: an Oil Company

•  Framework: Spark

•  Under Storage: GlusterFS

•  Storage Media: MEM only

•  Analyzing data in traditional storage

38

Use Case: a SAAS Company

•  Framework: Spark

•  Under Storage: S3

•  Storage Media: SSD only

•  Elastic Tachyon deployment

39

40

What if data size exceeds memory capacity?

41

3) Tiered Storage:Tachyon Manages More Than DRAM

MEM SSD

HDD

Faster

Higher Capacity

42

Configurable Storage Tiers

MEM only

MEM + HHD

SSD only

43

4) Pluggable Data Management Policy

Evict stale data to lower tier

Promote hot data to upper tier

44

Pin Data in Memory

5) Transparent Naming

45

6) Unified Namespace

46

More Features

•  7) Remote Write Support •  8) Easy deployment with Mesos and Yarn •  9) Initial Security Support •  10) One Command Cluster Deployment •  11) Metrics Reporting for Clients, Workers,

and Master

47

12) More Under Storage Supports

48

Reported Tachyon Usage

49

Outline

•  Open Source

•  Introduction to Tachyon

•  New Features

•  Getting Involved

50

Memory-Centric Distributed Storage

Welcome to try, contact, and collaborate!

51

JIRA New Contributor Tasks

•  Team consists of Tachyon creators, top contributors

•  Series A ($7.5 million) from Andreessen Horowitz

•  Committed to Tachyon Open Source

52

53

Strata NYC 2015

•  Welcome to visit us at our booth #P18.

•  Check out other Tachyon related talks. –  First-ever scalable, distributed deep learning architecture

using Spark and Tachyon •  Christopher Nguyen (Adatao, Inc.), Vu Pham (Adatao, Inc) •  2:05pm–2:45pm Thursday, 10/01/2015

–  Faster time to insight using Spark, Tachyon, and Zeppelin •  Nirmal Ranganathan (Rackspace Hosting) •  2:05pm–2:45pm Thursday, 10/01/2015

54

•  Try Tachyon: http://tachyon-project.org

•  Develop Tachyon: https://github.com/amplab/tachyon

•  Meet Friends: http://www.meetup.com/Tachyon

•  Get News: http://goo.gl/mwB2sX

•  Tachyon Nexus: http://www.tachyonnexus.com •  Contact us: haoyuan@tachyonnexus.com

55

top related