hopsfs: scaling hierarchical file system metadata using ... · graph state back to disk 2. compute:...

HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases

Salman Niazi1, Mahmoud Ismail1,Steffen Grohsschmiedt3 , Mikael Ronstrom̈4

Seif Haridi1,2, Jim Dowling1,2

1 KTH - Royal Institute of Technology2 RISE SICS - Swedish Institute of Computer Science

3 Spotify 4 Oracle

www.hops.io

HopsFS vs. HDFS PerformanceIntroduction

The Hadoop Distributed File System (HDFS) is the most popularopen-source platform for storing large volumes of data. However,HDFS’ design introduces two scalability bottlenecks. First, theNamenode architecture places practical limits on the size of thenamespace (files/directories). HDFS’ second main bottleneck is asingle global lock on the namespace that ensures the consistency ofthe file system by limiting concurrent access to the namespace to asingle-writer or multiple-readers.

HopsFS

HopsFS is an open-source, next generation distribution, drop-inreplacement for HDFS that replaces the main scalability bottleneck inHDFS, single node in-memory metadata service, with a no-sharedstate distributed system built on a NewSQL database. By removingthe metadata bottleneck in Apache HDFS, HopsFS enables thefollowing:

• Significantly larger cluster sizes, storing 37 times more metadata.• More than an order of magnitude higher throughput

• 16x – 37x, where 37x is the throughput for higher write rates• Significantly lower client latencies for large clusters.• Multiple stateless Namenodes.• Instant failover between the Namenodes.• Tinker-friendly metadata.

NDB-DN1/user/foo.txt

PRB Inv RUC

Replica

NDBNDB-DN2/user

/user/foo.txtNDB-DN3 NDB-DN4

File inode related metadata

/user/bar.txt/user/foo.tar

Quota PRB Inv RUC

Replica

HopsFS and HDFS Throughput for Spotify Workload

1 5 10 15 20 25 30 35 40 45 50 55 60

Number of Namenodes

HopsFS using 12 Node NDB clusterHopsFS using 8 Node NDB clusterHopsFS using 4 Node NDB clusterHopsFS using 2 Node NDB cluster

HopsFS using 12 Node NDB cluster with hotspotsHDFS Spotify Workload

HopsFS Architecture

HopsFS provides multiple stateless Namenodes. The Namenodes can serverequests from Both HopsFS and HDFS clients, however, HopFS clientsprovides load balancing between the Namenodes using random, round-robin,and sticky policies.

NameNodes

DataNodes

DN 1 DN 2 DN 3 DN N

NN 1 NN 2 NN 3 NN N

MySQL Cluster

HopsFS/HDFSClients

Metadata Partitioning

All the inodes in a directory are partitioned using a parent inode ID, therefore,all the immediate children of /user directory are stored on NDB-DN-3 forefficient directory listing, for example, ls /user. The file inode related metadatafor /user/foo.txt is stored on NDB-DN-4 for efficient file reading operations, forexample, cat /user/foo.txt.

Memory HDFS HopsFS1 GB 2.3 million 0.69 million

50 GB 115 million 34.5 million100 GB 230 million 69 million200 GB 460 million 138 million500 GB Does Not Scale 346 million

1 TB Does Not Scale 708 million25 TB Does Not Scale 17 billion

HopsFS and HDFS Metadata Scalability

1000 2000 3000 4000 5000 6000

Clients

HopsFSHDFS

100 200 300 400 500

HopsFS and HDFS End-to-end Latency

50K100K

20 50 80 110 140 170 200 230

Time (sec)

HDFSHopsFS

HopsFS and HDFS Namenode Failover

Vertical lines represent namenodes failures.

Paris Carbone, Gyula Fóra, Seif Haridi, Vasiliki Kalavri, Marius Melzer, Theodore Vasiloudis <parisc@kth.se> <Gyula.Fora@king.com> <haridi@kth.se> <kalavri@kth.se> <marius.melzer@ri.se> <tvas@kth.se>

A NEW STATE OF THE ART IN DATA STREAMING

• Lightweight, consistent end-to-end processing

Advancing Data Stream Analytics with Apache Flink®

CONSISTENT CONTINUOUS PROCESSING WITH PIPELINED SNAPSHOTS

• Dynamic reconfiguration and application management

in-progress committedpendingpending

epoch n-1 epoch n-2 epoch n-3epoch n

rollback

snap-1 snap-2

snap-3

update application

Pre-partition state in hash(K) space, into key-groups

bob……

… ………

local states

input streams

1) divide computationinto epochs

stream processor

Reconfiguration Scenarios • scale in/out • failure recovery • pushing bug fixes • application AB Testing • platform migration

2) capture states after each epochwithout stopping

snapshot

Apache Flink™: Stream and Batch Processing in a Single Engine P Carbone, S Ewen, S Haridi, A Katsifodimos, V Markl, K Tzoumas Bulletin of the IEEE Computer Society Technical Committee on Data Engineering

State management in Apache Flink®: consistent stateful distributed stream processing P Carbone, S Ewen, G Fóra, S Haridi, S Richter, K Tzoumas Proceedings of the VLDB Endowment 10 (12), 1718-1729

Cluster Backend Metrics

Dataflow Runtime

DataStream DataSet

MLLibraries

Core API

Runner

consistent stateevent-time progress

fluid apipartitioned streams

FAST SLIDING WINDOW AGGREGATION SYSTEM SUPPORT FOR GRAPH STREAM MINING • Sliding window aggregation can be very expensive

• Existing optimisations apply to limited window types

• ‘Cutty’ redefines stream windows for optimal processing

20 40 60 80 100

Number of Queries

0k500k

1000k1500k2000k2500k3000k3500k4000k4500k

CuttyPairs+RA

1 10 20 30 40 50 60 70 80 90 100

Number of Queries

Cutty (eager)Pairs+Cutty (lazy)

PairsRANaive

Cutty: Aggregate Sharing for User-Defined Windows P Carbone, J Traub, A Katsifodimos, S Haridi, V Markl ACM CIKM - 25th International Conference on Information and Knowledge Management

• People process graphs inefficiently

3. Store: write the final graph state back to disk

2. Compute: read and mutate the graph state

1. Load: read the graph from disk and partition it in memory

• We propose a new way to process graphs continuously

• this is slow, expensive and redundant

graph summary

algorithm algorithm~R1 R2

1) Single-pass summaries edgeStream.aggregate(newSummary(window,fold,combine,lower))

edge additions

2) Neighbour Aggregation and Iterations on Stream Windows

sinkloopsrc winsrc win sinkloop

graphstream.window(…) .iterateSyncFor(10, InputFunction(), StepFunction(),

OutputFunction())

graphstream.window(…) .applyOnNeighbors(FindPairs())

Sponsors Partnersgithub.com/vasia/gelly-streaming

hopsfs: scaling hierarchical file system metadata using ... · graph state back to disk 2. compute:...

Documents

cover contact graphs - link.springer.com(a) disk seeds (b)...

usenix atc 2018 · data-centric ltp execution model •...

11368_floppy disk and hard disk

disk to disk clone

extended dominating-set-based routing in ad hoc wireless...

floppy disk and hard disk

grafos de disco unitario e moeda unitaria talita de ... ·...

hopsfs 10x hdfs performance

graphr: accelerating graph processing using reram -...

lncs 5823 - dogma: a disk-oriented graph matching algorithm...

magnetic disk drive with reduced disk-to-disk spacing and...

samstag, 15. oktober...

1 disk scheduling. 2 mass-storage systems disk structure ...

observation of beam halo with corona graph. objective lens...

disk management - ecology...

dd-graph: a highly cost-effective distributed disk...

msp rhug: running geo-distributed applications with redis...

random geometric graph diameter in the unit disk

igraph : a framework for comparisons of disk-based graph...

constant curvature graph convolutional networksconstant...