ruby and distributed storage systems

43
Ruby for Distributed Storage Systems RubyKaigi 2017: Sep 20, 2017 Satoshi Tagomori (@tagomoris) Treasure Data, Inc.

Upload: satoshi-tagomori

Post on 23-Jan-2018

7.697 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Ruby and Distributed Storage Systems

Ruby for Distributed Storage Systems

RubyKaigi 2017: Sep 20, 2017 Satoshi Tagomori (@tagomoris)

Treasure Data, Inc.

Page 2: Ruby and Distributed Storage Systems

Satoshi Tagomori (@tagomoris)

Fluentd, MessagePack-Ruby, Norikra, Woothee, ...

Treasure Data, Inc.

Page 3: Ruby and Distributed Storage Systems
Page 4: Ruby and Distributed Storage Systems

-45°

Page 5: Ruby and Distributed Storage Systems

Ruby for Distributed Storage Systems

Page 6: Ruby and Distributed Storage Systems

Ruby for and Distributed Storage Systems

Page 7: Ruby and Distributed Storage Systems

Ruby and Performance

• Web? or Not?

• Disk & Network I/O

• "I/O spends most of time on servers"... is it real? • Storages are getting faster and faster

(SSD, NVMe, ...) • Networks too (10GbE, fast network in Cloud, ...)

Page 8: Ruby and Distributed Storage Systems

Storage Systems• Disk I/O • Network I/O • Serialization / Deserialization (json, msgpack, ...)

• read/write data from/to disk • parse/generate HTTP request/response

• Indexing (update, search) • Timer • Threads + Locks

Page 9: Ruby and Distributed Storage Systems

Distributed Storage Systems• Data replication • Checksum • Asynchronous network I/O

• Quorum

• More Threads + Locks

Page 10: Ruby and Distributed Storage Systems

Replication w/ 3 replicas• Create 3 replica of data, including local storage

accept request to write data

write the data into local storage

(1)

receive responses to replicate data

(3)

send response to write data

input input

input

input

input

input input

input

input

input input

input

input

input input

send requests to replicate data

Page 11: Ruby and Distributed Storage Systems

Replication in Quoram Systems: In Action• Create 2 replica of data at least (max 3), including local storage

accept request to write data,

and write it locally (1)

send response to write data

input input

input

input

input

? ?

input

input

input

input

input

input

create 2 threads to send requests to replicate data

input

input

receive a successfulresponse to

replicate data (2)

? ? ?

Discard a thread for another node

Page 12: Ruby and Distributed Storage Systems

Bigdam

Page 13: Ruby and Distributed Storage Systems

Bigdam• Brand new data ingestion pipeline

in Treasure Data

• Huge data • Extraordinary large number of connections /

requests • Many edge endpoints on the planet

Page 14: Ruby and Distributed Storage Systems

Bigdam:Edge locations on the earth + the Central location

Page 15: Ruby and Distributed Storage Systems

Bigdam components

@narittan

@tagomoris @nalsh

@k0kubun

@komamitsu_tw

Page 16: Ruby and Distributed Storage Systems

Bigdam-pool• OSS (in future... not yet)

• Distributed key-value storage • for buffer pool in Bigdam • to build S3 free data ingestion pipeline

Page 17: Ruby and Distributed Storage Systems

Bigdam-pool: Small Buffers• Small buffers (MBs)

• Write: append support for many small chunks (KBs) • Read: secondary index to query/read many buffers at once • Short buffer lifetime: minutes (create - append - read - delete)

• Buffers store ids of chunks (for deduplication)

buffer buffer buffer buffer

chunk

chunk

chunk

chunk

chunkchunk chunk

chunk chunk chunk

account_id, database, table

Page 18: Ruby and Distributed Storage Systems

Bigdam-pool: Replication

• Replication in a cluster • without maintaining replica factor

• Clients send requests to all living nodes

Page 19: Ruby and Distributed Storage Systems

Bigdam-pool: Buffer Transferring over Clusters

Edge location Central location

Over InternetUsing HTTPS or HTTP/2

Buffer committed (size or timeout)

Page 20: Ruby and Distributed Storage Systems

written in Java

Page 21: Ruby and Distributed Storage Systems

Designing Bigdam• Architecture Design - split a system to 5 microservices

• consistency, availability • performance (how to scale it out?) • deployment, cost

• API Design • Mocking • Interface Test • Integration Test

Page 22: Ruby and Distributed Storage Systems

Mocking Bigdam using Ruby• Mocking

• build mock servers of all components • implement all public APIs between components

• Find/add missing parameters required • Prepare to develop components in parallel

• Mocked using Ruby, Sinatra • public APIs - it's just a Webapp • fast and easy to do :D

Page 23: Ruby and Distributed Storage Systems

Interface/Integration Tests of Bigdam using Ruby• Interface tests:

• verify all public APIs are implemented correctly • Integration tests

• verify the whole pipeline can import data correctly

• Written in Ruby, test-unit • less code to serialize/deserialize various req/res • readable test cases • fast and easy to do :D

Page 24: Ruby and Distributed Storage Systems

And,

Page 25: Ruby and Distributed Storage Systems

Bigdam-pool-ruby• Port bigdam-pool from Java to Ruby

• Experiment to know Ruby is good enough or not

Page 26: Ruby and Distributed Storage Systems
Page 27: Ruby and Distributed Storage Systems

Bigdam-pool-ruby• Perfectly compatible with Java implementation

• Public API, Private API • Data formats on local storage, of secondary

index

• Under development • only supports stand alone mode, for now

Page 28: Ruby and Distributed Storage Systems

Studies: Serialization / Deserialization• All network API call requires it

• parsing HTTP request • parsing request content body (json/msgpack) • building response content body (json/msgpack) • building HTTP response

• Should be parallelized on CPU cores

Page 29: Ruby and Distributed Storage Systems

Studies: Asynchronous Network I/O• EventMachine? Cool.io? Celluloid::IO? •🤔 • I want to use only async network I/O at a time!

(not disk, not timer)

• Event driven I/O library? • Thread pools + callback? • or any idea?

Page 30: Ruby and Distributed Storage Systems

Threading / Timers• ExecutorService in Java is very useful...

• Fixed / non-fixed thread pools with Queue • (and some other executor models)

• Runner of Runnable tasks • "Runnable task" is just like a lambda w/o args

• To be implemented as Gem? • Queue and SizedQueue look useful for it

Page 31: Ruby and Distributed Storage Systems

Queue#peek Get the head object w/o removing it from queue

https://github.com/ruby/ruby/pull/1698

Page 32: Ruby and Distributed Storage Systems

Queue#peek Get the head object w/o removing it from queue

https://github.com/ruby/ruby/pull/1698

Page 33: Ruby and Distributed Storage Systems

MonitorMixin#mon_locked? and #mon_owned?• Mutex#owned? exists

https://github.com/ruby/ruby/pull/1699

Page 34: Ruby and Distributed Storage Systems

Resource Control Make sure to release resources: try-with-resources in Java

Page 35: Ruby and Distributed Storage Systems

Resource Control Make sure to release resources: try-with-resources in Java

Page 36: Ruby and Distributed Storage Systems

Typing?• Defining APIs

• Rubyists (including me) MAY be using:[string, integer, boolean, string, ...]

• Rubyists (including me) MAY be using:{"time": unix_time (but sometimes float)}

• Explicit definition makes nothing bad in designing APIs • Json schema or something others may help us...

Page 37: Ruby and Distributed Storage Systems

Typing: in logging and others

https://bugs.ruby-lang.org/issues/13913

Page 38: Ruby and Distributed Storage Systems

Process Built-in Application Servers• Distributed Storage Systems:

• Background worker threads • Timers • Communication workers to other nodes • Various async operation workers

• Public API request handlers • Private API request handlers (inter-nodes) • Startup/Shutdown hooks

• It's NOT just web application, but handles HTTP requests

Page 39: Ruby and Distributed Storage Systems

https://github.com/tagomoris/bigdam-pool-ruby

Page 40: Ruby and Distributed Storage Systems

https://github.com/tagomoris/bigdam-pool-ruby

NOT YET

Page 41: Ruby and Distributed Storage Systems

"Why Do You Want to Write Such Code in Ruby?"

Page 42: Ruby and Distributed Storage Systems

"Why Do You Want to Write Such Code in Ruby?"

"Because I WANT TO DO IT!"

Page 43: Ruby and Distributed Storage Systems

"Why Do You Want to Write Such Code in Ruby?"

"Because I WANT TO DO IT!"

"... And we already have Fluentd :P"Thank you.

@tagomoris