treasure data and oss

Masahiro NakagawaFeb 7, 2015

dots. Summit 2015

Treasure Data and OSS

Who are you?

> Masahiro Nakagawa > github/twitter: @repeatedly

> Treasure Data, Inc. > Senior Software Engineer > Fluentd / td-agent developer

> I love OSS :) > D language - Phobos committer > Fluentd - Main maintainer > MessagePack / RPC - D and Python (only RPC) > The organizer of several meetups (Presto, DTM, etc) > etc…

Company background•  Founded 2011 in Mountain View, CA !

–  The first cloud service for the entire data pipeline!

–  Including: Acquisition, Storage, & Analysis !

•  Provide a “Cloud Data Service” !–  Fast Time to Value!–  Cloud Flexibility and Economics !–  Simple and Well Supported !

•  Treasure Data has over 100+ customers in production!–  Incl. Fortune 500 companies !–  400k new records / second !–  Almost 9 Trillion records loaded !–  Variety of use cases and verticals !

The Treasure Data Team Hiro Yoshikawa – CEO Open source business veteran

Kaz Ohta – CTO Founder of world’s largest Hadoop Group

Sada Furuhashi – Software Architect MessagaPack / Fluentd Author

Notable Investors

Othman Laraki Ex-VP of Growth at Twitter

Jerry Yang Founder of Yahoo!

Yukihiro “Matz” Matusmoto Creator of “Ruby” programming language

James Lindenbaum Founder of Heroku

Sierra Ventures - Tim Guleri Leading venture capital firm in Big Data

TD Service Architecture

Time to Value

Send query result Result Push

Acquire Analyze Store

Plazma DB Flexible, Scalable, Columnar Storage

Web Log

App Log

Censor

CRM

ERP

RDBMS

Treasure Agent(Server) SDK(JS, Android, iOS, Unity)

Streaming Collector

Batch / Reliability

Ad-hoc /Low latency

KPI$

KPI Dashboard

BI Tools

Other Products

RDBMS, Google Docs, AWS S3, FTP Server, etc.

Metric Insights

Tableau, Motion Board��etc.

POS

REST API ODBC / JDBC �SQL, Pig�

Bulk Uploader

Embulk,TD Toolbelt

SQL-based query

@AWS or @IDCF

Connectivity

Economy & Flexibility Simple & Supported

Data Acquisition

Log collecting in TD

> Treasure Agent > Fluentd based log collector

> Embulk > JavaScript SDK > Mobile SDK (iOS, Android, Unity)

Structured logging !

Reliable forwarding !

Pluggable architecture

http://fluentd.org/

http://fluentd.org/

Fluentd

> Data collector for unified logging layer > Streaming data transfer based on JSON > Written in Ruby

> Gem based various plugins > http://www.fluentd.org/plugins

> Working in production > http://www.fluentd.org/testimonials

http://www.fluentd.org/plugins

http://www.fluentd.org/testimonials

Data Analytics Flow

Collect Store Process Visualize

Data source

Reporting

Monitoring

Data Analytics Flow

Store Process

Cloudera

Horton Works

Treasure Data

Collect Visualize

Tableau

Excel

R

easier & shorter time

???

Divide & Conquer & Retry

error retry

error retry retry

retryBatch

Stream

Other stream

Core Plugins

> Divide & Conquer

> Buffering & Retrying

> Error handling

> Message routing

> Parallelism

> read / receive data > from API, database,

command, etc… > write / send data

> to API, database, alert, graph, etc…

Architecture (v0.12 or later)

EngineInput

Filter Output

Buffer

> grep > record_transfomer > …

> Forward > File tail > ...

> Forward > File > ...

Output

> File > Memory

not pluggable

FormatterParser

Before (M x N)

After (M + N)

or Embulk

Other Fluentd related OSS> Treasure Agent

> https://github.com/treasure-data/omnibus-td-agent

> Fluentd Forwarder > https://github.com/fluent/fluentd-forwarder

> Simple forwarder for Windows / Leaf node

> Fluentd UI > https://github.com/fluent/fluentd-ui

> Management web UI

https://github.com/treasure-data/omnibus-td-agent

https://github.com/fluent/fluentd-forwarder

https://github.com/fluent/fluentd-ui

Other OSS products

> Scribed (C++) > Developed by Facebook > No maintained

> Apache Flume (Java) > Mainly for Hadoop HDFS / HBase

> Logstash (JRuby) > Mainly for Elasticsearch

Embulk

> Bulk Loader version of Fluentd > Pluggable architecture

> JRuby, JVM languages (TBD) > High performance parallel processing

> Share your script as a plugin > https://github.com/embulk

http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed

https://github.com/embulk

http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed

HDFS

MySQL

Amazon S3

Embulk

CSV Files

SequenceFile

Salesforce.com

Elasticsearch

Cassandra

Hive

Redis

✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behaviour ✓ Idempotent retrying

Plugins Plugins

bulk load

Computing Framework

3 query engines in TD

> Hive (HiveQL, Batch) > for ETL and large jobs > Hivemall for machine learning

> Pig (Pig Latin, Batch) > DataFu for data mining and statistics

> Presto (SQL, Short batch) > for Ad hoc queries

Hadoop

> Distributed computing framework > Consist of many components…

http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/

http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/

http://nosqlessentials.com/

http://nosqlessentials.com/

> Low level framework for YARN applications > New Query Engine > Provide good IR for Hive, Pig and more

> Task and DAG based pipelining

Apache Tez

ProcessorInput Output

Task DAGhttp://tez.apache.org/

http://tez.apache.org/

Hive on MR vs. Hive on Tez

MapReduce Tez

http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9

M

HDFS

R

R

M M

HDFS HDFS

R

M M

R

M M

R

M

R

M MM

M M

R

R

R

Avoid unnecessary HDFS write!

SELECT g1.x, g2.avg, g2.cnt FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1"JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2"ON (g1.x = g2.x) ORDER BY avg;

GROUP b BY b.xGROUP a BY a.x

JOIN (a, b)

ORDER BY

GROUP BY x

GROUP BY a.x"JOIN (a, b)

ORDER BY

http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9

Other OSS products

> Apache Spark > Mainly for on-memory processing > Spark ecosystem is now growing

> Apache Flink > Mainly for iterative processing

> Microsoft’s Dryad > This was premature for human being…

Presto

A distributed SQL query engine for interactive data analisys against GBs to PBs of data.

Presto overview> Open sourced by Facebook

> http://prestodb.io/ > written in Java

> Built-in useful features > Connectors > Machine Learning > Window function > Approximate query > etc…

> Used by Netflix, Dropbox, Treasure Data, Qubole, Airbnb, LINE, GREE, Scaleout, etc

http://prestodb.io/

HDFS

Hive

PostgreSQL, etc.

Daily/Hourly BatchInteractive query

CommercialBI Tools

Batch analysis platform Visualization platform

Dashboard

HDFS

Hive


✓ Less scalable ✓ Extra cost

CommercialBI Tools

Dashboard

✓ More work to manage 2 platforms

✓ Can’t query against “live” data directly

Batch analysis platform Visualization platform

PostgreSQL, etc.

HDFS

Hive Dashboard

Presto

PostgreSQL, etc.

Daily/Hourly Batch

HDFS

HiveDashboard

Daily/Hourly Batch

Interactive query

Interactive query

Presto

HDFS

HiveDashboard


Cassandra MySQL Commertial DBs

SQL on any data sets CommercialBI Tools

✓ IBM Cognos✓ Tableau ✓ ...

Data analysis platform

All stages are pipe-lined ✓ No wait time ✓ No fault-tolerance

MapReduce vs. Presto

MapReduce Presto

map map

reduce reduce

task task

task task

task

task

memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory

task

disk

map map

reduce reduce

disk

disk

Write data to disk

Wait betweenstages

Other OSS products

> Cloudera Impala > Mainly for HDFS / HBase

> Apache Drill > More flexible architecture

> Apache Tajo > For building data warehouse

Visualization

Hmm…

> There are no popular OSS products > We don’t focus on developing

visualization tool for now > Commercial BI tools are popular

> Tableau, Motion board and etc > Maybe, next presentation talk about

this area deeply

Treasure Data resources

> https://github.com/treasure-data > perfectqueue, perfectsched, etc

> https://sql.treasuredata.com/ > HiveQL syntax checker

> https://examples.treasuredata.com/ > Query catalog

http://blog.treasuredata.com/2014/11/26/12-open-source-software-innovations-from-treasure-data-engineers/

https://github.com/treasure-data

https://sql.treasuredata.com/

https://examples.treasuredata.com/

http://blog.treasuredata.com/2014/11/26/12-open-source-software-innovations-from-treasure-data-engineers/

Check: treasuredata.comCloud service for the entire data pipeline

http://treasure-data.com

treasure data and oss

Technology

td treasure agent fluentd

cloud data service

orgfluentd data collector

entire data pipeline

cloud service

td toolbeltsqlbased

cloud flexibility

retryerror retryerror