treasure data and oss
TRANSCRIPT
Masahiro NakagawaFeb 7, 2015
dots. Summit 2015
Treasure Data and OSS
Who are you?
> Masahiro Nakagawa > github/twitter: @repeatedly
> Treasure Data, Inc. > Senior Software Engineer > Fluentd / td-agent developer
> I love OSS :) > D language - Phobos committer > Fluentd - Main maintainer > MessagePack / RPC - D and Python (only RPC) > The organizer of several meetups (Presto, DTM, etc) > etc…
Company background• Founded 2011 in Mountain View, CA !
– The first cloud service for the entire data pipeline!
– Including: Acquisition, Storage, & Analysis !
• Provide a “Cloud Data Service” !– Fast Time to Value!– Cloud Flexibility and Economics !– Simple and Well Supported !
• Treasure Data has over 100+ customers in production!– Incl. Fortune 500 companies !– 400k new records / second !– Almost 9 Trillion records loaded !– Variety of use cases and verticals !
The Treasure Data Team Hiro Yoshikawa – CEO Open source business veteran
Kaz Ohta – CTO Founder of world’s largest Hadoop Group
Sada Furuhashi – Software Architect MessagaPack / Fluentd Author
Notable Investors
Othman Laraki Ex-VP of Growth at Twitter
Jerry Yang Founder of Yahoo!
Yukihiro “Matz” Matusmoto Creator of “Ruby” programming language
James Lindenbaum Founder of Heroku
Sierra Ventures - Tim Guleri Leading venture capital firm in Big Data
TD Service Architecture
Time to Value
Send query result Result Push
Acquire Analyze Store
Plazma DB Flexible, Scalable, Columnar Storage
Web Log
App Log
Censor
CRM
ERP
RDBMS
Treasure Agent(Server) SDK(JS, Android, iOS, Unity)
Streaming Collector
Batch / Reliability
Ad-hoc /Low latency
KPI$
KPI Dashboard
BI Tools
Other Products
RDBMS, Google Docs, AWS S3, FTP Server, etc.
Metric Insights
Tableau, Motion Board�����etc.
POS
REST API ODBC / JDBC �SQL, Pig�
Bulk Uploader
Embulk,TD Toolbelt
SQL-based query
@AWS or @IDCF
Connectivity
Economy & Flexibility Simple & Supported
Data Acquisition
Log collecting in TD
> Treasure Agent > Fluentd based log collector
> Embulk > JavaScript SDK > Mobile SDK (iOS, Android, Unity)
Structured logging !
Reliable forwarding !
Pluggable architecture
http://fluentd.org/
Fluentd
> Data collector for unified logging layer > Streaming data transfer based on JSON > Written in Ruby
> Gem based various plugins > http://www.fluentd.org/plugins
> Working in production > http://www.fluentd.org/testimonials
Data Analytics Flow
Collect Store Process Visualize
Data source
Reporting
Monitoring
Data Analytics Flow
Store Process
Cloudera
Horton Works
Treasure Data
Collect Visualize
Tableau
Excel
R
easier & shorter time
???
Divide & Conquer & Retry
error retry
error retry retry
retryBatch
Stream
Other stream
Core Plugins
> Divide & Conquer
> Buffering & Retrying
> Error handling
> Message routing
> Parallelism
> read / receive data > from API, database,
command, etc… > write / send data
> to API, database, alert, graph, etc…
Architecture (v0.12 or later)
EngineInput
Filter Output
Buffer
> grep > record_transfomer > …
> Forward > File tail > ...
> Forward > File > ...
Output
> File > Memory
not pluggable
FormatterParser
Before (M x N)
After (M + N)
or Embulk
Other Fluentd related OSS> Treasure Agent
> https://github.com/treasure-data/omnibus-td-agent
> Fluentd Forwarder > https://github.com/fluent/fluentd-forwarder
> Simple forwarder for Windows / Leaf node
> Fluentd UI > https://github.com/fluent/fluentd-ui
> Management web UI
Other OSS products
> Scribed (C++) > Developed by Facebook > No maintained
> Apache Flume (Java) > Mainly for Hadoop HDFS / HBase
> Logstash (JRuby) > Mainly for Elasticsearch
Embulk
> Bulk Loader version of Fluentd > Pluggable architecture
> JRuby, JVM languages (TBD) > High performance parallel processing
> Share your script as a plugin > https://github.com/embulk
http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed
HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behaviour ✓ Idempotent retrying
Plugins Plugins
bulk load
Computing Framework
3 query engines in TD
> Hive (HiveQL, Batch) > for ETL and large jobs > Hivemall for machine learning
> Pig (Pig Latin, Batch) > DataFu for data mining and statistics
> Presto (SQL, Short batch) > for Ad hoc queries
Hadoop
> Distributed computing framework > Consist of many components…
http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
http://nosqlessentials.com/
http://nosqlessentials.com/
> Low level framework for YARN applications > New Query Engine > Provide good IR for Hive, Pig and more
> Task and DAG based pipelining
Apache Tez
ProcessorInput Output
Task DAGhttp://tez.apache.org/
Hive on MR vs. Hive on Tez
MapReduce Tez
http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
M
HDFS
R
R
M M
HDFS HDFS
R
M M
R
M M
R
M
R
M MM
M M
R
R
R
Avoid unnecessary HDFS write!
SELECT g1.x, g2.avg, g2.cnt FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1"JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2"ON (g1.x = g2.x) ORDER BY avg;
GROUP b BY b.xGROUP a BY a.x
JOIN (a, b)
ORDER BY
GROUP BY x
GROUP BY a.x"JOIN (a, b)
ORDER BY
Other OSS products
> Apache Spark > Mainly for on-memory processing > Spark ecosystem is now growing
> Apache Flink > Mainly for iterative processing
> Microsoft’s Dryad > This was premature for human being…
Presto
A distributed SQL query engine for interactive data analisys against GBs to PBs of data.
Presto overview> Open sourced by Facebook
> http://prestodb.io/ > written in Java
> Built-in useful features > Connectors > Machine Learning > Window function > Approximate query > etc…
> Used by Netflix, Dropbox, Treasure Data, Qubole, Airbnb, LINE, GREE, Scaleout, etc
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly BatchInteractive query
CommercialBI Tools
Batch analysis platform Visualization platform
Dashboard
HDFS
Hive
Daily/Hourly BatchInteractive query
✓ Less scalable ✓ Extra cost
CommercialBI Tools
Dashboard
✓ More work to manage 2 platforms
✓ Can’t query against “live” data directly
Batch analysis platform Visualization platform
PostgreSQL, etc.
HDFS
Hive Dashboard
Presto
PostgreSQL, etc.
Daily/Hourly Batch
HDFS
HiveDashboard
Daily/Hourly Batch
Interactive query
Interactive query
Presto
HDFS
HiveDashboard
Daily/Hourly BatchInteractive query
Cassandra MySQL Commertial DBs
SQL on any data sets CommercialBI Tools
✓ IBM Cognos✓ Tableau ✓ ...
Data analysis platform
All stages are pipe-lined ✓ No wait time ✓ No fault-tolerance
MapReduce vs. Presto
MapReduce Presto
map map
reduce reduce
task task
task task
task
task
memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory
task
disk
map map
reduce reduce
disk
disk
Write data to disk
Wait betweenstages
Other OSS products
> Cloudera Impala > Mainly for HDFS / HBase
> Apache Drill > More flexible architecture
> Apache Tajo > For building data warehouse
Visualization
Hmm…
> There are no popular OSS products > We don’t focus on developing
visualization tool for now > Commercial BI tools are popular
> Tableau, Motion board and etc > Maybe, next presentation talk about
this area deeply
Treasure Data resources
> https://github.com/treasure-data > perfectqueue, perfectsched, etc
> https://sql.treasuredata.com/ > HiveQL syntax checker
> https://examples.treasuredata.com/ > Query catalog
http://blog.treasuredata.com/2014/11/26/12-open-source-software-innovations-from-treasure-data-engineers/
Check: treasuredata.comCloud service for the entire data pipeline