con2342 rubin
DESCRIPTION
adfaTRANSCRIPT
![Page 1: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/1.jpg)
Hadoop and MySQL for Big Data
Alexander Rubin October 9, 2013
![Page 2: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/2.jpg)
www.percona.com
About Me
• Working with MySQL for over 10 years – Started at MySQL AB, Sun Microsystems, Oracle
(MySQL Consulting) – Worked at Hortonworks – Joined Percona in 2013
Alexander Rubin, Principal Consultant, Percona
![Page 3: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/3.jpg)
www.percona.com
Agenda
• Hadoop Use Cases • Inside Hadoop • Big Data with Hadoop • MySQL and Hadoop Integration • Star Schema benchmark
![Page 4: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/4.jpg)
www.percona.com
Hadoop: when it makes sense
BIG DATA
![Page 5: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/5.jpg)
www.percona.com
Big Data • Volume
• Petabytes • Variety
• Any type of data – usually unstructured/raw data • No normalization
• Velocity • Data is collected at a high rate
http://en.wikipedia.org/wiki/Big_data#Definition
“3Vs” of big data
![Page 6: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/6.jpg)
www.percona.com
Hadoop Use Cases
Clickstream: a data trail for a user visiting website Task: store and analyze
![Page 7: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/7.jpg)
www.percona.com
Hadoop Use Cases
Risk modeling: fraud prevention (banks, credit cards) • Store all customer activity • Run analysis to identify credit card fraud
![Page 8: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/8.jpg)
www.percona.com
Hadoop Use Cases
Point-of-Sale Analysis
![Page 9: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/9.jpg)
www.percona.com
Hadoop Use Cases
Stocks / Historical prices analysis
![Page 10: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/10.jpg)
www.percona.com
Inside Hadoop
![Page 11: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/11.jpg)
www.percona.com
Where is my data?
• HDFS = Data is spread between MANY machines
HDFS (Distributed File System)
Data Nodes
![Page 12: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/12.jpg)
www.percona.com
Inside Hadoop
MapReduce (Distributed Programming Framework)
HDFS (Distributed File System)
Hive (SQL)
PIG (scripting)
HB
ase
(Key
-Val
ue
NoS
QLD
B)
• HDFS = Data is spread between MANY machines • Write files, “append-only” mode
![Page 13: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/13.jpg)
www.percona.com
Inside Hadoop
MapReduce (Distributed Programming Framework)
HDFS (Distributed File System)
Hive (SQL)
PIG (scripting)
HB
ase
(Key
-Val
ue
NoS
QLD
B)
• MapReduce: Reads and Process data • Code is executed on the DATA NODES inside HDFS
![Page 14: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/14.jpg)
www.percona.com
Inside Hadoop
MapReduce (Distributed Programming Framework)
HDFS (Distributed File System)
Hive (SQL)
PIG (scripting)
HB
ase
(Key
-Val
ue
NoS
QLD
B)
• Hive = executes SQL and translates into MapReduce job • PIG = scripting language, translates into MapReduce job
![Page 15: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/15.jpg)
www.percona.com
Inside Hadoop
MapReduce (Distributed Programming Framework)
HDFS (Distributed File System)
Hive (SQL)
PIG (scripting)
HB
ase
(Key
-Val
ue
NoS
QLD
B)
• Hbase = Key/Value store, No-SQL
![Page 16: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/16.jpg)
www.percona.com
Hive
• SQL level for Hadoop • Translates SQL to Map-Reduce jobs • Schema on Read – does not check the
data on load
![Page 17: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/17.jpg)
www.percona.com
Hive Example
hive> create table lineitem ( l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity double, l_extendedprice double, l_discount double, l_tax double, l_returnflag string, l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string, l_shipinstruct string, l_shipmode string, l_comment string) row format delimited fields terminated by '|';
![Page 18: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/18.jpg)
www.percona.com
Hive Example
hive> create external table lineitem ( l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity double, l_extendedprice double, l_discount double, l_tax double, l_returnflag string, l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string, l_shipinstruct string, l_shipmode string, l_comment string) row format delimited fields terminated by '|’ location ‘/ssb/lineorder/’;
![Page 19: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/19.jpg)
www.percona.com
Impala: Faster SQL Not based on Map-Reduce, directly get data from HDFS
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/Installing-and-Using-Impala.html
![Page 20: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/20.jpg)
www.percona.com
Hadoop vs MySQL
![Page 21: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/21.jpg)
www.percona.com
Hadoop vs. MySQL for BigData
• Indexes • Partitioning • Sharding
• Full table scan • Partitioning • Map/Reduce
![Page 22: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/22.jpg)
www.percona.com
Hadoop (vs. MySQL)
• No indexes • All processing is full scan • BUT: distributed and parallel
• No transactions • High latency (usually)
![Page 23: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/23.jpg)
www.percona.com
Indexes for Big Data challenge
• Creating an index for Petabytes of data? • Updating an index for Petabytes of data? • Reading a terabyte index? • Random read of Petabyte?
Full scan in parallel is better for big data
![Page 24: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/24.jpg)
www.percona.com
ETL vs ELT
1. Extract data from external source
2. Transform before loading
3. Load data into MySQL
1. Extract data from external source
2. Load data into Hadoop
3. Transform data/Analyze data/Visualize data;
![Page 25: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/25.jpg)
www.percona.com
OLTP / Web site
BI / Data Analysis
Hadoop Example: Load Data I
Hadoop Cluster
MySQL
ELT
![Page 26: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/26.jpg)
www.percona.com
OLTP / Web site
BI / Data Analysis
Archiving to Hadoop
Hadoop Cluster
MySQL
ELT
Goal: keep 100G
Can store Petabytes for archiving
![Page 27: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/27.jpg)
www.percona.com
Hadoop Example: Load Data II
Hadoop Cluster
Flume
Web Server Clickstream logs
Filesystem
Grab file
Reports
![Page 28: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/28.jpg)
www.percona.com
Hadoop and MySQL
• Apache Sqoop: http://sqoop.apache.org/ • Used to import and export data between Hadoop and MySQL
![Page 29: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/29.jpg)
www.percona.com
Hadoop and MySQL Together Integrating MySQL and Hadoop
![Page 30: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/30.jpg)
www.percona.com
Integration: Hadoop -> MySQL
MySQL Server Hadoop Cluster
![Page 31: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/31.jpg)
www.percona.com
Hadoop -> MySQL: Sqoop $ sqoop export --connect jdbc:mysql://mysql_host/db_name --table orders --export-dir /results/
MySQL Server Hadoop Cluster
![Page 32: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/32.jpg)
www.percona.com
Hadoop -> MySQL: Sqoop Not real-time: run from a cronjob
• Export from files on HDFS to MySQL • To export from HIVE
Hive tables = files inside HDFS $ sqoop export!--connect jdbc:mysql://mysql_host/db_name --table orders!--export-dir /user/hive/warehouse/orders.db!--input-fields-terminated-by '\001’!• Hive uses ^A (\001) as a field delimiter
![Page 33: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/33.jpg)
www.percona.com
Integration: MySQL -> Hadoop
Hadoop Cluster MySQL Server
![Page 34: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/34.jpg)
www.percona.com
MySQL -> Hadoop: Sqoop
$ sqoop import --connect jdbc:mysql://mysql_host/db_name --table ORDERS --hive-import
MySQL Server Hadoop Cluster
![Page 35: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/35.jpg)
www.percona.com
MySQL to Hadoop: Hadoop Applier
Only inserts are supported
Hadoop Cluster Replication Master
Regular MySQL Slaves
MySQL Applier (reads binlogs from Master)
![Page 36: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/36.jpg)
www.percona.com
MySQL to Hadoop: Hadoop Applier
• Download from: http://labs.mysql.com/ • Alpha version right now • We need to write code how to process data
![Page 37: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/37.jpg)
www.percona.com
Star Schema benchmark Variation of TPC-H with star-schema database design http://www.percona.com/docs/wiki/benchmark:ssb:start
![Page 38: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/38.jpg)
www.percona.com
Cloudera Impala
![Page 39: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/39.jpg)
www.percona.com
Impala and Columnar Storage Parquet: Columnar Storage for Hadoop
• Column-oriented storage • Supports compression • Star Schema Benchmark example
• RAW data: 990GB • Parquet data: 240GB
http://parquet.io/ http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_parquet.html
![Page 40: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/40.jpg)
www.percona.com
Impala Benchmark: Table 1. Load data into HDFS $ hdfs dfs –put /data/ssb/lineorder* /ssb/lineorder/
http://www.percona.com/docs/wiki/benchmark:ssb:start
![Page 41: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/41.jpg)
www.percona.com
Impala Benchmark: Table 2. Create external table CREATE EXTERNAL TABLE lineorder_text ( LO_OrderKey bigint ,
LO_LineNumber tinyint , LO_CustKey int,
LO_PartKey int, ... ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/ssb/lineorder';
http://www.percona.com/docs/wiki/benchmark:ssb:start
![Page 42: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/42.jpg)
www.percona.com
Impala Benchmark: Table 3. Convert to Parquet Impala-shell> create table lineorder like lineorder_text STORED AS PARQUETFILE; Impala-shell> insert into lineorder select * from lineorder_text;
http://www.percona.com/docs/wiki/benchmark:ssb:start
![Page 43: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/43.jpg)
www.percona.com
Impala Benchmark: SQLs
select sum(lo_extendedprice*lo_discount) as revenue from lineorder, dates where lo_orderdate = d_datekey and d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25;
http://www.percona.com/docs/wiki/benchmark:ssb:start
Q1.1
![Page 44: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/44.jpg)
www.percona.com
Impala Benchmark: SQLs
select c_nation, s_nation, d_year, sum(lo_revenue) as revenue from customer, lineorder, supplier, dates where lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_orderdate = d_datekey and c_region = 'ASIA’ and s_region = 'ASIA' and d_year >= 1992 and d_year <= 1997 group by c_nation, s_nation, d_year order by d_year asc, revenue desc;
Q3.1
![Page 45: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/45.jpg)
www.percona.com
Partitioning for Impala / Hive Partitioning by Year CREATE TABLE lineorder ( LO_OrderKey bigint , LO_LineNumber tinyint , LO_CustKey int, LO_PartKey int, ... ) partitioned by (year int); select * from lineorder where … year = '2013’
… will only read the “2013” partition (much less disk IOs!)
Will create a “virtual” field on “year”
![Page 46: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/46.jpg)
www.percona.com
Impala Benchmark CDP Hadoop with Impala
• 10 m1.xlarge Amazon EC2 • RAW data: 990GB • Parquet data: 240GB
![Page 47: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/47.jpg)
www.percona.com
Impala Benchmark
Scanning 1Tb of row data 50-200 seconds 10 xlarge nodes on Amazon EC2
![Page 48: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/48.jpg)
www.percona.com
Impala vs Hive Benchmark
10 xlarge nodes on Amazon EC2 Impala and Hive
![Page 49: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/49.jpg)
www.percona.com
Amazon Elastic Map Reduce
![Page 50: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/50.jpg)
www.percona.com
Hive on Amazon S3
hive> create external table lineitem ( l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity double, l_extendedprice double, l_discount double, l_tax double, l_returnflag string, l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string, l_shipinstruct string, l_shipmode string, l_comment string) row format delimited fields terminated by '|’ location 's3n://data.s3ndemo.hive/tpch/lineitem';
![Page 51: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/51.jpg)
www.percona.com
Amazon Elastic Map Reduce
• Store data on S3 • Prepare SQL file (create table, select, etc) • Run Elastic Map Reduce
• Will start N boxes then stop them • Results loaded to S3
![Page 52: CON2342 Rubin](https://reader034.vdocuments.net/reader034/viewer/2022042517/577cc0551a28aba7118fb97a/html5/thumbnails/52.jpg)
www.percona.com
Questions?
Thank you!
Blog: http://www.arubin.org