using mysql, hadoop and spark for data analysis -...

39
Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect, Percona September 21, 2015

Upload: others

Post on 20-May-2020

24 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

Using MySQL, Hadoop and Spark for Data Analysis

Alexander Rubin Principle Architect, Percona September 21, 2015

Page 2: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

About Me

•  Working with MySQL for over 10 years – Started at MySQL AB, Sun Microsystems, Oracle

(MySQL Consulting) – Worked at Hortonworks (Hadoop company) –  Joined Percona in 2013

Alexander Rubin, Principal Consultant, Percona

Page 3: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Agenda

•  Why Hadoop? •  Why Spark? •  Hadoop – Big Data Analytics with Hadoop – Star Schema benchmark – MySQL and Hadoop Integration

•  Spark examples

Page 4: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

•  Petabytes of data •  Unstructured/raw data

•  No normalization in the first place •  Data is collected at a high rate http://en.wikipedia.org/wiki/Big_data#Definition

Why Hadoop and not MySQL?

Page 5: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

•  Claimed to be faster •  … in memory processing

•  Direct access to data sources (i.e.MySQL) >>>  df  =  sqlContext.load(source="jdbc",    url="jdbc:mysql://localhost?user=root",  dbtable="ontime.ontime_sm”) •  Native R (and Python) integration

Why Spark?

Page 6: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Inside Hadoop

Page 7: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Where is my data?

•  HDFS = Data is spread between MANY machines •  Also Amazon S3 is supported

HDFS (Distributed File System)

Data Nodes

Page 8: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

The Famous Picture!

MapReduce (Distributed Programming Framework)

HDFS (Distributed File System)

Hive (SQL)

Other components

(PIG, etc)

•  HDFS = Data is spread between MANY machines •  Write files, “append-only” mode

Impala (Faster SQL)

Other componets (HBASE, etc)

Page 9: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Page 10: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Hive

•  SQL level for Hadoop •  Translates SQL to Map-Reduce jobs •  Schema on Read – does not check the

data on load

Page 11: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Hive Example

hive> create table lineitem ( l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity double, l_extendedprice double, ... l_shipmode string, l_comment string) row format delimited fields terminated by '|';

Page 12: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Hive Example

hive> create external table lineitem ( ...) row format delimited fields terminated by '|’ location ‘/ssb/lineorder/’;

Page 13: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Impala: Faster SQL Not based on Map-Reduce, directly get data from HDFS

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/Installing-and-Using-Impala.html

Page 14: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Hadoop vs MySQL

Page 15: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Hadoop vs. MySQL for BigData

Indexes

Partitioning

“Sharding”

Full table scan

Partitioning

Map/Reduce

Page 16: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Hadoop (vs. MySQL)

•  No indexes •  All processing is full scan •  BUT: distributed and parallel

•  No transactions •  High latency (usually)

MySQL: 1 query = 1 CPU core

Page 17: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Indexes (BTree) for Big Data challenge

•  Creating an index for Petabytes of data? •  Updating an index for Petabytes of data? •  Reading a terabyte index? •  Random read of Petabyte?

Full scan in parallel is better for big data

Page 18: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

ETL vs ELT

1.  Extract data from external source

2.  Transform before loading

3.  Load data into MySQL

1.  Extract data from external source

2.  Load data into Hadoop (as is)

3.  Transform data/Analyze data/Visualize data;

Parallelism

Page 19: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Example: Loading wikistat into MySQL

1.  Extract data from external source (uncompress!)

2.  Load data into MySQL and Transform

Wikipedia page counts – download, >10TB load  data  local  infile  '$file'    into  table  wikistats.wikistats_full  CHARACTER  SET  latin1    FIELDS  TERMINATED  BY  '  '    (project_name,  title,  num_requests,  content_size)    set  request_date  =  STR_TO_DATE('$datestr',    '%Y%m%d  %H%i%S'),  title_md5=unhex(md5(title));    

http://dumps.wikimedia.org/other/pagecounts-raw/

Page 20: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Load timing per hour of wikistat

•  InnoDB: 52.34 sec •  MyISAM: 11.08 sec (+ indexes) •  1 hour of wikistats =1 minute •  1 year will load in 6 days – (8765.81 hours in 1 year)

•  6 year = > 1 month to load

Page 21: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Loading wikistat into Hadoop

•  Just copy files to HDFS…and create hive structure •  How fast to search? – Depends upon the number of nodes

•  1000 nodes spark cluster – 4.5 TB, 104 Billion records –  Exec time: 45 sec –  Scanning 4.5TB of data

•  http://spark-summit.org/wp-content/uploads/2014/07/Building-1000-node-Spark-Cluster-on-EMR.pdf

Page 22: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Amazon Elastic Map Reduce

Page 23: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Hive on Amazon S3

hive> create external table lineitem ( l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity double, l_extendedprice double, l_discount double, l_tax double, l_returnflag string, l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string, l_shipinstruct string, l_shipmode string, l_comment string) row format delimited fields terminated by '|’ location 's3n://data.s3ndemo.hive/tpch/lineitem';

Page 24: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Amazon Elastic Map Reduce

•  Store data on S3 •  Prepare SQL file (create table, select, etc) •  Run Elastic Map Reduce

•  Will start N boxes then stop them •  Results loaded to S3

Page 25: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Hadoop and MySQL Together Integrating MySQL and Hadoop

Page 26: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

OLTP / Web site

BI / Data Analysis

Archiving to Hadoop

Hadoop Cluster

MySQL

ELT

Goal: keep 100G – 1TB

Can store Petabytes for archiving

Page 27: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Integration: Hadoop -> MySQL

MySQL Server Hadoop Cluster

Page 28: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

MySQL -> Hadoop: Sqoop

$ sqoop import --connect jdbc:mysql://mysql_host/db_name --table ORDERS --hive-import

MySQL Server Hadoop Cluster

Page 29: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

MySQL to Hadoop: Hadoop Applier

Only inserts are supported

Hadoop Cluster Replication Master

Regular MySQL Slaves

MySQL Applier (reads binlogs from Master)

Page 30: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

MySQL to Hadoop: Hadoop Applier

•  Download from: http://labs.mysql.com/ •  Still Alpha version right now •  We need to write code how to process data

Page 31: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Apache

Page 32: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Start Spark (no Hadoop)

root@thor:~/spark/spark-­‐1.4.1-­‐bin-­‐hadoop2.6#  ./sbin/start-­‐master.sh  less  ../logs/spark-­‐root-­‐org.apache.spark.deploy.master.Master-­‐1-­‐thor.out  

15/08/25  11:21:21  INFO  Master:  Starting  Spark  master  at  spark://thor:7077  15/08/25  11:21:21  INFO  Master:  Running  Spark  version  1.4.1  15/08/25  11:21:21  INFO  Utils:  Successfully  started  service  'MasterUI'  on  port  8080.  15/08/25  11:21:21  INFO  MasterWebUI:  Started  MasterWebUI  at  http://10.60.23.188:8080    

root@thor:~/spark/spark-­‐1.4.1-­‐bin-­‐hadoop2.6#  ./sbin/start-­‐slave.sh  spark://thor:7077  

Page 33: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

Prepare Spark

root@thor:~/spark/spark-1.4.1-bin-hadoop2.6# cat env.sh !export MASTER=spark://`hostname`:7077 !export SPARK_CLASSPATH=/root/spark/mysql-connector-java-5.1.36-bin.jar !# optional !export SPARK_WORKER_MEMORY=6g!export SPARK_MEM=6g!export SPARK_DAEMON_MEMORY=6g!export SPARK_DAEMON_JAVA_OPTS="-Dspark.executor.memory=6g” !

Page 34: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

PySpark and MySQL

root@thor:~/spark/spark-1.4.1-bin-hadoop2.6# ./bin/pyspark    Welcome  to              ____                            __            /  __/__    ___  _____/  /__          _\  \/  _  \/  _  `/  __/    '_/        /__  /  .__/\_,_/_/  /_/\_\      version  1.4.1              /_/    Using  Python  version  2.7.3  (default,  Jun  22  2015  19:33:41)  SparkContext  available  as  sc,  HiveContext  available  as  sqlContext.  

Page 35: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

PySpark and MySQL

df  =  sqlContext.load(source="jdbc",  url="jdbc:mysql://localhost?user=root",  dbtable="world.City")  df.registerTempTable("City")  res  =  sqlContext.sql("select  *  from  City  order  by  Population  desc  limit  10")  for  x  in  res.collect():            print  x.Name,  x.Population    

Page 36: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

SparkSQL and MySQL

root@thor:~/spark/spark-­‐1.4.1-­‐bin-­‐hadoop2.6#  ./bin/spark-­‐sql  2>error.log  

CREATE  TEMPORARY  TABLE  City  USING  org.apache.spark.sql.jdbc  OPTIONS  (      url  "jdbc:mysql://localhost?user=root",      dbtable  "world.City"  );  select  *  from  City  order  by  Population  desc  limit  10;    

 

Page 37: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

PySpark and GZIP file from  pyspark.sql  import  SQLContext,  Row  sqlContext  =  SQLContext(sc)    #  Load  a  text  file  and  convert  each  line  to  a  Row.  lines  =  sc.textFile("/home/consultant/wikistats/dumps.wikimedia.org/other/pagecounts-­‐raw/2008/2008-­‐01/pagecounts-­‐20080112-­‐120000.gz")  parts  =  lines.map(lambda  l:  l.split("  "))  wiki  =  parts.map(lambda  p:  Row(project=p[0],  url=p[1],  uniq_visitors=int(p[2]),  total_visitors=int(p[3])))    #  Infer  the  schema,  and  register  the  DataFrame  as  a  table.  schemaWiki  =  sqlContext.createDataFrame(wiki)  schemaWiki.registerTempTable("wikistats")  res  =  sqlContext.sql("SELECT  *  FROM  wikistats  limit  100")  for  x  in  res.collect():            print  x.project,  x.url,  x.uniq_visitors      

Page 38: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

PySpark and GZIP file  #  Load  a  text  file  and  convert  each  line  to  a  Row.  lines  =  sc.textFile("/home/consultant/wikistats/dumps.wikimedia.org/other/pagecounts-­‐raw/2008/2008-­‐01/")  parts  =  lines.map(lambda  l:  l.split("  "))  wiki  =  parts.map(lambda  p:  Row(project=p[0],  url=p[1]))    #  Infer  the  schema,  and  register  the  DataFrame  as  a  table.  schemaWiki  =  sqlContext.createDataFrame(wiki)  schemaWiki.registerTempTable("wikistats")  res  =  sqlContext.sql("SELECT  url,  count(*)  as  cnt  FROM  wikistats  where  project="en"  group  by  url  order  by  cnt  desc  limit  10")  for  x  in  res.collect():            print  x.url,  x.cnt  

Page 39: Using MySQL, Hadoop and Spark for Data Analysis - …files.meetup.com/11699412/Hadoop_MySQL_Spark_Data...Using MySQL, Hadoop and Spark for Data Analysis Alexander Rubin Principle Architect,

www.percona.com

PySpark and GZIP file Cpu0    :  94.4%us,    0.0%sy,    0.0%ni,    5.6%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu1    :    5.7%us,    0.0%sy,    0.0%ni,  92.4%id,    0.0%wa,    0.0%hi,    1.9%si,    0.0%st  Cpu2    :  95.0%us,    0.0%sy,    0.0%ni,    5.0%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu3    :  94.9%us,    0.0%sy,    0.0%ni,    5.1%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu4    :    0.6%us,    0.0%sy,    0.0%ni,  99.4%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu5    :  94.3%us,    0.0%sy,    0.0%ni,    5.7%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu6    :  94.3%us,    0.0%sy,    0.0%ni,    5.7%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu7    :  95.0%us,    0.0%sy,    0.0%ni,    5.0%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu8    :  94.4%us,    0.0%sy,    0.0%ni,    5.6%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu9    :  94.3%us,    0.0%sy,    0.0%ni,    5.7%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu10  :  94.4%us,    0.0%sy,    0.0%ni,    5.6%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu11  :  94.3%us,    0.0%sy,    0.0%ni,    5.7%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu12  :    1.3%us,    0.0%sy,    0.0%ni,  98.7%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu13  :  95.0%us,    0.0%sy,    0.0%ni,    5.0%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu14  :    0.0%us,    0.0%sy,    0.0%ni,100.0%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu15  :  94.3%us,    0.0%sy,    0.0%ni,    5.7%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu16  :  94.3%us,    0.0%sy,    0.0%ni,    5.7%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu17  :  94.3%us,    0.0%sy,    0.0%ni,    5.7%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu18  :  94.3%us,    0.0%sy,    0.0%ni,    5.7%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu19  :  94.9%us,    0.0%sy,    0.0%ni,    5.1%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu20  :    0.0%us,    0.0%sy,    0.0%ni,100.0%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu21  :  94.9%us,    0.0%sy,    0.0%ni,    5.1%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu22  :  94.9%us,    0.0%sy,    0.0%ni,    5.1%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Cpu23  :    0.0%us,    0.0%sy,    0.0%ni,100.0%id,    0.0%wa,    0.0%hi,    0.0%si,    0.0%st  Mem:    49454372k  total,  40479496k  used,    8974876k  free,      357360k  buffers