splice machine-bloor-webinar-data-lakes

19
Ge#ng Started with Hadoop: Opera4onal Data Lake Rich Reimer VP, Product Management [email protected]

Upload: edgar-alejandro-villegas

Post on 21-Jul-2015

146 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Splice machine-bloor-webinar-data-lakes

Ge#ng  Started  with  Hadoop:  Opera4onal  Data  Lake  

Rich  Reimer  VP,  Product  Management  

[email protected]    

Page 2: Splice machine-bloor-webinar-data-lakes

2  

The  Big  Squeeze  Data  growing  much  faster  than  IT  budgets  

Source:  2013  IBM  Briefing  Book  

Source:  Gartner,  Worldwide  IT,    Spending  forecast,  3Q13  Update  

Page 3: Splice machine-bloor-webinar-data-lakes

Tradi4onal  RDBMSs  Giants  Overwhelmed…  Scale-­‐up  becoming  cost-­‐prohibi:ve  

Splice  Machine  |  Proprietary  &  Confiden4al  

Page 4: Splice machine-bloor-webinar-data-lakes

4  

Scale-­‐Out:  The  Future  of  Databases  Drama:c  improvement  in  price/performance  

 

Scale  Up  (Increase  server  size)  

Scale  Out  (More  small  servers)  

vs.  $ $ $ $ $ $

Page 5: Splice machine-bloor-webinar-data-lakes

5  

What  is  a  Data  Lake?  

•  Scale-­‐out  technology  based  on  Hadoop  

•  Data  stored  in  na4ve  formats  

Page 6: Splice machine-bloor-webinar-data-lakes

6  

Schema  on  Ingest  vs.  Schema  on  Read  

§  Even  “schemaless”  MongoDB  requires  “schema”  -  10  Things  You  Should  Know  About  Running  MongoDB  At  Scale  

•  By  Asya  Kamsky,  Principal  Solu4ons  Architect  at  MongoDB  •  Item  #1  –  “have  a  good  schema  and  indexing  strategy”  

Schema on Ingest

Schema on Read

•  Schema on Read if you only use data a few times a year

•  Structured data should always remain structured

•  Add schema if data used regularly

Data Stream Application

Page 7: Splice machine-bloor-webinar-data-lakes

7  

Who  Are  We?  

THE  ONLY  HADOOP  RDBMS  Replace  your  old  RDBMS  

with  a  scale-­‐out  SQL  database  Affordable,  Scale-­‐Out  ACID  Transac4ons  No  Applica4on  Rewrites  

10x    Bemer    

Price/Perf    

Page 8: Splice machine-bloor-webinar-data-lakes

8  

Reference  Architecture:  Opera4onal  Data  Lake  Offload  real-­‐:me  repor:ng  and  analy:cs  from  expensive  OLTP  and  DW  systems  

OLTP Systems

Ad Hoc Analytics

Operational Data Lake

Executive Business Reports

Operational Reports & Analytics

ERP

CRM

Supply Chain

HR

Data Warehouse

Datamart

Stream or Batch

Updates

ETL

Real-Time, Event-Driven

Apps

Page 9: Splice machine-bloor-webinar-data-lakes

Streamlining  the  Structured  Data  Pipeline  in  Hadoop  

9  

Source Systems

ERP

CRM

Sqoop

Apply Inferred Schema

Stored as flat files

SQL Query Engines BI Tools

Tradi=onal  Hadoop  Pipeline  

vs.  

Source Systems

ERP

CRM

Existing

ETL Tool

Stored in same

schema

BI Tools

Streamlined  Hadoop  Pipeline  Advantages  •  Reduced  opera4onal  costs  with  less  complexity  

•  Reduced  processing  4me  and  errors  with  fewer  transla4ons  

•  Real-­‐4me  updates  for  data  cleansing  

•  Bemer  SQL  support  

Page 10: Splice machine-bloor-webinar-data-lakes

10  

Streamlining  and  Hardening  the  ETL  Processing  Pipeline  Gracefully  handle  data  quality  issues  and  failed  queries  without  full  data  reloads  

 Issue   Hadoop  Issues   Splice  Machine  Solu=on  

Handle  Data  Quality  Issues  

(e.g.,  duplicates)  

Hours  to  correct  ✗  Run  slow  MapReduce  job  to  de-­‐dupe  ✗  Reload  en4re  data  set  (hours)  

Seconds  to  correct  ✓ Insert  fails  due  to  constraint  viola4on  ✓ Rollback  flawed  updates  if  necessary  ✓ Reject,  replace,  or  merge  duplicates  with  incremental  

update  (ms  to  sec)  

Update/Delete  Data  

Hours  to  correct  ✗  Reload  en4re  data  set  (hours)  ✗  Writers  block  readers  

Seconds  to  correct  ✓ Correct  data  and  do  incremental  update  (ms  to  sec)  ✓ Consistent  view  of  data  even  with  many  concurrent  updates  ✓ Writers  don’t  block  readers  

ETL  Failure   Hours  to  correct  ✗  Reload  en4re  data  set  (hours)  ✗  Miss  ETL  window,  leading  to  either  delayed  

reports  or  stale  data  

Seconds  to  correct  ✓ Rollback  failed  step  ✓ Retry  failed  step  and  con4nue  

Fast  Query  Speeds   ✗  Results  typically  no  faster  than  seconds  because  data  stored  in  random  formats  ✗  MapReduce  

✓ Results  possible  in  milliseconds  because  data  stored  in  highly  op4mized  format  

✓ No  MapReduce  

Page 11: Splice machine-bloor-webinar-data-lakes

11  

Complemen4ng  Exis4ng  Hadoop-­‐Based  Data  Lakes  Op:mizing  storage  and  querying  of  structured  data  as  part  of  ELT  or  Hadoop  query  engines  

OLTP Systems

ERP

CRM

Supply Chain

HR

SCHEMA ON INGEST:

Streamlined, structured-to-

structured integration

Structured Data

Unstructured Data

1  

2  

3  

SCHEMA BEFORE READ: Repository for structured data or metadata from ELT process on unstructured data

HCATALOG

Pig

SCHEMA ON READ: Ad-hoc Hadoop queries across structured and unstructured data

Page 12: Splice machine-bloor-webinar-data-lakes

Case  Study:  Opera4onal  Data  Lake  

12  12  

Overview      Computer  technology  corpora4on    Update  database  technology  for:    ODS  layer  replacement    ETL  processing  and  analysis  of  Omniture  data    Real-­‐4me  OLTP  for  Global  Tech  Support  app  

 

Challenges    Oracle  and  Teradata  too  expensive  to  scale  

  Many  Oracle  queries  couldn’t  complete  

  Can  only  hold  7  days  worth  of  data  in  Oracle  

  Missing  ETL  window  with  current  Hadoop  data  lake    

Solu5on  Diagram    

(400TB)  

OLTP Systems

ERP

CRM

Supply Chain

Benefits  

75%  less  cost  with  commodity  scale  out  

Incremental  ETL  processing  gracefully  handle  data  quality  issues  

5x-­‐10x  faster  comple4ng  queries  on  which  Oracle  failed      

 

✔  

Page 13: Splice machine-bloor-webinar-data-lakes

13  

Reference  Architecture:  Unified  Customer  Profile  Improve  marke:ng  ROI  with  deeper  customer  intelligence  and  beKer  cross-­‐channel  coordina:on  

Unified Customer Profile

(aka DMP)

Operational Reports for Campaign Performance

Social Feeds

Web/eCommerce Clickstreams

Website Datamart

Stream or Batch Updates

BI Tools

Demand Side Platform (DSP)

Ad Exchange

1st Party/ CRM Data

3rd Party Data (e.g., Axciom)

Ad Perf. Data (e.g., Doubleclick)

Email Mktg Data

Call Center Data

POS Data

Email Marketing App

Ad Hoc Audience Segmentation

BI Tools

Page 14: Splice machine-bloor-webinar-data-lakes

14  

Campaign  Management:  Harte-­‐Hanks  Overview      Digital  marke4ng  services  provider    Unified  Customer  Profile    Real-­‐4me  campaign  management    Complex  OLTP  and  OLAP  environment  

 

Challenges    Oracle  RAC  too  expensive  to  scale  

  Queries  too  slow  –  even  up  to  ½  hour  

  Ge#ng  worse  –  expect  30-­‐50%  data  growth  

  Looked  for  9  months  for  a  cost-­‐effec4ve  solu4on    

Solu5on  Diagram    

Ini5al  Results  

¼  cost  with  commodity  scale  out  

3-­‐7x  faster  through  parallelized  queries  

10-­‐20x  price/perf  with  no  applica4on,  BI  or  ETL  rewrites  

 

Cross-Channel Campaigns

Real-Time Personalization

Real-Time Actions

Page 15: Splice machine-bloor-webinar-data-lakes

15  

Proven  Building  Blocks:  Hadoop  and  Derby  

APACHE  DERBY    §   ANSI  SQL-­‐99  RDBMS  §   Java-­‐based  §   ODBC/JDBC  Compliant    

APACHE  HBASE/HDFS  §  Auto-­‐sharding  §  Real-­‐4me  updates  §  Fault-­‐tolerance  §  Scalability  to  100s  of  PBs  §  Data  replica4on    

   

Page 16: Splice machine-bloor-webinar-data-lakes

Typical  Database  Workloads  

16  

Opera=onal  Applica=ons  

Opera=onal  Repor=ng  &  Analy=cs  

Ad-­‐Hoc  Analy=cs   Enterprise  Data  Warehouses  

Typical  Databases  

•  MySQL  •  Oracle  •  MongoDB  

•  MySQL    •  Oracle  

•  Greenplum  •  Paraccel  •  Netezza  

•  Teradata  •  Oracle  •  Sybase  IQ  

Use  Cases   •  OLTP  -­‐  ERP,  CRM  •  Websites  

•  Opera4onal  Datastores  

•  Exploratory  Analy4cs  •  Data  Mining  

•  Enterprise  Repor4ng  

Typical  Users   •  Customers  •  Opera4onal  

Employees  

•  Opera4onal  Employees  

•  Analysts  •  Data  Scien4sts  

•  Managers  •  Execu4ves  

Workload  Strengths  

•  High  concurrency  of  small  reads/  writes  

•  Range  queries  

•  Parameterized  reports  against  real-­‐4me  data  

•  Range  queries  

•  Complex  queries  requiring  full  table  scans  

•  Parameterized  reports  against  historical  data  

Page 17: Splice machine-bloor-webinar-data-lakes

17  

Internet  of  Things  

Opera4onal  Data  Lake  Digital  Marke4ng  

Personalized    Medicine  

Use  Cases  

Splice  Machine  |  Proprietary  &  Confiden4al  

Fraud  Detec4on  

Page 18: Splice machine-bloor-webinar-data-lakes

18  

Opera4onal  Data  Lake:  Great  On-­‐Ramp  to  Big  Data  

 §  Clear  Business  Value  Now  

§  Replace  obsolete  Opera4onal  Data  Stores  (ODSs)  §  Exis4ng  use  cases  –  not  just  a  science  project  §  Hadoop  RDBMS  –  inexpensive  to  store  data  

§  Incremental  On-­‐Ramp  to  Big  Data  §  Start  with  structured  data  and  then  expand  to  unstructured  

§  Add  schema  when  needed  

Page 19: Splice machine-bloor-webinar-data-lakes

Ge#ng  Started  with  Hadoop:  Opera4onal  Data  Lake  

Rich  Reimer  VP,  Product  Management  

[email protected]