datavirtuality - beyond the data lake

14
US Office: 1355 Market Street, #488 San Francisco, CA 94103 German Office: Katharinenstr . 15 04109 Leipzig, Germany Beyond the Data Lake Simplifying data integration for the modern age Matthias Korn | Technical Consultant [email protected]

Upload: dataconomy-media

Post on 08-Jan-2017

591 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: datavirtuality - Beyond the data lake

US Office:

1355 Market Street, #488

San Francisco, CA 94103

German Office:

Katharinenstr. 15

04109 Leipzig, Germany

Beyond the Data Lake Simplifying data integration for the modern age

Matthias Korn | Technical Consultant

[email protected]

Page 2: datavirtuality - Beyond the data lake

2 The Challenge

Gartner 2014: “VARIETY

is the biggest

challenge.”

“When asked about the

dimensions of data

organizations struggle

with most, 49% answered

variety, while 35%

answered volume and

16% velocity.”

Page 3: datavirtuality - Beyond the data lake

3 Integration using the Data Warehouse

Data is integrated by copying it into a central repository

Approach: ETL process

Structure is applied in the repository

BI users query Data Marts

Page 4: datavirtuality - Beyond the data lake

4 Why do so many DWH projects fail

Slow data-to-actionable-insights (6 to 9+ months)

77% failure rate*

Inflexible; costly modifications

Labour-intensive setup and maintenance

Page 5: datavirtuality - Beyond the data lake

5 Data Lake – getting data in is pretty easy…

Page 6: datavirtuality - Beyond the data lake

6 …but making sense of it is the challenge

Business User

Page 7: datavirtuality - Beyond the data lake

7 Approaches to data fishing

Situation improved with YARN

Apache Mahout, HBase, Hive, Pig and MapReduce

Data Marts are created

BI user‘s report tools query Data Marts

Wait, didn‘t they do this before already?

Page 8: datavirtuality - Beyond the data lake

8 „Transform“ just changed its position: ETL -> ELT

Data Marts have to be created by Data Scientists

BI users can‘t do new things

No permission concept

A lot of the stored data is never used, eating up the low storage costs

Page 9: datavirtuality - Beyond the data lake

9 The Logical Data Warehouse

Introduced by Gartner in 2012

Adds virtualization of data sources

Adds distributed processes

logically consistent, subject-oriented integration of time-variant data

Via data management infrastructure

Page 10: datavirtuality - Beyond the data lake

10 Logical Data Warehouse (LDW)

Page 11: datavirtuality - Beyond the data lake

11 What does the Logical Data Warehouse do?

LDW knows where the data is stored instead of copying it

Repositories are used for datasources that are too slow

Presents all data in a single virtual database

Quickly reacts to changes in data models of source systems

Page 12: datavirtuality - Beyond the data lake

12 Advantages of the Logical Data Warehouse

Real time data available and ready for analysis

Immediately productive

Different use cases supported: Exploration, data manipulation and batch processing

Data Model creation not tied to physical database: Logical Data Model!

Permission concept implemented

Webservice access using virtualization

Write back to the connected datasources

Page 13: datavirtuality - Beyond the data lake

13 Example data flow in an LDW

BI frontend aware of all data sources - creates SQL statement

Distributed query taking place

Performance optimization engine replicates data only if needed

Page 14: datavirtuality - Beyond the data lake

US Office:

1355 Market Street, #488

San Francisco, CA 94103

German Office:

Katharinenstr. 15

04109 Leipzig, Germany

DataVirtuality Thanks for your attention!

Connect with us at any time.

[email protected]