the inside scoop on hadoop - neudesic · pdf filethe inside scoop on hadoop ... * a modern...

36
The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director – BI & Big Data , Neudesic LLC. VTSP – Microsoft Corp. [email protected] [email protected] @OrionGM

Upload: nguyenkhanh

Post on 06-Feb-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director – BI & Big Data , Neudesic LLC. VTSP – Microsoft Corp.

[email protected] [email protected]

@OrionGM

Page 2: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

The Inside Scoop on Hadoop

Page 3: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

•  Understanding Hadoop •  Big Data Solution Deployment Models •  Architecting the Modern Data Warehouse •  Summary

Topics Covered

Page 4: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Understanding Hadoop

Page 5: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Big Data = Hadoop?

* A Modern Data Architecture with Apache Hadoop, Hortonworks Inc. 2014

Page 6: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

The Fundamentals of Hadoop

Hadoop evolved directly from commodity scientific supercomputing clusters developed in the 1990s. Hadoop consists of a parallel execution framework called •  Map/Reduce and •  Hadoop Distributed File System (HDFS).

Page 7: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Latest Developments

Page 8: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

HDFS

•  Very high fault tolerance •  Can not be updated but corrections can be appended •  File blocks are replicated multiple types

Three types nodes: Name Node (Directory) Backup Node ( checkpoint) Data Node-actual data

Page 9: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

MapReduce

•  A programing framework for library and runtime. just like .NET

•  Map Function - Take a task and break it down into small tasks

•  Reduce Function - Combine the partial answers and find the combined list

•  Master (Job Tracker) •  Is where you submit a query. Manages the Task Trackers which do the actual

Map or Reduce task. •  Workers (Task Trackers)

•  Do the work, just as each nodes in the cluster have a data node, they also have a task tracker

Page 10: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Basics of MapReduce

400 bills 1 bill/ sec

= 400 Seconds

Page 11: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Basics of MapReduce

1 bill/ sec

1 bill/ sec

= 200 Seconds

= 200 Seconds

200 Bills

200 Bills

Total =200 seconds

Page 12: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Basics if MapReduce

= 100 Seconds 100 bills

100 bills

100 bills

100 bills = 100 Seconds

= 100 Seconds

= 100 Seconds

Total =100 seconds

Page 13: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Basics of MapReduce

Query

Result

Name Node/Job Tracker Query

Data Nodes/Task Trackers

Page 14: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

HDFS and MapReduce

The Main Node: runs the Job tracker and The name node controls the files. Each node runs two processes: Task Tracker and Data Node

Page 15: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Hive and Pig

MapReduce •  Java

•  write many lines of code

Pig •  Mostly used by yahoo •  highly used for data

processing •  Shares some constructs with

SQL e.g. filtering, selecting, grouping, and ordering. But syntax is very different from sql.

•  Is more Verbose •  Needs a lot of training for

users with limited procedural programming background.

•  Gives you more control over the flow of data.

Hive •  Mostly used by Facebook for

analytic purposes •  Used for analytics •  Relatively easier for

developers with SQL experience.

•  Less control over optimization of data flows compared to Pig

•  Not as efficient as MapReduce •  Higher productivity for data scientists and developers

Page 16: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

HDFS

* A Modern Data Architecture with Apache Hadoop, Hortonworks Inc. 2014

Page 17: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Big Data Solution Deployment Models

Page 18: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Major Players in Big Data

•  Hortonworks •  Cloudera •  MapR •  Pentaho •  Amazon (AWS) •  …

Apache Foundation

Page 19: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Hortonworks

•  June 2011 funded by $23 million from Yahoo! and Benchmark Capital as an independent company

•  Horton the Elephant - Horton Hears a Who!

•  Employs contributors to project Apache Hadoop

•  October 2011 partnered with Microsoft : Azure and Windows Server .

•  Cloudera founded in October 2008…started the effort to be Microsoft Azure Certified in October 2014.

Page 20: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

HDP User Interface

Page 21: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Page 22: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Deployment Models

•  On Premise Deployment •  Microsoft Analytics Platform System (APS) •  Oracle Big Data Appliance •  Hortonworks Data Platform (HDP) •  Cloudera's CDH •  Pivotal Data Computing Appliance (DCA)

•  Big Data as a service •  HDInsight •  Cloudera on AWS •  Amazon RedShift •  Amazon Elastic MapReduce

Page 23: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

HDInsight: Hadoop As A Cloud Service

Page 24: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

HDFS

* A Modern Data Architecture with Apache Hadoop, Hortonworks Inc. 2014

Page 25: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

HDInsight Versions

Page 26: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Architecting the Modern Data Warehouse

Page 27: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

The ETL Automation Model

* A Modern Data Architecture with Apache Hadoop, Hortonworks Inc. 2014

Page 28: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

The ETL Automation Model

Page 29: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

BI-Integration Model

Page 30: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Data  Sources

Reporting  &  Analytics

Staging  

Mainframes  and  proprietary  data  sources

Flat  Files  sourcesSQL  Server

HDFS

Hadoop

Data  Integration  tool

Enterprise  Data  Models

EnterpriseStaging  Area

EDW

EnterpriseData  Warehouse

Charts  and  DashboardsReports

Oracle,  DB2,  etc

Self-­‐Service  BI(Excel  Pivot  Reports,  etc)

Hybrid BI-Integration Model

Page 31: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Hybrid BI-Integration Model

EDW_neuBikes

Extract  to  BLOB

Upload

Integration(SSIS,  Data  Model)

www.neuBikes.comSource:  weblogs

Power  BI  Data  Source:  EDW

Power  BI  Data  Source:  Azure  Blob  Store

On-­‐Prem  Data  Warehouse(SQL,  SSAS-­‐Tabular,  SSAS  OLAP)

Container:  ssugstaging

Compute:  HDInsightCluster:ssughadoopContainer:ssughadoopNodes:  1

Databases:  SQL  Server,  Oracle,  etc.

Third  Party  applications

Flat  Files

Container:  ssugedw

Page 32: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Hybrid BI-Integration Model

Page 33: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Page 34: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Summary

Page 35: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Summary

•  Understand your data growth to determine when to “Scale-Out”.

•  Determine the right tool for the workload you have.

•  Choose the right deployment of Big Data Solutions

•  Hybridize, do not start from scratch!

Page 36: The Inside Scoop on Hadoop - Neudesic · PDF fileThe Inside Scoop on Hadoop ... * A Modern Data Architecture with Apache Hadoop, ... Hybrid BI-Integration Model

© Copyright 2015, Neudesic. All rights reserved. © Copyright 2015, Neudesic. All rights reserved.

Questions and Discussion

Questions?