the bi guy's little guide to big data

28
MAKING BUSINESS INTELLIGENT www.pragmaticworks.c om The BI Guys Little Guide to Big Data Chris Price Senior BI Consultant @BluewaterSQL

Upload: chris-price

Post on 15-Jan-2015

522 views

Category:

Technology


2 download

DESCRIPTION

You know Pig is more than a farm animal and that Hive is not some ultra-hip bar. You've beyond the buzz words and the word count demos. Now…you're ready to figure out how it all fits in. In this session we will review common integration scenarios, proven patterns and best practices for integration Big Data solutions into your existing data warehouse and BI architecture. Learn how you too can ride the Big Data wave without reinventing the wheel to both enhance the information you currently deliver while solving problems that were previously unapproachable.

TRANSCRIPT

Page 1: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

The BI Guys Little Guide to Big Data

Chris PriceSenior BI Consultant

BluewaterSQL

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Introductionshellip

Chris Price Senior BI Consultant with Pragmatic Works

AuthorRegular SpeakerData Geek amp Super Dad

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Big Data Data Explosion

As recently as 2000 only frac14 of data was digital Paper film or other analog media

According to IBM 90 of data created in last 2 years Data volume now growing 10 every 5 years Approximately 85 from new sources

Consumerization 43 connected devices per adult 27 use social media input

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Big Data

Data Complexity Variety and Velocity

Terabytes

Gigabytes

Megabytes

Petabytes

Big Data

Service Logs

Spatial amp GPS coordinates

Data market feeds

eGov feeds

Weather

Textimage

Click stream

Wikisblogs

Sensors

RFIDDevices

SMS

HD Audiovideo

Web

Web Logs

Search Marketing

Recommendations

Affiliates

Advertising

Mobile

Collaboration

eCommerceTraditionalPayables

Payroll

Inventory

Contacts

Orders

Campaigns

Source Brian Mitchel TechEd 2013

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Big Data is wellhellipBig Drove $28b in IT investment in 2012

Expected to grow to $34b in 2014 Challenges

Data Volumes (HardwareStorage Economics) Data Diversity (Multiple Types amp Sources) Data Velocity (Real-Time) User-Expectations

How do we planintegratehelliphellip

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Agenda Hadoop Landscape Current BIDW Landscape BIDW amp Hadoop Intersection

ToolsTechniquesStrategies

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Hadoop Ecosystem

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Hadoop on Windows HDInsight on Windows Azure

Seamlessly scale in the cloud Backed by Azure Storage Vault (ASV)

Hortonworks Data Platform (HDP) On-Premise Based on HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Current Landscape

Clie

nt T

ools

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

BID

W S

yste

m

DW Cubes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Clie

nt T

ools

BID

W S

yste

m

DW

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

Cubes

Future Landscape

Hadoop

New Sources (Email Logs Social Media Sensor)

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Business Scenario

DW Cube

HadoopHDFS

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

Flume

Sensor DataWebHDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

What about Azure

DW Cube

Hadoop

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

AzCopy

Azure Blob Storage

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Tool Techniques amp Strategies

Enterprise Data Services WebHDFS Sqoop Hcatalog PigHive

Enterprise Operational Services Oozie

Other Windows Azure Blob Storage amp AzCopy Hive ODBC Polybase

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS

Born from HFTP intended as a replacement Widely used by Yahoo

High performance first class native protocol using industry standard RESTful mechanism

Complete interface for reading writing amp managing files

Supports secure authentication Data Locality ndash requests sent to data nodes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash Get Example

Requestcurl -i -L httphostportwebhdfsv1foobarop=OPEN

ResponseHTTP11 307 TEMPORARY_REDIRECT Content-Type applicationoctet-stream Location httpdatanode50075webhdfsv1foobarop=OPENampampoffset=0 Content-Length 0

HTTP11 200 OK Content-Type applicationoctet-streamContent-Length 22

Hello webhdfs user

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash More Examples

Rename Requestcurl -i -X PUT httphostportwebhdfsv1foobarop=RENAMEampampdestination=foobar2

Create Directory Requestcurl -i -X PUT httphostportwebhdfsv1foo2op=MKDIRS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop

Tool designed to efficiently move data between Hadoop (Hive amp Hbase) and RDBMS Importing (single and all tables) Exporting Eval (Query Execution) Merge (Multiple HDFS datasets) Incremental Imports

Generates MapReduce jobs Can control the level of parallelism

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 2: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Introductionshellip

Chris Price Senior BI Consultant with Pragmatic Works

AuthorRegular SpeakerData Geek amp Super Dad

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Big Data Data Explosion

As recently as 2000 only frac14 of data was digital Paper film or other analog media

According to IBM 90 of data created in last 2 years Data volume now growing 10 every 5 years Approximately 85 from new sources

Consumerization 43 connected devices per adult 27 use social media input

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Big Data

Data Complexity Variety and Velocity

Terabytes

Gigabytes

Megabytes

Petabytes

Big Data

Service Logs

Spatial amp GPS coordinates

Data market feeds

eGov feeds

Weather

Textimage

Click stream

Wikisblogs

Sensors

RFIDDevices

SMS

HD Audiovideo

Web

Web Logs

Search Marketing

Recommendations

Affiliates

Advertising

Mobile

Collaboration

eCommerceTraditionalPayables

Payroll

Inventory

Contacts

Orders

Campaigns

Source Brian Mitchel TechEd 2013

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Big Data is wellhellipBig Drove $28b in IT investment in 2012

Expected to grow to $34b in 2014 Challenges

Data Volumes (HardwareStorage Economics) Data Diversity (Multiple Types amp Sources) Data Velocity (Real-Time) User-Expectations

How do we planintegratehelliphellip

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Agenda Hadoop Landscape Current BIDW Landscape BIDW amp Hadoop Intersection

ToolsTechniquesStrategies

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Hadoop Ecosystem

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Hadoop on Windows HDInsight on Windows Azure

Seamlessly scale in the cloud Backed by Azure Storage Vault (ASV)

Hortonworks Data Platform (HDP) On-Premise Based on HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Current Landscape

Clie

nt T

ools

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

BID

W S

yste

m

DW Cubes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Clie

nt T

ools

BID

W S

yste

m

DW

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

Cubes

Future Landscape

Hadoop

New Sources (Email Logs Social Media Sensor)

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Business Scenario

DW Cube

HadoopHDFS

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

Flume

Sensor DataWebHDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

What about Azure

DW Cube

Hadoop

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

AzCopy

Azure Blob Storage

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Tool Techniques amp Strategies

Enterprise Data Services WebHDFS Sqoop Hcatalog PigHive

Enterprise Operational Services Oozie

Other Windows Azure Blob Storage amp AzCopy Hive ODBC Polybase

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS

Born from HFTP intended as a replacement Widely used by Yahoo

High performance first class native protocol using industry standard RESTful mechanism

Complete interface for reading writing amp managing files

Supports secure authentication Data Locality ndash requests sent to data nodes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash Get Example

Requestcurl -i -L httphostportwebhdfsv1foobarop=OPEN

ResponseHTTP11 307 TEMPORARY_REDIRECT Content-Type applicationoctet-stream Location httpdatanode50075webhdfsv1foobarop=OPENampampoffset=0 Content-Length 0

HTTP11 200 OK Content-Type applicationoctet-streamContent-Length 22

Hello webhdfs user

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash More Examples

Rename Requestcurl -i -X PUT httphostportwebhdfsv1foobarop=RENAMEampampdestination=foobar2

Create Directory Requestcurl -i -X PUT httphostportwebhdfsv1foo2op=MKDIRS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop

Tool designed to efficiently move data between Hadoop (Hive amp Hbase) and RDBMS Importing (single and all tables) Exporting Eval (Query Execution) Merge (Multiple HDFS datasets) Incremental Imports

Generates MapReduce jobs Can control the level of parallelism

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 3: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Big Data Data Explosion

As recently as 2000 only frac14 of data was digital Paper film or other analog media

According to IBM 90 of data created in last 2 years Data volume now growing 10 every 5 years Approximately 85 from new sources

Consumerization 43 connected devices per adult 27 use social media input

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Big Data

Data Complexity Variety and Velocity

Terabytes

Gigabytes

Megabytes

Petabytes

Big Data

Service Logs

Spatial amp GPS coordinates

Data market feeds

eGov feeds

Weather

Textimage

Click stream

Wikisblogs

Sensors

RFIDDevices

SMS

HD Audiovideo

Web

Web Logs

Search Marketing

Recommendations

Affiliates

Advertising

Mobile

Collaboration

eCommerceTraditionalPayables

Payroll

Inventory

Contacts

Orders

Campaigns

Source Brian Mitchel TechEd 2013

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Big Data is wellhellipBig Drove $28b in IT investment in 2012

Expected to grow to $34b in 2014 Challenges

Data Volumes (HardwareStorage Economics) Data Diversity (Multiple Types amp Sources) Data Velocity (Real-Time) User-Expectations

How do we planintegratehelliphellip

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Agenda Hadoop Landscape Current BIDW Landscape BIDW amp Hadoop Intersection

ToolsTechniquesStrategies

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Hadoop Ecosystem

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Hadoop on Windows HDInsight on Windows Azure

Seamlessly scale in the cloud Backed by Azure Storage Vault (ASV)

Hortonworks Data Platform (HDP) On-Premise Based on HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Current Landscape

Clie

nt T

ools

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

BID

W S

yste

m

DW Cubes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Clie

nt T

ools

BID

W S

yste

m

DW

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

Cubes

Future Landscape

Hadoop

New Sources (Email Logs Social Media Sensor)

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Business Scenario

DW Cube

HadoopHDFS

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

Flume

Sensor DataWebHDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

What about Azure

DW Cube

Hadoop

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

AzCopy

Azure Blob Storage

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Tool Techniques amp Strategies

Enterprise Data Services WebHDFS Sqoop Hcatalog PigHive

Enterprise Operational Services Oozie

Other Windows Azure Blob Storage amp AzCopy Hive ODBC Polybase

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS

Born from HFTP intended as a replacement Widely used by Yahoo

High performance first class native protocol using industry standard RESTful mechanism

Complete interface for reading writing amp managing files

Supports secure authentication Data Locality ndash requests sent to data nodes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash Get Example

Requestcurl -i -L httphostportwebhdfsv1foobarop=OPEN

ResponseHTTP11 307 TEMPORARY_REDIRECT Content-Type applicationoctet-stream Location httpdatanode50075webhdfsv1foobarop=OPENampampoffset=0 Content-Length 0

HTTP11 200 OK Content-Type applicationoctet-streamContent-Length 22

Hello webhdfs user

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash More Examples

Rename Requestcurl -i -X PUT httphostportwebhdfsv1foobarop=RENAMEampampdestination=foobar2

Create Directory Requestcurl -i -X PUT httphostportwebhdfsv1foo2op=MKDIRS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop

Tool designed to efficiently move data between Hadoop (Hive amp Hbase) and RDBMS Importing (single and all tables) Exporting Eval (Query Execution) Merge (Multiple HDFS datasets) Incremental Imports

Generates MapReduce jobs Can control the level of parallelism

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 4: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Big Data

Data Complexity Variety and Velocity

Terabytes

Gigabytes

Megabytes

Petabytes

Big Data

Service Logs

Spatial amp GPS coordinates

Data market feeds

eGov feeds

Weather

Textimage

Click stream

Wikisblogs

Sensors

RFIDDevices

SMS

HD Audiovideo

Web

Web Logs

Search Marketing

Recommendations

Affiliates

Advertising

Mobile

Collaboration

eCommerceTraditionalPayables

Payroll

Inventory

Contacts

Orders

Campaigns

Source Brian Mitchel TechEd 2013

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Big Data is wellhellipBig Drove $28b in IT investment in 2012

Expected to grow to $34b in 2014 Challenges

Data Volumes (HardwareStorage Economics) Data Diversity (Multiple Types amp Sources) Data Velocity (Real-Time) User-Expectations

How do we planintegratehelliphellip

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Agenda Hadoop Landscape Current BIDW Landscape BIDW amp Hadoop Intersection

ToolsTechniquesStrategies

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Hadoop Ecosystem

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Hadoop on Windows HDInsight on Windows Azure

Seamlessly scale in the cloud Backed by Azure Storage Vault (ASV)

Hortonworks Data Platform (HDP) On-Premise Based on HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Current Landscape

Clie

nt T

ools

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

BID

W S

yste

m

DW Cubes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Clie

nt T

ools

BID

W S

yste

m

DW

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

Cubes

Future Landscape

Hadoop

New Sources (Email Logs Social Media Sensor)

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Business Scenario

DW Cube

HadoopHDFS

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

Flume

Sensor DataWebHDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

What about Azure

DW Cube

Hadoop

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

AzCopy

Azure Blob Storage

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Tool Techniques amp Strategies

Enterprise Data Services WebHDFS Sqoop Hcatalog PigHive

Enterprise Operational Services Oozie

Other Windows Azure Blob Storage amp AzCopy Hive ODBC Polybase

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS

Born from HFTP intended as a replacement Widely used by Yahoo

High performance first class native protocol using industry standard RESTful mechanism

Complete interface for reading writing amp managing files

Supports secure authentication Data Locality ndash requests sent to data nodes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash Get Example

Requestcurl -i -L httphostportwebhdfsv1foobarop=OPEN

ResponseHTTP11 307 TEMPORARY_REDIRECT Content-Type applicationoctet-stream Location httpdatanode50075webhdfsv1foobarop=OPENampampoffset=0 Content-Length 0

HTTP11 200 OK Content-Type applicationoctet-streamContent-Length 22

Hello webhdfs user

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash More Examples

Rename Requestcurl -i -X PUT httphostportwebhdfsv1foobarop=RENAMEampampdestination=foobar2

Create Directory Requestcurl -i -X PUT httphostportwebhdfsv1foo2op=MKDIRS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop

Tool designed to efficiently move data between Hadoop (Hive amp Hbase) and RDBMS Importing (single and all tables) Exporting Eval (Query Execution) Merge (Multiple HDFS datasets) Incremental Imports

Generates MapReduce jobs Can control the level of parallelism

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 5: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Big Data is wellhellipBig Drove $28b in IT investment in 2012

Expected to grow to $34b in 2014 Challenges

Data Volumes (HardwareStorage Economics) Data Diversity (Multiple Types amp Sources) Data Velocity (Real-Time) User-Expectations

How do we planintegratehelliphellip

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Agenda Hadoop Landscape Current BIDW Landscape BIDW amp Hadoop Intersection

ToolsTechniquesStrategies

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Hadoop Ecosystem

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Hadoop on Windows HDInsight on Windows Azure

Seamlessly scale in the cloud Backed by Azure Storage Vault (ASV)

Hortonworks Data Platform (HDP) On-Premise Based on HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Current Landscape

Clie

nt T

ools

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

BID

W S

yste

m

DW Cubes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Clie

nt T

ools

BID

W S

yste

m

DW

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

Cubes

Future Landscape

Hadoop

New Sources (Email Logs Social Media Sensor)

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Business Scenario

DW Cube

HadoopHDFS

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

Flume

Sensor DataWebHDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

What about Azure

DW Cube

Hadoop

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

AzCopy

Azure Blob Storage

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Tool Techniques amp Strategies

Enterprise Data Services WebHDFS Sqoop Hcatalog PigHive

Enterprise Operational Services Oozie

Other Windows Azure Blob Storage amp AzCopy Hive ODBC Polybase

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS

Born from HFTP intended as a replacement Widely used by Yahoo

High performance first class native protocol using industry standard RESTful mechanism

Complete interface for reading writing amp managing files

Supports secure authentication Data Locality ndash requests sent to data nodes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash Get Example

Requestcurl -i -L httphostportwebhdfsv1foobarop=OPEN

ResponseHTTP11 307 TEMPORARY_REDIRECT Content-Type applicationoctet-stream Location httpdatanode50075webhdfsv1foobarop=OPENampampoffset=0 Content-Length 0

HTTP11 200 OK Content-Type applicationoctet-streamContent-Length 22

Hello webhdfs user

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash More Examples

Rename Requestcurl -i -X PUT httphostportwebhdfsv1foobarop=RENAMEampampdestination=foobar2

Create Directory Requestcurl -i -X PUT httphostportwebhdfsv1foo2op=MKDIRS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop

Tool designed to efficiently move data between Hadoop (Hive amp Hbase) and RDBMS Importing (single and all tables) Exporting Eval (Query Execution) Merge (Multiple HDFS datasets) Incremental Imports

Generates MapReduce jobs Can control the level of parallelism

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 6: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Agenda Hadoop Landscape Current BIDW Landscape BIDW amp Hadoop Intersection

ToolsTechniquesStrategies

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Hadoop Ecosystem

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Hadoop on Windows HDInsight on Windows Azure

Seamlessly scale in the cloud Backed by Azure Storage Vault (ASV)

Hortonworks Data Platform (HDP) On-Premise Based on HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Current Landscape

Clie

nt T

ools

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

BID

W S

yste

m

DW Cubes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Clie

nt T

ools

BID

W S

yste

m

DW

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

Cubes

Future Landscape

Hadoop

New Sources (Email Logs Social Media Sensor)

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Business Scenario

DW Cube

HadoopHDFS

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

Flume

Sensor DataWebHDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

What about Azure

DW Cube

Hadoop

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

AzCopy

Azure Blob Storage

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Tool Techniques amp Strategies

Enterprise Data Services WebHDFS Sqoop Hcatalog PigHive

Enterprise Operational Services Oozie

Other Windows Azure Blob Storage amp AzCopy Hive ODBC Polybase

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS

Born from HFTP intended as a replacement Widely used by Yahoo

High performance first class native protocol using industry standard RESTful mechanism

Complete interface for reading writing amp managing files

Supports secure authentication Data Locality ndash requests sent to data nodes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash Get Example

Requestcurl -i -L httphostportwebhdfsv1foobarop=OPEN

ResponseHTTP11 307 TEMPORARY_REDIRECT Content-Type applicationoctet-stream Location httpdatanode50075webhdfsv1foobarop=OPENampampoffset=0 Content-Length 0

HTTP11 200 OK Content-Type applicationoctet-streamContent-Length 22

Hello webhdfs user

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash More Examples

Rename Requestcurl -i -X PUT httphostportwebhdfsv1foobarop=RENAMEampampdestination=foobar2

Create Directory Requestcurl -i -X PUT httphostportwebhdfsv1foo2op=MKDIRS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop

Tool designed to efficiently move data between Hadoop (Hive amp Hbase) and RDBMS Importing (single and all tables) Exporting Eval (Query Execution) Merge (Multiple HDFS datasets) Incremental Imports

Generates MapReduce jobs Can control the level of parallelism

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 7: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Hadoop Ecosystem

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Hadoop on Windows HDInsight on Windows Azure

Seamlessly scale in the cloud Backed by Azure Storage Vault (ASV)

Hortonworks Data Platform (HDP) On-Premise Based on HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Current Landscape

Clie

nt T

ools

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

BID

W S

yste

m

DW Cubes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Clie

nt T

ools

BID

W S

yste

m

DW

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

Cubes

Future Landscape

Hadoop

New Sources (Email Logs Social Media Sensor)

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Business Scenario

DW Cube

HadoopHDFS

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

Flume

Sensor DataWebHDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

What about Azure

DW Cube

Hadoop

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

AzCopy

Azure Blob Storage

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Tool Techniques amp Strategies

Enterprise Data Services WebHDFS Sqoop Hcatalog PigHive

Enterprise Operational Services Oozie

Other Windows Azure Blob Storage amp AzCopy Hive ODBC Polybase

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS

Born from HFTP intended as a replacement Widely used by Yahoo

High performance first class native protocol using industry standard RESTful mechanism

Complete interface for reading writing amp managing files

Supports secure authentication Data Locality ndash requests sent to data nodes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash Get Example

Requestcurl -i -L httphostportwebhdfsv1foobarop=OPEN

ResponseHTTP11 307 TEMPORARY_REDIRECT Content-Type applicationoctet-stream Location httpdatanode50075webhdfsv1foobarop=OPENampampoffset=0 Content-Length 0

HTTP11 200 OK Content-Type applicationoctet-streamContent-Length 22

Hello webhdfs user

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash More Examples

Rename Requestcurl -i -X PUT httphostportwebhdfsv1foobarop=RENAMEampampdestination=foobar2

Create Directory Requestcurl -i -X PUT httphostportwebhdfsv1foo2op=MKDIRS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop

Tool designed to efficiently move data between Hadoop (Hive amp Hbase) and RDBMS Importing (single and all tables) Exporting Eval (Query Execution) Merge (Multiple HDFS datasets) Incremental Imports

Generates MapReduce jobs Can control the level of parallelism

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 8: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Hadoop on Windows HDInsight on Windows Azure

Seamlessly scale in the cloud Backed by Azure Storage Vault (ASV)

Hortonworks Data Platform (HDP) On-Premise Based on HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Current Landscape

Clie

nt T

ools

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

BID

W S

yste

m

DW Cubes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Clie

nt T

ools

BID

W S

yste

m

DW

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

Cubes

Future Landscape

Hadoop

New Sources (Email Logs Social Media Sensor)

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Business Scenario

DW Cube

HadoopHDFS

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

Flume

Sensor DataWebHDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

What about Azure

DW Cube

Hadoop

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

AzCopy

Azure Blob Storage

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Tool Techniques amp Strategies

Enterprise Data Services WebHDFS Sqoop Hcatalog PigHive

Enterprise Operational Services Oozie

Other Windows Azure Blob Storage amp AzCopy Hive ODBC Polybase

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS

Born from HFTP intended as a replacement Widely used by Yahoo

High performance first class native protocol using industry standard RESTful mechanism

Complete interface for reading writing amp managing files

Supports secure authentication Data Locality ndash requests sent to data nodes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash Get Example

Requestcurl -i -L httphostportwebhdfsv1foobarop=OPEN

ResponseHTTP11 307 TEMPORARY_REDIRECT Content-Type applicationoctet-stream Location httpdatanode50075webhdfsv1foobarop=OPENampampoffset=0 Content-Length 0

HTTP11 200 OK Content-Type applicationoctet-streamContent-Length 22

Hello webhdfs user

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash More Examples

Rename Requestcurl -i -X PUT httphostportwebhdfsv1foobarop=RENAMEampampdestination=foobar2

Create Directory Requestcurl -i -X PUT httphostportwebhdfsv1foo2op=MKDIRS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop

Tool designed to efficiently move data between Hadoop (Hive amp Hbase) and RDBMS Importing (single and all tables) Exporting Eval (Query Execution) Merge (Multiple HDFS datasets) Incremental Imports

Generates MapReduce jobs Can control the level of parallelism

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 9: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Current Landscape

Clie

nt T

ools

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

BID

W S

yste

m

DW Cubes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Clie

nt T

ools

BID

W S

yste

m

DW

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

Cubes

Future Landscape

Hadoop

New Sources (Email Logs Social Media Sensor)

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Business Scenario

DW Cube

HadoopHDFS

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

Flume

Sensor DataWebHDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

What about Azure

DW Cube

Hadoop

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

AzCopy

Azure Blob Storage

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Tool Techniques amp Strategies

Enterprise Data Services WebHDFS Sqoop Hcatalog PigHive

Enterprise Operational Services Oozie

Other Windows Azure Blob Storage amp AzCopy Hive ODBC Polybase

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS

Born from HFTP intended as a replacement Widely used by Yahoo

High performance first class native protocol using industry standard RESTful mechanism

Complete interface for reading writing amp managing files

Supports secure authentication Data Locality ndash requests sent to data nodes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash Get Example

Requestcurl -i -L httphostportwebhdfsv1foobarop=OPEN

ResponseHTTP11 307 TEMPORARY_REDIRECT Content-Type applicationoctet-stream Location httpdatanode50075webhdfsv1foobarop=OPENampampoffset=0 Content-Length 0

HTTP11 200 OK Content-Type applicationoctet-streamContent-Length 22

Hello webhdfs user

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash More Examples

Rename Requestcurl -i -X PUT httphostportwebhdfsv1foobarop=RENAMEampampdestination=foobar2

Create Directory Requestcurl -i -X PUT httphostportwebhdfsv1foo2op=MKDIRS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop

Tool designed to efficiently move data between Hadoop (Hive amp Hbase) and RDBMS Importing (single and all tables) Exporting Eval (Query Execution) Merge (Multiple HDFS datasets) Incremental Imports

Generates MapReduce jobs Can control the level of parallelism

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 10: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Clie

nt T

ools

BID

W S

yste

m

DW

Reporting Services SharePoint Microsoft Applications

DATA

SO

URC

ES

Traditional Sources (CRMERPLOBWeb)

Cubes

Future Landscape

Hadoop

New Sources (Email Logs Social Media Sensor)

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Business Scenario

DW Cube

HadoopHDFS

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

Flume

Sensor DataWebHDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

What about Azure

DW Cube

Hadoop

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

AzCopy

Azure Blob Storage

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Tool Techniques amp Strategies

Enterprise Data Services WebHDFS Sqoop Hcatalog PigHive

Enterprise Operational Services Oozie

Other Windows Azure Blob Storage amp AzCopy Hive ODBC Polybase

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS

Born from HFTP intended as a replacement Widely used by Yahoo

High performance first class native protocol using industry standard RESTful mechanism

Complete interface for reading writing amp managing files

Supports secure authentication Data Locality ndash requests sent to data nodes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash Get Example

Requestcurl -i -L httphostportwebhdfsv1foobarop=OPEN

ResponseHTTP11 307 TEMPORARY_REDIRECT Content-Type applicationoctet-stream Location httpdatanode50075webhdfsv1foobarop=OPENampampoffset=0 Content-Length 0

HTTP11 200 OK Content-Type applicationoctet-streamContent-Length 22

Hello webhdfs user

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash More Examples

Rename Requestcurl -i -X PUT httphostportwebhdfsv1foobarop=RENAMEampampdestination=foobar2

Create Directory Requestcurl -i -X PUT httphostportwebhdfsv1foo2op=MKDIRS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop

Tool designed to efficiently move data between Hadoop (Hive amp Hbase) and RDBMS Importing (single and all tables) Exporting Eval (Query Execution) Merge (Multiple HDFS datasets) Incremental Imports

Generates MapReduce jobs Can control the level of parallelism

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 11: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Business Scenario

DW Cube

HadoopHDFS

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

Flume

Sensor DataWebHDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

What about Azure

DW Cube

Hadoop

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

AzCopy

Azure Blob Storage

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Tool Techniques amp Strategies

Enterprise Data Services WebHDFS Sqoop Hcatalog PigHive

Enterprise Operational Services Oozie

Other Windows Azure Blob Storage amp AzCopy Hive ODBC Polybase

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS

Born from HFTP intended as a replacement Widely used by Yahoo

High performance first class native protocol using industry standard RESTful mechanism

Complete interface for reading writing amp managing files

Supports secure authentication Data Locality ndash requests sent to data nodes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash Get Example

Requestcurl -i -L httphostportwebhdfsv1foobarop=OPEN

ResponseHTTP11 307 TEMPORARY_REDIRECT Content-Type applicationoctet-stream Location httpdatanode50075webhdfsv1foobarop=OPENampampoffset=0 Content-Length 0

HTTP11 200 OK Content-Type applicationoctet-streamContent-Length 22

Hello webhdfs user

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash More Examples

Rename Requestcurl -i -X PUT httphostportwebhdfsv1foobarop=RENAMEampampdestination=foobar2

Create Directory Requestcurl -i -X PUT httphostportwebhdfsv1foo2op=MKDIRS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop

Tool designed to efficiently move data between Hadoop (Hive amp Hbase) and RDBMS Importing (single and all tables) Exporting Eval (Query Execution) Merge (Multiple HDFS datasets) Incremental Imports

Generates MapReduce jobs Can control the level of parallelism

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 12: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

What about Azure

DW Cube

Hadoop

ODBCODBC

Sqoo

p

OD

BC

Reporting Tools

AzCopy

Azure Blob Storage

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Tool Techniques amp Strategies

Enterprise Data Services WebHDFS Sqoop Hcatalog PigHive

Enterprise Operational Services Oozie

Other Windows Azure Blob Storage amp AzCopy Hive ODBC Polybase

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS

Born from HFTP intended as a replacement Widely used by Yahoo

High performance first class native protocol using industry standard RESTful mechanism

Complete interface for reading writing amp managing files

Supports secure authentication Data Locality ndash requests sent to data nodes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash Get Example

Requestcurl -i -L httphostportwebhdfsv1foobarop=OPEN

ResponseHTTP11 307 TEMPORARY_REDIRECT Content-Type applicationoctet-stream Location httpdatanode50075webhdfsv1foobarop=OPENampampoffset=0 Content-Length 0

HTTP11 200 OK Content-Type applicationoctet-streamContent-Length 22

Hello webhdfs user

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash More Examples

Rename Requestcurl -i -X PUT httphostportwebhdfsv1foobarop=RENAMEampampdestination=foobar2

Create Directory Requestcurl -i -X PUT httphostportwebhdfsv1foo2op=MKDIRS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop

Tool designed to efficiently move data between Hadoop (Hive amp Hbase) and RDBMS Importing (single and all tables) Exporting Eval (Query Execution) Merge (Multiple HDFS datasets) Incremental Imports

Generates MapReduce jobs Can control the level of parallelism

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 13: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Tool Techniques amp Strategies

Enterprise Data Services WebHDFS Sqoop Hcatalog PigHive

Enterprise Operational Services Oozie

Other Windows Azure Blob Storage amp AzCopy Hive ODBC Polybase

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS

Born from HFTP intended as a replacement Widely used by Yahoo

High performance first class native protocol using industry standard RESTful mechanism

Complete interface for reading writing amp managing files

Supports secure authentication Data Locality ndash requests sent to data nodes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash Get Example

Requestcurl -i -L httphostportwebhdfsv1foobarop=OPEN

ResponseHTTP11 307 TEMPORARY_REDIRECT Content-Type applicationoctet-stream Location httpdatanode50075webhdfsv1foobarop=OPENampampoffset=0 Content-Length 0

HTTP11 200 OK Content-Type applicationoctet-streamContent-Length 22

Hello webhdfs user

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash More Examples

Rename Requestcurl -i -X PUT httphostportwebhdfsv1foobarop=RENAMEampampdestination=foobar2

Create Directory Requestcurl -i -X PUT httphostportwebhdfsv1foo2op=MKDIRS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop

Tool designed to efficiently move data between Hadoop (Hive amp Hbase) and RDBMS Importing (single and all tables) Exporting Eval (Query Execution) Merge (Multiple HDFS datasets) Incremental Imports

Generates MapReduce jobs Can control the level of parallelism

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 14: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS

Born from HFTP intended as a replacement Widely used by Yahoo

High performance first class native protocol using industry standard RESTful mechanism

Complete interface for reading writing amp managing files

Supports secure authentication Data Locality ndash requests sent to data nodes

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash Get Example

Requestcurl -i -L httphostportwebhdfsv1foobarop=OPEN

ResponseHTTP11 307 TEMPORARY_REDIRECT Content-Type applicationoctet-stream Location httpdatanode50075webhdfsv1foobarop=OPENampampoffset=0 Content-Length 0

HTTP11 200 OK Content-Type applicationoctet-streamContent-Length 22

Hello webhdfs user

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash More Examples

Rename Requestcurl -i -X PUT httphostportwebhdfsv1foobarop=RENAMEampampdestination=foobar2

Create Directory Requestcurl -i -X PUT httphostportwebhdfsv1foo2op=MKDIRS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop

Tool designed to efficiently move data between Hadoop (Hive amp Hbase) and RDBMS Importing (single and all tables) Exporting Eval (Query Execution) Merge (Multiple HDFS datasets) Incremental Imports

Generates MapReduce jobs Can control the level of parallelism

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 15: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash Get Example

Requestcurl -i -L httphostportwebhdfsv1foobarop=OPEN

ResponseHTTP11 307 TEMPORARY_REDIRECT Content-Type applicationoctet-stream Location httpdatanode50075webhdfsv1foobarop=OPENampampoffset=0 Content-Length 0

HTTP11 200 OK Content-Type applicationoctet-streamContent-Length 22

Hello webhdfs user

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash More Examples

Rename Requestcurl -i -X PUT httphostportwebhdfsv1foobarop=RENAMEampampdestination=foobar2

Create Directory Requestcurl -i -X PUT httphostportwebhdfsv1foo2op=MKDIRS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop

Tool designed to efficiently move data between Hadoop (Hive amp Hbase) and RDBMS Importing (single and all tables) Exporting Eval (Query Execution) Merge (Multiple HDFS datasets) Incremental Imports

Generates MapReduce jobs Can control the level of parallelism

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 16: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

WebHDFS ndash More Examples

Rename Requestcurl -i -X PUT httphostportwebhdfsv1foobarop=RENAMEampampdestination=foobar2

Create Directory Requestcurl -i -X PUT httphostportwebhdfsv1foo2op=MKDIRS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop

Tool designed to efficiently move data between Hadoop (Hive amp Hbase) and RDBMS Importing (single and all tables) Exporting Eval (Query Execution) Merge (Multiple HDFS datasets) Incremental Imports

Generates MapReduce jobs Can control the level of parallelism

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 17: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop

Tool designed to efficiently move data between Hadoop (Hive amp Hbase) and RDBMS Importing (single and all tables) Exporting Eval (Query Execution) Merge (Multiple HDFS datasets) Incremental Imports

Generates MapReduce jobs Can control the level of parallelism

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 18: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Sqoop Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 19: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

HCatalogHivePig

Hcatalog ndash Metadata amp table management Users interact with a set of defined tables Abstracts away the wherehow of data storage Allows for consistent access

Pig ndash ETLData Transformation Scripting Pig Latin Java User-Defined Functions (PiggybankDataFu)

Hive ndash SQL-like interface Allows ad-hoc queries for data summarizations

and analysis ODBC Connector

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 20: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Demo Pig amp Hive

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 21: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Oozie

Scalable Reliable Extensible Workflow Management SystemJob Scheduler

Triggered by Time Data Availability

Can run and orchestrate multiple jobs MapReduce and Streaming MapReduce Hive Pig

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 22: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Also called Azure Storage Vault (ASV) Scalable persistent highly-scalable storage with

built-in geo-replication Azure HDInsight clusters are wired for ASV

On-Premise HDP uses HDFS Separates data from compute nodes

Clusters can be created and dropped minimizing costs Multiple clusters can share data

The Azure Flat (Quantum 10) mesh grid network is the key Violates the principal of data locality but out-performs

HDFS and Azure competitors

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 23: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Windows Azure Blob Storage

Source httpdennygleecom20130318why-use-blob-storage-with-hdinsight-on-azure

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 24: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy

Windows Azure Blob Storage Copies files to and from

Similar to Robocopy Command-lineAzCopy CBeer httpsstgblobcorewindowsnetdataBeer destKeyltMyKeygt S V

Recursively (S) copies all files in the Beer directory with Verbose (V) logging

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 25: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

AzCopy Demo

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 26: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

PolyBase

Part of Parallel Data Warehouse allows integration of relational and non-relational data

Creates external tables via a HDFS bridge Allows on-the-fly joins within SQL Server

Supports parallel Imports from HDFS Exports to HDFS

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 27: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Resources Bloggers

Denny Lee httpdennygleecom

Carl Nolan httpblogsmsdncombcarlnolarchivetagshadoop+streaming

Cindy Gross httpblogsmsdncombcindygrossarchivetagsbig+data

Books Hadoop the Definite Guide - Tom White Programming Pig - Alan Gates Programming Hive - Edward Capriolo Hadoop MapReduce Cookbook - Srinath Perera

Links to this Presentation httpbluewatersqlwordpresscomresources httpwwwslidesharenetbluewatersqlbig-dataguide

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
Page 28: The BI Guy's Little Guide to Big Data

MAKING BUSINESS INTELLIGENT wwwpragmaticworkscomMAKING BUSINESS INTELLIGENT wwwpragmaticworkscom

Thank you

BluewaterSQL httpbluewatersqlwordpresscom cpricepragmaticworkscom

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28