microsoft big data @ sqlug 2013

105
BIG DATA Wesley Backelant Technology Advisor Microsoft @WesleyBackelant Nathan Bijnens Big Data Consultant DataCrunchers @nathan_gs

Upload: nathan-bijnens

Post on 26-Jan-2015

105 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Microsoft Big Data @ SQLUG 2013

BIG DATA

Wesley Backelant

Technology Advisor

Microsoft

@WesleyBackelant

Nathan Bijnens

Big Data Consultant

DataCrunchers

@nathan_gs

Page 2: Microsoft Big Data @ SQLUG 2013

AGENDA

• Big Data

• Hadoop (& Ecosystem)

• How does it fit in the Microsoft world?

• Demo

• Resources

• Q&A

Page 3: Microsoft Big Data @ SQLUG 2013

THE WORLD OF DATA IS CHANGING

Page 4: Microsoft Big Data @ SQLUG 2013

How do I optimize

my fleet based on

weather and traffic

patterns?

How do I better

predict future

outcomes?

What’s the social

sentiment for my

brand or products

TODAY A NEW SET OF QUESTIONS ARE BEING ASKED OF

THE BUSINESS:

Page 5: Microsoft Big Data @ SQLUG 2013

TRANSFORMATION OF ONLINE MARKETING

BLOGS.FORBES.COM/DAVEFEINLEIB

Page 6: Microsoft Big Data @ SQLUG 2013

TRANSFORMATION OF OPERATIONS

BLOGS.FORBES.COM/DAVEFEINLEIB

Page 7: Microsoft Big Data @ SQLUG 2013

TRANSFORMATION OF CUSTOMER SERVICE

BLOGS.FORBES.COM/DAVEFEINLEIB

Page 8: Microsoft Big Data @ SQLUG 2013

TRANSFORMATION OF ENERGY

Page 9: Microsoft Big Data @ SQLUG 2013

TRANSFORMATION OF FRAUD DETECTION

Then… Now…

Page 10: Microsoft Big Data @ SQLUG 2013

NEW HARDWARE APPROACH

Traditional

Exotic HW

• Big central servers

• SAN

• RAID

Hardware reliability

Limited scalability

Big Data

Commodity HW

• racks of pizza boxes

• Ethernet

• JBOD

Unreliable HW

Scales further

Cost effective

Page 11: Microsoft Big Data @ SQLUG 2013

NEW SOFTWARE APPROACH

Traditional

Monolotic

• Centralized

• RDBMS

Schema first

Proprietary

Big Data

Distributed

- storage & compute nodes

Raw data

Page 12: Microsoft Big Data @ SQLUG 2013

HADOOP & BIG DATA ECOSYSTEM

HDFS

MapReduce

Page 13: Microsoft Big Data @ SQLUG 2013
Page 14: Microsoft Big Data @ SQLUG 2013

HDFS

Page 15: Microsoft Big Data @ SQLUG 2013

HDFS

Page 16: Microsoft Big Data @ SQLUG 2013
Page 17: Microsoft Big Data @ SQLUG 2013

MAPREDUCE

Page 18: Microsoft Big Data @ SQLUG 2013

MAPREDUCE

Page 19: Microsoft Big Data @ SQLUG 2013

MAPREDUCE

Page 20: Microsoft Big Data @ SQLUG 2013

HIVE

Page 21: Microsoft Big Data @ SQLUG 2013

HIVE

A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

– Ideal for ad hoc querying

– Query execution via MapReduce.

Key Building Principles:

– SQL

– Extensibility

– Types

– Functions

– Scripts

Page 22: Microsoft Big Data @ SQLUG 2013

HIVE

It supports many SQL features like:

– Data partitioning

– Aggregations

– Grouping

– Joins

Page 23: Microsoft Big Data @ SQLUG 2013

HIVE

And it’s extendable using UDFs.

package com.example.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.io.Text;

public final class Lower extends UDF {

public Text evaluate(final Text s) {

if (s == null) { return null; }

return new Text(s.toString().toLowerCase());

}

}

There are many UDFs published by external parties, for:

- Loading / Saving (SerDe)

- Field Transformations

Page 24: Microsoft Big Data @ SQLUG 2013
Page 25: Microsoft Big Data @ SQLUG 2013

HADOOP PIG: INTRO

Pig is a high level data flow language.

Page 26: Microsoft Big Data @ SQLUG 2013

HADOOP PIG: 3 COMPONENTS

• Pig Latin

• Grunt

• PigServer

Page 27: Microsoft Big Data @ SQLUG 2013

HADOOP PIG

data = LOAD 'employee.csv' USING PigStorage() AS (

first_name:chararray,

last_name:chararray,

age:int,

wage:float,

department:chararray

);

Page 28: Microsoft Big Data @ SQLUG 2013

HADOOP PIG

grouped_by_department = GROUP data BY department;

total_wage_by_department =

FOREACH grouped_by_department

GENERATE

group AS department,

COUNT(data) as employee_count,

SUM(data::wage) AS total_wage;

total_ordered = ORDER total_wage_by_department BY total_wage;

total_limited = LIMIT total_ordered 10;

Page 29: Microsoft Big Data @ SQLUG 2013

HADOOP PIG

DUMP total_limited;

STORE total_limited INTO ‘/test/’;

Page 30: Microsoft Big Data @ SQLUG 2013

UDF

● Custom Load and Store classes.● Hbase

● ProtocolBuffers

● CombinedLog

● Custom extraction

eg. date, ...

● Take a look at the PiggyBank.

Page 31: Microsoft Big Data @ SQLUG 2013
Page 32: Microsoft Big Data @ SQLUG 2013

HBASE

A distributed, versioned, column-oriented

database.

• Main features:

• Horizontal scalability

• Machine failure tolerance

• Row-level atomic operations including compare-and-swap ops like

incrementing counters

• Augmented key-value schemas, the user can group columns into families which

are configured independently

• Multiple clients like its native Java library, Thrift, and REST

• Upcoming Security

Page 33: Microsoft Big Data @ SQLUG 2013
Page 34: Microsoft Big Data @ SQLUG 2013

STORM

Page 35: Microsoft Big Data @ SQLUG 2013

STORM

Page 36: Microsoft Big Data @ SQLUG 2013

STORM

• Message passing.

• Distributed processing.

• Horizontally scalable.

• Incremental algorithms.

• Fast.

• Data in motion.

Page 37: Microsoft Big Data @ SQLUG 2013

STORM

Nimbus Zookeeper

Worker Node

Supervisor

Wo

rke

r

Wo

rke

r

Wo

rke

r

Worker Node

Supervisor

Wo

rke

r

Wo

rke

r

Work

er

Worker Node

Supervisor

Wo

rke

r

Wo

rke

r

Wo

rke

r

Page 38: Microsoft Big Data @ SQLUG 2013

STORM

• Tuple

• Stream

Page 39: Microsoft Big Data @ SQLUG 2013

STORM

• Spout

• Bolt

Page 40: Microsoft Big Data @ SQLUG 2013

STORM

• Grouping

Page 41: Microsoft Big Data @ SQLUG 2013

A DATA SYSTEM

Page 42: Microsoft Big Data @ SQLUG 2013

DATA IS MORE THAN INFORMATION

Not all information is equal. Some information is derived from other pieces of information.

Page 43: Microsoft Big Data @ SQLUG 2013

DATA IS MORE THAN INFORMATION

Eventually you will reach the most ‘raw’

form of information.This is the information you hold true, simple because it exists.

Let’s call this ‘data’, very similar to ‘event’.

Page 44: Microsoft Big Data @ SQLUG 2013

EVENTS

Everything we do generates events:

• Pay with Credit Card

• Commit to Git

• Click on a webpage

• Tweet

Page 45: Microsoft Big Data @ SQLUG 2013

EVENTS - BEFORE

Events used to manipulate

the master data.

Page 46: Microsoft Big Data @ SQLUG 2013

EVENTS - AFTER

Today, events are the master

data.

Page 47: Microsoft Big Data @ SQLUG 2013

DATA SYSTEM

Let’s store everything.

Page 48: Microsoft Big Data @ SQLUG 2013

EVENTS

Data is Immutable

Page 49: Microsoft Big Data @ SQLUG 2013

EVENTS

Data is Time Based

Page 50: Microsoft Big Data @ SQLUG 2013

CAPTURING CHANGE TRADITIONALLY

Person Location

Nathan Antwerp

Geert Dendermonde

John Ghent

Person Location

Nathan Ghent

Geert Dendermonde

John Ghent

Page 51: Microsoft Big Data @ SQLUG 2013

CAPTURING CHANGE

Person Location Time

Nathan Antwerp 2005-01-01

Geert Dendermonde 2011-10-08

John Ghent 2010-05-02

Nathan Ghent 2013-02-03

Person Location Timestamp

Nathan Antwerp 2005-01-01

Geert Dendermonde 2011-10-08

John Ghent 2010-05-02

Page 52: Microsoft Big Data @ SQLUG 2013

QUERY

The data you query is often

transformed, aggregated, ... Rarely used in it’s original form.

Page 53: Microsoft Big Data @ SQLUG 2013

QUERY

Query = function ( data )

Page 54: Microsoft Big Data @ SQLUG 2013

NUMBER OF PEOPLE LIVING IN EACH CITY.

Person Location Time

Nathan Antwerp 2005-01-01

Geert Dendermonde 2011-10-08

John Ghent 2010-05-02

Nathan Ghent 2013-02-03

Location Count

Ghent 2

Dendermonde 1

Page 55: Microsoft Big Data @ SQLUG 2013

QUERY

All Data Query

Page 56: Microsoft Big Data @ SQLUG 2013

QUERY: PRECOMPUTE

All Data QueryPrecomputed

View

Page 57: Microsoft Big Data @ SQLUG 2013

LAYERED ARCHITECTURE

Speed Layer

Batch Layer

Serving Layer

Page 58: Microsoft Big Data @ SQLUG 2013

LAYERED ARCHITECTURE

HD InsightColumn

Store

Qu

ery

Incoming Data

SQL

Page 59: Microsoft Big Data @ SQLUG 2013

BATCH LAYER

Page 60: Microsoft Big Data @ SQLUG 2013

BATCH LAYER

HD InsightColumn

Store

Incoming Data

Page 61: Microsoft Big Data @ SQLUG 2013

BATCH LAYER

Unrestrained computation.

Page 62: Microsoft Big Data @ SQLUG 2013

BATCH LAYER

Horizontal scalable.

Page 63: Microsoft Big Data @ SQLUG 2013

BATCH LAYER

High Latency.Let’s pretend temporarily that update latency

doesn’t matter.

Page 64: Microsoft Big Data @ SQLUG 2013

BATCH LAYER

Stores master copy of data set...append only.

Page 65: Microsoft Big Data @ SQLUG 2013

BATCH LAYER

Page 66: Microsoft Big Data @ SQLUG 2013

BATCH: VIEW GENERATION

Master Dataset

View #1

View #3

View #2MapReduce

Page 67: Microsoft Big Data @ SQLUG 2013

1. Take a large problem and divide it into sub-problems

2. Perform the same function on all sub-problems

3. Combine the output from all sub-problems

Output

MAP

REDUCE

MAPREDUCE

DoWork() DoWork() DoWork()…

Page 68: Microsoft Big Data @ SQLUG 2013

BATCH VIEW DATABASE

Read only database.No random writes required.

Page 69: Microsoft Big Data @ SQLUG 2013

BATCH LAYER

Not yet absorbed.

Data absorbed into Batch Views

Time No

w

We are not done yet…Just a few hours of data.

Page 70: Microsoft Big Data @ SQLUG 2013

SPEED LAYER

Page 71: Microsoft Big Data @ SQLUG 2013

OVERVIEW

HD InsightColumn

Store

Incoming Data

SQL

Page 72: Microsoft Big Data @ SQLUG 2013

SPEED LAYER

Stream processing.

Page 73: Microsoft Big Data @ SQLUG 2013

SPEED LAYER

Continuous computation.

Page 74: Microsoft Big Data @ SQLUG 2013

SPEED LAYER

Transactional.

Page 75: Microsoft Big Data @ SQLUG 2013

SPEED LAYER

Storing a limited window of data.Compensating for the last few hours of data.

Page 76: Microsoft Big Data @ SQLUG 2013

SPEED LAYER

All the complexity is isolated in the

Speed layer. If anything goes wrong,

it’s auto-corrected.

Page 77: Microsoft Big Data @ SQLUG 2013

CAP

You have a choice between:

• Availability

• Queries are eventual consistent.

• Consistency

• Queries are consistent.

Page 78: Microsoft Big Data @ SQLUG 2013

EVENTUAL ACCURACY

Some algorithms are hard to

implement in real time. For those

cases we could estimate the results.

Page 79: Microsoft Big Data @ SQLUG 2013

SPEED LAYER

Incoming Data

Real

Time

View 1

Real

Time

View 2

Page 80: Microsoft Big Data @ SQLUG 2013

SPEED LAYER VIEWS

• The views are stored in Read & Write database.

• MS SQL Server

• Column Store

• Cassandra

• …

• Much more complex than a read only view.

Page 81: Microsoft Big Data @ SQLUG 2013

SERVING LAYER

Page 82: Microsoft Big Data @ SQLUG 2013

OVERVIEW

HD InsightColumn

Store

Qu

ery

Incoming Data

SQL

Page 83: Microsoft Big Data @ SQLUG 2013

SERVING LAYER

This layer queries the Batch & Real

Time views and merges it.

Page 84: Microsoft Big Data @ SQLUG 2013

SERVING LAYER

Real

Time

Views

Merge

Batch

Views

Page 85: Microsoft Big Data @ SQLUG 2013

SERVING LAYER

Polybase is a great fit.

Page 86: Microsoft Big Data @ SQLUG 2013

OVERVIEW

Page 87: Microsoft Big Data @ SQLUG 2013

OVERVIEW

HD InsightColumn

Store

Qu

ery

Incoming Data

SQL

Page 88: Microsoft Big Data @ SQLUG 2013

LAMBDA ARCHITECTURE

• Can discard any view, batch and real time, and just recreate

everything from the master data.

• Mistakes are corrected via recomputation.

• Write bad data? Remove the data & recompute.

• Bug in view generation? Just recompute the view.

• Data storage is highly optimized.

Page 90: Microsoft Big Data @ SQLUG 2013

WHAT IS MICROSOFT DOING ON

THE BI & DEVELOPMENT SIDE

Page 91: Microsoft Big Data @ SQLUG 2013

INSIGHTS FROM ANY DATA, ANY SIZE, ANYWHERE

010101010101010101101010101010101001010101010101101010101010

Page 92: Microsoft Big Data @ SQLUG 2013

WE DELIVER INSIGHTS TO EVERYONE BY ENABLING BIG DATA

ANALYSIS WITH FAMILIAR END USER TOOLS

Hive add-in for Excel

Interaction and analysis of

unstructured data in Hadoop

Ben

efits

Key

Featu

res

Page 93: Microsoft Big Data @ SQLUG 2013

UNLOCKING IMMERSIVE INSIGHTS FROM ALL DATA

WITH MICROSOFT BI TOOLS

Hive ODBC Driver integrates Hadoop

to SQL Server Analysis Services,

PowerPivot, and Power View

Familiar self service BI tools

Ben

efits

Key

Featu

res

Page 94: Microsoft Big Data @ SQLUG 2013

WHILE DRAMATICALLY SIMPLIFYING PROGRAMMING

ON HADOOP

Integration with .NET and

new JavaScript libraries for

Hadoop

JS

MapReduce

programs

in JavaScript

Simplified

Programming

Deploy JavaScript Hadoop

jobs from a simple web

browser on any supported

device

Simplified Deployment of

MapReduce jobs

Ben

efits

Key

Featu

res

Page 95: Microsoft Big Data @ SQLUG 2013

WE MANAGE STREAMING DATA WITH STREAMINSIGHTB

en

efits

Key

Featu

res

StreamInsight SQL StreamInsight

Page 96: Microsoft Big Data @ SQLUG 2013

WHAT IS MICROSOFT DOING ON

THE HADOOP & INTEGRATION SIDE?

Page 97: Microsoft Big Data @ SQLUG 2013

AppliancesReference Architectures

Dell Parallel Data Warehouse

HP Enterprise Data Warehouse

Dell QuickstartData Warehouse

HP Business Data Warehouse

WE MANAGE RELATIONAL DATA WITH MICROSOFT

ENTERPRISE DATA WAREHOUSE SOLUTIONS

Fast Track for

Page 98: Microsoft Big Data @ SQLUG 2013

Fundamental Breakthrough in Data ProcessingINTRODUCING POLYBASE

Single Query; Structured and Unstructured

• Query and join Hadoop tables with Relational Tables

• Use Standard SQL language • Select, From Where

Existing SQLSkillset

No ITIntervention

Save Timeand Costs

SQL Server 2012

PDW Powered

by PolyBase

SQL

Analyze AllData Types

Page 99: Microsoft Big Data @ SQLUG 2013

Ben

efits

Key

Featu

res

AND SUPPORT UNSTRUCTURED DATA WITH ENTERPRISE

CLASS HADOOP ON PREMISE AND IN THE CLOUD

Page 100: Microsoft Big Data @ SQLUG 2013

Ben

efits

Key

Featu

res

MICROSOFT BRINGS THE SIMPLICITY AND MANAGEABILITY

OF WINDOWS AND SQL SERVER TO HADOOP

Page 101: Microsoft Big Data @ SQLUG 2013

MICROSOFT DELIVERS BIG DATA THROUGH OPEN

PLATFORM AND A RICH PARTNER ECOSYSTEMB

en

efits

Key

Featu

res

Page 102: Microsoft Big Data @ SQLUG 2013

BIG DATA DEMO:FROM DATA TO INSIGHTS!

Simplicity

Analysis with familiar

tools

Collaboration on

insights

Page 103: Microsoft Big Data @ SQLUG 2013

THANK YOU!!!

Page 104: Microsoft Big Data @ SQLUG 2013

RESOURCES

• Microsoft Big Data Solution: www.microsoft.com/bigdata

• Windows Azure: www.windowsazure.com/en-us/home/scenarios/big-data

• Try Now: https://www.hadooponazure.com

• HDInsight For Windows Beta Download: http://hortonworks.com/download/

• HDInsight Services For Windows:

http://social.technet.microsoft.com/wiki/contents/articles/6204.hdinsight-services-for-

windows.aspx#videos

• Hadoop in PowerPivot: http://social.technet.microsoft.com/wiki/contents/articles/6294.how-to-

connect-excel-powerpivot-to-hive-on-azure-via-hiveodbc.aspx

• Hadoop in SSIS: http://msdn.microsoft.com/en-us/library/jj720569.aspx

• Hurricane Sandy: http://sqlcat.com/sqlcat/b/msdnmirror/archive/2013/02/01/hurricane-sandy-

mash-up-hive-sql-server-powerpivot-amp-power-view.aspx

• Hadoop PowerShell: http://blogs.msdn.com/b/cindygross/archive/2012/08/23/how-to-install-the-

powershell-cmdlets-for-apache-hadoop-based-services-for-windows.aspx

• SQL Server BCP to Hive: http://blogs.msdn.com/b/cindygross/archive/2012/09/28/load-sql-server-

bcp-data-to-hive.aspx

• Internal vs External Table Hive: http://blogs.msdn.com/b/cindygross/archive/2013/02/06/hdinsight-

hive-internal-and-external-tables-intro.aspx

• Microsoft.NET SDK for Hadoop: http://hadoopsdk.codeplex.com/

• Twitter Analytics Example: http://twitterbigdata.codeplex.com/

Page 105: Microsoft Big Data @ SQLUG 2013

DATACRUNCHERS

We enable companies in envisioning, defining and implementing a data

strategy.

A one-stop-shop for all your Big Data needs.

The first Big Data Consultancy agency in Belgium.