microsoft big data @ sqlug 2013

BIG DATA

Wesley Backelant

Technology Advisor

Microsoft

@WesleyBackelant

Nathan Bijnens

Big Data Consultant

DataCrunchers

@nathan_gs

AGENDA

• Big Data

• Hadoop (& Ecosystem)

• How does it fit in the Microsoft world?

• Demo

• Resources

• Q&A

THE WORLD OF DATA IS CHANGING

How do I optimize

my fleet based on

weather and traffic

patterns?

How do I better

predict future

outcomes?

What’s the social

sentiment for my

brand or products

TODAY A NEW SET OF QUESTIONS ARE BEING ASKED OF

THE BUSINESS:

TRANSFORMATION OF ONLINE MARKETING

BLOGS.FORBES.COM/DAVEFEINLEIB

TRANSFORMATION OF OPERATIONS


TRANSFORMATION OF CUSTOMER SERVICE


TRANSFORMATION OF ENERGY

TRANSFORMATION OF FRAUD DETECTION

Then… Now…

NEW HARDWARE APPROACH

Traditional

Exotic HW

• Big central servers

• SAN

• RAID

Hardware reliability

Limited scalability

Big Data

Commodity HW

• racks of pizza boxes

• Ethernet

• JBOD

Unreliable HW

Scales further

Cost effective

NEW SOFTWARE APPROACH

Traditional

Monolotic

• Centralized

• RDBMS

Schema first

Proprietary

Big Data

Distributed

- storage & compute nodes

Raw data

HADOOP & BIG DATA ECOSYSTEM

HDFS

MapReduce

MAPREDUCE

HIVE

A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

– Ideal for ad hoc querying

– Query execution via MapReduce.

Key Building Principles:

– SQL

– Extensibility

– Types

– Functions

– Scripts

HIVE

It supports many SQL features like:

– Data partitioning

– Aggregations

– Grouping

– Joins

HIVE

And it’s extendable using UDFs.

package com.example.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.io.Text;

public final class Lower extends UDF {

public Text evaluate(final Text s) {

if (s == null) { return null; }

return new Text(s.toString().toLowerCase());

}

}

There are many UDFs published by external parties, for:

- Loading / Saving (SerDe)

- Field Transformations

HADOOP PIG: INTRO

Pig is a high level data flow language.

HADOOP PIG: 3 COMPONENTS

• Pig Latin

• Grunt

• PigServer

HADOOP PIG

data = LOAD 'employee.csv' USING PigStorage() AS (

first_name:chararray,

last_name:chararray,

age:int,

wage:float,

department:chararray

);

HADOOP PIG

grouped_by_department = GROUP data BY department;

total_wage_by_department =

FOREACH grouped_by_department

GENERATE

group AS department,

COUNT(data) as employee_count,

SUM(data::wage) AS total_wage;

total_ordered = ORDER total_wage_by_department BY total_wage;

total_limited = LIMIT total_ordered 10;

HADOOP PIG

DUMP total_limited;

STORE total_limited INTO ‘/test/’;

UDF

● Custom Load and Store classes.● Hbase

● ProtocolBuffers

● CombinedLog

● Custom extraction

eg. date, ...

● Take a look at the PiggyBank.

HBASE

A distributed, versioned, column-oriented

database.

• Main features:

• Horizontal scalability

• Machine failure tolerance

• Row-level atomic operations including compare-and-swap ops like

incrementing counters

• Augmented key-value schemas, the user can group columns into families which

are configured independently

• Multiple clients like its native Java library, Thrift, and REST

• Upcoming Security

STORM

• Message passing.

• Distributed processing.

• Horizontally scalable.

• Incremental algorithms.

• Fast.

• Data in motion.

STORM

Nimbus Zookeeper

Worker Node

Supervisor

Wo

rke

r

Wo

rke

r

Wo

rke

r

Worker Node

Supervisor

Wo

rke

r

Wo

rke

r

Work

er

Worker Node

Supervisor

Wo

rke

r

Wo

rke

r

Wo

rke

r

STORM

• Tuple

• Stream

STORM

• Spout

• Bolt

STORM

• Grouping

A DATA SYSTEM

DATA IS MORE THAN INFORMATION

Not all information is equal. Some information is derived from other pieces of information.

DATA IS MORE THAN INFORMATION

Eventually you will reach the most ‘raw’

form of information.This is the information you hold true, simple because it exists.

Let’s call this ‘data’, very similar to ‘event’.

EVENTS

Everything we do generates events:

• Pay with Credit Card

• Commit to Git

• Click on a webpage

• Tweet

EVENTS - BEFORE

Events used to manipulate

the master data.

EVENTS - AFTER

Today, events are the master

data.

DATA SYSTEM

Let’s store everything.

EVENTS

Data is Immutable

EVENTS

Data is Time Based

CAPTURING CHANGE TRADITIONALLY

Person Location

Nathan Antwerp

Geert Dendermonde

John Ghent

Person Location

Nathan Ghent

Geert Dendermonde

John Ghent

CAPTURING CHANGE

Person Location Time

Nathan Antwerp 2005-01-01

Geert Dendermonde 2011-10-08

John Ghent 2010-05-02

Nathan Ghent 2013-02-03

Person Location Timestamp




QUERY

The data you query is often

transformed, aggregated, ... Rarely used in it’s original form.

QUERY

Query = function ( data )

NUMBER OF PEOPLE LIVING IN EACH CITY.

Person Location Time




Nathan Ghent 2013-02-03

Location Count

Ghent 2

Dendermonde 1

QUERY

All Data Query

QUERY: PRECOMPUTE

All Data QueryPrecomputed

View

LAYERED ARCHITECTURE

Speed Layer

Batch Layer

Serving Layer

LAYERED ARCHITECTURE

HD InsightColumn

Store

Qu

ery

Incoming Data

SQL

BATCH LAYER

BATCH LAYER

HD InsightColumn

Store

Incoming Data

BATCH LAYER

Unrestrained computation.

BATCH LAYER

Horizontal scalable.

BATCH LAYER

High Latency.Let’s pretend temporarily that update latency

doesn’t matter.

BATCH LAYER

Stores master copy of data set...append only.

BATCH LAYER

BATCH: VIEW GENERATION

Master Dataset

View #1

View #3

View #2MapReduce

1. Take a large problem and divide it into sub-problems

2. Perform the same function on all sub-problems

3. Combine the output from all sub-problems

…

…

Output

MAP

REDUCE

MAPREDUCE

DoWork() DoWork() DoWork()…

BATCH VIEW DATABASE

Read only database.No random writes required.

BATCH LAYER

Not yet absorbed.

Data absorbed into Batch Views

Time No

w

We are not done yet…Just a few hours of data.

SPEED LAYER

OVERVIEW

HD InsightColumn

Store

Incoming Data

SQL

SPEED LAYER

Stream processing.

SPEED LAYER

Continuous computation.

SPEED LAYER

Transactional.

SPEED LAYER

Storing a limited window of data.Compensating for the last few hours of data.

SPEED LAYER

All the complexity is isolated in the

Speed layer. If anything goes wrong,

it’s auto-corrected.

CAP

You have a choice between:

• Availability

• Queries are eventual consistent.

• Consistency

• Queries are consistent.

EVENTUAL ACCURACY

Some algorithms are hard to

implement in real time. For those

cases we could estimate the results.

SPEED LAYER

Incoming Data

Real

Time

View 1

Real

Time

View 2

SPEED LAYER VIEWS

• The views are stored in Read & Write database.

• MS SQL Server

• Column Store

• Cassandra

• …

• Much more complex than a read only view.

SERVING LAYER

OVERVIEW

HD InsightColumn

Store

Qu

ery

Incoming Data

SQL

SERVING LAYER

This layer queries the Batch & Real

Time views and merges it.

SERVING LAYER

Real

Time

Views

Merge

Batch

Views

SERVING LAYER

Polybase is a great fit.

OVERVIEW

OVERVIEW

HD InsightColumn

Store

Qu

ery

Incoming Data

SQL

LAMBDA ARCHITECTURE

• Can discard any view, batch and real time, and just recreate

everything from the master data.

• Mistakes are corrected via recomputation.

• Write bad data? Remove the data & recompute.

• Bug in view generation? Just recompute the view.

• Data storage is highly optimized.

MICROSOFT BIG DATA

http://www.google.com/imgres?imgurl=http://richfrombechtle.files.wordpress.com/2008/10/vs2010archexplorer.jpg&imgrefurl=http://richfrombechtle.wordpress.com/2008/10/13/&usg=__qPArABkba3JddWP-O2AT7MRoU1s=&h=500&w=749&sz=95&hl=en&start=3&zoom=1&itbs=1&tbnid=mMsoPo--rTSTfM:&tbnh=94&tbnw=141&prev=/images?q=visual+Studio+Application&hl=en&sa=X&tbs=isch:1&prmd=ivns&ei=TGhwTYmPNsKTtwflnuiTDw

WHAT IS MICROSOFT DOING ON

THE BI & DEVELOPMENT SIDE

INSIGHTS FROM ANY DATA, ANY SIZE, ANYWHERE

010101010101010101101010101010101001010101010101101010101010

WE DELIVER INSIGHTS TO EVERYONE BY ENABLING BIG DATA

ANALYSIS WITH FAMILIAR END USER TOOLS

Hive add-in for Excel

Interaction and analysis of

unstructured data in Hadoop

Ben

efits

Key

Featu

res

UNLOCKING IMMERSIVE INSIGHTS FROM ALL DATA

WITH MICROSOFT BI TOOLS

Hive ODBC Driver integrates Hadoop

to SQL Server Analysis Services,

PowerPivot, and Power View

Familiar self service BI tools

Ben

efits

Key

Featu

res

WHILE DRAMATICALLY SIMPLIFYING PROGRAMMING

ON HADOOP

Integration with .NET and

new JavaScript libraries for

Hadoop

JS

MapReduce

programs

in JavaScript

Simplified

Programming

Deploy JavaScript Hadoop

jobs from a simple web

browser on any supported

device

Simplified Deployment of

MapReduce jobs

Ben

efits

Key

Featu

res

WE MANAGE STREAMING DATA WITH STREAMINSIGHTB

en

efits

Key

Featu

res

StreamInsight SQL StreamInsight

WHAT IS MICROSOFT DOING ON

THE HADOOP & INTEGRATION SIDE?

AppliancesReference Architectures

Dell Parallel Data Warehouse

HP Enterprise Data Warehouse

Dell QuickstartData Warehouse

HP Business Data Warehouse

WE MANAGE RELATIONAL DATA WITH MICROSOFT

ENTERPRISE DATA WAREHOUSE SOLUTIONS

Fast Track for

Fundamental Breakthrough in Data ProcessingINTRODUCING POLYBASE

Single Query; Structured and Unstructured

• Query and join Hadoop tables with Relational Tables

• Use Standard SQL language • Select, From Where

Existing SQLSkillset

No ITIntervention

Save Timeand Costs

SQL Server 2012

PDW Powered

by PolyBase

SQL

Analyze AllData Types

Ben

efits

Key

Featu

res

AND SUPPORT UNSTRUCTURED DATA WITH ENTERPRISE

CLASS HADOOP ON PREMISE AND IN THE CLOUD

Ben

efits

Key

Featu

res

MICROSOFT BRINGS THE SIMPLICITY AND MANAGEABILITY

OF WINDOWS AND SQL SERVER TO HADOOP

MICROSOFT DELIVERS BIG DATA THROUGH OPEN

PLATFORM AND A RICH PARTNER ECOSYSTEMB

en

efits

Key

Featu

res

BIG DATA DEMO:FROM DATA TO INSIGHTS!

Simplicity

Analysis with familiar

tools

Collaboration on

insights

THANK YOU!!!

RESOURCES

• Microsoft Big Data Solution: www.microsoft.com/bigdata

• Windows Azure: www.windowsazure.com/en-us/home/scenarios/big-data

• Try Now: https://www.hadooponazure.com

• HDInsight For Windows Beta Download: http://hortonworks.com/download/

• HDInsight Services For Windows:

http://social.technet.microsoft.com/wiki/contents/articles/6204.hdinsight-services-for-

windows.aspx#videos

• Hadoop in PowerPivot: http://social.technet.microsoft.com/wiki/contents/articles/6294.how-to-

connect-excel-powerpivot-to-hive-on-azure-via-hiveodbc.aspx

• Hadoop in SSIS: http://msdn.microsoft.com/en-us/library/jj720569.aspx

• Hurricane Sandy: http://sqlcat.com/sqlcat/b/msdnmirror/archive/2013/02/01/hurricane-sandy-

mash-up-hive-sql-server-powerpivot-amp-power-view.aspx

• Hadoop PowerShell: http://blogs.msdn.com/b/cindygross/archive/2012/08/23/how-to-install-the-

powershell-cmdlets-for-apache-hadoop-based-services-for-windows.aspx

• SQL Server BCP to Hive: http://blogs.msdn.com/b/cindygross/archive/2012/09/28/load-sql-server-

bcp-data-to-hive.aspx

• Internal vs External Table Hive: http://blogs.msdn.com/b/cindygross/archive/2013/02/06/hdinsight-

hive-internal-and-external-tables-intro.aspx

• Microsoft.NET SDK for Hadoop: http://hadoopsdk.codeplex.com/

• Twitter Analytics Example: http://twitterbigdata.codeplex.com/

http://www.microsoft.com/bigdata

http://www.windowsazure.com/en-us/home/scenarios/big-data

https://www.hadooponazure.com/

http://hortonworks.com/download/

http://social.technet.microsoft.com/wiki/contents/articles/6204.hdinsight-services-for-windows.aspx#videos

http://social.technet.microsoft.com/wiki/contents/articles/6294.how-to-connect-excel-powerpivot-to-hive-on-azure-via-hiveodbc.aspx

http://msdn.microsoft.com/en-us/library/jj720569.aspx

http://sqlcat.com/sqlcat/b/msdnmirror/archive/2013/02/01/hurricane-sandy-mash-up-hive-sql-server-powerpivot-amp-power-view.aspx

http://blogs.msdn.com/b/cindygross/archive/2012/08/23/how-to-install-the-powershell-cmdlets-for-apache-hadoop-based-services-for-windows.aspx

http://blogs.msdn.com/b/cindygross/archive/2012/09/28/load-sql-server-bcp-data-to-hive.aspx

http://blogs.msdn.com/b/cindygross/archive/2013/02/06/hdinsight-hive-internal-and-external-tables-intro.aspx

http://hadoopsdk.codeplex.com/

http://twitterbigdata.codeplex.com/

DATACRUNCHERS

We enable companies in envisioning, defining and implementing a data

strategy.

A one-stop-shop for all your Big Data needs.

The first Big Data Consultancy agency in Belgium.