microsoft big data @ sqlug 2013
DESCRIPTION
TRANSCRIPT
BIG DATA
Wesley Backelant
Technology Advisor
Microsoft
@WesleyBackelant
Nathan Bijnens
Big Data Consultant
DataCrunchers
@nathan_gs
AGENDA
• Big Data
• Hadoop (& Ecosystem)
• How does it fit in the Microsoft world?
• Demo
• Resources
• Q&A
THE WORLD OF DATA IS CHANGING
How do I optimize
my fleet based on
weather and traffic
patterns?
How do I better
predict future
outcomes?
What’s the social
sentiment for my
brand or products
TODAY A NEW SET OF QUESTIONS ARE BEING ASKED OF
THE BUSINESS:
TRANSFORMATION OF ONLINE MARKETING
BLOGS.FORBES.COM/DAVEFEINLEIB
TRANSFORMATION OF OPERATIONS
BLOGS.FORBES.COM/DAVEFEINLEIB
TRANSFORMATION OF CUSTOMER SERVICE
BLOGS.FORBES.COM/DAVEFEINLEIB
TRANSFORMATION OF ENERGY
TRANSFORMATION OF FRAUD DETECTION
Then… Now…
NEW HARDWARE APPROACH
Traditional
Exotic HW
• Big central servers
• SAN
• RAID
Hardware reliability
Limited scalability
Big Data
Commodity HW
• racks of pizza boxes
• Ethernet
• JBOD
Unreliable HW
Scales further
Cost effective
NEW SOFTWARE APPROACH
Traditional
Monolotic
• Centralized
• RDBMS
Schema first
Proprietary
Big Data
Distributed
- storage & compute nodes
Raw data
HADOOP & BIG DATA ECOSYSTEM
HDFS
MapReduce
HDFS
HDFS
MAPREDUCE
MAPREDUCE
MAPREDUCE
HIVE
HIVE
A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
– Ideal for ad hoc querying
– Query execution via MapReduce.
Key Building Principles:
– SQL
– Extensibility
– Types
– Functions
– Scripts
HIVE
It supports many SQL features like:
– Data partitioning
– Aggregations
– Grouping
– Joins
HIVE
And it’s extendable using UDFs.
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class Lower extends UDF {
public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text(s.toString().toLowerCase());
}
}
There are many UDFs published by external parties, for:
- Loading / Saving (SerDe)
- Field Transformations
HADOOP PIG: INTRO
Pig is a high level data flow language.
HADOOP PIG: 3 COMPONENTS
• Pig Latin
• Grunt
• PigServer
HADOOP PIG
data = LOAD 'employee.csv' USING PigStorage() AS (
first_name:chararray,
last_name:chararray,
age:int,
wage:float,
department:chararray
);
HADOOP PIG
grouped_by_department = GROUP data BY department;
total_wage_by_department =
FOREACH grouped_by_department
GENERATE
group AS department,
COUNT(data) as employee_count,
SUM(data::wage) AS total_wage;
total_ordered = ORDER total_wage_by_department BY total_wage;
total_limited = LIMIT total_ordered 10;
HADOOP PIG
DUMP total_limited;
STORE total_limited INTO ‘/test/’;
UDF
● Custom Load and Store classes.● Hbase
● ProtocolBuffers
● CombinedLog
● Custom extraction
eg. date, ...
● Take a look at the PiggyBank.
HBASE
A distributed, versioned, column-oriented
database.
• Main features:
• Horizontal scalability
• Machine failure tolerance
• Row-level atomic operations including compare-and-swap ops like
incrementing counters
• Augmented key-value schemas, the user can group columns into families which
are configured independently
• Multiple clients like its native Java library, Thrift, and REST
• Upcoming Security
STORM
STORM
STORM
• Message passing.
• Distributed processing.
• Horizontally scalable.
• Incremental algorithms.
• Fast.
• Data in motion.
STORM
Nimbus Zookeeper
Worker Node
Supervisor
Wo
rke
r
Wo
rke
r
Wo
rke
r
Worker Node
Supervisor
Wo
rke
r
Wo
rke
r
Work
er
Worker Node
Supervisor
Wo
rke
r
Wo
rke
r
Wo
rke
r
STORM
• Tuple
• Stream
STORM
• Spout
• Bolt
STORM
• Grouping
A DATA SYSTEM
DATA IS MORE THAN INFORMATION
Not all information is equal. Some information is derived from other pieces of information.
DATA IS MORE THAN INFORMATION
Eventually you will reach the most ‘raw’
form of information.This is the information you hold true, simple because it exists.
Let’s call this ‘data’, very similar to ‘event’.
EVENTS
Everything we do generates events:
• Pay with Credit Card
• Commit to Git
• Click on a webpage
• Tweet
EVENTS - BEFORE
Events used to manipulate
the master data.
EVENTS - AFTER
Today, events are the master
data.
DATA SYSTEM
Let’s store everything.
EVENTS
Data is Immutable
EVENTS
Data is Time Based
CAPTURING CHANGE TRADITIONALLY
Person Location
Nathan Antwerp
Geert Dendermonde
John Ghent
Person Location
Nathan Ghent
Geert Dendermonde
John Ghent
CAPTURING CHANGE
Person Location Time
Nathan Antwerp 2005-01-01
Geert Dendermonde 2011-10-08
John Ghent 2010-05-02
Nathan Ghent 2013-02-03
Person Location Timestamp
Nathan Antwerp 2005-01-01
Geert Dendermonde 2011-10-08
John Ghent 2010-05-02
QUERY
The data you query is often
transformed, aggregated, ... Rarely used in it’s original form.
QUERY
Query = function ( data )
NUMBER OF PEOPLE LIVING IN EACH CITY.
Person Location Time
Nathan Antwerp 2005-01-01
Geert Dendermonde 2011-10-08
John Ghent 2010-05-02
Nathan Ghent 2013-02-03
Location Count
Ghent 2
Dendermonde 1
QUERY
All Data Query
QUERY: PRECOMPUTE
All Data QueryPrecomputed
View
LAYERED ARCHITECTURE
Speed Layer
Batch Layer
Serving Layer
LAYERED ARCHITECTURE
HD InsightColumn
Store
Qu
ery
Incoming Data
SQL
BATCH LAYER
BATCH LAYER
HD InsightColumn
Store
Incoming Data
BATCH LAYER
Unrestrained computation.
BATCH LAYER
Horizontal scalable.
BATCH LAYER
High Latency.Let’s pretend temporarily that update latency
doesn’t matter.
BATCH LAYER
Stores master copy of data set...append only.
BATCH LAYER
BATCH: VIEW GENERATION
Master Dataset
View #1
View #3
View #2MapReduce
1. Take a large problem and divide it into sub-problems
2. Perform the same function on all sub-problems
3. Combine the output from all sub-problems
…
…
Output
MAP
REDUCE
MAPREDUCE
DoWork() DoWork() DoWork()…
BATCH VIEW DATABASE
Read only database.No random writes required.
BATCH LAYER
Not yet absorbed.
Data absorbed into Batch Views
Time No
w
We are not done yet…Just a few hours of data.
SPEED LAYER
OVERVIEW
HD InsightColumn
Store
Incoming Data
SQL
SPEED LAYER
Stream processing.
SPEED LAYER
Continuous computation.
SPEED LAYER
Transactional.
SPEED LAYER
Storing a limited window of data.Compensating for the last few hours of data.
SPEED LAYER
All the complexity is isolated in the
Speed layer. If anything goes wrong,
it’s auto-corrected.
CAP
You have a choice between:
• Availability
• Queries are eventual consistent.
• Consistency
• Queries are consistent.
EVENTUAL ACCURACY
Some algorithms are hard to
implement in real time. For those
cases we could estimate the results.
SPEED LAYER
Incoming Data
Real
Time
View 1
Real
Time
View 2
SPEED LAYER VIEWS
• The views are stored in Read & Write database.
• MS SQL Server
• Column Store
• Cassandra
• …
• Much more complex than a read only view.
SERVING LAYER
OVERVIEW
HD InsightColumn
Store
Qu
ery
Incoming Data
SQL
SERVING LAYER
This layer queries the Batch & Real
Time views and merges it.
SERVING LAYER
Real
Time
Views
Merge
Batch
Views
SERVING LAYER
Polybase is a great fit.
OVERVIEW
OVERVIEW
HD InsightColumn
Store
Qu
ery
Incoming Data
SQL
LAMBDA ARCHITECTURE
• Can discard any view, batch and real time, and just recreate
everything from the master data.
• Mistakes are corrected via recomputation.
• Write bad data? Remove the data & recompute.
• Bug in view generation? Just recompute the view.
• Data storage is highly optimized.
WHAT IS MICROSOFT DOING ON
THE BI & DEVELOPMENT SIDE
INSIGHTS FROM ANY DATA, ANY SIZE, ANYWHERE
010101010101010101101010101010101001010101010101101010101010
WE DELIVER INSIGHTS TO EVERYONE BY ENABLING BIG DATA
ANALYSIS WITH FAMILIAR END USER TOOLS
Hive add-in for Excel
Interaction and analysis of
unstructured data in Hadoop
Ben
efits
Key
Featu
res
UNLOCKING IMMERSIVE INSIGHTS FROM ALL DATA
WITH MICROSOFT BI TOOLS
Hive ODBC Driver integrates Hadoop
to SQL Server Analysis Services,
PowerPivot, and Power View
Familiar self service BI tools
Ben
efits
Key
Featu
res
WHILE DRAMATICALLY SIMPLIFYING PROGRAMMING
ON HADOOP
Integration with .NET and
new JavaScript libraries for
Hadoop
JS
MapReduce
programs
in JavaScript
Simplified
Programming
Deploy JavaScript Hadoop
jobs from a simple web
browser on any supported
device
Simplified Deployment of
MapReduce jobs
Ben
efits
Key
Featu
res
WE MANAGE STREAMING DATA WITH STREAMINSIGHTB
en
efits
Key
Featu
res
StreamInsight SQL StreamInsight
WHAT IS MICROSOFT DOING ON
THE HADOOP & INTEGRATION SIDE?
AppliancesReference Architectures
Dell Parallel Data Warehouse
HP Enterprise Data Warehouse
Dell QuickstartData Warehouse
HP Business Data Warehouse
WE MANAGE RELATIONAL DATA WITH MICROSOFT
ENTERPRISE DATA WAREHOUSE SOLUTIONS
Fast Track for
Fundamental Breakthrough in Data ProcessingINTRODUCING POLYBASE
Single Query; Structured and Unstructured
• Query and join Hadoop tables with Relational Tables
• Use Standard SQL language • Select, From Where
Existing SQLSkillset
No ITIntervention
Save Timeand Costs
SQL Server 2012
PDW Powered
by PolyBase
SQL
Analyze AllData Types
Ben
efits
Key
Featu
res
AND SUPPORT UNSTRUCTURED DATA WITH ENTERPRISE
CLASS HADOOP ON PREMISE AND IN THE CLOUD
Ben
efits
Key
Featu
res
MICROSOFT BRINGS THE SIMPLICITY AND MANAGEABILITY
OF WINDOWS AND SQL SERVER TO HADOOP
MICROSOFT DELIVERS BIG DATA THROUGH OPEN
PLATFORM AND A RICH PARTNER ECOSYSTEMB
en
efits
Key
Featu
res
BIG DATA DEMO:FROM DATA TO INSIGHTS!
Simplicity
Analysis with familiar
tools
Collaboration on
insights
THANK YOU!!!
RESOURCES
• Microsoft Big Data Solution: www.microsoft.com/bigdata
• Windows Azure: www.windowsazure.com/en-us/home/scenarios/big-data
• Try Now: https://www.hadooponazure.com
• HDInsight For Windows Beta Download: http://hortonworks.com/download/
• HDInsight Services For Windows:
http://social.technet.microsoft.com/wiki/contents/articles/6204.hdinsight-services-for-
windows.aspx#videos
• Hadoop in PowerPivot: http://social.technet.microsoft.com/wiki/contents/articles/6294.how-to-
connect-excel-powerpivot-to-hive-on-azure-via-hiveodbc.aspx
• Hadoop in SSIS: http://msdn.microsoft.com/en-us/library/jj720569.aspx
• Hurricane Sandy: http://sqlcat.com/sqlcat/b/msdnmirror/archive/2013/02/01/hurricane-sandy-
mash-up-hive-sql-server-powerpivot-amp-power-view.aspx
• Hadoop PowerShell: http://blogs.msdn.com/b/cindygross/archive/2012/08/23/how-to-install-the-
powershell-cmdlets-for-apache-hadoop-based-services-for-windows.aspx
• SQL Server BCP to Hive: http://blogs.msdn.com/b/cindygross/archive/2012/09/28/load-sql-server-
bcp-data-to-hive.aspx
• Internal vs External Table Hive: http://blogs.msdn.com/b/cindygross/archive/2013/02/06/hdinsight-
hive-internal-and-external-tables-intro.aspx
• Microsoft.NET SDK for Hadoop: http://hadoopsdk.codeplex.com/
• Twitter Analytics Example: http://twitterbigdata.codeplex.com/
DATACRUNCHERS
We enable companies in envisioning, defining and implementing a data
strategy.
A one-stop-shop for all your Big Data needs.
The first Big Data Consultancy agency in Belgium.