big data in the real world

24
Big Data in the Real World Orlando PASS October 2013 http://www.pssug.org Mark Kromer http://www.kromerbigdata.com @kromerbigdata @mssqldude

Upload: mark-kromer

Post on 15-Jan-2015

1.082 views

Category:

Technology


0 download

DESCRIPTION

Here I talk about examples and use cases for Big Data & Big Data Analytics and how we accomplished massive-scale sentiment, campaign and marketing analytics for Razorfish using a collecting of database, Big Data and analytics technologies.

TRANSCRIPT

Page 1: Big Data in the Real World

Big Data in the Real World

Orlando PASSOctober 2013http://www.pssug.org

Mark Kromerhttp://www.kromerbigdata.com@kromerbigdata@mssqldude

Page 2: Big Data in the Real World

‣What is Big Data?

‣The Big Data and Apache Hadoop environment

‣Big Data Analytics

‣SQL Server in the Big Data world

‣Microsoft + Hortonworks (Yahoo!) = HDInsights

What we’ll (try) to cover today

2

Page 3: Big Data in the Real World

Big Data 101

‣ 3 V’s

‣ Volume – Terabyte records, transactions, tables, files

‣ Velocity – Batch, near-time, real-time (analytics), streams.

‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix

‣ Text Processing‣ Techniques for processing and analyzing unstructured (and structured)

LARGE files

‣ Analytics & Insights

‣ Distributed File System & Programming

Page 4: Big Data in the Real World

‣ Big Data ≠ NoSQL

‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing

‣ Facebook, for example, uses Hbase from the Hadoop stack

‣ NoSQL does not have to be Big Data

‣ Big Data ≠ Real Time

‣ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value

‣ Use in-memory analytics for real time insights

‣ Big Data ≠ Data Warehouse

‣ I still refer to large multi-TB DWs as “VLDB”

‣ Big Data is about crunching stats in text files for discovery of new patterns and insights

‣ Use the DW to aggregate and store the summaries of those calculations for reporting

Mark’s Big Data Myths

Page 5: Big Data in the Real World

‣ Batch Processing

‣ Commodity Hardware

‣ Data Locality, no shared storage

‣ Scales linearly

‣ Great for large text file processing, not so great on small files

‣ Distributed programming paradigm

Page 6: Big Data in the Real World

Popular Hadoop Distributions

Hosted PaaS Hadoop platforms: Amazon EMR, Pivotal, Microsoft Hadoop on Azure

Page 7: Big Data in the Real World

Popular NoSQL Distributions

Transactional-based, not analytics schemas

Page 8: Big Data in the Real World

Popular MPP Distributions

Big Data as distributed, scale-out, sharded data stores

Page 9: Big Data in the Real World

Big Data Analytics Web Platform - Example

Data Source

s

Data M

asterin

g

Data

Warehouse

&

Analytics

Prese

ntatio

n

AttributionSegmentation

Stacking Effect

Media Level Data WarehouseAudience Level

Data WarehouseBig Data

SandboxesData Mapping

Business RulesExternal &

Extended Data

Tableau & Pentaho

MapReduceJobs

Page 10: Big Data in the Real World

using Microsoft.Hadoop.MapReduce;

using System.Text.RegularExpressions;

public class TotalHitsForPageMap : MapperBase

{

public override void Map(string inputLine, MapperContext context)

{

context.Log(inputLine);

var parts = Regex.Split(inputLine, "\\s+");

if (parts.Length != expected) //only take records with all values

{

return;

}

context.EmitKeyValue(parts[pagePos], hit);

}

}

MapReduce Framework (Map)

Page 11: Big Data in the Real World

public class TotalHitsForPageReducerCombiner : ReducerCombinerBase

{

public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)

{

context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString());

}

}

public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner>

{

public override HadoopJobConfiguration Configure(ExecutorContext context)

{

var retVal = new HadoopJobConfiguration();

retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT");

retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT");

retVal.DeleteOutputFolder = true;

return retVal;

}

}

MapReduce Framework (Reduce & Job)

Page 12: Big Data in the Real World

‣ Linux shell commands to access data in HDFS

‣ Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv

‣ List files in HDFS:

‣ c:\Hadoop>hadoop fs -ls /import

Found 1 items

-rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv

‣ View file in HDFS:c:\Hadoop>hadoop fs -cat /import/sales.csv

Kromer,123,5,55

Smith,567,1,25

Jones,123,9,99

James,11,12,1

Johnson,456,2,2.5

Singh,456,1,3.25

Yu,123,1,11

‣ Now, we can work on the data with MapReduce, Hive, Pig, etc.

Get Data into Hadoop

Page 13: Big Data in the Real World

create external table ext_sales

(

  lastname string,

  productid int,

  quantity int,

  sales_amount float

)

row format delimited fields terminated by ',' stored as textfile location '/user/makromer/hiveext/input';

LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales;

Use Hive for Data Schema and Analysis

Page 14: Big Data in the Real World

‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1

‣ > hadoop fs -cat /user/mark/customers/part-m-00000

‣ > 5,Bob Smith

‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3

‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec)

‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.

SqoopData transfer to & from Hadoop & SQL Server

Page 15: Big Data in the Real World

SQL Server Big Data – Data Loading

Amazon HDFS & EMR

Data Loading

Amazon S3 Bucket

Page 16: Big Data in the Real World

Role of NoSQL in a Big Data Analytics Solution

‣ Use NoSQL to store data quickly without the overhead of RDBMS

‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few

‣ Why NoSQL?

‣ In the world of “Big Data”

‣ “Schema later”

‣ Ignore ACID properties

‣ Drop data into key-value store quick & dirty

‣ Worry about query & read later

‣ Why NOT NoSQL?

‣ In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface

‣ SQL Server and NoSQL

‣ Not a natural fit

‣ Use HDFS or your favorite NoSQL database

‣ Consider turning off SQL Server locking mechanisms

‣ Focus on writes, not reads (read uncommitted)

Page 17: Big Data in the Real World

‣ SQL Server Database‣ SQL 2012 Enterprise Edition

‣ Page Compression

‣ 2012 Columnar Compression on Fact Tables

‣ Clustered Index on all tables

‣ Auto-update Stats Asynch

‣ Partition Fact Tables by month and archive data with sliding window technique

‣ Drop all indexes before nightly ETL load jobs

‣ Rebuild all indexes when ETL completes

‣ SQL Server Analysis Services‣ SSAS 2012 Enterprise Edition

‣ 2008 R2 OLAP cubes partition-aligned with DW

‣ 2012 cubes in-memory tabular cubes

‣ All access through MSMDPUMP or SharePoint

SQL Server Big Data Environment

Page 18: Big Data in the Real World

‣Columnstore

‣Sqoop adapter

‣PolyBase

‣Hive

‣In-memory analytics

‣Scale-out MPP

SQL Server Big Data Analytics Features

Page 19: Big Data in the Real World

19 19

Sensors Devices Bots CrawlersERP CRM LOB APPs

Unstructured and Structured Data

Parallel Data Warehouse

Hadoop On Windows

Azure

Hadoop On Windows

ServerConnectors

S S RS

SSAS

BI Platform

Familiar End User ToolsExcel with PowerPivot

Embedded BIPredictive Analytics

Data Market Place

Data Market

Petabytes of Data (Unstructured)

Hundreds of TB of Data (structured)

Microsoft’s Data Solution – Big Data & PDW

Page 20: Big Data in the Real World

MICROSOFT BIG DATA

Discover Combine Refine

Relational Non-relational Streaming

immersive data

experiences

connecting with worlds data

any data, any

size, anywhere

Self-Service Collaboration Corporate Apps Devices

Analytical

Parallel Data Warehouse

Microsoft HDInsight Server

HDInsight Service

StreamInsight

PowerPivot Power View

Page 21: Big Data in the Real World

Windows Azure HDInsight Service

Microsoft HDInsight Server

Expanded Partnership

Page 22: Big Data in the Real World

Microsoft .NET Hadoop APIs

‣ WebHDFS

‣ Linq to Hive

‣ MapReduce

‣ C#

‣ Java

‣ Hive

‣ Pig

‣ http://hadoopsdk.codeplex.com/

‣ SQL on Hadoop

‣ Cloudera Impala

‣ Teradata SQL-H

‣ Microsoft Polybase

‣ Hadapt

Page 23: Big Data in the Real World

Data Movement to the Cloud

‣Use Windows Azure Blob Storage• Already stored in 3 copies

• Hadoop can read from Azure blob storage

• Allows you to upload while using no Hadoop network or CPU resources

‣Compress files• Hadoop can read Gzip

• Uses less network resources than uncompressed

• Costs less for direct storage costs

• Compress directories where source files are created as well.

23

Page 24: Big Data in the Real World

‣ What is a Big Data approach to Analytics?

‣ Massive scale

‣ Data discovery & research

‣ Self-service

‣ Reporting & BI

‣ Why do we take this Big Data Analytics approach?

‣ TBs of change data in each subject area

‣ The data in the sources are variable and unstructured

‣ SSIS ETL alone couldn’t keep up or handle complexity

‣ SQL Server 2012 columnstore and tabular SSAS 2012 are key to using SQL Server for Big Data

‣ With the configs mentioned previously, SQL Server works great

‣ Analytics on Big Data also requires Big Data Analytics tools

‣ Aster, Tableau, PowerPivot, SAS, Parallel Data Warehouse

Wrap-up