azure stream analytics : analyse data in motion

Post on 16-Apr-2017

486 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Stream Analytics Analyze your data in motionDeepthi Anantharam

Technology Evangelist

@deananth

Ruhani Arora

Technology Evangelist

@infinitydlimit

The need for evolution – Identified 2 years ago

… data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing.

– Gartner, “The State of Data Warehousing in 2012”

Data sources

ETL

Data warehouse

BI and analytics

The “Traditional” Data Warehouse

4

Data sources

OLTP ERP CRM LOB

ETL

Data warehouse

BI and analytics

Increasing data volumes

1

Real-time data

4

Non-Relational Data

Devices Web Sensors Social

New data sources & types

2Cloud-born data

3

Evolving Approaches to Analytics

ETL Tool(SSIS, etc)

EDW(SQL Svr, Teradata, etc)

Extract

Original Data

Load

Transformed Data

Transform

OLTP

ERP LOB

BI Tools

Data Marts

Data Lake(s)

Dashboards

Apps

ETL Tool(SSIS, etc)

EDW(SQL Svr, Teradata, etc)

Extract

Original Data

Load

Transformed Data

Transform

OLTP

ERP LOB

BI Tools

Devices

Web

Sensors

Social

Ingest (EL)Original Data

Data Marts

Data Lake(s)

Dashboards

Apps

Evolving Approaches to Analytics

ETL Tool(SSIS, etc)

EDW(SQL Svr, Teradata, etc)

Extract

Original Data

Load

Transformed Data

Transform

OLTP

ERP LOB

BI Tools

Devices

Web

Sensors

Social

Ingest (EL)Original Data

Scale-out Storage & Compute

(HDFS, Blob Storage, etc)

Transform & Load

Data Marts

Data Lake(s)

Dashboards

Apps

Streaming data

Evolving Approaches to Analytics

ETL Tool(SSIS, etc)

EDW(SQL Svr, Teradata, etc)

Extract

Original Data

Load

Transformed Data

Transform

OLTP

ERP LOB

BI Tools

Devices

Web

Sensors

Social

Ingest (EL)Original Data

Scale-out Storage & Compute

(HDFS, Blob Storage, etc)

Transform & Load

Data Marts

Data Lake(s)

Dashboards

Apps

Streaming data

Evolving Approaches to Analytics

Real Time data analytics

Agenda• ETL with new sources of

data• Azure Data Factory

• Analytics with new sources of data• Azure Stream Analytics

Azure Data Factory Overview • New Azure service for data developers & IT

• Compose data processing, storage and movement services to create & manage analytics pipelines

• Initially focused on Azure & hybrid movement to/from on premises SQL Server. Overtime will expand to more storage & processing systems throughout

• Rich, simple end-to-end pipeline monitoring and management

Operationalizing Information Production With Data Factory

Example Scenario: Customer Profiling (game usage analytics)

Customer Profiling – Game Usage Analytics

2277,2013-06-01 02:26:54.3943450,111,164.234.187.32,24.84.225.233,true,8,1,20582277,2013-06-01 03:26:23.2240000,111,164.234.187.32,24.84.225.233,true,8,1,2058-2123-2009-2068-21662277,2013-06-01 04:22:39.4940000,111,164.234.187.32,24.84.225.233,true,8,1,2277,2013-06-01 05:43:54.1240000,111,164.234.187.32,24.84.225.233,true,8,1,2058-225545-2309-2068-21662277,2013-06-01 06:11:23.9274300,111,164.234.187.32,24.84.225.233,true,8,1,223-2123-2009-4229-99366232277,2013-06-01 07:37:01.3962500,111,164.234.187.32,24.84.225.233,true,8,1,2277,2013-06-01 08:12:03.1109790,111,164.234.187.32,24.84.225.233,true,8,1,234322-2123-2234234-12432-344323…

Log Files Snippet (10s of TBs per day in cloud storage)

User Table UserID FirstName LastName State …

2277 Pratik Patel Oregon

664432 Dave Nettleton Washington

8853 Mike Flasko California

New User Activity Per Week By Region

profileid day state duration rank weaponsused interactedwith1148 6/2/2013 Oregon 216 33 1 51004 6/2/2013 Missouri 22 40 6 2292 6/1/2013 Georgia 201 137 1 51059 6/2/2013 Oregon 27 104 5 2675 6/2/2013 California 65 164 3 21348 6/3/2013 Nebraska 21 95 5 2

Terminologies• Linked Services• Data Sets • Pipeline• Diagram View

• Create a Data factory• Add Data Sources• Define Tables and

Pipelines• Deploy & Start• Monitor and Manage

Steps

Example: Game Logs, Customer Profiling

On Premises SQL Server Azure Blob Storage

1000’s Log FilesNew User View

Azure Data Factory

Example: Game Logs, Customer Profiling

On Premises SQL Server Azure Blob Storage

1000’s Log FilesNew User View

Azure Data FactoryVi

ew O

f

Game Usage

View

Of

New Users

New User Activity

Example: Game Logs, Customer Profiling

View

Of

On Premises SQL Server Azure Blob Storage

1000’s Log FilesNew User View

Copy “NewUsers” to Blob Storage

Cloud New Users

Azure Data FactoryVi

ew O

f

Game Usage

View

Of

New Users

New User Activity

Pipeline

Example: Game Logs, Customer Profiling

On Premises SQL Server Azure Blob Storage

1000’s Log FilesNew User View

Copy NewUsers to Blob Storage

Cloud New Users

Azure Data FactoryVi

ew O

f

Game Usage

View

Of

Mask & Geo-Code

New Users

Geo DictionaryGeo Coded

Game Usage

HDInsight

New User Activity

Pipeline

Pipeline

Example: Game Logs, Customer Profiling

On Premises SQL Server Azure Blob Storage

1000’s Log FilesNew User View

Copy NewUsers to Blob Storage

Cloud New Users

Azure Data FactoryVi

ew O

f

Game Usage

View

Of

Runs

OnMask & Geo-

Code

New Users

Geo DictionaryGeo Coded

Game Usage

Join & Aggregate

HDInsight

New User Activity

View

Of

Pipeline

Pipeline

Pipeline

“GeoCoded Game Usage” Table:

Step 3: Define Tables & Pipelines

Pipeline Definition:Step 3: Define Tables & Pipelines

Activ

ityAc

tivity

Powershell// Deploy TableNew-AzureDataFactoryTable -DataFactory“GameTelemetry“-File NewUserActivityPerRegion.json

// Deploy PipelineNew-AzureDataFactoryPipeline -DataFactory “GameTelemetry“-File NewUserTelemetryPipeline.json

// Start PipelineSet-AzureDataFactoryPipelineActivePeriod -Name “NewUserTelemetryPipeline“-DataFactory “GameTelemetry“-StartTime 10/29/2014 12:00:00

Incremental Data Production

Dataset2

Dataset3

Hourly

12-1

1-2

2-3

Daily

Monday

Tuesday

Wednesday

Daily

Monday

Tuesday

Wednesday

Hive Activity

GameUsage

GeoCodeDictionary

Geo-CodedGameUsage

Custom Actions• Allows running any .NET code wrapped within an ADF

activity• Can be used to connect to new sources/destination• Can be used to create custom transformation activities• Example: Invoke Azure ML model• SDK for custom activity creation:

Coordination: • Rich scheduling• Complex dependencies• Incremental rerun

Authoring: • JSON & Powershell/C#

Management:• Lineage• Data production policies (late data, rerun, latency, etc)

Hub: Azure Hub (HDInsight + Blob storage)• Activities: Hive, Pig, C#• Data Connectors: Blobs, Tables, Azure DB, On Prem SQL Server, MDS

[internal]

Data Factory – Available Today

Analyze your data in motion

What is Streaming Data?

Data in MotionData at Rest

Azure Stream Analytics

Real-time stream processing Near infinite cloud scale

Managed real-time analytics

Mission-critical reliability and scale

Rapid development

Point of Service Devices

Self CheckoutStations

Kiosks

Smart Phones

Slates/Tablets

PCs/Laptops

Servers

Digital Signs

DiagnosticEquipmentRemote Medical

MonitorsLogic

Controllers

SpecializedDevicesThin

Clients

Handhelds

Security

POS Terminals

AutomationDevices

VendingMachines

Kinect

ATM

Stream Analytics

How do customers create a real-time streaming solution?

Customers using ASA?

Using Azure Analytic Service

Data Source

Collect Process

Consume

Deliver

Event Inputs- Event Hub- Azure Blob

Transform- Temporal joins- Filter- Aggregates- Projections- Windows- Etc.

Enrich

Correlate

Outputs- SQL Azure- Azure Blobs- Event Hub- Table Storage

BI Dashboards

Predictive Analytics

AzureStorage

Azure Stream Analytics

Reference Data- Azure Blob

Sample Scenario : Toll Station

TollId EntryTime License Plate State Make Model Type Weight

1 2014-10-25T19:33:30.0000000Z JNB 7001 NY Honda CRV 1 3010

1 2014-10-25T19:33:31.0000000Z YXZ 1001 NY Toyota Camry 2 3020

3 2014-10-25T19:33:32.0000000Z ABC 1004 CT Ford Taurus 2 3800

2 2014-10-25T19:33:33.0000000Z XYZ 1003 CT Toyota Corolla 2 2900

1 2014-10-25T19:33:34.0000000Z BNJ 1007 NY Honda CRV 1 3400

2 2014-10-25T19:33:35.0000000Z CDE 1007 NJ Toyota 4x4 1 3800

… … … … … … … …

EntryStream - Data about vehicles entering toll stations TollId ExitTime LicensePlate

1 2014-10-25T19:33:40.0000000Z JNB 7001

1 2014-10-25T19:33:41.0000000Z YXZ 1001

3 2014-10-25T19:33:42.0000000Z ABC 1004

2 2014-10-25T19:33:43.0000000Z XYZ 1003

… … …

ExitStream - Data about cars leaving toll stations

LicensePlate RegistartionId Expired

SVT 6023 285429838 1

XLZ 3463 362715656 0

QMZ 1273 876133137 1

RIV 8632 992711956 0

… … ….

ReferenceData - Commercial vehicle registration data

Query Language - OverviewDML Statements• SELECT• FROM• WHERE• GROUP BY• HAVING• CASE• JOINS• UNION

Scaling Functions• WITH• PARTITION BY

Date and Time Functions• DATENAME• DATEPART• DAY• MONTH• YEAR• DATETIMEFROMPARTS• DATEDIFF• DATADD

Windowing Extensions• Tumbling Window• Hopping Window• Sliding Window

Aggregate Functions• SUM• COUNT• AVG• MIN• MAX

String Functions• LEN

CONCAT• SUBSTRING• CHARINDEX• PATINDEX

Tumbling Windows

SELECT TollId, COUNT(*)FROM EntryStream TIMESTAMP BY EntryTimeGROUP BY TollId, TumblingWindow(second, 10)

Count the total number of vehicles entering each toll booth every interval of 10 seconds.

1 5 4 26 8 6 5

0 5 2010 15 Time (secs)

1 5 4 26

8 6

25

A 10-second Tumbling Window

30

3 6 1

5 3 6 1

Hopping Windows

SELECT COUNT(*), TollId FROM EntryStream TIMESTAMP BY EntryTimeGROUP BY TollId, HoppingWindow (second, 10,5)

Count the number of vehicles entering each toll booth every interval of 10 seconds; update results every 10 seconds

1 5 4 26 8 7

0 5 2010 15 Time (secs)

25

A 10-second Hopping Window with a 5-second “Hop”

30

4 26

8 6

5 3 6 1

1 5 4 26

8 6 5 3

6 15 3

Sliding Windows

Give me the count of all the toll booths which have served more than 10 vehicles in the last 10 seconds

1 5

0 5 2010 15 Time (secs)

25

A 10-second Sliding Window8

8

51

9

51 9

1

SELECT TollId, Count(*) FROM EntryStream ESGROUP BY TollId, SlidingWindow (second, 10)HAVING Count(*) > 10

Intake millions of events per secondProcess data from connected devices/appsIntegrated with highly-scalable publish-subscriber ingestor

Easy processing on continuous streams of data Transform, augment, correlate, temporal operationsDetect patterns and anomalies in streaming data

Correlate streaming with reference data

Real-time analytics

Input and OutputManagement

TransformationsManagement

Programmatic Access with REST APIs

Jobs Management Start JobStop Job

Create JobDelete Job

List JobsUpdate Job

Create Input / OutputDelete Input / Output

List Input / OutputUpdate Input / Output

Create TransformationDelete Transformation

Get TransformationUpdate Transformation

The full functionality of Azure Stream Analytics is through REST APIs. Enables programmatic accessUseful for automation through scriptingEmbed in other applications/tools

Demo: Scaling , Monitoring & Logging

Scaling Concepts – Partitions

Step Result 1

Step Result 2

Step Result 3

PartitionId = 1

PartitionId = 3PartitionId = 2

PartitionId = 1

PartitionId = 2PartitionId = 3

Event Hub

Stream Analytics

SELECT COUNT(*) AS Count, TollBoothId FROM EntryStream Partition By PartitionId GROUP BY TumblingWindow (minute, 3), TollBoothId

41

• Preview services

• Offers ability to deal with new age problem in processing and analyzing data

• Scale, Speed, Economy

ADF & ASA

Recommended/related sessions

Inside Azure Storage – Options, abstractions and Best PracticesData, Sabha2, 11.00 AM – 11.55 AM tomorrow

1

Choosing Right platform for BigDataData, Sabha2, 3.00 PM to 3.55 PM tomorrow

2

Practical Machine LearningData, Sabha2 , 4.15 to 5.10 Today

3

ReferencesRelated references for you to expand your knowledge on the subjectAzure Stream Analytics Documentationhttp://azure.microsoft.com/en-in/documentation/services/stream-analytics/

Stream Analytics Query Language Referencehttps://msdn.microsoft.com/en-us/library/azure/dn834998.aspx

Azure Portalhttp://azure.microsoft.com

Azure Updateshttp://azure.microsoft.com/blog/

Microsoft Virtual Academyaka.ms/mva

Developer Networkmsdn.microsoft.com/

Azure SupportMust know resources to get online help for Azure.

Azure Support Optionshttp://azure.microsoft.com/en-us/support/options/

Azure Support Planshttp://azure.microsoft.com/en-us/support/plans/

Ask questions, & get answers

Post questions in the Azure

forums

Tag questions with the keyword Azure.

Azure VidyapeethA platform for learning – Choose your topic, choose your time

• Register to attend Azure Vidyapeeth Live webinars @

www.aka.ms/azure-vidyapeeth

• Collect free $100 Azure gift pass by registering for our Azure Vidyapeeth series at the Expo zone!

• Point your mobile phone here to download the Azure Vidyapeeth Mobile App : www.aka.ms/av-app

Tell us what you think Help us shape future events by sharing your valuable feedback.

Scan the QR code to evaluate this session.

< QR Code will be given 2 days before the Conference >

Thank you

Twitter: @deananth @infinitydlimit

Follow us online

Pricing (Today)

Query Language You write declarative queries in SQL No code compilation, easy to author and deploy

Unified programming modelBrings together event streams, reference data and machine learning extensions

Temporal Semantics All operators respect, and some use, the temporal properties of events

Built-in operators and functionsThese should (mostly) look familiar if you know relational databases

Filters, projections, joins, windowed (temporal) aggregates, text and date manipulation

50

Why Event Processing in the Cloud?

Event data is already in the Cloud

Event data isglobally distributed

Reduced TCO Scale Managed service,

not infrastructure

Bring the processing to the data, not the data to the processing!

Streamed Data

is naturallynon-local!

Application ComponentsComponents of an Azure Stream Analytics Application

OUTPUT[Result of Query]

Azure SQL DB

Azure Event Hubs

Azure Blob Storage

INPUT

Source of Events

Azure Blob Storage

Azure Event Hubs

Reference Data

Query runs continuously against incoming stream of events

Stream Analytics Query

Events

Have a defined schema and are

temporal (sequenced in time)

top related