big data analytics in the cloud with microsoft azure

60
www.globalbigdataconference.com Twitter : @bigdataconf

Upload: mark-kromer

Post on 09-Jan-2017

69 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Big Data Analytics in the Cloud with Microsoft Azure

www.globalbigdataconference.comTwitter : @bigdataconf

Page 2: Big Data Analytics in the Cloud with Microsoft Azure

Big Data Analytics in the CloudMicrosoft Azure

Cortana Intelligence Suite

Mark KromerMicrosoft Azure Cloud Data Architect

@kromerbigdata@mssqldude

Page 3: Big Data Analytics in the Cloud with Microsoft Azure

What is Big Data Analytics?Tech Target: “… the process of examining large data sets to uncover hidden patterns, unknown

correlations, market trends, customer preferences and other useful business information.”Techopedia: “… the strategy of analyzing large volumes of data, or big data. This big data is

gathered from a wide variety of sources, including social networks, videos, digital images, sensors, and sales transaction records. The aim in analyzing all this data is to uncover patterns and connections that might otherwise be invisible, and that might provide valuable insights about the users who created it. Through this insight, businesses may be able to gain an edge over their rivals and make superior business decisions.”

Requires lots of data wrangling and Data Engineers

Requires Data Scientists to uncover patterns from complex raw data

Requires Business Analysts to provide business value from multiple data sources

Requires additional tools and infrastructure not provided by traditional database and BI technologies

Why Cloud for Big Data Analytics?

• Quick and easy to stand-up new, large, big data architectures

• Elastic scale• Metered pricing• Quickly evolve architectures to rapidly changing

landscapes• Prototype, tear down

Page 4: Big Data Analytics in the Cloud with Microsoft Azure

Big Data Analytics Tools & Use Casesvs. “Traditional BI”

Traditional BI

• Sales reports• Post-campaign marketing research & analysis• CRM reports• Enterprise data assets• Can’t miss any transactions, records or rows• DWs• Relational Databases• Well-defined and format data sources• Direct connections to OLTP and LOB data

sources• Excel• Well-defined business semantic models• OLAP cubes• MDM, Data Quality, Data Governance

Big Data Analytics

• Sentiment Analysis• Predictive Maintenance• Churn Analytics• Customer Analytics• Real-time marketing• Avoid simply siphoning off data for BI tools• Architect multiple paths for data pipelines:

speed, batch, analytical• Plan for data of varying types, volumes and

formats• Data can/will land at any time, any speed, any

format• It’s OK to miss a few records and data points• NoSQL• MPP DWs• Hadoop, Spark, Storm• R & ML to find patterns in masses of data

lakes

Page 5: Big Data Analytics in the Cloud with Microsoft Azure

• Key Values / JSON / CSV• Compress files• Columnar• Land raw data fast• Data Wrangle/Munge/Engineer• Find patterns• Prepare for business models• Present to business decision makers

A few basic fundamentals

Big Data Analytics in the Cloud

Collect and land data in lake

Process data pipelines

(stream, batch, analysis)

Presentation Laye

r: Surfa

ce kno

wledge to business decision makers

Page 6: Big Data Analytics in the Cloud with Microsoft Azure

Microsoft Azure Big Data Analytics

Cortana Intelligence Suite

Azure Data Platform-at-a-glance

Page 7: Big Data Analytics in the Cloud with Microsoft Azure

Action

People

Automated Systems

Apps

Web

Mobile

Bots

Intelligence

Dashboards & Visualizations

Cortana

Bot Framework

Cognitive Services

Power BI

Information Management

Event Hubs

Data Catalog

Data Factory

Machine Learning and Analytics

HDInsight (Hadoop and Spark)

Stream Analytics

Intelligence

Data Lake Analytics

Machine Learning

Big Data Stores

SQL Data Warehouse

Data Lake Store

Data Sources

Apps

Sensors and devices

Data

Page 8: Big Data Analytics in the Cloud with Microsoft Azure
Page 9: Big Data Analytics in the Cloud with Microsoft Azure

Azure Data FactoryWhat it is:

When to use it:

A pipeline system to move data in, perform activities on data, move data around, and move data out

• Create solutions using multiple tools as a single process• Orchestrate processes - Scheduling• Monitor and manage pipelines• Call and re-train Azure ML models

Page 10: Big Data Analytics in the Cloud with Microsoft Azure

ADF Components

Page 11: Big Data Analytics in the Cloud with Microsoft Azure

ADF Logical Flow

Page 12: Big Data Analytics in the Cloud with Microsoft Azure

Example – Customer Churn

Azure Blob Storage

Call Log Files

Customer Table

On Premises Data Mart

Call Log Files

Customer Table

Azure DB

Customer Churn Table

Act (Visualize)

Azure Data Factory:

Activity: a processing step (Hadoop job, custom code, ML model, etc)

Data Set(Collection of files, DB table, etc)

Pipeline: a logical group of activities

Data Sources

Customers Likely to

ChurnCustomer

Call Details

Analyze MoveTransform, Combine, etc

Transform & Analyze PublishIngest

Page 13: Big Data Analytics in the Cloud with Microsoft Azure

Simple ADF• Business Goal: Transform and Analyze Web Logs each month

• Design Process: Transform Raw Weblogs, using a Hive Query, storing the results in Blob Storage

Web Logs Loaded to Blob

Files ready for analysis and use in AzureML

HDInsight HIVE query to transform Log entries

Page 14: Big Data Analytics in the Cloud with Microsoft Azure

Azure SQL Data WarehouseWhat it is:

When to use it:A Scaling Data Warehouse Service in the Cloud

• When you need a large-data BI solution in the cloud• MPP SQL Server in the Cloud• Elastic scale data warehousing• When you need pause-able scale-out compute

Page 15: Big Data Analytics in the Cloud with Microsoft Azure

Elastic scale & performance

Real-time elasticity

Resize in <1 minute On-demand compute

Expand or reduceas needed

Pause Data Warehouse to Save on Compute Costs. I.e. Pause

during non-business hours

Page 16: Big Data Analytics in the Cloud with Microsoft Azure

Storage can be as big or small as required

Users can execute niche workloads without re-scanning data

Elastic scale & performance

Scale

Page 17: Big Data Analytics in the Cloud with Microsoft Azure

Logical overview

Control

Com

pute

Stor

age

Page 18: Big Data Analytics in the Cloud with Microsoft Azure

SELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];

SELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];

SELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];

SELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];

SELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];

SELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];

Compute

Control

Page 19: Big Data Analytics in the Cloud with Microsoft Azure

Azure Data LakeWhat it is:

When to use it:

Data storage (Web-HDFS) and Distributed Data Processing (HIVE, Spark,

HBase, Storm, U-SQL) Engines

• Low-cost, high-throughput data store• Non-relational data• Larger storage limits than Blobs

Page 20: Big Data Analytics in the Cloud with Microsoft Azure

Ingest all data regardless of requirements

Store all data in native format without schema definition

Do analysisUsing analytic engines like Hadoop and ADLA

Interactive queries

Batch queries

Machine Learning

Data warehouse

Real-time analytics

Devices

Page 21: Big Data Analytics in the Cloud with Microsoft Azure

WebHDFS

YARN

U-SQL

ADL Analytics

ADL HDInsight

1

1

1

1

1

1 1

1

1

1

1

1

Store

HiveAnalytics

Storage

Azure Data Lake (Store, HDInsight, Analytics)

Page 22: Big Data Analytics in the Cloud with Microsoft Azure

No limits to SCALE

Store ANY DATA in its native format

HADOOP FILE SYSTEM (HDFS) for the cloud

Optimized for analytic workload PERFORMANCE

ENTERPRISE GRADE authentication, access control, audit, encryption at rest

Azure Data Lake StoreA hyper scale repository for big data analytics workloads

Introducing ADLS

Page 23: Big Data Analytics in the Cloud with Microsoft Azure

Enterprise-grade

Limitless scaleProductivity from day one

Easy and powerful data preparation

All data

23

010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100

Page 24: Big Data Analytics in the Cloud with Microsoft Azure

Developing big data apps

Author, debug, & optimize big data apps in Visual Studio

Multiple LanguagesU-SQL, Hive, & Pig

Seamlessly integrate .NET

Page 25: Big Data Analytics in the Cloud with Microsoft Azure

Work across all cloud data

Azure Data Lake Analytics

Azure SQL DW Azure SQL DB Azure Storage Blobs

Azure Data Lake Store

SQL DB in an Azure VM

Page 26: Big Data Analytics in the Cloud with Microsoft Azure

What isU-SQL?

A hyper-scalable, highly extensible language for preparing, transforming and analyzing all data

Allows users to focus on the what—not the how—of business problems

Built on familiar languages (SQL and C#) and supported by a fully integrated development environment

Built for data developers & scientists

26

Page 27: Big Data Analytics in the Cloud with Microsoft Azure

U-SQL language philosophy

27

Declarative query and transformation language:• Uses SQL’s SELECT FROM WHERE with GROUP BY/aggregation, joins, SQL

Analytics functions• Optimizable, scalable

Operates on unstructured & structured data• Schema on read over files• Relational metadata objects (e.g. database, table)

Extensible from ground up:• Type system is based on C#• Expression language is C#

21User-defined functions (U-SQL and C#)User-defined types (U-SQL/C#) (future)User-defined aggregators (C#)User-defined operators (UDO) (C#)

U-SQL provides the parallelization and scale-out framework for usercode• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINERS

Expression-flow programming style:• Easy to use functional lambda composition • Composable, globally optimizable

Federated query across distributed data sources (soon)

REFERENCE MyDB.MyAssembly;CREATE TABLE T( cid int, first_order DateTime

, last_order DateTime, order_count int, order_amount float );

@o = EXTRACT oid int, cid int, odate DateTime, amount floatFROM "/input/orders.txt“USING Extractors.Csv();

@c = EXTRACT cid int, name string, city stringFROM "/input/customers.txt“USING Extractors.Csv();

@j = SELECT c.cid, MIN(o.odate) AS firstorder, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt, SUM(c.amount) AS totalamountFROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cidWHERE c.city.StartsWith("New")&& MyNamespace.MyFunction(o.odate) > 10GROUP BY c.cid;

OUTPUT @j TO "/output/result.txt"USING new MyData.Write();INSERT INTO T SELECT * FROM @j;

Page 28: Big Data Analytics in the Cloud with Microsoft Azure

Expression-flow programming styleAutomatic "in-lining" of SQLIP expressions – whole script leads to a single execution model

Execution plan that is optimized out-of-the-box and w/o user intervention

Per-job and user-driven parallelization

Detail visibility into execution steps, for debugging

Heat map functionality to identify performance bottlenecks

010010

100100

010101

Page 29: Big Data Analytics in the Cloud with Microsoft Azure

“Unstructured” Files• Schema on Read• Write to File• Built-in and custom Extractors and

Outputters• ADL Storage and Azure Blob

Storage

EXTRACT Expression

@s = EXTRACT a string, b int FROM "filepath/file.csv"USING Extractors.Csv;

• Built-in Extractors: Csv, Tsv, Text with lots of options• Custom Extractors: e.g., JSON, XML, etc.

OUTPUT Expression

OUTPUT @sTO "filepath/file.csv"USING Outputters.Csv();

• Built-in Outputters: Csv, Tsv, Text• Custom Outputters: e.g., JSON, XML, etc. (see http://usql.io)

Filepath URIs

• Relative URI to default ADL Storage account: "filepath/file.csv"• Absolute URIs:

• ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv"

• WASB: "wasb://container@account/filepath/file.csv"

Page 30: Big Data Analytics in the Cloud with Microsoft Azure

Expression-flow Programming Style

12

• Automatic "in-lining" of U-SQL expressions – whole script leads to a single execution model.

• Execution plan that is optimized out-of-the-box and w/o user intervention.

• Per job and user driven level of parallelization.

• Detail visibility into execution steps, for debugging.

• Heatmap like functionality to identify performance bottlenecks.

Page 31: Big Data Analytics in the Cloud with Microsoft Azure

Visual Studio integration

Page 32: Big Data Analytics in the Cloud with Microsoft Azure

What can you do with Visual Studio?

32

Visualize and replay progress

of job

Fine-tune query performance

Visualize physical plan of U-SQL

query

Browse metadata catalog

Author U-SQL scripts (with

C# code)

Create metadata objects

Submit and cancel U-SQL

Jobs

Debug U-SQL and C# code

Page 33: Big Data Analytics in the Cloud with Microsoft Azure

Plug-in

Page 34: Big Data Analytics in the Cloud with Microsoft Azure

Authoring U-SQL queries

34

Visual Studio fully supports authoring U-SQL scripts

While editing, it provides:IntelliSense

Syntax color coding

Syntax checking

Contextual Menu

Page 35: Big Data Analytics in the Cloud with Microsoft Azure

Job execution graph

35

After a job is submitted the progress of the execution of the job as it goes through the different stages is shown and updated continuously

Important stats about the job are also displayed and updated continuously

Page 36: Big Data Analytics in the Cloud with Microsoft Azure

Job diagnosticsDiagnostics information is shown to help with debugging and performance issues

Page 37: Big Data Analytics in the Cloud with Microsoft Azure

HDInsight: Cloud Managed Hadoop

What it is:

When to use it:

Microsoft’s implementation of apache Hadoop (as a service) that uses Blobs for persistent storage

• When you need to process large scale data (PB+)• When you want to use Hadoop or Spark as a service• When you want to compute data and retire the servers, but

retain the results• When your team is familiar with the Hadoop Zoo

Page 38: Big Data Analytics in the Cloud with Microsoft Azure

Hadoop and HDInsight

Using the Hadoop Ecosystem to process and query data

Page 39: Big Data Analytics in the Cloud with Microsoft Azure

Microsoft Azure Big Data Analytics

Cortana Intelligence Suite

HDInsight Tools for Visual Studio

Page 40: Big Data Analytics in the Cloud with Microsoft Azure
Page 41: Big Data Analytics in the Cloud with Microsoft Azure
Page 42: Big Data Analytics in the Cloud with Microsoft Azure
Page 43: Big Data Analytics in the Cloud with Microsoft Azure
Page 44: Big Data Analytics in the Cloud with Microsoft Azure

Deploying HDInsight Clusters• Cluster Type: Hadoop, Spark, HBase and Storm.

• Hadoop clusters: for query and analysis workloads• HBase clusters: for NoSQL workloads• Spark clusters: for in-memory processing, interactive queries, stream, and machine learning workloads

• Operating System: Windows or Linux• Can be deployed from Azure portal, Azure Command Line

Interface (CLI), or Azure PowerShell and Visual Studio• A UI dashboard is provided to the cluster through Ambari.• Remote Access through SSH, REST API, ODBC, JDBC.

• Remote Desktop (RDP) access for Windows clusters

Page 45: Big Data Analytics in the Cloud with Microsoft Azure

Azure Machine LearningWhat it is:

When to use it:

A multi-platform environment and engine to create and deploy Machine Learning models and API’s

• When you need to create predictive analytics• When you need to share Data Science experiments across

teams• When you need to create call-able API’s for ML functions• When you also have R and Python experience on your Data

Science team

Page 46: Big Data Analytics in the Cloud with Microsoft Azure

Creating an Experiment

Get/Prepare Data

Build/Edit Experiment

Create/Update Model

Evaluate Model Results

Build and ModelCreateWorkspace

Deploy Model

Consume Model

Page 47: Big Data Analytics in the Cloud with Microsoft Azure

Basic Azure ML ElementsImport Data

Preprocess

Algorithm

Train Model

Split Data

Score Model

Page 48: Big Data Analytics in the Cloud with Microsoft Azure
Page 49: Big Data Analytics in the Cloud with Microsoft Azure
Page 50: Big Data Analytics in the Cloud with Microsoft Azure

Power BIWhat it is:

When to use it:

Interactive Report and Visualization creation for computing and mobile platforms

• When you need to create and view interactive reports that combine multiple datasets

• When you need to embed reporting into an application• When you need customizable visualizations• When you need to create shared datasets, reports, and

dashboards that you publish to your team

Page 51: Big Data Analytics in the Cloud with Microsoft Azure

Microsoft Azure Big Data Analytics

Cortana Intelligence Suite

Common architectural patterns

Page 52: Big Data Analytics in the Cloud with Microsoft Azure

Big Data Analytics – Data Flow

DATA

Business apps

Custom apps

Sensors and devices

INTELLIGENCE ACTION

People

Preparation, Analytics and Machine Learning

Azure Data Lake Store

Ingestion

Bulk Ingestion

Event Ingestion

Discovery

Azure Data Catalog

Visualization

Power BI

HDInsight Data Lake Analytics

Page 53: Big Data Analytics in the Cloud with Microsoft Azure

Event Ingestion Patterns

Business apps

Custom apps

Sensors and devices

Events Events

Azure Data Lake Store

Transformed Data

Real Time Dashboards

Power BI

Raw Events

Azure Event Hubs

Kafka

Event Collection

Azure Stream Analytics

Spark Streaming

Stream Processing

Page 54: Big Data Analytics in the Cloud with Microsoft Azure

Bulk Ingestion and Preparation

Business apps

Custom apps

Sensors and devices

Azure Data Lake Store

Prepared Data (Structured)

Raw DataBulk Load

Azure Data Factory

Prepared Data (Unstructured)

Data Preparation

Batch Analytics

Interactive Analytics

Power BI Notebooks

Spark on HDInsight

Azure SQL DW

Azure Data Catalog

Page 55: Big Data Analytics in the Cloud with Microsoft Azure

Data Transformati

on

Data Collection

Presentation and action

Queuing System

Data Storage

Big Data Lambda Architecture

Azure Search

Data analytics (Excel, Power BI, Looker, Tableau)

Web/thick client dashboards

Devices to take actionEvent hub

Event & data producers

Applications

Web and social

Devices

Live Dashboards

DocumentDBMongoDBSQL AzureADWHbaseBlob StorageKafka/RabbitMQ/

ActiveMQ

Event hubs Azure ML

Storm / Stream Analytics

Hive / U-SQL

Data Factory

Sensors

Pig

Cloud gateways(web APIs)

Field gateways

Page 56: Big Data Analytics in the Cloud with Microsoft Azure

Get started today!

http://aka.ms/cisolutions 57

Cortana Intelligence Solutions

Page 57: Big Data Analytics in the Cloud with Microsoft Azure

Cortana Intelligence Solutions: Discover

http://aka.ms/cisolutions

Page 58: Big Data Analytics in the Cloud with Microsoft Azure

Cortana Intelligence Solutions: Try

Page 59: Big Data Analytics in the Cloud with Microsoft Azure

Cortana Intelligence Solutions: Deploy

Page 60: Big Data Analytics in the Cloud with Microsoft Azure

Instructions and Next Steps: Customize