big data analytics in the cloud with microsoft azure
TRANSCRIPT
www.globalbigdataconference.comTwitter : @bigdataconf
Big Data Analytics in the CloudMicrosoft Azure
Cortana Intelligence Suite
Mark KromerMicrosoft Azure Cloud Data Architect
@kromerbigdata@mssqldude
What is Big Data Analytics?Tech Target: “… the process of examining large data sets to uncover hidden patterns, unknown
correlations, market trends, customer preferences and other useful business information.”Techopedia: “… the strategy of analyzing large volumes of data, or big data. This big data is
gathered from a wide variety of sources, including social networks, videos, digital images, sensors, and sales transaction records. The aim in analyzing all this data is to uncover patterns and connections that might otherwise be invisible, and that might provide valuable insights about the users who created it. Through this insight, businesses may be able to gain an edge over their rivals and make superior business decisions.”
Requires lots of data wrangling and Data Engineers
Requires Data Scientists to uncover patterns from complex raw data
Requires Business Analysts to provide business value from multiple data sources
Requires additional tools and infrastructure not provided by traditional database and BI technologies
Why Cloud for Big Data Analytics?
• Quick and easy to stand-up new, large, big data architectures
• Elastic scale• Metered pricing• Quickly evolve architectures to rapidly changing
landscapes• Prototype, tear down
Big Data Analytics Tools & Use Casesvs. “Traditional BI”
Traditional BI
• Sales reports• Post-campaign marketing research & analysis• CRM reports• Enterprise data assets• Can’t miss any transactions, records or rows• DWs• Relational Databases• Well-defined and format data sources• Direct connections to OLTP and LOB data
sources• Excel• Well-defined business semantic models• OLAP cubes• MDM, Data Quality, Data Governance
Big Data Analytics
• Sentiment Analysis• Predictive Maintenance• Churn Analytics• Customer Analytics• Real-time marketing• Avoid simply siphoning off data for BI tools• Architect multiple paths for data pipelines:
speed, batch, analytical• Plan for data of varying types, volumes and
formats• Data can/will land at any time, any speed, any
format• It’s OK to miss a few records and data points• NoSQL• MPP DWs• Hadoop, Spark, Storm• R & ML to find patterns in masses of data
lakes
• Key Values / JSON / CSV• Compress files• Columnar• Land raw data fast• Data Wrangle/Munge/Engineer• Find patterns• Prepare for business models• Present to business decision makers
A few basic fundamentals
Big Data Analytics in the Cloud
Collect and land data in lake
Process data pipelines
(stream, batch, analysis)
Presentation Laye
r: Surfa
ce kno
wledge to business decision makers
Microsoft Azure Big Data Analytics
Cortana Intelligence Suite
Azure Data Platform-at-a-glance
Action
People
Automated Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards & Visualizations
Cortana
Bot Framework
Cognitive Services
Power BI
Information Management
Event Hubs
Data Catalog
Data Factory
Machine Learning and Analytics
HDInsight (Hadoop and Spark)
Stream Analytics
Intelligence
Data Lake Analytics
Machine Learning
Big Data Stores
SQL Data Warehouse
Data Lake Store
Data Sources
Apps
Sensors and devices
Data
Azure Data FactoryWhat it is:
When to use it:
A pipeline system to move data in, perform activities on data, move data around, and move data out
• Create solutions using multiple tools as a single process• Orchestrate processes - Scheduling• Monitor and manage pipelines• Call and re-train Azure ML models
ADF Components
ADF Logical Flow
Example – Customer Churn
Azure Blob Storage
Call Log Files
Customer Table
On Premises Data Mart
Call Log Files
Customer Table
Azure DB
Customer Churn Table
Act (Visualize)
Azure Data Factory:
Activity: a processing step (Hadoop job, custom code, ML model, etc)
Data Set(Collection of files, DB table, etc)
Pipeline: a logical group of activities
Data Sources
Customers Likely to
ChurnCustomer
Call Details
Analyze MoveTransform, Combine, etc
Transform & Analyze PublishIngest
Simple ADF• Business Goal: Transform and Analyze Web Logs each month
• Design Process: Transform Raw Weblogs, using a Hive Query, storing the results in Blob Storage
Web Logs Loaded to Blob
Files ready for analysis and use in AzureML
HDInsight HIVE query to transform Log entries
Azure SQL Data WarehouseWhat it is:
When to use it:A Scaling Data Warehouse Service in the Cloud
• When you need a large-data BI solution in the cloud• MPP SQL Server in the Cloud• Elastic scale data warehousing• When you need pause-able scale-out compute
Elastic scale & performance
Real-time elasticity
Resize in <1 minute On-demand compute
Expand or reduceas needed
Pause Data Warehouse to Save on Compute Costs. I.e. Pause
during non-business hours
Storage can be as big or small as required
Users can execute niche workloads without re-scanning data
Elastic scale & performance
Scale
Logical overview
Control
Com
pute
Stor
age
SELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];
Compute
Control
Azure Data LakeWhat it is:
When to use it:
Data storage (Web-HDFS) and Distributed Data Processing (HIVE, Spark,
HBase, Storm, U-SQL) Engines
• Low-cost, high-throughput data store• Non-relational data• Larger storage limits than Blobs
Ingest all data regardless of requirements
Store all data in native format without schema definition
Do analysisUsing analytic engines like Hadoop and ADLA
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
WebHDFS
YARN
U-SQL
ADL Analytics
ADL HDInsight
1
1
1
1
1
1 1
1
1
1
1
1
Store
HiveAnalytics
Storage
Azure Data Lake (Store, HDInsight, Analytics)
No limits to SCALE
Store ANY DATA in its native format
HADOOP FILE SYSTEM (HDFS) for the cloud
Optimized for analytic workload PERFORMANCE
ENTERPRISE GRADE authentication, access control, audit, encryption at rest
Azure Data Lake StoreA hyper scale repository for big data analytics workloads
Introducing ADLS
Enterprise-grade
Limitless scaleProductivity from day one
Easy and powerful data preparation
All data
23
010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100
Developing big data apps
Author, debug, & optimize big data apps in Visual Studio
Multiple LanguagesU-SQL, Hive, & Pig
Seamlessly integrate .NET
Work across all cloud data
Azure Data Lake Analytics
Azure SQL DW Azure SQL DB Azure Storage Blobs
Azure Data Lake Store
SQL DB in an Azure VM
What isU-SQL?
A hyper-scalable, highly extensible language for preparing, transforming and analyzing all data
Allows users to focus on the what—not the how—of business problems
Built on familiar languages (SQL and C#) and supported by a fully integrated development environment
Built for data developers & scientists
26
U-SQL language philosophy
27
Declarative query and transformation language:• Uses SQL’s SELECT FROM WHERE with GROUP BY/aggregation, joins, SQL
Analytics functions• Optimizable, scalable
Operates on unstructured & structured data• Schema on read over files• Relational metadata objects (e.g. database, table)
Extensible from ground up:• Type system is based on C#• Expression language is C#
21User-defined functions (U-SQL and C#)User-defined types (U-SQL/C#) (future)User-defined aggregators (C#)User-defined operators (UDO) (C#)
U-SQL provides the parallelization and scale-out framework for usercode• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINERS
Expression-flow programming style:• Easy to use functional lambda composition • Composable, globally optimizable
Federated query across distributed data sources (soon)
REFERENCE MyDB.MyAssembly;CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount floatFROM "/input/orders.txt“USING Extractors.Csv();
@c = EXTRACT cid int, name string, city stringFROM "/input/customers.txt“USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt, SUM(c.amount) AS totalamountFROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cidWHERE c.city.StartsWith("New")&& MyNamespace.MyFunction(o.odate) > 10GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"USING new MyData.Write();INSERT INTO T SELECT * FROM @j;
Expression-flow programming styleAutomatic "in-lining" of SQLIP expressions – whole script leads to a single execution model
Execution plan that is optimized out-of-the-box and w/o user intervention
Per-job and user-driven parallelization
Detail visibility into execution steps, for debugging
Heat map functionality to identify performance bottlenecks
010010
100100
010101
“Unstructured” Files• Schema on Read• Write to File• Built-in and custom Extractors and
Outputters• ADL Storage and Azure Blob
Storage
EXTRACT Expression
@s = EXTRACT a string, b int FROM "filepath/file.csv"USING Extractors.Csv;
• Built-in Extractors: Csv, Tsv, Text with lots of options• Custom Extractors: e.g., JSON, XML, etc.
OUTPUT Expression
OUTPUT @sTO "filepath/file.csv"USING Outputters.Csv();
• Built-in Outputters: Csv, Tsv, Text• Custom Outputters: e.g., JSON, XML, etc. (see http://usql.io)
Filepath URIs
• Relative URI to default ADL Storage account: "filepath/file.csv"• Absolute URIs:
• ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv"
• WASB: "wasb://container@account/filepath/file.csv"
Expression-flow Programming Style
12
• Automatic "in-lining" of U-SQL expressions – whole script leads to a single execution model.
• Execution plan that is optimized out-of-the-box and w/o user intervention.
• Per job and user driven level of parallelization.
• Detail visibility into execution steps, for debugging.
• Heatmap like functionality to identify performance bottlenecks.
Visual Studio integration
What can you do with Visual Studio?
32
Visualize and replay progress
of job
Fine-tune query performance
Visualize physical plan of U-SQL
query
Browse metadata catalog
Author U-SQL scripts (with
C# code)
Create metadata objects
Submit and cancel U-SQL
Jobs
Debug U-SQL and C# code
Plug-in
Authoring U-SQL queries
34
Visual Studio fully supports authoring U-SQL scripts
While editing, it provides:IntelliSense
Syntax color coding
Syntax checking
…
Contextual Menu
Job execution graph
35
After a job is submitted the progress of the execution of the job as it goes through the different stages is shown and updated continuously
Important stats about the job are also displayed and updated continuously
Job diagnosticsDiagnostics information is shown to help with debugging and performance issues
HDInsight: Cloud Managed Hadoop
What it is:
When to use it:
Microsoft’s implementation of apache Hadoop (as a service) that uses Blobs for persistent storage
• When you need to process large scale data (PB+)• When you want to use Hadoop or Spark as a service• When you want to compute data and retire the servers, but
retain the results• When your team is familiar with the Hadoop Zoo
Hadoop and HDInsight
Using the Hadoop Ecosystem to process and query data
Microsoft Azure Big Data Analytics
Cortana Intelligence Suite
HDInsight Tools for Visual Studio
Deploying HDInsight Clusters• Cluster Type: Hadoop, Spark, HBase and Storm.
• Hadoop clusters: for query and analysis workloads• HBase clusters: for NoSQL workloads• Spark clusters: for in-memory processing, interactive queries, stream, and machine learning workloads
• Operating System: Windows or Linux• Can be deployed from Azure portal, Azure Command Line
Interface (CLI), or Azure PowerShell and Visual Studio• A UI dashboard is provided to the cluster through Ambari.• Remote Access through SSH, REST API, ODBC, JDBC.
• Remote Desktop (RDP) access for Windows clusters
Azure Machine LearningWhat it is:
When to use it:
A multi-platform environment and engine to create and deploy Machine Learning models and API’s
• When you need to create predictive analytics• When you need to share Data Science experiments across
teams• When you need to create call-able API’s for ML functions• When you also have R and Python experience on your Data
Science team
Creating an Experiment
Get/Prepare Data
Build/Edit Experiment
Create/Update Model
Evaluate Model Results
Build and ModelCreateWorkspace
Deploy Model
Consume Model
Basic Azure ML ElementsImport Data
Preprocess
Algorithm
Train Model
Split Data
Score Model
Power BIWhat it is:
When to use it:
Interactive Report and Visualization creation for computing and mobile platforms
• When you need to create and view interactive reports that combine multiple datasets
• When you need to embed reporting into an application• When you need customizable visualizations• When you need to create shared datasets, reports, and
dashboards that you publish to your team
Microsoft Azure Big Data Analytics
Cortana Intelligence Suite
Common architectural patterns
Big Data Analytics – Data Flow
DATA
Business apps
Custom apps
Sensors and devices
INTELLIGENCE ACTION
People
Preparation, Analytics and Machine Learning
Azure Data Lake Store
Ingestion
Bulk Ingestion
Event Ingestion
Discovery
Azure Data Catalog
Visualization
Power BI
HDInsight Data Lake Analytics
Event Ingestion Patterns
Business apps
Custom apps
Sensors and devices
Events Events
Azure Data Lake Store
Transformed Data
Real Time Dashboards
Power BI
Raw Events
Azure Event Hubs
Kafka
Event Collection
Azure Stream Analytics
Spark Streaming
Stream Processing
Bulk Ingestion and Preparation
Business apps
Custom apps
Sensors and devices
Azure Data Lake Store
Prepared Data (Structured)
Raw DataBulk Load
Azure Data Factory
Prepared Data (Unstructured)
Data Preparation
Batch Analytics
Interactive Analytics
Power BI Notebooks
Spark on HDInsight
Azure SQL DW
Azure Data Catalog
Data Transformati
on
Data Collection
Presentation and action
Queuing System
Data Storage
Big Data Lambda Architecture
Azure Search
Data analytics (Excel, Power BI, Looker, Tableau)
Web/thick client dashboards
Devices to take actionEvent hub
Event & data producers
Applications
Web and social
Devices
Live Dashboards
DocumentDBMongoDBSQL AzureADWHbaseBlob StorageKafka/RabbitMQ/
ActiveMQ
Event hubs Azure ML
Storm / Stream Analytics
Hive / U-SQL
Data Factory
Sensors
Pig
Cloud gateways(web APIs)
Field gateways
Get started today!
http://aka.ms/cisolutions 57
Cortana Intelligence Solutions
Cortana Intelligence Solutions: Try
Cortana Intelligence Solutions: Deploy
Instructions and Next Steps: Customize