azure data lake analytics deep dive
TRANSCRIPT
![Page 1: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/1.jpg)
Ilyas FAzure Solution Architect @ 8KMilesTwitter: @ilyas_tweetsLinkedin: https://in.linkedin.com/in/ilyasf
Azure Data Lake Analytics Deep Dive
2016/05/17
![Page 2: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/2.jpg)
AgendaOrigins • Cosmos• Futures
Layers & Components• Storage• Parallelization• Job Scheduling• Query Execution• Performance• Demo
![Page 3: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/3.jpg)
Quick Recap
![Page 4: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/4.jpg)
The 3 Azure Data Lake Services
HDInsight
Analytics Store
Clusters as a service
Big data queries as a service
Hyper-scale Storage optimized
for analytics
Currently in PREVIEW. General Availability later in 2016.
![Page 5: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/5.jpg)
Familiar syntax to millions of SQL & .NET developers
Unifies declarative nature of SQL with the imperative power of C#
Unifies structured, semi-structured and unstructured data
Distributed query support over all data
U-SQLA new language for Big Data
![Page 6: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/6.jpg)
History
![Page 7: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/7.jpg)
HistoryBing needed to…• Understand user behavior
And do it…• At massive scale• With agility and speed• At low cost
So they built …• Cosmos
Cosmos• Batch Jobs• Interactive• Machine Learning• Streaming
Thousands of Developers
![Page 8: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/8.jpg)
Pricing
![Page 9: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/9.jpg)
Key ADL Analytics Components
![Page 10: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/10.jpg)
ADL Account Configuration
ADL Analytics Account
Links to ADL Stores
ADL Store Account
(the default one)
Job Queue
Key Settings:- Max Concurrent Jobs- Max ADLAUs per Job- Max Queue Length
An ADL Store IS REQUIRED for ADL Analytics to function.
Key Settings:• Max Concurrent Jobs = 3• Max ADLAUs per job = 20• Max Queue Length = 200
If you want to change the defaults, open a Support ticket
Links to Azure Blob Stores
U-SQL Catalog
Metadata
U-SQL Catalog
Data
![Page 11: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/11.jpg)
Simplified Workflow
Job Front End
Job Scheduler Compiler ServiceJob Queue
Job Manager
U-SQL Catalog
YARN
Job submission
Job execution
U-SQL Runtime Vertex execution
![Page 12: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/12.jpg)
Goal: Understanding a U-SQL (Batch) Job
![Page 13: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/13.jpg)
Azure Data Lake Analytics (ADLA) Demo
![Page 14: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/14.jpg)
![Page 15: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/15.jpg)
Job Properties
Job Graph
![Page 16: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/16.jpg)
Job SchedulingStates, Queue, Priority
![Page 17: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/17.jpg)
Job Status in Visual Studio
![Page 18: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/18.jpg)
Preparing
Queued
Running
Finalizing
Ended(Succeeded, Failed, Cancelled)
NewCompiling
QueuedScheduling
Starting
Running
Ended
UX Job State
The script is being compiled by the Compiler Service
All jobs enter the queue.
Are there enough ADLAUs to start the job?
If yes, then allocate those ADLAUs for the job
The U-SQL runtime is now executing the code on 1 or more ADLAUs or finalizing the outputs
The job has concluded.
![Page 19: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/19.jpg)
Why does a Job get Queued?Local Cause
Conditions:• Queue already at
Max Concurrency
Global Cause (very rare)
Conditions:• System-wide shortage of
ADLAUs• System-wide shortage of
Bandwidth
* If these conditions are met, a job will be queued even if the queue is not at its Max Concurrency
![Page 20: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/20.jpg)
State History
![Page 21: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/21.jpg)
The Job Queue
The queue is ordered by job priority.
Lower numbers -> higher priority.
1 = highest.
Running jobs
When a job is at the top of the queue, it will
start running.
Defaults: Max Running Jobs = 3Max Tokens per job = 20Max Queue Size = 200
![Page 22: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/22.jpg)
Priority Doesn’t Preempt Running Jobs
X has Pri=1.
XA
B
C
X will NOT preempt running jobs. X will have to wait.
These are all running and have very low priority
(pri=1000)
![Page 23: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/23.jpg)
U-SQL Job Compilation
![Page 24: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/24.jpg)
U-SQL Compilation Process
C#
C++
Algebra
Other files(system files, deployed resources)
managed dll
Unmanaged dll
Compilation output (in job folder)
Compiler & Optimizer
U-SQL Metadata Service
Deployed to Vertices
![Page 25: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/25.jpg)
The Job FolderInside the Default ADL Store:
/system/jobservice/jobs/Usql/YYYY/MM/DD/hh/mm/JOBID
/system/jobservice/jobs/Usql/2016/01/20/00/00/17972fc2-4737-48f7-81fb-49af9a784f64
![Page 26: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/26.jpg)
C# code generated by the U-SQL Compiler
C++ code generated by the U-SQL Compiler
Cluster Plan a.ka. “Job Graph” generated by U-SQL Compiler
User-provided .NET Assemblies
User-provided USQL script
Job Folder Contents
![Page 27: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/27.jpg)
Resources
![Page 28: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/28.jpg)
Blue items: the output of the compiler
Grey items: U-SQL runtime bits
Download all the resources
Download a specific resource
![Page 29: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/29.jpg)
Query ExecutionPlans, Vertices, Stages, Parallelism,
ADLAUs
![Page 30: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/30.jpg)
Job Schedule
r & Queue
Fron
t-End
Ser
vice
30
Optimizer
Vertex Scheduling
Compiler
Runtime
Visual Studio
Portal / API
Query Life
![Page 31: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/31.jpg)
How does the Parallelism number relate to Vertices
What does Vertices mean?
What is this?
![Page 32: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/32.jpg)
Logical -> Physical Plan
Each square = “a vertex” represents a fraction of the total
Vertexes in each SuperVertex (aka “Stage) are doing the same operation on different parts of the same data.
Vertexes in a later stages may depend on a vertex in an earlier stage
Visualized like this
![Page 33: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/33.jpg)
Stage Details252 Pieces of work
AVG Vertex execution time
4.3 Billion rows
Data Read & Written
![Page 34: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/34.jpg)
Automatic Vertex retryA vertex failed … but was
retried automatically
Overall Stage Completed Successfully
A vertex might fail because:- Router congested- Hardware failure (ex: hard drive
failed)- VM had to be rebooted
U-SQL job will automatically schedule a vertex on another VM.
![Page 35: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/35.jpg)
ADLAUs AzureData LakeAnalyticsUnit
Parallelism N = N ADLAUs
1 ADLAU ~= A VM with 2 cores and 6 GB of memory
![Page 36: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/36.jpg)
EfficiencyCost vs Latency
![Page 37: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/37.jpg)
Profile isn’t loaded
![Page 38: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/38.jpg)
Profile is loaded now
Click Resource usage
![Page 39: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/39.jpg)
Blue: Allocation
Red: Actual running
![Page 40: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/40.jpg)
Smallest estimated time when given 2425 ADLAUs
1410 seconds= 23.5 minutes
![Page 41: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/41.jpg)
Model with 100 ADLAUs
8709 seconds= 145.5 minutes
![Page 42: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/42.jpg)
𝐽𝑜𝑏𝐶𝑜𝑠𝑡=5𝑐+ (𝑚𝑖𝑛𝑢𝑡𝑒𝑠× 𝐴𝐷𝐿𝑈𝐴𝑠×𝐴𝐷𝐿𝐴𝑈𝑐𝑜𝑠𝑡𝑝𝑒𝑟𝑚𝑖𝑛 )
![Page 43: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/43.jpg)
Allocation
Allocating 10 ADLAUsfor a 10 minute job.
Cost = 10 min * 10 ADLAUs = 100 ADLAU minutes
Time
Blue line: Allocated
![Page 44: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/44.jpg)
Over Allocation Consider using fewer ADLAUs
You are paying for the area under the blue line
You are only using the area under the red line
Time
![Page 45: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/45.jpg)
Vertex Execution
![Page 46: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/46.jpg)
Store Basics
A VERY BIG FILE
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
Files are split apart into Extents.
Extents can be up to 250MB in size.
For availability and reliability, extents are replicated (3 copies).
Enables parallelized read
![Page 47: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/47.jpg)
Parallel writingFront-end machines for a web serviceLog files
Simultaneousuploads
Azure Data lake
![Page 48: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/48.jpg)
Extent
As file size increases, more opportunities for parallelism
Vertex
Extent Vertex
Extent Vertex
Extent Vertex
![Page 49: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/49.jpg)
The importance of partitioning input data
![Page 50: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/50.jpg)
Search engine clicks data setA log of how many clicks a certain domain got within a
sessionSessionID Domain Clicks3 cnn.com 91 whitehouse.gov 142 facebook.com 83 reddit.com 782 microsoft.com 11 facebook.com 53 microsoft.com 11
![Page 51: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/51.jpg)
Data Partitioning Compared
FBWH
CNN
Extent 2
FB
WHCNN
Extent 3
FB
WHCNN
Extent 1
File: Keys (Domain) are scattered across the extents
WHWHWH
Extent 2
CNN
CNN
CNN
Extent 3
FB
FB
FB
Extent 1
U-SQL Table partitioned on DomainThe keys are now “close together” also the index tells U-SQL exactly which extents contain the key
![Page 52: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/52.jpg)
CREATE TABLE SampleDBTutorials.dbo.ClickData( SessionId int, Domain string, Clinks int, INDEX idx1 //Name of index CLUSTERED (Domain ASC) //Column to cluster by // PARTITIONED BY HASH (Region) //Column to partition by);
INSERT INTO SampleDBTutorials.dbo.ClickDataSELECT *FROM @clickdata;
How did we create and fill that table?
![Page 53: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/53.jpg)
Find all the rows for cnn.com// Using a File
@ClickData = SELECT
Session int, Domain string,Clicks int
FROM “/clickdata.tsv”USING Extractors.Tsv();
@rows = SELECT * FROM @ClickData WHERE Domain == “cnn.com”;
OUTPUT @rows TO “/output.tsv” USING Outputters.tsv();
// Using a U-SQL Table partitioned by Domain
@ClickData = SELECT * FROM MyDB.dbo.ClickData;
@rows = SELECT * FROM @ClickData WHERE Domain == “cnn.com”;
OUTPUT @rows TO “/output.tsv” USING Outputters.tsv();
![Page 54: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/54.jpg)
Read Read
Write Write Write
Read
Filter Filter Filter
CNN,FB,WH
EXTENT 1 EXTENT 2 EXTENT 3
CNN,FB,WH
CNN,FB,WH
Because “CNN” could be anywhere, all extents must be read.
Read
Write
Filter
FBEXTENT 1 EXTENT 2 EXTENT 3
WH CNN
Thanks to “Partition Elimination” and the U-SQL Table, the job only reads from the extent that is known to have the relevant key
File U-SQL Table Partitioned by Domain
![Page 55: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/55.jpg)
How many clicks per domain?
@rows = SELECT Domain, SUM(Clicks) AS TotalClicks FROM @ClickData GROUP BY Domain;
![Page 56: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/56.jpg)
File
Read Read
Partition Partition
Full Agg
Write
Full Agg
Write
Full Agg
Write
Read
Partition
Partial Agg Partial Agg Partial Agg
CNN,FB,WH
EXTENT 1 EXTENT 2 EXTENT 3
CNN,FB,WH
CNN,FB,WH
U-SQL Table Partitioned by Domain
Read Read
Full Agg Full Agg
Write Write
Read
Full Agg
Write
FBEXTENT 1
WHEXTENT 2
CNNEXTENT 3
Expensive!
![Page 57: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/57.jpg)
High-Level Performance Advice
![Page 58: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/58.jpg)
Learn U-SQLLeverage Native U-SQL
Constructs first
UDOs are Evil Can’t optimize UDOs like pure
U-SQL code.
Understand your DataVolume, Distribution, Partitioning,
Growth
![Page 59: Azure Data Lake Analytics Deep Dive](https://reader034.vdocuments.net/reader034/viewer/2022052405/5871644e1a28ab58758b5077/html5/thumbnails/59.jpg)
Questions?