02 data lake - wug Česká republika
TRANSCRIPT
![Page 2: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/2.jpg)
Key Message
With the increase of computing power, electronic devices and accessibility to the Internet, more data than ever is being produced, collected and transmitted.
Organizations have recognized the power of data analysis, but are struggling to manage the massive amounts of information they have.
Facebook collects 250 TB a dayData production a day ±10 ZB35-40 ZB expected in 2020 Currently 0.5% used for analysis
![Page 3: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/3.jpg)
![Page 4: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/4.jpg)
![Page 5: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/5.jpg)
Introducing Data Lakes
![Page 6: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/6.jpg)
Two approaches to information management for analytics
Bottom-up(inductive)
Observation
Pattern
Theory
Hypothesis
What will happen?
How can we make it happen?
Predictive analytics
Prescriptive analytics
What happened?
Why did it happen?
Descriptive analytics
INFORMATION
Diagnostic analytics
OPTIMIZATION
Top-down(deductive)
Confirmation
Theory
Hypothesis
Observation
![Page 7: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/7.jpg)
The data lake uses a bottom-up approach
Ingest all data regardless of requirements
Store all data in native format without schema definition
Do analysisusing analytic engines like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
![Page 8: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/8.jpg)
Data warehousing uses a top-downapproach
Implement data warehouse
Physical design
ETL development
Reporting and analytics development
Install and tune
Reporting and analytics design
Dimension modeling
ETL design
Set up infrastructure
Understand corporate strategy
Data sources
Gather requirements
Business requirements
Technical requirements
![Page 9: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/9.jpg)
Options for Big Data in Azure
HDInsightAzure-managed VM cluster
running Hadoop
Data LakeBig Data as a Service
Virtual MachinesUser-managed VM cluster
running Hadoop
Clusters Serverless
Data Lake Analytics
Data Lake Storage Gen1
![Page 10: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/10.jpg)
Challenges involved in implementing a data lake
Performance and Scale
Storage bottlenecks IoT sources – small writesPrice-performanceData grows independently
Security
Compliance challengesEffectively control accessCorporate policies
Data Silos
Data spans sourcesInefficiency in colocation
Analytics
Open interfaces to dataVariety of analytics tools
![Page 11: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/11.jpg)
Azure Data Lake (ADL)
Analytics
Storage
Azure Data Lake Analytics
Azure Data Lake Analytics
Azure Data Lake Store
![Page 12: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/12.jpg)
Azure Data Lake demystified
Azure Data Lake
Data Lake Storage Gen1Storage optimized for analytics
Data Lake AnalyticsAnalytics job service
• Petabyte size files and trillions of objects
• Provide I/O capacity to bandwidth hungry apps like HDI,
Cloudera and Hortonworks
• Encryption on disk
• File & Folder ACLs
• Start in seconds, scale instantly, pay per job
• Develop massively parallel programs with simplicity
• Leverage open source (Hadoop, Spark, Python, R…)
• Debug & optimize big data programs with ease
• Virtualize your analytics
• Always encrypted
• Integrated Azure Active Directory
• Role-based access Control
• Enterprise-grade support
![Page 13: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/13.jpg)
Built on open source
Analytics
Storage
YARN
ADL Analytics ADL HDInsight
WebHDFS
ADL Store
HiveU/SQL
![Page 14: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/14.jpg)
Data lakes in Azure Cloud
Hadoop cluster
HDFS/WebHDFS API
Azure Data LakeWebHDFS API
Azure HDInsight
![Page 15: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/15.jpg)
Bigger picture
• Cortana Analytics is a fully managed big data and advanced analytics suite that enables you to transform your data into intelligent action
Business scenariosRecommendations,customer churn, forecasting
Perceptual intelligenceFace, vision
Speech, text
Personal digital assistant
Cortana
Dashboards and visualizations
Power BI
Machine Learning and Analytics
Azure Machine Learning
Azure Stream Analytics
DATA
Business apps
Custom apps
Sensors and devices
INTELLIGENCE ACTION
People
Automatedsystems
Big data stores
AzureSQL Data Warehouse
Information management
Azure Data Factory
Azure Data Catalog
Azure Event Hub
Azure Data Lake Store
Azure HDInsight (Hadoop)
Azure Data Lake Analytics
![Page 16: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/16.jpg)
Big data made easy
![Page 17: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/17.jpg)
Understanding ADL Store
![Page 18: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/18.jpg)
What is Azure Data Lake (ADL) Store?
ADL Store HDInsight
ADL Analytics
Machine Learning
Spark
R
Devices
![Page 19: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/19.jpg)
Comparing ADLS with Azure Blob Storage
Optimized for analytics Bulk storage of filesCold data storage
Azure Blob storage
Azure Data Lake Store
![Page 20: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/20.jpg)
How do you start using ADL?
ADLAnalytics
Data
Azure Storage blobs
ADLStore
… and so on
Log in to the Azure portal
Write a U-SQL script and submit it to the ADL Analytics account
Create an ADL Analytics
account (90 seconds, free)
The U-SQL job reads and writes
data
![Page 21: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/21.jpg)
DEMO
Azure Data Lake Store
![Page 22: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/22.jpg)
Understanding ADLA
![Page 23: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/23.jpg)
Azure Data Lake Analytics service
§ Built on Apache YARN
§ Scales dynamically with the turn of a dial
§ Pay by the query
§ Supports Azure Active Directory for access control, roles, and integration with on-premises identity systems
§ Built with U-SQL to unify the benefits of SQL with the power of C#
§ Processes data across Azure
A new distributed analytics service
![Page 24: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/24.jpg)
Key Benefits of ADLA
Supports Azure Active Directory
Includes U-SQL
Processes data across multiple Azure data sources
![Page 25: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/25.jpg)
Simplified management and administration
§ Web-based management in Azure portal
§ Automate tasks using PowerShell
§ Role-based Access Control with Azure Active Directory
§ Monitor service operations and activity
![Page 26: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/26.jpg)
DEMO
Azure Data Lake Analytics
![Page 27: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/27.jpg)
What isU-SQL?
A hyper-scalable, highly extensible language for preparing, transforming, and analyzing all data
Allows users to focus on the what—not the how—of business problems
Built on familiar languages (SQL and C#, SCOPE) and supported by a fully integrated development environment
Built for data developers and scientists
![Page 28: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/28.jpg)
U-SQL fundamentals
All the familiar SQL clauses
SELECT | FROM | WHERE
GROUP BY | JOIN | OVER
Operate on unstructured and structured data
Relational metadata objects
![Page 29: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/29.jpg)
StoreProcessRead
U-SQL queries: General pattern
INSERT
OUTPUT
OUTPUT
SELECT… FROM… WHERE…
EXTRACT
EXTRACT
SELECT
SELECT
Azure SQL DB
Azure Storage blobs
AzureStorage
blobs
RowSet
Azure Data Lake
RowSet
Azure Data Lake
![Page 30: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/30.jpg)
.NET integration and extensibility
U-SQL expressions are full C# expressions
Reuse .NET code in your own assemblies
Use C# to define your own:
Types | Functions | Joins | Aggregators | IO (Extractors, Outputters)
U/SQL .NET
![Page 31: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/31.jpg)
Query across Azure data sources
READ WRITE
Azure Data Lake Store
Azure Storage blobs
Azure SQL database
Azure SQL data warehouse
![Page 32: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/32.jpg)
Visual Studio and U-SQL integration
Visualize and replay progress of
job
Fine-tune query performance
Visualize physical plan of U-SQL
query
Browse metadata catalog
Author U-SQL scripts (with
C# code)
Create metadata objects
Submit and cancel U-SQL jobs
Debug U-SQL and C# code
![Page 33: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/33.jpg)
@rows = EXTRACT
OrderId int, Customer string, Date DateTime, Amount float
FROM “/input/orders.txt"USING Extractor.Tsv();
OUTPUT @rowsTO “adl://mylake/orders_copy.txt"USING Outputters.Tsv();
Apply schema on read
From a file in a data lake
Easy delimited text handling
Write out
Rowset
Read the input, write it directly to output (just a simple copy)
![Page 34: 02 data lake - WUG Česká republika](https://reader031.vdocuments.net/reader031/viewer/2022030101/621c5b0e68facf121032f65b/html5/thumbnails/34.jpg)
Follow up
§ Course 20775A:Performing Data Engineering on Microsoft HD Insight
§ Course 20774A:Perform Cloud Data Science with Azure Machine Learning
§ Course 20776A:Performing Big Data Engineering on Microsoft Cloud Services