introducing u-sql (sqlpass 2016)

Introducing U-SQLA Language that Simplifies Big

Data Processing

Michael Rys, Principal Program Manager, Microsoft

@MikeDoesBigData, [email protected]

Please silence cell phones

The Data Lake approachIngest all data regardless of requirements

Store all data in native format without schema definition

Do analysisUsing analytic engines like Hadoop and ADLA

Interactive queriesBatch queries

Machine LearningData warehouse

Real-time analytics

Devices

Introducing Azure Data LakeBig Data Made Easy

WebHDFS

YARN

U-SQL

ADL Analytics

ADL HDInsight

1

1

1

1

1

1 1

1

1

1

1

1

Store

HiveAnalytics

Storage

Azure Data Lake (Store, HDInsight, Analytics)

Teaser:Azure Data Lake and U-SQL at Scale

Session Objectives And TakeawaysSession Objective(s): • Introduce U-SQL: The Why and What• Show the philosophy of U-SQL• Demonstrate the power, scale and simplicity of U-SQL

Key Takeaways:• You understand why U-SQL is the best language for Big Data Processing• Understand how U-SQL scripts process data in a highly scalable way• You can use U-SQL to process unstructured and structured data• You can use U-SQL’s C# integration to extend your big data processing with custom-code• You can explain some main differences between U-SQL and T-SQL• You know what data sources can be joined in U-SQL• You can use VisualStudio’s ADL tooling to explore and analyze highly scaled out U-SQL jobs

Some sample use casesDigital Crime Unit – Analyze complex attack patterns to understand BotNets and to predict and mitigate future attacks by analyzing log records with complex custom algorithmsImage Processing – Large-scale image feature extraction and classification using custom codeShopping Recommendation – Complex pattern analysis and prediction over shopping records using proprietary algorithms

Characteristics of Big Data Analytics•Requires processing of any type of data •Allow use of custom algorithms•Scale to any size and be efficient

Status Quo:SQL for Big Data

Declarativity does scaling and parallelization for you

Extensibility is bolted on and not “native” hard to work with anything other

than structured data difficult to extend with custom

code

Status Quo:Programming Languages for Big Data

Extensibility through custom code is “native”

Declarativity is bolted on and not “native” User often has to

care about scale and performance

SQL is 2nd class within string Often no code reuse/

sharing across queries

Why U-SQL? Declarativity and Extensibility are equally native to the language!

Get benefits of both!Makes it easy for you by unifying:• Unstructured and structured data

processing• Declarative SQL and custom imperative

Code (C#)• Local and remote Queries• Increase productivity and agility from Day

1 and at Day 100 for YOU!

The origins of U-SQL

U-SQLSCOPE

Hive

T-SQL/ ANSI SQL

SCOPE – Microsoft’s internal Big Data language• SQL and C# integration model• Optimization and Scaling model• Runs 100’000s of jobs daily

Hive• Complex data types (Maps, Arrays)• Data format alignment for text files

T-SQL/ANSI SQL• Many of the SQL capabilities (windowing functions,

meta data model etc.)

Query data where it lives Benefits• Avoid moving large amounts of data across the

network between stores• Single view of data irrespective of physical

location • Minimize data proliferation issues caused by

maintaining multiple copies• Single query language for all data• Each data store maintains its own sovereignty• Design choices based on the need• Push SQL expressions to remote SQL sources

• Projections• Filters• Joins

U-SQL Query

Query

Query

Query

WriteAzure Storage Blobs

Azure SQL in VMs

Azure SQL DB

Azure Data Lake Analytics

Query

Azure SQL Data Warehouse

Quer

y

Writ

e

Azure Data Lake Storage

Easily query data in multiple Azure data stores without moving it to a single store

Show me U-SQL!

https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis


12Expression-flow Programming Style

Automatic "in-lining" of U-SQL expressions – whole script leads to a single execution model.

Execution plan that is optimized out-of-the-box and w/o user intervention.

Per job and user driven level of parallelization.

Detail visibility into execution steps, for debugging.

Heatmap like functionality to identify performance bottlenecks.

• Schema on Read• Write to File• Built-in and custom

Extractors and Outputters• ADL Storage and Azure Blob

Storage

“Unstructured” Files EXTRACT Expression@s = EXTRACT a string, b int FROM "filepath/file.csv"USING Extractors.Csv(encoding: Encoding.Unicode);

• Built-in Extractors: Csv, Tsv, Text with lots of options• Custom Extractors: e.g., JSON, XML, etc. (see http://usql.io)

OUTPUT ExpressionOUTPUT @sTO "filepath/file.csv"USING Outputters.Csv();

• Built-in Outputters: Csv, Tsv, Text• Custom Outputters: e.g., JSON, XML, etc. (see http://usql.io)

Filepath URIs• Relative URI to default ADL Storage account: "filepath/file.csv"

• Absolute URIs:• ADLS:

"adl://account.azuredatalakestore.net/filepath/file.csv"• WASB: "wasb://container@account/filepath/file.csv"

U-SQL extensibilityExtend U-SQL with C#/.NET

Built-in operators, function, aggregates

C# expressions (in SELECT expressions)

User-defined aggregates (UDAGGs)

User-defined functions (UDFs)

User-defined operators (UDOs)

Extending U-SQL!




Managing Assemblies

• CREATE ASSEMBLY db.assembly FROM @path;• CREATE ASSEMBLY db.assembly FROM byte[];

• Can also include additional resource files

• REFERENCE ASSEMBLY [account.]db.assembly;

• Referencing .Net Framework Assemblies• Always accessible system namespaces:

• U-SQL specific (e.g., for SQL.MAP)• All provided by system.dll system.core.dll

system.data.dll, System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq)

• Add all other .Net Framework Assemblies with:REFERENCE SYSTEM ASSEMBLY [System.XML];

• Enumerating Assemblies• Powershell command• U-SQL Studio Server Explorer

• DROP ASSEMBLY db.assembly;

Create assembliesReference assembliesEnumerate assembliesDrop assemblies

VisualStudio makes registration easy!

USING clause 'USING' csharp_namespace | Alias '=' csharp_namespace_or_class.

Examples: DECLARE @ input string = "somejsonfile.json";

REFERENCE ASSEMBLY [Newtonsoft.Json];REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

USING Microsoft.Analytics.Samples.Formats.Json;

@data0 = EXTRACT IPAddresses string FROM @input USING new JsonExtractor("Devices[*]");

USING json = [Microsoft.Analytics.Samples.Formats.Json.JsonExtractor];

@data1 = EXTRACT IPAddresses string FROM @input USING new json("Devices[*]");

Allows shortening and disambiguating C# namespace and class names

Show me File Sets!




• Simple Patterns• Virtual Columns• Only on EXTRACT for now

(On OUTPUT by early 2017)

File Sets Simple pattern language on filename and path@pattern string = "/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}";

• Binds two columns date and suffix• Wildcards the filename• Limits on number of files

(Current limit 800-3000 is increased in special preview)

Virtual columnsEXTRACT

name string, suffix string // virtual column, date DateTime // virtual column

FROM @patternUSING Extractors.Csv();

• Refer to virtual columns in query predicates to get partition elimination

• Warning gets raised if no partition elimination was found

Create shareable data and codehttps://github.com/Azure/usql/tree/master/Examples/TweetAnalysis



Meta Data Object ModelADLA Account/Catalog

Database

Schema

[1,n]

[1,n]

[0,n]

tables views TVFs

C# Fns C# UDAgg

ClusteredIndex

partitions

C# Assemblies

C# Extractors

Data Source

C# ReducersC# Processors

C# CombinersC# Outputters

Ext. tables

User objects

Refers toContains Implemented and named by

Procedures

Creden-tials

MD Name C# Name

C# Applier

Table Types

Legend

Statistics

C# UDTs

• Naming• Discovery• Sharing• Securing

U-SQL Catalog Naming• Default Database and Schema context: master.dbo• Quote identifiers with []: [my table]• Stores data in ADL Storage /catalog folder

Discovery• Visual Studio Server Explorer• Azure Data Lake Analytics Portal• SDKs and Azure Powershell commands

Sharing• Within an Azure Data Lake Analytics account• Across ADLA accounts that share same primary ADLS

accounts:• Referencing Assemblies• Calling TVFs and referencing tables and views

Securing• Secured with AAD principals at catalog and Database

level

• Views for simple cases• TVFs for parameterization

and most cases

VIEWs and TVFs

Views

CREATE VIEW V AS EXTRACT…CREATE VIEW V AS SELECT …

• Cannot contain user-defined objects (e.g. UDF or UDOs)!

• Will be inlined

Table-Valued Functions (TVFs)

CREATE FUNCTION F (@arg string = "default") RETURNS @res [TABLE ( … )] AS BEGIN … @res = … END;

• Provides parameterization• One or more results• Can contain multiple statements• Can contain user-code (needs assembly reference)• Will always be inlined • Infers schema or checks against specified return

schema

ProceduresCREATE PROCEDURE P (@arg string = "default“) ASBEGIN …; OUTPUT @res TO …; INSERT INTO T …;END;

• Provides parameterization• No result but writes into file or table• Can contain multiple statements• Can contain user-code (needs assembly reference)• Will always be inlined • Can contain DDL (but no CREATE, DROP

FUNCTION/PROCEDURE)

Allows encapsulation of U-SQL scripts

• Script variables for scalars• Overwritable defaulting• Constant compile-time vs

runtime expression evaluation

Script Variables & Parameters DECLARE @variable SqlArray<int> = new SqlArray<int>{1,2};DECLARE @variable = new SqlArray<int>{1,2};

• Provides named and typed scalar expressions• Option to infer the type of the scalar variable

DECLARE EXTERNAL @parameter = "string value";

• Provides overwriteable defaulting of a scalar variable• Allows external parameter models (e.g., Azure Data

Factory)

DECLARE CONST @const_expression = "my "+@parameter;

• Checks and guarantees that expression is evaluated at compile time, otherwise errors.

• CREATE TABLE• CREATE TABLE AS SELECT

Tables CREATE TABLE T (col1 int , col2 string , col3 SQL.MAP<string,string> , INDEX idx CLUSTERED (col2 ASC) PARTITION BY (col1) DISTRIBUTED BY HASH (driver_id) );

• Structured Data, built-in Data types only (no UDTs)• Clustered Index (needs to be specified): row-oriented• Fine-grained distribution (needs to be specified):

• HASH, DIRECT HASH, RANGE, ROUND ROBIN• Addressable Partitions (optional)

CREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …;CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…;CREATE TABLE T (INDEX idx CLUSTERED …) AS myTVF(DEFAULT);

• Infer the schema from the query• Still requires index and distribution (does not

support partitioning)

When to use Tables Benefits of Table clustering and distribution• Faster lookup of data provided by distribution and

clustering when right distribution/cluster is chosen• Data distribution provides better localized scale out• Used for filters, joins and grouping

Benefits of Table partitioning• Provides data life cycle management (“expire” old

partitions)• Partial re-computation of data at partition level• Query predicates can provide partition elimination

Do not use when…• No filters, joins and grouping• No reuse of the data for future queries

If in doubt: use sampling (e.g., SAMPLE ANY(x)) and test.

• ALTER TABLE ADD/DROP COLUMN

Evolving TablesALTER TABLE T ADD COLUMN eventName string;

ALTER TABLE T DROP COLUMN col3;

ALTER TABLE T ADD COLUMN result string, clientId string, payload int?;

ALTER TABLE T DROP COLUMN clientId, result;

• Meta-data only operation• Existing rows will get

• Non-nullable types: C# data type default value (e.g., int will be 0)

• Nullable types: null

Let’s do some SQL with U-SQL!https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis




U-SQLJoins

Join operators

• INNER JOIN• LEFT or RIGHT or FULL OUTER JOIN• CROSS JOIN• SEMIJOIN

• equivalent to IN subquery• ANTISEMIJOIN

• Equivalent to NOT IN subqueryNotes

• ON clause comparisons need to be of the simple form: rowset.column == rowset.columnor AND conjunctions of the simple equality comparison

• If a comparand is not a column, wrap it into a column in a previous SELECT

• If the comparison operation is not ==, put it into the WHERE clause

• turn the join into a CROSS JOIN if no equality comparison

Reason: Syntax calls out which joins are efficient

U-SQLAnalytics

Windowing Expression

Window_Function_Call 'OVER' '(' [ Over_Partition_By_Clause ]

[ Order_By_Clause ] [ Row _Clause ]')'.

Window_Function_Call :=Aggregate_Function_Call

| Analytic_Function_Call| Ranking_Function_Call.

Windowing Aggregate Functions

ANY_VALUE, AVG, COUNT, MAX, MIN, SUM, STDEV, STDEVP, VAR, VARP

Analytics Functions

CUME_DIST, FIRST_VALUE, LAST_VALUE, PERCENTILE_CONT, PERCENTILE_DISC, PERCENT_RANK, LEAD, LAG

Ranking Functions

DENSE_RANK, NTILE, RANK, ROW_NUMBER

“Top 5”sSurprises for SQL Users

• AS is not as• C# keywords and SQL keywords

overlap• Costly to make case-insensitive ->

Better build capabilities than tinker with syntax

• = != ==• Remember: C# expression language

• null IS NOT NULL• C# nulls are two-valued

• PROCEDURES but no WHILE• No UPDATE, DELETE, nor MERGE

Show me U-SQL UDOs!

https://github.com/Azure/usql/tree/master/Examples/ImageApp

https://blogs.msdn.microsoft.com/azuredatalake/2016/06/27/how-do-i-combine-overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos/

Start Time - End Time - User Name5:00 AM - 6:00 AM - ABC5:00 AM - 6:00 AM - XYZ8:00 AM - 9:00 AM - ABC8:00 AM - 10:00 AM - ABC10:00 AM - 2:00 PM - ABC7:00 AM - 11:00 AM - ABC9:00 AM - 11:00 AM - ABC11:00 AM - 11:30 AM - ABC11:40 PM - 11:59 PM - FOO11:50 PM - 0:40 AM - FOO

Start Time - End Time - User Name5:00 AM - 6:00 AM - ABC5:00 AM - 6:00 AM - XYZ7:00 AM - 2:00 PM - ABC11:40 PM - 0:40 AM - FOO

Copyright

Camera Make

Camera Model

Thumbnail

Michael Canon 70D

Michael Samsung S7

https://github.com/Azure/usql/tree/master/Examples/ImageApp



User-Defined ExtractorsUser-Defined Outputters User-Defined Processors

• Take one row and produce one row• Pass-through versus transforming

User-Defined Appliers• Take one row and produce 0 to n rows• Used with OUTER/CROSS APPLY

User-Defined Combiners• Combines rowsets (like a user-defined join)User-Defined Reducers

• Take n rows and produce m rows (normally m<n)

Scaled out with explicit U-SQL Syntax that takes a UDO instance (created as part of the execution):

• EXTRACT• OUTPUT• PROCESS • COMBINE• REDUCE

What are UDOs?Custom Operator ExtensionsScaled out by U-SQL

UDO Tips and Warnings

• Tips when Using UDOs:• READONLY clause to allow pushing predicates through

UDOs• REQUIRED clause to allow column pruning through UDOs• PRESORT on REDUCE if you need global order• Hint Cardinality if it does choose the wrong plan

• Warnings and better alternatives:• Use SELECT with UDFs instead of PROCESS• Use User-defined Aggregators instead of REDUCE• Learn to use Windowing Functions (OVER expression)

• Good use-cases for PROCESS/REDUCE/COMBINE:• The logic needs to dynamically access the input and/or

output schema. E.g., create a JSON doc for the data in the row where the columns are not known apriori.

• Your UDF based solution creates too much memory pressure and you can write your code more memory efficient in a UDO

• You need an ordered Aggregator or produce more than 1 row per group

U-SQL Language Philosophy

Declarative Query and Transformation Language:• Uses SQL’s SELECT FROM WHERE with GROUP

BY/Aggregation, Joins, SQL Analytics functions• Optimizable, ScalableExpression-flow programming style:• Easy to use functional lambda composition • Composable, globally optimizableOperates on Unstructured & Structured Data• Schema on read over files• Relational metadata objects (e.g. database, table)Extensible from ground up:• Type system is based on C#• Expression language IS C#• User-defined functions (U-SQL and C#)• User-defined Aggregators (C#)• User-defined Operators (UDO) (C#)

U-SQL provides the Parallelization and Scale-out Framework for Usercode• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER,

COMBINER, APPLIERFederated query across distributed data sources

REFERENCE MyDB.MyAssembly;

CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float, ... );

@o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt" USING Extractors.Csv();

@c = EXTRACT cid int, name string, city string FROM "/input/customers.txt" USING Extractors.Csv();

@j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt , AGG<MyAgg.MySum>(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE c.city.StartsWith("New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid;

OUTPUT @j TO "/output/result.txt"USING new MyData.Write();

INSERT INTO T SELECT * FROM @j;

Unifies natively SQL’s declarativity and C#’s extensibilityUnifies querying structured and unstructuredUnifies local and remote queriesIncrease productivity and agility from Day 1 forward for YOU!Sign up for an Azure Data Lake account and join the Public Preview http://www.azure.com/datalake and give us your feedback via http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!

This is why U-SQL!

http://www.azure.com/datalake

http://aka.ms/adlfeedback

http://aka.ms/u-sql-survey

Additional Resources Blogs and community page:

• http://usql.io (U-SQL Github)• http://blogs.msdn.microsoft.com/mrys/• http://blogs.msdn.microsoft.com/azuredatalake/ • https://channel9.msdn.com/Search?term=U-SQL#ch9Sear

ch

Documentation and articles:• http://aka.ms/usql_reference• https://azure.microsoft.com/en-us/documentation/services

/data-lake-analytics/• https://msdn.microsoft.com/en-us/magazine/mt614251

ADL forums and feedback• http://aka.ms/adlfeedback• https://social.msdn.microsoft.com/Forums/azure/en-US/ho

me?forum=AzureDataLake

• http://stackoverflow.com/questions/tagged/u-sql

http://usql.io/

http://blogs.msdn.microsoft.com/mrys/

https://blogs.msdn.microsoft.com/azuredatalake/

https://channel9.msdn.com/Search?term=U-SQL#ch9Search

https://channel9.msdn.com/Search?term=U-SQL#ch9Search

http://aka.ms/usql_reference

https://azure.microsoft.com/en-us/documentation/services/data-lake-analytics/

https://azure.microsoft.com/en-us/documentation/services/data-lake-analytics/

https://msdn.microsoft.com/en-us/magazine/mt614251

http://aka.ms/adlfeedback

https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake

https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake

http://stackoverflow.com/questions/tagged/u-sql

Explore Everything PASS Has to Offer

FREE ONLINE WEBINAR EVENTS FREE 1-DAY LOCAL TRAINING EVENTS

LOCAL USER GROUPS AROUND THE WORLD

ONLINE SPECIAL INTEREST USER GROUPS

BUSINESS ANALYTICS TRAINING

VOLUNTEERING OPPORTUNITIES

PASS COMMUNITY NEWSLETTER

BA INSIGHTS NEWSLETTER FREE ONLINE RESOURCES

http://www.sqlpass.org/Events/24HoursofPASS.aspx


http://www.sqlsaturday.com/

http://www.sqlsaturday.com/

http://www.sqlpass.org/PASSChapters.aspx

http://www.sqlpass.org/PASSChapters/VirtualChapters.aspx


http://www.sqlpass.org/PASSChapters/VirtualChapters.aspx







Session Evaluations

ways to access

Go to passSummit.com

Download the GuideBook App and search: PASS Summit 2016

Follow the QR code link displayed on session signage throughout the conference venue and in the program guide

Submit by 5pmFriday November 6th toWIN prizes

Your feedback is important and valuable. 3

Thank You Learn more from

Michael [email protected] or follow

@MikeDoesBigData

introducing u-sql (sqlpass 2016)

Data & Analytics