cse6331: cloud computing · frameworks emerging in the hadoop ecosystem a tool for comparing these...

25
CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2016 by Leonidas Fegaras MRQL

Upload: others

Post on 17-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

CSE6331: Cloud Computing

Leonidas Fegaras

University of Texas at Arlingtonc©2016 by Leonidas Fegaras

MRQL

Page 2: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Apache MRQL

Apache MRQL (incubating)

MRQL: a Map-Reduce Query Language

Web site: http://mrql.incubator.apache.org/

Wiki: http://wiki.apache.org/mrql/

History:

Fall 2010: started at UTA as an academic research project

March 2013: enters Apache Incubation

4 releases under Apache so far: latest MRQL 0.9.6

CSE6331: Cloud Computing MRQL 2

Page 3: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Motivation

Map-Reduce is not the only player in the Hadoop ecosystem any more

designed for batch processingnot well-suited for some big data workloads:real-time analytics, continuous queries, iterative algorithms, ...

Alternatives:

Spark, Flink, Pregel, Giraph, ...

New distributed stream processing engines:

Spark Streaming, Flink Streaming, Storm, S4, Samza, ...

CSE6331: Cloud Computing MRQL 3

Page 4: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

The new API-Based Distributed Processing Frameworks

They provide powerful abstractions as building blocks

They hide the details of parallel and distributed computing

They automate fault tolerance

. . . but, steep learning curve

Hard to develop, optimize, and maintain non-trivial applicationscoded in a programming language

Too many competing Big Data processing systems

Hard to tell which one of these systems will prevail in the near future

applications coded in one of these paradigms may have to be rewrittenas technologies evolve

... or you can express your applications in a query language that isindependent of the underlying distributed platform!

CSE6331: Cloud Computing MRQL 4

Page 5: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

MRQL: Design Objectives

Wanted to develop a powerful and efficient query processing systemfor complex data analysis applications on big data

more powerful than existing query languagesable to capture most complex data analysis tasks declarativelyable to work on read-only, raw (in-situ), complex dataHDFS as the physical storage layerplatform-independent:

the same query can run on multiple platforms on the same clusterallowing developers to experiment with various platforms effortlessly

efficient!

We envision MRQL to be:

a common front-end for the multitude of distributed processingframeworks emerging in the Hadoop ecosystema tool for comparing these systems (functionality & performance)

CSE6331: Cloud Computing MRQL 5

Page 6: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Is this yet Another SQL for Map-Reduce?

MRQL is NOT SQL!

MRQL is an SQL-like query language for large-scale, distributed dataanalysis on a computer cluster

Unlike SQL, MRQL supports

a richer data model (nested collections, trees, ...)arbitrary query nestingmore powerful query constructsuser-defined types and functions

MRQL queries can run on multiple distributed processing platforms

currently Apache Hadoop MapReduce, Hama, Spark, Flink, and (soon)Storm/Trinder

The MRQL syntax and semantics have been influenced by

modern database query languages (mostly, XQuery and ODMG OQL)functional programming languages (sequence comprehensions, algebraicdata types, type inference)

CSE6331: Cloud Computing MRQL 6

Page 7: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Language Features

The MRQL query language:

provides a rich type system that supports hierarchical data and nestedcollections uniformly

general algebraic datatypes – can be recursive

JSON and XML are user-defined types

pattern matching over data constructions (similar to Scala match)local type inference (similar to Scala)

allows nested queries at any level and at any place

no need for awkward nulls and outer-joins

supports UDFs

provided that they don’t have side effects

allows to operate on the grouped data using queries

as is done in XQuery and Pigimproves SQL group-by with aggregation

CSE6331: Cloud Computing MRQL 7

Page 8: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Language features

The MRQL query language:

supports custom aggregations and reductions using UDFs

provided they have certain properties (associativity & commutativity)

supports iteration declaratively

to capture iterative algorithms, such as PageRank

supports custom parsing and custom data fragmentation

provides syntax-directed construction and deconstruction of data

to capture domain-specific languages

CSE6331: Cloud Computing MRQL 8

Page 9: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

The MRQL Data Model

MRQL supports the following types:

a basic type: bool, short, int, long, float, double, string.

a tuple (t1, . . . , tn),

a record < A1 : t1, ...,An : tn >,

a list (sequence) [t] or list(t),

a bag (multiset) {t} or bag(t),

a user-defined type

an algebraic data type T

CSE6331: Cloud Computing MRQL 9

Page 10: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Algebraic Data Types

A data type T is defined at the top level:

data T = C1: t1 | ... | Cn: tn;

where C1, ..., Cn are data constructors and t1, ..., tn are types

It can be recursive

For example, an integer list can be defined as follows:

data IList = Cons: (int,IList) | Nil: ();

Then, Cons(1,Cons(2,Nil())) constructs the list [1,2]

XML is a predefined data type, defined as:

data XML = Node: ( String, bag( (String,String) ),

list(XML) )

| CData: String;

For example, <a x=’1’><b>text</b></a> is constructed using

Node(’a’,{(’x’,’1’)},[Node(’b’,{},[CData(’text’)])])

CSE6331: Cloud Computing MRQL 10

Page 11: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Patterns

Patterns are used in select-queries and case statements

A pattern can be:

a pattern variable that matches any data and binds the variable to data,a constant basic value,a * that matches any data,a data construction C (p1, . . . , pn),a tuple (p1, . . . , pn),a record < A1 : p1, . . . ,An : pn >,a list [p1, . . . , pn]

A case statement takes the following form:

case e { p1: e1 | ... | pn: en }

Example:

case e { Node(*,*,Node(’a’,*,cs)): cs | *: [] }

CSE6331: Cloud Computing MRQL 11

Page 12: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Accessing Data Sources

Parsing text files:

source(line,’employee.txt’,’,’,

type ( < name: string, dno: int, phone: any,address: string > ))

Loading binary data:

source(binary,’graph.bin’)

Parsing XML Documents:

XMark = source(xml,’xmark.xml’,{’person’});

Also, syntax to parse JSON documents and define custom parsers

Statements to dump the results of query e to a binary or a CSV textfile fname:

store fname from e;

dump fname from e;

CSE6331: Cloud Computing MRQL 12

Page 13: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

The Select-Query Syntax

The select-query syntax takes the form: ([ ... ] is optional)

select [ distinct ] e

from p in e, ..., p in e

[ where e ]

[ group by p: e [ having e ] ]

[ order by e [ limit e ] ]

where e are expressions/queries and p are patternsExample:

select (n,cn)

from < name: n, children: cs > in Employees,

< name: cn > in cs

Same as:

select (e.name,c.name)

from e in Employees,

c in e.children

CSE6331: Cloud Computing MRQL 13

Page 14: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

More Examples

select ( d, c, sum(s) )

from <dno:dn,salary:s> in Employees

group by (d,c): ( dn, salary>=100000 )

having avg(s) >= 80000

select (k,sum(o.TOTALPRICE))

from o in Orders,

c in Customers

where o.CUSTKEY=c.CUSTKEY

group by k: c.NAME;

select (c.NAME,avg(select o.TOTALPRICE

from o in Orders

where o.CUSTKEY=c.CUSTKEY))

from c in Customers;

CSE6331: Cloud Computing MRQL 14

Page 15: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Simple example: matrix multiplication

A sparse matrix X is represented as a bag of (Xij , i , j).

Zij =∑k

Xik ∗ Ykj

select ( sum(z), i, j )

from (x,i,k) in X, (y,k,j) in Y, z = x*y

group by i, j

CSE6331: Cloud Computing MRQL 15

Page 16: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

An XML example

Group all persons according to their interests and the number of openauctions they watch. For each such group, return the number of personsin the group:

select ( cat, os, count(p) )

from p in XMARK,

i in p.profile.interest

group by cat: i.@category,

os: count(p.watches.@open_auctions)

CSE6331: Cloud Computing MRQL 16

Page 17: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Another XML example

Invert the citation graph in DBLP by grouping the items by their citationsand by ordering these groups by the number of citations they received:

select ( select text(a.title)

from a in DBLP

where a.@key = x,

count(a) )

from a in DBLP,

c in a.cite

group by x: text(c)

order by inv(count(a))

CSE6331: Cloud Computing MRQL 17

Page 18: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Example: k-means clustering

Derive k clusters from a set of points P:

repeat centroids = ...

step select < X: avg(s.X), Y: avg(s.Y) >

from point in Points

group by k: (select c from c in centroids

order by distance(point,c))[0]

CSE6331: Cloud Computing MRQL 18

Page 19: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Example: the PageRank algorithm

Simplified PageRank:

A graph node is associated with a PageRank and its outgoing links:

< id: 23, rank: 0.0, adjacent : { 10, 45, 35 } >

Propagate the PageRank of a node to its outgoing links;each node gets a new PageRank by accumulating the propagatedPageRanks from its incoming links:

repeat nodes = ...step select < id: m.id, rank: n.rank, adjacent : m.adjacent >

from n in (select < id: key, rank: sum(c.rank) >from c in ( select < id: a,

rank: n.rank/count(n.adjacent) >from n in nodes,

a in n.adjacent )group by key: c. id ),

m in nodeswhere n.id = m.id

CSE6331: Cloud Computing MRQL 19

Page 20: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Architecture

Query translation stages:

1 type inference

2 query translation and normalization

3 simplification

4 algebraic optimization

5 plan generation

6 plan optimization

7 compilation to Java code

CSE6331: Cloud Computing MRQL 20

Page 21: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Algebraic operators

Algebraic operations on bags:

groupBy ( X: {(κ, α)} ) : {(κ, {α})}flatMap ( f: α→ {β}, X: {α} ) : {β}reduce ( ⊕: (α, α)→ α , X: {α} ) : αunion ( X: {α}, Y: {α} ) : {α}

Join: coGroup ( X: {(κ, α)}, Y: {(κ, β)} ) : {(κ, {α}, {β})}map-reduce(m,r,X) = flatMap(r,groupBy(flatMap(m,X)))

List operations: orderBy, append

Iteration: repeat ( f: {α} → {α}, X: {α} ) : {α}

CSE6331: Cloud Computing MRQL 21

Page 22: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Query optimization

The query optimizer:

uses a cost-based optimization framework to map algebraic terms toefficient workflows of physical operations

handles dependent joins (used for nested collections)

unnest deeply nested queries and converts them to join plans

CSE6331: Cloud Computing MRQL 22

Page 23: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Current Work: Distributed Stream Processing

Support for continuous queries over multiple streams of data

Data come in incremental batches ∆X

Batch streaming based on sliding windows

Query q(X1,X2;Y ) over one invariant and two streaming data sources

CSE6331: Cloud Computing MRQL 23

Page 24: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Current Work: Distributed Stream Processing

select (k,avg(p.Y))from p in stream(binary,"points")

group by k: p.X

Currently, works on Spark Streaming

Soon, on Flink Streaming and Spark/Trinder

CSE6331: Cloud Computing MRQL 24

Page 25: CSE6331: Cloud Computing · frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) CSE6331: Cloud Computing MRQL 5. Is this yet

Current Work: Incremental Query Processing

Problem: translate any batch program (eg, PageRank) to anincremental program automatically

Solution: Break the query q(X1,X2;Y ) = g(f (X1,X2;Y )) such as:

f (X1 ]∆X1,X2 ]∆X2;Y ) = f (X1,X2;Y ) ⊗ f (∆X1,∆X2;Y )

Requires program analysis & transformation

CSE6331: Cloud Computing MRQL 25