quark: a purely-functional scala dsl for data processing & analytics

Post on 16-Apr-2017






Click to see full reader


Quark: A Purely-Functional Scala DSL for Data Processing & AnalyticsJohn A. De Goes

@jdegoes - http://degoes.net

Apache Spark

Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

val textFile = sc.textFile("hdfs://...")val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)

Spark Sucks

— Functional-ish

— Exceptions, typecasts

— SparkContext

— Serializable

— Unsafe type-safe programs

— Second-class support for databases

— Dependency hell (>100)

— Painful debugging

— Implementation-dependent performance

Why Does Spark Have to Suck?Computation

val textFile = sc.textFile("hdfs://...")val counts = textFile.flatMap(line => line.split(" ")) <---- Where Spark goes wrong .map(word => (word, 1)) <---- Where Spark goes wrong .reduceByKey(_ + _) <---- Where Spark goes wrong


— Purely functional

— No exceptions, no casts, no nulls

— No global variables

— No serialization

— Safe type-safe programs

— First-class support for databases

— Few dependencies

— Better debugging

— Implementation-independent performance

Rule #1 in Functional ProgrammingDon't solve the problem, describe the solution.

AKA the "Do Nothing" rule

=> Don't compute, embed a compiled language into Scala


Quark is a Scala DSL built on Quasar Analytics, a general-purpose compiler for translating data processing over semi-structured data into efficient plans that execute 100% inside the target infrastructure.

val textFile = Dataset.load("...")val counts = textFile.flatMap(line => line.typed[Str].split(" ")) .map(word => (word, 1)) .reduceByKey(_.sum)

More QuarkCompilation

val dataset = Dataset.load("/prod/profiles")

val averageAge = dataset.groupBy(_.country[Str]).map(_.age[Int]).reduceBy(_.average)

Quark TargetsOne DSL to Rule Them All

— MongoDB

— Couchbase

— MarkLogic

— Hadoop / HDFS

— Add your connector here!

Both Quark and Quasar Analytics are purely-functional, open source projects written in 100% Scala.


How To DSLAdding Integers

sealed trait Exprfinal case class Integer(v: Int) extends Exprfinal case class Addition(v: Expr, v: Expr) extends Expr

def int(v: Int): Expr = Integer(v)def add(l: Expr, r: Expr): Expr = Addition(l, r)

add(add(int(1), int(2)), int(3)) : Expr

def interpret(e: Expr): Int = e match { case Integer(v) => v case Addition(l, r) => interpret(l) + interpret(r)}def serialize(v: Expr): Json = ???def deserialize(v: Json): Expr = ???

How To DSLAdding Strings

sealed trait Exprfinal case class Integer(v: Int) extends Exprfinal case class Addition(l: Expr, r: Expr) extends Expr // Uh, oh!final case class Str(v: String) extends Exprfinal case class StringConcat(l: Expr, r: Expr) extends Expr // Uh, oh!

How To DSLPhantom Type

sealed trait Expr[A]final case class Integer(v: Int) extends Expr[Int]final case class Addition(l: Expr[Int], r: Expr[Int]) extends Expr[Int]final case class Str(v: String) extends Expr[String]final case class StringConcat(l: Expr[String], r: Expr[String]) extends Expr[String]

def interpret[A](e: Expr[A]): A = e match { case Integer(v) => v case Addition(l, r) => interpret(l) + interpret(r) case Str(v) => v case StringConcat(l, r) => interpret(l) ++ interpret(r)}def serialize[A](v: Expr[A]): Json = ???def deserialize[Z](v: Json): Expr[A] forSome { type A } = ???

How To DSLGADTs in Scala still have bugs

SI-8563, SI-9345, SI-6680


How To DSLFinally Tagless

trait Expr[F[_]] { def int(v: Int): F[Int] def str(v: String): F[String] def add(l: F[Int], r: F[Int]): F[Int] def concat(l: F[String], r: F[String]): F[String]}

trait Dsl[A] { def apply[F[_]](implicit F: Expr[F]): F[A]}

def int(v: Int): Dsl[Int] = new Dsl[Int] { def apply[F[_]](implicit F: Expr[F]): F[Int] = F.int(v)}

def add(l: Dsl[Int], r: Dsl[Int]): Dsl[Int] = new Dsl[Int] { def apply[F[_]](implicit F: Expr[F]): F[Int] = F.add(l.apply[F], r.apply[F])}// ...

How To DSLFinally Tagless

type Id[A] = A

def interpret: Expr[Id] = new Expr[Id] { def int(v: Int): Id[Int] = v def str(v: String): Id[String] = v def add(l: Id[Int], r: Id[Int]): Id[Int] = l + r def concat(l: Id[String], r: Id[String]): Id[String] = l + r}

add(int(1), int(2)).apply(interpret) // Id(3)

final case class Const[A, B](a: A)

def serialize: Expr[Const[Json, ?]] = ???def deserialize[F[_]: Expr](json: Json): F[A] forSome { type A } = ???

Quark 101The Building Blocks

— Type. Represents a reified type of an element in a dataset.

— **Dataset[A]**. Represents a dataset, produced by successive application of set-level operations (SetOps). Describes a directed-acyclic graph.

— **MappingFunc[A, B]**. Represents a function from A to B that is produced by successive application of mapping-level operations (MapOps) to the input.

— **ReduceFunc[A, B]**. Represents a reduction from A to B, produced by application of reduction-level operations (ReduceOps) to the input.

Let's Build Us a Mini-Quark!

Mini-QuarkType System

sealed trait Typeobject Type { final case class Unknown() extends Type final case class Timestamp() extends Type final case class Date() extends Type final case class Time() extends Type final case class Interval() extends Type final case class Int() extends Type final case class Dec() extends Type final case class Str() extends Type final case class Map[A <: Type, B <: Type](key: A, value: B) extends Type final case class Arr[A <: Type](element: A) extends Type final case class Tuple2[A <: Type, B <: Type](_1: A, _2: B) extends Type final case class Bool() extends Type final case class Null() extends Type type UnknownMap = Map[Unknown, Unknown] val UnknownMap : UnknownMap = Map(Unknown(), Unknown())

type UnknownArr = Arr[Unknown] val UnknownArr : UnknownArr = Arr(Unknown())

type Record[A <: Type] = Map[Str, A] type UnknownRecord = Record[Unknown]}

Mini-QuarkSet-Level Operations

sealed trait SetOps[F[_]] { def read(path: String): F[Unknown]}


sealed trait Dataset[A] { def apply[F[_]](implicit F: SetOps[F]): F[A]}object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) }}


sealed trait SetOps[F[_]] { def read(path: String): F[Unknown]

def map[A, B](v: F[A], f: ???) // What goes here?}

Mini-QuarkMapping: Attempt #1

sealed trait SetOps[F[_]] { def read(path: String): F[Unknown]

def map[A, B](v: F[A], f: F[A] => F[B]) // Doesn't really work...}

Mini-QuarkMapping: Attempt #2

sealed trait MappingFunc[A, B] { def apply[F[_]](v: F[A])(implicit F: MappingOps[F]): F[B]}trait MappingOps[F[_]] { def str(v: String): F[Type.Str]

def project[K <: Type, V <: Type](v: F[Type.Map[K, V]], k: F[K]): F[V]

def add(l: F[Type.Int], r: F[Type.Int]): F[Type.Int]

def length[A <: Type](v: F[Type.Arr[A]]): F[Type.Int]

...}object MappingOps { def id[A]: MappingFunc[A, B] = new MappingFunc[A, A] { def apply[F[_]](v: F[A])(implicit F: MappingOps[F]): F[A] = v }}

Mini-QuarkMapping: Attempt #2

trait SetOps[F[_]] { def read(path: String): F[Unknown]

def map[A, B](v: F[A], f: MappingFunc[A, B]): F[B] // Yay!!!}

Mini-QuarkDataset: Mapping

sealed trait Dataset[A] { def apply[F[_]](implicit F: SetOps[F]): F[A]

def map[B](f: ???): Dataset[B] = ??? // What goes here???}object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) }}

Mini-QuarkDataset: Mapping Attempt #1

sealed trait Dataset[A] { self => def apply[F[_]](implicit F: SetOps[F]): F[A]

def map[B](f: MappingFunc[A, B]): Dataset[B] = new Dataset[B] { def apply[F[_]](implicit F: SetOps[F]): F[B] = F.map(self.apply, f) }}object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) }}

// dataset.map(_.length) // Cannot ever work!// dataset.map(v => v.profits[Dec] - v.losses[Dec]) // Cannot ever work!

Mini-QuarkDataset: Mapping Attempt #2

sealed trait Dataset[A] { def apply[F[_]](implicit F: SetOps[F]): F[A]

def map[B](f: MappingFunc[A, A] => MappingFunc[A, B]): Dataset[B] = new Dataset[B] { def apply[F[_]](implicit F: SetOps[F]): F[B] = F.map(self.apply, f(MappingFunc.id[A])) }}object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) }}

// dataset.map(_.length) // Works with right methods on MappingFunc!// dataset.map(v => v.profits[Dec] - v.losses[Dec]) // Works with right methods on MappingFunc!

Mini-QuarkDataset: Mapping Binary Operators

val netProfit = dataset.map(v => v.netRevenue[Dec] - v.netCosts[Dec])

Mini-QuarkMappingFuncs Are Arrows!

trait MappingFunc[A <: Type, B <: Type] extends Dynamic { self => import MappingFunc.Case

def apply[F[_]: MappingOps](v: F[A]): F[B]

def >>> [C <: Type](that: MappingFunc[B, C]): MappingFunc[A, C] = new MappingFunc[A, C] { def apply[F[_]: MappingOps](v: F[A]): F[C] = that.apply[F](self.apply[F](v)) }

def + (that: MappingFunc[A, B])(implicit W: NumberLike[B]): MappingFunc[A, B] = new MappingFunc[A, B] { def apply[F[_]: MappingOps](v: F[A]): F[B] = MappingOps[F].add(self(v), that(v)) }

def - (that: MappingFunc[A, B])(implicit W: NumberLike[B]): MappingFunc[A, B] = new MappingFunc[A, B] { def apply[F[_]: MappingOps](v: F[A]): F[B] = MappingOps[F].subtract(self(v), that(v)) } ...}

Mini-QuarkApplicative Composition

MappingFunc[A, B] A -----------------------------B \ / \ / \ / \ / MappingFunc[A, B ⊕ C] \ /MappingFunc[A, C] \ / \ / C

Learn More

— Finally Tagless: http://okmij.org/ftp/tagless-final/

— Quark: https://github.com/quasar-analytics/quark

— Quasar: https://github.com/quasar-analytics/quasar


@jdegoes - http://degoes.net

top related