Quark: A Purely-Functional Scala DSL for Data Processing & AnalyticsJohn A. De Goes
@jdegoes - http://degoes.net
Apache Spark
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
val textFile = sc.textFile("hdfs://...")val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
Spark Sucks
— Functional-ish
— Exceptions, typecasts
— SparkContext
— Serializable
— Unsafe type-safe programs
— Second-class support for databases
— Dependency hell (>100)
— Painful debugging
— Implementation-dependent performance
Why Does Spark Have to Suck?Computation
val textFile = sc.textFile("hdfs://...")val counts = textFile.flatMap(line => line.split(" ")) <---- Where Spark goes wrong .map(word => (word, 1)) <---- Where Spark goes wrong .reduceByKey(_ + _) <---- Where Spark goes wrong
WWFPD?
— Purely functional
— No exceptions, no casts, no nulls
— No global variables
— No serialization
— Safe type-safe programs
— First-class support for databases
— Few dependencies
— Better debugging
— Implementation-independent performance
Rule #1 in Functional ProgrammingDon't solve the problem, describe the solution.
AKA the "Do Nothing" rule
=> Don't compute, embed a compiled language into Scala
QuarkCompilation
Quark is a Scala DSL built on Quasar Analytics, a general-purpose compiler for translating data processing over semi-structured data into efficient plans that execute 100% inside the target infrastructure.
val textFile = Dataset.load("...")val counts = textFile.flatMap(line => line.typed[Str].split(" ")) .map(word => (word, 1)) .reduceByKey(_.sum)
More QuarkCompilation
val dataset = Dataset.load("/prod/profiles")
val averageAge = dataset.groupBy(_.country[Str]).map(_.age[Int]).reduceBy(_.average)
Quark TargetsOne DSL to Rule Them All
— MongoDB
— Couchbase
— MarkLogic
— Hadoop / HDFS
— Add your connector here!
Both Quark and Quasar Analytics are purely-functional, open source projects written in 100% Scala.
https://github.com/quasar-analytics/
How To DSLAdding Integers
sealed trait Exprfinal case class Integer(v: Int) extends Exprfinal case class Addition(v: Expr, v: Expr) extends Expr
def int(v: Int): Expr = Integer(v)def add(l: Expr, r: Expr): Expr = Addition(l, r)
add(add(int(1), int(2)), int(3)) : Expr
def interpret(e: Expr): Int = e match { case Integer(v) => v case Addition(l, r) => interpret(l) + interpret(r)}def serialize(v: Expr): Json = ???def deserialize(v: Json): Expr = ???
How To DSLAdding Strings
sealed trait Exprfinal case class Integer(v: Int) extends Exprfinal case class Addition(l: Expr, r: Expr) extends Expr // Uh, oh!final case class Str(v: String) extends Exprfinal case class StringConcat(l: Expr, r: Expr) extends Expr // Uh, oh!
How To DSLPhantom Type
sealed trait Expr[A]final case class Integer(v: Int) extends Expr[Int]final case class Addition(l: Expr[Int], r: Expr[Int]) extends Expr[Int]final case class Str(v: String) extends Expr[String]final case class StringConcat(l: Expr[String], r: Expr[String]) extends Expr[String]
def interpret[A](e: Expr[A]): A = e match { case Integer(v) => v case Addition(l, r) => interpret(l) + interpret(r) case Str(v) => v case StringConcat(l, r) => interpret(l) ++ interpret(r)}def serialize[A](v: Expr[A]): Json = ???def deserialize[Z](v: Json): Expr[A] forSome { type A } = ???
How To DSLGADTs in Scala still have bugs
SI-8563, SI-9345, SI-6680
FRIENDS DON'T LET FRIENDS USE GADTS IN SCALA.
How To DSLFinally Tagless
trait Expr[F[_]] { def int(v: Int): F[Int] def str(v: String): F[String] def add(l: F[Int], r: F[Int]): F[Int] def concat(l: F[String], r: F[String]): F[String]}
trait Dsl[A] { def apply[F[_]](implicit F: Expr[F]): F[A]}
def int(v: Int): Dsl[Int] = new Dsl[Int] { def apply[F[_]](implicit F: Expr[F]): F[Int] = F.int(v)}
def add(l: Dsl[Int], r: Dsl[Int]): Dsl[Int] = new Dsl[Int] { def apply[F[_]](implicit F: Expr[F]): F[Int] = F.add(l.apply[F], r.apply[F])}// ...
How To DSLFinally Tagless
type Id[A] = A
def interpret: Expr[Id] = new Expr[Id] { def int(v: Int): Id[Int] = v def str(v: String): Id[String] = v def add(l: Id[Int], r: Id[Int]): Id[Int] = l + r def concat(l: Id[String], r: Id[String]): Id[String] = l + r}
add(int(1), int(2)).apply(interpret) // Id(3)
final case class Const[A, B](a: A)
def serialize: Expr[Const[Json, ?]] = ???def deserialize[F[_]: Expr](json: Json): F[A] forSome { type A } = ???
Quark 101The Building Blocks
— Type. Represents a reified type of an element in a dataset.
— **Dataset[A]**. Represents a dataset, produced by successive application of set-level operations (SetOps). Describes a directed-acyclic graph.
— **MappingFunc[A, B]**. Represents a function from A to B that is produced by successive application of mapping-level operations (MapOps) to the input.
— **ReduceFunc[A, B]**. Represents a reduction from A to B, produced by application of reduction-level operations (ReduceOps) to the input.
Let's Build Us a Mini-Quark!
Mini-QuarkType System
sealed trait Typeobject Type { final case class Unknown() extends Type final case class Timestamp() extends Type final case class Date() extends Type final case class Time() extends Type final case class Interval() extends Type final case class Int() extends Type final case class Dec() extends Type final case class Str() extends Type final case class Map[A <: Type, B <: Type](key: A, value: B) extends Type final case class Arr[A <: Type](element: A) extends Type final case class Tuple2[A <: Type, B <: Type](_1: A, _2: B) extends Type final case class Bool() extends Type final case class Null() extends Type type UnknownMap = Map[Unknown, Unknown] val UnknownMap : UnknownMap = Map(Unknown(), Unknown())
type UnknownArr = Arr[Unknown] val UnknownArr : UnknownArr = Arr(Unknown())
type Record[A <: Type] = Map[Str, A] type UnknownRecord = Record[Unknown]}
Mini-QuarkSet-Level Operations
sealed trait SetOps[F[_]] { def read(path: String): F[Unknown]}
Mini-QuarkDataset
sealed trait Dataset[A] { def apply[F[_]](implicit F: SetOps[F]): F[A]}object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) }}
Mini-QuarkMapping
sealed trait SetOps[F[_]] { def read(path: String): F[Unknown]
def map[A, B](v: F[A], f: ???) // What goes here?}
Mini-QuarkMapping: Attempt #1
sealed trait SetOps[F[_]] { def read(path: String): F[Unknown]
def map[A, B](v: F[A], f: F[A] => F[B]) // Doesn't really work...}
Mini-QuarkMapping: Attempt #2
sealed trait MappingFunc[A, B] { def apply[F[_]](v: F[A])(implicit F: MappingOps[F]): F[B]}trait MappingOps[F[_]] { def str(v: String): F[Type.Str]
def project[K <: Type, V <: Type](v: F[Type.Map[K, V]], k: F[K]): F[V]
def add(l: F[Type.Int], r: F[Type.Int]): F[Type.Int]
def length[A <: Type](v: F[Type.Arr[A]]): F[Type.Int]
...}object MappingOps { def id[A]: MappingFunc[A, B] = new MappingFunc[A, A] { def apply[F[_]](v: F[A])(implicit F: MappingOps[F]): F[A] = v }}
Mini-QuarkMapping: Attempt #2
trait SetOps[F[_]] { def read(path: String): F[Unknown]
def map[A, B](v: F[A], f: MappingFunc[A, B]): F[B] // Yay!!!}
Mini-QuarkDataset: Mapping
sealed trait Dataset[A] { def apply[F[_]](implicit F: SetOps[F]): F[A]
def map[B](f: ???): Dataset[B] = ??? // What goes here???}object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) }}
Mini-QuarkDataset: Mapping Attempt #1
sealed trait Dataset[A] { self => def apply[F[_]](implicit F: SetOps[F]): F[A]
def map[B](f: MappingFunc[A, B]): Dataset[B] = new Dataset[B] { def apply[F[_]](implicit F: SetOps[F]): F[B] = F.map(self.apply, f) }}object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) }}
// dataset.map(_.length) // Cannot ever work!// dataset.map(v => v.profits[Dec] - v.losses[Dec]) // Cannot ever work!
Mini-QuarkDataset: Mapping Attempt #2
sealed trait Dataset[A] { def apply[F[_]](implicit F: SetOps[F]): F[A]
def map[B](f: MappingFunc[A, A] => MappingFunc[A, B]): Dataset[B] = new Dataset[B] { def apply[F[_]](implicit F: SetOps[F]): F[B] = F.map(self.apply, f(MappingFunc.id[A])) }}object Dataset { def read(path: String): Dataset[Unknown] = new Dataset[Unknown] { def apply[F[_]](implicit F: SetOps[F]): F[Unknown] = F.read(path) }}
// dataset.map(_.length) // Works with right methods on MappingFunc!// dataset.map(v => v.profits[Dec] - v.losses[Dec]) // Works with right methods on MappingFunc!
Mini-QuarkDataset: Mapping Binary Operators
val netProfit = dataset.map(v => v.netRevenue[Dec] - v.netCosts[Dec])
Mini-QuarkMappingFuncs Are Arrows!
trait MappingFunc[A <: Type, B <: Type] extends Dynamic { self => import MappingFunc.Case
def apply[F[_]: MappingOps](v: F[A]): F[B]
def >>> [C <: Type](that: MappingFunc[B, C]): MappingFunc[A, C] = new MappingFunc[A, C] { def apply[F[_]: MappingOps](v: F[A]): F[C] = that.apply[F](self.apply[F](v)) }
def + (that: MappingFunc[A, B])(implicit W: NumberLike[B]): MappingFunc[A, B] = new MappingFunc[A, B] { def apply[F[_]: MappingOps](v: F[A]): F[B] = MappingOps[F].add(self(v), that(v)) }
def - (that: MappingFunc[A, B])(implicit W: NumberLike[B]): MappingFunc[A, B] = new MappingFunc[A, B] { def apply[F[_]: MappingOps](v: F[A]): F[B] = MappingOps[F].subtract(self(v), that(v)) } ...}
Mini-QuarkApplicative Composition
MappingFunc[A, B] A -----------------------------B \ / \ / \ / \ / MappingFunc[A, B ⊕ C] \ /MappingFunc[A, C] \ / \ / C
Learn More
— Finally Tagless: http://okmij.org/ftp/tagless-final/
— Quark: https://github.com/quasar-analytics/quark
— Quasar: https://github.com/quasar-analytics/quasar
THANK YOU
@jdegoes - http://degoes.net