big data scala by the bay: interactive spark in your browser

Post on 09-Jan-2017

3.889 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

INTERACTIVE SPARK IN YOURBROWSER

Romain Rigaux romain@cloudera.comErick Tryzelaar erickt@cloudera.com

GOALOF HUEWEB INTERFACE FOR ANALYZING DATA WITH APACHE HADOOP

SIMPLIFY AND INTEGRATE

FREE AND OPEN SOURCE

—> WEB “EXCEL” FOR HADOOP

VIEW FROM30K FEET

Hadoop Web Server

You, your colleagues and even that friend that uses IE9 ;)

WHY SPARK?

SIMPLER (PYTHON, STREAMING, INTERACTIVE…)

OPENS UP DATA TO SCIENCE

SPARK —> MR

Apache Spark

Spark Streaming

MLlib(machine learning)

GraphX(graph)

Spark SQL

WHYIN HUE? MARRIED WITH FULL HADOOP ECOSYSTEM (Hive Tables, HDFS, Job Browser…)

WHYIN HUE? Multi user, YARN, Impersonation/SecurityNot yet-another-app-to-install

...

• It works

HISTORYV1: OOZIE THE GOOD

• Submit through Oozie

• Slow

THE BAD

• It works better

HISTORYV2: SPARK IGNITER THE GOOD

• Compiler Jar

• Batch

THE BAD

• It works even better

• Scala / Python / R shells

• Jar / Py batches

• Notebook UI

• YARN

HISTORYV3: NOTEBOOK THE GOOD

• Still new

THE BAD

GENERALARCHITECTURE

Livy

Spark

Spark

Spark

YARN

Backend partWeb part

GENERALARCHITECTURE

Livy

Spark

Spark

Spark

YARN

Backend partWeb part

Notebook with snippets

WEBARCHITECTURE

Server

Spark

ScalaCommon API

Pig Hive

Livy … HS2

Scala

Hive

Specific APIs

AJAXcreate_session()execute()…

REST Thrift

OpenSession()ExecuteStatement()

/session/sessions/{sessionId}/statements

LIVY SPARK SERVER

• REST Web server in Scala

• Interactive Spark Sessions and Batch Jobs

• Type Introspection for Visualization

• Running sessions in YARN local

• Backends: Scala, Python, R

• Open Source: https://github.com/cloudera/hue/tree/master/apps/spark/java

LIVYSPARK SERVER

LIVY WEB SERVERARCHITECTUREYARN

MasterSpark Client

YARNNode

SparkInterpreter

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

Livy Server

Scalatra

Session Manager

Session

LIVY WEB SERVERARCHITECTURE

Livy Server

YARN Master

Scalatra

Spark Client

Session Manager

Session

YARNNode

SparkInterpreter

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

LIVY WEB SERVERARCHITECTUREYARN

MasterSpark Client

YARNNode

SparkInterpreter

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

2

Livy Server

Scalatra

Session Manager

Session

LIVY WEB SERVERARCHITECTUREYARN

MasterSpark Client

YARNNode

SparkInterpreter

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

2

3

Livy Server

Scalatra

Session Manager

Session

LIVY WEB SERVERARCHITECTUREYARN

MasterSpark Client

YARNNode

SparkInterpreter

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

2

3

4Livy Server

Scalatra

Session Manager

Session

LIVY WEB SERVERARCHITECTUREYARN

MasterSpark Client

YARNNode

SparkInterpreter

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

2

3

4

5

Livy Server

Scalatra

Session Manager

Session

LIVY WEB SERVERARCHITECTUREYARN

MasterSpark Client

YARNNode

SparkInterpreter

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1

2

3

4

5

6Livy Server

Scalatra

Session Manager

Session

LIVY WEB SERVERARCHITECTUREYARN

MasterSpark Client

YARNNode

SparkInterpreter

SparkContext

YARNNode

SparkWorker

YARNNode

SparkWorker

1 7

2

3

4

5

6Livy Server

Scalatra

Session Manager

Session

SESSION CREATIONAND EXECUTION

% curl -XPOST localhost:8998/sessions \ -d '{"kind": "spark"}'{ "id": 0, "kind": "spark", "log": [...], "state": "idle"}

% curl -XPOST localhost:8998/sessions/0/statements -d '{"code": "1+1"}'{ "id": 0, "output": { "data": { "text/plain": "res0: Int = 2" }, "execution_count": 0, "status": "ok" }, "state": "available"}

LIVY INTERPRETERSScala, Python, R…

INTERPRETERS

• Pipe stdin/stdout to a running shell

• Execute the code / send to Spark workers

• Perform magic operations

• One interpreter by language

• “Swappable” with other kernels (python, spark..)

Interpreter

> println(1 + 1)2

println(1 + 1)

2

INTERPRETER FLOW

CURL

Hue

Livy Server Livy Session Interpreter

1+1

2

{ “data”: { “application/json”: “2” }}

1+1

2

INTERPRETER FLOW CHART

Receive lines Split lines

Send outputto server

Success

Incomplete Merge withnext lineError

Execute LineMagic!

Linesleft?

Magic line?

No

Yes

NoYes

LIVY INTERPRETERS

trait Interpreter { def state: State def execute(code: String): Future[JValue] def close(): Unit}

sealed trait State case class NotStarted() extends State case class Starting() extends Statecase class Idle() extends Statecase class Running() extends Statecase class Busy() extends Statecase class Error() extends Statecase class ShuttingDown() extends Statecase class Dead() extends State

LIVY INTERPRETERS

trait Interpreter { def state: State def execute(code: String): Future[JValue] def close(): Unit}

sealed trait Statecase class NotStarted() extends Statecase class Starting() extends Statecase class Idle() extends Statecase class Running() extends Statecase class Busy() extends Statecase class Error() extends Statecase class ShuttingDown() extends Statecase class Dead() extends State

SPARK INTERPRETER

class SparkInterpeter extends Interpreter { … private var _state: State = NotStarted() private val outputStream = new ByteArrayOutputStream() private var sparkIMain: SparkIMain = _ def start() = { ... _state = Starting() sparkIMain = new SparkIMain(new Settings(), new JPrintWriter(outputStream, true)) sparkIMain.initializeSynchronous() ...

Interpreter

new SparkIMain(new Settings(), new JPrintWriter(outputStream, true))

SPARK INTERPRETER

private var sparkContext: SparkContext = _def start() = { ... val sparkConf = new SparkConf(true) sparkContext = new SparkContext(sparkConf) sparkIMain.beQuietDuring { sparkIMain.bind("sc", "org.apache.spark.SparkContext", sparkContext, List("""@transient""")) } _state = Idle()}

sparkIMain.bind("sc", "org.apache.spark.SparkContext",sparkContext, List("""@transient"""))

EXECUTING SPARKprivate def executeLine(code: String): ExecuteResult = { code match { case MAGIC_REGEX(magic, rest) => executeMagic(magic, rest) case _ => scala.Console.withOut(outputStream) { sparkIMain.interpret(code) match { case Results.Success => ExecuteComplete(readStdout()) case Results.Incomplete => ExecuteIncomplete(readStdout()) case Results.Error => ExecuteError(readStdout()) } ...

case MAGIC_REGEX(magic, rest) =>

case _ =>

INTERPRETER MAGIC

private val MAGIC_REGEX = "^%(\\w+)\\W*(.*)".r

private def executeMagic(magic: String, rest: String): ExecuteResponse = { magic match { case "json" => executeJsonMagic(rest) case "table" => executeTableMagic(rest) case _ => ExecuteError(f"Unknown magic command $magic") }}

case "json" => executeJsonMagic(rest) case "table" => executeTableMagic(rest) case _ => ExecuteError(f"Unknown magic command $magic")

INTERPRETER MAGICprivate def executeJsonMagic(name: String): ExecuteResponse = { sparkIMain.valueOfTerm(name) match { case Some(value: RDD[_]) => ExecuteMagic(Extraction.decompose(Map( "application/json" -> value.asInstanceOf[RDD[_]].take(10))))

case Some(value) => ExecuteMagic(Extraction.decompose(Map( "application/json" -> value)))

case None => ExecuteError(f"Value $name does not exist") }}

case Some(value: RDD[_]) => ExecuteMagic(Extraction.decompose(Map( "application/json" -> value.asInstanceOf[RDD[_]].take(10))))

case Some(value) => ExecuteMagic(Extraction.decompose(Map( "application/json" -> value)))

TABLE MAGIC

"application/vnd.livy.table.v1+json": { "headers": [ { "name": "count", "type": "BIGINT_TYPE" }, { "name": "name", "type": "STRING_TYPE" } ], "data": [ [ 23407, "the" ], [ 19540, "I" ], [ 18358, "and" ], ... ]}

val lines = sc.textFile("shakespeare.txt");val counts = lines. flatMap(line => line.split(" ")). map(word => (word, 1)). reduceByKey(_ + _). sortBy(-_._2). map { case (w, c) => Map("word" -> w, "count" -> c) }%table counts%table counts

TABLE MAGIC

"application/vnd.livy.table.v1+json": { "headers": [ { "name": "count", "type": "BIGINT_TYPE" }, { "name": "name", "type": "STRING_TYPE" } ], "data": [ [ 23407, "the" ], [ 19540, "I" ], [ 18358, "and" ], ... ]}

val lines = sc.textFile("shakespeare.txt");val counts = lines. flatMap(line => line.split(" ")). map(word => (word, 1)). reduceByKey(_ + _). sortBy(-_._2). map { case (w, c) => Map("word" -> w, "count" -> c) }%table counts

JSON MAGIC

val lines = sc.textFile("shakespeare.txt");val counts = lines. flatMap(line => line.split(" ")). map(word => (word, 1)). reduceByKey(_ + _). sortBy(-_._2). map { case (w, c) => Map("word" -> w, "count" -> c) }%json counts

{ "id": 0, "output": { "application/json": [ { "count": 506610, "word": "" }, { "count": 23407, "word": "the" }, { "count": 19540, "word": "I" }, ... ] ...}%json counts

JSON MAGIC

val lines = sc.textFile("shakespeare.txt");val counts = lines. flatMap(line => line.split(" ")). map(word => (word, 1)). reduceByKey(_ + _). sortBy(-_._2). map { case (w, c) => Map("word" -> w, "count" -> c) }%json counts

{ "id": 0, "output": { "application/json": [ { "count": 506610, "word": "" }, { "count": 23407, "word": "the" }, { "count": 19540, "word": "I" }, ... ] ...}

• Stability and Scaling• Security• iPython/Jupyter backends

and file format

COMING SOON

DEMO TIME

TWITTER

@gethue

USER GROUP

hue-user@

WEBSITE

http://gethue.com

LEARN

http://learn.gethue.com

THANKS!

top related