d16 spark, cluster, awssmack 2 hot topic in bay area scala, spark apache mesos - distributed system...

CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2017

Doc 16 Spark, Cluster, AWS EMR Nov 7, 2017

Copyright ©, All rights reserved. 2017 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http://www.opencontent.org/opl.shtml) license defines the copyright on this document.

Hot topic in Bay area

Scala, Spark Apache Mesos - Distributed system kernel Apache Akka - highly concurrent, distributed, resilient message-driven applications on JVM Apache Cassandra - distributed database Apache Kafka -

Towards AWS

Need Spark program packaged in jar file

Issues Packaging in jar Running in local cluster of one machine Logging File references

Spark Program & Packaging in Jar

Put program in object

Packaging in jar file Package your code not Spark jars - Spark adds 200MB By hand using jar command Using sbt

Why Jar Size Matters

Master

Slave Slave Slave

Jar File & Spark Jars

When running Spark program Spark supplies all the Spark dependencies

If your jar file does not contain Spark jars then It can not run by itself

If your jar file does contain the Spark jars then It can run by itself Can run in Spark But you are passing unneeded 200 MB to each slave

Need to include all other needed resources in you jar file

Sample Program

import org.apache.spark.{SparkConf, SparkContext}

object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(List(1,2,3,4)) rdd.saveAsTextFile("SimpleAppOutput") sc.stop() } }

build.sbt

name := "Simple Project"

version := "1.0"

scalaVersion := "2.11.11"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"

File Structure

simpleApp simpleApp/build.sbt src/ src/main src/main/scala src/main/scala/SimpleApp.scala

Compiling the Example Using sbt

from the directory simpleApp directory

->sbt package[info] Updated file /Users/whitney/Courses/696/Fall17/SparkExamples/simpleApp/project/build.properties: set sbt.version to 1.0.2[info] Loading project definition from /Users/whitney/Courses/696/Fall17/SparkExamples/simpleApp/project[info] Updating {file:/Users/whitney/Courses/696/Fall17/SparkExamples/simpleApp/project/}simpleapp-build...[info] Done updating.[warn] Run 'evicted' to see detailed eviction warnings...[info] Compiling 1 Scala source to /Users/whitney/Courses/696/Fall17/SparkExamples/simpleApp/target/scala-2.11/classes ...[info] Done compiling.[info] Packaging /Users/whitney/Courses/696/Fall17/SparkExamples/simpleApp/target/scala-2.11/simple-project_2.11-1.0.jar ...[info] Done packaging.[success] Total time: 14 s, completed Nov 4, 2017 4:24:36 PM

Note size of Jar file

Running in Temp Spark Runtime

->spark-submit target/scala-2.11/simple-project_2.11-1.0.jar Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/11/04 16:30:13 INFO SparkContext: Running Spark version 2.2.0 .... 17/11/04 16:30:15 INFO SparkContext: Successfully stopped SparkContext 17/11/04 16:30:15 INFO ShutdownHookManager: Shutdown hook called 17/11/04 16:30:15 INFO ShutdownHookManager: Deleting directory /private/var/folders/br/q_fcsjqc8xj9qn0059bctj3h0000gr/T/spark-8930a3ab-b041-4ed4-8203-fc8369b9c374

I put the SPARK_HOME/bin & SPARK_HOME/sbin on my path Set SPARK_HOME

setenv SPARK_HOME /Java/spark-2.2.0-bin-hadoop2.7

run SPARK_HOME/bin/spark-submit from simpleApp

Starting a Spark Cluster of One

Command SPARK_HOME/sbin/start-master.sh

->start-master.sh starting org.apache.spark.deploy.master.Master, logging to /Java/spark-2.2.0-bin-hadoop2.7/logs/spark-whitney-org.apache.spark.deploy.master.Master-1-air-6.local.out

Master Web Page

localhost:8080 127.0.0.1:8080 0.0.0.0:8080

Starting slave on local machine

->start-slave.sh spark://air-6.local:7077 starting org.apache.spark.deploy.worker.Worker, logging to /Java/spark-2.2.0-bin-hadoop2.7/logs/spark-whitney-org.apache.spark.deploy.worker.Worker-1-air-6.local.out

Command SPARK_HOME/sbin/start-slave.sh

Master Web Page

Submitting Job to Spark on Cluster

->spark-submit --master spark://air-6.local:7077 target/scala-2.11/simple-project_2.11-1.0.jar

run SPARK_HOME/bin/spark-submit from simpleApp

Master Web Page

Application Page

Starting/Stopping Master/Slave

Commands in SPARK_HOME/sbin

->start-master.sh

->start-slave.sh spark://air-6.local:7077

->stop-master.sh

->stop-slave.sh

->start-all.sh

->stop-all.sh

spark-submit

./bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]

Spark Properties

name master logging memory etc

https://spark.apache.org/docs/latest/configuration.html

name - displayed in Spark Master Web page

master

Master URL Meaning

local Run Spark locally with one worker thread.

local[K] Run Spark locally with K worker threads

local[K,F] Run Spark locally with K worker threads and F maxFailures

local[*]Run Spark locally with as many worker threads as logical cores on your machine.

local[*,F]Run Spark locally with as many worker threads as logical cores on your machine and F maxFailures.

spark://HOST:PORT Connect to the given Spark standalone cluster master.

spark://HOST1:PORT1,HOST2:PORT2

Connect to the given Spark standalone cluster with standby masters with Zookeeper.

mesos://HOST:PORT Connect to the given Mesos cluster.

yarn Connect to a YARN cluster in client or cluster mode

Examples

->spark-submit target/scala-2.11/simple-project_2.11-1.0.jar

->spark-submit --master spark://air-6.local:7077 \ target/scala-2.11/simple-project_2.11-1.0.jar

->spark-submit --master "local[*]" target/scala-2.11/simple-project_2.11-1.0.jar

Start spark master-slave using default value

Start spark master-slave using all cores

Submit job to existing master

Setting Properties

In program

submit command

config file

In precedence order

Setting master in Code

object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]") val sc = new SparkContext(conf) val rdd = sc.parallelize(List(1,2,3,4)) rdd.saveAsTextFile("SimpleAppOutput") sc.stop() } }

Don't set master in code It overrides value in command line and config file So will not be able change master settings without recompiling

Warning

object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(List(1,2,3,4)) rdd.saveAsTextFile("SimpleAppOutput") sc.stop() } }

Spark will not override existing files If you run this a second time without removing files you get an exception

Using Intellij

Edit build.sbt file to add libraryDependencies

name := "Your Project"

version := "0.1"

scalaVersion := "2.11.11"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"

http://www.scala-sbt.org

update - dependencies

compile

package - generate jar file

run - Not useful with Spark

Commands

Issue - Debugging

Debugger not available for program running on cluster

Print statements Don't count on seeing them from slaves

Logging Spark uses log4j 1.2

1/2 of Default Output

->spark-submit --master spark://air-6.local:7077 simpleappintell_2.11-0.1.jarlog4j:WARN No appenders could be found for logger (root).log4j:WARN Please initialize the log4j system properly.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.cat in the hatUsing Spark's default log4j profile: org/apache/spark/log4j-defaults.properties17/11/04 22:16:37 INFO SparkContext: Running Spark version 2.2.017/11/04 22:16:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable17/11/04 22:16:38 INFO SparkContext: Submitted application: Simple Application17/11/04 22:16:38 INFO SecurityManager: Changing view acls to: whitney17/11/04 22:16:38 INFO SecurityManager: Changing modify acls to: whitney17/11/04 22:16:38 INFO SecurityManager: Changing view acls groups to: 17/11/04 22:16:38 INFO SecurityManager: Changing modify acls groups to: 17/11/04 22:16:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(whitney); groups with view permissions: Set(); users with modify permissions: Set(whitney); groups with modify permissions: Set()17/11/04 22:16:38 INFO Utils: Successfully started service 'sparkDriver' on port 52153.17/11/04 22:16:38 INFO SparkEnv: Registering MapOutputTracker17/11/04 22:16:38 INFO SparkEnv: Registering BlockManagerMaster17/11/04 22:16:38 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information17/11/04 22:16:38 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up17/11/04 22:16:38 INFO DiskBlockManager: Created local directory at /private/var/folders/br/q_fcsjqc8xj9qn0059bctj3h0000gr/T/blockmgr-f07bc14c-79a1-4402-aa1f-8df995460e4717/11/04 22:16:38 INFO MemoryStore: MemoryStore started with capacity 366.3 MB17/11/04 22:16:38 INFO SparkEnv: Registering OutputCommitCoordinator17/11/04 22:16:38 INFO Utils: Successfully started service 'SparkUI' on port 4040.17/11/04 22:16:39 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.102:404017/11/04 22:16:39 INFO SparkContext: Added JAR file:/Users/whitney/Courses/696/Fall17/SparkExamples/simpleAppIntell/target/scala-2.11/simpleappintell_2.11-0.1.jar at spark://192.168.0.102:52153/jars/simpleappintell_2.11-0.1.jar with timestamp 150985899902017/11/04 22:16:39 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://air-6.local:7077...17/11/04 22:16:39 INFO TransportClientFactory: Successfully created connection to air-6.local/192.168.0.102:7077 after 23 ms (0 ms spent in bootstraps)17/11/04 22:16:39 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20171104221639-0004

OFF (most specific, no logging) FATAL (most specific, little data) ERROR WARN INFO DEBUG TRACE (least specific, a lot of data) ALL (least specific, all data)

Log Levels Can specify level Per package Per class

Can determine log Format Location of output

Setting Level in Code

import org.apache.spark.{SparkConf, SparkContext} import org.apache.log4j.{Level, LogManager, Logger}

object SimpleApp { def main(args: Array[String]) {

Logger.getLogger("org").setLevel(Level.ERROR) val log = LogManager.getRootLogger log.info("Start") println("cat in the hat") val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(List(1,2,3,4)) rdd.saveAsTextFile("SimpleAppOutput2") log.info("End") sc.stop() } }

Output

->spark-submit --master spark://air-6.local:7077 simpleappintell_2.11-0.1.jar log4j:WARN No appenders could be found for logger (root). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. cat in the hat Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/11/05 12:04:37 INFO root: End

Again - Do you want to set log level in Code

Can set level in config file $SPARK_HOME/conf/log4j.properties.temple

By default Spark will look for $SPARK_HOME/conf/log4j.properties But does is not part of program

Quiet Log config

# Set everything to be logged to the console log4j.rootCategory=INFO, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Set the default spark-shell log level to WARN. When running the spark-shell, the # log level for this class is used to overwrite the root logger's log level, so that # the user can have different defaults for the shell and regular Spark apps. log4j.logger.org.apache.spark.repl.Main=WARN

# Settings to quiet third party logs that are too verbose log4j.logger.org=WARN log4j.logger.org.apache.parquet=ERROR log4j.logger.parquet=ERROR

Master Logging vs Slave Logging

import org.apache.spark.{SparkConf, SparkContext} import org.apache.log4j.{Level, LogManager, PropertyConfigurator, Logger}

object SimpleApp { def main(args: Array[String]) { val log = LogManager.getRootLogger log.info("Start") val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(1 to 10) val stringRdd = rdd.map { value => log.info(value) value.toString } log.info("End") sc.stop() } }

Master

Error on Running Log on serializable

Serializable Logger

import org.apache.spark.{SparkConf, SparkContext} import org.apache.log4j.{LogManager, Logger}

object DistributedLogger extends Serializable { @transient lazy val log = Logger.getLogger(getClass.getName) }

object SimpleApp { def main(args: Array[String]) { val log = LogManager.getRootLogger log.info("Start") val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(1 to 10) val result = rdd.map { i => DistributedLogger.log.warn("i = " + i) i + 10 } result.saveAsTextFile("SimpleAppOutput") log.info("End") sc.stop() } }

Running

->spark-submit target/scala-2.11/simpleappintell_2.11-0.1.jar 17/11/06 16:59:40 INFO root: Start 17/11/06 16:59:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [Stage 0:> (0 + 0) / 8]17/11/06 16:59:44 WARN DistributedLogger$: i = 7 17/11/06 16:59:44 WARN DistributedLogger$: i = 8 17/11/06 16:59:44 WARN DistributedLogger$: i = 9 17/11/06 16:59:44 WARN DistributedLogger$: i = 6 17/11/06 16:59:44 WARN DistributedLogger$: i = 3 17/11/06 16:59:44 WARN DistributedLogger$: i = 4 17/11/06 16:59:44 WARN DistributedLogger$: i = 1 17/11/06 16:59:44 WARN DistributedLogger$: i = 5 17/11/06 16:59:44 WARN DistributedLogger$: i = 2 17/11/06 16:59:44 WARN DistributedLogger$: i = 10 17/11/06 16:59:44 INFO root: End

Logging DataFrames

To log client operations needs to use udf

Amazon Elastic Map-Reduce (EMR)

Hadoop, Hive, Spark, etc on Cluster

Predefined set of languages/tools available

Can create cluster of machines

https://aws.amazon.com Create new account Get 12 months free access

AWS Free Tier

12 months free

EC2 - compute instances 740 hours per month Billed in hour increments Billed per instance

S3 - storage 5 GB 20,000 Get requests

RDS - MySQL, PostgresSQL, SQL Sever 20 GB 750 hours

EC2 Container - Docker images 500 MB

I and students were charged last year

AWS Educate

https://aws.amazon.com/education/awseducate/

SDSU is an institutional member

Students get $100 credit

EC2 Pricing

Price Per Hour

On Demand Spot

m1.medium $0.0047

m1.large $0.0?

ml.xlarge $0.352

m3.xlarge $0.0551

m4.large $0.1 $0.0299

c1.medium $0.0132

c1.xlarge $0.057

Basic Outline

Develop & test Spark locally

Upload program jar file & data to S3

Configure & launch cluster AWS Management Console AWS CLI SDKs

Monitor cluster

Make sure you terminate cluster when done

Simple Storage System - S3

Files are stored in buckets

Bucket names are global

Supports s3 - files divided in to block s3n

Accessing files S3 console Third party REST Java, C#, etc

Amazon S3

S3 Creating a Bucket

S3 Costs

AWS Free Usage Tier

New AWS customers receive each month for one year 5 GB of Amazon S3 storage in the Standard Storage class, 20,000 Get Requests, 2,000 Put Requests, and 15 GB of data transfer out

Standard StorageStandard - Infrequent

Access StorageGlacier Storage

First 50 TB / month $0.023 per GB $0.0125 per GB $0.004 per GB

Next 450 TB / month $0.022 per GB $0.0125 per GB $0.004 per GB

Over 500 TB / month $0.021 per GB $0.0125 per GB $0.004 per GB

S3 Objects

Objects contain Object data Metadata

Size 1 byte to 5 gigabytes per object

Object data Just bytes No meaning associated with bytes

Metadata Name-value pairs to describe the object Some http headers used

Content-Type

S3 Buckets

Namespace for objects

No limitation on number of object per bucket

Only 100 buckets per account

Each bucket has a name Up to 255 bytes long Cannot be same as existing bucket name by any S3 user

Bucket Names

Bucket names must Contain lowercase letters, numbers, periods (.), underscores (_), and dashes (-) Start with a number or letter Be between 3 and 255 characters long Not be in an IP address style (e.g., "192.168.5.4")

To conform with DNS requirements, Amazon recommends Bucket names should not contain underscores (_) Bucket names should be between 3 and 63 characters long Bucket names should not end with a dash Bucket names cannot contain dashes next to periods (e.g.,

"my-.bucket.com" and "my.-bucket" are invalid

Unique identifier for an object within a bucket

Object Url

http://buckerName.s3.amazonaws.com/Key

http://doc.s3.amazonaws.com/2006-03-01/AmazonS3.wsdl

Bucket = doc Key = 2006-03-01/AmazonS3.wsdl

Access Control Lists (ACL)

Each Bucket has an ACL Determines who has read/write access

Each Object can have an ACL Determines who has read/write access

ACL consists of a list of grants

Grant contains One grantee One permission

S3 Data Consistency Model

Updates to a single object at a key in a bucket are atomic

But a read after a write may return the old value Changes may take time to progate

No object locking If two writes to same object occur at the same time The one with later timestamp wins

CAP Theorem

CAP theorem says in a distributed system you can not have all three Consistency Availability tolerance to network Partitions

Consistency

A = 2 A = 2

Machine 1 Machine 2

A = 2 A = 3Not Consistent

Partition

A = 2 A = 2

Machine 1 Machine 2

A = 2 A = 2Partitioned

Machine 1 cannot talk to machine 2

But how does machine 1 tell the difference between no connection and a very slow connection or busy machine 2?

Latency

Latency Time between making a request and getting a response

Distributed systems always have latency

In practice detect a partition by latency

When no response in a given time frame assume we are partitioned

Available

A = 2 A = 2

Machine 1 Machine 2

Client

A = 2 A = 2ClientClient can not access value of A

What does not available mean? No connection Slow connection What is the difference?

Some say high available - meaning low latency

In practice available and latency are related

Consistency over Latency

A = 2 A = 2Set A to 3

A = 2 A = 2Set A to 3 Lock A

A = 2 A = 2Set A to 3 Set A to 3

A = 3 A = 3Set A to 3 Unlock A

Machine 1 Machine 2

Write requests queued until unlocked

Increased latency System still available

A = 3 A = 3

Latency over Consistency

A = 2 A = 2Set A to 3

Machine 1 Machine 2

Write requests accepted

Low latency System inconsistent A = 3 A = 2

Set A to 3

A = 3 A = 2

A = 3 A = 3

Latency over Consistency - Write Conflicts

A = 2 A = 2Set A to 3

Machine 1 Machine 2

A = 3 A = 1Set A to 3

Subtract 1 from A

A = ? A = ?Need policy to make system consistent

A = 3 A = 2Subtract 1 from A

Partition

A = 2 A = 2

Machine 1 Machine 2

A = ? A = ?Need policy to make system consistent

A = 2 A = 2

Set A to 3A = 3 A = 1

Subtract 1 from A

CAP Theorem

Not a theorem

Too simplistic What is availability What is a partition of the network

Misleading

Intent of CAP was to focus designers attention on the tradeoffs in distributed systems

How to handle partitions in the network Consistency Latency Availability

CAP & S3

S3 favors latency over consistency

Running Program on AWS EMR

Make sure program runs locally

Create jar file containing code Make sure that jar file contains manifest

Create s3 bucket(s) for jar file logs input output

Upload jar & data files to s3

Test Program - SimpleApp

import org.apache.spark.{SparkConf, SparkContext} import org.apache.log4j.LogManager

object SimpleApp { def main(args: Array[String]) { val log = LogManager.getRootLogger log.info("Start") if (args.length < 1) { log.error("Missing argument") return } val outputFile = args(0) val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val rdd = sc.parallelize(1 to 10) rdd.saveAsTextFile(outputFile) log.info("End") sc.stop() } }

Packaging SimpleApp using SBT

->sbt package [info] Loading settings from idea.sbt ... [info] Loading global plugins from /Users/whitney/.sbt/1.0/plugins [info] Loading settings from plugins.sbt ... [info] Loading project definition from /Users/whitney/Courses/696/Fall17/SparkExamples/simpleAppIntell/project [info] Loading settings from build.sbt ... [info] Set current project to SimpleAppIntell (in build file:/Users/whitney/Courses/696/Fall17/SparkExamples/simpleAppIntell/) [success] Total time: 2 s, completed Nov 6, 2017 4:05:00 PM

In project directory

Packaging SimpleApp using SBT

In project directory

->sbt [info] Loading settings from idea.sbt ... [info] Loading global plugins from /Users/whitney/.sbt/1.0/plugins [info] Loading settings from plugins.sbt ... [info] Loading project definition from /Users/whitney/Courses/696/Fall17/SparkExamples/simpleAppIntell/project [info] Loading settings from build.sbt ... [info] Set current project to SimpleAppIntell (in build file:/Users/whitney/Courses/696/Fall17/SparkExamples/simpleAppIntell/) [info] sbt server started at 127.0.0.1:4172 sbt:SimpleAppIntell> package [success] Total time: 2 s, completed Nov 6, 2017 4:06:33 PM sbt:SimpleAppIntell>

I use SBT shell as it is faster when needing to repeat operations

Result of SBT package

Note: I renamed the jar file simpleapp.jar

Contents of simple app.jar

Manifest-Version: 1.0 Implementation-Title: SimpleAppIntell Implementation-Version: 0.1 Specification-Vendor: default Specification-Title: SimpleAppIntell Implementation-Vendor-Id: default Specification-Version: 0.1 Implementation-Vendor: default Main-Class: SimpleApp

MANIFEST.MF Note

When running SimpleApp locally Don't need to use --class Spark finds main class from manifest

When running on AWS Need to use --class

Running Program on AWS EMR

Make sure program runs locally

Create jar file containing code Make sure that jar file contains manifest

Create s3 bucket(s) for jar file logs input output

Upload jar & data files to s3

My S3 Buckets

Spark on AWS - EMR Console

You can either use Spark option on Quick Options or use Advanced Options

Advanced Options

Spark Application Setup

You have to give --class ClassName in Spark-submit options

Using the custom jar option Useful when cloning steps

Output

Warning on AWS

It can take 5-10 minutes to start cluster

Logs do not show your logging statements

When you configure Steps incorrectly they fail Error messages are not very helpful

SSH to your Master Node

Create Amazon EC2 Key pair

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair

Instructions

Open EC2 Dashboard - Select Key Pairs

In Create Cluster - Quick Options

Click for Instructions

Command-line Tools

Open-source command-line tool for launching Apache Spark clusters

https://github.com/nchammas/flintrock

Flintrock

aws cli

Amazon's command line tool

https://aws.amazon.com/cli/

d16 spark, cluster, awssmack 2 hot topic in bay area scala, spark apache mesos - distributed system...

Documents

hadoop mapreduce and apache spark on emr: comparing...

distributed convolutional neural network with apache...

what is a distributed data science pipeline. how with apache...

making analytics viable in enterprises: potential routes...

resilient distributed datasets - apache spark

enabling exploratory data science with spark and r ·...

developing apache spark applications · apache spark...

integrating apache hive with kafka, spark, and...

distributed machine learning 101 using apache spark from the...

apache spark...eurostat what is apache spark? • a general...

developing apache spark applications - cloudera · apache...

apache spark

· engines on apache spark rui vieira software engineer ....

large scale distributed machine learning on apache...

apache spark introduction - seoul ai · apache spark...

distributed computing with apache spark

distributed ensemble learning with apache...

apache cassandra and spark for simple, distributed, near...

a distributed approach to epifast using apache spark · a...

large scale distributed machine learning on apache...