scala and spark are ideal for big data - data science pop-up seattle

19
#datapopupseattle Scala and Spark are Ideal for Big Data John Nestor Sr Architect, 47 Degrees 47deg

Upload: domino-data-lab

Post on 13-Jan-2017

4.447 views

Category:

Data & Analytics


0 download

TRANSCRIPT

#datapopupseattle

Scala and Spark are Ideal for Big Data

John NestorSr Architect, 47 Degrees

47deg

#datapopupseattle

UNSTRUCTUREDData Science POP-UP in Seattle

www.dominodatalab.com

D

Produced by Domino Data Lab

Domino’s enterprise data science platform is used by leading analytical organizations to increase productivity, enable collaboration, and publish

models into production faster.

Scala and Spark are Ideal for Big Data

John Nestor47 Degrees

Seattle Unstructured Data Science Pop-UpOctober 7, 2015

www.47deg.com 347deg.com

47deg.com

Why Scala?

• Strong typing

• Concise elegant syntax

• Runs on JVM (Java Virtual Machine)

• Supports both object-oriented and functional

• Small simple programs through large parallel distributed systems

• Easy to cleanly extend with new libraries and DSL’s

• Ideal for parallel and distributed systems

4

47deg.com

Scala: Strong Typing and Concise Syntax

• Strong typing like Java.

• Compile time checks

• Better modularity via strongly typed interfaces

• Easier maintenance: types make code easier to understand

• Concise syntax like Python.

• Type inference. Compiler infers most types that had to be explicit in Java.

• Powerful syntax that avoid much of the boilerplate of Java code (see next slide).

• Best of both worlds: safety of strong typing with conciseness (like Python).

5

47deg.com

Scala Case Class

• Java version class User { private String name; private Int age; public User(String name, Int age) { this.name = name; this.age = age; } public getAge() { return age; } public setAge(Int age) { this.age = age;} } User joe = new User(“Joe”, 30);

• Scala versioncase class User(name:String, var age:Int) val joe = User(“Joe”, 30)

6

47deg.com

Functional Scala

• Anonymous functions. (a:Int,b:Int) => a+b

• Functions that take and return other functions.

• Rarely need variables or loops

• Immutable collections: Seq[T], Map[K,V], …

• Works well with concurrent or distributed systems

• Natural for functional programming

• Functional collection operations (a small sample)

• map, flatMap, reduce, …

• filter, groupBy, sortBy, take, drop, …

7

47deg.com

Scala Availability and Support

• Open Source

• Typesafe provides support. Founded my Martin Odersky who designed Scala.

• IDEs: Intellij IDEA and Eclipse

• Libraries: lots now and more every day

• ScalaNLP - Epic (natural language processing)

• Major Scala users: LinkedIn, Twitter, Goldman Sachs, Coursera, Angies List, Whitepages

• Major systems written in Scala: Spark, Kafka

8

47deg.com

Typesafe Scala Components

• Scala Compiler (includes REPL)

• Scala Standard Libraries

• SBT - Scala Build Tool

• Play - scaleable web applications

• Scala JS - compiles Scala to JavaScript

• Akka - for parallel and distributed computation

• Spray - high performance asynchronous TCP/ HTTP library

• Spark - Typesafe also supports Spark

• Slick - for SQL database access

• ConductR - Scala deployment/devops tool

• Reactive Monitoring (Beta)9

47deg.com

Why Spark?

• Support for not only batch but also (near) real-time

• Fast - keeps data in memory as much as possible

• Often 10X to 100X Hadoop speed

• A clean easy-to-use API

• A richer set of functional operations than just map and reduce

• A foundation for a wide set of integrated data applications

• Can recover from failures - recompute or (optional) replication

• Scalable for very large data sets and reduced time

10

47deg.com

Spark RDDs

• RDD[T] - resilient distributed data set

• typed (must be serializable)

• immutable

• ordered

• can be processed in parallel

• lazy evaluation - permits more global optimizations

• Rich set of functional operations ( a small sample)

• map, flatMap, reduce, …

• filter, groupBy, sortBy, take, drop, …

11

47deg.com

Spark Components

• Spark Core

• Scalable multi-node cluster

• Failure detection and recovery

• RDDs and functional operations

• MLLib - for machine learning

• linear regression, SVMs, clustering, collaborative filtering, dimension reduction

• more on the way!

• GraphX - for graph computation

• Streaming - for near real-time

• Dataframes - for SQL and Json12

47deg.com

Spark Availability and Support

• Open Source - top level Apache project

• Over 750 contributors from over 200 organizations

• Can process multiple petabytes on clusters of over 8000 nodes

• Databricks. Matei Zaharia who wrote the original Spark is a founder and CTO

• Packages (more every day)

• Zeppelin - Scala notebooks

• Cassandra, Kafka connectors

13

47deg.com

Clusters and Scalability

• Scala Akka clusters (process distribution, micro services)

• message passing

• remote Actors

• Spark clusters (data distribution)

• local

• Stand alone (optionally with ZooKeeper)

• Apache Mesos

• Hadoop Yarn

• can run above on Amazon and Google clouds

14

47deg.com

Why Scala for Spark?

• Why not Python, R, or Java for Spark?

• Spark is written in Scala

• Scala source code is important Spark documentation

• Spark is best extended in Scala

• The primary API for Spark is Scala

• The functional features of Scala and Spark are a natural fit and easiest to use in Scala

• If you want to build scalable high performance production code based on Spark, R by itself is too specialized, Python is too slow and Java is tedious to write and maintain

15

47deg.com

Demo

16

47deg.com

Seattle Resources

• Seattle Meetups

• Scala at the Sea Meetup http://www.meetup.com/Seattle-Scala-User-Group/

• Seattle Spark Meetup http://www.meetup.com/Seattle-Spark-Meetup/

• Seattle Training: Spark and Typesafe Scala Classes http://www.47deg.com/events#training

• UW Scala Professional Certificate Program http://www.pce.uw.edu/certificates/scala-functional-reactive-programming.html

17

#datapopupseattle

@datapopup #datapopupseattle

#datapopupseattle

Thank You To Our Sponsors