apache spark introduction

24
Spark Conf Taiwan 2016

Upload: rich-lee

Post on 19-Mar-2017

119 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Apache Spark Introduction

Spark Conf Taiwan 2016

Page 2: Apache Spark Introduction

Apache SparkRICH LEE2016/9/21

Page 3: Apache Spark Introduction

AgendaSpark overview

Spark coreRDD

Spark Application DevelopSpark Shell

Zeppline

Application

Page 4: Apache Spark Introduction
Page 5: Apache Spark Introduction
Page 6: Apache Spark Introduction

Spark OverviewApache Spark is a fast and general-purpose cluster computing system

Key Features:FastEase of UseGeneral-purposeScalableFault tolerant

Logistic regression in Hadoop and Spark

Page 7: Apache Spark Introduction
Page 8: Apache Spark Introduction

Spark OverviewCluster Mode

Local

Standalone

Hadoop YARN

Apache Mesos

Page 9: Apache Spark Introduction

Spark Overview

Page 10: Apache Spark Introduction

Spark OverviewSpark High Level Architecture

Driver Program

Cluster Management

Worker Node

Executor

Task

Page 11: Apache Spark Introduction
Page 12: Apache Spark Introduction

Spark OverviewInstall and startup

Download

http://spark.apache.org/downloads.html

Start Master and Worker

./sbin/start-all.sh

http://localhost:8080

Start History server

./sbin/start-history-server.sh hdfs://localhost:9000/spark/directory

http://localhost:18080

Start Spark-Shell

./bin/spark-shell --master "spark://RichdeMacBook-Pro.local:7077"

./bin/spark-shell local[4]

Page 13: Apache Spark Introduction

RDDResilient Distributed Datase

RDD represents a collection of partitioned data elements that can be operated on in parallel. It is the primary data abstraction mechanism in Spark.

PartitionedFault TolerantInterfaceIn Memory

Page 14: Apache Spark Introduction

RDDCreate RDD

parallelizeval xs = (1 to 10000).toListval rdd = sc.parallelize(xs)

textFile val lines = sc.textFile("/input/README.md") val lines = sc.textFile("file:///RICH_HD/BigData_Tools/spark-

1.6.2/README.md")HDFS - "hdfs://"

Amazon S3 - "s3n://"Cassandra, HBase

Page 15: Apache Spark Introduction

RDD

Page 16: Apache Spark Introduction

RDDTransformation

Creates a new RDD by performing a computation on the source RDDmap val txtFile = sc.textFile("/input/README.md") val lengths = lines map { l => l.length}

flatMap val words = lines flatMap { l => l.split(" ")}

filter val longLines = lines filter { l => l.length > 80}

Page 17: Apache Spark Introduction

RDDAction

Return a value to a driver program

firstval numbersRdd = sc.parallelize(List(10, 5, 3, 1))val firstElement = numbersRdd.first

maxnumbersRdd.max

reduceval sum = numbersRdd.reduce((x, y) => x + y)val product = numbersRdd.reduce((x, y) => x * y)

Page 18: Apache Spark Introduction

RDDFilter log example:

val logs = sc.textFile("path/to/log-files")val errorLogs = logs filter { l => l.contains("ERROR")}val warningLogs = logs filter { l => l.contains("WARN")}val errorCount = errorLogs.countval warningCount = warningLogs.count

log RDD

error RDD

warn RDD

count count

Page 19: Apache Spark Introduction

RDDCaching

Stores an RDD in the memory or storageWhen an application caches an RDD in memory, Spark stores it in the

executor memory on each worker node. Each executor stores in memory the RDD partitions that it computes.

cache

persistMEMORY_ONLYDISK_ONLYMEMORY_AND_DISKMEMORY_ONLY_SERMEMORY_AND_DISK_SER

Page 20: Apache Spark Introduction

RDDCache example:

val logs = sc.textFile("path/to/log-files")val errorsAndWarnings = logs filter { l => l.contains("ERROR") || l.contains("WARN")}errorsAndWarnings.cache()val errorLogs = errorsAndWarnings filter { l => l.contains("ERROR")}val warningLogs = errorsAndWarnings filter { l => l.contains("WARN")}val errorCount = errorLogs.countval warningCount = warningLogs.count

Page 21: Apache Spark Introduction
Page 22: Apache Spark Introduction

Spark Application DevelopSpark-Shell

Zeppline

Application (Java/Scala)spark-submit

Page 23: Apache Spark Introduction

Spark Application Develop WordCount

val textFile = sc.textFile("/input/README.md")val wcData = textFile.flatMap(line => line.split(" ")) .map((_, 1)) .reduceByKey(_ + _)

wcData.collect().foreach(println)

Page 24: Apache Spark Introduction

工商時間 Taiwan Hadoop User Group

https://www.facebook.com/groups/hadoop.tw/

Taiwan Spark User Group

https://www.facebook.com/groups/spark.tw/