spark 소개 2부
TRANSCRIPT
![Page 1: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/1.jpg)
Lightning-fast cluster computing
![Page 2: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/2.jpg)
잠시 복습
![Page 3: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/3.jpg)
Problem
![Page 4: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/4.jpg)
Solution
MapReduce?
![Page 5: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/5.jpg)
모든 일을 MapReduce화 하라!
근데 이런 SQL을 어떻게 MapReduce로 만들지?
SELECT LAT_N, CITY, TEMP_F
FROM STATS, STATION WHERE MONTH = 7 AND STATS.ID =
STATION.ID ORDER BY TEMP_F;
![Page 6: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/6.jpg)
모든 일을 MapReduce화 하라!
이런 Machine learning/Data 분석 업무는?
“지난 2007년부터 매월 나오는 전국 부동산 실거래가 정보에서 영향을 미칠 수 있는 변수 140개중에 의미있는 변수 5개만 뽑아.”“아, 마감은 내일이다.”
![Page 7: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/7.jpg)
코드도 이정도면 뭐? (단순히 단어세는 코드가…)package org.myorg; import java.io.IOException;import java.util.*; import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
![Page 8: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/8.jpg)
원래 세월이 가면 연장은 좋아지는 법
![Page 9: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/9.jpg)
Generality
High-level tool들 아래에서 모든 일들을 있는 그대로 하게 해줍니다.
![Page 10: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/10.jpg)
쓰기 쉽습니다.
Java, Scala, Python을 지원합니다.
text_file = spark.textFile
("hdfs://...")
text_file.flatMap(lambda line:
line.split())
.map(lambda word: (word,
1))
.reduceByKey(lambda a, b:
a+b)
Word count in Spark's Python API
![Page 11: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/11.jpg)
온갖 분산처리 환경에서 다 돌아갑니다.
● Hadoop, Mesos, 혼자서도, Cloud에서도 돌아요.
● HDFS, Cassandra, HBase, S3등에서 데이타도 가져올 수 있어요.
![Page 12: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/12.jpg)
속도도 빠릅니다.
Hadoop MapReduce를 Memory에서 올렸을 때보다 100배, Disk에서 돌렸을 때의 10배 빠릅니다.
Logistic regression in Hadoop and Spark
![Page 13: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/13.jpg)
자체 Web UI까지 있어요….
![Page 14: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/14.jpg)
Spark은 말이죠
● Tool이에요, Library 아닙니다. ○ 이 Tool위에 하고 싶은 일들을 정의하고 ○ 실행시키는 겁니다.
![Page 15: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/15.jpg)
Standalone으로 부터
제 2부: 한번 해보자!
![Page 16: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/16.jpg)
vagrant up / vagrant ssh
![Page 17: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/17.jpg)
spark-shell
![Page 18: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/18.jpg)
pyspark- python spark shell
![Page 19: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/19.jpg)
Wordcount : Scalaval f = sc.textFile("README.md")val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)wc.saveAsTextFile("wc_out.txt")
![Page 20: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/20.jpg)
Wordcount : Scalaval f = sc.textFile("README.md")===================def textFile(path: String, minPartitions:
Int = defaultMinPartitions):RDD[String]
=================== Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
![Page 21: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/21.jpg)
Wordcount : Scalaval wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
![Page 22: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/22.jpg)
Wordcount : Scalaval wc = f.flatMap(l => l.split(" ")): 한 단어씩 끊어서
![Page 23: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/23.jpg)
Wordcount : Scalaval wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)):(각 단어들, 1)이라는 (Key, Value)들을 만들고
![Page 24: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/24.jpg)
Wordcount : Scalaval wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _): 그 집합들을 다 Key별로 합해보아요.
![Page 25: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/25.jpg)
![Page 26: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/26.jpg)
Wordcount : Scalascala>wc.take(20)…….finished: take at <console>:26, took 0.081425 s
res6: Array[(String, Int)] = Array((package,1), (For,2), (processing.,1), (Programs,1), (Because,1), (The,1), (cluster.,1), (its,1), ([run,1), (APIs,1), (computation,1), (Try,1), (have,1), (through,1), (several,1), (This,2), ("yarn-cluster",1), (graph,1), (Hive,2), (storage,1))
![Page 27: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/27.jpg)
Wordcount : Scalawc.saveAsTextFile("wc_out.txt")==========================파일로 저장
![Page 28: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/28.jpg)
![Page 29: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/29.jpg)
앞서 짠 코드를 이렇게 돌린다면?
Simplifying Big Data Analysiswith Apache SparkMatei ZahariaApril 27, 2015
![Page 30: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/30.jpg)
Disk-> Memory로 옮겨봅시다.
Simplifying Big Data Analysiswith Apache SparkMatei ZahariaApril 27, 2015
![Page 31: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/31.jpg)
즉 이렇게 각 Cluster별로 일거리와 명령을 전달해 주
면 되요.
![Page 32: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/32.jpg)
Spark Model
● 데이타를 변환해가는 프로그램을 작성하는 것
● Resilient Distributed Dataset(RDDs)○ Cluster로 전달할 memory나 disk에 저장될 object들의 집합
○ 병렬 변환 ( map, filter…)등등으로 구성○ 오류가 생기면 자동으로 재구성
![Page 33: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/33.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 34: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/34.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 35: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/35.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 36: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/36.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 37: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/37.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 38: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/38.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 39: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/39.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 40: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/40.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 41: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/41.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 42: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/42.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 43: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/43.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 44: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/44.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 45: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/45.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 46: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/46.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 47: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/47.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 48: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/48.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 49: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/49.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 50: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/50.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 51: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/51.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 52: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/52.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 53: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/53.jpg)
Making interactive Big Data Applications Fast AND Easy
Holden Karau
![Page 54: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/54.jpg)
Simplifying Big Data Analysiswith Apache SparkMatei ZahariaApril 27, 2015
![Page 55: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/55.jpg)
각 code 한 줄이 RDD! val f = sc.textFile("README.md")
val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
wc.saveAsTextFile("wc_out.txt")
![Page 56: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/56.jpg)
지원하는 명령들
![Page 57: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/57.jpg)
Build-in libraries● 다양한 기능들을 RDD로 쓸 수 있게 만들어놓음
● Caching + DAG model은 이런거 돌리는데 충분히 효율적임.
● 모든 라이브러리를 하나 프로그램에 다 묶어 놓는게 더 빠르다.
![Page 58: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/58.jpg)
Simplifying Big Data Analysiswith Apache SparkMatei ZahariaApril 27, 2015
![Page 59: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/59.jpg)
Simplifying Big Data Analysiswith Apache SparkMatei ZahariaApril 27, 2015
![Page 60: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/60.jpg)
Simplifying Big Data Analysiswith Apache SparkMatei ZahariaApril 27, 2015
![Page 61: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/61.jpg)
MLibVectors, Matrices = RDD[Vector]Iterative computation
points = sc.textFile(“data.txt”).map(parsePoint)model = KMeans.train(points, 10)model.predict(newPoint)
![Page 62: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/62.jpg)
GraphX
Represents graphs as RDDs of vertices and edges.
![Page 63: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/63.jpg)
Simplifying Big Data Analysiswith Apache SparkMatei ZahariaApril 27, 2015
![Page 64: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/64.jpg)
결론
여러분의 data source, 작업, 환경들을 다 통합하고 싶어요.
![Page 65: Spark 소개 2부](https://reader030.vdocuments.net/reader030/viewer/2022020113/587110521a28abac6d8b5985/html5/thumbnails/65.jpg)
Q&A