spark 소개 2부

Lightning-fast cluster computing

잠시 복습

Problem

Solution

MapReduce?

모든 일을 MapReduce화 하라!

근데 이런 SQL을 어떻게 MapReduce로 만들지?

SELECT LAT_N, CITY, TEMP_F

FROM STATS, STATION WHERE MONTH = 7 AND STATS.ID =

STATION.ID ORDER BY TEMP_F;

모든 일을 MapReduce화 하라!

이런 Machine learning/Data 분석 업무는?

“지난 2007년부터 매월 나오는 전국 부동산 실거래가 정보에서 영향을 미칠 수 있는 변수 140개중에 의미있는 변수 5개만 뽑아.”“아, 마감은 내일이다.”

코드도 이정도면 뭐? (단순히 단어세는 코드가…)package org.myorg; import java.io.IOException;import java.util.*; import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

원래 세월이 가면 연장은 좋아지는 법

Generality

High-level tool들 아래에서 모든 일들을 있는 그대로 하게 해줍니다.

쓰기 쉽습니다.

Java, Scala, Python을 지원합니다.

text_file = spark.textFile

("hdfs://...")

text_file.flatMap(lambda line:

line.split())

.map(lambda word: (word,

1))

.reduceByKey(lambda a, b:

a+b)

Word count in Spark's Python API

온갖 분산처리 환경에서 다 돌아갑니다.

● Hadoop, Mesos, 혼자서도, Cloud에서도 돌아요.

● HDFS, Cassandra, HBase, S3등에서 데이타도 가져올 수 있어요.

속도도 빠릅니다.

Hadoop MapReduce를 Memory에서 올렸을 때보다 100배, Disk에서 돌렸을 때의 10배 빠릅니다.

Logistic regression in Hadoop and Spark

자체 Web UI까지 있어요….

Spark은 말이죠

● Tool이에요, Library 아닙니다. ○ 이 Tool위에 하고 싶은 일들을 정의하고 ○ 실행시키는 겁니다.

Standalone으로 부터

제 2부: 한번 해보자!

vagrant up / vagrant ssh

spark-shell

pyspark- python spark shell

Wordcount : Scalaval f = sc.textFile("README.md")val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)wc.saveAsTextFile("wc_out.txt")

Wordcount : Scalaval f = sc.textFile("README.md")===================def textFile(path: String, minPartitions:

Int = defaultMinPartitions):RDD[String]

=================== Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.

https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#defaultMinPartitions:Int

https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html

Wordcount : Scalaval wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

Wordcount : Scalaval wc = f.flatMap(l => l.split(" ")): 한 단어씩 끊어서

Wordcount : Scalaval wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)):(각 단어들, 1)이라는 (Key, Value)들을 만들고

Wordcount : Scalaval wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _): 그 집합들을 다 Key별로 합해보아요.

Wordcount : Scalascala>wc.take(20)…….finished: take at <console>:26, took 0.081425 s

res6: Array[(String, Int)] = Array((package,1), (For,2), (processing.,1), (Programs,1), (Because,1), (The,1), (cluster.,1), (its,1), ([run,1), (APIs,1), (computation,1), (Try,1), (have,1), (through,1), (several,1), (This,2), ("yarn-cluster",1), (graph,1), (Hive,2), (storage,1))

Wordcount : Scalawc.saveAsTextFile("wc_out.txt")==========================파일로 저장

앞서 짠 코드를 이렇게 돌린다면?

Simplifying Big Data Analysiswith Apache SparkMatei ZahariaApril 27, 2015

Disk-> Memory로 옮겨봅시다.


즉 이렇게 각 Cluster별로 일거리와 명령을 전달해 주

면 되요.

Spark Model

● 데이타를 변환해가는 프로그램을 작성하는 것

● Resilient Distributed Dataset(RDDs)○ Cluster로 전달할 memory나 disk에 저장될 object들의 집합

○ 병렬 변환 ( map, filter…)등등으로 구성○ 오류가 생기면 자동으로 재구성

Making interactive Big Data Applications Fast AND Easy

Holden Karau

각 code 한 줄이 RDD! val f = sc.textFile("README.md")

val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

wc.saveAsTextFile("wc_out.txt")

지원하는 명령들

Build-in libraries● 다양한 기능들을 RDD로 쓸 수 있게 만들어놓음

● Caching + DAG model은 이런거 돌리는데 충분히 효율적임.

● 모든 라이브러리를 하나 프로그램에 다 묶어 놓는게 더 빠르다.

MLibVectors, Matrices = RDD[Vector]Iterative computation

points = sc.textFile(“data.txt”).map(parsePoint)model = KMeans.train(points, 10)model.predict(newPoint)

GraphX

Represents graphs as RDDs of vertices and edges.

결론

여러분의 data source, 작업, 환경들을 다 통합하고 싶어요.

spark 소개 2부

Engineering