apache spark session

Download Apache spark session

Post on 10-Aug-2015




3 download

Embed Size (px)


  1. 1. Sandeep GiriHadoop Apache A fast and general engine for large-scale data processing. Really fast Hadoop 100x faster than Hadoop MapReduce in memory, 10x faster on disk. Builds on similar paradigms as Hadoop Integrated with Hadoop
  2. 2. Sandeep GiriHadoop Apache
  3. 3. Sandeep GiriHadoop Login as rootwget http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.4.tgztar zxvf spark-1.1.0-bin-hadoop2.4.tgz && rm spark-1.1.0-bin-hadoop2.4.tgz;mv spark-1.1.0-bin-hadoop2.4 /usr/lib/cd /usr/lib; ln -s spark-1.1.0-bin-hadoop2.4/ sparkLogin as student/usr/lib/spark/bin/pyspark INSTALLING ONYARN Already Installed on hadoop1.knowbigdata.com
  4. 4. Sandeep GiriHadoop SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET A collection of elements partitioned across cluster lines = sc.textFile('hdfs://hadoop1.knowbigdata.com/user/student/sgiri/wordcount/input/big.txt') RDD Can be persisted in memory RDD Auto recover from node failures Can have any data type but has a special dataset type for key-value Supports two type of operations: transformation and action Each Element of RDD across cluster is run through map function
  5. 5. Sandeep GiriHadoop SPARK -TRANSFORMATIONS JavaRDD lineLengths = lines.map(new Function() {public Integer call(String s) { return s.length(); }}); Creates a new dataset persist() cache()
  6. 6. Sandeep GiriHadoop SPARK -TRANSFORMATIONS map(func) Return a new distributed dataset formed by passing each element of the source through a function func. Analogous to foreach of pig. filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true. flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items groupByKey([numTasks]) When called on a dataset of (K,V) pairs, returns a dataset of (K, Iterable) pairs. See More: sample, union, intersection, distinct, groupByKey, reduceByKey, sortByKey,join https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html
  7. 7. Sandeep GiriHadoop SPARK - ACTIONS int totalLength = lineLengths.reduce(new Function2() {public Integer call(Integer a, Integer b) { return a + b; }}); Return value to the driver
  8. 8. Sandeep GiriHadoop SPARK - ACTIONS reduce(func) Aggregate elements of dataset using a function: Takes 2 arguments and returns one Commutative and associative for parallelism count() Return the number of elements in the dataset. collect() Return all elements of dataset as an array at driver. Used for small output. take(n) Return an array with the first n elements of the dataset.Not Parallel. See More: rst(), takeSample(), takeOrdered(), saveAsTextFile(path) https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html
  9. 9. Sandeep GiriHadoop SPARK - EXAMPLE - REDUCE SUM FUNCTION //Single Nodelines = ["san giri g", "san giri", "giri", "bhagwat kumar", "mr. shashank sharma", "anto"]lineLengths = [11, 9, 4, 14, 20, 4]sum = ??? ! //Node1lines = ["san giri g", "san giri", "giri"]lineLengths = [11, 9, 4]! totalLength = [20, 4]totalLength = 24 //sum or min or max or sqrt(a*a + b*b)! //Node2lines = ["bhagwat kumar"]lineLengths = [14]totalLength = 14! //Node3lines = ["mr. shashank sharma", "anto"]lineLengths = [20, 4]totalLength = 24! ! //Driver NodelineLengths = [24, 14, 24]lineLength = [38, 24]lineLength = [62]
  10. 10. Sandeep GiriHadoop SPARK - SHARED MEMORY Broadcast broadcastVar = sc.broadcast(new int[] {1, 2, 3});broadcastVar.value();// returns [1, 2, 3] Broadcast Variables Broadcast() broadcast.value()
  11. 11. Sandeep GiriHadoop SPARK - SHARED MEMORY Accumulator accum = sc.accumulator(0);sc.parallelize(Arrays.asList(1, 2, 3, 4)).foreach(x -> accum.add(x));accum.value();// returns 10 Accumulators += 10 += 20 are only added to through associative operation assoc.: (2+3)+4=2+(3+4)=9
  12. 12. Sandeep GiriHadoop ! #Import regular expression import re;! #load le lines = sc.textFile('hdfs://hadoop1.knowbigdata.com/user/student/sgiri/wordcount/input/big.txt')! #Split line into multiple lines fm = lines.atMap(lambda lines: lines.split(" "));! #Keep only alphanumerics m = fm.map(lambda word: ( re.sub(r"[^A-Za-z0-9]*", ""), word.lower()), 1))! #Run Reduce counts = m.reduceByKey(lambda a, b: a + b)counts.count();counts.saveAsTextFile('hdfs://hadoop1.knowbigdata.com/user/student/sgiri/wordcount/output/spark')Word Count example
  13. 13. Sandeep GiriHadoop import re;lines = sc.textFile('hdfs://hadoop1.knowbigdata.com/user/student/sgiri/wordcount/input/big.txt');common = sc.broadcast({"a":1, "an":1, "the":1, "this":1, "that":1, "of":1, "is":1});accum = sc.accumulator(0);! fm = lines.atMap(lambda lines: lines.split(" "));m = fm.map( lambda word: ( re.sub( r"[^A-Za-z0-9]*", "", word.lower() ), 1) )! def lterfunc(k): accum.add(1); return k[0] not in common.value;! cleaned = m.lter(lterfunc);cleaned.take(10)counts = cleaned.reduceByKey(lambda a, b: a + b)counts.count();counts.saveAsTextFile('hdfs://hadoop1.knowbigdata.com/user/student/sgiri/wordcount/output/spark')WordCount with Accumulator and broadcast