big data analytics with spark & cassandra
Post on 15-Apr-2017
1.417 Views
Preview:
TRANSCRIPT
Big Data Analytics mit Spark & Cassandra_
JUG Stuttgart 01/2016
Matthias Niehoff
• Cassandra
• Spark
• Spark & Cassandra
• Spark Applications
• Spark Streaming
• Spark SQL
• Spark MLLib
Agenda_
2
Cassandra
3
•Distributed database
•Highly Available
•Linear Scalable
•Multi Datacenter Support
•No Single Point Of Failure
•CQL Query Language • Similiar to SQL • No Joins and aggregates
• Eventual Consistency „Tunable Consistency“
Cassandra_
4
Distributed Data Storage_
5
Node 1
Node 2
Node 3
Node 4
1-25
26-50 51-75
76-0
CQL - Querying Language With Limitations_
6
SELECT*FROMperformerWHEREname='ACDC'—>ok
SELECT*FROMperformerWHEREname='ACDC'andcountry='Australia'—>notok
SELECTcountry,COUNT(*)asquantityFROMartistsGROUPBYcountryORDERBYquantityDESC—>notsupported
performername (PK)
genrecountry
Spark
7
•Open Source & Apache project since 2010
•Data processing Framework • Batch processing • Stream processing
What Is Apache Spark_
8
•Fast • up to 100 times faster than Hadoop • a lot of in-memory processing • linear scalable using more nodes
• Easy • Scala, Java and Python API • Clean Code (e.g. with lambdas in Java 8) • expanded API: map, reduce, filter, groupBy, sort, union, join,
reduceByKey, groupByKey, sample, take, first, count
• Fault-Tolerant • easily reproducible
Why Use Spark_
9
•RDD‘s – Resilient Distributed Dataset • Read–Only description of a collection of objects • Distributed among the cluster (on memory or disk) • Determined through transformations • Allows automatically rebuild on failure
•Operations • Transformations (map,filter,reduce...) —> new RDD • Actions (count, collect, save)
•Only Actions start processing!
Easily Reproducable?_
10
•Partitions • Describes the Partitions (i.e. one per Cassandra Partition)
•Dependencies • dependencies on parent RDD’s
•Compute • The function to compute the RDD’s partitions
•(Optional) Partitioner • How is the data partitioned? (Hash, Range..)
•(Optional) Preferred Location • Where to get the data (i.e. List of Cassandra Node IP’s)
Properties Of A RDD_
11
RDD Example_
12
scala>valtextFile=sc.textFile("README.md")textFile:spark.RDD[String]=spark.MappedRDD@2ee9b6e3
scala>vallinesWithSpark=textFile.filter(line=>line.contains("Spark"))linesWithSpark:spark.RDD[String]=spark.FilteredRDD@7dd4af09
scala>linesWithSpark.count()res0:Long=126
Reproduce RDD’s Using A Tree_
13
Datenquelle
rdd1
rdd3
val1 rdd5
rdd2
rdd4
val2
rdd6
val3
map(..)filter(..)
union(..)
count()
count() count()
sample(..)
cache()
•Transformations • map, flatMap • sample, filter, distinct • union, intersection, cartesian
•Actions • reduce • count • collect,first, take • saveAsTextFile • foreach
Spark Transformations & Actions_
14
Run Spark In A Cluster_
15
•Memory • A lot of data in memory • More memory —> Less disk IO —> Faster processing • Minimum 8 GB / Node
•Network • Communication between Driver, Cluster Manager & Worker • Important for reduce operations • 10 Gigabit LAN or better
•CPU • Less communication between threads • Good to parallelize • Minimum 8 – 16 Cores / Node
What About Hardware?_
16
•Master Web UI (8080)
How To Monitor? (1/3)_
17
•Worker Web UI (8081)
How To Monitor? (2/3)_
18
•Application Web UI (4040)
How To Monitor? (3/3)_
19
([atomic,collection,object],[atomic,collection,object])
valfluege=List(("Thomas","Berlin"),("Mark","Paris"),("Thomas","Madrid"))
valpairRDD=sc.parallelize(fluege)
pairRDD.filter(_._1=="Thomas").collect.foreach(t=>println(t._1+"flognach"+t._2))
Pair RDDs_
20
key – not unique value
• Parallelization! • keys are use for partitioning
• pairs with different keys are distributed across the cluster
• Efficient processing of • aggregate by key
• group by key
• sort by key
• joins, union based on keys
Why Use Pair RDD’s_
21
RDD Dependencies_
22
„Narrow“ (pipeline-able)
map, filterunion
join on co partitioned data
RDD Dependencies_
23
„Wide“ (shuffle)
groupBy on non partitioned data join on non co partitioned data
Spark Demo
24
Spark & Cassandra
25
Use Spark And Cassandra In A Cluster_
26
Spark
Client Spark
Driver
C*
C*
C*C*
Spark WN
Spark WN
Spark WN
Spark WN
Spark Master
Two Datacenter - Two Purposes_
27
C*
C*
C*C*
C*
C*
C*C*
Spark WN
Spark WNSpark
WN
Spark WN
Spark Master
DC1 - Online DC2 - Analytics
•Spark Cassandra Connector by Datastax • https://github.com/datastax/spark-cassandra-connector
• Cassandra tables as Spark RDD (read & write)
• Mapping of C* tables and rows onto Java/Scala objects
• Server-Side filtering („where“)
• Compatible with • Spark ≥ 0.9 • Cassandra ≥ 2.0
•Clone & Compile with SBT or download at Maven Central
Connecting Spark With Cassandra_
28
• Start the shell bin/spark-shell--jars~/path/to/jar/spark-cassandra-connector-assembly-1.3.0.jar--confspark.cassandra.connection.host=localhost
• ImportCassandraClassesscala>importcom.datastax.spark.connector._
Use The Connector In The Shell_
29
• Read complete table valmovies=sc.cassandraTable("movie","movies")//returnsCassandraRDD[CassandraRow]
• Read selected columns valmovies=sc.cassandraTable("movie","movies").select("title","year")
• Filter rows valmovies=sc.cassandraTable("movie","movies").where("title='DieHard'")
• Access Columns in Result Set movies.collect.foreach(r=>println(r.get[String]("title")))
Read A Cassandra Table_
30
Read As Tuple
valmovies=sc.cassandraTable[(String,Int)]("movie","movies").select("title","year")
valmovies=sc.cassandraTable("movie","movies").select("title","year").as((_:String,_:Int))
//bothresultinaCassandraRDD[(String,Int)]
Read A Cassandra Table_
31
Read As Case Class
caseclassMovie(title:String,year:Int)
sc.cassandraTable[Movie]("movie","movies").select("title","year")
sc.cassandraTable("movie","movies").select("title","year").as(Movie)
Read A Cassandra Table_
32
•Every RDD can be saved
• Using Tuples
valtuples=sc.parallelize(Seq(("Hobbit",2012),("96Hours",2008)))tuples.saveToCassandra("movie","movies",SomeColumns("title","year")
• Using Case Classes
caseclassMovie(title:String,year:int)valobjects=
sc.parallelize(Seq(Movie("Hobbit",2012),Movie("96Hours",2008)))objects.saveToCassandra("movie","movies")
Write Table_
33
//LoadandformatasPairRDDvalpairRDD=sc.cassandraTable("movie","director").map(r=>(r.getString("country"),r))
//Directors/Country,sortedpairRDD.mapValues(v=>1).reduceByKey(_+_).sortBy(-_._2).collect.foreach(println)
//or,unsortedpairRDD.countByKey().foreach(println)
//AllCountriespairRDD.keys()
Pair RDDs With Cassandra_
34
director
name text Kcountry text
• Joins can be expensive as they may require shuffling
valdirectors=sc.cassandraTable(..).map(r=>(r.getString("name"),r))
valmovies=sc.cassandraTable().map(r=>(r.getString("director"),r))
movies.join(directors)//RDD[(String,(CassandraRow,CassandraRow))]
Pair RDDs With Cassandra - Join
35
director
name text Kcountry text
movie
title text Kdirector text
•Automatically on read
•Not automatically on write • No Shuffling Spark Operations -> Writes are local • Shuffeling Spark Operartions
• Fan Out writes to Cassandra • repartitionByCassandraReplica(“keyspace“, “table“) before write
• Joins with data locality
Using Data Locality With Cassandra_
36
sc.cassandraTable[CassandraRow](KEYSPACE,A).repartitionByCassandraReplica(KEYSPACE,B).joinWithCassandraTable[CassandraRow](KEYSPACE,B).on(SomeColumns("id"))
•cassandraCount() • Utilizes Cassandra query • vs load the table into memory and do a count
•spanBy(), spanByKey() • group data by Cassandra partition key • does not need shuffling • should be preferred over groupBy/groupByKey
CREATE TABLE events (year int, month int, ts timestamp, data varchar, PRIMARY KEY (year,month,ts));
sc.cassandraTable("test","events").spanBy(row=>(row.getInt("year"),row.getInt("month")))sc.cassandraTable("test","events").keyBy(row=>(row.getInt("year"),row.getInt("month"))).spanByKey
Further Transformations & Actions_
37
Spark & Cassandra Demo
38
Create an Application
39
•Normal Scala Application
•SBT as build tool
•source in src/main/scala-2.10
•assembly.sbt in root and project directory
•build.sbt in root directory
• sbt assembly to build
Scala Application_
40
libraryDependencies+="com.datastax.spark"%"spark-cassandra-connector"%"1.3.0"libraryDependencies+="org.apache.spark"%"spark-core"%"1.3.1"%"provided"libraryDependencies+="org.apache.spark"%"spark-mllib_2.10"%"1.3.1"%"provided"libraryDependencies+="org.apache.spark"%"spark-streaming_2.10"%"1.3.1"%"provided"
•Normal Java Application
•Java 8!
•MVN as build tool
•source in src/main/java
•in pom.xml • dependencies (spark-core, spark-streaming, spark-mllib,
spark-cassandra-connector) • assembly-plugin or shade-plugin
• mvn clean install to build
Java Application_
41
•Special classes for Java
SparkConfconf=newSparkConf().setMaster("local[2]").setAppName("Java").set("spark.cassandra.connection.host","127.0.0.1");
JavaSparkContextsc=newJavaSparkContext(conf);JavaStreamingContextssc=newJavaStreamingContext(conf,Durations.seconds(1L));
JavaRDD<Integer>rdd=sc.parallelize(Arrays.asList(1,2,3,4,5,6));
rdd.filter(e->e%2==0).foreach(System.out::println);
Java Specials_
42
•Special classes for Java
importstaticcom.datastax.spark.connector.japi.CassandraJavaUtil.*;
CassandraTableScanJavaRDD<CassandraRow>table=javaFunctions(sc.sparkContext()).cassandraTable("keyspace",„table");
CassandraTableScanJavaRDD<Entity>table=javaFunctions(sc.sparkContext()).cassandraTable("keyspace","table",mapRowTo(Entity.class))
javaFunctions(someRDD).writerBuilder("keyspace","table",mapToRow(Entity.class)).saveToCassandra();
Java Specials - Cassandra_
43
Spark SQL
44
• SQL Queries with Spark (SQL & HiveQL) • On structured data • On DataFrame • Every result of Spark SQL is a DataFrame • All operations of the GenericRDD‘s available
• Supports (even on non primary key columns) • Joins • Union • Group By • Having • Order By
Spark SQL_
45
valsqlContext=newSQLContext(sc)valpersons=sqlContext.jsonFile(path)
//Showtheschemapersons.printSchema()
persons.registerTempTable("persons")
valadults=sqlContext.sql("SELECTnameFROMpersonsWHEREage>18")adults.collect.foreach(println)
Spark SQL - JSON Example_
46
{"name":"Michael"}{"name":"Jan","age":30}{"name":"Tim","age":17}
valcsc=newCassandraSQLContext(sc)
csc.setKeyspace("musicdb")
valresult=csc.sql("SELECTcountry,COUNT(*)asanzahl"+ "FROMartistsGROUPBYcountry"+ "ORDERBYanzahlDESC");
result.collect.foreach(println);
Spark SQL - Cassandra Example_
47
Spark SQL Demo
48
Spark Streaming
49
• Real Time Processing using micro batches
• Supported sources: TCP, S3, Kafka, Twitter,..
• Data as Discretized Stream (DStream)
• Same programming model as for batches
• All Operations of the GenericRDD & SQL & MLLib
• Stateful Operations & Sliding Windows
Stream Processing With Spark Streaming_
50
importorg.apache.spark.streaming._
valssc=newStreamingContext(sc,Seconds(1))
valstream=ssc.socketTextStream("127.0.0.1",9999)
stream.map(x=>1).reduce(_+_).print()
ssc.start()
//awaitmanualterminationorerrorssc.awaitTermination()
//manualterminationssc.stop()
Spark Streaming - Example_
51
•Maintain State for each key in a DStream: updateStateByKey
Spark Streaming - Stateful Operations_
52
defupdateAlbumCount(newValues:Seq[Int],runningCount:Option[Int]):Option[Int]={valnewCount=runningCount.getOrElse(0)+newValues.sizeSome(newCount)}
valcountStream=stream.updateStateByKey[Int](updateAlbumCount_)
StreamisaDStreamofPairRDD's
•One Receiver -> One Node • Start more receivers and union them
valnumStreams=5valkafkaStreams=(1tonumStreams).map{i=>KafkaUtils.createStream(...)}valunifiedStream=streamingContext.union(kafkaStreams)unifiedStream.print()
• Received data will be split up into blocks • 1 block => 1 task • blocks = batchSize / blockInterval
• Repartition data to distribute over cluster
Spark Streaming - Parallelism_
53
Spark Streaming Demo
54
Spark MLLib
55
• Fully integrated in Spark • Scalable • Scala, Java & Python APIs • Use with Spark Streaming & Spark SQL
• Packages various algorithms for machine learning
• Includes • Clustering • Classification • Prediction • Collaborative Filtering
• Still under development • performance, algorithms
Spark MLLib_
56
MLLib Example - Clustering_
57
age
set of data points meaningful clusters
inco
me
//Loadandparsedatavaldata=sc.textFile("data/mllib/kmeans_data.txt")valparsedData=data.map(s=>Vectors.dense(s.split('').map(_.toDouble))).cache()//Clusterthedatainto3classesusingKMeanswith20iterationsvalclusters=KMeans.train(parsedData,2,20)//EvaluateclusteringbycomputingSumofSquaredErrorsvalSSE=clusters.computeCost(parsedData)println("SumofSquaredErrors="+WSSSE)
MLLib Example - Clustering (using KMeans)_
58
MLLib Example - Classification_
59
MLLib Example - Classification_
60
//LoadtrainingdatainLIBSVMformat.valdata=MLUtils.loadLibSVMFile(sc,"sample_libsvm_data.txt")
//Splitdataintotraining(60%)andtest(40%).valsplits=data.randomSplit(Array(0.6,0.4),seed=11L)valtraining=splits(0).cache()valtest=splits(1)
//RuntrainingalgorithmtobuildthemodelvalnumIterations=100valmodel=SVMWithSGD.train(training,numIterations)
MLLib Example - Classification (Linear SVM)_
61
//Computerawscoresonthetestset.valscoreAndLabels=test.map{point=>valscore=model.predict(point.features)(score,point.label)}
//Getevaluationmetrics.valmetrics=newBinaryClassificationMetrics(scoreAndLabels)valauROC=metrics.areaUnderROC()println("AreaunderROC="+auROC)
MLLib Example - Classification (Linear SVM)_
62
MLLib Example - Collaborative Filtering_
63
//Loadandparsethedata(userid,itemid,rating)valdata=sc.textFile("data/mllib/als/test.data")valratings=data.map(_.split(',')match{caseArray(user,item,rate)=>Rating(user.toInt,item.toInt,rate.toDouble)
})
//BuildtherecommendationmodelusingALSvalrank=10valnumIterations=20valmodel=ALS.train(ratings,rank,numIterations,0.01)
MLLib Example - Collaborative Filtering using ALS_
64
//EvaluatethemodelonratingdatavalusersProducts=ratings.map{caseRating(user,product,rate)=>(user,product)}
valpredictions=model.predict(usersProducts).map{ caseRating(user,product,rate)=>((user,product),rate)}
valratesAndPredictions=ratings.map{caseRating(user,product,rate)=>((user,product),rate)}.join(predictions)
valMSE=ratesAndPredictions.map{case((user,product),(r1,r2))=>valerr=(r1-r2);err*err}.mean()
println("MeanSquaredError="+MSE)
MLLib Example - Collaborative Filtering using ALS_
65
Use Cases
66
• In particular for huge amounts of external data
• Support for CSV, TSV, XML, JSON und other
Use Cases for Spark and Cassandra_
67
Data Loading
caseclassUser(id:java.util.UUID,name:String)
valusers=sc.textFile("users.csv") .repartition(2*sc.defaultParallelism) .map(line=>line.split(",")match{caseArray(id,name)=>User(java.util.UUID.fromString(id),name)})
users.saveToCassandra("keyspace","users")
Validate consistency in a Cassandra database
• syntactic • Uniqueness (only relevant for columns not in the PK) • Referential integrity • Integrity of the duplicates
• semantic • Business- or Application constraints • e.g.: At least one genre per movies, a maximum of 10 tags per blog
post
Use Cases for Spark and Cassandra_
68
Validation & Normalization
• Modelling, Mining, Transforming, ....
• Use Cases • Recommendation • Fraud Detection • Link Analysis (Social Networks, Web) • Advertising • Data Stream Analytics ( Spark Streaming) • Machine Learning ( Spark ML)
Use Cases for Spark and Cassandra_
69
Analyses (Joins, Transformations,..)
• Changes on existing tables • New table required when changing primary key • Otherwise changes could be performed in-place
• Creating new tables • data derived from existing tables • Support new queries
• Use the CassandraConnectors in Spark
Use Cases for Spark and Cassandra_
70
Schema Migration
Thank you for your attention!
71
Questions?
Matthias Niehoff, IT-Consultant
90
codecentric AG Zeppelinstraße 2 76185 Karlsruhe, Germany
mobil: +49 (0) 172.1702676 matthias.niehoff@codecentric.de
www.codecentric.de blog.codecentric.de
matthiasniehoff
top related