big data analytics with spark & cassandra
TRANSCRIPT
![Page 1: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/1.jpg)
Big Data Analytics mit Spark & Cassandra_
JUG Stuttgart 01/2016
Matthias Niehoff
![Page 2: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/2.jpg)
• Cassandra
• Spark
• Spark & Cassandra
• Spark Applications
• Spark Streaming
• Spark SQL
• Spark MLLib
Agenda_
2
![Page 3: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/3.jpg)
Cassandra
3
![Page 4: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/4.jpg)
•Distributed database
•Highly Available
•Linear Scalable
•Multi Datacenter Support
•No Single Point Of Failure
•CQL Query Language • Similiar to SQL • No Joins and aggregates
• Eventual Consistency „Tunable Consistency“
Cassandra_
4
![Page 5: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/5.jpg)
Distributed Data Storage_
5
Node 1
Node 2
Node 3
Node 4
1-25
26-50 51-75
76-0
![Page 6: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/6.jpg)
CQL - Querying Language With Limitations_
6
SELECT*FROMperformerWHEREname='ACDC'—>ok
SELECT*FROMperformerWHEREname='ACDC'andcountry='Australia'—>notok
SELECTcountry,COUNT(*)asquantityFROMartistsGROUPBYcountryORDERBYquantityDESC—>notsupported
performername (PK)
genrecountry
![Page 7: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/7.jpg)
Spark
7
![Page 8: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/8.jpg)
•Open Source & Apache project since 2010
•Data processing Framework • Batch processing • Stream processing
What Is Apache Spark_
8
![Page 9: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/9.jpg)
•Fast • up to 100 times faster than Hadoop • a lot of in-memory processing • linear scalable using more nodes
• Easy • Scala, Java and Python API • Clean Code (e.g. with lambdas in Java 8) • expanded API: map, reduce, filter, groupBy, sort, union, join,
reduceByKey, groupByKey, sample, take, first, count
• Fault-Tolerant • easily reproducible
Why Use Spark_
9
![Page 10: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/10.jpg)
•RDD‘s – Resilient Distributed Dataset • Read–Only description of a collection of objects • Distributed among the cluster (on memory or disk) • Determined through transformations • Allows automatically rebuild on failure
•Operations • Transformations (map,filter,reduce...) —> new RDD • Actions (count, collect, save)
•Only Actions start processing!
Easily Reproducable?_
10
![Page 11: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/11.jpg)
•Partitions • Describes the Partitions (i.e. one per Cassandra Partition)
•Dependencies • dependencies on parent RDD’s
•Compute • The function to compute the RDD’s partitions
•(Optional) Partitioner • How is the data partitioned? (Hash, Range..)
•(Optional) Preferred Location • Where to get the data (i.e. List of Cassandra Node IP’s)
Properties Of A RDD_
11
![Page 12: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/12.jpg)
RDD Example_
12
scala>valtextFile=sc.textFile("README.md")textFile:spark.RDD[String]=spark.MappedRDD@2ee9b6e3
scala>vallinesWithSpark=textFile.filter(line=>line.contains("Spark"))linesWithSpark:spark.RDD[String]=spark.FilteredRDD@7dd4af09
scala>linesWithSpark.count()res0:Long=126
![Page 13: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/13.jpg)
Reproduce RDD’s Using A Tree_
13
Datenquelle
rdd1
rdd3
val1 rdd5
rdd2
rdd4
val2
rdd6
val3
map(..)filter(..)
union(..)
count()
count() count()
sample(..)
cache()
![Page 14: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/14.jpg)
•Transformations • map, flatMap • sample, filter, distinct • union, intersection, cartesian
•Actions • reduce • count • collect,first, take • saveAsTextFile • foreach
Spark Transformations & Actions_
14
![Page 15: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/15.jpg)
Run Spark In A Cluster_
15
![Page 16: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/16.jpg)
•Memory • A lot of data in memory • More memory —> Less disk IO —> Faster processing • Minimum 8 GB / Node
•Network • Communication between Driver, Cluster Manager & Worker • Important for reduce operations • 10 Gigabit LAN or better
•CPU • Less communication between threads • Good to parallelize • Minimum 8 – 16 Cores / Node
What About Hardware?_
16
![Page 17: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/17.jpg)
•Master Web UI (8080)
How To Monitor? (1/3)_
17
![Page 18: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/18.jpg)
•Worker Web UI (8081)
How To Monitor? (2/3)_
18
![Page 19: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/19.jpg)
•Application Web UI (4040)
How To Monitor? (3/3)_
19
![Page 20: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/20.jpg)
([atomic,collection,object],[atomic,collection,object])
valfluege=List(("Thomas","Berlin"),("Mark","Paris"),("Thomas","Madrid"))
valpairRDD=sc.parallelize(fluege)
pairRDD.filter(_._1=="Thomas").collect.foreach(t=>println(t._1+"flognach"+t._2))
Pair RDDs_
20
key – not unique value
![Page 21: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/21.jpg)
• Parallelization! • keys are use for partitioning
• pairs with different keys are distributed across the cluster
• Efficient processing of • aggregate by key
• group by key
• sort by key
• joins, union based on keys
Why Use Pair RDD’s_
21
![Page 22: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/22.jpg)
RDD Dependencies_
22
„Narrow“ (pipeline-able)
map, filterunion
join on co partitioned data
![Page 23: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/23.jpg)
RDD Dependencies_
23
„Wide“ (shuffle)
groupBy on non partitioned data join on non co partitioned data
![Page 24: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/24.jpg)
Spark Demo
24
![Page 25: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/25.jpg)
Spark & Cassandra
25
![Page 26: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/26.jpg)
Use Spark And Cassandra In A Cluster_
26
Spark
Client Spark
Driver
C*
C*
C*C*
Spark WN
Spark WN
Spark WN
Spark WN
Spark Master
![Page 27: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/27.jpg)
Two Datacenter - Two Purposes_
27
C*
C*
C*C*
C*
C*
C*C*
Spark WN
Spark WNSpark
WN
Spark WN
Spark Master
DC1 - Online DC2 - Analytics
![Page 28: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/28.jpg)
•Spark Cassandra Connector by Datastax • https://github.com/datastax/spark-cassandra-connector
• Cassandra tables as Spark RDD (read & write)
• Mapping of C* tables and rows onto Java/Scala objects
• Server-Side filtering („where“)
• Compatible with • Spark ≥ 0.9 • Cassandra ≥ 2.0
•Clone & Compile with SBT or download at Maven Central
Connecting Spark With Cassandra_
28
![Page 29: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/29.jpg)
• Start the shell bin/spark-shell--jars~/path/to/jar/spark-cassandra-connector-assembly-1.3.0.jar--confspark.cassandra.connection.host=localhost
• ImportCassandraClassesscala>importcom.datastax.spark.connector._
Use The Connector In The Shell_
29
![Page 30: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/30.jpg)
• Read complete table valmovies=sc.cassandraTable("movie","movies")//returnsCassandraRDD[CassandraRow]
• Read selected columns valmovies=sc.cassandraTable("movie","movies").select("title","year")
• Filter rows valmovies=sc.cassandraTable("movie","movies").where("title='DieHard'")
• Access Columns in Result Set movies.collect.foreach(r=>println(r.get[String]("title")))
Read A Cassandra Table_
30
![Page 31: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/31.jpg)
Read As Tuple
valmovies=sc.cassandraTable[(String,Int)]("movie","movies").select("title","year")
valmovies=sc.cassandraTable("movie","movies").select("title","year").as((_:String,_:Int))
//bothresultinaCassandraRDD[(String,Int)]
Read A Cassandra Table_
31
![Page 32: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/32.jpg)
Read As Case Class
caseclassMovie(title:String,year:Int)
sc.cassandraTable[Movie]("movie","movies").select("title","year")
sc.cassandraTable("movie","movies").select("title","year").as(Movie)
Read A Cassandra Table_
32
![Page 33: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/33.jpg)
•Every RDD can be saved
• Using Tuples
valtuples=sc.parallelize(Seq(("Hobbit",2012),("96Hours",2008)))tuples.saveToCassandra("movie","movies",SomeColumns("title","year")
• Using Case Classes
caseclassMovie(title:String,year:int)valobjects=
sc.parallelize(Seq(Movie("Hobbit",2012),Movie("96Hours",2008)))objects.saveToCassandra("movie","movies")
Write Table_
33
![Page 34: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/34.jpg)
//LoadandformatasPairRDDvalpairRDD=sc.cassandraTable("movie","director").map(r=>(r.getString("country"),r))
//Directors/Country,sortedpairRDD.mapValues(v=>1).reduceByKey(_+_).sortBy(-_._2).collect.foreach(println)
//or,unsortedpairRDD.countByKey().foreach(println)
//AllCountriespairRDD.keys()
Pair RDDs With Cassandra_
34
director
name text Kcountry text
![Page 35: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/35.jpg)
• Joins can be expensive as they may require shuffling
valdirectors=sc.cassandraTable(..).map(r=>(r.getString("name"),r))
valmovies=sc.cassandraTable().map(r=>(r.getString("director"),r))
movies.join(directors)//RDD[(String,(CassandraRow,CassandraRow))]
Pair RDDs With Cassandra - Join
35
director
name text Kcountry text
movie
title text Kdirector text
![Page 36: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/36.jpg)
•Automatically on read
•Not automatically on write • No Shuffling Spark Operations -> Writes are local • Shuffeling Spark Operartions
• Fan Out writes to Cassandra • repartitionByCassandraReplica(“keyspace“, “table“) before write
• Joins with data locality
Using Data Locality With Cassandra_
36
sc.cassandraTable[CassandraRow](KEYSPACE,A).repartitionByCassandraReplica(KEYSPACE,B).joinWithCassandraTable[CassandraRow](KEYSPACE,B).on(SomeColumns("id"))
![Page 37: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/37.jpg)
•cassandraCount() • Utilizes Cassandra query • vs load the table into memory and do a count
•spanBy(), spanByKey() • group data by Cassandra partition key • does not need shuffling • should be preferred over groupBy/groupByKey
CREATE TABLE events (year int, month int, ts timestamp, data varchar, PRIMARY KEY (year,month,ts));
sc.cassandraTable("test","events").spanBy(row=>(row.getInt("year"),row.getInt("month")))sc.cassandraTable("test","events").keyBy(row=>(row.getInt("year"),row.getInt("month"))).spanByKey
Further Transformations & Actions_
37
![Page 38: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/38.jpg)
Spark & Cassandra Demo
38
![Page 39: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/39.jpg)
Create an Application
39
![Page 40: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/40.jpg)
•Normal Scala Application
•SBT as build tool
•source in src/main/scala-2.10
•assembly.sbt in root and project directory
•build.sbt in root directory
• sbt assembly to build
Scala Application_
40
libraryDependencies+="com.datastax.spark"%"spark-cassandra-connector"%"1.3.0"libraryDependencies+="org.apache.spark"%"spark-core"%"1.3.1"%"provided"libraryDependencies+="org.apache.spark"%"spark-mllib_2.10"%"1.3.1"%"provided"libraryDependencies+="org.apache.spark"%"spark-streaming_2.10"%"1.3.1"%"provided"
![Page 41: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/41.jpg)
•Normal Java Application
•Java 8!
•MVN as build tool
•source in src/main/java
•in pom.xml • dependencies (spark-core, spark-streaming, spark-mllib,
spark-cassandra-connector) • assembly-plugin or shade-plugin
• mvn clean install to build
Java Application_
41
![Page 42: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/42.jpg)
•Special classes for Java
SparkConfconf=newSparkConf().setMaster("local[2]").setAppName("Java").set("spark.cassandra.connection.host","127.0.0.1");
JavaSparkContextsc=newJavaSparkContext(conf);JavaStreamingContextssc=newJavaStreamingContext(conf,Durations.seconds(1L));
JavaRDD<Integer>rdd=sc.parallelize(Arrays.asList(1,2,3,4,5,6));
rdd.filter(e->e%2==0).foreach(System.out::println);
Java Specials_
42
![Page 43: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/43.jpg)
•Special classes for Java
importstaticcom.datastax.spark.connector.japi.CassandraJavaUtil.*;
CassandraTableScanJavaRDD<CassandraRow>table=javaFunctions(sc.sparkContext()).cassandraTable("keyspace",„table");
CassandraTableScanJavaRDD<Entity>table=javaFunctions(sc.sparkContext()).cassandraTable("keyspace","table",mapRowTo(Entity.class))
javaFunctions(someRDD).writerBuilder("keyspace","table",mapToRow(Entity.class)).saveToCassandra();
Java Specials - Cassandra_
43
![Page 44: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/44.jpg)
Spark SQL
44
![Page 45: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/45.jpg)
• SQL Queries with Spark (SQL & HiveQL) • On structured data • On DataFrame • Every result of Spark SQL is a DataFrame • All operations of the GenericRDD‘s available
• Supports (even on non primary key columns) • Joins • Union • Group By • Having • Order By
Spark SQL_
45
![Page 46: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/46.jpg)
valsqlContext=newSQLContext(sc)valpersons=sqlContext.jsonFile(path)
//Showtheschemapersons.printSchema()
persons.registerTempTable("persons")
valadults=sqlContext.sql("SELECTnameFROMpersonsWHEREage>18")adults.collect.foreach(println)
Spark SQL - JSON Example_
46
{"name":"Michael"}{"name":"Jan","age":30}{"name":"Tim","age":17}
![Page 47: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/47.jpg)
valcsc=newCassandraSQLContext(sc)
csc.setKeyspace("musicdb")
valresult=csc.sql("SELECTcountry,COUNT(*)asanzahl"+ "FROMartistsGROUPBYcountry"+ "ORDERBYanzahlDESC");
result.collect.foreach(println);
Spark SQL - Cassandra Example_
47
![Page 48: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/48.jpg)
Spark SQL Demo
48
![Page 49: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/49.jpg)
Spark Streaming
49
![Page 50: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/50.jpg)
• Real Time Processing using micro batches
• Supported sources: TCP, S3, Kafka, Twitter,..
• Data as Discretized Stream (DStream)
• Same programming model as for batches
• All Operations of the GenericRDD & SQL & MLLib
• Stateful Operations & Sliding Windows
Stream Processing With Spark Streaming_
50
![Page 51: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/51.jpg)
importorg.apache.spark.streaming._
valssc=newStreamingContext(sc,Seconds(1))
valstream=ssc.socketTextStream("127.0.0.1",9999)
stream.map(x=>1).reduce(_+_).print()
ssc.start()
//awaitmanualterminationorerrorssc.awaitTermination()
//manualterminationssc.stop()
Spark Streaming - Example_
51
![Page 52: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/52.jpg)
•Maintain State for each key in a DStream: updateStateByKey
Spark Streaming - Stateful Operations_
52
defupdateAlbumCount(newValues:Seq[Int],runningCount:Option[Int]):Option[Int]={valnewCount=runningCount.getOrElse(0)+newValues.sizeSome(newCount)}
valcountStream=stream.updateStateByKey[Int](updateAlbumCount_)
StreamisaDStreamofPairRDD's
![Page 53: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/53.jpg)
•One Receiver -> One Node • Start more receivers and union them
valnumStreams=5valkafkaStreams=(1tonumStreams).map{i=>KafkaUtils.createStream(...)}valunifiedStream=streamingContext.union(kafkaStreams)unifiedStream.print()
• Received data will be split up into blocks • 1 block => 1 task • blocks = batchSize / blockInterval
• Repartition data to distribute over cluster
Spark Streaming - Parallelism_
53
![Page 54: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/54.jpg)
Spark Streaming Demo
54
![Page 55: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/55.jpg)
Spark MLLib
55
![Page 56: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/56.jpg)
• Fully integrated in Spark • Scalable • Scala, Java & Python APIs • Use with Spark Streaming & Spark SQL
• Packages various algorithms for machine learning
• Includes • Clustering • Classification • Prediction • Collaborative Filtering
• Still under development • performance, algorithms
Spark MLLib_
56
![Page 57: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/57.jpg)
MLLib Example - Clustering_
57
age
set of data points meaningful clusters
inco
me
![Page 58: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/58.jpg)
//Loadandparsedatavaldata=sc.textFile("data/mllib/kmeans_data.txt")valparsedData=data.map(s=>Vectors.dense(s.split('').map(_.toDouble))).cache()//Clusterthedatainto3classesusingKMeanswith20iterationsvalclusters=KMeans.train(parsedData,2,20)//EvaluateclusteringbycomputingSumofSquaredErrorsvalSSE=clusters.computeCost(parsedData)println("SumofSquaredErrors="+WSSSE)
MLLib Example - Clustering (using KMeans)_
58
![Page 59: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/59.jpg)
MLLib Example - Classification_
59
![Page 60: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/60.jpg)
MLLib Example - Classification_
60
![Page 61: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/61.jpg)
//LoadtrainingdatainLIBSVMformat.valdata=MLUtils.loadLibSVMFile(sc,"sample_libsvm_data.txt")
//Splitdataintotraining(60%)andtest(40%).valsplits=data.randomSplit(Array(0.6,0.4),seed=11L)valtraining=splits(0).cache()valtest=splits(1)
//RuntrainingalgorithmtobuildthemodelvalnumIterations=100valmodel=SVMWithSGD.train(training,numIterations)
MLLib Example - Classification (Linear SVM)_
61
![Page 62: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/62.jpg)
//Computerawscoresonthetestset.valscoreAndLabels=test.map{point=>valscore=model.predict(point.features)(score,point.label)}
//Getevaluationmetrics.valmetrics=newBinaryClassificationMetrics(scoreAndLabels)valauROC=metrics.areaUnderROC()println("AreaunderROC="+auROC)
MLLib Example - Classification (Linear SVM)_
62
![Page 63: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/63.jpg)
MLLib Example - Collaborative Filtering_
63
![Page 64: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/64.jpg)
//Loadandparsethedata(userid,itemid,rating)valdata=sc.textFile("data/mllib/als/test.data")valratings=data.map(_.split(',')match{caseArray(user,item,rate)=>Rating(user.toInt,item.toInt,rate.toDouble)
})
//BuildtherecommendationmodelusingALSvalrank=10valnumIterations=20valmodel=ALS.train(ratings,rank,numIterations,0.01)
MLLib Example - Collaborative Filtering using ALS_
64
![Page 65: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/65.jpg)
//EvaluatethemodelonratingdatavalusersProducts=ratings.map{caseRating(user,product,rate)=>(user,product)}
valpredictions=model.predict(usersProducts).map{ caseRating(user,product,rate)=>((user,product),rate)}
valratesAndPredictions=ratings.map{caseRating(user,product,rate)=>((user,product),rate)}.join(predictions)
valMSE=ratesAndPredictions.map{case((user,product),(r1,r2))=>valerr=(r1-r2);err*err}.mean()
println("MeanSquaredError="+MSE)
MLLib Example - Collaborative Filtering using ALS_
65
![Page 66: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/66.jpg)
Use Cases
66
![Page 67: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/67.jpg)
• In particular for huge amounts of external data
• Support for CSV, TSV, XML, JSON und other
Use Cases for Spark and Cassandra_
67
Data Loading
caseclassUser(id:java.util.UUID,name:String)
valusers=sc.textFile("users.csv") .repartition(2*sc.defaultParallelism) .map(line=>line.split(",")match{caseArray(id,name)=>User(java.util.UUID.fromString(id),name)})
users.saveToCassandra("keyspace","users")
![Page 68: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/68.jpg)
Validate consistency in a Cassandra database
• syntactic • Uniqueness (only relevant for columns not in the PK) • Referential integrity • Integrity of the duplicates
• semantic • Business- or Application constraints • e.g.: At least one genre per movies, a maximum of 10 tags per blog
post
Use Cases for Spark and Cassandra_
68
Validation & Normalization
![Page 69: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/69.jpg)
• Modelling, Mining, Transforming, ....
• Use Cases • Recommendation • Fraud Detection • Link Analysis (Social Networks, Web) • Advertising • Data Stream Analytics ( Spark Streaming) • Machine Learning ( Spark ML)
Use Cases for Spark and Cassandra_
69
Analyses (Joins, Transformations,..)
![Page 70: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/70.jpg)
• Changes on existing tables • New table required when changing primary key • Otherwise changes could be performed in-place
• Creating new tables • data derived from existing tables • Support new queries
• Use the CassandraConnectors in Spark
Use Cases for Spark and Cassandra_
70
Schema Migration
![Page 71: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/71.jpg)
Thank you for your attention!
71
![Page 72: Big data analytics with Spark & Cassandra](https://reader034.vdocuments.net/reader034/viewer/2022042520/5877cea11a28ab39588b747b/html5/thumbnails/72.jpg)
Questions?
Matthias Niehoff, IT-Consultant
90
codecentric AG Zeppelinstraße 2 76185 Karlsruhe, Germany
mobil: +49 (0) 172.1702676 [email protected]
www.codecentric.de blog.codecentric.de
matthiasniehoff