intro to spark and spark sql

Download Intro to Spark and Spark SQL

Post on 02-Jul-2015

10.316 views

Category:

Software

2 download

Embed Size (px)

DESCRIPTION

"Intro to Spark and Spark SQL" talk by Michael Armbrust of Databricks at AMP Camp 5

TRANSCRIPT

  • 1. Intro to Sparkand Spark SQLAMP Camp 2014Michael Armbrust - @michaelarmbrust

2. What is Apache Spark?Fast and general cluster computing system,interoperable with Hadoop, included in all majordistrosImproves efficiency through:> In-memory computing primitives> General computation graphsImproves usability through:> Rich APIs in Scala, Java, Python> Interactive shellUp to 100 faster(2-10 on disk)2-5 less code 3. Spark ModelWrite programs in terms of transformations ondistributed datasetsResilient Distributed Datasets (RDDs)> Collections of objects that can be stored in memoryor disk across a cluster> Parallel functional transformations (map, filter, )> Automatically rebuilt on failure 4. More than Map & ReducemapfiltergroupBysortunionjoinleftOuterJoinrightOuterJoinreducecountfoldreduceByKeygroupByKeycogroupcrosszipsampletakefirstpartitionBymapWithpipesave... 5. Example: Log MiningLoad error messages from a log into memory,then interactively search for various patternsvallines=spark.textFile(hdfs://...)valerrors=lines.filter(_startswithERROR)valmessages=errors.map(_.split(t)(2))messages.cache()linesBlock 1linesBlock 2linesBlock 3WorkerWorkerWorkerDrivermessages.filter(_containsfoo).count()messages.filter(_containsbar).count(). . .resultstasksmessagesCache 1messagesCache 2messagesCache 3BaseT RraDnDsformed RDDActionResult: scaled full-text to search 1 TB data of Wikipedia in 5-7 secin#(vs 170 sec for on-disk data) Limited integration with Spark programs> Hive optimizer not designed for SparkSpark SQL reuses the best parts of Shark:Borrows Hive data loading In-memory column storeAdds RDD-aware optimizer Rich language interfaces 16. Adding Schema to RDDsSpark + RDDsFunctional transformations onpartitioned collections of opaqueobjects.SQL + SchemaRDDsDeclarative transformations onpartitioned collections of tuples.UserUserUserUserUserUserNameAgeHeightNameAgeHeightNameAgeHeightNameAgeHeightNameAgeHeightNameAgeHeight 17. SchemaRDDs: More than SQLSchemaRDD{JSON}Image credit: http://barrymieny.deviantart.com/SQLQLParquetMLlibUnified interface for structured dataJDBCODBC 18. Getting Started: Spark SQLSQLContext/HiveContext Entry point for all SQL functionality Wraps/extends existing spark contextfrompyspark.sqlimportSQLContextsqlCtx=SQLContext(sc) 19. Example DatasetA text file filled with peoples names and ages:Michael,30Andy,31 20. RDDs as Relations (Python)# Load a text file and convert each line to a dictionary.lines = sc.textFile("examples//people.txt")parts = lines.map(lambda l: l.split(","))people = parts.map(lambda p: Row(name=p[0],age=int(p[1])))# Infer the schema, and register the SchemaRDD as a tablepeopleTable = sqlCtx.inferSchema(people)peopleTable.registerAsTable("people") 21. RDDs as Relations (Scala)valsqlContext=neworg.apache.spark.sql.SQLContext(sc)importsqlContext._//Definetheschemausingacaseclass.caseclassPerson(name:String,age:Int)//CreateanRDDofPersonobjectsandregisteritasatable.valpeople=sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p=>Person(p(0),p(1).trim.toInt))people.registerAsTable("people") 22. RDDs as Relations (Java)publicclassPersonimplementsSerializable{privateString_name;privateint_age;publicStringgetName(){return_name;}publicvoidsetName(Stringname){_name=name;}publicintgetAge(){return_age;}publicvoidsetAge(intage){_age=age;}}JavaSQLContextctx=neworg.apache.spark.sql.api.java.JavaSQLContext(sc)JavaRDDpeople=ctx.textFile("examples/src/main/resources/people.txt").map(newFunction(){publicPersoncall(Stringline)throwsException{String[]parts=line.split(",");Personperson=newPerson();person.setName(parts[0]);person.setAge(Integer.parseInt(parts[1].trim()));returnperson;}});JavaSchemaRDDschemaPeople=sqlCtx.applySchema(people,Person.class); 23. Querying Using SQL#SQLcanberunoverSchemaRDDsthathavebeenregistered#asatable.teenagers=sqlCtx.sql("""SELECTnameFROMpeopleWHEREage>=13ANDageq"inputRow.getInt($ordinal)"caseAdd(left,right)=>q"""{valleftResult=${generateCode(left)}valrightResult=${generateCode(right)}leftResult+rightResult}"""} 39. Code Generating a + bvalleft:Int=inputRow.getInt(0)valright:Int=inputRow.getInt(1)valresult:Int=left+rightresultRow.setInt(0,result) Fewer function calls No boxing of primitives 40. Performance Comparison4035302520151050Intepreted EvaluationHand-written CodeGenerated with ScalaReflectionMillisecondsEvaluating 'a+a+a' One Billion Times! 41. Performance vs.DeepAnalyticsTPC-DSInteractive Reporting45040035030025020015010050034 46 59 79 19 42 52 55 63 68 73 98 27 3 43 53 7 89SharkSpark SQL 42. Whats Coming in Spark 1.2? MLlib pipeline support for SchemaRDDs New APIs for developers to add external datasources Full support for Decimal and Date types. Statistics for in-memory data Lots of bug fixes and improvements to Hivecompatibility 43. Questions?