Download - Anatomy of Spark SQL Catalyst - Part 2
![Page 1: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/1.jpg)
Anatomy of Spark Catalyst- Part 2
Journey from DataFrame to RDD
https://github.com/phatak-dev/anatomy-of-spark-catalyst
![Page 2: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/2.jpg)
● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at datamantra.io
● Consult in Hadoop, Spark and Scala
● www.madhukaraphatak.com
![Page 3: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/3.jpg)
Agenda● Recap of Part 1● Query Plan● Logical Plans● Analysis● Optimization● Spark Plan● Spark Strategies● Custom strategies
![Page 4: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/4.jpg)
Code organization of Spark SQL● The code of spark sql is organized into below projects
○ Catalyst○ Core○ Hive○ Hive-Thrift server○ unsafe(external to spark sql)
● We are focusing on Catalyst code in this session● Core depends on catalyst ● Catalyst depends upon unsafe for tungsten code
![Page 5: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/5.jpg)
Introduction to Catalyst● An implementation agnostic framework for manipulating
trees of relational operators and expressions● It defines the all the expressions, logical plans and
optimizations API’s for spark SQL● Catalyst API is independent of RDD evaluation. So
catalyst operators and expressions can be evaluated without RDD abstraction
● Introduced in Spark 1.3 version
![Page 6: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/6.jpg)
Catalyst in Spark SQL
HiveQL
Hive parser
Hive queries
SparkSQL
SparkSQL Parser
Spark SQL queries
Dataframe DSL
DataFrame
Catalyst
Spark RDD code
Dataset DSL
![Page 7: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/7.jpg)
Recap of Part 1● Trees● Expressions● Data types● Row and Internal Row● Eval● CodeGen● Janino
![Page 8: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/8.jpg)
DataFrame Internals● Each dataframe is internally represented as logical
plans in spark● These logical plan converts into physical plans to
execute on RDD abstraction● Building blocks of plans like expressions, operators
come from the catalyst library● We need to understand internals of catalyst in order to
understand how a given query formed and executed.● Ex : DFExample.scala
![Page 9: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/9.jpg)
Query Plan API● Root of all plans● Plan types - Logical Plan and Spark Plan● OutputSet signifies the attributes outputted by this plan● InputSet signifies the attributed inputted by this plan● Schema - Signifies the StructType associated with
output of logical plan● Provides special functions like transformExpressions,
transformExpressionsUp to manipulate the expressions in plan
![Page 10: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/10.jpg)
Logical Plan
![Page 11: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/11.jpg)
Logical Plan● One type of Query plan which focuses on building plans
of catalyst operators and expressions● Independent of RDD abstraction● Focuses on analysis of the plan for correctness ● Also responsible for resolving the attributes before they
are evaluated● Three default type of logical plans are
○ LeafNode,UnaryNode and BinaryNode● Ex:LogicalPlanExample
![Page 12: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/12.jpg)
Tree manipulation of Logical Plan● As we did in earlier with expression trees, we can
manipulate the logical plans using tree API’s● All these manipulations are represented as a Rule● These rule take a plan and give you new plan● Rather than using transform and transformUp we will be
using transformExpression and transformExpressionUp for manipulating these trees
● Ex : FilterLogicalPlanManipulation
![Page 13: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/13.jpg)
Understanding Plan Manipulation
Transform
LR
Filter
AttribituteRef(id) Literal(true)
Equals
Transform ExpressionUp
LR
Filter
Literal (true)
LR
![Page 14: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/14.jpg)
Analysis of Logical Plan
![Page 15: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/15.jpg)
Analysing● Analysis of logical plan is a step includes
○ Resolving relations (Spark SQL)○ Resolve attributes○ Resolve functions○ Analyze for correctness of structuring
● Analysis makes sures all information is extracted before a logical plan can be executed
● Analyzer is an interface to implement the analysis● AnalysisExample
![Page 16: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/16.jpg)
Analysis in SQL● Whenever we use SQL API for manipulating
dataframes we work with UnResolvedRelation● UnResolvedRelation is a logical relation which needs to
be resolved from catalog● Catalog is a dictionary of all registered tables● Part of the analysis, is to to resolve these unresolved
relations and provide appropriate relation types● UnResolvedRelationExample
![Page 17: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/17.jpg)
Logical Plan Optimization
![Page 18: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/18.jpg)
Optimizer● One of the important part of the Spark Catalyst is to
implement the optimizations on logical plans● All these optimizations are represented using Rule
which transforms the logical plans● All code for Optimization resides in Optimizer.scala file● In our example, we see how filter push for a logical plan● For more information on optimization, refer to anatomy
of dataframe talk[1] from references● PushPredicateExample
![Page 19: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/19.jpg)
Providing custom optimizations● Till Spark 2.0, we needed to change the spark source
code for the changing optimizations● As Dataset becomes core abstraction in 2.0, ability to
tweak catalyst optimization becomes important● So from spark 2.0, spark has exposed the ability to add
user defined rules in run time which makes spark optimizer more configurable
● More information about defining and adding custom rules refer to spark 2.0 talk[2] from references
![Page 20: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/20.jpg)
Spark Plan
![Page 21: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/21.jpg)
SparkPlan● Physical plan of the Spark SQL which lives in the core
package● Defines two abstract methods
○ doPrepare○ doExecute
● Specifies helper collectMethods like○ executeCollect, executeTake
● Three nodes LeafNode, UnaryNode, BinaryNode● org.apache.spark.sql.execution.SparkPlan
![Page 22: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/22.jpg)
Logical Plan to Spark Plan● Let’s look at converting our logical plans to spark plans● On sqlContext, there is a SparkPlanner which will help
us to do the conversion● A single logical plan can result in multiple physical plans● On every physical plan , there is execute method which
consumes RDD[InternalRow] and produces RDD[InternalRow]
● LogicalToPhysicalExample
![Page 23: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/23.jpg)
QueryPlanner ● Interface for converting logical planning to Physical
plans● List of strategies applied for conversion● Each strategy has a method plan which is chained like
we did in rules in logical plan side● QueryPlanner also extends from TreeNode which
supports all tree transversal● PlanLater is a strategy which gives lazy effect● org.apache.spark.sql.catalyst.planning.QueryPlanne
r
![Page 24: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/24.jpg)
Spark Strategies● These are set of strategies which implement query
planner to turn logical plans to Spark Plans● The different strategies are
○ BasicOperators○ Aggregation○ DefaultJoin
● We execute these strategies in sequence to generate final result
● org.apache.spark.sql.core.phyzicalplan.SparkStrategyExample
![Page 25: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/25.jpg)
SparkPlanner● User facing API for converting the logical plans to spark
plans● It lists all the strategies to execute on a given logical
plan● Calling plan method generate all physical plans using
the above strategies● These physical plans can be executed using execute
method on a physical plan● org.apache.spark.sql.execution.SparkPlanner
![Page 26: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/26.jpg)
Understanding Filter Strategy● All code for Filter Strategy lives in
basicOperators.scala● It uses mapPartitionInternal for filtering data over
RDD[InternalRow]● A comparison expression is converted to predicate
using newPredicate method which uses code generation
● Once we have predicate, we can use scala filter to filter the data from the RDD
● Filtered RDD[InternalRow] is returned from the strategy
![Page 27: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/27.jpg)
Custom strategy● As we can write custom rules for logical optimization,
we can add custom strategies also● Many connectors like MemSQL, Mongodb add custom
strategies to optimize read, filter etc● Developer can add custom strategies using
sqlContext.experimental.strategies object● You can look at simple custom strategy from memsql in
below link● http://bit.ly/2bwnUxF
![Page 28: Anatomy of Spark SQL Catalyst - Part 2](https://reader034.vdocuments.net/reader034/viewer/2022042520/587138e11a28abf0568b6491/html5/thumbnails/28.jpg)
References● Dataframe Anatomy
https://www.youtube.com/watch?v=iKOGBr-kOks● Spark 2.0 talk https://www.youtube.com/watch?v=GhZ-XPGyXiM● Memsql Custom Strategyhttps://github.com/memsql/memsql-spark-connector/tree/master/connectorLib/src/main/scala/org/apache/spark/sql/memsql