ibis: scaling python analytics on hadoop and impala

1"©"Cloudera,"Inc."All"rights"reserved."

Ibis:"Scaling"Python"Analy=cs"on"Hadoop"and"Impala"Wes"McKinney,"StrataDHadoop"World"NYC"2015D09D30"@wesmckinn"


Me"

• R&D"at"Cloudera"•  Serial"creator"of"structured"data"tools"/"user"interfaces"• Mathema=cian"—"MIT"‘07"•  “Professional"SQL"programmer”"2007D2010"(@"AQR)"• Created"pandas"(Python"library)"in"2008"• Wrote"bestseller"Python'for'Data'Analysis'2012"•  Founder"of"DataPad""


Python"is"popular…"

• Python"has"become"a"standard"language"of"data"science"• Why"is"it"popular?"• Maximizes"produc=vity"for"data"engineers"and"data"scien=sts"• Build"robust"soeware"and"do"interac=ve"data"analysis"with"100%"Python"code""• EasyDtoDlearn"and"makes"happy"and"produc=ve"data"teams""• Large,"diverse"open"source"development"community"• Comprehensive"libraries:"data"wrangling,"ML,"visualiza=on,"etc."

• Main"use"case:"data"science"&"engineering"swiss"army"knife"on"smallDtoDmedium"size"data"


…but"Python"does"not"scale"today"

• Python"ecosystem"confined"to"singleDnode"analysis"• Great"for"smaller"data"sets"• Requires"sampling"or"aggrega=ons"for"larger"data"• Distributed"tools"compromise"in"various"ways"

• Extrac=ng"samples"or"aggrega=ons"for"larger"data"means:"• “Scales”"by"losing"more"fidelity"• Addi=onal"ETL"overhead"to"extract"samples/aggrega=ons"• Loss"of"produc=vity"with"mul=ple"languages,"tools,"etc"• Blocks"certain"analysis"and"use"cases"


Industry"Analy=cs" Scien=fic"Compu=ng"

Heterogeneous"data"""""Flat"tables"and"JSON"Spark"/"MapReduce"SQL"DFSDfriendly"/"streaming"data"formats"More"physical"machines"

Homogeneous"data"""""Mul=dimensional"arrays"HPC"tools"Linear"algebra"Scien=fic"data"formats"Fewer"physical"machines"

Some"simplis=c"generaliza=ons"


Industry"Analy=cs" Scien=fic"Compu=ng"

Heterogeneous"data"""""Flat"tables"and"JSON"Spark"/"MapReduce"SQL"DFSDfriendly"/"streaming"data"formats"More"physical"machines"

Homogeneous"data"""""Mul=dimensional"arrays"HPC"tools"Linear"algebra"Scien=fic"data"formats"(e.g."HDF5)"Fewer"physical"machines"

Some"simplis=c"generaliza=ons"

Python:(heavy(investment,((generally(

Python:(light(investment,(generally(


Industry"Analy=cs:"Python’s"existen=al"crisis"

8"©"Cloudera,"Inc."All"rights"reserved."Source:"Wikipedia"


Our"(Python’s)"biggest"mistake:"approaching"Big"Data"like"a"scien=fic"compu=ng"problem"


pandas"

• Hugely"popular"Python"table"/"“data"frame”"library"• Labeled"table,"array,"and"=me"series"data"structures"

• Popular"for"data"prepara=on,"ETL,"and"inDmemory"analy=cs"• Built"using"Python’s"scien=fic"compu=ng"stack"• User"API"/"domain"specific"language"• Bespoke"inDmemory"analy=cs"/"rela=onal"algebra"engine"•  IO"interfaces"(CSV,"SQL,"etc.)"• Expanded"data"type"system"(beyond"NumPy)"

•  Supports"flat"data"only"(or"semiDstructured"data"that"can"be"flasened)"


Many"SQL"engines"

…"and"more"


The"“Great"Decoupling”"for"Big"Data"UI

Ibis, SQL, Spark API, …

ComputeAnalytic SQL, Spark, MapReduce

StorageHDFS, Kudu, HBase


A"sample"big"data"architecture"

Kafka

Kafka

Kafka

Kafka

Application dataHDFS

JSON Spark/MapReduce

Columnar storage

Analytic SQL Engine

User

SQL


Big"data"architectures"currently"dominated"by"JVM"languages,"with"a"increasing"amounts"of"C++""Python/R/Julia"don’t"have"much"of"a"“seat"at"the"table”"


Nested"/"Complex"types"support"

• Arrays,"structs,"maps,"and"unions"as"firstDclass"value"types"• Analyze"JSONDlike"data"directly"without"flasening"or"normaliza=on"• Most"new"SQL"engines"have"some"level"of"support"•  Impala"• Presto"• Drill"• BigQuery"• Spark"SQL"• Hive"• …"


Ibis"in"a"nutshell"

•  For"Python"programmers"doing"analy=cs"in"industry"• Project"Blog:"hsp://blog.ibisDproject.org"•  Joint"project"with"Impala"team"@"Cloudera"• ApacheDlicensed,"open"source"hsp://github.com/cloudera/ibis""• Craeing"a"compelling"PythonDonDHadoop"user"experience"• Remove"SQL"coding"from"user"workflows"• Develop"high"performance"Python"extension"APIs"


Ibis"in"a"nutshell,"cont’d"

• Composable"Python"DSL"(“Ibis"expressions”)"makes"handDcoding"SQL"SELECT"statements"unnecessary"•  Ibis"for"SQL"Programmers:"hsp://docs.ibisDproject.org/sql.html"• Development"roadmap"targets"Impala"(C++"/"LLVM)"query"engine"• …"but"SQL"compiler"toolchain"is"general"purpose"

• Current"supports"Impala"and"SQLite,"but"soon"other"dialects"• We"welcome"external"contributors"for"other"Analy=c"SQL"engines"


Benefits"of"Ibis"

• Maximize"developer"produc=vity"• Mirrors"singleDnode"Python"experience"• Solve"big"data"problems"without"leaving"Python"• Leverage"Python"skills,"ecosystem,"and"tools"

• Python"as"firstDclass"language"for"Hadoop"• FullDfidelity"analysis"without"extrac=ons"• Python"analysis"at"any"scale"• Na=ve"hardware"speeds"for"a"broad"set"of"use"cases"


Brief"interac=ve"demo"


Ibis/Impala"Joint"Roadmap"

• More"natural"data"modeling"• Complex"types"support"

•  Integra=on"with"full"Python"data"ecosystem"• Advanced"analy=cs"+"machine"learning"• Enable"use"of"performance"compu=ng"tools"

• User"extensibility"with"na=ve"performance"•  InDmemory"columnar"format"• PythonDtoDLLVM"IR"compila=on"

• Workflow"and"usability"tools"


Execu=ng"data"science"languages"in"the"compute"layer"

UIIbis, SQL, Spark API, …

ComputeAnalytic SQL, Spark, MapReduce

StorageHDFS, Kudu, HBase

Python, R, Julia, …?


Enabling"interoperability"with"big"data"systems"

• Distributed"/"MPP"query"engines:"implemented"in"a"host(language"• Typically"C/C++"or"Java/Scala"

• UserDdefined"func=ons"(UDFs)"through"various"means"•  Implement"in"host"language"•  Implement"in"user"language"through"some"external"language"protocol"(oeen"RPCDbased)"

• External"UDFs"are"usually"very"slow"(cf:"PL/Python,"PySpark,"etc.)"


What"are"UDFs"good"for?"

• Note:"industry"data"scien=sts"have"libraries"containing"100s"of"UDFs"for"Hive"or"other"distributed"query"engines"

• Custom"data"transforma=ons"• Custom"domain"logic"(date"/"=me"/"data"types)"• Custom"data"types"• Custom"aggrega=ons"(incl."machine"learning"/"sta=s=cs"expressible"as"reduc=ons)"


Why"are"external"UDFs"slow?"

•  Serializa=on"/"deserializa=on"overhead"•  Scalar"vs"vectorized"computa=ons"• RPC"overhead"


Example:"Vectoriza=on"for"interpreted"languages"

SUM(CASE WHEN x > y THEN x ELSE x + y END)


Vectorized"vs"Interpreted"perf"


How"to"make"them"fast?"

• Common"run=me"memory"representa=on"for"tabular"data"•  ShareDmemory"(zeroDcopy"or"memcpyDonly)"external"UDF"protocol"• Vectorized"UDF"interface"(for"interpreted"languages)"•  Impala"is"uniquely"posi=oned"to"play"well"with"Ibis"• BestDinDclass"performance"and"scalability"• C++"and"LLVMDbased"(JIT"compiler)"run=me"• Unified,"efficient"data"interchange"amongst"Ibis,"Impala,"and"Kudu"will"enable"high"performance"real"=me"analy=cs"from"Python"


Memory"representa=on"

• Many"query"engines"are"standardizing"on"inDmemory"columnar"rep’n"of"materialized"transient"data"•  Impala:"hsp://blog.cloudera.com/blog/2015/07/whatsDnextDforDimpalaDmoreDreliabilityDusabilityDandDperformanceDatDevenDgreaterDscale/"• Apache"Drill:"hsps://drill.apache.org/faq/"

•  IndustryDstandard"serializa=on"format:"Apache"Parquet"• hsps://parquet.apache.org/"


Serializa=on"vs"InDmemory"

•  Serializa=on"formats"(e.g."Parquet)""• Op=mize"for"IO"/"DFS"throughput"at"expense"of"CPU/memory"bus"throughput"• Do"not"consider"random"access"or"inDmemory"analy=cs"as"a"goal"

• No"standardized"inDmemory"containers"for"materialized"data"from"file"/"RPC"protocols"(Parquet,"Thrie,"protobuf,"Avro,"etc.)"


Standardized"inDmemory"columnar"(IMC)"

• Compact"inDmemory"representa=on"for"semistructured"data"• Part"of"Impala’s"upcoming"dev"roadmap"•  Some"prior"IMCDforDSQL"work:"Apache"Drill"•  Standardized"memory"representa=on"means"data"can"be"shared"without"serializa=on"• Create"a"canonical"C/C++"implementa=on"for"use"in"Python"/"R"/"Julia"


Ibis’s"Vision"

• Uncompromised"Python"experience"• 100%"Python"endDtoDend"user"workflows""• Enable"integra=on"with"the"exis=ng"Python"data"ecosystem"(pandas,"scikitDlearn,"NumPy,"etc)"

•  Interac=ve"at"big"data"scale"• FullDfidelity"analysis"without"extrac=ons"• Scalability"for"big"data"• Na=ve"hardware"speeds"for"a"broad"set"of"use"cases"


Thank"you"Wes"McKinney"@wesmckinn"Views"are"my"own"

ibis: scaling python analytics on hadoop and impala

Technology