pysparkの勘所(20170630 sapporo db analytics showcase)
TRANSCRIPT
![Page 1: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/1.jpg)
PySpark@
![Page 2: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/2.jpg)
▸ facebook : Ryuji Tamagawa
▸ Twitter : tamagawa_ryuji
▸ FB
![Page 3: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/3.jpg)
![Page 4: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/4.jpg)
8
![Page 6: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/6.jpg)
▸
▸ pandas PyData
▸ Spark Scala Java
Spark
▸ TB
![Page 7: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/7.jpg)
![Page 8: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/8.jpg)
▸ Spark Hadoop
▸ PySpark
▸ PySpark
▸ Spark/Hadoop PyData
PySpark
![Page 9: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/9.jpg)
Spark Hadoop
![Page 10: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/10.jpg)
Spark Hadoop
Hadoop0.x Spark
OS
HDFS
MapReduce
OS
HDFS
Hive e.t.c.HBase
MapReduce
OSHDFS
Hive e.t.c.
HBaseMapReduce
YARN
Spark Spark Streaming, MLlib, GraphX, Spark SQL)
Impala
SQL
YARN
Spark Spark Streaming, MLlib, GraphX,
Spark SQL)
Mesos
Spark Spark Streaming, MLlib, GraphX,
Spark SQL) Spark Spark Streaming, MLlib, GraphX,
Spark SQL)
Windows
Hadoop 0.x Hadoop 1.x Hadoop 2.x + Spark
![Page 11: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/11.jpg)
Spark Hadoop
Hadoop Spark
mapJVM
HD
FS
reduceJVM
mapJVM
reduceJVM
f1 RDD
Executor JVM
HD
FS
f2f3
f4f5
f6f7
MapReduce Spark
RDD
![Page 12: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/12.jpg)
Spark Hadoop
Spark
▸ Hadoop MapReduce
▸ Spark API MapReduce API
▸ Hadoop
![Page 13: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/13.jpg)
PySpark
![Page 14: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/14.jpg)
PySpark
(Py)Spark
▸ / Spark
▸ PyData
▸ Spark
▸ Spark Hadoop
PyData
PySpark
![Page 15: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/15.jpg)
PySpark
▸
▸ SSD
▸ CPU
▸
ParquetS3
CPU
![Page 16: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/16.jpg)
Spark 1.2 PySpark …
(Py)Spark
![Page 17: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/17.jpg)
PySpark
![Page 18: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/18.jpg)
PySpark
RDD API DataFrame API
▸ RDD Resilient Distributed Dataset = Spark
Java
▸ DataFrame RDD
/ R data.frame
▸ Spark 2.x DataFrame Learning PySpark ML Structured Streaming GraphFrames TensorFrame
▸ Python RDD API DataFrame API Scala / Java
![Page 19: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/19.jpg)
Worker node
PySpark
Executer JVM
Driver JVM
Executer JVM
Executer JVM
Storage
Python VM
Worker node Worker node
Python VM
Python VM
RDD API PySpark
Worker node
Executer JVM
Driver JVM
Executer JVM
Executer JVM
Storage
Python VM
Worker node Worker node
Python VM
Python VM
DataFrame API PySpark
![Page 20: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/20.jpg)
PySpark
▸ RDD API Executer JVM Python VM
▸ DataFrame API JVM
▸ UDF Python VM
▸ UDF Scala Java
▸ Spark 2.x DataFrame
![Page 21: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/21.jpg)
Spark PyData
![Page 22: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/22.jpg)
Spark PyData
Spark PyData
▸ Spark
▸ Python PyData
▸
▸ Parquet
▸ Apache Arrow
![Page 23: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/23.jpg)
Spark PyData
PyData
![Page 24: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/24.jpg)
Spark PyData
PyData
Anaconda PythonBlaze NumPy and pandas interface to Big Data'. daskBokeh
Canopy PythonIPython
matplotlib PyDatanose
numba JITNumPy PyDataScipy PyData
StatsmodelsSymPy
pandas NumPy SciPyscikit-imagescikit-learn PyData
![Page 25: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/25.jpg)
Spark PyData
▸ CSV JSON
▸ Spark Parquet
▸ Performance comparison of different file formats and storage
engines in the Hadoop ecosystem
▸ Parquet Python
▸ fastparquet pyarrow
▸ Parquet
![Page 26: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/26.jpg)
Spark PyData
Parquet
https://parquet.apache.org/documentation/latest/
I/O
![Page 27: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/27.jpg)
Spark PyData
Sparkdf = spark.read.csv(csvFilename, header=True, schema = theSchema).coalesce(20) df.write.save(filename, compression = 'snappy')
from fastparquet import write
pdf = pd.read_csv(csvFilename)
write(filename, pdf, compression='UNCOMPRESSED')
fastparquet
import pyarrow as pa
import pyarrow.parquet as pq
arrow_table = pa.Table.from_pandas(pdf)
pq.write_table(arrow_table, filename, compression = 'GZIP')
pyarrow
![Page 28: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/28.jpg)
Spark PyData
▸ pandas CSV Spark
Spark pandas
…
▸ Spark - pandas
▸ pandas → Spark …
▸ Apache Arrow
![Page 29: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/29.jpg)
Spark PyData
Apache Arrow
▸ Apache Arrow
▸ PyData / OSS
▸ /
https://arrow.apache.org
![Page 30: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/30.jpg)
Spark PyData
Wes blog
▸ pandas Apache Arrow
▸ Blog
▸ PyData Blog
Wes OK
▸ 2017 : pandas, Arrow, Feather, Parquet, Spark, Ibis
http://qiita.com/tamagawa-ryuji/items/deb3f63ed4c7c8065e81
![Page 31: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/31.jpg)
PySpark
![Page 32: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/32.jpg)
▸ pandas PySpark
▸ PySpark DataFrame API
▸ Parquet
CSV
Parquet
▸ UI
Jupyter NotebookParquet
PySpark
DataFrame API
pandas
PyData Jupyter Notebook
CSV
![Page 33: PySparkの勘所(20170630 sapporo db analytics showcase)](https://reader034.vdocuments.net/reader034/viewer/2022042510/5a6551bd7f8b9a5a2a8b4acb/html5/thumbnails/33.jpg)