![Page 1: Spark Tutorial with Set Up and Basic File Processingeecs.csuohio.edu/~sschung/cis612/CIS612_SparkBasic... · 2019-04-22 · -rw-r--r-- -rw-r--r-- 1 stacey supergroup 1 stacey supergroup](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7082b23d4ec2594518d6be/html5/thumbnails/1.jpg)
Spark Tutorial with Set Up and Basic File Processing
CIS 612
1) Download Apache Spark release 2.4.1 from http://spark.apache.org/downloads.html
Choose the binary without Hadoop, as Hadoop is already installed and configured on my system
2) Follow Apache documentation for setting up Spark with your own Hadoop installation:
http://spark.apache.org/docs/latest/hadoop-provided.html
Modify SPARK_DIST_CLASSPATH to include Hadoop’s package jars by adding an entry in conf/spark-
env.sh
![Page 2: Spark Tutorial with Set Up and Basic File Processingeecs.csuohio.edu/~sschung/cis612/CIS612_SparkBasic... · 2019-04-22 · -rw-r--r-- -rw-r--r-- 1 stacey supergroup 1 stacey supergroup](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7082b23d4ec2594518d6be/html5/thumbnails/2.jpg)
3) Put JSON files on HDFS to use in Spark
4) Run Spark Shell
![Page 3: Spark Tutorial with Set Up and Basic File Processingeecs.csuohio.edu/~sschung/cis612/CIS612_SparkBasic... · 2019-04-22 · -rw-r--r-- -rw-r--r-- 1 stacey supergroup 1 stacey supergroup](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7082b23d4ec2594518d6be/html5/thumbnails/3.jpg)
5) Get a SQLContext to be able to use SparkSQL
6) Import business100.json file
7) Import review100B.json file
![Page 4: Spark Tutorial with Set Up and Basic File Processingeecs.csuohio.edu/~sschung/cis612/CIS612_SparkBasic... · 2019-04-22 · -rw-r--r-- -rw-r--r-- 1 stacey supergroup 1 stacey supergroup](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7082b23d4ec2594518d6be/html5/thumbnails/4.jpg)
8) Show schema of business100
![Page 5: Spark Tutorial with Set Up and Basic File Processingeecs.csuohio.edu/~sschung/cis612/CIS612_SparkBasic... · 2019-04-22 · -rw-r--r-- -rw-r--r-- 1 stacey supergroup 1 stacey supergroup](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7082b23d4ec2594518d6be/html5/thumbnails/5.jpg)
![Page 6: Spark Tutorial with Set Up and Basic File Processingeecs.csuohio.edu/~sschung/cis612/CIS612_SparkBasic... · 2019-04-22 · -rw-r--r-- -rw-r--r-- 1 stacey supergroup 1 stacey supergroup](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7082b23d4ec2594518d6be/html5/thumbnails/6.jpg)
![Page 7: Spark Tutorial with Set Up and Basic File Processingeecs.csuohio.edu/~sschung/cis612/CIS612_SparkBasic... · 2019-04-22 · -rw-r--r-- -rw-r--r-- 1 stacey supergroup 1 stacey supergroup](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7082b23d4ec2594518d6be/html5/thumbnails/7.jpg)
9) Show schema of review100B
10) Register the SchemaRDDs as tables to be able to query them with Scala SQL
11) Query the businesses table to find businesses rated higher than 4 stars
12) Create a table out of businessids rated higher than 4 stars, then join to review table by businessid to get
the reviews of these highly rated businesses
![Page 8: Spark Tutorial with Set Up and Basic File Processingeecs.csuohio.edu/~sschung/cis612/CIS612_SparkBasic... · 2019-04-22 · -rw-r--r-- -rw-r--r-- 1 stacey supergroup 1 stacey supergroup](https://reader033.vdocuments.net/reader033/viewer/2022050308/5f7082b23d4ec2594518d6be/html5/thumbnails/8.jpg)
13) Find funny reviews from this list of good businesses (more than 5 votes for “funny”)