spark tutorial with set up and basic file...

8
Spark Tutorial with Set Up and Basic File Processing CIS 612 1) Download Apache Spark release 2.4.1 from http://spark.apache.org/downloads.html Choose the binary without Hadoop, as Hadoop is already installed and configured on my system 2) Follow Apache documentation for setting up Spark with your own Hadoop installation: http://spark.apache.org/docs/latest/hadoop-provided.html Modify SPARK_DIST_CLASSPATH to include Hadoop’s package jars by adding an entry in conf/spark- env.sh

Upload: others

Post on 25-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spark Tutorial with Set Up and Basic File Processingeecs.csuohio.edu/~sschung/cis612/CIS612_SparkBasic... · 2019-04-22 · -rw-r--r-- -rw-r--r-- 1 stacey supergroup 1 stacey supergroup

Spark Tutorial with Set Up and Basic File Processing

CIS 612

1) Download Apache Spark release 2.4.1 from http://spark.apache.org/downloads.html

Choose the binary without Hadoop, as Hadoop is already installed and configured on my system

2) Follow Apache documentation for setting up Spark with your own Hadoop installation:

http://spark.apache.org/docs/latest/hadoop-provided.html

Modify SPARK_DIST_CLASSPATH to include Hadoop’s package jars by adding an entry in conf/spark-

env.sh

Page 2: Spark Tutorial with Set Up and Basic File Processingeecs.csuohio.edu/~sschung/cis612/CIS612_SparkBasic... · 2019-04-22 · -rw-r--r-- -rw-r--r-- 1 stacey supergroup 1 stacey supergroup

3) Put JSON files on HDFS to use in Spark

4) Run Spark Shell

Page 3: Spark Tutorial with Set Up and Basic File Processingeecs.csuohio.edu/~sschung/cis612/CIS612_SparkBasic... · 2019-04-22 · -rw-r--r-- -rw-r--r-- 1 stacey supergroup 1 stacey supergroup

5) Get a SQLContext to be able to use SparkSQL

6) Import business100.json file

7) Import review100B.json file

Page 4: Spark Tutorial with Set Up and Basic File Processingeecs.csuohio.edu/~sschung/cis612/CIS612_SparkBasic... · 2019-04-22 · -rw-r--r-- -rw-r--r-- 1 stacey supergroup 1 stacey supergroup

8) Show schema of business100

Page 5: Spark Tutorial with Set Up and Basic File Processingeecs.csuohio.edu/~sschung/cis612/CIS612_SparkBasic... · 2019-04-22 · -rw-r--r-- -rw-r--r-- 1 stacey supergroup 1 stacey supergroup
Page 6: Spark Tutorial with Set Up and Basic File Processingeecs.csuohio.edu/~sschung/cis612/CIS612_SparkBasic... · 2019-04-22 · -rw-r--r-- -rw-r--r-- 1 stacey supergroup 1 stacey supergroup
Page 7: Spark Tutorial with Set Up and Basic File Processingeecs.csuohio.edu/~sschung/cis612/CIS612_SparkBasic... · 2019-04-22 · -rw-r--r-- -rw-r--r-- 1 stacey supergroup 1 stacey supergroup

9) Show schema of review100B

10) Register the SchemaRDDs as tables to be able to query them with Scala SQL

11) Query the businesses table to find businesses rated higher than 4 stars

12) Create a table out of businessids rated higher than 4 stars, then join to review table by businessid to get

the reviews of these highly rated businesses

Page 8: Spark Tutorial with Set Up and Basic File Processingeecs.csuohio.edu/~sschung/cis612/CIS612_SparkBasic... · 2019-04-22 · -rw-r--r-- -rw-r--r-- 1 stacey supergroup 1 stacey supergroup

13) Find funny reviews from this list of good businesses (more than 5 votes for “funny”)