spark tutorial with set up and basic file...

Post on 25-Jul-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Spark Tutorial with Set Up and Basic File Processing

CIS 612

1) Download Apache Spark release 2.4.1 from http://spark.apache.org/downloads.html

Choose the binary without Hadoop, as Hadoop is already installed and configured on my system

2) Follow Apache documentation for setting up Spark with your own Hadoop installation:

http://spark.apache.org/docs/latest/hadoop-provided.html

Modify SPARK_DIST_CLASSPATH to include Hadoop’s package jars by adding an entry in conf/spark-

env.sh

3) Put JSON files on HDFS to use in Spark

4) Run Spark Shell

5) Get a SQLContext to be able to use SparkSQL

6) Import business100.json file

7) Import review100B.json file

8) Show schema of business100

9) Show schema of review100B

10) Register the SchemaRDDs as tables to be able to query them with Scala SQL

11) Query the businesses table to find businesses rated higher than 4 stars

12) Create a table out of businessids rated higher than 4 stars, then join to review table by businessid to get

the reviews of these highly rated businesses

13) Find funny reviews from this list of good businesses (more than 5 votes for “funny”)

top related