joining 2 json files and loading the results in...

8
Joining 2 JSON files and Loading the Results in HBase CIS 612 Paul Webster Lab 4_2 Part 2 was to install HBase and use Hive to load the Yelp ‘business.json’ and ‘review.json’ files into Hive tables, then join both the tables and place them in a HBase table.. First, I installed HBase. I downloaded it from https://hbase.apache.org/downloads.html. Hive and Hadoop were already installed from Part 1 and Part 2 so I didn’t need to install them. I ran into several problems when installing HBase which were fixed by changing the configuration file. Next, I uploaded ‘business.json’ and ‘review.json’ to their own directories on Hadoop. Below is the screen capture of these commands: Next, I ran the following commands on Hive which created the Hive and HBase tables containing the Yelp ‘business.json’ data and displayed the Hive table it created. DROP TABLE IF EXISTS hive_table_business; CREATE TABLE hive_table_business(business_id string, name string, address string, city string,state string, postal_code string, latitude string, longitude string, stars string, review_count string, is_open string, attributes string, categories string, hours string)

Upload: others

Post on 30-Apr-2020

36 views

Category:

Documents


0 download

TRANSCRIPT

Joining 2 JSON files and Loading the Results in HBase

CIS 612

Paul Webster

Lab 4_2 Part 2 was to install HBase and use Hive to load the Yelp ‘business.json’ and

‘review.json’ files into Hive tables, then join both the tables and place them in a HBase table..

First, I installed HBase. I downloaded it from https://hbase.apache.org/downloads.html.

Hive and Hadoop were already installed from Part 1 and Part 2 so I didn’t need to install them. I

ran into several problems when installing HBase which were fixed by changing the configuration

file.

Next, I uploaded ‘business.json’ and ‘review.json’ to their own directories on Hadoop.

Below is the screen capture of these commands:

Next, I ran the following commands on Hive which created the Hive and HBase tables

containing the Yelp ‘business.json’ data and displayed the Hive table it created.

DROP TABLE IF EXISTS hive_table_business;

CREATE TABLE hive_table_business(business_id string, name string,

address string, city string,state string, postal_code string, latitude

string, longitude string, stars string, review_count string, is_open

string, attributes string, categories string, hours string)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" =

":key,cf1:name,cf1:address,cf1:city,cf1:state,cf1:postal_code,cf1:lati

tude,cf1:longitude,cf1:stars,cf1:review_count,cf1:is_open,cf1:attribut

es,cf1:categories,cf1:hours")

TBLPROPERTIES ("hbase.table.name" = "hive_table_business");

DROP TABLE IF EXISTS staging;

CREATE EXTERNAL TABLE staging (json STRING)

LOCATION '/LAB4_2/business';

INSERT OVERWRITE TABLE hive_table_business SELECT

get_json_object(json, "$.business_id") AS business_id,

get_json_object(json, "$.name") AS name,

get_json_object(json, "$.address") AS address,

get_json_object(json, "$.city") AS city,

get_json_object(json, "$.state") AS state,

get_json_object(json, "$.postal_code") AS postal_code,

get_json_object(json, "$.latitude") AS latitude,

get_json_object(json, "$.longitude") AS longitude,

get_json_object(json, "$.stars") AS stars,

get_json_object(json, "$.review_count") AS review_count,

get_json_object(json, "$.is_open") AS is_open,

get_json_object(json, "$.attributes") AS attributes,

get_json_object(json, "$.categories") AS categories,

get_json_object(json, "$.hours") AS hours

FROM staging;

SELECT * FROM hive_table_business;

Below is the output:

Next, I loaded the HBase shell prompt and ran the command “scan

‘hive_table_business’ to display the HBase table that was created. Below is the output:

Next, I ran the following commands on Hive which created the Hive and HBase tables

containing the Yelp ‘review.json’ data and displayed the first 10 entries of the Hive table it

created.

DROP TABLE IF EXISTS hive_table_review;

CREATE TABLE hive_table_review(review_id string, user_id string,

business_id string, stars string,useful string, funny string, cool

string, reviewtext string, reviewdate string)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" =

":key,cf1:user_id,cf1:business_id,cf1:stars,cf1:useful,cf1:funny,cf1:c

ool,cf1:reviewtext,cf1:reviewdate")

TBLPROPERTIES ("hbase.table.name" = "hive_table_review");

DROP TABLE IF EXISTS staging;

CREATE EXTERNAL TABLE staging (json STRING)

LOCATION '/LAB4_2/review';

INSERT OVERWRITE TABLE hive_table_review SELECT

get_json_object(json, "$.review_id") AS review_id,

get_json_object(json, "$.user_id") AS user_id,

get_json_object(json, "$.business_id") AS business_id,

get_json_object(json, "$.stars") AS stars,

get_json_object(json, "$.useful") AS useful,

get_json_object(json, "$.funny") AS funny,

get_json_object(json, "$.cool") AS cool,

get_json_object(json, "$.text") AS reviewtext,

get_json_object(json, "$.date") AS reviewdate

FROM staging;

SELECT * FROM hive_table_review LIMIT 10;

Below is the output:

Next, I loaded the HBase shell prompt and ran the command “scan

‘hive_table_review’, {‘LIMIT’ => 5}” to display the first 5 entries of the HBase

table that was created. Below is the output:

Next, I joined both tables created above in Hive into a tables in Hive and HBase using the

following code:

DROP TABLE IF EXISTS hive_table_join;

CREATE TABLE hive_table_join(business_id string, name string, address

string, city string,state string, postal_code string, latitude string,

longitude string, starsbusiness string, review_count string, is_open

string, attributes string, categories string, hours string,review_id

string, user_id string, starsreview string,useful string, funny

string, cool string, reviewtext string, reviewdate string)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" =

":key,cf1:name,cf1:address,cf1:city,cf1:state,cf1:postal_code,cf1:lati

tude,cf1:longitude,cf1:starsbusiness,cf1:review_count,cf1:is_open,cf1:

attributes,cf1:categories,cf1:hours,cf1:review_id,cf1:user_id,cf1:star

sreview,cf1:useful,cf1:funny,cf1:cool,cf1:reviewtext,cf1:reviewdate")

TBLPROPERTIES ("hbase.table.name" = "hive_table_join");

set hive.auto.convert.join=false;

INSERT OVERWRITE TABLE hive_table_join SELECT

L.business_id,

L.name,

L.address,

L.city,

L.state,

L.postal_code,

L.latitude,

L.longitude,

L.stars AS starsbusiness,

L.review_count,

L.is_open,

L.attributes,

L.categories,

L.hours,

R.review_id,

R.user_id,

R.stars AS starsreview,

R.useful,

R.funny,

R.cool,

R.reviewtext,

R.reviewdate

FROM hive_table_business L JOIN hive_table_review R ON

(L.business_id=R.business_id);

Below is the output:

Next, in Hive I displayed the first 5 entries of the joined table using the command

“select * from hive_table_join limit 5”. Below is the output:

Next, in HBase, I displayed the first 5 entries of table containing the joined tables

using the command:

“scan ‘hive_table_join’, {‘LIMIT’ => 5}” Below is the output: