tamr on google cloud platform: e-commerce tutorial

6
Tamr on Google Cloud Platform: E-Commerce Tutorial Tamr on Google Cloud Platform: E-Commerce Tutorial Overview In this tutorial, we’ll be working with sources from an e-commerce company’s customer account database and web analytics data warehouse. In particular, we’ll be looking at data that was collected when the company was running a promotion on athletic shoes. Using Tamr on Google Cloud Platform, we’ll join these sources together, do some data cleansing, and finally, push our new dataset via Google Dataflow to BigQuery, Google’s fully managed, NoOps, data analytics service. There, we will ask questions of the data that might inform how the company could op- timize its marketing strategies. For example, if the join reveals that 18-25 year old customers come most often to come to the site via a social media promotion, and buy running shoes more than any other shoe type, the company might embed running shoe advertisements into social media. Customer data from the warehouse includes the following fields, which come both from forms filled in when site visitors 1) create an account and 2) make a purchase, and from of their interactions on the site (ui): + first_name(1) + last_name (1) + birthdate (1) + age (1) + age_group (1) + billing_street (2) + billing_city (2) + billing_state (2) + billing_zip (2) + tracking_cookie (ui) + customer_ID (ui) + date_last_visit (ui) + credit_card (2 -- value encrypted for security) + date_last_purchase (2) + is_premium_member -- has made a purchase (2) + is_senior_discount_member -- age group 65 and up (1) + num_visits (ui)

Upload: dinhtram

Post on 13-Feb-2017

229 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Tamr on Google Cloud Platform: E-Commerce Tutorial

Tamr on Google Cloud Platform: E-Commerce Tutorial

Tamr on Google Cloud Platform: E-Commerce Tutorial

Overview

In this tutorial, we’ll be working with sources from an e-commerce company’s customer account database and web analytics data warehouse.

In particular, we’ll be looking at data that was collected when the company was running a promotion on athletic shoes. Using Tamr on Google

Cloud Platform, we’ll join these sources together, do some data cleansing, and finally, push our new dataset via Google Dataflow to BigQuery,

Google’s fully managed, NoOps, data analytics service. There, we will ask questions of the data that might inform how the company could op-

timize its marketing strategies. For example, if the join reveals that 18-25 year old customers come most often to come to the site via a social

media promotion, and buy running shoes more than any other shoe type, the company might embed running shoe advertisements into social

media.

Customer data from the warehouse includes the following fields, which come both from forms filled in when site visitors 1) create an account

and 2) make a purchase, and from of their interactions on the site (ui):

+ first_name(1)

+ last_name (1)

+ birthdate (1)

+ age (1)

+ age_group (1)

+ billing_street (2)

+ billing_city (2)

+ billing_state (2)

+ billing_zip (2)

+ tracking_cookie (ui)

+ customer_ID (ui)

+ date_last_visit (ui)

+ credit_card (2 -- value encrypted for security)

+ date_last_purchase (2)

+ is_premium_member -- has made a purchase (2)

+ is_senior_discount_member -- age group 65 and up (1)

+ num_visits (ui)

Page 2: Tamr on Google Cloud Platform: E-Commerce Tutorial

Tamr on Google Cloud Platform: E-Commerce Tutorial

2

The company’s web analytics data is derived directly from raw user interactions with their website, and contains the following fields:

+ session_id (unique identifier of each customer session)

+ tracking_cookie

+ IP_address

+ user_has_account

+ user_account_ID

+ num_items_in_cart

+ items_in_cart

+ cart_ID

+ product_department

+ product_ID

+ product_category

+ Price

+ promotion_type

+ referral_site_id

+ return_prospect_rating

+ purchase_is_gift

+ shipping_category

+ billing_street

+ billing_zip

Signing Into Tamr & Google Cloud Platform

To get started, register with Tamr and sign into Google Cloud Platform at gcp-preview.tamr.com

+ If you don’t have an account with Google Cloud Platform, you can go through the Tamr portion of the tutorial, but will not be able to push your dataset to BigQuery.

+ If you don’t have a Google Cloud Platform account but would like to register for one, select the “Free Trial” option at the bottom of the Google Cloud Platform sign-in page.

For more information on registration with Tamr and Google Cloud Platform, see [link to documentation section “Getting Started”]

Viewing, Joining, and Cleansing with Tamr

Once signed in, a prompt to select a source will appear:

+ Select the “Tamr Sample Data” project and “tamr-gcp-sample-data” bucket

+ From this bucket, select the Customer.csv source

Page 3: Tamr on Google Cloud Platform: E-Commerce Tutorial

Tamr on Google Cloud Platform: E-Commerce Tutorial

3

Once you have selected a source, you will land on a screen on which you will see all of the fields in the Customer.csv source. In order to view the data within each field, click “Add All”. This will generate a preview of the data contained by each field, and some information about thedensity of these fields, indicated by the green bar. The longer the green bar, the fewer empty records there are within that field.

Next, we add a source from the web analytics warehouse, called Purchases.csv.

+ Select the “Add a Source”

+ Make sure the “Tamr Sample Data” project and “tamr-gcp-sample-data” bucket are still selected

+ From this bucket, select the Purchases.csv source

In order to join these sources together, Tamr needs to use a join key, which is a field that contains the same data across the two sources. A prompt will appear in which we must enter the join key. In this case, we will use the company’s unique identifier for customers, called User_account_id in our Purchases source, and customer_id in our customer source.

Page 4: Tamr on Google Cloud Platform: E-Commerce Tutorial

Tamr on Google Cloud Platform: E-Commerce Tutorial

4

To verify that the join worked, search for these two fields and drag and drop them into your new data set. You will see that their records are aligned and identical, indicating a successful join.

From here, you may select any fields that you would like to include in your new dataset. Some interesting fields to use for queries in BigQuery include:

+ age

+ age group

+ referral site id

+ product_category

+ Price

+ return_prospect_rating

+ promotion type

Page 5: Tamr on Google Cloud Platform: E-Commerce Tutorial

Tamr on Google Cloud Platform: E-Commerce Tutorial

5

In order to clean up this new data set, you can use Tamr’s transformation functionality to:

+ Remove blank rows + Replace all blank cells in a attributes with a certain value

An interesting transformation to try out involves removing rows that are blank for “age” and “age_group”. By doing this, we guarantee that there will be age information about all customers in our new dataset. To do this, select the two attributes and click “Transform”:

Once you are finished selecting and transforming attributes in your new dataset, select “Apply Formatting” and you can view the results. Then, you can push the dataset to BigQuery by clicking the “Move my Data to Google BigQuery” button.

Running a Google Cloud Dataflow Job

After you click the “Move my Data to Google BigQuery” button, your job will take a few minutes to start. Once it is running, you can see the progress of your Cloud Dataflow job within the Google Cloud Platform console by clicking the “Submitted” link that appears when you click “Publish to BigQuery,” shown below:

Page 6: Tamr on Google Cloud Platform: E-Commerce Tutorial

Tamr on Google Cloud Platform: E-Commerce Tutorial

6

Once the job has finished running, which you can verify by clicking “View Logs” on the right of the screen, and watching the state of the job diagram (shown below), click “BigQuery” in the left menu.

Analysis in BigQuery

Once you arrive in BigQuery, find your new project and dataset in the left menu. Some questions you might look into include:

+ Did age groups tend to associate with particular promotion types or referral sites? + Which shoe type did each age group buy most? + Which age group had the most prospects with a high return_prospect_rating?

For help writing SQL queries in BigQuery, check out Google’s BigQuery documentation,