data warehouse implementation for liquor consumption in iowa and missouri states

20
1 | Page Data Warehouse Implementation for Liquor Consumption in IWOA and Missouri States Student Name: Mayur Kishor Mane Student ID: x15009009 Course: MSc in Data Analytics National College of Ireland, Dublin.

Upload: mayur-mane

Post on 14-Jan-2017

223 views

Category:

Data & Analytics


13 download

TRANSCRIPT

Page 1: Data warehouse implementation for liquor consumption in IOWA and missouri states

1 | P a g e

Data Warehouse Implementation for Liquor Consumption

in IWOA and Missouri States

Student Name: Mayur Kishor Mane

Student ID: x15009009

Course: MSc in Data Analytics

National College of Ireland, Dublin.

Page 2: Data warehouse implementation for liquor consumption in IOWA and missouri states

2 | P a g e

Introduction:

In modern era of Big Data environment, there is continuous growth of data which create lots of trouble

to storage devices. In this case old data gets vanished and new data comes in storing device but when

company decision need to be taken on statistics of processes done till date then to recover that vanish

data is getting big challenge hence to overcome this historical data storage problem we need a good data

warehouse. Data warehouse helps to build business important information from enormous amount of

data generated in past and real time. With this remarkable feature data warehouse also require to

incorporate multiple datasets from different departments, to work on unused or bad data and to get

better report performance.

In this project I have collected three different datasets from three different sources and made analysis on

that data to find information for liquor business in Missouri and IWOA states. Business intelligence is

nothing but conversion of data into information, here information received from analysis of three

different datasets which helps to improve liquor business as well as assisting to make future decisions

easy and effective.

Steps used to implement data warehouse are as follows:

1. Collecting datasets from different data sources

2. Design of data warehouse (DWH)

3. Design of ETL (Extract, Transform and Load) steps

4. Development of ETL

5. Data loading into data warehouse

Software tools used to implement data warehouse are as follows:

1. SQL Server Management Studio (SSMS)

2. SQL Server Integration Service (SSIS)

3. SQL Server Analysis Service (SSAS)

4. SQL Server Reporting Service (SSRS)

5. Tableau

Collection of datasets:

To build business focused data warehouse we need good quality of data sources which will provide data

in proper format as well as without any wrong values. To build liquor consumption related data I have

used below three data sources.

Data Source 1: I have used official website of US government to find Missouri liquor consumption related

data. It is structured data which includes liquor licenses details, distributor details with city, address, and

contact details. URL for this data is given below.

URL: https://catalog.data.gov/dataset/missouri-active-alcohol-license-data-af6fa/resource/f78c96f4-

e9f2-414c-8a1f-07b9a67a1908

Data Source 2: I have used another official website of US which contains data of IWOA state specifically.

It is structured data which includes liquor product details, store details and liquor sales related data such

as how many bottles are sold and in what price. URL for this data is given below.

Page 3: Data warehouse implementation for liquor consumption in IOWA and missouri states

3 | P a g e

URL: https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy

Data Source 3: This data source is a social networking website viz. Twitter. From this web site I have

fetched tweets on different liquors based on their commonly used names. After fetching tweets I made

simple sentimental analysis which gives idea of consumer’s perspective on liquors in these two states

particularly.

Data Warehouse Architecture:

Data warehouse of liquor consumption is implemented using Ralph Kimball’s approach which is bottom

up approach/dimension modelling. This technique is rapid compare to Inmon’s approach of data

warehouse architecture.

This DWH focused on increasing business values in liquor industry.

Making dimensional structured data which will be delivered to liquor business.

Managing liquor data lifecycle rather than doing a galactic big bang approach.

Advantages of use of Kimball’s approach for DWH architecture:

This approach gives many advantages over top-bottom approach of Inmon’s approach and mix approach

which contains bottom-up and top-bottom approach. Few of advantages are listed below which I have

noticed while implementing DWH for liquor consumption related datasets.

1. It is process oriented, which helps to target on business process flow.

2. Easy to grab core element and make analysis report better.

3. It is rapid process which means requirement of time to build DWH is low.

4. Because of multiple dimensions, datasets could be view from many ways to answer business

related hurdles.

Below figure explains dimensional model of liquor DWH:

Fig.1

Page 4: Data warehouse implementation for liquor consumption in IOWA and missouri states

4 | P a g e

Schema Structure of DWH:

Star schema has been used in implementation of liquor consumption DWH. This schema method is very

easy to make multidimensional modelling compare to snow flake and fast constellation schemas. Here

fact table is present at the center of diagram and all dimensions are connected directly to fact table.

Fact Table:

In DWH fact table dependent on dimension table. It holds the data which is going to analyze and gets the

data from dimension tables. Mainly it has two types of attributes viz. foreign keys and measures. Here

foreign keys allowing fact table to connect with dimension table and measure helps to maintain data

which will be analyzed.

In liquor consumption DWH, fact table contains pack, date, sale related info and tweet sentiment analysis

data. These all data columns are helping to answer on case studies.

Dimension Table:

Dimension is nothing but collection of reference data of all measurable events. These events are known

as facts which are stored in fact table e.g. product, vendor, sale, date and volume. Dimension table is

having primary key column which contains all numerical as well as unique values.

Star Schema for multidimensional DWH is as shown in below figure.

Fig.2

Page 5: Data warehouse implementation for liquor consumption in IOWA and missouri states

5 | P a g e

Benefits of Star Schema:

Star schema has benefit of easy to implement and many other benefits are listed below.

It makes ETL process easier in implementation of DWH.

Metric analysis gets easier by using this schema model.

Data efficiency increases due to dimension tables directly connects to fact table.

Query execution is simple.

Less foreign key used which makes process faster.

Each dimension need only one dimensional table which reduces the size of model.

Designing of Data Warehouse on Liquor consumption of IWOA and Missouri states

In liquor consumption data warehouse I have used star schema which means every dimension is directly

connected to my fact table with primary key. Dimensions of data warehouse are decided based on case

studies I am going to answer. Here I need to find answer for how much liquor consumed by IWOA and

Missouri states with number of packs unit with this which liquor consumed highest where this import and

export of particular liquors in these two states will get managed.

Star schema for liquor consumption data warehouse is as follows:

Figure.3

Page 6: Data warehouse implementation for liquor consumption in IOWA and missouri states

6 | P a g e

In above figure.3, DimDate, DimVolume, DimProduct, DimVendor and DimSales are dimension tables

whereas FactSales is a fact table. All dimension tables are having primary key which will be having all

unique values and that primary key will be saved in fact table to make connection between two tables.

However fact table has all key values coming from dimension tables and measure values which will help

in analysis of data.

DimDate:

It contains date update column which has dates of respective records considered for liquor in that

respective state. This attribute helps to find out on which date how much liquor been consumed.

Fig.4

DimVoume:

It contains volume key, bottle volume in ml and how much bottles sold. These attributes helps to find out

how much liquor volume sold out in milliliter and how many bottles are sold.

Fig.5

DimProduct:

It contains product key, invoice number, liquor category, liquor name, item description and item number.

These attribute helps to find out which liquor has been sold out and their respective description.

Fig.6

Page 7: Data warehouse implementation for liquor consumption in IOWA and missouri states

7 | P a g e

DimVendor:

It contains vendor key, vendor name and vendor number. These attributes helps to find out who are the

vendors and their respective names.

Fig.7

DimSales: It contains pack, sale in dollars, actual bottle cost and bottle cost at retail store. These attributes

help to find out how much amount been used in liquor consumption along with quantity of liquor packs

consumed.

Fig.8

FactSales:

It contains key values of all dimensions and measures such as positive, negative tweets, packs, sale in

dollar and date of update when liquor have sold. From fact table we can work on case studies. Answers

of all case studies will help to improve liquor business in future such as we can manage number of bottles

need to be deliver to particular vendor or how much contribution is been done in revenue based on sales

statistics.

Fig.9

Page 8: Data warehouse implementation for liquor consumption in IOWA and missouri states

8 | P a g e

Extraction, Transform and Load (ETL) Process

In the field of computing, ETL is a process used in database system which helps to fetch data from

homogeneous as well as heterogeneous data sources. ETL process is an important step in implementing

data warehouse as different databases are need to incorporated. Also by using the step of extraction it

gets easier and then we can do transformation and loading of data on server in same ETL application. To

perform ETL process we need SQL Server Integration Service (SSIS) tool.

ETL process for liquor consumption is implemented using SSIS tool. However SSMS tool also require in ETL

as we need to connect first to database engine prior starting to implementation of ETL. In SSMS database

engine we need database in which we implementing ETL.

Fig.10

In above figure.10 we can see MAYUR\MAYUR is a SQL server name and databases are added on that

particular server. Here database used is Liquor1 which is also visible in above figure. The activated server

is a type of database engine of SQL server.

Extraction

It is method used to extract datasets from different sources and then later those datasets will be used in

data warehouse environment. It is time consuming step ETL process as datasets could be having various

forms of format and discrepancies hence to resolve all these it takes some time. Extraction method is

based on what kind of datasets have chosen also on business needs.

Initial step in ETL is an extraction of data which includes how datasets are collected from multiple data

sources. In this DWH I have used three different data sources as explained in introduction.

Data Source 1: I have used official website of US government to find Missouri liquor consumption related

data. It is structured data which includes liquor licenses details, distributor details with city, address, and

contact details. URL for this data is given below.

URL: https://catalog.data.gov/dataset/missouri-active-alcohol-license-data-af6fa/resource/f78c96f4-

e9f2-414c-8a1f-07b9a67a1908

Page 9: Data warehouse implementation for liquor consumption in IOWA and missouri states

9 | P a g e

Fig.11

In this data, class “E” licenses information of liquor businesses involved in the manufacture, shipping,

and/or sale of individually in the State of Missouri. From date January 1, 2012 to April 21, 2016.

Data Source 2: I have used another official website of US which contains data of IWOA state specifically.

It is structured data which includes liquor product details, store details and liquor sales related data such

as how many bottles are sold and in what price. URL for this data is given below.

URL: https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy

Fig.12

This data has the liquors purchase information of state IWOA class “E” liquor licensees by different liquor

product, sale and date of purchase from January 1, 2012 to Nov 7, 2014. It is used to analyze total liquor

sales in IWOA of individual liquor products at various stores.

Data Source 3: This data source is a social networking web site viz. Twitter. From this web site I have

fetched tweets on different liquors based on their commonly used names. After fetching tweets I made

simple sentimental analysis which gives idea of consumer’s perspective in these two states particularly.

To fetch tweets I used R studio Version 0.99.893 – © 2009-2016 RStudio, Inc. which has open source

components available in it. Below screen represents the window was available in RStudio while executing

code to fetch tweets from twitter.

Page 10: Data warehouse implementation for liquor consumption in IOWA and missouri states

10 | P a g e

Fig.13

We need to install below packages in RStudio prior running code to fetch the tweets.

library(twitteR)

library(stringr)

library(ggplot2)

library(plyr)

library(dplyr)

This packages are require to connect with twitter by using twitter API as well as to do data management

easily.

However whole code I have uploaded on github.com to access same from anywhere on any device in

future.

URL to check R code for tweet sentiment analysis code is given below.

R-Code-for-tweet-sentiment-analysis/README.md

Page 11: Data warehouse implementation for liquor consumption in IOWA and missouri states

11 | P a g e

Loading:

Loading step includes building dimensions and fact tables in this we have to complete with extraction

step to work on require datasets. Once these tables are uploaded on server then time is to improve

performance of the business intelligence data by doing aggregation.

Fig.14

In this step we are collecting different databases and loading them on staging databases, it gets created

to do cleaning, dividing or merging data, creating some derived column, changing data type so that

formatting will get done. In liquor consumption data warehouse I have created three staging table as

shown in below figure.

Fig.15

Page 12: Data warehouse implementation for liquor consumption in IOWA and missouri states

12 | P a g e

Staging Databases:

In below figure 16 I have imported three excel file sources those are having data as explained below.

IWOA Liquor sale data in IWOA state

Missouri Liquor sale data in Missouri state

Twitter Incorporated sentiment analysis tweets of all 59 liquors

Fig.16

Loading Dimension Tables:

I have loaded total 5 dimensions as per requirement to complete case studies. All dimensions are

connected to fact with primary key assign to one of the dimension column. Below diagram helps to

understand dimension loading procedure.

Page 13: Data warehouse implementation for liquor consumption in IOWA and missouri states

13 | P a g e

Fig.17

Fig.18

Loading Fact Table:

Fact table is dependent on measures and key values those are connected between dimension table and

fact table. Fact table is important in terms of data analysis because without fact table it is difficult deploy

cube to do analysis service. Fact table in SQL SSMS is shown in figure 8.

For each dimension ETL maintaining a special surrogate key lookup table. Hence fact table get updated

whenever a new entity comes in dimension table.

Page 14: Data warehouse implementation for liquor consumption in IOWA and missouri states

14 | P a g e

Fig.19

Control Flow Diagram:

Control flow diagram mainly consist of SQL execute task and data flow task which helps to initiate any

type of task or process. For liquor consumption data warehouse control flow is shown below.

Fig.20

Page 15: Data warehouse implementation for liquor consumption in IOWA and missouri states

15 | P a g e

Deploying Cube:

By using SQL SSAS tool I have deployed an OLAP cube which helps to do browsing on data sources collected

on SQL server. This tool helps to generate multi-dimensional analysis on database, also we can put pivot

table by using spreadsheet option.

Fig.21

Creating Dashboard:

To create a dashboard I have used Tableau software which helps to do data visualization. By using this

software application I have prepared answer for my all case studies and published on internet so that

anyone on earth can access that dashboard and get the information. To access dashboard please use

below URL.

URL:https://public.tableau.com/profile/publish/LiquorConsumptionReport/FinalLiquorReport#!/publish-

confirm

Below picture of window appears after uploading database from SQL server to Tableau

Page 16: Data warehouse implementation for liquor consumption in IOWA and missouri states

16 | P a g e

Fig.22

Advantage of Tableau in Data reporting service:

Easy to deploy cube on this application

Dashboard can be created

Fast processing due to low dependency on server

GUI is easy to play with database

SSRS:

After successful deployment of cube we can use SSRS to do reporting or analyze the database, With this

application we can pull out information from data warehouse using pivot table option.

This is last step in implementation of data warehouse which gives final result to make business related

decision in future based on historical datasets.

Advantages of SSRS are as follows:

Very easy to export result into excel and other file formats

Parameter helps to return any result as per user wants

Flexible reporting is possible with SSRS

Look great as once glance helps to understand statistics of database

Page 17: Data warehouse implementation for liquor consumption in IOWA and missouri states

17 | P a g e

Business Case Studies:

These case studies helps to improve liquor business particularly in iWOA and Missouri states also helps to

make critical decision in the field of uncertainty.

Case 1: Business Case Report

Which liquor category has maximum sale in IWOA and Missouri states?

Fig.23

Analysis of report:

By using above graph we can conclude that Canadian whiskies are sold most in terms of volume in ml, in

dollars as well as number of packs. After Canadian whiskies the most sold out liquor product is Blended

whiskies.

Based on this statistics we can decide how much liquor should import or export to the IWOA and Missouri

states based on volume consumed. Also how much liquor sale have contributed to revenue and which

particular liquor has highest contribution in revenue of liquor business.

Here total number of liquors are 59 which means 59 different liquor categories and included in this DWH

with their respective description and unique number.

Page 18: Data warehouse implementation for liquor consumption in IOWA and missouri states

18 | P a g e

Case 2: Business Case Report

Finding liquor sales distribution for different vendors in IWOA and Missouri states.

Fig.24

Analysis of report:

By using above treemaps, we can conclude the most performing vendor in liquor business is Diageo

Americas. We can see how much dollars Diageo Americas has earned and for that how many packs they

have sold out. Same thing we can apply for any vendor present in IWOA and Missouri states.

With this report we can keep eye on liquor business game that who has good strategies to sale liquor and

who hasn’t. Also to make our business network in big liquor industry this numbers can help because we

can conclude top 10 vendors in liquor business with their sale which helps to reach at the highest level in

liquor business market.

Here total vendor count is 54 which means in this DWH we have information of all 54 vendors with their

unique number, sale amount and number of liquor packs they have sold out.

Page 19: Data warehouse implementation for liquor consumption in IOWA and missouri states

19 | P a g e

Case 3: Business Case Report

Which liquor category consumers love to have most of the time in IWOA and Missouri States?

Fig.25

From above packed bubble figure, we can conclude that Vodka 80 Proof has most loved liquor in IWOA

and Missouri states as we can see it has highest number of positive tweets. However we can also check

which liquor is having highest negative tweets by changing parameter in a dashboard.

By using this report we can manage our liquor inventories. With this we can make a new strategy in liquor

business by procuring liquor which has highest demand in liquor market. This is glance of consumer’s

perspective towards different liquors.

Here we have positive and negative tweets for all 59 liquor categories which helps to distinguished them

from each other.

Page 20: Data warehouse implementation for liquor consumption in IOWA and missouri states

20 | P a g e

Case 4: Business Case Report

A particular vendor is having maximum or lowest sale then what is the reason behind his result?

Fig.26

From above horizontal bar graph, we can conclude that if anyone vendor is having maximum success in

liquor sale then what are the products he has which gives him profit. Brown-Forman Corporation has

maximum liquor sale of liquor because of his particular Tennessee whiskies.

Here by using this technique we can make changes in business strategies as which liquor to keep for sale

or which liquor should not keep. Liquor has no expiry date but it does need space to keep in inventories

hence by using above graph we can well monitor liquor amount which result in better liquor inventory.

However we can do this same with all 54 vendors present in IWOA and Missouri states with their individual

product statistics.