data warehouse implementation for liquor consumption in iowa and missouri states
TRANSCRIPT
1 | P a g e
Data Warehouse Implementation for Liquor Consumption
in IWOA and Missouri States
Student Name: Mayur Kishor Mane
Student ID: x15009009
Course: MSc in Data Analytics
National College of Ireland, Dublin.
2 | P a g e
Introduction:
In modern era of Big Data environment, there is continuous growth of data which create lots of trouble
to storage devices. In this case old data gets vanished and new data comes in storing device but when
company decision need to be taken on statistics of processes done till date then to recover that vanish
data is getting big challenge hence to overcome this historical data storage problem we need a good data
warehouse. Data warehouse helps to build business important information from enormous amount of
data generated in past and real time. With this remarkable feature data warehouse also require to
incorporate multiple datasets from different departments, to work on unused or bad data and to get
better report performance.
In this project I have collected three different datasets from three different sources and made analysis on
that data to find information for liquor business in Missouri and IWOA states. Business intelligence is
nothing but conversion of data into information, here information received from analysis of three
different datasets which helps to improve liquor business as well as assisting to make future decisions
easy and effective.
Steps used to implement data warehouse are as follows:
1. Collecting datasets from different data sources
2. Design of data warehouse (DWH)
3. Design of ETL (Extract, Transform and Load) steps
4. Development of ETL
5. Data loading into data warehouse
Software tools used to implement data warehouse are as follows:
1. SQL Server Management Studio (SSMS)
2. SQL Server Integration Service (SSIS)
3. SQL Server Analysis Service (SSAS)
4. SQL Server Reporting Service (SSRS)
5. Tableau
Collection of datasets:
To build business focused data warehouse we need good quality of data sources which will provide data
in proper format as well as without any wrong values. To build liquor consumption related data I have
used below three data sources.
Data Source 1: I have used official website of US government to find Missouri liquor consumption related
data. It is structured data which includes liquor licenses details, distributor details with city, address, and
contact details. URL for this data is given below.
URL: https://catalog.data.gov/dataset/missouri-active-alcohol-license-data-af6fa/resource/f78c96f4-
e9f2-414c-8a1f-07b9a67a1908
Data Source 2: I have used another official website of US which contains data of IWOA state specifically.
It is structured data which includes liquor product details, store details and liquor sales related data such
as how many bottles are sold and in what price. URL for this data is given below.
3 | P a g e
URL: https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy
Data Source 3: This data source is a social networking website viz. Twitter. From this web site I have
fetched tweets on different liquors based on their commonly used names. After fetching tweets I made
simple sentimental analysis which gives idea of consumer’s perspective on liquors in these two states
particularly.
Data Warehouse Architecture:
Data warehouse of liquor consumption is implemented using Ralph Kimball’s approach which is bottom
up approach/dimension modelling. This technique is rapid compare to Inmon’s approach of data
warehouse architecture.
This DWH focused on increasing business values in liquor industry.
Making dimensional structured data which will be delivered to liquor business.
Managing liquor data lifecycle rather than doing a galactic big bang approach.
Advantages of use of Kimball’s approach for DWH architecture:
This approach gives many advantages over top-bottom approach of Inmon’s approach and mix approach
which contains bottom-up and top-bottom approach. Few of advantages are listed below which I have
noticed while implementing DWH for liquor consumption related datasets.
1. It is process oriented, which helps to target on business process flow.
2. Easy to grab core element and make analysis report better.
3. It is rapid process which means requirement of time to build DWH is low.
4. Because of multiple dimensions, datasets could be view from many ways to answer business
related hurdles.
Below figure explains dimensional model of liquor DWH:
Fig.1
4 | P a g e
Schema Structure of DWH:
Star schema has been used in implementation of liquor consumption DWH. This schema method is very
easy to make multidimensional modelling compare to snow flake and fast constellation schemas. Here
fact table is present at the center of diagram and all dimensions are connected directly to fact table.
Fact Table:
In DWH fact table dependent on dimension table. It holds the data which is going to analyze and gets the
data from dimension tables. Mainly it has two types of attributes viz. foreign keys and measures. Here
foreign keys allowing fact table to connect with dimension table and measure helps to maintain data
which will be analyzed.
In liquor consumption DWH, fact table contains pack, date, sale related info and tweet sentiment analysis
data. These all data columns are helping to answer on case studies.
Dimension Table:
Dimension is nothing but collection of reference data of all measurable events. These events are known
as facts which are stored in fact table e.g. product, vendor, sale, date and volume. Dimension table is
having primary key column which contains all numerical as well as unique values.
Star Schema for multidimensional DWH is as shown in below figure.
Fig.2
5 | P a g e
Benefits of Star Schema:
Star schema has benefit of easy to implement and many other benefits are listed below.
It makes ETL process easier in implementation of DWH.
Metric analysis gets easier by using this schema model.
Data efficiency increases due to dimension tables directly connects to fact table.
Query execution is simple.
Less foreign key used which makes process faster.
Each dimension need only one dimensional table which reduces the size of model.
Designing of Data Warehouse on Liquor consumption of IWOA and Missouri states
In liquor consumption data warehouse I have used star schema which means every dimension is directly
connected to my fact table with primary key. Dimensions of data warehouse are decided based on case
studies I am going to answer. Here I need to find answer for how much liquor consumed by IWOA and
Missouri states with number of packs unit with this which liquor consumed highest where this import and
export of particular liquors in these two states will get managed.
Star schema for liquor consumption data warehouse is as follows:
Figure.3
6 | P a g e
In above figure.3, DimDate, DimVolume, DimProduct, DimVendor and DimSales are dimension tables
whereas FactSales is a fact table. All dimension tables are having primary key which will be having all
unique values and that primary key will be saved in fact table to make connection between two tables.
However fact table has all key values coming from dimension tables and measure values which will help
in analysis of data.
DimDate:
It contains date update column which has dates of respective records considered for liquor in that
respective state. This attribute helps to find out on which date how much liquor been consumed.
Fig.4
DimVoume:
It contains volume key, bottle volume in ml and how much bottles sold. These attributes helps to find out
how much liquor volume sold out in milliliter and how many bottles are sold.
Fig.5
DimProduct:
It contains product key, invoice number, liquor category, liquor name, item description and item number.
These attribute helps to find out which liquor has been sold out and their respective description.
Fig.6
7 | P a g e
DimVendor:
It contains vendor key, vendor name and vendor number. These attributes helps to find out who are the
vendors and their respective names.
Fig.7
DimSales: It contains pack, sale in dollars, actual bottle cost and bottle cost at retail store. These attributes
help to find out how much amount been used in liquor consumption along with quantity of liquor packs
consumed.
Fig.8
FactSales:
It contains key values of all dimensions and measures such as positive, negative tweets, packs, sale in
dollar and date of update when liquor have sold. From fact table we can work on case studies. Answers
of all case studies will help to improve liquor business in future such as we can manage number of bottles
need to be deliver to particular vendor or how much contribution is been done in revenue based on sales
statistics.
Fig.9
8 | P a g e
Extraction, Transform and Load (ETL) Process
In the field of computing, ETL is a process used in database system which helps to fetch data from
homogeneous as well as heterogeneous data sources. ETL process is an important step in implementing
data warehouse as different databases are need to incorporated. Also by using the step of extraction it
gets easier and then we can do transformation and loading of data on server in same ETL application. To
perform ETL process we need SQL Server Integration Service (SSIS) tool.
ETL process for liquor consumption is implemented using SSIS tool. However SSMS tool also require in ETL
as we need to connect first to database engine prior starting to implementation of ETL. In SSMS database
engine we need database in which we implementing ETL.
Fig.10
In above figure.10 we can see MAYUR\MAYUR is a SQL server name and databases are added on that
particular server. Here database used is Liquor1 which is also visible in above figure. The activated server
is a type of database engine of SQL server.
Extraction
It is method used to extract datasets from different sources and then later those datasets will be used in
data warehouse environment. It is time consuming step ETL process as datasets could be having various
forms of format and discrepancies hence to resolve all these it takes some time. Extraction method is
based on what kind of datasets have chosen also on business needs.
Initial step in ETL is an extraction of data which includes how datasets are collected from multiple data
sources. In this DWH I have used three different data sources as explained in introduction.
Data Source 1: I have used official website of US government to find Missouri liquor consumption related
data. It is structured data which includes liquor licenses details, distributor details with city, address, and
contact details. URL for this data is given below.
URL: https://catalog.data.gov/dataset/missouri-active-alcohol-license-data-af6fa/resource/f78c96f4-
e9f2-414c-8a1f-07b9a67a1908
9 | P a g e
Fig.11
In this data, class “E” licenses information of liquor businesses involved in the manufacture, shipping,
and/or sale of individually in the State of Missouri. From date January 1, 2012 to April 21, 2016.
Data Source 2: I have used another official website of US which contains data of IWOA state specifically.
It is structured data which includes liquor product details, store details and liquor sales related data such
as how many bottles are sold and in what price. URL for this data is given below.
URL: https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy
Fig.12
This data has the liquors purchase information of state IWOA class “E” liquor licensees by different liquor
product, sale and date of purchase from January 1, 2012 to Nov 7, 2014. It is used to analyze total liquor
sales in IWOA of individual liquor products at various stores.
Data Source 3: This data source is a social networking web site viz. Twitter. From this web site I have
fetched tweets on different liquors based on their commonly used names. After fetching tweets I made
simple sentimental analysis which gives idea of consumer’s perspective in these two states particularly.
To fetch tweets I used R studio Version 0.99.893 – © 2009-2016 RStudio, Inc. which has open source
components available in it. Below screen represents the window was available in RStudio while executing
code to fetch tweets from twitter.
10 | P a g e
Fig.13
We need to install below packages in RStudio prior running code to fetch the tweets.
library(twitteR)
library(stringr)
library(ggplot2)
library(plyr)
library(dplyr)
This packages are require to connect with twitter by using twitter API as well as to do data management
easily.
However whole code I have uploaded on github.com to access same from anywhere on any device in
future.
URL to check R code for tweet sentiment analysis code is given below.
R-Code-for-tweet-sentiment-analysis/README.md
11 | P a g e
Loading:
Loading step includes building dimensions and fact tables in this we have to complete with extraction
step to work on require datasets. Once these tables are uploaded on server then time is to improve
performance of the business intelligence data by doing aggregation.
Fig.14
In this step we are collecting different databases and loading them on staging databases, it gets created
to do cleaning, dividing or merging data, creating some derived column, changing data type so that
formatting will get done. In liquor consumption data warehouse I have created three staging table as
shown in below figure.
Fig.15
12 | P a g e
Staging Databases:
In below figure 16 I have imported three excel file sources those are having data as explained below.
IWOA Liquor sale data in IWOA state
Missouri Liquor sale data in Missouri state
Twitter Incorporated sentiment analysis tweets of all 59 liquors
Fig.16
Loading Dimension Tables:
I have loaded total 5 dimensions as per requirement to complete case studies. All dimensions are
connected to fact with primary key assign to one of the dimension column. Below diagram helps to
understand dimension loading procedure.
13 | P a g e
Fig.17
Fig.18
Loading Fact Table:
Fact table is dependent on measures and key values those are connected between dimension table and
fact table. Fact table is important in terms of data analysis because without fact table it is difficult deploy
cube to do analysis service. Fact table in SQL SSMS is shown in figure 8.
For each dimension ETL maintaining a special surrogate key lookup table. Hence fact table get updated
whenever a new entity comes in dimension table.
14 | P a g e
Fig.19
Control Flow Diagram:
Control flow diagram mainly consist of SQL execute task and data flow task which helps to initiate any
type of task or process. For liquor consumption data warehouse control flow is shown below.
Fig.20
15 | P a g e
Deploying Cube:
By using SQL SSAS tool I have deployed an OLAP cube which helps to do browsing on data sources collected
on SQL server. This tool helps to generate multi-dimensional analysis on database, also we can put pivot
table by using spreadsheet option.
Fig.21
Creating Dashboard:
To create a dashboard I have used Tableau software which helps to do data visualization. By using this
software application I have prepared answer for my all case studies and published on internet so that
anyone on earth can access that dashboard and get the information. To access dashboard please use
below URL.
URL:https://public.tableau.com/profile/publish/LiquorConsumptionReport/FinalLiquorReport#!/publish-
confirm
Below picture of window appears after uploading database from SQL server to Tableau
16 | P a g e
Fig.22
Advantage of Tableau in Data reporting service:
Easy to deploy cube on this application
Dashboard can be created
Fast processing due to low dependency on server
GUI is easy to play with database
SSRS:
After successful deployment of cube we can use SSRS to do reporting or analyze the database, With this
application we can pull out information from data warehouse using pivot table option.
This is last step in implementation of data warehouse which gives final result to make business related
decision in future based on historical datasets.
Advantages of SSRS are as follows:
Very easy to export result into excel and other file formats
Parameter helps to return any result as per user wants
Flexible reporting is possible with SSRS
Look great as once glance helps to understand statistics of database
17 | P a g e
Business Case Studies:
These case studies helps to improve liquor business particularly in iWOA and Missouri states also helps to
make critical decision in the field of uncertainty.
Case 1: Business Case Report
Which liquor category has maximum sale in IWOA and Missouri states?
Fig.23
Analysis of report:
By using above graph we can conclude that Canadian whiskies are sold most in terms of volume in ml, in
dollars as well as number of packs. After Canadian whiskies the most sold out liquor product is Blended
whiskies.
Based on this statistics we can decide how much liquor should import or export to the IWOA and Missouri
states based on volume consumed. Also how much liquor sale have contributed to revenue and which
particular liquor has highest contribution in revenue of liquor business.
Here total number of liquors are 59 which means 59 different liquor categories and included in this DWH
with their respective description and unique number.
18 | P a g e
Case 2: Business Case Report
Finding liquor sales distribution for different vendors in IWOA and Missouri states.
Fig.24
Analysis of report:
By using above treemaps, we can conclude the most performing vendor in liquor business is Diageo
Americas. We can see how much dollars Diageo Americas has earned and for that how many packs they
have sold out. Same thing we can apply for any vendor present in IWOA and Missouri states.
With this report we can keep eye on liquor business game that who has good strategies to sale liquor and
who hasn’t. Also to make our business network in big liquor industry this numbers can help because we
can conclude top 10 vendors in liquor business with their sale which helps to reach at the highest level in
liquor business market.
Here total vendor count is 54 which means in this DWH we have information of all 54 vendors with their
unique number, sale amount and number of liquor packs they have sold out.
19 | P a g e
Case 3: Business Case Report
Which liquor category consumers love to have most of the time in IWOA and Missouri States?
Fig.25
From above packed bubble figure, we can conclude that Vodka 80 Proof has most loved liquor in IWOA
and Missouri states as we can see it has highest number of positive tweets. However we can also check
which liquor is having highest negative tweets by changing parameter in a dashboard.
By using this report we can manage our liquor inventories. With this we can make a new strategy in liquor
business by procuring liquor which has highest demand in liquor market. This is glance of consumer’s
perspective towards different liquors.
Here we have positive and negative tweets for all 59 liquor categories which helps to distinguished them
from each other.
20 | P a g e
Case 4: Business Case Report
A particular vendor is having maximum or lowest sale then what is the reason behind his result?
Fig.26
From above horizontal bar graph, we can conclude that if anyone vendor is having maximum success in
liquor sale then what are the products he has which gives him profit. Brown-Forman Corporation has
maximum liquor sale of liquor because of his particular Tennessee whiskies.
Here by using this technique we can make changes in business strategies as which liquor to keep for sale
or which liquor should not keep. Liquor has no expiry date but it does need space to keep in inventories
hence by using above graph we can well monitor liquor amount which result in better liquor inventory.
However we can do this same with all 54 vendors present in IWOA and Missouri states with their individual
product statistics.