csc 177 data warehouse and mining project pooja vora vishma shah guided by – prof. meiliu lu

CSC 177Data warehouse and

Mining projectPooja Vora

Vishma ShahGuided by – Prof. Meiliu lu

Agenda

Data Warehouse Project Introduction Background Scope of study Implementation Data Cleaning and Preprocessing Data Mart

Data Mining Project Introduction Background Scope of study Implementation Data mining

Learning experience Future Scope References

Data Warehouse Introduction

• The objective of our project is to create a data mart with star schema

• Data mart will be used to find answers related to various company key factors and statistics.

Background

• Source website : Navathe company schema • Dataset :

• Company dataset

• Company dataset : Fact table - 7 attribtues,1000 entries

Scope Of Study

• Data Preprocessing • Microsoft Office Excel • Microsoft SQL Server

• Data Mart • Microsoft SQL server , Visio, convertCSVtoSQL

• Olap Operations• SQL server queries

Implementation

• Data Cleaning & Preprocessing

• Data Mart

• Olap Operations

Data Cleaning & Preprocessing

The company schema had different tables as per navathe , we also added few dimension for analytical processing and created a fact table with star schema.

Data Mart

• We have 5 dimension tables in our data mart and one fact table which forms star schema.

• The Fact table tables consists of around 1000 rows having various details about ssn, project, work_id etc

Star Schema

Data Mart Question-Answers

• How many products were produced over the months?

• Rollup

• How to find employee current working project?

• Slicing on employee dimension

• How to find the statistics of days where more than 5 products were produced

• Dicing on product and work dimension

• How to find which days and how many products of particular product were produced?

• Scoping

Olap Operations Example

• Roll Upselect t.date_year, t.date_month, sum(w.NumberOfProduct) as 'No. Of Products' from EmpFactTable f, DimTime t, DimEmp_work_record wwhere f.date_key= t.date_key and f.work_id = w.work_idgroup by date_year, date_month with rollup

date_year date_month No. Of Products2014 1 9802014 2 7612014 3 1274 2014 4 2402014 NULL 3255NULL NULL 3255

winning month

Quiz

Which dimension was used for slicing cube?• Employee• Time• Work• Product

Answer - Employee

Data Mining Project

Introduction

• Perform Data mining on data set to discover knowledge

• Apply data mining algorithms using tools

• compare the performance of algorithms using these tools.

• Compare the tools performance

Background

• Source Website – www.data.gov

• Dataset :

• Consumer complaints

• Data:

- 14 attribtues, 55000 entries (Data from 2012 to 2014)

http://www.data.gov/

Scope Of Study

• Data Preprocessing• Microsoft Office Excel

• Tools (Weka, Rapidminer)

• Data Mining• Tools : Weka, Rapidminer

• Algorithms : K-Means, Naïve Bayes

Implementation

• Data Cleaning & Preprocessing

• Data Mining

• Tools Comparision

Data Cleaning & Preprocessing

• Data Cleaning - Replaced missing values with “unknown”

• Data selection – Selected Consumer complaints data of two months (Sept , Oct) for mining

• Sample Data selected as 3000 rows

Data Mining

We have used One Classification & One Clustering Algorithm

Classification – Naïve Bayes

Clustering – K-means

Data Mining Demo

Tools Comparision : K-Means

Rapid Miner

Weka

Tools Comparision : Naïve Bayes

Rapidminer Weka

Quiz

Which Clustering Algorithm was used for data mining?• K-Means• EM

Answer – K-means

Learning Experience

• Learned the analytical processing through data mart project.• Helped to improve knowledge for Database statistics• Learned to gain information out of the querying results. • Learned different data mining tools like weka and rapid

miner • Improved understanding of various algorithms and their

practical implementation through tools• Learned to make sense out of the results obtained from the

tools

Future Scope

• Data Warehouse

• Create a snowflake schema by introducing dimension like employee types contractors/Fulltime and then take it further for analytical processing for different statistics

• Data Mining

• Can implement other algorithms and tools like orange etc

References

• Elmasri and Navathe, Fundamentals of Database System, 6th Edition, Addison-Wesley Publishing

• OLAP Courseware http://athena.ecs.csus.edu/~olap/olap/introduction.php

• DM dataset http://www.data.gov/consumer/

• Data Mining Courseware http://athena.ecs.csus.edu/~datamini

• https://rapidminer.com/wpcontent/uploads/2013/10/RapidMiner_RapidMinerInAcademicUse_en.pdf

http://athena.ecs.csus.edu/~olap/olap/introduction.php

http://athena.ecs.csus.edu/~olap/olap/introduction.php

http://www.data.gov/consumer/

http://www.data.gov/consumer/

http://athena.ecs.csus.edu/~datamini

http://athena.ecs.csus.edu/~datamini

https://rapidminer.com/wpcontent/uploads/2013/10/RapidMiner_RapidMinerInAcademicUse_en.pdf




Questions….

csc 177 data warehouse and mining project pooja vora vishma shah guided by – prof. meiliu lu

Documents