2 data, data everywhere yet... we can’t find the data we need data is scattered over the network...
TRANSCRIPT
22
Data, Data everywhereData, Data everywhereyet ...yet ...
We can’t find the data we needWe can’t find the data we need data is scattered over the networkdata is scattered over the network
We can’t get the data we needWe can’t get the data we need need an expert to get the data
We can’t understand the data We can’t understand the data we foundwe found available data is poorly
documented
We can’t use the data we foundWe can’t use the data we found data needs to be transformed from
one form to other
What is Data Warehouse?What is Data Warehouse?
Definition by InmonDefinition by Inmon ““A data warehouse is a A data warehouse is a subject-orientedsubject-oriented,, integrated, time-variantintegrated, time-variant, ,
and and non-volatilenon-volatile collection of data in support of management’s collection of data in support of management’s decision-making process”decision-making process”
Data Warehouse—Subject-OrientedData Warehouse—Subject-Oriented
Organized around major subjects, such as Organized around major subjects, such as customer, customer,
product, salesproduct, sales
Data Warehouse—IntegratedData Warehouse—Integrated
Constructed by integrating multiple, Constructed by integrating multiple, heterogeneous data sourcesheterogeneous data sources relational databases, flat files, on-line transaction relational databases, flat files, on-line transaction
recordsrecords
Data cleaning and data integration techniques Data cleaning and data integration techniques are appliedare applied Ensure consistency in naming conventions, attribute Ensure consistency in naming conventions, attribute
measures, etc. among different data sourcesmeasures, etc. among different data sources When data is moved to the warehouse, it is When data is moved to the warehouse, it is
converted converted
Data Warehouse—Time VariantData Warehouse—Time Variant
The time horizon for the data warehouse is significantly The time horizon for the data warehouse is significantly
longer than that of operational systemslonger than that of operational systems Operational database: current value dataOperational database: current value data
Data warehouse data: provide information from a historical Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)perspective (e.g., past 5-10 years)
Data Warehouse—Non-VolatileData Warehouse—Non-Volatile
Operational Operational update of data does not occurupdate of data does not occur in in
the data warehouse environmentthe data warehouse environment Requires only two operations in data accessing: Requires only two operations in data accessing:
initial loading of datainitial loading of data and and access of dataaccess of data
Data Warehouse vs. Operational DBMSData Warehouse vs. Operational DBMS
OLTP (On-Line Transaction Processing)OLTP (On-Line Transaction Processing) Major task of traditional relational DBMSMajor task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.manufacturing, payroll, registration, accounting, etc.
OLAP (On-Line Analytical Processing)OLAP (On-Line Analytical Processing) Major task of data warehouse systemMajor task of data warehouse system Data analysis and decision makingData analysis and decision making
From Tables and Spreadsheets to From Tables and Spreadsheets to Data CubesData Cubes
A data warehouse is based on A data warehouse is based on multidimensional data model which views data in the form of a data multidimensional data model which views data in the form of a data
cubecube
A data cube allows data to be modeled and viewed in multiple A data cube allows data to be modeled and viewed in multiple dimensions (dimensions (such as salessuch as sales))
Dimension tables, such as Dimension tables, such as item (item_name, brand, type), or time(day, item (item_name, brand, type), or time(day, week, month, quarter, year)week, month, quarter, year)
Fact table contains measures (such as Fact table contains measures (such as dollars_solddollars_sold) and keys to each ) and keys to each of the related dimension tablesof the related dimension tables
Conceptual Modeling of Data WarehousesConceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measuresModeling data warehouses: dimensions & measures
Star schemaStar schema A fact table in the middle connected to a set of dimension tablesA fact table in the middle connected to a set of dimension tables
Snowflake schemaSnowflake schema A refinement of star schema where some dimensional hierarchy is A refinement of star schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables, forming a normalized into a set of smaller dimension tables, forming a
shape similar to snowflakeshape similar to snowflake
Fact constellationsFact constellations Multiple fact tables share dimension tables, viewed as a collection Multiple fact tables share dimension tables, viewed as a collection
of stars, therefore called of stars, therefore called galaxy schemagalaxy schema or fact constellation or fact constellation
Avg_sales
Euros_sold
Unit_sold
Location_keybranch_keybranch_namebranch_type
Example of Star SchemaExample of Star Schema
time_keydayday_of_the_weekmonthquarteryear
location_keystreetcityprovince_or_streetcountry
Measures
item_keyitem_namebrandtypesupplier_type
Branch_key
Branch
Time
Item
Location
Sales Fact Table
Item_key
Time_key
branch_keybranch_namebranch_type
Example of Snowflake SchemaExample of Snowflake Schema
time_keydayday_of_the_weekmonthquarteryear
Measures
item_keyitem_namebrandtypesupplier_key
Branch
Time
Item
location_keystreetcity_key
Location
Sales Fact Table
Avg_sales
Euros_sold
Unit_sold
Location_key
Branch_key
Item_key
Time_key
supplier_keysupplier_type
city_keycityprovince_or_streetcountry
City
Supplier
branch_keybranch_namebranch_type
Example of Fact ConstellationExample of Fact Constellation
time_keydayday_of_the_weekmonthquarteryear
Measures
Branch
Time
item_keyitem_namebrandtypesupplier_key
Item
location_keystreetcityProvince/streetcountry
Location
Sales Fact Table
Avg_sales
Euros_sold
Unit_sold
Location_key
Branch_key
Item_key
Time_key
shipper_keyshipper_namelocation_keyshipper_type
shipper
unit_shipped
Euros_sold
to_location
from_location
shipper_key
Item_key
Time_key
Shipping Fact Table
A Sample Data CubeA Sample Data CubeTotal annual salesof TV in IrelandDate
Produ
ct
Cou
ntr
ysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
Ireland
France
Germany
sum
Typical OLAP OperationsTypical OLAP Operations
Roll up (drill-up): Roll up (drill-up): summarize datasummarize data by climbing up hierarchy or by dimension reductionby climbing up hierarchy or by dimension reduction
Drill down (roll down): Drill down (roll down): reverse of roll-upreverse of roll-up from higher level summary to lower level summary or detailed data, or from higher level summary to lower level summary or detailed data, or
introducing new dimensionsintroducing new dimensions
Slice and diceSlice and dice project and selectproject and select
Pivot (rotate) Pivot (rotate) reorient the cube, visualization, 3D to series of 2D planes.reorient the cube, visualization, 3D to series of 2D planes.
1616
Data Warehouse ArchitectureData Warehouse Architecture
Data Warehouse Engine
Optimized Loader
ExtractionCleansing
AnalyzeQuery
Metadata Repository
RelationalDatabases
LegacyData
Purchased Data
ERPSystems
Data Warehouse ArchitectureData Warehouse ArchitectureData Extraction - Data Extraction involves gathering the data from multiple heterogeneous sources.
Data Cleaning - Data Cleaning involves finding and correcting the errors in data.
Data Transformation - Data Transformation involves converting data from legacy format to warehouse format.
Data Loading - Data Loading involves sorting, summarizing, consolidating, checking integrity and building indices and partitions.
Refreshing - Refreshing involves updating from data sources to warehouse.
Data Warehouse ModelsData Warehouse Models
Enterprise warehouseEnterprise warehouse collects all of the information about subjects spanning the entire collects all of the information about subjects spanning the entire
organizationorganization
Data MartData Mart a subset of corporate-wide data that is of value to a specific groups a subset of corporate-wide data that is of value to a specific groups
of users. Its scope is confined to specific, selected groups, such as of users. Its scope is confined to specific, selected groups, such as marketing data martmarketing data mart
Introduction to Data Introduction to Data MiningMining
What Motivated Data Mining?What Motivated Data Mining?
We are drowning in data, but starving for We are drowning in data, but starving for knowledge!knowledge!
2121
What Is Data Mining?What Is Data Mining?
Data mining (knowledge discovery from data) Data mining (knowledge discovery from data) Extraction of interesting (Extraction of interesting (implicit, previously unknown and implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of potentially useful) patterns or knowledge from huge amount of datadata
Alternative namesAlternative names Knowledge discovery (mining) in databases (KDD), knowledge Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.dredging, information harvesting, business intelligence, etc.
2222
Why Data Mining?—Potential Why Data Mining?—Potential ApplicationsApplications
Data analysis and decision supportData analysis and decision support Market analysis and managementMarket analysis and management
Target marketing, customer relationship management (CRM), Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentationmarket basket analysis, cross selling, market segmentation
Risk analysis and managementRisk analysis and managementForecasting, customer retention, quality control, competitive Forecasting, customer retention, quality control, competitive
analysisanalysis Fraud detection and detection of unusual patterns (outliers)Fraud detection and detection of unusual patterns (outliers)
2323
Integration of Multiple Integration of Multiple TechnologiesTechnologies
MachineLearning
DatabaseManagement
ArtificialIntelligence
Statistics
DataMining
VisualizationAlgorithms
2424
What Can Data Mining Do?What Can Data Mining Do?
ClusterClusterClassifyClassify Categorical, RegressionCategorical, Regression
SummarizeSummarize Summary statistics, Summary rulesSummary statistics, Summary rules
Link Analysis / Model DependenciesLink Analysis / Model Dependencies Association rulesAssociation rules
Detect DeviationsDetect Deviations
2525
ClusteringClustering
Find groups of Find groups of similar data similar data itemsitems
““Group people with Group people with similar travel profiles”similar travel profiles” George, PatriciaGeorge, Patricia Jeff, Evelyn, ChrisJeff, Evelyn, Chris RobRob
Clusters
2626
ClassificationClassification
Find ways to separate Find ways to separate data items into pre-data items into pre-defined groupsdefined groupsA bank loan officer wants A bank loan officer wants to analyse the data in to analyse the data in order to know which order to know which customer (loan applicant) customer (loan applicant) are risky or which are are risky or which are safe.safe.
Groups
Training Data
tool produces
classifier
2727
Association RulesAssociation Rules
Identify dependencies in the Identify dependencies in the data:data:
X makes Y likelyX makes Y likely
Indicate significance of each Indicate significance of each dependencydependency
““Find groups of items Find groups of items commonly purchased commonly purchased together”together”
People who purchase X People who purchase X are likely to purchase Yare likely to purchase Y
2828
Deviation DetectionDeviation Detection
Find unexpected Find unexpected values,values,
Uses:Uses:Failure analysisFailure analysisAnomaly discovery for Anomaly discovery for analysisanalysis
““Find unusual Find unusual occurrences in stock occurrences in stock prices”prices”
Knowledge Discovery (KDD) ProcessKnowledge Discovery (KDD) Process
Data mining—core of Data mining—core of knowledge discovery knowledge discovery processprocess
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Knowledge ProcessKnowledge Process
1.1. Data cleaningData cleaning – to remove noise and inconsistent – to remove noise and inconsistent datadata
2.2. Data integrationData integration – to combine multiple source – to combine multiple source 3.3. Data selectionData selection – to retrieve relevant data for analysis – to retrieve relevant data for analysis4.4. Data transformationData transformation – to transform data into – to transform data into
appropriate form for data miningappropriate form for data mining5.5. Data miningData mining6.6. EvaluationEvaluation7.7. Knowledge presentationKnowledge presentation
Knowledge ProcessKnowledge Process
Although data mining is only one step in Although data mining is only one step in the entire process, it is an essential one the entire process, it is an essential one since it uncovers hidden patterns for since it uncovers hidden patterns for evaluation evaluation
Knowledge ProcessKnowledge Process
Based on this view, the architecture of a typical Based on this view, the architecture of a typical data mining system may have the following data mining system may have the following major components:major components:
Database, data warehouse, world wide web, or other Database, data warehouse, world wide web, or other information repositoryinformation repository
Database or data warehouse serverDatabase or data warehouse server Data mining engineData mining engine Pattern evaluation modelPattern evaluation model User interface User interface