introduction-to-knowledge discovery in database

47
Chapter 1 : Presented By :- Kartik N. Kalpande.

Upload: kartik-kalpande-patil

Post on 13-Apr-2017

330 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Introduction-to-Knowledge Discovery in Database

Chapter 1 :

Presented By :-Kartik N. Kalpande.

Page 2: Introduction-to-Knowledge Discovery in Database

What is Knowledge Acquisitions ? aka :: data mining, knowledge discovery,

knowledge extraction, information discovery, information harvesting ect.

Process of discovering useful information,hidden pattern or rules in large quantities of data ( non-trivial, unknown data)

By automatic or semiautomatic means It’s impossible to find pattern using manual

method.

Page 3: Introduction-to-Knowledge Discovery in Database

Why Knowledge Acquisitions ?

Page 4: Introduction-to-Knowledge Discovery in Database

Why Knowledge Acquisitions ? Why?

Data explosion (tremendous amount of data available) Data is being warehoused Computing power Competitive pressure

Hard Disk Nowadays more than 100Ggbytes capacities

Page 5: Introduction-to-Knowledge Discovery in Database

Is Data Mining Appropriate for My problem ? Four general question to consider

Can we clearly define the problem? Does potentially meaningful data exist? Does the data contain hidden knowledge or is the

data factual and useful for reporting purpose only? Will the cost of processing the data be less than

the likely increase in profit seen by applying any potential knowledge gained from the data mining project.

Page 6: Introduction-to-Knowledge Discovery in Database

Traditional Approaches Traditional database queries:. Access a

database using a well defined query such as SQL

The query output consist of data from database

The output usually a subset of the database

DBMS DB

SQL

Page 7: Introduction-to-Knowledge Discovery in Database

Data Mining or Data Query Four general types of knowledge can be

define to help us determine when data mining is appropriate.Shallow KnowledgeMultidimensional KnowledgeHidden KnowledgeDeep Knowledge

Page 8: Introduction-to-Knowledge Discovery in Database

Shallow Knowledge Factual in nature Can be easily stored and manipulated in a

database Database query language such as SQL

are excellent tools for extracting shallow knowledge from data

Page 9: Introduction-to-Knowledge Discovery in Database

Multidimensional Knowledge also Factual Data are stored in a multidimensional

format On-line Analytical Processing (OLAP)

tools are used on multidimensional data

Page 10: Introduction-to-Knowledge Discovery in Database

Hidden Knowledge Patterns or regularities in data that cannot

be easily found using database query language such as SQL

Data mining algorithms can find such patterns with ease.

Page 11: Introduction-to-Knowledge Discovery in Database

Deep Knowledge Knowledge stored in database that can

only be found if we are given some direction about what we are looking for.

Current data mining tools are not able to locate deep knowledge.

Page 12: Introduction-to-Knowledge Discovery in Database

What can computers learn?• Four level of learning can be differentiated

(Merril & Tennyson, 1977) : Facts : simple statement of truth Concepts : set of objects, symbols, or events grouped

together because they share certain characteristics Procedures: step by step course of action to achieve a

goal. Principles: highest level of learning. General truth or

laws that are basic to other truths.

Page 13: Introduction-to-Knowledge Discovery in Database

What can computers learn?• Computer are good at learning ‘concepts’.• Concepts are the output of data mining

session.• There are three (3) common concept view:

a. Classical viewb. Probabilistic viewc. Exemplar View

Page 14: Introduction-to-Knowledge Discovery in Database

Three Concept Viewsa. Classical View:• Definite defining properties• These properties determine if an individual item is an

example of a particular concept.• Crisp and leaves no room for misinterpretation.• Example: Good Credit Rating

IF Annual Income >= 30,000& Years at Current Position >= 5& Owns Home = TrueTHEN Good Credit Risk = True

Page 15: Introduction-to-Knowledge Discovery in Database

Three Concept Viewsb. Probabilistic View:• Concepts are represented by properties that are probable of concept member.• Assumption is that people store and recall concept as generalization created

from individual instance observation.• Cannot be directly applied to achieve answer – but can be used to help in

decision making process.• Associate probability of membership with a specific

classification.

Page 16: Introduction-to-Knowledge Discovery in Database

- The mean annual income for individuals who consistently make loan payments on time is $30,000- Most individuals who are good credit risks have been working for the same company for at least five years.- The majority of good credit risks own their own home

Three Concept Viewsb. Probabilistic View:• Example: Good Credit Rating

Home owner with an annual income of $27000, employed at the same position for 4 years might be classified as a good credit risk with a probability of 0.85

Page 17: Introduction-to-Knowledge Discovery in Database

Three Concept Viewsc. Exemplar View:• A given instance is determine to be an example of a particular concept

if the instance is similar enough to a set of one or more known examples of the concept .

• Assumption is that people store and recall likely concept exemplars that are then used to classify new instances.

• Can associate a probability of concept membership with each classification.

Page 18: Introduction-to-Knowledge Discovery in Database

Three Concept Viewsc. Exemplar View:• Example:

Exemplar #1: Annual Income = 32,000 Number of years at current position = 6 Homeowner

Exemplar #2: Annual Income = 52,000 Number of years at current position = 16 Renter

Exemplar #1: Annual Income = 28,000 Number of years at current position = 12 Homeowner

Page 19: Introduction-to-Knowledge Discovery in Database

What can be mined?

Page 20: Introduction-to-Knowledge Discovery in Database

Concepts that can be mined?

a. Classes :• stored data is used to locate data in

predetermined groups.• Eg: A restaurant chain could mine

customer purchase data to determine when customers visit and what they typically order.

Page 21: Introduction-to-Knowledge Discovery in Database

Concepts that can be mined?

b. Clusters :• Data items are grouped by logical

relationships.• Eg: Data can be mined to identify market

segments or customer affinities.

Page 22: Introduction-to-Knowledge Discovery in Database

Concepts that can be mined?

c. Associations :• Data can be mined to identify

association.• Eg: The beer-diaper example is typical of

associative mining.

Page 23: Introduction-to-Knowledge Discovery in Database

Concepts that can be mined?

d. Sequential :• Patterns in which data is mined to

anticipate behavior patterns and trends.• Eg: An outdoor equipment retailer could

predict the likelihood of a backpack purchase based on sleeping bag or hiking shoes sale.

Page 24: Introduction-to-Knowledge Discovery in Database

Multidisciplinary

Databases

StatisticsPatternRecognition

KDD

MachineLearning AI

Neurocomputing

Data Mining

Page 25: Introduction-to-Knowledge Discovery in Database

Disciplines Of Data Mining

Data Mining

Information RetrivalAlgorithm

Machine Learning Visualization

StatisticsDatabase System

Page 26: Introduction-to-Knowledge Discovery in Database

Data Mining Model & Task

Data Mining

Predictive Descriptive

•Classification•Regression•Time Series Analysis•Prediction

•Clustering•Summarization•Association Rules•Sequence Discovery

Page 27: Introduction-to-Knowledge Discovery in Database

Predictive Model Make prediction about values of data using

known results found from different data Or based on the use of other historical data Example:: credit card fraud, breast cancer

early warning, terrorist act, tsunami and ect.

Page 28: Introduction-to-Knowledge Discovery in Database

Predictive Model Perform inference on the current data to make

predictions. We know what to predict based on historical data) Never accurate 100% Concentrate more to input output relation ship

( x,f(x)) Typical Question

Which costumer are likely to buy this product next four month

What kind of transactions that are likely to be fraudulent

Who is likely to drop this paper?

Page 29: Introduction-to-Knowledge Discovery in Database

Predictive Model

xx xxx

xx

xx

xx

x xxx

x

months

Profit (RM)

Current data

Future dataO ?

Page 30: Introduction-to-Knowledge Discovery in Database

Descriptive Model Identifies pattern or relationships in data. Serves as a way to explore the properties of

data examined, not to predict new properties Always required a domain expert Example::

Segmenting marketing area Profiling student performances

Page 31: Introduction-to-Knowledge Discovery in Database

Descriptive Model Discovering new patterns inside the data We may don’t have any idea how the data looks like Explores the properties of the data examined Pattern at various granularities (eg: Student:

University-> faculty->program-> major? Typical Question

What is the data What does it look like What does the data suggest for group of customer

advertisement?

Page 32: Introduction-to-Knowledge Discovery in Database

Descriptive Model

major

Results

xx xx

xx

xx

xxoo

oooo

ooo

o

o

oo

oo o

yy

y

yy y

yy yy

yy y

y

y

Group 1Group 2

Group 3

Page 33: Introduction-to-Knowledge Discovery in Database

View Of DM Data To Be Mined

Data warehouse, WWW, time series, textual. spatial multimedia, transactional

Knowledge To Be Mined Classification, prediction, summarization, trend

Techniques Utilized Database, machine learning, visualization, statistics

Applications Adapted Marketing, demographic segmentation, stock

analysis

Page 34: Introduction-to-Knowledge Discovery in Database

DM In Action Medical Applications ::clinical diagnosis, drug analysis Business (marketing segmentation & strategies,

insolvency predictor, loan risk assessment Education (Online learning) Internet (searching engine) Etc.

Page 35: Introduction-to-Knowledge Discovery in Database

Data Mining Methodology Hypothesis Testing vs Knowledge Discovery

Hypothesis Testing Top down approach Attempts to substantiate or disprove preconceived idea

Knowledge Discovery Bottom-up approach Start with data and tries to get it to tell us something we

didn’t already know

Page 36: Introduction-to-Knowledge Discovery in Database

Data Mining Methodology Hypothesis Testing

Generate good ideas Determine what data allow these hypotheses

to be tested Locate the data Prepare the data for analysis Build computer models based on the data Evaluate computer model to confirm or reject

hypotheses

Page 37: Introduction-to-Knowledge Discovery in Database

Data Mining Methodology Knowledge Discovery

Directed Identified sources of pre classified data Prepare data analysis Select appropriated KD techniques based on data

characteristics and data mining goal Divide data into training, testing and evaluation Use the training dataset to build model Tune the model by applying it to test dataset Take action based on data mining results Measure the effect of the action taken Restart the DM process taking advantage of new data

generated by the action taken

Page 38: Introduction-to-Knowledge Discovery in Database

Data Mining Methodology Knowledge Discovery

Undirected Identified available data sources Prepare data analysis Select appropriated undirected KD techniques based

on data characteristics and data mining goal Use the selected technique to uncover hidden

structure in the data Identify potential targets for directed KD Generate new hypothesis to test

Page 40: Introduction-to-Knowledge Discovery in Database

Revision::Two Approaches In data Mining

Data Mining

Predictive Descriptive

•Classification•Regression•Time Series Analysis•Prediction

•Clustering•Summarization•Association Rules•Sequence Discovery

Predict the future value Define R/S among data

Page 41: Introduction-to-Knowledge Discovery in Database

Knowledge Discovery Process

Page 42: Introduction-to-Knowledge Discovery in Database

Knowledge Discovery Process 1.0 Selection

The data needs for the data mining process may be obtained from many different and heterogeneous data sources

Examples Business Transactions Scientific Data Video and pictures

Page 43: Introduction-to-Knowledge Discovery in Database

Knowledge Discovery Process 2.0 Pre Processing Main idea – to ensure that data is clean (high quality of

data). The data to be used by the process may have

incorrect or missing data. There may be anomalous data from multiple

sources involving different data types and metrics

Erroneous data may be corrected or removed, whereas missing data must be supplied or predicted (Often using data mining tools)

Page 44: Introduction-to-Knowledge Discovery in Database

Knowledge Discovery Process 3.0 Transformation

Data from different sources must be converted into a common format for processing

Some data may be encoded or transformed into more usable formats

Example:: Data Reduction Data Cleaning, Data Integration,

Data Transformation, Data Reduction and Data Discretization

Page 45: Introduction-to-Knowledge Discovery in Database

Knowledge Discovery Process 4.0 Data Mining Main idea –to use intelligent method to extract

patterns and knowledge from database This step applies algorithms to the transformed

data to generate the desired results. The heart of KD process (where unknown pattern will

be revealed). Example of algorithms: Regression

(classification, prediction), Neural Networks (prediction, classification, clustering), Apriori Algorithms (association rules), K-Means & K-Nearest Neighbor (clustering), Decision Tree (classification), Instance Learning (classification).

Page 46: Introduction-to-Knowledge Discovery in Database

Knowledge Discovery Process 5.0 Interpretation/Evaluation

How the data mining results are presented to the users is extremely important because the usefulness of the results is dependent on it

Example:: Graphical Geometric Icon Based Pixel Based Hierarchical Based Hybrid

Page 47: Introduction-to-Knowledge Discovery in Database

Case Study: Predicting FSK Final Year’s Student Performance

activities

Student database {contains 30,000 records}

Academicsacademics

Selected record {matric, PMK, grades} – only 2,000 records (contains incomplete records etc.

Selectionacademics

Clean record {replace the missing value, removed the replicated}

Pre-processing Using neural networks : transform into numerical.

Transformation

Y=w1x1+w2x2+b1

Generated Model : pattern for performance prediction

Data mining

Testing result: 90 % correct

accept model

Knowledge (apply model)

Interpretation & evaluation