cse 8392 spring 1999 data mining: part i professor margaret h. dunham department of computer science...

CSE 8392 SPRING 1999DATA MINING:

PART I

Professor Margaret H. Dunham

Department of Computer Science and Engineering

Southern Methodist University

Dallas, Texas 75275

(214) 768-3087

fax: (214) 768-3085

email: mhd@seas.smu.edu

www: http://www.seas.smu.edu/~mhd

January 1999

CSE 8392 Spring 1999 2

CSE8392 SPRING 1999 OUTLINE

• Course Objective: To examine Data Mining concepts. A database perspective (rather than AI or statistics) is taken.

• I. Introduction and Related Topics

• II. Core Topics

• III. Advanced Topics

• IV. Case Studies

• V. Student Presentations

• VI. Summary and Future Trends

INTRODUCTION AND RELATED TOPICS

• Section Objective: Provide an introduction of data mining concepts. Briefly examine related concepts and background topics.

• Historical Perspective

– Gleaning Knowledge from the Data

– User Expectations increase as amount/sophistication of collected data increases.

– Reality vs Extracted Data

Reality

QueryInformation

Physical View Database View

Related Topics (to be covered)

– Knowledge Discovery

– Information Retrieval

– Fuzzy Sets

– Data Warehousing and OLAP

– Dimensional Modeling

Data Mining Overview

• What is Data Mining?– Definition: Fayyad, p. 9 – A.k.a.

• Exploratory data analysis• Unsupervised pattern recognition• Data driven discovery• Deductive learning

• Data Mining determines patterns in the data– Non-trivial– Valid– Novel– Potentially useful– Interesting– General and simple– Understandable

DM Techniques (R[1])

• DM involves many different algorithms to accomplish different things. All have the following techniques in common.

– Model(Must fit a model to the data.)

• Function/Purpose

• Representation

– Preference Criteria (How to choose one model over another?)

– Search Algorithm (How to search the data)

• Example (Loan Data, fig 1.1 p6 in Fayyad):

– Model: Classification, Linear Function

– Preference: What best fits data? (Fig 1.2 or 1.4)

– Search Algorithm: Linear search of database

DM Model Functions (R[1])

• Classification - Map data into predefined groups

• Regression - Map data to real valued predicate variable

• Clustering - Map data into groups defined by data itself

• Summarization - Map subsets of data into simple description

• Dependency Modeling - Identify dependencies among data items

• Link Analysis - Identify other relationships among data (association rules)

• Sequence Analysis - Identify sequential patterns in data

DM Historical Perspective

• Late 70’s: Spreadsheet analysis

• 80’s: Transactional databases support data storage and retrieval

• Early 90’s: Growing interest in end user support (a.k.a. decision support)

– Issue: transactional databases are not designed for decision support

• Mid 90’s: Dedicated data warehouses for decision support and multidimensional analysis

• Late 90’s: Proliferation; new concepts (data marts)

• DM Tools: Neovista, Red Brick

Data Mining Metrics

• Berson, Tables 17-1,17-2,17-3, p 347

• Accuracy

• Clarity

• Dirty Data

• Dimensionality

• Raw Data (Preprocessing)

• RDBMS embedding

• Scalability

• Speed

• Validation

CSE 8392 Spring 1999 10

DM Issues

• Overfitting

• Outliers

• Closed World Assumption

• Database schemas and database models

• Algorithms for data mining

• Interpretation and visualization of results

• Size of databases

• Multimedia data, Spatio-Temporal Data

• Changing data

• Integration

• DM Applications

– Basket market analysis Stock analysis and selection

– Fraud detection and prevention

– Crisis prediction and prevention

CSE 8392 Spring 1999 11

KNOWLEDGE DISCOVERY IN DATABASES (KDD)

• “Overall process of discovering useful knowledge from data.” (p28 in R[1])

• Defn: R[1] p 30

• Steps Fig 1, p29 R[1] (Fig 1.3 in Fayyad)

• Data Mining is one step in KDD process

• KDD objective not usually clear or exact. May require time with customer understanding needs.

• Data usually has problems - needs cleaning

– Incorrect/missing data

– Extract from multiple sources and compare

– Delete anomalous data and sources

– Different data types/metrics

CSE 8392 Spring 1999 12

FUZZY SETS and LOGIC

• Set membership described by a real valued (0,1) membership function

• Ex: Set of all tall people

• Set membership function: f(x)=x is tall iff height(x)>6 ft.

• Note that this is a simple classification problem. Just as the Loan example, the results are not exact.

• Basis of many classification and clustering approaches

• In a conventional DB how do you retrieve all tall people?

– Three valued logic: True, False, Maybe

– Multi-valued logic: More than 2 values

CSE 8392 Spring 1999 13

Fuzzy Logic

• Reasoning with uncertainty

• Extends multivalued logic; allows user to communicate using imprecise concepts, i.e.

– “good” and “bad”

– “close to” and “far away”

• Avoids brittleness of rule based reasoning by introducing probability of set membership

– Allows for smoother transition between classification sets in the domain

– Example

• Berson figure 16.2, page 325

CSE 8392 Spring 1999 14

INFORMATION RETRIEVAL

• Store and retrieve documents based on fuzzy queries

• Predecessor of web based access

• Ex: Store information about all articles in all IEEE Transactions journals and Retrieve all documents dealing with heaps.

• Overview

– Conventional IR Systems

– Query Structures(Keywords)

– Matching(Multivalued logic)

– Measures

– Text Analysis Techniques

– IR Related Topics

CSE 8392 Spring 1999 15

Conventional IR Systems

• Library card catalogs

• Documents (Library Science)

– Formatted

– Unformatted (Text)

– Mixed

• Document Surrogates

– Identifiers

– Titles, names, and dates

– Abstracts, extracts, reviews

– Summaries of Numerical Data

– Image Descriptions

CSE 8392 Spring 1999 16

IR Queries

• Query Structures

– Matching Criteria

– Boolean Queries

– Vector

– Fuzzy

– Natural Language

• Logical combination of keywords

• Weight associated with keywords

• Similarity measures

CSE 8392 Spring 1999 17

Similarity Measures

– Document Vector:

– Different Measures:

– Salton and McGill, Introduction to Modern Information Retrieval, 1984, McGraw-Hill, pp201-204.

– Similarity uses:

• Document-Document

• Query-Query

• Document-Query

iniii dddD ,...,, 21

kjkikji ddDDSim

CSE 8392 Spring 1999 18

IR Document/Query Matching

• Matching Process

– Relevance and Similarity Measures

– Boolean based matching

• Logical match

– Vector based matching

• Threshold match

– Probabilistic Match

n documents relevant

• P(relevant) =

N total documents

– Fuzzy Matching

– Proximity Matching

– Weighting

– Relative Importance of Items

CSE 8392 Spring 1999 19

IR Matching

• Scaling

– Impact of Sample Size

– Clustering

– Centroids

• Measures

– Precision

– Recall

CSE 8392 Spring 1999 20

IR Indexing

• Text Analysis

– Indexing is the assignment of keywords or terms that represent document content

• Originally a library science problem that has grown with the advent of web based searches

– Indexing types

• Automated vs. manual

• Controlled vs. uncontrolled

• Single term vs. terms in context

• Deep vs. shallow

CSE 8392 Spring 1999 21

IR Indexing

• General Steps

– 1. Assignment of terms or concepts capable of representing content

– 2. Assignment to each term a weight or value

• Indexing

– Vector based

• Start with excerpts, remove high frequency words

– Stop list

– Thesaurus

• Compute discrimination values of terms

CSE 8392 Spring 1999 22

IR Retrieval

• Retrieval or Classification

– Vector based

• Same starting point as with indexing

• Compute weighting factors

• Assign to each document a weighted term vector

– Similarity Measures

• Measure similarity between document/query

• Results normalized to range between 0 - 1

CSE 8392 Spring 1999 23

IR Retrieval

– Inverse Document Frequency

• Assumes importance is proportional to standard occurrence frequency, and inversely proportional to the total number of documents.

• Also used for similarity measurement

– Inverted Indexing of Document

– Concept Hierarchy

• DAG of concepts

• Follow nodes from general to more specific

• Tag articles with low level concepts so that each may be distinguished from ancestors

CSE 8392 Spring 1999 24

IR Related Topics

• Information Retrieval Related Topics

– Text Analysis

– Fuzzy Sets

– Extending Databases

– Hypertext

– Digital Libraries

– Data Mining

• Web based browsers

CSE 8392 Spring 1999 25

DATA WAREHOUSING AND OLAP

– Preparations for Mining: Data Warehousing

• Extracting the data (from RDBMS)

• Storing the data

– Data warehouse or data mart

• Cleansing the data

• Mining the data

– Often with multidimensional queries

• Definition

– Blend of technologies

– Integration

– Enables Strategic Use of Data

• Architecture

– Figure 6.1, page 116

CSE 8392 Spring 1999 26

DW Migration

• Migration from Relational Database to Data Warehouse

– Differences (Relational vs. Data Warehouse)

– Procedure for Migration

• Extraction

• Cleanup

• Transformation

• Migration

• Issues

– Multiple sources

– Database Heterogeneity

– Data Heterogeneity

CSE 8392 Spring 1999 27

DW Design

• Data Warehouse Design Considerations - Nine Step Method:

– Subject Matter

– Fact Table contents

– Dimensioning

– Fact Selection

– Precalculations

– Rounding out dimension table

– Duration selection

– What about change?

– Query priorities

• Technical Considerations

– Hardware

– Communications Infrastructure

– Data Structures

CSE 8392 Spring 1999 28

cse 8392 spring 1999 data mining: part i professor margaret h. dunham department of computer science...

Documents

dunham bush wcox

8392 2015-03-15 13.57.txt

9/15/2008 ctbto data mining/data fusion workshop 1...

10/30/021 me data mining overview margaret h. dunham cse...

using semantic caching to manage location dependent data in...

allard vossen [8392] binnenwerk v8.split

dunham-bush centrifugal liquid chiller -...

discrete mathematics cse 2353 fall 2007 margaret h. dunham...

© prentice hall1 data mining introductory and advanced...

© prentice hall1 data mining introductory and advanced...

dunham, lena

11/11/051 me a novel technique for learning rare events...

stiff: a forecasting framework for spatio-temporal data...

cse 5331/7331 f'071 cse 5331/7331 fall 2007 dimensional...

4/24/09 - ksu spatiotemporal stream mining using emm...

3/11/10, byu1 the magnificent emm margaret h. dunham michael...

6730 f063-0821 tel 011-664-8392 …jarinkopower.3zoku.com...

discrete mathematics, part ii cse 2353 fall 2007 margaret h....

jeff dunham

cse 5331/7331 fall 2011 p-value and statistical significance...