data mining (overview)

Download Data Mining (overview)

Post on 04-Nov-2014

522 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

 

TRANSCRIPT

  • 1. Data Mining (overview)

2. Presentation overview

  • Introduction
  • Association Rules
  • Classification
  • Clustering
  • Similar Time Sequences
  • Similar Images
  • Outliers
  • WWW
  • Summary

3. Background

  • Corporations have huge databases containing a wealth of information
  • Business databases potentially constitute a goldmine of valuable business information
  • Very little functionality in database systems to support data mining applications
  • Data mining: The efficient discovery of previously unknown patterns in large databases

4. Applications

  • Fraud Detection
  • Loan and Credit Approval
  • Market Basket Analysis
  • Customer Segmentation
  • Financial Applications
  • E-Commerce
  • Decision Support
  • Web Search

5. Data Mining Techniques

  • Association Rules
  • Sequential Patterns
  • Classification
  • Clustering
  • Similar Time Sequences
  • Similar Images
  • Outlier Discovery
  • Text/Web Mining

6. Examples of DiscoveredPatterns

  • Association rules
    • 98% of people who purchase diapers also buy beer
  • Classification
    • People with age less than 25 and salary > 40k drive sports cars
  • Similar time sequences
    • Stocks of companies A and B perform similarly
  • Outlier Detection
    • Residential customers for telecom company with businesses at home

7. Association Rules

  • Given:
      • A database of customer transactions
      • Each transaction is a set of items
  • Find all rulesX => Ythat correlate the presence of one set of itemsXwith another set of itemsY
      • Example: 98% of people who purchase diapers and baby food also buy beer.
      • Any number of items in the consequent/antecedent of a rule
      • Possible to specify constraints on rules (e.g., find only rules involving expensive imported products)

8. Association Rules

  • Sample Applications
      • Market basket analysis
      • Attached mailing in direct marketing
      • Fraud detection for medical insurance
      • Department store floor/shelf planning

9. Confidence and Support

  • A rule must have some minimum user-specifiedconfidence
    • 1 & 2 => 3 has 90% confidence if when a customer bought 1 and 2, in 90% of cases, the customer also bought 3.
  • A rule must have some minimum user-specifiedsupport(how frequently the rule occurs)
    • 1 & 2 => 3 should hold in some minimum percentage of transactions to have business value

10. Example

  • For minimum support = 50%, minimum confidence = 50%, we have the following rules
    • 1 => 3 with 50% support and 66% confidence
    • (1&3 happened in 50% of cases, but whenever 1 happened only in 2/3 of cases 3 happened too)
    • 3 => 1 with 50% support and 100% confidence
  • (3&1 happened in 50% of cases, but whenever 3 happened 1 happened too)

11. Quantitative Association Rules

  • Quantitative attributes (e.g. age, income)
  • Categorical attributes (e.g. make of car)
  • [Age: 30..39] and [Married: Yes] => [NumCars:2]

min support = 40% min confidence = 50% Definition? 12. Temporal Association Rules

  • Candescribe the rich temporal character in data
  • Example:
    • {diaper} -> {beer} (support = 5%, confidence = 87%)
    • Support of this rule may jump to 25% between 6 to 9 PM weekdays
  • Problem: How to find rules that follow interesting user-defined temporal patterns
  • Challenge is to design efficient algorithms that do much better than finding every rule in every time unit

13. Correlation Rules

  • Association rules do not capture correlations
  • Example:
      • Suppose 90% customers buy coffee,25% buy tea and 20% buy both tea and coffee
      • {tea} => {coffee} has high support 0.2 and confidence 0.8
      • {tea, coffee} are not correlated
      • expected support of customers buying both is 0.9 * 0.25 = 0.225

14. Sequential Patterns

  • Given:
      • A sequence of customer transactions
      • Each transaction is a set of items
  • Find all maximal sequential patterns supported by more than a user-specified percentage of customers
  • Example: 10% of customers who bought a PC did a memory upgrade in a subsequent transaction
      • 10% is the support of the pattern

15. Classification

  • Given:
      • Database of tuples, each assigned a class label
  • Develop a model/profile for each class
      • Example profile (good credit): (25

Recommended

View more >