g54dmt – data mining techniques and applications jqb/g54dmt jqb/g54dmt dr. jaume bacardit...

24
G54DMT – Data Mining Techniques and Applications http://www.cs.nott.ac.uk/~j qb/G54DMT Dr. Jaume Bacardit [email protected] Topic 3: Data Mining Lecture 5: Regression and Association Rules ides from chapter 5 of Data Mining. Concepts and Techniques by Han & Kamber

Upload: keira-townsend

Post on 01-Apr-2015

226 views

Category:

Documents


10 download

TRANSCRIPT

Page 1: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

G54DMT – Data Mining Techniques and Applications

http://www.cs.nott.ac.uk/~jqb/G54DMT

Dr. Jaume [email protected]

Topic 3: Data MiningLecture 5: Regression and Association Rules

Some slides from chapter 5 of Data Mining. Concepts and Techniques by Han & Kamber

Page 2: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Outline of the lecture

• Regression– Definition– Representations

• Association rules– Definition– Methods

• Resources

Page 3: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Regression• Regression problems are supervised problems where the

output variable is continuous• Many techniques with different names are included in this

category– Regression– Function approximation– Modelling– Curve-fitting

• Given an input vector X and a corresponding output y, we want to find a function f such that y’=f(X) is as close as possible to the true y

Page 4: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Evaluating regression

• Supervised learning: we know the true outputs, so we check how different are from the predicted ones– Mean Absolute Error– Mean Squared Error– Root Mean Squared Error

Page 5: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Linear Regression• Most classic (and widespread in statistics)

type of regression• f(X) is modelled as

– y’=w0+w1x1+w2x2+…+wnxn

http://upload.wikimedia.org/wikipedia/en/thumb/1/13/Linear_regression.png/400px-Linear_regression.png

Page 6: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Linear regression

• Simple but limited in expression power– The same model would apply to these four

datasets

http://en.wikipedia.org/wiki/Anscombe%27s_quartet

Page 7: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Linear regression

• How to find W?– Many mathematical methods available

• Least squares• Ridge regression• Lasso• Etc

– We can also use some kind of metaheuristic (e.g. a Genetic Algorithm)

Page 8: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Polynomial regression

• More complex and sophisticated functions– y=w0+w1x+w2x2+…..

– Y=w0+w1x1+w2x2+w3x1x2+…

• Now the job is double– Choosing the correct function (human inspection

may help)– Adjusting the weights of the model

• Still, would a single mathematical function fit any type of data?

Page 9: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Piece-wise regression

• Input space is partitioned in regions• A local regression model is generated from

the training examples that fell inside each region– Approximating a sine function with linear

regressions

(Butz, 2010)

Page 10: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Piece-wise regression

• How to partition the input space– Using a series of rules

• With a (hyper)rectangular condition (XCSF)• With a (hyper)ellipsoidal condition (XCSF,LWPR)• With a neural condition (XCSF)

– Using a tree-like structure (CART, M5)

• How to perform the regression process for each local approximation– Pick any of the functions discussed before– Plus some truly non-linear methods (SVR)

Page 11: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Piece-wise approximation with hyperellipsoids

• Using XCSF (Wilson, 02) with hyperellipsoid conditions (Butz et al, 08)

Test functionXCSF’s population

(Stalph et al, 2010)

Page 12: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Other regression methods

• Neural networks– A MLP is natively a regression method

• Classification is done by discretising the output of the network

– It is proven that a MLP with enough hidden nodes can approximate any function

• Support Vector Regression– As in SVM, depending on the kernel we got linear or non-

linear regression– The margin specifies a tube around the approximated

function. All points inside the tube have their errors ignored– Support Vectors are the points that lay outside the tube

Page 13: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Association Rules• Association rules try to find frequent patterns in the

dataset that appear together• It can use the class label but it does not have to

we can consider it an unsupervised learning paradigm

• Two types of elements being generated– Association rules: They have antecedent and consequent– Frequent itemsets: They just have an antecedent.

• Both antecedent and consequent are logic predicates (generally of conjunctive form)

Page 14: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Association rules mining

Witten and Frank, 2005 (http://www.cs.waikato.ac.nz/~eibe/Slides2edRev2.zip)

Page 15: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Origin of Association Rules

• These methods were originally employed to analyse shopping carts

• Database is specified as a set of transactions. Each of them includes one or more of a set of items

• An frequent itemset is a set of items that appears in many transactions

• These databases are extremely sparse Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Page 16: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Beers and diapers

• An urban myth about association rules says that when applied to analyze a very large volume of shopping carts they discovered a very simple pattern– “Customers that buy beer also tend to buy diapers”

• This story has changed through time. You can find an article about it here

• It is a good example of data mining, as it was able to find an unexpected pattern

Page 17: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Why Is Freq. Pattern Mining Important?

• Discloses an intrinsic and important property of data sets• Forms the foundation for many essential data mining tasks

– Association, correlation, and causality analysis– Sequential, structural (e.g., sub-graph) patterns– Pattern analysis in spatiotemporal, multimedia, time-series,

and stream data – Classification: associative classification– Cluster analysis: frequent pattern-based clustering– Data warehousing: iceberg cube and cube-gradient – Semantic data compression: fascicles– Broad applications

Page 18: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Evaluation of association rules• Support

– Percentage of examples covered by the predicate in the antecedent

– Applies to both association rules and frequent itemsets

• Confidence– Percentage of the examples matched by the antecedent for

which also match the consequent– Only apply to association rules

• Typically, the user specifies a minimum support and confidence and the algorithm finds all rules above the thresholds

Page 19: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Scalable Methods for Mining Frequent Patterns

• The downward closure property of frequent patterns– Any subset of a frequent itemset must be frequent– If {beer, diaper, nuts} is frequent, so is {beer, diaper}– i.e., every transaction having {beer, diaper, nuts} also

contains {beer, diaper} • Scalable mining methods: Three major approaches

– Apriori (Agrawal & Srikant@VLDB’94)– Freq. pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00)– Vertical data format approach (Charm—Zaki & Hsiao

@SDM’02)

Page 20: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Apriori: A Candidate Generation-and-Test Approach

• Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)

• Method:

– Initially, scan DB once to get frequent 1-itemset

– Generate length (k+1) candidate itemsets from length k frequent itemsets

– Test the candidates against DB

– Terminate when no frequent or candidate set can be generated

Page 21: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

The Apriori Algorithm—An Example

Database TDB

1st scan

C1L1

L2

C2 C2

2nd scan

C3 L33rd scan

Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Itemset sup

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

Itemset sup

{A} 2

{B} 3

{C} 3

{E} 3

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2

Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2

Itemset

{B, C, E}

Itemset sup

{B, C, E} 2

Supmin = 2

Page 22: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

The Apriori Algorithm

• Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;

Page 23: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Resources• “The Elements of Statistical Learning” by Hastie et al.

contains a lot of detail about statistical regression• List of Regression and association rules methods in

KEEL• Weka also contains both kind of methods• Chapter 5 of the Han and Kamber book is all about

association rules (Han created the Fpgrowth method)• Review of evolutionary algorithms for association

rule mining

Page 24: G54DMT – Data Mining Techniques and Applications jqb/G54DMT jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Questions?