ai week 14 machine learning: introduction to data mining lee mccluskey, room 3/10 email...

12
AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10 Email [email protected] http://scom.hud.ac.uk/scomtlm/ cha2555/

Upload: colin-wells

Post on 16-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10 Email lee@hud.ac.uklee@hud.ac.uk

AI Week 14Machine Learning:Introduction to Data Mining

Lee McCluskey, room 3/10

Email [email protected]

http://scom.hud.ac.uk/scomtlm/cha2555/

Page 2: AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10 Email lee@hud.ac.uklee@hud.ac.uk

Artform Research Group

Data Mining: from Machine Learning and Databases

DM involves discovering patterns from large data bases or data warehouses for different purposes. It is the science of extracting meaningful information from (large) databases.

Two Types of Learning: Data Mining can be “Learning from Example” (Classifiation) where we want to learn the

features that that characteristic of a class eg environmental conditions that lead to an Earthquake.

Classes can be binary e.g. spam or notspam

Classes can be many e.g. classification of documents “Learning from Observation” (Knowledge Discovery) where we have

lots of observations and we want the DM to discover interesting patterns.

We might want to analysis “raw data” (e.g. points in space) to see if there are any pattern, or analyse records and discover patterns in a Relational DB (eg a data warehouse).

Page 3: AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10 Email lee@hud.ac.uklee@hud.ac.uk

Artform Research Group

Data MiningPredominantly the techniques used in DM are SYNTACTIC and STATISTICAL.

Applications: Data mining and knowledge discovery techniques have been applied to many, many areas including ..Market analysis and Retail Decision support Financial analysis Discovering environmental trendsDisease analysisTraffic trend analysis

We will focus on learning RULES

Page 4: AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10 Email lee@hud.ac.uklee@hud.ac.uk

Artform Research Group

Data Mining of Rules: Example Inputs

Input to Data Mining Algorithms:

Sets of records – eg like data base records. For example

- a shopping list might be considered a record where data fields are “nominal”

- an environmental observation (temp, wind speed, pressure, wind direction, time) where data fields are more complex – eg real numbers

Classification Rule Mining: a class we are interested in characterising (depending on type of learning)

Page 5: AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10 Email lee@hud.ac.uklee@hud.ac.uk

Artform Research Group

Data Mining of Rules: Example Outputs

Classification Rule Mining: Each record is input with a class C(i) label it is an example of,

and OUTPUT is a (set of) classification rulesFeatures => C(1)….Features => C(n)That can be used in the future to put a record into a class.

Association Rule Mining: A set of the most common association rules between features

within record is outpute.g. If a record with a certain set of features is found (x,y,z, …}, then it is likely that the following are present {a,b,c,…}

Page 6: AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10 Email lee@hud.ac.uklee@hud.ac.uk

Artform Research Group

Data Mining and Data ClensingData Mining is often part of a larger process aimed at getting more out of

data warehouses and involves data clensing

data clensing: is the process of identifying and removing or correcting corrupted or missing records from a database. This makes the data consistent with other similar data sets in the database. Eg the process may remove invalid post codes, spurious extreme values (eg -999999.999).

Page 7: AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10 Email lee@hud.ac.uklee@hud.ac.uk

Artform Research Group

Classification Rule Mining: Rule Induction and Use

RowIds A1 A2

1 x1 y1 c12 x1 y2 c23 x1 y1 c24 x1 y2 c15 x2 y1 c26 x2 y1 c17 x2 y3 c28 x1 y3 c19 x2 y4 c110 x3 y1 c1

RowIds A1 A2

1 x1 y1 c12 x1 y2 c23 x1 y1 c24 x1 y2 c15 x2 y1 c26 x2 y1 c17 x2 y3 c28 x1 y3 c19 x2 y4 c110 x3 y1 c1

RowIdsRowIds A1A1 A2A2

11 x1x1 y1y1 c1c122 x1x1 y2y2 c2c233 x1x1 y1y1 c2c244 x1x1 y2y2 c1c155 x2x2 y1y1 c2c266 x2x2 y1y1 c1c177 x2x2 y3y3 c2c288 x1x1 y3y3 c1c199 x2x2 y4y4 c1c11010 x3x3 y1y1 c1c1

Classification Rules

Training Data

RowId A1 A2 Class

1 x1 y12 x2 y43 x1 y1

RowId A1 A2 Class

1 x1 y12 x2 y43 x1 y1

RowIdRowId A1A1 A2A2 ClassClass

11 x1x1 y1y122 x2x2 y4y433 x1x1 y1y1

Test Data

Classification algorithm

Prediction Accuracy

Class

Page 8: AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10 Email lee@hud.ac.uklee@hud.ac.uk

Artform Research Group

Classification Rule Mining - jargonA classification rule LHS => C is built up from

examples (and counter examples) of a class C

A rule …

-- covers an example if the features of LHS are present in the example.

-- is characteristic if it is covers all members of a class-- is maximally characteristic if it contains the largest LHS to

cover all members of a class -- is discriminating – if it covers NO counter examples (=

examples of other classes, if classes are disjoint)

Page 9: AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10 Email lee@hud.ac.uklee@hud.ac.uk

Artform Research Group

Classification Rule Mining - jargon

X E E E E X X Example space

Hypothesis Space

Characteristic hypothesisVESRION SPACE – set of allCharacteristic and discriminatinghypothesis

Discriminating hypothesis

Page 10: AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10 Email lee@hud.ac.uklee@hud.ac.uk

Artform Research Group

Classification Rule Mining – example..Size = medium, colour = green, shape = square => c1Size = small, colour = red, shape = square => c1Size = small, colour = blue, shape = circle => c1Size = small, colour = green, shape = triangle => c2Size = large, colour = white, shape = circle => c2

We aim to find “hypotheses” that are:

Characteristic and Discriminating

Page 11: AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10 Email lee@hud.ac.uklee@hud.ac.uk

Artform Research Group

Classification Rule Mining: UseTypically two sets of data are used in data mining:

1. Training Set2. Validation Set

These sets are randomly selected.

A classifier is a set of classification rules. These are formed on set (1.) and tested out on set (2.) to find out their accuracy.

The technique of cross-validation is where the sets are swapped round: the training set becomes the validation set etc

Page 12: AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10 Email lee@hud.ac.uklee@hud.ac.uk

Artform Research Group

Conclusions

Data Mining is a powerful set of techniques to help analyse data, and discover hidden knowledge

• There is a growing amount of data available.

• DM has many applications.• DM can be supervised or unsupervised.