ai week 14 machine learning: introduction to data mining lee mccluskey, room 3/10 email...
TRANSCRIPT
AI Week 14Machine Learning:Introduction to Data Mining
Lee McCluskey, room 3/10
Email [email protected]
http://scom.hud.ac.uk/scomtlm/cha2555/
Artform Research Group
Data Mining: from Machine Learning and Databases
DM involves discovering patterns from large data bases or data warehouses for different purposes. It is the science of extracting meaningful information from (large) databases.
Two Types of Learning: Data Mining can be “Learning from Example” (Classifiation) where we want to learn the
features that that characteristic of a class eg environmental conditions that lead to an Earthquake.
Classes can be binary e.g. spam or notspam
Classes can be many e.g. classification of documents “Learning from Observation” (Knowledge Discovery) where we have
lots of observations and we want the DM to discover interesting patterns.
We might want to analysis “raw data” (e.g. points in space) to see if there are any pattern, or analyse records and discover patterns in a Relational DB (eg a data warehouse).
Artform Research Group
Data MiningPredominantly the techniques used in DM are SYNTACTIC and STATISTICAL.
Applications: Data mining and knowledge discovery techniques have been applied to many, many areas including ..Market analysis and Retail Decision support Financial analysis Discovering environmental trendsDisease analysisTraffic trend analysis
We will focus on learning RULES
Artform Research Group
Data Mining of Rules: Example Inputs
Input to Data Mining Algorithms:
Sets of records – eg like data base records. For example
- a shopping list might be considered a record where data fields are “nominal”
- an environmental observation (temp, wind speed, pressure, wind direction, time) where data fields are more complex – eg real numbers
Classification Rule Mining: a class we are interested in characterising (depending on type of learning)
Artform Research Group
Data Mining of Rules: Example Outputs
Classification Rule Mining: Each record is input with a class C(i) label it is an example of,
and OUTPUT is a (set of) classification rulesFeatures => C(1)….Features => C(n)That can be used in the future to put a record into a class.
Association Rule Mining: A set of the most common association rules between features
within record is outpute.g. If a record with a certain set of features is found (x,y,z, …}, then it is likely that the following are present {a,b,c,…}
Artform Research Group
Data Mining and Data ClensingData Mining is often part of a larger process aimed at getting more out of
data warehouses and involves data clensing
data clensing: is the process of identifying and removing or correcting corrupted or missing records from a database. This makes the data consistent with other similar data sets in the database. Eg the process may remove invalid post codes, spurious extreme values (eg -999999.999).
Artform Research Group
Classification Rule Mining: Rule Induction and Use
RowIds A1 A2
1 x1 y1 c12 x1 y2 c23 x1 y1 c24 x1 y2 c15 x2 y1 c26 x2 y1 c17 x2 y3 c28 x1 y3 c19 x2 y4 c110 x3 y1 c1
RowIds A1 A2
1 x1 y1 c12 x1 y2 c23 x1 y1 c24 x1 y2 c15 x2 y1 c26 x2 y1 c17 x2 y3 c28 x1 y3 c19 x2 y4 c110 x3 y1 c1
RowIdsRowIds A1A1 A2A2
11 x1x1 y1y1 c1c122 x1x1 y2y2 c2c233 x1x1 y1y1 c2c244 x1x1 y2y2 c1c155 x2x2 y1y1 c2c266 x2x2 y1y1 c1c177 x2x2 y3y3 c2c288 x1x1 y3y3 c1c199 x2x2 y4y4 c1c11010 x3x3 y1y1 c1c1
Classification Rules
Training Data
RowId A1 A2 Class
1 x1 y12 x2 y43 x1 y1
RowId A1 A2 Class
1 x1 y12 x2 y43 x1 y1
RowIdRowId A1A1 A2A2 ClassClass
11 x1x1 y1y122 x2x2 y4y433 x1x1 y1y1
Test Data
Classification algorithm
Prediction Accuracy
Class
Artform Research Group
Classification Rule Mining - jargonA classification rule LHS => C is built up from
examples (and counter examples) of a class C
A rule …
-- covers an example if the features of LHS are present in the example.
-- is characteristic if it is covers all members of a class-- is maximally characteristic if it contains the largest LHS to
cover all members of a class -- is discriminating – if it covers NO counter examples (=
examples of other classes, if classes are disjoint)
Artform Research Group
Classification Rule Mining - jargon
X E E E E X X Example space
Hypothesis Space
Characteristic hypothesisVESRION SPACE – set of allCharacteristic and discriminatinghypothesis
Discriminating hypothesis
Artform Research Group
Classification Rule Mining – example..Size = medium, colour = green, shape = square => c1Size = small, colour = red, shape = square => c1Size = small, colour = blue, shape = circle => c1Size = small, colour = green, shape = triangle => c2Size = large, colour = white, shape = circle => c2
We aim to find “hypotheses” that are:
Characteristic and Discriminating
Artform Research Group
Classification Rule Mining: UseTypically two sets of data are used in data mining:
1. Training Set2. Validation Set
These sets are randomly selected.
A classifier is a set of classification rules. These are formed on set (1.) and tested out on set (2.) to find out their accuracy.
The technique of cross-validation is where the sets are swapped round: the training set becomes the validation set etc
Artform Research Group
Conclusions
Data Mining is a powerful set of techniques to help analyse data, and discover hidden knowledge
• There is a growing amount of data available.
• DM has many applications.• DM can be supervised or unsupervised.