data mining practical machine learning tools and techniques slides for chapter 3 of data mining by...
TRANSCRIPT
![Page 1: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall](https://reader035.vdocuments.net/reader035/viewer/2022071709/56649cf55503460f949c427f/html5/thumbnails/1.jpg)
Data MiningPractical Machine Learning Tools and Techniques
Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank andM. A. Hall
![Page 2: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall](https://reader035.vdocuments.net/reader035/viewer/2022071709/56649cf55503460f949c427f/html5/thumbnails/2.jpg)
2Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
Input: Concepts, instances, attributes
Terminology●What’s a concept?
Classification, association, clustering, numeric prediction
●What’s in an example? Relations, flat files, recursion
●What’s in an attribute? Nominal, ordinal, interval, ratio
●Preparing the input ARFF, attributes, missing values, getting to know data
![Page 3: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall](https://reader035.vdocuments.net/reader035/viewer/2022071709/56649cf55503460f949c427f/html5/thumbnails/3.jpg)
3Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
Terminology
●Components of the input:Concepts: kinds of things that can be learned
●Aim: intelligible and operational concept descriptionInstances: independent examples of a concept
●Note: more complicated forms of input are possibleAttributes: measuring aspects of an instance
●We will focus on nominal and numeric ones
![Page 4: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall](https://reader035.vdocuments.net/reader035/viewer/2022071709/56649cf55503460f949c427f/html5/thumbnails/4.jpg)
4Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
What’s a concept?
●Styles of learning:• Classification learning: predicting a discrete
class• Association learning: detecting associations
between features• Clustering: grouping similar instances into
clusters• Numeric prediction: predicting a numeric
quantity
●Concept: thing to be learned●Concept model/hypothesi: ● output of learning scheme
![Page 5: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall](https://reader035.vdocuments.net/reader035/viewer/2022071709/56649cf55503460f949c427f/html5/thumbnails/5.jpg)
5Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
What’s in an attribute?
●Each instance is described by a fixed predefined set of features, its “attributes”●Possible attribute types (“levels of measurement”):NominalOrdinalInterval
![Page 6: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall](https://reader035.vdocuments.net/reader035/viewer/2022071709/56649cf55503460f949c427f/html5/thumbnails/6.jpg)
6Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
Nominal quantities
Values are distinct symbolsValues themselves serve only as labels or namesNominal comes from the Latin word for name
Example: attribute “outlook” from weather dataValues: “sunny”,”overcast”, and “rainy”
No relation is implied among nominal values (no ordering or distance measure)●Only equality tests can be performed
![Page 7: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall](https://reader035.vdocuments.net/reader035/viewer/2022071709/56649cf55503460f949c427f/html5/thumbnails/7.jpg)
7Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
Ordinal quantities
Imposes order on values, but: no distance between values defined●Example: attribute “temperature” in weather dataValues: “hot” > “mild” > “cool”●Example rule:
temperature < hot play = yes●Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”)
![Page 8: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall](https://reader035.vdocuments.net/reader035/viewer/2022071709/56649cf55503460f949c427f/html5/thumbnails/8.jpg)
8Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
Interval quantities
Interval quantities are not only ordered but measured in fixed and equal units●Example 1: attribute “temperature” expressed in degrees Fahrenheit●Example 2: attribute “year”●Difference of two values makes sense●Sum or product doesn’t make senseZero point is not defined!
![Page 9: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall](https://reader035.vdocuments.net/reader035/viewer/2022071709/56649cf55503460f949c427f/html5/thumbnails/9.jpg)
9Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
Attribute types used in practice
Most schemes accommodate just two levels of measurement: nominal and ordinal
Nominal attributes are also called “categorical”, ”enumerated”, or “discrete”But: “enumerated” and “discrete” imply order●Special case: dichotomy (“boolean” attribute)
Ordinal attributes are called “numeric”, or “continuous”But: “continuous” implies mathematical continuity
![Page 10: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall](https://reader035.vdocuments.net/reader035/viewer/2022071709/56649cf55503460f949c427f/html5/thumbnails/10.jpg)
10Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
Metadata
Information about the data that encodes background knowledge:•Source•Date recorded•Type•Range•Resolution
![Page 11: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall](https://reader035.vdocuments.net/reader035/viewer/2022071709/56649cf55503460f949c427f/html5/thumbnails/11.jpg)
11Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
The ARFF format
%
% ARFF file for weather data with some numeric features
%
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {true, false}
@attribute play? {yes, no}
@data
sunny, 85, 85, false, no
sunny, 80, 90, true, no
overcast, 83, 86, false, yes
...
![Page 12: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall](https://reader035.vdocuments.net/reader035/viewer/2022071709/56649cf55503460f949c427f/html5/thumbnails/12.jpg)
12Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
Additional attribute types
●ARFF supports string attributes:
Similar to nominal attributes but list of values is not pre-specified●It also supports date attributes:
Uses the ISO-8601 combined date and time format yyyy-MM-dd-THH:mm:ss
@attribute description string
@attribute today date
![Page 13: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall](https://reader035.vdocuments.net/reader035/viewer/2022071709/56649cf55503460f949c427f/html5/thumbnails/13.jpg)
13Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
Sparse data●In some applications most attribute values in a dataset are zeroE.g.: word counts in a text categorization problem
●ARFF supports sparse data
●This also works for nominal attributes (where the first value corresponds to “zero”)
0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A”
0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B”
{1 26, 6 63, 10 “class A”}
{3 42, 10 “class B”}
![Page 14: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall](https://reader035.vdocuments.net/reader035/viewer/2022071709/56649cf55503460f949c427f/html5/thumbnails/14.jpg)
14Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
Nominal vs. ordinal
●Attribute “age” nominal
●Attribute “age” ordinal(e.g. “young” < “pre-presbyopic” < “presbyopic”)
If age = young and astigmatic = noand tear production rate = normalthen recommendation = soft
If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft
If age pre-presbyopic and astigmatic = noand tear production rate = normalthen recommendation = soft
![Page 15: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall](https://reader035.vdocuments.net/reader035/viewer/2022071709/56649cf55503460f949c427f/html5/thumbnails/15.jpg)
15Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
Missing values
Frequently indicated by out-of-range entriesTypes: unknown, unrecorded, irrelevantReasons:
● malfunctioning equipment● changes in experimental design● collation of different datasets● measurement not possible
●Missing value may have significance in itselfMay need to be “imputed” or estimated Instances or attributes may need to be removed
![Page 16: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall](https://reader035.vdocuments.net/reader035/viewer/2022071709/56649cf55503460f949c427f/html5/thumbnails/16.jpg)
16Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
Inaccurate values
●Often data has not been collected for data analytics●Result: garbage in garbage out●Reasons for errors:
● Typographical errors in nominal edits needed● Measurement errors in numeric attributes
outliers need to be identified● Equipment failure regular calibrations
important● Errors may be deliberate (e.g. wrong zip codes)● Other problems: duplicate records, stale data
![Page 17: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall](https://reader035.vdocuments.net/reader035/viewer/2022071709/56649cf55503460f949c427f/html5/thumbnails/17.jpg)
17Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)
Getting to know the data
●Simple visualization tools are very useful
Nominal attributes: histogramsIs the distribution consistent with background knowledge?
Numeric attributes: graphsAny obvious outliers?
●2-D and 3-D plots show dependencies●Need to consult domain experts●Too much data to inspect? Take a sample!