Download - DATA PREPARATION: Basic Definitions
DATA PREPARATION: Basic Definitions
Data Set (input): Collection of data objects and their attributes used as input for a machine learning scheme.
Data object (instance, record, case, sample, observation): An individual, independent data example of the concept to be learned, characterized by a number of attributes.
Attribute (feature): Property or characteristic of an object.
Model (concept): Pattern or description that is to be learned.
DATA PREPARATION: Attribute Types
Attribute value: Measurement of the quantity of that particular attribute.
Two basic attribute types: Qualitative and Quantitative.
Qualitative (categorical): Lack the properties of numbers.
Quantitative (numeric): Attributes represented by numbers and have their properties.
DATA PREPARATION: Attribute Types
Attribute types further distinguished by the number of values: Discrete versus continuous.
Discrete: A discrete attribute can have values from only a finite or countably infinite set of values. Examples: Male/female, ages
Continuous: A continuous attribute can have values from an uncountable set of values such as the real numbers. Examples: Temperature, weight, distance, time
DATA PREPARATION: Attribute Types
Nominal attribute: Qualitative names providing only enough information to distinguish from each other. No order or distance measure is implied.
Ordinal attribute: Qualitative names providing enough information to rank their order (Example: small, medium, large), but not enough to measure distance.
Interval attribute: Ordered and value differences are meaningful and measurable.
Ratio attribute: Both differences and ratios are meaningful and measurable.
DATA PREPARATION: Data Set Characteristics
Dimensionality: Number of attributes possessed by the data set instances.
Sparsity: Sparse data sets are those in which the most object attibutes are zero.
Resolution: The degree of discernable detail of an attribute value. How finely an attribute is measured.
DATA PREPARATION: Data Sets
Sources of Data Sets Databases Web sites Streaming data
DATA PREPARATION: Data Sets
Data Input Formats Data records Text Graph-based Data matrix Ordered data Spatial data Visual inputs Video inputs
DATA PREPARATION: Record Data
Record Data Data that consists of a collection of records, each of which
consists of a fixed set of attributes Tid Refund Marital
Status Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
DATA PREPARATION: Data Matrix
Data Matrix If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute
Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection of y load
Projection of x Load
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection of y load
Projection of x Load
DATA PREPARATION: Document Data
Document Data Each document becomes a `term' vector,
each term is a component (attribute) of the vector, the value of each component is the number of times
the corresponding term occurs in the document.
Document 1
sea
son
time
ou
t
lost
win
ga
me
score
ba
ll
play
coa
ch
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
DATA PREPARATION: Transaction Data
A special type of record data, where each record (transaction) involves a set of items. For example, consider a grocery store. The set of
products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
DATA PREPARATION: Graph Data
Graph Data
5
2
1
2
5
DATA PREPARATION: Chemical Data
Chemical Data Benzene Molecule: C6H6
DATA PREPARATION: Ordered Data
Ordered Data Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCCCGCAGGGCCCGCCCCGCGCCGTCGAGAAGGGCCCGCCTGGCGGGCGGGGGGAGGCGGGGCCGCCCGAGCCCAACCGAGTCCGACCAGGTGCCCCCTCTGCTCGGCCTAGACCTGAGCTCATTAGGCGGCAGCGGACAGGCCAAGTAGAACACGCGAAGCGCTGGGCTGCCTGCTGCGACCAGGG
DATA PREPARATION: Ordered Data Spatio-Temporal Data
Surface Air Temperature over North America, January--February 2014 https://www.youtube.com/watch?v=VCCkyOTIS3o
DATA PREPARATION: ARFF format
ARFF: Attribute-Relation File Format.
See Weka Documentation: http://weka.wikispaces.com/ARFF
XRFF (eXtensible attribute-Relation File Format): An XML-based extension of the ARFF format.
See Weka Documentation: http://weka.wikispaces.com/XRFF
DATA PREPARATION: Data Conversion
Weka supports other data input types via filters C4.5 CSV Libsvm Svm light Binary serialized instances
DATA PREPARATION: Data Conversion
What if desired data does not fit any of the Weka’s input types? Translate manually (only works for small data sets) Writing your own specialized script Problem not unique to Weka
Data Conversion often underappreciated problem Data collection Algorithm requirement mismatch
DATA PREPARATION: Data Quality
Measurement errors Noise Artifacts Equipment limitations
Data collection procedure errors Human error Precision Bias Accuracy
DATA PREPARATION: Data Quality
Handling Data Outliers Missing or incomplete values
Estimate? Ignore?
Inaccurate values
DATA PREPARATION: Data Quality
Multiple data sources Inconsistent data: how to handle?
Duplicate data
Age of data
Data relevance
DATA PREPARATION
KNOW YOUR DATA
NEXT STEP
Data Preprocessing