data preparation: basic definitions

Post on 07-Jan-2016

53 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

DATA PREPARATION: Basic Definitions. Data Set (input): Collection of data objects and their attributes used as input for a machine learning scheme. - PowerPoint PPT Presentation

TRANSCRIPT

DATA PREPARATION: Basic Definitions

Data Set (input): Collection of data objects and their attributes used as input for a machine learning scheme.

Data object (instance, record, case, sample, observation): An individual, independent data example of the concept to be learned, characterized by a number of attributes.

Attribute (feature): Property or characteristic of an object.

Model (concept): Pattern or description that is to be learned.

DATA PREPARATION: Attribute Types

Attribute value: Measurement of the quantity of that particular attribute.

Two basic attribute types: Qualitative and Quantitative.

Qualitative (categorical): Lack the properties of numbers.

Quantitative (numeric): Attributes represented by numbers and have their properties.

DATA PREPARATION: Attribute Types

Attribute types further distinguished by the number of values: Discrete versus continuous.

Discrete: A discrete attribute can have values from only a finite or countably infinite set of values. Examples: Male/female, ages

Continuous: A continuous attribute can have values from an uncountable set of values such as the real numbers. Examples: Temperature, weight, distance, time

DATA PREPARATION: Attribute Types

Nominal attribute: Qualitative names providing only enough information to distinguish from each other. No order or distance measure is implied.

Ordinal attribute: Qualitative names providing enough information to rank their order (Example: small, medium, large), but not enough to measure distance.

Interval attribute: Ordered and value differences are meaningful and measurable.

Ratio attribute: Both differences and ratios are meaningful and measurable.

DATA PREPARATION: Data Set Characteristics

Dimensionality: Number of attributes possessed by the data set instances.

Sparsity: Sparse data sets are those in which the most object attibutes are zero.

Resolution: The degree of discernable detail of an attribute value. How finely an attribute is measured.

DATA PREPARATION: Data Sets

Sources of Data Sets Databases Web sites Streaming data

DATA PREPARATION: Data Sets

Data Input Formats Data records Text Graph-based Data matrix Ordered data Spatial data Visual inputs Video inputs

DATA PREPARATION: Record Data

Record Data Data that consists of a collection of records, each of which

consists of a fixed set of attributes Tid Refund Marital

Status Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

DATA PREPARATION: Data Matrix

Data Matrix If data objects have the same fixed set of numeric

attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute

Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

1.12.216.226.2512.65

1.22.715.225.2710.23

Thickness LoadDistanceProjection of y load

Projection of x Load

1.12.216.226.2512.65

1.22.715.225.2710.23

Thickness LoadDistanceProjection of y load

Projection of x Load

DATA PREPARATION: Document Data

Document Data Each document becomes a `term' vector,

each term is a component (attribute) of the vector, the value of each component is the number of times

the corresponding term occurs in the document.

Document 1

sea

son

time

ou

t

lost

win

ga

me

score

ba

ll

play

coa

ch

team

Document 2

Document 3

3 0 5 0 2 6 0 2 0 2

0

0

7 0 2 1 0 0 3 0 0

1 0 0 1 2 2 0 3 0

DATA PREPARATION: Transaction Data

A special type of record data, where each record (transaction) involves a set of items. For example, consider a grocery store. The set of

products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

DATA PREPARATION: Graph Data

Graph Data

5

2

1

2

5

DATA PREPARATION: Chemical Data

Chemical Data Benzene Molecule: C6H6

DATA PREPARATION: Ordered Data

Ordered Data Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCCCGCAGGGCCCGCCCCGCGCCGTCGAGAAGGGCCCGCCTGGCGGGCGGGGGGAGGCGGGGCCGCCCGAGCCCAACCGAGTCCGACCAGGTGCCCCCTCTGCTCGGCCTAGACCTGAGCTCATTAGGCGGCAGCGGACAGGCCAAGTAGAACACGCGAAGCGCTGGGCTGCCTGCTGCGACCAGGG

DATA PREPARATION: Ordered Data Spatio-Temporal Data

Surface Air Temperature over North America, January--February 2014 https://www.youtube.com/watch?v=VCCkyOTIS3o

DATA PREPARATION: ARFF format

ARFF: Attribute-Relation File Format.

See Weka Documentation: http://weka.wikispaces.com/ARFF

XRFF (eXtensible attribute-Relation File Format): An XML-based extension of the ARFF format.

See Weka Documentation: http://weka.wikispaces.com/XRFF

DATA PREPARATION: Data Conversion

Weka supports other data input types via filters C4.5 CSV Libsvm Svm light Binary serialized instances

DATA PREPARATION: Data Conversion

What if desired data does not fit any of the Weka’s input types? Translate manually (only works for small data sets) Writing your own specialized script Problem not unique to Weka

Data Conversion often underappreciated problem Data collection Algorithm requirement mismatch

DATA PREPARATION: Data Quality

Measurement errors Noise Artifacts Equipment limitations

Data collection procedure errors Human error Precision Bias Accuracy

DATA PREPARATION: Data Quality

Handling Data Outliers Missing or incomplete values

Estimate? Ignore?

Inaccurate values

DATA PREPARATION: Data Quality

Multiple data sources Inconsistent data: how to handle?

Duplicate data

Age of data

Data relevance

DATA PREPARATION

KNOW YOUR DATA

NEXT STEP

Data Preprocessing

top related