measurements and data. topics types of data distance measurement data transformation forms of data...
DESCRIPTION
Types of Measurement Ordinal, – e.g., excellent=5, very good=4, good=3… Nominal – e.g., color, religion, profession – Need non-metric methods Ratio – e.g., weight – has concatenation property, two weights add to balance a third: 2+3 = 5 Interval – e.g., temperature, calendar timeTRANSCRIPT
![Page 1: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/1.jpg)
Measurements and Data
![Page 2: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/2.jpg)
Topics
• • • • •
Types of DataDistance MeasurementData TransformationForms of DataData Quality
![Page 3: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/3.jpg)
Types of Measurement• Ordinal,
– e.g., excellent=5, very good=4, good=3…• Nominal
– e.g., color, religion, profession– Need non-metric methods
• Ratio– e.g., weight– has concatenation property, two weights add to balance a
third: 2+3 = 5• Interval
– e.g., temperature, calendar time
![Page 4: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/4.jpg)
Examples of Metrics
• Euclidean Distance dE
– Standardized (divide by variance)
– Weighted dWE
• Minkowski measure– Manhattan Distance
• Mahanalobis Distance dM
– Use of Covariance• Binary data Distances
![Page 5: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/5.jpg)
Use of Covariance in Distance
• Similarities between cups
• Suppose we measure cup-height 100 timesand diameter only once– height will dominate although 99 of the height
measurements are not contributing anything• They are very highly correlated• To eliminate redundancy we need a data-driven method
– approach is to not only to standardize data in eachdirection but also to use covariance betweenvariables
![Page 6: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/6.jpg)
• A scalar value to measure how x and y vary together
€
Covariance between two Scalar Variables
• Large positive value– if large values of x tend to be associated with large values of y and small values of x
with small values of y• Large negative value
– if large values of x tend to be associated with small values of y
• With d variables can construct a d x d matrix of covariances
![Page 7: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/7.jpg)
Correlation CoefficientValue of Covariance is dependent upon ranges of x and y
Dependency is removed bydividing values of x by their standard deviationand values of y by their standard deviation
![Page 8: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/8.jpg)
Correlation MatrixHousing related variablesacross city suburbs (d=11)11 x 11 pixel image (White 1, Black -1)Columns 12-14 have values -1,0,1 forpixel intensity referenceRemaining represent corrrelation matrix
Reference for -1, 0,+1
Variables 3 and 4 are highly negativelycorrelated with Variable 2Variable 5 is positively correlated with Variable 11Variables 8 and 9 are highly correlated
![Page 9: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/9.jpg)
![Page 10: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/10.jpg)
Generalizing Euclidean Distance
Minkowski or Lλ metric
• λ = 2 gives the Euclidean metric• λ = 1 gives the Manhattan or City-block metric
• λ = ∞ yields
![Page 11: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/11.jpg)
Distance Measures for Binary Data• Most obvious measure is Hamming Distance normalized by number of bits
Proportion of variableson which objects have same value
• If we don’t care about irrelevant properties had by neither object we haveJaccard Coefficient
Example: two documentsdo not have certain terms
• Dice Coefficient extends this argument– If 00 matches are irrelevant then 10 and 01 matches should have half relevance
![Page 12: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/12.jpg)
Transforming the Data
Model depends on form of data
If Y is a function of X2 then we could use
quadratic function or choose U= X2 and use a linear fit
![Page 13: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/13.jpg)
V1is non-linearlyRelated to V2
V1
V3=1/V2is linearlyrelated to V1
V2
![Page 14: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/14.jpg)
Variance increases
Square root transformationkeeps the variance constant
![Page 15: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/15.jpg)
Forms of DataStandard Data (Data Matrix)
Multirelational Data
String• Sequence of symbols from a finite alphabet
Event Sequence• Sequence of pairs of the form {event,occurrence time}
![Page 16: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/16.jpg)
Name Department Age Salary
Department Budget Manager
Multirelational Data(multiple data matrices)
Payroll Database
Name
Department Table
Name
Can be combined together to form a data matrix with fieldsname, department-name, age, salary, budget, managerOr create as many rows as department-namesFlattening requires needless replication (Storage issues)
![Page 17: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/17.jpg)
Data Quality for Individual Measurements
• Data Mining Depends on Quality of data• Many interesting patterns discovered may
result from measurement inaccuracies.• Sources of error
– Errors in measurement– Carelessness– Instrumentation failure
![Page 18: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/18.jpg)
Precision and Accuracy• Precise Measurement
– Small variability (measured by variance)– Repeated measurements yield same value– Many digits of precision is not necessarily
accurate (results of calculations give many digits)
• Accurate– Not only small variability but close to true value
![Page 19: Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality](https://reader036.vdocuments.net/reader036/viewer/2022062600/5a4d1bb27f8b9ab0599cd13d/html5/thumbnails/19.jpg)
Data Quality for Collections of Data
• Collections of Data– Much of statistics is concerned with inference from
a sample to a population– How to infer things from a fraction about entire
population– Two sources of error:
• sample size and bias