erroneous distribution data identification using outlier detection techniques

Erroneous Distribution Data Identification Using Erroneous Distribution Data Identification Using Outlier Detection Techniques Outlier Detection Techniques

W. Zhuang, Y. Zhang, J.F. GrassleW. Zhuang, Y. Zhang, J.F. Grassle

Rutgers, the State University of New Jersey, Rutgers, the State University of New Jersey, USAUSA

OverviewOverview

Review of OBIS DQ-issuesReview of OBIS DQ-issues Review of existing DQ methodsReview of existing DQ methods Case study: detecting outliers in Case study: detecting outliers in

multidimensional datamultidimensional data Discussion and future directionsDiscussion and future directions

Data Quality (DQ) Data Quality (DQ)

DQ problems can be generated in every DQ problems can be generated in every steps of the data life cycle:steps of the data life cycle:

DQ problems (I)DQ problems (I)

Data gathering:Data gathering:

instrument failureinstrument failures; s; false identificationsfalse identifications

geo-referencinggeo-referencing Data storageData storage

key metadata missingkey metadata missing

erroneous data entry; database default values erroneous data entry; database default values masquerading as real valuesmasquerading as real values

DQ problems (II)DQ problems (II)

Data delivery: data corruption due to encoding Data delivery: data corruption due to encoding conversionconversion

Data integration: duplicated recordsData integration: duplicated records Data retrieval: missing valuesData retrieval: missing values Data analysis/cleaning: inappropriate models Data analysis/cleaning: inappropriate models

used, etc.used, etc.

DQ solving-a process-based approach DQ solving-a process-based approach

DQ solving is an essential component of data DQ solving is an essential component of data analysis and thus part of the data life cycleanalysis and thus part of the data life cycle

A. It builds foundation for analysis and modelingA. It builds foundation for analysis and modeling B. It provides feedbackB. It provides feedback to improve the whole to improve the whole

data life cycledata life cycle C. It could lead to more DQ problems if not C. It could lead to more DQ problems if not

carefully executedcarefully executed

DQ solving methodsDQ solving methods

Harvest metadata close to dataHarvest metadata close to data Built-in integrity check and double data entryBuilt-in integrity check and double data entry Model-based approach: Model-based approach:

a) statistical a) statistical

b) heuristic b) heuristic

OBIS DQ StudyOBIS DQ Study

Metadata-related problemsMetadata-related problems DQ on scientific namesDQ on scientific names Integrity checking Integrity checking Redundant records detectionRedundant records detection Outliers Outliers detection- a case studydetection- a case study

Outliers sometimes represent erroneous dataOutliers sometimes represent erroneous data

We are examining data mining tools for detecting We are examining data mining tools for detecting erroneous data pointserroneous data points

DBSCAN-a clustering toolDBSCAN-a clustering tool

DBSCAN is density-based in feature spaceDBSCAN is density-based in feature space It deals with high dimensional dataIt deals with high dimensional data There is no need to specify cluster numbersThere is no need to specify cluster numbers It identifies outliers during the clustering process It identifies outliers during the clustering process It is a fast algorithm and freely availableIt is a fast algorithm and freely available M.Ester, H.P.Kriegel, J.Sander and Xu. A M.Ester, H.P.Kriegel, J.Sander and Xu. A

density-based algorithm for discovering clusters density-based algorithm for discovering clusters in large spatial databasesin large spatial databases

A diagram of DBSCANA diagram of DBSCAN

Core

Border

Outlier

= 1unit

MinPts = 5

Total points distributionTotal points distribution

-90

-60

-30

0

30

60

90

-180 -120 -60 0 60 120 180

whole dataset

Result from DBSCANResult from DBSCAN

-90

-60

-30

0

30

60

90

-180 -120 -60 0 60 120 180

cluster points outliers

Limitation of the methodLimitation of the method

Geographical outliers may be used to identify Geographical outliers may be used to identify erroneous points in survey data, but may not erroneous points in survey data, but may not good for museum collections or literature-based good for museum collections or literature-based data records.data records.

Other methods to identify erroneous distribution Other methods to identify erroneous distribution data ? How about using environmental data as data ? How about using environmental data as proxies? proxies?

Can we get some more information?Can we get some more information?

-90

-60

-30

0

30

60

90

-180 -120 -60 0 60 120 180

dcsn dcso dosn doso

Limitations of using environmental Limitations of using environmental variablesvariables

Risk of imposing a rigid model at the time of pre-Risk of imposing a rigid model at the time of pre-processingprocessing

Risk of losing valuable outliersRisk of losing valuable outliers Risk of circular logic in later analysesRisk of circular logic in later analyses

DiscussionsDiscussions

Why don’t you use more environmental Why don’t you use more environmental variables? variables?

Can you use DBSCAN on environmental Can you use DBSCAN on environmental variables directly?variables directly?

Possible improvementsPossible improvements

Define multiple methods as DQ componentsDefine multiple methods as DQ components Assign bootstrap weightsAssign bootstrap weights Present outlier candidates to expertsPresent outlier candidates to experts Update weights based on user feedbackUpdate weights based on user feedback

SummarySummary

Many data quality problems can arise during the Many data quality problems can arise during the whole data life cycle.whole data life cycle.

Preliminary checking can eliminate a lot of Preliminary checking can eliminate a lot of simple errorssimple errors

Expert knowledge should be integrated and be Expert knowledge should be integrated and be the decisive factor when it comes to DQ solvingthe decisive factor when it comes to DQ solving

Data mining techniques may act as metal Data mining techniques may act as metal detectors so that experts can focus on a detectors so that experts can focus on a narrowed down group of candidatesnarrowed down group of candidates

erroneous distribution data identification using outlier detection techniques

Documents

data corruption

dq problems idata

erroneous data pointsdbscan

data life cyclea

data life cyclec

data mining tools

existing dq methodscase

densitybased algorithm