erroneous distribution data identification using outlier detection techniques
DESCRIPTION
Erroneous Distribution Data Identification Using Outlier Detection Techniques. W. Zhuang, Y. Zhang, J.F. Grassle Rutgers, the State University of New Jersey, USA. Overview. Review of OBIS DQ-issues Review of existing DQ methods Case study: d etecting outliers in multidimensional data - PowerPoint PPT PresentationTRANSCRIPT
Erroneous Distribution Data Identification Using Erroneous Distribution Data Identification Using Outlier Detection Techniques Outlier Detection Techniques
W. Zhuang, Y. Zhang, J.F. GrassleW. Zhuang, Y. Zhang, J.F. Grassle
Rutgers, the State University of New Jersey, Rutgers, the State University of New Jersey, USAUSA
OverviewOverview
Review of OBIS DQ-issuesReview of OBIS DQ-issues Review of existing DQ methodsReview of existing DQ methods Case study: detecting outliers in Case study: detecting outliers in
multidimensional datamultidimensional data Discussion and future directionsDiscussion and future directions
Data Quality (DQ) Data Quality (DQ)
DQ problems can be generated in every DQ problems can be generated in every steps of the data life cycle:steps of the data life cycle:
DQ problems (I)DQ problems (I)
Data gathering:Data gathering:
instrument failureinstrument failures; s; false identificationsfalse identifications
geo-referencinggeo-referencing Data storageData storage
key metadata missingkey metadata missing
erroneous data entry; database default values erroneous data entry; database default values masquerading as real valuesmasquerading as real values
DQ problems (II)DQ problems (II)
Data delivery: data corruption due to encoding Data delivery: data corruption due to encoding conversionconversion
Data integration: duplicated recordsData integration: duplicated records Data retrieval: missing valuesData retrieval: missing values Data analysis/cleaning: inappropriate models Data analysis/cleaning: inappropriate models
used, etc.used, etc.
DQ solving-a process-based approach DQ solving-a process-based approach
DQ solving is an essential component of data DQ solving is an essential component of data analysis and thus part of the data life cycleanalysis and thus part of the data life cycle
A. It builds foundation for analysis and modelingA. It builds foundation for analysis and modeling B. It provides feedbackB. It provides feedback to improve the whole to improve the whole
data life cycledata life cycle C. It could lead to more DQ problems if not C. It could lead to more DQ problems if not
carefully executedcarefully executed
DQ solving methodsDQ solving methods
Harvest metadata close to dataHarvest metadata close to data Built-in integrity check and double data entryBuilt-in integrity check and double data entry Model-based approach: Model-based approach:
a) statistical a) statistical
b) heuristic b) heuristic
OBIS DQ StudyOBIS DQ Study
Metadata-related problemsMetadata-related problems DQ on scientific namesDQ on scientific names Integrity checking Integrity checking Redundant records detectionRedundant records detection Outliers Outliers detection- a case studydetection- a case study
Outliers sometimes represent erroneous dataOutliers sometimes represent erroneous data
We are examining data mining tools for detecting We are examining data mining tools for detecting erroneous data pointserroneous data points
DBSCAN-a clustering toolDBSCAN-a clustering tool
DBSCAN is density-based in feature spaceDBSCAN is density-based in feature space It deals with high dimensional dataIt deals with high dimensional data There is no need to specify cluster numbersThere is no need to specify cluster numbers It identifies outliers during the clustering process It identifies outliers during the clustering process It is a fast algorithm and freely availableIt is a fast algorithm and freely available M.Ester, H.P.Kriegel, J.Sander and Xu. A M.Ester, H.P.Kriegel, J.Sander and Xu. A
density-based algorithm for discovering clusters density-based algorithm for discovering clusters in large spatial databasesin large spatial databases
A diagram of DBSCANA diagram of DBSCAN
Core
Border
Outlier
= 1unit
MinPts = 5
Total points distributionTotal points distribution
-90
-60
-30
0
30
60
90
-180 -120 -60 0 60 120 180
whole dataset
Result from DBSCANResult from DBSCAN
-90
-60
-30
0
30
60
90
-180 -120 -60 0 60 120 180
cluster points outliers
Limitation of the methodLimitation of the method
Geographical outliers may be used to identify Geographical outliers may be used to identify erroneous points in survey data, but may not erroneous points in survey data, but may not good for museum collections or literature-based good for museum collections or literature-based data records.data records.
Other methods to identify erroneous distribution Other methods to identify erroneous distribution data ? How about using environmental data as data ? How about using environmental data as proxies? proxies?
Can we get some more information?Can we get some more information?
-90
-60
-30
0
30
60
90
-180 -120 -60 0 60 120 180
dcsn dcso dosn doso
Limitations of using environmental Limitations of using environmental variablesvariables
Risk of imposing a rigid model at the time of pre-Risk of imposing a rigid model at the time of pre-processingprocessing
Risk of losing valuable outliersRisk of losing valuable outliers Risk of circular logic in later analysesRisk of circular logic in later analyses
DiscussionsDiscussions
Why don’t you use more environmental Why don’t you use more environmental variables? variables?
Can you use DBSCAN on environmental Can you use DBSCAN on environmental variables directly?variables directly?
Possible improvementsPossible improvements
Define multiple methods as DQ componentsDefine multiple methods as DQ components Assign bootstrap weightsAssign bootstrap weights Present outlier candidates to expertsPresent outlier candidates to experts Update weights based on user feedbackUpdate weights based on user feedback
SummarySummary
Many data quality problems can arise during the Many data quality problems can arise during the whole data life cycle.whole data life cycle.
Preliminary checking can eliminate a lot of Preliminary checking can eliminate a lot of simple errorssimple errors
Expert knowledge should be integrated and be Expert knowledge should be integrated and be the decisive factor when it comes to DQ solvingthe decisive factor when it comes to DQ solving
Data mining techniques may act as metal Data mining techniques may act as metal detectors so that experts can focus on a detectors so that experts can focus on a narrowed down group of candidatesnarrowed down group of candidates