2014-may-07. what is the problem? what have others done? what is our solution? does it work? outline...

26
2014-May-07

Upload: maud-melton

Post on 16-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

2014-May-07

Page 2: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

What is the problem?

What have others done?

What is our solution?

Does it work?

Outline

2

Page 3: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

What is the problem?

• Linked Open Data (LOD): ▫ Realizing Semantic Web by interlinking existing

but dispersed data

• Main components of LOD:▫URIs to identify things ▫RDF to describe data▫HTTP to access data

3

Page 4: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

Datasets: 295Triples: over 30,000,000,000 (30 B)Links: over 500,000,000 (500 M)

4

What is the problem?

Page 5: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

Inclusion Criteria for publishing and interlinking datasets into LOD cloud

• resolvable http/https URIs

• Presented in one of the standard formats of Semantic Web (RDF, RDFa, RDF/XML, Turtle, N-Triples)

• Contains at least 1000 triples

• Connected via at least 50 RDF links to the existing datasets of LOD

• Accessible via RDF crawling, RDF dump, or SPARQL endpoint

Is dataset ready to publish?

5

What is the problem?

Page 6: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

6

Idea of the LOD: Publishing first, improving later

Results in: quality problems in the published datasets

Missing link:

What is the problem?

Data Quality evaluation before release

Page 7: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

Data quality in the Context of LOD

• General Validators

• Parsing and Syntax

• Accessibility / Dereferencability

Validators Quality Assessment of Published data

• Classifying quality problems of LOD

• Using metadata for quality assessment

• filtering poor quality data (WIQA)

• Semantic Annotation using ontologies

7

What have others done?

Page 8: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

Limitations of related works:

•Syntax validation, not quality evaluation

•Not scalable

•Not full automated

•Evaluation after publishing

8

What have others done?

Page 9: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

What is our solution?

Proposing a set of metrics for

Inherent quality assessment of datasets

before interlinking to LOD cloud

9

Page 10: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

Quality Prediction

Empirical Evaluation

Theoretical Validation

Developing a Quality Model

Proposing Metrics

Selecting Inherent Quality Dimensions

10

What is our solution?

Page 11: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

Studying data quality models

Defining inherent quality of LOD

Selecting the basic model

(ISO-25012)Mapping quality

dimensions of ISO to LOD

11

1. Selecting Inherent Quality Dimensions

Page 12: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

Inherent Quality of LOD

Interlinking

Completeness

Semantic AccuracySyntax Accuracy

Uniqueness

Consistency

12

1. Selecting Inherent Quality Dimensions

Page 13: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

Defining metrics using GQM

Implementing an automated tool Formal definition

13

2. Proposing Metrics

Example:Goal: Assessment of the consistency of a dataset in the context of LODQuestion: What is the degree of conflict in the context of data value?Metric: The number of functional properties with inconsistent values

Page 14: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

14

LODQM: Linked Open Data Quality Model

• 6 Quality dimensions• 32 Metrics

3. Developing LODQM

Page 15: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

Using Theoretical Measurement Framework

Identifying properties of

desirable metrics

Validating metrics

15

4. Theoretical Validation

Metric TypeNumber

of metricsNull-

Value

Non-

NegativitySymmetry Monotonicity

Disjoint

Module

AdditivityMerging

Cohesive

Modules

Complexity 29 √ √ √ √ n/a _ _

Cohesion 2 √ √ _ √ _ _ √

Coupling 1 √ √ _ √ n/a √_

Page 16: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets

Comparing the trends of Metrics over two observationsCollecting experts’ subjective perception on quality dimensionsCorrelation study between metrics and quality dimensions

16

5. Empirical Evaluation 5.1

5.2

5.3

5.4

5.5

5.6

5.7

Page 17: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

17

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensions

Correlation study between metrics and quality dimensions

Datasets

No. of triples

No. of instances

No. of classes

No. of properties

FAO Water Areas 10,730 586 31 19

Water Economic Zones 29,193 1,074 113 127

Large Marine Ecosystems 12,012 716 21 31

Geopolitical Entities 22,725 312 88 101

ISSCAAP Species Classification 398,166 25,253 52 93

Species Taxonomic Classification 319,490 11,741 33 26

Commodities 56,420 2,788 10 19

Vessels 4,236 240 6 22

5. Empirical Evaluation √

Page 18: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

18

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensions

Correlation study between metrics and quality dimensions

√√

5. Empirical Evaluation

Page 19: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

19

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets using heuristics

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensions

Correlation study between metrics and quality dimensions

√√

5. Empirical Evaluation

Result:• Three pairs of metrics are correlated:

{IFP, Im_DT}{Im_DT, Sml_Cls} {Inc_Prp_Vlu, IF}

• The others are independent

Page 20: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

20

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets using heuristics

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensions

Correlation study between metrics and quality dimensions

√√√√

5. Empirical Evaluation

Page 21: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

21

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets using heuristics

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensions

Correlation study between metrics and quality dimensions

√√√√√

5. Empirical Evaluation

Page 22: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

22

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets using heuristics

Comparing the trends of Metrics over two observations

Collecting experts’ subjective perception on quality dimensions

Correlation study between metrics and quality dimensions

√√√√√

5. Empirical Evaluation

Result:• Only one pair of quality dimensions is correlated:

{Interlinking, Syntactic accuracy}

• The others are independent

Page 23: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

Applying PCA Method to select the highly

correlated metricsDeveloping predictive models

Assessing the quality of new datasets

using models

23

6. Quality Prediction

Result:

20 out of 32 metrics are selected

Using Neural Network Method:

MultiLayerPerceptron

Dataset No. of triples No. of instances Domain

Geonames 6,590 699 Geography

IMDB 866 291 Movie

Anatomy 6,449 6449 Anatomy

Citeseer 948,770 173963 Publication

FAO 248,731 28,098 Food Science

Page 24: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

24

6. Quality Prediction

Page 25: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

Conclusion on Metrics

25

Definable

•Proposed by GQM (32)

•Formally defined (32)

Valid

•Theoretically validated (32)

Practical

•Implemented (32)

Correlated with quality

•Experts (28)

•Correlation study (27)

•PCA (20)

Predictability

•MLP (20)

Page 26: 2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2

Appreciative of your

Attention and Comments