how to clean data less through linked (open data) approach?

23
How to Clean data Less through Linked (Open Data) Approach Andrea Wei-Ching Huang Institute of Information Science, Academia Sinica, Taipei, Taiwan Dec. 7 2015 @ IIS R101 1. Data Quality: data, metadata, linked data 2. The case of 840,000 cc licensed data 3. How LOD approach can improve data quality?

Upload: andrea-huang

Post on 08-Apr-2017

1.502 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: How to clean data less through Linked (Open Data) approach?

How to Clean data Less through

Linked (Open Data) Approach

Andrea Wei-Ching HuangInstitute of Information Science, Academia Sinica, Taipei, Taiwan

Dec. 7 2015 @ IIS R101

1. Data Quality: data, metadata, linked data

2. The case of 840,000 cc licensed data

3. How LOD approach can improve data quality?

Page 2: How to clean data less through Linked (Open Data) approach?

1. Data Quality: data, metadata, linked data

Page 3: How to clean data less through Linked (Open Data) approach?

Information Quality Data Quality Metadata Quality Linked Data Quality Data Quality VocabularyStvilia et al.(2007):

22 dimensions

Batini et al. (2009):

28 dimensions

Tani et al. (2013):

10 parameters

Zaveri et al. (2016):

18 dimensions

W3C (2015):

10 dimensions

Naturalness (I) Interoperability (RP) Statistics

Accessibility (R) Accessibility Accessibility Availability (A) Availability

Accuracy (R) Accuracy Accuracy (S) Semantic Accuracy (I) Accuracy

Accuracy/Validity (I) Applicability Pertinence Syntactic Validity (I)

Appropriate amount of data

Complexity (R) Clarity

Precision/Completeness(R) Completeness Completeness(S) Completeness (I) Completeness

Informativeness/Redundancy(R) Comprehensiveness Understandability (C)

Informativeness/Redundancy(I) Conciseness Conciseness (I)

Structural Consistency (I) Consistency Similarity Consistency (I) Consistency

Convenience

Structural Consistency(R) Correctness

Verifiability (R) Credibility Trustworthiness (C) Credibility

Currency (I) Currency

Semantic Consistency(I) Derivation Integrity

Ease of operation Processability

Naturalness (R) Interactivity Conformance(S) Interlinking (A) Conformance

Semantic Consistency(R) Interpretability Interpretability (RP)

Precision/Completeness(I) Maintainability Preservability

Complexity(I) Objectivity

Relevance/ Aboutness(R) Relevancy Relevance Relevancy (C) Relevance

Authority (Reputational) Reputation

Security(R) Security Security (A)

Speed Performance (A)

Timeliness Timeliness Timeliness (C) Timeliness

Traceability RP Conciseness (RP)

Cohesiveness (I) Uniqueness Significance

Usability Licensing (A)

Volatility(R) Volatility

Versatility (RP)

(I): Intrinsic; (R): Relational; (S): Metadata Spec.; (RP): Representational; (A):Accessibility; (C): Contextual

Page 4: How to clean data less through Linked (Open Data) approach?

1. Accessibility/Availability (可取得性)

2. Accuracy (正確性)

3. Completeness (完整性)

4. Consistency (一致性)

5. Credibility/Trustworthiness (可信度)

6. Relevance (相關性)

7. Timeliness (適時性)

7 dimensions/parameters are common ground

Quantitative/ Qualitative Methodologies are mutual utilized.

Page 5: How to clean data less through Linked (Open Data) approach?

Metadata Quality: Problems & Solutions (1)

Record Problems

Yasser, Chuttur M. "An analysis of problems in metadata records." Journal of Library Metadata 11.2 (2011): 51-62

Page 6: How to clean data less through Linked (Open Data) approach?

Metadata Quality: Problems & Solutions (2)

Dublin Core Semantic Problems

Park, Jung-ran, and Eric Childress. "Dublin Core metadata semantics: An analysis of the perspectives of information professionals." Journal of Information Science 35.6 (2009): 727-739.

• Type is a subjective value.• Source is a confusing field. It is difficult to apply it consistently.• Creator can be very varied and it can be tricky determining exactly who the

creator is. • The information from the publisher is vague. • Can’t define different role of contributor. • There is often great ambiguity in terms of Type and Relation.

• between Format and Type.• between Creator, Publisher, and Contributor.• between Source and Relation. • The high degree of difficulty (55.3%) engendered by the Relation field

o discernment of the dynamic and interrelated nature of information objects presents challenges in using the Relationelement.

Page 7: How to clean data less through Linked (Open Data) approach?

Metadata Quality: Problems & Solutions (3)

Current Solutions

Tani, Alice, Leonardo Candela, and Donatella Castelli. "Dealing with metadata quality: The legacy of digital library efforts." Information Processing & Management 49.6 (2013): 1194-1205.2

Tani et al. (2013): Summary of metadata quality approaches .-------------------------------------------------------------------------------------------------------------------------------------------

Metadata guidelines, standard and Application Profiles Pros: potentially effective; if shared among organizations, they promote cross organization interoperability

Cons: challenging to agree between different organizations; often end-up being complex combinations of features reflecting the interests of many disparate parties; they infringe autonomy of the entities adopting them

Metadata evaluation approaches (analytic-oriented and crowdsourcing-oriented)

Pros: helpful to identify specific problems Cons: based on community specific criteria

Semi-automatic metadata generation approaches Pros: helpful to deal with the data deluge Cons: human assessment

Metadata cleaning, enhancement, augmentation approachesPros: fundamental to enable cross-community exploitation of metadataCons: information loss; information inconsistency-------------------------------------------------------------------------------------------------------------------------------------------

Page 8: How to clean data less through Linked (Open Data) approach?

2. The case of 840,000 cc licensed dataIn Union Catalogue of Digital Archives Taiwan

Page 9: How to clean data less through Linked (Open Data) approach?

“Fitness for Use” is the Key:

Data Quality (DQ) Definition for Digital Data

Nicholas R. Chrisman (1986): “Digital data can adapt to a broader range of uses with a broader range of special demand, …The root of data abuse is not in the quality of the data, but in the awareness andunderstanding of the quality of the data. By converting to the fitness for use approach, the problem of data abuse is moved from producer to consumer (data user).

W3C Data Quality Vocabulary (2015) :

“...quality lies in the eye of the beholder; that there is no objective, ideal definition of it. Some datasets will be judged as low-quality resources by some data consumers, while they will perfectly fit others' needs.

Quality from perspectives of supply and demand sides: ex. Data Publishers, Certification Agencies, Data Aggregators and Data Consumers.

Pragmatic User-specific Context-dependent

Page 10: How to clean data less through Linked (Open Data) approach?

physical object digital object digital collection digital aggregation & publication reusing & semantic representation

Creation Conversion 1 Conversion 2 Conversion 3 Conversion 4 Clean & Enrich Conversion 5

Local Curation (90 projects)

DC 15 elements as the requirement for Union Catalog

Locally developed schemes

Digital Archive Curation (1 portal)

XML HTMLTEXT/Image XLSX/Table/HTML CSV CSV RDF/Turtle

Linked Open Data (globally linked & semantically represented)

Globally linked, machine accessible semantics & domain knowledge vocabularies are needed for LOD.

CONTEX I CONTEX II CONTEX III

“Fitness for Use” in different contexts:

Page 11: How to clean data less through Linked (Open Data) approach?

physical object digital object digital collection digital aggregation & publication reusing & semantic representation

Digital Archive Curation (1 portal)

XML HTMLTEXT/Image XLSX/Table/HTML CSV CSV RDF/Turtle

Linked Open Data (globally linked & semantically represented)

Provide metadata guidelines & standard (DC 15)

Metadata Generation

Local Curation (90 projects)

Data Quality

Data Quality

Data Quality

Linked Data Generation

Metadata evaluation approaches

Semi-automatic metadata generation approaches

Metadata cleaning, enhancement, augmentation approaches

Information Loss ?Interpretation Problems?Time & Resource Cost?

Page 12: How to clean data less through Linked (Open Data) approach?

Problems identified in the case of 840,000 cc data

1. Confusion of Dublin Core (DC 定義混淆)

2. Name Ambiguity (名稱模糊)

3. Inconsistent Encoding (編碼不一致)

4. Semantic Overlaps (語意超載)

5. Duplicate Records (資料重複)

6. Insufficient Element Usage (語意缺漏)

7. Errors / Mistakes / Others (其它錯誤)

Page 13: How to clean data less through Linked (Open Data) approach?

Considerations in the case of 840,000 cc data for LOD

1. We are not data creators. Can we clean/revise the data “correctly”? Keep original CSV data open. Revised/Cleaned data as diff/mapping files.

2. How can we prevent “information loss”? Mapping activities often result in information loss. Reconsider the value of broken links.

3. Limited Resources & Time to handel the clean tasks.

Page 14: How to clean data less through Linked (Open Data) approach?

3. How Linked (Open Data) approach can improve data quality?

Page 15: How to clean data less through Linked (Open Data) approach?

1. Raw data, New data (cleaned data, semantically refined data) can be benefited from Open Data Approach:

Creation of new data based on combining data.

External quality checks of data (validation).

Sustainability of data (no data loss).

The ability to merge, integrate and mesh public and private data.

Janssen, Marijn, Yannis Charalabidis, and Anneke Zuiderwijk. "Benefits, adoption barriers and myths of open data and open government." Information Systems Management 29.4 (2012): 258-268.

Page 16: How to clean data less through Linked (Open Data) approach?

2. Using SPARQL Queries to identify problems:

Identify DQ Problems before RDF generated:

Use W3C mapping language R2RML and RDF validation framework (RDFUnit) for mapping definitions and allow publishers to catch & correct violations before they even happened. (Dimou et al, 2015)

Identify DQ Problems after RDF generated:

Using SPARQL and Public Shared LOD resources (ex. DBPedia, Geonames)as reference to identified problems. (Furber and Hepp, 2010)

Fürber, Christian, and Martin Hepp. "Using semantic web resources for data quality management." Knowledge Engineering and Management by the Masses. Springer Berlin Heidelberg, 2010. 211-225.

Dimou, Anastasia, et al. "Assessing and Refining Mappingsto RDF to Improve Dataset Quality." The Semantic Web-ISWC 2015. Springer International Publishing, 2015. 133-149.

Page 17: How to clean data less through Linked (Open Data) approach?

Above five points are summarized from Furber and Hepp (2013): "Using Semantic Web Technologies for Data Quality Management." Handbook of Data Quality. Springer Berlin Heidelberg, 2013. 141-161.

Collaborative representation and use of quality-relevant knowledge

Automatic identification of conflicting data requirement

Semantic definition of data.

Use Semantic Web data as a Trusted Reference data

Content Integration with Ontologies

3. Use Vocabularies, Ontologies & LOD Knowledge Base :

To improve data quality for every step of a dataset's lifecycle (ex. W3C Data Quality Vocabulary) .

To enrich data semantics and increase data reused and refined values.

Page 18: How to clean data less through Linked (Open Data) approach?

http://www.w3.org/TR/vocab-dqv/

The importance pf provenance and metadata quality. (Carata, Lucian, et al. 2014)

Page 19: How to clean data less through Linked (Open Data) approach?

The Story of A Fish

http://catalog.digitalarchives.tw/item/00/5f/ca/d5.html

Parapercis kentingensis

Page 20: How to clean data less through Linked (Open Data) approach?

http://URI of this Fish/6277845

2012

2015 2016

TEXT/Image

XLSX/Table/HTML

XML/HTML

CSV: (raw data published as open data)

6277845

(1)12/15 triples (statements)

Metadata(DC 15)

(2) 12/15 triples (statements)

Provence

wikidataerr

+ one “diff” triple

new

(3) Mapping replace Cleaning

+ one “time mapping” triple

time

new

err

Place information is not described in the Coverage but Description in this stage. This should be cleaned & mapped to external resources like Geoname and TaiwanPlaceName by us, or by some others when time and resources are available.

(5) When the raw CSV and DC 15 represented triples (DC 15 Version) are published, they are easily for others to detect the errors, reused and enriched by their own Fitness of Use and Interpretations. Even there are errors from the beginning, more statements about this Fish (6277845) are thus can be generated by the interests of community.

(4) Refined Version: semantically enriched by using domain vocabularies like Darwin Core Terms

prov r4r schema cc odw

Page 21: How to clean data less through Linked (Open Data) approach?

1. Keep original CSV data open.2. Less clean with mapping more: revised/cleaned data as

diff/mapping files.3. Publish the original DC 15 statements as 15 triples and

provide Provenance information.4. Assign each item resource a URI.5. Use domain vocabulary to enrich the resource (e.x. dwc)6. Mapping and Linking to external databases to enrich

statements. (GenNames, TaiwanPlaceNames, Encyclopedia of Life).

7. More errors or meanings will be stated by third parties and crowdsourcing for their own interests.

How we clean data less through Linked (Open Data) Approach

Page 22: How to clean data less through Linked (Open Data) approach?

1. Batini, Carlo, et al. "Methodologies for data quality assessment and improvement." ACM Computing Surveys (CSUR) 41.3 (2009): 16.

2. Chrisman, Nicholas R. "Obtaining information on quality of digital data." Proc. AutoCarto London. Vol. 1. 1986.

3. Carata, Lucian, et al. "A primer on provenance." Communications of the ACM 57.5 (2014): 52-60.

4. Dimou, Anastasia, et al. "Assessing and Refining Mappings to RDF to Improve Dataset Quality." The Semantic Web-ISWC 2015.

Springer International Publishing, 2015. 133-149

5. Fürber, Christian, and Martin Hepp. "Using semantic web resources for data quality management." Knowledge Engineering and

Management by the Masses. Springer Berlin Heidelberg, 2010. 211-225.

6. Furberand Hepp(2013): "Using Semantic Web Technologies for Data Quality Management." Handbook of Data Quality. Springer Berlin

Heidelberg, 2013. 141-161

7. Hooland, Seth van, and Ruben Verborgh. Linked data for libraries, archives and museums. (2014).

8. Janssen, Marijn, YannisCharalabidis, and Anneke Zuiderwijk. "Benefits, adoption barriers and myths of open data and open

government." Information Systems Management 29.4 (2012): 258-268.

9. Manus, Susan, The Value of a Broken Link (2012): http://blogs.loc.gov/digitalpreservation/2012/03/the-value-of-a-broken-link/

10. Park, Jung-ran, and Eric Childress. "Dublin Core metadata semantics: An analysis of the perspectives of information professionals."

Journal of Information Science 35.6 (2009): 727-739.

11. Stvilia, Besiki, et al. "A framework for information quality assessment." Journal of the American Society for Information Science and

Technology 58.12 (2007): 1720-1733.

12. Tani, Alice, Leonardo Candela, and Donatella Castelli. "Dealing with metadata quality: The legacy of digital library efforts." Information

Processing & Management 49.6 (2013): 1194-1205.

13. W3C, Data Quality Vocabulary (2015), http://www.w3.org/TR/vocab-dqv/

14. Yasser, ChutturM. "An analysis of problems in metadata records." Journal of Library Metadata 11.2 (2011): 51-62

15. Zaveri, Amrapali, et al. "Quality assessment for linked open data: A survey." Semantic Web 7.1 (2016).

REFERENCE

Page 23: How to clean data less through Linked (Open Data) approach?

Merry Christmas Happy New Year

We will release the DC 15 Versions and the Refined Version (Biology) shortly.