quantifying the bias in data links
DESCRIPTION
An approach to identify how much a Linked Data dataset is biased, using statistical methods and the links between datasets. 28/11/2014 @EKAW2014, Linköping, SwedenTRANSCRIPT
![Page 1: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/1.jpg)
Quantifying the bias in data links
Ilaria Tiddi, Mathieu d’Aquin, Enrico Motta
![Page 2: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/2.jpg)
The problem
• Linked Data datasets are biased
• Bias = the information is unevenly distributed
• To detect such a bias, the information distribution in the
dataset should be compared to an unbiased one (ground
truth), which is not available
• Our proposal is to use information coming from the
connected datasets to approximate such a comparison
![Page 3: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/3.jpg)
• LMDB is biased towards old movies (i.e., it mostly contains information about old movies)
• A recommender system would therefore produce results biased towards old movies
• There is a need of identifying this bias
• to properly assess the results of Linked Data systems and
• to compensate the bias.
Is bias a problem?
![Page 4: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/4.jpg)
Motivation
• Dedalo: using Linked Data to explain patterns
• Pattern
• Students of the Open University enroll into Health&Social Care
courses more often around Manchester than in other places
• Explanation
• Health&Social Care courses are popular in Manchester because it is
in the Northern Hemisphere
• In DBpedia, the information incompleteness regarding places
locations is unevenly distributed, i.e. there is a bias
![Page 5: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/5.jpg)
• Measure how much a dataset is biased when compared to another one
• Use the dataset projection into its connecting dataset D
• compare the property values distribution of entities in D
• with the one of entities in S (the dataset projection)
D
Sowl:sameAsrdf:seeAlso
skos:exactMatch….
Identifying the bias
Dataset
![Page 6: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/6.jpg)
• Compare dc:subject values for the entities in D and in S
LMDB is biased towards black and white movies
• Same for dbp:released
LMDB is biased towards older movies
Example : is LMDB biased?
![Page 7: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/7.jpg)
• Use SPARQL to build pairs of values distributions in S and D
• Given
• two populations (values) and
• a same observation (RDF property)
dc:subject(D) = {dbCat:ScienceFictionMovies,dbCat:Black&WhiteMovies}
dc:subject(S) = {dbCat:Black&WhiteMovies}
• Use the statistical t-tests commonly exploited to compare observations
Bias detection proposition
![Page 8: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/8.jpg)
• There is a significant difference between two populations
• calculates the probability p that the difference is due to chance
• state a null hypothesis (i.e. is due to chance)
• there is no bias in a property
• an alternate hypothesis (the one you want to prove)
• there is bias in a property
• if p below 0.05, then one can reject the null hypothesis
• the lower p, the more the property is biased
• Rank the properties according to p to find the most biased ones
T-Tests of statistical significance
![Page 9: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/9.jpg)
Experiments and results
• 30 datasets and 54 pairs from the DataHub1
• Varying in size of entities in S (from 30 to 60,000 approx.)
• Varying in domain (multi-domain, biomedical computer science, education, geography…)
[1] http://datahub.io/
![Page 10: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/10.jpg)
When results are expected…
• NLFinland, places in Finland (connected to DBpedia)
• NLSpain, bibliographic Spanish data (connected to DBpedia)
class prop value p
db:Place dc:subject db:CitiesAndTownsInFinland p < 1.00e-15
db:Place dbp:latd (average) 40.5 p < 1.00e-15
db:Place dbp:longd (average) 24.6 p < 1.00e-15
class prop value p
db:MusicalArtist db:birthPlace db:Spain p < 1.13e-13
db:Writer dbp:nationality db:Spanish p < 4.64e-03
![Page 11: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/11.jpg)
class prop value p
up:Protein up:isolatedFrom uptissue:Brain p < 1.33e-04
class prop value p
db:Agent db:genre db:Novel p < 1.00e-15
db:Agent db:genre db:Poetry p < 1.00e-15
db:Agent db:deathCause db:Suicide p < 1.00e-15
…when results are less expected
• Uniprot, biomedical data (connected to
Bio2RDF/BioPax/DrugBank)
• RED, writers data (connected to DBpedia)
![Page 12: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/12.jpg)
• The importance of identifying the bias in a dataset
• Approach:
• with information from the connected datasets
• statistical t-tests on the distributions of the values of a property
• ranking properties basing on the probability of being biased
• Evaluating Dedalo’s performance on Google Trends
Please participate!
http://linkedu.eu/dedalo/eval/
Conclusions and future work
![Page 13: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/13.jpg)
ilaria.tiddi @open.ac.uk
@IlaTiddi http://linkedu.eu/dedalo/eval/
Thank you for your attention
Questions?
![Page 14: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/14.jpg)
Dedalo: explaining clusters with Linked Data
• Linked Data are a graph
• nodes : URIs
• edges : RDF properties
• Some nodes walk to the same node
Walk = a chain of RDF properties
• Walks can be an explanation for the cluster
ExplC = a chain of properties and one final entity
![Page 15: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/15.jpg)
Dedalo: explaining clusters with Linked Data
A* iterative search
Entropy to drive the search expanding the graph
Improving the F-score of ExplC at each iteration
ExplC =“movies whose subject is a subcategory of Science Fiction”
![Page 16: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/16.jpg)
Knowledge Discovery
The process of identifying patterns in data1
Patterns are usually interpreted by the experts
Linked Data can be used to automatically interpret patterns
open, shared, multi-domain, connected knowledge
rawdata
cleandata
Patterns
Knowledge
[1] Fayyad, 1998.
![Page 17: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/17.jpg)
Need of identify the bias when producing Linked Data systems
We propose a process to identify and measure the bias based on statistical methods
Contribution
A recommender system based on DBpedia (any kind of movies)
DBpedia is linked to the Linked Movies Database ( ‘30s movies )
The recommendation might be compromised
![Page 18: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/18.jpg)
• Students are interested in Health&Social Care since they live in the Northern Hemisphere
• What about the other counties?
• are they connected to the “Northern Hemisphere” entity?
• There must be a bias :the information is unevenly distributed
• Solution: weighting properties to rebalance the unevenness
Motivation
![Page 19: Quantifying the bias in data links](https://reader033.vdocuments.net/reader033/viewer/2022052910/559cdb461a28aba0408b45ff/html5/thumbnails/19.jpg)
ilaria.tiddi @open.ac.uk
@IlaTiddi
THANK YOU VERY MUCH!
Questions?