Download - Data quality in Real Estate
Data QualityIn Real Estate
Dimitris Kontokostas, Andy van der Hoeven, Samur Araujo
Amsterdam, Sep 14th 2017, LDQ Workshop, SEMANTiCS Conference
About Geophy
● Goal to map all buildings in the world
● Provide a quality score for each building
○ Based on location, building status, history, environmental metrics, etc
● Semantic platform
○ RDF eases the data integration process
● Team of 45 with aim to double by next year
Real Estate is a very complex domain
Really!
Possible constraints on addresses?
● An address will start with, or at least include, a building number.
● When there is a building number, it will be all-numeric.
● No buildings are numbered zero
● Well, at the very least no buildings have negative numbers
● A building number will only be used once per street
● A building will only have one number
● A building name won't also be a number
● [...] https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses
Geophy [set of] ontologies
● 13 ontologies (+ 9 external)
● 125 Classes
○ Buildings
○ Addresses
○ Companies
○ [...]
● 720 properties
○ 500 datatype
○ 160 relation properties
● Growing...
Quality is expensive
● Quality of source data○ Free, open, closed data sources, etc.
● Data clean up process○ Violations, deduplication, precision, etc.
○ How much time and effort can one afford?
How much quality is good enough?
� Fitness for use
Quality of ...
● Source data○ Accuracy of the source
● Translation of source data○ RDF mappings, rml, d2rq, scripts etc.
● Model design○ Modelling quality
○ Data fitting on schema
● Model definition○ Mapping of model on RDFS, OWL, ShEx|SHACL Shapes, etc
○ Semantics i.e RDFS, OWL DL/RL/FULL, etc
Evolution & quality
� Data evolves
� so do ontologies
� so do RDF mappings
� so does code
� so do SPARQL queries
� so do constraints
http://aligned-project.eu
Scaling quality ...
● Thousands of triples
● Millions of triples
● Billions of triples
● ?
Try to move validation in the K range (when possible)
Validate closer to the source
� Validate the model
� Validate the RDF mappings
� Validate RDF mapping excerpts
� Validate instance data
Automate, automate & automate
Can you spot the error?
rdfs:label ⇒ rdf:langString
� :foo rdfs:label ″foo @en″ .
Automate, automate & automate
Can you spot the error?
rdfs:label ⇒ rdf:langString
� :foo rdfs:label ″foo @en″ .
� :foo rdfs:label ″foo″@en .
CI/CD is your buddy
● Integrate validation with your CI/CD
○ Choose tools & technologies wisely
○ Jenkins, Travis, Gitlab, TeamCity
● Fail the build until data issues are fixed
● Data integration validation checks
○ Standalone datasets can pass CI
Thank you for your attention
Questions?