iswc 2015 - collecting, integrating, enriching and republishing open city data as linked data
TRANSCRIPT
© Siemens AG 2015. All rights reserved
Collecting, integrating, enriching and republishing open city data as linked data
Stefan Bischof – Siemens, WU ViennaChristoph Martin – WU ViennaAxel Polleres – WU ViennaPatrik Schneider – WU Vienna, TU Vienna
Page 2 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Which city is the best? Compare cities!
Page 3 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
What we have: European Green City Index
Page 4 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
»»
Use data to compare cities
Idea: Exploit available open data on cities to compute comparable indicators
Use standard Semantic Web technologies for:
ontology based data integration (including lightweight provenance, temporal and spatal context)
data refinement and enrichment (approximating missing values, resolve quality issues)
data publication (SPARQL, LOD, webUI) Comparable city indicators
»» City Data City Data PipelinePipeline
Page 5 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Integrated Open Data is very sparse
Cities
Indicators
51% 51% values missingvalues missing
97% 97% values values missingmissing
But we need base indicators for all cities to compute comparable indicators
Page 6 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
How can we fill in missing values?
Get more data – which makes data even sparser
Use domain knowledge …
Try to automatically fill in values …
Page 7 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Use domain knowledge to predict missing values
Eurostat: 62 equations for derived indicators (e.g., population density)
Unit conversions (e.g., QUDT ontology)
Use materialization or query rewriting for value computation [ESWC13]
Covers only few indicators
How can we get more domain knowledge?
Page 8 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Use machine learningto predict missing values
Deploy and combine a portfolio of different regression methods: • Multiple linear regression (MLR)• K-nearest neighbour (KNN)• Random forrest decision trees (RFD)
Validation: 10-fold cross validation
Quality measure to pick the best method/indicator: normalized root mean square error in %
However: many/most machine learning methods need more or less complete training data!
Page 9 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Approach 1: Complete subset regression
For each target indicator
• Find top-k predictors based on correlation matrix and form a complete subset
• Apply all methods (MLR, KNN, RFD), compute RMSE% and select best method
Cities
Indicators
MLR
KNN
RFD
»»»
Page 10 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Approach 1: How many predictors needed?
Page 11 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Approach 2: Principal component regression
Fill in missing values with neutral values wrt. PCA [Roweis’97]For each target indicator • Find top-k predictors among the PCs based on correlation matrix• Again: apply MLR, KNN,RFD compute RMSE% select the best method
Cities
Indicators
MLR
KNN
RFD
»
Principal components
»»»
Page 12 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Approach 2: How many predictors needed?
Page 13 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Cross-dataset prediction 1/2:(How) can this be used for cross-dataset prediction?
Cities
Indicators
Page 14 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Cross-dataset prediction 1/2:(How) can this be used for cross-dataset prediction?
Con:•Not great .... Avg. RMSE for both directions over 10%•Could transfer a "bias" from one dataset's context to the otherPro: •for some indicators it works quite well
Page 15 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Cross-dataset prediction 2/2:Pairwise Linear regression can be used to "learn ontology mappings" from values
Compare the values of each eurostat indicator with each UN indicator
Find linear dep. of pairs (equations) or equal pairs (equivalent properties)
Using robust linear regression necessary to handle outliers
Cities
Indicators
Page 16 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Conclusion on various things we tried:
Approach 1: complete subsets
good results, 0.25 RMSE% covers only a few cities/indicators
Approach 2: principal component regression
predicts more missing values quality is not always good
Cross dataset prediction in general:
interesting ("highest gain") bad error rates with methods tested so far
Ontology learning from instance data:
several "conjectured" relationships derivable needs datasets with overlapping cities (usable to "confirm/reject" manual mappings)
US Census??
Page 17 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Now what's Semantic Web/Linked Data here?
City Data Pipeline Dataset available openly! http://citydata.wu.ac.at/
Data accessible as Linked Open Data, via SPARQL endpoint, and WebUI
Original values including data source for each value
Predicted values including error estimates
Page 18 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Current and future work
Ongoing
• Encode data in RDF Data Cube vocabulary and provenance in PROV
• Use other methods, e.g., SVM or robust linear regression for PCR
Future Work
• Some form of time-series analysis
• Add more data sources (Carbon Disclosure Project, QuerioCity)
• Integrate GIS data sources (OSM, Linked Geo Data)
...Last, but not least:
our assumption/driver: Predictions get better, the more Open data we integrate...