iswc 2015 - collecting, integrating, enriching and republishing open city data as linked data

© Siemens AG 2015. All rights reserved

Collecting, integrating, enriching and republishing open city data as linked data

Stefan Bischof – Siemens, WU ViennaChristoph Martin – WU ViennaAxel Polleres – WU ViennaPatrik Schneider – WU Vienna, TU Vienna

October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

Which city is the best? Compare cities!


What we have: European Green City Index


»»

Use data to compare cities

Idea: Exploit available open data on cities to compute comparable indicators

Use standard Semantic Web technologies for:

ontology based data integration (including lightweight provenance, temporal and spatal context)

data refinement and enrichment (approximating missing values, resolve quality issues)

data publication (SPARQL, LOD, webUI) Comparable city indicators

»» City Data City Data PipelinePipeline


Integrated Open Data is very sparse

Cities

Indicators

51% 51% values missingvalues missing

97% 97% values values missingmissing

But we need base indicators for all cities to compute comparable indicators


How can we fill in missing values?

Get more data – which makes data even sparser

Use domain knowledge …

Try to automatically fill in values …


Use domain knowledge to predict missing values

Eurostat: 62 equations for derived indicators (e.g., population density)

Unit conversions (e.g., QUDT ontology)

Use materialization or query rewriting for value computation [ESWC13]

Covers only few indicators

How can we get more domain knowledge?


Use machine learningto predict missing values

Deploy and combine a portfolio of different regression methods: • Multiple linear regression (MLR)• K-nearest neighbour (KNN)• Random forrest decision trees (RFD)

Validation: 10-fold cross validation

Quality measure to pick the best method/indicator: normalized root mean square error in %

However: many/most machine learning methods need more or less complete training data!


Approach 1: Complete subset regression

For each target indicator

• Find top-k predictors based on correlation matrix and form a complete subset

• Apply all methods (MLR, KNN, RFD), compute RMSE% and select best method

Cities

Indicators

MLR

KNN

RFD

»»»


Approach 1: How many predictors needed?


Approach 2: Principal component regression

Fill in missing values with neutral values wrt. PCA [Roweis’97]For each target indicator • Find top-k predictors among the PCs based on correlation matrix• Again: apply MLR, KNN,RFD compute RMSE% select the best method

Cities

Indicators

MLR

KNN

RFD

»

Principal components

»»»


Approach 2: How many predictors needed?


Cross-dataset prediction 1/2:(How) can this be used for cross-dataset prediction?

Cities

Indicators


Cross-dataset prediction 1/2:(How) can this be used for cross-dataset prediction?

Con:•Not great .... Avg. RMSE for both directions over 10%•Could transfer a "bias" from one dataset's context to the otherPro: •for some indicators it works quite well


Cross-dataset prediction 2/2:Pairwise Linear regression can be used to "learn ontology mappings" from values

Compare the values of each eurostat indicator with each UN indicator

Find linear dep. of pairs (equations) or equal pairs (equivalent properties)

Using robust linear regression necessary to handle outliers

Cities

Indicators


Conclusion on various things we tried:

Approach 1: complete subsets

good results, 0.25 RMSE% covers only a few cities/indicators

Approach 2: principal component regression

predicts more missing values quality is not always good

Cross dataset prediction in general:

interesting ("highest gain") bad error rates with methods tested so far

Ontology learning from instance data:

several "conjectured" relationships derivable needs datasets with overlapping cities (usable to "confirm/reject" manual mappings)

US Census??


Now what's Semantic Web/Linked Data here?

City Data Pipeline Dataset available openly! http://citydata.wu.ac.at/

Data accessible as Linked Open Data, via SPARQL endpoint, and WebUI

Original values including data source for each value

Predicted values including error estimates


Current and future work

Ongoing

• Encode data in RDF Data Cube vocabulary and provenance in PROV

• Use other methods, e.g., SVM or robust linear regression for PCR

Future Work

• Some form of time-series analysis

• Add more data sources (Carbon Disclosure Project, QuerioCity)

• Integrate GIS data sources (OSM, Linked Geo Data)

...Last, but not least:

our assumption/driver: Predictions get better, the more Open data we integrate...