iswc 2015 - collecting, integrating, enriching and republishing open city data as linked data

18
© Siemens AG 2015. All rights reserved Collecting, integrating, enriching and republishing open city data as linked data Stefan Bischof – Siemens, WU Vienna Christoph Martin – WU Vienna Axel Polleres – WU Vienna Patrik Schneider – WU Vienna, TU Vienna

Upload: stefan-bischof

Post on 07-Feb-2017

512 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

© Siemens AG 2015. All rights reserved

Collecting, integrating, enriching and republishing open city data as linked data

Stefan Bischof – Siemens, WU ViennaChristoph Martin – WU ViennaAxel Polleres – WU ViennaPatrik Schneider – WU Vienna, TU Vienna

Page 2: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

Page 2 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

Which city is the best? Compare cities!

Page 3: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

Page 3 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

What we have: European Green City Index

Page 4: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

Page 4 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

»»

Use data to compare cities

Idea: Exploit available open data on cities to compute comparable indicators

Use standard Semantic Web technologies for:

ontology based data integration (including lightweight provenance, temporal and spatal context)

data refinement and enrichment (approximating missing values, resolve quality issues)

data publication (SPARQL, LOD, webUI) Comparable city indicators

»» City Data City Data PipelinePipeline

Page 5: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

Page 5 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

Integrated Open Data is very sparse

Cities

Indicators

51% 51% values missingvalues missing

97% 97% values values missingmissing

But we need base indicators for all cities to compute comparable indicators

Page 6: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

Page 6 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

How can we fill in missing values?

Get more data – which makes data even sparser

Use domain knowledge …

Try to automatically fill in values …

Page 7: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

Page 7 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

Use domain knowledge to predict missing values

Eurostat: 62 equations for derived indicators (e.g., population density)

Unit conversions (e.g., QUDT ontology)

Use materialization or query rewriting for value computation [ESWC13]

Covers only few indicators

How can we get more domain knowledge?

Page 8: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

Page 8 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

Use machine learningto predict missing values

Deploy and combine a portfolio of different regression methods: • Multiple linear regression (MLR)• K-nearest neighbour (KNN)• Random forrest decision trees (RFD)

Validation: 10-fold cross validation

Quality measure to pick the best method/indicator: normalized root mean square error in %

However: many/most machine learning methods need more or less complete training data!

Page 9: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

Page 9 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

Approach 1: Complete subset regression

For each target indicator

• Find top-k predictors based on correlation matrix and form a complete subset

• Apply all methods (MLR, KNN, RFD), compute RMSE% and select best method

Cities

Indicators

MLR

KNN

RFD

»»»

Page 10: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

Page 10 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

Approach 1: How many predictors needed?

Page 11: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

Page 11 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

Approach 2: Principal component regression

Fill in missing values with neutral values wrt. PCA [Roweis’97]For each target indicator • Find top-k predictors among the PCs based on correlation matrix• Again: apply MLR, KNN,RFD compute RMSE% select the best method

Cities

Indicators

MLR

KNN

RFD

»

Principal components

»»»

Page 12: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

Page 12 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

Approach 2: How many predictors needed?

Page 13: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

Page 13 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

Cross-dataset prediction 1/2:(How) can this be used for cross-dataset prediction?

Cities

Indicators

Page 14: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

Page 14 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

Cross-dataset prediction 1/2:(How) can this be used for cross-dataset prediction?

Con:•Not great .... Avg. RMSE for both directions over 10%•Could transfer a "bias" from one dataset's context to the otherPro: •for some indicators it works quite well

Page 15: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

Page 15 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

Cross-dataset prediction 2/2:Pairwise Linear regression can be used to "learn ontology mappings" from values

Compare the values of each eurostat indicator with each UN indicator

Find linear dep. of pairs (equations) or equal pairs (equivalent properties)

Using robust linear regression necessary to handle outliers

Cities

Indicators

Page 16: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

Page 16 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

Conclusion on various things we tried:

Approach 1: complete subsets

good results, 0.25 RMSE% covers only a few cities/indicators

Approach 2: principal component regression

predicts more missing values quality is not always good

Cross dataset prediction in general:

interesting ("highest gain") bad error rates with methods tested so far

Ontology learning from instance data:

several "conjectured" relationships derivable needs datasets with overlapping cities (usable to "confirm/reject" manual mappings)

US Census??

Page 17: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

Page 17 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

Now what's Semantic Web/Linked Data here?

City Data Pipeline Dataset available openly! http://citydata.wu.ac.at/

Data accessible as Linked Open Data, via SPARQL endpoint, and WebUI

Original values including data source for each value

Predicted values including error estimates

Page 18: ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

Page 18 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved

Current and future work

Ongoing

• Encode data in RDF Data Cube vocabulary and provenance in PROV

• Use other methods, e.g., SVM or robust linear regression for PCR

Future Work

• Some form of time-series analysis

• Add more data sources (Carbon Disclosure Project, QuerioCity)

• Integrate GIS data sources (OSM, Linked Geo Data)

...Last, but not least:

our assumption/driver: Predictions get better, the more Open data we integrate...