under the hood: how geonames aggregates over 35 sources into one data set

36
GeoNames is ... aggregator of free geo data I am ... Marc Wick self employed software engineer, Switzerland GeoNames Under the Hood: How GeoNames Aggregates many Sources into One Data Set“

Upload: adunne

Post on 17-Jan-2015

3.359 views

Category:

Technology


0 download

DESCRIPTION

Speaker: Marc Wick

TRANSCRIPT

Page 1: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames is ...aggregator of free geo data

I am ...Marc Wick

self employed software engineer, Switzerland

GeoNames“Under the Hood: How GeoNames Aggregates

many Sources into One Data Set“

Page 2: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 2

GeoNames Feature Density Map

Page 3: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 3

GeoNames - Gazetteer

Pragmatic, useful, ease of useOver 6.5 million features Cc-by licence9 feature classes

Page 4: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 4

Screen shot Berlin

Page 5: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 5

Origins and Goal

Proprietary applicationTeam up togethercontribute modifications to central data base.applications switch to GeoNames from proprietary aggregation

Page 6: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 6

Challenge

A lot of data IS availableMany providersLanguagesScripts

Page 7: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 7

GeoNames Ambassadors

GeoNames contactSpeak local languageKnow local situation

Page 8: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 8

Data Sources

National Mapping AgenciesStatistical OfficesPostal codesNational Geospatial-Intelligence Agency (NGA) ‏Applications using GeoNames− Data files− Manual modifications

Page 9: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 9

US vs Europe

US data is freely availableEuropean data is not availableRest of the World?Consequences

Page 10: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 10

Page 11: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 11

Future of geodata availability

We believe basic geodata will be free in most countries

Why :− Economy− Traffic Policy and Road Safety (road signs) ‏

Page 12: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 12

Page 13: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 13

Free Availability is only a First Step

Page 14: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 14

Who aggregates data

GeoNamesSuper national mapping agenciesSuper national organisations

INSPIRE

Page 15: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 15

Problems and Solutions I

Shape / GMLDatum reprojection

FWTools/ GDAL/OGRPostgis/epsg/native tools/custom impl

Page 16: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 16

Problems and Solutions II

FeatureCodes not 1:1non-ASCIICountry codesAdmin1 codes

Pattern matchingTransliteration

Page 17: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 17

Place name matching

GeocodingDistancefeature type and feature codeReverse geocoding, compare name similarity− levenshtein distance− letter pair similarity

Page 18: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 18

Page 19: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 19

Wikipedia GeoTemplates

Proliferation of GeoFormatsNo consensus, AnarchyExamples− <geo>48 46 36 N 121 48 51 W</geo>− {{coor d|48.7767|N|121.8142|W|}}− Berlin : |lat_deg = 52|lat_min = 31− ... (Any template you could possibly think of is used somewhere) ‏

Page 20: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 20

Alternate Names

...Italian : BerlinoEnglish : BerlinArabic : نيلربKorean :���Thai : เบอรลินRussian : БерлинChinese :��Marathi : बर् लि न... (ca 100 names)‏

Page 21: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 21

Postal codes

Geocode – postal code numeric distanceAccuracy, completeness

ScribbleMaps by Robert Kosara

Page 22: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 22

Page 23: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 23

Page 24: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 24

Data Dump

Flat csv filesSimple formatEase of useFull daily dumpdaily modificationsrdf

Page 25: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 25

Web Services

Search− Ranking

Tf idfRelevancy

− I18n

Page 26: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 26

Page 27: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 27

Hierarchy Web Services

HierarchyChildNeighbour Sibling

Page 28: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 28

Gto

po30

SRTM

3

JDBC

Database : Postgres(postgis) ‏

Lucene

Full Text IndexTF-IDF

Tomcat (Java) ‏

Apache

mod rewrite

JSONjdom.org (xml) ROME (RSS)‏ ‏

JMSactiveMQ

Page 29: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 29

Libraries

JavaDrupalRubyPhpPerlPythonLisp

Page 30: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 30

Synchronization

Dail dumpDaily modificationJms

Rdf dump, periodically

Page 31: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 31

Linked Data

Page 32: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 32

Applications using GeoNames

thousands of applicationssearchSite navigationgeo-coding

Page 33: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 33

Page 34: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 34

Page 35: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 35

Page 36: Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set

GeoNames, Marc Wick Web 2.0 Expo - 8. Nov 2007 Berlin 36

Thank you for your attention.