retrieving location-based data on the web andrei tabarcea, 14.02.2011

18
Retrieving Location- based Data on the Web Andrei Tabarcea, 14.02.2011

Upload: todd-york

Post on 23-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

Retrieving Location-based Data on the Web

Andrei Tabarcea, 14.02.2011

Page 2: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

Introduction

The goal is to find services and points of interest close to the user’s location

We call this “location-based search” We try to find location information in web-

pages

Page 3: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

MOPSI Search

Page 4: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

MOPSI Search Results

Locally Managed Database

Users’ Collection

Open Web Searches

Combinationof

search results

Page 5: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

Location Information in Webpages

- Site hosting information (owner address, server address etc.)

- HTML tags (geo-tags, address-tags, vcards for Google Maps etc.)

- Addresses, postal codes, phone numbers

- Well-known places

Page 6: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

Main Challenges

Find location information in webpages Find relevant information related to the

found location information

Page 7: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

Ad-Hoc Georeferencing

• The problem is how to extract and validate location data from semi-structured text

• Postal address is the most common location data found• Our goal is to give geographical coordinates to services

mentioned in web-pages• We call this method ad-hoc georeferencing

<HTML><HEAD profile"="http://geotags.com/geo> <META name="geo.position" content="62.35;29.44"> <META name="geo.region" content="FI"><META name="geo.placename" content="Joensuu"> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><link rel="stylesheet" href="http://www.joensuu.fi/tkt/sivutyyli.css" type="text/css"><TITLE>Pages of Pasi Fränti</TITLE></HEAD>

VS.

Page 8: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

Extracting the Information

For each link:

- Extract plain text from html-file- Detect street names by using

gazetteer- Extract additional service

information- Gather results as list

For result list:

- Evaluate relevance- Arrange by distance- Purge overlapping results- Show results- (Optionally) Save results

Page 9: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

Problems- How to evaluate relevance?

- Mixed keyword meanings

- No relation between keywords and addresses

Page 10: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

Mobile Search Engine

Geocoded street-name

database

Core server software

Mobileapplication

Web userinterface

Coordinates

AddressKeywordCoordinates

Searchresults

KeywordCoordinates

Searchresults

Search Engine consists of:• User interface• Core server software• Geocoded street-name

database

Page 11: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

Core Server software

Georeferencing module

Geocoded

database

Address and

description detector

Address validator

Word list

Results list

Sorted results list

KeywordMunicipaliti

es

<keyword, municipality>

query

Result

links

Coordinat

es

Municipalities li

st

Addresses

Coordinates

Relevant municipalities detector

Keyword, Address,

Coordinates

Page parser

Page 12: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

Street-address Detection

• We use a rule-based pattern matching algorithm• The detection of street-names is the starting point of the algorithm• An address-block candidate is constructed by detecting typical

address elements (street names, numbers, postal codes, telephone numbers and municipal names)

• Address block candidates are validated using the gazetteer

Page 13: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

Title Detection

- Title detection (or company detection) is a Named Entity Recognition problem

- We designed a 2-step system to detect titles associated to addresses:

- Step 1: Fast dictionary match- Step 2: Use a classifier to detect the title

Page 14: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

Title Extractor

Usually, the text before the address holds relevant information

Joen Pizza Special Y-tunnus: 2129577-6 Käyntiosoite: Koskikatu 17 80100 JOENSUU Postiosoite Koskikatu 17 80100 JOENSUU Puhelin: 013-220246 Virallinen toimiala: Kahvila-ravintolat

address

words before the address

Page 15: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

The Problem- Results for keyword “kahvila”, address: ”Freesenkatu 1, Helsinki”

No title

Page 16: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

System Architecture

Tagged and hand-checked data

Classifier

Training data

HTML pages

Evaluator

Evaluation data

HTML parser

Dictionary matching

Match

Title extractor

Title candidateParsed HTML

StatisticsTITLE

Dataset Collection

No match

Page 17: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

Parsing HTML pages

-Current solution extracts text from HTML pages-We don’t exploit the advantage that we extract data from web pages-Proposed future solution:

- Visual segmentation of web pages- Detection of the address block- Nearest-neighbor search considering text and visual characteristics

Joen Pizza Special Y-tunnus 2129577-6 Käyntiosoite Koskikatu 17 80100 JOENSUU Postiosoite Koskikatu 17 80100 JOENSUU Puhelin: 013-220246 Virallinen toimiala Kahvila-ravintolat

Page 18: Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011

Questions

Thank you