extracting metadata for spatially- aware information retrieval on the internet clough, paul...
TRANSCRIPT
![Page 1: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/1.jpg)
Extracting Metadata for Spatially-Aware Information Retrieval on the
InternetClough, Paul
University of Sheffield, UK
Presented By
Mayank Singh
![Page 2: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/2.jpg)
Overview :
• The importance of the experiment.• Introduction to SPIRIT and GATE.• Techniques employed – Geo Parsing and Geo
Coding.• Pros• Cons• What it leads to.
![Page 3: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/3.jpg)
The importance of the experiment:• A novel system.
• Geospatial information extraction from the Web documents.
• Annotating the retrieved documents with the spatial data.
• Using the annotated documents to power a working GIR system.
![Page 4: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/4.jpg)
How does it work (summary)
Extracting geospatial references from document involves:– Identifying geographic references
– Assigning them spatial co-ordinates
– Factors influencing the above:
speed, reliability, flexibility and multilingualism.
![Page 5: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/5.jpg)
Introduction to SPIRIT
• Spatial Information Retrieval on the Internet
• The main aim of the project is to create tools and
techniques to help people find information that
relates to specified geographical locations.
![Page 6: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/6.jpg)
1TB crawl of about 9million web documents focused
on UK, Germany, France and Switzerland. Support
of Ontology of places.
Relevance ranking of web documents catering to
needs of:• Documents referring some place of interest• Digital geospatial resources
![Page 7: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/7.jpg)
GATE
It’s a java suite for tasks related to Natural Language
Processing and particularly useful and widely used in
the area of Information Extraction. ANNIE (A
nearly-new Information Extraction system) is the
highlight of this experiment which is employed by
SPIRIT.
![Page 8: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/8.jpg)
ANNIE
• Tokenizer• Gazetter• Sentence splitter• Part-of-speech tagger• Named-Entity transducer
![Page 9: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/9.jpg)
Spatial Markup
Sources of Spatial markup:• OS – Ordnance Survey (UK, point)• TGN – Getty Thesaurus of Geographical names
(Global, point) • SABE – Seamless administrative boundaries of
Europe (Europe, polygon)
![Page 10: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/10.jpg)
Geo-Parsing
• Named-Entity Recognition – lists + rules• List lookup inefficient• First gazetter lookup then use of contextual
evidence to realize this.• JAPE (Java Patterns Annotation Engine) – rules
defined w.r.t terms of entities identified within GATE.
• Rules are language independent (using Systran system)
![Page 11: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/11.jpg)
Hurdles faced
• Filtering out commonly used words – specially which are used in a non-geographical sense.
• Using person-name list to filter out ambiguity between places and names.
![Page 12: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/12.jpg)
Geo-Coding
• Gazetter lookup to assign co-ordinates
• Removing ambiguity in place names: by feature hierarchy and feature type provided by OS.
• Actual grounding done by SABE and OS.
• TGN used to resolve global ambiguity.
![Page 13: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/13.jpg)
Experimental Setup
• Total annotated collection of about 8.8million pages
• 22 out of top 50 domains from Europe• About 1.6 million doc containing 5-10 unique
footprints selected. Further 10% chosen from this and then those only from UK (130)
• All geographic names (1864) manually identified and stored as benchmark
![Page 14: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/14.jpg)
Geo-parsing Results
SPIRIT + SABE + OS:• Correct – 1340• Missing – 479• False Hits – 596• Precision – 0.6966• Recall – 0.7820• F1 – 0.7184
![Page 15: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/15.jpg)
Geo-Coding Results
• TGN ineffective due to global scope – 1021 found, 68% ambiguous.
• UK SABE good – 942 found, 11% ambiguous.
• 1137 places assigned a UID correctly. That is not only correct geo sense but resource order too.
![Page 16: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/16.jpg)
Conclusions
• Promising as success rate of 89% is there.• Geo-parsing can be improved by enhancing
gazetter matching methods and filtering of non-geographic entries
• Geo-coding can be improved by finding better methods for combining geog. resources.
![Page 17: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/17.jpg)
Pros
• Novel system and high success rate.
• Towards a geospatial search engine.
• Spatial markup resources in abundance.
![Page 18: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/18.jpg)
Cons
• Ambiguity (geographical) • Matching correct geographical sense.• Large overhead required to build such systems.• Inherent NLP problems.
![Page 19: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/19.jpg)
What it all leads to
• Creating geographical ontology to assist in GIR (Challenges and Resources for Evaluating Geographical IR - Bruno Martins, Mário J. Silva and Marcirio Silveira Chaves Faculdade de Ciências da Universidade de Lisboa 1749-016 Lisboa, Portugal)
• More focused Local and topical search (Urban Web Crawling - Dirk Ahlers OFFIS Institute for Information Technology Oldenburg, Germany; Susanne Boll University of Oldenburg Germany)
![Page 20: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh](https://reader037.vdocuments.net/reader037/viewer/2022110405/56649edb5503460f94bea519/html5/thumbnails/20.jpg)
References
• Extracting Metadata for Spatially-Aware Information Retrieval on the Internet - Clough, Paul
• GATE - http://gate.ac.uk/overview.html
• SPIRIT - http://www.geo-spirit.org/project_full.html
• Challenges and Resources for Evaluating Geographical IR - Bruno Martins, Mário J. Silva and Marcirio Silveira Chaves Faculdade de Ciências da Universidade de Lisboa 1749-016 Lisboa, Portugal
• Urban Web Crawling - Dirk Ahlers OFFIS Institute for Information Technology Oldenburg, Germany; Susanne Boll University of Oldenburg Germany