edin pelagios

28
www.inf.ed.ac.uk Institute for Language, Cognition and Computation The Edinburgh Geoparser and Chalice Claire Grover Kate Byrne, Richard Tobin, Jo Walsh

Upload: cegrover

Post on 24-Jan-2015

849 views

Category:

Education


3 download

DESCRIPTION

Talk about the Edinburgh Geoparser and the Chalice project at the Pelagios workshop in London on 24/03/11.

TRANSCRIPT

Page 1: Edin pelagios

www.inf.ed.ac.uk

Institute for Language, ���Cognition and Computation

The Edinburgh Geoparser ���and Chalice

Claire Grover Kate Byrne, Richard Tobin, Jo Walsh

Page 2: Edin pelagios

Institute for Language, Cognition and Computation

Overview of the Edinburgh Geoparser •  System to automatically recognise place names in text and

disambiguate them with respect to a gazetteer. (Athens, Springfield) •  Patchy development over past few years funded by a variety of

projects applied to a range of data sets:

– GeoCrossWalk

– BOPCRIS

– GeoDigRef (Histpop, BOPCRIS, BL)

– Embedding GeoCrossWalk (Stormont Papers)

– SYNC3 (online news)

– Chalice (EPNS)

– Unlock

•  Main concern has been to keep it generally usable while applying it to specific data sets.

Page 3: Edin pelagios

Institute for Language, Cognition and Computation

Overview of the Edinburgh Geoparser

.txt .html .xml

Format conversion Tokenisation POS

tagging Lemmatis-

ation

Named Entity

Recognition .geotagged.xml

Geotagging

Gazetteer lookup Resolution .geotagged.xml .gaz.xml

Georesolution

Page 4: Edin pelagios

Institute for Language, Cognition and Computation

Page 5: Edin pelagios

Institute for Language, Cognition and Computation

Page 6: Edin pelagios

Institute for Language, Cognition and Computation

Page 7: Edin pelagios

Institute for Language, Cognition and Computation

Page 8: Edin pelagios

Institute for Language, Cognition and Computation

Evaluation (2009) SpatialML (gold geotagging) GeoNames Unlock No. of place names 3628 3628 No. for which gaz entries found 3538 3049 Correct within 5km 2946 2143 As % of total 81.2% 59.0%

SpatialML (end-to-end) GeoNames No. of place names 3628 No. for which gaz entries found 2923 Correct within 5km 2504 As % of total 69.0%

Page 9: Edin pelagios

Institute for Language, Cognition and Computation

Current Development Issues •  Open source release

•  Increased configurability

–  Input formats: plain text, HTML, simple XML, ...

– User’s own text analysis: paragraphs, sentences, word tokens, place name mark-up

– Output formats: map visualisation, text mark-up, …

– User input: constrain by area, bounding box, …

•  Choice of gazetteer: GeoNames, Unlock, geonames-local, Pleiades+, Chalice historical gazetteer, ...

•  Performance monitoring/evaluation against test sets

Page 10: Edin pelagios

Institute for Language, Cognition and Computation

GAP project: Pleiades+ •  Based on Pleiades set of ancient place names but extended in two ways:

•  by matching Pleiades place names against GeoNames place names in the same location and adding the GeoNames alternative names to the Pleiades+ list:

– adds three alternative names for the single Pleiades entry for "Autricum" ("Chartrez", "Chartres", "Shartr"), because "Autricum” is present in both Pleiades and GeoNames, with the same approximate location

•  at run-time, looking up place names found in the text against GeoNames (as well as against Pleiades+) and the using the alternative names from GeoNames to match against the Pleiades+ list

– Pleiades has no entry for "Egypt”. We look up the name in GeoNames and use its alternative names (which include "Aegyptus") to match back against Pleiades (which does include "Aegyptus"). (We don't want to simply take places directly from GeoNames because, when we tried it, we were swamped with irrelevant modern places having names corresponding to ancient toponyms.)

Page 11: Edin pelagios

Institute for Language, Cognition and Computation

Chalice •  Connecting Historical Authorities with Linked Data, Contexts, and Entities.

•  Funded under the JISC jiscEXPO programme on "exposing digital content for education and research".

•  The project is exploring the viability of creating a historical gazetteer from digitized volumes from the English Place-Name Society (EPNS).

•  Partners:

– CDDA, Queen’s University, Belfast

– School of Informatics, Edinburgh

– EDINA, Edinburgh

– CeRch, Kings College London

•  Informatics role is to adapt our existing text mining/geoparsing technology to convert the textual documents that are output from OCR into structured data.

Page 12: Edin pelagios

Institute for Language, Cognition and Computation

Chalice data •  Cheshire

– Cheshire Part I. EPNS Volume 44, 1970

– Cheshire Part II. EPNS Volume 45, 1970

– Cheshire Part III. EPNS Volume 46, 1971

– Cheshire Part IV. EPNS Volume 47, 1972

– Cheshire Part V (1 :i). EPNS Volume 48, 1981

– Cheshire Part V (1 :ii). EPNS Volume 54, 1981

•  Small samples from:

– Berkshire, Buckinghamshire (Vol. 2), Cambridgeshire (Vol 19), Derbyshire (Vols 27-29), Hertfordshire (Vol. 15)

•  Shropshire: Pimhill Hundred (born digital)

Page 13: Edin pelagios

Institute for Language, Cognition and Computation

EPNS •  Parishes are usually organised in terms of the hundreds in which they belong.

•  Towns and villages are usually referred to as townships and are organised in terms of the parish in which they belong.

•  Township descriptions often contain relatively unstructured information about smaller associated places such as buildings, bridges, lanes, woods and farms.

•  Township descriptions also frequently contain separately marked sections of information about field names and street names.

•  Information about river and major road names are described separately from the inhabited place descriptions.

•  Place names are the primary object of interest and descriptions of them contain information about alternative names and spellings that have been attested in historical sources and the etymology of names or name parts.

•  In Chalice we focus on capturing parishes, townships, sub-townships, attestation. We don’t deal with hundreds, field names, street names, rivers, roads etc.

Page 14: Edin pelagios

Institute for Language, Cognition and Computation

Page 15: Edin pelagios

Institute for Language, Cognition and Computation

The start of the entry for the township of Willaston in the parish of Neston in Wirral Hundred.

Page 16: Edin pelagios

Institute for Language, Cognition and Computation

Page 17: Edin pelagios

Institute for Language, Cognition and Computation

Page 18: Edin pelagios

Institute for Language, Cognition and Computation

Page 19: Edin pelagios

Institute for Language, Cognition and Computation

Page 20: Edin pelagios

Institute for Language, Cognition and Computation

Page 21: Edin pelagios

Institute for Language, Cognition and Computation

Page 22: Edin pelagios

Institute for Language, Cognition and Computation

Issues •  OCR quality needs to be high: not just recognising characters correctly but

getting font and layout information right. Failure to recognise bold and small caps fonts or the difference between a line break and a paragraph break can lead to major errors in the recognition process.

•  EPNS volumes vary in the use of layout and font to indicate structure (e.g. Cheshire parishes are signaled by centering combined with numbering with roman numerals while Hertfordshire ones are unnumbered but centered and in bold font.) In some volumes potentially useful information is contained in footnotes.

•  Different volumes reflect different decisions about where place name information should be put. In most cases the information about the parish name occurs next to the town in the parish that has the same name. In the Shropshire text some place name information occurs in an earlier volume and is not subsequently repeated, e.g. the description of the parish of Baschurch, containing a township of the same name, has no attestation or etymological information provided because the name was discussed in Part 1.

Page 23: Edin pelagios

Institute for Language, Cognition and Computation

Page 24: Edin pelagios

Institute for Language, Cognition and Computation

Page 25: Edin pelagios

Institute for Language, Cognition and Computation

Page 26: Edin pelagios

Institute for Language, Cognition and Computation

Page 27: Edin pelagios

Institute for Language, Cognition and Computation

Page 28: Edin pelagios

Institute for Language, Cognition and Computation

Thank you!