hitime project
TRANSCRIPT
HiTiME project description
Christian Roosendaal ([email protected]),
Vyacheslav Tykhonov ([email protected]),
HiTiME System developers IISH Amsterdam
Processing module● Check for new documents● Split into words● Store in DB
CMS (Drupal, WordPress, …,)
Sourcedata
NERNER
NER
Knowledge Base
Entity Recognize module● Retrieve document tokens● Send to NER by telnet● If token is recognized entity → store in DB
Meanings module● Look for sequences of entities● Replace with known composite entities
1.
2.
3.
4.
6.
5.
HiTiME prototype data flow
Input DB
7.
Training sets from IISH archives
Clio-Infrastructure● Infrastructure to store data from different systems● Connect dates and locations with datasets● Find relevant documents in time/location domain● Visualize trends relevant to documents
HiTiME application- Persons- Organizations- Locations- Dates- Professions
LINKSDatabase with 8000+ professions● Create training sets
Evergreen librarySystem● Create training sets for
authority records● Improve MARC21● records
Searchsearch.iisg.nl● Improve metadata● Extend
functionality with new filters
PID service● Store entities
IISH systems integration
Knowledge baseExport data to e.g.RDF, OWL, XML
OCR application● Scans, posters,
archives
External applications● BWSA● Timeline● Visual Mets
System design
Inputdata
KB
HiTiME core
doc_id last_modified data
Document 1 12-13-12 12:04 “Petrus Alma is great...”
Document 2 12-13-12 11:37 “...”
doc_id last_modified data
Document 1 12-13-12 12:04 <person>Petrus Alma</person> is great...”
Document 2 12-13-12 11:37 “...”
● HiTiME core checks for new or updated documents in input database● Input database can be any type of database with timestamps
doc_id word_id word
0 0 Petrus
0 1 Alma
0 2 is0 3 great
doc_id sentence_id position word_id meaning_flag identity_id
0 0 0 0 0
0 0 1 1 0
0 0 2 2 0
0 0 3 3 0
Example string: “Petrus Alma is great”
Split text into words and store words separately in table:
Store coordinates of each word in coordinate table:
Database design (1/2)
word_id NER Frog Heidel UCTO Decision0 PERS PERS1 PERS PERS
Processing of text by NER. Output of NER:
“Petrus” → PERS“Alma” → PERS“is” → 0“great” → 0
doc_id sentence_id position word_id meaning_flag identity_id
0 0 0 0 1
0 0 1 1 1
0 0 2 2 0
0 0 3 3 0
Store in decision table:
Database Design (2/2)
Update meaning_flag in coordinate table:
Improvement : Integration of FROG, UCTO and HeidelTime
● Prototype only uses NER, and crude methods to split raw text into sentences and words● Splitting can be made more reliable with UCTO and FROG● Time expressions are not recognized in prototype → HeidelTime
Word NER Frog Heidel ... Decision
Amsterdam LOC LOC
Amsterdam is a location. Seems right, but what if the text means the VOC ship “Amsterdam”?
Improvement: Disambiguation of recognized entities (1/2)
Improvement: Disambiguation of recognized entities (2/2)
NER can be trained to improve accuracy. By making use of differently trained NER'swe can build an Expert System:
Word NER Frog Heidel NER2 NER3 Decision
Amsterdam LOC SHIP BAND ?
Final decision can be made based on priorities of trained models.Our idea is to assign lowest priorities to wide scope models.
ShipsAmsterdam (VOC ship), an 18th century cargo ship
MS Amsterdam, a cruise ship owned and operated by Holland America LineMusicAmsterdam (band), a pop band from the United Kingdom
"Amsterdam" (Jacques Brel song), a song by Jacques Brel
“Petrus Alma is great”
Recognized as person
“Petrus Alma is great”
Recognized as one person
In our prototype:
Should be:
Recognized as person
Improvement: “composite” entities (1/2)
Search for sequences of recognized entities in coordinate table:doc_id sentence_id position word_id meaning_flag identity_id
0 0 0 0 1 0
0 0 1 1 1 0
0 0 2 2 0
0 0 3 3 0
identity_id name type
0 Petrus Alma PERS
1 Aron van Dam PERS
2 Frederik Feringa PERS
“Petrus Alma”
Compare these sequences with entities in entities table:
Improvement: “composite” entities (2/2)
identity_id name type
0 Petrus Alma PERS
Possible solution: Keep track of known entities in separate entities table:
Final decision about entity:
BWSA application before processing
BWSA application after processing