boa - how to integrate your language
DESCRIPTION
BOA tries to extract knowledge (binary relations) from unstructured data like free text. This is a tutorial based on the Korean language on how to adopt the BOA approach to your language.TRANSCRIPT
AKSW, Universität Leipzig
BOAHow To Integrate Your Language
Daniel Gerber Axel-Cyrille Ngonga Ngomo
Bootstrapping the Data Web
AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page
General Overview
2
Corpus Indexing Background Knowledge Surface forms
EvaluationRDF extractionSearch & ScoringKorean features
Bootstrapping the Data Web
AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page
1. Create a corpus in your language
๏ At least 25M sentences
๏ Chunked into one sentence per line
๏ No HTML
๏ UTF-8?
๏ For later Coreference Resolution, resource URL needs to be available
3
Bootstrapping the Data Web
AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page
2. Corpus indexing
4
๏ Apache Lucene 3.4.0
๏ Set of >20 UTF-8 RegEx filters
๏ Whitespace Analyzer
➡ No stemming
➡ Tokenization on every token
➡ Stop-words included in index
➡ Lowercase version
Bootstrapping the Data Web
AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page
3. Background knowledge I
5
ObjectProperties
DatatypePropertiesvs
Bootstrapping the Data Web
AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page
3. Background knowledge II
6
Line #1 Line #2
URI1 http://dbpedia.org/resource/South_Korea http://dbpedia.org/resource/KAIST
Label1 대한민국 한국 과학 기술원
Property http://dbpedia.org/ontology/capital http://dbpedia.org/ontology/country
URI2 http://dbpedia.org/resource/Seoul http://dbpedia.org/resource/South_Korea
Label2 서울 대한민국
Domain http://dbpedia.org/ontology/PopulatedPlace ⎯
Range http://dbpedia.org/ontology/PopulatedPlace http://dbpedia.org/ontology/Country
Bootstrapping the Data Web
AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page
4. Surface form generation
7
๏ DBpedia Spotlight ๏ Labels๏ Redirects๏ Disambiguation
๏ Datatype Properties๏ Person XY is born on 1st of October in 1972.๏ Person XY is born on 1 October in 1972.๏ Person XY is born on a Thursday in 1972
๏ Find and Create those surface forms
Bootstrapping the Data Web
AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page
5. Korean feature extraction
8
LanguageDependent
LanguageIndependent
ReVerb
Wordnet Distance
?
?
# of words
# of stopwords
# of occurrences
?
?
Bootstrapping the Data Web
AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page
6. Pattern search and scoring
9
Barack Obama was born in Honolulu.was born in
버락 오바마는 호놀룰루에서 태어났습니다.
Subject? Object?Predicate?
Named Entity Disambiguation!
Bootstrapping the Data Web
AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page
7. RDF extraction
10
Barack Obama
Honolulu
dbpedia-owl:birthPlace
버락 오바마는 호놀룰루에서 태어났습니다.
Barack Obama was born in Honolulu.was born in
�� ���
���
dbpedia-owl:birthPlace
에서 태어났습니다.
Bootstrapping the Data Web
AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page
8. Evaluation
1. Select properties P to evaluate (T100)
2. Query DBpedia for triples (and labels) with p ∈ P
3. Find sentence with labels
4. Assess if triple can be found in sentence
➡ Gold Standard with 1000 annotated sentence/triples
5. Run one BOA iteration on Gold Standard
6. Measure Precision/Recall/F-Measure
11
Bootstrapping the Data Web
AKSW@KAIST - http://boa.aksw.org17.01.2012 - Page
Necessary resources for new language
๏ 50M sentence (best general knowledge)
๏ Sentence Boundary Disambiguation
๏ Part of speech tagger helpful
๏ Named Entity Recognition
๏ Named Entity Disambiguation
๏ Labels for resources
๏ SPARQL endpoint
12
LOD2 Presentation . 02.09.2010 . Page http://lod2.eu
Thank you!Questions?
Daniel GerberJohannisgasse 26, Room 5-2104103 Leipzig, GermanySIMBA@AKSWhttp://bis.informatik.uni-leipzig.de/DanielGerberhttp://boa.aksw.orghttp://code.google.com/p/boa