learning dictionaries from unannotated data
Post on 05-Jul-2015
419 Views
Preview:
DESCRIPTION
TRANSCRIPT
Learning dictionaries from unannotated data.
Hristo Tanev
OPTIMA action
GlobeSec unit, IPSC
Outline of the talk
What are semantic dictionaries Ontopopulis – a system for learning of
semantic dictionaries Ontopopulis in use Conclusions
NLP and dictionaries
Natural Language Processing (NLP) systems map a natural language text into some structured representation, which is somehow related to the human understanding of language President Obama is meeting tonight with Apple CEO Steve
Jobs {Obama:PER; Apple:ORG; Steve Jobs:PER}
This process is often multi-level, complex and requires knowledge about language and the world: Dictionaries Grammars Ontologies ….
Semantic dictionaries Semantic dictionaries map words or phrases into domain-specific
semantic classes boat : VEHICLE gun : WEAPON engineer : PERSON swine flu : DISEASE nice : POSITIVE_ADJECTIVE
Many NLP systems use semantic dictionaries Information extraction
[PEOPLE] in a [VEHICLE] (two people in a boat) Opinion mining
List of positive and negative words and phrases
Semantic dictionaries are one of the most simple ways to present knowledge
Semantic dictionaries
Semantic dictionaries are the most language and domain-specific resources of the NLP systems
They could be very large Expensive to create in terms of time and
resources Require domain and linguistic expertise
Ontopopulis – an automatic system for learning of semantic dictionaries The system is based on a modification of a weakly supervised
method, described in [Tanev and Magnini, Weakly Supervised Approaches for Ontology Population, 2008,Ontology Learning and Population, Bridging the Gap between Text and Knowledge]
The system is multilingual and knowledge poor, uses just an unannotated corpus and a list of stop words
In contrast with state-of-the-art systems for learning of semantic classes, Ontopopulis does not use any language-specific processing
It is written in Java and requires about 10-20 minutes per couple (triple) of classes
System architecture
Extraction of contextual features
Seed:train bustruckcar
Text collection
Contextual features:driver of the X : 2.6X plowed : 2.2X was parked : 2.2stopped a X : 2.2collided with another X : 2.1…
New term extraction
New terms:vehiclevanlorrytaximinibus
Stopwords
Ontopopulis – basic steps
Ontopopulis takes on its input a small set of seed keywords for each semantic class which we want to learn
The system learns contextual features (n grams, which co-occur immediately before or after the seed terms ); it chooses the most reliable ones for each class
Optionally, the user can validate the contextual features
New terms are learnt for each semantic class with the validated features Current version of the system is tuned not to learn named
entities
Ontopopulis – an example.Learning types of vehicles
Input: three seed sets of words, which refer to vehicles – watercrafts, aircrafts and land vehicles
Watercrafts: ferryboat , ship , boat , yacht Aircrafts: helicopter, airplane , Airbus Land vehicles: train , bus , truck , car
Contextual features The system searches for the seed keywords in a corpus, finds
contextual features and scores them Watercraft top contextual features
X capsized 1.9388203783671003 seizure of the X 1.5338767554180062 X and its crew v 1.5219650106107294 missing after a X 1.4236053259793582 X was intercepted 1.3941924012474063 X ran aground 1.3306324796381248 born on a X 1.3147244396126636 ….
Aircraft top contextual features crash of the X 2.02 X that crashed 1.5359148669094496 wreckage of the X 1.4008144141101402 pieces of the X 1.1963353346631014 aboard the X 1.0836274437276754 X has crashed 0.9853940362678498 X pilot 0.8890513946133263 ….
Land vehicles contextual features driver of the X 2.6832564679889153 X plowed 2.287497321320604 X was parked 2.2407143635036704 stopped a X 2.200628452539843 collided with another X 2.1468282693841436 travel by X 2.0890015358446328 X was travelling 2.008981636231196 …..
Scoring the contextual features
weight1(f, class)=
seed(watercraft)={boat, ferryboat, ship, yacht}PMI (f,s) – Pointwise Mutual Information of f and s
weightN(f, class)=
weight(f,class)=
),(3),(
),(
)(
sfPMIsffreq
sffreq
classseeds
•+∑
∈
),(max
),(
11)(
1
1
classfweight
classfweight
classfeaturesf ∈
∑∈
•
classesclassN
NN classfweight
classfweightclassfweight
1
),(
),(),(
1
Extracting new terms
Text collection is scanned for contextual features
The n-grams which appear in the feature slots are considered term candidates
Weighting term candidates:
weight(t, class)=
Term candidates are ordered in order of decreasing weight
∑∈ +
∩)(
),(.3),(
),().()()(
classfeaturesf
tfPMItffreq
tffreqfweighttfeaturesclassfeatures
Extracting new terms
Top 20 terms for watercrafts (75% accuracy) vessel 392.08530101453465 ferry 130.92071859241187 arctic sea 111.51926214919027 boats 70.09673806960807 fishing boat 51.91800928040082 flight 51.54860533011118 ships 45.064249579966756 freighter 38.4793792989174 vessels 37.94665196333265 shuttle 37.84138500667754 tanker 33.973493404331116 cargo ship 30.92735210060045 craft 30.576926785957 cargo 24.773583958775333 submarine 22.62225313680197 trawler 21.744727204334037 princess ashika 20.788755092164358 liner 20.16456735187679 fishing vessel 20.103099674619276 cruise ship 19.564950730766093
Extracting new terms Top 20 terms for aircraft (70% accuracy)
plane 386.08632995744114 aircraft 214.51664885690826 jet 116.9117713587897 airbus a330 110.65796968774065 air france 107.73115170156977 airliner 65.72192132602771 chopper 65.07326856476149 flight 63.81155233947375 yemenia 58.864411455717715 a330 51.35667865678823 shuttle 34.74771258444413 jetliner 33.00461319890622 airbus a310 30.019477774417997 a310 8.78145228767769 planes 26.203456787916377 passenger plane 25.637328558549058 passenger jet 24.670909891236946 france plane 24.02055464618129 caspian airlines22.123307028006952 france jet 20.787176394109114
Extracting new terms Top 20 terms for land vehicles (80% accuracy)
vehicle 379.6998519301858 van 172.63373740700783 lorry 153.9337673760267 taxi 116.566997111338 minibus 99.674172750452 motorcycle 83.45691130257896 trailer 75.79549750687403 minivan 72.83622251294283 tractor 63.04147460775497 pickup truck 56.137849074735094 jeep 47.47715623723825 pickup 44.18340053505094 suv 43.51148949931748 cars 36.60232043125164 tanker 35.883024767183514 motorbike 35.73857198901322 driver 34.64342424860143 bakkie 31.44693588923759 passenger 29.58595652982613 passenger bus 27.710992943240772
Ontopopulis vs. Google Sets
Google Sets for land vehicles with the same seed set extracted 20 new terms and reached 30% accuracy vs. 80% for Ontopopulis) boat airplane taxi helicopter plane airport air bicycle buggy aircraft coach suv ferry motorcycle robot transport tips time travel tank planes rail
Ontopopulis in multilingual environment Italian - learning a list of dangerous and potentially dangerous substances Input: sostanze pericolose , rifiuti pericolosi, uranio, scorie nucleari Output (top 20, 70% accuracy)
rifiuti speciali 41.64611927865504 materiale 20.614977545240276 amianto 17.715371213671204 rifiuti tossici 13.7472503464293 spazzatura 13.113535554767779 esplosivo 11.406271876468216 cocaina 11.238345999686327 gpl 10.000438516204888 immondizia 9.8760929040176 sigarette 9.4065323172882 carburante 9.070216390158857 rifiuti provenienti 8.852697311731404 rifiuti radioattivi 8.616686260486738 prodotti 8.511069145547099 sostanze chimiche 8.340998333033808 materiali 7.940812933415176 scorie radioattive 7.934097715435894 alimenti 7.796882609916327 rifiuti solidi 7.541486005721908 prodotti caseari 7.467839989843934
Using Ontopopulis for event extraction
We use Ontopopulis to learn terms, which we next put into the domain-specific dictionaries of our event extraction system NEXUS
Some rules which make reference to semantic classes: Rules for parsing person reference noun phrases, such as
two engineers Rules which detect weapons used:
ucciso con (una | un) [WEAPON] (ucciso con una pistola) Detection of vehicles used:
[PEOPLE] in (un | una) [VEHICLE] (due persone in una imbarcazione)
Drug traffickingtraffico di [DRUGS] (traffico di ketamina)
Using Ontopopulis for event classification NEXUS uses combinations of classes of words to recognize event types. For
example: Words of class Crime near words like arrest trigger Arrest type of event Words of class Political person near words like kill trigger Assassination event type
We learned different semantic classes, related to crises for English, French, Italian, Spanish, Portuguese and Arabic
Some of these classes were: Disasters Humanitarian crises Law-enforcement authorities Political person Infrastructure Crimes Vehicles Heavy weapons Drugs
Learning event – related classes for Spanish and Portuguese. Evaluation
------6095Spanish
7585207085756090Portuguese
BuildingCrimeEdged weapon
WatercraftVehiclePoliticianWeaponPersonAccuracy (%) top 20
Using Ontopopulis for summarization
TAC’10: Aspect-driven summarization - summary plus aspects: Damage, Countermeasures, etc.
We created automatically with Ontopopulis, a list of damages, disaster and military countermeasures, crime charges and resources
Using damages and countermeasures dictionaries improved average aspect – based Pyramid score by 0.12; crime charges and resources dictionaries decreased the average aspect-based Pyramid score by 0.09
Using Ontopopulis for opinion mining
Dictionaries of positive and negative words and phrases play a central role in the opinion mining systems
Difficult to find such dictionaries, especially for languages other than English
With Ontopopulis, we learned subjective words for English and Spanish
After manual cleaning, these words were plugged in our opinion mining system
Using Ontopopulis for opinion mining
Learning positive and negative words Positive (seed set: nice, pleasant, convenient,
beautiful) Learnt positive words (top 33, accuracy 97%):
fun, wonderful, lovely, comfortable, safe, interesting, simple, easy, unique, enjoyable, reliable, friendly, exciting, affordable, accessible, *difficult, happy, decent, efficient, funny, healthy, warm, productive, clean, attractive, helpful, perfect, great, secure, intuitive, gentle, cool, sustainable
Using Ontopopulis for opinion mining
Negative words (seed set: unpleasant, ugly, inconvenient)
Learnt negative words: (top 33, accuracy 88%)uncomfortable, simple, uncomfortable, *simple, sad, difficult, disturbing, painful, terrible, shocking, emotional, embarrassing, horrible, frightening, awful, fundamental, harsh, unfortunate, unpalatable, complicated, *historical, cruel, *universal, hard, *honest, scary, brutal, dangerous, obvious, ugly head, bizarre, awkward, eternal, bitter, *absolute
A tasty conclusion Input: risotto, crepes, ratatouille, roasted chicken Output:
soup 9.424864485808794 pasta 8.503403365978578 salad 4.138940899343471 sauce 3.9255334464290845 juice 3.5978760396055662 seafood 3.5493529534271904 syrup 3.3213803233051684 barbecue 3.0409478219630994 pizza 2.9854125681933934 cooked 2.9262742039838177
Ontopopulis is nearly unsupervised, requires just a small input seed set
Language and domain – independent Results vary between semantic classes,
typically accuracy > 70% in top 20 acquired terms Manual supervision is necessary, however we found it easier
to clean already acquired dictionary, rather than creating it manually
Efficient, on a state-of-the-art PC requires about 10 minutes per class
Multiplatform - written entirely in Java Application potential
Thank you!
top related