multi-lingual concept extraction with ... - anna lisa gentile · alfredo alba, annicoden, anna lisa...
TRANSCRIPT
Multi-lingualConceptExtractionwithLinkedDataandHuman-in-the-Loop
AlfredoAlba,Anni Coden,AnnaLisaGentile,DanielGruhl,Petar Ristoski,SteveWelch
IBMResearch
Motivation
Motivation
§ extractinformationfromanovel corpus
§ whataretherelevantconcepts inthedomain?
§ limiteddomain andlanguage knowledge
§ IDEA:combinestatisticaltechniqueswithuser-in-the-loop
DomainLearningAssistant
• Startwithasmallnumberofseeds(1)
• Getsuggestionsofnewsurfaceforms
• Theuseraccept/reject
Findingconcept candidatesThesafetyandefficacyoffilgrastim aresimilarinadultsand childrenreceivingcytotoxicchemotherapy
Laeficacia ylaseguridad delfilgrastim sonsimilares en los adultos y en los niños tratados conquimioterapia citotóxica
Lasicurezza el’efficacia delfilgrastim sono simili negli adulti e nei bambinisottoposti achemioterapia citotossica
DieWirksamkeit undUnbedenklichkeit vonFilgrastim ist bei Erwachsenen undbei Kindern ,dieeine zytotoxische Chemotherapie erhalten ,vergleichbar
Findingconcept candidates
Plasmaeliminationhalf-lifeoforalpravastatin is1.5to2hours.
L’emivita plasmatica dieliminazione delpravastatin orale é compresa tra un’ora emezzoedueore.
Findingconcept candidatesCandidates:{eggs,flour}
“mixeggs andflour”àmix <candidate>and <candidate>
mix <candidate>and <candidate>à “mixsugarandbutter”
Candidates:{eggs,flour,sugar,butter}
“meltthebutter”àmeltthe<candidate>
…
Findingconcept candidatesCandidates:{uova,farina}
“amalgamare uova efarina”à amalgamare <candidate>e<candidate>
amalgamare <candidate>e<candidate>à “amalgamare zucchero eburro”
Candidates:{uova,farina,zucchero,burro}
“sciogliere il burro”à sciogliere il <candidate>
…
Multi-lingualexperimentHYPOTHESIS:samebehavior,regardlessofthelanguage
§ westartwithveryfewseeds(onecouldbesufficient)foreachlanguage§ weextractcontextpatternsandusethemtogeneratenewcandidates
§ weasktouser toaccept/reject thecandidates
§ werepeatforafixednumberofiterationsinalllanguages
Multi-lingualexperiment:DrugDiscovery§ DATA:parallelcorpusfromtheEuropeanMedicinesAgency(EMEA)§documentsrelatedtomedicinalproducts§translationsinto22officiallanguagesoftheEuropeanUnion§1,500documentsformostofthelanguages§weused4languages(en,es,it,de)
§ TASK:buildalexiconofclinicaldrugs
§user-in-the-loop simulatedbyconstructingaGoldStandard(GS)ofdrugsnamesextractedfromLinkedOpenData(weusedDBpediahttp://dbpedia.org)
DrugDiscovery:Oneseed
§ initialseeds:singleseed§Onedrugnamewhichappearsineachcorpus(e.g.“irbesartan”)
§ 20iterations
§ learningcurvesforalllanguagesarecomparable
Discovery growth for glimpse for English (en), Italian (it), Spanish (es) and German (de). Average correlation amongst all languages r = 0.998.
DrugDiscovery:LinkedDataseeds§ initialseeds:20%ofavailableLinkedData(DBpedia)§ 5-foldvalidation(randomlyselected20%,samedrugsforalllanguages)§ choiceofinitialseedsdoesnotimpactstheresults
Discovery growth with 5-fold cross validation on the EMEA dataset using DBpedia as seeds. Each plot shows the discovery growth for each of the randomly generated 5 folds and reports the Pearson correlation (r) amongst them.
DrugDiscovery:benefitofLinkedData§glimpseà onemanuallyprovidedseed
§glimpseLDàLinkedDataseeds
§in10iterationsglimpseLD cancoverthesamelexiconthatwouldtakemorethan20iterationswithglimpse
Human-in-the-loopexperimentwithasubjectmatterexpert(physician)
Multi-lingualexperiment:Colors§ DATA:Twitterstream1st-14thofJanuary2016– lang:En,De,Es, It§containatleastonementionofacolor§ goldstandardlistsofcolorsfromWikidata andDbpedia
§ balancedatasetssizeindifferentlanguages§ 155,828tweetsperlanguage
§ TASK:expandthelexiconofcolors
§ user-in-the-loop: 4nativespeakers,10iterations
Multi-lingualexperiment:Colors§ newcoloritemsextractedfromTwitterdata:§German:5§ Italian:5§English:19§Spanish:22§azulgrana§ rojo vivo§ “limn"(inplaceofthecolorlímon)
ConclusionsWHAT§ knowledgeresourcesarenevercomplete/exhaustive
§ construct/improvedictionariesfromtextcorpora
HOW§ iterativeandpurelystatistical algorithm§ nofeatureextractionrequired§ comparablebehaviorfor differentlanguages
§ organicallyincorporateshumanfeedback
Multi-lingualConceptExtractionwithLinkedDataandHuman-in-the-Loop
IBMResearch
[email protected] @AnLiGentile
AlfredoAlba,Anni Coden,AnnaLisaGentile,DanielGruhl,Petar Ristoski,SteveWelch