george paliouras, may 2014 national center for scientific research ‘demokritos’ george...

Download George Paliouras, May 2014  National Center for Scientific Research ‘Demokritos’ George Paliouras BioASQ Intelligent Information Management

If you can't read please download the document

Upload: rebecca-marn

Post on 14-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1

George Paliouras, May 2014 www.bioasq.org National Center for Scientific Research Demokritos George Paliouras BioASQ Intelligent Information Management Targeted Competition Framework ICT-2011.4.4(d) George Paliouras, May 2014 A challenge on large-scale biomedical semantic indexing and question answering www.bioasq.org Slide 2 George Paliouras, May 2014 www.bioasq.org Biomedical articles per year 2/43 Slide 3 George Paliouras, May 2014 www.bioasq.org Questions of biomedical experts Are there any DNMT3 proteins present in plants? Yes Yes. The plant DOMAINS REARRANGED METHYLTRANSFERASE2 (DRM2) is a homolog of the mammalian de novo methyltransferase DNMT3. DRM2 contains a novel arrangement of the motifs required for DNA methyltransferase catalytic activity. Yes/No question Exact Answer Ideal Answer 3/43 Slide 4 George Paliouras, May 2014 www.bioasq.org Questions of biomedical experts What is the methyl donor of DNA (cytosine-5)-methyltransferases? S-adenosyl-L-methionine S-adenosyl-L-methionine (AdoMet, SAM) is the methyl donor of DNA (cytosine-5)- methyltransferases. DNA (cytosine-5)-methyltransferases catalyze the transfer of a methyl group from S-adenosyl-L-methionine to the C-5 position of cytosine residues in DNA. Factoid question Exact Answer Ideal Answer 4/43 Slide 5 George Paliouras, May 2014 www.bioasq.org Questions of biomedical experts (III) List question In 1955, the production of itaconic acid was firstly described for Ustilago maydis. Some Aspergillus species, like A. itaconicus and A. terreus, show the ability to synthesize this organic acid and A. terreus can secrete significant amounts to the media. Itaconic acid is mainly supplied by biotechnological processes with the fungus Aspergillus terreus. Cloning of the cadA gene into the citric acid producing fungus A. niger showed that it is possible to produce itaconic acid also in a different host organism. Aspergillus terreus, Aspergillus niger, Ustilago maydis Exact Answer Ideal Answer Which species may be used for the biotechnological production of itaconic acid? 5/43 Slide 6 George Paliouras, May 2014 www.bioasq.org Questions of biomedical experts (III) Summary question Histone methyltransferases (HMTs) are responsible for the site-specific addition of covalent modifications on the histone tails, which serve as markers for the recruitment of chromatin organization complexes. There are two major types of HMTs: histone-lysine N- Methyltransferases and histone-arginine N-methyltransferases. The former methylate specific lysine (K) residues such as 4, 9, 27, 36, and 79 on histone H3 and residue 20 on histone H4. The latter methylate arginine (R) residues such as 2, 8, 17, and 26 on histone H3 and residue 3 on histone H4. Depending on what residue is modified and the degree of methylation (mono-, di- and tri-methylation), lysine methylation of histones is linked to either transcriptionally active or silent chromatin. - Exact Answer Ideal Answer How do histone methyltransferases cause histone modification? 6/43 Slide 7 George Paliouras, May 2014 www.bioasq.org 7/43 Slide 8 George Paliouras, May 2014 www.bioasq.org Finding relevant snippets 8/43 Slide 9 George Paliouras, May 2014 www.bioasq.org Not only texts: ontologies, linked data, 9/43 Slide 10 George Paliouras, May 2014 www.bioasq.org 10/43 Slide 11 George Paliouras, May 2014 www.bioasq.org Information from structured data List question http://www.disease-ontology.org/api/metadata/DOID:162 (cancer) http://www.uniprot.org/uniprot/M3K8_RAT (TPL2 synonym) Subject: http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseases/3003 (lung cancer) Predicate: http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/associatedGene Object: http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/TPL2" Related RDF triple Related concepts Which forms of cancer is the Tpl2 gene associated with? 11/43 Slide 12 George Paliouras, May 2014 www.bioasq.org 12/43 Slide 13 George Paliouras, May 2014 www.bioasq.org BioASQ Vision Make sure this knowledge is used to the benefit of patients Need to make it accessible to biomedical experts Search is not effective enough Push research in automated answering of questions A challenge for such systems can achieve a multiplying effect 13/43 Slide 14 George Paliouras, May 2014 www.bioasq.org What is BioASQ? A challenge funded by the European Union (FP7). Task a: Hierarchical text classification Organizers distribute new unclassified PubMed articles. Participants assign MeSH terms to the articles. Evaluation based on annotations of PubMed curators. Task b: IR, QA, summarization, Organizers distribute English biomedical questions. Participants provide: relevant articles, snippets, concepts, triples, exact answers, ideal answers. Evaluation: both automatic (GMAP, MRR, ROUGE etc.) and manual (by biomedical experts). 14/43 Slide 15 George Paliouras, May 2014 www.bioasq.org Task b The challenge 15/43 Task a Slide 16 George Paliouras, May 2014 www.bioasq.org 16/43 Slide 17 George Paliouras, May 2014 www.bioasq.org Behind the scenes 17/43 Slide 18 George Paliouras, May 2014 www.bioasq.org BioASQ Platform 18/43 Slide 19 George Paliouras, May 2014 www.bioasq.org Datasets Task b data contain gold articles, snippets, concepts, triples, exact and ideal answers prepared by biomedical experts from around Europe. Task a 1 st challenge2 nd challenge Training10,876,00412,628,968 Test8349071950 Task b 1 st challenge2 nd challenge Training29310 Test281500 19/43 Slide 20 George Paliouras, May 2014 www.bioasq.org Data sources They include both text and structured info. PubMed abstracts, PubMed Central articles, MeSH. Gene Ontology, UniProt, Jochem, Disease Ontology. 20/43 Slide 21 George Paliouras, May 2014 www.bioasq.org Annotation: questions and queries 21/43 Slide 22 George Paliouras, May 2014 www.bioasq.org Annotation: snippets 22/43 Slide 23 George Paliouras, May 2014 www.bioasq.org Annotation: answers 23/43 Slide 24 George Paliouras, May 2014 www.bioasq.org Assessment: relevance of material 24/43 Slide 25 George Paliouras, May 2014 www.bioasq.org Assessment: information in answers 25/43 Slide 26 George Paliouras, May 2014 www.bioasq.org BioASQ social network 26/43 Slide 27 George Paliouras, May 2014 www.bioasq.org Oracle 27/43 Slide 28 George Paliouras, May 2014 www.bioasq.org Oracle 28/43 Slide 29 George Paliouras, May 2014 www.bioasq.org Two cycles March 2013 June 2013 August 2013 September 2013 2013 Schedule February 2014 March 2014 May 2014 September 2014 2014 Schedule The official challenge is over, but Task a continues to run each week. An oracle for task b will be available soon. Oracles will remain available. Third cycle is being designed 29/43 Slide 30 George Paliouras, May 2014 www.bioasq.org Challenge participants so far 30/43 Slide 31 George Paliouras, May 2014 www.bioasq.org Challenge participants in each cycle 31/43 Slide 32 George Paliouras, May 2014 www.bioasq.org Evaluation measures Task a: Hierarchical text classification Flat measures for multi-label classification: Accuracy, MiF, MaF, EBF Hierarchical measures: LCA-F (new), HF Task b: IR, QA, summarization, Phase A: standard IR measures, mean precision, mean recall, mean F- measure, MAP (used for winners selection), G-MAP Phase B: Exact answers (based on type): accuracy (yes/no), strict/lenient accuracy, MRR (factoid), mean F-measure (list) Ideal answers: manual scores from the experts {Readability, Repetition, Information Precision and Recall}, plus ROUGE 32/43 Slide 33 George Paliouras, May 2014 www.bioasq.org First year technology/results overview Task 1a Mainly SVMs and learning-to-rank. Mostly flat classification, ignoring class taxonomy. Mediocre results by hierarchical methods. One of the systems outperformed NLMs system. Task 1b Phase A (retrieve relevant documents, concepts, snippets, triples): low performance (compared to baselines). Phase B (formulate exact and ideal answers): poor performance for exact answers (except for yes/no questions); high performance for ideal answers (paragraph-sized summaries), but starting with gold documents, snippets etc. Large scope for improvements, esp. in Task 1b. 33/43 Slide 34 George Paliouras, May 2014 www.bioasq.org Exact answer results (batch 2/3) 34/43 Slide 35 George Paliouras, May 2014 www.bioasq.org Ideal answer results (batch 2/3) 35/43 Slide 36 George Paliouras, May 2014 www.bioasq.org Results task a flat measures 36/43 Slide 37 George Paliouras, May 2014 www.bioasq.org Results task a hierarchical 37/43 Slide 38 George Paliouras, May 2014 www.bioasq.org First challenge prizes 38/43 Slide 39 George Paliouras, May 2014 www.bioasq.org Sustainability BioASQ Oracle Software release and installation instructions Benchmark datasets BioASQ social network Involvement of the biomedical community in the process Attracting sponsors for prizes Making the challenge viable, at very low cost, after the end of the project 39/43 Slide 40 George Paliouras, May 2014 www.bioasq.org Project Consortium 1.National Centre for Scientific Research Demokritos - NSCR D (EL) 2.Transinsight GmbH TI (D) 3.Universite Joseph Fourier- UJF (F) 4.University Leipzig - ULEI (D) 5.Universite Pierre et Marie Curie Paris 6 UPMC (F) 6.Athens University of Economics and Business Research Centre AUEB-RC (EL) 40/43 Slide 41 George Paliouras, May 2014 www.bioasq.org Project Consortium 41/43 Slide 42 George Paliouras, May 2014 www.bioasq.org Get in touch! BioASQ workshop @CLEF (Sheffield, Sept 14) Visit www.bioasq.org Follow @BioASQ 42/43 Slide 43 George Paliouras, May 2014 www.bioasq.org Useful Links BioASQ Annotation & assessment tools: http://at.bioasq.org/http://at.bioasq.org/ http://assess.bioasq.org/http://assess.bioasq.org/ https://github.com/AKSW/BioASQ-AThttps://github.com/AKSW/BioASQ-AT BioASQ social network: http://sn.bioasq.org/http://sn.bioasq.org/ https://github.com/AKSW/BioASQ-SNhttps://github.com/AKSW/BioASQ-SN BioASQ platform: http://bioasq.lip6.fr/http://bioasq.lip6.fr/ BioASQ Oracles: http://bioasq.lip6.fr/oracle/http://bioasq.lip6.fr/oracle/ 43/43 A. Kosmopoulos, I. Partalas, E. Gaussier, G. Paliouras, I. Androutsopoulos, Evaluation Measures for Hierarchical Classification: a unified view and novel approaches. Data Mining and Knowledge Discovery (To appear)