the computational linguistics summarization pilot task @ tac 2014 kokil jaidka †, muthu kumar...

22
The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka , Muthu Kumar Chandrasekaran* , Min-Yen Kan* , Ankur Khanna Nanyang Technological University Dept. of Computer Science, National University of Singapore * Web, IR / NLP Group , National University of Singapore

Upload: hilary-gibbs

Post on 12-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

The Computational Linguistics Summarization Pilot task @ TAC

2014Kokil Jaidka†, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡

Nanyang Technological University †

Dept. of Computer Science, National University of Singapore *Web, IR / NLP Group ‡, National University of Singapore

Page 2: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

Scientific Document Summarization

I have an abstract. I am done!

Photo Credits Dennis Jarvis @flickr

2TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Page 3: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

Outline• Citation based extractive summaries• Facetted summaries• Automatic literature review• CL development corpus• Annotation• TAC 2015: CL-Summ track• Acknowledgements

3TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Page 4: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

Scientific Document Summarization

• Abstracts– Authors’ own summary.

• Citation summary– Scientific community creates summaries of

research papers while they cite a paper but…

• Facetted summaries– Capture all aspects of a paper.

5TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Page 5: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

21 April 2023 TAC Biomedsumm Track - The Computational Linguistics Pilot Task 6

Citation summary & facets

Image credits Ken Ammi @flickr

Page 6: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

Structured Abstract:Common in Medicine, Biomed,Bioinformatics domains

Facetted summaries

7TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Page 7: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

Facets & Argumentative zones

21 April 2023 TAC Biomedsumm Track - The Computational Linguistics Pilot Task 8

Page 8: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

Scientific Document SummarizationCitation based extractive summaries

Scope of Citation• Qazvinian, V., & Radev, D. R. “Identifying non-explicit citing

sentences for citation-based summarization” (ACL, 2010)

• Abu-Jbara, Amjad, and Dragomir Radev. "Reference scope identification in citing sentences.” (ACL, 2012)

Coherence• Abu-Jbara, Amjad, and Dragomir Radev. "Coherent citation-

based summarization of scientific papers.” (ACL 2011)

9TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Page 9: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

Scientific Document Summarization & Automatic Literature Review

10TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Page 10: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

11TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Scientific Document Summarization & Automatic Literature Review

Page 11: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

Free to access at: http://acl-arc.comp.nus.edu.sg/

12TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Page 12: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

SciSumm Corpus• 10 reference papers or topics randomly

sampled from the ACL ARC corpus.• Upto 10 citing papers per reference paper

including those outside ACL ARC.

13TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Page 13: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

Annotation pipeline

21 April 2023 TAC Biomedsumm Track - The Computational Linguistics Pilot Task 14

AUTOMA

TIC SUMAUTOMA

TIC SUM

SCI DOC

SUMMSCI DOC

SUMM

<xml>

<abstract>

…….

</abstract>…

<xml>

<abstract>

…….

</abstract>…

…<xml>

<abstract>

…….

</abstract>…

<xml>

<abstract>

…….

</abstract>…

Annotation!

Post Processing to Biomedsumm format:

1.Scripts from U. Colorado (Prabha)

2.Sentence segmented version from U.Mich (Rahul)

OCR & section parse

OCR & section parse

ParsCit ‘s:SectLabel module

Page 14: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

• 3 annotators in all.• Released data has one gold standard

annotation per topic or reference paper.• Discourse facet has a minor change from

Biomedsumm’s categories.

Annotating the SciSumm corpus

15TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Page 15: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

• Task 1A: For each citance, identify the spans of text (cited text spans) in the RP that most accurately reflect the citance

Tasks

16TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Reference Paper (RP)Reference Paper (RP)

Citing papers.Citing text is called citance

Page 16: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

Tasks• Task 1B: For each cited text span,

identify what facet of the paper it belongs to, from a predefined set of facets.

17TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Reference Paper (RP)Reference Paper (RP)

Mark the cited text in RP and provide its facet.

Citing papers.Citing text is called citance

Page 17: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

Evaluation• Small corpus: 10 fold cross validated

evaluation over the 10 documents.• Task 1a scored by overlap with

citances.• Task 1b scored by overlap with

reference text spans.

TAC Biomedsumm Track - The Computational Linguistics Pilot Task 18

Page 18: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

Task & evaluation: highlights

• First corpus in the CL that incorporates prior research findings on citation based summaries.

• 10 teams from 5 different countries participated in the evaluation.

19TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Page 19: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

Limitations• No gold standard summaries yet

• OCR errors: We hope to have corrected them manually.

• But mainly, we need more annotated data!

21 April 2023 TAC Biomedsumm Track - The Computational Linguistics Pilot Task 20

Page 20: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

TAC 2015: CL-Summ shared task

• Plans to rollout a full-fledged official shared task for the CL corpus.

• 20 training topics

• 10 test topics

• 3 annotations per summary.

21 April 2023 TAC Biomedsumm Track - The Computational Linguistics Pilot Task 21

Page 21: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

TAC 2015: We need you help!

• We seek support from– summarization community in general and – CL community in particular

to provide manpower for annotating the corpus

• Great to have all participating teams contribute!

21 April 2023 TAC Biomedsumm Track - The Computational Linguistics Pilot Task 22

Page 22: The Computational Linguistics Summarization Pilot task @ TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang

Acknlowledgements• Hoa Dang, NIST

• Lucy Vanderwende, MSR

• All Biomedsumm track participants.

• This research is partially supported by CSIDM

23TAC Biomedsumm Track - The Computational Linguistics Pilot Task21 April 2023

Questions? Thank you!