1 biochain : using lexical chaining approaches for biomedical text summarization lawrence reeve...

37
1 BioChain BioChain : : Using Lexical Chaining Using Lexical Chaining Approaches for Approaches for Biomedical Text Biomedical Text Summarization Summarization Lawrence Lawrence Reeve Reeve INFO780 - Final Report – Summer 2005

Upload: jonah-kelly

Post on 28-Dec-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

1

BioChainBioChain: : Using Lexical Chaining Using Lexical Chaining

Approaches for Approaches for Biomedical Text Biomedical Text SummarizationSummarization

Lawrence Lawrence ReeveReeve

INFO780 - Final Report – Summer 2005

Page 2: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

2

DiscussionsDiscussions

BioChainBioChain Goal & ApproachGoal & Approach BioChain ProcessBioChain Process EvaluationEvaluation

Using other summarization systemsUsing other summarization systems Comparing abstract vs full-textComparing abstract vs full-text

SummarizationSummarization DUC 2004 System ExamplesDUC 2004 System Examples

SummarySummary

Page 3: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

3

BioChain GoalBioChain Goal Take biomedical abstract (or full text) and Take biomedical abstract (or full text) and

generate a summary:generate a summary:Adjuvant Chemotherapy for Adult Soft Tissue Sarcomas of the Extremities and Girdles: Results of the Italian Randomized Cooperative Trial. (Frustaci et al, 2001)(Frustaci et al, 2001)

Adjuvant chemotherapy for soft tissue sarcoma is controversial because previous trials reported conflicting results. The present study was designed with restricted selection criteria and high dose-intensities of the two most active chemotherapeutic agents.

Patients and Methods: Patients between 18 and 65 years of age with grade 3 to 4 spindle-cell sarcomas (primary diameter >= 5 cm or any size recurrent tumor) in extremities or girdles were eligible. Stratification was by primary versus recurrent tumors and by tumor diameter greater than or equal to 10 cm versus less than 10 cm. One hundred four patients were randomized, 51 to the control group and 53 to the treatment group (five cycles of 4'-epidoxorubicin 60 mg/m2 days 1 and 2 and ifosfamide 1.8 g/m2 days 1 through 5, with hydration, mesna, and granulocyte colony-stimulating factor).

Results: After a median follow-up of 59 months, 60 patients had relapsed and 48 died (28 and 20 in the

treatment arm and 32 and 28 in the control arm, respectively). The median disease-free survival (DFS) was 48 months in the treatment group and 16 months in the control group (P = .04); and the median overall survival (OS) was 75 months for treated and 46 months for untreated patients (P = .03). For OS, the absolute benefit deriving from chemotherapy was 13% at 2 years and increased to 19% at 4 years (P = .04).

Conclusion: Intensified adjuvant chemotherapy had a positive impact on the DFS and OS of patients with high risk extremity soft tissue sarcomas at a median follow-up of 59 months. Therefore, our data favor an intensified treatment in similar cases. Although cure is still difficult to achieve, a significant delay in death is worthwhile, also considering the short duration of treatment and the absence of toxic deaths.

Page 4: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

4

BioChain GoalBioChain Goal Work done in conjunction DUCoMWork done in conjunction DUCoM

Ari Brooks, M.D.Ari Brooks, M.D.

What’s the latest, best information on What’s the latest, best information on cancer treatment?cancer treatment? Current focus is on clinical trial papersCurrent focus is on clinical trial papers

Database of ~1,200 manually processed papersDatabase of ~1,200 manually processed papers

Current goal: Summarize a single clinical Current goal: Summarize a single clinical trial papertrial paper

Ultimate goal: Summarize multiple clinical Ultimate goal: Summarize multiple clinical trial documentstrial documents

Page 5: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

5

BioChain ApproachBioChain Approach Apply methods/concepts from lexical chaining:Apply methods/concepts from lexical chaining:

Cluster (chain) words together based on semantic-Cluster (chain) words together based on semantic-relatednessrelatedness

Words are chained together based on word ‘senses’ Words are chained together based on word ‘senses’ (concepts)(concepts)

Lexical Chaining…Lexical Chaining… identifies lexical cohesionidentifies lexical cohesion

property causing sentences to ‘hang together’ property causing sentences to ‘hang together’ (Morris & Hirst, 1991)(Morris & Hirst, 1991)

captures core themes of a text (aboutness)captures core themes of a text (aboutness) is an intermediate formatis an intermediate format

Example: Example: (Doran et al., 2004)(Doran et al., 2004)

““The house contains an attic. The home is a cabin.”The house contains an attic. The home is a cabin.” Lexical Chain: dwelling Lexical Chain: dwelling {house, attic, home, {house, attic, home,

cabin}cabin}

Page 6: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

6

Implemented Using Implemented Using UMLSUMLS

Key UMLS resources used:Key UMLS resources used:

MetathesaurusMetathesaurus Maps terms into conceptsMaps terms into concepts

Semantic NetworkSemantic Networkorganizes related conceptsorganizes related concepts

MetaMap Transfer ApplicationMetaMap Transfer Application text-to-concept mapping tooltext-to-concept mapping tool

Page 7: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

7

BioChain ProcessBioChain Process

Page 8: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

8

Source Text InputSource Text Input Abstract or full text from PubMedAbstract or full text from PubMed

Need to identify noun phrases within each Need to identify noun phrases within each sentencesentence

concepts are derived from noun phrases concepts are derived from noun phrases using vocabulary in metathesaurususing vocabulary in metathesaurus

Sentences must be sequentially orderedSentences must be sequentially ordered

PDF conversion issuesPDF conversion issues ColumnsColumns CaptionsCaptions BibliographyBibliography Reference numbersReference numbers Images of documentsImages of documents Text tablesText tables

Page 9: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

9

MetaMap TransferMetaMap Transfer Maps noun phrases Maps noun phrases

to UMLS Metathesaurus to UMLS Metathesaurus conceptsconcepts

to UMLS Semantic Typesto UMLS Semantic Types

Candidate

ConceptsFinal

Mapping

ConceptSemantic Type(s)

CandidateScores

Sentence/Phrase

Source: http://mmtx.nlm.nih.gov/runMMTx.shtml

Page 10: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

10

UMLS MetathesaurusUMLS Metathesaurus Vocabulary database: Vocabulary database:

Contains concepts, terms and relationshipsContains concepts, terms and relationships Incorporates more than 100 source Incorporates more than 100 source

vocabularies (SNOMED-CT, CPT, vocabularies (SNOMED-CT, CPT, othersothers)) 1 million concepts1 million concepts 5 million terms5 million terms

links alternative terms of the same concept links alternative terms of the same concept together together

identifies relationships between different identifies relationships between different concepts concepts

co-occurrenceco-occurrence parent, child, siblingparent, child, sibling synonymysynonymy (National Library of Medicine, (National Library of Medicine,

2005d)2005d)

Page 11: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

11

UMLS MetathesaurusUMLS Metathesaurus

Terms

Source: http://www.nlm.nih.gov/research/umls/meta2.html

Concept

Page 12: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

12

UMLS Semantic UMLS Semantic NetworkNetwork

Provides: Provides: categorization of all concepts in the categorization of all concepts in the

UMLS Metathesaurus UMLS Metathesaurus relationships between concepts relationships between concepts

Consists of:Consists of: 135 semantic types135 semantic types 54 relationships54 relationships

(National Library of Medicine, (National Library of Medicine, 2005d)2005d)

Page 13: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

13

UMLS Semantic UMLS Semantic NetworkNetwork

Source: http://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html

Page 14: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

14

Concept ChainingConcept Chaining

Use semantic network to link Use semantic network to link together related concepts:together related concepts:

Ex: T081 - Quantitative (semantic type)Ex: T081 - Quantitative (semantic type) High dose (concept)High dose (concept) cm (concept)cm (concept) SizeSize (concept)(concept) Median Statistical Measurement (concept)Median Statistical Measurement (concept)

MetaMap Transfer: MetaMap Transfer: Noun phrase Noun phrase concept concept semantic type semantic type

BioChain:BioChain: Semantic typeSemantic type concept, concept, concept, concept,

conceptconcept

Page 15: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

15

Concept ChainingConcept Chaining Internal storage:Internal storage:

Array of semantic types formedArray of semantic types formed 135 semantic types, each has a type id135 semantic types, each has a type id

Ex: T061 - Therapeutic or Preventive Ex: T061 - Therapeutic or Preventive ProcedureProcedure

135 entries indexed by semantic id135 entries indexed by semantic id Each semantic type entry holds a list of Each semantic type entry holds a list of

concepts found in the source textconcepts found in the source text

Each concept instance in semantic Each concept instance in semantic type entry contains:type entry contains:

Original noun phraseOriginal noun phrase Sentence numberSentence number Section (paragraph) numberSection (paragraph) number

Page 16: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

16

Sample Abstract Sample Abstract (Frustaci et al, 2001)(Frustaci et al, 2001)

Adjuvant Chemotherapy for Adult Soft Tissue Sarcomas of the Extremities and Girdles: Results of the Italian Randomized Cooperative Trial.

Adjuvant chemotherapy for soft tissue sarcoma is controversial because previous trials reported conflicting results. The present study was designed with restricted selection criteria and high dose-intensities of the two most active chemotherapeutic agents.

Patients and Methods: Patients between 18 and 65 years of age with grade 3 to 4

spindle-cell sarcomas (primary diameter >= 5 cm or any size recurrent tumor) in extremities or girdles were eligible. Stratification was by primary versus recurrent tumors and by tumor diameter greater than or equal to 10 cm versus less than 10 cm. One hundred four patients were randomized, 51 to the control group and 53 to the treatment group (five cycles of 4'-epidoxorubicin 60 mg/m2 days 1 and 2 and ifosfamide 1.8 g/m2 days 1 through 5, with hydration, mesna, and granulocyte colony-stimulating factor).

Results: After a median follow-up of 59 months, 60 patients had relapsed and 48 died (28 and 20 in the treatment arm and 32 and 28 in the control arm, respectively). The median disease-free survival (DFS) was 48 months in the treatment group and 16 months in the control group (P = .04); and the median overall survival (OS) was 75 months for treated and 46 months for untreated patients (P = .03). For OS, the absolute benefit deriving from chemotherapy was 13% at 2 years and increased to 19% at 4 years (P = .04).

Conclusion: Intensified adjuvant chemotherapy had a positive impact on the DFS and OS of patients with high risk extremity soft tissue sarcomas at a

median follow-up of 59 months. Therefore, our data favor an intensified treatment in similar cases. Although cure is still difficult to achieve, a significant delay in death is worthwhile, also considering the short duration of treatment and the absence of toxic deaths.

Page 17: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

17

Concept Chain - ExampleConcept Chain - ExampleT061 - Therapeutic or Preventive Procedure: 6.0T061 - Therapeutic or Preventive Procedure: 6.0

phrase: ‘Adjuvant Chemotherapy’phrase: ‘Adjuvant Chemotherapy’

concept: Chemotherapy, Adjuvantconcept: Chemotherapy, Adjuvant

sentence#0, section#0sentence#0, section#0

phrase: ‘Adjuvant chemotherapy’phrase: ‘Adjuvant chemotherapy’

concept: Chemotherapy, Adjuvantconcept: Chemotherapy, Adjuvant

sentence#2, section#1sentence#2, section#1

phrase: ‘primary diameter cm’phrase: ‘primary diameter cm’

concept: Primary operation (qualifier value)concept: Primary operation (qualifier value)

sentence#5, section#2sentence#5, section#2

phrase: ‘Intensified adjuvant chemotherapy’phrase: ‘Intensified adjuvant chemotherapy’

concept: Chemotherapy, Adjuvantconcept: Chemotherapy, Adjuvant

sentence#13, section#4sentence#13, section#4

phrase: ‘intensified treatment’phrase: ‘intensified treatment’

concept: Therapeutic procedureconcept: Therapeutic procedure

sentence#14, section#4sentence#14, section#4

MetathesaurusConcepts

Semantic Type

Page 18: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

18

Chain ScoringChain Scoring Each chain has a scoreEach chain has a score

Indicates degree a semantic type is discussed in Indicates degree a semantic type is discussed in texttext

Lexical chaining research identified 3 Lexical chaining research identified 3 factors for strength: factors for strength: (Morris & Hirst, 1991)(Morris & Hirst, 1991)

Reiteration: more repetion is betterReiteration: more repetion is better Density: shorter distance between concepts is Density: shorter distance between concepts is

betterbetter Length: longer chain length is betterLength: longer chain length is better

Using method from University College Using method from University College Dublin Dublin (Doran, Stokes, Dunnion, McCarthy, 2004)(Doran, Stokes, Dunnion, McCarthy, 2004)

Frequency of most frequent concept Frequency of most frequent concept (reiteraton)* (reiteraton)* number of unique concept occurences number of unique concept occurences

Page 19: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

19

Chain Scoring (cont’d)Chain Scoring (cont’d) Assign score of 0 unless in one of Assign score of 0 unless in one of

these concepts:these concepts:

Concept ID

Concept Name

T37 Injury or Poisoning

T51 Event

T52 Activity

T61 Therapeutic or Preventative Procedure

T62 Research Activity

T67 Phenomena or Process

T81 Quantitative Concept

T169 Functional Concept

T170 Intellectual Product

T191 Neoplastic Process

Page 20: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

20

Strong ChainsStrong Chains Strong chains identify ‘best’ semantic Strong chains identify ‘best’ semantic

types in texttypes in text

Lexical chaining research identifies 3 Lexical chaining research identifies 3 factors for strength: factors for strength: (Morris & Hirst, 1991)(Morris & Hirst, 1991)

Reiteration: more repetion is betterReiteration: more repetion is better Density: shorter distance between concepts Density: shorter distance between concepts

is betteris better Length: longer chain length is betterLength: longer chain length is better

Lexical chaining research generally Lexical chaining research generally uses:uses:

two standard deviations above the mean of two standard deviations above the mean of the scores computed for every chain in the the scores computed for every chain in the document document (Barzilay and Elhadad, 1997)(Barzilay and Elhadad, 1997)

Page 21: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

21

Strong Chains – ExampleStrong Chains – Example Top chains:Top chains:

T081-Quantitative Concept, score: 14.0T081-Quantitative Concept, score: 14.0 T061-Therapeutic or Preventive Procedure, T061-Therapeutic or Preventive Procedure,

score: 6.0score: 6.0 T169-Functional Concept, score: 6.0T169-Functional Concept, score: 6.0 T079-Temporal Concept, score: 4.0T079-Temporal Concept, score: 4.0 T080-Qualitative Concept, score: 4.0T080-Qualitative Concept, score: 4.0 T082-Spatial Concept, score: 4.0T082-Spatial Concept, score: 4.0 T073-Manufactured Object, score: 2.0T073-Manufactured Object, score: 2.0 T109-Organic Chemical, score: 2.0T109-Organic Chemical, score: 2.0 T170-Intellectual Product, score: 2.0T170-Intellectual Product, score: 2.0 T121-Pharmacologic Substance, score: 1.0T121-Pharmacologic Substance, score: 1.0

Strong chains: (Strong chains: (2 StdDev2 StdDev)) Avg score: 1.6666666666666667Avg score: 1.6666666666666667

Std Dev: 3.0671497204093914Std Dev: 3.0671497204093914 Strong Score: 7.80096610748545Strong Score: 7.80096610748545 T081-Quantitative Concept: 14.0T081-Quantitative Concept: 14.0

Strong chains: (Strong chains: (1 StdDev1 StdDev)) Avg score: 1.6666666666666667Avg score: 1.6666666666666667 Std Dev: 3.0671497204093914Std Dev: 3.0671497204093914 Strong Score: 4.733816387076058Strong Score: 4.733816387076058 T081-Quantitative Concept: 14.0T081-Quantitative Concept: 14.0 T061-Therapeutic or Preventive T061-Therapeutic or Preventive Procedure: 6.0Procedure: 6.0 T169-Functional Concept: 6.0T169-Functional Concept: 6.0

Page 22: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

22

Identifying Top ConceptsIdentifying Top Concepts Part of sentence extraction processPart of sentence extraction process

Get top chains (top semantic types)Get top chains (top semantic types) based on chain strengthbased on chain strength

Perform frequency count on concepts with chainsPerform frequency count on concepts with chains concept(s) with highest frequency is top conceptconcept(s) with highest frequency is top concept

Another approach:Another approach: Identify concept relationship typesIdentify concept relationship types

assign weight to each relationship type ( synonymy, siblings, assign weight to each relationship type ( synonymy, siblings, parent, child)parent, child)

Score each concept based on contribution to chainScore each concept based on contribution to chain Choose highest scoring conceptChoose highest scoring concept

Page 23: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

23

Sentence ExtractionSentence Extraction

Use extractive approachUse extractive approach Identify main concepts in text using Identify main concepts in text using

semantic typessemantic types

Identify which sentences discusses the Identify which sentences discusses the main concepts the mostmain concepts the most

Using chain strength and concept Using chain strength and concept frequencyfrequency

Page 24: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

24

Sentence Extraction – Sentence Extraction – ExamplesExamples

Top Concepts – 2 standard deviations

T081-Quantitative Concept--------------Concept: Median Statistical Measurement, sentence#9Sentence: The median disease-free survival (DFS) was 48 months in the treatment group and 16 months in the control group (P = .04);

Concept: Median Statistical Measurement, sentence#10Sentence: and the median overall survival (OS) was 75 months for treated and 46 months for untreated patients (P = .03).

Page 25: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

25

EvaluationEvaluation

QualitativeQualitative Domain expert: Dr. Ari BrooksDomain expert: Dr. Ari Brooks Provided concept filteringProvided concept filtering

QuantitativeQuantitative Concept chains: Compare abstract vs. full Concept chains: Compare abstract vs. full

text text (Silber and McCoy, 2002)(Silber and McCoy, 2002)

RecallRecall: Percentage of strong chains from the : Percentage of strong chains from the main text that are in the abstract main text that are in the abstract

PrecisionPrecision: Percentage of concept instances in : Percentage of concept instances in the abstract that also appear in strong chains the abstract that also appear in strong chains in the documentin the document

Summarization:Summarization: Compare with Word 2002, SweSum, Compare with Word 2002, SweSum,

CopernicCopernic

Page 26: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

26

EvaluationEvaluationHow similar are sentences extracted by BioChain to other systems?

Page 27: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

27

EvaluationEvaluationDo abstracts adequately represent the full-text?

Page 28: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

28

EvaluationEvaluation Avg p=0.90, r=0.92Avg p=0.90, r=0.92

Avg # of strong chains in full-text is 3Avg # of strong chains in full-text is 3 Represents 2% of all possible semantic typesRepresents 2% of all possible semantic types

Avg unique UMLS concepts in abstract is 8Avg unique UMLS concepts in abstract is 8 Avg 80% coverage of concepts in filterAvg 80% coverage of concepts in filter

Diversity testDiversity test p=0.00, r=0.33p=0.00, r=0.33

Page 29: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

29

DUC 2004 Summarization DUC 2004 Summarization ApproachesApproaches

Systems:Systems: News StoryNews Story LAKELAKE KMSKMS GISTexterGISTexter

All used extractive sentence All used extractive sentence approachapproach

Page 30: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

30

DUC 2004 – News StoryDUC 2004 – News Story C5.0 decision tree to predict words in a C5.0 decision tree to predict words in a

summarysummary Used 8 features:Used 8 features:

TF of word in documentTF of word in document IDF of term in external news corpusIDF of term in external news corpus position of word from start of documentposition of word from start of document Lexical cohesion score between word and documentLexical cohesion score between word and document Binary Flags: noun, verb, adjective, noun phraseBinary Flags: noun, verb, adjective, noun phrase

Results:Results: TF, word position and IDF have greatest impact on TF, word position and IDF have greatest impact on

summary qualitysummary quality lexical cohesion adds little as feature in decision treelexical cohesion adds little as feature in decision tree

Page 31: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

31

DUC 2004 – LAKEDUC 2004 – LAKE keyphrase extraction approachkeyphrase extraction approach

extracting all uni-grams, bi-grams, tri-grams, and extracting all uni-grams, bi-grams, tri-grams, and four-grams and filter them with part-of-speech four-grams and filter them with part-of-speech patternspatterns

Naïve Bayes classifier trained using manual Naïve Bayes classifier trained using manual keyphrases used to identify relevant keyphrases: keyphrases used to identify relevant keyphrases:

keyphrase head TF*IDFkeyphrase head TF*IDF distance of keyphrase from the start of documentdistance of keyphrase from the start of document

Classifier identifies candidate phrases that maximize Classifier identifies candidate phrases that maximize TF*IDF and occur at beginning of documentTF*IDF and occur at beginning of document

Results:Results: Scored in middle of all submissionsScored in middle of all submissions Add additional features that capture the semantic Add additional features that capture the semantic

properties of keyphrases: properties of keyphrases: lexical chainslexical chains

Page 32: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

32

DUC 2004 – KMSDUC 2004 – KMS

Text decomposed into a parse tree Text decomposed into a parse tree format format identify noun phrases and score them based identify noun phrases and score them based

on a frequency analysis of terms in the noun on a frequency analysis of terms in the noun phrases phrases

Results:Results: frequency-based approach performs better frequency-based approach performs better

than systems based on other approachesthan systems based on other approaches Simple to implement Simple to implement

Page 33: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

33

DUC 2004 – GISTexterDUC 2004 – GISTexter

computes weight for each term in computes weight for each term in collectioncollection based on term frequency in a relevant based on term frequency in a relevant

set of documents set of documents Sentence score = sum of weights of each Sentence score = sum of weights of each

term in sentence term in sentence Top scoring sentences are then extractedTop scoring sentences are then extracted

ResultsResults Performed among the best systemsPerformed among the best systems

Page 34: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

34

SummarySummary Want to summarize biomedical texts (specifically Want to summarize biomedical texts (specifically

oncology)oncology)

Use lexical chaining approaches with existing UMLS Use lexical chaining approaches with existing UMLS resources to identify the ‘aboutness’ of a text using resources to identify the ‘aboutness’ of a text using concepts vs termsconcepts vs terms

Extract sentences containing strongest concepts within Extract sentences containing strongest concepts within a semantic type chaina semantic type chain

Result is an indicative summary of what text is aboutResult is an indicative summary of what text is about

Evaluation shows concept chaining is strong between Evaluation shows concept chaining is strong between human summary and full-texthuman summary and full-text

Page 35: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

35

ReferencesReferences Afantenos, S. D., Karkaletsis, V., & Stamatopoulos, P. (2005). Afantenos, S. D., Karkaletsis, V., & Stamatopoulos, P. (2005).

Summarization from Medical Documents: A SurveySummarization from Medical Documents: A Survey Artificial Intelligence in Medicine, 33Artificial Intelligence in Medicine, 33(2), 157-177. (2), 157-177.

Aronson, A. R. (2001). Effective mapping of biomedical text to the UMLS Aronson, A. R. (2001). Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Metathesaurus: the MetaMap program. Proceedings of the AMIA Proceedings of the AMIA Symposium 2001, Symposium 2001, 17-21. 17-21.

Barzilay, R., & Elhadad, M. (1997). Using Lexical Chains for Text Barzilay, R., & Elhadad, M. (1997). Using Lexical Chains for Text Summarization. Summarization. In Proceedings of the Intelligent Scalable Text In Proceedings of the Intelligent Scalable Text Summarization Workshop (ISTS'97), ACL, Summarization Workshop (ISTS'97), ACL, Madrid, Spain, 10-18. Madrid, Spain, 10-18.

Copernic Technologies, I. (2005). Copernic Technologies, I. (2005). Copernic SummarizerCopernic Summarizer. Canada: . . Canada: . Retrieved August 7, 2005, from http://www.copernic.com Retrieved August 7, 2005, from http://www.copernic.com

D’Avanzo, E., Magnini, B., & Vallin, A. (2004). Keyphrase Extraction for D’Avanzo, E., Magnini, B., & Vallin, A. (2004). Keyphrase Extraction for Summarization Purposes: The LAKE System at DUC-2004. Summarization Purposes: The LAKE System at DUC-2004. Proceedings Proceedings of the 2004 Document Understanding Conference, of the 2004 Document Understanding Conference, Boston, USA, Boston, USA, Retrieved June 3, 2005, Retrieved June 3, 2005,

Dalianis, H. (2000). Dalianis, H. (2000). SweSum - A Text Summarizer for SwedishSweSum - A Text Summarizer for Swedish No. No. TRITA-NA-P0015). Stockholm, Sweden: NADA, KTH. TRITA-NA-P0015). Stockholm, Sweden: NADA, KTH.

Doran, W., Stokes, N., Carthy, J., & Dunnion, J. (2004). Comparing Doran, W., Stokes, N., Carthy, J., & Dunnion, J. (2004). Comparing Lexical Chain-based Summarisation Approaches using an Extrinsic Lexical Chain-based Summarisation Approaches using an Extrinsic Evaluation. Evaluation. Proceedings of the Global WordNet Conference(GWC 2004), Proceedings of the Global WordNet Conference(GWC 2004),

Doran, W. P., Stokes, N. S., Dunnion, J., & Carthy, J. (2004). Assessing Doran, W. P., Stokes, N. S., Dunnion, J., & Carthy, J. (2004). Assessing the Impact of Lexical Chain Scoring Methods and Sentence Extraction the Impact of Lexical Chain Scoring Methods and Sentence Extraction Schemes on Summarization. Schemes on Summarization. Proceedings of the 5th International Proceedings of the 5th International conference on Intelligent Text Processing and Computational Linguistics conference on Intelligent Text Processing and Computational Linguistics CICLing-2004, CICLing-2004,

Doran, W., Stokes, N., Newman, E., Dunnion, J., Carthy, J., & Toolan, F. Doran, W., Stokes, N., Newman, E., Dunnion, J., Carthy, J., & Toolan, F. (2004). News Story Gisting at University College Dublin. (2004). News Story Gisting at University College Dublin. Proceedings of Proceedings of the Document Understanding Conference (DUC-2004), the Document Understanding Conference (DUC-2004),

Page 36: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

36

References, continuedReferences, continued Fellbaum, C. (1998). Fellbaum, C. (1998). WORDNET: An Electronic Lexical DatabaseWORDNET: An Electronic Lexical DatabaseThe MIT The MIT

Press.Press. Galley, M., & McKeown, K. (2003). Improving Word Sense Disambiguation in Galley, M., & McKeown, K. (2003). Improving Word Sense Disambiguation in

Lexical Chaining. Lexical Chaining. Proceedings of the Eighteenth International Joint Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, Conference on Artificial Intelligence, Acapulco,Mexico, 1486-1488. Acapulco,Mexico, 1486-1488.

Lacatusu, F., Hickl, A., Harabagiu, S., & Nezda, L. (2004). Lite-GISTexter at Lacatusu, F., Hickl, A., Harabagiu, S., & Nezda, L. (2004). Lite-GISTexter at DUC 2004. DUC 2004. Proceedings of the 2004 Document Understanding Conference, Proceedings of the 2004 Document Understanding Conference, Retrieved June 10, 2005, Retrieved June 10, 2005,

Lin, C. (2005). Lin, C. (2005). Recall-Oriented Understudy for Gisting Evaluation (ROUGE).Recall-Oriented Understudy for Gisting Evaluation (ROUGE). Retrieved August 20, 2005 from http://www.isi.edu/~cyl/ROUGE/ Retrieved August 20, 2005 from http://www.isi.edu/~cyl/ROUGE/

Litkowski, K. C. (2004). Summarization Experiments in DUC 2004. Litkowski, K. C. (2004). Summarization Experiments in DUC 2004. Proceedings of the 2004 Document Understanding Conference, Proceedings of the 2004 Document Understanding Conference, Boston, USA, Boston, USA, Retrieved June 5, 2005, Retrieved June 5, 2005,

Microsoft Coporation. (2002). Microsoft Coporation. (2002). Microsoft Word 2002Microsoft Word 2002. Redmond, Washington, . Redmond, Washington, USA: . Retrieved August 7, 2005, from http://office.microsoft.com USA: . Retrieved August 7, 2005, from http://office.microsoft.com

Morris, J., & Hirst, G. (1991). Lexical Cohesion Computed by Thesaural Morris, J., & Hirst, G. (1991). Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure of Text.Relations as an Indicator of the Structure of Text. Computational Linguistics, Computational Linguistics, 1717(1), 21-43. (1), 21-43.

National Institute of Standards and Technology (NIST). (2005). National Institute of Standards and Technology (NIST). (2005). Document Document Undertanding Conferences.Undertanding Conferences. Retrieved August 20, 2005 from http://www- Retrieved August 20, 2005 from http://www-nlpir.nist.gov/projects/duc/ nlpir.nist.gov/projects/duc/

Silber, G. H., & McCoy, K. F. (2002). Efficiently Computed Lexical Chains as an Silber, G. H., & McCoy, K. F. (2002). Efficiently Computed Lexical Chains as an Intermediate Representation for Automatic Text Summarization.Intermediate Representation for Automatic Text Summarization. Computational Linguistics, 28Computational Linguistics, 28(4)(4)

Page 37: 1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005

37

References, continuedReferences, continued SNOMED International. (2005). SNOMED International. (2005). SNOMED Clinical Terms.SNOMED Clinical Terms. Retrieved Retrieved

July 31, 2005 from http://www.snomed.org/ July 31, 2005 from http://www.snomed.org/ Turney, P. (2000). Learning algorithms for keyphrase extraction.Turney, P. (2000). Learning algorithms for keyphrase extraction.

Information Retrieval, 2Information Retrieval, 2(4), 303-336. (4), 303-336. United States National Library of Medicine. (2005a). United States National Library of Medicine. (2005a).

ClinicalTrials.gov.ClinicalTrials.gov. Retrieved July 31, 2005 from Retrieved July 31, 2005 from http://www.clinicaltrials.gov/ http://www.clinicaltrials.gov/

United States National Library of Medicine. (2005b). United States National Library of Medicine. (2005b). MetaMap MetaMap Transfer.Transfer. Retrieved July 31, 2005 from http://mmtx.nlm.nih.gov/ Retrieved July 31, 2005 from http://mmtx.nlm.nih.gov/

United States National Library of Medicine. (2005c). United States National Library of Medicine. (2005c). PubMed.PubMed. Retrieved July 31, 2005 from Retrieved July 31, 2005 from http://www.ncbi.nlm.nih.gov/entrez/query.fcgi http://www.ncbi.nlm.nih.gov/entrez/query.fcgi

United States National Library of Medicine. (2005d). United States National Library of Medicine. (2005d). Unified Medical Unified Medical Language System (UMLS).Language System (UMLS). Retrieved July 5, 2005 from Retrieved July 5, 2005 from http://www.nlm.nih.gov/research/umls/ http://www.nlm.nih.gov/research/umls/

United States National Library of Medicine. (2004a). United States National Library of Medicine. (2004a). UMLS UMLS Metathesaurus Fact Sheet.Metathesaurus Fact Sheet. Retrieved July 31, 2005 from Retrieved July 31, 2005 from http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html

United States National Library of Medicine. (2004b). United States National Library of Medicine. (2004b). UMLS Semantic UMLS Semantic Network Fact Sheet.Network Fact Sheet. Retrieved July 31, 2005 from Retrieved July 31, 2005 from http://www.nlm.nih.gov/pubs/factsheets/umlssemn.html http://www.nlm.nih.gov/pubs/factsheets/umlssemn.html