summarization of xml documents kondreddi sarath kumar

36
Summarization of XML Summarization of XML Documents Documents Kondreddi Sarath Kumar Kondreddi Sarath Kumar

Upload: alicia-bond

Post on 27-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Summarization of XML Documents Kondreddi Sarath Kumar

Summarization of XML Summarization of XML DocumentsDocuments

Kondreddi Sarath KumarKondreddi Sarath Kumar

Page 2: Summarization of XML Documents Kondreddi Sarath Kumar

OutlineOutline

I. Motivation

II. System for XML Summarization

III. Ranking Model and Summary Generation

IV. User Evaluation

V. Xoom tool and few example summaries

VI. Conclusion

Page 3: Summarization of XML Documents Kondreddi Sarath Kumar

MotivationMotivationXML Document Collection (eg: IMDB)

XML Document

Types of XML Document Summaries

1)Generic summary – summarizes entire contents of the document.

2)Query-biased summary – summarizes those parts of the document which are relevant to user’s query.

Page 4: Summarization of XML Documents Kondreddi Sarath Kumar

AimsAims

We aim at summaries which are :

• Generated automatically

• Highly constrained by size

• Highly informative

• High coverage

Page 5: Summarization of XML Documents Kondreddi Sarath Kumar

AimsAims

We aim at summaries which are :

• Generated automatically

• Highly constrained by size

• Highly informative

• High coverage

ChallengesChallenges

• Structure is as important as text

Page 6: Summarization of XML Documents Kondreddi Sarath Kumar

AimsAims

We aim at summaries which are :

• Generated automatically

• Highly constrained by size

• Highly informative

• High coverage

ChallengesChallenges

• Structure is as important as text

• Varying text length

Page 7: Summarization of XML Documents Kondreddi Sarath Kumar

System for XML SummarizationSystem for XML Summarization

Info Unit Generator

SUMMARY GENERATOR

RANKING UNIT

Tag Ranker

Text Ranker

Corpus Statistics

Tag Units

Text Units

Summary Size

Ranked Tag units

Ranked Textunits

Summary

XMLDoc

Page 8: Summarization of XML Documents Kondreddi Sarath Kumar

Information Units of an XML DocumentInformation Units of an XML Document

Page 9: Summarization of XML Documents Kondreddi Sarath Kumar

Information Units of an XML DocumentInformation Units of an XML Document

Tag

- Regarded as metadata

- Can be highly redundant

Page 10: Summarization of XML Documents Kondreddi Sarath Kumar

Information Units of an XML DocumentInformation Units of an XML Document

Tag

- Regarded as metadata

- Can be highly redundant

Text

- Instance for the tag

- Much less redundant

- Have different sizes

Page 11: Summarization of XML Documents Kondreddi Sarath Kumar

Ranking UnitRanking UnitI. Tag Ranking

Typicality : How salient is the tag in the corpus?

E.g.: <title>

• Typical tags define the context of the document

• Occur regularly in most or all of the documents

• Quantified by fraction of documents in which the tag occurs (df)

Specialty : Does the tag occur more/less frequent in this document?

• Special tags denote a special aspect of the current document

• Occurs too many or too few times in the current document than usual

• Quantified by deviation from average number of occurrences per document

Page 12: Summarization of XML Documents Kondreddi Sarath Kumar

Ranking UnitRanking UnitI. Tag Ranking

Typicality : How salient is the tag in the corpus?

E.g.: <title>

• Typical tags define the context of the document

• Occur regularly in most or all of the documents

• Quantified by fraction of documents in which the tag occurs (df)

Specialty : Does the tag occur more/less frequent in this document?

• Special tags denote a special aspect of the current document

• Occurs too many or too few times in the current document than usual

• Quantified by deviation from average number of occurrences per document

)()1()()( ispeitypi TPTPTP

Page 13: Summarization of XML Documents Kondreddi Sarath Kumar

II. Text Ranking

Two categories of text

1) Entities

2) Regular text

Page 14: Summarization of XML Documents Kondreddi Sarath Kumar

Tag context Document context Corpus context

)|()1()|(*))(,|(*),|( CtPDtPTcDtPTDtP jjijij

Ranking is done based on context of occurrence.

- No redundancy in tag context (E.g.: actor names, genre)

- Redundancy in tag context (E.g.: plots, goofs, trivia items)

Page 15: Summarization of XML Documents Kondreddi Sarath Kumar

Tag context Document context Corpus context

)|()1()|(*))(,|(*),|( CtPDtPTcDtPTDtP jjijij

Ranking is done based on context of occurrence.

- No redundancy in tag context (E.g.: actor names, genre)

- Redundancy in tag context (E.g.: plots, goofs, trivia items)

Page 16: Summarization of XML Documents Kondreddi Sarath Kumar

Tag context Document context Corpus context

)|()1()|(*))(,|(*),|( CtPDtPTcDtPTDtP jjijij

Ranking is done based on context of occurrence.

- No redundancy in tag context (E.g.: actor names, genre)

- Redundancy in tag context (E.g.: plots, goofs, trivia items)

Page 17: Summarization of XML Documents Kondreddi Sarath Kumar

Correlated tags and text

Often find related tag units – siblings of each otherE.g.: Actor and Role

Inclusion Principle

Case 1 :

Case 2 :

},....,,,{ 321 ksib TTTTT )(...)()( 21 kTrankTrankTrank Let and

kjTrankTrankTrank j where)(...)()( 21

s)Tother (also )( )( 21 iTrankTrank

siblings. its and

te text valurankedbest its include , }T..., , {T from T random Choose ij1i

once.at included be tohave }T..., , {T of All j1

on. so and included is tof sibling is which tvaluethen text

inclusion,for considered isT if stagelater aAt

tvaluebest text itswith currently included isTOnly

12

2

1. 1

Page 18: Summarization of XML Documents Kondreddi Sarath Kumar

Generation of SummaryGeneration of Summary

Tag Prob.

Actor 0.5

Keyword 0.3

Trivia 0.2

Consider the following tag rank table :

To generate a summary with 30 tags, 15 actor tags, 9 keyword tags and 6 trivia would be required.

Tag Required no. of tags

Available no. of tags

Actor 15 30

Keyword 9 2

trivia 6 15

Page 19: Summarization of XML Documents Kondreddi Sarath Kumar

Generation of SummaryGeneration of Summary

Tag Prob.

Actor 0.5

Keyword 0.3

Trivia 0.2

Consider the following tag rank table :

To generate a summary with 30 tags, 15 actor tags, 9 keyword tags and 6 trivia would be required.

Tag Required no. of tags

Available no. of tags

Actor 15 30

Keyword 9 2

trivia 6 15

Distribute the remaining “tag-budget” by re-normalizing the distribution of available tags

Distribute the remaining “tag-budget” by re-normalizing the distribution of available tags

Page 20: Summarization of XML Documents Kondreddi Sarath Kumar

Step Tag Prob. No of tags available

No of tags to be added

No of tags added in the round

Round 1

1.1 actor 0.5 30 15 15

1.2 keyword 0.3 2 9 2

1.3 trivia 0.2 15 6 6

Total 23

Round 2

2.1 actor 0.715 15 5 5 (20)

- keyword 0 0 0 0 (2)

2.2 trivia 0.285 15 2 2 (8)

Total 30

Generating the summary with 30 tags

Page 21: Summarization of XML Documents Kondreddi Sarath Kumar

User EvaluationUser Evaluation

Dataset No of files

No of unique tags

No of documents used for evaluation

Movie 200,000 39 8

People 150,000 11 4

Size alpha

Movie5

10

20

1, 0.8

1, 0.8, 0.6

1, 0.8, 0.6

People5

10

1, 0.6

1, 0.6

Total 64+16 = 80

• Automatically generated summaries (80) have been mixed with human-generated summaries (32)

• Summaries graded using a scale of 1-7where 1 – extremely bad & 7 – perfect

• Six different evaluators – each summary evaluated by at least three

Page 22: Summarization of XML Documents Kondreddi Sarath Kumar

User EvaluationUser Evaluation

Dataset Size alpha

1.0 0.8 0.6 Total (across alpha)

Movie5

10

20

8/8 (100%)

8/8 (100%)

7/8 (87.5%)

5/8 (62.5%)

7/8 (87.5%)

7/8 (87.5%)

-

1/8 (12.5%)

4/8 (50%)

13/16 (81.25%)

16/24 (66.6%)

18/24 (75%)

Total(across sizes)

23/24 (95.8%) 19/24 (79.1%) 5/16 (31.2%)

47/64 (73.4%)

People5

10

3/4 (75%)

4/4 (100%)

-

-

1/4 (62.5%)

4/4 (100%)

4/8 (50%)

8/8 (100%)

Total(across sizes)

7/8 (87.5%) - 5/8 (62.5%) 12/16 (75%)

Tabulation of average and above average grades (4-7)

Note: Grades shown only if at least 2 evaluators agreed on it.

Page 23: Summarization of XML Documents Kondreddi Sarath Kumar

Xoom Xoom A tool for exploring and summarizing XML documents

Exploration Mode

Page 24: Summarization of XML Documents Kondreddi Sarath Kumar

XoomXoom

Summarization Mode - Titanic.xml

Page 25: Summarization of XML Documents Kondreddi Sarath Kumar
Page 26: Summarization of XML Documents Kondreddi Sarath Kumar

ConclusionConclusion

• A fully automated XML summary generator

• Ranking of tags and text based on the ranking model

• Generation of summary from ranked tags & text within memory budget

• Xoom – a tool for exploring and summarizing XML documents

• User Evaluation

Page 27: Summarization of XML Documents Kondreddi Sarath Kumar

PublicationsPublications

• Xoom: A tool for zooming in and out of XML Documents (Demo)Maya Ramanath and Kondreddi Sarath KumarProc. of Intl. Conf. on Extending Database Technology (EDBT), St. Petersburg, Russia, March 2009

• A Rank-Rewrite Framework for Summarizing XML DocumentsMaya Ramanath and Kondreddi Sarath Kumar2nd Intl. Workshop on Ranking in Databases (DBRank, in conjunction with ICDE 2008), Cancun, Mexico, April 2008

User Evaluation of Summaries

Link: http://www.mpi-inf.mpg.de/~ramanath/Summarization/

Page 28: Summarization of XML Documents Kondreddi Sarath Kumar

Thanks!Thanks!

Page 29: Summarization of XML Documents Kondreddi Sarath Kumar

AppendixInformativeness

Page 30: Summarization of XML Documents Kondreddi Sarath Kumar

Coverage

Page 31: Summarization of XML Documents Kondreddi Sarath Kumar

Why not tag-text pairs?

Page 32: Summarization of XML Documents Kondreddi Sarath Kumar

Ocean’s Eleven.xml - Summaries

Page 33: Summarization of XML Documents Kondreddi Sarath Kumar

Titanic.xml on OST Summarizer

Gern </actor> <role > Drowning man </role> </casting> <casting> <actor > Martin, Johnny (I) </actor> <role > Rescue boat crewman </role> </casting> <casting> <actor > Lynch, Don (II) </actor> <role > Frederick Spedden </role> </casting> <casting> <actor > Cameron, James (I) </actor> <role > Cameo appearance (steerage dancer) </role> </casting> <casting> <actor > Cragnotti, Chris </actor> <role > Victor Giglio </role> </casting> <casting> <actor > Kenny, Tony (I) </actor> <role > Deckhand </role> </casting> <casting> <actor > Campolo, Bruno </actor> <role > Second-class man </role> </casting> </cast> <misc> <miscEntry> <person > Abercrombie, Ian </person> <job > adr loop group </job> <extra > uncredited </extra> </miscEntry> <miscEntry> <person > Allen, Melinda </person> <job > assistant: James Cameron </job> <extra > uncredited </extra> </miscEntry> <miscEntry> <person > Altman, John (I) </person> <job > historical music advisor </job> </miscEntry> <miscEntry> <person > Altman, John (I) </person> <job > music arranger: period music </job> </miscEntry> <miscEntry> <person > Amorelli, Mike </person> <job > rigging gaffer </job> </miscEntry> <miscEntry> <person > Amorelli, Paul </person> <job > rigging best boy electric </job> </miscEntry> <miscEntry> <person > Anaya, Daniel </person> <job > grip </job> </miscEntry> <miscEntry> <person > Andrade, Maria Louise </person> <job > costumer </job> </miscEntry> <miscEntry> <person > Baker, Brett </person> <job > photo double: Leonardo DiCaprio </job> </miscEntry> <miscEntry> <person > Arvizu, Ricardo </person> <job > grip </job> </miscEntry> <miscEntry> <person > Bailes, Tim </person> <job > marine consultant </job> </miscEntry> <miscEntry> <person > Arneson, Charlie </person> <job > aquatic researcher </job> </miscEntry> <miscEntry> <person > Arneson, Charlie </person> <job > aquatic supervisor </job> </miscEntry> <miscEntry> <person > Arnold, Amy </person> <job > key set costumer: women </job> </miscEntry> <miscEntry> <person > Atkinson, Lisa (I) </person> <job > pre-production consultant </job> </miscEntry> <miscEntry> <person > Barius, Claudette </person> <job > additional still photographer: pre-production </job> <extra > uncredited </extra> </miscEntry> <miscEntry> <person > Baker, Jeanie </person> <job > costumer </job> </miscEntry> <miscEntry> <person > Barton, Roger </person> <job > associate editor </job> </miscEntry> <miscEntry> <person > Baker, Tom (VI) </person> <job > electrician </job> </miscEntry> <miscEntry> <person > Bass, Andy (I) </person> <job > assistant music engineer </job> </miscEntry> <miscEntry> <person > Barber, Jamie (I) </person> <job > first assistant camera: Halifax </job> <miscEntry> <person > Baylon, Hugo </person> <job > location assistant </job> </miscEntry> <miscEntry> <person > Bee, Guy Norman </person> <job > camera operator </job> </miscEntry> <miscEntry> <person > Benarroch, Ariel </person> <job > first assistant camera: second unit </job> <extra > uncredited </extra> </miscEntry> <miscEntry> <person > Bendt, Tony </person> <job > company grip </job> </miscEntry> <miscEntry> <person > Boccoli, Daniel </person> <job > apprentice editor </job> </miscEntry> <miscEntry> <person > Botham, Buddy </person> <job > generator operator </job> </miscEntry> <miscEntry> <person > Bonner, Kit </person> <job > naval consultant </job> </miscEntry> <miscEntry> <person > Blevins, Cha </person> <job > costumer </job> <extra > as Deborah 'Cha' Blevins </extra> </miscEntry> <miscEntry> <person > Bloom, Kirk </person> <job > second assistant camera </job> </miscEntry> <miscEntry> <person > Bolton, Paul </person> <job > electrician </job> </miscEntry> <miscEntry> <person > Bornstein, Bob </person> <job > music preparation </job> </miscEntry> <miscEntry> <person > Bozeman, Marsha </person> <job > costumer </job> </miscEntry> <miscEntry> <person > Broberg, David </person> <job > first assistant film editor </job> </miscEntry> <miscEntry> <person > Brady, Kenneth Patrick </person> <job > production assistant </job> </miscEntry> <miscEntry> <person > Bruno, Keri </person> <job > production assistant </job> </miscEntry> <miscEntry> <person > Bryan, Mitch (III) </person> <job > assistant video assist operator </job> </miscEntry> <miscEntry> <person > Bryce, Malcolm </person> <job > lamp operator </job> </miscEntry> <miscEntry> <person > Burdick, Geoff </person> <job > production associate </job> </miscEntry> <miscEntry> <person > Buckley, John (III) </person> <job > gaffer </job> </miscEntry> <miscEntry> <person > Cameron, James (I) </person> <job > director of photography: Titanic deep dive camera </job> </miscEntry> <miscEntry> <person > Cameron, James (I) </person> <job > special camera equipment designer </job> </miscEntry> <miscEntry> <person > Cameron, Michael (II) </person> <job > special deep ocean camera system </job> </miscEntry> <miscEntry> <person > Byall, Bruce </person> <job > grip </job> </miscEntry> <miscEntry> <person > Byron, Carol Sue </person> <job > additional production accountant </job> <extra > uncredited </extra> </miscEntry> <miscEntry> <person > Canedo, Luis </person> <job > rigging electrician </job> <extra > as Jose

Page 34: Summarization of XML Documents Kondreddi Sarath Kumar

Dataset Filename

Movie American BeautyOcean’s ElevenKill Bill Part IISaving Private RyanThe Last SamuraiThe Usual SuspectsTitanicA Space Odyssey

People Brad PittMatt DamonBen AffleckLeonardo DiCaprio

User Evaluation of Summaries – IMDB Dataset

Files used

Page 35: Summarization of XML Documents Kondreddi Sarath Kumar

User Evaluation of Summaries – IMDB Dataset

Page 36: Summarization of XML Documents Kondreddi Sarath Kumar

Xoom