summarization of xml documents kondreddi sarath kumar

Post on 27-Dec-2015

221 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Summarization of XML Summarization of XML DocumentsDocuments

Kondreddi Sarath KumarKondreddi Sarath Kumar

OutlineOutline

I. Motivation

II. System for XML Summarization

III. Ranking Model and Summary Generation

IV. User Evaluation

V. Xoom tool and few example summaries

VI. Conclusion

MotivationMotivationXML Document Collection (eg: IMDB)

XML Document

Types of XML Document Summaries

1)Generic summary – summarizes entire contents of the document.

2)Query-biased summary – summarizes those parts of the document which are relevant to user’s query.

AimsAims

We aim at summaries which are :

• Generated automatically

• Highly constrained by size

• Highly informative

• High coverage

AimsAims

We aim at summaries which are :

• Generated automatically

• Highly constrained by size

• Highly informative

• High coverage

ChallengesChallenges

• Structure is as important as text

AimsAims

We aim at summaries which are :

• Generated automatically

• Highly constrained by size

• Highly informative

• High coverage

ChallengesChallenges

• Structure is as important as text

• Varying text length

System for XML SummarizationSystem for XML Summarization

Info Unit Generator

SUMMARY GENERATOR

RANKING UNIT

Tag Ranker

Text Ranker

Corpus Statistics

Tag Units

Text Units

Summary Size

Ranked Tag units

Ranked Textunits

Summary

XMLDoc

Information Units of an XML DocumentInformation Units of an XML Document

Information Units of an XML DocumentInformation Units of an XML Document

Tag

- Regarded as metadata

- Can be highly redundant

Information Units of an XML DocumentInformation Units of an XML Document

Tag

- Regarded as metadata

- Can be highly redundant

Text

- Instance for the tag

- Much less redundant

- Have different sizes

Ranking UnitRanking UnitI. Tag Ranking

Typicality : How salient is the tag in the corpus?

E.g.: <title>

• Typical tags define the context of the document

• Occur regularly in most or all of the documents

• Quantified by fraction of documents in which the tag occurs (df)

Specialty : Does the tag occur more/less frequent in this document?

• Special tags denote a special aspect of the current document

• Occurs too many or too few times in the current document than usual

• Quantified by deviation from average number of occurrences per document

Ranking UnitRanking UnitI. Tag Ranking

Typicality : How salient is the tag in the corpus?

E.g.: <title>

• Typical tags define the context of the document

• Occur regularly in most or all of the documents

• Quantified by fraction of documents in which the tag occurs (df)

Specialty : Does the tag occur more/less frequent in this document?

• Special tags denote a special aspect of the current document

• Occurs too many or too few times in the current document than usual

• Quantified by deviation from average number of occurrences per document

)()1()()( ispeitypi TPTPTP

II. Text Ranking

Two categories of text

1) Entities

2) Regular text

Tag context Document context Corpus context

)|()1()|(*))(,|(*),|( CtPDtPTcDtPTDtP jjijij

Ranking is done based on context of occurrence.

- No redundancy in tag context (E.g.: actor names, genre)

- Redundancy in tag context (E.g.: plots, goofs, trivia items)

Tag context Document context Corpus context

)|()1()|(*))(,|(*),|( CtPDtPTcDtPTDtP jjijij

Ranking is done based on context of occurrence.

- No redundancy in tag context (E.g.: actor names, genre)

- Redundancy in tag context (E.g.: plots, goofs, trivia items)

Tag context Document context Corpus context

)|()1()|(*))(,|(*),|( CtPDtPTcDtPTDtP jjijij

Ranking is done based on context of occurrence.

- No redundancy in tag context (E.g.: actor names, genre)

- Redundancy in tag context (E.g.: plots, goofs, trivia items)

Correlated tags and text

Often find related tag units – siblings of each otherE.g.: Actor and Role

Inclusion Principle

Case 1 :

Case 2 :

},....,,,{ 321 ksib TTTTT )(...)()( 21 kTrankTrankTrank Let and

kjTrankTrankTrank j where)(...)()( 21

s)Tother (also )( )( 21 iTrankTrank

siblings. its and

te text valurankedbest its include , }T..., , {T from T random Choose ij1i

once.at included be tohave }T..., , {T of All j1

on. so and included is tof sibling is which tvaluethen text

inclusion,for considered isT if stagelater aAt

tvaluebest text itswith currently included isTOnly

12

2

1. 1

Generation of SummaryGeneration of Summary

Tag Prob.

Actor 0.5

Keyword 0.3

Trivia 0.2

Consider the following tag rank table :

To generate a summary with 30 tags, 15 actor tags, 9 keyword tags and 6 trivia would be required.

Tag Required no. of tags

Available no. of tags

Actor 15 30

Keyword 9 2

trivia 6 15

Generation of SummaryGeneration of Summary

Tag Prob.

Actor 0.5

Keyword 0.3

Trivia 0.2

Consider the following tag rank table :

To generate a summary with 30 tags, 15 actor tags, 9 keyword tags and 6 trivia would be required.

Tag Required no. of tags

Available no. of tags

Actor 15 30

Keyword 9 2

trivia 6 15

Distribute the remaining “tag-budget” by re-normalizing the distribution of available tags

Distribute the remaining “tag-budget” by re-normalizing the distribution of available tags

Step Tag Prob. No of tags available

No of tags to be added

No of tags added in the round

Round 1

1.1 actor 0.5 30 15 15

1.2 keyword 0.3 2 9 2

1.3 trivia 0.2 15 6 6

Total 23

Round 2

2.1 actor 0.715 15 5 5 (20)

- keyword 0 0 0 0 (2)

2.2 trivia 0.285 15 2 2 (8)

Total 30

Generating the summary with 30 tags

User EvaluationUser Evaluation

Dataset No of files

No of unique tags

No of documents used for evaluation

Movie 200,000 39 8

People 150,000 11 4

Size alpha

Movie5

10

20

1, 0.8

1, 0.8, 0.6

1, 0.8, 0.6

People5

10

1, 0.6

1, 0.6

Total 64+16 = 80

• Automatically generated summaries (80) have been mixed with human-generated summaries (32)

• Summaries graded using a scale of 1-7where 1 – extremely bad & 7 – perfect

• Six different evaluators – each summary evaluated by at least three

User EvaluationUser Evaluation

Dataset Size alpha

1.0 0.8 0.6 Total (across alpha)

Movie5

10

20

8/8 (100%)

8/8 (100%)

7/8 (87.5%)

5/8 (62.5%)

7/8 (87.5%)

7/8 (87.5%)

-

1/8 (12.5%)

4/8 (50%)

13/16 (81.25%)

16/24 (66.6%)

18/24 (75%)

Total(across sizes)

23/24 (95.8%) 19/24 (79.1%) 5/16 (31.2%)

47/64 (73.4%)

People5

10

3/4 (75%)

4/4 (100%)

-

-

1/4 (62.5%)

4/4 (100%)

4/8 (50%)

8/8 (100%)

Total(across sizes)

7/8 (87.5%) - 5/8 (62.5%) 12/16 (75%)

Tabulation of average and above average grades (4-7)

Note: Grades shown only if at least 2 evaluators agreed on it.

Xoom Xoom A tool for exploring and summarizing XML documents

Exploration Mode

XoomXoom

Summarization Mode - Titanic.xml

ConclusionConclusion

• A fully automated XML summary generator

• Ranking of tags and text based on the ranking model

• Generation of summary from ranked tags & text within memory budget

• Xoom – a tool for exploring and summarizing XML documents

• User Evaluation

PublicationsPublications

• Xoom: A tool for zooming in and out of XML Documents (Demo)Maya Ramanath and Kondreddi Sarath KumarProc. of Intl. Conf. on Extending Database Technology (EDBT), St. Petersburg, Russia, March 2009

• A Rank-Rewrite Framework for Summarizing XML DocumentsMaya Ramanath and Kondreddi Sarath Kumar2nd Intl. Workshop on Ranking in Databases (DBRank, in conjunction with ICDE 2008), Cancun, Mexico, April 2008

User Evaluation of Summaries

Link: http://www.mpi-inf.mpg.de/~ramanath/Summarization/

Thanks!Thanks!

AppendixInformativeness

Coverage

Why not tag-text pairs?

Ocean’s Eleven.xml - Summaries

Titanic.xml on OST Summarizer

Gern </actor> <role > Drowning man </role> </casting> <casting> <actor > Martin, Johnny (I) </actor> <role > Rescue boat crewman </role> </casting> <casting> <actor > Lynch, Don (II) </actor> <role > Frederick Spedden </role> </casting> <casting> <actor > Cameron, James (I) </actor> <role > Cameo appearance (steerage dancer) </role> </casting> <casting> <actor > Cragnotti, Chris </actor> <role > Victor Giglio </role> </casting> <casting> <actor > Kenny, Tony (I) </actor> <role > Deckhand </role> </casting> <casting> <actor > Campolo, Bruno </actor> <role > Second-class man </role> </casting> </cast> <misc> <miscEntry> <person > Abercrombie, Ian </person> <job > adr loop group </job> <extra > uncredited </extra> </miscEntry> <miscEntry> <person > Allen, Melinda </person> <job > assistant: James Cameron </job> <extra > uncredited </extra> </miscEntry> <miscEntry> <person > Altman, John (I) </person> <job > historical music advisor </job> </miscEntry> <miscEntry> <person > Altman, John (I) </person> <job > music arranger: period music </job> </miscEntry> <miscEntry> <person > Amorelli, Mike </person> <job > rigging gaffer </job> </miscEntry> <miscEntry> <person > Amorelli, Paul </person> <job > rigging best boy electric </job> </miscEntry> <miscEntry> <person > Anaya, Daniel </person> <job > grip </job> </miscEntry> <miscEntry> <person > Andrade, Maria Louise </person> <job > costumer </job> </miscEntry> <miscEntry> <person > Baker, Brett </person> <job > photo double: Leonardo DiCaprio </job> </miscEntry> <miscEntry> <person > Arvizu, Ricardo </person> <job > grip </job> </miscEntry> <miscEntry> <person > Bailes, Tim </person> <job > marine consultant </job> </miscEntry> <miscEntry> <person > Arneson, Charlie </person> <job > aquatic researcher </job> </miscEntry> <miscEntry> <person > Arneson, Charlie </person> <job > aquatic supervisor </job> </miscEntry> <miscEntry> <person > Arnold, Amy </person> <job > key set costumer: women </job> </miscEntry> <miscEntry> <person > Atkinson, Lisa (I) </person> <job > pre-production consultant </job> </miscEntry> <miscEntry> <person > Barius, Claudette </person> <job > additional still photographer: pre-production </job> <extra > uncredited </extra> </miscEntry> <miscEntry> <person > Baker, Jeanie </person> <job > costumer </job> </miscEntry> <miscEntry> <person > Barton, Roger </person> <job > associate editor </job> </miscEntry> <miscEntry> <person > Baker, Tom (VI) </person> <job > electrician </job> </miscEntry> <miscEntry> <person > Bass, Andy (I) </person> <job > assistant music engineer </job> </miscEntry> <miscEntry> <person > Barber, Jamie (I) </person> <job > first assistant camera: Halifax </job> <miscEntry> <person > Baylon, Hugo </person> <job > location assistant </job> </miscEntry> <miscEntry> <person > Bee, Guy Norman </person> <job > camera operator </job> </miscEntry> <miscEntry> <person > Benarroch, Ariel </person> <job > first assistant camera: second unit </job> <extra > uncredited </extra> </miscEntry> <miscEntry> <person > Bendt, Tony </person> <job > company grip </job> </miscEntry> <miscEntry> <person > Boccoli, Daniel </person> <job > apprentice editor </job> </miscEntry> <miscEntry> <person > Botham, Buddy </person> <job > generator operator </job> </miscEntry> <miscEntry> <person > Bonner, Kit </person> <job > naval consultant </job> </miscEntry> <miscEntry> <person > Blevins, Cha </person> <job > costumer </job> <extra > as Deborah 'Cha' Blevins </extra> </miscEntry> <miscEntry> <person > Bloom, Kirk </person> <job > second assistant camera </job> </miscEntry> <miscEntry> <person > Bolton, Paul </person> <job > electrician </job> </miscEntry> <miscEntry> <person > Bornstein, Bob </person> <job > music preparation </job> </miscEntry> <miscEntry> <person > Bozeman, Marsha </person> <job > costumer </job> </miscEntry> <miscEntry> <person > Broberg, David </person> <job > first assistant film editor </job> </miscEntry> <miscEntry> <person > Brady, Kenneth Patrick </person> <job > production assistant </job> </miscEntry> <miscEntry> <person > Bruno, Keri </person> <job > production assistant </job> </miscEntry> <miscEntry> <person > Bryan, Mitch (III) </person> <job > assistant video assist operator </job> </miscEntry> <miscEntry> <person > Bryce, Malcolm </person> <job > lamp operator </job> </miscEntry> <miscEntry> <person > Burdick, Geoff </person> <job > production associate </job> </miscEntry> <miscEntry> <person > Buckley, John (III) </person> <job > gaffer </job> </miscEntry> <miscEntry> <person > Cameron, James (I) </person> <job > director of photography: Titanic deep dive camera </job> </miscEntry> <miscEntry> <person > Cameron, James (I) </person> <job > special camera equipment designer </job> </miscEntry> <miscEntry> <person > Cameron, Michael (II) </person> <job > special deep ocean camera system </job> </miscEntry> <miscEntry> <person > Byall, Bruce </person> <job > grip </job> </miscEntry> <miscEntry> <person > Byron, Carol Sue </person> <job > additional production accountant </job> <extra > uncredited </extra> </miscEntry> <miscEntry> <person > Canedo, Luis </person> <job > rigging electrician </job> <extra > as Jose

Dataset Filename

Movie American BeautyOcean’s ElevenKill Bill Part IISaving Private RyanThe Last SamuraiThe Usual SuspectsTitanicA Space Odyssey

People Brad PittMatt DamonBen AffleckLeonardo DiCaprio

User Evaluation of Summaries – IMDB Dataset

Files used

User Evaluation of Summaries – IMDB Dataset

Xoom

top related