marti hearst sims 247 sims 247 lecture 19 visualizing text and text collections march 31, 1998

33
Marti Hearst SIMS 247 SIMS 247 Lecture 19 SIMS 247 Lecture 19 Visualizing Text and Text Visualizing Text and Text Collections Collections March 31, 1998 March 31, 1998

Post on 22-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Marti HearstSIMS 247

SIMS 247 Lecture 19 SIMS 247 Lecture 19 Visualizing Text and Text CollectionsVisualizing Text and Text Collections

March 31, 1998March 31, 1998

Marti HearstSIMS 247

Today and Next TimeToday and Next Time• Purposes of Text VisualizationPurposes of Text Visualization

• Why Text is ToughWhy Text is Tough

• Visualizing Concept SpacesVisualizing Concept Spaces– For Collection Overviews

• Visualizing Query SpecificationsVisualizing Query Specifications– Selecting Term Subsets

– Viewing Metadata

• Visualizing Retrieval ResultsVisualizing Retrieval Results– Term Hit Distribution

– Grouping of Retrieved Documents

Marti HearstSIMS 247

Why Visualize Text?Why Visualize Text?• To help with Information AccessTo help with Information Access

– give an overview of a collection– show user what aspects of their interests are

present in a collection– help user understand why documents retrieved as

a result of a query

• Text Data MiningText Data Mining– not much has been done in this yet

• Software EngineeringSoftware Engineering– not techically text, but has some similar properties

Marti HearstSIMS 247

Why Text is ToughWhy Text is Tough

• Text is Text is notnot pre-attentive pre-attentive• Text consists of abstract conceptsText consists of abstract concepts

– which are difficult to visualize

• Text represents similar concepts in many Text represents similar concepts in many different waysdifferent ways– space ship, flying saucer, UFO, figment of imagination

• Text has very high dimensionalityText has very high dimensionality– Tens or hundreds of thousands of features– Many subsets can be combined together

Marti HearstSIMS 247

Text Meaning is NOT pre-attentiveText Meaning is NOT pre-attentive

SUBJECT PUNCHED QUICKLY OXIDIZED TCEJBUS DEHCNUP YLKCIUQ DEZIDIXOCERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEMSCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOCGOVERNS PRECISE EXAMPLE MERCURY SNREVOG ESICERP ELPMAXE YRUCREMCERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEMGOVERNS PRECISE EXAMPLE MERCURY SNREVOG ESICERP ELPMAXE YRUCREMSCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOCSUBJECT PUNCHED QUICKLY OXIDIZED TCEJBUS DEHCNUP YLKCIUQ DEZIDIXOCERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEMSCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOC

Marti HearstSIMS 247

Why Text is ToughWhy Text is Tough

• Abstract concepts are difficult to Abstract concepts are difficult to visualizevisualize

• Combinations of abstract concepts Combinations of abstract concepts are even more difficult to visualizeare even more difficult to visualize– time– shades of meaning– social and psychological concepts– causal relationships

Marti HearstSIMS 247

Why Text is ToughWhy Text is Tough

The Dog.

Marti HearstSIMS 247

Why Text is ToughWhy Text is Tough

The Dog.

The dog cavorts.

The dog cavorted.

Marti HearstSIMS 247

Why Text is ToughWhy Text is Tough

The man.

The man walks.

Marti HearstSIMS 247

Why Text is ToughWhy Text is Tough

The man walks the cavorting dog.

So far, we can sort of show this in pictures.

Marti HearstSIMS 247

Why Text is ToughWhy Text is Tough

As the man walks the cavorting dog, thoughtsarrive unbidden of the previous spring, so unlikethis one, in which walking was marching anddogs were baleful sentinals outside unjust halls.

How do we visualize this?

Marti HearstSIMS 247

Why Text is ToughWhy Text is Tough

• Language only hints at meaningLanguage only hints at meaning• Most meaning of text lies within our minds Most meaning of text lies within our minds

and common understandingand common understanding– “How much is that doggy in the window?”

• how much: social system of barter and trade (not the size of the dog)

• “doggy” implies childlike, plaintive, probably cannot do the purchasing on their own

• “in the window” implies behind a store window, not really inside a window, requires notion of window shopping

Marti HearstSIMS 247

Why Text is ToughWhy Text is Tough

• General categories have no standard General categories have no standard ordering (nominal data)ordering (nominal data)

• Categorization of documents by single Categorization of documents by single topics misses important distinctionstopics misses important distinctions

• Consider an article aboutConsider an article about– NAFTA– The effects of NAFTA on truck manufacture– The effects of NAFTA on productivity of truck

manufacture in the neighboring cities of El Paso and Juarez

Marti HearstSIMS 247

Why Text is ToughWhy Text is Tough

• Other issues about languageOther issues about language– ambiguous (many different meanings

for the same words and phrases)– different combinations imply different

meanings

Marti HearstSIMS 247

Why Text is ToughWhy Text is Tough

• I saw I saw PathfinderPathfinder on on MarsMars with a telescope. with a telescope.

• PathfinderPathfinder photographedphotographed MarsMars..• The The PathfinderPathfinder photographphotograph marsmars our our

perception of a lifeless planet.perception of a lifeless planet.

• The The PathfinderPathfinder photographphotograph from from FordFord has has

arrived.arrived.• The The PathfinderPathfinder fordforded the river without ed the river without

marmarring its paint job.ring its paint job.

Marti HearstSIMS 247

Why Text is EasyWhy Text is Easy• Text is easier when you have a lot of itText is easier when you have a lot of it

– Highly redundant– Because people are good at finding associations,

just about any simple algorithm can get “good” results for coarse tasks

• Pull out “important” phrases• Find “meaningfully” related words• Create “summary” from document

– Major problem: Evaluation

• People usually People usually searchsearch on relatively coarse on relatively coarse meaningsmeanings

Marti HearstSIMS 247

Why Text is EasyWhy Text is Easy• Pretty much any simple technique can pull out Pretty much any simple technique can pull out

phrases that seem to characterize a documentphrases that seem to characterize a document• Most frequent words from a lecture last fall:Most frequent words from a lecture last fall:

109 slide 69 to 37 view 37 version 37 graphic 37 first109 slide 69 to 37 view 37 version 37 graphic 37 first

37 back 36 previous 36 next 32 of 31 the37 back 36 previous 36 next 32 of 31 the

30 recall 28 relevant 27 precision 25 retrieved 25 documents30 recall 28 relevant 27 precision 25 retrieved 25 documents

21 and 18 evaluate 15 a 13 what 13 vs 13 how 21 and 18 evaluate 15 a 13 what 13 vs 13 how

12 trec 12 is 12 high 12 for 10 relevance 12 trec 12 is 12 high 12 for 10 relevance

10 queries 10 on 9 information 8 x 8 why 10 queries 10 on 9 information 8 x 8 why

8 as 8 answer 7 search 7 maron 7 document 8 as 8 answer 7 search 7 maron 7 document

7 blair 6 top 6 results 6 measure7 blair 6 top 6 results 6 measure

6 length 6 in 6 evaluation 6 curves6 length 6 in 6 evaluation 6 curves

Marti HearstSIMS 247

Why Text is EasyWhy Text is Easy• Same text, removing most frequent words in Same text, removing most frequent words in

language and most frequent in this text:language and most frequent in this text:

30 recall 28 relevant 27 precision 25 retrieved 25 documents30 recall 28 relevant 27 precision 25 retrieved 25 documents

18 evaluate 13 vs 12 trec 12 high 10 relevance 18 evaluate 13 vs 12 trec 12 high 10 relevance

10 queries 9 information 8 x 8 answer 7 search 10 queries 9 information 8 x 8 answer 7 search

7 maron 7 document 7 blair 6 top 6 results 7 maron 7 document 7 blair 6 top 6 results

6 measure 6 length 6 evaluation 6 curves6 measure 6 length 6 evaluation 6 curves

• These words can act as a simple summary of the These words can act as a simple summary of the documentdocument– people are good at inferring the relations– redundancy in the word meanings

Marti HearstSIMS 247

Text Collection OverviewsText Collection Overviews

• How can we show an overview of the How can we show an overview of the contents of a text collection?contents of a text collection?– show info external to the docs

• e.g., date, author, source, number of inlinks• does not show what they are about

– show the meanings or topics in the docs• show a list of titles• show results of clustering words or documents• organize according to categories

– how to show arbitrary subsets?

Marti HearstSIMS 247

Showing Collection OverviewsShowing Collection Overviews

• Showing the DocumentsShowing the Documents– External Metadata

• e.g., author, date, hyperlink connectivity• Does not show what the documents are about

– Visualizations of Document Clusters• Mapping document clusters into nearby points• Networks with Force-Directed Placement• Kohonen Feature Maps

– Zoomable “Landscapes”

Marti HearstSIMS 247

Showing Collection OverviewsShowing Collection Overviews

• Distinguish betweenDistinguish between– showing the documents – showing the words/concepts

• Distinguish betweenDistinguish between– a general overview– a query-centered view

Marti HearstSIMS 247

Clustering for Collection OverviewsClustering for Collection Overviews

• Two main stepsTwo main steps– cluster the documents according to the words

they have in common– map the cluster representation onto a

(interactive) 2D or 3D representation

• Since text has tens of thousands of Since text has tens of thousands of featuresfeatures– the mapping to 2D loses a tremendous

amount of information– only very coarse themes are detected

Marti HearstSIMS 247

Clustering for Collection OverviewsClustering for Collection Overviews– Scatter/Gather

• show main themes as groups of text summaries

– Scatter Plots• show docs as points; closeness indicates nearness in

cluster space• show main themes of docs as visual clumps or

mountains

– Kohonen Feature maps• show main themes as adjacent polygons

– BEAD• show main themes as links within a force-directed

placement network

Marti HearstSIMS 247

Sca

tter

/Gat

her

Sca

tter

/Gat

her

Marti HearstSIMS 247

Sca

tter

Plo

t of

Clu

ster

sS

catt

er P

lot

of C

lust

ers

(Ch

en e

t al

. 97)

(Ch

en e

t al

. 97)

Marti HearstSIMS 247

BEAD (Chalmers 97)BEAD (Chalmers 97)

Marti HearstSIMS 247

BE

AD

(C

hal

mer

s 96

)B

EA

D (

Ch

alm

ers

96)

Marti HearstSIMS 247

Example: ThemescapesExample: Themescapes(Wise et al. 95)(Wise et al. 95)

Marti HearstSIMS 247

Koh

onen

Fea

ture

Map

sK

ohon

en F

eatu

re M

aps

(Lin

92,

Ch

en e

t al

. 97)

(Lin

92,

Ch

en e

t al

. 97)

(594 docs)

Marti HearstSIMS 247

Galaxy of NewsGalaxy of NewsRennison 95Rennison 95

Marti HearstSIMS 247

Visualizing Concept OverviewsVisualizing Concept Overviews

• Huge 2D maps may be inappropriate Huge 2D maps may be inappropriate focus for information retrieval focus for information retrieval – cannot see what the documents are about– documents are forced into one position in

semantic space– space is difficult to browse for IR purposes

• Perhaps more suited for pattern Perhaps more suited for pattern discoverydiscovery– problem: often only one view on the space

Marti HearstSIMS 247

How Useful are Graphical Clusters?How Useful are Graphical Clusters?

• A study A study (Kleiboemer et al. 96) (Kleiboemer et al. 96) comparedcompared– a system with 2D graphical clusters– a system with 3D graphical clusters– a system that shows textual clusters

• Novice usersNovice users• Only textual clusters were helpful Only textual clusters were helpful

(and they were difficult to use well)(and they were difficult to use well)

Marti HearstSIMS 247

Next TimeNext Time

• Visualizing Query Term Visualizing Query Term SpecificationSpecification– available words– available metadata

• Visualizing Retrieval ResultsVisualizing Retrieval Results