streamit: dynamic visualization and interactive exploration of text streams

Click here to load reader

Upload: charis

Post on 22-Feb-2016

111 views

Category:

Documents


0 download

DESCRIPTION

STREAMIT: Dynamic Visualization and Interactive Exploration of Text Streams. Text Stream. Textual Data Explosion Emails, news, messages, broadcasts, … Daily, hourly, minutely Urgent need for efficient processing and analysis Visualization is an effective approach Text stream - PowerPoint PPT Presentation

TRANSCRIPT

STREAMIT: Dynamic Visualization and Interactive Exploration of Text Streams

STREAMIT: Dynamic Visualization and Interactive Exploration of Text Streams

Jamal AlsakranKent State University, OhioYang Chen University of North Carolina - CharlotteYe Zhao (Presenter)Kent State University, OhioJing YangUniversity of North Carolina - CharlotteDongning LuoUniversity of North Carolina - Charlotte

Text Stream

Textual Data Explosion

Emails, news, messages, broadcasts,

Daily, hourly, minutely

Urgent need for efficient processing and analysis

Visualization is an effective approach

Text stream

Text collections constantly evolve with continuously new incoming documents

Keywords/topics not known in advance

Challenges to Visual Exploration

Temporal evolution

Existing topics

Emerging topics

Their relations

Clusters and Outliers

No collection pre-scanning or presumably priori knowledge

Live processing required

In contrast to traditional text database

Flexible user interaction for changing and adjusting

information seeking focus/preference

Process large volumes of texts in real time

SREAMIT System

Dynamic force-directed simulation

Naturally handle continuously inserted documents

Continual evolvement

Continuous depiction and analysis of growing document collections

Automatic grouping and separating

No time window used

No abrupt change

Dynamic processing

Keyword vectors dynamically updated

No prerecorded scan

SREAMIT System (continued)

Interactive exploration

Live adjustment of visualization parameters

Dynamic keyword importance

Present the significance of a keyword at a certain time

Reflect changing user demand and interest

Scalable optimization

Fast computing

GPU acceleration

Animation and interaction

Easy user control and interaction tools

Related Work

Multidimensional scaling (MDS) & projection :

IN-SPIRE 99, InfoSky 02, Hipp 08, Exemplar-based 09

Temporal data trends

ThemeRiver 02, LensRiver 07, T-scroll 07, Meme-tracking 09, Themail 06, Topic-based 09

Text streams

TextPool 05, Moving time window Wong03, Eventriver 10, Text pipe 05

Force-based placement

Graph drawing 91, Chalmers96, Morrison02, etc.

System Overview

Potential and Similarity

Potential energy between pairs of document particles

is a control parameter

li and lj are locations of particle i and j

lij is the ideal distance of them

Ideal distance computed from document similarity

Cosine similarity

Large similarity leads to smaller ideal distance, move documents closer to form clusters

Force-directed Model

Global potential function

Forces computed from minimization

Attract or repulse document particles

DYNAMIC KEYWORD IMPORTANCE

Cosine similarity can be improved by introducing importance

Importance Ik freely modified by users at any time

According to interest/preference

According to discovered knowledge from prior period

A powerful tool for users to manipulate layout and analyze data

Importance might be changed from automatic scheme

E.g. for keyword k,

Ok: occurance;

tek:last time it appears; tsk: first time it appears;

nk : the number of documents that contain the keyword

Visualization Interface

Visualization Tools

Main window

Major layout

Animation Control Panel

Play, pause, stop

Drag by mouse

Keyword table

Dynamic update

Change importance

Document table

Text information

Labeling

Use text document titles

Reduce cluttering

Recent semantic titles

User controlled clutter levels

Group title label

Use color and opacity to display clear layout

User Interaction

Adjusting Keyword Importance

Grouping and Tracking Documents

Halo for interested topics

Browsing and Tracking Keywords

Selection

Manual, example-based, keyword-based

Integrated shoebox for details

Case Study: New York Times News

Total article number: 230

Time period Jul. 19 and Sep. 18, 2010

About Barack Obama

Articles continuously injected, new keywords added to the keyword table, and their frequencies are updated on-the-fly

Keyword importance automatically assigned

Case Study: New York Times News

136 news articles

High frequency keywords:

Politics and Government, International Relations, Terrorism

Increase the importance of International Relations

Highlight the group with Afghanistan War in pink halo (2)

Terrorism in orange halo (3)

All documents are shown

Terrorism becomes larger, and one item (outlier) between Afghanistan War and

Terrorism

Case Study: US NSF Award Abstracts

1000 National Science Foundation (NSF) IIS award abstracts

Funded between Mar. 2000 and Aug. 2003

Each document characterized by a set of keywords

Size of a document circle represents funding amount

Case Study: US NSF Award Abstracts

Aug. 1, 2000

95 projects

Sep. 1, 2000,172 projects;

many large projects started;

Highlight Management in red and Database in green;

Increase their importance

Mar. 15, 2002,672 projects;

many large projects started;

Highlight Sensor with halo;

(2) is an outlier far away from the other projects with halo

It is about just-in-time information retrieval on wearable

computers

Case Study: Video on NSF Dataset

Case Study: Video on NSF Dataset

Performance Optimization

Initial positions of document particles affect computational steps and cost

Similarity Grid

New documents roughly inserted within the proximity of similar documents

Each grid cell has a special keyword vector consisting of the average keyword weights from the documents inside the cell

data set of

7100 documents

Performance Optimization

GPU acceleration

CUDA implementation of the N-body problem

Good performance achieved

NVidia Quadro NVS 295 GPU with 2GB texture memory

Intel Core2 1.8GHz CPU with 2GB RAM

GPU Performance

Experiments with 50 by 50 grid

Achieve good average speed

More importantly, maximum simulation time after document insertion on the GPU was less than a second

Fast for human perception and analysis

Discussion

The system has the ability to handle live text streams with document arrival interval around 1 second

On consumer PC and graphic card

E.g., New York Times news has an averaging 3 documents per hour and a maximum 8 documents per hour at the peak time

A very large number of documents inside the system will undoubtedly introduce visual clutters and hinder the ingestion of analyzers

Natural perception limit and device limit

Clutter reduction and simplification algorithms needed

Further increase the power

Advanced hardware

Hierarchical or multiple-resolution simulation

Conclusion

STREAMIT: An efficient visual exploration system for live text streams

Dynamic physical system

Keyword manipulation with importance

Visual tools

Acknowledgment:

National Science Foundation IIS-0915528, IIS-0916131 and NSFDACS10P1309.

Thanks!

Questions!

Text Document

Particles

Dynamic

Keyword

Importance