sequential document visualization based on hierarchical ... · descriptive pictures fig. 1 three...

10
TSINGHUA SCIENCE AND TECHNOLOGY ISSN ll 1007-0214 ll 01/12 ll pp1-16 Volume 17, Number 4, August 2012 Sequential Document Visualization Based on Hierarchical Parametric Histogram Curves Haidong Chen, Guizhen Wang, Dichao Peng ∗∗ , Wuheng Zuo , Wei Chen State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058, China; Zhijiang College of Zhejiang University of Technology, Hangzhou 310024, China Abstract: Recently, sequential document visualization has attracted much attention for its superior capability in depicting the sequential semantic progression in a single document. However, existing methods commonly take abstractive visual forms such as texts, numbers, and glyphs, and require much user expertise for document exploration. In this paper we propose a sequential visualization to represent a single document with a two- dimensional picture-based storyline, which semantically enhances the comprehension of textual information. We introduce a new parametric modeling approach called the Hierarchical Parametric Histogram Curve (HPHC), which encodes the statistical progression locally and adaptively. By transforming an HPHC into the two-dimensional space with a new locality-preserving embedding algorithm, we create a mapping from points along the curve to descriptive pictures and generate the visualization result. The new representation expresses the primary content with a graphical form, and allows for efficient multi-resolution and focus+context exploration in a long document. Our approach compares favorably with previous work in that it is more intuitive and requires less user expertise. Informal evaluation shows that it is useful in quick document browsing, communication, and understanding, especially for people with low literacy skills. Key words: document visualization; sequential document visualization; pictorial communication; histogram curve Introduction Document visualization has been an important research topic in information visualization [16] . While much of the effort has been devoted to visualizing large document corpus, it is of equal, if not more, importance Received: 2012-05-01 ; Accepted: 2012-06-13 Supported by the National High-Tech Research and Devel- opment (863) Program of China (No. 2012AA120903); the National Natural Science Foundation of China (No. 61003193); Commonweal Project of Science and Technol- ogy Department of Zhejiang Province (No. 2011C21058) ∗∗ To whom correspondence should be addressed. E-mail: [email protected] Tel: 86-15157180966 to present visual summaries of a single (large) docu- ment. Different from most multi-document visualiza- tion approaches which focus on building and explor- ing document manifolds with various inter-document similarity metrics, the within-document visualization requires a detailed representation and reformulation to the underlying document components. State-of-the-art approaches [5,7,8] represent text units such as a word, a sentence, and a paragraph with visual abstractions such as a glyph, a knot in a graph, and a curve, respectively. Studies have shown the effectiveness of these methods towards revealing structural and hierarchical informa- tion for a document. Inspired by the fact that a traditional document is written and read orderly like an evolving story, the

Upload: others

Post on 14-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequential Document Visualization Based on Hierarchical ... · Descriptive pictures Fig. 1 Three main components of our approach technique of sequential document visualization[9]

TSINGHUA SCIENCE AND TECHNOLOGYISSNl l 1007 -0214l l 01 /12l l pp1 -16Volume 17, Number 4, August 2012

Sequential Document Visualization Based onHierarchical Parametric Histogram Curves∗

Haidong Chen, Guizhen Wang, Dichao Peng∗∗, Wuheng Zuo†, Wei Chen

State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058, China;† Zhijiang College of Zhejiang University of Technology, Hangzhou 310024, China

Abstract: Recently, sequential document visualization has attracted much attention for its superior capability in

depicting the sequential semantic progression in a single document. However, existing methods commonly take

abstractive visual forms such as texts, numbers, and glyphs, and require much user expertise for document

exploration. In this paper we propose a sequential visualization to represent a single document with a two-

dimensional picture-based storyline, which semantically enhances the comprehension of textual information. We

introduce a new parametric modeling approach called the Hierarchical Parametric Histogram Curve (HPHC), which

encodes the statistical progression locally and adaptively. By transforming an HPHC into the two-dimensional

space with a new locality-preserving embedding algorithm, we create a mapping from points along the curve to

descriptive pictures and generate the visualization result. The new representation expresses the primary content

with a graphical form, and allows for efficient multi-resolution and focus+context exploration in a long document. Our

approach compares favorably with previous work in that it is more intuitive and requires less user expertise. Informal

evaluation shows that it is useful in quick document browsing, communication, and understanding, especially for

people with low literacy skills.

Key words: document visualization; sequential document visualization; pictorial communication; histogram curve

Introduction

Document visualization has been an important researchtopic in information visualization[1−6]. While muchof the effort has been devoted to visualizing largedocument corpus, it is of equal, if not more, importance

Received: 2012-05-01 ; Accepted: 2012-06-13∗ Supported by the National High-Tech Research and Devel-

opment (863) Program of China (No. 2012AA120903);the National Natural Science Foundation of China (No.61003193); Commonweal Project of Science and Technol-ogy Department of Zhejiang Province (No. 2011C21058)

∗∗ To whom correspondence should be addressed.E-mail: [email protected] Tel: 86-15157180966

to present visual summaries of a single (large) docu-ment. Different from most multi-document visualiza-tion approaches which focus on building and explor-ing document manifolds with various inter-documentsimilarity metrics, the within-document visualizationrequires a detailed representation and reformulation tothe underlying document components. State-of-the-artapproaches[5,7,8] represent text units such as a word, asentence, and a paragraph with visual abstractions suchas a glyph, a knot in a graph, and a curve, respectively.Studies have shown the effectiveness of these methodstowards revealing structural and hierarchical informa-tion for a document.

Inspired by the fact that a traditional document iswritten and read orderly like an evolving story, the

Page 2: Sequential Document Visualization Based on Hierarchical ... · Descriptive pictures Fig. 1 Three main components of our approach technique of sequential document visualization[9]

2 Tsinghua Science and Technology, August 2012, 17(4): 000-000

Input

document

Textual

preprocessing

Automatic sequential

representation

Visualization and

exploration

Descriptive pictures

Fig. 1 Three main components of our approach

technique of sequential document visualization[9] hasbeen developed to exploit the inherent sequentiality ofa document. It can greatly facilitate document segmen-tation, topic extraction, and information retrieval. Fortasks that demand efficient semantic presence like quickdocument browsing and understanding, however, cur-rent solutions with limited basic representations suchas numbers, texts, and glyphs may become inefficient.This is mainly because the capability to capture andvisualize the document semantics is limited by thetextual or glyph-based visual forms. Users have tobe very careful to align the visualization curve andthe texts, and to interact with the digital numbers inthe text panel that indicate the positions correspondingto the points on the curve. This may be extremelyundesirable for specific users such as children or onewith less literacy skills. Furthermore, mastering thosetools typically requires a fair amount of expertise on thebackground knowledge behind the visualization, suchas the PCA analysis, and thus involves a steep learningcurve.

The difficulty of understanding the information ina large amount of texts may be addressed by takingadvantage of the power of human perception. An in-tuitive method is to exploit the visual expressiveness ofpictures to reveal thematic patterns and topic transitionsin a document. With this method, the sequential processof reading a document can be enhanced by a parallelperceptual process that benefits greatly from complexhuman visual system[6,10].

In this paper, we propose a new multi-resolutionand sequential document modeling approach, which isequipped with a picture-based visual representation forgenerating effective visualization of a single (long) doc-ument. Our approach consists of three components: atextual preprocessing module; a sequential, parametric,and multi-resolution representation; visualization andinteractive exploration. Figure 1 shows the flowchartof the visual exploration system. The novelty of ourwork is a new sequential document representation andits reformulation to a two-dimensional representation inconjunction with the employment of a pictorial commu-nication mode.

The resulting visualization favorably lowers the ob-

stacles to the understanding of the users and shortensthe user exploration time by a great extent. A snapshotof the visualization of a section in the bookMy Life[11]

is shown in Figure 2.To summarize the contribution of this paper, we

overcome the limitation of the previous sequential doc-ument visualization scheme[9] with a more informativeand perceivable two-dimensional picture-based repre-sentation. The suite of techniques presented in thispaper, including the multi-scale content summarizationand focus+context visualization, can also be seamlesslyincorporated with a new parametric document represen-tation. And finally, our system allows users to navigatethe document in a set of creative modes with intuitivecontrols. Informal evaluation shows that our systemis useful in quick document browsing, communication,and understanding, especially for people with low liter-acy skills.

1 Related Work

Document visualization. A rough classification fordocument visualization techniques would be the onesthat visualize a document corpus and the ones thathandle a single document. For the visualization ofa document collection, researchers[1,2,4] seek to mapthe collection into a multivariate space, explore, andpresent the connections and relevance among docu-ments. Often, a text unit (a sentence, a paragraph, or adocument) is encoded with a knot in a graph or a glyph.Representative techniques and systems include GistIcons[7], DocuBurst[8], Phrase Nets[5], Topic Island[12],IN-SPIRE[13], Omniviz and Refviz[14].

On the other hand, visualizing a single documentreduces the scope to the texts in a document. Concern-ing the visualization, many efforts[15] have been madein text pattern mining, presentation, and spatialization.To enable the sequential analysis, the similarity amongdifferent segments of a document is judged. Theapproach presented in Ref. [9] represents a categoricalsequence as a smooth curve in the histogram space,and makes use of a set of differential geometry andsmooth analysis to detect thematic transition. Due tothe involvement of several local statistical modelingtechniques, the designed user interface is not well suitedfor people with low literacy skills.

Pictorial communication. Communication throughpictures eases user recognition by avoiding lexical anal-ysis and text understanding[10]. The idea that converts

Page 3: Sequential Document Visualization Based on Hierarchical ... · Descriptive pictures Fig. 1 Three main components of our approach technique of sequential document visualization[9]

Haidong Chen et al.: Sequential Document Visualization ... 3

Viewing controls

Document navigator

Pictorial visualization

Key frame popup2D MDS Overview map

Fig. 2 Visualization of a single long document with the start point highlighted in green and active point in blue. The layout ofthe user interface consists of a pictorial visualization window, a text navigator, a two-dimensional MDS view, an overview map,a key frame popup window, and a set of interactive viewing controls.

text to a perceivable scene is prevailing in many fields.For example, the WordsEye system[16] assembles aset of pre-defined models from token sets to build athree-dimensional world. Obviously, its focus is thespatial organization of the built scene. By leveragingcontent-based image analysis techniques, an intelligenttext illustration system[17] was generated. However,this algorithm does not address the spatialization is-sue. A nicely operated system[18] can automaticallyconvert descriptive texts to two-dimensional traffic ac-cidents. Aiming at communicating general texts, a text-to-picture framework was proposed in Refs. [19-21].However, the proposed approach tends to be feasiblefor a short segment, like a sentence. The goal of thispaper is to design an effective visualization scheme forcommunicating a single long document.

2 Hierarchical Parametric HistogramCurve

The representation of a document is a key component invisualizing its contents. We expect the representation tohave the following properties:• Considers the locality of words and captures medi-

um or long range sequential dependencies• Reflects the semantic variation within a document

under different levels of user interestBoth properties should be met in the context of

sequential document visualization. To model a verylong document, it is appropriate to assume that theportion of the document that is far away from thelocation of interest should be considered irrelevant, thusdoes not contribute to the modeling of local content.The second property results from the fact that documentcontent can be captured under multiple resolutions, andthe lower the level, the more the detail.

Page 4: Sequential Document Visualization Based on Hierarchical ... · Descriptive pictures Fig. 1 Three main components of our approach technique of sequential document visualization[9]

4 Tsinghua Science and Technology, August 2012, 17(4): 000-000

Traditional representations such as Bag Of Words(BOW) or n-gram either ignore the word ordering orare only able to capture short range sequential depen-dency. The recently introduced Locally Weighted BagOf Words (LOWBOW) representation[9,22] is able tocapture sequential progression within a document. Itis a non-parametric document modeling approach. Byvarying the bandwidth of the smoothing kernel, it cancapture the document content at various resolutions.However, at a fixed resolution level, it simply usesa uniform sampling scheme to discretize the originalcontinuous representation, thus no adaptiveness hasbeen considered there.

In the following, we introduce a parametric documentmodeling approach called Hierarchical Parametric His-togram Curve (HPHC). The steps to construct an HPHCrepresentation include building a list of local wordhistograms, approximating the underlying continuousprocess by a piecewise linear interpolation, and con-structing a hierarchical description by selecting featurepoints based on local curvature information. Figure 3shows a simple example. A word listD is discretized asa set of local histograms and linearly interpolated as acontinuous curve in a high dimensional space (here 3-D). To visualize this curve, each point can be consideredas a barycentric coordinate and is drawn as a point in atriangle.

First, we will review some basic notions related toBOW that constitute the basis of our representation.Given a documentD which contains a sequence ofwords{w1, . . . ,wN} wherewi ∈ V,∀i for some vocab-ulary setV = {v1, . . . ,v|V |}, the BOW representationcounts the number of times each word fromV appearsin the document, and uses this information to estimatethe underlying distribution. This results in a|V |-entrynormalized vector, whosek-th item is computed as1N ∑N

i=1 δi,k with δi,k being defined as

δi,k =

{

1, wi = vk,

0, otherwise(1)

Similar to LOWBOW, we assume that the distributionthat generates the document varies with the position inthe document. More specifically, for a documentD =

{w1, . . . ,wN} and some positionj ∈ {1, . . . ,N}, we usethe BOW assumption to estimate the distribution atjover a restricted portion of the document determined by

(1,2,1,1,2,1,3,3,3,2)

1 2 3 1 2 3 1 2 3

1

2 3

1 2 3

(a) (b)

Fig. 3 (a) Given a word list D =< 1,2,1,1,2,1,3,3,3,2 >,and a vocabularyV = {1,2,3}, the set of local histograms isdiscrete; (b) A piecewise linear histogram curve for a longdocument.

a window of size 2s+1, which yields

h j :=1

2s+1〈Σs

i=−sδ j+i,1, ..., Σsi=−sδ j+i,|V |〉 (2)

where δi,k is defined in Eq. (1). To preserve thesequentiality within a document,s is set to be biggerthan the length between each two consecutive positionssuch that every tow consecutive portions of documenthave overlaps.

For an arbitrary set of positions{t1, . . . , tM} ⊂

{1, . . . ,N} wheret1 = 1, tM = N andt1 < .. . < tM, andthe corresponding{h j1, ...,h jM}, we approximate theword distribution at any positiont ∈ [1,N] with thefollowing linear interpolation

CM(t) =M−1

∑i=1

(

ti+1− tti+1− ti

hi +t − ti

ti+1− tihi+1

)

1Bi(t) (3)

whereBi = [ti, ti+1) and1A(t) is the indicator functionwhich is 1 if t ∈ A, and 0 otherwise.

This linear approximation is critical for the secondproperty discussed above. The value ofM as wellas the realization ofM positions may vary to reflectthe change in document content. The larger theM,the finer resolution we get for the representation. Inaddition, if majority oftis are concentrated at regionswhere semantic meanings varies a lot, we make useof the adaptive features provided by our technique.In the following, we focus on the construction ofHPHC from an initial position configuration whereM0 = {t1, · · · , tM0}. Note the slight abuse of usingM0

once as a set and once as an integer will simplify thenotation below.

The hierarchical parametric histogram curve is es-sentially a set{CM0(t),CM1(t), . . . ,CMl (t)} with theadditional requirementM0 ⊃ M1 ⊃ . . . ⊃ Ml . CMi+1

is constructed fromCMi by using {t1, . . . , tMi} as thefeature points forCMi through the following steps:

Page 5: Sequential Document Visualization Based on Hierarchical ... · Descriptive pictures Fig. 1 Three main components of our approach technique of sequential document visualization[9]

Haidong Chen et al.: Sequential Document Visualization ... 5

• Choose a constantm that serves as a smoothingfactor influencing the level of detail.

• Compute the following measure of local curvaturefor positiont j ∈ Mi,

κ jk =

(

h j −h j+k)

·(

h j −h j−k)

|h j −h j+k||h j −h j−k|, k = 1,2, ...,m,

where κ jk varies from 1 for the sharpest angleto -1 for a straight line. The supportr j and thecorresponding curvatureκ j,r j for position t j arethen determined by the following property

κ j,m < κ j,m−1 < · · ·< κ j,r j ≥ κ j,r j−1.

• Perform a nonmaxima suppression[23] to detectfeature pointst j whereκ j,r j ≥ κs,rs for all s withno more thanr j distance fromj. The setMi+1 thencontains all those detected feature points.

• Updateh j for t j ∈ Mi+1 using Eq. (2) with anasymmetric window centered att j, ranging from

t j − s − ⌊t j−t j−1

2 ⌋ to t j + s + ⌊t j+1−t j

2 ⌋. EstimateCMi+1(t) using Eq. (3).

We repeat the above procedure and each time gen-erate a new curve under the controllable thresholdm.It is clear from the construction thatCMi(t) containsmore details thanCMi+1(t). By gradually enlarging thesmoothing factorm at each iteration, the entire processis able to discover features in multiple scales, makingit possible to capture the document characteristics suchas the semantic variation and thematic patterns. Figure4 depicts an HPHC representation for a 2-D curve in 4levels.

It is worth mentioning that the last step of updatingh j is essentially an adaptive kernel bandwidth selectionscheme that depends on semantic variation within thedocument. Since fewer feature points are selected forregions of low semantic variation, the effective windowsize is then enlarged due to relatively large distancesbetween adjacent point locations which results in moresmoothing, and vice versa.

3 Visualization and Exploration

The scatterplot[24] is a popular statistical graphics tech-nique that plots a collection of points. It gives a goodvisual interpretation to the relationship among points.In this section we describe a different way by linking aset of scattered points with a parametric and piecewiselinear curve and representing the points such that eachone indicates a portion of the document with the mostrelevant descriptive picture manually collected from the

(a) 3000 points (b) 526 points

(c) 158 points (d) 54 points

Fig. 4 Four consecutive hierarchies of a hierarchical 2-Dcurve representation. The rainbow color scheme from blueto red is adopted to encode the sequentiality.

internet. In case of no pictures, a simple point is drawninstead. This new form is able to explicitly expressthe document structures with a sequential curve. Togenerate a multi-resolution visualization for sequentialdocuments, two central problems need to be addressed:

(1) How can a sequential document be representedin a hierarchical visual form without the loss oflocality and sequentiality?

(2) How can as many flexibilities as possible be pro-vided to enable intuitive document exploration?

3.1 Locality-preserving two-dimensional reformulation

An HPHC is a high-dimensional curve, whose basic el-ements are the statistical word frequencies. Many well-studied low-dimensional embedding techniques can re-duce the dimensionality for effective visualization. Asthe dissimilarity among document pieces is the mainconsideration for exploring the thematic patterns andthe topic changes, we employ the MultidimensionalScaling[25] (MDS).

An MDS algorithm tries to assign locations in low-dimensional space, typically 2-D or 3-D space, whilepreserving the similarities of data points measured inhigh dimensional space. Suppose that we expect tocompute a set of two-dimensional pointsP = {pi =

(xi,yi)∈R2, i= 1,2,3, ...,N}, whereN is the item num-

ber in the high-dimensional space. The optimization of

Page 6: Sequential Document Visualization Based on Hierarchical ... · Descriptive pictures Fig. 1 Three main components of our approach technique of sequential document visualization[9]

6 Tsinghua Science and Technology, August 2012, 17(4): 000-000

P is fulfilled by minimizing stress energy

σ (P) = ∑i< j

(di j −ρi j)2 (4)

wheredi j andρi j denote the Euclidean distance betweenpi andp j, and the dissimilarity between thei-th andj-thitems, respectively.

Along the curveC(t), the local histogram list<h1,h2, ...,h|D| > represents the textual information ateach word with a local statistical measure. The docu-ment sequentiality is actually encoded in the generationmode of the local histogram list so that two consecutivehistograms share the most similar information. Otherimportant features are the curvatureκ and torsionτbased on the arc-length parametrization ofC(t) inR|V |[26]. Therefore, we define the dissimilarity as:

ξi j = ‖hi −h j‖+α‖κi −κ j‖+β‖τi − τ j‖ (5)

whereα andβ are two adjustable variants. They are0.4 and 0.2 respectively in our system.

In this paper, we assume that the robustness andaccuracy issues can be addressed by advanced MDStechniques[25] and focus on its use for curve layout.

When the point positions are obtained, we can com-pute a smooth two-dimensional curve with local s-moothing kernels other than the uniform kernel present-ed in Eq. (3), like the Gaussian, Tricube, Epanechnikov,and Cauchy kernels[24].

3.2 Multi-resolution visualization

For a substantially long document, the visual explo-ration should be made at different levels of detail. Forexample, an abstraction at the coarsest level with aboutone hundred words may roughly outline the content, buta more detailed summarization at a finer level providesan overview for each document section. However, thecomputational complexity of MDS becomes a majorconcern. To reduce the computational complexity ofMDS configuration, distributional MDS algorithm[27] isadopted.

First, a sufficiently fine levelCk(t) of the HPHClist < C0(t),C1(t),C2(t), ... > is determined so thatthe distance betweenC0(t) andCk(t) is below a giventhresholdε. The distance between a fine levelC j(t) anda coarse levelCk(t) is as follows:

dc(C j(t),Ck(t)) =i=N j

∑i=1

dp(hi j ,−−−−→hik hik+1) (6)

dp(p3,−−→p1p2) :=

(p3−p1) · (p2−p1)

‖p2−p1‖2 (7)

hik

hi j

hik+1

dist(hi j ,−−−−→hik hik+1)

Fig. 5 The distance from a point in a finer level to a linesegment in a coarser level.

whereN j is the point number inC j(t). hik andhik+1 aretwo neighboring points inCk(t), andhi j is betweenhikandhik+1 in C j(t). Figure 5 illustrates the concept of thedistance between a point and a line segment with a 2-Dexample.

Suppose that the selected level isk. We choose itsconsecutive levels, of which each coarser level is aclustered result built upon its previous level. The dis-tributional MDS algorithm[27] can then be employed toget a multi-level two-dimensional MDS configuration< Pk,Pk+1,Pk+2, ... >. By displaying this sequenceorderly, a progressive insight of the underlying docu-ment structure is gained. Figure 6 shows three levelsof a section selected from the bookMy Life[11]. Theconfiguration of this visualization is listed in the firstrow of Table 1.

3.3 Focus+Context visualization

For rapid understanding of a long document, both theoverview (context) and detailed depiction (focus) areneeded to investigate places with semantic transition-s. A well-structured layout would allow the usersto browse the entire content and focus on interestingregions. Based on the built hierarchical curve structure,we can easily construct a focux+context visualization.

Suppose, for example, the drawn curve under thecurrent viewing isC j(t). When a focusing operation isspecified by the user, the curve segment that intersectswith the focus region and the rearranged layout are de-termined. Subsequently, their projections inC j−1(t) areextracted, yielding a finer curve segment. Each picturein the segment is scaled in a way that is proportional toits distance to the focus center, as depicted in Figure 7.

3.4 Interactive document exploration with linked views

We have developed a user interface (see Figure 2)that integrates multiple exploration windows. Theleft side of the window is designed to navigate thetextual information of the underlying document whichis organized as a hierarchical tree through key wordsor chapter structures. The bottom left window pro-vides some basic interaction settings such as selectinglevels. The main window in the top right portion of

Page 7: Sequential Document Visualization Based on Hierarchical ... · Descriptive pictures Fig. 1 Three main components of our approach technique of sequential document visualization[9]

Haidong Chen et al.: Sequential Document Visualization ... 7

Fig. 6 Multi-resolution views of a section (Chapter 20-Chapter 28) in the bookMy Life[11].

Fig. 7 Focus+context visualization of a section ofTheAmerican Civil War. Pictures within selected segment are alsopresented in a key frame pop-up window.

the interface displays an HPHC curve and descriptivepictures located at the two-dimensional projection ofeach curve point. Selection, zooming, and browsingoperations are supported in this window. The key framepopup window and the overview map are shown in thebottom of the interface. In the key frame popup window,multiple picture candidates that relate to the underlyingcurve point are shown. These windows provide theuser a details-on-demand display of the document inconjunction with the main window. To some extent,the two-dimensional MDS view and the Overview Mapare redundant. However, the two-dimensional MDSview can help the user discover the point density of thelow-dimensional embedding. The Overview Map canprovide indications of the sequentiality of documentand the active text segment (The blue circle).

In particular, our system features the following oper-ations:• Rapid document browsing. The descriptive pic-

tures and the cuve lower the user’s workload, mak-ing the browsing suitable for anyone, especially

persons with low-literacy skills.• Document segmentation. The most coarse lev-

els in theHPHC representation suggest potentialsegmentations of a document. Users can alsorapidly justify the analysis by checking the textualinformation where the curve variations are large,and generate a pictorial summarization based onthe ones that pass the examination. Figure 8ashows an interesting example. The numbers denotethree important events during the contest for theGovernor of Arkansas in 1982: (1) Bill decided tocontest for the Governor. Thereafter he was busyin other affairs; (2) Hillary showed her support forBill by taking the Clinton family name during thecontest; (3) Bill was elected as the Governor forthe second time. Here, the dominant points in theHPHC clearly show sharp topic transitions, andlead to a potential document segmentation.

• Document pattern discovery.The sequentiality ofthe HPHC representation leads to a long evolvingpicture sequence, which is extremely suitable fordocuments with clear temporal or spatial relation-ships like the biography, geography, and histor-ical books. If there are some similar or periodthematic patterns or concepts in a document, ourmulti-resolution visualization scheme effectivelypreserves them while making a reasonable two-dimensional layout. An interesting example isshown in Figure 8b, where the curve intersects withitself indicating a repetitive event was detected.

4 Experiments and Discussion

We have implemented the proposed approach with Java.For efficiency, the computation for the dissimilaritymatrix and MDS configuration is accelerated by us-ing CUDA. All experiments are conducted on a PCequipped with a P4 2.4 GHz CPU, 3.5GB host memory,and a nVidia GTX 280 video card.

Page 8: Sequential Document Visualization Based on Hierarchical ... · Descriptive pictures Fig. 1 Three main components of our approach technique of sequential document visualization[9]

8 Tsinghua Science and Technology, August 2012, 17(4): 000-000

Resulting visualization Low-dimensional embedding

1

2

3

(a)

(b)

Fig. 8 Visualizing a section ofMy Life. (a) A piece ofHPHC that shows the process of Clinton’s second run for theGovernor of Arkansas in 1982. (b) A repeating pattern in thehphc reveals the entire procedure of the presidential electionpreparation of Bill Clinton in 1991.

4.1 Textual Preprocessing

Before the visualization, sequential processing of theinput document is needed. First, all words are convertedto lower case and non-letter symbols are removed. ThePorter stemming algorithm[28] is then employed to stemthe word list. To get a smoothly progressed list of localhistograms along the document, a small numeric value(e.g., 0.05) is added to zero-valued items.

Three books, namely,My Life[11] (abbr. ML),Robin-son Crusoe[29] (abbr. RC), andThe American CivilWar[30] (abbr. AW), were tested with our proposedmethod in this paper.

The first and second columns of Table 1 list theitem numbers of the initial input documents and theirstemmed versions. For each document, three levelswere constructed to build the textual hierarchy (key-words) and the hierarchical levels, as reported in thethird and fourth columns. The time (in minutes) andmemory (in megabyte) consumed for preprocessing arelisted in the last two columns.

Table 1 Statistics on the tested documents

Inputdocument

Stemmeddocument

Keywords

LOD Time Mem

ML 101526 44947 164 48/67/97 10 150RC 104318 36785 482 55/93/146 10 120AW 26237 13655 186 41/63/102 6 30

4.2 User Evaluation

The results for the documentMy Life have been demon-strated throughout this paper. We show the results foranother two documents in Figure 9 and Figure 10.

To evaluate the efficiency of our approach, we con-ducted a preliminary informal user evaluation that con-sisted of 10 volunteers who had no experience in thefield of text visualization. All participants were grad-uate students and three of them were female. Shortlybefore they used our toolkit to explore the three docu-ments, a 10-minute introduction to the user study wasgiven. Three document exploration modes were thenprovided, namely, the text-based navigation, the curveview, and the visualization with linked views. The firsttwo modes are provided by disabling related featureswithin our system. For each mode, an individualdocument was used as the testing data. The users werethen asked to answer the three questions listed below.Five categorical values ranging from 1 (very negative)to 5 (very positive) were given for each question andeach mode.• How long do you expect to roughly understand

what this book is about with the help of this mode?• Do you think that this mode can help you under-

stand this document?• Will you choose this mode when you are asked

to present this document to your grandmother orgrandfather?

The total user study took 40 minutes. The scoresfor each mode are listed in the order of the questions:the text-based navigation (2.4/3.6/2.0); the curve view(2.5/2.8/2.3), and the visualization with linked views(3.0/4.0/3.8). These statistics show a strong favorabilityfor the linked views visualization. All users foundthat the visualization with linked views mode veryinteresting and novel. They also agreed that the systemwas very useful for rapid document understanding andwere always willing to read the ones with pictures.Some users commented that our tool “clearly provided auseful complement to the hierarchical document struc-tures”. This evaluation also confirmed the interactiveperformance and plausible interactivity of our system

Page 9: Sequential Document Visualization Based on Hierarchical ... · Descriptive pictures Fig. 1 Three main components of our approach technique of sequential document visualization[9]

Haidong Chen et al.: Sequential Document Visualization ... 9

thanks to the combination of the sequential documentvisualization and the pictorial communication tech-niques. Furthermore, in certain situations, it providesa rapid method for identifying important documentstructures.

4.3 Discussions

Our HPHC representation shares many common com-ponents as the LOWBOW representation. For in-stance, both are built upon the local histogram list, andmaintain sequential and position information. How-ever, our visualization approach extensively consider-s the two-dimensional layout of a sequential curve,and is different from the LOWBOW-based documentvisualization[9] in several aspects:• LOWBOW is a sophisticated framework for repre-

senting sequential or categorical time series data.Our approach takes a much simpler form, and isflexible to undergo both the frequency-based orgeometric operations like simplification, query, andfrequency analysis.

• The HPHC representation is parametric, whilethe LOWBOW representation is non-parametricand globally smoothed by using a smoothing filterscheme. Although thehphc is formed by fittingthe discrete local histograms with a uniform kernel(Eq. (3)), all other smoothing kernels suitablefor the LOWBOW framework like the Gaussian,Epanechnikov, and Cauchy kernels, can also beemployed.

• The HPHC takes a hierarchical structure whosefinest level encodes every word, and whose coars-est level gives a high abstraction to the document.In contrast, the LOWBOW representation uses auniform sampling and achieves a multi-resolutionview by changing the size of the smoothing kernel.Thus, it can hardly afford adaptive exploration.

Furthermore, the selected images have an effect onthe resulting visualization. But automatically generat-ing pictures from texts is not the main focus of thispaper. All pictures used in this paper are manually col-lected from the internet and labeled by five volunteersin one week who are otherwise uninvolved to this work.However, more advanced techniques can be seamlesslyincorporated into our approach.

5 Conclusions

Document visualization is useful in browsing, search-ing, and analyzing hidden themes of textual archives. In

this paper we describe an integrated approach that sup-ports hierarchical visualization, interactive navigation,comparative analysis, as well as querying capabilities.With the combination of the one-dimensional documentsequentiality and the two-dimensional graphical forms,we create an effective visual exploration system that iscapable of providing users (e.g., a 12 years old child)with a powerful tool that is easy to use and requires littleexpertise by leveraging their abilities to understand thevisual forms.

For future directions, we would like to test the pro-posed scheme with other categorical time series datasuch as video sequences and financial data. We alsoexpect to extend the pictorial communication modefor the visualization of document corpus. We believethat our approach can help develop a new documentbrowsing technology, potentially reshaping the way inwhich users comprehend textual information.

References

[1] Fortuna B, Grobelnik M, Mladenic D. Visualization of textdocument corpus.Informatica Journal, 2005,29(4): 497-502.

[2] Havre S, Hetzler E, Whitney P, et al. ThemeRiver:Visualizing thematic changes in large document collec-tions. IEEE Transactions on Visualization and ComputerGraphics, 2002,8(1): 9-20.

[3] Rohrer R, Sibert J, Ebert D S. The shape of shakespeare:Visualizing text using implicit surfaces. In: IEEESymposium on Information Visualization. Washington,D.C., USA, 1998.

[4] Cui W, Liu S, Tan L, et al. TextFlow: Towardsbetter understanding of evolving topics in text.IEEETransactions on Visualization and Computer Graphics,2011,17(12): 2412-2421.

[5] Ham F V, Wattenberg M, Viegas F B. Mapping textwith phrase nets.IEEE Transactions on Visualization andComputer Graphics, 2009,15(6): 1169-1176.

[6] Wise J, Thomas J, Pennock K, et al. Visualizing the non-visual: Spatial analysis and interaction with informationfrom text documents. In: Information Visualization.Washington, D.C., USA, 1995: 51-58.

[7] DeCamp P, Frid-Jimenez A, Guiness J, et al. GistIcons: Seeing meaning in large bodies of literature. In:IEEE Symposium on Information Visualization, PosterCompendium, IEEE CS Press, 2005.

[8] Collins C, Carpendale S, Penn G. DocuBurst: Visualizingdocument content using language structure.ComputerGraphics Forum, 2009,28(3): 1039-1046.

[9] Mao Y, Dillon J, Lebanon G. Sequential documentvisualization. IEEE Transactions on Visualization andComputer Graphics, 2007,13(6):1208-1215.

[10] Tufte E. Visual Explanations: Images and quantities,evidence and narrative. Graphics Press, 1997.

Page 10: Sequential Document Visualization Based on Hierarchical ... · Descriptive pictures Fig. 1 Three main components of our approach technique of sequential document visualization[9]

10 Tsinghua Science and Technology, August 2012, 17(4): 000-000

Fig. 9 Visualizing three levels for the bookThe American Civil War[29]. The third row of Table 1 lists the configuration of thisexperiment.

Fig. 10 Visualizing three levels for the bookRobinson Crusoe[30]. The second row of Table 1 lists the configuration of thisexperiment.

[11] Clinton B. My Life. Vintage , 2005.[12] Miller N E, Wong P C, M. Brewster, et al. Topic islands-

a wavelet-based text visualization system. In: IEEEVisualization. North Carolina, United States, 1998.

[13] Pacific Northwest National Laboratory. IN-SPIRE Vi-sual Document Analysis. [Online]. Available: http://in-spire.pnnl.gov/, 2012.

[14] BioWisdom. Omniviz and Refviz. [Online]. Available:http://www.biowisdom.com/tag/omniviz/, 2012.

[15] Wong P C, Cowley W, Foote H, et al. Visualizingsequential patterns for text mining. In: IEEE Symposiumon Information Visualization, Washington, DC, USA,2000.

[16] Coyne B, Sproat R. WordsEye: An automatic text-to-sceneconversion system. In: ACM SIGGRAPH. Los Angeles,USA, 2001: 487-496.

[17] Joshi D, Wang Z, Li J. The story picturing engine: Asystem for automatic text illustration.ACM Transactionson Multimedia Computing, Communications, and Appli-cations, 2006,2(1): 1356-1363.

[18] Johansson R, Berglund A, Danielsson M, et al. Automatictext-to-scene conversion in the traffic accident domain. In:19th IJCAI. Edinburgh, Scotland, 2005: 1073-1078.

[19] Goldberg A B, Rosin J, Zhu X, et al. Toward text-to-picture synthesis. In: NIPS 2009 Symposium on AssistiveMachine Learning for People with Disabilities, 2009.

[20] Goldberg A, Zhu X, Dyer C R, et al. Easy as ABC?facilitating pictorial communication via semanticallyenhanced layout. In: Computational Natural LanguageLearning. Manchester, United Kingdom, 2008.

[21] Zhu X, Goldberg A, Eldawy M, et al. A text-to-picturesynthesis system for augmenting communication. In: TheIntegrated Intelligence Track of the Twenty-Second AAAIConference on Artificial Intelligence. Vancouver, Canada,2007.

[22] Lebanon G. Sequential document representations andsimplicial curves. In: Uncertainty in Artificial Intelligence(UAI), Arlington, Virginia, USA, 2006.

[23] Teh C H, Chin R T. On the detection of dominant points ondigital curves.IEEE Transactions on Pattern Analysis andMachine Intelligence, 1989,11(8): 859-872.

[24] Wilkinson L. The Grammar of Graphics. Springer, 1999.

[25] Borg I, Groenen P. Modern multidimensional Scaling:theory and applications. Springer, 1997.

[26] Spivak M. A Comprehensive Introduction to DifferentialGeometry. Perish Press, 1999.

[27] Quist M, Yona G. Distributional scaling: An algorithm forstructure preserving embedding of metric and nonmetricspaces.Journal of Machine Learning Research, 2004,5:399-420.

[28] Porter M F. An algorithm for suffix stripping.Readings inInformation Retrieval, 1980,14(3): 130-137.

[29] Glatthaar J, Krick R. The American Civil War. OspreyPublishing, 2001.

[30] Defoe D. Robinson Crusoe. W. Taylor, 1719.