finding structure in texts with topological data analysisncuwm/22ndannual/... · introduction...

Post on 24-Jul-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Finding Structure in Texts with Topological DataAnalysis

Calli Clay and Ella Graham

St. Catherine University

February 1, 2020

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 1 / 17

Introduction

Recently, analyzing data has become more complex because data setsare larger in size and higher in dimension

To address this complexity, we looked at determining the shape of adata set using an approach called topological data analysis

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 2 / 17

The Shape of a Data Set

A Three Dimensional Data Set

−10 −5 0 5 10 15−4

−2

0 2

4 6

810

12

−6−4

−2 0

2 4

6

Variable OneV

aria

ble

Two

Var

iabl

e T

hree

Yet Another Three Dimensional Data Set

−6 −4 −2 0 2 4 6−15

−10

−5

0 5

10

15

−6−4

−2 0

2 4

6

Variable One

Var

iabl

e Tw

o

Var

iabl

e T

hree

Figure: Visualizing data sets (Dr. Pelatt)

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 3 / 17

Research Goals

Determine the efficiency of topological data analysis as a textanalytics tool

Analyze poetry forms including the villanelle and sestina

Analyze music genres including rock music and pop music

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 4 / 17

Background

Topology is the study of shapes

Figure: Transforming a coffee cup into a donut (Hood)

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 5 / 17

Background

Persistent homology is a common TDA method

A technique for approximating the topological features of a space indifferent dimensionsHas not been widely used for analyzing texts

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 6 / 17

Simplicial Complexes

Geometric representations of the shape of a data set

Simplices are the building blocks for simplicial complexes

Figure: Simplices (Huang)

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 7 / 17

Simplicial Complexes

We can think of point clouds as being sampled from topological space

Simplices are used to turn point clouds into simplicial complexesAccomplished with a Vietoris-Rips complex

Figure: Illustration of building a simplicial complex from a point cloud (Huang)

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 8 / 17

Persistent Homology

We use persistent homology to analyze the space that is representedby simplicial complexes

We calculate homology groups in each dimension

Dimension 0 represents componentsDimension 1 represents holes or loopsDimension 2 and higher represent voids

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 9 / 17

Barcodes

Visual representation of the persistent homology of a given text

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 10 / 17

Barcode Example with Poetry

Do not go gentle into that good night by Dylan Thomas

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 11 / 17

Bottleneck Distance

Once each text file is visually represented by a barcode, we cancompare their barcodes to find the bottleneck distance

Measures distance between the persistent homologies of two text files

W∞(X ,Y ) = infη:X→Y

supx∈X||x − η(x)||∞

Wasserstein distance is another approach

Figure: Barcode 1 in Dimension 0 Figure: Barcode 2 in Dimension 0

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 12 / 17

Process

Using the programming software RStudio, we:

Clean each text file

Represent each line of text with a word count vector

The resulting vector space forms a word count matrix

Calculate a distance matrix composed of the pairwise distancesbetween each point in the word count matrix

Use RStudio packages to calculate the persistent homology, createbarcodes, and find pairwise bottleneck distances between barcodes

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 13 / 17

Word Count Vectors with Song Lyrics

raindrops (an angel cried) by Ariana Grande

“When Raindrops fell down from the skythe day you left me, an angel cried

oh, she cried, an angel criedshe cried”

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 14 / 17

Issues and Questions

Stop Words: Do they change word count vectors significantly?

Address with standard tf-idf technique (Wagner)

Defining Distance: Euclidean or Angular?

Algorithms: SIF or SIFTS?

1 2

34

1 2

34

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 15 / 17

Results

Analyzing poetry using persistent homology is more interesting thananalyzing song lyrics

Upon further investigation, we may be able to accurately concludethat TDA is effective for the analysis of poetry

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 16 / 17

References

H. Edelsbrunner and J. Harer, Computational topology: an introduction.American Mathematical Soc., 2010.

X. Zhu, “Persistent homology: An introduction and a new textrepresentation for natural language processing,” in Twenty-ThirdInternational Joint Conference on Artificial Intelligence, 2013.

H. Wagner, P. D lotko, and M. Mrozek, “Computational topology in textmining,” in CT, pp. 68–78, Springer, 2012.

H.-L. Huang, X.-L. Wang, P. P. Rohde, Y.-H. Luo, Y.-W. Zhao, C. Liu, L. Li,N.-L. Liu, C.-Y. Lu, and J.-W. Pan, “Demonstration of topological dataanalysis on a quantum processor,” Optica, vol. 5, no. 2, pp. 193–198, 2018.

S. Gholizadeh, A. Seyeditabari, and W. Zadrozny, “Topological signature of19th century novelists: Persistent homology in text mining,” Big Data andCognitive Computing, vol. 2, no. 4, p. 33, 2018.

M. Hood, “When is a coffee mug a donut? topology explains it,” 2016.

Ripser, https://live.ripser.org/.

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 17 / 17

top related