datascience lab 2017_from bag of texts to bag of clusters_Терпиль Евгений / Павел...

From bag of texts to bag of clusters

Paul Khudan Yevgen [email protected] [email protected]

mailto:[email protected]

mailto:[email protected]

Map of ML mentions

Mar 2017, collected by YouScan

Map of ML mentions

конференция, meetup

Map of ML mentions

Приглашаем 13 мая на Data Science Lab…


Part 1Classic approach

Word embeddings

Semantic representation of texts

1. Text (semi/un)supervised classification

2. Document retrieval

3. Topic insights

4. Text similarity/relatedness

Requirements

• Vector representation is handy

• Descriptive (not distinctive) features

• Language/style/genre independence

• Robustness to language/speech variance (word- and phrase- level synonymy, word order, newly emerging words and entities)

• Token-based methods, although char-based are more robust

• Preprocessing and unification

• Tokenization

• Lemmatization?

Prerequisites

BoW, Tf-idf and more

• Bag of Words: one-hot encoding over the observed dictionary

• TF-IDF: ‘term frequency’ * ‘inverse document frequency’ for term weighting (include different normalization schemes)

• Bag of n-grams: collocations carry more specific senses

• Singular Value Decomposition (SVD) of the original term-document matrix (compression with less relevant information loss):

◦ resolves inter-document relations: similarity

◦ resolves inter-term relations: synonymy and polysemy

◦ reduces dimensionality

BoW, Tf-idf and more

- easily interpretable

- easy to implement

- parameters are straightforward

- not robust to language variance

- scales badly

- vulnerable to overfitting

Pros Cons

ODS курс на хабре

Google купила kaggle

распознавание раковых опухолей

яндекс крипта, запросы женщин

Data Science Lab

TF-IDF + SVD + TSNE

нейронная сеть

artificial intelligence

TF-IDF + SVD

deep learning

Clustering

1. K-means

2. Hierarchical clustering

3. Density Based Scan

K-means

• Separate all observations in K groups of equal variance

• Iteratively reassign cluster members for cluster members mean to minimize the inertia: within-cluster sum of squared criterion

Hierarchical clustering

• Build a hierarchy of clusters

• Bottom-up or top-down approach (agglomerative or divisive clustering)

• Various metrics for cluster dissimilarity

• Cluster count and contents depends on chosendissimilarity threshold

Clusters:a, bc, def

Density Based Scan

• Find areas of high density separated by areas of low density of samples

• Involves two parameters: epsilon and minimum points

• Epsilon sets the minimum distance for two points to be considered close enough

Minimum points stand for the amount of mutually close points to be considered a new cluster

K-Means clusters

TF-IDF + SVD

Word embeddings

Word embeddings that capture semantics: word2vec family, fastText, GloVe

CBOW Skip-gram

Word embeddings

Word embeddings

Dimension-wise mean/sum/min/max over embeddings of words in text

Words Mover’s Distance

Word embeddings

- semantics is included

- moderately robust to language variance

- scales better, including OOV

- embeddings source and quality?

- vector relations (distance measures, separating planes) is what really means, not vector values

- meaning degrades quickly on moderate-to-large texts

- interpretation is a tedious work

Pros Cons





Data Science Lab

Word2Vec mean

Word2Vec mean

покупка, инвестиции

TF-IDF + SVD


Sense clusters

Sense clusters

0 0.9 0 0 0.95 0 0.1

3000

еда времяовощи

картошка

• Find K cluster centers over target vocabulary embeddings

• Calculate distances (cosine measure) to cluster centers for each vocabulary word, ignore relatively small ones

• Use distances as new K-dimensional feature vector (word embedding)

• Aggregate embeddings

• Normalize?

Sense clusters

- semantics is now valuable(expressed by concrete values in vectors)

- meaning now accumulates in text vectors better

- it is possible to retrofit clusters on sense interpretations for readability

- inherited from word embeddings

- chained complexity

- additional parameters to fiddle with

- vector length is higher (around 3k dimensions) -> bigger, cumbersome, heavier

Pros Cons





Data Science Lab

Word2Sense mean


Word2Sense mean

Doc2Vec

Part 2Alternatives

Deep learning





Data Science Lab

K-Means representation

Topic modeling

LDA



Sequence-to-Sequence Models

document vector

Neural Machine Translation Text Summarization

Examples:

sentence vector

Objective

Skip Thought

word embedding

Objective

Fast Sent

Sentence representation

softmax

ODS курс на хабре Google купила kaggle



Data Science Lab

Fast Sent


Fast Sent

Sequential Denoising Autoencoder (SDAE)

купил для исследователейGoogle

Google

Google купил для

исследователей

сервис

сервис

купил сервис для

Delete word Swap bigram

Corrupt sentence by

p0 Є [0, 1] px Є [0, 1]

and predict original sentence




Data Science Lab

SDAE


SDAE

Supervised evaluations

Learning Distributed Representations of Sentences from Unlabelled Data

Unsupervised (relatedness) evaluations

Learning Distributed Representations of Sentences from Unlabelled Data

LinksLearning Distributed Representations of Sentences from Unlabelled Data http://www.aclweb.org/anthology/N16-1162

FastSent, SDAE https://github.com/fh295/SentenceRepresentation

Skip-Thought Vectors https://github.com/ryankiros/skip-thoughts

Sense clusters https://servponomarev.livejournal.com/10604 https://habrahabr.ru/post/277563/

http://www.aclweb.org/anthology/N16-1162

https://github.com/fh295/SentenceRepresentation

https://github.com/ryankiros/skip-thoughts

https://servponomarev.livejournal.com/10604

https://habrahabr.ru/post/277563/

Questions?

datascience lab 2017_from bag of texts to bag of clusters_Терпиль Евгений / Павел...

Technology