datascience lab 2017_from bag of texts to bag of clusters_Терпиль Евгений / Павел...

49
From bag of texts to bag of clusters

Upload: geekslab-odessa

Post on 22-Jan-2018

134 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

From bag of texts to bag of clusters

Page 3: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Map of ML mentions

Mar 2017, collected by YouScan

Page 4: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Map of ML mentions

конференция, meetup

Page 5: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Map of ML mentions

Приглашаем 13 мая на Data Science Lab…

конференция, meetup

Page 6: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Part 1Classic approach

Word embeddings

Page 7: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Semantic representation of texts

1. Text (semi/un)supervised classification

2. Document retrieval

3. Topic insights

4. Text similarity/relatedness

Page 8: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Requirements

• Vector representation is handy

• Descriptive (not distinctive) features

• Language/style/genre independence

• Robustness to language/speech variance (word- and phrase- level synonymy, word order, newly emerging words and entities)

Page 9: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

• Token-based methods, although char-based are more robust

• Preprocessing and unification

• Tokenization

• Lemmatization?

Prerequisites

Page 10: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

BoW, Tf-idf and more

• Bag of Words: one-hot encoding over the observed dictionary

• TF-IDF: ‘term frequency’ * ‘inverse document frequency’ for term weighting (include different normalization schemes)

• Bag of n-grams: collocations carry more specific senses

• Singular Value Decomposition (SVD) of the original term-document matrix (compression with less relevant information loss):

◦ resolves inter-document relations: similarity

◦ resolves inter-term relations: synonymy and polysemy

◦ reduces dimensionality

Page 11: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

BoW, Tf-idf and more

- easily interpretable

- easy to implement

- parameters are straightforward

- not robust to language variance

- scales badly

- vulnerable to overfitting

Pros Cons

Page 12: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

ODS курс на хабре

Google купила kaggle

распознавание раковых опухолей

яндекс крипта, запросы женщин

Data Science Lab

TF-IDF + SVD + TSNE

Page 13: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

нейронная сеть

artificial intelligence

TF-IDF + SVD

deep learning

Page 14: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Clustering

1. K-means

2. Hierarchical clustering

3. Density Based Scan

Page 15: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

K-means

• Separate all observations in K groups of equal variance

• Iteratively reassign cluster members for cluster members mean to minimize the inertia: within-cluster sum of squared criterion

Page 16: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Hierarchical clustering

• Build a hierarchy of clusters

• Bottom-up or top-down approach (agglomerative or divisive clustering)

• Various metrics for cluster dissimilarity

• Cluster count and contents depends on chosendissimilarity threshold

Clusters:a, bc, def

Page 17: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Density Based Scan

• Find areas of high density separated by areas of low density of samples

• Involves two parameters: epsilon and minimum points

• Epsilon sets the minimum distance for two points to be considered close enough

Minimum points stand for the amount of mutually close points to be considered a new cluster

Page 18: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

K-Means clusters

TF-IDF + SVD

Page 19: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Word embeddings

Word embeddings that capture semantics: word2vec family, fastText, GloVe

CBOW Skip-gram

Page 20: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Word embeddings

Page 21: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Word embeddings

Dimension-wise mean/sum/min/max over embeddings of words in text

Words Mover’s Distance

Page 22: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Word embeddings

- semantics is included

- moderately robust to language variance

- scales better, including OOV

- embeddings source and quality?

- vector relations (distance measures, separating planes) is what really means, not vector values

- meaning degrades quickly on moderate-to-large texts

- interpretation is a tedious work

Pros Cons

Page 23: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

ODS курс на хабре

Google купила kaggle

распознавание раковых опухолей

яндекс крипта, запросы женщин

Data Science Lab

Word2Vec mean

Page 24: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Word2Vec mean

покупка, инвестиции

Page 25: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

TF-IDF + SVD

покупка, инвестиции

Page 26: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Sense clusters

Page 27: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Sense clusters

0 0.9 0 0 0.95 0 0.1

3000

еда времяовощи

картошка

• Find K cluster centers over target vocabulary embeddings

• Calculate distances (cosine measure) to cluster centers for each vocabulary word, ignore relatively small ones

• Use distances as new K-dimensional feature vector (word embedding)

• Aggregate embeddings

• Normalize?

Page 28: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Sense clusters

- semantics is now valuable(expressed by concrete values in vectors)

- meaning now accumulates in text vectors better

- it is possible to retrofit clusters on sense interpretations for readability

- inherited from word embeddings

- chained complexity

- additional parameters to fiddle with

- vector length is higher (around 3k dimensions) -> bigger, cumbersome, heavier

Pros Cons

Page 29: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

ODS курс на хабре

Google купила kaggle

распознавание раковых опухолей

яндекс крипта, запросы женщин

Data Science Lab

Word2Sense mean

Page 30: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

покупка, инвестиции

Word2Sense mean

Page 31: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Doc2Vec

Page 32: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

ODS курс на хабре

Google купила kaggle

яндекс крипта, запросы женщин

Doc2Vec

Page 33: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Part 2Alternatives

Deep learning

Page 34: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

ODS курс на хабре

Google купила kaggle

распознавание раковых опухолей

яндекс крипта, запросы женщин

Data Science Lab

K-Means representation

Page 35: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Topic modeling

Page 36: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

LDA

Google купила kaggle

ODS курс на хабре

Page 37: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Sequence-to-Sequence Models

document vector

Neural Machine Translation Text Summarization

Examples:

Page 38: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

sentence vector

Objective

Skip Thought

Page 39: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

word embedding

Objective

Fast Sent

Sentence representation

softmax

Page 40: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

ODS курс на хабре Google купила kaggle

распознавание раковых опухолей

яндекс крипта, запросы женщин

Data Science Lab

Fast Sent

Page 41: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

покупка, инвестиции

Fast Sent

Page 42: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Fast Sent

конференция, meetup

Page 43: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Sequential Denoising Autoencoder (SDAE)

купил для исследователейGoogle

Google

Google купил для

исследователей

сервис

сервис

купил сервис для

Delete word Swap bigram

Corrupt sentence by

p0 Є [0, 1] px Є [0, 1]

and predict original sentence

Page 44: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

ODS курс на хабре

Google купила kaggle

яндекс крипта, запросы женщин

Data Science Lab

SDAE

Page 45: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

конференция, meetup

SDAE

Page 46: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Supervised evaluations

Learning Distributed Representations of Sentences from Unlabelled Data

Page 47: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Unsupervised (relatedness) evaluations

Learning Distributed Representations of Sentences from Unlabelled Data

Page 48: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

LinksLearning Distributed Representations of Sentences from Unlabelled Data http://www.aclweb.org/anthology/N16-1162

FastSent, SDAE https://github.com/fh295/SentenceRepresentation

Skip-Thought Vectors https://github.com/ryankiros/skip-thoughts

Sense clusters https://servponomarev.livejournal.com/10604 https://habrahabr.ru/post/277563/

Page 49: DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)

Questions?