edward kibardin presentation at the chief data scientist europe 2016

Post on 22-Jan-2018

178 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Deep Learning and Topological Data Analysis for

machine intelligence and predictive analytics

Unlabeled Data

The vast majority of data is unlabeled

Medical and DNA profiling

Images

Text Stock market transactions

Customer Activities

Sensor signals System Logs Sound

Unlabeled Data

•  How many categories in my dataset? •  Which categories are the best for the business? •  Why some objects are not like the others? •  How I can contextualize new objects? •  Is there a simpler way to describe my data?

Business questions to unlabeled data: Unlabeled Data

A topological invariant is a map f that assigns the same object to homeomorphic spaces, that is:

Homology: is a machine that converts local data about a space into global algebraic structure

Topological invariants

Reference: Wikipedia, 2010.

The Čech Complex

Combinatorial representations

a b

a.  Compute a combinatorial model approximating the structure of the underlying space

b.  Then compute topological invariants of this structure c.  Represent these topological invariants in 2d space

Topology Data Analysis Pipeline

c

Theorem: Supposeh:Xg is a discrete Morse function. Then X is homotopy equivalent to a CW-complex with exactly one cell of dimension p for each critical simplex of dimension p.

Reference:TengMa;ZhuangzhiWu;PeiLuo;LuFeng.Reebgraphcomputa1onthroughspectralclustering,2011.

Morse Theory and Reeb Graph

Deep Generative Nets + TDA

1. Learning of deep generative model 2. Fine-tuning using topological loss

Case study: Netflix competition A dataset from Netflix open competition best collaborative filtering algorithm to predict user ratings for films:

•  100,480,507 ratings •  480,189 users •  17,770 movies •  2.1 GB of CSV file

Case study: Netflix competition

PCA

Standard Approach to cluster analysis

Case study: Netflix competition

PCA

Hessian LLE

Isomap

Locally-Linear Embedding (LLE)

Local Tangent Space Alignment (LTSA)

Standard Approach to cluster analysis

Case study: Netflix competition Topological Result

Case study: Netflix competition Topological Result with Labels

Case study: Netflix competition Horror Movies

Case study: Netflix competition Science Fiction / Fantasy Series

Case study: 20 Newsgroups The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

•  18,820 documents •  From 6 to 5000 words each •  20 newsgroups (classes)

20Newsgroupsacademicdataset

(semi-supervised)

Case study: 20 Newsgroups alt.atheism comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey

sci.crypt sci.electronics sci.med sci.space soc.religion.christian talk.politics.guns talk.politics.mideast talk.politics.misc talk.religion.misc

Detailed topology (user group overlay) Case study: 20 Newsgroups

Baseball cluster Case study: 20 Newsgroups

“pitch” > 1.2 This must be baseball speed game margin realist chip ucdavi edu gari

built villanova huckabai basebal game and shade hour that damn long don plai hour game watch game for that long butt fall asleep and watch

channel surf pitch catch color

Motorcycles cluster Case study: 20 Newsgroups

“bike” > 1.114 This must be motorcycles

ride sixteen dai had put test drive honda final saturdai rain fact clear warm and sunni and wind di week ago long cool ride hawk cycl for test ride

had sold and deliv demo fifteen hour arriv and demo vfr bike lock showroom surround bike and

not like move todai even bike us dirt bike us street bike car and big tent full outlandishli fat tour bike trailer squeez park lot sort fat bike convent shelli and dave run msf each time classroom and back lot usual free cookout

distribut severli affect will bike perform such load cling back rest secur shift increas chanc surf

collect wisdom request can afford leather pant boot and jean can make you knee protector

rollerblad us bean and sell

Result of learning first two groups Case study: 20 Newsgroups

Labeled baseball!Unlabeled baseball!

Labeled Motorcycles!

Unlabeled Motorcycles!

Autos Pc.hardware

Mac.hardware

Result of learning five groups Case study: 20 Newsgroups

Mac.hardware

Baseball!

Pc.hardware

Autos

Motorcycles!

Scy.med!

Politics.misc!

Politics.!mideast!

Hockey!

Final result for 2nd layer Case study: 20 Newsgroups

Motorcycles

Christian Atheism

Religion.misc

Politics.guns

Politics.misc

Politics.mideast

Scy.crypt

Scy.med Hockey

Baseball

Autos

Forsale Mac.hardware

Electronics

Scy.space Comp.graphics

Windows.x

Ms-windows.misc

Pc.hardware

Case study: Badoo A subset of user activity in the United States. Aggregated activity metrics over two weeks in August 2014.

•  88,567 users •  867 metrics

Case study: Badoo Data Transformation

Used aggregated representations of user activities per day: •  Number of likes •  Number of dislikes •  Number of matches •  Profiles visited •  Photos uploaded •  Number of messages sent (no content analysed) •  Number of message replies •  Interactions with different app features

Case study: Badoo

Case study: Badoo Messages sent / received

Case study: Badoo Users with high retention

Case study: Badoo Users grouped in retention clusters by using deep generative nets

Case study: Badoo Users grouped in retention clusters by using deep generative nets

Case study: Badoo Users grouped in retention clusters by using deep generative nets

“Pretty boys”: users with high score, received a lot of likes and messages in

first 3 days

“Dedicated”: users, invested much time in profiles, were active of site and received several

messages in first three days

“Curious”: invested less time in profiles, send lots of messages, sometimes being blocked by other users

Case study: Badoo On-line learning and prediction of user clusters

1. Configure integration 2. Perform segmentation

3. System performs classification 4. Report classification results

•  CSV API •  JSON API •  Database connector

Case study: Financial Articles Understand main topics from news and scientific articles on economics topic

•  17,020 documents

Case study: Financial Articles

Demo

Case study: Financial Articles

Case study: Financial Articles

Links Topology And Data (Gunnar Carlsson): http://www.ams.org/journals/bull/2009-46-02/S0273-0979-09-01249-X/S0273-0979-09-01249-X.pdf Discrete Morse Theory and Persistent Homology (Kevin P. Knudson): http://www.math.fsu.edu/~hironaka/FSUUF/knudson.pdf Topological Persistence and Simplification (Herbert Edelsbrunner, David Letscher, Afra Zomorodian): http://math.uchicago.edu/~shmuel/AAT-readings/Data%20Analysis%20/PersTop.pdf Extracting and Composing Robust Features with Denoising Autoencoders (Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol) http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf

info@datarefiner.com www.datarefiner.com

top related