understanding feature space in machine learning - data science pop-up seattle

31
#datapopupseattle Understanding Feature Space in Machine Learning Alice Zheng Director of Data Science, Dato RainyData Datoinc

Upload: domino-data-lab

Post on 21-Apr-2017

5.457 views

Category:

Data & Analytics


5 download

TRANSCRIPT

Page 1: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

#datapopupseattle

Understanding Feature Space in Machine Learning

Alice ZhengDirector of Data Science, Dato

RainyData Datoinc

Page 2: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

#datapopupseattle

UNSTRUCTUREDData Science POP-UP in Seattle

www.dominodatalab.com

DProduced by Domino Data Lab

Domino’s enterprise data science platform is used by leading analytical organizations to increase productivity, enable collaboration, and publish

models into production faster.

Page 3: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

Understanding Feature Space in Machine Learning

Alice Zheng, DatoOctober, 2015

3

Page 4: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

My journey so far

Applied+machine+learning+

(Data+science)

Build+ML+tools

Shortage+of+experts+

and+good+tools.

Page 5: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

Why machine learning?

Model data.Make predictions.Build intelligent

applications.

Page 6: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

The machine learning pipeline

I+fell+in+love+the+instant+I+laid+

my+eyes+on+that+puppy.+His+

big+eyes+and+playful+tail,+his+

soft+furry+paws,+…

Raw+data

FeaturesModels

Predictions

Deploy+in+

production

Page 7: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

Three things to know about ML• Feature = numeric representation of raw data• Model = mathematical “summary” of features• Making something that works = choose the right model

and features, given data and task

Page 8: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

Feature = Numeric representation of raw data

Page 9: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

Representing natural text

It#is#a#puppy#and#it#is#extremely#cute.

What’s+important?+

Phrases?+Specific+

words?+Ordering?+

Subject,+object,+verb?

Classify:++

puppy+or+not?Raw+Text

{“it”:2,++

+“is”:2,++

+“a”:1,++

+“puppy”:1,++

+“and”:1,+

+“extremely”:1,+

+“cute”:1+}

Bag+of+Words

Page 10: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

Representing natural text

It#is#a#puppy#and#it#is#extremely#cute.

Classify:++

puppy+or+not?Raw+Text Bag+of+Words

it 2

they 0

I 0

am 0

how 0

puppy 1

and 1

cat 0

aardvark 0

cute 1

extremely 1

… …

Sparse+vector+

representation

Page 11: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

Representing images

Image+source:+“Recognizing+and+learning+object+categories,”++

Li+Fei[Fei,+Rob+Fergus,+Anthony+Torralba,+ICCV+2005—2009.

Raw+image:++

millions+of+RGB+triplets,+

one+for+each+pixel

Classify:++

person+or+animal?Raw+Image Bag+of+Visual+Words

Page 12: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

Representing imagesClassify:++

person+or+animal?Raw+Image Deep+learning+features

3.29+

[15+

[5.24+

48.3+

1.36+

47.1+

[1.92

36.5+

2.83+

95.4+

[19+

[89+

5.09+

37.8

Dense+vector+

representation

Page 13: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

Feature space in machine learning• Raw data ! high dimensional vectors• Collection of data points ! point cloud in feature space• Feature engineering = creating features of the appropriate

granularity for the task

Page 14: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

Visualizing Feature Space

Page 15: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

Crudely speaking, mathematicians fall into two categories: the algebraists, who find it easiest to reduce all problems to sets of numbers and variables, and the geometers, who understand the world through shapes. -- Masha Gessen, “Perfect Rigor”

Page 16: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

Visualizing bag-of-words

puppy

cute

1

1

I+have+a+puppy+and+

it+is+extremely+cute

I#have#a#puppy#and#it#is#extremely#cute

it 1

they 0

I 1

am 0

how 0

puppy 1

and 1

cat 0

aardvark 0

zebra 0

cute 1

extremely 1

… …

Page 17: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

Visualizing bag-of-words

puppy

cute

1

1

1

extremely

I+have+a+puppy+and++

it+is+extremely+cute

I+have+an+extremely++

cute+cat

I+have+a+cute++

puppy

Page 18: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

Document point cloudword+1

word+2

Page 19: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

Model = Mathematical “summary” of features

Page 20: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

What is a summary?• Data ! point cloud in feature space• Model = a geometric shape that best “fits” the point cloud

Page 21: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

Classification modelFeature+2

Feature+1

Decide+between+two+classes

Page 22: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

Clustering modelFeature+2

Feature+1

Group+data+points+tightly

Page 23: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

Regression modelTarget

Feature

Fit+the+target+values

Page 24: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

Visualizing Feature Engineering

Page 25: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

When does bag-of-words fail?

puppy

cat

2

1

1

have

I+have+a+puppy

I+have+a+cat

I+have+a+kitten

Task:+find+a+surface+that+separates++

documents+about+dogs+vs.+cats

Problem:+the+word+“have”+adds+fluff++

instead+of+information

I+have+a+dog+

and+I+have+a+pen

1

Page 26: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

Improving on bag-of-words• Idea: “normalize” word counts so that popular words are

discounted• Term frequency (tf) = Number of times a terms appears in a

document• Inverse document frequency of word (idf) =

• N = total number of documents• Tf-idf count = tf x idf

Page 27: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

From BOW to tf-idf

puppy

cat

2

1

1

have

I+have+a+puppy

I+have+a+cat

I+have+a+kitten

idf(puppy)+=+log+4+

idf(cat)+=+log+4+

idf(have)+=+log+1+=+0

I+have+a+dog+

and+I+have+a+pen

1

Page 28: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

From BOW to tf-idf

puppy

cat1

have

tfidf(puppy)+=+log+4+

tfidf(cat)+=+log+4+

tfidf(have)+=+0

I+have+a+dog+

and+I+have+a+pen,+

I+have+a+kitten

1

log+4

log+4

I+have+a+cat

I+have+a+puppy

Decision+surface

Tf[idf+flattens+

uninformative+

dimensions+in+the+

BOW+point+cloud

Page 29: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

‹#›

That’s not all, folks!• Geometry is the key to understanding feature space and

machine learning• Many other fun topics:- Feature normalization- Feature transformations-Model regularization

• Dato is hiring! [email protected]@RainyData,+@DatoInc

Page 30: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

#datapopupseattle

@datapopup #datapopupseattle

Page 31: Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle

#datapopupseattle

Thank You To Our Sponsors