nlp @ homeaway datadaytexas2016

26
` NLP @ HomeAway

Upload: brent-schneeman

Post on 13-Apr-2017

592 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: NLP @ HomeAway datadaytexas2016

`

NLP @ HomeAway

Page 2: NLP @ HomeAway datadaytexas2016

HomeAway

• 1,000,000+ global vacation rental listings• 200,000,000+ vacation days / year• Headquartered in Austin, TX• ~190 countries, 22 languages• Almost 2,000 employees worldwide

Key Facts

Page 3: NLP @ HomeAway datadaytexas2016
Page 4: NLP @ HomeAway datadaytexas2016

All those vacations … a lot of text

We’re going to look at Reviews and Property Descriptions

Reviews• > 10,000,000

Property Descriptions• > 1,000,000

Communications• Real time between

travelers and suppliers

We’ll look at Reviews and Descriptions

Page 5: NLP @ HomeAway datadaytexas2016

Clustering ReviewsPreparation

• Stopword removal• Stemming• Document vectors of tf-idf

weighted terms

Cluster• Cosine distance between

doc vectors … and then color by review rating

Page 6: NLP @ HomeAway datadaytexas2016

What about this review?

Page 7: NLP @ HomeAway datadaytexas2016

That outlier...The house situation is excellent, close to all facilities, restaurants, groceries, beach, stores, etc. The pool, the patio

furniture, the deck, the beach chairs and the towels are very good for bathing and dining outside, The house

offers enough space. We were disapointed by the old tv sets; the bathrooms need to be refreshed as well as the

cupboard in the kitchen and the laundry room. We were expecting more. We already rented two other

houses with HomeAway before of better quality. The other couple also rent something cleaner and nicer for a better price. The cleaning must have been done more metiscusly. The oven was very dirty. We found that

kitchen pot and pans were chipped and old. There are many old stuff under the cupboard. The toaster heats properly only on one side. The BBQ grill was rusty; all the protection was gone on half the surface. We had problems twice with the internet. The manager/owner came once to try (without success) to repair the leaking sink. The bath was very slow to drain; a plumber came one morning and waited half an hour for the owner who

never showed up, so no repair were done. The small carpets in the bathrooms were old, dirty and disgutting. In the yard, close to the pool, there were old mops, brooms, plastic plants that should all be sent to garbage. It's more

a 3.5* than a 4*. There is a real potential for this house but now it seems a bit neglected. If you haven't

seen other places, you don't know; the four of us can compare and we were all disapointed this time.

Page 8: NLP @ HomeAway datadaytexas2016

Negative reviews

Colocations

Page 9: NLP @ HomeAway datadaytexas2016

Traveler’s Hierarchy of Needs

Glass of Wine Hustle and Bustle Within Walking Distance

Open Floor Plan Labor Day WeekendVisitor Recently Left

Bring Your OwnWasher and Dryer

Pots and Pans

Sort of like Maslow’s

Page 10: NLP @ HomeAway datadaytexas2016

On to Property Descriptions

We have > 1,000,000 descriptions in many languages

• Fraud Detection

• Competitive Intelligence

Page 11: NLP @ HomeAway datadaytexas2016

COIN data provenance

Page 12: NLP @ HomeAway datadaytexas2016

Breckenridge (Blue is HomeAway)

Page 13: NLP @ HomeAway datadaytexas2016

Breckenridge Zoomed InHow do these two properties relate?

Page 14: NLP @ HomeAway datadaytexas2016

Trick Question! Same Property!

Page 15: NLP @ HomeAway datadaytexas2016

The property descriptionsHomeAway The Other Guys

Page 16: NLP @ HomeAway datadaytexas2016

Why did we use descriptions?• Geolocation good for “within 5000 meters”• Image detection can be slow• Computer Vision Day is next week…

• Similar descriptions seemed probable Consistent owner branding, easy to replicate

• Tech team wanted to use natural language processing• Didn’t know if this would work when we began

Page 17: NLP @ HomeAway datadaytexas2016

How

• Draw Geo Bounding Box• Filter on metadata

Bedrooms, bathrooms, &c.

• Compare text• Lather, rinse, repeat• Select a duplicate, if any

Page 18: NLP @ HomeAway datadaytexas2016

How, continued

Most similar property description

Page 19: NLP @ HomeAway datadaytexas2016

Methodology concernsTF-IDF vectors, cosine distance work for duplicates and fraud, but

A little slowMany vectors, many dimensions

Vocab size ~4500 tokens -> ~4500 dimensionsMillions of vectors

Page 20: NLP @ HomeAway datadaytexas2016

Cluster computing, better math to the rescue! (maybe just a brain?)

Spark Clusters (Scala)Topic Modeling (LDA)

Not sure if it will work for duplicationCosine, Jensen-Shannon, or Hellinger distances?

Page 21: NLP @ HomeAway datadaytexas2016

Topic Modeling, quickly

In natural language processing, Latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

(Wikipedia)

Cat, Dog, Fish, Turtle,

Hamster

Cat, Dog, Mass,

Hysteria, Sleeping, Together

Cat, Dog, Cold, Rain,

Hot, Temperature

“Pets”

“DemonicInvasion”

“Weather”

Page 22: NLP @ HomeAway datadaytexas2016

LDA Current resultsFinding number of topics

220 topics, ~600K (en_US) descriptions(curvature)

Page 23: NLP @ HomeAway datadaytexas2016

LDA, continued

Page 24: NLP @ HomeAway datadaytexas2016

LDA Future• Duplicates?• Fraud Detection?• Property topics in the

“Vacation Rental” Space?

- Marketing, SEO, UX

TOPIC 5

Neighborhood

Lovely

Backyard

Quiet

Residential

TOPIC 17

Beach

Pier

Boardwalk

Isle

Crescent

Page 25: NLP @ HomeAway datadaytexas2016

Logos

LingPipe

CoreNLP

Page 26: NLP @ HomeAway datadaytexas2016

Questions?

Brent SchneemanDirector, Data Science

HomeAway, [email protected]@schnee