nlp @ homeaway datadaytexas2016

Post on 13-Apr-2017

592 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

`

NLP @ HomeAway

HomeAway

• 1,000,000+ global vacation rental listings• 200,000,000+ vacation days / year• Headquartered in Austin, TX• ~190 countries, 22 languages• Almost 2,000 employees worldwide

Key Facts

All those vacations … a lot of text

We’re going to look at Reviews and Property Descriptions

Reviews• > 10,000,000

Property Descriptions• > 1,000,000

Communications• Real time between

travelers and suppliers

We’ll look at Reviews and Descriptions

Clustering ReviewsPreparation

• Stopword removal• Stemming• Document vectors of tf-idf

weighted terms

Cluster• Cosine distance between

doc vectors … and then color by review rating

What about this review?

That outlier...The house situation is excellent, close to all facilities, restaurants, groceries, beach, stores, etc. The pool, the patio

furniture, the deck, the beach chairs and the towels are very good for bathing and dining outside, The house

offers enough space. We were disapointed by the old tv sets; the bathrooms need to be refreshed as well as the

cupboard in the kitchen and the laundry room. We were expecting more. We already rented two other

houses with HomeAway before of better quality. The other couple also rent something cleaner and nicer for a better price. The cleaning must have been done more metiscusly. The oven was very dirty. We found that

kitchen pot and pans were chipped and old. There are many old stuff under the cupboard. The toaster heats properly only on one side. The BBQ grill was rusty; all the protection was gone on half the surface. We had problems twice with the internet. The manager/owner came once to try (without success) to repair the leaking sink. The bath was very slow to drain; a plumber came one morning and waited half an hour for the owner who

never showed up, so no repair were done. The small carpets in the bathrooms were old, dirty and disgutting. In the yard, close to the pool, there were old mops, brooms, plastic plants that should all be sent to garbage. It's more

a 3.5* than a 4*. There is a real potential for this house but now it seems a bit neglected. If you haven't

seen other places, you don't know; the four of us can compare and we were all disapointed this time.

Negative reviews

Colocations

Traveler’s Hierarchy of Needs

Glass of Wine Hustle and Bustle Within Walking Distance

Open Floor Plan Labor Day WeekendVisitor Recently Left

Bring Your OwnWasher and Dryer

Pots and Pans

Sort of like Maslow’s

On to Property Descriptions

We have > 1,000,000 descriptions in many languages

• Fraud Detection

• Competitive Intelligence

COIN data provenance

Breckenridge (Blue is HomeAway)

Breckenridge Zoomed InHow do these two properties relate?

Trick Question! Same Property!

The property descriptionsHomeAway The Other Guys

Why did we use descriptions?• Geolocation good for “within 5000 meters”• Image detection can be slow• Computer Vision Day is next week…

• Similar descriptions seemed probable Consistent owner branding, easy to replicate

• Tech team wanted to use natural language processing• Didn’t know if this would work when we began

How

• Draw Geo Bounding Box• Filter on metadata

Bedrooms, bathrooms, &c.

• Compare text• Lather, rinse, repeat• Select a duplicate, if any

How, continued

Most similar property description

Methodology concernsTF-IDF vectors, cosine distance work for duplicates and fraud, but

A little slowMany vectors, many dimensions

Vocab size ~4500 tokens -> ~4500 dimensionsMillions of vectors

Cluster computing, better math to the rescue! (maybe just a brain?)

Spark Clusters (Scala)Topic Modeling (LDA)

Not sure if it will work for duplicationCosine, Jensen-Shannon, or Hellinger distances?

Topic Modeling, quickly

In natural language processing, Latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

(Wikipedia)

Cat, Dog, Fish, Turtle,

Hamster

Cat, Dog, Mass,

Hysteria, Sleeping, Together

Cat, Dog, Cold, Rain,

Hot, Temperature

“Pets”

“DemonicInvasion”

“Weather”

LDA Current resultsFinding number of topics

220 topics, ~600K (en_US) descriptions(curvature)

LDA, continued

LDA Future• Duplicates?• Fraud Detection?• Property topics in the

“Vacation Rental” Space?

- Marketing, SEO, UX

TOPIC 5

Neighborhood

Lovely

Backyard

Quiet

Residential

TOPIC 17

Beach

Pier

Boardwalk

Isle

Crescent

Logos

LingPipe

CoreNLP

Questions?

Brent SchneemanDirector, Data Science

HomeAway, Inc.brent@homeaway.comcareers.homeaway.com@schnee

top related