nlp @ homeaway datadaytexas2016
Post on 13-Apr-2017
592 Views
Preview:
TRANSCRIPT
`
NLP @ HomeAway
HomeAway
• 1,000,000+ global vacation rental listings• 200,000,000+ vacation days / year• Headquartered in Austin, TX• ~190 countries, 22 languages• Almost 2,000 employees worldwide
Key Facts
All those vacations … a lot of text
We’re going to look at Reviews and Property Descriptions
Reviews• > 10,000,000
Property Descriptions• > 1,000,000
Communications• Real time between
travelers and suppliers
We’ll look at Reviews and Descriptions
Clustering ReviewsPreparation
• Stopword removal• Stemming• Document vectors of tf-idf
weighted terms
Cluster• Cosine distance between
doc vectors … and then color by review rating
What about this review?
That outlier...The house situation is excellent, close to all facilities, restaurants, groceries, beach, stores, etc. The pool, the patio
furniture, the deck, the beach chairs and the towels are very good for bathing and dining outside, The house
offers enough space. We were disapointed by the old tv sets; the bathrooms need to be refreshed as well as the
cupboard in the kitchen and the laundry room. We were expecting more. We already rented two other
houses with HomeAway before of better quality. The other couple also rent something cleaner and nicer for a better price. The cleaning must have been done more metiscusly. The oven was very dirty. We found that
kitchen pot and pans were chipped and old. There are many old stuff under the cupboard. The toaster heats properly only on one side. The BBQ grill was rusty; all the protection was gone on half the surface. We had problems twice with the internet. The manager/owner came once to try (without success) to repair the leaking sink. The bath was very slow to drain; a plumber came one morning and waited half an hour for the owner who
never showed up, so no repair were done. The small carpets in the bathrooms were old, dirty and disgutting. In the yard, close to the pool, there were old mops, brooms, plastic plants that should all be sent to garbage. It's more
a 3.5* than a 4*. There is a real potential for this house but now it seems a bit neglected. If you haven't
seen other places, you don't know; the four of us can compare and we were all disapointed this time.
Negative reviews
Colocations
Traveler’s Hierarchy of Needs
Glass of Wine Hustle and Bustle Within Walking Distance
Open Floor Plan Labor Day WeekendVisitor Recently Left
Bring Your OwnWasher and Dryer
Pots and Pans
Sort of like Maslow’s
On to Property Descriptions
We have > 1,000,000 descriptions in many languages
• Fraud Detection
• Competitive Intelligence
COIN data provenance
Breckenridge (Blue is HomeAway)
Breckenridge Zoomed InHow do these two properties relate?
Trick Question! Same Property!
The property descriptionsHomeAway The Other Guys
Why did we use descriptions?• Geolocation good for “within 5000 meters”• Image detection can be slow• Computer Vision Day is next week…
• Similar descriptions seemed probable Consistent owner branding, easy to replicate
• Tech team wanted to use natural language processing• Didn’t know if this would work when we began
How
• Draw Geo Bounding Box• Filter on metadata
Bedrooms, bathrooms, &c.
• Compare text• Lather, rinse, repeat• Select a duplicate, if any
How, continued
Most similar property description
Methodology concernsTF-IDF vectors, cosine distance work for duplicates and fraud, but
A little slowMany vectors, many dimensions
Vocab size ~4500 tokens -> ~4500 dimensionsMillions of vectors
Cluster computing, better math to the rescue! (maybe just a brain?)
Spark Clusters (Scala)Topic Modeling (LDA)
Not sure if it will work for duplicationCosine, Jensen-Shannon, or Hellinger distances?
Topic Modeling, quickly
In natural language processing, Latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.
(Wikipedia)
Cat, Dog, Fish, Turtle,
Hamster
Cat, Dog, Mass,
Hysteria, Sleeping, Together
Cat, Dog, Cold, Rain,
Hot, Temperature
“Pets”
“DemonicInvasion”
“Weather”
LDA Current resultsFinding number of topics
220 topics, ~600K (en_US) descriptions(curvature)
LDA, continued
LDA Future• Duplicates?• Fraud Detection?• Property topics in the
“Vacation Rental” Space?
- Marketing, SEO, UX
TOPIC 5
Neighborhood
Lovely
Backyard
Quiet
Residential
TOPIC 17
Beach
Pier
Boardwalk
Isle
Crescent
Logos
LingPipe
CoreNLP
Questions?
Brent SchneemanDirector, Data Science
HomeAway, Inc.brent@homeaway.comcareers.homeaway.com@schnee
top related