data science @ instacart

Data Science @ Instacart

Sharath RaoData Scientist / ManagerSearch and Discovery

Collaborators: Angadh Singh and Shishir Prasad

v

The Instacart Value Proposition

Groceries from stores you love

deliveredto your

doorstep

in as little as an hour

+ + + =

v

Customer Experience

Select a Store

Shop for Groceries

Checkout Select Delivery Time

Delivered to Doorstep

v

Shopper Experience

Accept Order Find the Groceries

Out for Delivery

Delivered to DoorstepScan Barcode

v

Four Sided Marketplace

Customers Shoppers

Products(Advertisers)

Search

Advertising

Shopping

Delivery

Customer Service

Inventory

Picking

Loyalty

Stores(Retailers)

v

Two topics today

A Recommendation System for Discovery

Using Data Science for out of stock mitigation

v

Online grocery vs Traditional e-commerceWeek 3Week 2

Online Grocery

Week 1

Traditional e-commerce

v

Grocery Shopping in “Low Dimensional Space”

Search

Restock

Explore

+

+

=

v

Why personalization at Instacart

Your storeEverybody’s store

v

Repeat purchases increase LTV of recommendations

$5.49

$549

Today A year later

1 +….+ 100

$549

$549

vDifferent recommendation systems address different needs

v

Personalized Top N recommendations

Promote broad-based discovery in a dynamic catalog

Including from stores customers may have never shopped

v

Run out of X?

Rank products by repurchase probability

v

Personalized recommendations of new products

when customers seek out what is new out there

Also addresses product cold start problems

v

Replacement Product Recommendations

Mitigate adverse impact of last-minute out of stocks

v

“Frequently bought with” Recommendations

Not necessarily consumed together

Help customers shop for complementary products

and try alternatives

Probablyconsumed together

vPersonalized Top N Recommendations

v

Learning from feedback

Traditionally collaborative filtering used explicit feedback to predict ratings

There may still bias in whether the user chooses to rate

Explicit Feedback Implicit Feedback

v

Learning from Explicit Feedback

• Explicit feedback may be more reliable but there is much less of it

• Less reliable if users rate based on aspirations instead of true preferences

vs

v

Implicit Feedback - trade-off quality and quantity

Stre

ngth

of e

vide

nce

Number of Events

v

Architecture

Event Data Score and Select Top N (Spark/EMR)

User/Product Factors

Event Data

Run-time ranking for diversity

Candidate Selection

ALS (Spark/EMR)

Generate User-Product

Matrix

v

A Matrix Factorization Formulation for Implicit Feedback

N Products

M U

sers

1

-

- 9

-

- -

3

20

User Product Matrix R; (M x N)

1

0

0 1

0

0 0

1

1binary preferences

Preference Matrix R; (M x N)

“Collaborative Filtering for Implicit Feedback” - Hu et. al

v

A Matrix Factorization Formulation for Implicit Feedback

~Y

XT

Product Factors (k x N)

User Factors (M x k)

1

0

0 1

0

0 0

1

1x

Preference Matrix R; (M x N)

v

Matrix Factorization from Implicit Feedback - The Intuition

#Purchases Preference p Confidence c

0 0 Low

1 1 Low

>>1 1 High

• Confidence increases linearly with purchases r • c = 1 + alpha * r

• alpha controls the marginal rate of learning from user purchases

• Key questions• How should the unobserved events be treated• How should one trade-off observed and the unobserved

v

Regularized Weighted Squared Loss

Confidence

User Factors Matrix

Product Factors Matrix

Preference Matrix Regularization

Solve using Alternating Least Squares

v

Architecture


Matrix

ALS (Spark/EMR)

Score and Select Top N (Spark/EMR)



Candidate Selection

Event Data

Event Data

v

Spark ALS Hyper-parameter Tuning

• rank k - diminishing returns after 150

• alpha - controls rate of learning from observed events

• iterations - ALS tends to converge within 5, seldom more than 10

• lambda - regularization parameter

v

Architecture


Matrix

ALS Matrix Factorization (Spark/EMR)

Candidate Selection




Event Data

Event Data

v

Scoring user and products

With millions of products and users, scoring every (user, product) pair is prohibitive

Two goals in selecting products to score

• Long tail which have not been discovered

• Products that have an a priori high purchase rate (popular)

~

v

Trade-off popularity and discovery in the tail

We start with simple stratified sampling

For each user, score N products

Sample h products from Head

Sample t products from tail

N ~ 10000 h ~ 3000 t ~7000

v

Tuning Spark For ALS

Understanding Spark execution model and its implementation of ALS helps

• Training is communication heavy1, set partitions <= #CPU cores

• Scoring is memory intensive

• Broad guidelines2 • Limit executor memory to 64GB • 5 cores per executor • Set executors based on data size

1 - http://apache-spark-user-list.1001560.n3.nabble.com/Error-No-space-left-on-device-tp9887p9896.html 2 - http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/

v

A/B Test Setup


Matrix

ALS (Spark/EMR)



Run-time diversity ranking

Candidate Selection

Event Data

Event Data

Weekly for past N months data

Weekly for users with recent activity

v

A/B Test Results

• Statistically significant increases• Items per order• GMV per order

• Total product sales spread over more categories

vOk, we have a recommendation system

Where do we go from here?

v

What else do you do with user and product factors?

Score (user, product) pair on demand

Get Top N similar users

Get Top N similar product

As features in other models

v

Products similar to “Haigs Spicy Hummus"

More “Spicy Hummus”

Spicy Salsas

Generated using Approximate Nearest Neighbor (“annoy” from Spotify)

v

What next

• Make recommendations more contextual

• Explain recommendations (“Because you did X”)

vMitigating the effect of out of stocks

v• what are out of stocks

• why do they happen

• how data science helps mitigate effects

v

Out of stocks - Customer Context

“Deliver Ice Cream from Whole Foods Market SOMA at 8 pm tomorrow”

v

Online services

Supply InfiniteLimited

Fulfi

llmen

t

Immediate

Future

v

Traditional E-commerce

• Manage inventory in warehouses optimized for quick fulfillment

• Customers only specify the “What”

• Disallow users from ordering out of stock products

• Set expectations• “3 day shipping” but will ship in 10 business days

v

On-demand delivery from local retailers

• Shoppers navigate a complex environment where products • may have run out • may be misplaced• may be damaged

• Customers specify “What”, “When” and “Where from”

• Improvise under uncertainty

vCustomers

Advertisers(brands)

Stores(Retailers)

lose revenue and trust of customers

Everybody loses when out of stocks happen

• don’t get exactly what they want

• must contemplate and/or communicate replacements

lose revenue and trust of customers

• waste time searching for items that aren’t in store

• context switch to searching and communicating replacements

Shoppers

v

Out of stock rate - an illustration

v

A probable solution

Do not show or allow customers to order items that are currently out of stock

v

A probable (but terrible) solution

• Customers really know these stores• “Missing” items is seen as a sign of an unreliable catalog/service

• May have been out of stock this morning but could be available when the order is fulfilled

• Sets up negative spirals

“I was there over the weekend. Please check behind

the cheeses aisle”

“Are you telling me they don’t carry

strawberries?”

v

Solution that works reasonably well

• Shoppers can see Instacart recommended replacements while shopping in the store

• Customers may also specify or choose from recommended replacements

• Relatively more flexibility with groceries

• Some services offer to cancel the order if an item isn’t available

v

Instacart Recommended Replacements

Flavor PackingSizeBrand Price

• Several product attributes matter

• Context matters, might benefit from personalization

• Must scale to millions of products

• Not always symmetric

• May be ok to replace X with gluten free X but not the other way around

Diet Info

v• Shoppers are trained to pick replacements

• But shoppers can benefit from algorithmic suggestions

• Many unfamiliar products in a vast catalog

• Validation for common products

• Finding replacements fast improves operational efficiency

Replacement Recommendations for Shoppers

v

• Customers can specify replacements while placing the order

• Can choose to communicate with the shopper in store to verify

Replacement Recommendations for Customers

v

What could we do if could predict item availability?

Customer location

Nearest store

Farther, but better availability

Controlling for retailer and quality, customer is indifferent to physical location

v

The Item Availability Prediction

Probability( Item in store | time, context)

What is probability that an item will be at the store when the shopper shows up to

look for it?

v

Item Availability as a Classification Problem

TIMESTAMP, ITEM IDENTIFIED, IN STORE?

• Millions of examples from historical data

• Feature Engineering

• historical availability at multiple resolutions• Eg: time since last “not found” event

• Item attributes• Eg: perishables restocked differently than personal care

• Temporal Features

v

Training and Scoring

Feature Extraction

XGBoost Training

ScoringFeature Extraction

Event Data

Event Data

Model Store

Weekly with over 2 months of training data

Cache availability scores

Score tens of millions of items every hour

v

Serving and Optimization Layer

Fulfillment Engine

OrderFulfillment plan: Store location, Shopper etc.

Items, eligible store locations

Availability scores

Active in production with an acceptable trade-off between

fulfillment efficiency and refund rate

v

Whats next

• Leverage model predictions for other features/data products• Avoid negative feedback loops!

• Biased training data• only have access to what is ordered through Instacart

• Tighter integrations with retailer data

• Scaling: continue to score a growing catalog at tight SLAs

WE’RE HIRING!

@sharathrao

vAppendix

v

Offline evaluation

• Ideally we want to evaluate user response to recommendations• But we will only know this from an live A/B test

• Recall based metrics are an offline proxy (albeit not the best)• Recall: “Fraction of purchased products covered among Top N

recommendations”• We only use this for hyper parameter tuning

v

Ensembles

Use different types of evidence and/or product metadata to easily create ensembles

User x Products Purchased

User x Products Viewed

User x Brands Purchased

Model or Linear Combination

…

v

What better promotes broad-based discovery

vs

v

Online ranking for diversity

“Diversity within sessions, Novelty across sessions”

“Establish trust in a fresh and comprehensive catalog”

“Less is more”

Cached list of ~1000 products

per userFinal list of

<100 products

promote diversity

v

Diversity

Top K products - ranked by score

Rank product categories by their median product score

> > >

v

Weighted sampling for diversity

Sample category in proportion to score

Within category, sample in proportion to product score

v

Architecture


Matrix

ALS (Spark/EMR)



Run-time diversity ranking

Candidate Selection

Event Data

Event Data

v

Out of stocks happen due to uncertainty in several places

Order fulfillment in (distant) future

Cannot hold inventory

Real-time inventory tracking across thousands of locations isn’t perfect (yet)

Customer might reschedule delivery

data science @ instacart

Internet