combining image and language to predict and understand the...
TRANSCRIPT
Back Ground
Results
• Cast the prediction problem to Regression and Classification
• Make usefulness prediction by combing two neural networks
• CNN on an image associated with the review
• Attention based Convolutional RNN on the text of the review
• Analyze Attention weights to gain insight into what make a review useful
Using 10.2K Reviews - Class 0: Reviews have 0
useful votes- Class 1: Reviews have
>9 useful votes- A subset of images from
the 200,000 business pictures provided by yelp
# of ConvRNN Layer V.S. Classification Accuracy
71%
73.25%
75.5%
77.75%
80%
0 1 2 3 4 5# of ConvRNN Layer V.S.
Regression Accuracy (RMS)
2.79
2.805
2.82
2.835
2.85
0 1 2 3 4
Classification Results (All Model )
Accu
racy
50%
57.5%
65%
72.5%
80%
Image+ConvRNN
ConvRNN
RNN+Attention
Bi-RNN
SVMRegression Results
(All Model )
RMSE
2.6
2.75
2.9
3.05
3.2
Image+ConvRNN
ConVRNN
RNN+Attention
Bi-RNN
Linear Regression
Combining Image and Language to Predict and Understand the Usefulness of Yelp ReviewsDavid Z Liu ([email protected])
Challenge
• Predicting a reviews usefulness is important and challenging
• Knowing the usefulness of a review in advance, businesses can recommend high quality and fresh reviews to customers and gain business insight
• For reviews with the same length, No obvious features that directly indicate the usefulness of a document
• Only a small amount of data have significant number of useful votes
Approach
Data Set
Link Image With Text - Link a business photo to a review
written about that business by match words in the image caption with words in the review text
- Pick the image with the highest Jaccard similarity
Model
• Input to the model:• Written review to the RNN.
Text are converted to 300 dimensional GloVe
• Associated image to the CNN, converted to 64X64X3
• Output from the model:• Classification or regression
prediction
64x64x3
32x
32x8
16x
16x
16
8x8x
64
4x4x
64
I Love food and Images
300x300x1
200x200x16
100x100x16
100
(0 = bidirectional RNN without attention)
Visualizing Attention Weight A typical review with the top 25 words greened base on attention intensity.
Image + ConvRNNBidirectional RNN Without Attention
Bidirectional RNN With Attention
Prediction
ConvRNN
Size corresponds to the attention weight