visual7w grounded question answering in images

20
Visual7W Grounded Question Answering in Images Yuke Zhu, Oliver Groth, Michael Bernstein, Li Fei-Fei Slides by Issey Masuda Mora Computer Vision Reading Group (09/05/2016) [arXiv ] [web ] [GitHub ]

Upload: universitat-politecnica-de-catalunya

Post on 22-Jan-2018

384 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Visual7W  Grounded Question Answering in Images

Visual7WGrounded Question Answering in Images

Yuke Zhu, Oliver Groth, Michael Bernstein, Li Fei-Fei

Slides by Issey Masuda MoraComputer Vision Reading Group (09/05/2016)

[arXiv] [web] [GitHub]

Page 2: Visual7W  Grounded Question Answering in Images

Context

Page 3: Visual7W  Grounded Question Answering in Images

Visual Question Answering

Goal: predict the answer of a given question related to an image

Page 4: Visual7W  Grounded Question Answering in Images

Motivation

New Turing test? How to evaluate AI’s image understanding?

Page 5: Visual7W  Grounded Question Answering in Images

Visual7W

Page 6: Visual7W  Grounded Question Answering in Images

The 7W

WHAT

WHERE

WHEN

WHO

WHY

HOW

WHICH

Questions: multi-choice4 candidates, only one correct

Page 7: Visual7W  Grounded Question Answering in Images

Grounding: image-text correspondencesExploit the relation between image regions and nouns in the questions

Page 8: Visual7W  Grounded Question Answering in Images

The new answer is...Question-Answer types:

● Telling questions: the answer is text

● Pointing questions: a new type of QA that they introduce where the answers are image regions

Page 9: Visual7W  Grounded Question Answering in Images

Related work

Page 10: Visual7W  Grounded Question Answering in Images

Common approach

Who is under the umbrella?

Extract visual features

Embedding

Merge Predict answer Two women

Page 11: Visual7W  Grounded Question Answering in Images

The Dataset

Page 12: Visual7W  Grounded Question Answering in Images

Visual7W DatasetCharacteristics:

● 47.300 images from COCO dataset● 327.939 QA pairs● 561.459 object bounding boxes spread across 36.579 categories

Page 13: Visual7W  Grounded Question Answering in Images

Creating the DatasetProcedure:

● Write QA pairs● 3 AMT workers evaluate as good or bad each pair● Only the ones with at least 2 good evaluations are considered● Write the 3 wrong answers (having the right one)● Extract object names and draw bounding boxes for each one

Page 14: Visual7W  Grounded Question Answering in Images

The Model

Page 15: Visual7W  Grounded Question Answering in Images

Attention-based modelPointing questions model

Page 16: Visual7W  Grounded Question Answering in Images

Experiments & Results

Page 17: Visual7W  Grounded Question Answering in Images

ExperimentsDifferent experiments have been conducted depending on the information given to the subject:

● Only the question● Question + Image

Subjects/models:

● Human● Logistic regression● LSTM● LSTM + attention model

Page 18: Visual7W  Grounded Question Answering in Images

Results

Page 19: Visual7W  Grounded Question Answering in Images

Conclusions

Page 20: Visual7W  Grounded Question Answering in Images

Conclusions

● Visual QA model has been presented● Attention model to focus on local regions of the image● Dataset created with goundings