Deep Neural Networks for Multimodal Learning

Download Deep Neural Networks for Multimodal Learning

Post on 13-Apr-2017

177 views

Category:

Science

1 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>Deep Neural Networks for Multimodal Learning</p><p>Presented by: Marc Bolaos</p><p>lvaro Peris</p><p>Francisco Casacuberta</p><p>Marc Bolaos</p><p>Petia Radeva</p></li><li><p>Multimodal TasksVideo </p><p>DescriptionMultimodal Translation</p><p>Multimodal Description</p><p>Visual Question Answering</p><p>Dense Captioning</p><p>Image Description</p></li><li><p>Multimodal TasksVideo </p><p>DescriptionMultimodal Translation</p><p>Multimodal Description</p><p>Visual Question Answering</p><p>Dense Captioning</p><p>Image Description</p></li><li><p>Multimodal TasksVideo </p><p>DescriptionMultimodal Translation</p><p>Multimodal Description</p><p>Visual Question Answering</p><p>Dense Captioning</p><p>Image Description</p><p>Two young guys with shaggy hair look at their hands while hanging out in the yard.Two young, White males are outside near many bushes.Two men in green shirts are standing in a yard.A man in a blue shirt standing in a garden.Two friends enjoy time spent together.</p><p>Dos hombres estn en el jardn.</p></li><li><p>Multimodal TasksVideo </p><p>DescriptionMultimodal Translation</p><p>Multimodal Description</p><p>Visual Question Answering</p><p>Dense Captioning</p><p>Image Description</p><p>A man is smiling at a stuffed lion.</p><p>Un hombre sonre a un len de peluche.</p></li><li><p>Multimodal TasksVideo </p><p>DescriptionMultimodal Translation</p><p>Multimodal Description</p><p>Visual Question Answering</p><p>Dense Captioning</p><p>Image Description</p><p>Q: What kind of store is this?A: bakery</p><p>Q: What number is the bus?A: 48</p></li><li><p>Multimodal TasksVideo </p><p>DescriptionMultimodal Translation</p><p>Multimodal Description</p><p>Visual Question Answering</p><p>Dense Captioning</p><p>Image Description</p></li><li><p>Multimodal Tasks - Basic NN Components</p><p>Long-Short Term Memory (LSTM)Convolutional Neural Network (CNN)</p><p>Attention Mechanism</p><p>where</p><p>is</p><p>the</p><p>giraffe</p><p>LSTM</p><p>LSTM</p><p>LSTM</p><p>LSTMdnde</p><p>est</p><p>la</p><p>jirafa</p><p>LSTM</p><p>LSTM</p><p>LSTM</p><p>LSTM </p><p>.</p><p>.</p><p>.</p><p>v1v2</p><p>vn</p><p>1(t)</p><p>2(t)</p><p>n(t)</p><p>.</p><p>.</p><p>.</p><p>LSTMt-1</p><p> n</p><p> i(t) vii=1</p><p>LSTMt</p></li><li><p>[Task 1] Video Description</p><p>Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729. 2014 Dec 15.</p><p>Two men working on a high building </p><p>Two teams are playing soccer</p><p> 1970 open domain clips collected from YouTube. Annotated using a crowdsourcing platform. Variable number of captions per video. 80.000 different video-caption pairs.</p><p>Microsoft Video Description (MSVD) Dataset</p></li><li><p>[Task 1] Video Description - Model</p><p>CNN( )</p><p>CNN( )</p><p>CNN( )</p><p>.</p><p>.</p><p>.</p><p>LSTM j=1</p><p>LSTM j=J</p><p>LSTM j=2</p><p>LSTM j=J</p><p>LSTM j=2</p><p>LSTM j=1</p><p>.</p><p>.</p><p>.</p><p>.</p><p>.</p><p>.</p><p>LSTM t=1</p><p>LSTM t=2</p><p>.</p><p>.</p><p>.</p><p>.</p><p>.</p><p>.</p><p>SOFT ATTENTION</p><p>MODEL</p><p>LSTM t=T</p><p>ENCODER DECODER</p><p>.</p><p>.</p><p>.</p><p>Two</p><p>elephants</p><p>water</p><p>.</p><p>.</p><p>.</p><p>.</p><p>.</p><p>.</p><p>lvaro Peris, Marc Bolaos, Petia Radeva, and Francisco Casacuberta. "Video Description using Bidirectional Recurrent Neural Networks." In Proceedings of the International Conference on Artificial Neural Networks (ICANN) (IN PRESS) (2016)</p><p>Sentence generation: argmaxy P(y|y1 ,...,yt-1,x1,...,xJ) </p></li><li><p>[Task 1] Video Description - Results</p><p>* Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A. Describing videos by exploiting temporal structure. InProceedings of the IEEE International Conference on Computer Vision 2015 (pp. 4507-4515).</p><p> Bidirectional temporal mechanism (BLSTM): allows to extract information in a past-to-future and future-to-past fashion.</p><p> Attention mechanisms: helpful when applying a step-by-step sentence generation.Future work:</p><p> CNNs at a higher and temporal level (3D CNNs).</p></li><li><p>[Task 2] Visual Question AnsweringVQA Dataset Open-Ended question answering task</p><p>Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D. Vqa: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision 2015 (pp. 2425-2433).</p><p> 200.000 images 3 questions per image 10 (SHORT) answers per question annotated by </p><p>different users</p></li><li><p>[Task 2] VIBIKNet for VQA</p><p>SO</p><p>FTM</p><p>AX</p><p>text embedding(GLOVE initialization)</p><p>LSTMforward</p><p>LSTMforward</p><p>LSTMforward</p><p>visual embedding</p><p>LSTMbackward</p><p>LSTMbackward</p><p>LSTMbackward</p><p>KCNN(L2 norm)</p><p>element-wise summation</p><p>vector concatenation</p><p>BidirectionalLSTM</p><p>[ , ]</p><p>[ , ]</p><p>where</p><p>is</p><p>the</p><p>giraffe</p><p>LSTMforward</p><p>LSTMbackward</p><p>behind fence</p><p>Marc Bolaos, lvaro Peris, Francisco Casacuberta and Petia Radeva. "VIBIKNet: Visual Bidirectional Kernelized Network for Visual Question Answering" Challenge on Visual Question Answering CVPR (no proceedings) (2016)</p><p>Answer generation: argmaxa P(a|q1 ,...,qn,x) </p></li><li><p>[Task 2] VIBIKNet for VQA - Results</p></li><li><p>[Task 2] VIBIKNet for VQA - Results</p><p>Model Accuracy [%] on dev 2014 Accuracy [%] on test 2015</p><p>Yes/No Number Other Overall Yes/No Number Other Overall</p><p>LSTM 79.00 38.16 33.68 52.88 - - - -</p><p>BLSTM 79.13 38.26 33.52 52.96 78.30 38.88 38.97 54.86</p><p>BLSTMtrain+dev</p><p>- - - - 78.88 36.33 40.27 56.1</p><p> Classification models work better than generative models on datasets with simple answers. Models for compacting and jointly describing the information (KCNN) present in the </p><p>images seem promising. The use of pre-trained but adaptable representations is crucial for small and medium-sized </p><p>datasets.</p></li><li><p>[Task 3] Image Description</p><p>img1 CNN</p><p>LSTM A</p><p>LSTM</p><p>LSTM</p><p>.</p><p>.</p><p>.</p><p>rally</p><p>road</p><p>.</p><p>.</p><p>.</p><p> Image Description formulated as a translation problem:</p><p>Sentence generation: argmaxy P(y|y1 ,...,yt-1,x) </p><p>Basic initial tests on Flickr8k obtaining a result of BLEU = 20.2%</p></li><li><p>Embed.</p><p>[Task 4] Multimodal Translation</p><p> Translation problem aided by image information:</p><p>Embed</p><p>Embed</p><p>.</p><p>.</p><p>.</p><p>LSTM t=1</p><p>LSTM t=2</p><p>.</p><p>.</p><p>.</p><p>SOFT ATTENTION</p><p>MODEL</p><p>LSTM t=T</p><p>ENCODER DECODER</p><p>Dos</p><p>elefantes</p><p>agua</p><p>.</p><p>.</p><p>.</p><p>Sentence translation: argmaxy P(y|y1 ,...,yt-1, x, z1,...,zJ) </p><p>Two</p><p>elephants</p><p>water</p><p>z1</p><p>.</p><p>.</p><p>.</p><p>z2</p><p>zJ</p><p>KCNN[ , ]</p><p>[ , ]</p><p>[ , ]</p><p>Basic initial tests on Flickr30k ACLTask1 Challenge obtaining a result of METEOR = 41.2% </p><p>BLSTM...</p></li><li><p>Future Directions</p><p>We are working on adding several state-of-the-art architectures and ideas:</p><p> Highway Networks</p><p> Compact Bilinear Pooling</p><p> Class Activation Maps</p><p>Srivastava RK, Greff K, Schmidhuber J. Highway networks. arXiv preprint arXiv:1505.00387. 2015 May 3.</p><p>Gao Y, Beijbom O, Zhang N, Darrell T. Compact bilinear pooling. arXiv preprint arXiv:1511.06062. 2015 Nov 19.</p><p>Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning Deep Features for Discriminative Localization. arXiv preprint arXiv:1512.04150. 2015 Dec 14.</p></li><li><p>Collaboration supported by the R-MIPRCV:</p><p> Stay of Marc Bolaos (CVC-UB) at UPV, 2015. Stay of lvaro Peris (UPV) at UB, 2016. To be extended with the incorporation of UGR (stay of a PhD student October, 2016). </p><p>Publications and challenges:</p><p> ICANN2016 CVPR2016</p><p>Resume</p><p>www.github.com/MarcBS/VIBIKNet</p><p>Download VIBIKNet</p></li><li><p>Download VIBIKNet</p><p>www.github.com/MarcBS/VIBIKNet</p><p> www.ub.edu/cvub/marcbolanos</p><p> marc.bolanos@ub.edu</p></li><li><p>Multimodal Tasks - Basic NN Components</p><p>Convolutional Neural Network (CNN)</p></li><li><p>Multimodal Tasks - Basic NN Components</p><p>where</p><p>is</p><p>the</p><p>giraffe</p><p>LSTM</p><p>LSTM</p><p>LSTM</p><p>LSTMdnde</p><p>est</p><p>la</p><p>jirafa</p><p>LSTM</p><p>LSTM</p><p>LSTM</p><p>LSTM </p><p>Long-Short Term Memory (LSTM)</p></li><li><p>Multimodal Tasks - Basic NN Components</p><p>Attention Mechanism</p><p>.</p><p>.</p><p>.</p><p>v1</p><p>v2</p><p>vn</p><p>1(t)</p><p>2(t)</p><p>n(t)</p><p>.</p><p>.</p><p>.</p><p>LSTMt-1</p><p>LSTMt</p></li><li><p>[Task 2] VIBIKNet for VQA - Kernelized CNN</p><p>Object Detector PCA</p><p>Gaussian Mixture Model(Fisher Vectors)</p><p>PCA</p><p>GoogLeNet</p><p>PCA</p><p>PCA</p><p>PCA</p><p>GoogLeNet</p><p>GoogLeNet</p><p>GoogLeNet</p><p>CNNfeaturevector</p><p>KCNNfeaturevector</p><p>Liu Z. Kernelized Deep Convolutional Neural Network for Describing Complex Images. arXiv preprint arXiv:1509.04581. 2015 Sep 15.</p></li><li><p>Embed.</p><p>[Task 4] Multimodal Translation</p><p> Translation problem aided by image information:</p><p>Embed</p><p>Embed</p><p>.</p><p>.</p><p>.</p><p>LSTM j=1</p><p>LSTM j=J</p><p>LSTM j=2</p><p>LSTM j=J</p><p>LSTM j=2</p><p>LSTM j=1</p><p>.</p><p>.</p><p>.</p><p>.</p><p>.</p><p>.</p><p>LSTM t=1</p><p>LSTM t=2</p><p>.</p><p>.</p><p>.</p><p>.</p><p>.</p><p>.</p><p>SOFT ATTENTION</p><p>MODEL</p><p>LSTM t=T</p><p>ENCODER DECODER</p><p>Dos</p><p>elefantes</p><p>agua</p><p>.</p><p>.</p><p>.</p><p>.</p><p>.</p><p>.</p><p>Sentence translation: argmaxy P(y|y1 ,...,yt-1, x, z1,...,zJ) </p><p>Two</p><p>elephants</p><p>water</p><p>z1</p><p>.</p><p>.</p><p>.</p><p>z2</p><p>zJ</p><p>KCNN[ , ]</p><p>[ , ]</p><p>[ , ]</p><p>Basic initial tests on Flickr30k ACLTask1 Challenge obtaining a result of BLEU = 20.2</p></li><li><p>Deep Neural Networks for Multimodal Learning</p><p>Presented by: Marc Bolaos</p><p>where</p><p>is</p><p>the</p><p>giraffe</p><p>behind</p><p>CNN</p><p>BLSTM</p><p>the</p><p>fence</p><p>LSTM</p></li><li><p>[Task 1] Video Description - Results</p><p>* Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A. Describing videos by exploiting temporal structure. InProceedings of the IEEE International Conference on Computer Vision 2015 (pp. 4507-4515).</p><p> Bidirectional temporal mechanism (BLSTM): allows to extract information in a past-to-future and future-to-past fashion.</p><p> Attention mechanisms: helpful when applying a step-by-step sentence generation.Future work:</p><p> CNNs at a higher and temporal level (3D CNNs).</p></li></ul>

Recommended

View more >