cultmedia: deep learning for automatic description of images …€¦ · •miur, eu and italian...
TRANSCRIPT
CultMedia: Deep Learning for automatic description of images and video in DH
Technological Innovation for Digital Humanities, March 3rd 2018
Lorenzo Baraldi, Rita Cucchiara
University of Modena and Reggio Emilia
About us
Who
• 4 Staff people (Rita Cucchiara, Costantino Grana,
Roberto Vezzani, Simone Calderara)
• 8 Phd Students
• 7 Research assistants, SW developers
• 3 (ex) spinoff companies
Collaborations with
• Facebook FAIR (F), Eurecom (F)
• Panasonic (USA)
• Ferrari (I), Maserati (I)
• MIUR, EU and Italian public bodies
• Italian SuperComputing Resource Allocation – CINECA
• Many smes,
• Computer Vision Foundation, CVPL-IAPR, AIXIA
www.aimagelab.unimore.it
Aimage Lab UNIMORE and Ferrari spa
Research activity on Cultural Heritage
• Layout analysis and content classification on digitized manuscripts
• Browsing and retrieval systems
• Interaction with art
• Video, Vision and Language…teaching machines to understand Art
The “Treccani” project
• 35 volumes, published from 1929
• Digitized version from the original manuscripts
• Complex layouts with regions from different categories:
• Text
• Images
• Graphic
• Scores
• Tables
• Borderless tables
• Goal: a completely digitized and browsable version of the Encyclopedia.
A. Corbelli, L. Baraldi, F. Balducci, C. Grana, R. Cucchiara "Layout analysis and content classification in digitized books" IRCDL 2017
A. Corbelli, L. Baraldi, C. Grana, R. Cucchiara "Historical Document Digitization through Layout Analysis and Deep Content Classification" ICPR 2016
The “Treccani” project
Layout analysis
OCR on text regions
Region classification
• Text
• Images
• Graphic
• Formulas
• Scores
• Tables
• Bordlerless tables
• JSON output
• Interactive annotation interface
• Visualization interface
Ground truth Result
Ground truth Result
Annotation interface
The JSON description can be visualized and
edited in every part through an interactive
annotation interface.
Navigation interface
Homepage
Navigation interface
Single page visualization
Navigation interface
Digitized version with in-line graphic elements
Navigation interface
Automatically retrieved graphic elements
Navigation interface
Search by word
The «Rerum Novarum» project
Document browsing and interactive retrieval
Multi-digitization of illuminated manuscripts
• Layout segmentation
• Picture segmentation and tagging
• Search with relevant feedback
D. Borghesani; C. Grana; R. Cucchiara "Rerum Novarum: Interactive Exploration of Illuminated Manuscripts" ACM MM 2010
C. Grana; D. Borghesani; R. Cucchiara "Relevance feedback strategies for artistic image collections tagging" ICMR 2011
Interacting with Art
• Novel human-machine interfaces: new kinds of self-guided tour that can integrate information from the local environment, web and social medias.
• A wearable vision device for museum environments.
• Visitors can interact with the artwork by replicating the gestures and behaviors that they would use to ask a guide something about the artwork.
Algorithms:
• Hand segmentation
• Gesture Recogntion
• Artwork Recognition
L. Baraldi, F. Paci, G. Serra, R. Cucchiara "Gesture Recognition using Wearable Vision Sensors to Enhance Visitors' Museum Experiences" IEEE Sensors, 2015
L. Baraldi, F. Paci, G. Serra, L. Benini, R. Cucchiara "Gesture Recognition in Ego-Centric Videos using Dense Trajectories and Hand Segmentation" CVPRW 2014
Artwork recognition
• Image processing algorithm runs on the wearable device and it is able to detect, in real-time, the artwork the user is observing.
• The result of the processing activity is then sent to the processing center.
• The location service is used to speed up the artwork identification
Head-mountedCamera
WearableDevice
SmartBoxBluetooth
Access Point WiFi
CultMedia: teaching machines to understand art
Project from the National Technological Cluster on Technologies for the Cultural Heritagecofounded by the Italian Ministry of Education, University and Research (2017-2018)
A focus on multimedia dataVideo, images, digitized documents, computer graphics
GoalsHigh quality and low cost multimedia production for re-using existing materials for integrating multimedia data in cross-media storytellings
CultMedia
A disruptive improvement in the processes and services related to the cultural heritage content production
Goals
handling the creation of multimedia video and new transmedia storytelling
providing large cost savings through the extended use of machine learning and artificial intelligence solutions for the reuse of existing multimedia material and its integration in new CH productions.
Research activities @ AImageLab
Video browsing, indexing, retrieval
Novel descriptors for video indexing
Bridging together vision and language
… teaching machines to understand art!
Browsing (and reusing) video
Neuralstory: an interactive Multimedia System for Video Indexing and Re-use
• Decomposition of the storytelling structure into coherent parts, to enhance browsing and retrieval (scene detection)
• Automatic annotation and retrieval of broadcast video
• Users can produce new storytelling by means of multi-modal presentations (re-use)
Online demo at:https://www.neuralstory.it
Video Decomposition and Indexing
Video Decomposition into meaningful parts
• A Deep Network learns a semantic embedding space, in which shots belonging to the same scene have lower Euclidean distances.
• This decomposition is the basis of the visualization interface, and also allows a fine-grained search inside video-clips.
Retrieval
• Leverages automatic annotation and a thumbnail selection strategy, to provide semantically and aesthetically valuableresults.
Video decomposition: our approach
Perceptual features (visual, audio, quantity of speech) and Semantic features (textual concepts, visual concepts)
A Deep Network learns a semantic embedding space, in which shots belonging to the same scene have lower Euclidean distances
A one-hour video is decomposed into coherent partsAnd can be watched in less than one minute
Retrieval: our approach
Hypothesis:
• In broadcast videos speaker describes what the video shows
• Retrieval driven by semantic concepts suggested in the transcript
Thumbnail selection
• Aesthetic ranking model using CNN activations, and a small training set
Aesthetic-based selection
[Baraldi, Grana, Cucchiara, ACM ICMR 2016]
Aesthetic-based retrieval
“Selecting and ranking thumbnails according with some learned perceptual features”
… The idea of beauty, comes from the perception of objects, their proportions, their harmony or unity among the
parts, in the evenness of the line and purity of color ……
low level characteristics, like color, edges and sharpness,
high level features, such as the presence of a clearly visible object in the center.
excellent match with the hierarchical nature of CNNs
a ranking strategy which learns the relative importance
given a dataset of user preferences.
VGG-16 > 4000 convolutional layers, different size
Aligning and searching inside videos
Temporal Match Kernels
A novel compact descriptor for video alignment and retrieval (with a Fourier transform!)
Applications
• Temporal alignment of different videos
• Similarity between videos
• Searching for a piece of video in a video collection
• Searching for an artwork in a video collection
With Facebook AI Research, CVPR 2018
L. Baraldi, M. Douze, R. Cucchiara and H. Jégou, "LAMV: Learning to align and match videos with kernelized temporal layers“, CVPR 2018
Video re-use
Images, shots and
scenes can be picked
during watching
Selected clips can be used to create new
multimodal slides
Which can be enriched with text
and images
Decomposing the storytelling structure of a collection of video enables the creation of new personalized storytelling.
From temporal segmentation to captioning
LSTM networks as language models
At training time: condition on the image and train to predict the next word given the previous (GT) words
LSTM LSTM LSTM LSTM LSTM LSTM LSTMLSTM
a dog carrying
a
frisbee in a
fielda dog carrying a frisbee in aGT
Using a vocabulary of more than 10.000 words
- only at the first timestep- at every timestep
Automatic annotation
Automatically generated captions will be useful for human search, for automatic search by query, and for future query-answering services.
L. Baraldi, C. Grana, R. Cucchiara, "Hierarchical Boundary-Aware Neural Encoder for Video Captioning" CVPR, 2017
M. Cornia, L. Baraldi, G. Serra, R. Cucchiara, "Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention" ACM TOMM, 2017
Generated caption: A woman is looking at a television screen.
Generated caption: A city with a large boat in the water.
Generated caption: A boat is in the water near a large mountain.
Generated caption: A woman in a red jacket is riding a bicycle.
Bridging vision and language in art
Vignette depicting Solomon receiving
homage from the princes.
A round with a peacock in fenced area.
Joseph dropped by the brothers in the
well.A round with two monkeys, one of
whom holds a cherub in his arms.
Goals
• Understanding art
• Describing art in natural language
• Retrieving images with natural language queries
Challenges:
• Open research area also in natural images
• Domain shift: visual and textual elements are different from ordinary datasets
BibleVSA dataset
From the Borso d’Este Holy Bible:
Illuminated manuscript (640 pages)
Commentary describing the visual content of each of the illustrations, the decorations of the page, and of the textual content itself.
Annotations of the alignment between parts of the commentary and the illustrations
Training of visual-semantic embeddings
Automatic alignment of visual and textual cultural data
Visualizing the domain shift
Resnet-152 VGG-19
FastText (Facebook AI Research) GloVeWord2Vec
Visual data
Textual data
Building visual-semantic spaces in the DH domain
The unsupervised way
Relying only on the supervision given by non-DH datasets
…a metric learning loss, and the constraint that the distributions of text and data should match (MMD)
L. Baraldi, M. Cornia, C. Grana, R. Cucchiara, “Aligning text and document illustrations: towards visually explainable Digital Humanities”, submitted to ICPR 2018
Without MMD With MMD
Building visual-semantic spaces in the DH domain
Automatic alignment on a single page
L. Baraldi, M. Cornia, C. Grana, R. Cucchiara, “Aligning text and document illustrations: towards visually explainable Digital Humanities”, submitted to ICPR 2018
The CultMedia dataset
We need more data, to tackle more tasks!
Creation of a (medium-to-)large-scale datasetoriented to the Cultural Heritage domain and suitable for automatic understanding tasks, such as:
Artwork identification and retrieval(I can detect, locate and identify an artwork)
Automatic Artwork description and retrieval with natural language queries(I can describe an artwork, and retrieve similar ones from other natural language descriptions)
Detection of attributes and relationships inside the artwork(I can identify the people/objects represented in the artwork, and the relationships between them)
Visual grounding of descriptions(I can use that knowledge to ground and justify the descriptions)
Strong link with the re-use spirit of the project.
«A man with hat holding a glass of wine»
Soldato con Calice (N. Tournier)
Data annotation (#1)
Temporal segmentation of the input videoto isolate the temporal extent of each artwork and unrelated temporal segments
Artwork detectioni.e. annotate the bounding box of the artwork, frame by frame (exploiting the semi-automatic annotation given by the optical flow)
Unrelated Artwork #1 Walking Artwork #2 ….
Data annotation (#2)
Annotation with metadatai.e. author, name of the artwork, year, style, …
Captions a. describing the content of the artwork without leveraging any cultural backgroundb. describing the content and the context of the artwork by leveraging a specific cultural background
«A man with hat holding a glass of wine»
«A caravaggesque painting in which a soldier seems to establish a
cultured dialogue with the spectator, descending, into the daily life of an
inn, echoes of the classical tradition of the myth of Bacchus »
Data annotation (#3)
Annotation of the detailsdetection and description of the components of the artwork (objects and people) with actions and attributes
Grounding of captionsi.e. connecting people, objects, attribute and actions in a natural language sentence“Nerone is standing in front of a Agrippina, who lays on a bed.”
Nerone (person), standing
Agrippina (person), laying
Annotation interface
Ad-hoc web-based annotation interfaces, also integrating existing platforms (Vatic)
Audit and control of the annotations in an on-line manner.
Preliminary results
A first round of data collection and annotationhas taken place on February
Three groups of five annotators
Each round is scheduled as follows:
1st day: training on the interface, collection and validation of sample annotations on synthetic data
2nd day: visit to the Estense Gallery (Modena), and collection of the data- before and after the
3rd day 5th day: annotation and cross-validation
During the first round:
around 2000 natural language descriptions
140 detailed annotations of artworks and their details
annotation of 200 short user-generated videos taken inside the museum.
Research activity on Cultural Heritage
• Layout analysis and content classification on digitized manuscripts
• Browsing and retrieval systems
• Interaction with art
• Video, Vision and Language…teaching machines to understand Art
Thank you!Questions?
[email protected]@unimore.it
http://aimagelab.ing.unimore.it
Thanks to “Città Educante” Project (CTN01 00034 393801) of the National Technology Cluster on Smart Communities, cofounded by the Italian Ministry of Education, University and Research (MIUR).
Ongoing collaboration with Facebook AI Research (FAIR)Facebook has selected Imagelab as one of the 15 world-class research labs in Europe
Thanks to “CultMedia” Project of the National Technology Cluster on Smart Communities, cofounded by the Italian Ministry of Education, University and Research (MIUR).