cultmedia: deep learning for automatic description of images …€¦ · •miur, eu and italian...

CultMedia: Deep Learning for automatic description of images and video in DH

Technological Innovation for Digital Humanities, March 3rd 2018

Lorenzo Baraldi, Rita Cucchiara

[email protected]

University of Modena and Reggio Emilia

About us

Who

• 4 Staff people (Rita Cucchiara, Costantino Grana,

Roberto Vezzani, Simone Calderara)

• 8 Phd Students

• 7 Research assistants, SW developers

• 3 (ex) spinoff companies

Collaborations with

• Facebook FAIR (F), Eurecom (F)

• Panasonic (USA)

• Ferrari (I), Maserati (I)

• MIUR, EU and Italian public bodies

• Italian SuperComputing Resource Allocation – CINECA

• Many smes,

• Computer Vision Foundation, CVPL-IAPR, AIXIA

www.aimagelab.unimore.it

Aimage Lab UNIMORE and Ferrari spa

Research activity on Cultural Heritage

• Layout analysis and content classification on digitized manuscripts

• Browsing and retrieval systems

• Interaction with art

• Video, Vision and Language…teaching machines to understand Art

The “Treccani” project

• 35 volumes, published from 1929

• Digitized version from the original manuscripts

• Complex layouts with regions from different categories:

• Text

• Images

• Graphic

• Scores

• Tables

• Borderless tables

• Goal: a completely digitized and browsable version of the Encyclopedia.

A. Corbelli, L. Baraldi, F. Balducci, C. Grana, R. Cucchiara "Layout analysis and content classification in digitized books" IRCDL 2017

A. Corbelli, L. Baraldi, C. Grana, R. Cucchiara "Historical Document Digitization through Layout Analysis and Deep Content Classification" ICPR 2016

The “Treccani” project

Layout analysis

OCR on text regions

Region classification

• Text

• Images

• Graphic

• Formulas

• Scores

• Tables

• Bordlerless tables

• JSON output

• Interactive annotation interface

• Visualization interface

Ground truth Result

Annotation interface

The JSON description can be visualized and

edited in every part through an interactive

annotation interface.

Navigation interface

Homepage


Single page visualization


Digitized version with in-line graphic elements


Automatically retrieved graphic elements


Search by word

The «Rerum Novarum» project

Document browsing and interactive retrieval

Multi-digitization of illuminated manuscripts

• Layout segmentation

• Picture segmentation and tagging

• Search with relevant feedback

D. Borghesani; C. Grana; R. Cucchiara "Rerum Novarum: Interactive Exploration of Illuminated Manuscripts" ACM MM 2010

C. Grana; D. Borghesani; R. Cucchiara "Relevance feedback strategies for artistic image collections tagging" ICMR 2011

Interacting with Art

• Novel human-machine interfaces: new kinds of self-guided tour that can integrate information from the local environment, web and social medias.

• A wearable vision device for museum environments.

• Visitors can interact with the artwork by replicating the gestures and behaviors that they would use to ask a guide something about the artwork.

Algorithms:

• Hand segmentation

• Gesture Recogntion

• Artwork Recognition

L. Baraldi, F. Paci, G. Serra, R. Cucchiara "Gesture Recognition using Wearable Vision Sensors to Enhance Visitors' Museum Experiences" IEEE Sensors, 2015

L. Baraldi, F. Paci, G. Serra, L. Benini, R. Cucchiara "Gesture Recognition in Ego-Centric Videos using Dense Trajectories and Hand Segmentation" CVPRW 2014

Artwork recognition

• Image processing algorithm runs on the wearable device and it is able to detect, in real-time, the artwork the user is observing.

• The result of the processing activity is then sent to the processing center.

• The location service is used to speed up the artwork identification

Head-mountedCamera

WearableDevice

SmartBoxBluetooth

Access Point WiFi

CultMedia: teaching machines to understand art

Project from the National Technological Cluster on Technologies for the Cultural Heritagecofounded by the Italian Ministry of Education, University and Research (2017-2018)

A focus on multimedia dataVideo, images, digitized documents, computer graphics

GoalsHigh quality and low cost multimedia production for re-using existing materials for integrating multimedia data in cross-media storytellings

CultMedia

A disruptive improvement in the processes and services related to the cultural heritage content production

Goals

handling the creation of multimedia video and new transmedia storytelling

providing large cost savings through the extended use of machine learning and artificial intelligence solutions for the reuse of existing multimedia material and its integration in new CH productions.

Research activities @ AImageLab

Video browsing, indexing, retrieval

Novel descriptors for video indexing

Bridging together vision and language

… teaching machines to understand art!

Browsing (and reusing) video

Neuralstory: an interactive Multimedia System for Video Indexing and Re-use

• Decomposition of the storytelling structure into coherent parts, to enhance browsing and retrieval (scene detection)

• Automatic annotation and retrieval of broadcast video

• Users can produce new storytelling by means of multi-modal presentations (re-use)

Online demo at:https://www.neuralstory.it

https://neuralstory.ing.unimore.it/

Video Decomposition and Indexing

Video Decomposition into meaningful parts

• A Deep Network learns a semantic embedding space, in which shots belonging to the same scene have lower Euclidean distances.

• This decomposition is the basis of the visualization interface, and also allows a fine-grained search inside video-clips.

Retrieval

• Leverages automatic annotation and a thumbnail selection strategy, to provide semantically and aesthetically valuableresults.

Video decomposition: our approach

Perceptual features (visual, audio, quantity of speech) and Semantic features (textual concepts, visual concepts)

A Deep Network learns a semantic embedding space, in which shots belonging to the same scene have lower Euclidean distances

A one-hour video is decomposed into coherent partsAnd can be watched in less than one minute

Retrieval: our approach

Hypothesis:

• In broadcast videos speaker describes what the video shows

• Retrieval driven by semantic concepts suggested in the transcript

Thumbnail selection

• Aesthetic ranking model using CNN activations, and a small training set

Aesthetic-based selection

[Baraldi, Grana, Cucchiara, ACM ICMR 2016]

Aesthetic-based retrieval

“Selecting and ranking thumbnails according with some learned perceptual features”

… The idea of beauty, comes from the perception of objects, their proportions, their harmony or unity among the

parts, in the evenness of the line and purity of color ……

low level characteristics, like color, edges and sharpness,

high level features, such as the presence of a clearly visible object in the center.

excellent match with the hierarchical nature of CNNs

a ranking strategy which learns the relative importance

given a dataset of user preferences.

VGG-16 > 4000 convolutional layers, different size

Aligning and searching inside videos

Temporal Match Kernels

A novel compact descriptor for video alignment and retrieval (with a Fourier transform!)

Applications

• Temporal alignment of different videos

• Similarity between videos

• Searching for a piece of video in a video collection

• Searching for an artwork in a video collection

With Facebook AI Research, CVPR 2018

L. Baraldi, M. Douze, R. Cucchiara and H. Jégou, "LAMV: Learning to align and match videos with kernelized temporal layers“, CVPR 2018

Video re-use

Images, shots and

scenes can be picked

during watching

Selected clips can be used to create new

multimodal slides

Which can be enriched with text

and images

Decomposing the storytelling structure of a collection of video enables the creation of new personalized storytelling.

From temporal segmentation to captioning

LSTM networks as language models

At training time: condition on the image and train to predict the next word given the previous (GT) words

LSTM LSTM LSTM LSTM LSTM LSTM LSTMLSTM

a dog carrying

a

frisbee in a

fielda dog carrying a frisbee in aGT

Using a vocabulary of more than 10.000 words

- only at the first timestep- at every timestep

Automatic annotation

Automatically generated captions will be useful for human search, for automatic search by query, and for future query-answering services.

L. Baraldi, C. Grana, R. Cucchiara, "Hierarchical Boundary-Aware Neural Encoder for Video Captioning" CVPR, 2017

M. Cornia, L. Baraldi, G. Serra, R. Cucchiara, "Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention" ACM TOMM, 2017

Generated caption: A woman is looking at a television screen.

Generated caption: A city with a large boat in the water.

Generated caption: A boat is in the water near a large mountain.

Generated caption: A woman in a red jacket is riding a bicycle.

Bridging vision and language in art

Vignette depicting Solomon receiving

homage from the princes.

A round with a peacock in fenced area.

Joseph dropped by the brothers in the

well.A round with two monkeys, one of

whom holds a cherub in his arms.

Goals

• Understanding art

• Describing art in natural language

• Retrieving images with natural language queries

Challenges:

• Open research area also in natural images

• Domain shift: visual and textual elements are different from ordinary datasets

BibleVSA dataset

From the Borso d’Este Holy Bible:

Illuminated manuscript (640 pages)

Commentary describing the visual content of each of the illustrations, the decorations of the page, and of the textual content itself.

Annotations of the alignment between parts of the commentary and the illustrations

Training of visual-semantic embeddings

Automatic alignment of visual and textual cultural data

Visualizing the domain shift

Resnet-152 VGG-19

FastText (Facebook AI Research) GloVeWord2Vec

Visual data

Textual data

Building visual-semantic spaces in the DH domain

The unsupervised way

Relying only on the supervision given by non-DH datasets

…a metric learning loss, and the constraint that the distributions of text and data should match (MMD)

L. Baraldi, M. Cornia, C. Grana, R. Cucchiara, “Aligning text and document illustrations: towards visually explainable Digital Humanities”, submitted to ICPR 2018

Without MMD With MMD

Building visual-semantic spaces in the DH domain

Automatic alignment on a single page

L. Baraldi, M. Cornia, C. Grana, R. Cucchiara, “Aligning text and document illustrations: towards visually explainable Digital Humanities”, submitted to ICPR 2018

The CultMedia dataset

We need more data, to tackle more tasks!

Creation of a (medium-to-)large-scale datasetoriented to the Cultural Heritage domain and suitable for automatic understanding tasks, such as:

Artwork identification and retrieval(I can detect, locate and identify an artwork)

Automatic Artwork description and retrieval with natural language queries(I can describe an artwork, and retrieve similar ones from other natural language descriptions)

Detection of attributes and relationships inside the artwork(I can identify the people/objects represented in the artwork, and the relationships between them)

Visual grounding of descriptions(I can use that knowledge to ground and justify the descriptions)

Strong link with the re-use spirit of the project.

«A man with hat holding a glass of wine»

Soldato con Calice (N. Tournier)

http://www.gallerie-estensi.beniculturali.it/galleria-estense/collezioni/dipinti/


Data annotation (#1)

Temporal segmentation of the input videoto isolate the temporal extent of each artwork and unrelated temporal segments

Artwork detectioni.e. annotate the bounding box of the artwork, frame by frame (exploiting the semi-automatic annotation given by the optical flow)

Unrelated Artwork #1 Walking Artwork #2 ….


Annotation with metadatai.e. author, name of the artwork, year, style, …

Captions a. describing the content of the artwork without leveraging any cultural backgroundb. describing the content and the context of the artwork by leveraging a specific cultural background

«A man with hat holding a glass of wine»

«A caravaggesque painting in which a soldier seems to establish a

cultured dialogue with the spectator, descending, into the daily life of an

inn, echoes of the classical tradition of the myth of Bacchus »




Annotation of the detailsdetection and description of the components of the artwork (objects and people) with actions and attributes

Grounding of captionsi.e. connecting people, objects, attribute and actions in a natural language sentence“Nerone is standing in front of a Agrippina, who lays on a bed.”

Nerone (person), standing

Agrippina (person), laying

Annotation interface

Ad-hoc web-based annotation interfaces, also integrating existing platforms (Vatic)

Audit and control of the annotations in an on-line manner.

Preliminary results

A first round of data collection and annotationhas taken place on February

Three groups of five annotators

Each round is scheduled as follows:

1st day: training on the interface, collection and validation of sample annotations on synthetic data

2nd day: visit to the Estense Gallery (Modena), and collection of the data- before and after the

3rd day 5th day: annotation and cross-validation

During the first round:

around 2000 natural language descriptions

140 detailed annotations of artworks and their details

annotation of 200 short user-generated videos taken inside the museum.

Research activity on Cultural Heritage

• Layout analysis and content classification on digitized manuscripts

• Browsing and retrieval systems

• Interaction with art

• Video, Vision and Language…teaching machines to understand Art

Thank you!Questions?

[email protected]@unimore.it

http://aimagelab.ing.unimore.it

Thanks to “Città Educante” Project (CTN01 00034 393801) of the National Technology Cluster on Smart Communities, cofounded by the Italian Ministry of Education, University and Research (MIUR).

Ongoing collaboration with Facebook AI Research (FAIR)Facebook has selected Imagelab as one of the 15 world-class research labs in Europe

Thanks to “CultMedia” Project of the National Technology Cluster on Smart Communities, cofounded by the Italian Ministry of Education, University and Research (MIUR).

http://imagelab.ing.unimore.it/

cultmedia: deep learning for automatic description of images …€¦ · •miur, eu and italian...

Documents