application of audio and video processing methods for language research
DESCRIPTION
Application of Audio and Video Processing Methods for Language Research. Przemyslaw Lenkiewicz, Peter Wittenburg Oliver Schreer , Stefano Masneri Daniel Schneider, Sebastian Tschöpel Max Planck Institute for Psycholinguistics Fraunhofer -Heinrich Hertz Institute - PowerPoint PPT PresentationTRANSCRIPT
The Language Archive – Max Planck Institute for Psycholinguistics
Nijmegen, The Netherlands
Application of Audio and Video Processing Methods for Language
ResearchPrzemyslaw Lenkiewicz, Peter
Wittenburg Oliver Schreer, Stefano Masneri
Daniel Schneider, Sebastian Tschöpel
Max Planck Institute for PsycholinguisticsFraunhofer-Heinrich Hertz InstituteFraunhofer IAIS Institute
AVATECH
Advancing Video and Audio Technology in Humanities research
AVATECH
Max Planck Institute for PsycholinguisticsFraunhofer-Heinrich Hertz InstituteFraunhofer IAIS Institute
Annotations
Base of research analysis
Annotations – challenges
• Annotations are of different types, almost all manual • Different quality, conditions – mostly bad• Different languages – mostly minority languages• Annotation time is anything between 10-100 times
the length of the media
Manual Annotation Gap
We have around 200 TB data at MPI, in particular digitalized Audio/Video-Recordings, Brain-Images, Hand tracking, etc.Increasingly more data is nor described nor annotated
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 20130
50
100
150
200
250
300
MPI Digital Archive
Tera
byte
Organized and annotateddata
Not annotated dataswitch to lossless mJPEG2000, HD Video and Brain-Imaging
AVATecH Main Goals
• Reduce the time necessary for annotating.
• Develop communication interfaces and human-machine interfaces.
• Develop A / V processing algorithms.
Recognizers
• Small applications executed from ELAN• They have some specific purposes, they
recognize specific things• They usually create annotations or visualize
things for you• Aim at tasks that can be trivial but time consuming
RECOGNIZERS
Audio recognizers
• Audio segmentation– Autonomously splits audio stream into
homogeneous segments– Approach: Model-free approach based on
clustering with help of Bayesian information criterion
Audio recognizers
• Audio segmentation: Goals– Find coherent parts in a recording – Detect speaker changes – Detect environment changes – Detect utterances– Preprocessing step for speaker ID, clustering
• Speech/Non-speech detection– Detects whether a segment contains speech or
not– Approach: Offline training of Gaussian Mixture
Models for speech & non-speech and detection of model for each segment with highest likelihood
– Integrates further user-driven feedback mechanism
• Local Speaker clustering– Joins and labels segments according to
underlying speaker– Approach: Iterative calculation of Bayesian
Information Criterion and BIC-dependent merging of speech-segment combinations
– Baseline tested on single documents with mediocre results robustification needed
• Speaker Identification– Identifies well-known speakers from given
speech segments – Approach: Based on Adapted Gaussian Mixture
Models & probability functions– Currently developing fast, iterative training-
workflow to train a speaker model for detection
• Language Independent Alignment– Accurate alignment between speech and text
in a multilingual context.
• Query-by-example:– Accurate alignment between speech and text
in a multilingual context.
RECOGNIZERSEXAMPLE
Detect how many persons are in the video, detect who and when is
speaking, create appropriate number of tiers and annotations for all of them and align their speech with
transcription from a textfile.
Detect how many persons are in the video, detect who and when is
speaking, create appropriate number of tiers and annotations for all of them and align their speech with
transcription from a textfile.
Video recognizers
Shot detection/keyframe extraction
Skin color estimation
Skin color estimation
Hand/Head Detection & Tracking
We can calculate
• Boundaries of the gesture space• Speed, acceleration of hand movement• Segment recording into units:
– Stroke– Hold– Retreat
• Hand movement related to body• Which parts of speech overlap with
gestures
Hand/Head Detection & Tracking
• Demo (ellipses video)