1 the alert system: audiovisual broadcast speech transcription for selective dissemination of...
Post on 18-Dec-2015
226 Views
Preview:
TRANSCRIPT
1
The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of
Multimedia Information
Gerhard RigollGerhard RigollMunich University of TechnologyMunich University of Technology
Institute for Human-Machine CommunicationInstitute for Human-Machine CommunicationMunich, GermanyMunich, Germanyrigoll@ei.tum.derigoll@ei.tum.de
2
ALERT system for selective dissemination of multimedia
information• Official start: 01/2000, start of work: 03/2000, duration: 30 months• Man power effort: ~30 MY ---> Budget: ~1.6 Mio Euro EC funding• Web Site: http://alert.uni-duisburg.de
General Project dates
3
InternetNEWS
Media information flooding
supervision byinformation brokers
4
Internet
NEWS information(sound, video, text)
today‘s headlines ..
..
transcriptiontopic
detection
TAXES
ALERT MESSAGE
Media monitoring in the alert project
5
General project Objectives
To develop a demo system capable of identifying specific information in multimedia data, consisting of
text, audio and video streams
using advanced speech recognition video processing techniques automatic topic detection algorithms
demonstrator shall alert a user about the existence of requested information send detailed information (on client's further request)
extracted text annotated audio/video data and video clips
provide functionality in French, German and Portuguese demo system will be evaluated mainly by industrial
partners
6
integration
technologies
users
THe alert Consortium
7
WP structure (WP0-WP4)
today
Month 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
WP0 ManagementManagement Committee meetingsManagement Reports
WP1 User needs, market study, specsUser needs and market studyDemonstrator specification
WP2 Multilingual common structurePilot data availableDefinition of the common structureCommon resources for all languages
WP3 Information indexing and struct.Automatic transcriptionAudio- and video-based segmentation
WP4 Automatic topic detection
deliverablemilestone
8
WP structure (WP5-WP7)
today
Month 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
WP5 System development and integ.Infrastructure and Media IntegrationMultimedia Document StructurationAccess and InteractionSystem Integration
WP6 EvaluationEvaluation plansTest and evaluation of the Port. systemTest and evaluation of the French systemTest and evaluation of the German system
WP7 Exploitation and DisseminationScientificCommercial
deliverablemilestone
9
Collection of pilot corpus
First step to setup similar resources Purpose: testbed for assessing methods for data
collection, annotation and distribution Collection guidelines:
Minimum amount: 5 hours Type of data: video, audio and annotation Video format: MPEG1Audio format: PCM linear, 16KHz sampling rate,
16 bits/sample, mono, collected from antennaAnnotation based on LDC guidelines Thematic orientation: news and interview
shows
10
Collection of final databases
Experimental results recommendations for final corpusquality mp3, 32 kbps, 16kHz, mono
Minimum amount: speech recognition: 50 hours (training), 3
hours (development), 3 hours (evaluation) word-labelled
topic detection: 300 hours, topic annotatedtext corpus: 100 million words
Full data set:1300 hours word or topic annotated> 10k topic annotated summaries in German text corpus: > 1 billion words
11
comparison of coding schemes for broadcast speech databases
12
multimediadocument
video/imageprocessing
video/imageprocessing
speechprocessing
speechprocessing
automatictopic
detection
automatictopic
detection
match topicsfound againstuser profiles
match topicsfound againstuser profiles
multimedia document database
multimedia document database label
database
label database
alertspecificusers
alertspecificusers
if video
if audio
if text
contained
contained
contained
segmentation
segmentation
topickeywords
video-basedtranscription
best hypo-wordgraph
Multimedia datA-labeling and alert-generation
13
Begin Cut
WindowChange
NewscasterNewscaster
InterviewCut
Dissolve
Wipe
Cut
Report
WeatherForecast
Cut Cut End
Interview = Newscaster NewscasterInterviewed PersonCut Cut
Basic principle of video-segmentation
Stochastic Video-Model (based on HMMs):
14
Result of video-based segmentation
15
0 5000 10000 15000 20000
frame
0 5000 10000 15000 20000
frame
we
ath
er f
ore
cas
t
intr
o
spe
ak
er
rep
ort
spe
ak
er w
ith
in
terv
iew
pa
rtn
er
referencevideosegmentation
automatic video segmentation
Combined video-audio-segmentation
16
topic segmentation
Results: video based detection of topic boundaries is feasibleprecision rate = 1 - insertion rate = 88.2 %recall rate = 1 - deletion rate = 82.2 %
17
French BN speech recognizer
continuous density HMM system33 phones + 3 non-speech (silence, filler words,
breath)~20% WER (on news)65k dictionaryautomatic pronunciation with manual verification58 hours acoustic training data, 350 Mio words text RT decoding: 5700 states, 92k Gaussians10xRT decoding: 11000 states, 350k Gaussians4-gram language model 15M bi-, 15M tri-, 13M
four-grams
18
Portuguese BN speech recognizer
Based on the AUDIMUS LVCSR systemHybrid system based on MLP/HMM techniquesCombination of different acoustic models
(product of posterior probabilities)38 phones + silence, 57k dictionary4 gram LM: 5M bi-, 12M tri-, 13M fourgramsTrained on 13 h of BN dataResults:
15xRT: F0: ~20%, All F: ~40 %
19
German Baseline Speech Recognition System
20
German BN speech recognizer
continuous density HMM system50 phones + 17 non speech (silence, filler
words, breath, rustle, ...)~20 % WER (initial DuDeutsch: >70 % WER)100 k dictionary initial pronunciation from CELEX, compound
word construction10xRT: 30-90k Gaussians3-gram (cached) language model, 8M bi-,
16M trigrams
21
system phone models #mixtures WER
baseline German triphones 31 780 ~30%system, 100k,spontaneous speech
baseline, not triphones 31 780 79,7%trained on broad-cast data
baseline with triphones 31 780 72,3%broadcast language model
acoustic models monophones 1 722 54,3%trained on broadcast data
acoustic models triphones 96 417 22,8%optimized onbroadcast data
Evolution of the german system
22
viele menschen auch heute noch in provisorischen notunterkünften .
viele menschen auch heute noch einen froh wie so daß nur unterkünften
zweitausend beben ganze ortschaften zerstört .
zweite außen beben also ortschaften so stück
muß der anreiz zur zusätzlichen privaten vorsorge erhöht werden .
mußte ein reiz zum zusätzlichen privater vorsorge erhöht werden
zuschuß bekommen . dafür will sich die csu arbeitnehmerunion
zuschuß bekommen dafür wie sich die csu arbeitnehmer im juni
mit rund zusätzlichen zwo komma fünf millionen mark muß der landkreis
mitte und zusätzlichen zwo komma fünf millionen mal muß der lahn kreis
Examples for German transcription results
23
Automatic topic detection
Objectives:to divide automatically audio/video
streams into topic-specific homogeneous segments
automatic assignment of requested topics to distinct segments
Test set:
• 22 topics in 2956 training and 1284 test texts• deletion of 150 stop words• no stemming performed
24
New approach to topic detection
This is a text containing important topics.
p(w1)p(w2)p(w3) . . .
[00.....0100....0]
MMI Neural Net
VQ label
25
0102030405060708090
100
k-means MMI
0102030405060708090
100
new approach compared system
Comparison of new approach and standard system
Comparison of feature quantization with k-means clustering and MMI neural net
Results for Clean text
26
Results with partially corrupted texts:
• some words are fragmented similar to speech recognition output•22 topics in 3037 training and 1319 test texts• no stop words• no stemming
Partially Corrupted text
27
0102030405060708090
100
1 bes
t
2 bes
t
3 bes
t
4 bes
t
new approach compared system
173 topics
0102030405060708090
100
new approach compared system
22 topics
Results for Corrupted text
28
Demonstrator specification (details)
USERRETRIEVAL
RetrievalInterface
TOPIC DETECTION
USER
TopicDetection
ProfilesDatabase
DatabaseInterface
DATASTORAGE
USER
Alert GeneratorWeb | Email | WAP Info | ...
ALERT GENERATION
DATACAPTURE
ProgramDescriptions
Database
ProgramIdentification
FormatConversion
&Compression
DataAcquisition
Video Labelling :Content Classes,Editing Effects
VideoExtraction
VideoSegmentation
VIDEO TOOLS
SpeechTranscription
AUDIO TOOLS
AudioExtraction
AudioSegmentation
Audio Labelling :Speaker,Acoustic
Conditions,Channel,Language
TV/InternetBroadcast News
RADIOBroadcast News
INTERNETNews Texts
29
Publications ICASSP 2001 (7/2001)
LIMSI: Automatic transcription of compressed broadcast audio GMUD: New approaches to audio- visual segmentation of TV news
for automatic topic retrieval.
TREC-9 (11/2000) LIMSI: The LIMSI SDR system for TREC-9
argus press (11/2000) Observer: Observer Argus Media beteiligt sich am EU-
Forschungsprojekt ALERT
ICSLP 2000 (10/2000) GMUD: Compound splitting and lexical unit recombination for
improved performance of a speech recognition system for German parlianmentary speeches
INESC: The Use of Syllable Segmentation Information in Continuous Speech Recognition Hybrid Systems Applied to the Portuguese Language
INESC: Combination of Acoustic Models in Continuous Speech Recognition Hybrid Systems
30
Publications (II)
ICSLP 2000 (10/2000) LIMSI: Fast decoding for indexation of broadcast data LIMSI: Investigating text normalization and pronunciation
variants for German broadcast transcription
EDCL 2000 4th European Conference on Research and Advanced Technology for Digital Libraries (9/2000) INESC: Topic Detection in Read Documents
ASR 2000 (9/2000) INESC: A Decoder for Finite-State Structured Search Spaces
ICASSP 2000 (6/2000) GMUD: A Novel Error Measure for the Evaluation of Video
Indexing Systems
31
PresentationsSchaufenster der Wissenschaft (3/2001)
GMUD: Informationen aus Radio, Fernsehen und Internet: Automatische Themenerkennung in Multimedia-Daten
Euromap Informationstag (12/2000) GMUD: Das Projekt ALERT - Alert system for selective dissemination of
multimedia information
IV Jornadas de Arquivo e Documentação (10/2000) INESC: Speech recognition and topic detection applied to alert systems for
broadcast news
ASR 2000 (9/2000) GMUD: ALERT System for Selective Dissemination of Multimedia
Information
Homme Technologie et Systèmes Complexes (6/2000) VECSYS: Parlez Naturellement, la Machine Vous Comprend
RIAO'2000 Content-based Multimedia Information Access (4/2000) VECSYS, LIMSI: An Audio Transcriber for Broadcast Document Indexation
32
outlook
use of additional datacross-talker situationsenlarged number of topicsimproving rejection mechanisms of
unknown topics (confidence for topics)detection of new topicssummarizationscalable summarizationtopic-dependent summarization
top related