seminar multimodale räume ss 2003 – einführung 7. mai - rainer stiefelhagen multimodale räume...
Post on 06-Apr-2015
105 Views
Preview:
TRANSCRIPT
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Multimodale Räume
“Smart Rooms”“Intelligent
Environments”
Seminar SS 03
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
User Interfaces
• In the beginning: Wimpy Computing– Windows, Icons, Menus, Pointing
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
2nd Generation:Human-Machine Interaction
“Please show me… hm… all Hotels in THIS area.. er..partof the city"
• Speaking• Pointing, • Gesturing• Hand-Writing• Drawing• Presence/Focus of Attention• Combination
– Sp+HndWrtg+Gestr.
– Repair
• Multimodal NLP & Dialog
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
“Perceptual” User Interfaces
• Perceptive– human-like perceptual capabilities (what is the user
saying, who is the user, where is the user, what is he doing?)
• Multimodal – People use multiple modalities to communicate (speech,
gestures, facial expressions, …)
• Multimedia– Text, graphics, audio and video
(Matthew Turk (Ed.), Proceedings of the 1998 Workshop on Perceptual User Interfaces)
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Next: Pervasive Computing
Human-Computer Interaction not the Only Exchange
Humans Want to Interact with Other Humans
– Computers in the Human Interaction Loop (CHIL)
– The Transparent, Invisible Computer
– Computers Needs to be Context Aware
– Should Require little or no Learning or Attention
– Should be proactive rather than command driven
– Produce Little or No Distraction
– Permit a HCI and CHIL Mix
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Smart/Intelligent Rooms
• Use of computation to enhance everyday activity• Integrate computers seamlessly into the real world
(e.g. offices, homes)• Use “natural” interfaces for communication
(voice, gesture, etc. ) • Computer should adapt to the human, not vice-
versa!
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Perception
• In order to respond appropriately, objects/room need(s) to pay attention to – People and – Context
• Machines have to be aware of their environment:– Who, What, When, Where and Why?
• Interfaces must be adaptive to – Overall situation – Individual User
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Intelligent Environments
• Classroom 2000 (Georgia Tech)• Mozer’s Adaptive House• Enhanced Meeting Rooms• Kids Room (MIT)• …• Enhanced Objects such as Whiteboards, Desks, Chairs, …• See also the Intelligent Environments Resource Page
(http://www.research.microsoft.com/ierp/)
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Intelligent Rooms, Univ. California, San Diego
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Classroom 2000
• Capturing activity in a classroom– Speaker’s voice– Video– Slides– Handwritten Notes
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Classroom 2000
Presenting (recorded) lectures through a web-based interface
•Integration of Slides, Notes, Audio, Video
• Searching
•Adding additional material
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Microsoft Easy Living Project
•XML-based distributed agent system
•Computer vision for person-tracking and visual user interaction.
•Multiple sensor modalities combined.
•Use of a geometric model of the world to provide context.
•Automatic or semi-automatic sensor calibration and model building.
•Fine-grained events and adaptation of the user interface. Device-independent communication and data protocols. Ability to extend the system in many ways.
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Mozer’s Adaptive House
• Operated as an ordinary home– Usual light-switches, thermostats, doors etc.
• Adjustments are measured and used to train the house to– automatically adjust temperature– adjust lighting– choose music or TV channel
• The house infers the users desires from their actions and behaviours
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Adaptive House (Mozer)
Sensors:
• Light Level
• Sound Level
• Temperature
• Motion
• Door status
• Window status
• Light settings
• Fan
• Heaters
• …(M. Mozer, Univ. of Colorado, Boulder)
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Issues in Perception
• Visual– Face-detection / Tracking– Body-Tracking– Face Recognition– Gesture Recognition– Action Recognition– Gaze Tracking / Tracking Focus of Attention
• Auditory– Speech Recognition– Speaker Tracking– Auditory Scene Analysis– Speaker Identification
• Other: Haptic, Olfactoric, … ?
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Enhanced Meeting Rooms
Capturing of Meetings
• Transcription• Summarization• Dialog Processing
• Who was there ?• Who talked to whom ?
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Work at ISL
• Face Tracking• Facial Feature Tracking (Eyes, Nose, Mouth)• Head Pose Estimation / Gaze Tracking• Lip-Reading (Audio-Visual Speech Reco.)• 3D Person Tracking• Pointing Gesture Tracking
• Other Modalities: Speech (!!!, see John), Dialogue, Translation, Handwriting, ...
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Tracking of Human Faces
• A face provides different functions:• identification• perception of emotional expressions
• Human Computer Interaction requires tracking of faces:• lip-reading• eye/gaze tracking• facial action analysis / synthesis
• Video Conferencing / video telephony application:• tracking the speaker• achieving low bit rate transmission
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Demo: FaceTracker
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Color Based Face Tracking
Human skin-colors:• cluster in a small area of a color space• skin-colors of different people mainly differ in intensity!• variance can be reduced by color normalization• distribution can be characterized by a Gaussian model
BGR
Rr
BGR
Gg
Chromatic colors:
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Color Model
Advantages:• very fast• orientation invariant• stable object representation• not person-dependent• model parameters can be quickly adapted
Disadvantages:• environment dependent • (light-sources heavily affect color distribution)
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Tracking Gaze and Focus of Attention
• In meetings:– to determine the addressee of a speech act – to track the participants attention– to analyse, who was in the center of focus– for meeting indexing / retrieval
• Interactive rooms – to guide the environments focus to the right application– to suppress unwanted responses
• Virtual collaborative workspaces (CSCW)• Human-Robot Cooperation • Cars (Driver monitoring)
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Tracking a User’s Focus of Attention
• Focus of Attention tracking:– To detect a person’s interest
– To know what a user is interacting with
– To understand his actions/intentions
– To know whether a user is aware of something
• In meetings:– to determine the addressee of a speech act
– to understand the dynamics of interaction
– for meeting indexing / retrieval
• Other areas– Smart environments
– Video-conferencing
– Human-Robot Interaction
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Head Pose Estimation
• Model-based approaches:– Locate and track a number of facial features– Compute head pose from 2D to 3D correspondences (Gee
& Cipolla '94, Stiefelhagen et.al '96, Jebara & Pentland '97,Toyama '98)
• Example-based approaches:– estimate new pose with function approximator (such as
ANN) (Beymer et.al.'94, Schiele & Waibel '95, Rae & Ritter '98)
– use face database to encode images (Pentland et.al. '94)
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Model-based Head Pose estimation
Image 3D Model Real World
Y
Z
X
Feature Tracking Pose Estimation
•Find correspondences between points in a 3D model and points in the image
• Iteratively solve linear equation system to find pose parameters (rx, ry, rz, tx, ty, tz)
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Demo: Facial Feature Tracking
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Demo: Model-based Head Pose
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Model-based Head Pose
• Pose estimation accuracy depends on correct feature localization!
• Problems:– Choice of good features– Occlusion due to strong head rotation– Fast head movement– Detection of tracking failure / re-initialization– Requires good image resolution
• Video
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Estimating Head Pose with ANNs
• Train neural network to estimate head orientation
• Preprocessed image of the face used as input
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Network Architecture
Hidden Layer:40 to 150 units
Pan (Tilt)
Input Retina: up to 3 x 20x30 pixel 1.800 units
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Tracking People in a Panoramic View
Camera View Panoramic View
PerspectiveView
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Training
• Separate nets for pan and tilt• Trained with Std.-Backprop with Momentum Term
• Datasets:– Training on 6100 images from 12 users– Crossevaluation on 750 images from same users– Tested on 750 images from same users
• Additional User Independent Testset:– 1500 images from two new users
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Results
training set test set new usershisto 6.6 / 5.0 9.4 / 6.9 11.3 / 9.1edges 6.0 / 2.6 10.8 / 7.1 13.3 / 10.8both 1.4 / 1.5 7.8 / 5.4 9.9 / 10.3
Average Error in degrees for pan / tilt
histo: Histogram-normalized image used as input
edges: Horizontal- and Vertical Edge Image used as input
both: Both, Histogram-image plus Edge Images used
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Demo
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Spatial-Awareness in Smart Rooms
Tracking people indoors
• To focus sensors on people
• To resolve spatial lrelationships
• To avoid bumping into humans• To analyze activity
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Person Tracking
Vision based localization of people/objects:
• Single Perspective:• Pfinder - W3S - Hydra - etc.
•Multiple Perspective:• AVIARY - Easy Living
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Person Tracking in the ISL Smart Room
Cam2
Cam1 Cam0
Features
FeaturesTrackingagent
Featureextractor
People
Cam3
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Personen-Tracking mit mehreren Kameras
Ziel: 3D Tracking von Personen in Räumen
• Segmentierung von Vordergrundobjekten in jedem Bild
• „3D Schnitt“ der Strahlen durch die Objektmitten
• Kalman-Filter
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Adaptive Silhouette Extraction
Background subtraction:
• Adaptive Multi-Gaussian background model [Stauffer et al., CVPR 1998]
• Morphological operators smooth foreground output
• Connected components form silhouettes
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Locating people
LocationHypotheses:
i) (X,Y)ii) (X,Y)
12
3
a
b
a
ab
b
• Extract reference point: Centroid
• Use calibrated sensors to calculate absolute position
• Create list of location hypotheses 1
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Tracking people
Best Hypothesis Tracking:
• Match location hypotheses ato tracks
• Smooth tracks with Kalman afilter
Hypotheses
i) (X,Y)ii) (X,Y)
Track 1
Track 2
i)
ii)
Track 1
Track 2
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Tracking Problems
Imperfect andMerged silhouettes:
Counterstrategies
Better Vision algorithm
Probabilistic Multi-Hypothesis aTracking
Reference point: Head
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
• Use head as reference point instead of centroid
• Head tracker has significantly lower tracking error and false alarm rate
Reference point: Head
HeadCentroid
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
0,09
0,1
Tracking error False alarm rate
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Demo
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Erkennung von Zeigegesten
• Ziele: – Menschliche Zeigegesten erkennen – Zeigerichtung in 3D extrahieren
• Einsatzgebiete:– Mensch-Roboter-Interaktion – smart rooms
• Anforderungen:– Personenunabhängig– Echtzeitbetrieb– Kamerabewegung möglich
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Erkennung von Zeigegesten
Stereokamera Linkes/rechtes Bild
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
3D-Tracker: Verarbeitungsschritte
Kamera Hautfarbe Disparität
3D-Clustering von Hautfarbpixeln liefert Hinweise auf Position von Kopf und Hände.
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Gestenerkennung: Bewegungsphasen
• Zeigegesten bestehen aus drei intuitiv unterscheidbaren Bewegungsphasen:
– Beginn – Halten– Ende
• Genaue Lokalisierung der Haltephase wichtig zur Bestimmung der Zeigerichtung
Mittlere Dauer der Bewegungsphasen
μ [sec] σ [sec]
Komplette Geste 1.75 0.48
Beginn 0.52 0.17
Halten 0.76 0.40
Ende 0.47 0.12
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Gestenerkennung: Modelle
• Modellierung der 3 Phasen mit separaten Modellen
• Kontinuierliche HMMs mit 2 Gaussians pro Zustand
• Null-Modell als Schwellwert für die Phasen-Modelle
• Training auf handgelabelten Daten
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Gestenerkennung: Detektion
• Eine Zeigegeste wird erkannt, wenn 3 Zeitpunkte tB < tH < tE gefunden werden, so dass– PE(tE) > PB(tE) und PE(tE) > 0
– PB(tB) > PE(tB) und PB(tB) > 0
– PH(tH) > 0
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Gestenerkennung: Merkmale
• Merkmalsvektor: (r, Δθ, Δy )
• Experimente: zylindrische Koordinaten besser als sphärische und kartesische
• Hand relativ zum Kopf unabhängig von Position im Raum
• Δθ, Δy keine Anpassung an Zeigeziele aus dem Training
• Spline-Interpolation der Merkmals-sequenzen auf konstant 40Hz.
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Zeigerichtung
• Kopf-Hand-Linie– Sehstrahl Auge-Hand
– Einfach zu messen
• Unterarmlinie– Potenziell überlegen bei
abgewinkeltem Arm
– Schwieriger zu messen
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Audio-Visual Speech Recognition
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Lip Tracking Module
• Feature based
• detects localization failures and automatic recover from failures
• tracks facial features (pupils, nostrils, lips)
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Audio-Visual Recognition
hypc = a hypa v hypv
1 = a v
Kombinations Methoden
• SNR Gewichte
• Entropie Gewichte
• trainierte Gewichte
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Fusion Levels
• Word Level (Vote, Decide based on A and V score)• Phoneme Level (Combine by Diff. Weighting
Schemes)• Feature Level (Combine Features)
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Audio-Visual Speech
0102030405060708090
100
clean 16 dBSNR
8 dBSNR
acoustic
visual
combined
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Mögliche Themen
• Personentracking• Gestenerkennung• Attentive Interfaces• Face Detection• Lippenlesen (Audio-Visual
Speech Reco.)• Audio-Visual Tracking• Emotion Recognition• Person Identification• Microphone-Arrays• Sensor Fusion
• Smart Room Infrastructure • Intelligent Camera Control• Self-Calibration • Other Smart Room Projects
(MIT, Georgia Tech, IM2)• Other Sensors: Pressure, IR,
etc• Speech Recognition
– in Meetings– Far-Field – Efficient
• Microphone-Arrays
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
Seminar Multimodale Räume SS 2003 – Einführung 7. Mai - Rainer Stiefelhagen
top related