tony tung @ matsuyama lab., kyoto university 2007-2014

Dynamic Surface

Modeling & Applications

Tony Tung

Matsuyama Laboratory, Kyoto University

2005.07-2005.08

2008.06-2014.09

自己紹介

Interests: computer vision, pattern recognition,

shape modeling, human-computer interaction

Tony TUNG

Matsuyama Laboratory

Graduate School of Informatics, Kyoto University

2005/07/01 - 2005/08/31 : JSPS Summer program (postdoc)

2008/06/01 - 2010/01/31 : Postdoc + JSPS short-term postdoc

2010/02/01 - 2014/09/30 : Assistant Professor (CREST - Kawahara Laboratory)

KAKENHI WAKATEB *2

JSPS AYAME

Microsoft Research Azure project

Contact: tonytung.org

3D Video is:

- Free-viewpoint video

- Image-based system for full surface capture of objects in

motion

- Markerless technique

3D video: full 3D object in motion

3D Video project

[Matsuyama et al., CVIU'04]

3D Video project

Applications: preservation of intangible cultural heritage,

medicine (e.g., gait analysis), entertainment (movies,

sport replay), etc.

3D Video project

T. Matsuyama, S. Nobuhara, T. Takai, T. Tung

Springer 2012 (book)

3D video framework

- Current 3D video studio (3rd) at Kyoto University

•Reconstruction space: 3m x 3m x3m

•Green background for chroma keying, fluorescent/LED lamps

•16 video cameras 1600x1200@25fps

•Grasshopper IEEE1394b

•Synchronization by external trigger

•Geometrically calibrated

•Cluster which consists of 2 PCs (8 cameras per PC)

3D video framework

• 3D video data = sequence of 3D mesh models– Frame-by-frame reconstruction using multiview stereo techniques

3D Video Reconstruction[CVPR08] [ICCV09]

3D video reconstruction

3D video reconstruction from multiview stereo:

[Matsuyama et al., CVIU’04] [Matsuyama et al., Springer’12]

++ using temporal cues:

[Tung et al., CVPR’08] [Tung et al., ICCV’09]

3D video super resolution

Image-based super resolution of 3D video

[Tung et al., CVPR’08]

Stereo probabilistic fusion

3D video reconstruction from wide baseline stereo and SfM

probabilistic fusion

[Tung et al., ICCV’09]

Data size issue

• One or several subjects in the 3D video studio

• 3D surface reconstruction by MVS technique

• Volumetric graph-cuts (5mm resolution)

• Each 3D model = 1.5 MB (30,000 triangles)

• 5 min of 3D video = 11.25 GB

How to manage this big amount of data?

- How to search ( analysis, visualization)

- How to handle inconsistency ( storage, transfer)

Topology Dictionary

for 3D Video Understanding[CVPR07] [CVPR09] [PAMI12]

3D video sequence

Sequential reconstruction

(Inconsistent topology between frames)

Topology can be used to characterized 3D video data

Topology dictionary for 3D video understanding

• Abstraction levels• Topology-based shape description (frame level)

• Probabilistic motion graph modeling (sequence level)

• Applications• Analysis: segmentation, annotation, action recognition

• Content-based encoding: summarization, skimming

• Data size compression: storage, streaming


[Tung et al., PAMI’12]

[Matsuyama et al., Springer’12]

Topology-based shape description

Morse theory

: S with : real continuous function

S : manifold surface (mesh surface)

Reeb graph = quotient space of the graph of in S

defined by the equivalence relation ~

. (X) = (Y)

(X ,Y) S2, X ~ Y . X and Y same connected

component as -1((X))

[Reeb, 1946]


• Multiresolution Reeb graphs

[Hilaga et al., SIGGRAPH’01]


- Automatic extraction of graphs

- R, t, scale invariant

- Homotopic

- Multiresolution coarse-to-fine matching

Reeb graph evaluation

• Robustness to surface noise

Reeb graph vs. skeleton

• “Automatic” 3D shape description

Topology matching

- Invariance to rotation, translation and scale

- Matching using topological and geometrical

attributes (valence, relative area)

- Coarse-to-fine multiresolution strategy

- Similarity of two models M,N from similarity of

topology consistent node pairs {(mi, nj)} at

every level of resolution:

SIM(M,N) = sim(mi, nj)r=0

R

{ij}

[Hilaga et al., SIGGRAPH’01]


Performance evaluation

Pose retrieval in 3D video sequences

[Huang et al., 3DPVT'10]

Topology clusters

• Dataset clustering using similarity evaluation

Distance matrix {1 - SIM}

Topology clusters

• Dataset clustering using similarity evaluation

Repeated poses

Long poses

Short poses

Transitions

Distance matrix {1 - SIM}

Topology clusters

• Clustering of (repetitive) atomic actions

Topology clusters


i i

Topology clusters


Topology clusters

• Motion graph structure SIGGRAPH’02: [Arikan&Forsyth] [Kovar et al.] [Lee et al.]

• using statistics on cluster size and occurrence

i

Topology clusters

i

SUMMARIZATION



Topology clusters

i

3D VIDEO SKIMMING



3D video skimming

3D video annotation

• Add semantic information to each topology cluster

3D video skimming and annotation


[Tung et al., CVPR’09] [Tung et al., PAMI’12]

(captions should be automatically displayed with this video)

CG models as prior

3D video annotation

3D video annotation


[Tung et al., CVPR’09] [Tung et al., PAMI’12]

(captions should be automatically displayed with this video)

Invariant Surface Descriptor for

3D Video Encoding[ACCV12] [TVC14]

3D video encoding

• 3D video data size is big

– Several GB for few minutes of HR sequence

• Impractical for data storage/management

• Impractical for data streaming over network

• Data structure inconsistency prevents existing

compression approaches to be efficient

3D video encoding

Approach: Geometry image technique (3D to 2D transform)

• Cut open 3D meshes and re-parameterize on plane

• Apply lossless compression (2D video)

See [Gu et al., SIGGRAPH’02]

for synthetic data

3D video encoding

Approach: Geometry image technique (3D to 2D transform)

• Cut open 3D meshes and re-parameterize on plane

• Apply lossless compression (2D video)

Solution: stabilize the cuts for optimal encoding

3D video encoding

Possible scenarios:

1. Meshes are consistent (share same connectivity)

• Synthetic datasets

3D video encoding

Possible scenarios:



2. Meshes are inconsistent (different connectivity, resolution)

• Tracking & remeshing [Cagniart et al., ECCV’10]

• Point-to-point surface alignment

“Geodesic mapping”

[Tung et al., CVPR’10][Tung et al., PAMI’14]

3D video encoding

Possible scenarios:



2. Meshes are inconsistent (different connectivity, resolution)

• Tracking & remeshing [Cagniart et al., ECCV’10]

• Point-to-point surface alignment [Tung et al., CVPR’10][Tung et al.,

PAMI’14]

• Geometrical data are inconsistent in time (e.g., raw 3D video)

– Adaptive bitrate streaming (where resolution can vary)

Deformation invariant surface descriptor

[Tung et al., ACCV’12] [Tung et al., Vis. Comp.’14]

Invariant shape descriptor

• Define a surface-based shape descriptor

– Graph defined on object’s surface

– Nodes are geodesically consistent across time

• E.g., surface extremal points

[Tung et al., IJSM05] [Tung et al., PAMI12]






– Edges joint the nodes

• Defined as paths on the surface

• Maintained geodesically consistent across time

– Using the previous position of the path (vertices)

– Using the shortest path between nodes

• Probabilistic framework (MAP-MRF) to handle

surface non-rigid deformations






– Edges joint the nodes

• Defined as paths on the surface

• Maintained geodesically consistent across time

– Using the previous position of the path (vertices)

– Using the shortest path between nodes


1. Invariant to surface deformation and parametrization

2. Parametrization in one-shot

3. Use as cut graphs

3D video encoding

Invariant surface-based descriptor for 3D video encoding

[Tung et al., ACCV’12] [Tung et al., Vis. Comp.’14]

Dynamic Surface Alignment[CVPR10] [PAMI14]

Point-to-point surface alignment

For:

– Shape matching

(retrieval, comparison)

– Motion tracking

– Texture transfer

– …

– 3D video encoding

– Surface dynamics


Appearance-based

• color, corners, local features

e.g., see [Ahmed et al., CVPR08]

Have to deal with:

- Inconsistent colors from multiple views

- Poor texture (e.g., solid color clothing)

- Surface noise

Usual process

1. Find landmark points

2. Refine (interpolate)

Geometry-based

• local geometry property

• mapping/diffusion functions:

spherical [Starck et al., ICCV05],

embedding [Bronstein et al., TVCG07],

multiple maps [Kim et al., , SIGGRAPH11],

spectral matching [Lombaert et al., PAMI13]

• patch deformation [Cagniart et al., ECCV10]


is a surface mapping between S1 and S2

is a metric

is a diffeomorphism

Have to deal with:



- Surface noise

Geodesic mapping

1. Define landmark points using geometry-

based approach

2. Choose the landmark points with

minimum ambiguity (coarse-to-fine

strategy)

3. Refine by propagation

Have to deal with:



- Surface noise

See preliminary work in [Tung et al., CVPR’10]

Geodesic mapping

Model:

• Define a smooth bijective map between two manifolds

(S1, g1) and (S2, g2)

• g1 and g2 are geodesic distances

Geodesic consistency of v1S1 and v2S2 :

• Assuming two sets of N points

B1 = {b1,…,bN} S1 and B2 = {b’1,…,b’N} S2

• i{1,..,N}, |g1 (v1, bi) - g2 (v2, b’i)| ≤ ’

Global geodesic distance measures distortion between surfaces points w.r.t. N points.

Geodesic mapping

Overview:

BtSt and Bt+1St+1 are

surface extremal points

(see [Tung et al., PAMI2012])

Surface extremal points are critical points of

Geodesic mapping

Overview:



(see [Tung et al.,

PAMI2012])

Geodesic consistency condition can be broken

when surfaces undergo non-rigid deformations!

Geodesic mapping

Overview:



(see [Tung et al., PAMI2012])

Ambiguity degree A (vS) for point localization:

Measure of the number of points geodesically

consistent to v w.r.t. B S

Geodesic consistency condition can be broken

when surfaces undergo non-rigid deformations!

Geodesic mapping

Recursive mapping:

• Recursively chose Ni points in regions

of low ambiguity w.r.t. N landmarks

• Find corresponding points using N’ ≤ N

(N’ = max number of isoline intersections)

• Set N = Ni

Geodesic mapping

N = Ni

Geodesic mapping

• Refinement by MRF optimization:

Labeling problem

Global geodesic distance DN w.r.t. B^t = {bi ^t} and B^t+1 = {bi ^t+1} :

Tp(lp): orientation of (p, lp)

Geodesic mapping

• Experimental resultspoint-to-point surface alignment between consecutive frames

Geodesic mapping point-to-point surface alignment

[Cagniart et al., ECCV10] as ground truth [Spectral method] = [Lombaert et al., PAMI13]


[Cagniart et al., ECCV10] as ground truth [Spectral method] = [Lombaert et al., PAMI13]

[Misreconstruction]


• Quantitative evaluations

[Lombaert,13

]

[Kim et

al.,11]

Geodesic mapping

• Topology change

Regions where no topology change occurred are not affected

[Kim et al.,

SIGGRAPH11]

Geodesic mapping

• Applications

Intrinsic Characterization of

Dynamic Surface[CVPR13] [CVPR14]

Natural object dynamics modeling

• Natural scenes are complex but contain

statistics

– e.g., water, fire, human actions, etc.

• Dynamics modeling has been used for complex

scene segmentation and classification

– Dynamic textures

• Linear Dynamical Systems (distances, BoS)

[Doretto, IJCV02] [Chan, CVPR05] [Ravichandran, CVPR09]

– Dynamic facial events

• Timing structure of LDS

[Kawashima et al., 2007~2010]

Real-world surface dynamics

•Real-world objects in motion exhibit local deformation statistics

•Observation of intrinsic geometry

[Tung et al., CVPR13]


Bouncing sequence

Shape index observation across time [Koenderink, Vis. Comp. ‘92]




Samba sequence

Shape index observation across time [Koenderink, Vis. Comp. ‘92]



Intrinsic geometry

• Local topology descriptor (Koenderink shape

index)

[-1,1] and k1, k2 are principal curvatures (k1≤k2)

The shape index varies continuously with respect to surface deformation.

Intrinsic geometry

• Shape index variance average give information

on deformation location and relative magnitude

• However, it does not contain information about

acceleration patterns or timing structure

Shape index variance average over sequence.

Surface deformation dynamics

• After surface alignment, surface points can be

tracked across time

• Observation of temporal variations of shape

index at each surface point

• Characterization per surface patch


• After surface alignment, surface points can be

tracked across time

• Observation of temporal variations of shape

index at each surface point

• Characterization per per surface patch

Free sequence


• Dynamics modeling using Hybrid Linear Dynamical

System [Kawashima et al., ICIAP’07]

– Hidden state variable with Markovian dynamics

• Continuous hidden state variable x(t)

• Noisy measurements y(t)

– Linear-Gaussian model

• Y = { y(t) } : observations

• X = { x(t) } : hidden states in continuous state space

• Fi : transition matrix that models the dynamics of Di

• H : observation matrix mapping hidden states to system output by linear

projection

• gi : bias vector, vi(t) : measurement noise, w(t): observation noise[Doretto, IJCV02] [Chan, CVPR05] [Ravichandran, CVPR09]


• Dynamics modeling using Hybrid Linear Dynamical

System [Kawashima et al., ICIAP’07]

– Model LDS state durations and transitions (i.e., timing

structure)


– Model state durations and transitions (i.e., timing structure)

Bag-of-System

• Keypoint classification using bag-of-systems

– Bag-of-feature framework

– Codebook obtained by k-medoid clustering

• Codewords accounting for timing distribution

– Softweighting accounting for relative state duration

• Classification using SVM with RBF kernel

– Rigid/non-rigid regions

Rigidity-based classification

- Collection of N = 4 LDS per patch

- K=8 codewords

- For each sequence: 25% for training, 75% for testing

[Ravichandran 09][Saisan01] [Saisan01] [Ours]


Timing-based local descriptor

I = {overlapping intervals}


• Preserve local structure of surface such as deformation

patterns between neighbor patches

Timing-based local descriptor

I = {overlapping intervals}

Yi , Yj : observed signals


• Histogram of timing:

Bag-of-Timing paradigm

• Timing of local surface element dynamics are

words of a codebook

– Sparse histogram of dynamic state timings

– Find codewords using k-medoids algorithm

– Soft-weighting of descriptors

• Classification (SVM)/ segmentation of

descriptors

– Different rigidity levels

Rigidity-based surface segmentation

Surface dynamics

Rigidity-based surface segmentation

Dynamic face

3D face dataset

Dynamic face

Cardiac datasets

Summary

• 3D video is a markerless surface capture technique which allows the capture of objects in motion

• 3D video reconstruction state-of-the art• Silhouette and stereo fusion

• Topology dictionary for 3D video understanding- Shape description using Reeb graphs

- Sequence encoding by feature vector clustering

- Probabilistic motion graph model

• Applications: skimming, summarization, annotation, content-based description/encoding.

Summary

• Invariant surface-based descriptor

– Geometry video approach

– Deformation invariant surface cut graph

– Probabilistic formulation

– Applications: 3D video data compression for transfer,

storage.

Summary

• Point-to-point surface alignment of 3D video

data

– Recursive geodesic mapping

– Ambiguity measure

– Competitive with state-of-the-art

Accuracy is to be improved when topology change

Use other intrinsic maps

Summary

• Deformable surface dynamics modeling

– Intrinsic surface properties are tracked across time

– Dynamics modeled using a set of LDS with timing

structure information (using Hybrid LDS)

– Timing-based local descriptor

– Applications: rigidity classification, segmentation with

respect to rigidity levels

• Deformation learning using a generative model

Multimodal Interaction Dynamics

in Group Discussion

using a Smart Digital Signage[ECCVW12] [HCI13] [THMS14]

[ECCVW 08] [IJNCR14]

Human-human interaction

• Human-human interactions for ambient systems

supervising human communications

• Multimodal sensing and analysis of multiparty

interaction for high-level understanding of

human interactions

• Speaker diarization / Visual information processing

• Annotation of comprehension and interest level

• New indexing scheme of speech archives

• Interaction-oriented approach (reaction)

• Non-verbal information (backchannels, nodding,

gaze )

Related work

• VACE Multimodal meeting corpus [Chen et al., MLMI’06]

• 6 people (round table)

• 12 stereo camera pairs, 3D Vicon IR system, microphones

• AMI meeting corpus [2007]

• 6 cameras, 24 microphones, whiteboard

• IMADE room (poster) [Kawahara et al., Interspeech’08]

• 1 presenter, 2 listeners

• 6-8 multiview video cameras, motion capture (12 markers on

body and head), eye-tracking system with accelerometer, micro

array (8-19) and headset

Related work

Video capture at IMADE room

Why poster sessions?

• Norm in conferences and open labs

• Mixture of lecture and meeting characteristics

• One main speaker with a small audience

• Real-time feedback (backchannels by audience)

• Interactive

• Audience can make questions/comments at any time

• Controllable (knowledge/familiarity) and yet real

Overview

1. Multimodal capture system

2. Audio and Visual information processing

3. Multimodal interaction dynamics modeling

4. Experimental validation

• Joint-attention estimation

Portable multimodal system

• 65” plasma screen

• 19-channel mic array + amplifier

• 6 multiple view video cameras

• Vision camera (UXGA, 25fps), synchronized &

calibrated

• 1 PC with GPU

65” display (160cm width)

200cm

30-40cmMicrophone array

Demo at IEEE ICASSP’12

Multimodal data processing

Audio information processing

• Speaker diarization

• Audio segmentation

• Speaker turns

1. Speech enhancement

2. 2 GMM models for classification (256

components)• Speech

• Noise

3. Training by EM [Gomez et al., IEEE Trans. ASLP 2010]

Video information processing

• Online head motion tracking (for nodding and turning)

1. Face detection [Viola & Jones, CVPR’01]

• Face feature detection (nose)

2. Depth from stereo

3. Feature tracking using probabilistic model (particle

filter) [ECCVW08] [IJNCR14]

• Likelihood updated with color histograms and depth info

• Cope with missing frames, partial occlusions

Video information processing

System demo at IEEE ICASSP2012

A/V interaction

• Input: temporal data (e.g., head positions)

• Speaker diarization

• Head motion of each subjects

• Dynamics modeling using HDS [Kawashima et al.,

NIPSw’10]

• System of LDS

• Transitions using a Finite State Machine

• Timing structure analysis

(event classification, multimodal interaction modeling)

Modeling using HDS

• Linear Dynamical System Di

• Y = { y(t) } : observations

• X = { x(t) } : hidden states in continuous state space

• Fi : transition matrix that models the dynamics of Di

• H : observation matrix that maps hidden states to

system output by linear projection

• gi : bias vector, vi(t) : meas. noise, w(t): obs. noise

Modeling using HDS (cont’d)

• Hybrid LDS

1. N LDS Di

2. FSM with N states: S = { qi }– (N and LDS parameters are estimated using EM)

• Interval-based representation

• Interval: Ik = < qi , tj >

• Duration: tj = ek - bk

[Kawashima et al., NIPSw’10]

Interaction modeling

• Interaction level between multimodal signals

i.e., number of occurrences of synchronized events wrt time

• The distribution of temporal differences of two signals Yk

and Yk’ is modeled by:

Z(Yk,Yk’)=Pr({ bk-bk’=b, ek-ek’= e} | {(Ik, Ik’) : [bk,ek][bk’,ek’] !=0} )

(Z represents synchronization wrt reaction time)

Experimental results

• Two scenarios with digital signage:

• Poster presentation

• Casual discussion

• Speaker/Audience interaction characterization

• A/V processing

• Multimodal interaction dynamics modeling using 6

states

• Insight about joint-attention

Poster presentation

3min

Multimodal interaction modeling

• IHDS with 6 modes for head motion

• LDS clustering & parameter optimization by EM

• LDS timing structure and speaker turn synchronization

Head motion dynamics vs. speech turns

Joint-attention characterization

• Reaction occurrences to A/V stimuli

Audio stimuli Visual stimuli

Casual discussion

3min

Multimodal interaction modeling

Head motion dynamics vs. speech turns

Synchronized state

distribution

Joint-attention estimation

Audio stimuli Visual stimuli

Summary

• Multimodal system with digital signage (smart

poster) for human-human interaction analysis

• Mic array & multiview video

• Poster presentations (1 presenter, 2-3 listeners)

• Multimodal data interaction

• Speaker diarization & dynamical system modeling

(IHDS)

• Joint-attention in group discussion

• Non-verbal events generate more non-verbal

reactions compared to audio events

tonytung.org

tony tung @ matsuyama lab., kyoto university 2007-2014

Science