taai 2016 keynote talk: it is all about ai

79
1 It is all about AI Mark Liao Institute of Information Science Academia Sinica, Taiwan (TAAI 2016)

Upload: yi-shin-chen

Post on 21-Mar-2017

586 views

Category:

Technology


2 download

TRANSCRIPT

  • 1

    It is all about AI Mark Liao Institute of Information Science Academia Sinica, Taiwan (TAAI 2016)

  • Contents of this talk

    Automatic Concert Video Mashup

    Spatio-Temporal Learning of Basketball Offensive Strategies

    2

  • 1

    Automatic Concert Video Mashup

    Mark Liao Institute of Information Science Academia Sinica, Taiwan

  • What is concert video mashup ?

    A concert video mashup process is to deal with all videos captured from different locations of a concert hall and convert them into a complete, non-overlapping, seamless, and high-quality outcome.

    4

  • Why concert video mashup ?

    To provide people who could not attend live concert a

    second chance to enjoy the performance with similar quality.

    5

  • Many problems to be solved !

    Videos were captured with no coordination, incompleteness or redundancy happens always.

    The order to watch these videos often causes confusion.

    These videos were captured by handheld devices, their visual/audio quality cannot be guaranteed.

    6

  • Issues need to be addressed

    The order to watch Visual quality optimization Seamless sound track connection No redundancy No missing video segments Mashup results follow the rules defined by

    language of film

    7

  • Potential Issues: The order to watch(1/5) Three video clips captured from 3 different

    angles, different distances, 1&2 partially overlapped, 3 independent

    8

    1 2

    3

  • Potential Issues: Multiple audio sequence alignment (2/5)

    Case 1: partially overlapped Case 2: no overlap

    9

  • Potential Issues(3/5)

    Among three videos coherent in time, which one should be chosen ? (3 different locations)

    -- follow the rules of language of film !

    10

    Medium Shot

    Long Shot

    Extreme Long Shot

  • Among several qualified videos clips, which one should be chosen ? Same distance !

    -- visual quality ? audio quality ?

    11

    Potential Issues(4/5)

    Extreme Long Shot

    Extreme Long Shot

  • Potential Issues(5/5 )

    How to present the emotion, ideas, and art of a music director into a concert video mashup process ? Can a CNN learn facial emotion ?

    12

  • Previous Effort

    The closest research area to ``automatic video mashup is ``summarizations of multi-view videos

    The objective of the latter is to produce a

    reduced set of abstracted videos or key-frame sequence that can represent the most prominent parts of the input videos.

    13

  • Literatures related to video mashup (1/3)

    [Shrestha et al.] formulate video mashup as an optimization problem

    - pros optimizing visual quality and diversity constraints - cons did not take into account professional view of a visual storytelling director P. Shrestha et al., automatic mashup generation from multiple-camera concert recordings, ACM MM, 2010.

    14

  • Literatures related to video mashup (2/3)

    [Wu et al.] put some pre-defined rules to solve the frequent-shot-change problem

    - pros can solve part of the shot change problem - cons did not involve a visual storytelling director to instruct a video mashup process Wu et al., MoVieUp: Automatic mobile video mashup, IEEE TCSVT, 2015.

    15

  • Literatures related to video mashup (3/3)

    [Saini et al.] introduce visual storytelling rules by dividing audience seats into six shooting locations and then calculate statistics of shot transition and length from professionally edited videos

    - pros a good start by introducing the views of professional experts - cons shot types defined by themselves, not by rules defined in language of film

    Saini et al., MoViMash: Online mobile video mashup, ACM MM, 2012.

    16

  • Introduction

    An experienced movie director frequently use camera work practice in visual storytelling.

    Intro Verse

    Verse Chorus

    Chorus Bridge

    Bridge . . .

    16

  • Introduction

    Applications Mashup

    Emotion (music video)

    18

  • Introduction

    According to the language of film [3], shot size is one of the basics of filmmaking.

    19

    Long Shot Close-Up

  • Introduction

    20

    The definition of six types of shots [3].

  • Introduction

    Definition from the language of film [3], a concert video contains eight types of camera shots.

    20

    Musical Instrument Shot (MIS) Audience Shot (ADS)

  • INTRODUCTION

    Two images from an official concert video of the song 93 million miles by Jason Mraz live at Hong Kong 2012.

    22

  • System Framework for Video Mashup

    23

  • Shot Classification based on EW-Deep-CCM

    Error-Weighted Deep Cross-Correlation Model

    24

  • Object Representation (VGG-Net)

    Object representation using a 16-layer VGG-Net we extract features from the output layer and the two fully-

    connected layers as the object representations, the feature dimensions are 1000-D, 4096-D and 4096-D, respectively.

    25

  • Object Representations (1/2)

    ImageNet1000 object representation

    26

  • Object Representations (2/2)

    27

  • Literatures related to Fusion Strategy

    Early fusion

    Pros:

    Take the advantage of combining various feature cues

    Cons: High dimensional feature set may easily suffer from the problem of data sparseness, and stress the computational resources.

    28

  • Literatures related to Fusion Strategy

    Late fusion Pros:

    Without increasing the dimensionality Interpret the performance of different classifiers and gain insight into the role of multiple modalities during emotional expression

    Cons: The assumption of conditional independence among multiple modalities is inappropriate.

    29

  • Shot Classification based on EW-Deep-CCM

    A novel fusion strategy named Error Weighted Deep Cross-Correlation Model (EW-Deep-CCM) is proposed to effectively combine the extracted multilayer object representations.

    30

  • Experimental Results

    Comparison of Shot Type Classification (other method)

    31

  • EW-Deep-CCM only achieves 83% detection rate

    17% error remain, i.e., 1/6 error rate, this will cause frequent shot changes

    32

  • 17% error rate causes too many shot changes

    31

  • Conditional Random Field-based (CRF) Approach

    1st trial: 30-frame fixed window size (not a systematic way to

    smooth the results)

    2nd trial: Recurrent Neural Network (RNN) -- Problem: RNN needs pre-segmented data to derive best results, but the shot type classification results generated are not well segmented

    3rd trial: Conditional Random Field (CRF)

    34

  • OUR METHOD Coherent-Net

    Shot Type Refinement

    (CRF)

    35

  • OUR METHOD Coherent-Net Framework

    Shot Type Refinement

    (CRF) ( | ')P w w

    ( | )P w O

    '

    1

    ( | )= ( , ' | )

    ( | ') ( ' | )

    ( | ') ( ' | )N

    n nn

    P P

    P P

    P P w o=

    ww O w w O

    w w w O

    w w

    CRF EW-Deep-CCM

    ( ' | )P w O

    36

  • (EW-Deep-CCM)

    Likelihood (DNN posterior

    probability)

    Cross-correlation Empirical weight

    1 1 1( ' | ) ( ' | , ) ( | ) ( | , ) ( | , )

    ( | , ) ( | , ) ( | )

    C D Kout out fc out

    ij k k ij i i k j i ki j k

    out fc fc fci j k j j k ij ij

    P w o P w w P w P o w P w

    P w P o w P o

    = = =

    Shot Type Refinement

    (CRF)

    ( | ')P w w

    ( | )P w O( ' | )P w O

    37

  • 1 1' ', , ', 't tw w w=w

    1w=w 2w 3w 1tw tw

    ( ) ( )1( | ') exp , '

    ' jjP F

    =

    w w w wZ w ( ) ( )' exp , 'jj

    F

    =

    w

    Z w w w

    ( ) ( ) ( )11 exp , , ' , '

    ' j j t t j j tt j t jt w w s w

    +

    w wZ w

    ( ) 1{ } { } { } { ' },1 exp

    ' t t t tmn w m w n om w m w ot m n S t m S o O

    = = = =

    +

    1 1 1 1Z w

    ( )1

    , '0j t

    s w =

    wwhen and 'tw o= tw m=otherwise

    State-observation pair State transition

    ( )11

    , , '0j t t

    t w w

    =

    wwhen and 1tw n =tw m=

    otherwise

    (CRF)

    unary potential pairwise potential

    CLCCCC

    CCCCCC

    38

  • EXPERIMENTS Official Demo 1

    39

    the song Skyfall by Adele perform at Oscar 2013

  • EXPERIMENTS Official Demo 2

    the song When I was Your Man by Bruno Mars perform at BBC Radio 1's Big weekend 2013

    40

  • System Framework for Video Mashup

    41

  • Problem & Goal

    A concert video mashup process needs to align the videos taken by variant audiences into a common timeline.

    42

  • Literature Review

    Audio fingerprinting Problems

    Originally designed for the problem of audio identification rather than that of time alignment.

    Easily cause audio signal distortion Zhu et al. treat audio identification as an image

    matching problem. (significant performance improvement) B. Zhu et al., A novel audio fingerprinting method robust to time

    scale modification and pitch shifting, ACM MM, 2010.

    43

  • Our Method

    We modified Zhus method to address the multiple audio sequences alignment problem. Auditory image (spectrogram) construction

    1-D audio signal (waveform) 2D auditory image Time-frequency representation

    (spectrogram)

    Short-time Fourier

    transform

    44

  • Our Method

    Audio Sequences Alignment (1) Boundary candidate selection (based on SIFT alignment)

    -where a is a SIFT feature in audio sequence A, b is the closest feature of a in B, b is the second closet feature of a in B.

    b A B a

    ', ( , ) ( , ),

    Yes if D a b c D a bBC

    No otherwise <

    =

    BC: boundary candidate D(.): Euclidean distance c: a constant (c=0.7)

    Yellow lines are boundary candidates

    45

  • Our Method

    Audio Sequences Alignment (2) Boundary candidate refinement. -A window distortion measure (WDM) is defined for each

    boundary candidate refinement.

    46

  • Our Method

    Audio Sequences Alignment (3) Final boundary decision. -The alignment result is determined by a refined boundary

    candidate that with minimum window distortion.

    47

  • DEMO 1

    Im Yours by Jason Mraz live at Singapore 2012 with context search (Aligned in 49.8001 s)

    48

    TimeLine 00:00:00 00:00:49.8001

    Recording #4

    Recording #5

    +0.4334 s

  • DEMO 2

    All I Ask by Adele live at Birmingham Genting Arena 2016 with context search (Aligned in 53.2169 s)

    49

    TimeLine 00:00:00 00:00:53.2169

    Recording #1

    Recording #2

    +0.5502 s

  • TimeLine 00:00:00 00:00:52.4893 04:00:2277

    00:00:52.7667 03:58:8667

    Audience #1

    Audience #2

    Audience #3

    Demo - Multiple Audio Sequence Alignment Result

    50

  • Learning Professional Recording Skill

    51

    Initial Prbo.

    Duration (frames/shot)

    Shot Transition (prob.)

    Shot Type Refinement

    (CRF)

    Coherent-Net

  • System Framework for Video Mashup

    52

  • Demo - Mashup Result

    53

    mr#1

    mr#2

    mr#3

  • 1

    Spatio-Temporal Learning of Basketball Offensive Strategies

  • Motivations

    To develop an automatic tactics analysis tool for coaches, players, and general publics.

    To develop a new technique that can

    compete with existing tools, such as sportVU, but with much lower price

    55

  • Methodology Adopted

    To analyze group behavior directly from the court-view of an NBA broadcast video

    Detect and track each offense player, calculate their trajectories and map these trajectories from court view to tactic board for analysis

    56

  • Motivation (1)

    57

  • Motivation (2)

    58

  • Motivation (3)

    Unknown Offense Video Clip

    90% Screen Cut 10% Princeton

  • 60

    6 cameras above the court No close-up view Unable to see the details of plays

  • 61

    SportVU videos Broadcast videos

    Tracked data Tracked data

    SportVU system Our tracking system

    ?

  • Extracting features from an offense video clip ?

    Automatic player detection

    Automatic player tracking

    Map extracted trajectories from basketball

    court to tactic board

    62

  • step 2: Derive correct player trajectories on panorama court (3/3)

    63

  • step 3: Map trajectories from panorama court to tactic board

    64

  • Whats next ?

    Tactics Analysis based on spatiotemporal trajectories of 5 offense players

    65

  • A Two-Stage Un-supervised Clustering for Tactic Analysis

    Stage-1: Un-supervised clustering of all available

    tactics based on their mutual distances

    Stage-2: Un-supervised clustering of all tactics clustered into the same cluster in Stage-1 (try to separate the role of each offense player)

    66

  • What techniques are needed ?

    A spatiotemporal model that can describe the group behavior of 5 offense players

    Automatic clustering of group behaviors

    (screen-cut, Princeton, wing-wheel, etc) Representation of each group behavior An appropriate metric to calculate the distance

    between two arbitrary tactics.

    67

  • Trajectory set Representation

    S: the spatiotemporal matrix; Pij=(xij,yij): 2D coordinate of the j-th player in the i-th frame; Vj=[P1j P2j PLj]T; S=[V1 V2 V3 V4 V5 (V6)];

  • Distance Measure of Trajectory Set Problems Different time durations between 2 clips

    Ordering of column vectors

  • Trajectory Set Distance Matrix

    S1=[V1 V2 V3 V4 V5] S2=[U1 U2 U3 U4 U5]

  • Clustering by Dominant Set

    PAMI 07. Massimiliano Pavan and Marcello Pelillo. Dominant Sets and Pairwise Clustering

    Tactic1

    Tactic2

    Tactic3

  • Second-stage: how to model an offense strategy ?

    8 different trajectory sets of right hawk, each consists of 5 trajectories generated by 5 offense players

  • Clustering by Trajectory Distance Based on the distance between trajectories, one can separate each

    group of tactics into five group of trajectories, each corresponds to a role (an offense player)

    Hawk

    Wing Wheel

    Princeton

  • Temporal Alignment For each role, we use the velocities along x- and y-direction,

    respectively, to model it (use DTW to solve the alignment problem)

  • The Built Model

  • Demo _ Classification

    Hawk template

  • Demo _ Classification

    Princeton template

  • Demo _ Classification

    Wing wheel template

  • Thank you very much for

    listening

    79

    1Contents of this talk 3What is concert video mashup ?Why concert video mashup ?Many problems to be solved !Issues need to be addressedPotential Issues: The order to watch(1/5)Potential Issues: Multiple audio sequence alignment (2/5)Potential Issues(3/5)Potential Issues(4/5)Potential Issues(5/5 )Previous EffortLiteratures related to video mashup (1/3)Literatures related to video mashup (2/3)Literatures related to video mashup (3/3)IntroductionIntroductionIntroductionIntroductionIntroductionINTRODUCTION System Framework for Video MashupShot Classification based on EW-Deep-CCM Object Representation (VGG-Net)Object Representations (1/2)Object Representations (2/2)Literatures related to Fusion StrategyLiteratures related to Fusion StrategyShot Classification based on EW-Deep-CCM Experimental Results 32 33Conditional Random Field-based (CRF) Approach OUR METHOD Coherent-NetOUR METHOD Coherent-Net Framework(EW-Deep-CCM) (CRF)EXPERIMENTS Official Demo 1EXPERIMENTS Official Demo 2System Framework for Video MashupProblem & GoalLiterature ReviewOur MethodOur MethodOur MethodOur MethodDEMO 1DEMO 2 50Learning Professional Recording SkillSystem Framework for Video MashupDemo - Mashup Result 54MotivationsMethodology AdoptedMotivation (1)Motivation (2)Motivation (3) 60 61Extracting features from an offense video clip ?step 2: Derive correct player trajectories on panorama court (3/3)step 3: Map trajectories from panorama court to tactic boardWhats next ?A Two-Stage Un-supervised Clusteringfor Tactic AnalysisWhat techniques are needed ?Trajectory set RepresentationDistance Measure of Trajectory SetTrajectory Set Distance MatrixClustering by Dominant SetSecond-stage: how to model an offense strategy ?Clustering by Trajectory DistanceTemporal AlignmentThe Built ModelDemo _ ClassificationDemo _ ClassificationDemo _ Classification 79