segmentation*of*tv*shows* into*scenes* · hsv/stg*results* • corpus – first eight*episodesof...

26
SEGMENTATION OF TV SHOWS INTO SCENES USING SPEAKER DIARIZATION AND SPEECH RECOGNITION

Upload: others

Post on 13-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

SEGMENTATION  OF  TV  SHOWS  INTO  SCENES  

USING  SPEAKER  DIARIZATION  AND  SPEECH  RECOGNITION    

Page 2: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Context  •  Exponen<al  growth  of  video  content  

– Mostly  TV  and  web  

•  Content-­‐based  video  indexing  –  Query/browse  by  people  

•  REPERE  challenge  –  Query/browse  by  seman<c  concept  

•  GdR  ISIS  IRIM  project  •  NIST  TRECVid  Seman<c  Indexing  task  

•  Make  content  consump+on  easier  –  Automa<c  summariza<on    

•  PhD  thesis  with  IRIT  (Ph.  Ercolessi)  

2  

Page 3: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

(More)  Context  

•  Automa<c  summariza<on  of  TV  series  – At  the  episode  level  

•  Shot  segmenta,on  •  Scene  segmenta,on  •  Plot  (or  substory)  deinterlacing  •  Episode  summariza,on  

–  «  Previously,  on  Lost…  »  –  Browse  by  plot  

– At  the  collec<on  level  •  Cross-­‐episode  plot  •  Episode  summariza<on  wrt.  whole  collec<on  

3  

Page 4: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Shot,  scene,  sequence  and  plot  •  Shot  

–  a  part  of  a  film  between  two  camera  cuts  •  Scene  

–  (each  author  working  on  scene  segmenta+on  uses  its  own  defini+on)  –  group  of  consecu<ve  shots  –  temporal  con<nuity  /  unity  of  <me  –  seman<c  coherence  /  unity  of  ac<on  

•  Sequence  –  group  of  consecu<ve  scenes  –  seman<c  coherence  –  aka  story  or  topic  in  TV  news  

•  made  of  mul<ple  stories  introduced  by  the  anchor  •  Plot  

–  group  of  (not  necessarily  consecu<ve)  sequences  –  modern  TV  show  episodes  usually  have  mul<ple  interlaced  plots  –  two  plots  can  overlap  

4  

Page 5: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Outline  

•  Context  •  Defini<ons  &  nota<ons  •  Principle  – Scene  transi<on  graph  with  color  histograms  – Generalized  STG  

•  Mul<modal  extension  – Speaker  diariza<on  &  speech  recogni<on  – Mul<modal  fusion  

5  

Page 6: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Scene  segmenta<on  

•  Input:  shot  boundaries  

•  Output:  scene  boundaries  

•  Classifica<on  problem  on  shot  boundaries  –  precision,  recall,  F1-­‐measure  

6  

10 119876321 4 5 k kth shot

10 119876321 4 5 k kth shot

Page 7: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Color  

shot boundaries

1 1 1 1 12 2 2 2 23 3 3

automatic speech recognitiondon't look atme like that

Ally

it's true!

whatever

...

automatic speaker diarization

color histogram (one frame per second)

ASR

SD

HSV

...

...

...

...

...

... ...

...

•  Mul<ple  frame  per  shot,  one  color  histogram  per  frame  •                   :  minimum  distance  between  all  possible  pairs  of  

histograms  from  shots  i  and  j  dHSV

ij

7  

Page 8: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Scene  transi<on  graph  

10 119876321 4 5

k kth shot

8  

Segmenta<on  of  Video  by  Clustering  and  Graph  Analysis  /  Yeung  (1998)  

Page 9: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Nota<ons  

dissimilarity  between  shots  i  and  j  temporal  distance  between  shots  i  and  j  combined  distance  between  shots  i  and  j      temporal  distance  threshold  combined  distance  threshold    

tijdij

Dij

Dij =

⇢dij if tij < �t

+1 otherwise

�t

�d

9  

Page 10: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Scene  transi<on  graph  

10

11

9

8

7

6

3

2

1

45

k kth shot shot cluster

•  Step  1:  complete-­‐link  agglomera<ve  clustering     �d

10  

Page 11: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Scene  transi<on  graph  

10

11

9

8

7

6

3

2

1

45

k kth shot shot cluster edge

•  Step  2:  scene  transi<on  graph  

11  

Page 12: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Scene  transi<on  graph  

10

11

9

8

7

6

3

2

1

45 ||

scene 1 scene 2

k kth shot shot cluster || edgecut

•  Step  3:  cut-­‐edge  detec<on  

12  

Page 13: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

HSV/STG  Results  

•  Corpus  – First  eight  episodes  of  Ally  McBeal  TV  shows  – 5  hours  of  videos,  5564  shots  and  306  scenes  

•  Evalua<on  – Leave-­‐one-­‐episode-­‐out  cross  valida<on  – Two  thresholds                    and    

13  

Precision Recall F-Measure # ScenesHSV (STG) 0.256 0.533 0.449 461

�t �d

Page 14: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Scene  transi<on  graph  

•  Limita<on:  – Every  pair                                            leads  to  a  different  set  of  detected  scene  boundaries.  

– The  op<mal  values  are  very  dependent  on  the  video  

•  Proposi<on  (Sidiropoulos,  2011):  – Generalized  STG  

(�t,�d)

14  

Page 15: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Generalized  STG  

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Sce

ne

bou

nd

ary

pro

bab

ility

p

– Large  set  of  STGs  by  selec<ng  random                  and            – Scene  boundary  probability  

�t �d

15  

Page 16: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Generalized  STG  

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Sce

ne

bou

nd

ary

pro

bab

ility

p

t hreshold �

– Large  set  of  STGs  by  selec<ng  random                  and            – Scene  boundary  probability  – Unique  threshold  

�t �d

16  

Page 17: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

HSV/GSTG  Results  

•  Corpus  – First  eight  episodes  of  Ally  McBeal  TV  shows  – 5  hours  of  videos,  5564  shots  and  306  scenes  

•  Evalua<on  – Leave-­‐one-­‐episode-­‐out  cross  valida<on  

17  

Precision Recall F-Measure # ScenesHSV/STG 0.256 0.533 0.449 461HSV/GSTG 0.447 0.566 0.487 403

Page 18: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Mul<ple  modali<es,  mul<ple    

shot boundaries

1 1 1 1 12 2 2 2 23 3 3

automatic speech recognitiondon't look atme like that

Ally

it's true!

whatever

...

automatic speaker diarization

color histogram (one frame per second)

ASR

SD

HSV

...

...

...

...

...

... ...

...

dij

18  

Page 19: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Speaker  diariza<on  

shot boundaries

1 1 1 1 12 2 2 2 23 3 3

automatic speech recognitiondon't look atme like that

Ally

it's true!

whatever

...

automatic speaker diarization

color histogram (one frame per second)

ASR

SD

HSV

...

...

...

...

...

... ...

...

shot boundaries

1 1 1 1 12 2 2 2 23 3 3

automatic speech recognitiondon't look atme like that

Ally

it's true!

whatever

...

automatic speaker diarization

color histogram (one frame per second)

ASR

SD

HSV

...

...

...

...

...

... ...

...

•  Mul<ple  speakers  per  shot,  but  only  one  descriptor  –  Term    Frequency  /  Inverse  Document  Frequency  –  Speaker  Frequency  /  Inverse  Shot      Frequency  

•                   :  cosine  distance  between  TF-­‐IDF  vector    dSDij

19  

Page 20: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Automa<c  Speech  Recogni<on  

shot boundaries

1 1 1 1 12 2 2 2 23 3 3

automatic speech recognitiondon't look atme like that

Ally

it's true!

whatever

...

automatic speaker diarization

color histogram (one frame per second)

ASR

SD

HSV

...

...

...

...

...

... ...

...

•  Lemma<za<on  (tree-­‐tagger)  •                   :  cosine  distance  between  TF-­‐IDF  vector    

shot boundaries

1 1 1 1 12 2 2 2 23 3 3

automatic speech recognitiondon't look atme like that

Ally

it's true!

whatever

...

automatic speaker diarization

color histogram (one frame per second)

ASR

SD

HSV

...

...

...

...

...

... ...

...

dASRij

20  

Page 21: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Monomodal  GSTG  Results  

•  Corpus  – First  eight  episodes  of  Ally  McBeal  TV  shows  – 5  hours  of  videos,  5564  shots  and  306  scenes  

•  Evalua<on  – Leave-­‐one-­‐episode-­‐out  cross  valida<on  

21  

Precision Recall F-Measure # ScenesHSV (STG) 0.256 0.533 0.449 461HSV 0.447 0.566 0.487 403SD 0.157 0.562 0.240 1136ASR 0.105 0.572 0.175 1751

Page 22: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Monomodal  GSTG  approaches  

•  Limita<on:  – One  modality  cannot  solve  the  problem  on  its  own  

•  Proposi<on:  – Mul<modal  fusion  

22  

Page 23: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Mul<modal  Fusion  sceneboundaries

sceneboundaryprobabilities

GSTG threshold distancebetweenshots

front-endvideoshots

EARLYFUSION

LATEFUSION

INTERMEDIATEFUSION

•  Late  fusion  –  intersec<on            or  union  

•  Early  fusion  –     

•  Intermediate  fusion  –     

\ [

dij = wHSV · dHSV

ij + wSD · dSD

ij + wASR · dASR

ij

p = wHSV · pHSV + wSD · pSD + wASR · pASR23  

Page 24: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Mul<modal  Results  

24  

Fusion Precision Recall F-MeasureHSV (baseline) 0.447 0.566 0.487HSV \ SD 0.598 0.357 0.438HSV \ SD \ ASR 0.606 0.242 0.341HSV [ SD 0.180 0.770 0.288HSV [ SD [ ASR 0.121 0.851 0.210d(HSV) + d(SD) 0.445 0.599 0.499d(HSV) + d(SD) + d(ASR) 0.445 0.599 0.499p(HSV) + p(SD) 0.484 0.555 0.510p(HSV) + p(SD) + p(ASR) 0.488 0.622 0.539

Page 25: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

Conclusion  &  future  work  

•  Graph-­‐based  approach  to  segmenta<on  

•  Using  more  modali<es  is  beker  – Beker  use  of  ASR  output  – Add  visual  seman<c  concept  detec<on  

 

25  

Page 26: SEGMENTATION*OF*TV*SHOWS* INTO*SCENES* · HSV/STG*Results* • Corpus – First eight*episodesof Ally0McBealTVshows – 5 hoursof videos,5564 shots*and*306*scenes* • Evaluaon* –

What’s  next?  

•  At  the  episode  level  –  Shot  segmenta,on  –  Scene  segmenta,on  –  Plot  (or  story)  deinterlacing  –  Episode  summariza,on  

•  «  Previously,  on  Lost…  »  •  Browse  by  plot  

•  At  the  collec<on  level  –  Cross-­‐episode  plot  –  Episode  summariza<on  wrt.  whole  collec<on  

26