arf @ mediaeval 2012: an uninformed approach to violence detection in hollywood movies

14
An Uninformed Approach to Violence Detection in Hollywood Movies *this work was partially supported under European Structural Funds EXCEL POSDRU/89/1.5/S/62557. Bogdan IONESCU *2,4 [email protected] Ionuț MIRONICĂ 2 [email protected] Jan SCHLÜTER +1 [email protected] Markus SCHEDL 3 [email protected] ARF (Austria-Romania-France) team 4 Austrian Research Institute for Artificial Intelligence 1 University POLITEHNICA of Bucharest 2 3 + this work was supported by the Austrian Science Fund (FWF) under project no. Z159.

Upload: mediaeval2012

Post on 18-Dec-2014

394 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywood Movies

An Uninformed Approach to Violence Detection in Hollywood Movies

*this work was partially supported under European Structural Funds EXCEL POSDRU/89/1.5/S/62557.

Bogdan IONESCU*2,4

[email protected]

Ionuț MIRONICĂ2

[email protected]

Jan SCHLÜTER+1

[email protected]

Markus SCHEDL3

[email protected]

ARF (Austria-Romania-France) team

4Austrian Research Institute for Artificial Intelligence

1University POLITEHNICA of Bucharest

2 3

+this work was supported by the Austrian Science Fund (FWF) under project no. Z159.

Page 2: ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywood Movies

2

Presentation outline

MediaEval - Pisa, Italy, 4-5 October 2012 1/13

• The approach

• Video content description & classification

• Experimental results

• Conclusions and future work

Page 3: ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywood Movies

3

The approach

MediaEval - Pisa, Italy, 4-5 October 2012 2/13

> challenge: find a way

to tag violence in movies;

> what approach ?

correlation matrix

(on ground truth)

e.g. movie: Harry Potter

high low

training a classifier

on ground-truth to predict

directly the violence

frames is questionable.

ArmageddonKill BillThe Wicker Man

different correlations between violence and concepts;

high variability in appearance of violent scenes from movie to movie;

Page 4: ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywood Movies

4

The approach: machine learning

MediaEval - Pisa, Italy, 4-5 October 2012 3/13

> approach:

low-level features

movies &

ground truth

(annotations)

frame-level

descriptors

predicting violence

violence

training & optimizing

yes/no (+ score)

mid-level prediction

training

pred. (real values)blood

fire

screams

pred.

pred.

Page 5: ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywood Movies

5

The approach: machine learning

MediaEval - Pisa, Italy, 4-5 October 2012 4/13

> approach: testing

low-level features mid-level prediction predicting violence

unseen

movie

blood

fire

screams

…frame-level

descriptors pred.

pred.

pred.

violenceyes/no

(+ score)

Page 6: ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywood Movies

6

Video content description - audio

MediaEval - Pisa, Italy, 4-5 October 2012 5/13

[B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands]

• Linear Predictive Coefficients,

• Line Spectral Pairs,

• Mel-Frequency Cepstral Coefficients,

• Zero-Crossing Rate,

+ variance of each feature over a certain window.

• spectral centroid, flux, rolloff, and kurtosis,

standard audio features (frame-level)

f1 fn…f2

globalfeature

= mean & variance

time

+var{f2} var{fn}

Page 7: ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywood Movies

7

Video content description - visual

MediaEval - Pisa, Italy, 4-5 October 2012 6/13

feature descriptors (frame-level)

• Histogram of oriented Gradients (HoG) ~ counts occurrences of gradient orientation in localized portions of an image (20º per bin);

color descriptors (frame-level)

• Color naming histogram ~ project colours into 11 universal color names (black, blue, brown, grey, green, orange, pink, purple, red, white, and yellow);

[J. van de Weijer et al. IEEE TIP’09]

[B. Ionescu et al. IEEE ICASSP’06]

visual activity (frame-level)

time

9 2high values will

account forimportant visual

changes ~ action

Page 8: ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywood Movies

8

Classifier: multi-layer perceptron

MediaEval - Pisa, Italy, 4-5 October 2012 7/13

- training using back-propagation,

- use 'dropout' to reduce overfitting: a fraction of units is randomly omitted for each training case so a unit cannot rely on all other units being present. [G. Hinton et al. arXiv.org’12]

512 unitsdesc. dim. 1-5 (~concept tags)

Page 9: ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywood Movies

9

Experimental results: concept prediction

MediaEval - Pisa, Italy, 4-5 October 2012 8/13

> validation of the concept predictor (on the 15 train movies);

*results reported for an optimum threshold

leave-one-movie-out cross-validation

*

best results for fire and explosions (prominent yellow tones), gunshots and screams.

the purely visual concepts obtain high Fscore mainly because they are rare,

blood detector not that accurate (e.g. missed most blood in “Kill Bill”),

> use concept ground truth;

Page 10: ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywood Movies

10

Experimental results: violence prediction

MediaEval - Pisa, Italy, 4-5 October 2012 9/13

> validation of the violence predictor (on the 15 train movies);

> input: descriptors + mid-level predictions (real numbers);

> use violence ground truth;

leave-one-movie-out cross-validation

0.23

0.41

0.3

prec. rec. F-sc.

optimal threshold

0.27

0.46

0.34

prec. rec. F-sc.

+ median filtering for predictions

optimal threshold

Page 11: ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywood Movies

11

Experimental results: official runs

MediaEval - Pisa, Italy, 4-5 October 2012 10/13

> segment/shot violence decision: assign the frame-wise highest prediction score + thresholding;

> segment-level results:

precision 0.28, recall 0.49, F-score 0.36, MAP@100 0.55;

> shot-level results:

results vary significantly with the movie

Page 12: ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywood Movies

12

Experimental results: official runs

MediaEval - Pisa, Italy, 4-5 October 2012 11/13

> shot-level comparative results:

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

DYNI-5

DYNI-1

DYNI-4

DYNI-3

TUB-5

DYNI-2

TEC-1

TUB-2

NII-5

TUB-4

TUB-1

TUB-3

NII-4

NII-1

NII-2

NII-3

LIG-2

LIG-4

LIG-3

LIG-1

TUM

-5

TUM

-3

TUM

-2

TUM

-4

TEC-2

TEC-4

TUM

-1

Shang

haiH

ongk

ong-

3

Shang

haiH

ongk

ong-

4

Shang

haiH

ongk

ong-

5

Shang

haiH

ongk

ong-

2TE

C-5

TEC-3

Shang

haiH

ongk

ong-

1ARF-

1

MAP@100

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

MAP

Page 13: ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywood Movies

13

Conclusions and future work

MediaEval - Pisa, Italy, 4-5 October 2012 12/13

> fair performance for a naïve attempt to violence detection;

> future work:

investigate whether the concept predictions actually helped,

investigate contribution of modalities, investigate dropout vs. classic learning.

> a high baseline to be challenged by more sophisticated approaches;

Page 14: ARF @ MediaEval 2012: An Uninformed Approach to Violence Detection in Hollywood Movies

14

thank you !

MediaEval - Pisa, Italy, 4-5 October 2012 13/13

any questions ?