sound detection

Sound Detection

Derek Hoiem

Rahul Sukthankar (mentor)

August 24, 2004

Objective

Learn model of sound object from few (10-20) examples and distinguish from all other sounds

Examples of sound classes: Gunshots, screams, laughter, car horns, meow, dog

bark, etc

Applications

“Tell me if you hear a gunshot.” (monitoring)

“Get me video clips containing dogs barking.” (search and retrieval)

“What’s going on?” (scene understanding)

Why its difficult

Sound classes have large variations

Sounds are often ambiguous without context

Overlaid “noise” obscures sound

Sound or not?

Car horn

Laser gun

Dog bark

Which of these sounds are not from their named classes?

Previous work

Sound Classification (Wold 1996, Casey 2001, etc) Categorize short sound clips Reasonable accuracy (5-20% error)

Sound Detection (Defaux 2000, Piamsa-nga 1999) Localize and recognize sound objects in long clips Poor performance or assumption of unrealistic

conditions (e.g., very quiet background)

Detection via Windowed Search

Long Track

…

Clip 1

Clip 2

Clip N

Break audio track into short overlapping short clips

Clip Classifier

Independently classify short clips as object or non-object

Return locations of detected sound object

Representation

meows

phone rings

Raw RepresentationTime-frequency analysis: windowed Fourier transform

Extract power percentage in each band over time and total power over time

Features

Features

Features

Features

Compute features used for classification

Classification Features

Diverse feature set:Different sound classes are distinctive

in different waysmeans and standard deviations of

power at different frequenciesBand-width, peaks, loudness, etc.138 features in all

Classification by Decision Trees Try to find simple rules that discriminate object

from non-object Each decision is based on a threshold of a

feature value Assign confidence based on likelihood of data

for object and non-object classes at each leaf node

Decision nodes

Leaf Nodes

Boosted Trees

Problem: One decision tree by itself may not be a great classifier

Solution: Use several trees, with each one focusing on the mistakes of previously learned trees

Adaboost: Weight training data uniformly Learn a decision tree classifier on weighted data Re-weight data giving more weight to incorrectly

classified examples Final classification based on linear combination of

confidences from all learned decision trees

Examples of Decision Trees

Low percentage of power in low frequencies in

mid-time of sound

Very high power amplitude range

Meow Gunshot

High power amplitude range

More complex tree that

focuses on examples

misclassified by tree above

Gunshot

Cascade of Classifiers

Goal: eliminate false positives with few false negatives in early stages

Advantages: Allows use of large set of negative training examples Improves classification speed

Dangers: cannot recover from false negatives

Stage 1Sound Clip Stage 2 Stage 3 Pass

Fail

Pass (5%) Pass (2%) Pass (0.005%)

Fail Fail Fail

Results: Classification Error

Average Error vs Stages in Cascade

0.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

7.0%

8.0%

9.0%

10.0%

stage 1 stage 2 stage 3

pos error

neg error

Best Performance

WorstPerformance

stage 1 stage 2 stages 3

pos neg pos neg pos neg

meow 0.0% 1.4% 0.0% 1.2% 2.2% 0.8%

phone 0.0% 0.4% 4.3% 0.1% 5.9% 0.0%

car horn 0.0% 3.9% 0.6% 2.2% 3.6% 1.3%

door bell 1.4% 2.1% 2.1% 0.4% 6.3% 0.1%

swords 6.1% 1.3% 6.7% 0.1% 6.7% 0.0%

scream 0.3% 5.5% 2.7% 1.4% 5.3% 1.1%

dog bark 0.7% 1.0% 6.0% 0.3% 7.7% 0.2%

laser gun 0.0% 6.8% 4.4% 5.1% 6.7% 0.9%

explosion 4.1% 5.2% 7.5% 1.5% 12.0% 0.5%

light saber 4.8% 6.8% 9.7% 1.0% 13.9% 0.2%

gunshot 8.1% 6.1% 12.5% 2.3% 14.5% 1.1%

close door 7.9% 7.8% 14.5% 4.8% 17.6% 2.3%

male laugh 4.3% 14.7% 9.5% 9.7% 13.3% 7.0%

average 2.9% 4.4% 6.0% 2.2% 8.5% 1.1%

Results: ROC curves

Note: to approximate negative error rate divide FP by 25,000

Results: Anecdotal

Gunshots Female Laugh Male Laugh

Swords Scream

sound detection

Documents

different sound classes

sound objects

sound clipstage

nonobject classes

decision tree classifier

decision treestry

nonobjecteach decision

named classes