searching video collections: representation, indexing ... · universidad de chile 3 searching video...

1Dulce Ponceleon

Searching Video Collections: Representation, Indexing, Browsing and Evaluation

Part I

Universidad de Chile December 2002

Universidad de Chile2

Searching Video Collections: Overview

Part IIntroduction to Multimedia Information RetrievalMultimedia RepresentationMultimedia Indexing

Part II Audio AnalysisSpeech Indexing Query Formulation Multimedia Retrieval

Part IIIBrowsing Distribution/StreamingEvaluation Multimedia IR ApplicationsConclusions


Searching Video Collections:Part IIntroduction to Multimedia Information RetrievalMultimedia Representation

Visual Features (Still Images and Image Sequences)ColorTextureShapeEdgesObjects, Motion

Multimedia IndexingVideo Segmentation

Shot-Boundary DetectionEffects Detection

Beyond Basic Visual Features: Text, Face


What is Multimedia?

Unstructured Data types: text, images, audio, videoDifferent from DBMS structured recordsName: <s>, Sex: <s>, Age: <I>, SSN: <I>…

Structure in Unstructured DataAll unstructured data has contentTypically also has associated metadataText has layout and logical structureMultimedia has complex spatial, temporal, and semantic structure


History: from Text IR to MMIRLibrary of Alexandria (3rd century BC)

500,000 volumes, catalogues, classificationFirst concordance of the bible (13th century AD)Printing press (15th century)Johnson’s dictionary (1755)Dewey Decimal classification (1876)Punched card retrieval (1930’s)Luhn describes statistical retrieval/abstracting (1959)MEDLINE (1964, goes on-line in 1971)

*Adapted from a presentation © Bruce Croft


History:From Text IR to MMIR

Cranfield effort defines evaluation (1966)DIALOG from Lockheed (1967)Salton’s book about SMART and IR (1968)

discusses many techniques that are used today

Relevance ranking available (late 80’s)Large-scale probabilistic system (West, 1992)Google, Search Engines (1996)


Do you use Google?

Do you use Google once a day?

Do you use Google 10 times a day?

Do you ?


Do you Image ?

Do you use Google image search?

More than once a day?

Do you video Google?


What does come to mind when we say MM Retrieval?

Keanu Reeves avoid bullets

Helicopter Crash

i.e. Hollywood’s Multimedia Retrieval


Little Value in Indexing Published Content (?)

Publishing impliesHigh production effortBroad appeal

Easy to manually annotate (once)Somebody edits in a dissolve=> They can add manual annotation

No demandPeople aren’t clamoring for image retrieval“Give me Rock Hudson washing up on beach”


An MM Indexing Product

Bare Facts Video Guide

Indexes nudity in Hollywood videos

Very specialized


Kundi (.com-era startup)

Hot Now buttonA user identifies important content

Voting/moderation (ala /.) scheme)

Notifications shared with other users

HotNow!


World is not Bleak!

Cameras everywhere!Number of sensors doubling every yearFixed (webcams)Mobile (on your person)

Security

Customer Relationship Management


Security

Important in new worldFind “interesting” events

Look for anomaliesName that eventLook for secondaryevents


Customer Relationship Management

High value in customizationImagine camera at store entranceCan we determine gender?

Suggest sale item in men’s clothing

Can we recognize previous customer?

Probably not well enough


Text vs. Multimedia


Properties of Multimedia1. Visual Components2. Spatial Components3. Temporal Components4. Ease of data entry5. Well defined interaction unit?6. Well defined semantic unit?

NNVery Difficult

YYYVideo

NNDifficultYNNAudio

NNDifficultNYYImage

YYEasyNYYText

654321Data Type


What Type of Queries would you like to Answer?

Downhill Skiing [Foote99, Over01]Scenes that include space shuttle launchingScenes with a yellow boat, pink flowerPeople on the beachSpeaker talking in front of the US FlagCorn on the cob in a fieldImpact of heavy airliner landing on runways


What Type of Queries CAN you Answer Today?

Use a Sample Image or Video Clip[Flickner95]

Use Basic Art tools to express “a red object moving from the upper left to the lower right corner on a white background”

[Dimitrova94, Chang98, Smith96, Yining98]Not at the semantic level desired


Two Fundamental Multimedia Retrieval Paradigms

Expression-based retrieval aka Query-by-Example [Foote99]

Semantic-based Retrieval based on automatically extracted metadata or manually annotated metadata [Barnard01]


What is Content Analysis?Analysis of low-level features

Basic features, physical propertiesSemantics for high-level abstractionsSpecial algorithms borrowing from several disciplinesUse of all media availableRelated Areas

signal processing, computer vision, speech recognition, pattern and image recognition, OCR, natural language, audio analysis


Query Based Multimedia IR System Overview

Users InformationNeed

MultimediaContent

Represented as Represented as

Audio-VisualTextQuery

Indexed Multimedia Content

Retrieve and ComputeSimilarity

RankResults

Evaluate


Similarity-based Image SearchManual annotations are far from suitable

subjective, feasible?a picture is worth …how many keywordsa picture with no-keywords, how much is worth?

Typical automatic procedureUse features to characterize imagesStore feature vectors Enable the user to start (limited!) semantic queriesYield a set of resulting images

based on distance of featuresSmallest distance represents the best match


Analysis of Picture Sequences

GoalsRecognition of ObjectsRecognition of camera motion

Features Object MotionHints to semantics

Example: motion vs. non-motion sequences

Recognition of motion in combination with segmentationTracking of object boundaries in subsequent frames yields higher segmentation performance than use of still images.


Visual Features in Multimedia

Color, Color, Color Texture ShapeEdgesObject Outline

foreground vs. backgroundedge detection

Motion TrajectoriesHigher-level Semantics Multimodal


Audio Features in Multimedia

Features depend on audio category SpeechMusicSounds (i.e. explosions, street noise, etc.)

FeaturesEnergy, LoudnessPitchCepstral CoefficientsBeatHarmonics


Visual Features: Color

IntroductionColor Models Color RepresentationsColor FeaturesSimilarity Measures


What is Color?

It is a perceptual phenomenonEach color corresponds to a narrow band of wavelength within the electromagnetic spectrum

Visible wavelengths: 400 – 700 nm range400 – 480 is blue, ~ 520 is green, 600-700 is red

Human eye can distinguish 400,000 colors< 400nm ultraviolet and X-rays> 700 nm infrared, microwaves, FM radio, TV, AM radio, etc.


Visual Features: Color

The Color PhenomenonDominant wavelength is a light called hueIntensity (energy) of a light is called luminanceor brightnessAmount of pure light (pink vs. red) is saturation or purityCollectively the hue and saturation are referred as Chromaticity


Color RetrievalIt is a global featureIndependent of view and resolutionNo-object background segmentation is requiredCan handle deformation of objectCan handle articulated objectColor Coherence Color Layout

Drawbacks: color constancy


Color SpaceThe RBG Model

(red, blue, green) different intensitiesUsed for active devices

The YIQ ModelDeveloped by NTSC and used for the first color TV broadcast in 1953To be compatible with black & white TVLuminance signal

Y = 0.30 R + 0.59 G + 0.11 B Two color difference signals

I = 0.6 R - 0.28 G + 0.32 BQ = 0.21 R - 0.52 G + 0.32 B


Color Space Linearity

For color retrieval we need a measure of color differenceRBG color space, each color (r,g,b)Drawbacks

It is not designed for humansMainly used for active display monitorsIt is perceptually non-linear

A linear color space is needed which corresponds to our perception


The CIE Color Space

There are several linear colors spaces used in color industry for quality control, such as L u vThese are non-linear transformations of the RGB spaceIt is device independentEuclidean distance can be use as a measure of similarityEmpirical studies show that this is very close to human perception of color differenced


Color Model towards Image Representation

Digital Image: 2D array of pixels

2D array of intensitiesbinary (1 bit/pixel), grayscale (8 bits/pixel) or color (24 bits/pixel)

2D array of codes Code corresponds to RBG triple

134 135 132 12 15...133 134 133 133 11...130 133 132 16 12...137 135 13 14 13...140 135 134 14 12...


Color Modeled as Blocks

Divide into 8x8 blocks and convert RGB to YUVLuminance (Y) and Chrominance (Cb,Cr)Blue color difference CbRed color difference Cr

Only half resolution needed from Chrominance


Discrete Cosine Transform

Transform each block of 8x8 samples into a block of 8x8 spatial frequency coefficientsEnergy tends to be concentrated into a few significant coefficientsOther coefficients are close to zero

DCT Basis


Color and Color Mappings

Copyright by Smith&Chang 1996

RBG HSV

Color Sets = binary vector representing color (good for regional color)


Color Representations

Pair-wiseRepresents color with a matrix of pixelsComputes changes at corresponding pixel locationsAdvantage: it considers spatial locationDisadvantage: too low level, dependent on image size, non a concise representation

Histogram Color representationLinearly re-quantize the contents into N levelsSimple method, used for video segmentation

Cluster Color Representation


Color Similarity

For (L,u, v) space, we can use Mahalanobis distance, where

Data = colorCorrelation = perceptual similarity

For HSV space, similarity is derived form the distance in the cylindrical HSV color spaceHistogram Quadratic Distance:

Introduced in QBIC project (IBM 1993)Provides better similarity than “like-bin”comparisonComputationally expensive


What is Texture

It is a perceptual phenomenonIt is a region phenomenon (not a point phenomenon)Depends a lot on the scaleRepeating patterns of local variations in image intensity which are too fine to be distinguished as a separate object


Visual Features: Texture

ApproachesStatistical (coarseness, directionality, contrast) [Tamura78, Liu96]Spectral [Ma96]

Should be invariant to intensity, scale, orientationNatural Scenes are challenging

Query Image

MIT’s Photobook Texture Matching


Tamura Texture Feature

Primary FeaturesContrast - related to picture-quality, sharpnessCoarseness – coarse-grained vs. fine-grainedDirectionality

Secondary FeaturesLine-likeness (line-like vs. blob-like) RegularityRoughness


What is Shape?

It is also a perceptual phenomenonA 2D shape descriptor should be invariant to

translation, scale changes, rotation

Measures:


Visual Features: Shape

Region-based Approach

Boundary-based Approach

Use contours, ignore interior

Use interior details (holes, etc) besides boundary details

Can we reconstruct the object from the shape descriptors?


Shape Techniques OverviewShape Description

Boundary Based Region Based

Spatial Domain Transform Domain

StructuralGeometric

Partial Complete

Corner PointsChain PointsShape NumbersPerimeterAreaElongationCompactnessFourier Descriptors

Contour SegmentsBreakpoints

Areas, holes, Euler NumberMoment Invariants, Sernike MomentsCompactness, Elongation, Symmetry

PrimitivesRules2D Strings

Hough TransformationWalsh TransformWavelet Transform


Region-Based Shape & Texture Matching

MIT’s Photobook:

FourEyes


Visual Features: Motion

• Align two images to achieve the best match.

• Determine motion between sequence imagesCopyright Lucas & Kanade

Motion Field


Optical FlowReal world object motion are transformed to color changes in imagesEfficient computation of motion vectors: use gray-value images

Optical Flow

motion of gray-value patterns in the image plane

first step: calculate motion vector of each gray-value pixel

second step: calculate continuous vector field (interpolation)


Optical Flow ...

Constraintsboth steps use constraints

both steps introduce motion vector failures

Approachesdifferential techniques (derivatives of gray values)

correlation-based techniques (correlation of regions)

energy-based techniques (velocity filters)

phase-based techniques (phase dependence with regard to band pass filters


Optical Flow: Examples

originalneedle flicker


Optical Flow: ProblemsCorrespondence Problem

???

• Other Problems

?

?

?

Aperture Problem Solution of Aperture Problem

DeformableObjects

Periodical Structures

t0

t1

t0

t1

t1t1

t0t0

?

• Optical Flow unreliable feature for content analysis!


Aperture Problem


Motion Estimation: Examples

Block-based Region-basedPixel-based

Pixel-based Motion Vector in Video Compression


Motion Vectors

Modern compression algorithms for video calculate motion vectors for pixel blocks (examples: MPEG-1, MPEG-2, H.261, H.263). Block motion can be used to detect camera operations, but cannot be used to analyze object motion.

Advantage: motion vectors are available without expensive calculation if encoder/decoder information is usedExample


Motion Vectors

Example: famous MPEG test clip

Displacement Vectors

Velocity vector (flow vector)

ASSUMPTION

For Small time interval velocity is constant


Local Motion:Motion Trajectory Extraction

Object tracking through motion estimation

In spatial domain 2D or 3DIn compressed domain using motion vectors

Trajectory representation using symbolic or analytical notation


Trajectory Representation and Retrievala) Trajectory motion pattern b) B-Spline curve

c) Chain code d) Differential chain codeDimitrova94


MPEG-7 Visual Descriptors


Motion Activity: Motivation

Need to capture “pace” or Intensity of activityFor example, draw distinction between

“High Action” segments such as chase scenes.“Low Action” segments such as talking heads

Emphasize simple extraction and matchingUse Gross Motion Characteristics thus avoiding object segmentation, tracking etc.Compressed domain extraction is important


MPEG-7 Motion Activity Descriptor

Attributes UsedIntensity/Magnitude - 3 bitsSpatial Characteristics - 16 bitsTemporal Characteristics - 30 bitsDirectional Characteristics - 3 bits


MPEG-7 Motion Activity Descriptor

IntensityExpresses “pace” or Intensity of ActionExtracted by suitably quantizing variance of motion vector magnitude

DirectionExpresses dominant direction if definable as one of a set of eight equally spaced directionsExtracted by using averages of angle (direction) of each motion vectorUseful where there is strong directional motion


Captures the size and number of moving regions in the shot on a frame by frame basisEnables distinction between shots with one large region in the middle such as talking heads and shots with several small moving regions such as aerial soccer shotsThus “sparse” shots have many long runs while “dense” shots do not have many long runs.

MPEG-7 Motion Activity Descriptormedium

long

short

Spatial Distribution : using run-lengths


Video IndexingAnalysis of Still Image

Features: Color, Texture, ShapeDistance Metrics

Analysis of Image SequenceSegmentationCut DetectionMotion VectorsShot TransitionsCamera OperationsScene AnalysisSelection of KeyframesShot Similarity

video

scenes

shots

frames


Camera Motion Descriptors

Camera track, boom, and dolly motion modes,

Camera pan, tilt and

roll motion modes.


Video IndexingMultilayered Hierarchical Structure of a Video Clip

Copyright by J. Hunter 2001,

Dublin Core and MPEG-7 Metadata for Video


Video IndexingSemantic Units (Hierarchy)

Object, Regions, FramesShot: continuous sequence of frames captured from one cameraScene: one or more shots presenting different views of the same event (time or space related)Segment: one or more related scenes

TransitionsCut - an abrupt shot change that occurs in a single frameDissolves – continuous transition, progressive linear combination Fade - a slow change in brightness usually resulting in or starting with a solid black frameWipes – pixels from the second shot replace those of the first shot in a regular patternOthers –special effects, editing tools can offer up to 200 effects


Video Indexing Example

Controlled VocabularyClose Trans

Controlled VocabularyOpen Trans

Controlled VocabularyLighting

GIF, JPEGKeyFrame

secs, frame #, SMPTEEnd Time

secs, frame #, SMPTEStart Time

secs, framesDuration

Controlled VocabularyCamera Motion

Controlled VocabularyCamera Angle

Controlled VocabularyCamera Distance

TextText

FormatsDescription

TextObject

TextCast

TextLocale

GIF, JPEGKeyFrame

secs, frame #, SMPTEEnd Time

secs, frame #, SMPTEStart Time

secs, framesDuration

TextEdit List

TextTranscript

TextScript

TextText

FormatsDescription

Shots Scenes

Dublin Core Metadata


Reliable Shot Detection

The three most commonly used transition types are:

Abrupt Cut, Hard CutsFadesDissolves


Cut Detection

Cut: Sudden Change of Image Content between continuous shotsCut Detection: Separate Video into Shots and calculate Features for Shots separately.

Time


Shot TransitionsFade In

change of image content from monochrome color to image

example: fade from white/black

Fade Outchange of image content from image to monochrome color

example: fade to white/black

Time


What is Dissolve?Dissolve: Shot Transition with Image Overlays

Time


Types of Dissolve

Cross dissolve

Additive dissolve


Shot Boundary DetectionPixel DifferencesStatistical DifferencesHistogramsCompression DifferencesEdge TrackingMotion Vectors

SMPTE 00:12:45:20


Pixel Differences: Basic Idea

Compute total number of pixels that change in value more than a threshold If this total is greater than a second

threshold then a shot boundary is detectedDrawbacks

Sensitive to camera motion (pan, zoom)Sensitive to object motion

t

bT


Pixel Differences: ImprovementsBasic method plus the use of a 3x3 averaging filter before the comparison

[Zhang93]Divide image in 12 regions and find the best match for each region in a neighborhood around the region in the other image. Difference is the sum of the region differences.

[Shahraray95]Chromatic images:

Change in gray level in 2nd imageRelatively constant for dissolves and fadesStill sensitive to camera and object motion


Histogram DifferencesUse color/gray-scale histograms of pixels as a feature to detect shot boundariesAssumption: for the same background and same objects, there is very little change in the histogramLet be the histogram for the bin of the

frame, then difference is given by

If the difference exceeds a threshold A shot boundary is detected

)( jHithj

thi

|)()(| 1∑ +−= j iii jHjHCHD

bi TCHD >


Histograms: Example

Cut


Histograms: Difference GraphCuts

Threshold


Histogram-Based Cut DetectionDifferent images can have same histograms

Same Histogram

Same Histogram

Obvious example

Not so obvious example


Histogram-Based Cut Detection: Challenges

Different images can have similar histograms

Color values of subsequent images change significantly without a cut occurring

explosions

change of scene illumination

fast movement of large objects

Performance of histogram-based cut detectionbetween 90 and even 98 (in some cases)


Histogram Differences:improvements

A coarse quantization is good enough. Typically, 6-bit code: 2 higher order bits or R, G and B channels.

This leads to 64-bin histograms.Good trade-off between accuracy and speed for shot boundary detectionThreshold selection is crucial. Threshold depends very much on the contentGradual transitions: use two thresholds instead of one global threshold, one for abrupt cuts and one for special effects

bT


Histogram Comparison405 459 810

810 972 1026

0.4264 0.4298

0.1602 0.0383

Frame Number

Similarity Measure

Talk Show Sequence

Copyright Philips (MPEG-7 contribution)


Histograms Differences:Twin-Comparison MethodCompute for all frames in videoMark camera breaks where Mark potential gradual transitions subsequences

wherever For each gradual transitions ,accumulate frame-to-frame difference:If , then declare as a gradual transition This algorithm works well and is widely used

iCHD

si TCHD >

bi TCHD >

bTAC >

]},{[eF

sFGT =

],[eF

sF

],[eF

sF


IBM’s CueVideo Shot Boundary Detection

SMPTE 00:12:45:20

Detects cuts, dissolves, fades and other gradual changesCompare multiple pairs of frames: 1, 3 and 7 frames apartProcesses decoded frames

Supports MPEG, QT, AVI, live feed,…No user-tuned parameters - allows batch processingDetection of flashes, bad framesOne pass - allows live video processing

Copyright IBM Almaden


CueVideo Histogram Example:


Edge Change Ratio (ECR)

Properties

edge pixel in image i and (i-1): si and si-1

Eout: pixel in image (i-1) is edge pixel, pixel in image i is not an edge pixel

Ein: pixel in image (i-1) is not an edge pixel, pixel in image i is edge pixel

use of broad edges (noise independence)

edge change ratio between images i and (i-1)

=

−−

i

out

i

ini s

EsEECR ,max

11


Computation of ECR: Example

Image (i-1)

Image i Edge Image i

Edge Image (i-1)

Inverted Images

ECR

AND

ECi

in

EC outi-1

ECR-Images

AND


ECR Cut Detection

D

Time

D

Time

D

Time

D

Time

D

Time

Inside Shot Cut Fade Out

Fade In Dissolve


ECR Cut Detection: Cutsif ECRi is edge change ratio between frames i and (i-1) a cut is detected if

where T is a threshold

Fast object and camera motion leads to high ECR-values without cuts

TECRi ≥

Cuts


ECR Cut DetectionFade In, Fade Out

Fade out: number of edge pixels zero after last frame of sequence

Fade in: number of edge pixels zero before first frame of sequence

Fade In Fade Out


ECR Cut Detection: Problems

Fast object or camera motion

Explosions

Fades and dissolves

soft transitions are difficult to detect

other effects: wipe detection unreliable

Performancetypically between 90 and 95 percent


Shot-Boundary Detection: Conclusions

Histogram-based technique are good to recognize cuts

Standard deviation techniques good to recognize fades

Dissolves are the more challengingProblems

Ground truth: experimental data must be analyzed manually

Database ? Benchmarks?

Definition of a fade/dissolve


Text Detection: ApplicationsAnnotation and search of image and video libraries

TV, movie studios, advertising, and surveillance

Automatic identification and logging of the beginning and end of key events based on captionsVideo SummarizationTicker Tape analysisCommercial DetectionSports Programs indexing


Text Detection: Design DecisionsWhat kind of text occurrences?

Scene text Overlay text

With what style attributes?

Font sizeFont typeText color

In what kind of media data?

Image-basedVideo-based

What should be achieved?

LocalizationSegmentationRecognition

How will the results be used?

IndexingObject-based video encoding

any

both


Example: MPEG-4 Text Extraction

Locate text of any size at any position in images, web pages and videosSegment and recognize textEncode extracted text as rigid foreground object in MPEG4 (with Yen-Kuang Chen) 27.5

2828.5

2929.5

3030.5

3131.5

160 165 170 175 180 185 190 195

KBits/sec

PSNR

Y

Signle VOP Multiple VOP


Example:

Dec 25 1998OCR result:


Text Detection Example - Latin Script


Text Detection: Korean Script Example


Text Extracted from Video


Face Detection


Pool of Features

=> ~130.000 features for 24x24 window


Rapid Computationx

y

x

y

Rainer Lienhart,Jochen Maydt. An Extended Set of Haar-like Features for Rapid Object Detection. IEEE ICIP 2002, pp. 900-903, Sep. 2002.


Cascade of Classifiers

PremiseSize of feature pool (>100000) exceeds what any reasonable classifier can handleCascade of classifiers (special kind of decision tree) can outperform a single stage classifier because it can use more features at the same computational complexityUse Boosting (Discrete/Real/ Gentle Adaboost, LogitBoost)

Input Pattern

Stage N

Stage 2

Stage 1 P(x|¬o)=.5P(x|o) = .002

P(x|¬o)=.52

P(x|o) = .004

P(x|¬o)=.5N

P(x|o) ~ .1

Object

…

P(x|o) = .998

P(x|o) = .9982 = .996

P(x|o) = .998N ~ .90


Cascade Concept

Target ConceptBackground removal in stage 1

Background removal in stage 2






Gracias por su Atencion


Searching Video Collections: Overview

Part IIntroduction to Multimedia Information RetrievalMultimedia RepresentationMultimedia Indexing

Part II Audio AnalysisSpeech Indexing Query Formulation Multimedia Retrieval

Part IIIBrowsing Distribution/StreamingEvaluation Multimedia IR ApplicationsConclusions


Edge Detection

Basic Idea:1st and 2nd derivative of an edge position of the edge can be estimated with the maximum of the 1st derivative or with the zero-crossing of the 2nd derivativeGeneralize technique to calculate the derivative of a two-dimensional image


Canny Edge Detector

designed to be an optimal edge detector (according to particular criteria)It takes as input a gray scale image

as output an image showing the positions of tracked intensity discontinuities.


Canny Edge Detector

Multi-stage processImage Smoothed by Gaussian ConvolutionSimple 2-D first derivative operator to highlight regions of the image with high first spatial derivativestracks along the top of these ridges and sets to zero all pixels that are not actually on the ridge top

non-maximal suppressionThe tracking process exhibits hysteresis

searching video collections: representation, indexing ... · universidad de chile 3 searching video...

Documents