video indexing and retrieval

00-04-25 Kyongil Yoon 1/39

Video Indexing and Video Indexing and RetrievalRetrieval

CMSC828KCMSC828K

Kyongil YoonKyongil Yoon


ContentsContents

Part 1 : SurveyPart 1 : Survey

““Multimedia Database Management Systems”Multimedia Database Management Systems”Guojun LuGuojun Lu

Chapter 7. Video Indexing and RetrievalChapter 7. Video Indexing and Retrieval

Part 2 : ExamplePart 2 : Example

Automatic Video Indexing via Object Motion AnalysisAutomatic Video Indexing via Object Motion AnalysisJonathan D. CourtneyJonathan D. CourtneyTexas InstrumentsTexas Instruments


IntroductionIntroduction

VideoVideo– A combination of text, audio, and images with a time dimensionA combination of text, audio, and images with a time dimension

Indexing and retrieval methodsIndexing and retrieval methods– Metadata-based methodMetadata-based method

– Text-based methodText-based method

– Audio-based methodAudio-based method

– Content-based methodContent-based method

• Video : A collection of independent images or framesVideo : A collection of independent images or frames

• Video : A sequence of groups of similar frames (shot-based)Video : A sequence of groups of similar frames (shot-based)

– Integrated approachIntegrated approach


Shot-Based Video …Shot-Based Video …

Video shot : logical unit or segmentVideo shot : logical unit or segment– Same sceneSame scene– Single camera motionSingle camera motion– A distinct event or an actionA distinct event or an action– A single indexable eventA single indexable event

QueryQuery– Which video?Which video?– What part of video?What part of video?

StepsSteps– Segment the video into shotsSegment the video into shots– Index each shotsIndex each shots– Apply a similarity measurement between queries and video shotsApply a similarity measurement between queries and video shots

Retrieve shots with high similaritiesRetrieve shots with high similarities


Shot Detections (Segmentation)Shot Detections (Segmentation)

SegmentationSegmentation– A process of dividing a video sequence into shotsA process of dividing a video sequence into shots

Key issue Key issue – Establishing suitable difference metricsEstablishing suitable difference metrics

– Techniques for applying themTechniques for applying them

TransitionTransition– Camera breakCamera break

– Dissolve, wipe, fade-in, fade-outDissolve, wipe, fade-in, fade-out


Basic Video Segment TechniquesBasic Video Segment Techniques

Sum of pixel-to-pixel differencesSum of pixel-to-pixel differences Color histogram differenceColor histogram difference

– To be tolerant with object motionTo be tolerant with object motion

– SDSDii = = jj||HHii(j)-H(j)-Hi+1i+1(j)(j)|| wherewhere i : frame number, j : gray leveli : frame number, j : gray level

Modification of color histogramModification of color histogram– SDSDii = = jj((((HHii(j)-H(j)-Hi+1i+1(j)(j)))22 / / HHi+1i+1(j)(j))) 22 test test

Selection of appropriate threshold - CriticalSelection of appropriate threshold - Critical– e.g.) The mean of the frame-to-frame differencee.g.) The mean of the frame-to-frame difference

+ small tolerance value + small tolerance value


Detecting Gradual ChangeDetecting Gradual Change

Fade-in, fade-out, dissolve, wipe, …Fade-in, fade-out, dissolve, wipe, … Twin-comparison techniqueTwin-comparison technique

– TTbb : Normal camera breaks : Normal camera breaksTTss : Potential frames of gradual change : Potential frames of gradual change

– If If TTbb < diff < diff shot boundaryshot boundaryTTss < diff < T < diff < Tbb accumulate differencesaccumulate differences diff < T diff < Tss nothingnothing

– If the accumulated value is greater than TIf the accumulated value is greater than Tbb, a gradual change is , a gradual change is detected.detected.

Detection techniques based on wavelet transformationDetection techniques based on wavelet transformation

Very hard to detect!Very hard to detect!


False Shot DetectionFalse Shot Detection

Camera panning, tilting, and zoomingCamera panning, tilting, and zooming– Motion analysis techniquesMotion analysis techniques– Camera movementsCamera movements

• Optical flow computed by block matching methodOptical flow computed by block matching method Illumination changeIllumination change

– Normalization of color images before carrying out shot detectionNormalization of color images before carrying out shot detection

1.1. RRii’ = R’ = Rii / Sqrt( / Sqrt( NN R Rii2 2 ), G), Gii’ = …, B’ = …, Bii’ = …’ = …

2.2. ChromaticityChromaticity

1)1) rrii’ = R’ = Rii’ / (R’ / (Rii’ + G’ + Gii’ + B’ + Bii’)’)

2)2) ggii’ = R’ = Rii’ / (R’ / (Rii’ + G’ + Gii’ + B’ + Bii’)’)

3.3. A combined histogram for r and g : CHI (Chromaticity histogram image)A combined histogram for r and g : CHI (Chromaticity histogram image)4.4. Reduce it to 16x16Reduce it to 16x165.5. 2D DCT2D DCT6.6. Pick only 36 significant DCT valuesPick only 36 significant DCT values7.7. Distances are calculated based on these valuesDistances are calculated based on these values


Other Shot DetectionOther Shot Detection

Motion removalMotion removal– Ideally, frame-to-frame distance should beIdeally, frame-to-frame distance should be

• Close to zero with very little variation within a shotClose to zero with very little variation within a shot

• Significantly larger than within-values between shotsSignificantly larger than within-values between shots

– However, within a shotHowever, within a shot

• Object motion, camera motion, other changesObject motion, camera motion, other changes

• Filter to remove the effects of camera/object motionFilter to remove the effects of camera/object motion

Based on edge detectionBased on edge detection Advanced camerasAdvanced cameras

– Recording extra information such as position, time, orientation, …Recording extra information such as position, time, orientation, …


Segmentation of Compressed VideoSegmentation of Compressed Video

Based on MPEG compressed videoBased on MPEG compressed video– DCT coefficientsDCT coefficients

– Motion informationMotion information

• E.g. # of bidirectional coded macro blocks in B frame, it is very likely E.g. # of bidirectional coded macro blocks in B frame, it is very likely shot boundary occurs around the B frameshot boundary occurs around the B frame

Based on VQ compressed videoBased on VQ compressed video


Video Indexing and RetrievalVideo Indexing and Retrieval

Shot detection is preprocessing for indexingShot detection is preprocessing for indexing

R (representative) framesR (representative) frames– One or more key frames for each shotOne or more key frames for each shot

– Retrieval is based on these framesRetrieval is based on these frames

Other informationOther information– Motion, objects, metadata, annotationMotion, objects, metadata, annotation


Based on R framesBased on R frames

An r frame captures the main content of the shotAn r frame captures the main content of the shot Image retrieval : color, shape, texture, …Image retrieval : color, shape, texture, … Choosing r framesChoosing r frames

– How many?How many?1.1. One per shotOne per shot2.2. The number of r frames according to their lengthThe number of r frames according to their length3.3. One per subshot/sceneOne per subshot/scene

– How to select?How to select?1.1. First frame of segmentFirst frame of segment2.2. An average frameAn average frame3.3. The frame whose histogram is closest to the average histogramThe frame whose histogram is closest to the average histogram4.4. Large background + all foregrounds superimposedLarge background + all foregrounds superimposed

– First frame + frame with large distanceFirst frame + frame with large distance


R frame base ignores temporal or motion informationR frame base ignores temporal or motion information Motion information is derived from optical flow or Motion information is derived from optical flow or

motion vectorsmotion vectors Parameters for indexingParameters for indexing

– Content : Content : talking head vs car crashtalking head vs car crash

– Uniformity : Uniformity : smoothness as a function of timesmoothness as a function of time

– Panning : Panning : horizontal camera movementhorizontal camera movement

– Tilting : Tilting : vertical camera movementvertical camera movement

Based on Motion InformationBased on Motion Information

Camera motionCamera motion–Pan, tilt, zoom, swing, (horizontal/vertical) shiftPan, tilt, zoom, swing, (horizontal/vertical) shift


Based on ObjectsBased on Objects

Content based representationContent based representation If one could find a way to distinguish individual If one could find a way to distinguish individual

objects throughout he sequence, …objects throughout he sequence, … In a still image, object segmentation is difficultIn a still image, object segmentation is difficult

In a video sequence, we can group pixels that move In a video sequence, we can group pixels that move together into an object.together into an object.

MPEG-4 object-based codingMPEG-4 object-based coding– How to representHow to represent

– NOT how to segment and detectNOT how to segment and detect


Based on OthersBased on Others

MetadataMetadata– DVD-SI : DVD service informationDVD-SI : DVD service information

Title, video type, directorsTitle, video type, directors

AnnotationAnnotation1.1. ManuallyManually

2.2. Associated transcripts or subtitlesAssociated transcripts or subtitles

3.3. Speech recognition on sound trackSpeech recognition on sound track

Integrated methodIntegrated method


Effective Video Representation and Effective Video Representation and AbstractionAbstraction

Useful to have effective representation and Useful to have effective representation and abstraction toolabstraction tool

How to show contents in a limited spaceHow to show contents in a limited space ApplicationsApplications

– Video browsingVideo browsing

– Presentation of video resultsPresentation of video results

– Reduce network bandwidth requirements and delayReduce network bandwidth requirements and delay

Then how?Then how?


Representation and AbstractionRepresentation and Abstraction

Topical or subject classificationTopical or subject classification– News : (local, international, finance, sport, weather)News : (local, international, finance, sport, weather)

Motion icon (Motion icon (miconmicon) or video icon) or video icon– Easy shot boundary representationEasy shot boundary representation

– Operations : browsing, slicing, extraction a subiconOperations : browsing, slicing, extraction a subicon

Video streamerVideo streamer ClipmapClipmap

– A window containing a collection of 3D miconsA window containing a collection of 3D micons

Hierarchical video browserHierarchical video browser


Representation and AbstractionRepresentation and Abstraction

StoryboardStoryboard– A collection of representative framesA collection of representative frames

MosaickingMosaicking– An algorithm to combine information from a number of framesAn algorithm to combine information from a number of frames

Scene transition graphScene transition graph– Node : image which represents one or more video shotsNode : image which represents one or more video shots

– Edge : the content and temporal flow of videoEdge : the content and temporal flow of video

Video skimmingVideo skimming– High-level video characterization, compaction, and abstractionHigh-level video characterization, compaction, and abstraction


Automatic Video Indexing via Object Motion AnalysisAutomatic Video Indexing via Object Motion AnalysisAs an Object Tracking ExampleAs an Object Tracking Example

Video indexingVideo indexing– The process of identifying important frames or objects in the video The process of identifying important frames or objects in the video

data for efficient playbackdata for efficient playback

– Scene cut detection, camera motion, object motionScene cut detection, camera motion, object motion

– Hierarchical segmentationHierarchical segmentation

Three stepsThree steps– Motion segmentation, object tracking, motion AnalysisMotion segmentation, object tracking, motion Analysis

EventsEvents– Appearance/DisappearanceAppearance/Disappearance

– Deposit/RemovalDeposit/Removal

– Entrance/ExitEntrance/Exit

– Motion/RestMotion/Rest


Motion SegmentationMotion Segmentation

Segmented Image CSegmented Image Cnn

– CCnn = ccomps( T = ccomps( Thh••k )k )

Th : binary image resulting from thresholding |ITh : binary image resulting from thresholding |Inn-I-I00||

TThh••k : morphological close operation on Tk : morphological close operation on Thh

– Reference frame IReference frame I00

Strong assumptions may fail whenStrong assumptions may fail when– Sudden lightingSudden lighting

– Gradual lightingGradual lighting

– Change of viewpointChange of viewpoint

– Objects in reference frameObjects in reference frame


Imperfectness of SegmentationImperfectness of Segmentation

All the possible problemsAll the possible problems– True objects will disappear temporarilyTrue objects will disappear temporarily

– False objectsFalse objects

– Separate objects will temporarily join togetherSeparate objects will temporarily join together

– Single objects will split into multiple regionsSingle objects will split into multiple regions


Object TrackingObject Tracking

TerminologyTerminology– SequenceSequence - ordered set of N frames - ordered set of N frames

SS = = {{FF00, F, F11, … F, … FN-1N-1}} : : FFii is i-th frameis i-th frame

– ClipClip CC = ( = (S, f, s, lS, f, s, l) : ) : FFff,F,Fll - first and last valid frame, - first and last valid frame, FFss - start frame - start frame

– FrameFrame FF : image : image II annotated with a timestamp annotated with a timestamp tt, , FFnn = ( = (IInn, t, tnn))

– ImageImage II : : r r x x cc array of pixel array of pixel

– TimestampTimestamp records the date and the time records the date and the time

V-objectV-object– Extracted by motion segmentation comparing a frame to a reference Extracted by motion segmentation comparing a frame to a reference

frameframe

– Label, centroid, bounding box, shape maskLabel, centroid, bounding box, shape mask

– VVnn = { = { VVnnpp; p = 1, … P; p = 1, … P } }


Object TrackingObject Tracking Tracking procedureTracking procedure

– Iterate (forward) step 1-3 for frames 0, 1, …, N-2Iterate (forward) step 1-3 for frames 0, 1, …, N-21. For each V-object, predict its position in next frame1. For each V-object, predict its position in next frame

6. 6. Determine secondary links from forward stepDetermine secondary links from forward step

7. Determine secondary links from backward step7. Determine secondary links from backward step

np = n

p + np(tn+1-tn)n

p = np + n

p(tn+1-tn)

2. For each V-object, determine the V-object in the next frame with centroid 2. For each V-object, determine the V-object in the next frame with centroid nearest to the predictionnearest to the prediction

3. For every pair, estimate forward velocity3. For every pair, estimate forward velocity

4. Do 1-3 in backward4. Do 1-3 in backward

- For all frames- For all frames

5. Determine primary links5. Determine primary linksfor mutual nearest neighborfor mutual nearest neighbor


Object TrackingObject Tracking

Following graph is producedFollowing graph is produced– Node - V-objectsNode - V-objects

– Primary links (mutually nearest)Primary links (mutually nearest)

– Secondary links (others)Secondary links (others)

F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14

1-dim rep.positions


Motion AnalysisMotion AnalysisV-object GroupingV-object Grouping

Group V-objects with difference levelsGroup V-objects with difference levels– Stem, Stem, MM

– Branch, Branch, BB

– Trail, Trail, LL

– Track, Track, KK

– Trace, Trace, EE

E E K K L L B B MM Each level implies a feature of the blobEach level implies a feature of the blob


V-object Grouping - StemV-object Grouping - Stem Maximal size path of two or more V-objects with no secondary linksMaximal size path of two or more V-objects with no secondary links MM = { = {VVii : : i = 1, 2i = 1, 2, … , … NNM}}

– outdegree(outdegree(VVii) = 1 for 1 ) = 1 for 1 ii < < MM

– indegree(indegree(VVii) = 1 for 1<) = 1 for 1<i i MM

– either either 11 = = 22 = … = = … = NNMM

or or 11 22 … … NNMM

Stationary/movingStationary/moving


AA

AD

D

B B B B/EE/F F

C C C C C H H H H

GG

II

KK

JJ

V-objects with constant/no movement without any change

V-objects with constant/no movement without any change


V-object Grouping - BranchV-object Grouping - Branch Maximal size path containing no secondary links and composed Maximal size path containing no secondary links and composed

with only one pathwith only one path BB = { = {VVii : : i = 1, 2, …, Ni = 1, 2, …, NBB}}

– outdegree(outdegree(VVii) = 1 for 1 ) = 1 for 1 ii < < BB

– indegree(indegree(VVii) = 1 for 1<) = 1 for 1<i i BB

Stationary(one stem) / moving(otherwise)Stationary(one stem) / moving(otherwise)

L M M M MM M

LL

OO

N N N N N Q Q Q Q T

PP

S

RR

T

S


V-objects without changeV-objects without change


V-object Grouping - TrailV-object Grouping - Trail

LL Maximal-size path without secondary linksMaximal-size path without secondary links Stationary/moving/unknown Stationary/moving/unknown

U V V V VV V

UU

UU

W W W W W W W W W W Y

XX

Y Z

YY

Y

Y

Z


V-objects with main movementV-objects with main movement


V-object Grouping - TrackV-object Grouping - Track KK = { = {LL11, G, G11, … ,L, … ,LNNK-1K-1

, G, GNNK-1K-1, L, LNNKK

}}– LLii : trail: trail– GGi i : connecting dipath with constant velocity: connecting dipath with constant velocity

through through H = { H = {VVllii, G, Gii, V, V11

i+1i+1}} wherewhere VVll

i i is the last object of is the last object of LLii and and VV11i+1 i+1 is the first object ofis the first object of L Li+1i+1

Stationary/moving/unknownStationary/moving/unknown


V-objects with movementguessing occluded part

V-objects with movementguessing occluded part


V-object Grouping - TraceV-object Grouping - Trace

EE Maximal size connected digraph of V-objectsMaximal size connected digraph of V-objects

1 2 2 2 22 2

11

11

1 1 1 1 1 1 1 1 1 1 1

11

1 1

11

1

1

1


Group of V-objects overlappedGroup of V-objects overlapped


EventsEvents

AppearanceAppearance - - an object emerges in the scenean object emerges in the scene Disappearance Disappearance - - an object disappears from the scenean object disappears from the scene Entrance Entrance - - moving object enters the scenemoving object enters the scene Exit Exit - - moving objects exits from the scenemoving objects exits from the scene Deposit Deposit - - an inanimate object is added to the scenean inanimate object is added to the scene Removal Removal - - an inanimate object is removed from the scenean inanimate object is removed from the scene Motion Motion - - an object at rest begins to movean object at rest begins to move Rest Rest - - a moving object comes to a stopa moving object comes to a stop (Depositor) (Depositor) - - a moving object adds an inanimate object to a moving object adds an inanimate object to

the scene the scene (Remover) (Remover) - - a moving object removes an inanimate object a moving object removes an inanimate object

from the scene from the scene


Annotating V-objectsAnnotating V-objects

Appearance 1. Head of track2. Indegree(V) > 0

1. Head of track2. Indegree(V) = 0



Adjacent to V-object with deposit tag

Adjacent from V-object with removal tag

1. Tail of stationary stem2. Head of moving stem1. Tail of moving stem2. Head of stationary stem

DisappearanceEntrance

Exit

Deposit

Removal

(Depositor)

(Remover)

Motion

Rest

1. Tail of track2. Outdegree(V) > 01. Head of track2. Indegree(V) = 01. Tail of track2. Outdegree(V) = 0

1. Tail of track2. Outdegree(V) = 0



Moving Stationary Unknown

V-object motion state


Example of AnnotationExample of Annotation


Entrance Entrance

EntranceExit

Exit

ExitDepositor/Deposit Removal/Remover

Motion Rest Appearance Disappearance


QueryQuery

Y = (Y = (CC, , TT, , VV, , RR, , EE))– CC : a video clip : a video clip

– T = (tT = (tii, t, tjj)) : a time interval within the clip : a time interval within the clip

– VV : V-object in the clip : V-object in the clip

– RR : a spatial region in the field of view : a spatial region in the field of view

– E E : an object motion event: an object motion event

Processing a queryProcessing a query– Keeps truncating domain with query parametersKeeps truncating domain with query parameters


Experimental ResultExperimental Result

3 videos, 900 frames, 18 objects, 44 events3 videos, 900 frames, 18 objects, 44 events

1 false negative, 10 false positive1 false negative, 10 false positive ConservativeConservative

Video 1Video 1 Video 2Video 2 Video 3Video 3

Inventory or Security Inventory or Security monitoringmonitoring300 frs, 10fr/sec300 frs, 10fr/sec5 objects, 10 events5 objects, 10 eventsentrance/exit, entrance/exit, deposit/removaldeposit/removal

retail customer retail customer monitoringmonitoring285 frames, 10 fr/sec285 frames, 10 fr/sec4 objects, 14 events4 objects, 14 eventsall eight eventsall eight events3 foreground objects in 3 foreground objects in ref. frameref. frameMost complicatedMost complicated

parking lot traffic parking lot traffic monitoringmonitoring315 frames, 3fr/sec315 frames, 3fr/sec9 objects, 20 events9 objects, 20 eventsmost noisymost noisy


Errors come fromErrors come from

Noise in the sequenceNoise in the sequence Assumption of constant trajectories of occluded Assumption of constant trajectories of occluded

objectsobjects No means to track objects through occlusion by fixed No means to track objects through occlusion by fixed

scene objectsscene objects


MosaickingMosaicking


Story board, Video MultiplexingStory board, Video Multiplexing

Show 20 minutes of video in 6 secondsShow 20 minutes of video in 6 seconds Loop all shots as thumbnails at same timeLoop all shots as thumbnails at same time Let the user focus on the interesting shotsLet the user focus on the interesting shots


MiconMicon

video indexing and retrieval

Documents

video shots

video sequence

ri ri gi bigi

frame number

frame distance

videovideo indexing

ri sqrt n ri2

key frames