hands on: multimedia methods for large scale video analysisfractor/fall2012/cs294-1-2012.pdf · day...

$: Hands On: Multimedia Methods for Large Scale Video Analysisfractor/fall2012/cs294-1-2012.pdf · day • Twitter: up to 120M messages per day 3 Sunday, August 26, 12. ... • Classiﬁcation$
Hands On: Multimedia Methods for Large Scale Video Analysis

Dr. Gerald Friedland, [email protected]

1

Sunday, August 26, 12

mailto:[email protected]


Multimedia in the Internet is Growing

Raphaël Troncy: Linked Media: Weaving non-textual content into the Semantic Web, MozCamp, 03/2009.

2


Multimedia in the Internet is Growing

• YouTube claims 65k 100k video uploads per day or 48 72 hours every minute

• Flickr claims 1M images uploads per day

• Twitter: up to 120M messages per day 3


Why do we care?

• Consumer-Produced Multimedia allows empirical studies at never-before seen scale in various research disciplines such as sociology, medicine, economics, environmental sciences, computer science...

• Recent buzzword: BIGDATA4


Problem

How can YOU effectively work on large scale multimedia data (without working at Google)?

5


What is this class about?

• Introduction to large-scale video processing from a CS point of view (application driven)

• Covers different modalities: Visual, Audio, Tags, Sensor Data

• Covers different side topics, such as Mechanical Turk for annotation


Content of this Class

• Visual methods for video analysis• Acoustic methods for video analysis• Meta-data and tag-based methods for video

analysis• Inferring from the social graph and collaborative

filtering• Information fusion and multimodal integration• Coping with memory and computational issues• Crowd sourcing for ground truth annotation• Privacy issues and societal impact of video

retrieval

•


Why you should you be interested in the Topic

• Processing of large scale, consumer produced videos is a brand-new emerging field driven by:– Massive government funding (e.g. IARPA

Aladdin, IARPA Finder, NSF BIGDATA)– Industry’s needs to make consumer videos

retrievable and searchable– Interesting academic questions


Some Examples of Research at ICSI

•Video Concept Detection•Forensic User Matching•Location Estimation•Fast Approaches in Parlab


Video2GPS: Multimodal Location Estimation of Flickr Videos

Gerald FriedlandInternational Computer Science InstituteBerkeley, [email protected]




Definition

11

• Location Estimation = estimating the geo-coordinates of the origin of the content recorded in digital media.

• Here: Regression task– Where was the media recorded in lattitude,

longitude, and altitude?

G. Friedland, O. Vinyals, and T. Darrell: "Multimodal Location Estimation", pp. 1245-1251, ACM Multimedia, Florence, Italy, October 2010.


Motivation 1

12

Training data comes for free on a never-before seen scale!

Portal % TotalYouTube (estimate) 3.0 3MFlickr 4.5 180M

Allows tackling of tasks of never-before seens difficulty?


Motivation 2

13

Location-based services are awesome!

Geocoordinates+Time = Unique ID Sunday, August 26, 12

De-Motivation

14

Geo-Location enables Cybercasing!

G. Friedland and R. Sommer: "Cybercasing the Joint: On the Privacy Implications of Geotagging", Proceedings of the Fifth USENIX Workshop on Hot Topics in Security (HotSec 10), Washington, D.C, August 2010.


Placing TaskAutomatically guess the location of a Flickr video:• i.e., assign geo-coordinates (latitude and

longitude)• Using one or more of:

– Visual/Audio content– Metadata (title, tags, description, etc)– Social information 15


Consumer-Produced, Unfiltered Videos...


Data Description (2011)• Training Data

– 10k flickr videos/metadata/visual keyframes (+features)/geo-tags

– 6M flickr photos/metadata/visual features

• Test Data– 5k flickr video/metadata/visual

keyframes (+features)– no geotags

• Test/Training split: by UserID 17


Your Best Guess???


Our Approach

19

Tag-based Approach+ Visual Approach+ Acoustic Approach-------------------Multimodal Approach


Intuition for the Approach

Investigate `Spatial Variance’ of a feature: – Spatial variance is small: Features is

likely location-indicative– Spatial variance is high: Feature is

likely not indicative.

20


Example Revisited

Tag: ['pavement', 'ucberkeley', 'berkeley', 'greek', 'greektheatre', 'spitonastranger', 'live', 'video‘]


Tag-based Approach

• ‘greektheatre’ would be the correct answer: however, it was not seen in the development dataset

• ‘ucberkeley’ is the 2nd-best answer: estimation error : 0.332 km

Tag Matches in Training set

Spatial Variance

pavement 2 5.739

ucberkeley 4 0.132

berkeley 14 68.138

greek 0 N/A

greektheatre 0 N/A

spitonastranger 0 N/A

live 91 6453.109

video 2967 6735.844


Visual Approach

• Assumption: similar images similar locations• Based off related work: Reduce location estimation to an image retrieval problem•Use median frame of video as keyframe


Visual Features

• Color Histograms – L*a*b* (4, 14, 14 bins, respectively)– Chi-square distance

• GIST – 5x5 spatial resolution, 6 orientations and 4 scales– Euclidean distance

See: A. Oliva, and A. Torralba: Building the Gist of a Scene: The Role of Global Image Features in Recognition, Progress in Brain Research Volume: 155, Issue: 1, Publisher: Elsevier, Pages: 23-36, 2006


25


26


Audio Approach

• Idea: Different places have different acoustic “signatures”

• Due to data sparsity: Initially only used for major cities

• Classification system similar to GMM/UBM Speaker ID system using MFCC19 features.


• Which city was this recorded in? Pick one of: Amsterdam, Bangkok, Barcelona, Beijing, Berlin, Cairo, CapeTown, Chicago, Dallas, Denver, Duesseldorf, Fukuoka, Houston, London, Los Angeles, Lower Hutt, Melbourne, Moscow, New Delhi, New York, Orlando, Paris, Phoenix, Prague, Puerto Rico, Rio de Janeiro, Rome, San Francisco, Seattle, Seoul, Siem Reap, Sydney, Taipei, Tel Aviv, Tokyo, Washington DC, Zuerich

• Solution: Tokyo, highest confidence score!•

An ExperimentListen!


Audio-Only Results

1 2 5 10 20 40 60 80 90 95 98 99 1 2

5

10

20

40

60

80

90

95

98 99

DET curve for common−user city identification with 540 videos and 9,700 trials

False Alarm probability (in %)

Mis

s pr

obab

ility

(in %

)

EER = 22%


Problem: Bias


Geo-‐tagging: an es-ma-on-‐theore-c

{berkeley, sathergate, campanile}

{berkeley, haas} {campanile} {campanile, haas}

Observa(ons:

Images:

Tags: , , ,

{tk1} {tk2} {tk3} {tk4}, , ,Es(mate:

Geoloca-ons:

x1 x2 x3 x4, , ,Sunday, August 26, 12

Interpre-ng tradi-onal approaches

Loca-ons are random variables: {x1, x2, ....., xN}

Tradi-onal approaches es-mate:p(xi|{tki }) �

Y

k

p(xi|tki )

wherep(xi|tki ) is obtained from the training set

Example: the distribu-on for the tag “washington” is depicted here

Loca-on es-mate:Z

xi p(xi|{tki })dxi

Probability of loca-on given tags


DrawbacksData sparsity: Not all tags in test set are available in training set. Hence es-mate of can be bad

p(xi|tki )Sub-‐op(mality: The approaches are subop-mal given the data.

What we ideally want:p(x1, x2, ....., xN |{tk1}, {tk2}, ..., {tkN})

Mean of the above distribu-on gives the best es-mate of the loca-onsi.e. for each image we want

p(xi|{tk1}, {tk2}, ...., {tkN})

Tradi-onal algorithms only give:p(xi|{tki })


Coopera-ve geo-‐taggingIntui-on: Images in the training set having common tags have correlated geo-‐loca-ons captured by the joint distribu-onJoint probability modeling:

p(x1, x2, ....., xN |{tk1}, {tk2}, ..., {tkN}) �Y

i

p(xi|{tki })Y

(i,j)

p(xi, xj |{tki } ⇥ {tkj })

Pairwise distribu-on given at least one common tag

is obtained from the training set as before

p(xi, xj |{tki } � {tkj }) Modeled as an indicator func-on I(xi = xj)If the common tag has low spa-al variance or occurs infrequently, e.g. if the common tag is “haas”, its very likely the loca-ons are the same

Ques-on: How to es-mate to op-mal marginal distribu-on ?p(xi|{tk1}, {tk2}, ...., {tkN})

p(xi|{tki })


Bayesian graphical framework{berkeley, sathergate, campanile}

{berkeley, haas}

{campanile} {campanile, haas}

Node: Geoloca-on of the image

Edge: Correlated loca-ons (e.g. common tag)

Edge Poten(al: Strength of an edge, (e.g. posterior distribu-on of loca-ons given common tags)

p(xi, xj |{tki } � {tkj })

p(xj |{tkj })p(xi|{tki })


Belief propaga-on updatesp(xi|{tk1}, {tk2}, ...., {tkN})Itera-ve algorithm to approximate

the posterior distribu-on

Gaussian modeling p(xi|{tki }) � N (µi,�2i )

At itera-on 0 each node calculates (µi,�2i )

At itera-on t each node updates its loca-on as a weighted mean of its previous loca-on and that of its neighbors

µ(t)i =

1

(�(t)i )2

µ(t�1)i +

Pk⇥N (i)

1

(�(t)k )2

µ(t)k

(�(t)i )2

1

(�(t)i )2

=1

(�(t�1)i )2

+X

k2i

1

(�(t�1)k )2

The weights reflect the confidence in that measurements, i.e. higher the spa-al variance lower is the weight


Belief propaga-on

(µ1,�21)

(µ2,�22)

(µ3,�23)

Posterior mean and variance assuming Gaussian beliefs

Audio visual features are incorporated in modeling the edge and node poten-als


Incorpora-ng Audio-‐Visual features• GIST features are extracted for the images.• MFCC features are extracted for the audio.• These are now incorporated into the node and edge poten-als as

exponen-al distribu-ons.

p(xi, xj |ai, aj) ⇥ exp(� ||xi � xj ||�||ai � aj ||

)

ai are the audio features associated with image i

The intui-on is that closer the audio features are, higher the probability that the geo-‐loca-ons are closer.Similarly this can be included in the node poten-als as well as for the visual features.


Multimodal Results: MediaEval 2011Dataset

0"

10"

20"

30"

40"

50"

60"

70"

80"

90"

10m" 100"m" 1"km" 5"km" 10"km" 50"km" 100"km"

Coun

t"[%]"

Distance"between"es>ma>on"and"ground"truth"

Figure 3: The resulting accuracy of the algorithm

as described in Section 5.

used as the final distance between frames. The scaling of theweights was learned by using a small sample of the trainingset and normalizing the individual distance distributions sothat each the standard deviation of each of them would besimilar. We use 1-NN matching between the test frame andthe all the images in a 100 km radius around the 1 to 3 co-ordinates from the previous step. We pick the match withthe smallest distance and output its coordinates as a finalresult.

This multimodal algorithm is less complex than previousalgorithms (see Section 3), yet produces more accurate re-sults on less training data. The following section analysesthe accuracy of the algorithm and discusses experiments tosupport individual design decisions.

6. RESULTS AND ANALYSISThe evaluation of our results is performed by applying

the same rules and using the same metric as in the MediaE-val 2010 evaluation. In MediaEval 2010, participants wereto built systems to automatically guess the location of thevideo, i.e., assign geo-coordinates (latitude and longitude) tovideos using one or more of: video metadata (tags, titles),visual content, audio content, and social information. Eventhough training data was provided (see Section 4), any “useof open resources, such as gazetteers, or geo-tagged articlesin Wikipedia was encouraged” [16]. The goal of the taskwas to come as close as possible to the geo-coordinates ofthe videos as provided by users or their GPS devices. Thesystems were evaluated by calculating the geographical dis-tance from the actual geo-location of the video (assignedby a Flickr user, creator of the video) to the predicted geo-location (assigned by the system). While it was importantto minimize the distances over all test videos, runs werecompared by finding how many videos were placed within athreshold distance of 1 km, 5 km, 10 km, 50 km and 100 km.For analyzing the algorithm in greater detail, here we alsoshow distances of below 100m and below 10m. The lowestdistance category is about the accuracy of a typical GPSlocalization system in a camera or smartphone.

First we discuss the results as generated by the algorithmdescribed in Section 5. The results are visualized in Figure 3.The results shown are superior in accuracy than any system

0"10"20"30"40"50"60"70"80"90"

10m" 100"m" 1"km" 5"km" 10"km" 50"km" 100"km"

Coun

t"[%]"

Distance"between"es>ma>on"and"ground"truth"

Visual"Only" Tags"Only" Visual+Tags"

Figure 4: The resulting accuracy when comparing

tags-only, visual-only, and multimodal location esti-

mation as discussed in Section 6.1.

presented in MedieEval 2010. Also, although we added ad-ditional data to the MediaEval training set, which was legalas of the rules explained above, we added less data thanother systems in the evaluation, e.g. [21]. Compared to anyother system, including our own, the system presented hereis the least complex.

6.1 About the Visual ModalityProbably one of the most obvious questions is the impact

of the visual modality. As a comparison, the image-matchingbased location estimation algorithm in [8] started reportingaccuracy at the granularity of 200 km. As can be seen inFigure 4, this is consistent with our results: Using the lo-cation of the 1-best nearest neighbor in the entire databasecompared to the test frame results in a minimum accuracyof 10 km. In contrast to that, tag-based localization reachesaccuracies of below 10m. For the tags-only localization wemodified the algorithm from Section 5 to output only the 1-best geo-coordinates centroid of the matching tag with low-est spatial variance and skip the visual matching step. Whilethe tags-only variant of the algorithm performs already well,using visual matching on top of the algorithm decreases theaccuracy in the finer-granularity ranges but increases over-all accuracy, as in total more videos can be classified below100 km. Out of the 5091 test videos, using only tags 3774videos can be estimated correctly with an accuracy better100 km. The multimodal approach estimates 3989 correctlyin the range below 100 km.

6.2 The Influence of Non-Disjoint User SetsEach individual person has his own idiosyncratic method

of choosing a keyword for certain events and locations whenthey upload videos to Flickr. Furthermore, the spatial vari-ance of the videos uploaded by one user is low on average.At the same time, a certain amount of users uploads manyvideos. Therefore taking into account to which user a videobelongs seems to have a higher chance of finding geographi-cally related videos. For this reason, videos in the MediaEval2010 test dataset were chosen to have a disjoint set of usersfrom the training dataset. However, the additional trainingimages provided for MediaEval 2010 are not user disjoint


How is the Class Organized?

• Once a week: lectures fundamental introduction (from me)

• Once a week: Project meetings• Guest lectures by YouTube, Yahoo!,

Intel, TU Delft.

•


Hands-On Part

• Do a project on a large scale data set• Data accessible through ICSI accounts • Additional computation: $100 in

Amazon EC 2 money•


Lecture Material

• Background material for lectures:G. Friedland, R. Jain: Introduction to Multimedia Computing, to appear at Cambridge University Press, 2012.

• (Constantly changing) draft availablehttp://www.mm-creole.org


http://www.mm-creole.org

http://www.mm-creole.org

How do you receive Credit?

• 3 Credits (graded or ungraded) based on:– 2 quizzes– a project

•


hands on: multimedia methods for large scale video analysisfractor/fall2012/cs294-1-2012.pdf · day...

Documents