a spectrogram-based audio fingerprinting system for ...evaluation task. jiang et al. proposed a copy...

A spectrogram-based audio fingerprinting systemfor content-based copy detection

Chahid Ouali1,2 & Pierre Dumouchel1 & Vishwa Gupta2

Received: 2 October 2014 /Revised: 2 November 2015 /Accepted: 17 November 2015# Springer Science+Business Media New York 2015

Abstract This paper presents a novel audio fingerprinting method that is highly robust to avariety of audio distortions. It is based on an unconventional audio fingerprint generationscheme. The robustness is achieved by generating different versions of the spectrogram matrixof the audio signal by using a threshold based on the average of the spectral values to prunethis matrix. We transform each version of this pruned spectrogram matrix into a 2-D binaryimage. Multiple versions of these 2-D images suppress noise to a varying degree. This varyingdegree of noise suppression improves likelihood of one of the images matching a referenceimage. To speed up matching, we convert each image into an n-dimensional vector, andperform a nearest neighbor search based on this n-dimensional vector. We give results withtwo different feature parameters and their combination. We test this method on TRECVID2010 content-based copy detection evaluation dataset, and we validate the performance onTRECVID 2009 dataset also. Experimental results show the effectiveness of these featureseven when the audio is distorted. We compare the proposed method to two state-of-the-artaudio copy detection systems, namely NN-based and Shazam systems. Our method by faroutperforms Shazam system for all audio transformations (or distortions) in terms of detectionperformance, number of missed queries and localization accuracy. Compared to NN-basedsystem, our approach reduces minimal Normalized Detection Cost Rate (min NDCR) by 23 %and improves localization accuracy by 24 %.

Keywords Content-based copy detection . Audio fingerprints . Feature parameters .

Spectrogram . TRECVID

Multimed Tools ApplDOI 10.1007/s11042-015-3081-8

* Chahid [email protected]

Pierre [email protected]

Vishwa [email protected]

1 ÉTS (École de Technologie Supérieure), Montreal, Canada2 CRIM (Computer Research Institute of Montreal), Montreal, Canada

http://crossmark.crossref.org/dialog/?doi=10.1007/s11042-015-3081-8&domain=pdf

1 Introduction

The evolution of information technology has allowed the production of huge amounts ofmultimedia data. This has increased the need for powerful tools to handle this data in terms ofidentification, filtering and retrieval. In this context, audio copy detection, which consists ofidentifying duplicate (or near duplicate) audio content, has become an emerging and activeresearch area due to its broad applications. Audio copy detection can be used in a wide varietyof applications: broadcast monitoring, music identification, copyright control, law enforcementinvestigation, music library organization, etc.

Watermarking is one of the key technologies in multimedia copy detection [9]. Thistechnique inserts invisible information into the original document to allow subsequent detec-tion of this document. The major limitation of watermarking is that it is impossible to detectdocuments that are not watermarked. To cope with this problem, content-based copy detection(CBCD) has been recently introduced. The idea behind CBCD is that the content itselfcontains enough unique information that it can be used to detect copies. However, audiosignals could be subjected to a variety of distortions that make audio copy detection difficult.In fact, operations such as audio compression, audio quality reduction and adding extraneousspeech make the task very difficult. Different approaches are therefore required to make copydetection robust to these transformations.

A typical CBCD system design is composed of two parts: (1) a method to extract compactsignatures (the fingerprints) from an audio signal that describe the acoustic properties of thesignal; (2) a method to search fingerprints of an unknown audio in a large dataset offingerprints. The fingerprints should have discriminative power over imposter fingerprints,be invariant to distortions, be compact and computationally simple [5].

A major difficulty in fingerprint definition is how to take into consideration detectionaccuracy and speed. Haitsma et al. introduced binary fingerprints that tradeoff accuracy forspeed [8]. For every windowed interval of 11.6 milliseconds, they extract a hash value of 32bits corresponding to the energy differences along the frequency and the time axes. The searchsimply uses a lookup table, allowing a very fast search. This approach achieved good resultswhen used to identify distorted audio. However, its performance degrades against distortionssuch as addition of speech or bandwidth limitation. An improvement proposed by Saracogluet al. [17] extracts 15 bits instead of 32 bits to overcome these audio transformations. Theenergy differences between two consecutive frequency bands have been the subject of severalother works [13, 14].

In Gupta et al. [7], a robust audio fingerprinting system based on nearest-neighbor mappingis proposed. They use 12 Mel-Frequency Cepstral Coefficients (MFCC) plus energy and itsdelta coefficients as audio features. A nearest neighbor search is then performed between thereference frames and the query frames. This approach achieved good results on a very difficultevaluation task. Jiang et al. proposed a copy detection system that achieved excellent perfor-mance in TRECVID 2010 and 2011 audio+video copy detection tasks [12]. However, theaudio only results are not available for comparison.

In more recent work, local spectral energies around salient points chosen from themaxima in the Mel-filtered spectra are selected [1]. Regions around each selectedpoint are encoded to generate binary fingerprints. Compared to [8], this approachimproved significantly the detection accuracy. The idea of constructing fingerprintsbased on spectrogram peak has been used before with the Shazam system [19]. In thiswork, several time-frequency points are chosen from the spectrogram. A point is

Multimed Tools Appl

selected if it has higher energy than all its neighbors in a region centered on thepoint. Compact signatures representing peak pairs are then generated to form finger-print hashes. In [11], Jégou et al. introduced an audio search system that presentscomparable results to Shazam system. They divide the audio signal into overlappingshort-term windows of 25 ms taken every 10 ms. Audio descriptors for each windoware computed using 64 filter banks (i.e., the dimensionality of one frame is 64). Inorder to make these descriptors more discriminative, they concatenated 3 successivefilter banks. Their scheme resulted in a compound descriptor of dimensionality 192representing 85 ms. They then used an approximate nearest neighbor search betweenquery descriptors and reference descriptors. This method achieved good results foraudio+video copy detection, where these audio descriptors were combined with visualdescriptors [2].

Unlike most approaches described above, Ke et al. [20] transformed music identifica-tion problem into 2-D computer vision problem. They learned a set of filters to create acompact representation for local regions of the spectrogram image. Thus, a spectrogram istransformed into a set of 32 bit vectors, and a classical hash table is used to perform thesearch step. Based on [20], Baluja et al. extended a wavelet-based approach used fornear-duplicate image retrieval, to the task of audio detection [3]. They extract spectralimages from the spectrogram and compute Haar wavelets for every image. To reduce theeffects of audio degradation, only wavelets with the largest magnitude are selected. Zhuet al. [21] used Scale Invariant Feature Transform (SIFT) technique to extract local imagedescriptors from the spectrogram. SIFT features are robust to image translation, whichallowed them to treat Time Scale Modification (TSM) and pitch shifting problem. Theirsystem achieved promising identification results for audio stretched from 65 to 150 % ofthe original length.

In this paper, we introduce a novel audio fingerprinting method based on spectrogramimages. This paper is an expanded version of papers [15, 16]. The idea behind the designof our system is that the spectrogram of an original audio and its copy look very similar.However, distortion may change visual information in the spectrogram. To reduce audiomismatch due to these distortions, we convert the spectrogram into binary images andgenerate different versions of fingerprints by keeping an incremental amount of signalinformation (based on a spectral energy threshold) for each version. By following thisstrategy, one of the spectrogram versions of the copy is more likely to have spectrogramsimilar to the original. The novelty of this approach, compared to the state-of-the-artCBCD systems, lies in the conversion of the spectrogram into a set of binary images andthe derivation of multiple fingerprints from these images. To speed up the search, eachversion of the spectrogram image is converted into an n-dimensional vector and a nearestneighbor search is then performed using these n-dimensional vectors. This search isspeeded up using a Graphics Processing Unit (GPU) that significantly reduces the runtime. This strategy reduced the min NDCR for most transformations and significantlyimproved the localization accuracy. Compared to our previous papers [15, 16], this paperincludes new experiments that support the proposed approach. We also evaluate oursystem on TRECVID 2009 dataset in addition to the TRECVID 2010 dataset. Besidesthe NN-based system, we also compare our system to the Shazam system.

The rest of the paper is structured as follows. Section 2 describes our audio fingerprintingsystem. Section 3 reports our experiments and compares our method to two state-of-the-artaudio fingerprinting systems. Section 4 recapitulates our contributions.

Multimed Tools Appl

2 System overview

The overall architecture of our system is shown in Fig. 1. First, spectrogram-based n-dimensional vectors per frame are generated from all the audio references. Second, differentversions of these n-dimensional vectors per frame are created (by varying spectral energythresholds) and stored in reference fingerprints database. A query is processed in the sameway, and followed by a time shift step in order to produce several fingerprint versions withdifferent speeds. Finally, the query fingerprints are searched in the reference fingerprintsdatabase to produce the search result. Following paragraphs present these steps in detail.

2.1 Spectrogram generation

First, we down sample the audio signal to 8 KHz in order to make the fingerprint robust totransformations that may reduce the bandwidth of the audio signal. We apply Hammingwindow of length 96 ms. We generate a spectrogram by computing the short time Fouriertransform in this 96 ms window. We reduce this spectrogram to 257 frequency bins in thefrequency range from 500 Hz to 3000 Hz. We compute these 257 frequency bins every 3 ms.

2.2 Fingerprint generation

The previous step transforms the audio into a spectrogram represented by a matrix containingthe intensity of the signal at any given time and frequency. The fingerprint generation stepconverts this spectrogram matrix to 2-D binary images using a sliding window of size w×h.These 2-D binary images are then converted into n-dimensional vectors as outlined below.

The spectrogram matrix of size w×h is converted into multiple n-dimensional vectors usingvarying spectral thresholds. We compute these n-dimensional vectors every av ms (frameadvance of size av). The choice of window size and the frame advance are discussed inSection 3.

We experiment with two different audio feature parameters derived using the global or localmean of the spectral values in the spectrogram: Global Mean or Local Mean fingerprints.

Fig. 1 System overview

Multimed Tools Appl

2.2.1 Global mean fingerprint

Figure 2 illustrates the Global Mean fingerprint generation process. From the spectrogrammatrix of size w×h (i.e., the window frame), we compute the mean intensity value of thismatrix (global mean). Then, we replace the intensity values of this matrix by either 0 or 1 usingthis strategy: if the intensity is greater than this global mean then we replace it by 1, otherwisewe replace it with a 0. Thus, the global mean represents a threshold that generates onefingerprint version. We generate different versions of the fingerprint from the same spectro-gram matrix by using different thresholds derived from this global mean (e.g., 0.4×globalmean, 0.6×global mean). Figure 3 shows four versions generated for two reference frames andtwo query frames. In this figure, query1 is a transformed copy (bandwidth limit and single-band companding) of reference1, and query2 is a copy of reference2 mixed with speech. Notehow easy it is to match query1 version3 to reference1 version2. Similarly, query2 version4 andreference2 version4 are almost identical despite the fact that query2 is mixed with extraneousspeech.

This scheme results in robust fingerprints against a variety of transformations. Using themean of intensities as a threshold allows us to select the most relevant information anddiscard the irrelevant information, for example, eliminating low intensity noise fromfingerprint representation. In addition, the binary representation makes the fingerprintinvariant to relative intensity value changes (e.g., overall amplification or reduction ofenergy, equalization…). Thus, a value of 1 in the w×h binary spectrogram matrix (i.e., thebinary image) denotes a time-frequency peak regardless of the real intensity value. This issimilar to the Shazam [19] system where the amplitude component has been eliminated andonly peak positions are retained.

Fig. 2 Global Mean fingerprint extraction

Multimed Tools Appl

2.2.2 Local mean fingerprint

Unlike Global Mean feature parameter, Local Mean uses smaller blocks to compute the mean.In fact, we partitioned the w×h window into tiles of wb×hb each (see Fig. 4). We then computethe mean intensity values of each tile block, and convert the intensity values in each tile to a 0or 1 following the strategy used for Global Mean features.

Figure 5 shows binary images of the same audio segment (1-s length) generatedwith Global Mean and Local Mean using the threshold equal to mean. The query inthis figure is a transformed version (with mp3 compression and multibandcompanding) of the reference. We compare Global Mean which extracts Bthe mostrelevant information^ on the whole w×h window, to Local Mean that emphasizes theintensity values of a small portion of the window (wb×hb). In other words, LocalMean represents not only the highest intensity values in the window like Global Meanbut also smaller local intensity values that are not relevant in the case of the GlobalMean. These smaller local intensities can be useful in minimizing the differencebetween query and reference images. In Fig. 5, we see that Local Mean can generateimages for the reference and the query that are more similar than those generated bythe Global Mean.

2.2.3 Fingerprint representation

We represent the resulting binary image (or the w×h spectrogram matrix if the spectro-gram is not converted into binary images) using a simple n-dimensional fingerprint. Wedivide the matrix into n/2 horizontal slices and n/2 vertical slices. We then take the sum

Fig. 3 Four different versions of quantized spectrogram matrix generated for four different audio frames (1-swindow frame). Note that images in each line are not successive frames, but different version of the same framediffering in spectral threshold (or global mean) for quantization. Look how hard it is to match version1 of query1to any version of reference1. By contrast, we can easily match query1 version3 to reference1 version2, or query1version4 to reference1 version3. We can observe the same thing in the second example (query2 and reference2)

Multimed Tools Appl

of the elements of each slice to obtain a vector of n dimensions (see the example givenin Fig. 6 with n=48). This n-dimensional vector is the compact fingerprint representationof the w×h spectrogram matrix.

Fig. 4 Local Mean feature extraction

Fig. 5 Global Mean versus Local Mean Fingerprints

Multimed Tools Appl

2.3 Retrieval

Once the fingerprints for the reference audio have been generated and stored in a database, asearch algorithm finds the best matching audio segments between a query and the reference.We create query fingerprints in the same way as reference fingerprints. We noticed speeddifferences between some queries and their references (on TRECVID 2010 dataset). To reducethese speed differences, we also produce query fingerprints that have been speeded up orslowed down (see Section 3.3 for more detail). We change the speed by increasing/decreasingthe sampling frequency of the audio file. This is done in the context of TRECVID 2010dataset, and not to solve the general problem of such audio speed change attacks.

During retrieval, each query fingerprint version is compared with all the reference finger-print versions. Ideally, reference audio frames and its near duplicate queries should haveidentical fingerprints among these versions. However, even when images look identical, theirn-dimensional descriptors could differ slightly. Furthermore, distortions, such as Bmixed withspeech^, can make images very different. Thus, a natural choice is to use a similarity measurebetween two fingerprints that is robust to audio distortions.

Our similarity measure is similar to that used in [7]. We first label each reference frame bythe frame number of its closest query frame. To find the closest query frame, we use the nearestneighbor algorithm with a Manhattan distance (i.e., absolute distance) as a measure ofsimilarity. After the closest query frame has been found for each reference frame, the totalnumber of fingerprints that match the query frame-synchronously is computed: We move thequery over the reference. For each alignment of the query to the reference, we count thenumber of reference fingerprints that match exactly query frame number for each alignment ofthe query to the reference.

We have improved the algorithm used in [7] to compute the number of matching framesbetween a query and a reference. A representation of the improved algorithm is shown inFig. 7. In this figure, the reference frames on the horizontal axes are labeled with the nearestneighbor query frames. The row labeled Bcounts^ shows the total counts for each alignment of

Fig. 6 Fingerprint representation

Multimed Tools Appl

the query to the reference. In this example, the best count is 5 and is obtained when query isoverlaid on the reference starting with the 4th frame. Algorithmically, these counts are obtainedin this manner: for each frame j of the reference, we increment the count c(j−i)=c(j−i)+1,where i is the label of the reference frame j (i.e., the closest query frame is i). Then, the bestsegment match is found by looking for the reference frame with the highest count.

Our algorithm differs from [7] as follows: with each reference frame, we associate thenearest query frame and the N nearest successive neighboring frames of the query. Forexample, in Fig. 7 (SCF-1: successive closest frames with N=1), the closest query frame tothe fifth reference frame is frame 2. Thus, we update the count c(j−i) for all query frames: 1, 2and 3 (frames 1 and 3 are successive neighboring frames). We process in this manner becauselarge overlap between frames generates similar fingerprints for successive frames. Thissimilarity allows the search algorithm to label the reference frame with a wrong query framenumber. For example, if the correct closest query frame to the reference frame is frame n, thenthe matching algorithm may wrongfully label frame n−1 as its closest frame since queryframes n and n−1 are similar due to the large overlap. We discuss in detail the influence of theparameter SCF (number of successive closest frames) in Section 3.3.

3 Experiments

This section covers an evaluation of our proposed system on the well-known TRECVID 2010audio copy detection dataset. First, we present results of Local Mean (LM) and Global Mean(GM) feature parameters. We then show that the combination of these two features results insignificantly lower min NDCR. We also study the influence of the number of successiveclosest frames (SCF) on the system performance. In addition, we show the effectiveness ofconverting the spectrogram matrix into a set of binary images and the utility of using differentfingerprint versions. We evaluate the effectiveness of the Global Mean fingerprints on bothTRECVID 2010 and TRECVID 2009 datasets. Finally, we compare our system to the NN-based [7, 10] and Shazam [19] audio fingerprinting systems. For the Shazam system, we usedthe Shazam implementation found in [6] with the default parameters.

Fig. 7 Search algorithm

Multimed Tools Appl

3.1 Audio copy detection datasets

We ran experiments on TRECVID 2009 and 2010 datasets provided by NIST [18]. TRECVID2010 dataset consists of a reference video collection of more than 11,000 videos from Internetarchives for a total of 400 h of video, whereas TRECVID 2009 contains 385 h of referencevideos from roughly 800 BBC broadcasts. The two datasets are quite different. For bothdatasets, there are 201 original audio queries, each query altered with 7 different transforma-tions (see Table 1) for a total of 1407 queries. The length of queries varies from 3 s to 3 minand can be one of these 3 types: (1) transformed fragment of a reference video; (2) contains atransformed fragment of a reference video; (3) fragment of an audio not in the reference videodatabase. Query creation framework is described in [4].

When we examined the TRECVID 2010 dataset we found other audio transformations notmentioned by NIST. For example, many queries are distorted: parts of the query signal havebeen replaced by small silent segments at different places (e.g., query 3353, 3771, and 4200).This complicates the task especially when the query is combined with other transformationssuch as Bmixed with speech^. Gupta et al. [7] overcame this problem by discarding all silentsegments. In addition, some queries have undergone speed modification by speeding up orslowing down the query (e.g., query 3030: −180 % (2.8 times slower), 4247: −38 % (1.38times slower), 3145: −6 % (1.06 times slower), 3056: +8 % (1.08 times faster), 3957: +23 %(1.23 times faster)).1

3.2 Evaluation metrics

The task in TRECVID CBCD evaluation is to determine for each query if it contains asegment from the reference database. Since the query may be embedded in a non-referencecollection, final result includes the following information (when a copy is detected): query starttime, reference start time and the reference finish time. To evaluate the accuracy of locating acopied fragment within a video, we use F1 score (i.e., F-measure), that is defined as theharmonic mean of precision and recall. We use the minimal NDCR to evaluate the detectioneffectiveness. NDCR is a weighted cost combination of the probability of missing a true copyand the false alarm rate (PMiss and RFA). In the TRECVID evaluations, different parameterswere defined for two different application profiles: Bbalanced^ and Bno false alarm^ (NOFA)profile. In NOFA profile, which is the more difficult, the cost of an individual false alarm is1000 times the cost of a missed query, while in balanced profile both the missed query and thefalse alarm are assigned a cost of 1. We report results here using the NOFA profile.

3.3 Results and analysis

We noticed that many reference audio files in TRECVID 2010 dataset have duplicates thatskew the results. Therefore, we have removed these duplicate audio files before evaluating anysystems we tested.

1 These are only approximations of the speed difference between the query and the corresponding reference. Forexample, −180 % means that the query is approximately 2.8 times slower than the reference (1 s of the querycorresponds to approximately 0.36 s of the reference).

Multimed Tools Appl

3.3.1 Results for global mean fingerprint

By varying the threshold based on the Global Mean, we created four different fingerprints foreach query and reference audio file. The thresholds used to create these versions are: globalmean of the w×h matrix, 0.6×global mean, 0.4×global mean and 0.2×global mean. As wereduce the threshold, each subsequent fingerprint version includes more spectral informationthan the previous one (there are more non-zero values in the resulting w×h binary image).Thus, the image generated with the threshold equal to the global mean contains less informa-tion than the image generated from a threshold of 0.6×global mean, and so on (see Fig. 3). Toincrease the likelihood that one query fingerprint version matches a reference fingerprintversion, it is better to generate many versions with varying thresholds. However, the run timeincreases with the number of versions, leading us to use only a few versions.

In order to study the distribution of the quantized data (number of 1’s contained in thebinary image) per version, we represent in Table 2 the percentage of 1’s in the binary imagesaveraged over all reference/query frames for each threshold described above.

From Table 2 we see that the 1’s are quite sparse at the threshold of 1×mean. Therefore,thresholds above this value are not necessary, since higher values will have less discriminativeinformation; which may lead to more false matches between the query and the referenceframes. On the other hand, using threshold=0.2×mean generates almost 3 times more 1’s,which seems to be sufficient for including additional discriminative binary images. Note thatthe data generated with this smallest threshold has 1’s in 41.9 % of the query binary bins, andusing a threshold below 0.2×mean will only increase the noise especially when the queryincludes extraneous speech, making the copy detection more difficult. The range from 0.2×mean to 1×mean seems sufficient to cover the significant spectral regions. Our experiments ona small number of queries using these 4 thresholds gave good results. We have used these 4thresholds to evaluate our system on the TRECVID 2009 and 2010 datasets.

In addition to these 4 fingerprint versions for the reference and the query, we generate8 more fingerprints for the query: 4 with 9 % slower audio and 4 with 9 % faster audio.This parameter (i.e., 9 %) is chosen to detect audio copy that has undergone a smallspeed variation. However, the problem becomes more difficult when the speed differencebetween the query and the reference is large. We found that TRECVID 2010 containsmany queries that are transformed with different speeds (see Section 3.1). Our experi-ments show that changing the speed by±9 % does not detect all the speed transformedcopies. For TRECVID 2009 dataset we did not generate additional fingerprints as thisdataset has not undergone any speed variation.

Table 1 Audio querytransformations Transformation Description

T1 Nothing

T2 mp3 compression

T3 mp3 compression and multiband companding

T4 bandwidth limit and single band companding

T5 mix with speech

T6 mix with speech, then multiband compress

T7 bandpass filter, mix with speech, compress

Multimed Tools Appl

As stated above, the spectrogram matrix obtained from the audio signal is converted into aset of 2-D binary images. Each image is a quantized version of the spectrogram matrix of sizew×h. In our experiment, we use a window of 1-s length (w=333 and h=257; whichcorresponds to 333 frames×257 frequency bins) taken every 24 ms (i.e., av=24). The choiceof 1-s window size ensures that the frame contains enough information to be discriminative.The 24-ms frame advance is a compromise between having too many reference frames beingthe same (frame advance too short) or missing the alignment between the query and thereference (frame advance too long). In fact, the choice of frame advance, when generatingfingerprints, impacts the system performance. Short frame advance generates many successiveframes with identical fingerprints, especially when the spectral threshold is high (equal to the1-s spectrogram matrix mean). This leads to many more false fingerprint matches duringsearch. A larger frame advance avoids this problem. However, a large frame advance can causeproblems in matching a reference frame to a query frame. This is because the start of the querymay not be synchronized with the start of the reference leading to many more poor matches.The step size of 24 ms chosen in our experiments seems to be a good compromise. Increasingframe advance beyond 24 ms increases min NDCR. However, when we examine our resultsin-depth, we notice that wrong fingerprint matches still cause many false alarms, especially forshort queries. Our use of coarse fingerprints amplifies this problem. A solution to this problemis to increase the dimension of these fingerprints, so the fingerprint will include more detailabout the image.

To confirm this reasoning, Table 3 compares min NDCR and the number of missed queriesfor Global Mean fingerprint using n=26 dimensions versus n=48 dimensions. As we can see,min NDCR is reduced for all transformations when we represent the binary image by 48dimensions instead of 26 with a relative improvement of 21 %. Similarly, the total number ofmissed queries is reduced by 17 % when using 48 dimensions instead of 26 dimensions. In thiswork, we evaluated our system using 26 and 48 dimensional fingerprints. In the future work,we will evaluate many more values for feature dimensions in order to find the optimal number.

As expected, the performance of our system decreases from transformation T1 to T7 in termof miss count and min NDCR. This difference in performance is especially noteworthy for thelast three transformations. This is because these transformations (T5, T6 and T7) add irrelevantspeech to the query, which makes them more difficult to match. If we compare the average min

Table 2 Percentage of 1’s in the binary images averaged over all query/reference frames for Global Meanfingerprints

Threshold 0.2×mean 0.4×mean 0.6×mean 1×mean

Query version (%) 41.9 30.9 24.8 17.9

Reference version (%) 25.5 15.6 13.8 8.9

Table 3 Performance of Global Mean with varying dimensions for different transforms

Dimension T1 T2 T3 T4 T5 T6 T7 Total/Average

Min NDCR 26 0.097 0.104 0.216 0.194 0.313 0.425 0.328 0.239

48 0.09 0.097 0.179 0.134 0.209 0.336 0.284 0.189

Miss count 26 10 10 25 20 24 48 31 168

48 10 10 17 17 24 35 26 139

Multimed Tools Appl

NDCR for transformations that do not add irrelevant speech, to those that add irrelevant speechwe found that the average min NDCR goes up from 0.125 to 0.276 with 48 dimensions andfrom 0.152 to 0.355 when we use 26 dimensions.

In order to study the influence of the number of successive closest frames (SCF) on thesystem performance, Table 4 shows the min NDCR generated with Global Mean featureparameter using different SCF values (SCF-# denote the number of successive closest framesused in the search algorithm). From this table we notice that the worst result is achieved withSCF-0 (i.e., we don’t take into consideration any frame before and after the closest frame). Thebest result is given by SCF-1, which reduces the average min NDCR for all transformationsfrom 0.214 to 0.181. On the other hand, we notice that the min NDCR increases when we usetwo or more neighboring frames (SCF-2 and SCF-3). This can be explained by the fact that thecount is updated not only for the real nearest query frame but also for the neighboring frames.For example, if the real nearest query frame to the reference frame is query frame 3, and thesearch algorithm found that the nearest query frame is frame 4, then the count is updated notonly for frame 4 but also for frames 2, 3, 5 and 6 (SCF-2). Thus, as we increase the SCF value,we increase the likelihood of finding the real nearest neighbor, but we also increase the countfor erroneous matches.

To evaluate the effectiveness of converting the spectrogram into binary images, Table 5shows the min NDCR obtained using the best parameters (SCF-1 and 48 dimensions) with andwithout converting the spectrogram matrix into binary images. In the case where the spectro-gram matrix is not converted into a binary image, each w×h frame taken from the spectrogrammatrix is directly converted into an n-dimensional vector. Thus, each element of this vectorrepresents the sum of the intensity values contained in the horizontal or vertical slices of thespectrogram matrix (instead of the binary image). In other words, instead of converting the w×h binary image into an n-dimensional vector, we convert the w×h spectrogram matrix(represented in red rectangle in Fig. 2) into an n-dimensional vector as described inSection 2.2.3.

It can be seen from Table 5 that the average min NDCR over all transformations goes downfrom 0.666 (no quantization) to 0.181 with quantization into binary images, The degradation inmin NDCR without quantization is not significant for audio transforms T1 and T2, but verysignificant for the rest of transformations, especially for transforms that add irrelevant speechto the queries. The reason is that any intensity changes of the spectrogram matrix will beencoded into the fingerprint for the un-quantized version making it sensitive to any noisesadded to the signal. Consequently, the distance between two fingerprints will be affecteddepending on the difference in the intensity values between them. On the other hand,converting the spectrogram matrix into binary images discards the real intensity values ofthe signal and reduces the impact of the intensity value changes. Even when a noise is added tothe signal, binary quantization strategy prevents the noise from being encoded into the

Table 4 Min NDCR generated with Global Mean feature parameter using different SCF values

SCF T1 T2 T3 T4 T5 T6 T7 Average

0 0.075 0.142 0.201 0.149 0.201 0.425 0.306 0.214

1 0.075 0.075 0.179 0.127 0.201 0.343 0.269 0.181

2 0.082 0.097 0.179 0.134 0.216 0.366 0.246 0.188

3 0.09 0.097 0.179 0.134 0.209 0.336 0.284 0.189

Multimed Tools Appl

fingerprint when the spectral value does not exceed the threshold. Even when the noise forcesthe spectral value to exceed the threshold, its real intensity value is replaced by 1 in the binaryimage (regardless of the real intensity value), which reduces the possibility of obtaining a largedistance between the transformed fingerprint and the original fingerprint.

The last line of Table 5 shows that the average min NDCR obtained with SCF-1 and 48dimensions when using only one version of the binary image is 0.280, which is significantlyhigher than 0.181 (see Table 4, SCF-1) obtained with four versions of the Global Mean featureparameter. This shows that our strategy of using four different Global Mean fingerprintsderived from 4 different thresholds works very well.

Since we have optimized our system on TRECVID 2010 dataset, we used the TRECVID2009 dataset to validate the performance of our system using the Global Mean fingerprints.Table 6 shows the min NDCR, the number of missed queries and F-measure achieved onTRECVID 2009 dataset using Global Mean fingerprints with 48 dimensions and SCF-1.

From Table 6, we see that on TRECVID 2009 Global Mean based fingerprints result in amin NDCR averaged over all transformation of 0.131. From a total of 1407 queries, oursystem missed only 106 queries and correctly detected 1301 queries (more than 92 % correctdetection). The worst results in terms of min NDCR and missed queries are achieved withdifficult transforms T6 and T7 that add irrelevant speech to the queries. Nevertheless, goodresults are obtained for transform T5, which also adds irrelevant speech to the queries. Thissystem achieved a good localization accuracy averaged over all transformation of 0.844 (1.0 isthe best possible localization accuracy). Note that the F-measure (localization accuracy) didnot degrade from transforms T1 to T7 and resulted in a similar performance for all thetransformations.

3.3.2 Local mean fingerprint results

Like Global Mean, we used a 1-s window frame with 24 ms frame advance with the LocalMean feature parameter, and we generated four different fingerprint versions. To convert the 1-s window into binary image, we applied a tile of size wb=16×hb=12 for a total of 462 tiles(see Fig. 4 for wb and hb definition).

Table 5 Min NDCR per transformation obtained with Global Mean feature with/without using binary imagesand with only 1 fingerprint version instead of 4

T1 T2 T3 T4 T5 T6 T7 Average

Without binary images 0.104 0.097 0.784 0.806 0.97 0.948 0.955 0.666

With binary image 0.082 0.075 0.403 0.254 0.194 0.545 0.41 0.280

Table 6 Min NDCR, number of missed queries and F-measure for Global Mean fingerprints on TRECVID 2009

T1 T2 T3 T4 T5 T6 T7 Total/Average

Min NDCR 0.067 0.082 0.127 0.09 0.119 0.246 0.187 0.131

Missed queries 9 10 16 11 14 23 23 106

F-measure 0.857 0.861 0.833 0.857 0.833 0.845 0.828 0.844

Multimed Tools Appl

In the search step with the Global Mean fingerprints, we compared each reference versionto all query versions. However, for the Local Mean fingerprints, we compare each referenceversion to the corresponding query version generated with the same threshold only. This isbecause we have noticed that, unlike Global Mean, Local Mean provides poor score when wecompare reference fingerprints to query fingerprints generated using different thresholds. Thethresholds used to create these versions are: mean of the local spectrogram matrix segment(i.e., 16×12 local matrix segments), 0.6×local mean, 0.4×local mean and 0.2×local mean.

Table 7 shows min NDCR with Local Mean fingerprint generated with different thresholdsusing two SCF values: SCF-1 and SCF-3.

As we can see from Table 7, for most transforms, min NDCR reduces as we increase thelocal mean based threshold. Fingerprints generated with high threshold contain less informa-tion, but result in higher matches between query and reference fingerprints. In fact, the lowestmin NDCR is achieved with the threshold equal to the local mean for both SCF-1 and SCF-3feature parameters for many transforms. However, for some transformations, threshold equalto the local mean gives bad results. This is the case for example with T1 (min NDCR=0.724)and T7 (min NDCR=0.642) where the min NDCR is very high compared to the rest of thelocal mean fingerprint versions. The reason is that some queries with music contain repeatedmusic segments (i.e., the same music segment repeated within the same audio at differenttimes). However, the ground-truth contains the start time and finish time of only one of thesemusic segments. In other words, to be considered as a true positive, the segment found by copydetection should overlap with this ground-truth (same start and finish times). For some of thesequeries, our system detects the correct audio segment with a high score, but the start and finishtimes do not overlap the segment in the ground-truth. In such a situation, the decision thresholdthat rejects false alarms becomes very high resulting in a much higher min NDCR.

When we combine the results from the four local mean fingerprint versions (row in greybackground in Table 7) by choosing the results with the highest matching count, min NDCRreduces. Notice that the problem of repeated music segments is solved when we combine allversions together, resulting in a significantly lower min NDCR.

The lowest averaged min NDCR for all transformations achieved by Local Mean finger-prints is equal to 0.224 compared to 0.181 achieved by the Global Mean fingerprints, which is19 % lower than Local Mean. In fact, min NDCR for Global Mean is significantly lower than

Table 7 Min NDCR for Local Mean fingerprints for different thresholds and different SCF values when testedon TRECVID 2010 dataset

Threshold T1 T2 T3 T4 T5 T6 T7 Average

SCF-1 mean × 0.2 0.194 0.231 0.254 0.201 0.343 0.373 0.351 0.278

mean × 0.4 0.201 0.216 0.216 0.201 0.328 0.336 0.313 0.259

mean × 0.6 0.172 0.201 0.216 0.179 0.612 0.313 0.619 0.330

mean 0.142 0.179 0.194 0.164 0.276 0.306 0.642 0.272

combined 0.149 0.179 0.209 0.157 0.284 0.313 0.276 0.224

SCF-3 mean × 0.2 0.194 0.246 0.231 0.201 0.336 0.373 0.358 0.277

mean × 0.4 0.194 0.216 0.216 0.187 0.313 0.358 0.321 0.258

mean × 0.6 0.187 0.194 0.209 0.187 0.313 0.321 0.284 0.242

mean 0.724 0.179 0.194 0.179 0.276 0.299 0.5 0.336

combined 0.149 0.187 0.201 0.164 0.269 0.313 0.478 0.252

Multimed Tools Appl

for Local Mean for all transformations except T6. The poor performance of Local Meanrelative to Global Mean was surprising since our preliminary experiments on queries missedby Global Mean showed good results. When we examined our result in-depth, we noticed thatalthough Local Mean detected many queries missed by Global Mean, it also missed otherqueries detected by Global Mean. In fact, many of the queries missed by Local Mean are audiotransformations where different parts of the audio signal have been replaced by silences. Itseems that Global Mean is invariant to such a transformation. In NN-based system [7] thesilent segments are located using a voice activity detector, and then skipped when computingthe matching counts.

3.3.3 Combination of results from global mean and local mean fingerprints

In order to lower the min NDCR, we combined the results of Global Mean and Local Meanfeature parameters (we used all versions for both features). We combined the results by firstgenerating separately the best results for each feature parameter, and then keeping the resultswith the highest matching counts (matching counts as shown in Fig. 7). Table 8 shows theresults of this combination using SCF-1 and SCF-3 parameters. The lowest average minNDCR over all transformations is obtained with LM1.GM1 (combination of Local Meanand Global Mean feature parameters using SCF-1 value). LM3.GM1 also gives good resultsand achieved the lowest min NDCR for five transformations. Notice that these two configu-rations use Global Mean with the SCF-1 parameter that gave the best results for a singlefeature parameter (see Table 4 for Global Mean feature).

As mentioned above, results for combined features use all the versions of each featureparameter. We conducted another test using all versions of Global Mean and only one versionof Local Mean (this version is generated with a threshold of 0.6×local mean). We used SCF-1for Global Mean feature and SCF-3 for Local Mean feature. This new configuration, denotedas LM3.GM1*, reduced the average min NDCR from 0.156 (LM1.GM1) to 0.147(LM3.GM1*). Figure 8 compares min NDCR for LM3.GM1* configuration to the best resultsobtained for each feature parameter separately. LM3.GM1*, achieved the lowest min NDCRfor all transformations when compared with Global Mean and Local Mean features separately.The average min NDCR over all transformation is reduced by 18 % compared to Global Meanand 34 % compared to Local Mean.

3.3.4 Comparative audio copy detection systems

Table 9 compares min NDCR of our Spectro system, obtained with the combination of GlobalMean and Local Mean feature parameters (LM3.GM1* configuration), NN-based [7] andShazam [19] systems.

Table 8 Min NDCR for combined feature parameters on TRECVID 2010 dataset


LM1.GM1 0.075 0.075 0.127 0.09 0.194 0.291 0.239 0.156

LM1.GM3 0.09 0.097 0.149 0.112 0.209 0.291 0.276 0.175

LM3.GM1 0.075 0.075 0.112 0.097 0.187 0.261 0.396 0.172

LM3.GM3 0.09 0.097 0.142 0.097 0.201 0.276 0.261 0.166

Multimed Tools Appl

The comparison of min NDCR for Spectro and NN-based shows that our system signifi-cantly outperforms NN-based system for the first five transformations. The best value of minNDCR achieved is 0.075 for transforms T1 and T2, which is an improvement of 60 % over theNN-based system. Similarly, we achieved a relative reduction of 42, 48 and 7 % for T3, T4 andT5, respectively. Furthermore, our system reduced the number of missed queries by half for alltransformations that do not add extraneous speech (see Table 10). In fact, our system results in32 % fewer missed queries for all transformations compared to NN-based system. Eventhough our system gave fewer missed queries for T7, the min NDCR is higher than NN-based system for T7. This happened because of a false alarm with a high matching count,resulting in a bad decision threshold. The average min NDCR achieved by our system is 0.147,representing a 23 % reduction in comparison to the NN-based system.

Secondly, the missed queries for T1 and T2 are either distorted with large speed shift (e.g.,+23 %, −180 %, −38 %, etc.), or the query consists only of silence (e.g., queries 3524 and4315). The presence of silent audio queries can be explained by the fact that TRECVID 2010dataset was designed to evaluate combined audio+video copy detection. There was noseparate audio only copy detection evaluation. In addition to these queries, a large numberof missed queries are short queries (less than 6 s). Short queries, when distorted by irrelevant

Fig. 8 Comparison of the best results for each feature parameter separately and for their combination

Table 9 Comparison of min NDCR for Spectro, NN-based and Shazam systems on TRECVID 2010 dataset


Spectro (LM3.GM1*) 0.075 0.075 0.112 0.097 0.187 0.261 0.224 0.147

NN-based 0.179 0.187 0.194 0.187 0.201 0.194 0.209 0.193

Shazam 0.373 0.366 0.507 0.47 0.515 0.597 0.59 0.488

Multimed Tools Appl

speech, become very challenging. In fact, speech added to these queries makes the originalsignal hardly perceptible by humans.

On the other hand, Shazam system achieved mediocre performance with a min NDCRaveraged over all transformations of 0.488. This system missed 436 queries from a total of1407 queries compared to 117 and 173 missed by Spectro and NN-based systems, respective-ly. When we look at the Shazam results, we find that a large number of missed queries arequeries of type 2 (reference video embedded in a non-reference video). It seems that Shazamcannot handle queries of this type, unlike Spectro and NN-based systems, and this leads to arelatively bad detection performance.

Finally, the F-measure for locating a query within a reference for Spectro, NN-based andShazam systems is compared in Table 11. We notice that Spectro outperforms both systems forall transformations and improves the F-measure averaged over all transformations by 24 %relative to NN-based and 6 % compared to Shazam.

In this paper we did not take run time performance into consideration since we have notoptimized our implementation. However, we note that our approach consists of multiple time-shifted query versions, and several other fingerprint versions. Therefore, the processing time ismultiplied by the number of these versions. To reduce these multiple versions, we used athreshold to determine if a query needs to be processed with a time shifted version. In otherwords, we first process the original query. If the score is below a threshold, we continue with atime-shifted query. This strategy decreases the processing time, and we can do the same for theother fingerprint versions. In fact, we notice that, for many queries, a good detection score canbe obtained even with one query fingerprint version and one reference fingerprint version,especially for transformations T1 and T2.

4 Conclusion

In this paper, we describe an audio content-based copy detection system that uses fingerprintsderived from a spectrogram matrix. The spectrogram matrix is converted into a binaryrepresentation by applying a spectral threshold to the spectral values in this matrix. We showthat fingerprints extracted from this binary spectrogram matrix are robust to different audio

Table 10 Comparison of the number of missed queries for Spectro, NN-based and Shazam systems onTRECVID 2010 dataset

T1 T2 T3 T4 T5 T6 T7 Total

Spectro (LM3.GM1*) 10 10 13 12 22 26 24 117

NN-based 22 25 25 24 25 24 28 173

Shazam 44 48 64 58 67 77 78 436

Table 11 Comparison of F-measure for Spectro, NN-based and Shazam systems on TRECVID 2010 dataset


Spectro (LM3.GM1*) 0.9 0.889 0.855 0.858 0.865 0.82 0.833 0.860

NN-based 0.685 0.695 0.701 0.691 0.685 0.691 0.703 0.693

Shazam 0.765 0.789 0.825 0.795 0.803 0.841 0.834 0.807

Multimed Tools Appl

transformations. We compare results using two different audio feature parameters derivedusing either Global Mean or Local Mean of the spectral values. Local Mean based featureparameters give higher min NDCR than the Global Mean based features, but still detect 25queries missed by Global Mean based features, including 10 queries degraded by a transfor-mation that adds irrelevant speech to the query (T6 transform). The relatively poor perfor-mance of Local Mean based feature parameter is due to its sensitivity to silent audio segmentswithin a query, unlike Global Mean, which is robust to such transformations. Combiningresults from the Global Mean and the Local Mean based feature parameters improves theaverage min NDCR by 18 % compared to the best results achieved using these featureparameters separately. We also show that using the SCF-1 parameter (which considers oneframe before and after the closest query frame to the reference frame) during search improvessignificantly the average min NDCR for the Global Mean feature parameter.

We compared our system to two audio copy detection systems: NN-based [7] and Shazam[19]. Results of this comparison show that our system by far outperforms Shazam system forall audio transformations. When compared to the NN-based system, our system achieved 23 %lower min NDCR, 24 % better localization accuracy and 32 % fewer missed queries.

References

1. Anguera X, Garzon A, Adamek T (2012) Mask: robust local features for audio fingerprinting. In:2012 13th IEEE International Conference on Multimedia and Expo, ICME 2012, July 9, 2012 -July 13, 2012, 455–460. Melbourne, VIC, Australia: IEEE Computer Society

2. Ayari M, Delhumeau J, Douze M, Jégou H, Potapov D, Revaud J, Schmid C, Yuan J(2011)Inria@Trecvid’2011: Copy Detection & Multimedia Event Detection. In: TRECVID workshop

3. Baluja S, Covell M (2007) Audio fingerprinting: combining computer vision data stream processing. In:2007 I.E. International Conference on Acoustics, Speech, and Signal Processing, 15–20 April 2007, 213–16.Piscataway, NJ, USA: IEEE

4. Building Video Queries for Trecvid (2008) Copy Detection Task http://www-nlpir.nist.gov/projects/tv2010/TrecVid2008CopyQueries.pdf. Accessed January 2014

5. Cano P, Batle E, Kalker T, Haitsma J (2002) A review of algorithms for audio fingerprinting. In: 2002 I.E.5th Workshop on Multimedia Signal Processing, 9–11 Dec. 2002, 169–73. Piscataway, NJ, USA: IEEE

6. Ellis D (2009) Robust landmark-based audio fingerprinting, Online Serial],(2009 May), Available at HTTP:http://labrosa.ee.columbia.edu/∼dpwe/resources/matlab/fingerprint, ci4

7. Gupta VN, Boulianne G, Cardinal P (2012) CRIM’s content-based audio copy detection system for Trecvid2009. Multimed Tools Appl 60(2):371–87

8. Haitsma J, Kalker T (2002) A highly robust audio fingerprinting system. In: Ismir9. Hartung F, Kutter M (1999) Multimedia watermarking techniques. Proc IEEE 87(7):1079–110710. Heritier M, Gupta V, Gagnon L, Boulianne G, Foucher S, Cardinal P (2009) CRIM’s content-based copy

detection system for trecvid. In: Proc. TRECVID-2009. Gaithersburg, MD., USA11. Jegou H, Delhumeau J, Jiangbo Y, Gravier G, Gros P (2012) Babaz: a large scale audio search system for

video copy detection. In: 2012 I.E. International Conference on Acoustics, Speech and Signal Processing(ICASSP 2012), 25–30 March, 2369–72. Kyoto, Japan

12. Jiang M, Fang S, Tian YH, Huang T, Gao W (2011) Pku-Idm@ Trecvid 2011 Cbcd: content-based copydetection with cascade of multimodal features and temporal pyramid matching. In: TRECVID workshop

13. Lebosse J, Brun L, Pailles JC (2007) A robust audio fingerprint extraction algorithm. In: Proceedings of theFourth IASTED International Conference on Signal Processing, Pattern Recognition and Applications, 14–16 Feb. 2007, 269–74. Anaheim, CA, USA: ACTA Press

14. Lezi W, Yuan D, Hongliang B, Jiwei Z, Chong H, Wei L (2012) Contented-based large scale web audio copydetection. In: 2012 I.E. International Conference on Multimedia and Expo (ICME), 9–13 July 2012, 961–6.Los Alamitos, CA, USA: IEEE Computer Society

15. Ouali C, Dumouchel P, Gupta V (2014) A robust audio fingerprinting method for content-based copydetection. In: International Workshop on Content-Based Multimedia Indexing. Austria

Multimed Tools Appl

http://www-nlpir.nist.gov/projects/tv2010/TrecVid2008CopyQueries.pdf

http://www-nlpir.nist.gov/projects/tv2010/TrecVid2008CopyQueries.pdf

http://labrosa.ee.columbia.edu/%7Edpwe/resources/matlab/fingerprint

16. Ouali C, Dumouchel P, Gupta V (2014) Robust features for content-based audio copy detection. In: FifteenthAnnual Conference of the International Speech Communication Association. Singapore

17. Saracoglu A, Esen E, Ates TK, Acar BO, Zubari U, Ozan EC, Ozalp E, Alatan AA, Ciloglu T (2009)Content based copy detection with coarse audio-visual fingerprints. In: 2009 Seventh International Workshopon Content-Based Multimedia Indexing (CBMI), 3–5 June 2009, 213–18. Piscataway, NJ, USA: IEEE

18. Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and trecvid. In: 8th ACM MultimediaInternational Workshop on Multimedia Information Retrieval, MIR 2006, co-located with the 2006 ACMInternational Multimedia Conferenc, October 26, 2006 - October 27, 2006, 321–330. Santa Barbara, CA,United states: Association for Computing Machinery

19. Wang ALC (2003) An industrial-strength audio search algorithm. In: International Conference on MusicInformation Retrieval (ISMIR), pp 7–13

20. Yan K, Hoiem D, Sukthankar R (2005) Computer vision for music identification. In: Proceedings. 2005 I.E.Computer Society Conference on Computer Vision and Pattern Recognition, 20–25 June 2005, vol. 1, 597–604. Los Alamitos, CA, USA: IEEE Comput. Soc

21. Zhu B, Li W, Wang Z, Xue X (2010) A novel audio fingerprinting method robust to time scale modificationand pitch shifting. In: 18th ACM International Conference on Multimedia ACMMultimedia 2010, MM’10,October 25, 2010 - October 29, 2010, 987–990. Firenze, Italy: Association for Computing Machinery

Chahid Ouali received the B.Sc. degree in computer science from University of Sfax (Tunisia) and the M.Ing.degree from the École de Technologie supérieure (Canada). He is currently working toward his Ph.d. degree fromthe Department of Software and IT Engineering at École de Technologie Supérieure, Quebec, Canada and alsowith Centre de recherche informatique de Montréal (CRIM), Canada.

His research interests include pattern recognition, information retrieval and audio signal processing. He isworking on audio and video fingerprinting methods for content-based copy detection.

Multimed Tools Appl

Pierre Dumouchel received B.Eng. (McGill University), M.Sc. (INRS-Télécommunications), PhD (INRS-Télécommunications), has over 25 years of experience in the field of speech recognition, speaker recognitionand emotion detection. Pierre is Director General at École de technologie supérieure (ETS) of Université duQuébec, Canada.

Vishwa Gupta received the B.Tech. degree from the Indian Institute of Technology (IIT), Kharagpur, India, andthe M.Sc. and Ph.D. degrees in electrical and computer engineering from Clemson University, Clemson, SC.

At Nortel, he worked on many speech recognition applications including ADAS Plus for directory assistanceautomation. He published many papers and obtained several patents while at Nortel. At SpeechWorks, he wasinvolved in research and development of applications as a member of the product team. At IBM, he worked ontheir speech recognition engine to incorporate state-of-the-art algorithms and features. At Centre de RechercheInformatique de Montreal (CRIM), he has been active in speech recognition, speaker diarization, keywordspotting, content-based audio/video copy detection, and automated advertisement detection.

Multimed Tools Appl

a spectrogram-based audio fingerprinting system for ...evaluation task. jiang et al. proposed a copy...

Documents