pdfs.semanticscholar.org · 2020-04-30 · i instituto tecnológico y de estudios superiores de...

I

Instituto Tecnológico y de Estudios Superiores de Monterrey

Campus Monterrey

School of Engineering and Sciences

Unsupervised Deep Learning Recurrent Model for Audio Fingerprinting

A dissertation presented by

Abraham Báez Suárez

Submitted to the School of Engineering and Sciences

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

In

Information Technologies and Communications

Major in Intelligent Systems

Monterrey, Nuevo León, June 26th, 2020

IV

Dedication

“If you’re going through hell, keep going”

- Winston Churchill -

This dissertation is dedicated to my father, Benito Báez Sánchez, who supported me during the struggle of switching between countries, labs, and projects.

To my mother, Maria Gabriela Suárez Vega, who had her own difficulties during my Doctoral Program and came out victorious, but also for her sweet concerns

when I had my first and only car accident being in a foreign country far from home.

To my siblings, Benito Báez Suárez and Gladys Gabriela Báez Suárez, whose family life has been their own Doctoral Program but they keep enjoying the perks of

being parents next to their beloved people.

Finally, this work is also dedicated to my beloved wife, Michelle Mary Faria; her constant support was a crucial factor in “going through hell”. Sharing the same

responsibilities and concerns helped me to enjoy even more this hard time filled with uncertainty. It was difficult to see the light at the end of the tunnel, but it was

easier to walk together towards the unknown.

Last but not least, this dissertation is also dedicated to the memory of my grandmother, Rebeca Vega Rivera, a strong person who raised five children and

withstood many circumstances until the very last moment.

Unfortunately, this dissertation also has to be dedicated to the people that did not believe in me, who told me to give up because I would never become a researcher.

To the people who hate foreigners and their lack of compassion. It is because of them that I learned what not to do to others and that I kept going through hell.

V

Acknowledgements

I want to express my deepest gratitude to all those who contributed directly and/or indirectly to this work. I acknowledge the great support given by my advisor, Dr. Juan Arturo Nolazco Flores, who listened to me when I had no idea what to do with my research and supported me during the hard time of changing labs while being abroad. To Dr. Ioannis A. Kakadiaris who accepted me at his lab working on Computer Vision applied on Face Recognition across Aging and to Dr. Weidong (Larry) Shi and Ms. Yang Lu who allowed me to expand my horizons learning Deep Learning and apply it to Signal/Speech Processing and Security, deploying our technology into real 9-1-1 Centers. I also acknowledge Dr. Omprakash Gnawali and Dr. Shou-Hsuan Stephen Huang, two great people who gave me guidance during the deployment of our 9-1-1 technology and feedback while writing the Journal Paper. I acknowledge the contributions of Mr. Nolan Shah and Daniel Sitonic. The former was coding next to me, designing the experiments and the algorithm while the latter was resourceful, building a dataset needed for the experiments, and helping in every single requested task. Nevertheless, a Ph.D. Program is not only research. I had the opportunity to enjoy my abroad staying while being part of the Catholic Newman Center, where I met a lot of cheerful people and shared my faith and joining the Cougar DanceSport Club, doing what I love the most in different styles. The Church Choir allowed me to get better in reading music (understanding the staff and notes) while singing, and all these experiences would not have been possible if the International Student and Scholar Services Office (ISSSO) at the University of Houston had not helped me to go through all the hurdles of being an international student.

During my Ph.D. program, I had the fortune to be awarded a tuition scholarship granted by Instituto Tecnológico y de Estudios Superiores de Monterrey (ITESM) and a scholarship for the living expenses by Consejo Nacional de Ciencia y Tecnología (CONACyT). Other organizations, such as the Department of Homeland Security (DHS) and the North Atlantic Treaty Organization (NATO), funded the research work and the acquisition of equipment. Furthermore, this work was also supported by the Amazon Web Services (AWS) Cloud Credits for Research1 and the University of Houston I2C Lab2.

1 https://aws.amazon.com/research-credits 2 http://i2c.cs.uh.edu

VI

Unsupervised Deep Learning Recurrent Model for Audio Fingerprinting

by

Abraham Báez Suárez Abstract

Audio fingerprinting techniques were developed to index and retrieve audio samples by comparing a content-based compact signature of the audio instead of the entire audio sample, thereby reducing memory and computational expense. Different techniques have been applied to create audio fingerprints, however, with the introduction of deep learning, new data-driven unsupervised approaches are available.

This doctoral dissertation presents a Sequence-to-Sequence Autoencoder Model for Audio Fingerprinting (SAMAF) which improved hash generation through a novel loss function composed of terms: Mean Square Error, minimizing the reconstruction error; Hash Loss, minimizing the distance between similar hashes and encouraging clustering; and Bitwise Entropy Loss, minimizing the variation inside the clusters.

The performance of the model was assessed with a subset of VoxCeleb1 dataset, a "speech in-the-wild" dataset. Furthermore, the model was compared against three baselines: Dejavu, a Shazam-like algorithm; Robust Audio Fingerprinting System (RAFS), a Bit Error Rate (BER) methodology robust to time-frequency distortions and coding/decoding transformations; and Panako, a constellation algorithm-based adding time-frequency distortion resilience.

Extensive empirical evidence showed that our approach outperformed all the baselines in the audio identification task and other classification tasks related to the attributes of the audio signal with an economical hash size of either 128 or 256 bits for one second of audio.

Additionally, the developed technology was deployed into two 9-1-1 Emergency Operation Centers (EOCs), located in Palm Beach County (PBC) and Greater Harris County (GH), allowing us to evaluate the performance in real-time in an industrial environment.

Keywords: Artificial Intelligence, Machine Learning, Deep Learning, Unsupervised Learning, Sequence-to-Sequence Autoencoder, Audio Fingerprinting, Audio Identification

VII

List of Figures Figure 2.1 Left: LSTM cell with all its gates (forget ft input it and output ot), the

sigmoid 𝝈 and hyperbolic tangent 𝒕𝒂𝒏𝒉 functions, and all their interactions. The

inputs (previous cell Ct-1 and previous hidden ht-1 states and input sequence xt), the

weights Wf,i,o,C, and the biases bf,i,o,C are responsible for the creation of the updated

cell Ct and updated hidden ht states. The rectangles mean layer operations,

otherwise pointwise operations. Right: Mathematical definition of the gates and

their interactions with the inputs to create the updated cell and updated hidden

states. .................................................................................................................... 10

Figure 2.2 The general framework starts with a raw audio signal, which is

normalized through a preprocessing step. The MFCC features are extracted from

the normalized signal and then input to the Sequence-to-Sequence Autoencoder

(SA) model, which generates the audio fingerprint (hash representation). ............ 11

Figure 2.3 The Sequence-to-Sequence Autoencoder (SA) model is composed of

two RNNs, the Encoder and the Decoder. The former receives the inputs xia, which

are combined with the LSTM cell states Sia. When the last state is reached Sai,T, it

is passed to the latter as the initial previous state Σai,t-1. The output of the RNN

Decoder x̂ia is a reconstructed representation of the original inputs xia. The audio

fingerprint (hash) is embedded into the last state Sai,T of the RNN Encoder ......... 12

Figure 2.4. Graphical explanation of the Loss Terms where MSE Loss minimizes

the reconstruction error modifying the distribution of the audio fingerprints all

around the space, the Hash Loss minimizes the distance between similar hashes

bringing them together, and the Bitwise Entropy Loss minimizes the variation in the

clusters of similar hashes tuning them to become more alike. .............................. 14

Figure 3.1 The average Hamming distance (X-axis) vs. audio identification

accuracy (Y-axis) shows that the models with a structured hash space (small

average Hamming distance) are not necessarily the ones with higher performance

and vice versa; it depends on the parameters of the model. Random, Accuracy,

and Loss indicate the method by which the models were selected. Random refers

VIII

to instances of the SA model with randomly initialized weights, Accuracy refers to

selection by maximal audio identification performance, and loss refers to selection

by minimal loss. [Better visualized in color] ........................................................... 22

Figure 3.2 Learning curves of the Loss functions terms. The top row refers to the

simplest model (1LST128) while the bottom row to the most complex model

(2LST256). [Better visualized in color] ................................................................... 25

Figure 3.3 VoxCeleb1 subset dataset benchmark using the best model,

2LST256M-Acc, to report classification performance. ........................................... 26

Figure 3.4 Benchmark of the time-frequency transformations on the VoxCeleb1

subset dataset. Rows are Pitch, Speed, and Tempo transformations, respectively.

Columns are audio sample length from 1 to 6 seconds. [Better visualized in color]

Note: RAFS and Panako 1-second curves are a flat line in the horizontal axis

because their performance is 0.00% ..................................................................... 29

Figure 3.5 Critical Difference (CD) diagram for the 36 classifiers. On the right side,

the classifiers with the lowest (best) rank are located. It is seen that 2LST256M-

Acc possesses the best average rank, but it is not significantly different from the

following top 26 classifiers (cluster). However, the data is not sufficient to conclude

that those 26 classifiers are not similar to other clusters. In this case, it makes

2LST256M-Acc the absolute winner. The CD diagram uses a critical value of 3.837

(𝑞𝛼), with a 5% confidence level 𝛼 . .................................................................... 34

Figure 3.6 Critical Difference (CD) diagram for our best model (2LST256M-Acc)

and the benchmarked models (Dejavu, Panako, and Rafs). It is seen that SAMAF

is at the top with an average rank of 1.65, followed by Dejavu with an average rank

of 2.04. It is not known, due to lack of data, if Dejavu is statistically similar to

SAMAF or the other two methods (Panako and Rafs). The critical distance is

1.9148, with a critical value of 2.569 (𝑞𝛼) using a 5% confidence level 𝛼 . ......... 35

Figure 4.1 The general framework of the testing environment starts with a call

generator in which information is filtered through a firewall and later managed by

IX

the Session Border Controller (SBC), which sends the call to the Public Safety

Answering Point (PSAP). Once the call is answered, the Session Initiation Protocol

(SIP)/Real-time Transport Protocol (RTP) probe forwards the RTP packets, which

are analyzed by the call analyzer, also known as Acoustic Forensic Analysis

Engine (AFAE). ..................................................................................................... 37

Figure 4.2 PSAP Call taker login window .............................................................. 39

Figure 4.3 PSAP Dashboard displaying the attributes of the caller information .... 39

Figure 4.4 The block diagram of the deployment starts with a raw audio signal,

which is normalized through a preprocessing step. Depending on the analysis

either Spectrogram Image Features (SIF) or Mel-Frequency Cepstral Coefficients

(MFCCs) features are extracted from the normalized signal and then input either to

the Convolutional Neural Network (CNN) that outputs a classification label (Human,

Synthetic, or TTY) or input to a Sequence-to-Sequence Autoencoder (SA) model

which generates the audio fingerprint (hash representation). ................................ 40

Figure 4.5 Flowchart of the AFAE methodology, where it is observed that the first

filter ensures that there is information to be analyzed. Later, the outputs of the call

classifier and the duplicate detector are stored in a MySQL database .................. 41

Figure 4.6 Architecture of the different devices involved in the call analysis module

(AFAE) where it can be seen that a SIP trunk carries the information which reaches

a switch where the port mirroring occurs and a SIP/RTP probe segments and

preprocesses the packets, one call per thread, which are analyzed by our AFAE

module. The visualization of the analysis is displayed through a web interface .... 42

Figure 4.7 Web Interface. This is the main part for inputting the query data; the

records can be located using several filters such as identifier (external or internal),

timestamp or date range, classification result, ghost result, similarity score, and

amount of records to be displayed. ....................................................................... 44

Figure 4.8 Web Interface. Overview/Statistics of all the records found after the

submission of the query. From left to right, we see the number of total calls found,

X

number of human, synthetic, TTY, and empty calls, number of silent, non-silent,

and empty calls. Empty records are created either when the call is detected as a

silent call or there was an error during the analysis, and the only thing saved was

the ID of the call. ................................................................................................... 45

Figure 4.9 Web Interface. Results of the query sorted by timestamp. On the left,

the external ID, then the internal ID, and an audio console to play the calls. The

next column has the timestamp followed by the classification score and its

confidence score, and duplicate score. If the call is silent, no results are outputted,

as seen in the last row. .......................................................................................... 45

Figure 4.10, Web Interface. The information per-call is shown when the internal ID

is clicked. At the top, the call we want to analyze is displayed, and at the bottom, all

the calls that are similar to the one at the top are sorted from smallest to largest

MinDist (last column) where a small MinDist means greater similarity. ................. 46

Figure 4.11 Web Interface. The administrative Interface provides with the tools to

partially or totally delete the records in the database and all the files created for the

analysis ................................................................................................................. 47

XI

List of Tables Table 1.1 Knowledge-based methodologies for the creation of fingerprints in different types of

media .......................................................................................................................................... 6

Table 1.2 Machine Learning methodologies, focused on Deep Learning, for the creation of

fingerprints in different types of media ........................................................................................ 7

Table 3.1 Quantity of reference hashes stored in the Reference Database and quantity of

query hashes with respect to the audio length. ......................................................................... 20

Table 3.2 Average Hamming Distance in the VoxCeleb1 dataset with the different model

parameters such as the number of layers, hash size, strategy, loss function term, and audio

length. ....................................................................................................................................... 23

Table 3.3 Top-5 models in the VoxCeleb1 subset for audio identification and other

classification tasks based on the audio signal attributes. .......................................................... 24

Table 3.4 VoxCeleb1 subset benchmarked against baselines per transformation ................... 27

Table 3.5 Simplest (top) and most complex (bottom) SAMAF models for Known vs. Unknown

comparison in the audio identification task ............................................................................... 30

Table 3.6 Audio identification performance in the VoxCeleb1 dataset showing all the

classification results with the different model parameters such as the number of layers, hash

size, strategy, loss function term, and audio length. ................................................................. 31

Table 3.7 Transformation performance in the VoxCeleb1 dataset showing all the classification

results with the different model parameters such as the number of layers, hash size, strategy,

loss function term, and audio length. ........................................................................................ 32

Table 3.8 Average ranks of the SA model and its different parameters applying the Friedman

procedure ................................................................................................................................. 34

Table 3.9 Average Ranks of the proposed model and the benchmarked models ..................... 35

Table 4.1 Summary of primary submission results in the ASVspoof 2015 challenge and

comparison with the University of Houston results. It is displayed that our approach is the best

one, followed by the runner-up with a factor of 18. ................................................................... 41

XII

Table A.1 Abbreviations ........................................................................................ 50

Table A.2 Acronyms .............................................................................................. 51

Table B.1 Variables and Symbols ......................................................................... 53

Table C.1 Speech Transformations ....................................................................... 56

XIII

Contents Abstract ................................................................................................................. VI

List of Figures ...................................................................................................... VII

List of Tables ........................................................................................................ XI

1. Introduction ................................................................................................... 1

1.1 Motivation ....................................................................................................... 3 1.2 Problem Statement and Context ..................................................................... 4 1.3 Research Question ......................................................................................... 5 1.4 Solution Overview ........................................................................................... 7 1.5 Main Contributions .......................................................................................... 8 1.6 Dissertation Organization ............................................................................... 8

2. Deep Unsupervised Audio Fingerprinting ................................................... 9

2.1 Deep Learning Review ................................................................................... 9 2.2 SAMAF: Sequence-to-Sequence Autoencoder Model for Audio Fingerprinting ...................................................................................................... 11 2.3 Hamming Distance Matching ........................................................................ 15

3. Experimental Results .................................................................................. 16

3.1 VoxCeleb1 Dataset ....................................................................................... 16 3.2 Technical Details .......................................................................................... 17 3.3 Experiments .................................................................................................. 18 3.4 Results .......................................................................................................... 21 3.5 Statistical Analysis ........................................................................................ 33

4. Deployment .................................................................................................. 36

4.1 History of the 9-1-1 System .......................................................................... 36 4.2 General Framework of the testing environment ............................................ 37 4.3 Call Classifier and Duplicate Detector .......................................................... 38 4.4 Materials and Architecture ............................................................................ 40 4.5 Palm Beach County (PBC) ........................................................................... 42 4.6 Greater Harris County (GHC) ....................................................................... 43 4.7 Web Interface ............................................................................................... 44

5. Conclusions ................................................................................................. 48

5.1 Future Work .................................................................................................. 49

.......................................................................................................... 50

Abbreviations and acronyms .............................................................................. 50

.......................................................................................................... 53

Variables and Symbols ....................................................................................... 53

.......................................................................................................... 56

Speech Transformations ..................................................................................... 56

XIV

.......................................................................................................... 57

9-1-1 Test Scripts ............................................................................................... 57

........................................................................................................... 58

Python Script of the SAMAF methodology for the VoxCeleb1 subset dataset .... 58

Bibliography ...................................................................................................... 120

Published papers .............................................................................................. 124

Curriculum Vitae ................................................................................................ 125

Chapter 1. Introduction 1

1

Chapter 1 1. Introduction Audio is high dimensional in nature. Perceptually similar audio samples may have high variance in their data streams, and as a result, the comparison of one audio sample against a corpus of other audio samples is memory and computationally expensive, and neither efficient nor effective. Tasks such as indexing & retrieval, audio identification, integrity verification, music information retrieval, scene recognition, and piracy monitoring may benefit from audio fingerprinting. An audio fingerprint is a content-based compact signature (often a hash) that summarizes key information about the audio sample and commonly links audio samples to their corresponding metadata (e.g., sex, age, language, artist and song name, genre, among others). Ideally, an audio fingerprint-based system should identify a sample regardless of compression, distortion (e.g., pitching, equalization, D/A-A/D conversion, coders), or interference (e.g., background noise in the transmission channel). Traditional cryptographic hashing algorithms such as Message Digest 5 (MD5), Cyclic Redundancy Checking (CRC), and Secure Hash Algorithm (SHA) can be used to generate a compact signature of audio samples. However, these hashes are fragile, and a single modified bit in the data stream results in an entirely different signature. They are not robust to differences in compression, distortion, or interference, and as a result, are not considered to be robust fingerprinting methods. The design of an audio fingerprinting system should take attention to (i) the method by which audio fingerprints are generated, (ii) the search and/or matching algorithm used, and (iii) the storage size of audio fingerprints. The literature is focused on the different techniques capable of extracting meaningful audio fingerprints and the procedure to search and match them with a collection of audio fingerprints stored in a database. The fingerprint size is chosen to be as small as possible without compromising the performance of the system. The International Federation of the Phonographic Industry (IFPI) and the Recording Industry Association of America (RIAA) recognized the importance of audio fingerprinting systems [1]. Towards understanding how to evaluate them, a variety of metrics were defined, including accuracy, reliability, robustness, granularity, security, versatility, scalability, complexity, and fragility. Among the aforementioned metrics, accuracy, robustness, and granularity are the most tested in the literature. The definition of the metrics was stated in Cano [1] as follows: Accuracy: Number of correct, missed, and wrong identifications (false positives). Reliability: In playlist generation, assessing whether or not a query is present in a repository is of importance for copyright enforcement organizations. If a song has


2

not been broadcasted, it should not be identified as a match, even at the cost of missing actual matches. In other applications, such as automatic labeling of MP3 files, avoiding false positives is not mandatory. Robustness: Accurately identify an item, regardless of the level of compression, distortion and/or interference in the transmission channel. Granularity: Identify whole titles from excerpts a few seconds long. It requires to deal with shifting, which is the lack of synchronization between the extracted fingerprint and those stored in the database. Furthermore, it adds complexity to the search forcing to compare all possible alignments. Security: Vulnerability of the solution to cracking or tampering. In contrast with the robustness requirement, the manipulations to deal with are designed to fool the fingerprint identification algorithm. Versatility: Ability to identify audio regardless of the audio format. Ability to use the same database for different applications. Scalability: Performance with very large databases of titles or a large number of concurrent identifications. The database size affects the accuracy and complexity of the system. Complexity: It refers to the computational costs of the fingerprint extraction, the size of the fingerprint, the complexity of the search, the complexity of the fingerprint comparison, the cost of adding new items to the database, among others. Fragility: Some applications, such as content integrity verification systems, may require the detection of changes in the content. This is contrary to the robustness requirement, as the fingerprint should be robust to content-preserving transformations but not to other distortions. Advances in Artificial Intelligence (AI) and Machine Learning (ML) have led to new approaches in the processing of high-dimensional data like audio where Deep Learning (DL) is used to create representations with different levels of abstraction. These often utilize Artificial Neural Networks (ANN) such as Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN). Deep learning-based architectures have begun to be used towards the creation of audio, multimedia, and cross-modality fingerprints. This research introduces an Unsupervised Deep Learning Recurrent Model for generating audio fingerprints. It addresses the audio identification task through a Sequence-to-Sequence Autoencoder (SA) model composed of two linked Recurrent Neural Networks (RNN). The first RNN is the Encoder, which maps the input data into a vector representation of fixed dimensionality, and the second is the Decoder, which maps the representation back to the original input data. The unsupervised nature of the model, the economical hash size (128/256 bits per second of audio),


3

the robustness to identify transformed audio signals, and the capacity to generalized in different types of audio signals differentiates this model from other approaches. Experimental results confirmed that our methodology is robust to transformations in speech (VoxCeleb [2] dataset) achieving the highest accuracy of 66.56% while the runner-up, Dejavu [3], an open-source Shazam-like implementation, achieved an accuracy of 47.32%; the other two baselines, Robust Audio Fingerprinting System (RAFS) [4], a Bit Error Rate (BER) methodology robust to time-frequency distortions and coding/decoding transformations, and Panako [5], a constellation-based algorithm adding time-frequency distortion resilience, achieved accuracies of 19.82% and 23.28%, respectively. 1.1 Motivation The speed multimedia files are widespread through the internet makes it harder for producers to keep track of the copyright infringements caused by practices such as piracy or unauthorized broadcasting, and it also increases the complexity of the file management systems, which have to efficiently index and retrieve a vast corpus of multimedia files in a reasonable window time. Even more alarming, the multimedia content could be maliciously modified misleading users of its real intent (trojan horse), manipulating the information in favor of certain political/philosophical viewpoint [6] and/or deceiving the final user making him believe that he is interacting with a specific person [7]. In order to keep track of the content of multimedia files, protect the final user, and add related metadata, fingerprinting techniques were developed and optimized. These techniques efficiently compare the content of multimedia files against a repository without tampering the data (e.g., watermarks [8] [9]). Improvements to resilience made systems such as Shazam [10], Philips [4], and Microsoft [11] popular, especially for the music information retrieval task. In recent years, new methodologies have been developed on former approaches but introducing novel strategies such as Panako [5], which introduced robustness towards time-frequency distortions. However, all these techniques were developed before the introduction of Deep Learning, and no significant efforts have been made bearing this approach in mind. Some of the most recent Deep Learning approaches are mainly not focused on audio fingerprinting; they have been used for video, image, and text fingerprinting, including cross-domain applications such as querying images using text [12] and vice versa expanding the capabilities of multimedia fingerprinting. The lack of deep models for audio fingerprinting allows us to fill the gap coming up with a novel technique that clusters similar audio samples and restricts unalike audio samples to join these clusters.


4

1.2 Problem Statement and Context Audio samples are high dimensional in nature, and a direct comparison between them is memory and computationally expensive. Additionally, the internet makes it easy to send/receive audio files all around the world without verifying the origin and security of them, making it easy to spread malware and viruses as well as to infringe copyright regulations. Hence, a technique capable of comparing thousands of audio samples in a small fraction of time, robust to audio compression, distortion, and interference may help towards the solution of this problem. In the literature, there are a handful of papers that are the foundation of the audio fingerprinting techniques. First, a survey on audio fingerprinting by Cano et al. [1] clarifies the concepts and different techniques of a typical audio fingerprinting system. Later, Shazam [10], the first system to successfully capitalized on this technology, is introduced and explained. After Shazam, other companies such as Philips [4] and Microsoft [11] focused their research on the audio fingerprinting technology building its foundation. Some other more recent approaches (Panako [5]) have been developed, introducing robustness to time-frequency distortions. Wang [10] (Shazam) introduced the constellation algorithm, one of the most popular methods to extract audio fingerprints. It worked by generating sets of hash:time offset records; the searching and matching algorithm operated with time pairs distributed into bins. A matching occurred when a significant number of time pairs were located into a bin corresponding to the track ID. The fingerprint size was a 64-bit structure, 32 bits for the hash, and 32 bits for the time offset and track ID. The recognition rate reported on 10,000 tracks (8KHz mono, 16-bit samples) considered an additive noise and an additive noise + GSM compression cases, whereas the search time on 20,000 tracks was found on the order of 5-500 milliseconds. Haitsma and Kalker [4] (Phillips) split the audio stream (mono stereo, 5KHz) into frames of 0.37 seconds length and applied a Fast-Fourier Transformation (FFT). A 32-bit sub-fingerprint was extracted every 11.5625 milliseconds applying 33 non-overlapping bands (300Hz-2KHz) per frame. The concatenation of 256 sub-fingerprints is called a fingerprint-block, with a total length of 2.96 seconds. The bit value (0 or 1) is determined by computing the energy relation between frames and bands. The search algorithm is a lookup table containing all the 32-bit sub-fingerprints, and the matching is calculated through Bit Error Rate (BER), finding a match if the value is above a threshold, and either returning an error or stating that the song is not in the database. The fingerprint size was found to be 2,767.5676 bits per second. Burges et al. [11] (Microsoft) split the audio stream (mono stereo, 11.025KHz) into frames of 372 milliseconds length with half-overlap and applied a Modulated Complex Lapped Transform (MCLT) ending with 2048 coefficients per frame. In the first layer, MCLT log magnitudes are projected to a 64-dimensional space with an Oriented Principal Component Analysis (OPCA) technique. Later, 32 of the resulting frames are concatenated to form another 2048-dimensional vector, which is


5

projected using a second layer. The final 64 real values vector is the audio fingerprint created from 6.138 seconds of audio. The search algorithm used a lookup table in the first phase. In the second, the "fingerprint to different clip" average Euclidean distance was normalized to unity enabling a single accept threshold for all fingerprints as part of the matching algorithm. The fingerprint size was found to be 667.3183 bits per second. Six and Leman [5] (Panako) preprocessed the audio files with a sample rate of 8KHz and 1-channel. They used TarsosDSP for the extraction of spectrogram features. Based on Shazam, peaks were obtained finding the local maxima and then validating that they were also the maximum within a tile with dimensions ∆𝑻 ∆𝑭. They handled the time stretching through event triplets the same as symbolic fingerprinting [13] and added pitch-shift resilient property through a Constant-Q Transform [14]. The matching algorithm was similar to Shazam but heavily modified to allow time and frequency distortions. Their approach was tested up to 10% distortion with queries of length 20, 40, and 60 seconds. The fingerprint size was found to be 1,024 bits per second. In the state-of-the-art, fingerprints are not limited to audio but other types of media such as video, images, text, and data. Methodologies that create media fingerprints with hand-crafted features and no meta-data are classified as knowledge-based algorithms, while the ones capable of creating their own features and be driven by data are called Machine Learning (ML) algorithms. Table 1.1 classifies the recent work of knowledge-based methodologies based on the type of media and the task they are solving. Table 1.2 also classifies the recent work of ML methodologies, with a focus on Deep Learning (DL), making the distinction between supervised and unsupervised approaches. It is noted that classical methods are well developed for the creation of audio fingerprints, while other media types are mostly unexplored. In the case of DL, fingerprints for images are more studied than for audio. Hence, this work builds a robust DL strategy for the creation of robust audio fingerprints that are resilient to compression, time-frequency distortions, and other transformations such as echo, low/high pass filters, or noise.

1.3 Research Question Based on the state-of-the-art review, it was discovered that Deep Learning systems had not been exploited for the creation of audio fingerprints. Therefore, the research question is formulated as follows: Is it possible to create a new audio fingerprinting system robust to audio compression, distortion, and interference making use of edge-technology, such as Deep Learning, preserving a small hash size? The goal of this work is to answer this question.


6

Table 1.1 Knowledge-based methodologies for the creation of fingerprints in different types of media

Author Media Task Algorithm

Anguera et al. [15] Audio Music Information

Retrieval MASK

Fan and Feng [16] Audio Music Information

Retrieval Shazam-based

Henaff et al. [17] Audio Music Information

Retrieval Constant-Q Transform

Haitsma and Kalker [4] Audio Music Information

Retrieval Philips

Wang [10] Audio Music Information

Retrieval Shazam

Sonnleitner and Widmer [18] Audio Music Information

Retrieval Quad

Six and Leman [5] Audio Music Information

Retrieval PANAKO

Özer et al. [19] Audio Audio Identification Periodicity & SVD based

Burges et al. [11] Audio Audio Identification Microsoft

Baluja and Covell [20] Audio Audio Identification Google

Gupta et al. [21] Audio Piracy Monitoring Philips-based

Hsieh et al. [8] Audio Piracy Monitoring Low-bit encoding and parity

method

Ouali et al. [22] Audio Piracy Monitoring Binary 2D Spectrogram Image

Ouali et al. [22] Video Piracy Monitoring V-Intensity & V-motion

Roopalakshmi and Reddy [23] Video Piracy Monitoring Geometric Distortions &

MFCC

Park et al. [24] Image Indexing and

Retrieval Neighbor Sensitive Hashing

Gaoy et al. [25] Images Indexing and

Retrieval Data Sensitive Hashing

Gaoy et al. [25] Text Indexing and


Gaoy et al. [25] Data Indexing and



7

Table 1.2 Machine Learning methodologies, focused on Deep Learning, for the creation of fingerprints in different types of media

Author Media Task Approach Algorithm

Kereliuk et al. [26] Audio Music Information

Retrieval Supervised CNN

Li et al. [27] Audio Music Information

Retrieval Unsupervised Kohonen

Petetin et al. [28] Audio Scene Recognition Both DBN & DNN

Amiriparian et al. [29] Audio Scene Recognition Unsupervised SA

Gu et al. [30] Video Indexing and Retrieval Supervised CNN & RNN

Nguyen et al. [31] Images Indexing and Retrieval Supervised DNN

Lai et al. [32] Images Indexing and Retrieval Supervised CNN Hinton and Salakhutdinov [33]

Images Indexing and Retrieval Unsupervised RBM

Liong et al. [34] Images Indexing and Retrieval Both DNN

Cao et al. [12] Images Cross Modal Indexing and

Retrieval Supervised CNN & RNN

Cao et al. [12] Text Cross Modal Indexing and

Retrieval Supervised CNN & RNN

Salakhutdinov and Hinton [35]

Text Indexing and Retrieval Unsupervised RBM

Abbreviations: DNN - Deep Neural Network || DBN - Deep Belief Network || RBM - Restricted Boltzmann Machine || CNN - Convolutional Neural Network || RNN - Recurrent Neural Network || SA – Sequence-to-

Sequence Autoencoder

1.4 Solution Overview An audio fingerprinting system is characterized by the creation of a content-based small signature (hash) that represents the original audio sample, and the property of creating similar signatures (hashes) regardless of the transformation/distortion of an audio sample was subjected. Furthermore, different audio samples must be characterized by different hashes that should be located far from each other in the hash space. Based on the previous statements, a dimensionality reduction technique was needed to create the hash representation, and a new loss function was formulated for the creation of the hash space. The former task was achieved by a Sequence-to-Sequence Autoencoder (SA) model based on the research of Chung et al. [36] who proved that the model was useful in speech applications creating a fixed-dimension vector representation of variable-length audio segments, and its unsupervised nature described the phonetic structure of the audio, giving audio segments that sounded similar a vector representation nearby in the space. The latter task was accomplished, bearing in mind the relationship between three different terms created to handle the creation of the hash space. The first term is the Mean Square Error (MSE) Loss, which minimized the error between the inputs (audio samples) and the outputs (reconstruction of the inputs using the created audio fingerprint). The second term is the Hash Loss, which minimized the distance between similar hashes and encouraged clustering. The third term is the Bitwise Entropy Loss, which minimized the variation inside the clusters.


8

These two ideas, the SA model and the novel loss function, were combined for the creation of a new audio fingerprinting system which based on experimental results proved to be a right solution for the problem previously stated. 1.5 Main Contributions Our first contribution is the application of a DL unsupervised algorithm for the generation of audio fingerprints without the need for human annotation. Although the methodologies found in Li et al. [27], Petetin et al. [28], and Amiriparian et al. [29] are unsupervised, the first uses the unsupervised Kohonen Self-Organized Map (SOM) as a classifier, the second uses the unsupervised Deep Belief Network (DBN) as pre-training for a DNN classifier, and the third uses the unsupervised SA model to create features to be input into a multi-layer perceptron for scene classification. Unlike the proposed framework, none of these works use an unsupervised approach specifically for the generation of audio fingerprints. Our second contribution is the creation of a loss function, which while minimizing the Sequence-to-Sequence Autoencoder (SA) reconstruction error, creates a hash (fingerprint) space where similar hashes are as close as possible starting with a rough composition and then being refined with bitwise adjustment. For our third contribution, we provide experimental results that show a robust fingerprinting system with an economical hash size, 128/256 bits per second of audio, and small granularity (1-6 seconds). Finally, our fourth contribution is the deployment of this technology into two 9-1-1 Centers, Palm Beach County, and Greater Harris County, where our system was tested in real-time in an industrial environment. 1.6 Dissertation Organization Chapter 1 introduced the definitions, concepts, and motivations for the creation and use of audio fingerprinting systems. Furthermore, an unsupervised deep learning recurrent model and a novel loss function were introduced for the solution of the stated problem.

Chapter 2 briefly reviews the history of Deep Learning (DL) and goes deeper into the explanation of the SAMAF framework breaking it down into its different blocks and the mathematical formulation behind the methodology.

Chapter 3 presents important implementation details which allow other researchers to reproduce the results herein. Additionally, experimental results are displayed, and an explanation of the advantages and disadvantages is elaborated.

Chapter 4 describes the deployment process in Palm Beach and Greater Harris Counties, where the audio fingerprinting system was implemented as a duplicate detector for 9-1-1 calls, working in real-time in an industrial environment.

Chapter 5 summarizes the relevant findings, presenting the conclusions, and highlighting the contributions, as well as introduces the future work.

Chapter 2. Deep Unsupervised Audio Fingerprinting 9

9

Chapter 2 2. Deep Unsupervised Audio Fingerprinting Before introducing the mathematical formulation behind the created framework, it is imperative to introduce the foundations. Hence, the history of Deep Learning is briefly presented to summarize the key points of different architectures and to highlight the mathematical equations needed for the development of our system. Afterward, a detailed explanation of how to build our proposed model is presented. 2.1 Deep Learning Review Deep Neural Networks (DNNs) are feedforward networks made of multiple processing layers that learn to represent data in multiple levels of abstraction. The backpropagation algorithm feeds back the system modifying its internal parameters to compute the representation in each layer from the representation in the previous layer. This allows the discovery of intricate structures in large datasets [37]. DNNs have improved the state-of-the-art in speech recognition [38] [39] and object detection [40], among other domains. Convolutional Neural Networks (CNNs) are a particular type of feedforward network that was much easier to train and generalized much better than networks with fully connected layers. This type of network excelled at object recognition classifying 1.2 million high-resolution images (ImageNet LSVRC-2010 [41]) into 1000 different classes with a 37.5% error rate with the architecture AlexNet [42]. Although DNNs and CNNs have shown tremendous flexibility and power in abstraction, they have the limitation that inputs and outputs must be encoded with fixed dimensionality. In dynamic systems, where the structure of the data changes over time, a vector of fixed dimensionality neither encodes the current information nor the sequential relationship within the data. This type of data is called sequential data and can be found in audio, video, and text samples. Recurrent Neural Networks (RNNs) are networks capable of processing sequential data. Their hidden neurons form a directed cycle consisting of a hidden state 𝒉𝒕that functions as an internal memory at time 𝒕 enabling the capture of dynamic temporal information and processing of variable-length sequences 𝒙 𝒙𝒕, 𝒙𝒕 𝟏, … , 𝒙𝑻 . The hidden state 𝒉𝒕 is updated at time step 𝒕 by 𝒉𝒕 𝒇 𝒉𝒕 𝟏, 𝒙𝒕 , where 𝒇 is a non-linear activation function. Even though RNNs are robust dynamic systems, training them to learn long-term dependencies is challenging due to the backpropagated gradients either growing or shrinking at each time step. Therefore, over many time steps, they typically explode or vanish [43]. Long-Short Term Memory (LSTM) cells [44] provided a solution by incorporating memory units. These allow the network to learn when to forget, update, or output previous hidden states given new information. As a result, remembering long-term dependencies is the default behavior. Additional depth can be added to LSTMs by stacking them on top of each other, using the output of the LSTM in the previous layer (𝓵 𝟏) as the input to the LSTM in the current layer 𝓵. Figure 2.1 illustrates the composition of an LSTM cell, which requires


10

three inputs: previous cell 𝑪𝒕 𝟏 and previous hidden 𝒉𝒕 𝟏 states, and input sequence 𝒙𝒕; and returns two outputs: updated cell 𝑪𝒕 and updated hidden 𝒉𝒕 states; Weights 𝑾 𝒇,𝒊,𝒐,𝑪 and biases 𝒃 𝒇,𝒊,𝒐,𝑪 need to be considered. The forget 𝒇𝒕 gate keeps/removes information from the previous cell state 𝑪𝒕 𝟏, then the input 𝒊𝒕 gate decides which information is updated from the cell state candidate 𝑪𝒕 ; the two previous interactions are added to form the updated cell state 𝑪𝒕. Finally, the output 𝑶𝒕 gate decides which information from the updated cell state 𝑪𝒕 will create the updated hidden state 𝒉𝒕. The two types of functions presented in the illustration are

the sigmoid function 𝝈 𝒙 𝟏

𝟏 𝒆 𝒙, which maps the data in the range 𝟎, 𝟏 , and the

hyperbolic tangent function 𝐭𝐚𝐧𝐡 𝒙 𝒆𝒙 𝒆 𝒙

𝒆𝒙 𝒆 𝒙 𝟐𝝈 𝟐𝒙 𝟏, which maps the data in

the range 𝟏, 𝟏 . The rectangles mean layer operations, otherwise pointwise operations.

𝒇𝒕 𝝈 𝑾𝒇 ∙ 𝒉𝒕 𝟏|𝒙𝒕 𝒃𝒇 𝒊𝒕 𝝈 𝑾𝒊 ∙ 𝒉𝒕 𝟏|𝒙𝒕 𝒃𝒊 𝒐𝒕 𝝈 𝑾𝒐 ∙ 𝒉𝒕 𝟏|𝒙𝒕 𝒃𝒐 𝑪𝒕 𝐭𝐚𝐧𝐡 𝑾𝑪 ∙ 𝒉𝒕 𝟏|𝒙𝒕 𝒃𝑪 𝑪𝒕 𝒇𝒕 ∗ 𝑪𝒕 𝟏 𝒊𝒕 ∗ 𝑪𝒕 𝒉𝒕 𝒐𝒕 ∗ 𝐭𝐚𝐧𝐡 𝑪𝒕

Figure 2.1 Left: LSTM cell with all its gates (forget ft input it and output ot), the sigmoid 𝝈 and hyperbolic tangent 𝒕𝒂𝒏𝒉 functions, and all their interactions. The inputs (previous cell Ct-1 and previous hidden ht-1 states and input sequence xt), the weights Wf,i,o,C, and the biases bf,i,o,C are responsible for the creation of the updated cell Ct and updated hidden ht states. The rectangles mean layer operations, otherwise pointwise operations. Right: Mathematical definition of the gates and their interactions with the inputs to create the updated cell and updated hidden states. Nonetheless, using RNNs where input and output sequences differ in length with complicated and non-monotonic relationships is not clear. Sequence-to-Sequence (S2S) models [45] [46] were developed to support this functionality, first applied in the machine translation task. The model consisted of two RNNs trained jointly to maximize the conditional probability of the target sequence given a source sequence. The first RNN is the Encoder, which learns how to encode a variable-length sequence into a fixed-length vector representation, and the second RNN is the Decoder, which learns how to decode a given fixed-length vector representation back into a variable-length sequence. When the RNN Encoder and RNN Decoder are trained jointly to minimize the reconstruction error, the new model is called: Sequence-to-Sequence Autoencoder (SA). These types of models make minimal assumptions on the sequence structure. Therefore, they are considered unsupervised approaches which remove/reduce the amount of work needed for manual annotations.


11

2.2 SAMAF: Sequence-to-Sequence Autoencoder Model for Audio Fingerprinting

An overview of our framework is found in Figure 2.2. The raw audio signal 𝒓 was preprocessed, and the amplitude of the new signal 𝒓𝒑𝒑 𝒓𝒑𝒑𝟏

, … , 𝒓𝒑𝒑𝑵𝝐ℝ𝑵 was normalized following (1). The Mel-Frequency

Cepstral Coefficient (MFCC) features were created from the normalized signal 𝒓𝒏𝒐𝒓𝒎. The Sequence-to-Sequence Autoencoder (SA) model processed the MFCC features and outputted the final representation of the audio fingerprint, a vector of 1's and -1's.

𝒓𝒏𝒐𝒓𝒎𝒓𝒑𝒑

𝒎𝒂𝒙 𝒓𝒑𝒑𝟏, … , 𝒓𝒑𝒑𝑵

(1)

Figure 2.2 The general framework starts with a raw audio signal, which is normalized through a preprocessing step. The MFCC features are extracted from the normalized signal and then input to the Sequence-to-Sequence Autoencoder (SA) model, which generates the audio fingerprint (hash representation). The main component of our general framework is the SA model. In Chung et al. [36], it is stated that the SA model proved to be useful in speech applications because the model created fixed-dimension vector representations of variable-length audio segments. Furthermore, the unsupervised learning nature of the model described the sequential phonetic structure of the audio, giving audio segments that sounded similar a vector representation nearby in the space. Hence, the SA model was chosen for the creation of audio fingerprints. The model was trained with a set of 𝑨 audio signals 𝑿𝒂

𝒂 𝟏,…,|𝑨|where 𝑿𝒂 represents the 𝒂-th audio signal, and 𝑿 denotes the raw audio signal has been preprocessed, and the features have been extracted. Given 𝑻 as the number of steps in the RNN context and 𝑰 as the number of MFCC blocks an audio signal is split into we have, 𝑿𝒂 𝒙𝟏

𝒂, 𝒙𝟐𝒂, … , 𝒙𝒊

𝒂, … , 𝒙𝑰𝒂 where 𝒙𝒊

𝒂 𝒙𝒊,𝟏𝒂 , 𝒙𝒊,𝟐

𝒂 , … , 𝒙𝒊,𝒕𝒂 , 𝒙𝒊,𝒕 𝟏

𝒂 , … , 𝒙𝒊,𝑻𝒂 and 𝒙𝒊,𝒕

𝒂 ∈ ℝ𝑴 for all 𝟏 𝒊 𝑰 and 𝟏 𝒕 𝑻 . The M-dimensionality is equal to the number of MFCC coefficients, and the sub-index 𝒕 also refers to the number of sequential MFCC features contained in an MFCC block; an audio fingerprint is created per MFCC block. A detailed representation of the SA model is illustrated in Figure 2.3, where the inputs to the RNN Encoder 𝒙𝒊

𝒂 are located on the bottom left side. The previous state of the LSTM cell 𝑺𝒊,𝒕 𝟏

𝒂 𝑪𝒊,𝒕 𝟏𝒂 , 𝒉𝒊,𝒕 𝟏

𝒂 , composed of the


12

previous cell 𝑪𝒊,𝒕 𝟏𝒂 and previous hidden 𝒉𝒊,𝒕 𝟏

𝒂 states, interacts with the current inputs 𝒙𝒊,𝒕

𝒂 creating the updated state of the LSTM cell 𝑺𝒊,𝒕𝒂 𝑪𝒊,𝒕

𝒂 , 𝒉𝒊,𝒕𝒂 . When the last state

is reached 𝑺𝒊,𝑻𝒂 𝑪𝒊,𝑻

𝒂 , 𝒉𝒊,𝑻𝒂 , the last hidden state 𝒉𝒊,𝑻

𝒂 contains information from all the RNN Encoder inputs 𝒙𝒊

𝒂 . As a result, the last hidden state is a compact representation of the RNN Encoder inputs; this is the reason it is considered the feature representation for the audio fingerprint. To create the final representation of the audio fingerprint (𝑯𝒊,𝑻

𝒂 hash), a threshold maps the last hidden state elements to its binary values 𝟏, 𝟏 . In Equation (2), the thresholding is formalized, stating that the last hidden state is mapped from a real space ℝ𝑩 to a Hamming space 𝓗𝑩 (𝑩-bits hash) which binary values are 1 if 𝒉𝒊,𝑻

𝒂 𝒃 𝟎 and 𝟏 otherwise.

Figure 2.3 The Sequence-to-Sequence Autoencoder (SA) model is composed of two RNNs, the Encoder and the Decoder. The former receives the inputs xia, which are combined with the LSTM cell states Sia. When the last state is reached Sai,T, it is passed to the latter as the initial previous state Σai,t-1. The output of the RNN Decoder x̂ia is a reconstructed representation of the original inputs xia. The audio fingerprint (hash) is embedded into the last state Sai,T of the RNN Encoder

𝑻𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅 𝒉𝒊,𝑻𝒂 : ℝ𝑩 ↦ 𝟏, 𝟏 𝑩

𝒉𝒊,𝑻

𝒂 𝒉𝒊,𝑻𝒂 𝟏 , … , 𝒉𝒊,𝑻

𝒂 𝑩 ∈ ℝ𝑩

𝑯𝒊,𝑻𝒂 𝒉𝒊,𝑻

𝒂 , 𝒃𝟏, 𝒊𝒇 𝒉𝒊,𝑻

𝒂 𝒃 𝟎𝟏, 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆

(2)

The RNN Decoder state structure 𝚺𝒊,𝒕

𝒂 𝜻𝒊,𝒕𝒂 , 𝜼𝒊,𝒕

𝒂 is the same as the RNN Encoder state structure 𝑺𝒊,𝒕

𝒂 𝑪𝒊,𝒕𝒂 , 𝒉𝒊,𝒕

𝒂 . However, the symbol variables are different to highlight the distinction in the information flow. Although Encoder and Decoder are structurally similar, their differences are that (i) the initial previous state 𝚺𝒊,𝒕 𝟏

𝒂 of the Decoder is the last state 𝑺𝒊,𝑻

𝒂 of the Encoder, (ii) the Decoder input 𝒙𝒊,𝒕𝒂 is the output

of its previous hidden state 𝑶 𝜼𝒊,𝒕 𝟏𝒂 , commonly known as feed previous


13

configuration, being the initial Decoder input 𝒙𝒊,𝟏𝒂 the same as the initial Encoder input

𝒙𝒊,𝟏𝒂 , and (iii) the learning parameters, the weights 𝑾′𝒔 and biases 𝒃′𝒔, are the same

for the Encoder and the Decoder. The quality of the deep unsupervised audio fingerprint is dependent on the learned mapping to a Hamming space 𝒇 𝒙𝒊

𝒂 : ℝ𝑴 𝑻 ↦ 𝓗𝑩 . As the LSTM cell state size increases, the dimensionality of the Hamming space increases, and more information from the original audio signal can be retained by the model improving the mapping where similar audio fingerprints are nearby. As a result, the cell state size is a user-defined parameter that needs to be tuned empirically, and the function to map the inputs into the hamming space is the loss function of the SA model. The loss function was designed to jointly train (shared weights and biases) the RNN Encoder and RNN Decoder to generate the reconstruction sequence 𝑿𝒂 to be as close as possible to the input sequence 𝑿𝒂. This is done by minimizing the Mean Square Error (MSE) between the reconstruction and input; Equation (3) formalized this idea. However, this term alone did not guarantee that similar audio fingerprints were nearby in the space. To further improve the spatial relationship, the Hash Loss term was added. In Equation (4) the set of all the last hidden states was represented by 𝒉𝑻 and the distance matrix was calculated over this set following the notation that the las hidden states 𝒉𝒊,𝑻 and 𝒉𝒋,𝑻 were not the same, therefore 𝒊 𝒋. In order to minimize the distance over similar last hidden states, a second set, the similarity labels set 𝝈𝑻, is created describing how strong they are related to each other. The formal definition can be found in Equation (5), where the cosine similarity metric

𝒄𝒐𝒔_𝒔𝒊𝒎 𝒉𝒊,𝑻, 𝒉𝒋,𝑻𝒉𝒊,𝑻∙𝒉𝒋,𝑻

𝒉𝒊,𝑻 ∗ 𝒉𝒋,𝑻 was used to bound the range of the similarity labels

between 𝟎, 𝟏 where zero meant that the relationship did not contribute, and one meant that the relationship totally contributed to the distance minimization problem. The Hash loss was normalized by the total number of similarity labels different from zero.

𝑴𝑺𝑬𝑳𝒐𝒔𝒔 𝑿𝟏𝑨

𝑿𝒂 𝑿𝒂 𝟐𝑨

𝒂 𝟏

(3)

𝑯𝒂𝒔𝒉𝑳𝒐𝒔𝒔 𝒉𝑻

∑ 𝝈𝒊𝒋,𝑻 𝒉𝒊,𝑻 𝒉𝒋,𝑻𝟐

𝒉𝒊,𝑻,𝒉𝒋,𝑻 ∈ 𝒉𝑻,𝒊 𝒋

𝝈𝒊𝒋,𝑻 ∈ 𝝈𝑻 𝝈𝒊𝒋,𝑻 𝟎 (4)

𝝈𝒊𝒋,𝑻

⎩⎨

⎧ 𝟏, 𝒊𝒇 𝐜𝐨𝐬 _𝒔𝒊𝒎 𝒉𝒊,𝑻, 𝒉𝒋,𝑻 𝟎. 𝟓𝟓𝟏

𝟎. 𝟓𝟓∗ 𝐜𝐨𝐬 _𝒔𝒊𝒎 𝒉𝒊,𝑻, 𝒉𝒋,𝑻 , 𝒊𝒇 𝟎 𝐜𝐨𝐬 _𝒔𝒊𝒎 𝒉𝒊,𝑻, 𝒉𝒋,𝑻 𝟎. 𝟓𝟓

𝟎, 𝑶𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆

(5)

To fine-tune the structure of each audio fingerprint w.r.t its similar audio fingerprints, a Bitwise Entropy Loss was applied. First, the set of all hashes (audio fingerprints) was defined as ℍ. Then, 𝑯𝒊 ⊆ ℍ contained all the hashes 𝑯𝒋 that were similar to 𝑯𝒊,


14

including itself. Formally, it was stated that 𝑯𝒊 𝑯𝒋 𝐜𝐨𝐬 _𝒔𝒊𝒎 𝑯𝒊, 𝑯𝒋

𝟎. 𝟓𝟓 𝒘𝒉𝒆𝒓𝒆 𝒊, 𝒋 𝟏, … , 𝑵 where 𝑵 is the total number of hashes in ℍ. Equation (6) refers to this idea where the Bitwise Entropy Loss was calculated in all the subsets 𝑯𝒊 which cardinality is greater than or equal to two, and this term is normalized by the total number of subsets which cardinality was greater than or equal to two. Equation (7) defines that the binary entropy was calculated per bit 𝒃 where each bit had two states 𝝂 ∈ 𝟏, 𝟏 . In information theory, the binary entropy requires the calculation of the empirical probability 𝑷 𝑯𝒊

𝒃𝝂 when the 𝒃-th bit state is 𝝂 . The entropy term was normalized by the total number of elements in each subset |𝑯𝒊| and by the number of bits 𝑩.

𝑩𝒊𝒕𝒘𝒊𝒔𝒆𝑬𝒏𝒕𝒓𝒐𝒑𝒚𝑳𝒐𝒔𝒔 ℍ𝟏

| 𝒊: |𝑯𝒊| 𝟐 |𝑩𝒊𝒏𝒂𝒓𝒚𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑯𝒊

𝒊:𝑯𝒊⊆ℍ 𝒔.𝒕.|𝑯𝒊| 𝟐

(6)

𝑩𝒊𝒏𝒂𝒓𝒚𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑯𝒊𝟏𝑩

∗𝟏

|𝑯𝒊|𝑷 𝑯𝒊

𝒃𝝂 𝒍𝒐𝒈𝟐 𝑷 𝑯𝒊𝒃𝝂

𝝂∈ 𝟏,𝟏

𝑩

𝒃 𝟏

(7)

The final representation of the Loss function is found in (8), where the three terms (i) 𝑴𝑺𝑬𝑳𝒐𝒔𝒔 𝑿 , (ii) 𝑯𝒂𝒔𝒉𝑳𝒐𝒔𝒔 𝒉𝒕 , and (iii) 𝑩𝒊𝒕𝒘𝒊𝒔𝒆𝑬𝒏𝒕𝒓𝒐𝒑𝒚𝑳𝒐𝒔𝒔 ℍ were weighted by constants 𝜶, 𝜷, and 𝜸, respectively. The weight values were set to the unity empirically.

𝑳𝒐𝒔𝒔 𝜶 ∗ 𝑴𝑺𝑬𝑳𝒐𝒔𝒔 𝑿 𝜷 ∗ 𝑯𝒂𝒔𝒉𝑳𝒐𝒔𝒔 𝒉𝑻 𝜸 ∗ 𝑩𝒊𝒕𝒘𝒊𝒔𝒆𝑬𝒏𝒕𝒓𝒐𝒑𝒚𝑳𝒐𝒔𝒔 ℍ (8) A visual explanation of the loss terms is presented in Figure 2.4 where MSE Loss minimizes the reconstruction error modifying the distribution of the audio fingerprints all around the space, the Hash Loss minimizes the distance between similar hashes bringing them together, and the Bitwise Entropy Loss minimizes the variation in the clusters of similar hashes tuning them to become more alike.

Figure 2.4. Graphical explanation of the Loss Terms where MSE Loss minimizes the reconstruction error modifying the distribution of the audio fingerprints all around the space, the Hash Loss minimizes the distance between similar hashes bringing them together, and the Bitwise Entropy Loss minimizes the variation in the clusters of similar hashes tuning them to become more alike.


15

As a summary, the general framework of SAMAF is presented in Algorithm 1, formalizing the steps to transform a raw audio signal 𝒓 into an audio fingerprint 𝑯.

Algorithm 1: SAMAF – Sequence-to-Sequence Autoencoder Model for Audio Fingerprinting Data: A set 𝑹 containing 𝑨 raw audio signals Input: Raw Audio Signal 𝑹𝒂 Output: A set of Audio Fingerprints 𝑯𝒂 created from Raw Audio Signal 𝑹𝒂

1 repeat 2 Raw Audio Signal 𝑹𝒂 is preprocessed 𝑹𝒑𝒑

𝒂 3 Mel-Frequency Cepstral Coefficients (MFCC) 𝑿𝒂 are extracted from 𝑹𝒑𝒑

𝒂

4 The MFCC features are split into blocks 𝒙𝒊

𝒂 where 𝒊 𝟏, … , 𝑰 Each block contains 𝑻 sequential MFCC features

5 repeat

6 Each MFCC block 𝒙𝒊

𝒂 is processed by the Sequence-to-Sequence Autoencoder (SA) Model creating an audio fingerprint 𝑯𝒊

𝒂 per each 7 until 𝒊 𝑰 8 A set of Hashes 𝑯𝒂 is created from the MFCC features 9 until 𝒂 𝑨

2.3 Hamming Distance Matching Identification of a query audio sample against a collection of audio samples was carried out by comparing the number of different bits (Hamming Distance) between query hash 𝑯𝒒 and each hash 𝑯𝒄𝒊

in the collection. Labels were assigned to the query audio sample based on the comparison with the minimum Hamming Distance. In the case several hashes within the collection had the same minimum Hamming Distance with the query audio sample, a voting strategy was used. Hash 𝑯𝒒 was created following the same framework discussed in the previous subsection 2.2. As part of our research, the length of a query audio sample was varied from 1-6 seconds. Since hashes were generated on one second audio windows, fingerprints for samples greater than one second were constructed by concatenating sequential hashes. With a 0.25 second window step, a one second audio sample resulted in one hash, two seconds resulted in five concatenated sequential hashes, 3 seconds resulted in 9 concatenated sequential hashes, 4 seconds in 13 sequential hashes, 5 seconds in 17 sequential hashes, and 6 seconds in 21 sequential hashes.

Chapter 3. Experimental Results 16

16

Chapter 3 3. Experimental Results This section covers experimental settings and results. The experiments herein show the performance of a SA model with different parameters: number of loss terms (e.g., Mean Square Error (MSE) Loss, Hash Loss, and Bitwise Entropy Loss), number of layers (either 1 or 2), and cell state size (either 128 or 256). Furthermore, to understand the robustness of each model against audio sample length (granularity), the length was varied from 1-6 seconds at a one second increment. Results were benchmarked against three baselines: 1) Dejavu [3], an open-source Shazam-like implementation, 2) RAFS [4], a Bit Error Rate (BER) methodology robust to time-frequency distortions, and 3) Panako [5], a constellation-based algorithm adding time-frequency distortion resilience. Additionally, randomly initialized SA models were evaluated to understand the intrinsic capacity of the SA model for identifying audio files and its ability to learn a mapping that boosts the identification (accuracy) performance. 3.1 VoxCeleb1 Dataset In this work, the performance of the SA model was tested on speech in the wild with the VoxCeleb1 dataset [2]. Unlike most existing datasets that contain utterances under constrained conditions, the VoxCeleb dataset is a large-scale audio-visual dataset consisting of short clips of human speech extracted from interview videos uploaded to YouTube™. Characteristically, the utterances contain background noise, laughter, bouts of silence, and a wide range of accents and speech. It was expected that the diversity, the unstructured nature, and the large size of the dataset should provide a challenge and reduce the potential for experimental bias contrary to more structured audio domains. The VoxCeleb1 dataset has a total of 153,516 audio files in WAV format, with a 16KHz sample rate extracted from 22,168 videos. On average, 6.93 audio files were extracted per video with a range from 1 to 239. The dataset contains speech from 1,251 different celebrities, 561 female and 690 male, spanning 36 nationalities. The total number of audio hours is 351.60, with an average segment length of 8.25 seconds, ranging from 3.26 to 144.92 seconds. For this research, the VoxCeleb1 dataset had to be expanded adding 21 transformation 1 (TruncateSilence, Noise0.05, Echo, Reverb, HighPassFilter, LowPassFilter, Reverse, Pitch0.5, Pitch0.9, Pitch1.1, Pitch1.5, Speed0.5, Speed0.9, Speed0.95, Speed1.05, Speed1.1, Speed1.5, Tempo0.5, Tempo0.9, Tempo1.1, Tempo1.5); the transformation’s parameters can be found in Appendix C. Speed,

1 Sound eXchange (SoX) was used to generate the transformations and to preprocess the audio signals [52]


17

Tempo and Pitch transformations were selected based on previous research [18]. Various studies have also evaluated their audio fingerprinting systems through performance against transformations [5] [15] [19]. The expansion of the VoxCeleb1 dataset increased the amount of data significantly; hence, a subset of the originals containing audio samples between 10 and 11 seconds long was selected. The new VoxCeleb1 subset, including transformations, was composed of 141,350 audio files with 6,425 original files and 134,925 transformation files. The audio files were taken from 4,563 videos with an average of 1.41 audio samples per video and a range of 1 to 14. The dataset contains speech from 1,195 different celebrities, 537 female and 658 male, spanning 36 nationalities. The total number of audio hours is 437.73, with 18.66 hours of Original audio and 419.07 hours of transformation audio. The transformation samples have an average length of 11.18 seconds 3.2 Technical Details As shown in Figure 2.2, the MFCC acoustic features were extracted from preprocessed audio signals. These features become the input to our SA model, where the audio fingerprints were created. Preprocessing. The audio signals were preprocessed1 and converted to a 1-Channel (mono stereo) with a sample rate of 16.0 KHz. Next, their amplitude was normalized to the range 𝟏, 𝟏 using Equation (1). Feature Extraction. A python library 2 was used to extract 13 Mel-Frequency Cepstral Coefficients (MFCCs) using a window length of 25ms and an overlap of 10ms; the other parameters were set with their default values. Model. The Sequence-to-Sequence Autoencoder (SA) Model3 is divided into RNN Encoder and Decoder. The input vectors of the RNN Encoder are the previous hidden and cell states, and the MFCC features; the dimensions of the vectors are either 128 or 256 (initialized to zero) and 13, respectively. With the input vectors, the updated hidden and cell states are calculated considering the weights (randomly initialized, uniformly distributed and shared between encoder and decoder); this procedure is repeated 100 times (one second of audio is processed every 100 steps or 𝑻-steps), creating the feature representation of the audio fingerprint in the last step. Later on, a threshold (Equation (2)) applied to the feature representation outputs the hash with its binary values 𝟏, 𝟏 . In order to add depth to the model, the LSTM Cells could be stacked up, using the output of the LSTM in the previous layer (𝓵 𝟏) as the input to the LSTM cell in the current layer 𝓵. In this research, one

2 python_speech_features was used for the creation of MFCC features [53] 3 TensorFlow® was used to implement the SA Model [54]


18

and two layers were used. To accelerate the computation and to calculate other parameters (e.g., gradient and loss) with statistical variation, a batch of 32 samples was used. In the first step of the RNN Decoder, the input vectors are the last hidden and cell states of the RNN Encoder and the MFCC features of the first step in the RNN Encoder. The next steps, the previous hidden and cell states of the RNN Decoder as well as the previous output (feed previous configuration), will be used for the computation of the current step in the RNN Decoder ( Figure 2.3). The SA model was trained with an Adagrad Optimizer using a learning rate of 0.01 without learning rate decay. The number of Epochs was set to 1,000. However, an early stopping technique, based on the learning and performance curves evolution, was manually applied. The loss function (Equation (8)) was tuned with the parameters 𝜶, 𝜷, and 𝜸 equal to the unity; these values were empirically chosen. The evaluation of the model during training was based on the results of one-second audio length. Reference Database. MySQL was used for storing the audio fingerprints. The description of the table contained the metadata (labels) of the audio file, the MFCC block ID, and the audio fingerprint (hash). This information was stored with the data types char, smallint(5) unsigned, and blob, respectively. During testing, a 176,717 entry database with a 128-bit hash size was 2.70 MB on disk, and with a 256-bit hash size was 5.39 MB on disk. Baselines. The baselines, Dejavu, RAFS, and Panako, were run using open source implementations. Dejavu, a Shazam-like algorithm, was implemented by Will Drevo4, and RAFS and Panako were implemented by Joren and Leman5. 3.3 Experiments The experiments were developed to gain insight into the SA model audio identification performance under different parameters. The different loss function terms (MSE, Hash, and Bitwise Entropy), the cell state size 𝑺 ∈ ℝ𝑩, the audio sample length (1-6 seconds), the number of model layers (1-2), and the model selection method during training (smallest loss vs. largest accuracy) can all have significant impact to the model’s performance. Hence, the core of the experiments quantified the relationship between these variables, the structure of the hash space, and the performance on the audio identification task. Furthermore, to benchmark the SA

4 https://github.com/worldveil/dejavu 5 https://github.com/JorenSix/Panako


19

model, the performance of these models was compared against the performance of three baselines, Dejavu, RAFS, and Panako, using the VoxCeleb1 dataset. The hash space was analyzed to understand the relationship between the different loss terms and audio identification performance. Our intuition was that a structured hash space should boost the audio identification task performance. Furthermore, the analysis of the hash space gave us insight into the learning process, given the unsupervised nature of the SA model. The audio identification task was focused on recognizing variations of the original audio signal. In the case of the subset of the VoxCeleb1 dataset, the transformations are determined to be successfully recognized when its metadata (speaker, video ID, filename) is matched to its corresponding original audio signal’s metadata. The characteristics of the model, such as number of layers and state size, could have a significant effect on the ability of the model to construct abstract features or express complex relationships between features. Since both of these are crucial to the development of an effective state (and therefore hash) that encodes the audio signal, they may positively or negatively impact the audio identification task. To determine this impact, the experiments performed were evaluated over different numbers of layers (1-2) and state size (128 or 256). The granularity of the system was tested from a range of one to six seconds with hashes created over a window of one second and a step of 0.25 seconds. With a granularity of one second, the fingerprint was constructed with one hash. With two seconds of audio, the fingerprint was constructed with five concatenated sequential hashes, 3 seconds was constructed with 9 concatenated sequential hashes, and so on. This means, for a sample audio 10.015 seconds long and a granularity of one second, 37 fingerprints would be created, and a query against the database would be performed for each hash. For a similarly sized audio sample and granularity of two seconds, 33 fingerprints would be created and queried. For a three seconds audio granularity, 29 fingerprints would be created and queried. Audio identification results are based on the exhaustive creation of hashes along the entire audio signal rather than a finite number of random queries as done in other studies. With this testing methodology, results are reproducible and statistically more significant because they represent the complete audio signal instead of random parts of it. The model was selected based on two different strategies: smallest loss or largest accuracy. The smallest loss selection strategy selected the model with a better-constructed hash space because it minimized the reconstruction error (MSE Loss), the distance between similar hashes (Hash Loss), and diversity in a cluster of similar hashes (Bitwise Entropy Loss). The largest accuracy selection strategy observed training evaluation accuracies and selected the model with the best audio identification performance. Even though the hypothesis was that a structured hash space would deliver better identification results, the unsupervised nature of the SA model meant the model did not necessarily learn features that improved performance over the audio identification task instead favoring features to better recognize other attributes of the audio signal (e.g., speech signal of the speaker).


20

The samples in the dataset were split into three sets: Training (60%), Validation (20%), and Testing (20%). The subset of the VoxCeleb1 dataset (originals of length 10-11 seconds long) contains 6,425 audio files labeled as originals from 1,195 people. Twenty-one transformations were applied to the originals resulting in 134,925 new audio files labeled as transformations. In sum, the training set contained 2,799 original files and 58,779 transformation files, and validation and test sets each contained 1,812 and 1,184 original files and 38,052 and 38,094 transformation files, respectively. In terms of hours per respective set, there were 190.74 hours of training audio, including originals and transformations and 123.43 and 123.56 hours of validation and test audio, respectively, for a total of 437.73 hours. Furthermore, in terms of number of hashes, there were 2,529,518 hashes in the training set and 1,636,790 and 1,638,401 hashes in the validation and test sets, respectively, for a total of 5,804,709. While all Training set hashes were used during the learning process, just 2% of the Training and Validation set hashes were used for model evaluation to speed up the learning process. The training time was about 21 days per model. During task performance evaluations, all original hashes (training, validation, and testing sets) were stored using MySQL forming the "Reference Database". Each query hash 𝑯𝒒 was compared against all hashes in the Reference Database and matched with the one with the minimum Hamming Distance. Machine learning algorithms generalization performance needs to be evaluated against unseen data. Therefore, the models were evaluated with a Known set composed of hashes from the training set, which were seen by the model during training and the Unknown set composed of hashes from the testing set, which were never seen by the model during training. Since 36 different models were evaluated for each granularity from 1-6 seconds, the time to evaluate accuracy performance was improved by constructing the Known and Unknown sets from only 1% of training and 1% testing hashes, respectively. Table 3.1 displays the amount of reference and query hashes with respect to the audio length.

Table 3.1 Quantity of reference hashes stored in the Reference Database and quantity of query hashes with respect to the audio length.

Length 1 sec 2 secs 3 secs 4 secs 5 secs 6 secs #References 176,717 158,265 139,813 121,361 102,909 84,457 #Queries 41,561 37,513 33,465 29,417 25,369 21,321

Three baselines were used to benchmark our methodology. Dejavu is a Shazam-like approach that uses the constellation algorithm. It requires 2-Channels (stereo) and a sample rate of 44.1 KHz in MP3 format. The number of hashes depends on the length of the audio signal and the number of peaks found in the frequency spectrum. RAFS downmixed audio files to 1-Channel with a sample rate of 16KHz, window size of 2,048, and step size of 64. The BER was kept at 0.35 with a threshold of 2,867 bits of error. The fingerprint size was found to be 8,000 bits per second. Panako, a variation of the constellation algorithm adding time-frequency distortion


21

resilience, preprocessed the audio files to 1-Channel with a sample rate of 16KHz, window size of 256, and a step size of 64. The information between peaks was augmented with time stretching and pitch-shift, making it more robust to these types of transformations. The fingerprint size was found to be 32,768 bits per second. RAFS and Panako default parameters did not work for small audio excerpts (1-6 seconds). Hence, the parameters above mentioned had to be empirically determined. All the other parameters were kept at the implementation defaults. All three techniques, Dejavu, RAFS, and Panako, encoded the relationship between peaks in the hash. Our approach drastically differs in that the audio signal information was directly encoded in the hash. In addition to the three baselines, SA models with randomly initialized weights and biases were evaluated to understand the intrinsic capacity of the model and give insight into the effectiveness of the learning process (training). In our results, "Random" refers to these instances of the SA model. 3.4 Results The outcome of the experiments describes the nature of the model, providing an objective comparison of how different parameters and strategies impact performance. Furthermore, a nomenclature will be used to refer to the models; for instance, in 1LST256MHE the first characters specify the number of layers (1L/2L), the next set of characters specify the size of the hash (ST128/ST256), and lastly, the remaining characters specify the additive construction of the loss function where M stands for Mean Square Error Loss, H for Hash Loss, and E for Bitwise Entropy Loss. The analysis of the hash space with respect to audio identification performance is given in Figure 3.1. It was expected that a structured hash space yields a lower average Hamming Distance relative to an unorganized hash space. Per our observations, the SA model was the right choice for the generation of audio fingerprints because the Random strategy (left column) achieved good results with an unorganized hash space. Although we hypothesized that a more structured hash space would result in better performance, in reality, the audio identification performance is more correlated to the model parameters rather than the structure of the hash space. As seen in the first and second rows, the models with the smallest average Hamming Distance are not the ones with the highest performance, although the opposite can be seen in the third and final rows. Due to the unsupervised nature of the SA model, the framework parameters needed to be tuned to jointly build a structured hash space and achieve high audio identification performance. The model which accomplished both was 2LST128M-Accuracy. Table 3.2 complements the hash space analysis by reporting the average Hamming distance performance of the proposed model with all the different parameters such as the number of layers, hash size, strategy, loss function term, and audio length. An analysis of the best SA models is given over the different attributes of the audio signal. The performance of the SA models can be explained with the Loss function terms and their evolution during the training process. As shown in Figure 3.2, the simplest (top) and most complex (bottom) models with the loss terms M/MH/MHE


22

were able to minimize the reconstruction error (MSE Loss). As expected, the clustering term (Hash Loss) was minimized when using the loss terms MH/MHE, but M loss term followed a similar pattern only in the most complex model (2LST256). This behavior implies that the Mean Square Error term performed some measure of clustering intrinsically without a term enforcing that behavior. Finally, the models which minimized variation inside the clusters (Bitwise Entropy Loss) reached, in general, high identification task accuracies. Table 3.3 details the performance of the Top-5 models in the subset of the VoxCeleb1 dataset per attribute. In general, the model 2LST256M with an audio length of 4, 5, and 6 seconds focused on the highest accuracy achieved good classification performance in audio identification, transformation, speaker, and video

(a) Random

(b) Accuracy

(c) Loss

Figure 3.1 The average Hamming distance (X-axis) vs. audio identification accuracy (Y-axis) shows that the models with a structured hash space (small average Hamming distance) are not necessarily the ones with higher performance and vice versa; it depends on the parameters of the model. Random, Accuracy, and Loss indicate the method by which the models were selected. Random refers to instances


23

of the SA model with randomly initialized weights, Accuracy refers to selection by maximal audio identification performance, and loss refers to selection by minimal loss. [Better visualized in color] attributes. However, a model with smaller hash size, 2LST128M, and focused on highest accuracy also achieved similar performance, suggesting that this type of model can be used instead of the most accurate model if the technical implementation requires a smaller hash size. Additionally, 2-layer models with a hash size of 128/256 bits were always in the top-5, suggesting that more layers enhanced the encoding of the audio signal information. Table 3.2 Average Hamming Distance in the VoxCeleb1 dataset with the different model parameters such as the number of layers, hash size, strategy, loss function

term, and audio length.

6

5

4

3

2

1 Length [secs]

6

5

4

3

2

1 Length [secs]

411

322

235

151

72

6

M

Rnd

128 Bits

2‐Layer

406

318

232

148

70

5

M

Rnd

128 Bits

1‐Layer

371

290

211

136

65

5

MH

389

305

223

143

68

5

MH

431

338

248

161

78

7

MHE

355

279

204

131

62

4

MHE

330

260

192

126

64

8

M

Acc

183

142

104

67

32

3

M

Acc

360

286

213

140

71

9

MH

72

55

39

25

11

0

MH

383

305

227

150

76

10

MHE

139

109

79

51

24

2

MHE

364

289

214

142

72

9

M

Loss

185

144

105

68

33

3

M

Loss

405

322

240

158

81

10

MH

61

47

33

21

9

0

MH

460

368

275

184

94

13

MHE

142

111

81

53

26

2

MHE

769

604

443

288

141

14

M

Rnd

256 Bits

780

613

449

290

139

13

M

Rnd

256 Bits

842

660

484

314

154

16

MH

755

593

434

279

134

12

MH

803

631

463

301

148

15

MHE

720

565

413

265

127

11

MHE

784

622

464

309

161

25

M

Acc

279

220

162

106

54

7

M

Acc 1

091

871

653

437

228

35

MH

288

227

168

110

56

7

MH


24

1097

876

658

441

230

36

MHE

161

126

92

60

29

3

MHE

855

680

509

341

178

28

M

Loss

272

215

160

105

53

6

M

Loss

1162

928

696

466

244

38

MH

294

232

172

113

57

7

MH

1105

883

662

444

232

36

MHE

157

124

91

59

29

2

MHE

The performance of the SA models can be explained with the Loss function terms and their evolution during the training process. As shown in Figure 3.2, the simplest (top) and most complex (bottom) models with the loss terms M/MH/MHE were able to minimize the reconstruction error (MSE Loss). As expected, the clustering term (Hash Loss) was minimized when using the loss terms MH/MHE, but M loss term followed a similar pattern only in the most complex model (2LST256). This behavior implies that the Mean Square Error term performed some measure of clustering intrinsically without a term enforcing that behavior. Finally, the models which minimized variation inside the clusters (Bitwise Entropy Loss) reached, in general, high identification task accuracies. Table 3.3 Top-5 models in the VoxCeleb1 subset for audio identification and other

classification tasks based on the audio signal attributes. Audio Identification Transformation

Layer Size [bits]

Length [secs]

Focus M H E Acc [%]

Layer Size [bits]

Length [secs]

Focus M H E Acc [%]

2 256 5 Acc X 59.3 2 256 6 Acc X 66.5

2 256 4 Acc X 58.7 2 256 5 Acc X 66.2

2 128 5 Acc X 58.6 2 128 6 Acc X 66.1

2 256 5 Loss X 58.3 2 256 6 Loss X 65.5

2 128 6 Acc X 58.2 2 256 5 Loss X 65.3

Speaker Video

Layer Size [bits]

Length [secs]

Focus M H E Acc [%]

Layer Size [bits]

Length [secs]

Focus M H E Acc [%]

2 256 5 Acc X 59.9 2 256 5 Acc X 59.5

2 256 4 Acc X 59.5 2 256 4 Acc X 59.0

2 128 5 Acc X 59.2 2 128 5 Acc X 58.9

2 128 6 Acc X 58.8 2 128 6 Acc X 58.5

2 256 5 Loss X 58.8 2 256 5 Loss X 58.5

Abbreviations: M – Mean Square Error Loss || H – Hash Loss || E – Bitwise Entropy Loss

The SA models are benchmarked against other methodologies to highlight the relative performance improvement. VoxCeleb1 subset dataset benchmark is found in Figure 3.3, where the best model for audio identification, 2LST256M-Acc, was

Abbreviations: M - Mean Square Error Loss || MH - Mean Square Error Loss + Hash Loss || MHE - Mean Square Error Loss + Hash Loss + Bitwise Entropy Loss


25

used for comparison purposes. It is worth noticing that our SAMAF model outperformed all the other models, no matter the audio length and the attributes of the audio signal. Its performance range was 58.19% - 66.56%, with a difference of at least 15% with respect to the runner-up, Dejavu. Even though Panako (2014) is a more recent model robust to time-frequency distortions, the performance was similar to RAFS (2002) in 4, 5, and 6 secs but poor for 1, 2, and 3 seconds. In the original publications, Panako was tested with an audio length of 20 seconds and RAFS with an audio length of 3 seconds. This could explain the experimental results.

Figure 3.2 Learning curves of the Loss functions terms. The top row refers to the simplest model (1LST128) while the bottom row to the most complex model (2LST256). [Better visualized in color]

(a) Audio Identification Accuracy [%]

(b) Transformations Accuracy [%]


26

(c) Speaker Accuracy [%]

(d) Video Accuracy [%]

Figure 3.3 VoxCeleb1 subset dataset benchmark using the best model, 2LST256M-Acc, to report classification performance. The evaluation of an audio fingerprinting system is not fulfilled without a performance breakdown over transformations. Table 3.4 reports these results with the three baselines. From the data, it is clear that, on average, SAMAF outperforms the baselines. Even when comparing results for queries without any transformation (Original vs. Original), the other three baselines are unable to identify 100% of the audio samples. Extreme transformations such as Pitch0.5 and Speed0.5 cannot be handled by any method. Surprisingly, Dejavu has a higher resilience towards positive transformations such as Speed1.5 and Tempo1.5, where the other methods failed. Dejavu also performs significantly better than all other methods in Tempo0.5, and that is the only method capable of recognizing an audio file played in reverse. Among the strong points of SAMAF, it outperformed in all Pitch and Speed distortions, excluding Speed0.5, Speed1.5, and Pitch0.5. RAFS and Panako were unable to identify queries with 1 second long samples. However, the performance improved with larger excerpts but only for simple transformations such as Original, Truncate, Low/High Pass Filters. Neither RAFS nor Panako could handle Echo, Pitch, Speed, Tempo, and Noise transformations, although Panako showed slight improvements as audio length increased. A visual description of the performance benchmark for time-frequency distortions is shown in Figure 3.4, where Pitch, Speed, and Tempo are dissected. RAFS and Panako one second curves are shown as a flat line over the horizontal axis because their accuracy is 0.00%. For Pitch, SAMAF outperformed the baselines, and surprisingly Pitch1.5 reached almost 90% accuracy while Pitch0.5 is 0.0%. Dejavu, RAFS, and Panako only scored for the Original classification. For Speed, a similar pattern was found where SAMAF performed better than the baselines and could handle positive distortions (Speed1.1, Pitch1.5, etc.) better than negative distortions (Tempo0.5, Pitch0.9, etc.). Dejavu, RAFS, and Panako all performed for the Original with success dependent on the length of audio. However, Dejavu outperformed SAMAF on Tempo changes. Dejavu and SAMAF methods can handle small positive/negative distortions, and Dejavu is the only one able to handle with an accuracy of 60.0% an extreme transformation such as Tempo1.5. Even for the


27

counterpart transformation, Tempo0.5, Dejavu reached about 30.0% accuracy being the only technique resilient to variations in tempo. Although RAFS and Panako performed well only in Original classification, Panako showed a slight performance improvement in positive and negative Tempo transformations for 5- and 6-seconds curves.

From a machine learning perspective, it is important to know the performance of the model for known data (training set) and unknown data (testing set) to validate the generalization capacity of the model for unseen data. Table 3.5 reports this metric for the simplest (top) and most complex (bottom) model where the difference between known and unknown data is not significant, meaning that the model is not overfitting. However, it cannot be stated that it is generalizing well either. Among the different loss terms, Mean Square Error (M) was the one achieving a similar performance for known and unknown data, while MH and MHE are biased towards the known data. These results indicate that adding constraints to the hash space harms the generalization performance and that a regularization term might alleviate this issue.

Table 3.4 VoxCeleb1 subset benchmarked against baselines per transformation

Transformation 1 second 2 seconds 3 seconds

Panako Rafs Dejavu SAMAF Panako Rafs Dejavu SAMAF Panako Rafs Dejavu SAMAF

Original 0.00% 0.00% 65.20% 100.00% 15.29% 23.07% 90.67% 100.00% 60.70% 67.02% 96.72% 100.00%

TruncateSilence 0.00% 0.00% 65.18% 80.49% 11.75% 88.51% 90.45% 86.38% 53.55% 89.85% 96.63% 89.32%

Noise0.05 0.00% 0.00% 46.84% 21.50% 2.09% 1.52% 74.78% 62.32% 10.30% 1.65% 86.95% 74.77%

Echo 0.00% 0.00% 77.58% 70.78% 1.87% 17.89% 95.65% 97.11% 6.32% 24.59% 98.74% 98.86%

Reverb 0.00% 0.00% 64.18% 96.26% 1.07% 57.90% 90.86% 100.00% 8.43% 58.08% 97.08% 100.00%

HighPassFilter 0.00% 0.00% 61.58% 24.28% 4.30% 93.99% 89.34% 77.58% 21.36% 94.64% 96.50% 87.53%

LowPassFilter 0.00% 0.00% 58.47% 93.08% 13.52% 73.77% 85.50% 100.00% 56.86% 74.68% 93.08% 100.00%

Reverse 0.00% 0.00% 29.49% 2.04% 0.00% 0.00% 47.67% 7.16% 0.00% 0.00% 57.56% 9.10%

Pitch0.5 0.00% 0.00% 0.06% 0.00% 0.00% 0.00% 0.06% 0.00% 0.00% 0.00% 0.00% 0.00%

Pitch0.9 0.00% 0.00% 0.00% 97.22% 0.00% 0.00% 0.00% 100.00% 0.07% 0.00% 0.00% 100.00%

Pitch1.1 0.00% 0.00% 0.06% 99.43% 0.00% 0.00% 0.00% 100.00% 0.50% 0.00% 0.00% 100.00%

Pitch1.5 0.00% 0.00% 0.11% 4.03% 0.00% 0.00% 0.19% 34.90% 0.00% 0.00% 0.21% 66.52%

Speed0.5 0.00% 0.00% 0.03% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%

Speed0.9 0.00% 0.00% 0.00% 14.81% 0.00% 0.00% 0.00% 40.19% 0.00% 0.00% 0.00% 49.13%

Speed0.95 0.00% 0.00% 0.00% 32.19% 0.42% 0.00% 0.00% 47.50% 2.07% 0.00% 0.00% 58.49%

Speed1.05 0.00% 0.00% 0.00% 30.01% 0.07% 0.00% 0.00% 49.50% 0.08% 0.00% 0.00% 61.99%

Speed1.1 0.00% 0.00% 0.06% 24.20% 0.00% 0.00% 0.00% 55.31% 0.00% 0.00% 0.00% 75.06%

Speed1.5 0.00% 0.00% 0.18% 0.09% 0.00% 0.00% 0.32% 0.11% 0.00% 0.00% 0.52% 0.00%

Tempo0.5 0.00% 0.00% 8.21% 3.55% 0.00% 0.00% 13.73% 6.36% 0.00% 0.00% 18.30% 5.54%

Tempo0.9 0.00% 0.00% 42.02% 30.13% 0.11% 0.00% 66.78% 55.91% 0.12% 0.00% 80.79% 73.79%

Tempo1.1 0.00% 0.00% 52.01% 31.51% 0.00% 0.07% 76.98% 64.36% 0.00% 0.00% 86.32% 84.00%


28

Tempo1.5 0.00% 0.00% 27.40% 4.02% 0.00% 0.00% 39.01% 11.97% 0.00% 0.00% 47.85% 12.10%

Average 0.00% 0.00% 27.21% 39.07% 2.29% 16.21% 39.18% 54.39% 10.02% 18.66% 43.51% 61.19%

Transformation 4 seconds 5 seconds 6 seconds


Original 91.22% 79.49% 98.19% 100.00% 97.61% 85.24% 98.84% 100.00% 98.03% 82.74% 98.82% 100.00%

TruncateSilence 82.27% 91.19% 97.83% 91.27% 90.20% 92.72% 98.55% 93.34% 93.96% 95.12% 98.72% 94.21%

Noise0.05 21.98% 1.57% 92.86% 81.26% 34.72% 1.94% 95.55% 84.62% 45.17% 1.77% 96.12% 86.83%

Echo 12.15% 24.32% 99.29% 99.78% 17.86% 27.18% 99.43% 99.92% 23.96% 28.45% 99.32% 99.90%

Reverb 21.58% 57.33% 98.36% 100.00% 36.78% 57.96% 98.94% 100.00% 51.87% 58.27% 99.06% 100.00%

HighPassFilter 43.51% 94.98% 97.62% 91.49% 63.21% 95.44% 98.74% 93.96% 74.51% 94.68% 98.71% 96.09%

LowPassFilter 89.92% 74.79% 96.55% 100.00% 97.14% 75.24% 97.68% 100.00% 98.03% 74.94% 98.35% 100.00%

Reverse 0.00% 0.00% 63.38% 10.65% 0.00% 0.00% 65.38% 11.00% 0.00% 0.00% 70.24% 11.86%

Pitch0.5 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%

Pitch0.9 0.99% 0.00% 0.00% 100.00% 1.84% 0.00% 0.00% 100.00% 3.07% 0.00% 0.00% 100.00%

Pitch1.1 2.80% 0.00% 0.00% 100.00% 5.63% 0.00% 0.00% 100.00% 8.51% 0.00% 0.00% 100.00%

Pitch1.5 0.00% 0.00% 0.49% 80.02% 0.00% 0.00% 0.19% 87.05% 0.00% 0.00% 0.59% 89.56%

Table 3.4 VoxCeleb1 subset benchmarked against baselines per transformation

Transformation 4 seconds 5 seconds 6 seconds


Speed0.5 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%

Speed0.9 0.00% 0.00% 0.00% 46.00% 0.80% 0.00% 0.00% 39.21% 2.17% 0.00% 0.00% 32.23%

Speed0.95 3.05% 0.00% 0.00% 70.27% 3.72% 0.00% 0.00% 85.46% 5.08% 0.00% 0.00% 92.06%

Speed1.05 0.18% 0.00% 0.00% 76.90% 0.43% 0.00% 0.00% 92.85% 0.80% 0.00% 0.00% 98.94%

Speed1.1 0.00% 0.00% 0.00% 81.55% 0.00% 0.00% 0.00% 81.55% 0.00% 0.00% 0.00% 78.86%

Speed1.5 0.00% 0.00% 0.51% 0.00% 0.00% 0.00% 0.50% 0.26% 0.00% 0.00% 0.47% 0.00%

Tempo0.5 0.00% 0.00% 22.29% 5.74% 0.00% 0.00% 25.60% 5.01% 0.00% 0.00% 28.91% 4.73%

Tempo0.9 0.21% 0.00% 87.32% 80.15% 2.57% 0.00% 91.45% 77.13% 5.29% 0.00% 94.47% 73.82%

Tempo1.1 0.00% 0.10% 92.53% 91.40% 1.53% 0.12% 96.40% 94.12% 1.80% 0.00% 97.19% 92.65%

Tempo1.5 0.00% 0.00% 53.86% 9.86% 0.00% 0.00% 58.90% 10.94% 0.00% 0.00% 60.00% 12.50%

Average 16.81% 19.26% 45.50% 64.38% 20.64% 19.81% 46.64% 66.20% 23.28% 19.82% 47.32% 66.56%


29

Figure 3.4 Benchmark of the time-frequency transformations on the VoxCeleb1 subset dataset. Rows are Pitch, Speed, and Tempo transformations, respectively. Columns are audio sample length from 1 to 6 seconds. [Better visualized in color] Note: RAFS and Panako 1-second curves are a flat line in the horizontal axis because their performance is 0.00% Finally, a detailed view of the performance of all the models can be found below, where the audio identification and transformations accuracy scores are exhaustively recorded with the full combination of the different parameters such as the number of layers, hash size, strategy, loss function terms, and audio length.


30

Table 3.5 Simplest (top) and most complex (bottom) SAMAF models for Known vs. Unknown comparison in the audio identification task

Length

1LST128 ‐ Acc 1LST128 ‐ Loss

M MH MHE M MH MHE

KWN UNK KWN UNK KWN UNK KWN UNK KWN UNK KWN UNK

1 sec 14.83% 13.36% 3.54% 3.38% 13.91% 11.71% 15.01% 13.23% 2.16% 1.95% 13.44% 10.64%

2 secs 41.76% 40.45% 30.98% 30.83% 41.30% 38.31% 42.52% 40.37% 27.67% 25.07% 39.66% 36.08%

3 secs 51.31% 50.78% 39.79% 39.54% 49.85% 47.00% 52.47% 51.20% 35.86% 33.02% 46.22% 42.96%

4 secs 54.96% 55.62% 44.30% 43.29% 53.25% 50.88% 56.55% 55.91% 40.01% 37.23% 49.37% 46.32%

5 secs 56.82% 57.33% 46.39% 44.22% 54.82% 52.42% 57.91% 57.69% 42.48% 40.01% 51.19% 47.07%

6 secs 56.19% 57.10% 47.00% 44.69% 55.18% 52.54% 57.60% 57.88% 43.71% 40.82% 51.39% 47.60%

Length

2LST256 ‐ Acc 2LST256 ‐ Loss

M MH MHE M MH MHE

KWN UNK KWN UNK KWN UNK KWN UNK KWN UNK KWN UNK

1 sec 37.41% 35.76% 34.68% 32.91% 34.65% 32.65% 37.63% 35.26% 34.83% 32.88% 34.67% 32.50%

2 secs 50.85% 50.75% 45.95% 43.38% 45.86% 42.48% 51.07% 49.50% 45.35% 42.61% 46.08% 42.79%

3 secs 56.57% 56.48% 48.98% 46.45% 49.15% 45.65% 56.19% 54.76% 48.65% 45.59% 49.52% 46.03%

4 secs 58.70% 58.90% 50.18% 47.56% 50.53% 46.59% 58.41% 57.21% 49.97% 46.93% 50.81% 47.09%

5 secs 59.05% 59.73% 50.04% 47.44% 50.37% 46.68% 58.92% 57.56% 50.17% 47.27% 50.75% 46.94%

6 secs 57.99% 58.49% 49.07% 46.74% 49.35% 46.01% 57.60% 56.64% 49.40% 46.77% 49.55% 46.55%

Table 3.6 describes the audio identification performance in a subset of the VoxCeleb1 dataset, where the Mean Square Error (M) loss term and the Accuracy strategy achieved the best performance overall regardless of the model parameters. The improved performance between 128 and 256 bits is minimal, suggesting that, if required, a small hash size can be used without diminishing the overall performance. There is a significant difference when comparing the performance per layer; it is clear that 2-layer models achieve higher results. Table 3.7 describes the transformation performance in a subset of the VoxCeleb1 dataset. Once again, the Mean Square Error (M) loss term and the Accuracy strategy achieved the best performance overall regardless of the model parameters. There are significant differences between a small (128) and a big (256) hash size. It is believed that certain transformations were either better or worse encoded, thereby affecting the classification performance. When comparing the number of layers, the difference is not clear suggesting that the hash size is the one driving the transformation classification performance and that the number of layers has a minor effect.


31

Table 3.6 Audio identification performance in the VoxCeleb1 dataset showing all the classification results with the different model parameters such as the number of

layers, hash size, strategy, loss function term, and audio length.

6

5

4

3

2

1 Length [secs]

6

5

4

3

2

1 Length [secs]

51.6

51.9

50.5

46.8

39.6

14.9

M

Rnd

128 Bits

2‐Layer

46.5

46.5

45.2

42.5

37.1

14.4

M

Rnd

128 Bits

1‐Layer

49.5

50.2

49.5

46.3

39.5

13.9

MH

47.0

46.9

45.4

42.2

36.8

14.1

MH

53.6

54.7

54.1

51.0

43.5

16.5

MHE

45.8

46.0

44.9

42.4

36.6

12.8

MHE

58.3

58.7

57.8

54.3

47.3

26.6

M

Acc

56.5

57.0

55.2

51.1

41.2

14.3

M

Acc

47.9

48.6

48.6

47.0

43.1

23.8

MH

46.1

45.5

43.9

39.7

30.9

3.5

MH

46.8

47.4

47.1

45.9

42.1

23.8

MHE

54.1

53.9

52.3

48.7

40.1

13.0

MHE

55.4

55.4

54.7

51.8

45.2

26.3

M

Loss

57.7

57.8

56.3

52.0

41.7

14.3

M

Loss

47.3

47.7

47.6

46.2

42.5

24.3

MH

42.6

41.5

38.9

34.7

26.6

2.1

MH

43.3

43.9

44.0

43.1

40.5

24.5

MHE

49.9

49.6

48.2

44.9

38.3

12.3

MHE

53.9

54.5

53.9

51.2

44.4

21.2

M

Rnd

256 Bits

48.6

48.6

47.7

45.0

39.8

18.8

M

Rnd

256 Bits

53.7

54.3

54.0

51.2

45.0

21.7

MH

48.4

48.5

47.4

44.3

39.1

17.9

MH

54.8

55.9

55.1

52.3

45.6

21.9

MHE

45.3

45.8

44.5

42.0

37.7

17.2

MHE

58.2

59.3

58.8

56.5

50.8

36.8

M

Acc

55.8

55.8

54.6

51.8

45.4

24.0

M

Acc

48.1

49.0

49.1

48.0

44.9

34.0

MH

52.4

52.0

50.5

47.8

42.6

22.9

MH

48.0

48.9

49.0

47.8

44.5

33.9

MHE

54.4

54.5

53.6

50.0

42.1

16.0

MHE

57.2

58.4

57.9

55.6

50.5

36.7

M

Loss

50.1

50.4

50.0

48.1

42.6

21.8

M

Loss

48.4

49.0

48.8

47.4

44.3

34.1

MH

49.3

49.1

47.7

45.1

40.5

21.8

MH

48.4

49.2

49.3

48.1

44.8

33.8

MHE

48.5

48.1

46.7

43.8

37.8

12.4

MHE



32

Table 3.7 Transformation performance in the VoxCeleb1 dataset showing all the classification results with the different model parameters such as the number of

layers, hash size, strategy, loss function term, and audio length.

6

5

4

3

2

1 Length [secs]

6

5

4

3

2

1 Length [secs]

58.4

57.6

55.2

50.6

42.4

15.9

M

Rnd

128 Bits

2‐Layer

52.8

51.6

49.4

46.0

39.8

15.3

M

Rnd

128 Bits

1‐Layer

56.2

55.7

54.2

50.1

42.4

14.8

MH

53.1

51.8

49.5

45.6

39.4

14.9

MH

60.8

60.5

59.0

55.1

46.6

17.6

MHE

51.9

50.9

49.0

45.7

39.2

13.6

MHE

66.1

65.1

63.3

58.7

50.7

28.4

M

Acc

64.2

63.3

60.4

55.3

44.3

15.3

M

Acc

54.8

54.4

53.4

50.9

46.1

25.4

MH

52.1

50.4

48.0

43.1

33.3

3.7

MH

53.4

52.8

51.6

49.7

45.1

25.4

MHE

61.2

59.7

57.2

52.7

43.1

14.0

MHE

63.2

61.8

59.9

56.0

48.4

28.0

M

Loss

65.1

64.0

61.4

56.2

44.8

15.3

M

Loss

54.1

53.3

52.3

50.0

45.5

25.9

MH

48.4

46.1

42.7

37.8

28.8

2.2

MH

49.5

49.0

48.3

46.7

43.4

26.2

MHE

56.4

55.0

52.6

48.5

41.0

13.2

MHE

61.2

60.4

58.9

55.3

47.6

22.5

M

Rnd

256 Bits

55.1

53.9

52.0

48.6

42.5

19.8

M

Rnd

256 Bits

60.8

60.1

58.9

55.3

48.1

23.0

MH

54.8

53.8

51.7

47.8

41.7

18.9

MH

62.2

61.9

60.2

56.4

48.8

23.2

MHE

51.4

50.9

48.7

45.4

40.2

18.2

MHE

66.6

66.2

64.4

61.2

54.4

39.1

M

Acc

63.4

62.1

59.7

55.9

48.5

25.7

M

Acc

55.0

54.8

53.9

52.0

48.1

36.1

MH

59.5

57.8

55.3

51.6

45.5

24.4

MH

54.8

54.7

53.8

51.8

47.7

36.0

MHE

61.5

60.4

58.6

54.0

45.1

17.1

MHE

65.5

65.3

63.6

60.3

54.0

39.0

M

Loss

57.1

56.2

54.8

52.1

45.6

23.3

M

Loss

55.2

54.8

53.6

51.4

47.4

36.2

MH

56.1

54.6

52.1

48.6

43.2

23.3

MH

55.2

55.0

54.1

52.2

48.0

35.9

MHE

55.0

53.5

51.1

47.4

40.5

13.3

MHE



33

3.5 Statistical Analysis Statistical analysis was performed to validate the previous findings. The Friedman test, a non-parametric statistical test equivalent of the repeated-measures ANOVA, ranks the algorithms for each data set separately; the best performing algorithm gets rank 1, the second rank 2, and so on. In case of ties, the average ranks are assigned. It compares the average ranks of algorithms under the null-hypothesis, which states that all the algorithms are equivalent, and so their ranks should be equal. If the null-hypothesis is rejected, we can proceed with a post-hoc test. The Nemenyi test is used when all classifiers are compared to each other, the performance of two classifiers is significantly different if the corresponding average ranks differ by at least the critical difference (CD), Equation (9).

𝑪𝑫 𝒒𝜶𝒌 𝒌 𝟏

𝟔𝑵 (9)

An average rank comparison of the SA model, with all its possible parameter combinations, using the Transformation performance as a metric, is shown in Table 3.8. It is observed that the top-4 models have a number of layers equal to two and that they rely on the first term of the loss function, Mean Square Error Loss (M), to achieve a high average ranking. However, the models using a hash size of 256 bits performed better than the models having a hash size of 128 bits. The Accuracy strategy proved to be a better suit than the Loss strategy, and the first three models are similar, but they distance themselves from all the others at least with an average rank twice as far.

When multiple classifiers are compared, the results of the post-hoc tests can be visually represented with a Critical Difference (CD) diagram (Figure 3.5). The top line in the diagram is the axis on which we plot the average ranks of the classifiers. The axis is turned so that the lowest (best) ranks are to the right, and the critical difference (Nemenyi Test, Equation (9)) is shown above the graph. When comparing the models against each other, we connect the groups of models that are not significantly different. For our particular problem, the parameters for the Nemenyi Test are 𝒒𝜶 𝟑. 𝟖𝟑𝟕, 𝒌 𝟑𝟔, and 𝑵 𝟔, where 𝒒𝜶 is the critical value based on a confidence level 𝜶 of 5%, 𝒌 is the total number of classifiers/models, and 𝑵 is the total number of databases. In our case, we used the VoxCeleb1 subset database 6 times, different audio length from 1-6 seconds. Hence, we consider each test as a different database adding up to 6. The CD diagram shows that 2LST256M-Acc has the best average rank, but it is not significantly different from the following top 26 classifiers (cluster), including some Random Models, although there is not enough data to conclude that the top 26 models are not similar to other clusters.


34

Table 3.8 Average ranks of the SA model and its different parameters applying the Friedman procedure

Model Average Rank

Model Average Rank

Model Average Rank

2LST256M‐Acc 1.00 1LST256MHE‐Acc 14.33 2LST128MHE‐Acc 23.67

2LST256M‐Loss 2.17 1LST256M‐Acc 15.00 1LST256M‐Rnd 24.00

2LST128M‐Acc 3.50 2LST256MH‐Acc 15.17 1LST128MHE‐Loss 24.83

2LST128M‐Loss 7.00 1LST256M‐Loss 15.50 1LST256MH‐Rnd 26.00

1LST256M‐Acc 7.17 2LST256MH‐Loss 16.17 2LST128MHE‐Loss 27.50

2LST256MHE‐Rnd 7.67 2LST256MHE‐Acc 16.50 1LST256MHE‐Loss 28.50

1LST128M‐Loss 10.67 1LST128MHE‐Acc 17.33 1LST128M‐Rnd 30.33

2LST256M‐Rnd 11.50 2LST128M‐Rnd 19.67 1LST128MH‐Rnd 30.83

2LST256MH‐Rnd 11.67 2LST128MH‐Acc 19.83 1LST256MHE‐Rnd 31.17

1LST128M‐Acc 12.17 2LST128MH‐Rnd 21.67 1LST128MHE‐Rnd 32.50

2LST128MHE‐Rnd 13.00 2LST128MH‐Loss 21.67 1LST128MH‐Acc 34.33

2LST256MHE‐Loss 14.17 1LST256MH‐Loss 21.83 1LST128MH‐Loss 36.00

Figure 3.5 Critical Difference (CD) diagram for the 36 classifiers. On the right side, the classifiers with the lowest (best) rank are located. It is seen that 2LST256M-Acc possesses the best average rank, but it is not significantly different from the following top 26 classifiers (cluster). However, the data is not sufficient to conclude that those 26 classifiers are not similar to other clusters. In this case, it makes 2LST256M-Acc the absolute winner. The CD diagram uses a critical value of 3.837 (𝒒𝜶), with a 5% confidence level 𝜶 . Furthermore, our best model (2LST256M-Acc) was statistically compared with the benchmarked models (Dejavu, Panako, and Rafs) carrying over again the Nemenyi Test. As seen in Table 3.9, SAMAF is on top with an average rank of 1.65, followed by Dejavu with an average rank of 2.04. Figure 3.6 visually complements the information aforementioned, showing that there are two clusters (SAMAF-Dejavu and Dejavu-Panako-Rafs). With the given data, it is not possible to conclude if


35

Dejavu is statistically similar to Panako and Rafs or SAMAF. Hence, we can conclude that SAMAF is the overall winner.

Table 3.9 Average Ranks of the proposed model and the benchmarked models Model Average Ranking

SAMAF 1.65

Dejavu 2.04

Panako 3.14

Rafs 3.17

Figure 3.6 Critical Difference (CD) diagram for our best model (2LST256M-Acc) and the benchmarked models (Dejavu, Panako, and Rafs). It is seen that SAMAF is at the top with an average rank of 1.65, followed by Dejavu with an average rank of 2.04. It is not known, due to lack of data, if Dejavu is statistically similar to SAMAF or the other two methods (Panako and Rafs). The critical distance is 1.9148, with a critical value of 2.569 (𝒒𝜶) using a 5% confidence level 𝜶 .

Chapter 4. Deployment 36

36

Chapter 4 4. Deployment This research, supported by the Department of Homeland Security (DHS), had to be accomplished in collaboration with personnel from Public Safety Answering Points (PSAPs) and 9-1-1 Emergency Operation Centers (EOCs). Therefore, the deployment of the created technology was tested in an industrial environment throwing results in real-time. 4.1 History of the 9-1-1 System In the National Emergency Number Association (NENA) website, it was found that: “On February 16, 1968, Senator Rankin Fite completed the first 9-1-1 call made in the United States in Haleyville, Alabama. The serving telephone company was then Alabama Telephone Company. This Haleyville 9-1-1 system is still in operation today” [47]. Since then, the 9-1-1 system has been evolving to accommodate emerging technologies such as cellphones and the internet. At the beginning of the 9-1-1 system, all the calls were made through fixed landlines via rotary dial and/or push-button telephones. Hence, the dispatchers already knew the caller’s phone number and the location information, considering that all the phone numbers were registered with an address that was geographically fixed. However, with the introduction of cell phones, the location information was not provided straightforward; it was required to triangulate the source of the call via the cell phone towers giving a rough estimation of the location information. Still, nowadays, this triangulation method is being used, causing a logistic problem when the cell phone tower radius overlaps with either another one or several cell phone towers. Later on, with the creation of the internet, a new technology to make phone calls was invented. A new standard, Voice over Internet Protocol (VoIP), was developed to make calls all around the world wreaking havoc on the idea of tracking the location information and being able to identify the caller’s number. Enhanced 9-1-1 (E911) was the response to these emerging technologies supporting wireless and VoIP communications. It was the first upgrade to the 9-1-1 infrastructure, automatically giving to the dispatchers the caller’s telephone number, also known as call back number or Automatic Number Identification (ANI). However, in the case of cellphones, the cell towers assign them first with a pANI (Pseudo Automatic Number Identification), which is used to query a database which returns the ANI, and, if available, the location information, via a service called Automatic Location Information (ALI). However, the introduction of smartphones and the enhanced capabilities of the internet not only brought an improvement in the service but also set the environment for dangerous Telephony Distributed Denial-of-Service (TDDoS) attacks, which have crippled the 9-1-1 services in several occasions. These attacks are characterized by denying the service to legit callers who may be in a life


37

and death situation, whereas malicious people operate without being disturbed for the reason that it is not possible to alert the authorities. The National Telecommunications and Information Administration (NTIA) highlighted that: “As communication technologies have evolved to include wireless phones, text, and picture messaging, video chat, social media, and Voice over Internet Protocol (VoIP) devices, the public expects that 911 services will also be able to accept information from these communication methods. While efforts are underway across the nation to enable call centers to accept text messages, the future success of 911 in serving the public's needs will only be possible when Public Safety Answering Points (PSAPs) have transitioned to an Internet Protocol (IP)-based 911 system, commonly referred to as Next Generation 911 or NG911” [48]. The idea behind is to create a faster and more resilient system allowing digital information (e.g., voice, photos, videos, text messages) to reach out first responders. 4.2 General Framework of the testing environment Working towards the implementation of NG911, this research has been focused on detecting possible threats and raising a flag whether a call is malicious. In order to develop the system, several components such as (i) a call generator, (ii) a VoIP Gateway/Firewall, (iii) a Session Border Controller (SBC), (iv), a media server/PSAP, (v) a Session Initiation Protocol (SIP)/Real-time Transport Protocol (RTP) probe, and (vi) a call analyzer, also known as Acoustic Forensic Analysis Engine, were required. Figure 4.1 displays the flow of the information beginning at the call generator and ending with the call analyzer.

Figure 4.1 The general framework of the testing environment starts with a call generator in which information is filtered through a firewall and later managed by the Session Border Controller (SBC), which sends the call to the Public Safety Answering Point (PSAP). Once the call is answered, the Session Initiation Protocol (SIP)/Real-time Transport Protocol (RTP) probe forwards the RTP packets, which are analyzed by the call analyzer, also known as Acoustic Forensic Analysis Engine (AFAE). Call Generator. It can be implemented using various software (e.g., OpenSips, SIPp, SipTester), and its function is to established the interval between calls, the


38

number of concurrent calls, and the total number of calls, among other functionalities. Source and destination IP addresses are needed to initiate the call. VoIP Gateway/Firewall. Initially, the internet was built, bearing in mind data transmission in the form of bits. However, a phone call is an analog signal which needs to be converted into a stream of bits so it can communicate with different devices inside a network. With the introduction of the IP Private Branch Exchange (PBX), a specific VoIP Gateway was in charge of handling all the request communications coming from telephone-like devices. SBC. Configuring the VoIP Gateway leaves a security breach that needs to be handled by a second device that manages all the session requests, SIP signaling. Media Server/PSAP. In order to process the audio or video streams associated with telephone calls or connections, a media server is needed. In our lab environment, a PSAP interface was developed to handle this step, and once the call is answered, the media information is sent to a second device for further processing. Figure 4.2 shows the login window for the call takers and, Figure 4.3 displays the PSAP dashboard, which contains all the information related to the call. SIP/RTP probe. This probe is in charge of the segmentation rules. It splits the calls in different RTP threads and regulates how long the session is kept alive. Other functionalities can be implemented inside the probe, for instance, analyzing the SIP information for further handling TDDoS attacks. AFAE. This is a Deep Learning module in charge of analyzing the content of the calls, which adds labels to them and scores how confident the system is when assigning those labels. It also keeps track of previous calls in order to detect duplicate calls, which might be a sign of a malicious call. 4.3 Call Classifier and Duplicate Detector The modules were designed with Deep Learning to determine whether a call is made by a human, software, or TeleTYpewriter (TTY) and to detect duplicate calls. A TTY is a particular device that lets people who are deaf, hard of hearing, or speech-impaired use the telephone to communicate, by allowing them to type messages back and forth to one another instead of talking and listening. A TTY is required at both ends of the conversation in order to communicate. Based on the idea that malicious people try to cripple the 9-1-1 centers via a Telephony Distributed Denial-of-Service (TDDoS) attack or test them with “random calls,” the new capabilities of the call analyzer attempt to mitigate the damage caused by both approaches. If an attacker uses synthetic voices, replay calls, or “random calls” to keep the call takers busy, the metadata generated by the call analyzer module raises flags alerting over the hidden intentions of the call. Mainly, the system is composed of three blocks: (i) Preprocessing, (ii) Feature Extraction, and (ii) Model. Figure 4.4 displays the block diagram for both systems:


39

call classifier (CNN) and duplicate detector (SA). Similar to the already explained SAMAF model, the call classifier goes through the same preprocessing stage, but only 2 seconds of the audio signal is kept. These 2 seconds are converted into a spectrogram, which is a 2D representation of a 1D audio signal, and deep learning image classification techniques are applied to determine the label that best fits each sample. The output of the CNN model is the probability of the call belonging to each of the three defined classes (human, synthetic, and TTY).

Figure 4.2 PSAP Call taker login window

Figure 4.3 PSAP Dashboard displaying the attributes of the caller information

The deep learning image classification model used was AlexNet [42], which has 60 million parameters and 500,000 neurons. It consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers; softmax is applied on the last one. Based on the AlexNet neural network, the input and output layers were modified for an image resolution of 256x256, while only three outputs instead of 1,000 were expected. Non-saturating neurons called Rectified Linear Unit (ReLU) were used to make training faster additional to the GPU implementation based on CUDA and cuDNN. The performance of the model was tested with the Automatic Speaker Verification Spoofing, and Countermeasures Challenge 2015 (ASVspoof 2015) [49] and the results of the ASVspoof 2015 challenge in comparison with the University of Houston results are shown in Table 4.1 where it can be noticed that our approach surpassed the best result by a factor


40

of 18. The deep learning duplicate detector is the same as the already aforementioned SAMAF methodology, and no further explanation is required.

Figure 4.4 The block diagram of the deployment starts with a raw audio signal, which is normalized through a preprocessing step. Depending on the analysis either Spectrogram Image Features (SIF) or Mel-Frequency Cepstral Coefficients (MFCCs) features are extracted from the normalized signal and then input either to the Convolutional Neural Network (CNN) that outputs a classification label (Human, Synthetic, or TTY) or input to a Sequence-to-Sequence Autoencoder (SA) model which generates the audio fingerprint (hash representation). Although the previous explanation comprehends the transformation of the audio signal up to the point where a categorical value and a compact signature are outputted, it does not clarify the overall system in terms of the information flow. Figure 4.5 is a flowchart representation of the information flow where it is observed that the first filter is a silence detector, which ensures that there is information to be analyzed. Later, the call classifier and the duplicate detector are triggered in a parallel process where the outputs of both systems, the class label and the confidence score in the case of the call classifier, and the hash and the similarity score in the case of the duplicate detector, are stored in a MySQL database. The content of this database will be queried through a Web Interface facilitating the audio forensic analysis work. 4.4 Materials and Architecture In order to analyze 9-1-1 calls in real-time and to save as many calls as possible, it was necessary to understand how the system reacted under stress. Different tests, such as storage, memory, and process load analysis, were performed. Based on empirical experiments, the server features, the amount of Random Access Memory (RAM), the Central Processing Units (CPUs) features, and the Graphics Processing Unit (GPU) features were chosen accordingly, being the GPU one of the most critical components because is not possible to run a Deep Learning model if the hardware is not designed for that specific application. GPUs can accelerate speed analysis by 100x over a CPU alone.


41

Table 4.1 Summary of primary submission results in the ASVspoof 2015 challenge and comparison with the University of Houston results. It is displayed that our

approach is the best one, followed by the runner-up with a factor of 18.

System ID

Equal Error Rates (EERs)

Known Attacks

Unknown Attacks

Average

UofH 0.0047 0.1297 0.0672

A 0.408 2.013 1.211

B 0.008 3.922 1.965

C 0.058 4.998 2.528

D 0.003 5.231 2.617

E 0.041 5.347 2.694

F 0.358 6.078 3.218

G 0.405 6.247 3.326

H 0.67 6.041 3.356

I 0.005 7.447 3.726

J 0.025 8.168 4.097

K 0.21 8.883 4.547

L 0.412 13.026 6.719

M 8.528 20.253 14.391

N 7.874 21.262 14.568

O 17.723 19.929 18.826

P 21.206 21.831 21.519

Average 3.621 10.042 6.832

STD 6.566 6.643 6.349

Figure 4.5 Flowchart of the AFAE methodology, where it is observed that the first filter ensures that there is information to be analyzed. Later, the outputs of the call classifier and the duplicate detector are stored in a MySQL database


42

Previous to the deployment in the two 9-1-1 Emergency Operation Centers (EOCs) located in Palm Beach County (PBC) and Greater Harris County (GHC), an agreement form was signed by all the parties involved, and the physical location of the servers was discussed. Our framework acts passively, meaning that we do not interact with their system; we only analyze the calls they send to our module (port mirroring). Therefore, the explanation of the system architecture is presented in Figure 4.6, where it is noted that the information travels through a SIP trunk, which reaches a switch where the port mirroring is applied. After that, this information is segmented and preprocessed using a SIP/RTP probe. Later, the preprocessed packets, one call per thread, are sent to our AFAE module where the analysis is visualized via a web interface.

Figure 4.6 Architecture of the different devices involved in the call analysis module (AFAE) where it can be seen that a SIP trunk carries the information which reaches a switch where the port mirroring occurs and a SIP/RTP probe segments and preprocesses the packets, one call per thread, which are analyzed by our AFAE module. The visualization of the analysis is displayed through a web interface 4.5 Palm Beach County (PBC) During the deployment at Palm Beach County, two main objectives were tested. (i) the correct performance of the overall architecture of the system, including all the applicable connections and devices, and (ii) the accuracy performance of our call analyzer, detecting silent, human, synthetic, TTY, and duplicate calls. Initially, the system was going to be set up in a test environment, however, days before the deployment, one of our partners informed that it was not possible to route emergency calls to the test PSAP because the SIP trunks associated to different Twilio (a cloud communications platform) accounts were being routed to the Sheriff’s office. This happened because, in order to make an emergency call, the numbers needed to be associated with a physical address. Even though our partners and PBC worked on updating the address databases, it was believed that Twilio did not do it on their side. To route emergency calls to the test PSAP is still an open issue that Twilio has not closed.


43

The plan B was to make test calls from a location close to the EOC where the call flow was low. Using a landline phone, test calls were dialed from Manalapan PSAP. Unfortunately, these calls did not reach our AFAE module. It was discovered that we had a 50/50 chance that the calls were routed following the SIP trunk, where the port mirroring was applied. Several calls were made, and little by little, the database started to get populated. Different scripts were used to do the test calls; two of them are reported in Appendix D. We also used pre-recorded audio, especially for the simulation of the TTY calls. It was played either with the speakers of a laptop or a cellphone. It was discovered that working on an industrial environment, there were many unforeseen events, and our AFAE module did not perform well, most of the calls were classified as synthetic. The emergency calls were filtered, and the frequencies higher than 4KHz were cut-off. Hence, our DL models did not work well because the training data had the double of the frequency content, 8Khz. Another unforeseen event was that calls were transferred between the first responders/dispatchers. Listening to the calls, it was realized that they have a fixed script that is repeated over and over in every transferred call causing false positives in the duplicate detector module. However, we were able to check the proper operation of the system architecture. A remote connection, Virtual Private Network (VPN), was set up by AT&T, so the AFAE module could be improved with the new data we started to gather. The server was left in their facilities until the end of the year 2018, and a watchdog was implemented in order to keep running the software in case it crashes. 4.6 Greater Harris County (GHC) The deployment in GHC happened after the deployment in PBC, and we had the opportunity to retrain our call classifier with frequencies less than or equal to 4KHz. The main objectives of the deployment were the same as the previous deployment, to test (i) the correct performance of the overall architecture of the system, including all the applicable connections and devices, and (ii) the accuracy performance of our call analyzer, detecting silent, human, synthetic, TTY and duplicate calls. Greater Harris County also has its own test environment were our AFAE module was installed. However, another unforeseen event did not allow us to test our module. The network communication configuration between the devices in the test environment was set up in TCP instead of UDP. Therefore, it was attempted to change the communication protocol without success. No testing was performed, and it was not possible to know whether the new 4KHz model was successful. However, the proper operation of the system architecture was verified. Same as in PBC, the server was left in their facilities until the end of the year 2018, and a VPN connection was established for further testing with the opportunity to gather data to keep improving the current models. In the near future, the AFAE module will be moved to a data center where real calls will be analyzed.


44

4.7 Web Interface After the analysis is executed, the results are stored in a MySQL database. However, to read the information directly from the database is time-consuming, not user-friendly, and might represent a security breach. An easy way to display the raw data was developed through a web interface using the bootstrap library. As can be seen in the next figures, several filters were created, so the forensic analysis over the information one is interested in can be retrieved easily. In Figure 4.7, we can see the main section for inputting the query data where different types of filters can be found. The records can be identified by their internal or external ID, a specific timestamp or a time range, depending on their classification result, also whether or not were classified as ghost calls, by their similarity score, and the final user can also specify the number of results to be displayed.

Figure 4.7 Web Interface. This is the main part for inputting the query data; the records can be located using several filters such as identifier (external or internal), timestamp or date range, classification result, ghost result, similarity score, and amount of records to be displayed.


45

An overview of the records found is displayed after the query is submitted. Figure 4.8 shows all the information related to the query, such as number of records found, number of calls classified as human, synthetic, TTY or with an empty result, number of calls identified as silent and non-silent, and number of empty records during the silent results. As explained in the flowchart of the AFAE framework (Figure 4.5), if the called is identified as silent, the analysis ends, and nothing else is populated in the classification stage. Another way to get empty records is when there is an error in the system, or it crashes, and the only thing that was saved on the MySQL database was the ID of the call.

Figure 4.8 Web Interface. Overview/Statistics of all the records found after the submission of the query. From left to right, we see the number of total calls found, number of human, synthetic, TTY, and empty calls, number of silent, non-silent, and empty calls. Empty records are created either when the call is detected as a silent call or there was an error during the analysis, and the only thing saved was the ID of the call. The results of the query are sorted by timestamp, and, as seen in Figure 4.9, it displays the external ID on the left, and later the internal ID and an audio console to play the calls, each call has a length of 6 seconds. The next column has the timestamp relative to the server time, and then the classification result and its confidence score, as well as the duplicate score, are shown in case the call had content. Otherwise, there are no classification and duplicate scores, as shown in the last row.

Figure 4.9 Web Interface. Results of the query sorted by timestamp. On the left, the external ID, then the internal ID, and an audio console to play the calls. The next column has the timestamp followed by the classification score and its confidence score, and duplicate score. If the call is silent, no results are outputted, as seen in the last row.


46

An analysis per-call is displayed when the internal ID of the call is clicked. Figure 4.10 displays the main information per-call. At the top, the information of the call we clicked is presented, and at the bottom, all the calls that are similar to the call in the top. These calls are sorted from smallest to largest MinDist (last column), where a small MinDist means greater similarity.

Figure 4.10, Web Interface. The information per-call is shown when the internal ID is clicked. At the top, the call we want to analyze is displayed, and at the bottom, all the calls that are similar to the one at the top are sorted from smallest to largest MinDist (last column) where a small MinDist means greater similarity. The analysis requires to store the audio files and other types of files that are generated in the hard drive. The folders containing such files can be manually purge using the administrative interface as well as the records in the database. As depicted in Figure 4.11, it is possible to delete records from the database based on a specific timestamp (PURGE button) or to delete all the records at once (RESET button). Additionally, we can delete all the files created for the analysis (CLEAR CACHE button).


47

Figure 4.11 Web Interface. The administrative Interface provides with the tools to partially or totally delete the records in the database and all the files created for the analysis

Chapter 5. Conclusions 48

48

Chapter 5 5. Conclusions In this work, a Sequence-to-Sequence Autoencoder Model for Audio Fingerprinting (SAMAF) was developed in order to improve the creation of audio fingerprints through a new loss function made up of three different terms: Mean Square Error (M), Hash Loss (H) and Bitwise Entropy Loss (E). The evaluation of the model was done over a subset of the VoxCeleb1 data set. The performance of the SA model was benchmarked against three baselines, Dejavu (a Shazam-like algorithm), RAFS (a Bit Error Rate (BER) algorithm), and Panako (variation of the constellation algorithm adding time-frequency distortion resilience). Furthermore, the hash space vs. audio identification performance curves and the learning curves (training process) offered additional insight into the behavior and performance of the model. Additionally, the technology was deployed in two 9-1-1 Emergency Operation Centers (EOC) located in Palm Beach County (PBC) and Greater Harris County (GHC), allowing the testing in real-time in industrial environments. As found in this study, a structured hash space is related to the performance of the model but not necessarily for better or worse.The selection and combination of different model parameters must be considered. Nevertheless, the model with a structured hash space delivering the best audio identification performance was composed of two LSTM layers, a 256-bit hash size, and was selected during training with maximal accuracy strategy. The capacity of the models was tested through the audio identification task, but also the attributes an audio signal is composed of were considered to perform additional classification evaluations. It was discovered that the unsupervised nature of the SA model, the number of layers, and the hash size significantly affected the classification performance. Therefore, parameter selection could increase/decrease the classification performance per attribute, also affecting the classification scores of the audio identification task. An analysis of the learning curves from training showed the link between the progression of each loss term and audio identification performance. The audio identification task was performed in a subset of the VoxCeleb1 dataset. Our methodology achieved the highest score of 59.32%, while Dejavu’s highest score was 42.66%, RAFS 17.61%, and Panako 20.49%. The SAMAF model consistently outperformed the baselines in all the classification tasks: Audio Identification, Transformation, Speaker, and Video accuracy. A detailed description of the model performance per transformation was given in Table 3.4 and in Figure 3.4, where SAMAF’s best model (2LST256M-Acc), on average, performed better than the other methodologies. However, the runner-up, Dejavu, was the only method capable of either recognizing audio played in reverse or handling extreme transformations such as Speed1.5 and Tempo1.5. RAFS and Panako did not work for excerpts 1 second long or Pitch, Speed, and Tempo transformations. Lastly,

Chapter 5. Conclusions 49

49

experimental results concluded that 1) all baselines identified audio samples under simple transformations, but SAMAF did better for complex transformations, and 2) each methodology had a different tolerance to sample length over the task of audio identification. From a Machine Learning perspective, the analysis of Known and Unknown data contributed to the proper assessment of the generalization capacity of the model. Even though the models did not overfit, they did not generalize well. Adding more constraints to the creation of hashes (MH/MHE) diminished the generalization performance slightly. During the deployment, it was learned that industrial environments carry many unforeseen events, which made either difficult or impossible the testing of the technology. While in PBC, our Audio Forensic Analysis Engine (AFAE) module started analyzing real calls, it was discovered that frequencies higher than 4KHz are cut-off, causing our analysis to classify all the calls as synthetic. The analysis failed because the DL models were trained with data containing double the frequency spectrum (up to 8KHz). After retraining the models, we deployed in GHC. However, testing was not possible because the transmission protocol between the devices was TCP instead of UDP. On top of that, we still did not have the clearance to analyze real calls, and we have been waiting for the agreement forms to be signed. 5.1 Future Work Even though the use of an unsupervised technique removes the labeling task, a semi-supervised technique might be able to achieve higher performance on the grounds that the current unsupervised approach is not only focused on the audio classification task but also on trying to cluster the data depending on their different attributes (e.g., male, female, pitch, genre, bass, treble, among others). When an audio signal is rich in information, unsupervised approaches can get lost or not perform as expected. However, the use of a supervised or semi-supervised approach implies the creation of labels, which is tedious and time-consuming but can achieve higher performance because the task we are looking for is directly specified.

Another approach to be followed is to modify the loss function so we can include more clues related to the audio signal attributes so the audio classification task can improve its classification performance without the need to add labels. Related to the deployment, the system is sensitive to the frequency content because the features that were extracted (SIFs and MFCCs) were based on it. Good DL models need to be trained with the proper data according to the extracted features. Hence, the future work comprehends the retraining of the models using data directly outputted from the application.

50

Abbreviations and acronyms

Table A.1 Abbreviations

Description

AI Artificial Intelligence

ML Machine Learning

DL Deep Learning

ANN Artificial Neural Network

DNN Deep Neural Networks

CNN Convolutional Neural Network

RNN Recurrent Neural Network

DBN Deep Belief Network

RBM Restricted Boltzmann Machine

S2S Sequence-to-Sequence Model

SA Sequence-to-Sequence Autoencoder Model

MFCC Mel-Frequency Cepstral Coefficient

MD5 Message Digest 5

CRC Cyclic Redundancy Checking

DSH Data Sensitive Hashing

MSE Mean Square Error

M Mean Square Error Loss

MH Mean Square Error Loss + Hash Loss

MHE Mean Square Error Loss + Hash Loss + Bitwise Entropy Loss

PBC Palm Beach County

GHC Greater Harris County

51

Description

DHS Department of Homeland Security

EOC Emergency Operation Center

RTP Real-time Transport Protocol

IP Internet Protocol

VoIP Voice over Internet Protocol

PBX Private Branch Exchange

VPN Virtual Private Network

TTY TeleTYpewriter

CPU Central Processing Unit

GPU Graphics Processing Unit

E911 Enhanced 9-1-1

NG911 Next Generation 9-1-1

TDDoS Telephony Distributed Denial-of-Service

AFAE Acoustic Forensic Analysis Engine

EER Equal Error Rate

w.r.t. with respect to

Table A.2 Acronyms

Description

SHA Secure Hash Algorithm

MIR Music Information Retrieval

SIP Session Initiation Protocol

PSAP Public Safety Answering Point

NENA National Emergency Number Association

NTIA National Telecommunications and Information Administration

ALI Automatic Location Identification

52

Description

ANI Automatic Number Identification

pANI Pseudo Automatic Number Identification

SIF Spectrogram Image Features

ReLU Rectified Linear Unit

RAM Random Access Memory

53

Variables and Symbols

Table B.1 Variables and Symbols

Variable Description Units

𝒙 The whole sequential data ---

𝒕 Time step ---

𝑻 Total number of time steps ---

𝒙𝒕 Sequential data capture at time step t ---

𝑪 Cell state ---

𝒉 Hidden state ---

𝑺 Full state (cell and hidden state together) ---

𝑪𝒕 𝟏 Cell state previous to time step t ---

𝒉𝒕 𝟏 Hidden state previous to time step t ---

𝑺𝒕 𝟏 Full state previous to time step t ---

𝑪𝒕 Updated cell state at time step t ---

𝒉𝒕 Updated hidden state at time step t ---

𝑺𝒕 Full state at time step t ---

𝒇𝒕 Forget gate at time step t ---

𝒊𝒕 Input gate at time step t ---

𝒐𝒕 Output gate at time step t ---

𝑪𝒕 Cell state candidate at time step t ---

𝑾 𝒇,𝒊,𝒐,𝑪 Weights of the forget, input, and output gates as well as the cell state

---

𝒃 𝒇,𝒊,𝒐,𝑪 Biases of the forget, input, and output gates as well as the cell state

---

𝝈 Sigmoid Function [0,1]

𝒕𝒂𝒏𝒉 Hyperbolic Tangent Function [-1,1]

54


𝓵 Layer ---

𝓵 𝟏 Previous Layer ---

r Raw Audio Signal ---

𝒓𝒑𝒑 Preprocessed Raw Audio Signal ---

𝒓𝒏𝒐𝒓𝒎 Normalized preprocessed raw audio signal

---

𝑨 Set of A audio signals ---

𝑴 Number of MFCC coefficients ---

𝑰 Number of MFCC blocks an audio signal is split into

---

ℝ The Set of Real Numbers ---

𝑿 Extracted features from 𝒓𝒏𝒐𝒓𝒎 ---

𝑿𝒂 Extracted features from the a-th audio signal

---

𝒙𝒊𝒂 Extracted features from the a-th audio

signal in the i-th MFCC block ---

𝒙𝒊,𝒕𝒂

Extracted features from the a-th audio signal in the i-th MFCC block at time

stamp t ---

𝝂 Bit state 𝟏, 𝟏

𝒃 The 𝒃-th bit bit

𝑩 Total number of bits bit

𝑯𝒊,𝑻𝒂

Final representation of the audio fingerprint from the a-th audio signal in

the i-th MFCC block 𝟏, 𝟏 𝑩

𝓗 Hamming Space 𝟏, 𝟏 𝑩

ℝ𝑩 ↦ 𝓗𝑩 Mapping of the space from a Real space with B-dimensions to a Hamming space

with B-dimensions ---

𝐜𝐨𝐬_𝒔𝒊𝒎 Cosine Similarity ---

55


MSE Mean Square Error ---

M Mean Square Error Loss ---

𝜶 Weight for the Mean Square Error Loss ---

H Hash Loss ---

𝜷 Weight for the Hash Loss ---

E Bitwise Entropy Loss ---

𝜸 Weight for the Bitwise Entropy Loss ---

MH Mean Square Error Loss +

Hash Loss ---

MHE

Mean Square Error Loss +

Hah Loss +

Bitwise Entropy Loss

---

K Total number of algorithms/models

(Nemenyi Test) ---

N Total number of Databases

(Nemenyi Test) ---

𝒒𝜶 Critical Value ---

CD Critical Difference ---

56

Speech Transformations Table C.1 describes the transformations applied over the VoxCeleb1 dataset and their parameters. These transformations were chosen based on the literature of existing audio fingerprinting techniques to assess the robustness of the SA model in handling compression, distortion, and interference of audio.

Table C.1 Speech Transformations

Transformation Description Truncate Silence Removes silence (less than -60 dB) Noise0.05 Adds white noise of amplitude of 5% to the audio

Echo Repetition of audio with delay of 1 second, and decay factor of 0.5

Reverb Adds reverberation with room size of 75%, reverberance of 50%, damping of 50%, wet gain of -1dB, stereo depth of 100%, and predelay of 10ms

High Pass Filter Attenuates frequencies below 1KHz Low Pass Filter Attenuates frequencies above 3KHz Reverse Reverses audio Pitch 0.5 Pitch reduction to 50% of original Pitch 0.9 Pitch reduction to 90% of original Pitch 1.1 Pitch increase to 110% of original Pitch 1.5 Pitch increase to 150% of original Speed 0.5 Speed reduction to 50% of original Speed 0.9 Speed reduction to 90% of original Speed 0.95 Speed reduction to 95% of original Speed 1.05 Speed increase to 105% of original Speed 1.1 Speed increase to 110% of original Speed 1.5 Speed increase to 150% of original Tempo 0.5 Tempo reduction to 50% of original Tempo 0.9 Tempo reduction to 90% of original Tempo 1.1 Tempo increase to 110% of original Tempo 1.5 Tempo increase to 150% of original

57

9-1-1 Test Scripts = Script 1 = Call-taker: 9-1-1 what are you reporting?

Caller: I’d like to report a collision between a cyclist an a vehicle.

Call-taker: Where?

Caller: Hmm… Northeast from the Beltway just south of 40 Northeast 45th street

Call-taker: Are you involved in the accident?

Caller: Yeah, I was the cyclist

Call-taker: Ok, do you need a medic or an ambulance?

Caller: Hmm… I don’t think so

Call-taker: Is the driver of the vehicle still there with you?

Caller: Yeah…

Call-taker: You said you weren’t blocking the roadway

Caller: Oh… no…

= Script 2 = Call-taker: 9-1-1, what are you reporting?

Caller: I’m reporting a drunk driver

Call-taker: Where at?

Caller: At fourteenth and aaah… it just hit the guardrail and it’s still driving it’s like

driving on the curb

Call-taker: Fourteenth and where?

Caller: Hmmm… fourteenth and aaah… Massachusetts

Call-taker: Ok, fourteenth Avenue South South Massachusetts. Is that correct?

Which direction is he going?

Caller: Eeeh… it’s going eeeh… just eeeh it’s going down fourteenth

Call-taker: Ok, and it jumped the curb

Caller: Yeah, and now like totally just hit a guardrail and I’m like… it’s still going,

though

58

Python Script of the SAMAF methodology for the VoxCeleb1 subset dataset

1. # v3.10.5 2. 3. # Changes from v3.10.4 4. # > Added accuracy data pipelining 5. # > 6. 7. import os 8. import sys 9. import math 10. import time 11. import shutil 12. import random 13. import numpy as np 14. import tensorflow as tf 15. from collections import defaultdict 16. import multiprocessing as mp 17. from functools import partial 18. import mysql.connector as mysqlc 19. from mysql.connector import errorcode 20. # from tensorflow.contrib.seq2seq.python.ops import decoder 21. from tensorflow.python.ops import variables 22. 23. ##### EDIT ME EVERY RUN! 24. 25. DATABASE_PREFIX = 'testBaez' 26. # DATABASE_PREFIX = 'ELSDSRv2' 27. 28. RUN_ID_MODEL = '2L_ST256_MSE+Hash+Entropy' 29. 30. PRETRAINED = False # [True | False] Train from scratch or with a pretrained model

31. LOG_DIR_MODELS_ROOT = '/mnt/' # Only for pretrained models 32. 33. RUN_ID = 'testBaez/' + str(int(time.time())) 34. 35. ##### Parameters 36. 37. ### Changable Parameters 38. 39. STATE_SIZE = 256 40. NUM_LAYERS = 2 41. 42. INCLUDE_HASH_LOSS = True 43. INCLUDE_ENTROPY_LOSS = True 44. 45. TOWERS_LIST = [0, 1] 46. LOG_DIR = '/home/AudioFingerprinting/log/s2s/' + RUN_ID 47. LOG_DIR_MODELS = LOG_DIR_MODELS_ROOT + 'AudioFingerprinting/log/s2s/ELSDSR/' + RUN

_ID_MODEL 48. 49. MODEL = '221' # Define the model to be loaded 50. # MODEL = '0' | '230' | '245' # Define the model to be loaded

59

51. 52. DO_CREATE_DB = True 53. DO_CREATE_TABLE = True 54. 55. DO_SAVE_VISUALIZATION = False 56. DO_EVALUATION = False 57. DO_SAVE_HASHES = False 58. 59. ### Fixed Parameters 60. 61. BATCH_SIZE = 32 # must be even (to split between 2 gpus) 62. GPU_BATCH_SIZE = int(BATCH_SIZE / 2) 63. # ^STATE_SIZE = 32 # Select from: [64, 128, 256] 64. NUM_EPOCHS = 1000 # How many epochs to run? 65. NUM_EPOCHS_VAL = 1 # Evaluate with the validation set every NUM_EPOCHS_VAL epochs

66. NUM_FEATS = 13 # Either 13 or 20 67. NUM_STEPS = 100 # This CAN NOT be changed. 68. 69. # ^NUM_LAYERS = 1 70. SHARE_WEIGHTS = True 71. # ^TOWERS_LIST = [0, 1] 72. USE_PEEPHOLES = False 73. 74. INITIAL_LEARNING_RATE = 0.01 75. DECAY_RATE = 1.0 # Decay rate for the learning rate 76. NUM_DECAY_STEP = 1.0 # Total number of times the decay rate is applied in 'NUM_EP

OCHS' epochs 77. # USE_ADAGRAD = True # True = Adagrad, False = SGD << UNUSED 78. # MOMENTUM = 0.9 # only for SGD << UNUSED 79. 80. DATA_SUBSET = 0.02 # percent of dataset to use 81. DATA_SUBSET_sub = 0.01 # percent of dataset for VAL/ACC 82. 83. LOSS_ALPHA = 1 # MSE Loss 84. 85. # ^INCLUDE_HASH_LOSS = False # include in loss function? 86. SHOW_HASH_LOSS = INCLUDE_HASH_LOSS or True # if not included in loss function, sh

ow behavior? 87. HASH_LOSS_USES_SXH = 0 # 0 for S, 1 for X, 2 for H 88. HASH_LOSS_USES_CONSTANT_Sij = False 89. HASH_LOSS_NORMALIZED = True # This SHOULD NOT be changed. 90. LOSS_BETA = 1 # Hash Loss 91. 92. # ^INCLUDE_ENTROPY_LOSS = False # include in loss function? 93. SHOW_ENTROPY_LOSS = INCLUDE_ENTROPY_LOSS or True # if not included in loss functi

on, show behavior? 94. LOSS_GAMMA = 1 # Bitwise Entropy Loss 95. 96. NUM_HASHES_DICT = 1 # Number of adjacent hashes used to build the dictionary key

97. NUM_HASHES_CLASSIFICATION = 1 # Number of adjacent hashes to do a query (1 Hash =

1 sec, 9 Hashes = 3 secs) 98. NUM_CLASSIFICATION_POOLS = 32 99. CLASSIFICATION_ACCURACY_SUBSET_SIZE = 1.0 100. 101. DO_ACCURACY = True 102. DO_EPOCH_ZERO = True 103. 104. DISPLAY_COUNT = 50 # Show preliminary metrics every DISPLAY_COUNT batches

60

105. SAVE_VIS_EVERY_HOURS = 6 106. SAVE_MODEL_EVERY_HOURS = 6 107. SAVE_WITH_LOSS = True 108. SAVE_WITH_ACCURACY = True 109. SAVE_WITH_TRANSFORMATION = True 110. SAVE_ALL_MODELS = False 111. MAX_TO_KEEP = NUM_EPOCHS # Maximum number of Models to be saved 112. 113. tf.logging.set_verbosity('ERROR') 114. LOG_DEVICE_PLACEMENT = False 115. 116. # DATA_LABELS_PATH = '/home/AudioFingerprinting/data/labels_VoxCeleb_drill

1.csv' # VoxCeleb 117. # DATA_LABELS_PATH = '/home/AudioFingerprinting/data/labels_VoxCeleb_drill

2.csv' # VoxCeleb 118. DATA_LABELS_PATH = '/home/AudioFingerprinting/data/labels_VoxCeleb_Subset_

Complete.csv' # VoxCeleb 119. DATABASE_PATH = "/mnt/AudioFingerprinting/data/VoxCeleb/Subset_tiny/13/" 120. AUDIO_DATABASE_PATH = "/mnt/AudioFingerprinting/data/VoxCeleb/Subset_pp/"

121. 122. DATA_TRN = 0.6 # Just for clarification, not used 123. DATA_VAL = 0.2 # Just for clarification, not used 124. DATA_TST = 0.2 # Just for clarification, not used 125. 126. """ Keep as Legacy """ 127. # TRANSFORMATIONS = ['Echo', 'HighPassFilter', 'LowPassFilter', 'Noise0.05

', 'Pitch0.5', 128. # 'Pitch0.9', 'Pitch1.1', 'Pitch1.5', 'Reverb', 'Reverse', 'Speed0.5', 129. # 'Speed0.9', 'Speed0.95', 'Speed1.05', 'Speed1.1', 'Speed1.5', 'Tempo0.5'

, 130. # 'Tempo0.9', 'Tempo1.1', 'Tempo1.5', 'TruncateSilence'] 131. # SPEAKERS_MALE = ['MASM', 'MCBR', 'MFKC', 'MKBP', 'MLKH', 'MMLP', 'MMNA',

'MNHP', 132. # 'MOEW', 'MPRA', 'MREM', 'MRKO', 'MTLS'] 133. # SPEAKERS_FEMALE = ['FAML', 'FDHH', 'FEAB', 'FHRO', 'FJAZ', 'FMEL', 'FMEV

', 'FSLJ', 134. # 'FTEJ', 'FUAN'] 135. # SENTENCES = ['a', 'b', 'c', 'd', 'e', 'f', 'g'] 136. # SPEAKERS = SPEAKERS_MALE 137. # SPEAKERS.extend(SPEAKERS_FEMALE) 138. # MERGED_ATTRIBUTES = [(SPEAKERS[i], SENTENCES[j]) for i in range(0, len(S

PEAKERS)) for j in range(0, len(SENTENCES))] 139. 140. # DATA_TRN = [('FMEL', 'b'), ('MCBR', 'e'), ('FTEJ', 'b'), ('FDHH', 'e'),

('FSLJ', 'b'), 141. # ('FHRO', 'b'), ('MCBR', 'f'), ('FHRO', 'e'), ('MLKH', 'd'), ('MASM', 'a'

), 142. # ('FTEJ', 'd'), ('FMEL', 'f'), ('FSLJ', 'e'), ('MMNA', 'd'), ('FHRO', 'd'

), 143. # ('FHRO', 'c'), ('MMNA', 'e'), ('MREM', 'd'), ('MMLP', 'd'), ('MLKH', 'f'

), 144. # ('MOEW', 'f'), ('FSLJ', 'f'), ('MCBR', 'b'), ('FMEV', 'd'), ('MFKC', 'e'

), 145. # ('MMNA', 'f'), ('FMEV', 'f'), ('FEAB', 'b'), ('FEAB', 'c'), ('MASM', 'b'

), 146. # ('MCBR', 'c'), ('MPRA', 'd'), ('FJAZ', 'f'), ('MOEW', 'e'), ('FDHH', 'g'

), 147. # ('MMLP', 'b'), ('MASM', 'g'), ('MPRA', 'c'), ('FEAB', 'd'), ('MKBP', 'g'

),

61

148. # ('FMEL', 'd'), ('MFKC', 'a'), ('FAML', 'g'), ('FDHH', 'a'), ('FAML', 'c'),

149. # ('MNHP', 'e'), ('FTEJ', 'e'), ('MNHP', 'f'), ('MFKC', 'b'), ('MKBP', 'b'),

150. # ('MCBR', 'd'), ('MFKC', 'd'), ('FMEV', 'g'), ('FAML', 'a'), ('MLKH', 'g'),

151. # ('FMEL', 'c'), ('FSLJ', 'd'), ('FJAZ', 'a'), ('MTLS', 'a'), ('MLKH', 'b'),

152. # ('MLKH', 'c'), ('FEAB', 'a'), ('MMLP', 'g'), ('FSLJ', 'c'), ('MTLS', 'g'),

153. # ('MREM', 'a'), ('MLKH', 'e'), ('MASM', 'f'), ('FAML', 'e'), ('FUAN', 'd'),

154. # ('MPRA', 'f'), ('MFKC', 'c'), ('MRKO', 'g'), ('MRKO', 'd'), ('MCBR', 'g'),

155. # ('MRKO', 'c'), ('MRKO', 'a'), ('MOEW', 'c'), ('FTEJ', 'f'), ('FMEV', 'c'),

156. # ('FDHH', 'b'), ('FTEJ', 'a'), ('FMEL', 'g'), ('MNHP', 'c'), ('MFKC', 'f'),

157. # ('MTLS', 'f'), ('MPRA', 'b'), ('MKBP', 'c'), ('FSLJ', 'g'), ('MCBR', 'a'),

158. # ('MOEW', 'a'), ('FJAZ', 'b'), ('MRKO', 'e'), ('MKBP', 'a'), ('MREM', 'f'),

159. # ('FHRO', 'f'), ('MREM', 'g')] 160. # DATA_VAL = [('FAML', 'f'), ('FMEV', 'a'), ('MREM', 'c'), ('FJAZ', 'd'),

('FDHH', 'c'), 161. # ('MMLP', 'e'), ('MOEW', 'g'), ('FMEV', 'b'), ('MOEW', 'd'), ('MMLP', 'f'

), 162. # ('MNHP', 'a'), ('MASM', 'e'), ('FJAZ', 'c'), ('FAML', 'b'), ('FUAN', 'g'

), 163. # ('MREM', 'e'), ('FAML', 'd'), ('MASM', 'd'), ('MFKC', 'g'), ('MRKO', 'b'

), 164. # ('MLKH', 'a'), ('MPRA', 'g'), ('MMNA', 'b'), ('MMLP', 'a'), ('MTLS', 'b'

), 165. # ('MRKO', 'f'), ('MKBP', 'e'), ('FEAB', 'f'), ('FEAB', 'g'), ('MTLS', 'c'

), 166. # ('MKBP', 'd'), ('FHRO', 'g')] 167. # DATA_TST = [('FEAB', 'e'), ('MREM', 'b'), ('FMEL', 'e'), ('MMLP', 'c'),

('MPRA', 'a'), 168. # ('MOEW', 'b'), ('FUAN', 'f'), ('MTLS', 'd'), ('FSLJ', 'a'), ('FUAN', 'c'

), 169. # ('MKBP', 'f'), ('MNHP', 'b'), ('FUAN', 'b'), ('FDHH', 'f'), ('FDHH', 'd'

), 170. # ('MTLS', 'e'), ('FMEV', 'e'), ('FUAN', 'a'), ('FJAZ', 'g'), ('MNHP', 'g'

), 171. # ('FHRO', 'a'), ('MASM', 'c'), ('MMNA', 'c'), ('FUAN', 'e'), ('FTEJ', 'c'

), 172. # ('FJAZ', 'e'), ('FTEJ', 'g'), ('MMNA', 'a'), ('MMNA', 'g'), ('MPRA', 'e'

), 173. # ('MNHP', 'd'), ('FMEL', 'a')] 174. """ Keep as Legacy """ 175. 176. # Parameter validation and manipulation 177. BATCH_SIZE = int(BATCH_SIZE / 2) * 2 178. if not (HASH_LOSS_USES_SXH >= 0 and HASH_LOSS_USES_SXH <= 2): 179. print("INVALID HASH_LOSS_USES_SXH") 180. exit(‐10) 181. SAVE_VIS_EVERY_SECONDS = SAVE_VIS_EVERY_HOURS * 60 * 60 182. SAVE_MODEL_EVERY_SECONDS = SAVE_MODEL_EVERY_HOURS * 60 * 60 183. 184. # Database Info 185. USERNAME = 'root'

62

186. PASSWORD = 'i2ccloud' 187. 188. 189. ##### Definition 190. 191. class Logger: 192. def __init__(self, log_file_path): 193. self.console = sys.stdout 194. self.log_file_path = log_file_path 195. self.file = open(log_file_path, 'a') 196. 197. def write(self, text): 198. self.console.write(text) 199. self.file.write(text) 200. 201. def flush(self): 202. self.console.flush() 203. self.file.flush() 204. 205. def close(self): 206. self.console.close() 207. self.file.close() 208. 209. 210. class DatabaseConverter(mysqlc.conversion.MySQLConverter): 211. def _float32_to_mysql(self, value): 212. return float(value) 213. 214. def _float64_to_mysql(self, value): 215. return float(value) 216. 217. def _int32_to_mysql(self, value): 218. return int(value) 219. 220. def _int64_to_mysql(self, value): 221. return int(value) 222. 223. 224. class Database: 225. def __init__(self): 226. try: 227. self.cnx = mysqlc.connect(user=USERNAME, password=PASSWORD) 228. self.cnx.set_converter_class(DatabaseConverter) 229. self.cursor = self.cnx.cursor() 230. except mysqlc.Error as err: 231. if err.errno == errorcode.ER_ACCESS_DENIED_ERROR: 232. print('Error: SQL Access Denied, Invalid Usr/Pwd') 233. raise Exception('Error: SQL Access Denied, Invalid Usr/Pwd

') 234. elif err.errno == errorcode.ER_BAD_DB_ERROR: 235. print('Error: SQL, Bad Database') 236. raise Exception('Error: SQL, Bad Database') 237. else: 238. print('Error: SQL, {}'.format(err)) 239. raise Exception('Error: SQL, {}'.format(err)) 240. 241. def CloseCursor(self): 242. print('Entered Database.CloseCursor') 243. self.cursor.close() 244. self.cnx.close() 245.

63

246. 247. class DB_Fingerprinting(Database): 248. def __init__(self): 249. # print('Entered DB_Fingerprinting.__init__') 250. Database.__init__(self) 251. 252. def db_exists(self, db_name): 253. # print('Entered DB_Fingerprinting.db_exists') 254. qry = 'SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE S

CHEMA_NAME = "{}";'.format(db_name) 255. self.cursor.execute(qry) 256. 257. row_counter = 0 258. for res in self.cursor: 259. row_counter += 1 260. 261. if row_counter == 1: 262. return True 263. else: 264. return False 265. 266. def create_DB(self, db_name): 267. print('Creating Database: {}'.format(db_name)) 268. qry = 'CREATE DATABASE {};'.format(db_name) 269. self.cursor.execute(qry) 270. 271. def drop_DB(self, db_name): 272. print('Dropping Database: {}'.format(db_name)) 273. qry = 'DROP DATABASE {}'.format(db_name) 274. self.cursor.execute(qry) 275. 276. def table_exists(self, db_name, table_name): 277. print('Entered DB_Fingerprinting.table_exists') 278. qry = 'SHOW TABLES FROM {} LIKE "{}";'.format(db_name, table_name)

279. self.cursor.execute(qry) 280. 281. row_counter = 0 282. for res in self.cursor: 283. row_counter += 1 284. 285. if row_counter == 1: 286. return True 287. else: 288. return False 289. 290. def create_table(self, db_name, table_name): 291. print('') 292. print('Creating Table {}.{}'.format(db_name, table_name)) 293. qry = 'CREATE TABLE {}.{} (Transformation char(25), Speaker char(7

) NOT NULL, Video char(11) NOT NULL, AudioName char(5), Gender char(1), Segment SMALLINT UNSIGNED, ' \

294. 'Hash blob);'.format(db_name, table_name) 295. self.cursor.execute(qry) 296. 297. def drop_table(self, db_name, table_name): 298. print('Dropping Table {}.{}'.format(db_name, table_name)) 299. qry = 'DROP TABLE {}.{};'.format(db_name, table_name) 300. self.cursor.execute(qry) 301. 302. # def insert_into_table(self, db_name, table_name, data, data_hash):

64

303. # # data: [Speaker, Sentence, Transformation, Segment] 304. # # data_hash: [Hash] 305. # 306. # HashBin = np.where(np.array(data_hash) >= 1.0, 1, 0) 307. # # print('HashBin: {}'.format(HashBin)) 308. # 309. # # Converting Hashes into Hexadecimal (4‐bit length encoding) 310. # bits_str = '' 311. # hash_hexa = '' 312. # hash_i_char = ['%s' % bit_int for bit_int in HashBin] # From in

teger to character 313. # for bit_char in hash_i_char: # Creating a string 314. # bits_str += bit_char 315. # for index in range(0, len(bits_str), 4): 316. # # Conversion to Hexadecimal 317. # hash_hexa += hex(int(bits_str[index:index + 4], 2)).split('x

')[‐1] 318. # 319. # # print('INSERT INTO {}.{}: {} | {} | {} | {}'.format(db_name, t

able_name, data[0], data[3], data[2], hash_hexa)) 320. # qry = 'INSERT INTO {}.{} (Speaker, Sentence, Transformation, Seg

ment, Hash) VALUES (%s, %s, %s, %s, UNHEX(%s))'.format( 321. # db_name, table_name) 322. # self.cursor.execute(qry, (data[0], str(data[3]), data[2], hash_h

exa)) 323. # self.cnx.commit() 324. 325. def duplicates(self, db_name, table_name, qryhash): 326. duplicates = [] 327. 328. qry = 'SELECT Transformation, Speaker, Video, AudioName, Gender, S

egment, BIT_COUNT(UNHEX(%s)^Hash) AS HD FROM {}.{} ORDER BY HD ASC'.format( 329. db_name, table_name) 330. self.cursor.execute(qry, (qryhash,)) 331. 332. for (transformation, speaker, video, audioname, gender, segment, h

d) in self.cursor: 333. duplicates.append((str(transformation), str(speaker), str(vide

o), str(audioname), str(gender), int(segment), int(hd))) 334. 335. return duplicates 336. 337. 338. def combine_gradients(tower_grads): 339. # see documentation here: 340. # https://github.com/tensorflow/models/blob/70b894bdf74c4deedafbbdd70c

2454162837d5d2/tutorials/image/cifar10/cifar10_multi_gpu_train.py#L101 341. # currently set for sum, not average 342. # to change, switch between reduce_mean and reduce_sum 343. 344. average_grads = [] 345. # for grad_and_self.vars_ in zip(*tower_grads): 346. for grad_and_vars_ in zip(*tower_grads): 347. grads = [] 348. # for g, _ in grad_and_self.vars_: 349. for g, _ in grad_and_vars_: 350. expanded_g = tf.expand_dims(g, 0) 351. grads.append(expanded_g) 352. grad = tf.concat(axis=0, values=grads) 353. grad = tf.reduce_mean(grad, 0) 354. # v = grad_and_self.vars_[0][1]

65

355. v = grad_and_vars_[0][1] 356. grad_and_var = (grad, v) 357. average_grads.append(grad_and_var) 358. return average_grads 359. 360. 361. class Data: 362. 363. # Loads the labels per set from a csv file ‐‐> DATA_LABELS_PATH 364. def load_labels(): 365. 366. lbls_trn = [] 367. lbls_val = [] 368. lbls_tst = [] 369. 370. with open(DATA_LABELS_PATH, "r") as lf: 371. lines = lf.readlines()[1:] # Skipping first line (headers) 372. for line in lines: 373. line = line[:‐1].split(",") # Removing '\n' character 374. 375. if line[‐1] == 'TRN': 376. lbls_trn.append(line[:‐1]) 377. elif line[‐1] == 'VAL': 378. lbls_val.append(line[:‐1]) 379. elif line[‐1] == 'TST': 380. lbls_tst.append(line[:‐1]) 381. else: 382. sys.exit('Set name must be: TRN | VAL | TST ') 383. 384. # print(len(lines)) 385. # print(lines[0:10]) 386. 387. # print('') 388. # print('Labelst Trn: ', len(lbls_trn)) 389. # print(lbls_trn[0:10]) 390. 391. # print('') 392. # print('Labelst Validation: ', len(lbls_val)) 393. # print(lbls_val[0:10]) 394. 395. # print('') 396. # print('Labelst Test: ', len(lbls_tst)) 397. # print(lbls_tst[0:10]) 398. 399. return lbls_trn, lbls_val, lbls_tst 400. 401. # Loads the labels per set from a csv file ‐‐> DATA_LABELS_PATH 402. def load_labels_orig(): 403. 404. lbls_trn_orig = [] 405. lbls_val_orig = [] 406. lbls_tst_orig = [] 407. 408. with open(DATA_LABELS_PATH, "r") as lf: 409. lines = lf.readlines()[1:] # Skipping first line (headers) 410. for line in lines: 411. line = line[:‐1].split(",") # Removing '\n' character 412. # print(line) 413. # print(line[0]) 414. # print(line[‐1]) 415. if line[‐1] == 'TRN' and line[0] == 'Original':

66

416. lbls_trn_orig.append(line[:‐1]) 417. elif line[‐1] == 'VAL' and line[0] == 'Original': 418. lbls_val_orig.append(line[:‐1]) 419. elif line[‐1] == 'TST' and line[0] == 'Original': 420. lbls_tst_orig.append(line[:‐1]) 421. else: 422. pass 423. # sys.exit('Set name must be: TRN_orig | VAL_orig | TS

T_orig ') 424. 425. return lbls_trn_orig, lbls_val_orig, lbls_tst_orig 426. 427. def addTrans(labels): 428. lbls_all = [] 429. 430. transformations = ['Echo', 'Tempo1.5', 'Reverb', 'Noise0.05', 'Pit

ch1.5', \ 431. 'Speed0.9', 'Pitch0.9', 'HighPassFilter', 'LowP

assFilter', 'Speed0.5', \ 432. 'Speed1.5', 'Pitch1.1', 'Tempo0.9', 'Speed1.1',

'Speed0.95', 'Tempo1.1', \ 433. 'Tempo0.5', 'TruncateSilence', 'Speed1.05', 'Pi

tch0.5', 'Reverse'] 434. 435. for labels_i in labels: 436. lbls_all.append(labels_i) 437. for t in transformations: 438. lbls_all.append([t if labels_i[x] == 'Original' else label

s_i[x] for x in range(len(labels_i))]) 439. 440. # print(len(lbls_all)) 441. # print(lbls_all) 442. 443. return lbls_all 444. 445. def load_segment_list(labels): 446. 447. segments = [] 448. samples = [] 449. 450. for l, label in enumerate(labels): 451. i = 0 452. EXPECTED_PATH = os.path.join(DATABASE_PATH[:‐

1], '/'.join(label[:‐1]) + "_" + str(i) + ".csv") 453. while os.path.isfile(EXPECTED_PATH): 454. segments.append([l, i, EXPECTED_PATH]) 455. if i >= 75: 456. samples.append( 457. [l, i ‐ 75, (os.path.join(DATABASE_PATH[:‐

1], '/'.join(label[:‐1]) + "_" + str(i ‐ 75) + ".csv")), 458. (os.path.join(DATABASE_PATH[:‐



1], '/'.join(label[:‐1]) + "_" + str(i) + ".csv")), 461. label[0], 462. label[1], 463. label[2], 464. label[3], 465. label[4]])

67

466. i += 25 467. EXPECTED_PATH = os.path.join(DATABASE_PATH[:‐

1], '/'.join(label[:‐1]) + "_" + str(i) + ".csv") 468. return segments, samples 469. 470. # Generate dataset distribution for sentences 471. def gen_set_sentences(SENT): 472. # Training ‐‐‐‐> 5 473. # Validation ‐‐> 1 474. # Testing ‐‐‐‐‐> 1 475. 476. slen = len(SENT) 477. 478. srand = random.sample(range(0, slen), slen) 479. 480. training_set = [SENT[i] for i in srand[0:slen ‐ 3]] 481. validation_set = [SENT[srand[slen ‐ 2]]] 482. testing_set = [SENT[srand[slen ‐ 1]]] 483. 484. return training_set, validation_set, testing_set 485. 486. # Generate dataset distribution for speakers 487. def gen_set_speakers(MALE, FEMALE): 488. # Training ‐‐‐‐> Male(8) ; Female(6) 489. # Validation ‐‐> Male(3) ; Female(2) 490. # Testing ‐‐‐‐‐> Male(2) ; Female(2) 491. 492. mlen = len(MALE) 493. flen = len(FEMALE) 494. 495. Mrand = random.sample(range(0, mlen), mlen) 496. Frand = random.sample(range(0, flen), flen) 497. 498. Training_Set = [MALE[i] for i in Mrand[0:8]] 499. Training_Set = Training_Set + [FEMALE[i] for i in Frand[0:6]] 500. 501. Validation_Set = [MALE[i] for i in Mrand[8:11]] 502. Validation_Set = Validation_Set + [FEMALE[i] for i in Frand[6:8]]

503. 504. Testing_Set = [MALE[i] for i in Mrand[11:13]] 505. Testing_Set = Testing_Set + [FEMALE[i] for i in Frand[8:10]] 506. return Training_Set, Validation_Set, Testing_Set 507. 508. def gen_set_attribute(ATTR): 509. slen = len(ATTR) 510. 511. tst_len = int(slen * 0.2) 512. val_len = int(slen * 0.2) 513. trn_len = slen ‐ val_len ‐ tst_len 514. 515. srand = random.sample(range(0, slen), slen) 516. 517. training_set = [ATTR[i] for i in srand[0:trn_len]] 518. validation_set = [ATTR[i] for i in srand[trn_len:trn_len + val_len

]] 519. testing_set = [ATTR[i] for i in srand[trn_len + val_len:trn_len +

val_len + tst_len + 1]] 520. 521. return training_set, validation_set, testing_set 522.

68

523. # Get list of files in trn,val,tst 524. def gen_file_list_breakdown_speaker(trn, val, test): 525. 526. trn_files = [] 527. val_files = [] 528. test_files = [] 529. 530. with open(DATA_INDEX, 'r') as File: 531. infoFile = File.readlines() # Reading all the lines from File

532. for line in infoFile: # Reading line‐by‐line 533. words = line.split() # Splitting lines in words using spa

ce character as separator 534. if words[2] in trn: 535. trn_files.append(words) 536. elif words[2] in val: 537. val_files.append(words) 538. else: 539. test_files.append(words) 540. 541. return trn_files, val_files, test_files 542. 543. # Get list of files in trn,val,tst 544. def gen_file_list_breakdown_attribute(trn, val, tst): 545. 546. trn_files = [] 547. val_files = [] 548. test_files = [] 549. 550. with open(DATA_INDEX, 'r') as File: 551. infoFile = File.readlines() # Reading all the lines from File

552. for line in infoFile: # Reading line‐by‐line 553. words = line.split() # Splitting lines in words using spa

ce character as separator 554. found = False 555. for e in trn: 556. if words[2] == e[0] and words[3] == e[1]: 557. found = True 558. trn_files.append(words) 559. break 560. 561. if not found: 562. for e in val: 563. if words[2] == e[0] and words[3] == e[1]: 564. found = True 565. val_files.append(words) 566. break 567. 568. if not found: 569. for e in tst: 570. if words[2] == e[0] and words[3] == e[1]: 571. found = True 572. test_files.append(words) 573. break 574. 575. if not found: 576. print('ERROR NOT FOUND: gen_file_list_breakdown_attrib

ute(trn,val,tst)') 577. print(words) 578. exit(‐10)

69

579. 580. return trn_files, val_files, test_files 581. 582. # Get list of pairs (filename, inital index of block) [block length is

100] 583. def gen_mapping(files): 584. mapList = [] 585. 586. for f in files: 587. # print(f[0]) 588. i = 0 589. while True: 590. path = f[0] + '_' + str(i) + '.csv' 591. if os.path.exists(path): 592. if i >= 75: # (k * 25) + [0, 25, 50, 75] make up a bl

ock 593. mapList.append([f[0], i ‐ 75, f[1], f[2], f[3]]) 594. else: 595. break 596. i += 25 597. return mapList 598. 599. # Get batches of BATCH_SIZE from mapped files 600. def gen_batches(labels_all, all_samples): 601. samples = all_samples.copy() 602. random.shuffle(samples) 603. more = True 604. while more: 605. batch = [] 606. batch_labels = [] 607. 608. if len(samples) >= BATCH_SIZE: 609. batch_elements = [samples.pop() for _ in range(BATCH_SIZE)

] 610. if len(samples) == 0: 611. more = False 612. else: 613. batch_elements = [samples.pop() for _ in range(len(samples

))] 614. diff = BATCH_SIZE ‐ len(batch_elements) 615. if not diff == 0: 616. extra = random.sample(all_samples, diff) 617. batch_elements += extra 618. else: 619. print("error: unreachable state reached") 620. exit(‐10) 621. more = False 622. 623. if len(batch_elements) != BATCH_SIZE: 624. print("error: batch_elements improperly filling up") 625. exit(‐10) 626. 627. for sample in batch_elements: 628. m1 = np.genfromtxt(sample[2], delimiter=',') 629. m2 = np.genfromtxt(sample[3], delimiter=',') 630. m3 = np.genfromtxt(sample[4], delimiter=',') 631. m4 = np.genfromtxt(sample[5], delimiter=',') 632. m = np.concatenate((m1, m2, m3, m4)) 633. 634. if m.shape[1] > NUM_FEATS: 635. print('WARNING: m.shape[1] > NUM_FEATS!!!')

70

636. m = np.delete(m, np.s_[NUM_FEATS:], axis=1) 637. 638. batch.append(m) 639. label = labels_all[sample[0]].copy() 640. label.append(sample[1]) 641. batch_labels.append(tuple(label)) 642. 643. yield np.array(batch), batch_labels 644. 645. def generate_dataset(mappedFiles): 646. def _generator(): 647. for f in mappedFiles: 648. yield (f[0], f[1], f[2], f[3], f[4], f[5], f[6], f[7], f[8

], f[9], f[10]) 649. 650. """ Nolan's _get_features """ 651. # def _get_features(filename, index, l1, l2, l3): 652. # f1 = tf.strings.join([filename, '_', tf.as_string(index), '.csv'

]) 653. # f2 = tf.strings.join([filename, '_', tf.as_string(index + 25), '

.csv']) 654. # f3 = tf.strings.join([filename, '_', tf.as_string(index + 50), '

.csv']) 655. # f4 = tf.strings.join([filename, '_', tf.as_string(index + 75), '

.csv']) 656. """ Nolan's _get_features """ 657. 658. def _get_features(fileID, index, f1, f2, f3, f4, l1, l2, l3, l4, l

5): 659. # f1 = tf.strings.join([filename, '_', tf.as_string(index), '.

csv']) 660. # f2 = tf.strings.join([filename, '_', tf.as_string(index + 25

), '.csv']) 661. # f3 = tf.strings.join([filename, '_', tf.as_string(index + 50

), '.csv']) 662. # f4 = tf.strings.join([filename, '_', tf.as_string(index + 75

), '.csv']) 663. 664. # f1 = tf.constant(f1, tf.string) 665. # f2 = tf.constant(f2, tf.string) 666. # f3 = tf.constant(f3, tf.string) 667. # f4 = tf.constant(f4, tf.string) 668. 669. # print(fileID) 670. # print(index) 671. # print(f1) 672. # print(f2) 673. # print(f3) 674. # print(f4) 675. 676. def _get_float_csv_tensor(fn): 677. csv_data_str = tf.io.read_file(fn) 678. csv_lines_str = tf.strings.split([csv_data_str], sep=tf.co

nstant("\n")) 679. csv_lines_str_dns = csv_lines_str.values # tf.sparse.to_d

ense(csv_lines_str, default_value=(["" for _ in range(NUM_FEATS)])) 680. csv_lines_masked = tf.boolean_mask(csv_lines_str_dns, tf.s

trings.length(csv_lines_str_dns) > 1) 681. csv_line_values = tf.io.decode_csv(csv_lines_masked, [[0.0

] for _ in range(NUM_FEATS)]) 682. output = tf.stack(csv_line_values, axis=1)

71

683. return output 684. 685. m1 = _get_float_csv_tensor(f1) 686. m2 = _get_float_csv_tensor(f2) 687. m3 = _get_float_csv_tensor(f3) 688. m4 = _get_float_csv_tensor(f4) 689. 690. m = tf.concat([m1, m2, m3, m4], axis=0) 691. 692. labels = (index, l1, l2, l3, l4, l5) 693. #labels = tf.reshape(labels, [64, 6]) 694. #labels = tf.transpose(labels) 695. 696. return m, labels 697. 698. dataset = tf.data.Dataset.from_generator(_generator, 699. (tf.int64, tf.int64, tf.s

tring, tf.string, tf.string, tf.string, tf.string, tf.string, tf.string, tf.string, tf.string),

700. output_shapes=(None, None, None, None, None, None, None, None, None, None, None))

701. # dataset = tf.data.Dataset.from_generator(_generator, (tf.string, tf.int64, tf.string, tf.string, tf.string), output_shapes=(None, None, None, None, None))

702. # dataset = dataset.shuffle(tf.constant(len(mappedFiles), dtype=tf.int64)) ### using shuffle here taskes too much time

703. dataset = dataset.map(_get_features, num_parallel_calls=4) # ABS added: num_parallel_calls=4

704. # dataset = dataset.map(_get_features) 705. dataset = dataset.prefetch(BATCH_SIZE * 2) 706. dataset = dataset.batch(BATCH_SIZE, drop_remainder=True) 707. 708. ## DEBUG POINT 709. # dataset = dataset.batch(64) 710. # iter = dataset.make_initializable_iterator() 711. # next_element = iter.get_next() 712. # with tf.Session(graph=graph) as sess: 713. # sess.run(iter.initializer) 714. # for _ in range(5): 715. # elem = sess.run(next_element) 716. # elemalt = next(Data.gen_batches(mappedFiles)) 717. # print(elem) 718. # print(elemalt) 719. # print('\n\n\n\n') 720. 721. # # https://www.tensorflow.org/guide/datasets#creating_an_i

terator 722. 723. return dataset 724. 725. 726. class Loss: 727. 728. # Graph definition for cosine similarity. 729. # cosine(a, b) = dot(a, b) / mul(norm(a), norm(b)) 730. # 731. def cosine(s1, s2): 732. return tf.divide(tf.tensordot(s1, s2, axes=1), tf.multiply(tf.norm

(s1), tf.norm(s2))) 733. 734. # OLD (FUNCTION OF cosine(...))

72

735. # Calculates similarity ~score~ between two tensors. 736. # similarity(a, b) = { cosine(a, b) >= 0.55 : 1 737. # { 0.55 > cosine(a, b) >= 0 : (1/0.55) * cosine(

a, b) 738. # { 0 > cosine(a, b) : 0 739. # NEW (CONSTANT) 740. # Calculates similarity ~score~ between two tensors. 741. # similarity(a, b) = { cosine(a, b) >= 0.55 : 1 742. # { 0.55 > cosine(a, b) : 0 743. def similarity(ta, tb): 744. cos_sim = Loss.cosine(ta, tb) 745. if not HASH_LOSS_USES_CONSTANT_Sij: 746. return tf.cond(cos_sim >= 0.55, lambda: tf.cast(1.0, tf.float6

4), 747. lambda: tf.cond(cos_sim >= 0.0, lambda: tf.cast

((1.0 / 0.55) * cos_sim, tf.float64), 748. lambda: tf.cast(0.0, tf.float64

))) 749. else: 750. return tf.cond(cos_sim >= 0.55, lambda: tf.cast(1.0, tf.float6

4), lambda: tf.cast(0.0, tf.float64)) 751. # contribution threshold changes for |S > 0| to |S > 0.55|, must c

hange this in loss calculation 752. 753. # Graph definition for entropy. 754. # entropy(C) = sum[c in C]( 755. # ‐ (c / sum(c)) * log_2(c / sum(c)) 756. # ) 757. # input C is a 2‐D dimensional vector, output is a scalar 758. def entropy(c): 759. 760. # print("c.shape: ", c.shape) ##DELETE!!! 761. 762. base = tf.constant(2, dtype=tf.float64) 763. s = tf.reduce_sum(c, axis=0) 764. p = tf.div(c, s) 765. o = tf.multiply(tf.divide(tf.log(p), tf.log(base)), p) 766. o = tf.where(tf.is_nan(o), tf.zeros_like(o), o) 767. 768. # print("o: ", o) ##DELETE!!! 769. # print("o.shape: ", o.shape) ##DELETE!!! 770. 771. # print("tf.multiply(tf.reduce_sum(o,axis=0), ‐

1).shape: ", tf.multiply(tf.reduce_sum(o,axis=0), ‐1).shape) ##DELETE!!! 772. # print("return: ", tf.multiply(tf.reduce_sum(o,axis=0), ‐

1)) ##DELETE!!! 773. 774. return tf.multiply(tf.reduce_sum(o, axis=0), ‐1) 775. 776. # Graph definition for per‐batch similar counter. 777. # For logging purposes only! 778. def count_similars(hash_list): 779. if HASH_LOSS_USES_CONSTANT_Sij: 780. threshold = tf.constant(0.55, dtype=tf.float64) 781. threshold_entropy = tf.constant(0.55, dtype=tf.float64) 782. else: 783. threshold = tf.constant(0.0, dtype=tf.float64) 784. threshold_entropy = tf.constant(0.55, dtype=tf.float64) 785. 786. zero = tf.constant(0, dtype=tf.int64) 787. one = tf.constant(1, dtype=tf.int64)

73

788. count = [] 789. count_entropy = [] 790. 791. num_hashes = len(hash_list) 792. for i in range(num_hashes): 793. for j in range(i + 1, num_hashes): 794. sim = Loss.similarity(hash_list[i], hash_list[j]) 795. count.append(tf.cond(tf.greater_equal(sim, threshold), lam

bda: one, lambda: zero)) # Hash Loss 796. count_entropy.append( 797. tf.cond(tf.greater_equal(sim, threshold_entropy), lamb

da: one, lambda: zero)) # Entropy Loss 798. 799. return tf.reduce_sum(tf.stack(count)), tf.reduce_sum(tf.stack(coun

t_entropy)) 800. 801. def convert_real_features_to_integer_features(hashes): 802. ones = tf.ones(tf.shape(hashes), dtype=tf.float64) 803. neg_ones = tf.scalar_mul(‐

1, tf.ones(tf.shape(hashes), dtype=tf.float64)) 804. zeros = tf.zeros(tf.shape(hashes), dtype=tf.float64) 805. 806. cond = tf.less(hashes, zeros) 807. out = tf.where(cond, neg_ones, ones) 808. 809. return out 810. 811. # Graph definition for entropy loss () 812. # entropyloss(S) = sum( 813. # for each pair Si, Sj in S: 814. # similarity(Si, Sj) * 815. # ) 816. # 817. def entropy_loss(hash_list): # input is a python list of tensors 818. threshold = tf.constant(0.55, dtype=tf.float64) 819. zero = tf.constant(0.0, dtype=tf.float64) 820. one = tf.constant(1.0, dtype=tf.float64) 821. 822. # print("len(hash_list): ", len(hash_list)) ##DELETE!!! 823. # print("hash_list[0].shape: ", hash_list[0].shape) ##DELETE!!! 824. 825. entropies = [] 826. 827. for i in range(len(hash_list)): 828. similar_bounded_hash_list = [] 829. similarity_scores = [] 830. similars_counter = tf.constant(0.0, tf.float64) 831. # print("Start: ", similars_counter) 832. 833. for j in range(len(hash_list)): 834. if i == j: 835. similar_bounded_hash_list.append(hash_list[j]) 836. similarity_scores.append(zero) 837. similars_counter = tf.add(similars_counter, 1) 838. else: 839. sim = Loss.similarity(hash_list[i], hash_list[j]) 840. similar_bounded_hash_list.append( 841. tf.cond( 842. tf.greater_equal(sim, threshold), 843. lambda: hash_list[j],

74

844. lambda: tf.zeros(hash_list[j].get_shape(), tf.float64)))

845. similars_counter = tf.add(similars_counter, tf.cond(tf.greater_equal(sim, threshold),

846. lambda: one,

847. lambda: zero))

848. 849. ## This code does not consider the number of elements

per subset 850. # similar_bounded_hash_list.append( 851. # tf.cond( 852. # tf.greater_equal(sim, threshold), 853. # lambda: hash_list[j], 854. # lambda: tf.zeros(hash_list[j].get_shape(), tf.float6

4))) 855. 856. similarity_scores.append(sim) 857. 858. similar_bounded_hash_list_stacked = tf.stack(similar_bounded_h

ash_list) 859. 860. if similar_bounded_hash_list_stacked.shape[0] > 1: 861. bits_pos = tf.to_int64(tf.greater(similar_bounded_hash_lis

t_stacked, zero)) 862. bits_neg = tf.to_int64(tf.less(similar_bounded_hash_list_s

tacked, zero)) 863. 864. # print("bits_pos.shape: ", bits_pos.shape) ##DELETE!!! 865. 866. count_pos = tf.reduce_sum(bits_pos, axis=0) # count great

er 0 867. count_neg = tf.reduce_sum(bits_neg, axis=0) # count less

0 868. 869. # print("count_pos.shape: ", count_pos.shape) ##DELETE!!!

870. # print("End: ", similars_counter) 871. 872. # e = tf.divide(Loss.entropy(tf.cast(tf.stack([count_pos,

count_neg]), tf.float64)), tf.cast(tf.shape(similar_bounded_hash_list_stacked)[0], tf.float64)) #Deprecated

873. e = tf.divide(Loss.entropy(tf.cast(tf.stack([count_pos, count_neg]), tf.float64)), similars_counter)

874. 875. # print("len(similar_bounded_hash_list): ",len(similar_bou

nded_hash_list)) ##DELETE!!! 876. # print("tf.cast(tf.shape(similar_bounded_hash_list_stacke

d)[0], tf.float64): ", tf.cast(tf.shape(similar_bounded_hash_list_stacked)[0], tf.float64)) ##DELETE!!!

877. # print("e: ", e) ##DELETE!!! 878. 879. entropies.append(e) 880. 881. # total_entropy = tf.reduce_sum(tf.stack(entropies)) 882. # total_entropy = tf.cast(total_entropy, tf.float64) 883. total_entropy = tf.divide(tf.reduce_mean(tf.stack(entropies)), tf.

constant(STATE_SIZE, dtype=tf.float64)) 884. return total_entropy 885.

75

886. # Graph definition for hash loss (preserving similarity between neighbors)

887. # hashloss(S) = sum( 888. # for each pair Si, Sj in S: 889. # similarity(Si, Sj) * 890. # ) 891. # 892. def hash_loss(tensors_list): 893. if HASH_LOSS_USES_CONSTANT_Sij: 894. threshold = tf.constant(0.55, dtype=tf.float64) 895. else: 896. threshold = tf.constant(0.0, dtype=tf.float64) 897. zero = tf.constant(0.0, dtype=tf.float64) 898. one = tf.constant(1.0, dtype=tf.float64) 899. n = len(tensors_list) 900. counter = [] 901. res = [] 902. 903. for i in range(n): 904. j = i 905. while j < (n ‐ 1): 906. if HASH_LOSS_USES_SXH == 0: 907. sim = Loss.similarity(tensors_list[i], tensors_list[j

+ 1]) 908. elif HASH_LOSS_USES_SXH == 1: 909. sim = Loss.similarity(tf.reshape(tensors_list[i], [‐

1]), tf.reshape(tensors_list[j + 1], [‐1])) 910. elif HASH_LOSS_USES_SXH == 2: 911. sim = Loss.similarity(tensors_list[i], tensors_list[j

+ 1]) 912. else: 913. exit(‐10) 914. counter.append(tf.cond(tf.greater_equal(sim, threshold), l

ambda: one, lambda: zero)) 915. res.append(tf.multiply(sim, tf.square(tf.norm(tf.subtract(

tensors_list[i], tensors_list[j + 1]))))) 916. j += 1 917. a = tf.reduce_sum(tf.cast(tf.stack(res, axis=0), tf.float64)) 918. if HASH_LOSS_NORMALIZED: 919. counter = tf.reduce_sum(tf.cast(tf.stack(counter), tf.float64)

) 920. counter = tf.cond(tf.equal(counter, zero), lambda: one, lambda

: counter) 921. # a = tf.divide(a, counter) 922. a = tf.divide(a, tf.multiply(counter, tf.constant(STATE_SIZE,

dtype=tf.float64))) 923. return a 924. 925. # Graph definition for mean squared error loss (preserving minimum req

uired information) 926. # hashloss(S) = sum( 927. # for each pair Si, Sj in S: 928. # similarity(Si, Sj) * 929. # ) 930. # 931. def mse_loss(enc_inputs, dec_predictions): 932. # return tf.reduce_mean(tf.reduce_sum(tf.reduce_sum(tf.square(tf.s

ubtract(enc_inputs,dec_predictions)), 2), 1)) 933. return tf.reduce_mean(tf.square(tf.subtract(enc_inputs, dec_predic

tions))) 934.

76

935. 936. ALREADY_DEFINED_VARS = False 937. ALREADY_DEFINED_SUMM = False 938. 939. 940. class Seq2Seq: 941. def __init__(self, num_layers=1, same_encoder_decoder_cell=False, bidi

rectional=False, attention=False): 942. self.num_layers = num_layers 943. self.same_encoder_decoder_cell = same_encoder_decoder_cell # weig

ht sharing 944. # no bidirectional or attention support yet 945. 946. # BIDIRECTIONAL: https://www.tensorflow.org/api_docs/python/tf/nn/

bidirectional_dynamic_rnn 947. # MULTILAYER BIDIRECTIONAL: https://www.tensorflow.org/api_docs/py

thon/tf/contrib/rnn/stack_bidirectional_dynamic_rnn 948. 949. # ATTENTION: https://www.tensorflow.org/api_docs/python/tf/contrib

/rnn/AttentionCellWrapper 950. # ATTENTION: https://www.tensorflow.org/api_docs/python/tf/contrib

/seq2seq/AttentionWrapper 951. 952. return 953. 954. def gen_variables(self): 955. global ALREADY_DEFINED_VARS 956. 957. with tf.variable_scope('Encoder', reuse=ALREADY_DEFINED_VARS): 958. self.enc_cells = [] 959. self.states = [] 960. 961. for l in range(0, self.num_layers): 962. with tf.variable_scope('Layer' + str(l + 1)): 963. ## create encoder cell per layer 964. self.enc_cells.append(tf.contrib.rnn.LSTMCell(STATE_SI

ZE, 965. state_is

_tuple=True, 966. use_peep

holes=USE_PEEPHOLES, 967. reuse=AL

READY_DEFINED_VARS)) 968. 969. ## create states per layer, creates both encoder and d

ecoder states [TODO: move outside Encoder variable scope] 970. with tf.name_scope('State'): 971. _enc_state_c = tf.get_variable('cell', shape=[GPU_

BATCH_SIZE, STATE_SIZE], 972. initializer=tf.cons

tant_initializer(0.0), 973. dtype=tf.float64) 974. _enc_state_h = tf.get_variable('hidden', shape=[GP

U_BATCH_SIZE, STATE_SIZE], 975. initializer=tf.cons

tant_initializer(0.0), 976. dtype=tf.float64) 977. self.states.append(tf.contrib.rnn.LSTMStateTuple(_

enc_state_c, _enc_state_h)) 978.

77

979. ## declare multilayer rnn cell for encoder AND state tuple and

980. self.enc_multi_cell = tf.contrib.rnn.MultiRNNCell(self.enc_cells)

981. self.states = tuple(self.states) 982. 983. ''''' ## output for encoder, NOT USED ANYMORE 984. with tf.name_scope('Output'): 985. self.enc_W = tf.get_variable('weights', [STATE_SIZE, NUM_F

EATS], 986. initializer=tf.truncated_normal_initializer(st

ddev=0.01, dtype=tf.float64), 987. dtype=tf.float64) 988. self.enc_B = tf.get_variable('biases', [NUM_FEATS], 989. initializer=tf.constant_initializer(0.0), 990. dtype=tf.float64) 991. ''' 992. 993. with tf.variable_scope('Decoder', reuse=ALREADY_DEFINED_VARS): 994. 995. ## if sharing cells b/t enc and dec, then share the cells, oth

erwise initialize 996. if self.same_encoder_decoder_cell: 997. self.dec_cells = self.enc_cells 998. else: 999. self.dec_cells = [] 1000. for l in range(0, self.num_layers): 1001. with tf.variable_scope('Layer' + str(l + 1)): 1002. self.dec_cells.append(tf.contrib.rnn.LSTMCell(STAT

E_SIZE, 1003. stat

e_is_tuple=True, 1004. use_

peepholes=USE_PEEPHOLES, 1005. reus

e=ALREADY_DEFINED_VARS)) 1006. 1007. ## declare multilayer rnn cell for decoder 1008. self.dec_multi_cell = tf.contrib.rnn.MultiRNNCell(self.dec_cel

ls) 1009. 1010. ## declare weights and biases for output processing 1011. with tf.name_scope('Output'): 1012. self.dec_W = tf.get_variable('weights', [STATE_SIZE, NUM_F

EATS], 1013. initializer=tf.truncated_norm

al_initializer(stddev=0.01, dtype=tf.float64), 1014. dtype=tf.float64) 1015. self.dec_B = tf.get_variable('biases', [NUM_FEATS], 1016. initializer=tf.constant_initi

alizer(0.0), 1017. dtype=tf.float64) 1018. 1019. ALREADY_DEFINED_VARS = True 1020. 1021. def gen_operations(self, enc_inputs): 1022. ## run encoder for all steps 1023. with tf.variable_scope('Encoder') as Encoder: 1024. with tf.name_scope('Cell'): 1025. enc_outputs, enc_last_state = tf.nn.dynamic_rnn(self.enc_m

ulti_cell, enc_inputs,

78

1026. initial_state=self.states)

1027. ''''' ## output for encoder, NOT USED ANYMORE 1028. with tf.name_scope('Output'): 1029. dec_input = tf.matmul(enc_last_state.h, self.enc_W) + self

.enc_B 1030. ''' 1031. 1032. ## create copy ‐‐ ensures decoder doesn't change same mem space 1033. store_enc_lhs = tf.identity(enc_last_state[len(enc_last_state) ‐ 1

].h) 1034. 1035. ## declare outputs 1036. dec_stepwise_states = [] 1037. dec_predictions = [] 1038. 1039. with tf.variable_scope('Decoder'): 1040. ## first decoder input= 1041. dec_input = tf.unstack(tf.transpose(enc_inputs, [1, 0, 2]))[0]

1042. 1043. ## set initial decoder state 1044. with tf.name_scope('Cell'): 1045. dec_state = enc_last_state 1046. 1047. for step in range(NUM_STEPS): 1048. ## run decoder cell for each step (cell = all layers) 1049. with tf.name_scope('Step'): 1050. outputs, dec_state = self.dec_multi_cell(dec_input, de

c_state) 1051. 1052. ## process output for the current step 1053. with tf.name_scope('Output'): 1054. dec_input = tf.matmul(dec_state[len(dec_state) ‐ 1].h,

self.dec_W) + self.dec_B 1055. dec_stepwise_states.append(outputs) 1056. dec_predictions.append(dec_input) 1057. 1058. ## process outputs and states for return 1059. dec_stepwise_states = tf.stack(dec_stepwise_states, axis=0) 1060. dec_predictions = tf.stack(dec_predictions, axis=0) 1061. dec_predictions = tf.transpose(dec_predictions, perm=[1, 0, 2]

, name='Predictions') 1062. 1063. return dec_predictions, store_enc_lhs, enc_outputs, dec_stepwise_s

tates 1064. 1065. def gen_summaries(self): 1066. global ALREADY_DEFINED_SUMM 1067. if ALREADY_DEFINED_SUMM: 1068. return 1069. ''''' 1070. with tf.name_scope('Encoder'): 1071. with tf.name_scope('Cell'): 1072. tf.summary.histogram('Weights', self.enc_cell.variables[0]

) 1073. tf.summary.histogram('Bias', self.enc_cell.variables[1]) 1074. with tf.name_scope('Output'): 1075. tf.summary.histogram('Weights', self.enc_W) 1076. tf.summary.histogram('Bias', self.enc_B) 1077.

79

1078. with tf.name_scope('State'): 1079. pass 1080. 1081. with tf.name_scope('Decoder'): 1082. with tf.name_scope('Cell'): 1083. tf.summary.histogram('Weights', self.dec_cell.variables[0]

) 1084. tf.summary.histogram('Bias', self.dec_cell.variables[1]) 1085. with tf.name_scope('Output'): 1086. tf.summary.histogram('Weights', self.dec_W) 1087. tf.summary.histogram('Bias', self.dec_B) 1088. ''' 1089. ALREADY_DEFINED_SUMM = True 1090. 1091. 1092. class Hashes: 1093. 1094. def hamming_distance(s1, s2): 1095. """ 1096. Hamming Distance between hashes which elements are either 1 or ‐1 1097. Return the Hamming distance between equal‐length hashes 1098. """ 1099. if (s1.shape[0]) != (s2.shape[0]): 1100. raise ValueError("Undefined for sequences of unequal length")

1101. return int(np.sum((((s1 * s2) ‐ 1) / 2) ** 2)) 1102. # return int(np.sum(np.logical_xor(s1, s2))), need to change below

to use 0 and 1 instead of ‐1 and 1 1103. 1104. def collisions(hashes): 1105. 1106. hashes_dict = defaultdict(list) 1107. counter = 0 1108. 1109. for hash_i in hashes: 1110. counter += 1 1111. h = tuple(hash_i) 1112. if not h in hashes_dict: 1113. hashes_dict[h].append(1) 1114. 1115. l = len(hashes_dict) 1116. num_collisions = counter ‐ l 1117. 1118. # print("Num Hashes: ", counter) 1119. # print("Num Unique Hashes: ", l) 1120. # print("Num Collisions: ", num_collisions) 1121. # print("") 1122. # print("Fraction of Unique Hashes: ", float(l/counter)) 1123. # print("Fraction of collisions: ", float(num_collisions/counter))

1124. 1125. return counter, l, num_collisions, float(l / counter), float(num_c

ollisions / counter) 1126. 1127. def create_dict(sorted_hashes, num_hashes_classification, 1128. db_fingerprint, db_name, table_name): 1129. hashes_dict = defaultdict(list) 1130. num_hashes_dict = num_hashes_classification 1131. 1132. for i in range(len(sorted_hashes) ‐ num_hashes_dict + 1): 1133. tmp = sorted_hashes[i:i + num_hashes_dict].copy()

80

1134. if not num_hashes_dict == 1: 1135. for j in range(num_hashes_dict ‐ 1): 1136. if not [tmp[j][k] for k in (0, 1, 2, 3)] == \ 1137. [tmp[j + 1][k] for k in (0, 1, 2, 3)]: 1138. flag = True 1139. break 1140. else: 1141. flag = False 1142. else: 1143. flag = False 1144. 1145. if flag: 1146. continue 1147. else: 1148. h = [] 1149. 1150. for j in range(num_hashes_dict): 1151. h.extend(tmp[j][6].copy()) 1152. h = tuple(h) 1153. # print('tmp[0][:‐1]: ', tmp[0][:‐1]) 1154. hashes_dict[h].append(tmp[0][:‐1]) 1155. # print('%i:' %(i), 'hashes_dict: ', hashes_dict) 1156. 1157. # if DO_CREATE_TABLE: 1158. # db_fingerprint.insert_into_table(db_name, table_name

, tmp[0][:‐1], list(h)) 1159. 1160. if DO_CREATE_TABLE: 1161. insert_values = [] 1162. 1163. for k, v in hashes_dict.items(): 1164. HashBin = np.where(np.array(k) >= 1.0, 1, 0) 1165. 1166. # Converting Hashes into Hexadecimal (4‐

bit length encoding) 1167. bits_str = '' 1168. hash_hexa = '' 1169. hash_i_char = ['%s' % bit_int for bit_int in HashBin] # F

rom integer to character 1170. 1171. for bit_char in hash_i_char: # Creating a string 1172. bits_str += bit_char 1173. 1174. for index in range(0, len(bits_str), 4): 1175. # Conversion to Hexadecimal 1176. hash_hexa += hex(int(bits_str[index:index + 4], 2)).sp

lit('x')[‐1] 1177. 1178. for v_ in v: 1179. # print(hash_hexa) 1180. # print(v_) 1181. # v_: [v_[0]:Transformation, v_[1]:Speaker, v_[2]:Vide

o, v_[3]:AudioName, v_[4]:Gender, v_[5]:Segment] 1182. # print(list([v_[0], v_[3], v_[2]]) + list([hash_hexa]

)) 1183. insert_values.append(tuple(list([v_[0], v_[1], v_[2],

v_[3], v_[4], v_[5]]) + list([hash_hexa]))) 1184. 1185. print('REFERENCE DATABASE') 1186. print('\tlen(insert_values): ', len(insert_values))

81

1187. print('\tsys.getsizeof(insert_values): %i [bytes]' % (sys.getsizeof(insert_values)))

1188. # print(insert_values) 1189. 1190. db_timer_init = time.time() 1191. 1192. db_fingerprint.cursor.executemany( 1193. 'INSERT INTO {}.{} (Transformation, Speaker, Video, AudioN

ame, Gender, Segment, Hash) VALUES (%s, %s, %s, %s, %s, %s, UNHEX(%s));'.format( 1194. db_name, table_name), insert_values) 1195. db_fingerprint.cnx.commit() 1196. 1197. print('\tdb_fingerprint.cursor.executemany time: %f [secs]' %

(time.time() ‐ db_timer_init)) 1198. 1199. # print('len(hashes_dict): ', len(hashes_dict)) 1200. 1201. return hashes_dict 1202. 1203. def create_queries(sorted_hashes, num_hashes_classification): 1204. samples_query = [] 1205. 1206. num_hashes_class = num_hashes_classification 1207. 1208. for i in range(len(sorted_hashes) ‐ num_hashes_class + 1): 1209. tmp = sorted_hashes[i:i + num_hashes_class] 1210. if not num_hashes_class == 1: 1211. for j in range(num_hashes_class ‐ 1): 1212. if not [tmp[j][k] for k in (0, 1, 2, 3)] == \ 1213. [tmp[j + 1][k] for k in (0, 1, 2, 3)]: 1214. flag = True 1215. break 1216. else: 1217. flag = False 1218. else: 1219. flag = False 1220. 1221. if not flag: 1222. h = [] 1223. 1224. for j in range(num_hashes_class): 1225. h.extend(tmp[j][6]) 1226. h = tuple(h) 1227. result = list(tmp[0][:‐1]) 1228. result.append(h) 1229. samples_query.append(tuple(result)) 1230. 1231. return samples_query 1232. 1233. def similarity(h, dictt, dict_keys): 1234. dict_keys2 = dict_keys[:] 1235. min_value = math.inf 1236. NNclass = [] 1237. 1238. h = np.array(list(h)) 1239. h = h.astype(float).astype(int) 1240. dict_keys = np.array(dict_keys) 1241. dict_keys = dict_keys.astype(float).astype(int) 1242. 1243. for i in range(len(dict_keys)): 1244. dist = Hashes.hamming_distance(h, dict_keys[i])

82

1245. if min_value >= dist: 1246. if min_value > dist: 1247. res = dictt[dict_keys2[i]].copy() 1248. # Appending HammingDistance = dist 1249. tmp_res = [] 1250. for elem in res: 1251. tmp = list(elem) 1252. tmp.append(dist) 1253. tmp_res.append(tuple(tmp)) 1254. NNclass = tmp_res.copy() 1255. # res = [(0):Speaker | (1):Sentence | (2):Transformati

on | (3):Segment | (4):HammingDistance] 1256. min_value = dist 1257. else: 1258. res = dictt[dict_keys2[i]].copy() 1259. # Appending HammingDistance = dist 1260. tmp_res = [] 1261. for elem in res: 1262. tmp = list(elem) 1263. tmp.append(dist) 1264. tmp_res.append(tuple(tmp)) 1265. a = list(tmp_res.copy()) 1266. # res = [(0):Speaker | (1):Sentence | (2):Transformati

on | (3):Segment | (4):HammingDistance] 1267. # a = list(dictt[dict_keys2[i]].copy()) 1268. NNclass += a[:] 1269. min_value = dist 1270. 1271. return set(NNclass) 1272. 1273. def classify_original(hash_query, db_name, table_name): 1274. # print('Entered Hashes.classify_original()') 1275. ## This code assumes NUM_HASHES_DICT == NUM_HASHES_CLASSIFICATION

1276. if not NUM_HASHES_DICT == NUM_HASHES_CLASSIFICATION: 1277. print('error: sanity check failed: NUM_HASHES_DICT == NUM_HASH

ES_CLASSIFICATION') 1278. exit(‐10) 1279. 1280. HashBin = np.where(np.array(hash_query) >= 1.0, 1, 0) 1281. # print('HashBin: {}'.format(HashBin)) 1282. # sys.exit('HashBin content') 1283. 1284. # Converting Hashes into Hexadecimal (4‐bit length encoding) 1285. bits_str = '' 1286. hash_hexa = '' 1287. hash_i_char = ['%s' % bit_int for bit_int in HashBin] # From inte

ger to character 1288. for bit_char in hash_i_char: # Creating a string 1289. bits_str += bit_char 1290. for index in range(0, len(bits_str), 4): 1291. # Conversion to Hexadecimal 1292. hash_hexa += hex(int(bits_str[index:index + 4], 2)).split('x')

[‐1] 1293. 1294. # print('Before tmp_db_fingerprint.duplicates(db_name, table_name,

hash_hexa)') 1295. # print('hash_hex: {}'.format(hash_hexa)) 1296. tmp_db_fingerprint = DB_Fingerprinting() 1297. res = tmp_db_fingerprint.duplicates(db_name, table_name, hash_hexa

)

83

1298. # res = [(0):Transformation | (1):Speaker | (2):Video | (3):AudioName | (4):Gender | (5):Segment | (6):HammingDistance]

1299. 1300. # # Printing out results of the query 1301. # for i, res_i in enumerate(res): 1302. # print('{}: {}'.format(i, res_i)) 1303. # if i+1 == 30: 1304. # break 1305. 1306. filename = defaultdict(list) 1307. 1308. for i, res_i in enumerate(res): 1309. # print('i: {}'.format(i)) 1310. # print('res_i[0]: {}'.format(res_i[0])) 1311. # print('res_i[1]: {}'.format(res_i[1])) 1312. # print('res_i[2]: {}'.format(res_i[2])) 1313. # print('res_i[3]: {}'.format(res_i[3])) 1314. # print('res_i[4]: {}'.format(res_i[4])) 1315. # print('res_i[5]: {}'.format(res_i[5])) 1316. # print('res_i[6]: {}'.format(res_i[6])) 1317. # print('res_i[‐1]: {}'.format(res_i[‐1])) 1318. 1319. if i == 0: 1320. min_value = res_i[‐1] 1321. # print('min_value: {}'.format(min_value)) 1322. filename[(res_i[0], res_i[1], res_i[2], res_i[3], res_i[4]

)].append(1) 1323. # print('filename: {}'.format(filename)) 1324. elif min_value == res_i[‐1]: 1325. filename[(res_i[0], res_i[1], res_i[2], res_i[3], res_i[4]

)].append(1) 1326. # print('filename: {}'.format(filename)) 1327. else: 1328. break 1329. 1330. # filename = defaultdict(list) 1331. # for elem in res: 1332. # filename[elem[0]].append(1) 1333. 1334. max_votes = 0 1335. filenameList = [] 1336. for k, v in filename.items(): 1337. # print('filename.items()') 1338. # print('key: ', k) 1339. # print('value: ', v) 1340. if sum(v) >= max_votes: 1341. if sum(v) == max_votes: 1342. filenameList.append(k) 1343. else: 1344. filenameList = [] 1345. filenameList.append(k) 1346. max_votes = sum(v) 1347. 1348. # print('len(filenameList): ', len(filenameList)) 1349. # print('filenameList: ', filenameList) 1350. 1351. if len(filenameList) != 1: 1352. # More than one option means classification 1353. # but we are returning the first value in the list (we are gue

ssing)

84

1354. return (filenameList[0][0], filenameList[0][1], filenameList[0][2], filenameList[0][3], filenameList[0][4])

1355. else: 1356. # Return the only option 1357. return (filenameList[0][0], filenameList[0][1], filenameList[0

][2], filenameList[0][3], filenameList[0][4]) 1358. 1359. def classification_subtask(subset, a): 1360. # print('Entered Hashes.classification_subtask()') 1361. # db_fingerprint = a[0] 1362. db_name = a[0] 1363. table_name = a[1] 1364. hashes_classification = [] 1365. 1366. # print('') 1367. # print('classification_subtask') 1368. # print('len(subset): ', len(subset)) 1369. # print([len(x) for x in subset]) 1370. # sys.exit('Exiting classification_subtask') 1371. 1372. for i in range(len(subset)): 1373. c_originals = Hashes.classify_original(subset[i][6], \ 1374. db_name, table_name) 1375. hashes_classification.append([subset[i], c_originals]) 1376. # print('hashes_classification[‐1]: ', hashes_classification[‐

1]) 1377. return hashes_classification 1378. 1379. def classification(hashes_dict, hashes_query, num_hashes_classificatio

n, \ 1380. db_fingerprint, db_name, table_name): 1381. 1382. hashes_dict_sorted = sorted(hashes_dict, key=lambda attr: (attr[0]

, attr[1], attr[2], attr[3], attr[5])) 1383. hashes_query_sorted = sorted(hashes_query, key=lambda attr: (attr[

0], attr[1], attr[2], attr[3], attr[5])) 1384. 1385. # print('hashes_dict_sorted: ', len(hashes_dict_sorted)) 1386. # print(hashes_dict_sorted) 1387. 1388. # print('hashes_query_sorted: ', len(hashes_query_sorted)) 1389. # print(hashes_query_sorted) 1390. 1391. ## Creating database with KEY: Hash; Value: [LABEL+SEGMENT] 1392. # print('Creating database Hashes.create_dict') 1393. database_db = Hashes.create_dict(hashes_dict_sorted.copy(), num_ha

shes_classification, 1394. db_fingerprint, db_name, table_na

me) 1395. 1396. # print('') 1397. # print('NumElem \"database_db\": {}'.format(str(len(database_db))

)) 1398. print('\tNumElem database_db: ', len(database_db)) 1399. # print('') 1400. 1401. ## Create list of keys 1402. dict_keys = [] 1403. for k in database_db.keys(): 1404. dict_keys.append(k) 1405.

85

1406. ## Create list of queries, subset 1407. queries_total = [] 1408. # queries_original = Hashes.create_queries(hashes_dict_sorted, num

_hashes_classification) # Only Test Set 1409. # print('queries_original') 1410. # print(queries_original) 1411. queries_transformations = Hashes.create_queries(hashes_query_sorte

d, num_hashes_classification) 1412. # print('queries_transformations') 1413. # print(queries_transformations) 1414. # queries_total = queries_original + queries_transformations # Onl

y Test Set 1415. queries_total = queries_transformations 1416. # print('queries_total') 1417. # print(queries_total) 1418. 1419. queries_total = random.sample(queries_total, int(CLASSIFICATION_AC

CURACY_SUBSET_SIZE * len(queries_total))) 1420. print('') 1421. # print('NumElem Query: {}'.format(str(len(queries_total)))) 1422. print('QUERY DATABASE') 1423. print('\tNumElem Query: ', len(queries_total)) 1424. print('\tsys.getsizeof(queries_total): %i [bytes]' % (sys.getsizeo

f(queries_total))) 1425. 1426. pool = mp.Pool(NUM_CLASSIFICATION_POOLS) 1427. thread_subset_size = int(len(queries_total) / NUM_CLASSIFICATION_P

OOLS) 1428. 1429. s = [queries_total[i * thread_subset_size: ( 1430. (i + 1) * thread_subset_size if (i != NUM_CLASSIFICATION_POOLS

‐ 1) else len(queries_total))] for i in 1431. range(NUM_CLASSIFICATION_POOLS)] 1432. # print('Before pool.map(partial(Hashes.classification_subtask, a=

(db_fingerprint, db_name, table_name)), s)') 1433. 1434. qry_timer = time.time() 1435. 1436. hc = pool.map(partial(Hashes.classification_subtask, a=(db_name, t

able_name)), s) 1437. h = [] 1438. for c in hc: 1439. h.extend(c) 1440. 1441. pool.close() 1442. pool.join() 1443. 1444. print('\tdb_fingerprint.cursor.execute(qry, (qryhash,)): %f [secs]

' % (time.time() ‐ qry_timer)) 1445. print('') 1446. 1447. return h 1448. 1449. 1450. class Evaluate: 1451. 1452. @staticmethod 1453. def train(mappedFiles, sess, epoch, summary, iterator): 1454. avg_lb = [] 1455. avg_lb_mse = [] 1456. avg_lb_hash = []

86

1457. avg_lb_entropy = [] 1458. avg_simcount = [] 1459. avg_simcount_entropy = [] 1460. 1461. sess.run(iterator.initializer) 1462. iter_handle = sess.run(iterator.string_handle()) 1463. 1464. full_start_time = start_time = time.time() 1465. batch_n = 0 1466. 1467. try: 1468. while True: 1469. r_lb, r_lb_mse, r_lb_hash, r_lb_entropy, \ 1470. r_simc, r_simc_ent, \ 1471. r_enc_lhs_0, r_enc_lhs_1, \ 1472. r_enc_dec_0, r_enc_dec_1, \ 1473. r_lr, \ 1474. r_batch_data, \ 1475. r_trn_var, sum_str, _ = sess.run( \ 1476. [loss_batch, loss_batch_mse, loss_batch_hash, loss_bat

ch_entropy, \ 1477. similars_batch, similars_entropy_batch, \ 1478. gstor_enc_lhs[0], gstor_enc_lhs[1], \ 1479. gstor_dec_pre[0], gstor_dec_pre[1], \ 1480. learning_rate, \ 1481. inputs_next_batch, \ 1482. training_variables, summary, train_step], \ 1483. feed_dict={handle: iter_handle}) 1484. 1485. if r_lb_hash == None: 1486. r_lb_hash = ‐1 1487. if r_lb_entropy == None: 1488. r_lb_entropy = ‐1 1489. 1490. avg_lb.append(r_lb) 1491. avg_lb_mse.append(r_lb_mse) 1492. avg_lb_hash.append(r_lb_hash) 1493. avg_lb_entropy.append(r_lb_entropy) 1494. avg_simcount.append(r_simc) 1495. avg_simcount_entropy.append(r_simc_ent) 1496. 1497. # if (batch_n + 1) % DISPLAY_COUNT == 0 or ( (batch_n + 1)

<10 and epoch == 1): 1498. if (batch_n + 1) % DISPLAY_COUNT == 0: 1499. duration = time.time() ‐ start_time 1500. print(' %04i\t%04i\t%.8f\t%f\t%f\t%f\t%f\t%f\t%f\t%f'

% \ 1501. (epoch, batch_n + 1, r_lr, r_lb, r_lb_mse, r_lb_

hash, r_lb_entropy, r_simc, r_simc_ent, duration)) 1502. with open(LOG_DIR + '/training_batch.txt', 'a') as fil

e: 1503. file.write('%04i\t%04i\t%.8f\t%f\t%f\t%f\t%f\t%f\t

%f\t%f\n' % \ 1504. (epoch, batch_n + 1, r_lr, r_lb, r_lb_m

se, r_lb_hash, r_lb_entropy, r_simc, r_simc_ent, 1505. duration)) 1506. start_time = time.time() 1507. batches_per_epoch = math.ceil(len(mappedFiles) / float

(BATCH_SIZE)) 1508. counter = epoch * batches_per_epoch + (batch_n + 1) 1509. summary_writer_trn.add_summary(sum_str, counter)

87

1510. summary_writer_trn.flush 1511. 1512. batch_n += 1 1513. except tf.errors.OutOfRangeError: 1514. pass 1515. 1516. full_duration = time.time() ‐ full_start_time 1517. 1518. return np.mean(avg_lb), \ 1519. np.mean(avg_lb_mse), \ 1520. np.mean(avg_lb_hash), \ 1521. np.mean(avg_lb_entropy), \ 1522. np.mean(avg_simcount), \ 1523. np.mean(avg_simcount_entropy), \ 1524. full_duration 1525. 1526. @staticmethod 1527. def evaluate_train(mappedFiles, sess, epoch, summary, iterator): 1528. avg_lb = [] 1529. avg_lb_mse = [] 1530. avg_lb_hash = [] 1531. avg_lb_entropy = [] 1532. avg_simcount = [] 1533. avg_simcount_entropy = [] 1534. 1535. # hashes = [] 1536. # hashes_labels = [] 1537. 1538. full_start_time = start_time = time.time() 1539. 1540. sess.run(iterator.initializer) 1541. iter_handle = sess.run(iterator.string_handle()) 1542. 1543. full_start_time = start_time = time.time() 1544. batch_n = 0 1545. 1546. try: 1547. while True: 1548. r_lb, r_lb_mse, r_lb_hash, r_lb_entropy, \ 1549. r_simc, r_simc_ent, \ 1550. r_enc_lhs_0, r_enc_lhs_1, \ 1551. r_enc_dec_0, r_enc_dec_1, \ 1552. r_lr, \ 1553. r_batch_data = sess.run( \ 1554. [loss_batch, loss_batch_mse, loss_batch_hash, loss_bat

ch_entropy, \ 1555. similars_batch, similars_entropy_batch, \ 1556. gstor_enc_lhs[0], gstor_enc_lhs[1], \ 1557. gstor_dec_pre[0], gstor_dec_pre[1], \ 1558. learning_rate, \ 1559. inputs_next_batch], \ 1560. feed_dict={handle: iter_handle}) 1561. 1562. if r_lb_hash == None: 1563. r_lb_hash = ‐1 1564. if r_lb_entropy == None: 1565. r_lb_entropy = ‐1 1566. 1567. avg_lb.append(r_lb) 1568. avg_lb_mse.append(r_lb_mse) 1569. avg_lb_hash.append(r_lb_hash)

88

1570. avg_lb_entropy.append(r_lb_entropy) 1571. avg_simcount.append(r_simc) 1572. avg_simcount_entropy.append(r_simc_ent) 1573. 1574. # r_enc_lhs = np.concatenate([r_enc_lhs_0, r_enc_lhs_1]) 1575. # hashes.extend(r_enc_lhs) 1576. # hashes_labels.extend(batch_labels) 1577. 1578. if (batch_n + 1) % DISPLAY_COUNT == 0: 1579. duration = time.time() ‐ start_time 1580. print(' %04i\t%04i\t%.8f\t%f\t%f\t%f\t%f\t%f\t%f\t%f'

% \ 1581. (epoch, batch_n + 1, r_lr, r_lb, r_lb_mse, r_lb_

hash, r_lb_entropy, r_simc, r_simc_ent, duration)) 1582. with open(LOG_DIR + '/training_batch.txt', 'a') as fil

e: 1583. file.write('%04i\t%04i\t%.8f\t%f\t%f\t%f\t%f\t%f\t

%f\t%f\n' % \ 1584. (epoch, batch_n + 1, r_lr, r_lb, r_lb_m

se, r_lb_hash, r_lb_entropy, r_simc, r_simc_ent, 1585. duration)) 1586. start_time = time.time() 1587. 1588. batch_n += 1 1589. except tf.errors.OutOfRangeError: 1590. pass 1591. 1592. full_duration = time.time() ‐ full_start_time 1593. 1594. return np.mean(avg_lb), \ 1595. np.mean(avg_lb_mse), \ 1596. np.mean(avg_lb_hash), \ 1597. np.mean(avg_lb_entropy), \ 1598. np.mean(avg_simcount), \ 1599. np.mean(avg_simcount_entropy), \ 1600. full_duration 1601. 1602. # @staticmethod 1603. # def evaluate_train_vis(mappedFiles, sess, epoch, iterator): 1604. # 1605. # hashes = [] 1606. # hashes_labels = [] 1607. # 1608. # # Extracting training samples in batches 1609. # sess.run(iterator.initializer) 1610. # iter_handle = sess.run(iterator.string_handle()) 1611. # 1612. # full_start_time = start_time = time.time() 1613. # batch_n = 0 1614. # 1615. # try: 1616. # while True: 1617. # r_enc_lhs_0, r_enc_lhs_1, \ 1618. # # r_batch_data, r_batch_labels = sess.run( \ 1619. # r_batch_data = sess.run( \ 1620. # [gstor_enc_lhs[0], gstor_enc_lhs[1], \ 1621. # # inputs_next_batch, labels_next_batch], \ 1622. # inputs_next_batch], \ 1623. # feed_dict={handle: iter_handle}) 1624. # 1625. # r_enc_lhs = np.concatenate([r_enc_lhs_0, r_enc_lhs_1])

89

1626. # hashes.extend(r_enc_lhs) 1627. # hashes_labels.extend(r_batch_labels) 1628. # 1629. # batch_n += 1 1630. # except tf.errors.OutOfRangeError: 1631. # pass 1632. # 1633. # # # Generate hash visualization 1634. # # sess.run([hashes_vis_op2, hash_features_vis_op], feed_dict={ha

shes_vis_input: hashes}) 1635. # # for l in hashes_labels: 1636. # # with open(METADATA_PATH, 'a') as file: 1637. # # file.write(l[2] + '\t' + l[3] + '\t' + l[1] + '\t' + s

tr(l[0]) + '\n') 1638. 1639. @staticmethod 1640. def accuracy(labels_all, segments_db, segments_query, db_fingerprint,

db_name, table_name, 1641. sess, epoch, num_hashes_class, stats, iter_trn_db, iter_t

rn_qr): 1642. 1643. start_time = time.time() 1644. 1645. hashes_db = [] 1646. labels_db = [] 1647. hashes_query = [] 1648. labels_query = [] 1649. 1650. ## How many batches? How many hashes to keep in the final batch? 1651. stat_num_batches_db = stats[0]# int(len(segments_db) / BATCH_SIZE)

1652. stat_num_batches_query = stats[1] # int(len(segments_query) / BATC

H_SIZE) 1653. stat_num_missing_db = stats[2] # len(segments_db) % BATCH_SIZE 1654. stat_num_missing_query = stats[3] # len(segments_query) % BATCH_SI

ZE 1655. 1656. TRN_gen_batches = TRN_gen_batches_db_hashes = time.time() 1657. 1658. iterator_trn_db = iter_trn_db 1659. sess.run(iterator_trn_db.initializer) 1660. iter_handle_db = sess.run(iterator_trn_db.string_handle()) 1661. batch_n_db = 0 1662. 1663. ## Generate db hashes 1664. try: 1665. while True: 1666. # for batch_n, (batch_data, batch_labels) in enumerate(Dat

a.gen_batches(labels_all[:], segments_db[:])): 1667. # batch_data_split = np.split(batch_data, 2) 1668. 1669. r_enc_lhs_0, r_enc_lhs_1, \ 1670. r_batch_data, r_batch_labels = sess.run( \ 1671. [gstor_enc_lhs[0], gstor_enc_lhs[1], 1672. inputs_next_batch, labels_next_batch], \ 1673. feed_dict={handle: iter_handle_db}) 1674. 1675. labels_batch = [] 1676. for k in range(len(r_batch_labels[0])): 1677. lbl = [r_batch_labels[1][k], 1678. r_batch_labels[2][k],

90

1679. r_batch_labels[3][k], 1680. r_batch_labels[4][k], 1681. r_batch_labels[5][k], 1682. r_batch_labels[0][k]] 1683. labels_batch.append(lbl) 1684. 1685. if (batch_n_db + 1) <= stat_num_batches_db: 1686. hashes_batch = np.concatenate([r_enc_lhs_0, r_enc_lhs_

1]) 1687. h = np.where(hashes_batch >= 0.0, 1, ‐1) 1688. hashes_db.extend(h) 1689. labels_db.extend(labels_batch) 1690. elif stat_num_missing_db != 0: 1691. hashes_batch = np.concatenate([r_enc_lhs_0, r_enc_lhs_

1]) 1692. h = np.where(hashes_batch >= 0.0, 1, ‐1) 1693. hashes_db.extend(h[0:stat_num_missing_db]) 1694. labels_db.extend(labels_batch[0:stat_num_missing_db])

1695. 1696. batch_n_db += 1 1697. except tf.errors.OutOfRangeError: 1698. pass 1699. 1700. print('') 1701. print('TRAINING ACCURACY') 1702. print('\tlen(segments_db): ', len(segments_db)) 1703. print('\tTRN_gen_batches_db_hashes: %f [secs]' % (time.time() ‐ TR

N_gen_batches_db_hashes)) 1704. 1705. iterator_trn_query = iter_trn_qr 1706. sess.run(iterator_trn_query.initializer) 1707. iter_handle_query = sess.run(iterator_trn_query.string_handle()) 1708. batch_n_qr = 0 1709. 1710. TRN_gen_batches_query_hashes = time.time() 1711. ## Generate query hashes 1712. try: 1713. while True: 1714. # for batch_n, (batch_data, batch_labels) in enumerate(Dat

a.gen_batches(labels_all[:], segments_query[:])): 1715. # batch_data_split = np.split(batch_data, 2) 1716. 1717. r_enc_lhs_0, r_enc_lhs_1, \ 1718. r_batch_data, r_batch_labels = sess.run( \ 1719. [gstor_enc_lhs[0], gstor_enc_lhs[1], \ 1720. inputs_next_batch, labels_next_batch], \ 1721. feed_dict={handle: iter_handle_query}) 1722. 1723. labels_batch = [] 1724. for k in range(len(r_batch_labels[0])): 1725. lbl = [r_batch_labels[1][k], 1726. r_batch_labels[2][k], 1727. r_batch_labels[3][k], 1728. r_batch_labels[4][k], 1729. r_batch_labels[5][k], 1730. r_batch_labels[0][k]] 1731. labels_batch.append(lbl) 1732. 1733. if (batch_n_qr + 1) <= stat_num_batches_query:

91

1734. hashes_batch = np.concatenate([r_enc_lhs_0, r_enc_lhs_1])

1735. h = np.where(hashes_batch >= 0.0, 1, ‐1) 1736. hashes_query.extend(h) 1737. labels_query.extend(labels_batch) 1738. elif stat_num_missing_query != 0: 1739. hashes_batch = np.concatenate([r_enc_lhs_0, r_enc_lhs_

1]) 1740. h = np.where(hashes_batch >= 0.0, 1, ‐1) 1741. hashes_query.extend(h[0:stat_num_missing_query]) 1742. labels_query.extend(labels_batch[0:stat_num_missing_qu

ery]) 1743. 1744. batch_n_qr += 1 1745. except tf.errors.OutOfRangeError: 1746. pass 1747. 1748. print('') 1749. print('\tlen(segments_query): ', len(segments_query)) 1750. print('\tTRN_gen_batches_query_hashes: %f [secs]' % (time.time() ‐

TRN_gen_batches_query_hashes)) 1751. 1752. print('') 1753. print('\tTRN_gen_batches: %f [secs]' % (time.time() ‐ TRN_gen_batc

hes)) 1754. 1755. ## Merge labels with hashes 1756. hl_db = [] 1757. for i in range(len(labels_db)): 1758. entry = [labels_db[i][0].decode("utf‐8"), 1759. labels_db[i][1].decode("utf‐8"), 1760. labels_db[i][2].decode("utf‐8"), 1761. labels_db[i][3].decode("utf‐8"), 1762. labels_db[i][4].decode("utf‐8"), 1763. labels_db[i][5], 1764. hashes_db[i]] 1765. hl_db.append(tuple(entry)) 1766. 1767. # hl_db: [Transformation | PersonID | VideoID | AudioFile | Ge

nder | Segment | Hash] 1768. 1769. hl_query = [] 1770. for i in range(len(labels_query)): 1771. entry = [labels_query[i][0].decode("utf‐8"), 1772. labels_query[i][1].decode("utf‐8"), 1773. labels_query[i][2].decode("utf‐8"), 1774. labels_query[i][3].decode("utf‐8"), 1775. labels_query[i][4].decode("utf‐8"), 1776. labels_query[i][5], 1777. hashes_query[i]] 1778. hl_query.append(tuple(entry)) 1779. 1780. # hl_query: [Transformation | PersonID | VideoID | AudioFile | Gen

der | Segment | Hash] 1781. 1782. ### Inserting column 0 [segment] after column 3 [sentence] and for

matting segment to six digits so that it is sorted 1783. # for hashes_orig_lbha_i in hashes_orig_lbha: 1784. # hashes_orig_lbha_i[0] = int(hashes_orig_lbha_i[0]) 1785. # hashes_orig_lbha_i[0], hashes_orig_lbha_i[1] = hashes_orig_l

bha_i[1], hashes_orig_lbha_i[0] # Swapping column 0 and 1

92

1786. # hashes_orig_lbha_i[1], hashes_orig_lbha_i[2] = hashes_orig_lbha_i[2], hashes_orig_lbha_i[1] # Swapping column 1 and 2

1787. # hashes_orig_lbha_i[2], hashes_orig_lbha_i[3] = hashes_orig_lbha_i[3], hashes_orig_lbha_i[2] # Swapping column 2 and 3

1788. # 1789. # ### Inserting column 0 [segment] after column 3 [sentence] and f

ormatting segment to six digits so that it is sorted 1790. # for hashes_noorig_lbha_i in hashes_noorig_lbha: 1791. # hashes_noorig_lbha_i[0] = int(hashes_noorig_lbha_i[0]) 1792. # hashes_noorig_lbha_i[0], hashes_noorig_lbha_i[1] = hashes_no

orig_lbha_i[1], hashes_noorig_lbha_i[0] # Swapping column 0 and 1 1793. # hashes_noorig_lbha_i[1], hashes_noorig_lbha_i[2] = hashes_no

orig_lbha_i[2], hashes_noorig_lbha_i[1] # Swapping column 1 and 2 1794. # hashes_noorig_lbha_i[2], hashes_noorig_lbha_i[3] = hashes_no

orig_lbha_i[3], hashes_noorig_lbha_i[2] # Swapping column 2 and 3 1795. 1796. if DO_SAVE_HASHES: 1797. ### Writing db hashes 1798. with open(LOG_DIR + "/hashes_db_Ep" + epoch + ".txt", 'a') as

file: 1799. for hl_db_i in hl_db: 1800. tmp_hash = np.array(hl_db_i[4]) 1801. file.write("%s\t%.4f\t%s\t%04i\t" % ( 1802. hl_db_i[0], hl_db_i[1], hl_db_i[2], hl_db_i[3])) 1803. for bit in range(STATE_SIZE): 1804. file.write("%i " % (int(tmp_hash[bit]))) 1805. file.write("\n") 1806. 1807. ### Writing query hashes 1808. with open(LOG_DIR + "/hashes_query_Ep" + epoch + ".txt", 'a')

as file: 1809. for hl_query_i in hl_query: 1810. tmp_hash = np.array(hl_query_i[4]) 1811. file.write("%s\t%.4f\t%s\t%04i\t" % ( 1812. hl_query_i[0], hl_query_i[1], \ 1813. hl_query_i[2], hl_query_i[3])) 1814. for bit in range(STATE_SIZE): 1815. file.write("%i " % (int(tmp_hash[bit]))) 1816. file.write("\n") 1817. 1818. ## Calculate collision statistics 1819. hashes_collisiones_timer = time.time() 1820. stat_collisions_db = Hashes.collisions(hashes_db) 1821. # stat_collisions_combine_subset = Hashes.collisions(hashes_query)

1822. # Order is [NumHashes, NumUniqueHashes, NumCollisions, FractionUni

queHashes, FractionCollisions] 1823. print('') 1824. print('Hashes.collisions.timer: %f' % (time.time() ‐ hashes_collis

iones_timer)) 1825. print('') 1826. 1827. # print('hl_db: ', len(hl_db)) 1828. # print(hl_db) 1829. 1830. # print('hl_query: ', len(hl_query)) 1831. # print(hl_query) 1832. 1833. ## Classify combine_subset hashes against db hashes 1834. stat_classification = Hashes.classification(hl_db, hl_query, num_h

ashes_class,

93

1835. db_fingerprint, db_name, table_name)

1836. # Order is [(query), (db)] 1837. # Order is [(Transformation, Speaker, Video, AudioName, Gender, Se

gment, Hash), 1838. # (Transformation, Speaker, Video, AudioName, Gender)] 1839. 1840. ## Calculate metrics 1841. acc_list = [] 1842. acc_list_speaker = [] # check original speaker 1843. acc_list_video = [] # check original sentence 1844. acc_list_gender = [] # check original sentence 1845. 1846. transformations_dict = defaultdict(list) 1847. 1848. for e in stat_classification: 1849. query = e[0] 1850. # query: [0:Transformation | 1: Speaker | 2:Video | 3:AudioNam

e | 4:Gender | 6:Segment | 7:Hash] 1851. db = e[1] 1852. # db: [0:Transformation | 1:Speaker | 2:Video | 3:AudioName |

4:Gender] 1853. 1854. # print('stat_classification') 1855. # print('query: {}'.format(query)) 1856. # print('db: {}'.format(db)) 1857. 1858. if query[1] == db[1]: 1859. acc_list_speaker.append(1) 1860. else: 1861. acc_list_speaker.append(0) 1862. if query[2] == db[2]: 1863. acc_list_video.append(1) 1864. else: 1865. acc_list_video.append(0) 1866. if query[4] == db[4]: 1867. acc_list_gender.append(1) 1868. else: 1869. acc_list_gender.append(0) 1870. if query[1] == db[1] and query[2] == db[2] and query[3] == db[

3]: 1871. acc_list.append(1) 1872. transformations_dict[query[0]].append(1) 1873. else: 1874. acc_list.append(0) 1875. transformations_dict[query[0]].append(0) 1876. 1877. acc_speaker = sum(acc_list_speaker) / len(acc_list_speaker) 1878. acc_video = sum(acc_list_video) / len(acc_list_video) 1879. acc_gender = sum(acc_list_gender) / len(acc_list_gender) 1880. acc_ = sum(acc_list) / len(acc_list) 1881. 1882. # if acc_speaker + acc_sentence == 0: 1883. # f1_score = 0 1884. # else: 1885. # f1_score = 2 * ((acc_speaker * acc_sentence) / (acc_speaker + ac

c_sentence)) 1886. 1887. transformations_acc = [] 1888. for k, v in transformations_dict.items(): 1889. transformations_acc.append(float(sum(v) / len(v)))

94

1890. print(k, '\t', sum(v), '\t', len(v), '\t', transformations_acc[‐1], )

1891. 1892. print('') 1893. print("Speaker Accuracy: ", acc_speaker) 1894. print("Video Accuracy: ", acc_video) 1895. print("Gender Accuracy: ", acc_gender) 1896. print("Accuracy: ", acc_) 1897. # print("F1 Score: ", f1_score) 1898. print("Transformations Accuracy: ", np.mean(transformations_acc))

1899. print('') 1900. 1901. duration = time.time() ‐ start_time 1902. 1903. return acc_speaker, \ 1904. acc_video, \ 1905. acc_, \ 1906. np.mean(transformations_acc), \ 1907. acc_gender, \ 1908. stat_collisions_db[0], \ 1909. stat_collisions_db[1], \ 1910. stat_collisions_db[2], \ 1911. stat_collisions_db[3], \ 1912. stat_collisions_db[4], \ 1913. duration 1914. 1915. def evaluate_ValAndAcc(labels_all, samples_db, samples_query, db_finge

rprint, db_name, sess, epoch, iterator_val): 1916. avg_lb = [] 1917. avg_lb_mse = [] 1918. avg_lb_hash = [] 1919. avg_lb_entropy = [] 1920. avg_simcount = [] 1921. avg_simcount_entropy = [] 1922. 1923. hashes = [] 1924. labels = [] 1925. 1926. mappedFiles = (samples_db + samples_query).copy() 1927. 1928. ## How many batches? How many hashes to keep in the final batch? 1929. numBatches = int(len(mappedFiles) / BATCH_SIZE) 1930. numMissingHashes = len(mappedFiles) % BATCH_SIZE 1931. 1932. full_start_time = start_time = time.time() 1933. sess.run(iterator_val.initializer) 1934. iter_handle_val = sess.run(iterator_val.string_handle()) 1935. batch_n = 0 1936. 1937. try: 1938. while True: 1939. 1940. r_lb, r_lb_mse, r_lb_hash, r_lb_entropy, \ 1941. r_simc, r_simc_ent, \ 1942. r_enc_lhs_0, r_enc_lhs_1, \ 1943. r_enc_dec_0, r_enc_dec_1, \ 1944. r_batch_data, r_batch_labels = sess.run( 1945. [loss_batch, loss_batch_mse, loss_batch_hash, loss_bat

ch_entropy, 1946. similars_batch, similars_entropy_batch,

95

1947. gstor_enc_lhs[0], gstor_enc_lhs[1], 1948. gstor_dec_pre[0], gstor_dec_pre[1], 1949. inputs_next_batch, labels_next_batch], 1950. feed_dict={handle: iter_handle_val}) 1951. 1952. batch_labels = [] 1953. for k in range(len(r_batch_labels[0])): 1954. lbl = [r_batch_labels[1][k], 1955. r_batch_labels[2][k], 1956. r_batch_labels[3][k], 1957. r_batch_labels[4][k], 1958. r_batch_labels[5][k], 1959. r_batch_labels[0][k]] 1960. batch_labels.append(lbl) 1961. 1962. if r_lb_hash == None: 1963. r_lb_hash = ‐1 1964. if r_lb_entropy == None: 1965. r_lb_entropy = ‐1 1966. 1967. avg_lb.append(r_lb) 1968. avg_lb_mse.append(r_lb_mse) 1969. avg_lb_hash.append(r_lb_hash) 1970. avg_lb_entropy.append(r_lb_entropy) 1971. avg_simcount.append(r_simc) 1972. avg_simcount_entropy.append(r_simc_ent) 1973. 1974. if (batch_n + 1) <= numBatches: 1975. hashes_batch = np.concatenate([r_enc_lhs_0, r_enc_lhs_

1]) 1976. h = np.where(hashes_batch >= 0.0, 1, ‐1) 1977. hashes.extend(h) 1978. labels.extend(batch_labels) 1979. elif numMissingHashes != 0: 1980. hashes_batch = np.concatenate([r_enc_lhs_0, r_enc_lhs_

1]) 1981. h = np.where(hashes_batch >= 0.0, 1, ‐1) 1982. hashes.extend(h[0:numMissingHashes]) 1983. labels.extend(batch_labels[0:numMissingHashes]) 1984. 1985. if (batch_n + 1) % DISPLAY_COUNT == 0: 1986. duration = time.time() ‐ start_time 1987. print(' %04i\t%04i\t%f\t%f\t%f\t%f\t%f\t%f\t%f' % \ 1988. (epoch, batch_n + 1, r_lb, r_lb_mse, r_lb_hash,

r_lb_entropy, r_simc, r_simc_ent, duration)) 1989. with open(LOG_DIR + '/validation_batch.txt', 'a') as f

ile: 1990. file.write('%04i\t%04i\t%f\t%f\t%f\t%f\t%f\t%f\t%f

\n' % \ 1991. (epoch, batch_n + 1, r_lb, r_lb_mse, r_

lb_hash, r_lb_entropy, r_simc, r_simc_ent, 1992. duration)) 1993. start_time = time.time() 1994. 1995. batch_n += 1 1996. except tf.errors.OutOfRangeError: 1997. pass 1998. 1999. full_duration = time.time() ‐ full_start_time 2000.

96

2001. ## Splitting hashes into hashes_db and hashes_query and merging labels with hashes

2002. hashes_db = [] 2003. hashes_query = [] 2004. 2005. hl_db = [] 2006. hl_query = [] 2007. 2008. # labels order is [Transformation, Speaker, Video, AudioName, Gend

er, Segment, Hash] 2009. # print('len(labels): ', len(labels)) 2010. for i in range(len(labels)): 2011. entry = [labels[i][0].decode("utf‐8"), 2012. labels[i][1].decode("utf‐8"), 2013. labels[i][2].decode("utf‐8"), 2014. labels[i][3].decode("utf‐8"), 2015. labels[i][4].decode("utf‐8"), 2016. labels[i][5], 2017. hashes[i]] 2018. 2019. # print('labels: ', labels[i]) 2020. # print('entry: ', entry) 2021. if entry[0] == 'Original': 2022. hashes_db.append(hashes[i]) 2023. hl_db.append(tuple(entry)) 2024. else: 2025. hl_query.append(tuple(entry)) 2026. hashes_query.append(hashes[i]) 2027. 2028. # Order is [Transformation, Speaker, Video, Audio, Segment, Hash]

2029. 2030. ### Inserting column 0 [segment] after column 3 [sentence] and for

matting segment to six digits so that it is sorted 2031. # for hl_db_i in hl_db: 2032. # hl_db_i[0] = int(hl_db_i[0]) 2033. # hl_db_i[0], hl_db_i[1] = hl_db_i[1], hl_db_i[0] # Swapping c

olumn 0 and 1 2034. # hl_db_i[1], hl_db_i[2] = hl_db_i[2], hl_db_i[1] # Swapping c

olumn 1 and 2 2035. # hl_db_i[2], hl_db_i[3] = hl_db_i[3], hl_db_i[2] # Swapping c

olumn 2 and 3 2036. # 2037. # ### Inserting column 0 [segment] after column 3 [sentence] and f

ormatting segment to six digits so that it is sorted 2038. # for hl_query_i in hl_query: 2039. # hl_query_i[0] = int(hl_query_i[0]) 2040. # hl_query_i[0], hl_query_i[1] = hl_query_i[1], hl_query_i[0]

# Swapping column 0 and 1 2041. # hl_query_i[1], hl_query_i[2] = hl_query_i[2], hl_query_i[1]

# Swapping column 1 and 2 2042. # hl_query_i[2], hl_query_i[3] = hl_query_i[3], hl_query_i[2]

# Swapping column 2 and 3 2043. 2044. if DO_SAVE_HASHES: 2045. ### Writing hl_db 2046. with open(LOG_DIR + "/VAL_hashes_db_Ep" + epoch + ".txt", 'a')

as file: 2047. for hl_db_i in hl_db: 2048. tmp_hash = np.array(hl_db_i[4]) 2049. file.write("%s\t%s\t%s\t%04i\t" % (

97

2050. hl_db_i[2], hl_db_i[3], hl_db_i[1], hl_db_i[0])) 2051. for bit in range(STATE_SIZE): 2052. file.write("%i " % (int(tmp_hash[bit]))) 2053. file.write("\n") 2054. 2055. ### Writing hl_query 2056. with open(LOG_DIR + "/VAL_hashes_query_Ep" + epoch + ".txt", '

a') as file: 2057. for hl_query_i in hl_query: 2058. tmp_hash = np.array(hl_query_i[4]) 2059. file.write("%s\t%.4f\t%s\t%04i\t" % ( 2060. hl_query_i[2], hl_query_i[3], 2061. hl_query_i[1], hl_query_i[0])) 2062. for bit in range(STATE_SIZE): 2063. file.write("%i " % (int(tmp_hash[bit]))) 2064. file.write("\n") 2065. 2066. if DO_ACCURACY: 2067. 2068. start_time_acc = time.time() 2069. 2070. # Creating DATABASE.TABLE for AUDIO_LENGTH_CLASSIFICATION 2071. table_name_val = 'VAL_fingerprints_{}'.format(int(1)) 2072. if DO_CREATE_TABLE: 2073. db_fingerprint.create_table(db_name, table_name_val) 2074. 2075. ## Calculate collision statistics 2076. hashes_collisiones_timer = time.time() 2077. stat_collisions_db = Hashes.collisions(hashes_db) 2078. # stat_collisions_combine_subset = Hashes.collisions(hashes_qu

ery) 2079. # Order is [NumHashes, NumUniqueHashes, NumCollisions, Fractio

nUniqueHashes, FractionCollisions] 2080. print('') 2081. print('Hashes.collisions.timer: %f' % (time.time() ‐ hashes_co

llisiones_timer)) 2082. print('') 2083. 2084. # print('hl_db: ', len(hl_db)) 2085. # print(hl_db) 2086. 2087. # print('hl_query: ', len(hl_query)) 2088. # print(hl_query) 2089. 2090. ## Classify hl_query against hl_db 2091. num_hashes_class = 1 # For training is always one 2092. stat_classification = Hashes.classification(hl_db, hl_query, n

um_hashes_class, 2093. db_fingerprint, db

_name, table_name_val) 2094. # Order is [(query), (db)] 2095. # Order is [(Transformation, Speaker, Video, AudioName, Gender

, Segment Hash), 2096. # (Speaker, Sentence, Transformation)] 2097. 2098. print('') 2099. print('Num Hashes Qry classification: ', len(stat_classificati

on)) 2100. print('') 2101. 2102. ## Calculate metrics

98

2103. acc_list_speaker = [] # check original speaker 2104. acc_list_video = [] # check original video 2105. acc_list_gender = [] # check original gender 2106. acc_list = [] # check accuracy 2107. transformations_dict = defaultdict(list) 2108. 2109. for e in stat_classification: 2110. query = e[0] 2111. # query: [0:Transformation | 1: Speaker | 2:Video | 3:Audi

oName | 4:Gender | 5:Segement | 6:Hash] 2112. db = e[1] 2113. # db: [0:Transformation | 1:Speaker | 2:Video |3:AudioName

| 4:Gender] 2114. # print('stat_classification') 2115. # print('query: {}'.format(query)) 2116. # print('db: {}'.format(db)) 2117. if query[1] == db[1]: 2118. acc_list_speaker.append(1) 2119. else: 2120. acc_list_speaker.append(0) 2121. if query[2] == db[2]: 2122. acc_list_video.append(1) 2123. else: 2124. acc_list_video.append(0) 2125. if query[4] == db[4]: 2126. acc_list_gender.append(1) 2127. else: 2128. acc_list_gender.append(0) 2129. if query[1] == db[1] and query[2] == db[2] and query[3] ==

db[3]: 2130. acc_list.append(1) 2131. transformations_dict[query[0]].append(1) 2132. else: 2133. acc_list.append(0) 2134. transformations_dict[query[0]].append(0) 2135. 2136. acc_speaker = sum(acc_list_speaker) / len(acc_list_speaker) 2137. acc_video = sum(acc_list_video) / len(acc_list_video) 2138. acc_gender = sum(acc_list_gender) / len(acc_list_gender) 2139. acc_ = sum(acc_list) / len(acc_list) 2140. 2141. # if acc_speaker + acc_sentence == 0: 2142. # f1_score = 0 2143. # else: 2144. # f1_score = 2 * ((acc_speaker * acc_sentence) / (acc_speaker

+ acc_sentence)) 2145. 2146. transformations_acc = [] 2147. for k, v in transformations_dict.items(): 2148. transformations_acc.append(float(sum(v) / len(v))) 2149. print(k, '\t', sum(v), '\t', len(v), '\t', transformations

_acc[‐1], ) 2150. 2151. acc_trans = np.mean(transformations_acc) 2152. # acc_gender = np.mean(gender_acc) 2153. 2154. print("Speaker Accuracy: ", acc_speaker) 2155. print("Video Accuracy: ", acc_video) 2156. print("Gender Accuracy: ", acc_gender) 2157. print("Accuracy: ", acc_) 2158. # print("F1 Score: ", f1_score)

99

2159. print("Transformations Accuracy: ", acc_trans) 2160. 2161. print('') 2162. 2163. duration_acc = time.time() ‐ start_time_acc 2164. 2165. spk_acc_val = acc_speaker 2166. vdo_acc_val = acc_video 2167. acc_val = acc_ 2168. # f1_val = f1_score 2169. trans_acc_val = acc_trans 2170. gender_acc_val = acc_gender 2171. duration_acc_val = duration_acc 2172. numHashes_acc_val = stat_collisions_db[0] 2173. uniqueHashes_acc_val = stat_collisions_db[1] 2174. numCollisions_acc_val = stat_collisions_db[2] 2175. fraqUniqueHashes_acc_val = stat_collisions_db[3] 2176. fraqCollisions_acc_val = stat_collisions_db[4] 2177. 2178. else: 2179. spk_acc_val = vdo_acc_val = acc_val = trans_acc_val = gender_a

cc_val = duration_acc_val = ‐1 2180. numHashes_acc_val = uniqueHashes_acc_val = numCollisions_acc_v

al = ‐1 2181. fraqUniqueHashes_acc_val = fraqCollisions_acc_val = ‐1 2182. 2183. return np.mean(avg_lb), np.mean(avg_lb_mse), np.mean(avg_lb_hash),

np.mean(avg_lb_entropy), \ 2184. np.mean(avg_simcount), np.mean(avg_simcount_entropy), \ 2185. full_duration, \ 2186. spk_acc_val, vdo_acc_val, acc_val, trans_acc_val, gender_ac

c_val, \ 2187. numHashes_acc_val, uniqueHashes_acc_val, numCollisions_acc_

val, \ 2188. fraqUniqueHashes_acc_val, fraqCollisions_acc_val, \ 2189. duration_acc_val 2190. 2191. 2192. ##### Logging 2193. 2194. if not os.path.exists(LOG_DIR): 2195. os.makedirs(LOG_DIR) 2196. logger = Logger(LOG_DIR + '/log.txt') 2197. sys.stdout = logger 2198. 2199. shutil.copyfile(os.path.realpath(__file__), (LOG_DIR + '/script.py')) 2200. 2201. ##### Run 2202. 2203. print('Acoustic Hashing v3.10') 2204. print('Run: ' + RUN_ID) 2205. print('Time: ' + time.strftime('%a, %d %b %Y %H:%M:%S')) 2206. print('=================================================') 2207. print('Parameters') 2208. print(' Per‐Cycle Batch size:', BATCH_SIZE) 2209. print(' Per‐Tower Batch size:', GPU_BATCH_SIZE) 2210. print(' State Size:', STATE_SIZE) 2211. print(' Num Steps (BPTT):', NUM_STEPS) 2212. print(' Num Features:', NUM_FEATS) 2213. print(' Num Epochs:', NUM_EPOCHS) 2214. print(' Num Epochs Validation:', NUM_EPOCHS_VAL)

100

2215. print(' Num Layers:', NUM_LAYERS) 2216. print(' Do Epoch 0:', DO_EPOCH_ZERO) 2217. print(' Sharing Weights:', SHARE_WEIGHTS) 2218. print(' Num Towers:', len(TOWERS_LIST)) 2219. print(' GPU List:', TOWERS_LIST) 2220. if PRETRAINED: 2221. print(' PRETRAINED: ', PRETRAINED) 2222. print(' Directory Models:', LOG_DIR_MODELS) 2223. print(' Model Number:', MODEL) 2224. else: 2225. print(' PRETRAINED: ', PRETRAINED) 2226. print(' Using Peepholes:', USE_PEEPHOLES) 2227. print(' Training') 2228. print(' Optimizer: Adagrad') 2229. print(' Initial:', INITIAL_LEARNING_RATE) 2230. print(' DecayRate:', DECAY_RATE) 2231. print(' Decays/Batch:', NUM_DECAY_STEP) 2232. print(' Momentum: N/A') 2233. print(' LossMSE: True') 2234. print(' Alpha:', LOSS_ALPHA) 2235. print(' LossHASH:', INCLUDE_HASH_LOSS) 2236. print(' Beta:', LOSS_BETA) 2237. if INCLUDE_HASH_LOSS: 2238. if HASH_LOSS_USES_SXH == 0: 2239. using_type = 'S' 2240. elif HASH_LOSS_USES_SXH == 1: 2241. using_type = 'X' 2242. elif HASH_LOSS_USES_SXH == 2: 2243. using_type = 'H' 2244. else: 2245. using_type = 'INVALID' 2246. print(' Using:', using_type) 2247. print(' Sij Constant:', HASH_LOSS_USES_CONSTANT_Sij) 2248. print(' Normalized:', HASH_LOSS_NORMALIZED) 2249. print(' LossENTROPY:', INCLUDE_ENTROPY_LOSS) 2250. print(' Gamma:', LOSS_GAMMA) 2251. print(' Dataset') 2252. print(' Database:', DATABASE_PATH) 2253. print(' Labels:', DATA_LABELS_PATH) 2254. print(' Audio Database:', AUDIO_DATABASE_PATH) 2255. print(' Subset Size:', DATA_SUBSET) 2256. print(' Subset Size sub:', DATA_SUBSET_sub) 2257. print(' Accuracy:', DO_ACCURACY) 2258. if (DO_ACCURACY): 2259. print(' Accuracy Sub‐

Subset Size:', CLASSIFICATION_ACCURACY_SUBSET_SIZE) 2260. print(' Pooling Threads:', NUM_CLASSIFICATION_POOLS) 2261. print(' Dictionary Key Size:', NUM_HASHES_DICT) 2262. print(' Classification Size:', NUM_HASHES_CLASSIFICATION) 2263. print(' Logging') 2264. print(' Directory:', LOG_DIR) 2265. print(' Evaluate every', DISPLAY_COUNT, 'batches') 2266. print(' Save') 2267. print(' Frequency') 2268. print(' Model', SAVE_MODEL_EVERY_HOURS, 'hours') 2269. print(' Vis', SAVE_VIS_EVERY_HOURS, 'hours') 2270. print(' Use') 2271. print(' Loss', SAVE_WITH_LOSS) 2272. print(' Accuracy', SAVE_WITH_ACCURACY) 2273. print(' All_Models: ', SAVE_ALL_MODELS) 2274. print(' Maximum Number of Models: ', MAX_TO_KEEP)

101

2275. print('=================================================') 2276. 2277. print('') 2278. print('Generating Datasets....') 2279. 2280. LABELS_TRN, LABELS_VAL, LABELS_TST = Data.load_labels() 2281. 2282. # """ REMOVE AFTER TESTING IS OVER """ 2283. # random.shuffle(LABELS_TRN) 2284. # random.shuffle(LABELS_VAL) 2285. # random.shuffle(LABELS_TST) 2286. # """ REMOVE AFTER TESTING IS OVER """ 2287. 2288. print('\tDistribution') 2289. 2290. print('') 2291. print('\t\tTrn: ', len(LABELS_TRN)) 2292. print('\t\tLABELS_TRN[‐3:]: ', LABELS_TRN[‐3:]) 2293. 2294. print('') 2295. print('\t\tVal: ', len(LABELS_VAL)) 2296. print('\t\tLABELS_VAL[‐3:]: ', LABELS_VAL[‐3:]) 2297. 2298. print('') 2299. print('\t\tTst: ', len(LABELS_TST)) 2300. print('\t\tLABELS_TST[‐3:]: ', LABELS_TST[‐3:]) 2301. 2302. if DATA_SUBSET < 1.0: 2303. ntrn = int(len(LABELS_TRN) * DATA_SUBSET) 2304. nval = int(len(LABELS_VAL) * DATA_SUBSET) 2305. ntst = int(len(LABELS_TST) * DATA_SUBSET) 2306. 2307. LABELS_TRN = LABELS_TRN[0:ntrn] 2308. LABELS_VAL = LABELS_VAL[0:nval] 2309. LABELS_TST = LABELS_TST[0:ntst] 2310. 2311. SEGMENTS_TRN, SAMPLES_TRN = Data.load_segment_list(LABELS_TRN) 2312. SEGMENTS_VAL, SAMPLES_VAL = Data.load_segment_list(LABELS_VAL) 2313. SEGMENTS_TST, SAMPLES_TST = Data.load_segment_list(LABELS_TST) 2314. 2315. print('\tSegments') 2316. print('\t\tTrn: ' + str(len(SEGMENTS_TRN))) 2317. print('\t\tVal: ' + str(len(SEGMENTS_VAL))) 2318. print('\t\tTst: ' + str(len(SEGMENTS_TST))) 2319. print('\t\tTOT: ' + str(len(SEGMENTS_TRN) + len(SEGMENTS_VAL) + len(SEGMEN

TS_TST))) 2320. 2321. print('\tSamples') 2322. print('\t\tTrn: ' + str(len(SAMPLES_TRN))) 2323. print('\t\tVal: ' + str(len(SAMPLES_VAL))) 2324. print('\t\tTst: ' + str(len(SAMPLES_TST))) 2325. print('\t\tTOT: ' + str(len(SAMPLES_TRN) + len(SAMPLES_VAL) + len(SAMPLES_

TST))) 2326. 2327. print('') 2328. print('SEGMENTS') 2329. print('SEGMENTS_TRN[0:3]: ', SEGMENTS_TRN[0:3]) 2330. print('SEGMENTS_VAL[0:3]: ', SEGMENTS_VAL[0:3]) 2331. print('SEGMENTS_TST[0:3]: ', SEGMENTS_TST[0:3]) 2332. print('SEGMENTS_TRN[‐3:]: ', SEGMENTS_TRN[‐3:]) 2333. print('SEGMENTS_VAL[‐3:]: ', SEGMENTS_VAL[‐3:])

102

2334. print('SEGMENTS_TST[‐3:]: ', SEGMENTS_TST[‐3:]) 2335. 2336. print('') 2337. print('SAMPLES') 2338. print('SAMPLES_TRN[0:3]: ', SAMPLES_TRN[0:3]) 2339. print('SAMPLES_VAL[0:3]: ', SAMPLES_VAL[0:3]) 2340. print('SAMPLES_TST[0:3]: ', SAMPLES_TST[0:3]) 2341. 2342. print('') 2343. print('\tOriginal+Transformations (Samples)') 2344. 2345. SAMPLES_TRN_ORIG = [] 2346. SAMPLES_TRN_TRANS = [] 2347. SAMPLES_VAL_ORIG = [] 2348. SAMPLES_VAL_TRANS = [] 2349. SAMPLES_TST_ORIG = [] 2350. SAMPLES_TST_TRANS = [] 2351. 2352. for s in SAMPLES_TRN: 2353. if LABELS_TRN[s[0]][0] == 'Original': 2354. SAMPLES_TRN_ORIG.append(s) 2355. else: 2356. SAMPLES_TRN_TRANS.append(s) 2357. for s in SAMPLES_VAL: 2358. if LABELS_VAL[s[0]][0] == 'Original': 2359. SAMPLES_VAL_ORIG.append(s) 2360. else: 2361. SAMPLES_VAL_TRANS.append(s) 2362. for s in SAMPLES_TST: 2363. if LABELS_TST[s[0]][0] == 'Original': 2364. SAMPLES_TST_ORIG.append(s) 2365. else: 2366. SAMPLES_TST_TRANS.append(s) 2367. 2368. print('\t\tTrn Orig: ' + str(len(SAMPLES_TRN_ORIG))) 2369. print('\t\tTrn Transformations: ' + str(len(SAMPLES_TRN_TRANS))) 2370. print('\t\tTrn Sum: ' + str(len(SAMPLES_TRN_ORIG) + len(SAMPLES_TRN_TRANS)

)) 2371. print('\t\tVal Orig: ' + str(len(SAMPLES_VAL_ORIG))) 2372. print('\t\tVal Transformations: ' + str(len(SAMPLES_VAL_TRANS))) 2373. print('\t\tVal Sum: ' + str(len(SAMPLES_VAL_ORIG) + len(SAMPLES_VAL_TRANS)

)) 2374. print('\t\tTst Orig: ' + str(len(SAMPLES_TST_ORIG))) 2375. print('\t\tTst Transformatiosns: ' + str(len(SAMPLES_TST_TRANS))) 2376. print('\t\tTst Sum: ' + str(len(SAMPLES_TST_ORIG) + len(SAMPLES_TST_TRANS)

)) 2377. print('\t\tTOT Orig: ' + str(len(SAMPLES_TRN_ORIG) + len(SAMPLES_VAL_ORIG)

+ len(SAMPLES_TST_ORIG))) 2378. print('\t\tTOT Transformatiosns: ' + str(len(SAMPLES_TRN_TRANS) + len(SAMP

LES_VAL_TRANS) + len(SAMPLES_TST_TRANS))) 2379. print('\t\tTOT Sum: ' + str(len(SAMPLES_TRN_ORIG) + len(SAMPLES_VAL_ORIG)

+ len(SAMPLES_TST_ORIG) + 2380. len(SAMPLES_TRN_TRANS) + len(SAMPLES_VAL_TRANS

) + len(SAMPLES_TST_TRANS))) 2381. 2382. print("") 2383. print('\tBatches Per Epoch') 2384. 2385. trn_bpe = (len(SAMPLES_TRN) / float(BATCH_SIZE)) 2386. print('\t\tTrn: ' + str(trn_bpe)) 2387. val_bpe = (len(SAMPLES_VAL) / float(BATCH_SIZE))

103

2388. print('\t\tVal: ' + str(val_bpe)) 2389. tst_bpe = (len(SAMPLES_TST) / float(BATCH_SIZE)) 2390. print('\t\tTst: ' + str(tst_bpe)) 2391. 2392. print('') 2393. print('=================================================') 2394. 2395. print('=================================================') 2396. 2397. print('') 2398. print('Generating Datasets SUBSETS!!!....') 2399. 2400. LABELS_TRN_ORIG, LABELS_VAL_ORIG, LABELS_TST_ORIG = Data.load_labels_orig(

) 2401. 2402. """ Selecting SUBSETS!!! """ 2403. random.shuffle(LABELS_TRN_ORIG) 2404. random.shuffle(LABELS_VAL_ORIG) 2405. random.shuffle(LABELS_TST_ORIG) 2406. 2407. ntrn_sub = int(len(LABELS_TRN_ORIG) * DATA_SUBSET_sub) 2408. nval_sub = int(len(LABELS_VAL_ORIG) * DATA_SUBSET_sub) 2409. ntst_sub = int(len(LABELS_TST_ORIG) * DATA_SUBSET_sub) 2410. 2411. LABELS_TRN_ORIG_sub = LABELS_TRN_ORIG[0:ntrn_sub] 2412. LABELS_VAL_ORIG_sub = LABELS_VAL_ORIG[0:nval_sub] 2413. LABELS_TST_ORIG_sub = LABELS_TST_ORIG[0:ntst_sub] 2414. 2415. LABELS_TRN_sub = Data.addTrans(LABELS_TRN_ORIG_sub) 2416. LABELS_VAL_sub = Data.addTrans(LABELS_VAL_ORIG_sub) 2417. LABELS_TST_sub = Data.addTrans(LABELS_TST_ORIG_sub) 2418. 2419. # print(len(LABELS_TRN_ORIG_sub), ' x22 = ', len(LABELS_TRN_sub)) 2420. # print(len(LABELS_VAL_ORIG_sub), ' x22 = ', len(LABELS_VAL_sub)) 2421. # print(len(LABELS_TST_ORIG_sub), ' x22 = ', len(LABELS_TST_sub)) 2422. 2423. print('\tDistribution_sub') 2424. 2425. print('') 2426. print('\t\tTrn_sub: ', len(LABELS_TRN_sub)) 2427. print('\t\tLABELS_TRN_sub[‐3:]: ', LABELS_TRN_sub[‐3:]) 2428. 2429. print('') 2430. print('\t\tVal_sub: ', len(LABELS_VAL_sub)) 2431. print('\t\tLABELS_VAL_sub[‐3:]: ', LABELS_VAL_sub[‐3:]) 2432. 2433. print('') 2434. print('\t\tTst_sub: ', len(LABELS_TST_sub)) 2435. print('\t\tLABELS_TST_sub[‐3:]: ', LABELS_TST_sub[‐3:]) 2436. 2437. SEGMENTS_TRN_sub, SAMPLES_TRN_sub = Data.load_segment_list(LABELS_TRN_sub)

2438. SEGMENTS_VAL_sub, SAMPLES_VAL_sub = Data.load_segment_list(LABELS_VAL_sub)

2439. SEGMENTS_TST_sub, SAMPLES_TST_sub = Data.load_segment_list(LABELS_TST_sub)

2440. 2441. print('\tSegments_sub') 2442. print('\t\tTrn_sub: ' + str(len(SEGMENTS_TRN_sub))) 2443. print('\t\tVal_sub: ' + str(len(SEGMENTS_VAL_sub))) 2444. print('\t\tTst_sub: ' + str(len(SEGMENTS_TST_sub)))

104

2445. print('\t\tTOT_sub: ' + str(len(SEGMENTS_TRN_sub) + len(SEGMENTS_VAL_sub) + len(SEGMENTS_TST_sub)))

2446. 2447. print('\tSamples_sub') 2448. print('\t\tTrn_sub: ' + str(len(SAMPLES_TRN_sub))) 2449. print('\t\tVal_sub: ' + str(len(SAMPLES_VAL_sub))) 2450. print('\t\tTst_sub: ' + str(len(SAMPLES_TST_sub))) 2451. print('\t\tTOT_sub: ' + str(len(SAMPLES_TRN_sub) + len(SAMPLES_VAL_sub) +

len(SAMPLES_TST_sub))) 2452. 2453. print('') 2454. print('SEGMENTS_sub') 2455. print('SEGMENTS_TRN_sub[0:3]: ', SEGMENTS_TRN_sub[0:3]) 2456. print('SEGMENTS_VAL_sub[0:3]: ', SEGMENTS_VAL_sub[0:3]) 2457. print('SEGMENTS_TST_sub[0:3]: ', SEGMENTS_TST_sub[0:3]) 2458. print('SEGMENTS_TRN_sub[‐3:]: ', SEGMENTS_TRN_sub[‐3:]) 2459. print('SEGMENTS_VAL_sub[‐3:]: ', SEGMENTS_VAL_sub[‐3:]) 2460. print('SEGMENTS_TST_sub[‐3:]: ', SEGMENTS_TST_sub[‐3:]) 2461. 2462. print('') 2463. print('SAMPLES_sub') 2464. print('SAMPLES_TRN_sub[0:3]: ', SAMPLES_TRN_sub[0:3]) 2465. print('SAMPLES_VAL_sub[0:3]: ', SAMPLES_VAL_sub[0:3]) 2466. print('SAMPLES_TST_sub[0:3]: ', SAMPLES_TST_sub[0:3]) 2467. 2468. print('') 2469. print('\tOriginal+Transformations (Samples_sub)') 2470. 2471. SAMPLES_TRN_ORIG_sub = [] 2472. SAMPLES_TRN_TRANS_sub = [] 2473. SAMPLES_VAL_ORIG_sub = [] 2474. SAMPLES_VAL_TRANS_sub = [] 2475. SAMPLES_TST_ORIG_sub = [] 2476. SAMPLES_TST_TRANS_sub = [] 2477. 2478. for s in SAMPLES_TRN_sub: 2479. if LABELS_TRN_sub[s[0]][0] == 'Original': 2480. SAMPLES_TRN_ORIG_sub.append(s) 2481. else: 2482. SAMPLES_TRN_TRANS_sub.append(s) 2483. for s in SAMPLES_VAL_sub: 2484. if LABELS_VAL_sub[s[0]][0] == 'Original': 2485. SAMPLES_VAL_ORIG_sub.append(s) 2486. else: 2487. SAMPLES_VAL_TRANS_sub.append(s) 2488. for s in SAMPLES_TST_sub: 2489. if LABELS_TST_sub[s[0]][0] == 'Original': 2490. SAMPLES_TST_ORIG_sub.append(s) 2491. else: 2492. SAMPLES_TST_TRANS_sub.append(s) 2493. 2494. print('\t\tTrn_sub Orig: ' + str(len(SAMPLES_TRN_ORIG_sub))) 2495. print('\t\tTrn_sub Transformations: ' + str(len(SAMPLES_TRN_TRANS_sub))) 2496. print('\t\tTrn_sub Sum: ' + str(len(SAMPLES_TRN_ORIG_sub) + len(SAMPLES_TR

N_TRANS_sub))) 2497. print('\t\tVal_sub Orig: ' + str(len(SAMPLES_VAL_ORIG_sub))) 2498. print('\t\tVal_sub Transformations: ' + str(len(SAMPLES_VAL_TRANS_sub))) 2499. print('\t\tVal_sub Sum: ' + str(len(SAMPLES_VAL_ORIG_sub) + len(SAMPLES_VA

L_TRANS_sub))) 2500. print('\t\tTst_sub Orig: ' + str(len(SAMPLES_TST_ORIG_sub)))

105

2501. print('\t\tTst_sub Transformatiosns: ' + str(len(SAMPLES_TST_TRANS_sub)))

2502. print('\t\tTst_sub Sum: ' + str(len(SAMPLES_TST_ORIG_sub) + len(SAMPLES_TST_TRANS_sub)))

2503. print('\t\tTOT_sub Orig: ' + str(len(SAMPLES_TRN_ORIG_sub) + len(SAMPLES_VAL_ORIG_sub) + len(SAMPLES_TST_ORIG_sub)))

2504. print('\t\tTOT_sub Transformations: ' + str(len(SAMPLES_TRN_TRANS_sub) + len(SAMPLES_VAL_TRANS_sub) + len(SAMPLES_TST_TRANS_sub)))

2505. print('\t\tTOT_sub Sum: ' + str(len(SAMPLES_TRN_ORIG_sub) + len(SAMPLES_VAL_ORIG_sub) + len(SAMPLES_TST_ORIG_sub) +

2506. len(SAMPLES_TRN_TRANS_sub) + len(SAMPLES_VAL_TRANS_sub) + len(SAMPLES_TST_TRANS_sub)))

2507. 2508. print("") 2509. print('\tBatches Per Epoch SUBSET!!!') 2510. 2511. trn_bpe = math.ceil(len(SAMPLES_TRN_sub) / float(BATCH_SIZE)) 2512. print('\t\tTrn_sub: ' + str(trn_bpe)) 2513. val_bpe = math.ceil(len(SAMPLES_VAL_sub) / float(BATCH_SIZE)) 2514. print('\t\tVal_sub: ' + str(val_bpe)) 2515. tst_bpe = math.ceil(len(SAMPLES_TST_sub) / float(BATCH_SIZE)) 2516. print('\t\tTst_sub: ' + str(tst_bpe)) 2517. 2518. print('') 2519. print('================================================================')

2520. 2521. print('') 2522. print('Generating Logs....') 2523. with open(os.path.join(LOG_DIR, 'training.txt'), 'a') as file: 2524. file.write( 2525. 'Epoch\tTotal_Loss\tMSE_Loss\tHash_Loss\tEntropy_Loss\tSimilar_Cou

nt\tSimilar_Count_Entropy\tDuration[secs]\n') 2526. 2527. with open(LOG_DIR + '/validation.txt', 'a') as file: 2528. file.write( 2529. 'Epoch\tTotal_Loss\tMSE_Loss\tHash_Loss\tEntropy_Loss\tSimilar_Cou

nt\tSimilar_Count_Entropy\tDuration[secs]\n') 2530. 2531. with open(LOG_DIR + '/training_batch.txt', 'a') as file: 2532. file.write( 2533. 'Epoch\tBatch\tLearning_Rate\tTotal_Loss\tMSE_Loss\tHash_Loss\tEnt

ropy_Loss\tSimilar_Count\tSimilar_Count_Entropy\tDuration[secs]\n') 2534. 2535. with open(LOG_DIR + '/validation_batch.txt', 'a') as file: 2536. file.write( 2537. 'Epoch\tBatch\tTotal_Loss\tMSE_Loss\tHash_Loss\tEntropy_Loss\tSimi

lar_Count\tSimilar_Count_Entropy\tDuration[secs]\n') 2538. 2539. with open(LOG_DIR + '/training_acc.txt', 'a') as file: 2540. file.write( 2541. 'Epoch\tNum_Hashes\tUnique_Hashes\tCollisions[%]\tSpkAcc[%]\tVideo

Acc[%]\tAccuracy[%]\tTransAcc[%]\tGenderAcc[%]\tDuration[secs]\n') 2542. 2543. with open(LOG_DIR + '/validation_acc.txt', 'a') as file: 2544. file.write( 2545. 'Epoch\tNum_Hashes\tUnique_Hashes\tCollisions[%]\tSpkAcc[%]\tVideo

Acc[%]\tAccuracy[%]\tTransAcc[%]\tGenderAcc\tDuration[secs]\n') 2546. 2547. # with open(os.path.join(LOG_DIR, 'models.txt'), 'a') as file:

106

2548. # file.write('Epoch\tSaveModel\tSaveLoss\tSaveAcc\tLossTrn\tLossVal\tAccTrn\tAccVal\n')

2549. 2550. print('') 2551. print('==========================') 2552. 2553. print('') 2554. print('Generating Graph....') 2555. 2556. tf.reset_default_graph() 2557. 2558. with tf.Graph().as_default(): 2559. print('\tLearning Rate') 2560. with tf.variable_scope('LearningRate'): 2561. DECAY_STEP = np.ceil(trn_bpe * NUM_EPOCHS / NUM_DECAY_STEP) 2562. 2563. step_lr = tf.get_variable('step_lr', [], 2564. initializer=tf.zeros_initializer(), 2565. trainable=False, 2566. dtype=tf.float64) 2567. 2568. learning_rate = tf.train.exponential_decay( 2569. INITIAL_LEARNING_RATE, # Base learning rate 2570. step_lr, # Current index into the dataset 2571. DECAY_STEP, # Decay step 2572. DECAY_RATE # Decay rate 2573. ) 2574. 2575. optimizer = tf.train.AdagradOptimizer(learning_rate) 2576. 2577. inputs = [] 2578. models = [] 2579. gstor_dec_pre = [] 2580. gstor_enc_lhs = [] 2581. gstor_enc_out = [] 2582. gstor_dec_out = [] 2583. losses = [] 2584. losses_mse = [] 2585. losses_hash = [] 2586. losses_entropy = [] 2587. similars = [] 2588. similars_entropy = [] 2589. gradients = [] 2590. 2591. print('\tDataStaging') 2592. 2593. _dataset_trn = Data.generate_dataset(SAMPLES_TRN) 2594. handle = tf.placeholder(tf.string, shape=[]) 2595. enc_inputs_iterator = tf.data.Iterator.from_string_handle(handle, _dat

aset_trn.output_types, _dataset_trn.output_shapes) 2596. inputs_next_batch, labels_next_batch = enc_inputs_iterator.get_next()

2597. # inputs_next_batch = enc_inputs_iterator.get_next() 2598. inputs_next_batch = tf.cast(inputs_next_batch, tf.float64) 2599. inputs_tower_batch = tf.split(inputs_next_batch, num_or_size_splits=le

n(TOWERS_LIST), axis=0) 2600. 2601. for tower_idx, device_id in enumerate(TOWERS_LIST): 2602. TOWER_TIME_INIT = time.time() 2603. with tf.device('/gpu:' + str(device_id)): 2604. print('\tTower' + str(device_id))

107

2605. 2606. print('\t\tDataInput' + str(device_id)) 2607. # enc_inputs = tf.placeholder(tf.float64, [GPU_BATCH_SIZE, NUM

_STEPS, NUM_FEATS], name='Encoder_Inputs') 2608. enc_inputs = inputs_tower_batch[tower_idx] 2609. enc_inputs.set_shape((GPU_BATCH_SIZE, NUM_STEPS, NUM_FEATS)) 2610. inputs.append(enc_inputs) 2611. 2612. # model = BasicSeq2Seq() 2613. model = Seq2Seq(num_layers=NUM_LAYERS, same_encoder_decoder_ce

ll=SHARE_WEIGHTS) 2614. models.append(model) 2615. 2616. print('\t\tVars' + str(device_id)) 2617. with tf.name_scope('Variables'): 2618. model.gen_variables() 2619. 2620. with tf.name_scope('Tower' + str(device_id)): 2621. print('\t\tOps' + str(device_id)) 2622. dec_predictions, enc_last_hidden_state, enc_outputs, dec_o

utputs = model.gen_operations(enc_inputs) 2623. gstor_dec_pre.append(dec_predictions) 2624. gstor_enc_lhs.append(enc_last_hidden_state) 2625. gstor_enc_out.append(enc_outputs) 2626. gstor_dec_out.append(dec_outputs) 2627. 2628. print('\t\tLoss' + str(device_id)) 2629. with tf.variable_scope('tLoss'): 2630. similar_count, similar_count_entropy = Loss.count_simi

lars(tf.unstack(enc_last_hidden_state)) 2631. similars.append(similar_count) 2632. similars_entropy.append(similar_count_entropy) 2633. 2634. with tf.name_scope('MSE'): 2635. loss_mse = Loss.mse_loss(enc_inputs, dec_predictio

ns) 2636. loss_mse = tf.multiply(tf.constant(LOSS_ALPHA, dty

pe=tf.float64), loss_mse) 2637. loss = loss_mse 2638. losses_mse.append(loss_mse) 2639. 2640. if SHOW_HASH_LOSS: 2641. with tf.name_scope('Hash'): 2642. if HASH_LOSS_USES_SXH == 0: 2643. loss_hash = Loss.hash_loss(tf.unstack(enc_

last_hidden_state)) 2644. elif HASH_LOSS_USES_SXH == 1: 2645. loss_hash = Loss.hash_loss(tf.unstack(dec_

predictions)) 2646. elif HASH_LOSS_USES_SXH == 2: 2647. loss_hash = Loss.hash_loss( 2648. tf.unstack(Loss.convert_real_features_

to_integer_features(enc_last_hidden_state))) 2649. else: 2650. exit(‐10) 2651. loss_hash = tf.multiply(tf.constant(LOSS_BETA,

dtype=tf.float64), loss_hash) 2652. if INCLUDE_HASH_LOSS: 2653. loss = loss + loss_hash 2654. losses_hash.append(loss_hash) 2655.

108

2656. if SHOW_ENTROPY_LOSS: 2657. with tf.name_scope('Entropy'): 2658. loss_entropy = Loss.entropy_loss( 2659. tf.unstack(Loss.convert_real_features_to_i

nteger_features(enc_last_hidden_state))) 2660. loss_entropy = tf.multiply(tf.constant(LOSS_GA

MMA, dtype=tf.float64), loss_entropy) 2661. if INCLUDE_ENTROPY_LOSS: 2662. loss = loss + loss_entropy 2663. losses_entropy.append(loss_entropy) 2664. 2665. losses.append(loss) 2666. 2667. print('\t\tGrad' + str(device_id)) 2668. with tf.variable_scope('tGradient'): 2669. gradient = optimizer.compute_gradients(loss) 2670. gradients.append(gradient) 2671. 2672. print('\t\tSummary~Model' + str(device_id)) 2673. model.gen_summaries() 2674. 2675. print('') 2676. print('\t\tTOWER_' + str(device_id) + ': %.2f seconds' % (time.tim

e() ‐ TOWER_TIME_INIT)) 2677. print('') 2678. 2679. print('\tGradientAverage') 2680. with tf.variable_scope('GradientAverage'), tf.device('/gpu:' + str(TOW

ERS_LIST[0])): 2681. gradients_mean = combine_gradients(gradients) 2682. 2683. print('\tOptimizer') 2684. with tf.variable_scope('Optimizer') as optim, tf.device('/gpu:' + str(

TOWERS_LIST[0])): 2685. train_step = optimizer.apply_gradients(gradients_mean, global_step

=step_lr) 2686. training_variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABL

ES, scope=optim.name) 2687. 2688. print('\tSummary') 2689. print('\t\tAccuracy') 2690. with tf.name_scope('EpochAccuracy'): 2691. num_unique_hashes_epoch = tf.placeholder(tf.int64, name='Num_Uniqu

e_Hashes_Epoch') 2692. tf.summary.scalar('Num_Unique_Hashes_Epoch', num_unique_hashes_epo

ch, collections=['epoch']) 2693. 2694. collisions_epoch = tf.placeholder(tf.float64, name='Collisions_Epo

ch') 2695. tf.summary.scalar('Collisions_Epoch', collisions_epoch, collection

s=['epoch']) 2696. 2697. speaker_epoch = tf.placeholder(tf.float64, name='Speaker_Epoch') 2698. tf.summary.scalar('Speaker_Epoch', speaker_epoch, collections=['ep

och']) 2699. 2700. video_epoch = tf.placeholder(tf.float64, name='Video_Epoch') 2701. tf.summary.scalar('Video_Epoch', video_epoch, collections=['epoch'

]) 2702.

109

2703. accuracy_epoch = tf.placeholder(tf.float64, name='Accuracy_Epoch')

2704. tf.summary.scalar('Accuracy_Epoch', accuracy_epoch, collections=['epoch'])

2705. 2706. # f1score_epoch = tf.placeholder(tf.float64, name='F1_Score_Epoch'

) 2707. # tf.summary.scalar('F1_Score_Epoch', f1score_epoch, collections=[

'epoch']) 2708. 2709. # files_avg_epoch = tf.placeholder(tf.float64, name='Files_Avg_Epo

ch') 2710. # tf.summary.scalar('Files_Avg_Epoch', files_avg_epoch, collection

s=['epoch']) 2711. 2712. transformations_avg_epoch = tf.placeholder(tf.float64, name='Trans

formations_Avg_Epoch') 2713. tf.summary.scalar('Transformations_Avg_Epoch', transformations_avg

_epoch, collections=['epoch']) 2714. 2715. gender_avg_epoch = tf.placeholder(tf.float64, name='Gender_Avg_Epo

ch') 2716. tf.summary.scalar('Gender_Avg_Epoch', gender_avg_epoch, collection

s=['epoch']) 2717. 2718. print('\t\tSimilars') 2719. with tf.name_scope('BatchSimilars'), tf.device('/gpu:' + str(TOWERS_LI

ST[0])): 2720. similars_batch = tf.add_n(similars) 2721. tf.summary.scalar('BatchSimilars_Hash', similars_batch) 2722. similars_entropy_batch = tf.add_n(similars_entropy) 2723. tf.summary.scalar('BatchSimilars_Entropy', similars_entropy_batch)

2724. 2725. print('\t\tLoss') 2726. with tf.name_scope('BatchLoss'), tf.device('/gpu:' + str(TOWERS_LIST[0

])): 2727. loss_batch = tf.divide(tf.add_n(losses), tf.constant(len(TOWERS_LI

ST), dtype=tf.float64)) 2728. tf.summary.scalar('Total', loss_batch) 2729. loss_batch_mse = tf.divide(tf.add_n(losses_mse), tf.constant(len(T

OWERS_LIST), dtype=tf.float64)) 2730. tf.summary.scalar('MSE', loss_batch_mse) 2731. if SHOW_HASH_LOSS: 2732. loss_batch_hash = tf.divide(tf.add_n(losses_hash), tf.constant

(len(TOWERS_LIST), dtype=tf.float64)) 2733. tf.summary.scalar('Hash', loss_batch_hash) 2734. else: 2735. loss_batch_hash = tf.no_op() 2736. if SHOW_ENTROPY_LOSS: 2737. loss_batch_entropy = tf.divide(tf.add_n(losses_entropy), tf.co

nstant(len(TOWERS_LIST), dtype=tf.float64)) 2738. tf.summary.scalar('Entropy', loss_batch_entropy) 2739. else: 2740. loss_batch_entropy = tf.no_op() 2741. 2742. with tf.name_scope('EpochLoss'): 2743. loss_total = tf.placeholder(tf.float64, name='Total_Loss') 2744. tf.summary.scalar('Total_Loss', loss_total, collections=['epoch'])

110

2745. mse_loss_epoch = tf.placeholder(tf.float64, name='MSE_Loss_Epoch')

2746. tf.summary.scalar('MSE_Loss_Epoch', mse_loss_epoch, collections=['epoch'])

2747. hash_loss_epoch = tf.placeholder(tf.float64, name='Hash_Loss_Epoch')

2748. tf.summary.scalar('Hash_Loss_Epoch', hash_loss_epoch, collections=['epoch'])

2749. entropy_loss_epoch = tf.placeholder(tf.float64, name='Entropy_Loss_Epoch')

2750. tf.summary.scalar('Entropy_Loss_Epoch', entropy_loss_epoch, collections=['epoch'])

2751. 2752. with tf.name_scope('EpochSimilars'): 2753. simcnt_hash_epoch = tf.placeholder(tf.float64, name='Similars_Coun

t_Hash_Epoch') 2754. tf.summary.scalar('Similars_Count_Hash_Epoch', simcnt_hash_epoch,

collections=['epoch']) 2755. simcnt_entropy_epoch = tf.placeholder(tf.float64, name='Similars_C

ount_Entropy_Epoch') 2756. tf.summary.scalar('Similars_Count_Entropy_Epoch', simcnt_entropy_e

poch, collections=['epoch']) 2757. 2758. print('\t\tPredictions') 2759. with tf.name_scope('Outputs'), tf.device('/gpu:' + str(TOWERS_LIST[0])

): 2760. combined_encoder_outputs = tf.stack(gstor_enc_out) 2761. tf.summary.histogram('EncoderOutputs', combined_encoder_outputs) 2762. 2763. combined_decoder_outputs = tf.stack(gstor_dec_out) 2764. tf.summary.histogram('DecoderOutputs', combined_decoder_outputs) 2765. 2766. combined_decoder_predictions = tf.stack(gstor_dec_pre) 2767. tf.summary.histogram('DecoderPredictions', combined_decoder_predic

tions) 2768. 2769. combined_encoder_last_state = tf.stack(enc_last_hidden_state) 2770. tf.summary.histogram('EncoderLastHiddenState', combined_encoder_la

st_state) 2771. 2772. # print('\tVisualization~VariableDefinition') 2773. # with tf.name_scope('HashesVis'), tf.device('/gpu:' + str(TOWERS_LIST

[0])): 2774. # hash_features_vis = tf.get_variable('hash_features', [trn_bpe *

BATCH_SIZE, STATE_SIZE], dtype=tf.float64) 2775. # hashes_vis = tf.get_variable('hashes', [trn_bpe * BATCH_SIZE, ST

ATE_SIZE], dtype=tf.int64) 2776. # hashes_vis_input = tf.placeholder(tf.float64, [trn_bpe * BATCH_S

IZE, STATE_SIZE], name='HashesVisInput') 2777. # hash_features_vis_op = hash_features_vis.assign(hashes_vis_input

) # ADDED!!! A.BAEZ 2778. # hashes_vis_op = tf.to_int64(tf.greater_equal(hashes_vis_input, t

f.constant(0, dtype=tf.float64))) 2779. # hashes_vis_op2 = hashes_vis.assign(tf.where(tf.equal(hashes_vis_

op, tf.constant(0, dtype=tf.int64)), 2780. # tf.multiply(tf.ones_

like(hashes_vis_op), ‐1), hashes_vis_op)) 2781. 2782. ##### 2783. 2784. print('\tSummary~Merge')

111

2785. summary = tf.summary.merge_all() 2786. summary_epoch = tf.summary.merge_all('epoch') 2787. 2788. print('\tSaver') 2789. 2790. saved_variables = variables._all_saveable_objects() 2791. # print("Saved Variables BEFORE: ",saved_variables) 2792. # for elem in saved_variables[:]: 2793. # if elem == hashes_vis or elem == hash_features_vis: 2794. # saved_variables.remove(elem) 2795. 2796. # print("Saved Variables AFTER: ", saved_variables) 2797. saver = tf.train.Saver(saved_variables, max_to_keep=MAX_TO_KEEP) 2798. 2799. print('\tConfig') 2800. config = tf.ConfigProto(log_device_placement=LOG_DEVICE_PLACEMENT, all

ow_soft_placement=True) 2801. config.gpu_options.allow_growth = True 2802. 2803. print('\tSession') 2804. if PRETRAINED: 2805. init_op = tf.group(tf.local_variables_initializer(), tf.global_var

iables_initializer()) 2806. sess = tf.Session(config=config) 2807. sess.run(init_op) 2808. 2809. print('\t\tLoading Model') 2810. LOAD_MODEL = LOG_DIR_MODELS + '/model/model‐' + MODEL + '.ckpt' 2811. saver.restore(sess, LOAD_MODEL) 2812. else: 2813. sess = tf.Session(config=config) 2814. sess.run(tf.global_variables_initializer()) 2815. 2816. print('\tSummaryWriter') 2817. with tf.variable_scope('Summaries'): 2818. summary_writer_trn = tf.summary.FileWriter(LOG_DIR + '/trn', sess.

graph) 2819. summary_writer_val = tf.summary.FileWriter(LOG_DIR + '/val', sess.

graph) 2820. 2821. # print('\tVisualization') 2822. # if not os.path.exists(LOG_DIR + '/vis'): 2823. # os.makedirs(LOG_DIR + '/vis') 2824. # if not os.path.exists(LOG_DIR + '/model'): 2825. # os.makedirs(LOG_DIR + '/model') 2826. # saver_vis = tf.train.Saver([hashes_vis, hash_features_vis], max_to_k

eep=NUM_EPOCHS) 2827. 2828. all_total_losses_trn = [] 2829. all_total_losses_val = [] 2830. all_accuracies_trn = [] 2831. all_accuracies_val = [] 2832. highest_accuracy_trn = 0.0 2833. highest_accuracy_val = 0.0 2834. highest_acc_trans_trn = 0.0 2835. highest_acc_trans_val = 0.0 2836. l_trn = float('inf') 2837. l_val = float('inf') 2838. last_saved_vis = int(time.time()) 2839. last_saved_model = int(time.time()) 2840.

112

2841. ## Dataset and Iterator definition 2842. 2843. SAMPLES_TRN_SHUFFLED = SAMPLES_TRN[:] 2844. random.shuffle(SAMPLES_TRN_SHUFFLED) 2845. dataset_trn = Data.generate_dataset(SAMPLES_TRN_SHUFFLED) 2846. iterator_trn = dataset_trn.make_initializable_iterator() 2847. 2848. def _do_list_modulo(segs, bs): 2849. # seg should be copy 2850. len_segs = len(segs) 2851. k = int(len_segs / bs) * bs 2852. r = int((len_segs % bs) * bs) 2853. tfds = segs[:k] 2854. rem = segs[k:] 2855. return tfds, rem 2856. 2857. 2858. segments_db_a, segments_db_b = _do_list_modulo(SAMPLES_TRN_ORIG_sub[:]

, BATCH_SIZE) 2859. if len(segments_db_b) > 0: 2860. db_last_batch = [segments_db_b[z % len(segments_db_b)] for z in ra

nge(BATCH_SIZE)] 2861. extended_segments_db = segments_db_a + db_last_batch 2862. else: 2863. extended_segments_db = segments_db_a 2864. dataset_trn_db = Data.generate_dataset(extended_segments_db) 2865. iterator_trn_db = dataset_trn_db.make_initializable_iterator() 2866. 2867. segments_qr_a, segments_qr_b = _do_list_modulo(SAMPLES_TRN_TRANS_sub[:

], BATCH_SIZE) 2868. if len(segments_qr_b) > 0: 2869. qr_last_batch = [segments_qr_b[z % len(segments_qr_b)] for z in ra

nge(BATCH_SIZE)] 2870. extended_segments_qr = segments_qr_a + qr_last_batch 2871. else: 2872. extended_segments_qr = segments_qr_a 2873. dataset_trn_query = Data.generate_dataset(extended_segments_qr) 2874. iterator_trn_qr = dataset_trn_query.make_initializable_iterator() 2875. 2876. stat_num_batches_db = int(len(SAMPLES_TRN_ORIG_sub) / BATCH_SIZE) 2877. stat_num_batches_query = int(len(SAMPLES_TRN_TRANS_sub) / BATCH_SIZE)

2878. stat_num_missing_db = len(SAMPLES_TRN_ORIG_sub) % BATCH_SIZE 2879. stat_num_missing_query = len(SAMPLES_TRN_TRANS_sub) % BATCH_SIZE 2880. acc_data_stats = (stat_num_batches_db, stat_num_batches_query, stat_nu

m_missing_db, stat_num_missing_query) 2881. 2882. 2883. 2884. 2885. comb_val = (SAMPLES_VAL_ORIG_sub[:] + SAMPLES_VAL_TRANS_sub[:]).copy()

2886. segments_val_a, segments_val_b = _do_list_modulo(comb_val, BATCH_SIZE)

2887. if len(segments_val_b) > 0: 2888. val_last_batch = [segments_val_b[z % len(segments_val_b)] for z in

range(BATCH_SIZE)] 2889. extended_segments_val = segments_val_a + val_last_batch 2890. else: 2891. extended_segments_val = segments_val_a 2892. dataset_val_query = Data.generate_dataset(extended_segments_val)

113

2893. iterator_val = dataset_val_query.make_initializable_iterator() 2894. 2895. ## Run Epochs 2896. 2897. for epoch in range(0 if DO_EPOCH_ZERO else 1, NUM_EPOCHS + 1): 2898. 2899. # Creating Database 2900. db_name = '{}_{}'.format(DATABASE_PREFIX, RUN_ID_MODEL.replace('+'

, '_')) 2901. db_fingerprint = DB_Fingerprinting() 2902. if DO_CREATE_DB: 2903. if db_fingerprint.db_exists(db_name): 2904. db_fingerprint.drop_DB(db_name) 2905. db_fingerprint.create_DB(db_name) 2906. else: 2907. db_fingerprint.create_DB(db_name) 2908. 2909. print('') 2910. print('EPOCH', epoch) 2911. 2912. print(' Training...') 2913. 2914. # Train 2915. 2916. if DO_EVALUATION: 2917. if epoch > 0: 2918. TRN_train_timer = time.time() 2919. epoch_loss_total, epoch_loss_mse, epoch_loss_hash, epoch_l

oss_entropy, \ 2920. hash_simcnt, entropy_simcnt, duration_trn = \ 2921. Evaluate.train(SAMPLES_TRN[:], sess, epoch, summary, i

terator_trn) 2922. print('') 2923. print('TRN_train_timer: %f [secs]' % (time.time() ‐ TRN_tr

ain_timer)) 2924. print('') 2925. else: 2926. TRN_eval_timer = time.time() 2927. epoch_loss_total, epoch_loss_mse, epoch_loss_hash, epoch_l

oss_entropy, \ 2928. hash_simcnt, entropy_simcnt, duration_trn = \ 2929. Evaluate.evaluate_train(SAMPLES_TRN[:], sess, epoch, s

ummary, iterator_trn) 2930. print('') 2931. print('TRN_eval_timer: %f [secs]' % (time.time() ‐ TRN_eva

l_timer)) 2932. print('') 2933. else: 2934. epoch_loss_total = epoch_loss_mse = epoch_loss_hash = epoch_lo

ss_entropy = ‐1 2935. hash_simcnt = entropy_simcnt = duration_trn = ‐1 2936. 2937. with open(LOG_DIR + '/training.txt', 'a') as file: 2938. file.write('%04i\t%f\t%f\t%f\t%f\t%f\t%f\t%f\n' % (epoch, 2939. epoch_loss_

total, 2940. epoch_loss_

mse, 2941. epoch_loss_

hash,

114

2942. epoch_loss_entropy,

2943. hash_simcnt,

2944. entropy_simcnt,

2945. duration_trn))

2946. 2947. # Training Accuracy 2948. if DO_ACCURACY: 2949. 2950. # Creating DATABASE.TABLE for AUDIO_LENGTH_CLASSIFICATION 2951. table_name_trn = 'TRN_fingerprints_{}'.format(int(1)) 2952. if DO_CREATE_TABLE: 2953. db_fingerprint.create_table(db_name, table_name_trn) 2954. 2955. TRN_acc_timer = time.time() 2956. 2957. spk_acc_trn, vdo_acc_trn, acc_trn, trans_acc_trn, gender_acc_t

rn, \ 2958. numHashes_acc_trn, uniqueHashes_acc_trn, numCollisions_acc_trn

, \ 2959. fraqUniqueHashes_acc_trn, fraqCollisions_acc_trn, duration_acc

_trn = \ 2960. Evaluate.accuracy(LABELS_TRN_sub[:], SAMPLES_TRN_ORIG_sub[

:], SAMPLES_TRN_TRANS_sub[:], 2961. db_fingerprint, db_name, table_name_trn,

sess, epoch, 1, 2962. acc_data_stats, iterator_trn_db, iterato

r_trn_qr) 2963. print('') 2964. print('TRN_acc_timer: %f [secs]' % (time.time() ‐ TRN_acc_time

r)) 2965. print('') 2966. 2967. with open(LOG_DIR + '/training_acc.txt', 'a') as file: 2968. file.write('%04i\t%05i\t%05i\t%f\t%f\t%f\t%f\t%f\t%f\t%f\n

' % (epoch, 2969.

numHashes_acc_trn, 2970.

uniqueHashes_acc_trn, 2971.

fraqCollisions_acc_trn, 2972.

spk_acc_trn, 2973.

vdo_acc_trn, 2974.

acc_trn, 2975.

# f1_trn, 2976.

trans_acc_trn, 2977.

gender_acc_trn, 2978.

duration_acc_trn)) 2979. else:

115

2980. spk_acc_trn = vdo_acc_trn = acc_trn = trans_acc_trn = gender_acc_trn = duration_acc_trn = ‐1

2981. uniqueHashes_acc_trn = fraqCollisions_acc_trn = ‐1 2982. 2983. # Training Summary 2984. sum_str1 = sess.run([summary_epoch], feed_dict={loss_total: epoch_

loss_total, 2985. mse_loss_epoch: ep

och_loss_mse, 2986. hash_loss_epoch: e

poch_loss_hash, 2987. entropy_loss_epoch

: epoch_loss_entropy, 2988. simcnt_hash_epoch:

hash_simcnt, 2989. simcnt_entropy_epo

ch: entropy_simcnt, 2990. num_unique_hashes_

epoch: uniqueHashes_acc_trn, 2991. collisions_epoch:

fraqCollisions_acc_trn, 2992. speaker_epoch: spk

_acc_trn, 2993. video_epoch: vdo_a

cc_trn, 2994. accuracy_epoch: ac

c_trn, 2995. # f1score_epoch: f

1_trn, 2996. # files_avg_epoch:

files_acc_trn, 2997. transformations_av

g_epoch: trans_acc_trn, 2998. gender_avg_epoch:

gender_acc_trn}) 2999. summary_writer_trn.add_summary(sum_str1[0], epoch) 3000. summary_writer_trn.flush() 3001. 3002. print(' Summary: %04i\t%f\t%f\t%f\t%f\t%f\t%f\t%f\t%f\t%f' % (epo

ch, 3003. epo

ch_loss_total, 3004. epo

ch_loss_mse, 3005. epo

ch_loss_hash, 3006. epo

ch_loss_entropy, 3007. has

h_simcnt, 3008. ent

ropy_simcnt, 3009. acc

_trn, 3010. dur

ation_trn, 3011. dur

ation_acc_trn)) 3012. all_total_losses_trn.append(epoch_loss_total) 3013. all_accuracies_trn.append(acc_trn) 3014.

116

3015. if ((epoch + 1) % NUM_EPOCHS_VAL == 0 or epoch == 0): 3016. print('') 3017. print(' > Validation...') 3018. 3019. epoch_loss_total, epoch_loss_mse, epoch_loss_hash, epoch_loss_

entropy, \ 3020. hash_simcnt, entropy_simcnt, duration_val, \ 3021. spk_acc_val, vdo_acc_val, acc_val, trans_acc_val, gender_acc_v

al, \ 3022. numHashes_acc_val, uniqueHashes_acc_val, numCollisions_acc_val

, \ 3023. fraqUniqueHashes_acc_val, fraqCollisions_acc_val, duration_acc

_val = Evaluate.evaluate_ValAndAcc( 3024. LABELS_VAL_sub[:], SAMPLES_VAL_ORIG_sub[:], SAMPLES_VAL_TR

ANS_sub[:], db_fingerprint, db_name, sess, epoch, iterator_val) 3025. 3026. with open(LOG_DIR + '/validation.txt', 'a') as file: 3027. file.write('%04i\t%f\t%f\t%f\t%f\t%f\t%f\t%f\n' % (epoch,

3028. epoch_l

oss_total, 3029. epoch_l

oss_mse, 3030. epoch_l

oss_hash, 3031. epoch_l

oss_entropy, 3032. hash_si

mcnt, 3033. entropy

_simcnt, 3034. duratio

n_val)) 3035. 3036. with open(LOG_DIR + '/validation_acc.txt', 'a') as file: 3037. file.write('%04i\t%05i\t%05i\t%f\t%f\t%f\t%f\t%f\t%f\t%f\n

' % (epoch, 3038.

numHashes_acc_val, 3039.

uniqueHashes_acc_val, 3040.

fraqCollisions_acc_val, 3041.

spk_acc_val, 3042.

vdo_acc_val, 3043.

acc_val, 3044.

trans_acc_val, 3045.

gender_acc_val, 3046.

duration_acc_val)) 3047. # Validation Summary 3048. sum_str2 = sess.run([summary_epoch], feed_dict={loss_total: ep

och_loss_total, 3049. mse_loss_epoch

: epoch_loss_mse,

117

3050. hash_loss_epoch: epoch_loss_hash,

3051. entropy_loss_epoch: epoch_loss_entropy,

3052. simcnt_hash_epoch: hash_simcnt,

3053. simcnt_entropy_epoch: entropy_simcnt,

3054. num_unique_hashes_epoch: uniqueHashes_acc_val,

3055. collisions_epoch: fraqCollisions_acc_val,

3056. speaker_epoch: spk_acc_val,

3057. video_epoch: vdo_acc_val,

3058. accuracy_epoch: acc_val,

3059. transformations_avg_epoch: trans_acc_val,

3060. gender_avg_epoch: gender_acc_val})

3061. 3062. summary_writer_val.add_summary(sum_str2[0], epoch) 3063. summary_writer_val.flush() 3064. 3065. print(' Summary: %04i\t%f\t%f\t%f\t%f\t%f\t%f\t%f\t%f\t%f' %

(epoch, 3066.

epoch_loss_total, 3067.

epoch_loss_mse, 3068.

epoch_loss_hash, 3069.

epoch_loss_entropy, 3070.

hash_simcnt, 3071.

entropy_simcnt, 3072.

acc_val, 3073.

duration_val, 3074.

duration_acc_val)) 3075. all_total_losses_val.append(epoch_loss_total) 3076. all_accuracies_val.append(acc_val) 3077. 3078. ### SAVING: Metrics are saved starting from Epoch 1 3079. SHOULD_SAVE_MODEL = False 3080. if epoch > 0: 3081. # USING LOSS: l_trn >= all_total_losses_trn[‐

1] and l_val >= all_total_losses_val[‐1] 3082. # USING ACCURACY: acc_trn >= highest_accuracy_trn and acc_

val >= highest_accuracy_val 3083. if SAVE_WITH_ACCURACY: 3084. SHOULD_SAVE_MODEL_ACC = acc_trn >= highest_accuracy_tr

n and acc_val >= highest_accuracy_val 3085. else: 3086. SHOULD_SAVE_MODEL_ACC = False

118

3087. if SAVE_WITH_LOSS: 3088. SHOULD_SAVE_MODEL_LOSS = l_trn >= all_total_losses_trn

[‐1] and l_val >= all_total_losses_val[‐1] 3089. else: 3090. SHOULD_SAVE_MODEL_LOSS = False 3091. if SAVE_WITH_TRANSFORMATION: 3092. SHOULD_SAVE_MODEL_ACC_TRANS = trans_acc_trn >= highest

_acc_trans_trn and trans_acc_val >= highest_acc_trans_val 3093. else: 3094. SHOULD_SAVE_MODEL_ACC_TRANS = False 3095. # using another metric? insert it here! 3096. 3097. SHOULD_SAVE_MODEL = SHOULD_SAVE_MODEL_ACC or SHOULD_SAVE_M

ODEL_LOSS or SHOULD_SAVE_MODEL_ACC_TRANS 3098. # SHOULD_SAVE_MODEL = SHOULD_SAVE_MODEL_ACC and SHOULD_SAV

E_MODEL_LOSS 3099. 3100. if SHOULD_SAVE_MODEL: 3101. l_trn = all_total_losses_trn[‐1] 3102. l_val = all_total_losses_val[‐1] 3103. highest_accuracy_trn = acc_trn 3104. highest_accuracy_val = acc_val 3105. highest_acc_trans_trn = trans_acc_trn 3106. highest_acc_trans_val = trans_acc_val 3107. 3108. # with open(os.path.join(LOG_DIR, 'models.txt'), 'a') as f

ile: 3109. # file.write( 3110. # str(epoch) + '\t' + str(SHOULD_SAVE_MODEL) + '\t

' + str(SHOULD_SAVE_MODEL_LOSS) + '\t' + str( 3111. # SHOULD_SAVE_MODEL_ACC)) 3112. # file.write('\t' + str(all_total_losses_trn[‐

1]) + '\t' + str(all_total_losses_val[‐1]) + '\t' + str( 3113. # acc_trn) + '\t' + str(acc_val)) 3114. # file.write('\n') 3115. 3116. current_time = int(time.time()) 3117. 3118. if (epoch == 0) or SAVE_ALL_MODELS or SHOULD_SAVE_MODEL or ( 3119. SAVE_MODEL_EVERY_SECONDS + last_saved_model <= current

_time): 3120. print(' > Saving Model...') 3121. 3122. CHECKPOINT_PATH = os.path.join(LOG_DIR, 'model', 'model‐

' + str(epoch) + '.ckpt') 3123. saver.save(sess, CHECKPOINT_PATH) # global_step=epoch, do

not include this, it is only used for naming 3124. last_saved_model = current_time 3125. 3126. # if (epoch == 0) or SAVE_ALL_MODELS or SHOULD_SAVE_MODEL or (

3127. # SAVE_VIS_EVERY_SECONDS + last_saved_vis <= current_t

ime): 3128. # print(' > Saving Visualization...') 3129. 3130. # VISUALIZATION_PATH = os.path.join(LOG_DIR, 'vis', str(epoch)

) 3131. # if not os.path.exists(VISUALIZATION_PATH): 3132. # os.makedirs(VISUALIZATION_PATH) 3133.

119

3134. # METADATA_PATH = os.path.join(VISUALIZATION_PATH, 'metadata‐hashes.tsv')

3135. # with open(METADATA_PATH, 'a') as file: 3136. # file.write('Speaker\tSentence\tTransformation\tSegment\n

') 3137. 3138. # Evaluate.evaluate_train_vis(SAMPLES_TRN[:], sess, epoch) 3139. 3140. # VISUALIZATION_CHECKPOINT_PATH = os.path.join(VISUALIZATION_P

ATH, 'model.ckpt') 3141. # saver_vis.save(sess, VISUALIZATION_CHECKPOINT_PATH) 3142. 3143. # PROJECTOR_DIR = LOG_DIR + '/vis/' + str(epoch) 3144. # if not os.path.exists(PROJECTOR_DIR): 3145. # os.makedirs(PROJECTOR_DIR) 3146. # with open(PROJECTOR_DIR + '/projector_config.pbtxt', 'a+') a

s file: 3147. # file.write('embeddings {\n') 3148. # file.write(' tensor_name: \"hashes\"\n') 3149. # file.write(' metadata_path: \"' + METADATA_PATH + '\"\n

') 3150. # file.write('}\n') 3151. # file.write('embeddings {\n') 3152. # file.write(' tensor_name: \"hash_features\"\n') 3153. # file.write(' metadata_path: \"' + METADATA_PATH + '\"\n

') 3154. # file.write('}\n') 3155. # file.write('model_checkpoint_path: \"' + VISUALIZATION_C

HECKPOINT_PATH + '\"\n') 3156. # last_saved_vis = current_time 3157. # else: 3158. # os.remove(METADATA_PATH) 3159. # os.rmdir(VISUALIZATION_PATH)

120

Bibliography

[1] P. Cano, E. Battle, T. Kalker and J. Haitsma, "A Review of Audio Fingerprinting," Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, vol. 41, no. 3, p. 271–284, November 2005.

[2] A. Nagrani, J. S. Chung and A. Zisserman, "VoxCeleb: A large-scale Speaker Identification Dataset," in INTERSPEECH, Stockholm, Sweden, August 20-24, 2017.

[3] W. Drevo, "Audio Fingerprinting with Python and Numpy," 15 November 2013. [Online]. Available: http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/. [Accessed 30 July 2018].

[4] J. Haitsma and T. Kalker, "A Highly Robust Audio Fingerprinting System," in Proceedings of the International Conference on Music Information Retrieval (ISMIR), Paris, France, October 13-17, 2002.

[5] J. Six and M. Leman, "PANAKO - A Scalable Acoustic Fingerprinting System Handling Time-Scale and Pitch Modification," in Proceedings of the Conference of the International Society for Music Information Retrieval (ISMIR), Taipei, Taiwan, October 27-31, 2014.

[6] S. Supasorn, S. M. Seitz and I. Kemelmacher-Shlizerman, "Synthesizing Obama: Learning Lip Sync from Audio," in Proceedings of Special Interest Group on Computer Graphics and Interactive Techniques (SIGGRAPH), Los Angeles, CA, July 30th - August 03, 2017.

[7] A. de Brébisson, J. Sotelo and K. Kumar, "Lyrebird," Lyrebird, [Online]. Available: https://lyrebird.ai/. [Accessed 25 September 2018].

[8] C.-J. Hsieh, J.-S. Li and C.-F. Hung, "A Robust Audio Fingerprinting Scheme for MP3 Copyright," in Proceedings of the International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP), Kaohsiung, Taiwan, November 26-28, 2007.

[9] R. F. Olanrewaju and O. Khalifa, "Digital Audio Watermarking; Techniques and Applications," in Proceedings of International Conference on Computer and Communication Engineering (ICCCE), Kuala Lumpur, Malaysia, July 03-05, 2012.

[10] A. L.-C. Wang, "An Industrial-Strength Audio Search Algorithm," in Proceedings of the International Conference on Music Information Retrieval (ISMIR), Baltimore, MD, October 26-30, 2003.

[11] C. J. C. Burges, J. C. Plat and S. Jana, "Distortion Discriminant Analysis for Audio Fingerprinting," IEEE Transactions on Speech and Audio Processing, vol. 11, no. 3, pp. 165-174, May 2003.

[12] Y. Cao, M. Long, J. Wang, Q. Yang and P. S. Yu, "Deep Visual-Semantic Hashing for Cross Modal Retrieval," in Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD), San Francisco, CA, August 13-17, 2016.

[13] A. Arzt, S. Böck and G. Widmer, "Fast Identification of Piece and Score Position via Symbolic Fingerprinting," in 13th International Symposium on Music Informationi Retrieval (ISMIR), Porto, Portugal, October 8-12, 2012.

[14] J. C. Brown and M. S. Puckette, "An Efficient Algorithm for the Calclation of a Constant Q Transform," Journal of the Acoustical Society of America, vol. 92, no. 5, pp. 2698-2701, June 1992.

[15] X. Anguera, A. Garzon and T. Adamek, "MASK: Robust Local Features for Audio Fingerprinting," in Proceedings of the International Conference on Multimedia and Expo (ICME), Melbourne, VIC, Australia, July 9-13, 2012.

[16] Y. Fan and S. Feng, "A Music Identification System Based on Audio Fingerprint," in Proceedings of the International Conference on Applied Computing and Information Technology (ACIT), Las Vegas, NV, December 12-14, 2016.

[17] M. Henaff, K. Jarrett, K. Kavukcuoglu and Y. LeCun, "Unsupervised Learning of Sparse Features for Scalable Audio Classification," in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Miami, FL, October 24-28, 2011.

121

[18] R. Sonnleitner and G. Widmer, "Robust Quad-Based Audio Fingerprinting," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 409-421, March 2016.

[19] H. Özer, B. Sankur and N. Memon, "Robust Audio Hashing for Audio Identification," in Proceedings of the European Signal Processing Conference (EUSIPCO), Vienna, Austria, September 06-10, 2004.

[20] S. Baluja and M. Covell, "Audio Fingerprinting: Combining Computer Vision and Data-Stream Processing," in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Honolulu, HI, April 15-20, 2007.

[21] V. Gupta, G. Boulianne and P. Cardinal, "Content-Based Audio Copy Detection Using Nearest-Neighbor Mapping," in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Dallas, TX, March 14-19, 2010.

[22] C. Ouali, P. Dumouchel and V. Gupta, "Content-Based Multimedia Copy Detection," in Proceedings of the IEEE International Symposium on Multimedia (ISM), Miami, FL, December 14-16, 2015.

[23] R. Roopalakshmi and G. R. M. Reddy, "A Framework for Estimating Geometric Distortions in Video Copies Based on Visual-Audio Fingerprints," Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, vol. 9, no. 1, pp. 201-210, January 2015.

[24] Y. Park, M. Cafarella and B. Mozafari, "Neighbor-Sensitive Hashing," Journal Proceedings of the VLDB Endowment, vol. 9, no. 3, pp. 144-155, November 2015.

[25] J. Gaoy, H. V. Jagadish, W. Lu and B. C. Ooi, "DSH: Data Sensitive Hashing for High-Dimensional k-NN Search," in Proceedings of the International Conference on Management of Data (SIGMOD), Snowbird, UT, June 22-27, 2014.

[26] C. Kereliuk, B. L. Sturm and J. Larsen, "Deep Learning and Music Adversaries," IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 2059-2071, November 2015.

[27] H. Li, X. Fei, K.-M. Chao, M. Yang and C. He, "Towards a Hybrid Deep-Learning method for Music Classification and Similarity Measurement," in Proceedings of the IEEE International Conference on e-Business Engineering (ICEBE), Macau, China, November 04-06, 2016.

[28] Y. Petetin, C. Laroche and A. Mayoue, "Deep Neural Networks for Audio Scene Recognition," in Proceedings of the European Signal Processing Conference (EUSIPCO), Nice, France, August 31 - December 04, 2015.

[29] S. Amiriparian, M. Freitag, N. Cummins and B. Schuller, "Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio," in Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), München, Germany, November 2017.

[30] Y. Gu, C. Ma and J. Yang, "Supervised Recurrent Hashing for Large Scale Video Retrieval," in Proceedings of the ACM on Multimedia Conference (MM), Amsterdam, The Netherlands, October 15-19, 2016.

[31] V.-A. Nguyen and M. N. Do, "Deep Learning Based Supervised Hashing for Efficient Image Retrieval," in Proceedings of the International Conference on Multimedia and Expo (ICME), Seattle, WA, July 11-15, 2016.

[32] H. Lai, Y. Pan, Y. Liu and S. Yan, "Simultaneous Feature Learning and Hash Coding with Deep Neural Networks," in Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, June 7-12, 2015.

[33] G. E. Hinton and R. Salakhutdinov, "Reducing the Dimensionality of Data with Neural Networks," Science, vol. 313, no. 5786, pp. 504-507, July 28, 2006.

[34] V. E. Liong, J. Lu, G. Wang, P. Moulin and J. Zhou, "Deep Hashing for Compact Binary Codes Learning," in Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, June 7-12, 2015.

[35] R. Salakhutdinov and G. E. Hinton, "Semantic Hashing," International Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969-978, July 2009.

[36] Y.-A. Chung, C.-C. Wu, C.-H. Shen, H.-Y. Lee and L.-S. Lee, "Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence

122

Autoencoder," College of Electrical Engineering and Computer Science, National Taiwan University, Taipei City, Taiwan, June 2016.

[37] Y. LeCun, Y. Bengio and G. Hinton, "Deep Learning," Nature, vol. 521, pp. 436-444, May 28, 2015.

[38] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath and B. Kingsbury, "Deep Neural Networks for Acoustic Modeling in Speech Recognition," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, November 2012.

[39] G. E. Dahl, D. Yu, L. Deng and A. Acero, "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition," IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30-42, January 2012.

[40] C. Szegedy, A. Toshev and . D. Erhan, "Deep Neural Networks for Object Detection," in Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, Nevada, December 05-10, 2013.

[41] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, "ImageNet: A Large-Scale Hierarchical Image Database," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, June 20-25, 2009.

[42] A. Krizhevsky, I. Sutskeer and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," in Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, Nevada, December 03-06, 2012.

[43] Y. Bengio, P. Simard and P. Frasconi, "Learning Long-Term Dependencies with Gradient Descent is Difficult," IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 157-166, March 1994.

[44] S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, November 1997.

[45] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk and Y. Bengio, "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation," in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, October 25-29, 2014.

[46] I. Sutskever, O. Vinyals and Q. V. Le, "Sequence to Sequence Learning with Neural Networks," in Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montréal, Canada, December 08-13, 2014.

[47] B. Fontes, "NENA: The 9-1-1 Association," National Emergency Number Association (NENA), [Online]. Available: https://www.nena.org/page/911overviewfacts. [Accessed 9 October 2018].

[48] D. Phythyon and L. Flaherty, "Next Generation 911," National Telecommunications and Information Administration, [Online]. Available: https://www.ntia.doc.gov/category/next-generation-911. [Accessed 10 October 2018].

[49] Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilçi, M. Sahidullah and A. Sizov, "ASVspoof 2015: the First Automatic Speaker Verification Spoofing and Countermeasures Challenge," University of Edinburgh, Edinburgh, England, 2015.

[50] L. Feng, "English Language Speech Database for Speaker Recognition (ELSDSR)," Technical University of Denmark (DTU), Lyngby, Denmark, 2004.

[51] D. Mazzoni and R. Dannenberg, Audacity, [Software] ver. 2.2.2, https://www.audacityteam.org/, February 20, 2018.

[52] C. Bagwel, SoX - Sound eXchange, [Software] ver. 14.4.2, http://gts.sourceforge.net/, February 22, 2015.

[53] J. Lyons, python_speech_features, [Software] ver. 0.6, https://github.com/jameslyons/python_speech_features, August 16, 2017.

[54] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke,

123

V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu and X. Zheng, "TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems," Google Research, Mountain View, California, 2015.

124

Published papers A. Báez-Suárez, N. Shah, J.A. Nolazco-Flores, S. S. Huang, O. Gnawali and W. Shi, “SAMAF: Sequence-to-Sequence Autoencoder Model for Audio Fingerprinting”, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 16, no. 2, pp. XXX-XXX, 2020

125

Curriculum Vitae Abraham Báez Suárez was born in Veracruz. He earned the Bachelor in Mechatronics Engineer from the Instituto Tecnólogico y de Estudios Superiores de Monterrey (ITESM), Monterrey Campus, in May 2009. While doing his bachelor’s, Abraham had the opportunity to be involved in an exchange program with the university Fachhochschule Esslingen (in Germany) from September 2007 until July 2008. During his stay, he continued his bachelor’s studies and kept learning about German culture and its language. He had the opportunity to work in a multidisciplinary project where he was in charge of the design of the blades of a scale wind turbine which was tested on a wind tunnel collecting data for the best cost-efficiency curve related to the material. After finishing his bachelor’s, he was accepted into the graduate program in Intelligent Systems and got awarded the Master’s title in May 2011. His thesis was on the application of Genetic Algorithms for damage localization in composite structures. After the graduate program, he started working for a multinational company leader in the electro domestics area (Whirlpool). In the beginning, his job was related to the design of mechanical parts for the ice & water system of fridge units working in cost reduction projects, but later he was designing the harnesses of the fridge units working under the controls system for cost, quality, new model line, and mega projects. He started working at ITESM during the summer of 2013 in a research assistant position, and a few months later, he joined the Doctorate program pursuing the Computer Science degree. During his Doctorate, he had the opportunity to visit the University of Houston as a Research Scholar. Arriving in August 2014, he stayed longer creating research bounds in two different labs, focused at the beginning in Computer Vision and later in Speech Processing. Currently, he works at the Mayo Clinic in Rochester, Minnesota, USA applying Artificial Intelligence algorithms in the fight against cancer and early detection of diseases. He will be graduating from the Doctorate program in June 2020. This document was typed in using Microsoft Word 2016 by Abraham Báez Suárez

pdfs.semanticscholar.org · 2020-04-30 · i instituto tecnológico y de estudios superiores de...

Documents