boosted locality sensitive hashing: discriminative …€¦ · ations. to this end, we start from...

BOOSTED LOCALITY SENSITIVE HASHING:DISCRIMINATIVE BINARY CODES FOR SOURCE SEPARATION

Sunwoo Kim, Haici Yang, Minje Kim

Indiana UniversityDepartment of Intelligent Systems Engineering

Bloomington, IN [email protected], [email protected], [email protected]

ABSTRACTSpeech enhancement tasks have seen significant improvements withthe advance of deep learning technology, but with the cost of in-creased computational complexity. In this study, we propose anadaptive boosting approach to learning locality sensitive hash codes,which represent audio spectra efficiently. We use the learned hashcodes for single-channel speech denoising tasks as an alternativeto a complex machine learning model, particularly to address theresource-constrained environments. Our adaptive boosting algo-rithm learns simple logistic regressors as the weak learners. Oncetrained, their binary classification results transform each spectrumof test noisy speech into a bit string. Simple bitwise operationscalculate Hamming distance to find the K-nearest matching framesin the dictionary of training noisy speech spectra, whose associatedideal binary masks are averaged to estimate the denoising maskfor that test mixture. Our proposed learning algorithm differs fromAdaBoost in the sense that the projections are trained to minimizethe distances between the self-similarity matrix of the hash codesand that of the original spectra, rather than the misclassificationrate. We evaluate our discriminative hash codes on the TIMIT cor-pus with various noise types, and show comparative performanceto deep learning methods in terms of denoising performance andcomplexity.

Index Terms— Speech Enhancement, Locality Sensitive Hash-ing, AdaBoost

1. INTRODUCTION

Deep learning-based source separation models have improved thesingle-channel speech denoising performance, nearing the ideal ra-tio masking results (IRM) [1, 2], or sometimes exceeding them [3].However, as the model size gets bigger, challenges grow for deploy-ing such large models into resource-constrained devices, even justfor feedforward inferences. Solving this issue is critical in real lifescenarios for devices that require speaker separation and noise can-cellation for quality speech communication.

There is an ongoing research in considering the tradeoff betweencomplexity and accuracy, which is a priority for mobile and em-bedded applications. Optimization of convolutional operations havebeen explored to build small, low latency models [4, 5]. Pruning lessactive weights [6] and filters [7] is another popular approach to re-ducing the network complexity, too. The other dimension of networkcompression is to reduce the number of bits to represent the network

This material is based upon work supported by the National ScienceFoundation under Award Number:1909509.

parameters, sometimes down to just one bit [8, 9] , one of its kindhas shown promising performances in speech denoising [10].

In this paper we take another route to source separation by re-defining the problem as a K-nearest neighborhood (KNN) searchtask: for a given test mixture, the separation is done by finding thenearest mixture spectra in the training set, and consequently theircorresponding ideal binary mask vectors. However, the complexityof the search process linearly increases with the size of the trainingdata. We expedite this tedious process by converting the query anddatabase spectra into a hash code to exploit bitwise matching oper-ations. To this end, we start from locality sensitive hashing (LSH),which is to construct hash functions such that similar data pointsare more probable to collide in the hashed space, or, in other words,more similar in terms of Hamming distance [11, 12]. While simpleand effective, the random projection-based nature of the LSH pro-cess is not trainable, thus limiting its performance when one uses itfor a specific problem.

We propose a learnable, but still projection-based hash function,Boosted LSH (BLSH), so that the separation is done in the binaryspace learned in a data-driven way [13]. BLSH reduce the redun-dancy in the randomly generated LSH codes by relaxing the inde-pendence assumption among the projection vectors and learn themsequentially in a boosting manner such that they complement oneanother, an idea shown in search applications [14, 15]. BLSH learnsa set of linear classifiers (i.e. perceptrons), whose binary classifica-tion results serve as a hash code. To learn the sequence of binaryclassifiers we employ the adaptive boosting (AdaBoost) technique[16], while we redefine the original classification-based AdaBoostalgorithm so that it works on our hashing problem for separation.Since the binary representation is to improve the quality of the hashcode-based KNN search during the separation, the objective of ourtraining algorithm is to maximize the representativeness of the hashcodes by minimizing the loss between the two self-similarity matri-ces (SSM) constructed from the original spectra and from the hashcodes.

We evaluate BLSH on the single-channel denoising task and em-pirically show that with respect to the efficiency, our system com-pares favorably to deep learning architectures and generalizes wellover unseen speakers and noises. Since binary codes can be cheaplystored and the KNN search is expedited with bitwise operations, webelieve this to be a good alternative for the speaker enhancement taskwhere efficiency matters.

BLSH can be seen as an embedding technique with a strong con-straint that the embedding has to be binary. Finding embeddingsthat preserve the semantic similarity is a popular goal in many dis-ciplines. In natural language processing, Word2Vec [17] or GloVe

Algorithm 1 KNN source separation1: Input: x, H , Y . A test mixture vector and the dictionary2: Output: y . A denoising mask vector3: Initialize an empty setN = ∅ and Amin = 04: for t← 1 to T do5: if Scos(x,Ht,:) > Amin then6: Replace the farthest neighbor index inN with t7: Update Amin ← mink∈N Scos(x,Hk,:)

8: return y ← 1K∑

k∈N Yk,:

[18] methods use pairwise metric learning to retrieve a distributedcontextual representation that retains complex syntactic and seman-tic relationships within documents. Another model that trains onsimilarity information is the Siamese networks [19, 20] which learnto discriminate a pair of examples. Utilizing similarity informationhas also been explored in the source separation community by pos-ing denoising as a segmentation problem in the time-frequency planewith an assumption that the affinities between time-frequency re-gions could condense complex auditory features together [21]. In-spired by studies of perceptual grouping [22], in [21] local affin-ity matrices were constructed out of cues specific to that of speech.Then, spectral clustering segments the weighted combination of sim-ilarity matrices to unmix speech mixtures. On the other hand, deepclustering learned a neural network encoder that produces discrim-inant spectrogram embeddings, whose objective is to approximatethe ideal pairwise affinity matrix induced from ideal binary masks(IBM) [23]. ChimeraNet extended the work by utilizing deep clus-tering as a regularizer for TF-mask approximation [24].

2. THE KNN SEARCH-BASED SOURCE SEPARATION

2.1. Baseline 1: Direct Spectral Matching

Suppose a masking-based source separation method by maintaininga large dictionary of training examples and searching for only KNNto infer the mask. We assume that if the mixture frames are similar,so are the sources in the mixture as well as their IBMs.

Let H ∈ RT×D be the normalized feature vectors from Tframes of training mixture examples, e.g., noisy speech spectra. Tcan be a potentially very large number as it exponentially grows withthe number of sources. Out of many potential choices, we are in-terested in short-time Fourier transform (STFT) and mel-spectra asthe feature vectors. For example, if H is from STFT on the train-ing mixture signals, D corresponds to the number of subbands Fin each spectrum, while for mel-spectra D < F . H is normalizedwith the L2 norm. We also prepare their corresponding IBM matrix,Y ∈ {0, 1}T×F , whose dimension F matches that of STFT. Fora test mixture spectrum out of STFT, x ∈ CF , our goal is to esti-mate a denoising mask, y ∈ RF , to recover the source by masking,y � x. While masking is applied to the complex STFT spectrumx, the KNN search can be done in the D-dimensional feature spacex ∈ RD , e.g., D < F for mel-spectra.

Algorithm 1 describes the KNN source separation procedure.We use notation Scos as the affinity function, e.g., the cosine similar-ity function. For each frame x in the mixture signal, we find the Kclosest frames in the dictionary (line 4 to 7), which forms the set ofindices of KNN, N = {τ1, τ2, · · · , τK}. Using them, we find thecorresponding IBM vectors from Y and take their average (line 8).

Complexity: The search procedure requires a linear scan of allreal-valued feature vectors in H , giving O(QDT ), where Q stands

Hashing

Hashing

Average

Estimated masking vector

Mixture spectra dictionary

Mixturespectrum(testing)

�<latexit sha1_base64="q/5VnUna48MdF4pyKG1MbqGQn9I=">AAACBXicZZDLSgMxFIYz3q23qks3wSK4kHZGBF1JQRcuFZwq2CKZzBmNzWVIzohl6NqlW30Id+LW5/AZfAnTy8LLgcB3/uQ/nPxJLoXDMPwMJianpmdm5+YrC4tLyyvV1bWWM4XlEHMjjb1MmAMpNMQoUMJlboGpRMJF0j0a3F/cg3XC6HPs5dBR7EaLTHCGXorbJjV4Xa2F9XBY9D9EY6iRcZ1eV7/aqeGFAo1cMueuojDHTsksCi6hX2kXDnLGu+wGrjxqpsB1yuGyfbrllZRmxvqjkQ7Vn46SKed6KtmhHhTD2x2aKG8boPs9GrODTil0XiBoPpqcFZKioYOv0lRY4Ch7Hhi3wi9H+S2zjKMPpNIeGstG7HzXUELfQVeoxrE1eWIeGilkdQfYr/h0or9Z/IfWbj3yfLZXax6Oc5ojG2STbJOI7JMmOSGnJCacCPJEnslL8Bi8Bm/B++jpRDD2rJNfFXx8A9somBU=</latexit><latexit sha1_base64="q/5VnUna48MdF4pyKG1MbqGQn9I=">AAACBXicZZDLSgMxFIYz3q23qks3wSK4kHZGBF1JQRcuFZwq2CKZzBmNzWVIzohl6NqlW30Id+LW5/AZfAnTy8LLgcB3/uQ/nPxJLoXDMPwMJianpmdm5+YrC4tLyyvV1bWWM4XlEHMjjb1MmAMpNMQoUMJlboGpRMJF0j0a3F/cg3XC6HPs5dBR7EaLTHCGXorbJjV4Xa2F9XBY9D9EY6iRcZ1eV7/aqeGFAo1cMueuojDHTsksCi6hX2kXDnLGu+wGrjxqpsB1yuGyfbrllZRmxvqjkQ7Vn46SKed6KtmhHhTD2x2aKG8boPs9GrODTil0XiBoPpqcFZKioYOv0lRY4Ch7Hhi3wi9H+S2zjKMPpNIeGstG7HzXUELfQVeoxrE1eWIeGilkdQfYr/h0or9Z/IfWbj3yfLZXax6Oc5ojG2STbJOI7JMmOSGnJCacCPJEnslL8Bi8Bm/B++jpRDD2rJNfFXx8A9somBU=</latexit><latexit sha1_base64="q/5VnUna48MdF4pyKG1MbqGQn9I=">AAACBXicZZDLSgMxFIYz3q23qks3wSK4kHZGBF1JQRcuFZwq2CKZzBmNzWVIzohl6NqlW30Id+LW5/AZfAnTy8LLgcB3/uQ/nPxJLoXDMPwMJianpmdm5+YrC4tLyyvV1bWWM4XlEHMjjb1MmAMpNMQoUMJlboGpRMJF0j0a3F/cg3XC6HPs5dBR7EaLTHCGXorbJjV4Xa2F9XBY9D9EY6iRcZ1eV7/aqeGFAo1cMueuojDHTsksCi6hX2kXDnLGu+wGrjxqpsB1yuGyfbrllZRmxvqjkQ7Vn46SKed6KtmhHhTD2x2aKG8boPs9GrODTil0XiBoPpqcFZKioYOv0lRY4Ch7Hhi3wi9H+S2zjKMPpNIeGstG7HzXUELfQVeoxrE1eWIeGilkdQfYr/h0or9Z/IfWbj3yfLZXax6Oc5ojG2STbJOI7JMmOSGnJCacCPJEnslL8Bi8Bm/B++jpRDD2rJNfFXx8A9somBU=</latexit>

Masking vector dictionary

Index to the NN[⌧1, ⌧2, · · · ⌧K]

<latexit sha1_base64="qf0DxEhcb/J4F23+YCxs2wqLLew=">AAACJXicZZDNThsxEMe9QPkIhab0yMUiQuIQJbspEhxR6QGpFyoRQMquIq93Fkz8sbJnEdEqD8CTcORaHqK3qlJPfYC+RJ1NDnyMZM1vxvMf2f+0kMJhGP4JFhaX3i2vrK411t9vbH5oftw6d6a0HPrcSGMvU+ZACg19FCjhsrDAVCrhIh0dT+8vbsE6YfQZjgtIFLvSIhecoW8Nm61BjKwcRm1a557PPDPoZmXMmfyW+KmwE9ZB30I0hxaZx+mw+S/ODC8VaOSSOTeIwgKTilkUXMKkEZcOCsZH7AoGHjVT4JKq/syE7vpORnNj/dFI6+5zRcWUc2OVtqkHxfC6TVPlZVN0L1djfphUQhclguazzXkpKRo6tYJmwgJHOfbAuBX+cZRfM8s4esMacS2sun3nq64S+gZGQnW/WlOk5q6bQd5xgJOGdyd67cVbOO91os+d3vf91tGXuU+rZJvskD0SkQNyRE7IKekTTu7JI/lBnoKH4GfwK/g9G10I5ppP5EUEf/8D0iWjmg==</latexit>

Y⌧1,:<latexit sha1_base64="4pDkAB0fNO59lzawCnInz23HSeg=">AAACDnicZZDLThsxFIY94R4KpO2SjUVUiUWUzAQkUFeosGAJEgGqTBp5nDOJiS8j+wwiGuUdWLKlD8Gu6pZX4Bl4CZyQBZcjWfrOb/9Hx3+SSeEwDJ+C0tz8wuLS8kp59cva+kbl67dzZ3LLocWNNPYyYQ6k0NBCgRIuMwtMJRIukuHh5P7iGqwTRp/hKIOOYn0tUsEZeulPnPzuFjGyvBvVfo67lWpYD6dFP0M0gyqZ1Um38hz3DM8VaOSSOdeOwgw7BbMouIRxOc4dZIwPWR/aHjVT4DrFdOsx/eGVHk2N9UcjnapvHQVTzo1UUqMeFMNBjSbK2ybo3o/GdL9TCJ3lCJq/Tk5zSdHQyZ9pT1jgKEceGLfCL0f5gFnG0SdTjqfGotFyvmsooa9gKFTjyJosMTeNHqR1Bzgu+3Sij1l8hvNmPdqpN093qwe/Zjktk02yRbZJRPbIATkmJ6RFOLHkjtyTv8Ft8BD8C/6/Pi0FM8938q6CxxcUBptv</latexit>

Y⌧2,:<latexit sha1_base64="VTNALN3Npr68VfOl2RZIV6UR3bA=">AAACDnicZZDPThsxEMa9lJYQaBvaIxeLqBKHKNlNkUCcopYDR5BYSJUNkdeZDSb+s7Jnq0arvEOPvdKH4Ia49hX6DLwETsiBwEiWfvPZ32j8pbkUDsPwf7DyZvXtu7XKenVj8/2Hj7WtT+fOFJZDzI00tpsyB1JoiFGghG5ugalUwkU6/j67v/gJ1gmjz3CSQ1+xkRaZ4Ay9dJmkPwZlgqwYtBuH00GtHjbDedHXEC2gThZ1Mqg9JEPDCwUauWTO9aIwx37JLAouYVpNCgc542M2gp5HzRS4fjnfekq/eGVIM2P90Ujn6nNHyZRzE5U2qAfF8KpBU+VtM3TLozE76JdC5wWC5k+Ts0JSNHT2ZzoUFjjKiQfGrfDLUX7FLOPok6kmc2PZip3vWkroaxgL1TqyJk/Nr9YQsqYDnFZ9OtHLLF7DebsZfW22T/fqnW+LnCpkm+yQXRKRfdIhx+SExIQTS/6QG/I3+B3cBnfB/dPTlWDh+UyWKvj3CBWmm3A=</latexit>

Y⌧K,:<latexit sha1_base64="XiVapXe9JVX2Dz5nymlewbrR7n8=">AAACFHicZZDPbtNAEMbXhdKSUgjtkcuKqBKHKLEDEhWnqHBA4lIknATVkbXejNtt9o+1O64SWX4NjlzpQ3CreuXOM/ASbNwcCB1ppd98u99o9ssKKRyG4e9g68HD7Uc7u49be0/2nz5rPz8YOVNaDjE30thJxhxIoSFGgRImhQWmMgnjbP5+dT++AuuE0V9wWcBUsXMtcsEZeiltHyTZ17RKkJVpwpn81H1Xp+1O2AubovchWkOHrOs0bf9JZoaXCjRyyZw7i8ICpxWzKLiEupWUDgrG5+wczjxqpsBNq2b3mh55ZUZzY/3RSBv1X0fFlHNLlXWpB8Xwoksz5W0rdJujMT+eVkIXJYLmd5PzUlI0dPVzOhMWOMqlB8at8MtRfsEs4+jzaSWNserHznd9JfQlzIXqf7CmyMyiP4O85wDrlk8n+j+L+zAa9KLXvcHnN53hyTqnXfKCvCSvSETekiH5SE5JTDhZkO/kB7kOvgU/g5vg9u7pVrD2HJKNCn79BeUTnW4=</latexit>

Ht,:<latexit sha1_base64="KLXDnFoVtx7KHIwp7IlLTlobJsY=">AAACCXicZZDLSgMxFIYz3q23qks3wSK4KO2MCoorURcuK1gv2FIy6RmNzWVIzohl6BO4dKsP4U7c+hQ+gy9hell4ORD4zp/8h5M/TqVwGIafwdj4xOTU9MxsYW5+YXGpuLxy7kxmOdS5kcZexsyBFBrqKFDCZWqBqVjCRdw56t9f3IN1wugz7KbQVOxGi0Rwhl66asQnrRzL+71WsRRWwkHR/xCNoERGVWsVvxptwzMFGrlkzl1HYYrNnFkUXEKv0MgcpIx32A1ce9RMgWvmg4V7dMMrbZoY649GOlB/OnKmnOuquEw9KIa3ZRorb+uj+z0ak71mLnSaIWg+nJxkkqKh/e/StrDAUXY9MG6FX47yW2YZRx9KoTEw5tW6811VCX0HHaGqx9aksXmotiGpOMBewacT/c3iP5xvVaLtytbpTungcJTTDFkj62STRGSXHJATUiN1wokiT+SZvASPwWvwFrwPn44FI88q+VXBxzdTpZlq</latexit>

H⌧1,:<latexit sha1_base64="AbQTqVC6SEtVB3hbLKNwmAkYdFI=">AAACE3icZZDLahsxFIY1Ttom7m2cLrMRMYUujD3jFlqyCkkXWaYQOwaPGTTyGVu1LoN0JokZ/BhZZts+RHel2z5An6EvUfmyiJsDgu/80n84+rNCCodR9Ceo7ew+efpsb7/+/MXLV6/DxkHfmdJy6HEjjR1kzIEUGnooUMKgsMBUJuEqm50t76+uwTph9CXOCxgpNtEiF5yhl9KwkWScyfO0SpCVadw6XqRhM2pHq6KPId5Ak2zqIg3/JmPDSwUauWTODeOowFHFLAouYVFPSgcF4zM2gaFHzRS4UbVafUHfemVMc2P90UhX6kNHxZRzc5W1qAfFcNqimfK2Jbrt0Zh/GlVCFyWC5uvJeSkpGrr8OB0LCxzl3APjVvjlKJ8yyzj6eOrJylh1es53HSX0V5gJ1flsTZGZ284Y8rYDXNR9OvH/WTyGfrcdv293v3xonpxuctojh+SIvCMx+UhOyDm5ID3CyQ25J9/I9+Au+BH8DH6tn9aCjecN2arg9z/eUpzd</latexit>

H⌧2,:<latexit sha1_base64="RJJ7pYxr09qK5kR2uzX0LlCQ7Lg=">AAACE3icZZDNThsxFIU9UEoIUBJYsrEaVWIRJTMBCcQqAhYsqdQEJCYaeZw7YOKfkX2nJRrlMViyhYforuq2D8Az8BJ1fhalXMnSd499rq5PmkvhMAxfgqXlDysfVytr1fWNzU9btfp235nCcuhxI429SpkDKTT0UKCEq9wCU6mEy3R0Or2//A7WCaO/4TiHgWI3WmSCM/RSUqvHKWfyPCljZEXSaR5PklojbIWzou8hWkCDLOoiqb3GQ8MLBRq5ZM5dR2GOg5JZFFzCpBoXDnLGR+wGrj1qpsANytnqE/rFK0OaGeuPRjpT/3WUTDk3VmmTelAMb5s0Vd42Rfd2NGZHg1LovEDQfD45KyRFQ6cfp0NhgaMce2DcCr8c5bfMMo4+nmo8M5btnvNdWwl9ByOh2mfW5Km5bw8haznASdWnE/2fxXvod1rRfqvz9aDRPVnkVCG75DPZIxE5JF1yTi5Ij3DygzySJ/IcPAQ/g1/B7/nTpWDh2SFvKvjzF9/ynN4=</latexit>

Ht,:<latexit sha1_base64="GCxNtJ09Wgd8Wl/8twp06JvU7o0=">AAACDHicZZDLSgMxFIYz3q23qks3wSK4KO1MFRRXoi5cKlgVnFIy6RmNzWVIzohl6Cu4dKsP4U7c+g4+gy9hWrvwciDwnT/5Dyd/kknhMAw/grHxicmp6ZnZ0tz8wuJSeXnl3JnccmhyI429TJgDKTQ0UaCEy8wCU4mEi6R7OLi/uAPrhNFn2Mugpdi1FqngDL0Uxwln8rhdYHWv3y5Xwlo4LPofohFUyKhO2uXPuGN4rkAjl8y5qyjMsFUwi4JL6Jfi3EHGeJddw5VHzRS4VjHcuU83vNKhqbH+aKRD9aejYMq5nkqq1INieFOlifK2AbrfozHdbRVCZzmC5t+T01xSNHTwY9oRFjjKngfGrfDLUX7DLOPocynFQ2NRbzrf1ZXQt9AVqn5kTZaY+3oH0poD7Jd8OtHfLP7DeaMWbdUap9uV/YNRTjNkjayTTRKRHbJPjskJaRJOMvJInshz8BC8BK/B2/fTsWDkWSW/Knj/ArkRmrg=</latexit>

H⌧K,:<latexit sha1_base64="9gW0n/r0sq138YBdPi5RKoyuCXA=">AAACF3icZZDLSiNBFIarvY3Gy2QUV24Kg+AiJN0qzOBK1IXgRsGoYIemunJaa1KXpuq0GJo8iEu3+hDuxK1Ln8GXsBKz8HKg4Dt/1X849ae5FA7D8DUYG5+YnPo1PVOZnZtf+F39s3jqTGE5tLiRxp6nzIEUGlooUMJ5boGpVMJZ2t0b3J9dg3XC6BPs5dBW7FKLTHCGXkqqy3HKmTxIyhhZkcSeD+vb/aRaCxvhsOhPiEZQI6M6SqpvccfwQoFGLplzF1GYY7tkFgWX0K/EhYOc8S67hAuPmilw7XK4fp+ueaVDM2P90UiH6mdHyZRzPZXWqQfF8KpOU+VtA3RfR2P2r10KnRcImn9MzgpJ0dDB52lHWOAoex4Yt8IvR/kVs4yjj6gSD41ls+V811RC/4euUM19a/LU3DQ7kDUcYL/i04m+Z/ETTjca0WZj43irtrM7ymmarJBVsk4i8pfskANyRFqEk5LckXvyENwGj8FT8PzxdCwYeZbIlwpe3gE5ZZ6r</latexit>

NN search

x<latexit sha1_base64="8Dh6r2KxjxDKqzLZfzKtAlIYxw0=">AAACCnicZZDLSgMxFIYz9T7eqi7dBIvgQtqZKuiyqAuXCo4V2yKZ9IyNzWVIMtIy9A1cutWHcCdufQmfwZcwrV1YeyDwnT/5Dyd/nHJmbBB8eYWZ2bn5hcUlf3lldW29uLF5bVSmKURUcaVvYmKAMwmRZZbDTaqBiJhDPe6eDu/rj6ANU/LK9lNoCXIvWcIosU66bQpiO3GS9wZ3xVJQDkaFpyEcQwmN6+Ku+N1sK5oJkJZyYkwjDFLbyom2jHIY+M3MQEpol9xDw6EkAkwrH208wLtOaeNEaXekxSP1ryMnwpi+iPexg+GO+zgWzjZEMznaJsetnMk0syDp7+Qk49gqPPwvbjMN1PK+A0I1c8th2iGaUOtS8ZsjY16JjOsqgskH6DJROdMqjVWv0oakbMAOfJdO+D+LabiulsODcvXysFQ7Gee0iLbRDtpDITpCNXSOLlCEKJLoGb2gV+/Je/PevY/fpwVv7NlCE+V9/gAmW5p7</latexit>

x<latexit sha1_base64="FEtyduG0if9M3tzJFP6HPmoOSGs=">AAACA3icZZDLSgMxFIYzXmu9VV26CRbBRWlnqqDLoi5cVrQXaItk0jM1NpchyUjL0KVLt/oQ7sStD+Iz+BKml4XaA4Hv/Ml/OPnDmDNjff/LW1hcWl5Zzaxl1zc2t7ZzO7t1oxJNoUYVV7oZEgOcSahZZjk0Yw1EhBwaYf9ifN94BG2Ykrd2GENHkJ5kEaPEOummHQ7ucnm/6E8Kz0MwgzyaVfUu993uKpoIkJZyYkwr8GPbSYm2jHIYZduJgZjQPulBy6EkAkwnnaw6wodO6eJIaXekxRP1tyMlwpihCAvYgSD2voBD4WxjNH9H2+iskzIZJxYknU6OEo6twuOP4i7TQC0fOiBUM7ccpvdEE2pdHNn2xJiWasZ1JcHkA/SZKF1qFYdqUOpCVDRgR1mXTvA/i3mol4vBcbF8fZKvnM9yyqB9dICOUIBOUQVdoSqqIYp66Bm9oFfvyXvz3r2P6dMFb+bZQ3/K+/wBLc+XLQ==</latexit>

K<latexit sha1_base64="zn/a2feLuAIRJeW/ZGAYA36buC0=">AAACBXicZZDLSgMxFIYz3q23qks3wSK4kHamCroUdSG4qeBooTNIJj3TxuYyJBmxDF27dKsP4U7c+hw+gy9hell4ORD4zp/8h5M/yTgz1vc/vanpmdm5+YXF0tLyyupaeX3j2qhcUwip4ko3E2KAMwmhZZZDM9NARMLhJumdDu9v7kEbpuSV7WcQC9KRLGWUWCeFESX84rZc8av+qPB/CCZQQZNq3Ja/oraiuQBpKSfGtAI/s3FBtGWUw6AU5QYyQnukAy2HkggwcTFadoB3nNLGqdLuSItH6k9HQYQxfZHsYQeC2O4eToSzDdH8Hm3To7hgMsstSDqenOYcW4WHX8VtpoFa3ndAqGZuOUy7RBNqXSClaGQsaqFxXU0weQc9JmpnWmWJeqi1Ia0asIOSSyf4m8V/uK5Xg/1q/fKgcnwyyWkBbaFttIsCdIiO0TlqoBBRxNATekYv3qP36r157+OnU97Es4l+lffxDX5Ml+I=</latexit>

y<latexit sha1_base64="xbu7HCixjOacgkT+qO85UyOc5z0=">AAACCXicZZDLSgMxFIYzXmu9VV26CRbBhbQzKuhS1IXLCtYLTpFMesbG5jIkZ8Qy9AlcutWHcCdufQqfwZcwrV14ORD4zp/8h5M/yaRwGIYfwdj4xOTUdGmmPDs3v7BYWVo+cya3HJrcSGMvEuZACg1NFCjhIrPAVCLhPOkeDu7P78A6YfQp9jJoKXajRSo4Qy9dxh2GRZz0+teValgLh0X/QzSCKhlV47ryGbcNzxVo5JI5dxWFGbYKZlFwCf1ynDvIGO+yG7jyqJkC1yqGC/fpulfaNDXWH410qP50FEw511PJJvWgGHY2aaK8bYDu92hM91qF0FmOoPn35DSXFA0dfJe2hQWOsueBcSv8cpR3mGUcfSjleGgs6k3nu7oS+ha6QtWPrMkSc19vQ1pzgP2yTyf6m8V/ONuqRdu1rZOd6v7BKKcSWSVrZINEZJfsk2PSIE3CiSKP5Ik8Bw/BS/AavH0/HQtGnhXyq4L3Lz8tmfs=</latexit>

masking

K<latexit sha1_base64="zn/a2feLuAIRJeW/ZGAYA36buC0=">AAACBXicZZDLSgMxFIYz3q23qks3wSK4kHamCroUdSG4qeBooTNIJj3TxuYyJBmxDF27dKsP4U7c+hw+gy9hell4ORD4zp/8h5M/yTgz1vc/vanpmdm5+YXF0tLyyupaeX3j2qhcUwip4ko3E2KAMwmhZZZDM9NARMLhJumdDu9v7kEbpuSV7WcQC9KRLGWUWCeFESX84rZc8av+qPB/CCZQQZNq3Ja/oraiuQBpKSfGtAI/s3FBtGWUw6AU5QYyQnukAy2HkggwcTFadoB3nNLGqdLuSItH6k9HQYQxfZHsYQeC2O4eToSzDdH8Hm3To7hgMsstSDqenOYcW4WHX8VtpoFa3ndAqGZuOUy7RBNqXSClaGQsaqFxXU0weQc9JmpnWmWJeqi1Ia0asIOSSyf4m8V/uK5Xg/1q/fKgcnwyyWkBbaFttIsCdIiO0TlqoBBRxNATekYv3qP36r157+OnU97Es4l+lffxDX5Ml+I=</latexit>

Fig. 1. TheKNN-based source separation process using LSH codes.

for the floating-point precision (e.g., Q = 64 for double precision).This procedure is restrictive since T needs to be large for qualitysource separation.

2.2. Baseline 2: LSH with Random Projections

We can reduce the storage overhead and expedite Algorithm 1 usinghashed spectra and the Hamming similarity between them. We de-fine L random projection vectors as P ∈ RL×D , where the l-th bitsin the codes is in the form of

H:,l = sgn(HP>l,: + bl) (1)

with bl as a bias term. Applying the same P onto H and x, weobtain bipolar binary H ∈ {−1,+1}T×L and x ∈ {−1,+1}L,respectively. Now, we tweak Algorithm 1, by having them take thebinary feature vectors. Moreover, Hamming similarity also replacesthe similarity function, which counts the number of matching bitsin the pair of binary feature vectors: SHam(a, b) =

∑l I(al, bl)/L,

where I(x, y) = 1 iff x = y. Otherwise, the algorithm is the same.Fig. 1 overviews the separation process of the baseline 2.

Randomness in the projections, however, dampens the quality ofthe hash bits, necessitating for longer codes. This is detrimental interms of query time, computational cost of projecting the query tohash codes, and storage overhead from the large number of projec-tions. The lackluster quality of the codes originates from the data-blind nature of random projections [25, 26]. BLSH addresses thisissue by learning each projection as a weak binary classifier.

Complexity: Since the same Algorithm 1 is used, the depen-dency to T remains the same, while we can reduce the complexityfromO(QDT ) toO(LT ) if L < QD. The procedure can be signif-icantly accelerated with supporting hardware as the Hamming simi-larity calculation is done through bitwise AND and pop counting.

3. THE PROPOSED BOOSTED LSH TRAININGALGORITHM FOR SOURCE SEPARATION

We set up our optimization problem with an objective to enrich thequality of the hash codes. Since the code vector is a collection ofthe binary classification results, during training the projection vec-tors are directed to minimize the discrepancies between the pairwise

Algorithm 2 BLSH training1: Input: H . Dictionary of training examples2: Output: P . Set of projections3: W ← uniform vector of 1

T×T.W ∈ RL×T×T

4: P ← random initialization . P ∈ RL×D

5: β ← vector of zeros . β ∈ RL

6: for l← 1 to L do7: Pl ← min

Pl

∑t1,t2

D([

H:,lH>:,l]t1,t2

,[HH>

]t1,t2

)Wl,t1,t2

8: εl =∑

t1,t2

D([

H:,lH>:,l]t1,t2

,[HH>

]t1,t2

)Wl,t1,t2

9: βl ← 12

ln 1−εlεl

10: Wl,:,: = Wl−1,:,: � exp(βlD(H:,l−1H>:,l−1, HH>)

)11: return P

affinity relationships among the original spectra in H and those weconstruct from their corresponding hash codes H.

Hence, given H:,l hash string from the l-th projection, we ex-press the loss function in terms of the dissimilarity between the self-similarity matrices (SSM) as follows:∑

t1,t2

D([

H:,lH>:,l]t1,t2

,[HH>

]t1,t2

)(2)

for a given distance metric D, such as element-wise cross-entropy.We scale the bipolar binary SSM to ranges of the ground truth’s suchthat H:,lH>:,l ∈ {0, 1}T×T . With this objective the learned binaryhash codes can be more compact and representative than the onesfrom a random projection. There can be potentially many differentsolutions to this optimization problem, such as solving this optimiza-tion directly for the set of projection vectors P or spectral hash-ing that learns the hash codes directly with no assumed projectionprocess [26]. The proposed BLSH algorithm employs an adaptiveboosting mechnism to learn the projection vectors one by one.

We reformulate AdaBoost [16], an adaptive boosting strategy forclassification, which is to learn efficient weak learners in the formof linear classifiers that complement those learned in previous iter-ations. It is an adaptive basis function model whose weak learnersform a weighted sum to achieve the final prediction, in our case anapproximation of the original self-similarity that minimizes the totalerror as follows:

∑t1,t2

D([ L∑

l=1

βlH:,lH>:,l]t1,t2

,[HH>

]t1,t2

), (3)

where βl is the weight for the l-th weak learner.In AdaBoost, the l-th projection is trained to focus more on

the previously misclassified examples by assigning an exponentiallylarger weight to them. We redesign this part by making the sam-ple weights exponentially large for the too different pairs of SSMelements as in (2). The SSM weights W ∈ RL×T×T are initial-ized with uniform values over all (t1, t2) pairs, and then updatedafter adding every projection, such that each pairwise location in theSSM is weighted element-wise based on the exponentiated discrep-ancy made by the current weak learner as follows:

Wl,:,: = Wl−1,:,: � exp(βlD(H:,l−1H>:,l−1, HH>)

)(4)

The effect of this update rule is to tune the weights with respect to thedistances between the bitwise and original SSMs on a given metric.

100 200

100

200

(a) L = 5

100 200

100

200

(b) L = 50

100 200

100

200

(c) L = 200

100 200

100

200

(d) Ground Truth SSM

Fig. 2. Self-affinity matrices of varying L hash codes and originaltime-frequency bins

Weights of the elements with the largest distance computed from the(l−1)-th projection are amplified, and vice versa on close pairs. Thel-th weak learner can then use these weights to concentrate on harderexamples. This approach pursues a complementary l-th projection;thus, overall rapidly increasing the approximation quality in (3) withrelatively smaller L projections. The final boosted objective for thel-th projection is formulated as

εl =∑t1,t2

D([

H:,lH>:,l]t1,t2

,[HH>

]t1,t2

)Wl,t1,t2 (5)

Under this objective, the first few projections index the majority ofthe original features, thereby dramatically reducing the overall stor-age of projections, computation from projecting elements, and thelength of hashed bit strings. Given the learned l-th projection andsample weights, we obtain the weights over the projections as

βl =1

2ln

1− εlεl

(6)

Algorithm 2 summarizes the procedure. Fig. 2 shows the com-plementary nature of the projections and their convergence behav-ior. After learning the projection matrix P , the rest of the test-timesource separation process is the same with Algorithm 1.

Complexity: BLSH does not change the run-time complexityO(LT ) of baseline 2. However, we expect that BLSH can outper-form LSH with smaller L thanks to the boosting mechanism.

4. EXPERIMENTS

4.1. Experimental Setup

To investigate the effectiveness of BLSH for the speech enhance-ment job, we evaluate our model on the TIMIT dataset. A train-ing set consisting of 10 hours of noisy utterances was generated byrandomly selecting 160 speakers from the train set, and by mix-

Fig. 3. SDR over number of projections on the open set.

ing each utterance with different non-stationary noise signals with0 dB signal-to-noise ratio (SNR), namely {birds, casino, cicadas,computer keyboard, eating chips, frogs, jungle, machine guns, mo-torcycles, ocean} [27]. For both subsamples, we select half of thespeakers as male and the other half as female. There are 10 shortutterances per speaker recorded with a 16kHz sampling rate. Theprojections are learned on the training set and the output hash codessaved as the dictionary. 2 hours of cross validation set were gen-erated similarly from the training set, which is used to evaluate thesource separation performance of the closed speaker and noise ex-periments (closed mixture set). One hour of evaluation data set wasmixed from 110 unseen speakers in the TIMIT test set and unseennoise sources from the DEMAND database, which consists of do-mestic, office, indoor public places, transportation, nature, and street(open mixture set) [28]. The test set mixtures are with a 0 dB SNRand gender-balanced. We apply a Short-Time Fourier Transform(STFT) with a Hann window of 1024 and hop size of 256. Dur-ing projection training, the mixture speech was segmented with thelength of 1000 frames as a minibatch, such that the self-similaritymatrices are of reasonable size. For evaluation, calculate the SDR,SIR, and SAR from BSS_Eval toolbox [29].

4.2. Discussion

Fig. 3 shows the improvements in performance over the number ofprojections on the open mixture set. For our evaluation, all KNN-based approaches were performed with K = 10 to narrow the focuson BLSH behavior with respect to number of projections. We com-pare two sub-sampled dictionaries, 10% and 1%, respectively, to ex-pedite theKNN search process and to reduce the spatial requirementof the hashed dictionary. We repeat the test ten times on differentrandomly sampled subsets. Fig. 3 shows the average and respectivestandard deviations (shades). First of all, we see that larger dictio-naries with 10% of data generally outperforms smaller dictionaries.Enlarging the length of the hash code L also generally improves theperformance until it saturates. More importantly, BLSH consistentlyoutperforms the random projection method even with less data uti-lization (BLSH learned from 1% of data always outperforms LSHwith 10% subset), showcasing the merit of the boosting mechanism.Also, random projections show a noticeable sensitivity to the ran-domness in the sub-sampling procedure: the standard deviation issignificantly larger especially in the 1% case, whereas that of BLSHremains tighter.

For comparison, we show the results in Table 1 at L = 150 in

Table 1. Evaluation metrics for different separation methods.

MethodSDR SAR SIR

Closed Open Closed Open Closed OpenOracle 15.79 21.92 17.19 23.13 25.22 28.48BLSH 10.20 10.04 11.57 12.12 17.22 18.20

Random 9.69 8.75 11.14 11.31 16.60 15.02KNN (STFT) 10.72 9.66 11.76 12.11 18.77 16.81

FC 12.17 12.04 13.39 13.67 19.37 21.22LSTM 13.37 12.65 14.79 14.31 19.68 21.77

the 10% case. In the table, the proposed BLSH method consistentlyoutperforms the random projection approach and also KNN on theraw STFT magnitudes in the open mixture case. Although the oracleIRM was higher in the open case, all systems perform worse than inthe closed case. It suggests that the unseen DEMAND noise sourcesare actually easier to separate, yet difficult to generalize for modelstrained on mixtures with ten noise types. In particular, we found thatfor unseen noise types, such as public cafeteria sounds, the KNNsystems on the raw STFT magnitudes generalize poorly. Regardless,the proposed BLSH shows more robust generalization for the openmixture set than both the random projection method and the directKNN on the raw STFT case. It demonstrates the better representa-tiveness of the learned hash codes than the raw Fourier coefficientsor the basic LSH codes.

We compare the BLSH system with some deep neural networkmodels as well, a 3-layer fully connected network (FC) and a bi-directional long short-term memory (LSTM) network [30] with twolayers. Both networks are with 1024 hidden units per layer. Unsur-prisingly, both networks outperform BLSH. However, consideringthe significantly the faster bitwise inference process of BLSH, morethan 10dB SDR improvement is still a promising performance.

Given these preliminary results, we believe further improve-ments can come from several areas. For example, by combiningeach βl values with the computed Hamming similarity from thel-th hash codes, the separation could take into account the relativecontributions of the learned projections. In addition, our approachdoes not employ information on the time axis since each projectionsare applied on a frame-by-frame basis. This could be enhanced withusing projections on multiple frames and windowing. Finally, en-riching the dictionary to include more disparate noise types as wellas optimizing training using kernel methods are potential candidatesfor future work.

5. CONCLUSION

In this paper, we proposed an adaptive boosted hashing algorithmcalled Boosted LSH for the source separation problem using near-est neighbor search. The model trains linear classifiers sequentiallyunder a boosting paradigm, culminating to a better approximationof the original self-similarity matrix with shorter hash strings. Withlearned projections, the K-nearest matching frames in the hasheddictionary to the test frame codes are found with efficient bitwiseoperations, and a denoising mask estimated by the average of asso-ciated IBMs. We showed through the experiments that our proposedframework achieves comparable performance against deep learningmodels in terms of denoising quality and complexity12.

1The code is open-sourced on https://github.com/sunwookimiub/BLSH.2Audio examples can be found: https://saige.sice.indiana.edu/research-

projects/BWSS-BLSH

https://github.com/sunwookimiub/BLSH

https://saige.sice.indiana.edu/research-projects/BWSS-BLSH

https://saige.sice.indiana.edu/research-projects/BWSS-BLSH

6. REFERENCES

[1] A. Narayanan and D. L. Wang, “Ideal ratio mask estimationusing deep neural networks for robust speech recognition,” in2013 IEEE International Conference on Acoustics, Speech andSignal Processing. IEEE, 2013, pp. 7092–7096.

[2] D. L. Wang and J. Chen, “Supervised speech separation basedon deep learning: An overview,” IEEE/ACM Transactions onAudio, Speech, and Language Processing, vol. 26, no. 10, pp.1702–1726, 2018.

[3] K. Zhen, M. S. Lee, and M. Kim, “Efficient context aggrega-tion for end-to-end speech enhancement using a densely con-nected convolutional and recurrent network,” arXiv preprintarXiv:1908.06468, 2019.

[4] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Effi-cient convolutional neural networks for mobile vision applica-tions,” arXiv preprint arXiv:1704.04861, 2017.

[5] M. Wang, B. Liu, and H. Foroosh, “Factorized convolutionalneural networks,” in Proceedings of the IEEE InternationalConference on Computer Vision, 2017, pp. 545–553.

[6] S. Han, H. Mao, and W. J. Dally, “Deep compression: Com-pressing deep neural networks with pruning, trained quantiza-tion and Huffman coding,” in Proceedings of the InternationalConference on Learning Representations (ICLR), 2016.

[7] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P.Graf, “Pruning filters for efficient convnets,” arXiv preprintarXiv:1608.08710, 2016.

[8] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neuralnetworks,” arXiv preprint arXiv:1603.05279, 2016.

[9] D. Soudry, I. Hubara, and R. Meir, “Expectation backpropa-gation: Parameter-free training of multilayer neural networkswith continuous or discrete weights,” in Advances in NeuralInformation Processing Systems (NIPS), 2014.

[10] M. Kim and P. Smaragdis, “Bitwise neural networks for effi-cient single-channel source separation,” in Proceedings of theIEEE International Conference on Acoustics, Speech, and Sig-nal Processing (ICASSP), 2018.

[11] P. Indyk and R. Motwani, “Approximate nearest neighbor –towards removing the curse of dimensionality,” in Proceed-ings of the Annual ACM Symposium on Theory of Computing(STOC), 1998, pp. 604–613.

[12] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni,“Locality-sensitive hashing scheme based on p-stable distribu-tions,” in Proceedings of the twentieth annual symposium onComputational geometry. ACM, 2004, pp. 253–262.

[13] Z. Li, H. Ning, L. Cao, T. Zhang, Y. Gong, and T. S. Huang,“Learning to search efficiently in high dimensions,” in Ad-vances in Neural Information Processing Systems, 2011, pp.1710–1718.

[14] X. Liu, C. Deng, Y. Mu, and Z. Li, “Boosting complementaryhash tables for fast nearest neighbor search,” in Thirty-FirstAAAI Conference on Artificial Intelligence, 2017.

[15] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, and N. Yu, “Com-plementary hashing for approximate nearest neighbor search,”in 2011 International Conference on Computer Vision. IEEE,2011, pp. 1631–1638.

[16] Y. Freund and R. E. Schapire, “Experiments with a new boost-ing algorithm,” in icml. Citeseer, 1996, vol. 96, pp. 148–156.

[17] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean,“Efficient estimation of word representations in vector space,”arXiv preprint arXiv:1301.3781, 2013.

[18] Jeffrey Pennington, Richard Socher, and Christopher D. Man-ning, “Glove: Global vectors for word representation,” in Em-pirical Methods in Natural Language Processing (EMNLP),2014, pp. 1532–1543.

[19] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov,“Siamese neural networks for one-shot image recognition,” inICML deep learning workshop, 2015, vol. 2.

[20] J. Bromley, I. Guyon, Y. LeCun, E. Sackinger, and R. Shah,“Signature verification using a “siamese” time delay neuralnetwork,” in Advances in Neural Information Processing Sys-tems (NIPS), 1994, pp. 737–744.

[21] Francis R Bach and Michael I Jordan, “Learning spectral clus-tering, with application to speech separation,” Journal of Ma-chine Learning Research, vol. 7, no. Oct, pp. 1963–2001, 2006.

[22] Martin Cooke and Daniel PW Ellis, “The auditory organiza-tion of speech and other sources in listeners and computationalmodels,” Speech communication, vol. 35, no. 3-4, pp. 141–177,2001.

[23] John R. Hershey, Zhuo Chen, Jonathan Le Roux, and ShinjiWatanabe, “Deep clustering: Discriminative embeddings forsegmentation and separation,” in Proceedings of the IEEE In-ternational Conference on Acoustics, Speech, and Signal Pro-cessing (ICASSP), Mar. 2016.

[24] Y. Luo, Z. Chen, J. R. Hershey, J. Le Roux, and N. Mesgarani,“Deep clustering and conventional networks for music separa-tion: Stronger together,” in Proceedings of the IEEE Interna-tional Conference on Acoustics, Speech, and Signal Processing(ICASSP), March 2017, pp. 61–65.

[25] R. Salakhutdinov and G. Hinton, “Semantic hashing,” Inter-national Journal of Approximate Reasoning, vol. 50, no. 7, pp.969–978, 2009.

[26] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” inAdvances in neural information processing systems, 2009, pp.1753–1760.

[27] Z. Duan, G. J. Mysore, and P. Smaragdis, “Online PLCA forreal-time semi-supervised source separation,” in InternationalConference on Latent Variable Analysis and Signal Separation.Springer, 2012, pp. 34–41.

[28] J. Thiemann, N. Ito, and E. Vincent, “The diverse envi-ronments multi-channel acoustic noise database (demand): Adatabase of multichannel environmental noise recordings,” inProceedings of Meetings on Acoustics ICA2013. ASA, 2013,p. 035081.

[29] E. Vincent, R. Gribonval, and C. Fevotte, “Performance mea-surement in blind audio source separation,” IEEE transactionson audio, speech, and language processing, vol. 14, no. 4, pp.1462–1469, 2006.

[30] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neuralnetworks,” IEEE Transactions on Signal Processing, vol. 45,no. 11, pp. 2673–2681, 1997.

boosted locality sensitive hashing: discriminative …€¦ · ations. to this end, we start from...

Documents