small-footprint keyword spotting with graph ...despite the recent successes of deep neural networks,...

8
SMALL-FOOTPRINT KEYWORD SPOTTING WITH GRAPH CONVOLUTIONAL NETWORK Xi Chen 1 , Shouyi Yin 1 , Dandan Song 2 , Peng Ouyang 2 , Leibo Liu 1 , Shaojun Wei 1 1 Tsinghua University 2 TsingMicro Co. Ltd. ABSTRACT Despite the recent successes of deep neural networks, it re- mains challenging to achieve high precision keyword spotting task (KWS) on resource-constrained devices. In this study, we propose a novel context-aware and compact architecture for keyword spotting task. Based on residual connection and bottleneck structure, we design a compact and efficient net- work for KWS task. To leverage the long range dependen- cies and global context of the convolutional feature maps, the graph convolutional network is introduced to encode the non- local relations. By evaluated on the Google Speech Command Dataset, the proposed method achieves state-of-the-art perfor- mance and outperforms the prior works by a large margin with lower computational cost. Index Termskeyword spotting, graph convolutional network, small-footprint. 1. INTRODUCTION With the rapid development of speech recognition in recent years, speech interface on smart devices is becoming increas- ingly popular, such as Google Search by Voice [1], intelli- gent loudspeakers and mobile assistants, etc. These various smart devices enable users to obtain a fully hands-free experi- ence by continuously listening to specific keywords to initiate voice input. Keyword spotting (KWS), or keyword detection, is a task that aims to detect pre-defined keywords in a stream of audio. A practical on-device KWS module should be highly accurate, low-latency and small footprint for the deployment, while operating on the low-power and resource-constrained devices. Conventional study for KWS task focuses on keyword/filter hidden Markov model (HMM) [2, 3, 4, 5, 6]. In these ap- proaches, HMMs are trained for both keyword and non- keyword speech segments, respectively. During the runtime, the Viterbi search algorithm is applied to generate the best path in the decoding graph, which could be computationally expensive due to the complexity of the HMMs topology. Sev- eral KWS systems use a large vocabulary continuous speech recognizer (LVCSR) to generate a rich lattices, and search the keyword among all possible paths in the lattices [7, 8, 9, 10]. In recent years, deep neural networks (DNNs) have been proven to yield significant improvement in KWS task. In [11], Chen et al. propose a small footprint approach called Deep KWS. Specifically, a DNN-based acoustic model is trained to directly predict the frame-level posteriors of sub-keywords followed by a posterior handling method, which produces a final confidence score. This effective idea outperforms the traditional keyword/filter HMMs and is highly attractive to run on devices with a small footprint and low latency. After that, more advanced architectures have been suc- cessfully applied to KWS task as an alternative to the DNN- based acoustic model. In [12, 13], convolutional neural net- works (CNNs) are utilized under limited memory footprint as well as computational resource scenarios. In [14], Tang et al. introduce the idea of deep residual network (ResNet) to achieve a trade-off between the model footprint and predic- tion performance. These methods have demonstrated com- putational efficiency but failed in capturing local receptive fields and short range context. Various attempts have also been made to build a KWS system with recurrent neural net- works (RNNs) [15, 16, 17, 18, 19], which is capable of mod- eling longer temporal context information. However, RNNs may suffer from state saturation while facing continuous input stream, increasing computational cost and detection latency. In this work, we aim to address the aforementioned limita- tions of convolution by incorporating the long-range context information with graph convolutional network (GCN) [20]. The non-local relations can be encoded through a densely- connected GCN module with attention mechanism, which es- timates the feature context by message passing on the under- lying graph. Intuitively, for the feature at a certain position, it is updated via aggregating features at all positions with weighted summation, where the weights are decided by the feature similarities between the corresponding two positions. Thus, any two positions with similar features can contribute mutual improvement regardless of their distance in spatial di- mension, which is different from the progressive behavior of recurrent and convolutional operations. Inspired by the ResNet [21], we propose a compact and efficient convolutional network (denoted as CENet) by uti- lizing the bottleneck architecture with narrow structure. The bottleneck architecture is proposed in [21] to reduce the model complexity by introducing a 1 × 1 convolutional layer, arXiv:1912.05124v1 [cs.SD] 11 Dec 2019

Upload: others

Post on 03-Jan-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SMALL-FOOTPRINT KEYWORD SPOTTING WITH GRAPH ...Despite the recent successes of deep neural networks, it re-mains challenging to achieve high precision keyword spotting task (KWS) on

SMALL-FOOTPRINT KEYWORD SPOTTING WITHGRAPH CONVOLUTIONAL NETWORK

Xi Chen1, Shouyi Yin1, Dandan Song2, Peng Ouyang2, Leibo Liu1, Shaojun Wei1

1Tsinghua University 2TsingMicro Co. Ltd.

ABSTRACT

Despite the recent successes of deep neural networks, it re-mains challenging to achieve high precision keyword spottingtask (KWS) on resource-constrained devices. In this study,we propose a novel context-aware and compact architecturefor keyword spotting task. Based on residual connection andbottleneck structure, we design a compact and efficient net-work for KWS task. To leverage the long range dependen-cies and global context of the convolutional feature maps, thegraph convolutional network is introduced to encode the non-local relations. By evaluated on the Google Speech CommandDataset, the proposed method achieves state-of-the-art perfor-mance and outperforms the prior works by a large margin withlower computational cost.

Index Terms— keyword spotting, graph convolutionalnetwork, small-footprint.

1. INTRODUCTION

With the rapid development of speech recognition in recentyears, speech interface on smart devices is becoming increas-ingly popular, such as Google Search by Voice [1], intelli-gent loudspeakers and mobile assistants, etc. These varioussmart devices enable users to obtain a fully hands-free experi-ence by continuously listening to specific keywords to initiatevoice input. Keyword spotting (KWS), or keyword detection, isa task that aims to detect pre-defined keywords in a stream ofaudio. A practical on-device KWS module should be highlyaccurate, low-latency and small footprint for the deployment,while operating on the low-power and resource-constraineddevices.

Conventional study for KWS task focuses on keyword/filterhidden Markov model (HMM) [2, 3, 4, 5, 6]. In these ap-proaches, HMMs are trained for both keyword and non-keyword speech segments, respectively. During the runtime,the Viterbi search algorithm is applied to generate the bestpath in the decoding graph, which could be computationallyexpensive due to the complexity of the HMMs topology. Sev-eral KWS systems use a large vocabulary continuous speechrecognizer (LVCSR) to generate a rich lattices, and search thekeyword among all possible paths in the lattices [7, 8, 9, 10].

In recent years, deep neural networks (DNNs) have beenproven to yield significant improvement in KWS task. In [11],Chen et al. propose a small footprint approach called DeepKWS. Specifically, a DNN-based acoustic model is trainedto directly predict the frame-level posteriors of sub-keywordsfollowed by a posterior handling method, which produces afinal confidence score. This effective idea outperforms thetraditional keyword/filter HMMs and is highly attractive torun on devices with a small footprint and low latency.

After that, more advanced architectures have been suc-cessfully applied to KWS task as an alternative to the DNN-based acoustic model. In [12, 13], convolutional neural net-works (CNNs) are utilized under limited memory footprint aswell as computational resource scenarios. In [14], Tang etal. introduce the idea of deep residual network (ResNet) toachieve a trade-off between the model footprint and predic-tion performance. These methods have demonstrated com-putational efficiency but failed in capturing local receptivefields and short range context. Various attempts have alsobeen made to build a KWS system with recurrent neural net-works (RNNs) [15, 16, 17, 18, 19], which is capable of mod-eling longer temporal context information. However, RNNsmay suffer from state saturation while facing continuous inputstream, increasing computational cost and detection latency.

In this work, we aim to address the aforementioned limita-tions of convolution by incorporating the long-range contextinformation with graph convolutional network (GCN) [20].The non-local relations can be encoded through a densely-connected GCN module with attention mechanism, which es-timates the feature context by message passing on the under-lying graph. Intuitively, for the feature at a certain position,it is updated via aggregating features at all positions withweighted summation, where the weights are decided by thefeature similarities between the corresponding two positions.Thus, any two positions with similar features can contributemutual improvement regardless of their distance in spatial di-mension, which is different from the progressive behavior ofrecurrent and convolutional operations.

Inspired by the ResNet [21], we propose a compact andefficient convolutional network (denoted as CENet) by uti-lizing the bottleneck architecture with narrow structure. Thebottleneck architecture is proposed in [21] to reduce themodel complexity by introducing a 1× 1 convolutional layer,

arX

iv:1

912.

0512

4v1

[cs

.SD

] 1

1 D

ec 2

019

Page 2: SMALL-FOOTPRINT KEYWORD SPOTTING WITH GRAPH ...Despite the recent successes of deep neural networks, it re-mains challenging to achieve high precision keyword spotting task (KWS) on

which is responsible for the reducing and restoring dimen-sions. Different from the prior works, we also investigatethe effects of the narrower structure for small-footprint KWStask.

By combining the contextual feature augmentation mod-ule GCN with the CENet, a compact but efficient model(denotedas CENet-GCN) is proposed for the KWS. We validate ourmethod on the Google Speech Command dataset [22] with aseries of comprehensive experiments. The empirical resultsand ablative study demonstrate the superior of our methodover the prior state-of-the-art approaches with much fewerparameters and simpler network structure. The main contri-butions of this work can be summarized as followed:

• We propose a compact and efficient convolutional net-work for small-footprint KWS by utilizing the bottle-neck structure.

• We introduce the GCN to capture the long range de-pendencies and achieve the contextual feature augmen-tation.

2. METHOD

We describe our method for building a compact and effi-cient network to achieve the small footprint KWS system.Firstly, we start with a brief introduction of the KWS task andthe preprocessing strategies for audio feature generation inSec. 2.1. Next, we detail our compact and efficient networkstructure based on residual connection and bottleneck struc-ture in Sec. 2.2. Finally, we describe the method of the featureenhancement using GCN in Sec. 2.3. To our knowledge, weare the first to apply the GCN approach for KWS task.

2.1. KWS Task and Preprocessing

KWS is a task of detecting the pre-defined keyword in audiostream. The CNN-based KWS task includes two phases: 1)pre-processing phase, 2) CNN-based classifier.

In the pre-processing phase, firstly, a band-pass filter of20Hz/4kHz is used to reduce noise. After that, the Mel-Frequency Cepstrum Coefficient (MFCC) feature is con-structed using a 30ms window size and 10ms frame shift. Wedenote the input MFCC feature as I feeding into the neuralnetwork. In this work I ∈ Rt×f , where t is 101 and f is 40.The f is the dimension of the MFCC feature, and the t is thenumber of frames.

The KWS task can be formulated as a classification prob-lem:

y = F(I,Θ) (1)

where y ∈ Rl is the prediction of the audio sequence, andF is the mapping function from the input feature space to thelabel space, which is implemented with neural network in ourmethod. l is the number of classes to be classified. Our goal

Fig. 1. Framework of our CENet.

is to learn the network parameter Θ of the mapping functionF .

2.2. Compact and Efficient Neural Network

Our goal is to learn a KWS classifier with the compact struc-ture and low computational cost. Motivated by the priorworks [14, 21], we follow the key idea of the ResNet [21],which postulates that learning residual of the network is eas-ier than the original mapping, to design a family of modelsfor the KWS task in the resource-limited environment. It isworth noting that ResNet [21] replaces the basic block(two3×3 convolutional layers) with bottleneck block, which con-sists of 1 × 1, 3 × 3 and 1 × 1 convolution, for efficiency inthe deep networks.

Our basic CENet is built with three kinds of block: 1)initial block, 2) bottleneck block and 3) connection block. Forour goal to build a small-footprint system, we choose a smallchannel number to avoid overfitting and heavy computationalcost.

• Initial block is designed to generate the feature repre-sentations with a convolutional kernel from the MFCCfeature, which includes a 3× 3 bias-free convolutionallayer, a batch normalization layer and activation func-tion ReLU (·). To reduce the spatial size of the convo-lutional feature map, we add a 2 × 2 average poolinglayer at the end of this block.

• Bottleneck block is introduced to achieve the residual

Page 3: SMALL-FOOTPRINT KEYWORD SPOTTING WITH GRAPH ...Despite the recent successes of deep neural networks, it re-mains challenging to achieve high precision keyword spotting task (KWS) on

Table 1. Configuration of the CENet Baseline. (1,16) means the number of input channel and output channel is 1 and 16.Model #Parameters Initial Block-1 Block-2 Block-3

CENet-6 16.2K[3× 3, (1, 16)

] 1× 1, (16, 8)3× 3, (8, 8)1× 1, (8, 16)

× 1,

1× 1, (16, 8)3× 3, (8, 8)1× 1, (8, 32)

× 1

1× 1, (32, 8)3× 3, (8, 8)1× 1, (8, 32)

× 1,

1× 1, (32, 8)3× 3, (8, 8)1× 1, (8, 48)

× 1

1× 1, (48, 12)3× 3, (12, 12)1× 1, (12, 48)

× 1,

1× 1, (48, 12)3× 3, (12, 12)1× 1, (48, 64)

× 1

CENet-24 44.3K[3× 3, (1, 16)

] 1× 1, (16, 8)3× 3, (8, 8)1× 1, (8, 16)

× 7,

1× 1, (16, 8)3× 3, (8, 8)1× 1, (8, 32)

× 1

1× 1, (32, 8)3× 3, (8, 8)1× 1, (8, 32)

× 7,

1× 1, (32, 8)3× 3, (8, 8)1× 1, (8, 48)

× 1

1× 1, (48, 12)3× 3, (12, 12)1× 1, (12, 48)

× 7,

1× 1, (48, 12)3× 3, (12, 12)1× 1, (48, 64)

× 1

CENet-40 61K[3× 3, (1, 16)

] 1× 1, (16, 8)3× 3, (8, 8)1× 1, (8, 16)

× 15,

1× 1, (16, 8)3× 3, (8, 8)1× 1, (8, 32)

× 1

1× 1, (32, 8)3× 3, (8, 8)1× 1, (8, 32)

× 15,

1× 1, (32, 8)3× 3, (8, 8)1× 1, (8, 48)

× 1

1× 1, (48, 12)3× 3, (12, 12)1× 1, (12, 48)

× 7,

1× 1, (48, 12)3× 3, (12, 12)1× 1, (48, 64)

× 1

function with lower model complexity. For each resid-ual function, we use a stack of 3 layers instead of aconvolutional layer. The three layers are 1 × 1, 3 × 3and 1×1 convolutions, where 1×1 layers are responsi-ble for reducing and restoring dimensions, leaving the3 × 3 layer a bottleneck with smaller input/output di-mensions.

• Connection block is a special bottleneck block, whichis used to increase dimensions and reduce the size offeature map by the convolutional layer with stride of 2.Connection block is used at the end of each stage.

With these three types of blocks, we are able to designseveral variants of the CENet at different model complexities.According to the standard ResNet architecture, our CENet isdesigned in a multi-stage scheme. The whole architectureconsists of one initial block and three stages. Each stage in-cludes several bottleneck blocks and one connection block.The network ends with a global average pooling layer and al-way fully-connected layer with softmax function.

We assume that the parameter size of a convolutional layeris cin× cout× k2, where the cin, cout and k is the input channelnumber, output channel number and kernel size, respectively.Given the input feature with size of cin × h × w, where theh and w are the height and weight, the operation for one-passinference is h×w× k2 × cin × cout. Both the parameters andmultiplication of the network are sensitive to the channel. Inorder to control the footprint of the network, we adopt smallnumber of channel with different depths.

Our base model is built in the simple structure referringto the aforementioned design. Each stage comprises a few ofbottlenecks and a connection block. The number of channelsincreases from 16 to 64 through 3 stages, which is experimen-tally proven efficient and effective to balance the complexityand capacity of the model. We follow equally narrow struc-ture and explore the impact of the depth by proposing threemodel variants. The details of our architecture are fully de-picted in Table 1.

2.3. Contextual Feature Augmentation with GCN

Modeling non-local relations in feature representations is afundamental problem in visual recognition, which enables usto capture long-range dependencies between scene entities.

Recently, since the CNN-based methods are widely used inthe KWS task, we explore the contextual feature augmen-tation to help the representations learning for KWS in thiswork. A promising strategy is to adopt the GCN to modelthe non-local relations of the convolutional features. Belowwe start with a brief introduction of GCN method for estimat-ing global context information. Then, feature augmentationstrategy that combines GCN with our proposed CENet archi-tecture is detailed.

We formulate the global contextual feature estimation [23]in the form of GCN. Formally, let X = [x1, · · · ,xN ]> be aset of the convolutional features, where xi ∈ Rc is a c-channelfeature vector and i indexes the spatial location of the featurevector. In the scene of KWS, we have a conv-feature mapdefined on a 2D grid with size of N × c,N = w × h.

Then, a fully-connected graph G = (V, E) is built with Nnodes vi ∈ V , and edges (vi, vj) ∈ E ,∀i < j (which meansthat this is an undirected graph), to represent the non-localrelations between the features. The GCN assigns xi as theinput to the node vi and computes the feature representationof each node through a message passing process. The non-local relations can be defined as:

xi = σ

1

Zi(X)

N∑j=1

g(xi,xj)Wᵀxj

(2)

where xi represents the updated feature representation atnode i, σ is an element-wise activation function (e.g., ReLU).g is the distance measurement function encoding pair-wiserelation. Zi(X) is a normalization factor for location i. W ∈Rc×c is the weight matrix defining a linear mapping to encodethe message from node i. We can write the updating equationEq. 3 in a matrix form:

X = σ(A(X)XW) (3)

where A(X) ∈ RN×N denotes the affinity matrix of thegraph in which Ai,j = 1

Zi(X)g(xi,xj). It is easy to extendthis updating procedure to multiple iterations by unrolling themessage passing into a multi-layer network [24]. We focuson the single iteration setting in the remaining of this sectionfor notational clarity. The message propagation mechanism isillustrated in the Fig. 2.

Page 4: SMALL-FOOTPRINT KEYWORD SPOTTING WITH GRAPH ...Despite the recent successes of deep neural networks, it re-mains challenging to achieve high precision keyword spotting task (KWS) on

Fig. 2. Illustration of message propagation mechanism

As the updated xi integrates context information from en-tire feature set X through the message passing process, wecan use it as an estimation of the non-local context for featurexi. The common strategy for defining the graph affinity ma-trix is based on the similarity of neighboring node features.[23] proposes multiple choices of pairwise function g:

• Gaussian:

g(xi,xj) = exᵀi xj (4)

here xᵀi xj is dot-product similarity and the normaliza-

tion factor is Zi(X) =∑∀j g(xi,xj).

• Embedded Gaussian:

g(xi,xj) = eθ(xi)ᵀφ(xj) (5)

where θ(xi) = Wθxi and φ(xj) = Wφxj are twoembeddings. Zi(X) =

∑∀j g(xi,xj).

Specifically, we adopt the embedded Gaussian with softmaxfunction to measure the pairwise similarity. Thus the Eq. 3can be rewritten as:

X = σ(softmax(XᵀWᵀθWφX)XW) (6)

This feature augmentation strategy considers all positionsinto the operation for modeling the global context, which isdifferent from a fully connected (fc) layer. Given the contextfeature X, we adopt a simple feature augmentation strategy tointegrate the contextual representation with the original con-volutional feature. We use a weighted summation of the twofeatures as in [25]:

Xa = γX + X (7)

where γ is a scaling parameter to be learned and Xa is theaugmented feature.

We can easily insert the GCN module into our CENet. Inpractical, we insert the module at the end of each stage to en-code the long range dependencies in different levels. Specifi-cally, we use only one-layer GCN to incorporate the non-localrelations while maintaining small model complexity.

3. EXPERIMENT

In this section, we conduct a series of experiments on theGoogle Command Dataset [22] to validate the effectiveness ofour method. Firstly, we present the introduction of the datasetand the implementation details in Sec. 3.1. Quantitative re-sults and ablation study are given in Sec. 3.2. In Sec. 3.3, theaugmented convolutional feature map is visualized to studythe effectiveness of GCN module. We also plot receiver oper-ating characteristic (ROC) curves for comprehensive analysis.

3.1. Experimental Configuration

We use the Google Speech Command Dataset to evaluateour method. The dataset consists of 65000 one-second fixedlength utterances, which concludes 30 short words fromthousands of people. Following the experimental settingof [13, 14]. We focus on discriminating 12 commands: ”yes”,”no”, ”up”, ”down”, ”left”, ”right”, ”on”, ”off”, ”stop”, ”go”,unknown, or silence. The dataset is split into 80:10:10 fortraining, test and validation by the SHA1-hashed name ofaudio files. Thus there are no overlapping speakers betweenthe train, test and validation sets. We randomly add thebackground noise in the dataset to the training audios with aprobability of 0.8, which is intended to enhance the data. TheSNR is randomly selected between [5, 15] db. A random Yms time shift is implemented before transforming the audioto MFCCs, where Y ∼ Uniform[−100, 100].

For the model architecture, our system is implementedwith PyTorch auto-differentiable framework. There are threevariants of the CENet models. Then, we insert the GCN mod-ule at the end of each stage. The learnable parameters ofour model are randomly initialized. Our models are trainedwith SGD optimizer, and the total training epochs is 350. Weuse the ”poly” learning rate policy where current learning rateequals to the base one multiplying (1 − iter

maxiter )power. Weset the base learning rate to 0.01 and power to 0.9. The batchsize is 64 and L2 weight decay is 10−3.

3.2. Quantitative Results

We report the quantitative results for the aforementionedmodels in Table 2. Following the evaluation method in [14],we use the class accuracy as the evaluation metric, which ismeasured as the fraction of the correct prediction. We alsocompare parameter numbers and multiplications with otherbaseline methods.

3.2.1. Comparison with Prior WorksWe first investigate different CENet variants compared withthe recent works. The performance indicates our CENet canoutperform the state-of-the-art methods with fewer parame-ters and less computations.

Concretely, our base model CENet-24 can achieve thecomparable performance with res15 [14]. Compared with

Page 5: SMALL-FOOTPRINT KEYWORD SPOTTING WITH GRAPH ...Despite the recent successes of deep neural networks, it re-mains challenging to achieve high precision keyword spotting task (KWS) on

Table 2. Performance of CENet and Baseline Method

Model #Param. Mult. Results

trad-fpool13[12] 1.37M 125M 90.5%tpool2[12] 1.09M 103M 91.7%one-stride1[12] 954K 5.76M 77.9%

res15[14] 238K 894M 95.8%res8[14] 110K 30M 94.1%res15-narrow[14] 42.6K 160M 94.0%res8-narrow[14] 19.9K 5.65M 90.1%

DS-CNN-S[13] 38.6K 5.4M 94.4%DS-CNN-M[13] 189.2K 19.8M 94.9%DS-CNN-L[13] 497.6K 56.9M 95.4%

CENet-6 16.2K 1.95M 93.9%CENet-24 44.3K 8.51M 95.6%CENet-40 61K 16.18M 96.4%

Table 3. Test Accuracy and footprint of different modelsModel #Param. Mult. Results

CENet-6 16.2k 1.95M 93.9% (94.9%)CENet-GCN-6 27.6k 2.55M 95.2% (95.7%)

CENet-24 44.3k 8.51M 95.6% (96.1%)CENet-GCN-24 55.6k 9.11M 96.5% (96.5%)

CENet-40 60.9K 16.18M 96.4% (96.7%)CENet-GCN-40 72.3K 16.78M 96.8% (97.0%)

res15 [14], our model achieves a 5× reduction in model com-plexity with only 44k parameters and the >100× reductionmultiplies in the feedforward inference pass. This result indi-cates the effectiveness of bottleneck architecture in buildingefficient network.

Moreover, we also investigate the impact of the depth bychanging the number of blocks in different stages. We firstlyreduce the number of blocks in each stage by 4 times andget the CENet-6, which is the most compact network in thisstudy. Compared with other compact models, like one-stide1in [12] and res8-narrow in [14]. CENet-6 can achieve 93.9%in accuracy, which outperforms the other two models by alarge margin with only 16.2K parameters.

Finally, we stack more layers to build a more powerfulvariant. It is worth noting that the model complexity is sensi-tive to the number of blocks of last stage due to the large chan-nel number. We only increase the number of blocks in the firsttwo stages to get CENet-40, which could achieve a higher per-formance with 96.4%. Despite our changes, CENet-40 stillmaintains a small number of parameters (61K) and operations(16.18M). All three models demonstrate the effectiveness andefficiency of our CENet.

3.2.2. Results of the CENet with GCN Module

To validate the contextual feature augmentation mechanism,we evaluate the performance of CENet-GCN models. We

Table 4. Results of the CENet with GCN in different stage.Model Stage #Param. Mult. Results

CENet-6 - 16.2k 1.95M 93.9%CENet-GCN-6 Stage-1 17.8K 2.27M 94.3%CENet-GCN-6 Stage-2 19.8K 2.13M 95.0%CENet-GCN-6 Stage-3 22.5K 2.05M 94.4%CENet-GCN-6 Stage-1,2,3 27.6K 2.55M 95.2%

insert the GCN module at the end of each stage to get CENet-GCN-6, CENet-GCN-24 and CENet-GCN-40, respectively.The CENet-GCN models are trained in the same way asCENet without additional supervision, and the results areshown in Table 3.

Interestingly, the CENet-6 can achieve a 1.3% improve-ment with GCN module with 10K additional parameters,which uses only 27.6K parameters to get the comparable ac-curacy with the res15 and CENet-24. The CENet-GCN-24 isable to outperform the CENet-40 with a more compact modeland less computation.

Compared with CENet models, CENet-GCN-24 andCENet-GCN-40 can also achieve a 0.9% and 0.4% improve-ment, respectively. Due to a larger capacity for the contextualfeature learning with a larger effective receptive field, the per-formance gain decreases in the deeper network. The resultsof three model variants demonstrate that the GCN moduleis capable of incorporating the contextual information to im-prove the feature learning with few parameters and smallcomputation cost.

To compare our method with other state-of-the-art ap-proaches [14] under the same setting, which uses the MFCCfeature, thus we also adopt the MFCC feature as the inputin all our experiments. We also evaluate the impact of in-put feature. MFCC feature is de-correlated in the frequencydomain when extracted, and fbank feature still remains thetemporal-frequency correlation of spectral representations.In principle, the convolutional layers can learn better fromfbank feature because it actually can leverage the spatial-temporal correlations of spectral representations. We furtherdemonstrate our method with the fbank feature as the inputand report the results in Table 3, which are shown inside thebrackets. Compared with the results of the MFCC feature,the quantitative performance indicates our approaches willachieve 0%-1.0% improvement in all model variants withfbank feature as the input.

3.2.3. Different Stage with GCN ModuleFurthermore, we investigate the stage which we should addthe GCN module to augment the feature with contextual in-formation. We insert one GCN module for different residualstages, based on the CENet-6 backbone, to build the network.The quantitative results which shown in Table 4 demonstratethe effects of the proposed GCN for each of stage and allstages.

Noted that incorporating GCN module into the stage-1

Page 6: SMALL-FOOTPRINT KEYWORD SPOTTING WITH GRAPH ...Despite the recent successes of deep neural networks, it re-mains challenging to achieve high precision keyword spotting task (KWS) on

Fig. 3. Visualization of the Feature Map. The first row is thespectrogram visualization of the audio sequence. The sec-ond row is the averaged feature map of the stage-1 before theGCN. The last row is the averaged feature map of the stage-1 after the GCN. Feature maps are generated from trainedCENet-GCN-6 model.

gain smaller compared with the base method, we can anal-yse the lower-level feature maps lack semantic information.Meanwhile, the stage-2 which equipped with a single GCNmodule could achieve 1.1 points improvement for accuracy.Moreover, it is straightforward to incorporate our GCN mod-ule into multiple residual stages in order to augment the fea-ture maps with multi-level contextual information. Our re-sults show that adding GCN module to all stages (1,2 and 3)together can achieve 95.2 for accuracy, which obtains 1.3 per-formance gain, respectively. The quantitative results indicatethat our GCN can achieve a large margin improvement withless computational cost.

3.3. Visualization and Analysis3.3.1. Visualization of Feature MapWe visualize the convolutional feature map in Fig. 3 for bet-ter understanding the contextual feature augmentation mech-anism proposed in our work. The convolutional feature mapsare averaged along the channel dimension, and interpolatedwith same size of the mel-scale spectrogram for visualization.Compared with the feature map before GCN, it is evident thatthe GCN module utilizing the contextual information can helphighlight the most discriminative region and enlarge the gapbetween the voiced and non-voiced regions.

3.3.2. ROC Curve AnalysisFurthermore, we plot the receiver operating curve (ROC) forextensive analysis. The x axis is false alarm rate (FAR) andthe y axis is false reject rate (FRR), representing the proba-bility of false positives and the probability of false negatives,

Fig. 4. ROC curve of CENet and CENet-GCN models.

Fig. 5. ROC curve for different models

respectively. For a given sensitivity thresholddefined as theminimum probability which a class is considered positive dur-ing evaluation. Curves for each of the keywords are computedby sweeping the sensitivity threshold [0.0, 1.0] and then beingaveraged vertically to produce the overall curve for a particu-lar model. The model with less area under the curve (AUC)are the better. Curves of all variants of our CENet and CENet-GCN are plotted in Fig. 4. The ROC curves demonstrate theeffectiveness of the GCN module, which is consistent withthe quantitative results in the Table 3.

We compare our most compact model and most power-ful model with prior works, as shown in Fig. 5. Both of ourmodels can outperform baseline methods with a large margin,respectively.

4. CONCLUSION

We introduce a novel efficient network for KWS tasks, whichleverages the power of residual connection with the bot-tleneck structure and graph convolutional network method.Based on our proposed network structure, we build severalvariants of our proposed model with different model com-plexity. Our models are evaluated on the Google SpeechCommand Dataset. Our basic CENet models outperform thecurrent state-of-the-art method with fewer parameters andsimpler network structure. To get further, we introduce thegraph convolutional network to KWS task to encode contextof features. Models combined with graph convolutional net-work method can perform even better than our basic models.

Page 7: SMALL-FOOTPRINT KEYWORD SPOTTING WITH GRAPH ...Despite the recent successes of deep neural networks, it re-mains challenging to achieve high precision keyword spotting task (KWS) on

5. REFERENCES

[1] Johan Schalkwyk, Doug Beeferman, Francoise Beau-fays, Bill Byrne, Ciprian Chelba, Mike Cohen, MaryamKamvar, and Brian Strope, “your word is my command:Google search by voice: A case study,” in Advances inspeech recognition, pp. 61–90. Springer, 2010.

[2] J Robin Rohlicek, William Russell, Salim Roukos, andHerbert Gish, “Continuous hidden markov modeling forspeaker-independent word spotting,” in Proc. ICASSP.IEEE, 1989, pp. 627–630.

[3] Richard C Rose and Douglas B Paul, “A hidden markovmodel based keyword recognition system,” in Proc.ICASSP. IEEE, 1990, pp. 129–132.

[4] JG Wilpon, LG Miller, and P Modi, “Improvementsand applications for key word recognition using hiddenmarkov modeling techniques,” in Proc. ICASSP. IEEE,1991, pp. 309–312.

[5] Marius-Calin Silaghi and Herve Bourlard, “Iterativeposterior-based keyword spotting without filler models,”in Proc. ASRU. IEEE, 1999, pp. 213–216.

[6] Marius-Calin Silaghi, “Spotting subsequences match-ing an hmm using the average observation probabilitycriteria with application to keyword spotting,” in Proc.AAAI, 2005, pp. 1118–1123.

[7] David RH Miller, Michael Kleber, Chia-Lin Kao,Owen Kimball, Thomas Colthurst, Stephen A Lowe,Richard M Schwartz, and Herbert Gish, “Rapid and ac-curate spoken term detection,” in Eighth Annual Con-ference of the international speech communication as-sociation, 2007.

[8] Siddika Parlak and Murat Saraclar, “Spoken term de-tection for turkish broadcast news,” in 2008 IEEE In-ternational Conference on Acoustics, Speech and SignalProcessing. IEEE, 2008, pp. 5244–5247.

[9] Jonathan Mamou, Bhuvana Ramabhadran, and OlivierSiohan, “Vocabulary independent spoken term detec-tion,” in Proceedings of the 30th annual internationalACM SIGIR conference on Research and developmentin information retrieval. ACM, 2007, pp. 615–622.

[10] Mark JF Gales, Kate M Knill, Anton Ragni, andShakti P Rath, “Speech recognition and keyword spot-ting for low-resource languages: Babel project researchat cued,” in Spoken Language Technologies for Under-Resourced Languages, 2014.

[11] Guoguo Chen, Carolina Parada, and Georg Heigold,“Small-footprint keyword spotting using deep neuralnetworks,” in Proc. ICASSP. IEEE, 2014, pp. 4087–4091.

[12] Tara N Sainath and Carolina Parada, “Convolutionalneural networks for small-footprint keyword spotting,”in Proc. INTERSPEECH. ISCA, 2015, pp. 1478–1482.

[13] Yundong Zhang, Naveen Suda, Liangzhen Lai, andVikas Chandra, “Hello edge: Keyword spotting on mi-crocontrollers,” arXiv preprint arXiv:1711.07128, 2017.

[14] Raphael Tang and Jimmy Lin, “Deep residual learningfor small-footprint keyword spotting,” in Proc. ICASSP.IEEE, 2018, pp. 5484–5488.

[15] Santiago Fernandez, Alex Graves, and Jurgen Schmid-huber, “An application of recurrent neural networksto discriminative keyword spotting,” in Proc. ICANN.Springer, 2007, pp. 220–229.

[16] Martin Woellmer, Bjoern Schuller, and Gerhard Rigoll,“Keyword spotting exploiting long short-term memory,”Speech Communication, vol. 55, no. 2, pp. 252–265,2013.

[17] Pallavi Baljekar, Jill Fain Lehman, and Rita Singh, “On-line word-spotting in continuous speech with recurrentneural networks,” in Proc. SLT. IEEE, 2014, pp. 536–541.

[18] Ming Sun, Anirudh Raju, George Tucker, Sankaran Pan-chapagesan, Gengshen Fu, Arindam Mandal, SpyrosMatsoukas, Nikko Strom, and Shiv Vitaladevuni, “Max-pooling loss training of long short-term memory net-works for small-footprint keyword spotting,” in Proc.SLT. IEEE, 2016, pp. 474–480.

[19] Yanzhang He, Rohit Prabhavalkar, Kanishka Rao, WeiLi, Anton Bakhtin, and Ian McGraw, “Streaming small-footprint keyword spotting using sequence-to-sequencemodels,” in 2017 IEEE Automatic Speech Recognitionand Understanding Workshop (ASRU). IEEE, 2017, pp.474–481.

[20] Thomas N. Kipf and Max Welling, “Semi-supervisedclassification with graph convolutional networks,” inProc. ICLR, 2017.

[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep residual learning for image recognition,” inProc. CVPR. IEEE, 2016, pp. 770–778.

[22] Pete Warden, “Speech commands: A dataset forlimited-vocabulary speech recognition,” arXiv preprintarXiv:1804.03209, 2018.

[23] Xiaolong Wang, Ross Girshick, Abhinav Gupta, andKaiming He, “Non-local neural networks,” in Proc.CVPR. IEEE, 2018, pp. 7794–7803.

Page 8: SMALL-FOOTPRINT KEYWORD SPOTTING WITH GRAPH ...Despite the recent successes of deep neural networks, it re-mains challenging to achieve high precision keyword spotting task (KWS) on

[24] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley,Oriol Vinyals, and George E Dahl, “Neural messagepassing for quantum chemistry,” in Proc. ICML, 2017.

[25] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao,Zhiwei Fang, and Hanqing Lu, “Dual attention networkfor scene segmentation,” in Proc. CVPR. IEEE, 2019,pp. 3146–3154.