automaticspeechrecognitionovererror-prone...

23
Automatic speech recognition over error-prone wireless networks q Zheng-Hua Tan * , Paul Dalsgaard, Børge Lindberg Centre for TeleInFrastructure (CTIF), Speech and Multimedia Communication (SMC), Aalborg University, Niels Jernes Vej 12, DK-9220 Aalborg, Denmark Received 19 December 2004; received in revised form 5 May 2005; accepted 11 May 2005 Abstract The past decade has witnessed a growing interest in deploying automatic speech recognition (ASR) in communica- tion networks. The networks such as wireless networks present a number of challenges due to e.g. bandwidth constraints and transmission errors. The introduction of distributed speech recognition (DSR) largely eliminates the bandwidth limitations and the presence of transmission errors becomes the key robustness issue. This paper reviews the techniques that have been developed for ASR robustness against transmission errors. In the paper, a model of network degradations and robustness techniques is presented. These techniques are classi- fied into three categories: error detection, error recovery and error concealment (EC). A one-frame error detection scheme is described and compared with a frame-pair scheme. As opposed to vector level techniques a technique for error detection and EC at the sub-vector level is presented. A number of error recovery techniques such as forward error correction and interleaving are discussed in addition to a review of both feature-reconstruction and ASR-decoder based EC techniques. To enable the comparison of some of these techniques, evaluation has been conduced on the basis of the same speech database and channel. Special attention is given to the unique characteristics of DSR as compared to streaming audio e.g. voice-over-IP. Additionally, a technique for adapting ASR to the varying quality of networks is presented. The frame-error-rate is here used to adjust the discrimination threshold with the goal of optimising out-of-vocabulary detection. This paper concludes with a discussion of applicability of different techniques based on the channel characteristics and the system requirements. Ó 2005 Elsevier B.V. All rights reserved. Keywords: Distributed speech recognition; Channel error robustness; Out-of-vocabulary detection 0167-6393/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2005.05.007 q This work has been supported by the FACE project, the CNTK consortium project and the EU 6th framework project MAGNET. * Corresponding author. Tel.: +45 9635 8686; fax: +45 9815 1583. E-mail addresses: [email protected] (Z.-H. Tan), [email protected] (P. Dalsgaard), [email protected] (B. Lindberg). Speech Communication 47 (2005) 220–242 www.elsevier.com/locate/specom

Upload: others

Post on 19-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

Speech Communication 47 (2005) 220–242

www.elsevier.com/locate/specom

Automatic speech recognition over error-pronewireless networks q

Zheng-Hua Tan *, Paul Dalsgaard, Børge Lindberg

Centre for TeleInFrastructure (CTIF), Speech and Multimedia Communication (SMC), Aalborg University,

Niels Jernes Vej 12, DK-9220 Aalborg, Denmark

Received 19 December 2004; received in revised form 5 May 2005; accepted 11 May 2005

Abstract

The past decade has witnessed a growing interest in deploying automatic speech recognition (ASR) in communica-tion networks. The networks such as wireless networks present a number of challenges due to e.g. bandwidthconstraints and transmission errors. The introduction of distributed speech recognition (DSR) largely eliminates thebandwidth limitations and the presence of transmission errors becomes the key robustness issue. This paper reviewsthe techniques that have been developed for ASR robustness against transmission errors.

In the paper, a model of network degradations and robustness techniques is presented. These techniques are classi-fied into three categories: error detection, error recovery and error concealment (EC). A one-frame error detectionscheme is described and compared with a frame-pair scheme. As opposed to vector level techniques a technique forerror detection and EC at the sub-vector level is presented. A number of error recovery techniques such as forward errorcorrection and interleaving are discussed in addition to a review of both feature-reconstruction and ASR-decoder basedEC techniques. To enable the comparison of some of these techniques, evaluation has been conduced on the basis of thesame speech database and channel. Special attention is given to the unique characteristics of DSR as compared tostreaming audio e.g. voice-over-IP. Additionally, a technique for adapting ASR to the varying quality of networksis presented. The frame-error-rate is here used to adjust the discrimination threshold with the goal of optimisingout-of-vocabulary detection.

This paper concludes with a discussion of applicability of different techniques based on the channel characteristicsand the system requirements.� 2005 Elsevier B.V. All rights reserved.

Keywords: Distributed speech recognition; Channel error robustness; Out-of-vocabulary detection

0167-6393/$ - see front matter � 2005 Elsevier B.V. All rights reserved.doi:10.1016/j.specom.2005.05.007

q This work has been supported by the FACE project, the CNTK consortium project and the EU 6th framework project MAGNET.* Corresponding author. Tel.: +45 9635 8686; fax: +45 9815 1583.E-mail addresses: [email protected] (Z.-H. Tan), [email protected] (P. Dalsgaard), [email protected] (B. Lindberg).

Page 2: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242 221

1. Introduction

To facilitate the access to services over wirelessnetworks there is often a demand to include auto-matic speech recognition (ASR) as a key compo-nent in the user interface (Cox et al., 2000; Leeand Lee, 2001). However, terminals of today areoften hand-held devices with limited battery life,computing power and memory size which all takentogether constitute a challenge for the implementa-tion of any complex ASR system into thesedevices. For ASR systems associated with largedatabases, security and consistency are additionalchallenges to be taken into account for terminal-based implementation (Rose et al., 2003). In addi-tion, the �always-on� facility of communicationnetworks offers improved opportunities for oper-ating ASR modules using a distributed architec-ture in which only the front-end processingrequires specific porting and implementation intothe hand-held devices. Therefore, the developmentof network-based ASR emerges as a recent trend.

To enable low bit-rate data transmission in dis-tributed architectures, speech coding is applied toconduct data compression. Lossy speech codecsmay, however, significantly degrade the perfor-mance of ASR (Haavisto, 1999). A way to elimi-nate this degradation is to introduce distributedspeech recognition (DSR) (Pearce, 2000, 2004)and DSR has been an important research focuswithin ASR during the last decade. In the client-server DSR system architecture, the ASR process-ing is split into the client-based front-end featureextraction and the server-based back-end recogni-tion, where data are transmitted between the twoparts via heterogeneous networks. Comparativestudies have shown superior performance ofDSR to codec-based ASR (Kelleher et al., 2002;Kiss, 2000).

Although DSR eliminates the degradationsoriginating from the speech compression algo-rithms, the transmission of data across networksstill brings in a number of problems to speech rec-ognition technology, in particular transmissionerrors. This paper analyses the characteristics oftransmission error degradations and attempts tomodel the degradations and the correspondingerror-robustness techniques. In order to reduce

transmission error degradation, client-drivenrecovery and server-based concealment techniquesare applied within DSR systems (where the clientis always at the sender side and the server at thereceiver side) in addition to error detectiontechniques. Client-based techniques e.g. retrans-mission, interleaving and forward error correction(FEC) may result in recovering a large amount oftransmission errors (e.g. Hsu and Lee, 2004; Jamesand Milner, 2004). However, generally these meth-ods have a number of disadvantages such asadditional delay, increased bandwidth and compu-tational overhead (Perkins et al., 1998). Server-based error concealment (EC) techniques exploitthe redundancy in the transmitted signal and thismay be used independent of, or in combinationwith, client-based techniques. When used in com-bination, the purpose is to handle the remainingerrors that a purely client-based technique failsto recover.

Server-based EC for DSR may either beconducted through feature-reconstruction ormodification of the ASR-decoder. Feature-recon-struction EC generally employs one of thefollowing techniques: splicing, substitution (withsilence, noise or source-data), repetition orinterpolation (Boulis et al., 2002; Milner andSemnani, 2000). Tan et al. (2003b, 2004a) recentlyproposed a partial splicing and a sub-vector conceal-ment technique that both demonstrate better perfor-mance. A group of statistical-based techniques havebeen developedwhich all exploit the statistical infor-mation about speech for feature-reconstruction(Gomez et al., 2004; James et al., 2004). In wirelesscommunications, a number of studies exploit thereliability information of the received bits to im-prove feature-reconstruction (Haeb-Umbach andIon, 2004; Peinado et al., 2003).

In the ASR-decoding stage, the effect of trans-mission errors can be mitigated by integration-based EC techniques. Both (Bernard and Alwan,2002; Weerackody et al., 2002) integrate thereliability of the channel-decoded features intothe recognition process where the Viterbi decodingalgorithm is modified such that contributionsmade by observation probabilities associated withfeatures estimated from erroneous features aredecreased. Endo et al. (2003) apply a theory of

Page 3: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

222 Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242

missing features to error-robust ASR where erro-neous features generate constant contributions tothe Viterbi decoding with the goal of neutralisingthe effect from these features.

In addition to these error-robustness methods,ASR systems may benefit from adaptation to thevarying quality of network in order to optimiseits performance. This has been demonstrated inan experiment that introduces a frame-error-rate(FER) based out-of-vocabulary (OOV) detectionmethod (Tan et al., 2003a).

This paper gives a survey of the robustness is-sues related to network degradations and presentsa number of analyses and experiments with a focuson transmission error robustness.

The paper is organized as follows. Section 2briefly reviews network-based speech recognitionand the ETSI–DSR standards. Section 3 analysesthe characteristics of transmission errors, presentsa model and categorizes error-robustness tech-niques. Section 4 presents a number of error detec-tion methods with an emphasis on error detectionat different levels. A number of client-based errorrecovery techniques are investigated in Sections 5and 6 presents server-based EC techniques. Exper-imental evaluations are presented in Section 7. Sec-tion 8 presents details of the FER-dependent OOVdetection method. Concluding remarks are givenin Section 9.

2. Speech recognition over networks

The deployment of ASR in networks requiresspecific attention due to a number of factors, suchas the more complicated architecture, the limitedresources in the terminals, the bandwidth con-straints and the transmission errors. These net-work-linked issues have focussed the research innetwork-based ASR on front-end processing forremote speech recognition, on source coding andon channel coding and EC.

2.1. Remote ASR

The integration of ASR into network environ-ments may be implemented as either a terminal-or network-based architecture. Because of their

different advantages both architectures areexpected to be used in connection with the deploy-ment of future services. An overview andcomparison of these architectures is presented in(Viikki, 2001). Since numerous devices alreadyexist and the types and number are still increasingwith an incredible rate, the limited resources in thedevices and the demand for services with user-friendly interfaces are motivating factors for intro-ducing remote (network-based) ASR. This paperlimits its attention to the network-based solu-tions only. To enable remote speech recognition,data representing speech may be transmitted fromthe input device to the server as either codedspeech or as ASR features, resulting in two typesof network-based ASR: server only ASR andDSR.

In the server only ASR approach, the clientcompresses input speech via conventional speechcoders and transmits the coded speech to the ser-ver (Kiss, 2000). One way of decoding is that theserver re-synthesises the speech, conducts feature-extraction and subsequently performs recognition.The influence of speech coding algorithms on ASRperformance, e.g. voice-over-IP (VoIP), GlobalSystem for Mobile Communications (GSM) andUniversal Mobile Telecommunication System(UMTS) codecs, has been extensively investigatedin the literature (e.g. 3GPP TR 26.943, 2004;Besacier et al., 2001; Fingscheidt et al., 2002;Kelleher et al., 2002; Lilly and Paliwal, 1996;Mayorga et al., 2003; Pearce, 2004). As the qualityof the re-synthesised speech is highly dependent onthe speech coder, a low bit-rate speech coder maycause significant degradations on recognitionperformance (Haavisto, 1998). Although thedeployment of acoustic models trained in matchedconditions for each individual coding scheme re-sults in substantial improvement, degradation inASR performance is still observed at bit-ratesbelow 16 kbps (Euler and Zinke, 1994). Anotherway is to estimate the feature set directly fromthe bit-stream of the speech coder without re-synthesizing the coded speech (Huerta, 2000;Huerta and Stern, 1998; Kim and Cox, 2000,2001; Pelaez-Moreno et al., 2001).

The degradations observed in the above studieshave led to the introduction of DSR with the goal

Page 4: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242 223

of avoiding the degradations from lossy speechcompression. In DSR, speech features suitablefor recognition are calculated and quantized inthe client and transmitted to the server, where theyare decoded, followed by appropriate EC and thenprocessed by the recogniser as shown in Fig. 1. Toprovide reliable communication between the clientand the server, the deployment of selected channelcoding/decoding schemes together with appropri-ate EC schemes is needed. This architecture pro-vides a good trade-off between bit-rate andrecognition performance (Bernard and Alwan,2002).

DSR standards have been produced withinETSI in the STQ Aurora DSR working group—often known as Aurora. The first standard forthe well-known cepstral features was published in2000 with the aim of handling the degradationsof ASR over mobile channels caused by both lossyspeech coding and transmission errors (ETSI ES201 108, 2000). The goal was also to providefront-end standards to enable interoperability overmobile network (Pearce, 2004). It has been exper-imentally justified that DSR outperforms adaptivemulti-rate (AMR) codecs according to both �Aur-ora tests� (Kelleher et al., 2002) and a number ofextensive industrial tests organised by 3GPP(3GPP TS 26.235 V6.1.0, 2004). When transmis-sion errors are introduced speech codecs produceeven more degradation since those coding algo-rithms generally have inter-frame dependency(Pearce, 2004). The strength of DSR is that theDSR frames are generally independent and thusmore robust to error-prone channels.

2.2. Source coding

The goal of source coding is to compress infor-mation for transmission over bandwidth-limitedchannels. One common class of coding schemes

Speech Client front-end

Feature extractiony[m]

Sourcecoding

Channelcoding

Error-prone channel

Fig. 1. Block diagram

for DSR applies vector quantization (VQ) toASR features. Split VQ together with scalar quan-tization used to compress Mel-frequency cepstralcoefficients (MFCCs) were evaluated by Digalakiset al. (1999). In the split VQ, each cepstral vectorwas partitioned into sub-vectors and each sub-vec-tor was independently quantized by using its owncodebook. It was concluded that split VQ haslower storage and computational requirements ascompared to full VQ and that split VQ performssignificantly better than scalar quantization atany bit-rate. By using split VQ and a bit-allocationmethod, it was found that 2000 bps is sufficient forthe transmission of the 13 cepstral coefficients toachieve ASR performance corresponding to un-quantized coefficients. The ETSI–DSR standardsuse a particular form of split VQ.

Another class of source coding is transformcoding in which features are transformed toremove the correlation in the features and wherequantization is applied in the transformed domain.Transform coding usually gives better perfor-mance than quantizing in the original domain.Milner and Shao (2003) study both Karhunen–Loeve and discrete cosine transform (DCT) forthe compression of MFCC features. Zhu andAlwan (2001) use a 2D-DCT to exploit inter-frame(temporal) correlations between speech features.Specifically, feature vectors from 12 frames aregrouped together to form one block of features,which is then transformed by 2D-DCT. Since theDCT compacts energy into the low-order compo-nents, by setting the lowest energy components ineach block to zero and quantizing the nonzerocomponents only, a low bit-rate is achieved. Rec-ognition performance is maintained even at634 bps though at the expense of a block-sizeddelay.

Paliwal and So (2004) exploited the multi-frameGaussian mixture model-based block quantizer for

Server back-end Words

ASR decoding

Channel decoding

Sourcedecoding

Errorconcealment

of DSR system.

Page 5: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

224 Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242

the coding of MFCC features. The strengths of theblock quantizer are computational simplicity andbit-rate scalability.

The inter-frame coding schemes discussedabove all exploit the correlation across consecutiveMFCC features. This results in the fact that theerror in one frame has considerable impact onthe quality of the following frames. Due to theremoval of the inherent redundancy that exists inspeech features, a low bit-rate source codingmethod is highly sensitive to transmission errors.To some extent, there is a trade-off between theerror-resistance and the low bit-rate (achieved bythe removal of redundancy)—sometimes calledthe ‘‘no free lunch theorem’’ (Ho, 1999): codingefficiency multiplied by robustness is constant.

Considerable work has been conducted oninvestigating how to allocate the bits among e.g.sub-vectors given an overall bit-rate. Digalakiset al. (1999) allocate bits among sub-vectors byusing the word-error-rate (WER) as a metric. Aniterative bit-allocation method is deployed in(Hsu and Lee, 2004) but driven by syllable errorrates. Srinivasamurthy et al. (2004) propose abit-allocation algorithm based on a mutual infor-mation measure, which is superior to the conven-tional mean square error (MSE) metric. Thejustification for using the mutual information mea-sure is that the goal in DSR is to ensure minimaldegradation in classification performance ratherthan minimal MSE.

2.3. Channel coding and error concealment

While source coding aims at compressing infor-mation, channel coding techniques attempt to pro-tect (detect and/or correct) information fromdistortions (Bossert, 2000). Channel coding isdefined as an error-control technique used forreliable data delivery across error-prone channelsby means of adding redundancy to the data (Sklarand Harris, 2004). In the context of DSR, consid-erable research has been conducted aimed atexploring the potential of channel coding and ECtechniques (e.g. Bernard, 2002; Boulis et al.,2002; James and Milner, 2004; Milner, 2001; Pei-nado et al., 2003; Potamianos and Weerackody,2001; Tan et al., 2004a). In addition, some work

investigates the joint design of source and channelcoding (Riskin et al., 2001; Weerackody et al.,2001).

2.4. The ETSI–DSR standards

The ETSI–DSR basic front-end defines the fea-ture-extraction processing together with an encod-ing scheme (ETSI ES 201 108). The front-endprocessing produces a 14-element vector consistingof log energy (logE) in addition to 13 MFCCsranging from c0 to c12—computed every 10 ms.Each feature vector is compressed using split VQ.The split VQ algorithm groups two features (either{ci and ci+1, i = 1,3, . . . , 11} or {c0 and logE}) intoa feature-pair sub-vector resulting in seven sub-vectors in one vector. Each sub-vector is quantizedusing its own split VQ codebook. The size of eachcodebook is 64 (6 bits) for the feature-pair {ci andci+1} and 256 (8 bits) for {c0 and logE}, resultingin a total of 44 bits for each vector.

Two quantized frames—in this paper equivalentto a vector—are grouped together and protectedby a 4-bit cyclic redundancy check (CRC) creatinga 92-bit frame-pair. Twelve frame-pairs are com-bined and appended with overhead bits resultingin an 1152-bit multi-frame. Multi-frames areconcatenated into a 4800 bps bit-stream fortransmission. The decoding algorithm at the serverconducts two calculations to determine whether ornot a frame-pair is received with errors, namely aCRC test and a data consistency test. The ETSIstandard uses a repetition scheme in its ECprocessing to replace erroneous vectors.

The ETSI–DSR standard serves as a baselinefor presenting various techniques and a platformfor comparing a number of experiments in thispaper with the baseline results.

A noisy acoustical environment is in the verynature of mobile services. Environmental noiseand speaker variation are key issues to be takeninto consideration for the success of ASR applica-tions (Rose, 2004). Speech enhancement tech-niques—used to counteract the detrimental effectsof acoustic noise—are mainly applied in the timeand frequency domain. However, as features sentto the server-based recognition are commonlycepstral coefficients, the speech enhancement

Page 6: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242 225

techniques must be applied at the client side. Thismotivated an update by ETSI to the basic front-end to include noise-robustness techniques, lead-ing to the publication of the advanced front-end(ETSI ES 202 050, 2002).

On the basis of requests for server-side speechreconstruction and for enabling improved tonallanguage recognition, an additional extended ver-sion of the basic front-end was later issued andincluding fundamental frequency information(ETSI ES 202 211, 2003; Ramabadran et al.,2004; Sorin et al., 2004). The attempt of speechreconstruction in DSR implies a convergence ofspeech coding and DSR feature-extraction thoughthe difference still exists, namely that the optimisa-tion criterion for speech coding is perceptual qual-ity while it is ASR performance for DSR feature-extraction. The combination of the advancedfront-end and the server-side speech reconstruc-tion has resulted in the extended advanced front-end ETSI ES 202 212 (2003). Presently, the DSRextended advanced front-end is selected by the3rd Generation Partnership Project (3GPP) asthe codec for speech enabled services (3GPP TS26.235). Standards have also been agreed in theInternet Engineering Task Force (IETF) to definethe Real-time Transport Protocol (RTP) payloadformats for these DSR codecs (Xie and Pearce,2004).

3. Modelling transmission error degradation

Network-linked degradations are mainly causedby the occurrence of transmission errors. This sec-tion analyses the characteristics of transmissionerrors, presents a model of error degradation andcategorizes error-robustness techniques.

3.1. Transmission channel types and simulation

Two types of network connections exist: circuit-switched and packet-switched data channels overwhich the DSR client and server are interlinked.Transmission errors in connection with packet-switched networks occur in the form of packetsthat are lost (also known as erasure errors), orpackets that are delayed and therefore in a real-

time application discarded (Perkins et al., 1998).Packet loss and delay are mainly caused by conges-tion at routers. Bit errors seldom happen inpacket-switched networks. In contrast, bit errorsoccur more frequently during transmission overcircuit-switched mobile networks, resulting in biterrors in the speech data streams. When a speechframe is detected as erroneous and is dropped bythe receiver, the situation is equivalent to a packetloss.

Considerable efforts have been spent on investi-gating and simulating various network degrada-tion phenomena, mainly transmission errors. Ingeneral DSR system performance is evaluatedeither in channel simulations or in real transmis-sion. Channels are simulated in three ways: byadding random errors, by adding burst errors orby using simulated networks. Random errors areintuitively tested in simulations by randomly add-ing errors to the speech bit-stream or by randomlydropping packet with a given probability. AnAWGN (Additive White Gaussian Noise) channelgenerates random bit errors while a Rayleighfading channel produces bit errors that occur inbursts. The well-known two-state Gilbert modelis widely used to generate error bursts (Gilbert,1960; Kanal and Sastry, 1978). Two states are usedto model error-free transmissions (good) and erro-neous transmissions (bad), respectively. Milnerand James (2004) suggested using a three-stateMarkov model. The basic idea of the three-statemodel is to add the extra state associated withthe bad state allowing for some short error-freeperiods to occur during intervals of error bursts.Tests showed that the three-state model tracesreal-network packet losses better.

In many investigations the widely used GSMcircuit switched channel error patterns (EPs) areapplied (Pearce, 2004). The EPs are realistic inthe sense that they include a merging of both ran-dom and burst errors. The three EPs are EP1, EP2and EP3 corresponding to a carrier-to-interference(C/I) ratio of 10 dB, 7 dB and 4 dB and corre-sponding to average bit error rate of 0.0049%,0.18% and 3.55%, respectively.

The literature references a number of networksimulators. Hsu and Lee (2004) use a GeneralPacket Radio Service (GPRS) simulator where a

Page 7: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

226 Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242

large number of complicated transmission phe-nomena have been considered such as the propa-gation model and multi-path fading. Tan et al.(2003a) applied UMTS statistics. The UMTS sta-tistics are provided from a system-level networksimulator which is able to simulate a large varietyof scenarios and user deployments and is able toextract realistic performance statistics regardingpacket error rate or blocking probability.

Besides evaluations in channel simulations,real-world network transmissions have also beentested. Mayorga et al. (2003) have investigatedthe effect of both packet loss simulated by the Gil-bert model and packet loss in real transmission. Astrong correlation between WER and packet lossrate has been observed in simulated conditions.In real conditions, however, the same correlationobserved in simulated conditions did not occur.

3.2. Modelling transmission error degradation

The traditional model of the degradation ofspeech signals is depicted in Fig. 2 (Acero, 1993;

Fig. 2. Architectural model of degrada

Fig. 3. Architectural model of network degradations and

Huang et al., 2001). The speech signal is corruptedby both linear distortion and additive noise. Tocompensate for these distortions three categoriesof noise-robustness techniques are introduced.These are noise-resistant features, speech enhance-ment and speech model compensation for noise asshown in Fig. 2 (Gong, 1995).

The link between feature-extraction and theASR-decoding module in the context of DSR isbroken by one or more network connections thateach introduces additional transmission errors intothe data streams. The architectural model ofdegradations involving both acoustic noise andtransmission errors is illustrated in Fig. 3. The fig-ure demonstrates how errors may be introducedinto the speech stream and shows which type oferror-robustness techniques can be applied.

Note that the general goal of feature-extractionremains to be the estimation of features that arerobust to acoustic noise since noise-robustnessoften is the dominating factor for the degrada-tions in ASR performance (Rose, 2004; Sukkaret al., 2002). The degradation originating from

tion and robustness techniques.

robustness techniques against transmission errors.

Page 8: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242 227

transmission errors is handled by a number oftechniques such as error control and concealment.As the MFCC features are extensively used andhave proved to be successful for ASR (Davis andMermelstein, 1980), MFCCs are used for mostDSR front-ends. One exception of this is theadoption of perceptual linear prediction (PLP)by Bernard and Alwan (2001). The advantage ofapplying PLP is low bit-rate since PLP coefficientscan be quantized at 400 bps (Gunawan and Hase-gawa-Johnson, 2001).

Source coding and decoding modules are addedto compress speech features to meet bandwidthconstraints as shown in Fig. 3. Data transferredover networks are subject to various error sourcesthat cause changes in the stream of speech dataeither at the bit or the packet level, depending onthe individual transmission channel. Channelcoding/decoding is deployed with the aim of en-abling error detection and recovery and thusproviding reliable transmission across networks.Deployment of server-based EC techniques is anecessity in order to reduce the impact oftransmission errors still remaining in the speechfeatures.

3.3. Categorization of error-robustness techniques

As compared to acoustic noise, transmissionerrors have distinctive characteristics and influencespeech signals and features differently. Firstly,transmission errors occur at discrete-time framevalues whereas the acoustic noise influences thespeech signal (and the derived speech features) as

ARQ InterleavingFEC

Active

Error detection

Client-based Server-based

Parity check CRC Checksum Block Convolutional

Client-based error reco

Error-robustness tech

Fig. 4. The taxonomy of erro

a running process. Secondly, transmission errorsin general affect the speech data in the cepstral do-main whereas the environmental noise affects inthe time–frequency domain. Thirdly, transmissionerrors are introduced into the speech signal afterthe feature-extraction process which allows forapplying methods for client-based error controlprior to transmission. These characteristics arethe foundations for applying different compensa-tion techniques for handling transmission errorsand environmental noise.

Generally, error-robustness techniques in DSRhave been developed in three manifestations.Firstly, DSR features are protected by traditionalerror control and recovery techniques such asFEC, interleaving and joint source and channelcoding with the aim of lossless recovery. Thesetechniques require the participation of the client,termed as client-based recovery. Secondly, DSRshares a class of robustness techniques with audio(Perkins et al., 1998) and video (Wang and Zhu,1998) transmission over networks where feature-reconstruction EC is applied to generate an esti-mation of the original signal. Thirdly, since thefinal receiver of DSR features is the ASR-decoder,transmission errors can be further mitigated in theASR-decoding process through the modificationof the decoder, either by marginalisation accordingto missing data theory (Cooke et al., 2001) or byweighted Viterbi decoding (Yoma et al., 1998).Transmission over networks may cause speech fea-tures completely lost or partially corrupted, whichmakes missing data techniques well suited for DSRto combat transmission errors (Potamianos and

Server-based error concealment

Passive

Feature-reconstruction ASR-decoder EC Joint coding

Statistical InterpolationInsertion

very

Soft-feature decoding

niques

r-robustness techniques.

Page 9: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

228 Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242

Weerackody, 2001). The taxonomy of thesetechniques is shown in Fig. 4.

DSR applications have strict end-to-end delayconstraints and are generally requested to operateusing a RTP protocol (Schulzrinne et al., 2003).Automatic repeat request (ARQ) is an error con-trol mechanism in which a retransmission isrequested by the server when an error is detected,resulting in a long delay. ARQ can be applied atvarious network protocol layers e.g. the transportlayer or the application layer. ARQ at transportlayer is generally deployed by network operatorsand it is in most cases not possible to turn thisfeature off. ARQ at the application layer, however,is not recommended for DSR and passive recoverytechniques e.g. interleaving become the preferredclient-based methods for correcting the errors.However, ASR can tolerate a certain amount ofdistortion in the speech features so that server-based EC techniques such as feature-reconstruc-tion and ASR-decoder EC are applicable forconcealing any remaining errors. Before any ECtechniques can be implemented, error detectionmethods should first be applied to reveal whetherand where a transmission error has been intro-duced (Wang and Zhu, 1998). Detection can berealised on the basis of either adding redundancyat the client-side or by exploiting the redundancyinherent in the signal itself purely at the serverside. These techniques are thoroughly reviewed inthe following sections.

4. Error detection

There are two types of error detection methodsaccomplished, either exploiting added redundancyfrom the channel coding or exploiting the redun-dancy of speech features.

In channel coding, redundant bits are added tothe data being transmitted and these are subse-quently used by the decoder to determine whetheror not errors have been conveyed into the data.Two classes of such techniques are used: paritycheck and block check. Parity check is a simplecharacter-based error detection method that is sel-dom used today for reliable communications. Twoof the most commonly used block check methods

are checksum and CRC. In the block check meth-odology, data are segmented into blocks and anadditional check block is appended to each datablock at the client. At the server side, the checkblock information is used to identify whether ornot there is an error in the data block. When anerror is detected, EC is conducted—as opposedto a retransmission.

For circuit switched networks, the ETSI–DSRstandards (Pearce, 2004) apply CRC as the majorerror detection scheme together with an additionalscheme in which a data consistency test exploitsthe characteristics of speech features. For packetswitched networks, the CRC is still kept as partof the payload for two reasons. Firstly, this en-ables the interoperability with the circuit switchednetworks as a circuit switched network couldpotentially be connected to a network gatewaythat encodes the ETSI–DSR features into theRTP payload. If the features are sent in the RTPpayload over a packet switched network, theRTP header information is then used for errordetection. Secondly, in the Internet Protocolversion 4 (IPv4), IP packets are dropped if thereis any error detected in the payload in contrastto the Internet Protocol version 6 (IPv6) in whichit may be possible to transport RTP payloads thatcontain errors and thus making the CRC internalto the ETSI–DSR payload useful for errordetection.

In general, error detection by adding headerinformation and/or FEC codes at the client sideis more reliable than error detection exploitingredundancy in the signal at the server althoughat the cost of additional bandwidth (Wang andZhu, 1998).

Tan et al. (2004c) raise the problem of the size(measured in bits) of a data block for error detec-tion and concealment. Error rates correspondingto the size of a data block are calculated as a func-tion of bit error rates (BER) of random errorsaccording to the following formula

Error Rate ¼ 1� ð1� BERÞbits ð1Þwhere bits is the number of bits in the data block.

The formula shows that given a BER value, thesmaller the number of bits is, the lower is the errorrate of the data block. This has motivated the

Page 10: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242 229

introduction of two methods: one-frame basederror protection (Tan and Dalsgaard, 2002) andsub-vector based error detection and concealment(Tan et al., 2004a), which are channel coding basedand feature characteristic based method,respectively.

4.1. Frame-pair versus one-frame

Within the ETSI–DSR standards, two quan-tized frames are grouped together and protectedwith a 4-bit CRC block together forming a 92-bitframe-pair. This method results in the entireframe-pair being labelled erroneous even if only asingle bit error occurs in the frame-pair packet.To overcome this, a one-frame based error protec-tion scheme was deployed to protect each frame byits own 4-bit CRC block which together generates a48-bit one-frame (Tan and Dalsgaard, 2002; Tanet al., 2003a). The one-frame scheme causes theoverall probability of one frame in error to be low-

0

20

40

60

80

100

0.1 0.5 1.0 1.5 2.0

BER (%)

Err

or R

ate

(%)

Frame-pairOne-frameSub-vector1Sub-vector2

0

5

10

15

20

25

30

35

EP1 EP2 EP3

GSM EPs

Err

or R

ate

(%)

Frame-pairOne-frameSub-vector1Sub-vector2

a

b

Fig. 5. Percentage error rates of frame-pair, one-frame (vector)and sub-vectors versus different channels. (a) Error rates versusrandom BER values. (b) Error rates versus GSM EPs.

er, as shown in Fig. 5 (at the cost of only a marginalincrease in bit-rate, from 4800 bps to 5000 bps).The data in Fig. 5(a) are achieved by applyingEq. (1) with a range of BER values of bit errorswith Gaussian distribution and the results in Fig.5(b) come from the calculation on the basis of theGSM EPs where errors occur in bursts.

4.2. Vector versus sub-vector

As one feature vector consists of seven sub-vec-tors, the error rates of the low bit sub-vectors aresignificantly lower than both frame-pair and one-frame as shown in Fig. 5. As compared to 92-bitfor the frame-pair and 48-bit for the one-frame,sub-vector1 (corresponding to [ci,ci+1], i =1,3, . . . , 11) and sub-vector2 (corresponding to[c0, logE]) are represented by 6-bit and 8-bit,respectively. However, since there is no channelcoding based error detection applied at the sub-vector level, error detection at this level can onlymake use of feature characteristics, e.g. by a dataconsistency test on each pair of the sub-vectors.This is realistic due to the temporal correlation be-tween speech features in consecutive frames causedpartly by the vocal tract inertia and partly by theoverlapping in the feature-extraction procedure.

Given that n denotes the frame number and Vthe feature vector, each vector is formatted as

V n ¼ ½cn1; cn2; . . . ; cn12; cn0; logEn�T

¼ ½½cn1; cn2�; . . . ; ½cn11; cn12�; ½cn0; logEn��T

¼ ½½Sn0�T; ½Sn

1�T; . . . ; ½Sn

6�T �T ð2Þ

where Snj (j = 0,1, . . . , 6) denotes the jth sub-vector

in frame n (ETSI ES 201 108).The consistency test is conducted across consec-

utive frame-pair vectors [Vn,Vn+1] such that eachsub-vector Sn

j from Vn is compared with its corre-sponding sub-vector Snþ1

j from Vn+1. If any of thetwo decoded features in a feature-pair sub-vectordoes not possess a minimal continuity criterion,the sub-vector is classified as inconsistent. Specifi-cally both sub-vectors Sn

j and Snþ1j in a frame-pair

are classified as inconsistent if

ðdðSnþ1j ð0Þ � Sn

j ð0ÞÞ > T jð0ÞÞ or

ðdðSnþ1j ð1Þ � Sn

j ð1ÞÞ > T jð1ÞÞ ð3Þ

Page 11: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

230 Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242

where d(x,y) = jx � yj and Snj ð0Þ and Snþ1

j ð0Þ andSnj ð1Þ and Snþ1

j ð1Þ are the first and second element,respectively, in the feature-pair sub-vectors Sn

j andSnþ1

j as given in (2); otherwise, they are classified asconsistent. The thresholds Tj(0) and Tj(1) are con-stants given on the basis of measuring the statisticsof error-free speech features.

This test generates a consistency matrix thatdiscriminates between consistent and inconsistentsub-vectors. Inconsistent sub-vectors are replacedby their nearest neighbouring consistent sub-vec-tors whereas the consistent sub-vectors are keptunchanged (Tan et al., 2004a).

5. Error recovery—client-based techniques

Aimed at lossless repair, error recovery tech-niques require the participation of the clientincluding both source and channel coding. Chan-nel coding such as FEC plays an important rolein error recovery. On one hand, the process ofboth FEC and EC relies on redundant informa-tion. FEC techniques add redundancy to thespeech signal using a channel code whereas ECtechniques exploit the redundancy in the signalitself. On the other hand, source coding removesthe redundancy from the speech signal to obtaina high compression rate, resulting in hindranceto the recovery and concealment of errors. Onesolution to this is to deliberately keep some redun-dancy in the coded signal to enable better errorrecovery and concealment (Wang and Zhu,1998), termed as error-resistant source coding.Another solution is to jointly design source andchannel coder. In a broader sense, layer coding(LC) and multiple description coding (MDC) fallinto both solutions.

Interleaving is a method that attempts to rear-range burst errors into a set of random errorsand by this enabling FEC and EC to become moreeffective.

5.1. Forward error correction

In using the FEC techniques redundant infor-mation is transmitted along with the original datato allow the server to detect and correct errors in

the data without any reference to the client (Carleand Biersack, 1997). Two classes of techniquesexist: block encoding and convolutional encoding(Bossert, 2000; Sklar and Harris, 2004). Block cod-ing encodes a block of k information bits into ablock of n coded bits for transmission and thusthe codes are referred to as (n,k) codes. The(n � k) redundant bits determine the error correc-tion capability of the code. Commonly used blockcodes are Hamming codes, Golay codes, BCHcodes and Reed–Solomon codes. Convolutionalcoding considers the entire stream of data as onesingle codeword. As a result, encoded data isdependent on not only the current bits but alsothe previous bits.

A convolutional code is used in combinationwith unequal error protection (UEP) to protectMFCC features in (Potamianos and Weerackody,2001). However, block codes are preferred forDSR systems as opposed to convolutional codesdue to independency between blocks, smaller delayand lower complexity. The ETSI–DSR standardapplies a Golay code to protect the most impor-tant information in the data stream e.g. the headerinformation. Bernard and Alwan (2002) use linearblock codes mainly for error detection with alimited capacity to conduct bit error correction.Boulis et al. (2002) apply Reed–Solomon codesto achieve graceful degradation of ASR perfor-mance over packet-erasure networks. Hsu andLee (2004) introduce BCH coding into DSR.

5.2. Multiple description coding and layered

coding

Unlike most conventional coders, both MDCand LC encode a source into two or more sub-streams that can be delivered on separate channelsin order to exploit channel diversity. MDC en-codes the signal source into sub-streams (alsocalled descriptions) of equal importance in thesense that each description can independentlyreproduce the original signal into some basic qual-ity (Goyal, 2001). The quality incrementally in-creases when more descriptions are received. Incontrast, LC generates one base layer stream andseveral enhancement layer streams. The base layerstream is the most important and can provide a

Page 12: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242 231

basic level of quality. The enhancement layerstreams can refine the quality of the signal recon-structed from the base layer stream but is uselesson its own. According to (Wang et al., 2002),MDC is more effective for applications with strictdelay constraints. On the other hand, LC is a goodchoice when retransmission of the base layer orUEP over different channels is feasible.

Kim and Kleijn (2004) observed that MDCgenerally outperforms Reed–Solomon basedFEC. Zhong et al. (2002) compare the popularG.729 standard against a number of MDCschemes for recognizing VoIP and justify the supe-rior performance of the MDC schemes. Aimed atscalable DSR, Srinivasamurthy et al. (2001)present a layered scheme encompassing two layers:the base layer encodes speech using a coarseDPCM (differential pulse code modulation) loopwhile the enhancement layer encodes the quantiza-tion error introduced by the coarse DPCM loop.

5.3. Joint source and channel coding

According to the well-known source–channelseparation theorem proposed by Shannon (1948),the source and channel coder can be designed sep-arately without loss of optimality. The assump-tions are the property of stationary source andchannels and the unlimited complexity and pro-cessing delay of the source and channel coder.Since the above assumptions are not true for mostreal-world applications, there is a need for jointdesign of source and channel coding that exploitsthe characteristics of source or source coder forproviding better error protection.

In (Weerackody et al., 2001), UEP is applied tospeech data by partitioning the data bit streaminto classes of different error sensitivity. Riskinet al. (2001) introduce an unequal loss protection(ULP) algorithm to assign unequal amounts ofFEC to different sub-vectors to minimise WER.

5.4. Interleaving

FEC and EC techniques have good efficiency incounteracting errors randomly distributed in thedata stream but fail to manage burst errors. Inter-leaving techniques have therefore been broadly

applied in communication systems as an optionaladdition to error correction codes to counteractthe effect of burst errors at the cost of delay. Onthe basis of interleaving, channel coding isconfronted with only a set of random errors thatare converted from burst errors.

Specifically, interleaving is a method that rear-ranges the ordering of a sequence of code symbolsin order to spread burst errors over multiple code-words for efficient error recovery and concealment(Ramsey, 1970). At the server, the counterpartde-interleaving restores the reordered sequence toits original order. A common way to implementinterleaving is to divide symbol sequences intoblocks corresponding to a two-dimensional array,and to read symbols in by rows and out by col-umns. Extensive work has been done by Jamesand Milner (2004) to deploy interleaving in DSR.

5.5. Discussion of client-based techniques

Although the principles of the client-based tech-niques reviewed in this section are different fromeach other, they share a common attribute namelythe participation of the client with the aim ofexploiting the characteristics of channels and sig-nals. The deployment of client-based techniques isalways a trade-off between the achieved perfor-mance and the required resources. For example,FEC trades bandwidth for redundancy, MDCtrades multiple channels for uncorrelated transmis-sion errors among descriptions, and interleavingtrades delay for random distribution of errors.Therefore, the employment of client-based tech-niques is highly dependent onnetworks andapplica-tions. One disadvantage of client-based techniquesis their weak compatibility. Further discussionsand comparisons are presented in Section 7.

6. Error concealment—server-based techniques

Lossless error recovery is required in data trans-mission as even a single bit error may cause theentire data block to be discarded. In contrast, acertain amount of distortions in the speechfeatures can be tolerated by the ASR-decoder.This fact makes EC a feasible method to

Page 13: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

232 Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242

complement client-based error recovery techniquesto mitigate the effect of remaining transmission er-rors without the request for a retransmission. ECgenerally deploys the strong temporal correlationresiding in speech features and utilises the statisti-cal information about speech.

An EC scheme often relies on a set of error-freefeatures received before and/or after erroneous orlost features: either one- or two-sided EC schemes.In real-time speech and audio streaming, one-sidedschemes are often used since discontinuities in are-synthesised signal may perceptually annoy theuser, especially during burst intervals. In DSRapplications, however, the discontinuities do notdisturb the ASR engine although it may increasethe end-to-end delay. Therefore, two-sided ECtechniques are favoured in DSR because of its sig-nificantly superior performance.

The aim of EC is in general to create a substitu-tion for a lost or erroneous packet as close to theoriginal as possible. This type of concealment tech-nique is termed as feature-reconstruction EC. Inapplications like voice transmission, the sourceand the sink of the communication channel arethe voice and the human ears, respectively. In thecontext of DSR, however, the source is the speechfeatures and the sink the ASR-decoder. Therefore,EC may be conducted during the recognitiondecoding process as well, which is unique forDSR. Specifically, the ASR-decoder may be mod-ified to handle degradations introduced by trans-mission errors, termed as ASR-decoder EC.

6.1. Insertion-based techniques

Insertion-based EC techniques refer to a classof simple techniques that reconstruct lost packetswithout taking the signal characteristics into con-sideration (Perkins et al., 1998). An erroneousframe is substituted by inserting silence, noise, anestimated value (for example a mean value overtraining data), or a repetition of a neighbouringframe.

In applying splicing a number of consecutiveerroneous frames are simply dropped. A side effectof employing splicing is a decrease in the Viterbidecoding time caused by the shorter feature stream(Kim and Cox, 2001). Boulis et al. (2002) reported

that the mean-value substitution (estimated overall training data) outperforms splicing and silencesubstitution.

The ETSI–DSR standard applies a repetitionwhere the first half of a series of erroneous framesis replaced with a copy of the last correct framebefore the error and the second half with a copyof the first correct frame following the error.

In the partial splicing scheme presented in (Tanet al., 2003b) erroneous frames are partly substi-tuted by a repetition of neighbouring frames andpartly by a splicing. It can be shown that partialsplicing under certain assumptions is equivalentto a weighted Viterbi decoding algorithm.

On the basis of error detection at the sub-vectorlevel as presented in Section 4.2, the sub-vector EC(Tan et al., 2004) is considered as a repetition ECat the sub-vector level.

6.2. Interpolation-based techniques

Interpolation accounts for the changing charac-teristics of the signal and particularly exploits thetemporal correlation that is present in the speechfeature stream to aid the reconstruction of speechfeatures.

The most commonly used interpolation tech-nique is applying a polynomial interpolation asan estimate of the erroneous frames (Milner andSemnani, 2000). For DSR applications, repetitionhas been experimentally justified to perform betterthan linear interpolation (Pearce, 2004; Peinadoet al., 2003; Tan et al., 2003b). Tan et al. (2004b)further conduct a comparative study with theaim of revealing the causes of this, which justifiesa difference existing between traditional signal-reconstruction and feature-reconstruction forASR. James and Milner (2004) propose to usecubic interpolation instead of linear interpolationwhich shows better performance than both repeti-tion and linear interpolation.

6.3. Statistical-based techniques

Neither insertion- nor interpolation-based tech-niques use a priori information about the speechfeatures though the mean-value substitutionreplaces lost packets by the mean estimated over

Page 14: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242 233

all training data. A number of recently developedtechniques deploy the statistical knowledge aboutspeech source by introducing such knowledge intothe concealment process.

Statistical-based techniques use a priori knowl-edge of the full data to estimate the missing data(Ramakrishana, 2000). In particular, the maxi-mum a posteriori (MAP) estimation is a techniquethat estimates the missing data with the aim ofmaximising their likelihood conditioned on the ob-served data and the distribution of the full data.On the basis of the MAP technique, James et al.(2004) reconstruct lost speech vectors by employ-ing both the correctly received vectors and the sta-tistical information such as mean and variancecalculated from a set of training utterances(assuming a Gaussian distribution). The methodoutperforms cubic interpolation in particular whenthe packet loss rate is high. MAP estimation,however, is in general computationally expensivedue to its need of inversing large covariancematrices.

In (Gomez et al., 2003), a data-source modelmitigation technique is presented for DSR overlossy packet channels. The technique models thedata-source through transition probabilities froma sequence of quantized sub-vectors to another se-quence. For example, in the first order data-sourcemodel, a set of comprehensive tables containingevery combination of two split VQ indices are builtfor forward estimation and for backward estima-tion, respectively. Tables for forward estimationare constructed by searching each index in thetraining database and averaging the sequences ofindices following it while tables for backward esti-mation by searching each index in the trainingdatabase and averaging the sequences of indicesprevious to it. Sequences in lost packets are recon-structed by means of the trained data-sourcemodel and the received sequences preceeding andfollowing the lost packets. Performance improve-ment has been observed particularly for highpacket loss rate as compared to repetition EC.Similarly, Lee et al. (2004) proposed an N-grammodel approach for packet loss concealment in aVoIP application. This work showed that trigrampredictive models consistently outperform the rep-etition-based method in terms of distortion. Both

of the above referenced techniques have low com-putational cost but high memory requirements.

The MAP estimation has a high computationalcomplexity whilst the memory requirements of thedata-source technique are high. Gomez et al.(2004) combine the data-source model techniqueand the MAP estimation technique to offer atrade-off between memory and computational re-sources requirement.

6.4. Soft-feature decoding based techniques

In wireless communications, a number of studiesexploit the information about the reliability of thereceived bits. Specific channels and channelcoding/decoding algorithms are often specified sothat soft-decision of channel decoding is applicable(Bernard, 2002; Haeb-Umbach and Ion, 2004;Peinado et al., 2003; Potamianos and Weerackody,2001). The reliability information is used either forfeature-reconstruction or in combination withweighted Viterbi decoding that takes thisinformation into account during the ASR-decodingprocess.

A minimum mean square error (MMSE)estimation and hidden Markov model (HMM)based EC are proposed in (Peinado et al., 2001,2003, 2005). Applied for EC in speech coding(Fingscheidt and Vary, 2001), MMSE estimationmodels the speech source as a Markov process toexploit the correlation between consecutiveframes. Peinado et al. (2003) first deploy MMSEbased on soft-feature decoding. Secondly signifi-cant improvement is further obtained by consider-ing previously received vectors for the estimationof the current feature vector, resulting in theHMM model based EC.

6.5. ASR-decoder based techniques

Speech features are, after transmission over net-works, subject to being missing or unreliable. Inaddition to reconstruction of these features, theASR-decoder can provide complementary meansto handle the detriment in speech features by inte-grating the reliability of the channel-decodedfeatures into the recognition process. Two well-known noise-robustness techniques match this

Page 15: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

234 Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242

purpose: namely marginalisation used in missingdata theory (Cooke et al., 2001) and soft-decoding(weighted Viterbi decoding when HMM is applied)(Yoma et al., 1998).As comparedwithASR in noisyenvironments, where identifying the reliabilities ofspectral features is difficult, the advantages in thiscontext are that missing features or the reliabilityof each feature is known from the channeldecoding (Potamianos and Weerackody, 2001)and that the information is available in the cepstraldomain.

In (Endo et al., 2003), missing data theory isapplied such that erroneous features generateconstant contributions to the Viterbi decodingwith the aim of neutralising these features. Jameset al. (2004a) show that missing data technique issuperior to a number of feature-reconstructionmethods. Although both marginalisation andsplicing do not utilise unreliable features, margin-alisation reserves the time information—sinceHMM state transitions are possible in the intervalsof unreliable features—so that much better perfor-mance is obtained (Bernard, 2002).

In weighted Viterbi decoding, exponentialweighting factors are introduced into the calcula-tion of the likelihood based on the probability ofthe speech observations such that contributionsmade by observation probabilities are decreasedor neutralised if the features are estimated fromerroneous frames.

The weighting factor may be computed fromeither the reliability measure of the receivedbits—when available—or from an estimationvalue when the hard-decision channel coding isapplied. The first method requires the evaluationof the reliability of the decoding feature fromsoft-decision channel coding i.e. assuming aknown bit probability (Bernard and Alwan,2002; Weerackody et al., 2002). The second meth-od is applicable to a wider range of channelsincluding channels characterised by packet lossso that the range of channel conditions are ex-tended from wireless to IP-based (Bernard andAlwan, 2002). Cardenal-Lopez et al. (2004) com-pare constant weighting factor with time varyingweighting factor to cope with the fact that thelonger the burst is the less effective is the repetitiontechnique.

6.6. Discussion of server-based techniques

The following may be concluded on the fiveclasses of server-based EC techniques reviewed inthis section. One of the advantages of server-basedEC is that there is no requirement for modifyingthe client-side of DSR, signifying the compatibilitywith the existing ETSI–DSR standards. Insertion-and interpolation-based techniques are traditionaltechniques widely used in many applications suchas audio and video transmission. Among them,repetition EC has shown good performance withlow complexity. The statistical-based techniquestake advantage of a priori knowledge of speechfeatures and show slightly better performance thanrepetition, however, at the expense of either highcomputational cost or high memory requirement.Soft-feature decoding based techniques achievehighest performance but generally at a high com-putational cost. Finally ASR-decoder based tech-niques are unique for DSR and can be applied incombination with other EC. Detailed performancecomparisons and discussions are presented in thenext section.

7. Performance evaluation

In the literature, various techniques have beendeveloped but evaluated on the basis of a numberof different speech databases and different channelsimulations. This makes the comparison of thesetechniques difficult. In this evaluation, the samedatabase and the same channel condition are usedin order to effectively compare the performance ofsome of the techniques outlined above.

7.1. Experimental settings

The Aurora 2 database (Pearce and Hirsch,2000) has been selected for this purpose. The data-base is the TI digit database artificially distortedby adding noise and using a simulated channel dis-tortion. Whole-word models are created for alldigits using the HTK recogniser. Each of the digitwhole word models has 16 HMM states with threeGaussian mixtures per state. The silence model hasthree HMM states with six Gaussian mixtures per

Page 16: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

02468

10

No

CR

C

Inte

rpol

atio

n

Rep

etiti

on

RS(

32, 1

6)

One

-fra

me

Sub-

vect

or

Inte

rlea

ving

12

H-M

AP

Inte

rlea

ving

24

H-F

BM

MSE

MD

C

Err

or-f

ree

WE

R (

%)

Fig. 6. WER (%) across the error-robustness techniques forEP3 for Test Set A.

Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242 235

state. A one-state short pause model is tied to thesecond state of the silence model. The word mod-els used in the experiments are trained on cleanspeech, no acoustic noise is added to the test dataand transmission errors are the only cause ofthe decrease in the server-side recognitionperformance.

The three GSM EPs are commonly used for theerror-robustness evaluation of speech codingalgorithms and DSR schemes. As EP1 and EP2generally do not cause noticeable performancedegradation (Tan et al., 2003a), EP3 is specificallychosen for this evaluation. It should be noticedthat the experiments conducted here are using thiserror pattern for circuit switched GSM channels.In case of packet switched channels such as theGSM GPRS (General Packet Radio Service), thelower levels of the protocol stack allow for retrans-mission and transmission errors will occur aspacket loss and at a lower rate.

7.2. Experimental results and discussions

A number of techniques are tested ranging fromclient-based to server-based techniques. The testedclient-based techniques include Reed–Solomonbased FEC, interleaving, MDC and the one-framescheme presented in Section 4.1, which all are usedin combination with repetition EC. The imple-mented Reed–Solomon code is RS(32,16) with 8-bit symbols where 16 information symbols areencoded into 32 coded symbols, indicating a capa-bility of correcting 8 symbol errors or 16 symbolerasures in the code word. Two interleavingschemes are applied: Interleaving12 in which a se-quence of 12 vectors is grouped into one block andInterleaving24 where a sequence of 24 vectors isgrouped. Interleaving is implemented simply byreading odd-numbered features first and even-numbered features second from the blocks. As aresult, Interleaving12 has 5 vectors or 50 ms (witha 10 ms frame shift) maximum delay and Inter-leaving24 has 110 ms maximum delay. In applyingMDC, two descriptions are generated namely theodd-numbered and the even-numbered. Eachdescription is encoded into 2600 bps comprising2200 bps speech data, 200 bps head informationand 200 bps CRC information. The two

description encodings are transmitted over twouncorrelated channels which both are simulatedby EP3.

The evaluated feature-reconstruction tech-niques encompass repetition (Aurora baseline),linear interpolation, splicing and sub-vector EC.The scheme without CRC error detection (NoCRC) is also evaluated. In this case, transmissionerrors remain in the speech features and are passedthrough to the ASR-decoder. The result from er-ror-free transmission is shown as well. The resultsfor statistical-based techniques are cited from(Peinado et al., 2003) in which H-FBMMSE andH-MAP represent forward–backward MMSE withhard decisions and MAP with hard decisions,respectively. The detailed WER results for TestSet A are shown in Fig. 6.

A more detailed comparison is presented inTable 1 in terms of WER, bandwidth requirementand computational complexity. It is observed thatthe performance in applying MDC approaches theerror-free channel, but at the additional require-ment of multiple channels. H-FBMMSE and H-MAP both provide very low WER values but atthe cost of very high computational complexity,indicating that they may not be applicable forreal-time applications without the reduction ofcomputation. The interleaving schemes achievegood performance, however at the expense of add-ing delay. As compared to the above techniques,sub-vector EC gives lower performance but it nei-ther introduce extra complexity nor resourcerequirement. The one-frame scheme shows supe-rior performance to the Aurora frame-pair schemewith the introduction of a marginal increase inbandwidth. RS(32,16) gives a performance close

Page 17: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

Table 1Performance comparison of some error-robustness techniquesfor EP3 for Test Set A

WER(%)

Bit-rate(bps)

Complexity Compatibilitywith ETSI–DSR standards

Splicing 24.00 4800 Low YesNo CRC 8.88 4600 Low NoLinearinterpolation

7.35 4800 Low Yes

Repetition(Aurora)

6.70 4800 Low Yes

Marginalisation – 4800 Low YesWeightedViterbi

– 4800 Low Yes

RS(32,16) 3.45 9600 High NoOne-frame 3.41 5000 Low NoSub-vector 2.65 4800 Low YesInterleaving12 2.43 4800 Low NoH-MAP 1.91 4800 High YesInterleaving24 1.74 4800 Low NoH-FBMMSE 1.34 4800 High YesMDC 1.04 5200 Low NoError-free 0.95 4800 – –

236 Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242

to the one-frame scheme but requiring more band-width and higher computation. It is noticed thatReed–Solomon based FEC aims at correctingerrors whilst the one-frame scheme just increasesthe capability of detecting errors. This justifies thatfor DSR applications, channel coding should focuson error detection rather than error correction asalso observed in (Bernard and Alwan, 2002).

With regard to the ASR-decoder EC, Endo et al.(2003) show that marginalisation offers superiorperformance to linear interpolation. Bernard(2002) demonstrates that repetition is superior tomarginalisation for random packet loss conditionswhile marginalisation may outperform repetitionwhen the average burst lengths are large. Repetitionin combination with weighted Viterbi gives betterperformance than both repetition alone andmarginalisation for all conditions. In accordancewith these references only, marginalisation and aweighted Viterbi technique are included and rankedin Table 1 for performance comparisons.

The experiments show that linear interpolationgives lower ASR performance than repetition asdiscussed in Section 6.2. In the case of No CRC,no compensation (i.e. EC) is conducted so that

erroneous features are fed directly to the ASR-decoder, resulting in reduced ASR performance.Splicing as described in Section 6.1—which isequivalent to no compensation when packet lossesoccur—gives the lowest performance and thereforeis not an applicable technique. It should be notedthat the client-based techniques including Reed–Solomon, one-frame, No CRC, interleaving andMDC are not compatible with the existingETSI–DSR standards while all other techniquescompared in Table 1 are server-based and there-fore can be used at the server side without requir-ing modifications of the standards.

It is generally verified by the experiments thatASR performance can be improved by introducinga number of error recovery and concealment tech-niques. Depending on the techniques that areapplied, for transmission over severe error-pronechannels—as demonstrated by applying EP3—the degradation in recognition performance is stilla matter of concern as compared to the baselineperformance with no transmission errors. The fol-lowing section focuses on the design of a potentialmethod by which it is possible to modify the pro-cessing of the recogniser to further raise the overallperformance of the ASR system.

8. Recogniser adaptation to transmission quality

The techniques as reviewed in the preceedingsections are designed with the goal of maintainingmaximal ASR performance. This section presentsresearch on adaptation of the server-side recogn-iser to the highly varying quality of the network inorder to further optimise the overall performance.

One example of such optimisation is the intro-duction of frame-error-rate (FER) based out-of-vocabulary (OOV) detection (Tan et al., 2003a).The method is based on the observation that trans-mission errors influence the acoustic likelihoodand thus affect the optimal threshold setting fordiscrimination between in-vocabulary words andOOV words.

In this section experiments are based on thebasic ETSI–DSR standard. The selected databaseused for both training and testing is the DanishSpeechDat 2 database DA-FDB 4000 which

Page 18: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

Fig. 7. PDFs of the log-likelihood ratios for in-vocabulary andOOV words.

Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242 237

covers speech from 4000 Danish speakers collectedover the fixed network. A part of the database isused for the training of tri-phone models andone filler model, each having three HMM states,and each state having a mixture of 32 Gaussians.The independent test data consist of isolated Dan-ish digits (11 words including two differentpronunciations of the Danish digit �1�) used asin-vocabulary words and city names (449 words)used as OOV words. The recogniser applied isthe HTK-based SpeechDat/COST249 referencerecogniser (Lindberg et al., 2000). The baselineWER (no transmission errors) for the Danish dig-its is 0.2%.

8.1. Effect of errors on likelihood ratio

distribution

OOV detection is a statistical hypothesis testingproblem in which a decision algorithm accepts orrejects the hypothesis (e.g. Rahim et al., 1997; Lle-ida and Rose, 2000). Given a speech signal obser-vation sequence O, the algorithm tests the nullhypothesis H0 against the alternative hypothesisH1. H0 represents one of the in-vocabulary wordsand H1 represents OOV words modelled by one fil-ler model. A likelihood ratio LR(O) based on thenull and alternative hypotheses is used to detectOOV words. The test rejects the H0 hypothesis if

LRðOÞ ¼ pðOjH 0ÞpðOjH 1Þ

< T ð4Þ

where T is the threshold of the test. p(OjH0) andp(OjH1) are the probabilities of the H0 and theH1 hypotheses, respectively.

Transmission errors may, however adversely af-fect the likelihood ratios of both the in-vocabularywords and OOV words. Fig. 7 shows the probabil-ity density functions (PDFs) of the log-likelihoodratios of the in-vocabulary words and OOV wordsfrom the experiments for error-free channel andchannel with 2% BER value.

The figure shows that the occurrence of trans-mission errors changes the PDFs of the log-likeli-hood ratios in two ways. Firstly, the standarddeviations of the distributions are increased forincreasing BER values. This has the effect of weak-ening the discrimination between in-vocabulary

and OOV words. Secondly, the shifting in themean value of the distributions affects the optimalthreshold setting for OOV detection. A fixedthreshold setting—as normally used in the contextof error-free transmission—may therefore fail tomaintain the balance of the false rejection andfalse acceptance rates.

8.2. FER-dependent threshold for OOV detection

A potential way of maintaining the balance is toadjust the threshold in accordance with the FERthat is a measure of the instantaneous transmissionerror rate. The FER is calculated on the basis ofthe CRC information in the data stream. This

Page 19: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

238 Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242

results in a FER-dependent threshold that opti-mises the OOV detection.

The threshold is modelled as a fourth-orderpolynomial function of the FER. The FER valuesare calculated from the BER values according toEq. (1). To estimate the coefficients of the polyno-mial, five experiments (with BER values rangingfrom 0.1% to 2%) were conducted using a develop-ment database consisting of 282 digit utterancesand 249 city names utterances. The thresholdsfor each of these experiments are chosen with thespecifically chosen optimisation target ofmaintaining the false rejection rate approximatelyconstant across a range of BER values.

The test data for the experiments describedbelow are the remaining 200 digits and 200 citynames utterances from the same database. Duringtest, the FER is estimated by using the CRC errordetection and then used for adjusting the thresholdof the OOV detection based on the fourth-orderpolynomial function.

Fig. 8 shows that the OOV detection algorithmusing FER-dependent threshold approximatelymaintains the false rejection rate of in-vocabularywords within the range from 4% to 6% whereas the

0

5

10

15

20

0.0 0.1 0.5 1.0 1.5 2.0BER

Fals

e R

ejec

tion

(%)

FER_Depend_Thres

Fixed_Thres

Fig. 8. False rejection rate versus BER values.

0

20

40

60

80

0.0 0.1 0.5 1.0 1.5 2.0BER

Fals

e A

ccep

tanc

e (%

)

FER_Depend_Thres

Fixed_Thres

Fig. 9. False acceptance rate versus BER values.

false rejection rate using a fixed threshold is highlyvarying in the range from 4.5% to 20%. The exper-iments were targeted at a false rejection rate of 5%.

By maintaining an almost constant false rejec-tion rate, the false acceptance rate increases asshown in Fig. 9. In general, however threshold set-ting in general is a trade-off between false rejectionand false acceptance and therefore design criteria(such as equal error rate requirements) could bethe basis for the FER-dependent OOV detection.

The experimental results shown above justifythat it is feasible to adapt the behaviour of theback-end recogniser and that it is possible to furtherimprove the overall ASR performance according tothe varying quality of the network in question.

9. Conclusion

In this paper, the developments and trends ofincorporating ASR technology into wireless net-works have been reviewed. Emphasis has beenplaced on robustness techniques against transmis-sion errors enforced by error-prone communica-tion channels. Three classes of techniques havebeen presented in detail namely error detection,client-based error recovery and server-based errorconcealment. Error detection can be accomplishedeither by adding redundancy at the client-side orby exploiting the redundancy inherent in the signalitself purely at the server side. It has been pointedout that it is important to identify the proper sizeof a data block for error detection and the follow-ing concealment.

For transmission over severely error-pronechannels, deployment of client-based error recov-ery techniques is of importance in order to achievehigh ASR performance. The deployment of client-based techniques will distinguish DSR systemsfrom one another, e.g. by not being compatiblewith the existing ETSI–DSR standards. Based onthe work conducted in this paper the followingcomments apply:

• Although FEC protection is essential to e.g. thehead information in a DSR stream, FEC suchas Reed–Solomon code is not effective forprotecting speech features as the overhead

Page 20: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242 239

introduced by FEC is superfluous for error-freechannels and not useful for channels with bursterrors.

• Since the ASR-decoder can tolerate a certainamount of distortions in the speech features,especially if these are caused by independenterrors, it is a significant advantage to convertburst errors into independent (random) errors.This makes the deployment of interleaving tech-niques attractive, although at the expense ofinherent delay.

• Another technique being able to circumventburst errors is MDC. When multiple channelsare available, MDC is highly recommendeddue to its excellent performance.

As pointed out in this review, it is not possibleto recover from all transmission errors on the basisof deploying client-based techniques only. Server-based EC techniques are therefore employed forhandling the remaining errors. On the basis ofthe experiences learned from the experiments, thefollowing overall comments are presented forserver-based EC techniques:

• For circuit-switched channels where errorsoccur at the bit level, error-free information ispotentially available within the erroneousframes (vectors) for the EC process, and it istherefore strongly recommended to exploit theremaining error-free information within theerroneous vectors. The sub-vector base ECscheme and the MMSE estimation are examplesof such techniques where the experiments haveshown superior performance to conventionaltechniques.

• Statistical-based techniques that utilise a prioriinformation about the speech signal showimproved performance although at the expenseof large memory requirement and/or high com-putational complexity.

• In addition to feature-reconstruction tech-niques, ASR-decoder based EC techniques areunique for DSR and provide good performance.

• The applicability of server-based EC is depen-dent on the trade-off between the achieved per-formance and the computational complexity.Some of these techniques may not be practical

for real-time applications because of their inher-ent processing delay and computational cost.For future research, combinations of these tech-niques are relevant to pursue.

In general, each technique has its own strengthsand weaknesses, so the selection of techniques tobe deployed is dependent on the expected channelcharacteristics and the system requirements.

In addition to robustness techniques, the intro-duction of adaptation schemes may be worthwhileto exploit due to the dynamic nature of networktransmission. Adaptation may be implemented intwo ways. Firstly, the schemes chosen for sourcecoding/decoding, channel coding/decoding andEC may be adaptive in order to optimise thetrade-off between the required network and com-putational resources and the achieved perfor-mance. Secondly, the recogniser and spokenlanguage modules can be made adjustable to thequality of network to obtain optimal overall per-formance. Preliminary experiments have verifiedthis adaptation concept for OOV detection task.

Acknowledgments

The authors appreciate the insightful commentsand suggestions from the reviewers, which havesignificantly improved the presentation of thepaper.

References

Acero, A., 1993. Acoustical and Environmental Robustness inAutomatic Speech Recognition. Kluwer Academic Publish-ers, Boston.

Bernard, A., 2002. Source and channel coding for speechtransmission and remote speech recognition, PhD disserta-tion, University of California, Los Angeles, 2002.

Bernard, A., Alwan, A., 2001. Source and channel coding forremote speech recognition over error-prone channel. In:Proc. ICASSP01, USA, May 2001.

Bernard, A., Alwan, A., 2002. Low-bitrate distributed speechrecognition for packet-based and wireless communication.IEEE Trans. Speech Audio Process. 10 (8), 570–579.

Besacier, L., Bergamini, C., Vaufreydaz, D., Castelli, E., 2001.The effect of speech and audio compression on speechrecognition performance. In: Proc. IEEE Multimedia SignalProcessing Workshop, 2001.

Page 21: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

240 Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242

Boulis, C., Ostendorf, M., Riskin, E.A., Otterson, S., 2002.Gracefully degradation of speech recognition performanceover packet-erasure networks. IEEE Trans. Speech AudioProcess. 10 (8), 580–590.

Bossert, M., 2000. Channel Coding for Telecommunications.John Wiley & Sons.

Cardenal-Lopez, A., Docio-Fernandez, L., Garcia-Mateo, C.,2004. Soft decoding strategies for distributed speech recog-nition over IP networks. In: Proc. ICASSP04.

Carle, G., Biersack, E.W., 1997. Survey of error recoverytechniques for IP-based audio–visual multicast applications.IEEE Network 11 (6), 24–36.

Cooke, M., Green, P., Josifovski, L., Vizinho, A., 2001. Robustautomatic speech recognition with missing and unreliableacoustic data. Speech Commun. 34, 267–285.

Cox, R.V., Kamm, C.A., Rabiner, L.R., et al., 2000. Speechand language processing for next-millennium communica-tions services. Proc. IEEE 88 (8), 1314–1337.

Davis, S., Mermelstein, P., 1980. Comparison of parametricrepresentations for monosyllabic word recognition in con-tinuously spoken sentences. IEEE Trans. Acoustics SpeechSignal Process. 28 (4), 357–366.

Digalakis, V., Neumeyer, L., Perakakis, M., 1999. Quantizationof cepstral parameters for speech recognition over the WorldWide Web. IEEE J. Select. Areas Commun. 17, 82–90.

Endo, T., Kuroiwa, S., Nakamura, S., 2003. Missing featuretheory applied to robust speech recognition over IPnetworks. In: Proc. Eurospeech03, Geneva, Switzerland.

ETSI Standard ES 201 108, 2000. Distributed speech recogni-tion; front-end feature extraction algorithm; compressionalgorithms.

ETSI Standard ES 202 050, 2002. Distributed speech recogni-tion; advanced front-end feature extraction algorithm;compression algorithm.

ETSI Standard ES 202 211, 2003. Distributed speech recogni-tion; extended front-end feature extraction algorithm;compression algorithm, back-end speech reconstructionalgorithm.

ETSI Standard ES 202 212, 2003. Distributed speech recogni-tion; extended advanced front-end feature extractionalgorithm; compression algorithm, back-end speech recon-struction algorithm.

Euler, S., Zinke, J., 1994. The influence of speech codingalgorithms on automatic speech recognition. In: Proc.ICASSP94.

Fingscheidt, T., Vary, P., 2001. Softbit speech decoding: a newapproach to error concealment. IEEE Trans. Speech AudioProcess. 9 (3), 1–11.

Fingscheidt, T., Aalburg, S., Stan, S., Beaugeant, C., 2002.Network-based vs. distributed speech recognition in adap-tive multi-rate wireless systems. In: Proc. ICSLP02.

Gilbert, E.N., 1960. Capacity of a burst-noise channel. BellSyst. Tech. J. 39, 1253–1266.

Gomez, A.M., Peinado, A.M., Sanchez, V., Rubio, A.J., 2003.A source model mitigation technique for distributed speechrecognition over lossy packet channels. In: Proc.Eurospeech03.

Gomez, A.M., Peinado, A.M., Sanchez, V., Miner, B.P., Rubio,A.J., 2004. Statistical-based reconstruction methods forspeech recognition in IP networks. In: Proc. Robust2004.

Gong, Y., 1995. Speech recognition in noisy environments: asurvey. Speech Commun. 16, 261–291.

Goyal, V.K., 2001. Multiple description coding: compressionmeets the network. IEEE Signal Process. Mag. 18 (5), 74–93.

3GPP TR 26.943 V6.0.0, 2004. Recognition performanceevaluations of codecs for speech enabled services (SES)(Release 6).

3GPP TS 26.235 V6.1.0, 2004. Packet switched conversationalmultimedia applications; default codecs (Release 6).

Gunawan, W., Hasegawa-Johnson, 2001. PLP coefficients canbe quantized at 400 bps. In: Proc. ICASSP.

Haavisto, P., 1998. Audio–visual signal processing for mobilecommunications. In: Proc. European Signal ProcessingConference, Island of Rhodes, Greece.

Haavisto, P., 1999. Speech recognition for mobile communica-tions. In: Proc. Robust Methods for Speech Recognition inAdverse Conditions, Tampere, Finland.

Haeb-Umbach, R., Ion, V., 2004. Soft features for improveddistributed speech recognition over wireless networks. In:Proc. ICSLP04.

Ho, Y.-C., 1999. The no free lunch theorem and the human-machine interface. IEEE Control Syst. 1999 (June), 8–10.

Huerta, J., 2000. Robust Speech Recognition in GSM CodecEnvironments, PhD thesis, CMU.

Huerta, J., Stern, R., 1998. Speech recognition from GSMcodec parameters. In: Proc. ICSLP98.

Hsu, W.-H., Lee, L.-S., 2004. Efficient and robust distributedspeech recognition (DSR) over wireless fading channels:2D-DCT compression, iterative bit allocation, short BCHcode and interleaving. In: Proc. ICASSP04.

Huang, X.D., Acero, A., Hon, H.-W., 2001. Spoken LanguageProcessing. Prentice Hall, New Jersey, USA.

James, A.B., Milner, B.P., 2004. An analysis of interleavers forrobust speech recognition in burst-like packet loss. In: Proc.ICASSP04, Montreal, Quebec, Canada.

James, A.B., Gomez, A., Milner, B.P., 2004. A comparison ofpacket loss compensation methods and interleaving forspeech recognition in burst-like packet loss. In: Proc.ICSLP04.

Kanal, L.N., Sastry, A.R.K., 1978. Models for channels withmemory and their applications to error control. Proc. IEEE66 (7), 724–744.

Kelleher, H., Pearce, D., Ealey, D., Mauuary, L., 2002. Speechrecognition performance comparison between DSR andAMR transcoded speech. In: Proc. ICSLP02, Denver, USA.

Kim, H.K., Cox, R.V., 2000. Bitstream-based feature extractionfor wireless speech recognition. In: Proc. ICASSP00,Turkey.

Kim, H.K., Cox, R.V., 2001. A bitstream-based front-end forwireless speech recognition on IS-136 communicationssystem. IEEE Trans. Speech Audio Process. 9, 558–568.

Kim, M.Y., Kleijn, E.B., 2004. Comparison of transmitter-based packet-loss recovery techniques for voice transmis-sion. In: Proc. ICSLP04.

Page 22: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242 241

Kiss, I., 2000. A comparison of distributed and network speechrecognition for mobile communication systems. In: Proc.ICSLP00, Beijing, China.

Lee, M., Zitouni, I., Zhou, Q., 2004. On a N-gram modelapproach for packet loss concealment. In: Proc. ICSLP04.

Lee, L.-S., Lee, Y., 2001. Voice access of global information forbroad-band wireless: technologies of today and challengesof tomorrow. Proc. IEEE 89 (1), 41–57.

Lilly, B.T., Paliwal, K.K., 1996. Effect of speech coders onspeech recognition performance. In: Proc. ICSLP96.

Lindberg, B., Johansen, F.T., Warakagoda, N., et al., 2000. Anoise robust multilingual reference recogniser based onSpeechDat(II). In: Proc. ICSLP00.

Lleida, E., Rose, R.C., 2000. Utterance verification in contin-uous speech recognition: decoding and training procedures.IEEE Trans. Speech Audio Process. 8 (2), 126–139.

Mayorga, P., Besacier, L., Lamy, R., Serignat, J.-F., 2003.Audio packet loss over IP and speech recognition. In: Proc.ASRU03, Virgin Islands, USA.

Milner, B., Semnani, S., 2000. Robust speech recognition overIP networks. In: Proc. ICASSP00, Turkey.

Milner, B., 2001. Robust speech recognition in burst-like packetloss. In: Proc. ICASSP01, USA.

Milner, B.P., James, A.B., 2004. An analysis of packet lossmodels for distributed speech recognition. In: Proc.ICSLP04.

Milner, B.P., Shao, X., 2003. Low bit-rate feature vectorcompression using transform coding and non-uniform bitallocation. In: Proc. ICASSP03.

Paliwal, K.K., So, S., 2004. Scalable distributed speech recog-nition using multi-frame GMM-based block quantization.In: Proc. ICSLP04.

Pearce, D., 2000. Enabling new speech driven services formobile devices: an overview of the ETSI standards activitiesfor distributed speech recognition front-ends. In Proc.AVIOS00: The Speech Applications Conference, San Jose,USA.

Pearce, D., 2004. Robustness to transmission channel—theDSR approach. In: Proc. Robustness 2004, Norwich,UK.

Pearce, D., Hirsch, H., 2000. The Aurora experimental frame-work for the performance evaluation of speech recognitionsystems under noisy conditions. In: Proc. ICSLP00, Beijing,China.

Peinado, A.M., Sanchez, V., Segura, J.C., Perez-Cordoba, J.L.,2001. MMSE-based channel error mitigation for distributedspeech recognition. In: Proc. Eurospeech01.

Peinado, A., Sanchez, V., Perez-Cordoba, J., de la Torre, A.,2003. HMM-based channel error mitigation and its appli-cation to distributed speech recognition. Speech Commun.41, 549–561.

Peinado, A., Sanchez, V., Perez-Cordoba, J., Rubio, A., 2005.EfficientMMSE-based channel errormitigation techniques—application to distributed speech recognition over wirelesschannels. IEEE Trans. Wireless Commun. 4 (1), 14–19.

Pelaez-Moreno, C., Gallardo-Antolin, A., Diaz-de-Maria, F.,2001. Recognizing voice over IP: a robust front-end for

speech recognition on the World Wide Web. IEEE Trans.Multimedia (June), 2001.

Perkins, C., Hodson, O., Hardman, V., 1998. A survey ofpacket loss recovery techniques for streaming audio. IEEENetwork 12, 40–48.

Potamianos, A., Weerackody, V., 2001. Soft-feature decodingfor speech recognition over wireless channels. In: Proc.ICASSP01, USA.

Ramabadran, T., Sorin, A., McLaughlin, M., Chazan, D.,Pearce, D., Hoory, R., 2004. The ETSI extended distributedspeech recognition (DSR) standards: server-side speechreconstruction. In: Proc. ICASSP04, Montreal, Quebec,Canada.

Ramakrishana, B.R., 2000. Reconstruction of incompletespectrograms for robust speech recognition, PhD disserta-tion, Carnegie Mellon University.

Rahim, M.G., Lee, C.-H., Juang, B.-H., 1997. Discriminativeutterance verification for connected digits recognition. IEEETrans. Speech Audio Process. 5 (3), 266–277.

Ramsey, J.L., 1970. Realization of optimum interleavers. IEEETrans. Inform. Theory 16 (3), 338–345.

Riskin, E.V., Boulis, C., Otterson, S., Ostendorf, M., 2001.Graceful degradation of speech recognition performanceover lossy packet networks. In: Proc. Eurospeech01.

Rose, R.C., 2004. Environmental robustness in automaticspeech recognition. In: Proc. Robust2004.

Rose, R.C., Arizmendi, I., Parthasarathy S., 2003. An efficientframework for robust mobile speech recognition services.In: Proc. ICASSP03, Hong Kong, China.

Schulzrinne, H., Casner, S., Frederick, R., Jacobson, V., 2003.RTP: A Transport Protocol for Real-Time Applications.IETF Audio/Video Transport WG, RFC3550.

Shannon, C.E., 1948. A mathematical theory of communica-tion. Bell Syst. Tech. J. 27, 379–423, 623–656.

Sklar, B., Harris, F.J., 2004. The ABCs of linear block codes.IEEE Signal Process. Mag. 2004 (July), 14–35.

Sorin, A., Ramababran, T., Chazan, D., et al., 2004. The ETSIextended distributed speech recognition (DSR) standards:client side processing and tonal language recognitionevaluation. In: Proc. ICASSP04.

Srinivasamurthy, N., Ortega, A., Narayanan, S., 2001. Efficientscalable speech compression for scalable speech recognition.In: Proc. Eurospeech01, Aalborg, Denmark.

Srinivasamurthy, N., Ortega A., Narayanan, S., 2004.Enhanced standard compliant distributed speech recogni-tion (Aurora encoder) using rate allocation. In: Proc.ICASSP04.

Sukkar, R.A., Chengalvarayan, R., Jacob, J.J., 2002. Unifiedspeech recognition for landline and wireless environments.In: Proc. ICASSP02.

Tan, Z.-H., Dalsgaard, P., 2002. Channel error protectionscheme for distributed speech recognition. In: Proc.ICSLP02, Denver, USA.

Tan, Z.-H., Dalsgaard, P., Lindberg, B., 2003a. OOV-detectionand channel error protection for distributed speech recog-nition over wireless networks. In: Proc. ICASSP03, HongKong, China.

Page 23: Automaticspeechrecognitionovererror-prone wirelessnetworkskom.aau.dk/~zt/doc/Tan_Speech_Communication_2005.pdf · conduct data compression. Lossy speech codecs may, however, significantly

242 Z.-H. Tan et al. / Speech Communication 47 (2005) 220–242

Tan, Z.-H., Dalsgaard, P., Lindberg, B., 2003b. Partial splicingpacket loss concealment for distributed speech recognition.IEE Electron. Lett. 39 (22), 1619–1620.

Tan, Z.-H., Dalsgaard, P., Lindberg, B., 2004a. A subvector-based error concealment algorithm for speech recognitionover mobile networks. In: Proc. ICASSP04, Montreal,Quebec, Canada.

Tan, Z.-H., Lindberg, B., Dalsgaard, P., 2004b. A comparativestudy of feature-domain error concealment techniques fordistributed speech recognition. In: Proc. Robust2004, Nor-wich, UK.

Tan, Z.-H., Dalsgaard, P., Lindberg, B., 2004c. On theintegration of speech recognition into personal networks.In: Proc. ICSLP04, Jeju Island, Korea.

Viikki, O., 2001. ASR in portable wireless devices. In: Proc.ASRU01, Madonna di Campiglio, Italy.

Wang, Y., Zhu, Q.-F., 1998. Error control and concealment forvideo communication: a review. Proc. IEEE 86 (5), 974–997.

Wang, Y., Panwar, S., Lin, S., Mao, S., 2002. Wireless videotransport using path diversity: multiple description vs.layered coding. In: Proc. IEEE ICIP 2002.

Weerackody, V., Reichl, W., Potamianos, A., 2001. Speechrecognition for wireless applications. In: Proc. IEEE Int.Conf. Commun. 2001.

Weerackody, V., Reichl, W., Potamianos, A., 2002. An error-protected speech recognition system for wirelesszcommunications. IEEE Trans. Wireless Commun. 1 (2),282–291.

Xie, Q., Pearce, D., 2004. RTP Payload Formats for ETSI ES202 050, ES 202 211, and ES 202 212 Distributed SpeechRecognition Encoding, June 2004, Available from: <http://www.ietf.org/internet-drafts/draft-ietfavt-rtp-dsr-codecs-03.txt>.

Yoma, N.B., McInnes, F.R., Jack, M.A., 1998. WeightedViterbi algorithm and state duration modelling for speechrecognition in noise. In: Proc. ICASSP98.

Zhong, X., Arrowood, J., Moreno, A., Clements, M., 2002.Multiple description coding for recognizing voice over IP.In: Proc. IEEE Digital Signal Processing Workshop.

Zhu, Q., Alwan, A., 2001. An efficient and scalable 2D DCT-based feature coding scheme for remote speech recognition.In: Proc. ICASSP01.