extreme learning machines and kernel deep convex ... extreme learning machines and kernel deep...

1
Extreme Learning Machines and Kernel Deep Convex Networks for Speech and Vision Tasks Ahmed Karanath(B13104), Kansul Mahrifa(B13123) Mentor: Dr. A. D. Dileep School of Computing and Electrical Engineering, Indian Institute of Technology Mandi Abstract We explore two related approaches for the task of speech emotion recognition(SER), speaker identification(Spk-Id) and scene classification: Extreme Learning Machine algorithm(ELM) based Neural Network, and Kernel Deep Con- vex Network(KDCN). We propose using these approaches to classify varying length patterns, and speed up training time as compared to back propagation techniques. We also propose a novel approach to classify varying length patterns us- ing dynamic kernel in KDCN without compro- mising on time, while giving comparable results to other methods used for these tasks. Introduction Images and speech data consists mostly of varying length samples. Hence there is a need for alterna- tive methods to classify varying length patterns in speech and images. Figure: Speech signal waveforms of two short duration utterances of word "me". These signals are recorded at a sampling rate of 16KHz. Figure: Image of coast from MIT8 scene dataset. Each feature vector corresponds to a patch on the image. Extreme Learning Machine Figure: Schematic diagram of a single layer feed forward neural network on which the ELM algorithm is used. In this algorithm[1], the weights w i above are randomly assigned. Then the output weights β is calculated as β = H T, where H is the generalized inverse of H. This is extended to a kernel based version[2]. For a test example X, the output function f (X)of the kernel based extreme learning machine (KELM) is given as: f (X)=[ K (X, X 1 ),...,K (X, X j )] I C + K -1 T (1) where, K (X, X j ) is the kernel function of hidden neurons of SLFN. In this kernel version, static kernels like linear kernel, polynomial kernel, Gaussian kernel are used. Issues: Neural networks including ELM based networks require fixed length input size, cannot handle varying length patterns. Hidden layer has to be tweaked by trial and error. Proposed Solution Classify varying length patterns directly with ELM using dynamic kernels. Hidden layer dimensionality need not be known, and number of hidden nodes need not be tweaked in this case. Named as dynamic kernel ELM(DKELM). Kernel Deep Convex Network Kernel Deep Convex Network(KDCN)[3] is a neu- ral network composed by stacking shallow neural network modules. Architecture: Each module in KDCN has an input layer, a hidden layer and an output layer. The input to the higher modules is concatenation of input data and the outputs of lower layers. Concatenation of output from lower layers help prevent over-fitting on the training data. Figure: Schematic diagram of KDCN with three modules. Issues: Concatenation of output from a module and the input data at the different levels is not possible for varying length patterns. Proposed Solution Linear combination of kernel matrices calculated separately on the input data and the output from the previous layers is performed, and this is given as input to the next module. Named as dynamic kernel deep convex network(DKDCN). Results Datasets used: Scene classification: MIT8 scene and Vogel Schiele datasets Speech emotion recognition: EmoDB and FAU-AEC datasets Speaker identification: NIST-SRE corpora Dynamic Kernel MIT8 scene Vogel Schiele DKELM DKDCN DKELM DKDCN FK 82.62 82.63 75.18 74.47 PSK 63.50 62.90 56.02 56.21 CIGMMIMK 75.20 75.30 65.25 65.20 SPSK - - - - SLPMK - - - - Table: Comparison of classification accuracies of KELM and DKDCN(3 layer) based classifiers on image data. Dynamic Kernel Speech Emotion Recognition Speaker Identification EmoDB FAU-AEC KELM DKDCN SVM KELM DKDCN SVM KELM DKDCN SVM FK 88.0 86.45 87.05 63.67 - 61.54 89.14 88.20 88.54 PSK 88.40 86.0 87.46 64.9 - 62.54 87.18 - 86.18 CIGMMIMK 78.0 79.0 85.62 62.71 - 62.48 88.78 87.35 88.54 SLPSK 91.40 - 92.6 65.60 - 66.29 91.21 - 91.67 Table: Comparison of classification accuracies of KELM, DKDCN(3 layer) and SVM based classifiers on speech data. Future Work The spaces indicated in the above tables indicate ongoing work. These will be reported on completion. Effect of different types of intermediate kernels in DKDCN, and ways to improve results. References [1] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: Theory and applications. Neurocomputing, 70:489 – 501, 2006. [2] Alexandros Iosifidis and Moncef Gabbouj. On the kernel extreme learning machine speedup. Pattern Recogn. Lett., 68(P1):205–210, December 2015. [3] Niharjyoti Sarangi and C. Chandra Sekhar. Automatic image annotation using convex deep learning models. ICPRAM 2015, Portugal. SCITEPRESS - Science and Technology Publications, Lda.

Upload: doanque

Post on 10-May-2018

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Extreme Learning Machines and Kernel Deep Convex ... Extreme Learning Machines and Kernel Deep Convex Networks for Speech and Vision Tasks Author Ahmed Karanath(B13104), Kansul Mahrifa(B13123)

Extreme Learning Machines and Kernel Deep Convex Networks for Speech and Vision TasksAhmed Karanath(B13104), Kansul Mahrifa(B13123)

Mentor: Dr. A. D. DileepSchool of Computing and Electrical Engineering, Indian Institute of Technology Mandi

AbstractWe explore two related approaches for the taskof speech emotion recognition(SER), speakeridentification(Spk-Id) and scene classification:Extreme Learning Machine algorithm(ELM)based Neural Network, and Kernel Deep Con-vex Network(KDCN). We propose using theseapproaches to classify varying length patterns,and speed up training time as compared to backpropagation techniques. We also propose a novelapproach to classify varying length patterns us-ing dynamic kernel in KDCN without compro-mising on time, while giving comparable resultsto other methods used for these tasks.

IntroductionImages and speech data consists mostly of varyinglength samples. Hence there is a need for alterna-tive methods to classify varying length patterns inspeech and images.

Figure: Speech signal waveforms of two short durationutterances of word "me". These signals are recorded at a

sampling rate of 16KHz.

Figure: Image of coast from MIT8 scene dataset. Eachfeature vector corresponds to a patch on the image.

Extreme Learning Machine

Figure: Schematic diagram of a single layer feed forwardneural network on which the ELM algorithm is used.

• In this algorithm[1], the weights wi above arerandomly assigned. Then the output weights βis calculated as β = H†T, where H† is thegeneralized inverse of H.

•This is extended to a kernel based version[2].For a test example X, the output functionf (X)of the kernel based extreme learningmachine (KELM) is given as:

f (X) = [K(X,X1), . . . , K(X,Xj)]IC

+ K

−1T

(1)where, K(X,Xj) is the kernel function ofhidden neurons of SLFN.

• In this kernel version, static kernels like linearkernel, polynomial kernel, Gaussian kernel areused.

Issues:•Neural networks including ELM based networksrequire fixed length input size, cannot handlevarying length patterns.

•Hidden layer has to be tweaked by trial anderror.

Proposed SolutionClassify varying length patterns directly withELM using dynamic kernels. Hidden layerdimensionality need not be known, and numberof hidden nodes need not be tweaked in this case.Named as dynamic kernel ELM(DKELM).

Kernel Deep ConvexNetwork

Kernel Deep Convex Network(KDCN)[3] is a neu-ral network composed by stacking shallow neuralnetwork modules.Architecture:•Each module in KDCN has an input layer, ahidden layer and an output layer.

•The input to the higher modules isconcatenation of input data and the outputs oflower layers.

•Concatenation of output from lower layers helpprevent over-fitting on the training data.

Figure: Schematic diagram of KDCN with three modules.

Issues:•Concatenation of output from a module and theinput data at the different levels is not possiblefor varying length patterns.

Proposed SolutionLinear combination of kernel matrices calculatedseparately on the input data and the outputfrom the previous layers is performed, and thisis given as input to the next module. Named asdynamic kernel deep convex network(DKDCN).

ResultsDatasets used:•Scene classification: MIT8 scene and VogelSchiele datasets

•Speech emotion recognition: EmoDB andFAU-AEC datasets

•Speaker identification: NIST-SRE corpora

Dynamic Kernel MIT8 scene Vogel SchieleDKELM DKDCN DKELM DKDCN

FK 82.62 82.63 75.18 74.47PSK 63.50 62.90 56.02 56.21CIGMMIMK 75.20 75.30 65.25 65.20SPSK - - - -SLPMK - - - -

Table: Comparison of classification accuracies of KELM andDKDCN(3 layer) based classifiers on image data.

DynamicKernel

Speech Emotion Recognition Speaker IdentificationEmoDB FAU-AECKELM DKDCN SVM KELM DKDCN SVM KELM DKDCN SVM

FK 88.0 86.45 87.05 63.67 - 61.54 89.14 88.20 88.54PSK 88.40 86.0 87.46 64.9 - 62.54 87.18 - 86.18CIGMMIMK 78.0 79.0 85.62 62.71 - 62.48 88.78 87.35 88.54SLPSK 91.40 - 92.6 65.60 - 66.29 91.21 - 91.67

Table: Comparison of classification accuracies of KELM,DKDCN(3 layer) and SVM based classifiers on speech data.

Future Work•The spaces indicated in the above tables indicateongoing work. These will be reported oncompletion.

•Effect of different types of intermediate kernelsin DKDCN, and ways to improve results.

References[1] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew.

Extreme learning machine: Theory and applications.Neurocomputing, 70:489 – 501, 2006.

[2] Alexandros Iosifidis and Moncef Gabbouj.On the kernel extreme learning machine speedup.Pattern Recogn. Lett., 68(P1):205–210, December 2015.

[3] Niharjyoti Sarangi and C. Chandra Sekhar.Automatic image annotation using convex deep learningmodels.ICPRAM 2015, Portugal. SCITEPRESS - Science andTechnology Publications, Lda.