journal of the international science and general ... · journal of the international science and...

9
Journal of the International Science and General Applications Vol. 1, No. 1 / March 2018 / ISGA Image2speech: Automatically generating audio descriptions of images MARK HASEGAWA-J OHNSON 1 ,ALAN BLACK 2 ,L UCAS ONDEL 3 ,ODETTE S CHARENBORG 4 , AND F RANCESCO CIANNELLA 5 1 University of Illinois, Urbana, IL USA 2 Carnegie-Mellon University, Pittsburgh, PA USA 3 Brno University of Technology, Brno, Czech Republic 4 Centre for Language Studies, Radboud University, Nijmegen, Netherlands 5 Cisco Systems, San Francisco, CA, USA 1 [email protected] 2 [email protected] 3 [email protected] 4 [email protected] 5 [email protected] Compiled February 7, 2018 The image2speech task generates a spo- ken description of an image. We present baseline experiments in which the neu- ral net used is a sequence-to-sequence model with attention, and the speech synthesizer is ClusterGen. Speech is generated from four different types of segmentations: two that require a lan- guage with known orthography (words and first-language phones), and two that do not (pseudo-phones and second- language phones). BLEU scores and to- ken error rates indicate that the task can be performed with better than chance ac- curacy. Informal perusal of the output (phone strings, word strings, and syn- thesized audio) suggests that the audio contains complete, intelligible words or- ganized into intelligible sentences, and that the most salient errors are caused by mis-recognition of objects and actions in the image. * © 2018 International Science and General Applications INTRODUCTION This paper proposes a new task for artificial intelligence: the generation of a spoken description of an image. The automatic generation of text is the topic of natural language processing (NLP), whereas the analysis of images is the topic of the field of computer vision. In both fields, great advances have been made on these separate topics, and recently they have been combined into a new research field: img2txt [1]. However, many of the world’s languages do not have a written form [2], therefore many people do not have access to these and other speech and NLP technologies. In this work, we propose a new research task: image2speech, which is similar to img2txt, but can reach people whose language does not have a natural or easily used written form. An image2speech system should generate a spoken description of an image directly, without first generating text. The methodology we propose converts image feature vectors into speech unit sequences. In order to implement this pipeline, four types of standard open-source software toolkits are used. First, the VGG16 [3, 4] visual object recognizer converts each image into a sequence of image feature vectors. Second, the XNMT [5] machine translation toolkit accepts these image feature vectors as inputs, and generates speech units as output. Third, the ClusterGen [6] speech synthesis toolkit generates audio from

Upload: others

Post on 25-Apr-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Journal of the International Science and General ... · Journal of the International Science and General Applications Vol. 1, No. 1/ March 2018 / ISGA each speech unit sequence. Fourth,

Journal of the International Science and General Applications Vol. 1, No. 1 / March 2018 / ISGA

Image2speech:Automatically generatingaudio descriptions ofimages

MARK HASEGAWA-JOHNSON1, ALANBLACK2, LUCAS ONDEL3, ODETTESCHARENBORG4, AND FRANCESCOCIANNELLA5

1University of Illinois, Urbana, IL USA2Carnegie-Mellon University, Pittsburgh, PA USA3Brno University of Technology, Brno, Czech Republic4Centre for Language Studies, Radboud University, Nijmegen,Netherlands5Cisco Systems, San Francisco, CA, [email protected]@[email protected]@[email protected]

Compiled February 7, 2018

The image2speech task generates a spo-ken description of an image. We presentbaseline experiments in which the neu-ral net used is a sequence-to-sequencemodel with attention, and the speechsynthesizer is ClusterGen. Speech isgenerated from four different types ofsegmentations: two that require a lan-guage with known orthography (wordsand first-language phones), and twothat do not (pseudo-phones and second-language phones). BLEU scores and to-ken error rates indicate that the task canbe performed with better than chance ac-curacy. Informal perusal of the output(phone strings, word strings, and syn-thesized audio) suggests that the audiocontains complete, intelligible words or-ganized into intelligible sentences, andthat the most salient errors are caused bymis-recognition of objects and actions inthe image.* © 2018 International Scienceand General Applications

INTRODUCTION

This paper proposes a new task for artificial intelligence: thegeneration of a spoken description of an image. The automaticgeneration of text is the topic of natural language processing(NLP), whereas the analysis of images is the topic of the fieldof computer vision. In both fields, great advances have beenmade on these separate topics, and recently they have beencombined into a new research field: img2txt [1]. However, manyof the world’s languages do not have a written form [2], thereforemany people do not have access to these and other speech andNLP technologies. In this work, we propose a new researchtask: image2speech, which is similar to img2txt, but can reachpeople whose language does not have a natural or easily usedwritten form. An image2speech system should generate a spokendescription of an image directly, without first generating text.

The methodology we propose converts image feature vectorsinto speech unit sequences. In order to implement this pipeline,four types of standard open-source software toolkits are used.First, the VGG16 [3, 4] visual object recognizer converts eachimage into a sequence of image feature vectors. Second, theXNMT [5] machine translation toolkit accepts these image featurevectors as inputs, and generates speech units as output. Third,the ClusterGen [6] speech synthesis toolkit generates audio from

Page 2: Journal of the International Science and General ... · Journal of the International Science and General Applications Vol. 1, No. 1/ March 2018 / ISGA each speech unit sequence. Fourth,

Journal of the International Science and General Applications Vol. 1, No. 1/ March 2018 / ISGA

each speech unit sequence. Fourth, in order to train a syntheticspeech voice, Clustergen needs a corpus of audio files, each ofwhich is transcribed using some type of discrete symbolic units;automatic speech recognition (ASR) systems based on Kaldi [7],Eesen [8], and AMDTK [9] perform this transcription.

The complete image2speech system is trained using a corpusof (image,description) pairs, where each description is an audiofile containing a spoken description of the image. Four differenttypes of speech units are tested. Two types of unit sequences aregenerated using an ASR system trained on the same language asused for the image2speech task. This system provides an upper-bound performance on the image2speech task. As this systemuses the same language during training and testing it wouldnever be applicable to a language without orthography. The twotypes of unit sequences are words and phone sequences, referredto as Words and L1-Phones (first-language phones), respectively.The other two unit sequences are used to build systems that canbe used for a language lacking orthography. The unit sequencesused in these systems are both generated using ASR systems thatare not trained on speech and their transcriptions in the samelanguage but rather these unit sequences are generated withoutrelying upon any transcription of the target speech. The unitsequences are phone sequences generated by an ASR system inanother language, referred to as L2-phones (second-languagephones) and pseudo-phones, respectively. Pseudo-phones areapproximately phone-sized segmental units, extracted withoutsupervision using an unsupervised acoustic unit discovery sys-tem [9].

EXPERIMENTAL METHODS

Fig. 1 gives an overview of experimental methods used in thispaper. image2speech models were trained using (image,audio)pairs drawn from the Flickr8k, MSCOCO, Flicker-Audio, andSPEECH-COCO corpora. Each image is represented as a se-quence of 196 vectors, each of dimension 512, created from thelast convolutional layer of the VGG16 network. Audio files areconverted to units via Kaldi forced alignment (Words and L1-phones) or via Eesen or AMDTK phone recognition (L2-phonesand Pseudo-phones). XNMT then learns to convert a sequenceof image feature vectors into a sequence of speech units, whileClustergen learns to convert speech units into audio.

DataExperiments in this paper used two databases: the Flickr8k im-age captioning corpus with its associated Flicker-Audio speechcorpus, and the MSCOCO image captioning corpus with its as-sociated SPEECH-COCO speech corpus.

The Flickr8k Corpus [10] includes five text captions for eachof 8000 images, as well as links to the images. Text captions werewritten by crowd workers, hired on Amazon Mechanical Turk.There is considerable variability among the captions providedfor each image. For example, the five different captions availablefor the first image in the corpus (image 1000268201_693b08cb0e)are:

A child in a pink dress is climbing up a set of stairs in an entry way.A girl going into a wooden building.A little girl climbing into a wooden playhouse.

TrainingImagesFlickr8k

MSCOCO

TrainingAudio

Flicker-AudioSPEECH-COCO

VGG16:Extract CNN

image features

Kaldi/Eesen/AMDTK:force-align words

force-align L1-phonesrecognize L2-phones

recognizepseudo-phones

Train XNMTimage2speech

model

image2speechmodel

Train ClustergenTTS model

Clustergenmodel

Test ImageVGG16:

Extract CNNfeatures

XNMT: GenerateWords,L1-phones,

L2-phones,Pseudo-

phones

Clustergen:Generate

audioSpeech audio

Fig. 1. Experimental methods. XNMT is first trained using(image,sequence) pairs, and ClusterGen using (sequence,audio)pairs, where the sequences are generated using Kaldi, Eesen,or AMDTK applied to the audio. A test image is then passedthrough VGG16, XNMT, and Clustergen to generate its audiodescription.

A little girl climbing the stairs to her playhouse.A little girl in a pink dress going into a wooden cabin.

In 2015, Harwath and Glass [11] proposed a data retrieval systemin which speech files are used to retrieve corresponding imagesfrom a large image database, and vice versa. In order to maketheir proposal possible, they hired crowd workers on MechanicalTurk to read aloud the 40,000 captions from the Flickr8k corpus.The resulting set of 40,000 spoken captions is distributed as theFlicker-Audio corpus.

The audio files in the Flicker-Audio corpus were recorded bycrowd workers on Mechanical Turk, therefore they are noisy andhighly variable. Background noise is present in most files, andin some files, the signal to noise ratio is quite low. Some filesare more than ten seconds long, apparently because the workerforgot to turn off his or her microphone after recording the audiofile. We have not listened to all of the audio files; of those weaudited, all contained a read version of the text caption, but somecontained pronunciation errors, omission of function words, andsimilar small mistakes. Because of the high variability of theFlicker-Audio corpus, we decided also to work with a corpus ofsynthetic speech, described in the next paragraph.

The Microsoft COCO (Common Objects in COntext) corpuswas initially developed as an object detection corpus [12]. Afterinitial release of the corpus, text captions of 150,000 of the im-ages (four captions each) were acquired using methods similarto those pioneered by the Flickr8k corpus, making MSCOCO thelargest database available for training img2txt systems [13]. TheSPEECH-COCO spoken transcriptions [14] were created usingeight different synthetic voices, reading the MSCOCO text tran-scriptions. All eight synthetic voices were created from low-noiserecordings of professional broadcast announcers; most listenerscan’t tell that the speech is synthetic. Because the speech is syn-thetic, exact time alignment of the phones and words is available,and is distributed with the corpus.

Experiments in this paper did not use the entire SPEECH-COCO corpus, because we did not have enough computingtime to train a neural network using the whole corpus. Instead,

Page 3: Journal of the International Science and General ... · Journal of the International Science and General Applications Vol. 1, No. 1/ March 2018 / ISGA each speech unit sequence. Fourth,

Journal of the International Science and General Applications Vol. 1, No. 1 / March 2018 / ISGA

experiments reported in this corpus used a subset of MSCOCOwith training, validation, and test corpora sized to match those ofFlickr8k: 6000 training images, 1000 validation images, and 1000test images. When an image is part of the training or validationcorpus, all of its captions are used, thus experiments using theMSCOCO corpus had a training corpus containing 24,000 image-audio pairs (6000 distinct images), while the Flickr8k trainingcorpus included 30,000 image-audio pairs (6000 distinct images).

Image Features

Imagenet [15] is an image database organized according to theWordNet [16] noun hierarchy. The creators of ImageNet postedrequests for images on crowdsourcing platforms: crowd workerswere each given a noun from the WordNet hierarchy, and askedto provide example images. ImageNet currently has 14m images,provided as examples of 22k nouns.

The ILSVRC (Imagenet Large Scale Visual Recognition Chal-lenge) has been held annually since 2010. The best single-network solution in ILSVRC 2014 Sub-task 2a, “Classifica-tion+localization with provided training data,” was a 13-layerconvolutional neural network (CNN) [3]; Implementations inTensorFlow ([4], used in this paper) and Keras [17] are now re-distributed as the VGG16 network.

VGG16 is a 13-layer CNN, followed by a two-layer fully-connected network (FCN). The entire network (13 convolutionallayers and 2 fully connected layers) was trained using the 14million images of ImageNet. When the five largest outputs ofthe softmax layer are chosen as hypotheses, that top-5 list con-tains the reference class label of a test image 92.7% of the time.The first layer generates a 224 × 224 × 64 output tensor, by ap-plying a set of 64 different 3 × 3 × 3-pixel filters to the inputimage. The second layer also generates a 224 × 224 × 64 outputtensor, by applying 64 different 3 × 3 × 64-pixel filters. Outputsin the second layer are pooled using 2 × 2 max pooling. Thenext two convolutional layers generate 112 × 112 × 128 outputtensors, followed by another layer of 2 × 2 max pooling. Thenext three convolutional layers generate 56 × 56 × 256 outputtensors, followed by another max pooling layer, followed bythree 28 × 28 × 512 convolutional layers, followed again by maxpooling, followed by three 14 × 14 × 512 convolutional layers.The last convolutional layer is composed of 512 channels, eachof which is a 14 × 14 image; it is useful to interpret this layer asa set of 14 × 14 = 196 feature vectors of dimension 512. Eachfeature vector is the nonlinear transformation of a 40 × 40-pixelsub-image, which is to say, about 3% of the original 224 × 224input-image. The last convolutional layer is again max-pooled,then reshaped into a single length-4096 vector, which is passedthrough two fully connected layers before generating a 1000-dimensional output probability vector, specifying the posteriorprobability that the image contains any of the 1000 object classesrequired by the ILSVRC 2014 competition.

Speech Units

Systems were trained and tested using four different types ofspeech units: Words, L1-Phones, L2-Phones, and Pseudo-Phones.The baseline models, with Word-sequence outputs, are stan-dard img2txt networks, e.g., comparable to the result reported

in [10]. In these networks, the output vocabulary of the net-work is the set of all distinct words in the training corpus:7993 words in Flickr8k, 7476 in MSCOCO8k. The experimen-tal systems generate phone outputs: L1-Phone, L2-Phone, andPseudo-Phone. Flickr8k and MSCOCO L1-phone systems bothuse English phone sets, but with slightly different sizes: 54 forFlickr8k, 52 for MSCOCO. The L2-Phone system (only tested forFlickr8k) contains 38 phones. The Pseudo-Phone system (onlytested for MSCOCO) was adjusted to produce 103 phones, aspseudo-phone sets of about this size have been useful in previousexperiments [9].

Words and L1-Phones are aligned to the speech2image train-ing audio files using ASR forced alignment trained in the targetlanguage, therefore speech segmentations of this kind can onlybe performed in a language that has a writing system. Thetwo databases used in this paper were transcribed in two dif-ferent ways. The larger corpus, SPEECH-COCO [14], is dis-tributed with phone transcriptions (phones in this database aretranscribed using a phonetic alphabet based on X-SAMPA). TheFlicker-Audio corpus [11] is not distributed with phonetic tran-scriptions, but text transcriptions are available [10]; from these,aligned L1-Phone transcriptions were generated using the KITEnglish transcription system [18]. As a result, the L1-Phones forFlickr8k are the ARPABET phones of [18].

Forced alignment of words and L1-Phones was implementedusing the Kaldi speech recognition toolkit. Kaldi [19] is an open-source toolkit that integrates methods for training deep neuralnet acoustic models, together with weighted finite state trans-ducer language models. Acoustic features consisted of MFCC(13 features), stacked ±3 frames (13 × 7 = 91 features), reducedto 40 dimensions using LDA followed by fMLLR. The acousticfeatures are observed by a neural network trained using Tensor-Flow, with a weighted finite-state transducers (WFSTs) basedsearch graph incorporated as part of the neural network trainingerror criterion. The WFST representation of speech sequenceinformation is outlined in [20]. In this framework, the acousticmodel is specified by a probabilistic mapping from acoustic sig-nals to a sequence of discrete symbols, and a WFST H mappingthese symbol sequences to triphone sequences. The other WFSTsin the framework are C which maps down triphone sequencesto monophone sequences, a pronunciation model L and a lan-guage model G. Since our tasks involve phone recognition, L isessentially an identity mapping and G is a phone N-gram model.

L2-Phone transcription does not use any information aboutthe writing system of the target language, and could thereforebe used in a language that lacks any writing system. In thismethod, an ASR is first trained in a different language (in ourcase, Dutch). The L2 ASR is then used to generate a phonetranscription of the target audio. In the experiments reported inthis paper, an Eesen speech recognizer [8] was first used to train aDutch ASR. Dutch phones were then mapped to English phonesusing linguistic knowledge only, without the use of any Englishwriting or transcriptions, and the English-adapted Dutch ASRwas used to transcribe English audio. Thus the ASR has someprior exposure to English audio, but has no knowledge aboutEnglish text [21].

Eesen [8] is an end-to-end neural speech recognizer, trainedusing the CTC (connectionist temporal classification) training cri-

Page 4: Journal of the International Science and General ... · Journal of the International Science and General Applications Vol. 1, No. 1/ March 2018 / ISGA each speech unit sequence. Fourth,

Journal of the International Science and General Applications Vol. 1, No. 1/ March 2018 / ISGA

terion [22]. The CTC training criterion scores a recurrent neuralnetwork by using a dynamic programming step to align the net-work output to a reference transcription; CTC therefore does notrequire time-aligned reference transcriptions. The dynamic pro-gramming step in CTC often fuses knowledge from two sources:from the neural network, and from an external language model.Eesen is the first published CTC implementation designed touse an external language model in the form of a weighted finitestate transducer (wFST). Eesen is therefore able to use a languagemodel that actually includes information from many differentsources.

Pseudo-Phones were generated from the Acoustic Unit Dis-covery (AUD) system of [9] with two major modifications. First,the truncated Dirichlet process of [9] was replaced by a sym-metric Dirichlet distribution, since, as pointed out in [23], thesymmetric Dirichlet distribution provides a good and yet simpleapproximation of the Dirichlet Process. Second, to cope with therelatively large database, the Variational Bayes Inference algo-rithm originally used in [9] was replaced with the faster Stochas-tic Variational Bayes Inference algorithm. It was found experi-mentally that these modifications, while considerably speedingup the training, yield negligible drop in accuracy. The sourcecode of the AUD model is available athttps://github.com/amdtkdev/amdtk.

Machine Translation ToolkitXNMT (the eXtensible Machine Translation Toolkit [5]) was usedto implement/train the image2speech system. XNMT is special-ized in the training of sequence-to-sequence neural networks,which means it reads in a sequence of inputs, and then generatesa different sequence of outputs.

XNMT is based on DyNet [24], a library for the training ofneural networks with variable-length inputs. Prior to DyNet,most neural network modeling toolkits assumed that every train-ing and test input is exactly the same size. DyNet introduced anew type of graph compilation: dynamic compilation, in whicheach layer of the neural net is represented as a compiled function,rather than a compiled data structure. This dynamic compila-tion method permits the toolkit to train neural networks usingtraining examples each of which has a different size. XNMT is de-signed so that existing components can be easily re-arranged torun new experiments, and new components can be easily added.Available components are categorized as embedders (e.g., one-hot, linear, and continuous vector embedders), encoders (e.g.,CNN, LSTM and pyramidal LSTM encoders), attention mod-els (e.g., dot product, bilinear, and MLP attention models), de-coders (e.g., an MLP decoder applied to the state vector of theencoder), and error metrics (e.g., BLEU, cross-entropy, word er-ror rate). Among other applications, the flexibility of XNMThas been demonstrated in the use of attention models to selectbetween neural and phrase-based translation probability vec-tors, a method that has particular utility in the translation oflow-frequency content words [25].

The image2speech model learned by XNMT is a sequence-to-sequence model, composed of an encoder, an attender, anda decoder. The encoder is a one-layer bidirectional LSTM (im-plemented using XNMT’s PyramidalLSTM model), with a 128-dimensional state vector. The attender is a three-layer perceptron,

implemented using XNMT’s StandardAttender model. For eachcombination of an input LSTM state vector and an output LSTMstate vector (128 dimensions each), the attender uses a three-layerperceptron (two hidden layers of 128 nodes each) to compute asimilarity score. The decoder is another three-layer perceptron(1024 nodes per hidden layer), which views an input createdas the attention-weighted summation of all input LSTM statevectors, concatenated to the state vector of the output LSTM.The output of the decoder is a softmax with a number of outputnodes equal to the size of the speech unit vocabulary.

Speech Synthesis ToolkitText-to-speech synthesis is typically a four-stage process. First,the text is converted to a graph of symbolic phonetic descrip-tors. Second, the duration of each unit in the phonetic graph ispredicted. Third, the mel-cepstrum [26], pitch, and multi-bandexcitation [27] are predicted using a dynamic model such as anHMM (hidden Markov model, [28]) or RNN (recurrent neuralnetwork, [29]), or by applying separate discrete-to-continuousmapping algorithms to each frame of the synthetic utterance [6].Fourth, the speech signal is generated by inverting the mel-cepstral transform [26], and exciting it with the specified ex-citation.

In our application of the image2speech task for languageswithout orthography, we do not have the symbolic training ma-terial available. Clustergen [6] is particularly applicable to theproblems considered in this paper because it is able to generateintelligible and pleasant synthetic voices from very small trainingcorpora, and using an arbitrary discrete labeling of the corpusthat need not include any traditional type of phoneme [30].

It is possible to synthesize speech in an application likethis because the Clustergen speech synthesis algorithm differsfrom most other speech synthesis algorithms: there is no pre-determined set of speech units, and there is no explicit dynamicmodel. Instead, every frame in the training database is viewedas an independent exemplar of a mapping from discrete inputsto continuous outputs, and a machine learning algorithm (e.g.,regression tree [6] or random forest [31]) is applied to learn themapping. Discrete inputs include standard speech synthesispredictors such as the phone sequence and prosodic context, aswell as variables uniquely available to Clustergen such as thetiming of the predicted frame with respect to segment bound-aries at every prosodic level. Continuous outputs include theexcitation and pitch, the mel-cepstrum, and a representationof the dynamic trajectory of the mel-cepstrum (its local slopeand curvature); mel-cepstrum of the continuous signal is thensynthesized using trajectory overlap-and-add.

RESULTS

Fig. 2 shows the training loss (cross-entropy) of sequence-to-sequence networks trained (using XNMT) to generate speechunit sequences from image features. Training loss is measuredin bits per output symbol. Word-generating models start withtraining loss much higher than that of any phone-generatingmodel, apparently because the number of distinct words is largerthan the number of distinct phones. Training loss of the Wordnetworks falls below those of the L2-phone and Pseudo-phone

Page 5: Journal of the International Science and General ... · Journal of the International Science and General Applications Vol. 1, No. 1/ March 2018 / ISGA each speech unit sequence. Fourth,

Journal of the International Science and General Applications Vol. 1, No. 1 / March 2018 / ISGA

Fig. 2. Training loss (cross-entropy, bits/symbol) of sequence-to-sequence networks trained to generate speech unit se-quences from image features.

networks after some training, apparently because Words aremore predictable than Pseudo-Phones or L2-Phones. The Wordmodels never achieve training losses below those of the L1-Phonemodels. In fact, the Word and L1-Phone models converge to verysimilar endpoints, suggesting that the L1-Phone network may belearning the same thing as the Word model: it might be learningto generate a sequence of phones that always corresponds tocomplete Words.

Table 1 shows BLEU scores (higher is better) and unit errorrates (UER; lower is better) of four experimental systems andtwo baselines, measured on the validation and test sets of theFlicker-Audio and MSCOCO8k corpora. For Word-generatingsystems, UER=word error rate; for Phone-generating systems,UER=phone error rate. Rank-ordering of the experimental sys-tems is roughly the same in Table 1 as in Fig. 2, though the Word-based system achieves a very poor unit error rate on the Flickr8ktest corpus. Both the L2-Phone (on Flickr8k) and Pseudo-Phone(on MSCOCO) systems suffer UER> 100%. The L1-Phone sys-tems, however, demonstrate unit error rates that are significantlybetter than chance (where “chance” is the error rate of a systemthat always generates the majority phone label).

Synthetic speech examples have been generated by the Clus-tergen algorithm for some of the L1-phone network outputs.Quality of the audio examples has not yet been quantified. Infor-mal listening suggests that the generated audio is not perfectlynatural, but is composed of intelligible words arranged into in-telligible sentences, despite the poor UER and BLEU results.

Figs. 3 and 4 show examples of two images from the val-idation subset of the Flickr8k corpus. For each image, threetranscriptions are shown: two of the five available reference tran-scriptions (to give the reader a feeling for the difference amongreference transcriptions), and one transcription generated by theL1-Phone image2speech network. As shown, the network is ableto generate a phone string that is composed entirely of intelligiblewords, sequenced in an intelligible and semantically reasonable

Table 1. BLEU scores (%) and unit error rates (UER, %)achieved in two baseline img2txt experiments (Word outputs)and four experimental image2speech experiments (L1-Phone,L2-Phone, and Pseudo-Phone), on both validation and testsets. ** means UER < chance (Student’s T, Chebyshev standarderror, p < 0.001; chance=90.2% Flickr8k, 88.7% MSCOCO).

Validation Test

Dataset, Targets BLEU UER BLEU UER

Flickr8k, Words 4.7% 91.3% 3.7% 130%

Flickr8k, L1-Phones 13.7 87.9 13.7 84.9**

Flickr8k, L2-Phones 5.4 115 6.1 101

MSCOCO, Words 4.8 5.5 88.5

MSCOCO, L1-Phones 15.1 16.3 78.8**

MSCOCO, Pseudo-Ph. 2.2 1.4 123

Ref #1: The boy +um+ laying face down on a skate-board is being pushed along the ground by +laugh+another boy.

Ref# 2: Two girls +um+ play on a skateboard +breath+in a court +laugh+ yard.

Network (L1-Phones): SIL +BREATH+ SIL T UW MEH N AA R R AY D IX NG AX R EH D AE N W AY TSIL R EY S SIL.

Network (L1-Phones translated to words): Two menare riding a red and white race.

Fig. 3. Example #1 from the flickr8k corpus. The table liststwo of the reference transcriptions, and the output of the L1-Phone image2speech system. The last line, a “translation” fromphones back into words, is provided for the convenience of thereader; it was not used or generated by the experiment.

Page 6: Journal of the International Science and General ... · Journal of the International Science and General Applications Vol. 1, No. 1/ March 2018 / ISGA each speech unit sequence. Fourth,

Journal of the International Science and General Applications Vol. 1, No. 1/ March 2018 / ISGA

Flickr8k Example #2

Ref #1: A boy +laugh+ in a blue top +laugh+ is jumpingoff some rocks in the woods.

Ref #2: A boy +um+ jumps off a tan rock.

Network: SIL +BREATH+ SIL EY M AE N IH Z JH AHM P IX NG IH N DH AX F AO R EH S T SIL.

Network (L1-Phones translated to words): A man isjumping in the forest.

Fig. 4. Example #2 from the flickr8k corpus. The table liststwo of the reference transcriptions, and the output of the L1-Phone image2speech system. The last line, a “translation” fromphones back into words, is provided for the convenience of thereader; it was not used or generated by the experiment.

sentence. In these two examples, the phone strings shown canbe read as English sentences that mislabel boys as men, but areotherwise almost plausible descriptions of the images: “Twomen are riding a red and white race,” and “A man is jumping inthe forest.”

Figs. 5 and 6 shows similar examples from the SPEECH-COCO corpus. In the first example, the network has generatedthe sentence “A group of people standing on a beach,” which isperfectly correct. In the second example, the network generated“A person working at a metal down, and a red fire hydrant.” Itis interesting that the neural net has noticed something aboutthe image (the person working in the background) that was notnoticed by either of the human transcribers.

CONCLUSIONS

This paper proposes a new task for artificial intelligence: im-age2speech, the task of generating spoken descriptions of in-put images, with no intermediate text. Image2speech is trainedusing a database of paired images and audio descriptions. Ex-perimental results are presented using the Flicker-Audio andSPEECH-COCO corpora. Measured UER scores are better thanchance, but less than perfect. Informal perusal of results showsthat image2speech is able to generate intelligible words, and tosequence them into intelligible sentences. Future work will ex-amine whether it is possible to create phone sets like L2-Phonesor Pseudo-phones that can be used with lower error rate for thisapplication; for example, we may try to use phone sets from amismatched crowdsourcing experiment [32] as an intermediatestage between speech and images in a zero-resource language.

Ref #1: A group of men enjoying the beach, standingin the waves or surfing.

Ref# 2: A group of people standing on a beach next tothe ocean.

Network: # @ g r uu p @ v p ii p l= s t a n d i ng o n @b ii ch #

Network (L1-Phones translated to words): A group ofpeople standing on a beach.

Fig. 5. Example #1 from the MSCOCO corpus. The table liststwo reference transcriptions, and the output of the L1-Phoneimage2speech system. The last line, a “translation” fromphones back into words, is provided for the convenience ofthe reader; it was not used or generated by the experiment.

Ref #1: A, a black and white photo of a fire hydrantnear a building.

Ref #2: Aa, a fire hydrant that is out next to a house.

Network: # @ p @@ s n= w oo k i ng @ t @ m e dl= dau n @ n d @ r e d f ai r h ai d r @ n t #.

Network (L1-Phones translated to words): A personworking at a metal down and a red fire hydrant.

Fig. 6. Example #2 from the MSCOCO corpus. The table lists,for each image, two of its reference transcriptions, and theoutput of the L1-Phone image2speech system. The last line, a“translation” from phones back into words, is provided for theconvenience of the reader; it was not used or generated by theexperiment.

Page 7: Journal of the International Science and General ... · Journal of the International Science and General Applications Vol. 1, No. 1/ March 2018 / ISGA each speech unit sequence. Fourth,

Journal of the International Science and General Applications Vol. 1, No. 1 / March 2018 / ISGA

REFERENCES

1. J.-Y. Pan, H.-J. Yang, P. Duygulu, and C. Faloutsos, “Au-tomatic image captioning,” in “Proc. IEEE Internat. Conf.Multimedia and Expo (ICME),” (2004).

2. G. Adda, S. Stüker, M. Adda-Decker, O. Ambouroue,L. Besacier, D. Blachon, M. Bonneau-Maynard, P. Godard,F. Hamlaoui, D. Idiatov, G.-N. Kouarata, L. Lamel, E.-M.Makasso, A. Rialland, M. Van de Velde, F. Yvon, and S. Zer-bian, “Breaking the unwritten language barrier: The BULBproject,” in “Proceedings of the SLTU-2016 5th Workshopon Spoken Language Technologies for Under-resourcedlanguages,” (2016).

3. K. Simonyan and A. Zisserman, “Very deep convo-lutional networks for large-scale image classification,”https://arxiv.org/abs/1409.1556 (2014). Accessed: 2017-09-14.

4. D. Frossard, “Vgg16 in tensorflow,”https://www.cs.toronto.edu/ frossard/post/vgg16/(2016). Accessed: 2017-09-14.

5. G. Neubig, “eXtensible Neural Machine Translation,”https://github.com/neulab/xnmt (2017). Accessed: 2017-09-14.

6. A. Black, “CLUSTERGEN: A statistical parametric speechsynthesizer using trajectory modeling,” in “Proc. Internat.Conf. Spoken Language Process. (ICSLP),” (2006), pp. 1762–1765.

7. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speechrecognition toolkit,” in “Proc. of ASRU,” (2011).

8. Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-endspeech recognition using deep RNN models and WFST-based decoding,” in “IEEE Workshop on Automatic SpeechRecognition and Understanding (ASRU),” (2015).

9. L. Ondel, L. Burget, and J. Cernocký, “Variational inferencefor acoustic unit discovery,” Procedia Comput. Sci. 2016,80–86 (2016).

10. C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier,“Collecting image annotations using amazon’s mechanicalturk,” in “Proceedings of the NAACL HLT 2010 Workshopon Creating Speech and Language Data with Amazon’sMechanical Turk,” (2010).

11. D. Harwath and J. Glass, “Deep multimodal semantic em-beddings for speech and images,” in “2015 IEEE AutomaticSpeech Recognition and Understanding Workshop,” (Scotts-dale, Arizona, USA, 2015), pp. 237–244.

12. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Dollár, and C. Zitnick, “Microsoft COCO: Com-mon objects in context,” Lect. Notes Comput. Sci. 8693,740–755 (2014).

13. X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar,and C. Zitnick, “Microsoft COCO captions: Data collectionand evaluation server,” https://arxiv.org/abs/1504.00325(2015). Downloaded 2017-09-14.

14. W. Havard, L. Besacier, and O. Rosec, “SPEECH-COCO:600k visually grounded spoken captions aligned toMSCOCO data set,” in “ISCA Workshop on Grounding

Language Understanding (GLU2017),” (2017).15. J. Deng, K. Li, M. Do, H. Su, and L. Fei-Fei, “Construction

and Analysis of a Large Scale Image Ontology,” in “VisionSciences Society,” (2009).

16. C. Fellbaum, WordNet: An Electronic Lexical Database (MITPress, Cambridge, MA, 1998).

17. F. Chollet, “VGG16 model for keras,”https://github.com/fchollet/keras/ (2017). Accessed:2017-09-14.

18. K. Kilgour, M. Heck, M. Müller, M. Sperber, S. Stüker, andA. Waibel, “The 2014 KIT IWSLT speech-to-text systems forenglish, german and italian,” in “Internat. Worksh. SpokenLanguage Translation (IWSLT),” (Lake Tahoe, 2014), pp.73–79.

19. D. Povey, A. Ghoshal et al., “The Kaldi speech recognitiontoolkit,” Proc. ASRU (2011).

20. M. Mohri, F. Pereira, and M. Riley, “Speech recognition withweighted finite-state transducers,” in “Springer Handbookof Speech Processing,” (Springer, 2008), pp. 559–584.

21. O. Scharenborg, F. Ciannella, S. Palaskar, A. Black, F. Metze,L. Ondel, and M. Hasegawa-Johnson, “Building an asr sys-tem for a low-research language through the adaptation ofa high-resource language asr system: Preliminary results,”(2017). In review.

22. A. Graves, S. Fernandez, F. Gomez, and J. Schmidhu-ber, “Connectionist temporal classification: labelling un-segmented sequence data with recurrent neural networks,”in “ICML ’06 Proc 23rd international conference on machinelearning,” (2006), pp. 369–376.

23. K. Kurihara, M. Welling, and Y. Teh, “Collapsed variationaldirichlet process mixture models,” in “Proc. IJCAI,” (2007).

24. G. Neubig, C. Dyer, Y. Goldberg, A. Matthews, W. Ammar,A. Anastasopoulos, M. Ballesteros, D. Chiang, D. Clothiaux,T. Cohn, K. Duh, M. Faruqui, C. Gan, D. Garrette, Y. Ji,L. Kong, A. Kuncoro, G. Kumar, C. Malaviya, P. Michel,Y. Oda, M. Richardson, N. Saphra, S. Swayamdipta, andP. Yin, “DyNet: The dynamic neural network toolkit,”https://arxiv.org/pdf/1701.03980.pdf (2017). Accessed:2017-09-14.

25. P. Arthur, G. Neubig, and S. Nakamura, “Incorporating dis-crete translation lexicons into neural machine translation,”in “Empirical Methods in Natural Language Processing(EMNLP),” (2016).

26. T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An adap-tive algorithm for mel-cepstral analysis of speech,” in “In-ternat. Conf. Acoust. Speech and Sign. Process. (ICASSP),”(1992).

27. T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, andT. Kitamura, “Mixed excitation for HMM-based speech syn-thesis,” in “Proc. Eurospeech,” (2001), pp. 2263–2266.

28. K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, andT. Kitamura, “Speech parameter generation algorithms forHMM-based speech synthesis,” in “Proc. ICASSP,” (2000).

29. S.-H. Chen, S.-H. Hwang, and Y.-R. Wang, “An RNN-based prosodic information synthesizer for Mandarin text-to-speech,” IEEE Trans. Speech Audio Process. 6, 226–239(1998).

30. P. Muthukumar and A. Black, “Automatic discovery of a

Page 8: Journal of the International Science and General ... · Journal of the International Science and General Applications Vol. 1, No. 1/ March 2018 / ISGA each speech unit sequence. Fourth,

Journal of the International Science and General Applications Vol. 1, No. 1/ March 2018 / ISGA

phonetic inventory for unwritten languages for statisticalspeech synthesis,” in “Proc. ICASSP,” (2014).

31. A. Black and P. Muthukumar, “Random forests for statisticalspeech synthesis,” in “Proc. Interspeech,” (2015), pp. 1211–1215.

32. M. Hasegawa-Johnson, P. Jyothi, D. McCloy, M. Mirbagheri,G. di Liberto, A. Das, B. Ekin, C. Liu, V. Manohar, H. Tang,E. C. Lalor, N. Chen, P. Hager, T. Kekona, R. Sloan, and A. K.Lee, “Asr for under-resourced languages from probabilistictranscription,” IEEE/ACM Trans. Audio, Speech Lang. 25,46–59 (2017).

Mark Hasegawa-Johnson has been onthe faculty at the University of Illinois since1999, where he is currently a Professor ofElectrical and Computer Engineering. Hereceived his Ph.D. in 1996 at MIT, with a thesistitled "Formant and Burst Spectral Measureswith Quantitative Error Models for SpeechSound Classification," after which he wasa post-doc at UCLA from 1996-1999. Prof.

Hasegawa-Johnson is a Fellow of the Acoustical Society ofAmerica, and a Senior Member of IEEE and ACM. He iscurrently Treasurer of ISCA, and Senior Area Editor of the IEEETransactions on Audio, Speech and Language. He has published254 peer-reviewed journal articles and conference papers in thegeneral area of automatic speech analysis, including machinelearning models of articulatory and acoustic phonetics, prosody,dysarthria, non-speech acoustic events, audio source separation,and under-resourced languages.

Prof. Alan W Black is a world leader inthe area of speech synthesis. He is a principalauthor of the Festival Speech Synthesis System(http://festvox.org/festival), a free softwaresystem which has been used by a large numberof academic and industrial groups throughout the world. He was the pioneer of unitselection speech synthesis, where appropriatesub-word units are automatically selected

from large databases of natural speech. This helped move thefield of speech synthesis from a rule driven approach to a datadriven approach. Prof. Black has over 170 refereed publishedpapers in many aspects of speech and language technologies.He was area program chair for the 5th ISCA Workshop onSpeech Synthesis, and General co-chair for Interspeech 2006.With Keiichi Tokuda he initiated the annual Blizzard Challengeevaluation for corpus based synthesis techniques, the largestspeech synthesis evaluation forum. He was an elected memberof the IEEE Speech and Language Technical Committee and theISCA Board.

Lucas Ondel is currently a PhD studentat the Brno University of Technology. Hereceived a bachelor’s degree in Mobile andEmbedded System from Universität ClaudeBernard Lyon 1 in Lyon and a master’s degreein software engineering from Universität

d’Avignon et des pays de Vaucluse in Avignon. His currentresearch interests include, among others, Bayesian machinelearning and novel acoustic modeling techniques applied tolow-resourced languages.

Odette Scharenborg (PhD) is a visitingresearcher at the Department of IntelligentSystems, Delft University of Technology,Netherlands (formerly an associate professorat the Centre for Language Studies, Radboud

University Nijmegen, The Netherlands). Her research interestfocuses on narrowing the gap between automatic and human

Page 9: Journal of the International Science and General ... · Journal of the International Science and General Applications Vol. 1, No. 1/ March 2018 / ISGA each speech unit sequence. Fourth,

Journal of the International Science and General Applications Vol. 1, No. 1 / March 2018 / ISGA

spoken-word recognition, and she has published widely inboth fields. She co-organised the Interspeech 2008 ConsonantChallenge, which aimed at promoting comparisons of humanand machine speech recognition in noise. She initiated the EUMarie Curie ITN “Investigating Speech Processing In RealisticEnvironments” (2012-2015). In 2017, she co-led a 6-weeksFrederick Jelinek Memorial Summer Workshop on the topicof the automatic discovery of grounded linguistic units forlanguages without orthography. She recently finished a 5-yearproject funded by the Dutch NWO investigating non-nativespoken-word recognition in noise. She is currently an electedmember of the ISCA Board, where she co-chairs the Conferencesand Technical Committees.

Francesco Ciannella is currently a NaturalLanguage Generation architect at Cisco Sys-tems. He received his Master of Science in In-telligent Information Systems at the LanguageTechnologies Institute, Carnegie Mellon Uni-versity. Francesco does research in Human-computer Interaction, Data Mining and Artifi-cial Intelligence. He was involved in the projectis ’Baby Talk - Language acquisition in childrenfrom 6 to 18 months,’ and in the project “The

Speaking Rosetta Stone: Discovering Grounded Linguistic Unitsfor Languages without Orthography,” which was selected forfunding as a Jelinek Speech and Language Technology (JSALT)workshop in summer 2017.