learning word meanings and grammar for describing everyday ...ark/emnlp-2015/proceedings/... · for...

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2249–2254,Lisbon, Portugal, 17-21 September 2015. c©2015 Association for Computational Linguistics.

Learning Word Meanings and Grammarfor Describing Everyday Activities in Smart Environments

Muhammad Attamimi1 Yuji Ando1 Tomoaki Nakamura1 Takayuki Nagai1Daichi Mochihashi2 Ichiro Kobayashi3 Hideki Asoh4

1 The University of Electro-Communications, 1-5-1 Chofugaoka, Chofu-shi, Tokyo, Japan2 Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo, Japan

3 Ochanomizu University, 2-1-1 Otsuka Bunkyo-ku Tokyo, Japan4 National Institute of Advanced Industrial Science and Technology,

1-1-1 Umezono, Tsukuba, Ibaraki, Japan{m att, ando, naka t}@apple.ee.uec.ac.jp, [email protected],

[email protected], [email protected], [email protected]

Abstract

If intelligent systems are to interact withhumans in a natural manner, the abilityto describe daily life activities is impor-tant. To achieve this, sensing human ac-tivities by capturing multimodal informa-tion is necessary. In this study, we con-sider a smart environment for sensing ac-tivities with respect to realistic scenarios.We next propose a sentence generationsystem from observed multimodal infor-mation in a bottom up manner using mul-tilayered multimodal latent Dirichlet allo-cation and Bayesian hidden Markov mod-els. We evaluate the grammar learning andsentence generation as a complete processwithin a realistic setting. The experimen-tal result reveals the effectiveness of theproposed method.

1 Introduction

Describing daily life activities is an important abil-ity of intelligent systems. In fact, we can usethis ability to achieve a monitoring system thatis able to report on an observed situation, or cre-ate an automatic diary of a user. Recently, sev-eral studies have been performed to generate sen-tences that describe images using Deep Learning(Vinyals et al., 2014; Fang et al., 2014; Donahueet al., 2014; Kiros et al., 2015). Although theseresults were good, we are interested in unsuper-vised frameworks. This is necessary to achievea system that can adapt to the user, that is, onethat can learn a user-unique language and gener-ate it automatically. Moreover, the use of crowd-sourcing should be avoided to respect the privacy

of the user. Regarding this, studies on sentencegeneration from RGB videos have been discussedin (Yu and Siskind, 2013; Regneri et al., 2013).A promising result for language learning has beenshown in (Yu and Siskind, 2013) and a quite chal-lenging effort to describe cooking activities wasmade in (Regneri et al., 2013). However, thesestudies rely only on visual information, while weaim to build a system that is able to describe every-day activities using multimodal information. Torealize such systems, we need to consider twoproblems. The first problem is the sensing of dailylife activities. In this paper, we utilize a smarthouse (Motooka et al., 2010) for sensing humanactivities. Thanks to the smart house, multimodalinformation such as visual, motion, and audio datacan be captured. The second problem to be tack-led is verbalization of the observed scenes. Tosolve this problem, a multilayered multimodal la-tent Dirichlet allocation (mMLDA) was proposedin (Attamimi et al., 2014).

In this paper, we propose a sentence generationsystem from observed scenes in a bottom up man-ner using mMLDA and a Bayesian hidden Markovmodel (BHMM) (Goldwater and Griffiths, 2007).To generate sentences from scenes, we need toconsider the words that represent the scenes andtheir order. Here, mMLDA is used to infer wordsfor given scenes. To determine the order of words,inspired by (Kawai et al., 2014), a probabilis-tic grammar that considers syntactic informationis learned using BHMM. In this study, the orderof concepts is generated by sampling the learnedgrammar. The word selection for each generatedconcept is then performed using the observed data.Moreover, a language model that represents the re-lationship between words is also used to calculate

2249

Smart

Environment

Seq. of teaching

sentences

Multimodal data

Morphological

analysis

Signal processing

BoW

BoF

mMLDA

Functional Concept

WordBayesian

HMM

Bigram model

Seq. of concepts

Grammar

Mutual info.

for init. valueConnection updates

Generated sentences

Seq. of concepts Seq. of words

Sentence

Generation

Connection between

words and concepts

Sensing

Figure 1: Language learning and sentence genera-tion system.

Tag readerWearablecamera

Laser range

finder

Gyro-accelerationsensor

Object

with an

RFID tag

RFID tag

to detect

the end of

actions

Figure 2: Multimodal information acquisition.

the transition probability between them. Consid-ering the transition probability at word level, a lat-tice of word candidates corresponding to the con-cept sequence can be generated. Therefore, sen-tence generation can be thought of as a problemof finding the word sequence that has the high-est probability from the lattice of word candidates,which can be solved by the Viterbi algorithm. Fi-nally, sampling from grammar is performed mul-tiple times to generate sentence candidates and se-lect the most probable one.

2 Proposed method

2.1 Overview

Figure 1 illustrates the overall system of proposedlanguage learning and sentence generation. Inthis study, we use a smart environment for sens-ing multimodal information. The system shown inFigure 2 is part of a smart house (Motooka et al.,2010) that is used to capture multimodal informa-tion. Here, an RFID tag is attached to an object

Integrated Concept

Motion Concept

Angle

Object Concept

Position WordWordWord

Place Concept

Object

Figure 3: Graphical model of mMLDA.

to enable the object information to be read using awearable tag reader. To capture motion, five sen-sors that consist of 3-axis acceleration with 3-axisgyroscope sensors are attached to the upper body,as shown in Figure 2. Moreover, a particle filter-based human tracker (Glas et al., 2007) appliedto four laser range finders is used to estimate thelocation of a person while performing an action.This is a setup designed to demonstrate that lan-guage can be learned and generated from real hu-man actions. Ultimately, our goal is sensing basedon image recognition.

The acquired multimodal data is then processed,which results in a bag-of-words model (BoW) andbag-of-features model (BoF) (Csurka et al., 2004).Using mMLDA (see section 2.2), various conceptscan be formed from the multimodal data. Giventeaching sentences, the connection between wordsand concepts can be learned based on mMLDAand BHMM which is learned with mutual infor-mation (MI) as the initial value. On the other hand,the bigram model of words is calculated and usedas the score when reordering words inferred frommultimodal information using grammar. A mor-phological analyzer for parsing words in a sen-tence is also necessary in the proposed system. Weuse publicly available parser MeCab (Kudo et al.,2004). In the future, we plan to use the unsuper-vised morphological analysis technique proposedin (Mochihashi et al., 2009).

2.2 mMLDA

Figure 3 shows the graphical model ofmMLDA used in this paper. Here, z repre-sents the integrated category (concept), whereaszO, zM , and zP represent the object, mo-

2250

tion, and place concepts, respectively. Inthe bottom layer (lower panel of Figure 3),wm ∈ {wo, wwO, wa, wwM , wl, wwP } representsthe multimodal information obtained from eachobject, motion, and place. Here, wo, wa, and wl

denote multimodal information obtained respec-tively from the object used in an action, motion ofa person while using the object, and location ofthe action. Further, wwC ∈ {wwO, wwM , wwP }denotes word information obtained from teachingsentences. Observation information is acquiredby using the system shown in Figure 2. A briefexplanation of each observation is as follows.

For object information, an No-dimensional vec-tor wo = (o1, o2, · · · , oNo) is used, where No de-notes the number of objects. In this vector, o∗takes a value of 0 or 1, where oi is set to 1 if anobject with index i is observed. Moreover, all ofthe teaching sentences are segmented into wordsand represented by a BoW as word information.Here, motion is segmented according to the ob-ject used. A sequence of 15-dimensional featurevectors for each motion is acquired. Using BoF,the acquired feature vectors are vector quantized,resulting in a 70-dimensional vector. The acquiredtwo dimensional of human positions are processedusing BoF to construct a 10-dimensional vector asplace information.

In mMLDA, latent variables that represent up-per and lower concepts z and zC ∈ {zO, zM , zP }are learned simultaneously. Gibbs sampling is ap-plied to the marginalized posterior probability oflatent variables to learn the model from observeddata wm (Attamimi et al., 2014).

2.3 Language learning and generation

2.3.1 Word inferenceIn this study, word information is obtained fromteaching sentences and employed for all concepts,as shown in Figure 3. Considering that appropriate

words to express each concept exist, a criterion tomeasure the correlation between words and con-cepts is needed. At the start of grammar learn-ing, MI, which can measure the mutual depen-dence of two stochastic variables, is used. There-fore, a word is considered to express a categorywhen the MI between the word and category islarge. On the other hand, a word with small MIis identified as a functional word. This determi-nation is used as an initial value in the syntac-tic learning and needs not be strictly determined.Once the grammar is learned, we can utilizeBHMM’s parameters P (ww|c) to infer a word ww

from observed data wmobs as P (wwC |wm

obs, c) ∝maxk P (wwC |c)P (wwC |k)P (k|wm

obs, c), whereP (wwC |k) and P (k|wm

obs, c) can be estimatedfrom mMLDA (Attamimi et al., 2014) and k iscategory of concept c′ ∈ {object,motion, place}and c ∈ {c′, functional}. It should be notethat P (wwC |k) and P (k|wm

obs, c) are consideredas uniform distribution for “functional” sincethey cannot be inferred from observed data usingmMLDA. In this case, we can rely on syntactic in-formation which is learned by BHMM.

2.3.2 Grammar learning using BHMM

Thanks to mMLDA and BHMM, appropriatewords to represent the observed information canbe inferred. Given an input consisting of a teach-ing sentence of a sequence of words, a BHMMcan infer a sequence of concepts. In the learningphase, the MI results of concept selection for eachword are used as the initial values of the BHMM.Here, grammar is defined as the concept transi-tion probability P (Ct|Ct−1), which is estimatedusing Gibbs sampling, where Ct ∈ c representsthe corresponding concepts of the t-th word in thesentence. In addition, a language model that rep-resent the bigram model of words in the teachingsentences is also used for generating sentences.

Motion Object Place Motion Object Place Motion Object PlaceDrink (1) Juice (1) Sofa (1) Wipe (7) Dustcloth (9) Kitchen (4) Write on (12) Notebook (16) Bedroom (5)

Tea (4) Dining room (2) Tissue (10) Dining room (2) Textbook (17) Sofa (1)Eat (2) Cookies (2) Dining room (2) Turn on (8) Remote control Living room (3) Refrigerator (18) Kitchen (4)

Chocolate (3) Living room (3) (air conditioner) (11) Bedroom (5) Open (13) Microwave (19) Kitchen (4)Shake (3) Tea (4) Sofa (1) Open (turn) (9) Tea (4) Living room (3) Closet (20) Bedroom (5)

Dressing (5) Kitchen (4) Honey (6) Dining room (2) Read (14) Textbook (17) Bedroom (5)Pour (4) Tea (4) Kitchen (4) Wrap (10) Plastic wrap (12) Dining room (2) Magazine (21) Sofa (1)

Juice (1) Living room (3) Aluminum foil (13) Kitchen (4) Spray (15) Deodorizer (22) Living room (3)Put on (5) Dressing (5) Dining room (2) Shirt (14) Bedroom (5) Bedroom (5)

Honey (6) Kitchen (4) Hang (11) Scourer (23) Kitchen (4)Throw (6) Ball (7) Sofa (1) Parka (15) Living room (3) Scrub (16) Sponge (24) Kitchen (4)

Plushie (8) Bedroom (5)

Table 1: Object, motion, and place correspondences (numbers in parentheses represent the categoryindex).

2251

ParkaBall

Cookies

Eat

ing

Thro

win

gH

angin

g

Ob

ject

in

fo.

Moti

on

in

fo.

Pla

ce i

nfo

. 10000

5000

0

100

-100

0

Dining

Sofa

Living

Eating Throwing Hanging

Acquired multimodal info. Generated sentences

G: {dining room, in, cookies, the, eat}

P: {dining room, in, cookies, the, eat}

Eating the cookies in the dining room.

B: {dining room, with, cookies, the, eat, eat}

G: {sofa, on, ball, the, throw}

P: {sofa, on, ball, the, throw}

Throwing the ball on the sofa.

B: {sofa, with, ball, to, throw, throw}

G: {living room, in, parka, the, hang}

P: {living room, in, parka, the, hang}

Hanging the parka in the living room.

B: {living room, in, hang}

Actual scenes

(a) (b) (c)

Figure 4: Examples of: (a) actual images, (b) captured multimodal information, and (c) generated sen-tences. In each image, B, P, and G indicate the sentence structure in Japanese grammar generated by thebaseline method, proposed method, and correct sentence, respectively; whereas the bottom line gives themeaning of the generated sentence. Words marked in red have been incorrectly generated.

2.3.3 Sentence generation of observed scenesFirst, concepts are sampled from the begin of sen-tence “BOS” until the end of sentence “EOS” ac-cording to the learned grammar N times. Letthe n-th (n ∈ {1, 2, · · · , N}) sequence of con-cepts that excludes “BOS” and “EOS” be Cn ={Cn

1 , · · · , Cnt , · · · , Cn

Tn}, where, Tn denotes the

number of sampled concepts, which correspondsto the length of a sampled sentence.

Next, the word that corresponds to concept Cnt

is estimated. Here, for a given observed infor-mation wm

obs, the top-K words that correspond toconcept Cn

t and have high probabilities wnt =

{wnt1, w

nt2, · · · , wn

tK} are used. Hence, the set ofall words for a sequence of concepts Cn can bewritten as W n = {wn

1 , wn2 , · · · , wn

Tn}. There-

fore, KTn number of patterns for a candidate ofthe sentence can be considered for Cn and the cor-responding words W n. Each candidate for sen-tence Sn is selected from these patterns and hasthe following probability:

P (Sn|Cn,W n, wmobs) ∝∏

t

P (Cnt |Cn

t−1)P (wnt |wm

obs, Cnt )P (wn

t |wnt−1). (1)

For observed information, the most probable sen-tence is selected from N sequences of conceptswith sets of words. Here, the sentence Sn thatmaximizes Eq. (1) is determined for each se-quence of concepts. Because many patterns ofSn exist, the Viterbi algorithm is applied to cutthe computational cost and determine the mostprobable sentence. Thus, a set of sentences thatconsists of sentences with the highest probability

for each sequence of concepts can be written asS = {S1, · · · , Sn, · · · , SN}.

We can select the final sentence from S by con-sidering the most probable candidate. In fact,long sentences tend to have low probability andare less likely to be selected. To cope withthis problem, adjustment coefficient ℓ(Sn) =(Lmax−LSn )∑N

n LSn

∑Nn log P (Sn|Cn,W n,wm

obs) is in-

troduced, where, LSn denotes the length of sen-tence Sn and Lmax represents the maximumvalue of the sentence length in S. Using ℓ(Sn),the logarithmic probability of the sentence canbe calculated as log P (Sn|Cn, W n, wm

obs) =log P (Sn|Cn, W n, wm

obs) + ωℓ(Sn), where ωis a weight that controls the length of sen-tences. A large weight leads to longer sen-tences. The final sentence S is determined asS = argmax

Sn∈ ˆSlog P (Sn|Cn, W n, wm

obs).

3 Experiments

The acquisition system shown in Figure 2 wasused to capture multimodal information from hu-man actions. Table 1 shows the actions that wereperformed by three subjects twice, resulting in atotal of 195 multimodal data with 1170 sentences.We then divided the data into training data (99multimodal data with 594 sentences) and test data(96 multimodal data with 576 sentences). Someexamples of acquired multimodal data are shownin Figure 4(b). Using training data, various con-cepts were formed by mMLDA, and the catego-rization accuracies for object, motion, and placewere respectively 100.00%, 52.53%, and 95.96%.Motion similarity was responsible for the false cat-

2252

♯ of words Baseline Proposedw/o functional words 78 65.38% 73.08%w functional words 98 – 68.37%

Table 2: Concepts selection results.

egorization of motion concepts. Since our goal isto generate sentences from observed scenes, theseresults are used as reference instead of comparingwith the baseline.

To evaluate the concept selection of words, 98words in teaching sentences were used. We com-pared the results of concept selection with hand-labeled ones. Table 2 shows the accuracy rate ofconcept selection. Here, we excluded the func-tional words (resulting in 78 words) for fair com-parison with the baseline method (Attamimi etal., 2014). One can see that, better results canbe achieved by the proposed method. It is clearthat concept selection is improved by using theBHMM, indicating that a better grammar can belearned using this model.

Next, the learned grammar was used and sen-tences were generated. To reduce randomness ofthe results, sentence generation was conducted 10times for each data. To verify sentence gener-ation quantitatively, we evaluated the sentencesautomatically using BLEU score (Papineni et al.,2002). Figure 5 depicts the results of 2- to 4-gramof BLEU scores. Since functional words are notconsidered in (Attamimi et al., 2014), we used ourgrammar and performed sentence generation pro-posed in (Attamimi et al., 2014) as the baselinemethod. One can see from the figure that in allcases the BLEU scores of proposed method outperforms the baseline method. It can be said thatthe sentences generated by the proposed methodare of better quality than those generated by thebaseline method.

Moreover, we also manually evaluated gener-ated sentences by asking four subjects (i.e., col-lege students who understand Japanese) whetherthe sentences were: correct both in grammar andmeaning (E1), grammatically correct but incorrectin meaning (E2), grammatically incorrect but cor-rect in meaning (E3), or incorrect both in grammarand meaning (E4). The average rates of E1, E2,E3, and E4 were shown in Table 3. We can seethat the proposed method out performs the base-line method by providing high rates of E1 and E2;and low rates of E4. Because we want to generatesentences that explain actions, incorrect motion in-

Grammar Meaning Baseline ProposedE1 correct correct (23.21 ± 5.28)% (45.39 ± 3.02)%E2 correct incorrect (35.07 ± 9.32)% (49.79 ± 3.77)%E3 incorrect correct (11.34 ± 5.59)% (2.79 ± 2.39)%E4 incorrect incorrect (30.38 ± 10.54)% (2.03 ± 2.10)%

Table 3: Evaluation results of generated sentences.

0

0.5

1Baseline method

Proposed method

2-gram 3-gram 4-gram

BL

EU

sco

re

Figure 5: BLEU scores of generated sentences.

ference would lead to incorrect sentence genera-tion. Examples of E2 are “Eating the plastic wrapin the dining room” and “Opening the dressing inthe kitchen.” One can see that these sentencesare grammatically correct but do not express thescenes correctly because the words that representthe motion are incorrect. Hence, the misclassi-fication that occurred in the motion concept for-mation was responsible for the incorrect meaningof the generated sentences. Figure 4(c) shows thesentences generated from the given scenes (Fig-ure 4(a)). We can see that meaningful yet naturalsentences that explain the observed scenes can begenerated using the proposed method.

4 Conclusion

In this paper, we proposed an unsupervisedmethod to generate natural sentences from ob-served scenes in a smart environment usingmMLDA and BHMM. In the smart environment,multimodal information can be acquired for real-istic scenarios. Thanks to mMLDA, various con-cepts can be formed and an initial determination offunctional words can be made by assuming a weakconnection of concepts and words calculated byMI. The possibility that grammar can be learnedfrom BHMM by considering the syntactic infor-mation has also been shown. We conducted exper-iments to verify the proposed sentence generation,and promising preliminary results were obtained.In future work, we aim to implement a nonpara-metric Bayes model that will be able to estimatethe number of concepts automatically.

Acknowledgments

This work is partly supported by JSPS KAKENHI26280096.

2253

ReferencesMuhammad Attamimi, Muhammad Fadlil, Kasumi

Abe, Tomoaki Nakamura, Kotaro Funakoshi, andTakayuki Nagai. 2014. Integration of VariousConcepts and Grounding of Word Meanings UsingMulti-layered Multimodal LDA for Sentence Gener-ation. In Proc. of IEEE/RSJ International Confer-ence on Intelligent Robots, pp.2194–2201.

Gabriella Csurka, Christopher R. Dance, Lixin Fan,Jutta Willamowski, Cedric Bray. 2004. Visual Cat-egorization with Bags of Keypoints. In Proc. ofECCV International Workshop on Statistical Learn-ing in Computer Vision.

Jeffrey Donahue, Lisa Hendricks, Sergio Guadarrama,Marcus Rohrbach, Subhashini Venugopalan, KateSaenko, and Trevor Darrell. 2014. Long-term Re-current Convolutional Networks for Visual Recog-nition and Description. Technical Report No.UCB/EECS-2014-180.

Hao Fang, Saurabh Gupta, Forrest Iandola, RupeshSrivastava, Li Deng, Piotr Dollar, Jianfeng Gao,Xiaodong He, Margaret Mitchell, John Platt, C.Lawrence Zitnick, and Geoffrey Zweig. 2014. FromCaptions to Visual Concepts and Back. In Proc. ofIEEE International Conference on Computer Visionand Pattern Recognition.

Dylan F. Glas, Takahiro Miyashita, Hiroshi Ishiguro,and Norihiro Hagita. 2007. Laser Tracking of Hu-man Body Motion Using Adaptive Shape Modeling.In Proc. of IEEE/RSJ International Conference onIntelligent Robots, pp.602–608.

Sharon Goldwater and Thomas L. Griffiths. 2007. AFully Bayesian Approach to Unsupervised Part-of-Speech Tagging. In Proc. of the 45th Annual Meet-ing of the Association for Computational Linguis-tics, pp.744–751.

Yuji Kawai, Yuji Oshima, Yuki Sasamoto, Yukie Nagai,and Minoru Asada. 2014. Computational Modelfor Syntactic Development: Identifying How Chil-dren Learn to Generalize Nouns and Verbs for Dif-ferent Languages In Proc. of Joint IEEE Interna-tional Conferences on Development and Learningand Epigenetic Robotics, pp.78–84.

Ryan Kiros, Ruslan Salakhutdinov, and Richard S.Zemel. 2015. Unifying Visual-Semantic Embed-dings with Multimodal Neural Language Models. InTrans. of the Association for Computational Lin-guistics.

Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto.2004. Appliying Conditional Random Fields toJapanese Morphological Analysis. In Proc. of Con-ference on Empirical Methods in Natural LanguageProcessing, pp.230–237.

Daichi Mochihashi, Takeshi Yamada, and NaonoriUeda. 2009. Bayesian Unsupervised Word Segmen-tation with Nested Pitman-Yor Language Modeling

In Proc. of the Association for Computational Lin-guistics, pp.100–108.

Nobuhisa Motooka, Ichiro Shiio, Yuji Ohta, KojiTsukada, Keisuke Kambara, and Masato Iguchi.2010. Ubiquitous Computing House Project: De-sign for Everyday Life. Journal of Asian Architec-ture and Building Engineering, 8:77–82.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for AutomaticEvaluation of Machine Translation. In Proc. of theAssociation for Computational Linguistics, pp.311–318.

Michaela Regneri, Marcus Rohrbach, Dominikus Wet-zel, Stefan Thater, Bernt Schiele, and ManfredPinkal. 2013. Grounding Action Descriptions inVideos. Trans. of the Association for ComputationalLinguistics, 1:25–36.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2014. Show and Tell: A Neu-ral Image Caption Generator. In arXiv:1411.4555[cs.CV].

Haonan Yu and Jeffrey M. Siskind. 2013. GroundedLanguage Learning from Video Described with Sen-tences. In Proc. of the Association for Computa-tional Linguistics, pp.53–63.

2254

learning word meanings and grammar for describing everyday ...ark/emnlp-2015/proceedings/... · for...

Documents