ai sensing for robotics using deep learning based visual

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 60–63Seattle, USA, July5 - 10, 2020. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17

60

AI Sensing for Robotics using Deep Learning based Visual and LanguageModeling

Yuvaram SinghHCL Technologies Limited,

Analytics CoENoida, India, 201304

[email protected]

Kameshwar Rao JVHCL Technologies Limited,

Analytics CoENoida, India, 201304

[email protected]

Abstract

An artificial intelligence(AI) system should becapable of processing the sensory inputs to ex-tract both task-specific and general informa-tion about its environment. However, most ofthe existing algorithms extract only task spe-cific information. In this work, an innovativeapproach to address the problem of processingvisual sensory data is presented by utilizingconvolutional neural network (CNN). It recog-nizes and represents the physical and semanticnature of the surrounding in both human read-able and machine processable format. Thiswork utilizes the image captioning model tocapture the semantics of the input image anda modular design to generate a probability dis-tribution for semantic topics. It gives any au-tonomous system the ability to process visualinformation in a human-like way and gener-ates more insights which are hardly possiblewith a conventional algorithm. Here a modeland data collection method are proposed.

1 Introduction

In a world gifted with visible light facilitating infor-mation sharing, the living creatures have developedorgans for sensing the light to understand their sur-rounding. In an autonomous system, this informa-tion is captured in IR, UV, and visible spectruminvolving sophisticated sensors and is processedusing complex algorithms. At Consumer Electron-ics Show (CES) 2020, Samsung presented Balliewhich is a personalized robot with ideas to make itself-aware of its surroundings and control IoT de-vices around home to make the environment better.With companies targeting to launch smart homerobots with capabilities of following voice com-mands, there is a need to develop a system thatcan automatically understand the semantics of theenvironment and take appropriate decisions on itsown.

(a) a person cutting cakewhile others cheering.

(b) Fire fighters are trying toput out fire in the building.

Figure 1: Variety of scenarios that a human can de-scribe comfortably.

The latest work in scene understanding involvedconstruction of knowledge graph for visual seman-tic understanding(Jiang et al., 2019). The authorsused ontology graph in combination with visualcaptioning to describe the scene. Another approachfor functional scene understanding was introducedusing semantic segmentation(Wald et al., 2018).All these scene understanding approaches make asystem specialized in certain tasks and working en-vironment while failing to generalize across varioustypes of situations and capture the human emotions.

It is efficient to make decision, based on a struc-tured description of the scene instead of workingon raw pixel information. Fig.1 shows scenarioswhere a human can easily interpret the meaningof the scene. It is easy to tell from the Fig.1b thatfirefighters are trying to put out the fire from build-ing. This is also true for all the Fig.1a, 1b where ahuman can understand and explain the scene easilythrough a language representation.

In this work we recommend an AI sensing sys-tem that can semantically interpret the environmen-tal conditions, objects, relations and activity carriedout from the visual feed. These interpretations areconverted into text for human understanding andprobability distribution for the control system toprocess and take decisions. The main intention ofthis work is to have a neural network based sen-

61

sor processing unit capable of extracting semanticcontext while deployed on low powered computehardware.

This paper is divided as follows:

• Section 2 explains various modules of the pro-posed approach.

• Section 3 discusses about the dataset and con-siderations to make while implementing thisapproach.

2 Proposed Approach

In this work, a modular approach is proposed torepresent the semantic content of the outside worldthrough vision sensor. A detailed flow diagram ofthe proposed method is shown in Fig.2. It consist ofthree sub-modules namely, CNN feature extractor,language module, and environment context prob-ability detector module. It combines visual, lan-guage, and context detection modules to assist thecontrol unit to make decisions based on non-taskspecific environment details.

2.1 CNN Feature ExtractorThis module process the visual feed and convertthem into feature tensor(f ) which is used to gen-erate semantic understanding of the surrounding.This feature tensor(f ) encodes the informationpresent in the incoming frame. A CNN basedfeature extractor(Xu et al., 2015) trained for im-age classification task on Imagenet dataset(Denget al., 2009) is used. There are variety of CNNbased pre-trained architectures are available to beused as feature extractors. Architectures such asMobilenet(Sandler et al., 2018), ResNet(He et al.,2016), InceptionNet(Szegedy et al., 2015) andDenseNet(Huang et al., 2017) have their own ben-efits and drawbacks. Based on the deploymenthardware, expected response time and environmentnature, specific architecture can be chosen.

2.2 Language ModuleIn this module, the information from the featuretensor(f ) are extracted and represented in a humaninterpretable language(l). This is achieved by usingLong Short Term Memory unit(LSTM)(Sak et al.,2014) which is a deep neural network(DNN) forgenerating sequential output(Xu et al., 2015). Acombination of soft-attention mechanism(Xu et al.,2015) and LSTM is used to describe the contentsextracted from the frame(Vinodababu, 2018). This

is a recursive step where the execution comes to ahalt when the end token < end > is predicted ormaximum sentence length is reached.

l = {w0, w1, w2, ...wn}

where

wi ∈ Rk

Here Rk is the vector of tokenized words in thevocabulary and (l) is the generated word sequence.The byproduct of having language representationis explainability of action.

The process of caption generation happens re-cursively were to sample a word w from Rk it goesthrough the following process. Ref Fig.3.

At a time step t,

• The attention mechanism computes the maskmt for feature tensorf using f and hidden stateHt−1.

• f weighted by mt combined with the previousword detected wt−1 is passed onto the LSTMalong with hidden state Ht−1 and cell stateCt−1 from the previous step.

• The LSTM output a probability distributionfor the words in the vocabulary R.

This process is carried out until the end token <end > is predicted or the max length of caption isreached. The effectiveness of this module dependson generation of dense caption for the scene.

2.3 Environment Context Detector

The verbal representation from language moduleis used to generate probability distribution overvarious groups of semantic context. The input se-quence is tokanized, vectorized and converted intoprobability distribution by using fully connectednetwork. It is constructed by single or multiple neu-ral net operating parallel, perform prediction overvarious context. Fig.4 provides the overall viewof this module where different fully connected net-work(FCN) are used for prediction. The captionare tokanized and vectorized to act as input. HereGloVe embedding(Pennington et al., 2014) is usedto vectorize the sentence. The activation of the out-put layer can use either softmax or sigmoid basedon the nature of the data. The topics of the con-text should be decided based on the workspace and

62

Figure 2: A block diagram of the proposed model.

Figure 3: A flow diagram explaining how languagemodule prediction the next word in sequence.(Xu et al.,2015)

preference of the robotics designer. Fig.4, showsenvironment context detector block diagram.

E = [c0, c1, ...cd]

where ci is the prediction vector of ith contextnet and E is the collection of c vectors. Here d isthe desired number of context net. The generatedprobability distribution is sent to the control systemwhich takes the final decision whether to react ornot. The proposed solution serves as an add-on tothe existing control system.

3 Dataset and Considerations

The CNN feature extractor is a pre-trained modeltrained on Imagenet dataset(Deng et al., 2009) forclassifying 1000 objects. The language module istrained using COCO image captioning dataset(Linet al., 2014) which consist of image and captionsin target language. A BLEU-1 score of 70.7 isachieved for the language module.

The dataset for the environment context mod-ule is similar to the text sentiment classificationdataset. The input will be a sentence and the labelsare one-hot vector of target class. A dataset is cre-ated from a portion of COCO caption where the

Figure 4: A block diagram of environment context de-tector.

semantic context topics are environment, situation,mood, presence of human, and objects in the sceneas shown in Fig.4. There are several logical consid-erations to be take while adopting this method. fewof them are,

• On board compute capability to carryout DNNcalculation.

• Robot deployment environment and its nature.

• The actual intention and task of the robot.

• How the control system should react to thegenerated probability distribution.

4 Conclusion

The main objective of the work is to use neuralnetworks to understand and represent the physicalenvironment around the system. This work serve asan add-on to the existing control system by provid-ing additional set of inputs capturing the semanticmeaning. An image captioning based approach isused to obtain semantic content of the surroundingand it is represented in a probability distribution.

63

ReferencesJ. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-

Fei. 2009. ImageNet: A Large-Scale HierarchicalImage Database. In CVPR09.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 770–778.

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, andKilian Q Weinberger. 2017. Densely connected con-volutional networks. In Proceedings of the IEEEconference on computer vision and pattern recogni-tion, pages 4700–4708.

Chen Jiang, Steven Lu, and Martin Jagersand.2019. Constructing dynamic knowledge graphfor visual semantic understanding and applica-tions in autonomous robotics. arXiv preprintarXiv:1909.07459.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In European confer-ence on computer vision, pages 740–755. Springer.

Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for word rep-resentation. In Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1532–1543.

Hasim Sak, Andrew Senior, and Francoise Beaufays.2014. Long short-term memory recurrent neural net-work architectures for large scale acoustic modeling.In Fifteenth annual conference of the internationalspeech communication association.

Mark Sandler, Andrew Howard, Menglong Zhu, An-drey Zhmoginov, and Liang-Chieh Chen. 2018. Mo-bilenetv2: Inverted residuals and linear bottlenecks.In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 4510–4520.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser-manet, Scott Reed, Dragomir Anguelov, Dumitru Er-han, Vincent Vanhoucke, and Andrew Rabinovich.2015. Going deeper with convolutions. In Proceed-ings of the IEEE conference on computer vision andpattern recognition, pages 1–9.

Sagar Vinodababu. 2018. a-pytorch-tutorial-to-image-captioning. https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning.

Johanna Wald, Keisuke Tateno, Jurgen Sturm, Nas-sir Navab, and Federico Tombari. 2018. Real-timefully incremental scene understanding on mobileplatforms. IEEE Robotics and Automation Letters,3(4):3402–3409.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual atten-tion. In International conference on machine learn-ing, pages 2048–2057.

http://www.aclweb.org/anthology/D14-1162

http://www.aclweb.org/anthology/D14-1162

https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning

https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning

ai sensing for robotics using deep learning based visual

Documents