sceneseer: 3d scene design with natural languagedesigning 3d scenes is a challenging creative task....

SceneSeer: 3D Scene Design with Natural Language

Angel X. Chang, Mihail Eric, Manolis Savva, and Christopher D. ManningComputer Science Department, Stanford University{angelx, meric, msavva, manning}@cs.stanford.edu

Generation

right

InferenceParsing

SceneSeer

Spatial Knowledge Base3D Scenes

...3D Models

Text Input Output 3D Scenes

Refinement with Textual Interactions

Put a clockon the wall

Put a paintingon the wall

Look atthe painting

Move the paintingto the right

...

..."There is a living room with a red couch and a TV"

1 2

3 4

Relative Positions Support Hierarchy Spatial Relations

p(chair|desk)

Figure 1. Illustration of the SCENESEER architecture. Input text is parsed into a scene representation and expanded through inference using knowledgelearned from a corpus of 3D scenes and 3D models. The representation is then used to generate a 3D scene that can be rendered and manipulatediteratively through textual commands.

ABSTRACTDesigning 3D scenes is currently a creative task that requiressignificant expertise and effort in using complex 3D designinterfaces. This effortful design process starts in stark con-trast to the easiness with which people can use language todescribe real and imaginary environments. We present SCE-NESEER: an interactive text to 3D scene generation systemthat allows a user to design 3D scenes using natural lan-guage. A user provides input text from which we extractexplicit constraints on the objects that should appear in thescene. Given these explicit constraints, the system then usesa spatial knowledge base learned from an existing databaseof 3D scenes and 3D object models to infer an arrangementof the objects forming a natural scene matching the input de-scription. Using textual commands the user can then itera-tively refine the created scene by adding, removing, replac-ing, and manipulating objects. We evaluate the quality of3D scenes generated by SCENESEER in a perceptual eval-uation experiment where we compare against manually de-signed scenes and simpler baselines for 3D scene generation.We demonstrate how the generated scenes can be iterativelyrefined through simple natural language commands.

INTRODUCTIONDesigning 3D scenes is a challenging creative task. Expertusers expend considerable effort in learning how to use com-plex 3D scene design tools. Still, immense manual effort isrequired, leading to high costs for producing 3D content invideo games, films, interior design, and architectural visual-ization. Despite the conceptual simplicity of generating pic-tures from descriptions, systems for text-to-scene generationhave only achieved limited success. How might we allowpeople to create 3D scenes using simple natural language?

Current 3D design tools provide a great amount of controlover the construction and precise positioning of geometrywithin 3D scenes. However, most of these tools do not al-low for intuitively assembling a scene from existing objectswhich is critical for non-professional users. As an analogue,in real life few people are carpenters, but most of us havebought and arranged furniture. For the purposes of defininghow to compose and arrange objects into scenes, natural lan-guage is an obvious interface. It is much easier to say “Put ablue bowl on the dining table” rather than retrieving, insert-ing and orienting a 3D model of a bowl. Text to 3D sceneinterfaces can empower a broader demographic to create 3Dscenes for games, interior design, and virtual storyboarding.

Text to 3D scene systems face several technical challenges.Firstly, natural language is typically terse and incomplete.People rarely mention many facts about the world since thesefacts can often be safely assumed. Most desks are uprightand on the floor but few people would mention this explicitly.This implicit spatial knowledge is critical for scene genera-tion but hard to extract. Secondly, people reason about the

1

arX

iv:1

703.

0005

0v1

[cs

.GR

] 2

8 Fe

b 20

17

world at a much higher level than typical representations of3D scenes (using the descriptive phrase table against wall vsa 3D transformation matrix). The semantics of objects andtheir approximate arrangement are typically more importantthan the precise and abstract properties of geometry. Most3D scene design tools grew out of the traditions of ComputerAided Design and architecture where precision of control andspecification is much more important than for casual users.Traditional interfaces allow for comprehensive control but aretypically not concerned with high level semantics.

SCENESEER allows users to generate and manipulate 3Dscenes at the level of everyday semantics through simple nat-ural language. It leverages spatial knowledge priors learnedfrom existing 3D scene data to infer implicit, unmentionedconstraints and resolve view-dependent spatial relations in anatural way. For instance, given the sentence “there is a din-ing table with a cake”, we can infer that the cake is most likelyon a plate and that the plate is most likely on the table. Thiselevation of 3D scene design to the level of everyday seman-tics is critical for enabling intuitive design interfaces, rapidprototyping, and coarse-to-fine refinements.

In this paper, we present a framework for the text to 3Dscene task and use it to motivate the design of the SCENE-SEER system. We demonstrate that SCENESEER can be usedto generate 3D scenes from terse, natural language descrip-tions. We empirically evaluate the quality of the generatedscenes with a human judgment experiment and find that SCE-NESEER can generate high quality scenes matching the inputtext. We show how textual commands can be used interac-tively in SCENESEER to manipulate generated 3D scenes.

BACKGROUND

3D Design InterfacesCurrent research typically focuses on low-level interfaces for3D modeling. There is much work on input methodologiesand schemes for easing navigation in 3D [14, 15] and ma-nipulation of primitives such as curves [17]. However, thereis little work on 3D scene manipulation at a semantic level.Recent work has explored the avenue of parsing natural lan-guage for design problem definitions in the context of me-chanical computer-aided design software [5]. Our motivationis similar, but we focus on the more open-ended setting ofnatural language 3D scene generation and manipulation.

Another line of work focuses on defining higher-level se-mantic APIs for scripting 3D animations and storyboards [7,13]. This prior work finds that people overwhelmingly pre-fer higher-level manipulation and specification in terms of se-mantic concepts such as “in front”, “to the left” and “on topof” rather than low level manipulation. The focus of this workwas on defining simple APIs for designing animated stories,whereas we focus on demonstrating a natural language inter-face for static 3D scene design. We are closer to early seminalwork in placing simple geometric objects in 3D through tex-tual commands [6].

More recently, Chaudhuri et al. [4] have demonstrated a novel3D model design interface where users can control the desiredvalues of high-level semantic attributes (e.g., “aerodynamic”,

“scary”) to interactively combine object parts. We similarlyallow users high-level semantic control but our focus is on 3Dscenes where manipulating the layout of objects comes witha different set of challenges compared to object assembly.

Automatic Scene LayoutRecent research in computer graphics has focused on learn-ing how to automatically lay out 3D scenes given trainingdata. Prior work on scene layout has focused largely on roominteriors and determining good furniture layouts by optimiz-ing energy functions that capture the quality of a proposedlayout. These energy functions are encoded from interior de-sign guidelines [16] or learned from input scene data [10].Knowledge of object co-occurrences and spatial relations isrepresented by simple models such as mixtures of Gaussianson pairwise object positions and orientations. Methods tolearn scene structure have been demonstrated using variousdata sources including simulation of human agents in 3Dscenes [11, 12], and analysis of supporting contact points inscanned environments [18]. Though SCENESEER is also asystem for generating 3D scenes, it focuses on providing aninteractive natural language interface for this task.

Text to Scene SystemsEarly work in textual interfaces for 3D scene manipulationhas addressed simplified scenarios with micro-worlds con-sisting of simple geometric shapes. The SHRDLU [20] andPUT [6] systems were pioneers in parsing natural languageinstructions, but generalization of their approach to more re-alistic scenarios similar to the real world is challenging. Morerecent work on the WordsEye system [8, 9] and other similarapproaches [19] has demonstrated more general text-driven3D scene generation that can handle complex scenes. Theauthors compellingly show the benefit of text to scene gener-ation but note that their systems are restricted due to a lack ofimplicit spatial knowledge. As a result, unnatural language,such as “the stool is 1 feet to the south of the table’,’ must beused to fully specify the scene.

Most recently, Chang et al. [3] have demonstrated how lin-guistic and non-linguistic spatial knowledge can be learneddirectly from existing scene data and leveraged when parsingnatural language. We build upon this prior work and focus oninteractive scene design where the user does not have to spec-ify all details of the scene at once, in contrast to prior text toscene systems such as WordsEye. With SCENESEER the usercan dynamically manipulate a 3D scene with simple textualcommands resembling dialogue systems.

APPROACH OVERVIEWHow might we create an interactive text to 3D scene sys-tem? The user should be able to describe and manipulate ascene with concise natural language. To make this possible,we need to parse the input text to a set of explicitly providedconstraints on objects and their arrangement. We also needto expand these constraints with implicit “common sense”facts (e.g., most of the time plates go on tables, not on thefloor). Once we’ve inferred implicit constraints, we can gen-erate a candidate scene that can be rendered and viewed by

2

the user for further interactive refinement with textual com-mands. Based on the systems presented in [2, 3], we framethe interactive text-to-scene problem in a probabilistic formu-lation that covers both scene generation and interactive sceneoperations. We show how these two previous systems can beviewed as specific instantiations of our framework.

SCENESEER relies on a spatial knowledge base that is learnedfrom a corpus of 3D scenes and 3D models (see Figure 1).The learning procedure and components of this spatial knowl-edge base are based on the prior work of Chang et al. [3]and we use the same corpus of 3D scenes and component3D models, consisting of about 12,490 mostly indoor objectswith associated textual categories and tags. Our approach forextracting object relative positions, support hierarchies, sup-port surfaces, and spatial relations is also based on that ofChang et al., and is described in more detail in the appendix.

We define our problem as the task of taking text describing ascene as input, and generating a plausible 3D scene describedby that text as output. More concretely, based on the inputtext, we select objects from a dataset of 3D models and ar-range them to generate output scenes. See Figure 1 for anillustration of the system architecture. We break the systemdown into several subtasks:

Semantic Parsing: Parse the input textual description of aconcrete scene into a scene template consisting of constraintson the objects present and spatial relations between them.

Scene Inference: Automatically expand the scene templateto account for implicit constraints not specified in the text.

Scene Generation: Using the above scene template and priorknowledge on the spatial arrangement of objects, sample thetemplate, select a set of objects to be instantiated, and opti-mize their arrangement into a scene.

Scene Interaction: Allow user to interactively adjust thescene through direct manipulation and textual commands.

This breakdown is useful for tackling the hard problem oftext-to-scene and allows us to address subproblems indepen-dently or jointly. We choose this breakdown as it provides aseparation between subproblems that fall within the graphics,NLP, and interaction design communities.

RepresentationThere is a big gap between the representations typically usedfor 3D object and scene rendering, and the high-level seman-tics that people assign to scenes. Here we define a scenetemplate representation following Chang et al. [3]’s approachto make explicit the information necessary for connectinghigher-level semantics to the lower-level geometric represen-tations. Since natural language usually expresses high-levelsemantics, we can view the text-to-scene task as a problem offirst extracting a higher-level scene template representationand then mapping into a concrete geometric representation.

Scene TemplateA scene template t = (O, C, Cs) consists of a set of ob-ject descriptions O = {o1, . . . , on} and constraints C ={c1, . . . , ck} on the relationships between the objects. A

scene template also has a scene type Cs. This provides ahigh-level yet unambiguous representation of scene structure.

Each object oi, has properties associated with it such as cat-egory label, basic attributes such as color and material, andnumber of occurrences in the scene. For constraints, we focuson spatial relations between objects, expressed as predicatesof the form supported by(oi, oj) or left(oi, oj) where oi andoj are recognized objects.

Geometric SceneWe refer to the concrete geometric representation of a sceneas a “geometric scene”. A geometric scene s consists of aset of 3D model instances {i1, . . . , in} where each modelinstance ij = (mj , Tj) is represented by model mj in themodel database and the transformation matrix Tj . The modelrepresents the physical appearance (geometry and texture) ofthe object while the transformation matrix encodes the posi-tion, orientation, and scaling of the object. Working directlywith such low-level representations of scenes is unnatural forpeople, which is a factor in the difficulty of learning current3D scene design interfaces. We generate a geometric scenefrom a scene template by selecting appropriate models froma 3D model database and determining transformations thatoptimize their layout to satisfy spatial constraints.

Model FormulationWe take a probabilistic view and model the task as that of es-timating the distribution of possible scenes P (s|u) given theinput utterance u. This allows us to incorporate prior knowl-edge to perform inference, and to leverage recent advancesin machine learning for learning from data. The distributionP (s|u) can be sampled from to generate plausible scenes topresent to the user.

We can further decompose P (s|u) into:

P (s|u) = P (t|u)P (t′|t, u)P (s|t′, t, u)

where P (t|u) is the probability of a scene template t givena utterance u, and t′ is a completed scene template. In ourpipelined system, we will assume that s is independent of t, uand that t′ is independent of u to get

P (s|u) = P (t|u)P (t′|t)P (s|t′)

A more sophisticated system would jointly estimate P (s|u).In our framework, Semantic Parsing corresponds to estimat-ing P (t|u), Scene Inference to estimating P (t′|t), and SceneGeneration to estimating P (s|t′). The deterministic parsingmodel of Chang et al. [3] represents P (t|u) as a delta proba-bility at the selected t.

We decompose scene generation into two parts: Object Se-lection and Scene Layout. Object selection consists of iden-tifying a likely set of models {mj} from our model databasegiven a complete scene template t. Scene layout is the prob-lem of arranging the objects and finding appropriate trans-forms {Tj} for the models {mj} and scene template t.

P (s|t) = P ({mj}|t)P ({mj , Tj}|{mj}, t)

3

Again, a more sophisticated system may choose to estimateP (s|t) jointly. Note that in the equation above, {mj} de-notes the set of all models and {mj , Tj} denotes the set of allmodel, transformation matrix tuples (mj , Tj).

To further simplify the problem, we assume that we can selectthe model for each object independently. Thus, we have

P ({mj}|t) =∏j

P (mj |oj)

This assumption clearly does not hold for many scenes andscene templates (e.g., matching furniture sets).

For scene layout, we estimate P ({mj , Tj}|{mj}, t) using alayout score L described in more detail in the Appendix.

To support Scene Interactions, we follow a similar model,except now we are given a starting scene s and an utterance u,and we want to determine the desired scene s′. We can modelthis as estimating P (s′|s, u). For simplicity, we will assumethat the utterance consist of one scene operationO that takes sand transforms it to s′. Thus we can break down P (s′|s, u) =P (s′|O, s, u)P (O|s, u). Here, given a specific starting scenes and the utterance u expressing a desired operation, we wantto find the operation O that can be applied on s to yield s′.This problem can be decomposed into parsing, inference andgeneration steps that we use for regular scene generation aswell. The difference is that we now estimate an operation O.

SEMANTIC PARSINGDuring semantic parsing we take the input text and create ascene template that identifies the objects and the relations be-tween them. We follow the approach of Chang et al. [3] toprocess the input text using the Stanford CoreNLP pipeline1,identify visualizable objects, physical properties of the ob-jects, and extract dependency patterns for spatial relations be-tween the objects. The parsed text is then deterministicallymapped to the scene template or scene operations. An inter-esting avenue for future research is to automatically learn howto map text using more advanced semantic parsing methods.

SCENE INFERENCEAfter we obtain a scene template with the explicitly statedconstraints, we expand it to include inferred objects and im-plicit constraints. As an example, for each object in the scene,we use the support hierarchy prior Psupport to find the mostlikely supporting parent object category. If there is no suchparent object already in the scene, we add it to the scene. Forinstance, given the objects “computer” and “room”, we inferthat there should also be a “desk” supporting the “computer”.

In addition to inferring missing objects, we infer the staticsupport hierarchy for the objects in the scene. For each ob-ject, we first identify support constraints that must exist basedon the scene template. For any objects that do not have a sup-porting parent, we sample Psupport(Cp|Cc) restricted to the setof available object categories. In addition, we can infer miss-ing objects based on the location (for instance, a bedroomtypically contains a bed and a dresser). Future work would1http://nlp.stanford.edu/software/corenlp.shtml

Figure 2. Generated scene for “There is a sandwich on a plate.” Notethat the room, table and other objects are not explicitly stated, but areinferred by the system.

involve enhancing this component with improved inferenceby identifying what objects must exist for each location, in-stead of sampling a random selection of objects. We can alsoconsider additional contextual details such as the distinctionbetween a messy desk with a variety of items versus a tidydesk with few items.

SCENE GENERATIONOnce we have a complete scene template we need to select 3Dmodels to represent the objects trying to match any describedattributes, and to arrange the models in the scene based onconstraints. During this step we aim to find the most likelyscene given the scene template and prior spatial knowledge.

Object Selection We use the keywords associated with eachobject to query the model database. We select randomly fromthe top 10 results for variety and to allow the user to regen-erate the scene with different models. This step can be en-hanced to take into account correlations between objects (e.g.,a lamp on a table should not be a floor lamp model).

Object Layout Given the selected models, the source scenetemplate, and priors on spatial relations, we find an arrange-ment of the objects within the scene that maximizes the prob-ability of the layout under the given scene template.

In order to compute how well the layout satisfies the con-straints given by the scene template, we parameterize the lay-out of each object using semantically meaningful attributes:support parent parentsup, support surface surfsup, attachmentsurface surfatt, position on support surface possup, orientationθ, size σ. These semantic attributes allow us to do higherlevel reasoning and interactions with the object.

We use a sampling approach to determine the position, ori-entation, and size of objects within the scene. We first tra-verse the support hierarchy in depth-first order, positioningthe largest available child node and recursing. Child nodesare positioned by first selecting a supporting surface surfsupon a candidate parent object through sampling of Psurfsup

. Us-ing Psurfatt

, we find the most appropriate child attachment sur-

4

http://nlp.stanford.edu/software/corenlp.shtml

There is a desk and a computer. There is a living room with a red couch and a TV.There is a desk with a chair and a poster.

On the desk there is a red stapler.

Figure 3. Several example generated scenes for three different input descriptions.

Verb Operation Example text Example parse Influenced parameters

select Select select the chair on the right of the table Select({lamp},{right(lamp,table)}) Sellook,look at LookAt look at the lamp LookAt({lamp}) Camadd,insert,place,put Insert add a lamp to the table Insert({lamp},{on(lamp,table)}) +objectdelete,remove Remove remove the lamp Remove({lamp}) −objectreplace Replace replace the lamp with a vase Replace({lamp},{vase}) +object,−objectmove, place, put Move move the chair to the left Move({chair},{left(chair)}) pos, parentsup, surfsupenlarge, shrink Scale enlarge the bowl Scale({bowl})(1.5) σ

Table 1. Scene operations defined for our system. The natural language verbs are parsed to specific operations and subsets of the scene parameters aremodified accordingly (last column).

face surfatt and orient the child node accordingly. We thensample possible placements (possup, θ, σ) on surfsup, ensuringthat the node is not overhanging and there are no collisionswith other objects. Each sampled placement is scored withL. The resulting scene is rendered and presented to the user.

Figure 2 shows a generated scene along with the support hier-archy and input text. Even though the room, table, and otherobjects were not explicitly mentioned in the input, we inferthat the plate is likely to be supported by a table and that therelikely to be other objects on the table. Without this inference,the user would need to be much more verbose with text suchas “There is a room with a table, a plate, and a sandwich. Thesandwich is on the plate, and the plate is on the table.”

Figure 3 shows several examples illustrating how we can gen-erate many 3D scenes for the user to select from. Our systemcan infer many unstated but implicitly present objects (e.g.,monitors are placed on the desks for functional computersetups). We can generate high quality scenes for localizedarrangements such as the computer desks, for which thereare many examples in the training 3D scene corpus. How-ever, more complex scenes such as the living rooms exhibitsome unnatural object positioning (e.g., TV does not face thecouch). For such cases, it is important to allow users to itera-tively refine the generated scene.

SCENE INTERACTIONSOnce a scene is generated, the user can view the scene andmanipulate it using both textual commands and mouse inter-action. The system supports traditional 3D scene interactionmechanisms such as navigating the viewpoint with mouse andkeyboard, and selection and movement of object models by

clicking. In addition, a user can give simple textual com-mands to select and modify objects, or to refine the scene.

To allow the user to interact with a generated scene via text,we define a set of high-level semantic operations for scenemanipulation. These are higher level operations than typi-cally found in traditional scene editing software. Again, thetext is treated as a set of constraints that we want the revisedscene to satisfy, while trying to keep the revised scene as sim-ilar to the original scene as possible. More formally, given theoriginal scene S and a scene operation O, we want to find thescene S′ which is most similar to S, while still satisfying theconstraints imposed by O.

To track the other elements of the scene, we maintain a scenestate Z = (S, Sel,Cam) that consists of the scene S, the setof selected objects Sel, and the camera position Cam. Eachoperation can be defined as a function O : Z → Z ′. Wesupport several basic operations: Select changes the set ofselected objects, Insert adds objects into the scene, Deleteremoves objects from the scene, Replace replaces objects inthe scene with new objects, operations such as Move andScale modify constraints on existing objects without chang-ing the set of objects in the scene. LookAt repositions thecamera to focus on the selected objects. Table 1 summarizesthe operations supported by our system. These basic opera-tions demonstrate some simple scene manipulations throughnatural language. This set of operations can be extended, forexample, to cover manipulation of parts of objects (“make theseat of the chair red”).

To interpret a textual scene interaction command, the systemfirst parses the input text u into sequence of scene operations(O1, . . . , Ok), identifying the resulting set of constraints thatshould hold. For each parsed scene operation Oi, the system

5

Figure 4. Left: initial scene. Right: after input “Look at vase”, the cam-era zooms to the flower vase and the vase is selected (green highlight).

then executes the scene operation by resolving the set of ob-jects on which the operation should be performed and thenmodifying the scene state accordingly.

Command Parsing We deterministically map verbs to possi-ble actions as shown in Table 1. Multiple actions are possi-ble for some verbs (e.g., “place” and “put” can refer to eitherMove or Insert). To differentiate, we assume new objectsare introduced with the indefinite article “a” whereas old onesare modified with the definite article “the”.

Object Resolution To allow interaction with the scene, wemust resolve references to objects within a scene. Objects aredisambiguated by category and view-centric spatial relations.In addition to matching objects by their categories, we usethe WordNet hierarchy to handle hyponym or hypernym ref-erents. Depending on the current view, spatial relations suchas “left” or “right” can refer to different objects.

Camera Positioning For the LookAt command, we firstidentify the set of objects being referred to. Given an ut-terance u of the form “look at X”, we get the set of objectsthat are the most likely referent Sel = argmaxo∈S P (o|X).We perform a simple viewpoint optimization where we sam-ple camera positions c and find the position that maximizes aview function f(c), giving us Campos = argmaxc f(c).

Camera positions are sampled from 12 equally spaced pointsaround the up-axis of the selected objects and at a fixed heightslightly above the bounding box of the objects. The camera istargeted at the centroid point of the selected objects. The viewfunction we use is: f(c) = visSel(c) + b(scrSel(c)) + visall(c)where visSel is the number of selected objects visible, visallis the percentage of all objects visible, and b is a functionof scrSel

2, the percent of the screen that is taken up by theselected objects.

This is a simple approach for camera positioning which ischallenging to perform manually for novice users of 3D de-sign software and yet critical for navigating and interactingwith the environment.

Scene Modification Based on the operation, we modify thescene by maximizing the probability of a new scene templategiven the requested action and previous scene template.

The set of objects in the scene is modified according to theoperation:

2linear ramp from b(0.2) = 0 to b(0.4) = 1

Figure 5. Examples of Replace operation. Top: “Replace the bowl witha red lamp”. Middle: “Replace the table with a coffee table”. Bottom:“Replace the table with a round table”.

• Insert: Select a new object that satisfies the constraintsfrom the model database and place it in the scene.• Replace: Select a new object that satisfies the constraints

from the model database and replace the objects to be re-placed with the new object.• Delete: Remove the old objects from the scene.

After the set of objects is determined, relayout is performedon the scene by attempting to satisfy the constraints whileminimizing the change in position for all unmodified objects.For operations such as Move and Scale the set of objects re-main the same but their position or size will need to change tosatisfy the new constraints. When doing relayout, we do notvary all parameters, but just the set of influenced parametersfor the operation (see Table 1).

Depending on the operation, operand objects, and their scenecontext, the resulting changes in the scene vary in complex-ity. The simplest operations only influence the target objectswhile more complex operations require adjustment of sup-ported objects or even surrounding objects. Figure 5 showsexamples of the Replace operation with varying complexity.Replacing the bowl on the table with a table lamp is relativelyeasy since the system only needs to ensure the attachmentpoint remains the same. However, when we replace the tablewith a coffee table in Figure 5 (middle), the bowl needs to berepositioned. In the bottom row of the figure we see a fail-ure case due to the layout algorithm since the chairs are notrepositioned to accommodate the bigger round table. Insteadthe table is pushed to the side to avoid collisions. Manuallyperforming a replace operation can be extremely tedious dueto such object position dependencies, thus making this a valu-able high-level operation. This illustrates how our approachattempts to bridge the semantic gap between high-level intentand low-level geometric manipulation. In our implementa-tion, the scene operations are initiated by the user using textcommands. It is also possible to incorporate these operationsinto a graphical user interface. For instance, Replace can beimplemented in a GUI as a sequence of clicks to select an ob-

6

1: There is a living roomwith a red couch and a TV.

2: Put a rug in the room.

4: Enlarge the rug.3: Move the rug to the back.

Figure 6. A sequence of textual interactions ro refine a living room sceneby adding a rug and manipulating it.

ject, search from a model database, select a desired model touse as a replacement.

Scene RefinementWith the set of high-level operations that we defined, a usercan progressively refine a generated scene. Figure 6 shows asequence of operations for adding a rug to the living room wegenerated earlier. Combined with traditional mouse and key-board navigation, these textual interactions allow high-levelscene editing operations that are interpreted within the currentviewing context (e.g., moving the rug to the back of the roomfrom the viewer’s perspective in the third panel). Withoutsuch high-level operations, the necessary interaction wouldinclude several steps to find a rug model, place it, orient it,and manually adjust its position.

GENERATED SCENE EVALUATIONWe evaluate the output of our system by asking people tojudge how well generated scenes match given input descrip-tions. This is an appropriate initial evaluation since in a prac-tical usage scenario, a scene matching the input descriptionwell would provide a good starting point for further refine-ment. We compare versions of our system contrasting thebenefit of different components. We also establish an upper-bound baseline by asking people to manually design scenescorresponding to the same descriptions, using a simple scenedesign interface used in prior work [10].

ConditionsWe compare seven conditions: basic, +sup, +sup+spat,+sup+prior, full, full+infer, and human. Thebasic condition is a simple layout approach which does notuse any learned support, spatial relation, or placement priors.

Figure 7. Screenshot from our evaluation experiment. Participants wereasked to rate how well the displayed scenes match the description on a7-point Likert scale.

The conditions +sup, +sup+spat, and +sup+prior,and full (which includes all three priors) test the bene-fit of adding these learned priors to the system. Finally,full+infer performs implicit inference for selecting andlaying out additional objects, while human consists of themanually designed 3D scenes that people created given thedescriptions. For each of these conditions we create a set of50 scenes, one for each of the input textual descriptions. In to-tal, we have 350 stimulus scene-description pairs to be ratedby people during our experiment (see Figure 8 for exampledescriptions and scenes).

ParticipantsParticipants were recruited online on Amazon MechanicalTurk and were required to be fluent speakers of English. Werecruited a total of 97 participants for evaluating the qualityof generated scenes with respect to reference textual descrip-tions. For the human condition, the scenes were created bya different set of 50 participants who were given the textualdescriptions and asked to create a corresponding scene (seelast column of Figure 8).

ProcedureDuring the experiment, a randomly sampled set of 30 pairsof generated scene and input textual descriptions were shownto each participant. The pairs were drawn from all condi-tions and present in randomized order. The participants wereasked to rate each pair on a 7-point Likert scale to indicate“how well the scene matches the description”, with a scoreof 1 indicating a very bad match, and 7 indicating a verygood match. The participants were instructed to considerthree points in particular: (1) Are the objects mentioned inthe description included in the scene? (2) Are the describedrelationships between the objects correct in the scene? and (3)Overall, does the scene fit the description? Figure 7 shows ascreenshot of the UI that we used to carry out the experiment.

DesignThe experiment was a within-subjects factorial design withthe condition {basic, +sup, +sup+spat, +sup+prior,

7

description basic full+sup full+infer

There is a desk and a chair.

human

There is a table and there are

four chairs. There are four plates and there are

four sandwiches.

Figure 8. Some example textual descriptions and scenes generated by SCENESEER in different conditions, as well as scenes manually designed bypeople.

condition mean rating

basic 3.61+sup 4.22+sup+spat 4.72+sup+prior 4.90full 5.03full+infer 4.65

human 5.89Table 2. Average scene-description match ratings for each of the condi-tions in our experiment.

full, full+infer, human}, description {1...50}, andparticipant {1...97} as factors. The Likert rating fordescription-to-scene match was the dependent measure.

ResultsIn total the 97 participants gave 2910 scene ratings for the 350stimulus scene-description pairs with 8.39 ratings on averageper pair (standard deviation was 3.56 ratings).

Generated scene ratingsThe mean ratings for each condition in our experiment aresummarized in Table 2 and the rating distributions are plot-ted in Figure 9. We see that the basic condition receivesthe lowest average rating, while predictably the scenes de-signed by people receive the highest rating. Adding learnedsupport, spatial relation parsing, and priors for relative posi-tion improve the rating for the scenes generated by our sys-tem, and the full combined condition receives the highestaverage rating. We note that the scenes generated with ad-ditional inferred objects (full+infer) actually receive alower rating. We hypothesize that two factors may contributeto the lower rating for full+infer. Firstly, adding extraobjects makes the scene layout more complicated and proneto exhibiting object selection or spatial relation errors. Sec-ondly, inferred objects are not explicitly mentioned in the de-scription so participants may not have viewed them as signif-icantly improving the quality of the scene (though they wereinstructed to consider additional objects positively if they addnaturally to the appearance of the scene).

Condition

1 2 3 4 5 6 7Rating

basic+sup

+sup+spat+sup+prior

fullfull+inferhuman

Figure 9. Distributions of scene-description match ratings for experi-ment conditions. The dark and light regions delineate the lower andupper quartile respectively, and the whiskers extend to 150% of the in-terquartile range.

Statistical analysisThe statistical significance of the effect of the condition factoron the ratings was determined using a mixed effects model,with the condition as a fixed effect and the participant anddescription as random effects, since the latter two are drawnrandomly from a large pool of potential participants and de-scriptions3. Most pairwise differences between conditionsfor mean scene rating were significant under Wilcoxon rank-sum tests with the Bonferroni-Holm correction (p<0.05).The exceptions are the comparisons between +sup+spat,+sup+prior and full which were not significant.

SummaryThe experimental results show that SCENESEER can generateplausible 3D scenes given input descriptions. The differentcomponents of the system that leverage learned support rela-tions, spatial relationship parsing, and relative position priorsall contribute towards improving the quality of the generatedscenes. Implicitly inferred additional objects do not improvethe scene-to-description match ratings given by people.

DISCUSSION

LimitationsThe unoptimized implementation of our system takes sev-eral seconds to parse, infer and generate a scene, depending3We used the lme4 R package and optimized for maximum log-likelihood [1].

8

mostly on the number of objects and relations expressed in theinput. Unfortunately, this makes interactive use slow in prac-tical settings. However, the system can be easily optimized towithin the range of real-time editing. A more important limi-tation, is that we currently handle a limited domain of scenes.Parsing of scene descriptions with existing natural languageprocessing methods is error-prone even for simple, everydaylanguage, and the publicly available 3D model datasets arebiased towards furniture and electronics, limiting the types ofpotential output scenes (e.g., “the flower garden outside thewindow” is a challenging case). Furthermore, generation ofscenes with more organic, complex object geometry is chal-lenging for our simplistic layout approach, and we have ig-nored the importance of stylistic consistency in object selec-tion (e.g., different styles of chairs selected for same table).

The current set of operations that we presented is constrainedto fairly simple manipulations of the scene. An improved sys-tem would be able to process more complex interactions thatrequire a deeper semantic understanding of the world. For ex-ample: “move the TV to face the couch” (requires knowledgeof front sides and facing orientations), and “put a bowl of fruiton the second shelf of the bookcase” (requires knowledge ofobject parts and relative ordering from bottom).

We have not explored the design choices in combining a high-level natural language interface such as SCENESEER with di-rect manipulation is an interesting avenue for future research.Different operations might be more efficient to perform in onemodality or the other, depending also on the particular scenecontext and viewpoint. Measuring the relative benefits of dif-ferent combination strategies in specific tasks is an interestingavenue for future research.

Future WorkA promising direction for future work is to integrate systemssuch as SCENESEER with speech to text systems allowingfor voice-driven and more natural dialogue interactions. Thiscould be particularly relevant in the context of increasinglypopular virtual and augmented reality systems where text in-put is more cumbersome.

Leveraging the interactive nature of the system to improvespatial knowledge is an exciting avenue for future work. Forinstance, by observing where the user decides to manuallyplace objects, we can improve our placement priors. A nat-ural way for the user to provide feedback and corrections tosystem errors is through a dialogue-based interaction extend-ing the imperative textual commands.

Another exciting direction for future work is to crowdsourcethe accumulation of spatial knowledge by scaling the systemon web-based platforms. This will also provide an opportu-nity for broader user studies that can give insight into context-specific preferences for text versus direct manipulation inter-actions, and provide useful data for informing the design offuture text-to-scene systems.

CONCLUSIONWe presented SCENESEER, a text to 3D scene generationsystem with semantically-aware textual interactions for iter-

ative 3D scene design. We evaluated the scenes generatedby our system through a human judgment experiment andconfirmed that we can generate scenes that match the inputtextual descriptions and approach the quality of manually de-signed scenes. We demonstrated that automatically generated3D scenes can be used as starting points for iterative refine-ment through a set of high-level textual commands. Naturallanguage commands are a promising mode of interaction for3D scene design tasks since they abstract much of the tediumof low level manipulation. In this way, we can bridge the se-mantic gap between high-level user intent and low-level geo-metric operations for 3D scene design.

We believe that the domain of text to 3D scene design presentsmany challenging research problems at the intersection ofcomputer graphics, HCI, and natural language processing.Our system presents a small step in the direction of enablingnatural language interfaces for scene design. We believe thatin the future, dialogue-based systems will make the scene de-sign process as effortless as telling a story.

REFERENCES1. Baayen, R. H., Davidson, D. J., and Bates, D. M.

Mixed-effects modeling with crossed random effects forsubjects and items. Journal of memory and language(2008).

2. Chang, A. X., Savva, M., and Manning, C. D. Interactivelearning of spatial knowledge for text to 3D scenegeneration. In Proceedings of the ACL 2014 Workshopon Interactive Language Learning, Visualization, andInterfaces (ACL-ILLVI) (2014).

3. Chang, A. X., Savva, M., and Manning, C. D. Learningspatial knowledge for text to 3D scene generation. InEMNLP (2014).

4. Chaudhuri, S., Kalogerakis, E., Giguere, S., andFunkhouser, T. AttribIt: Content Creation with SemanticAttributes. In UIST (2013).

5. Cheong, H., Li, W., Shu, L., Tessier, A., Bradner, E., andIorio, F. Natural language problem definition forcomputer-aided mechanical design. In CHI-DSLIWorkshop 2014 Conference Proceedings (2014).

6. Clay, S. R., and Wilhelms, J. Put: Language-basedinteractive manipulation of objects. Computer Graphicsand Applications (1996).

7. Conway, M., Audia, S., Burnette, T., Cosgrove, D., andChristiansen, K. Alice: lessons learned from building a3D system for novices. In Proceedings of the SIGCHIconference on Human Factors in Computing Systems,ACM (2000), 486–493.

8. Coyne, B., and Sproat, R. WordsEye: an automatictext-to-scene conversion system. In SIGGRAPH (2001).

9. Coyne, R. E., Klapheke, A., Rouhizadeh, M., Sproat, R.,and Bauer, D. Annotation tools and knowledgerepresentation for a text-to-scene system.

9

10. Fisher, M., Ritchie, D., Savva, M., Funkhouser, T., andHanrahan, P. Example-based synthesis of 3D objectarrangements. SIGGRAPH Asia (2012).

11. Jiang, Y., Lim, M., and Saxena, A. Learning objectarrangements in 3D scenes using human context. InICML (2012).

12. Jiang, Y., and Saxena, A. Infinite latent conditionalrandom fields for modeling environments throughhumans. RSS (2013).

13. Kelleher, C., and Pausch, R. Lessons learned fromdesigning a programming system to support middleschool girls creating animated stories. In VisualLanguages and Human-Centric Computing, 2006.VL/HCC 2006. IEEE Symposium on, IEEE (2006),165–172.

14. Khan, A., Mordatch, I., Fitzmaurice, G., Matejka, J., andKurtenbach, G. Viewcube: a 3D orientation indicatorand controller. In Proceedings of the 2008 symposium onInteractive 3D graphics and games, ACM (2008),17–25.

15. McCrae, J., Mordatch, I., Glueck, M., and Khan, A.Multiscale 3D navigation. In Proceedings of the 2009symposium on Interactive 3D graphics and games, ACM(2009), 7–14.

16. Merrell, P., Schkufza, E., Li, Z., Agrawala, M., andKoltun, V. Interactive furniture layout using interiordesign guidelines. In SIGGRAPH (2011).

17. Owen, R., Kurtenbach, G., Fitzmaurice, G., Baudel, T.,and Buxton, B. When it gets more difficult, use bothhands: exploring bimanual curve manipulation. InProceedings of Graphics Interface 2005, CanadianHuman-Computer Communications Society (2005),17–24.

18. Rosman, B., and Ramamoorthy, S. Learning spatialrelationships between objects. IJRR (2011).

19. Seversky, L. M., and Yin, L. Real-time automatic 3Dscene generation from natural language voice and textdescriptions. In Proceedings of the 14th annual ACMinternational conference on Multimedia (2006).

20. Winograd, T. Understanding natural language. Cognitivepsychology (1972).

APPENDIXScene Layout ScoreThe scene layout score is given by L = λobjLobj + λrelLrel, aweighted sum of object arrangement Lobj score and constraintsatisfaction Lrel score following from Chang et al. [3]’s defi-nition:

Lobj =∑oi

Psurf(Sn|Coi)∑

oj∈F (oi)

Prelpos(·)

Lrel =∑ci

Prel(ci)

where F (oi) are the sibling objects and parent object of oi.We use λobj = 0.25 and λrel = 0.75 for the results we present.

Spatial Knowledge PriorsWe use similar definitions for spatial knowledge priors asthose presented by Chang et al. [3] with updated support andattachment surface priors. Spatial knowledge priors are esti-mated using observations in scenes. To handle data sparsitywe utilize our category taxonomy. If there are fewer thank = 5 support observations we back off to a parent categoryin the taxonomy for more informative priors.

Object Occurrence PriorsOccurrence priors are given by simple Bayesian statistics ofobjects occurring in scenes: Pocc(Co|Cs) =

count(Co in Cs)count(Cs)

Support Hierarchy PriorsWe estimate the probability of a parent category Cp sup-porting a given child category Cc as a simple condi-tional probability based on normalized observation counts:Psupport(Cp|Cc) =

count(Cc on Cp)count(Cc)

Support and Attachment Surface PriorsSimilarly, the parent support surface priors are given by:Psurfsup

(s|Cc) = count(Cc on surface with s)count(Cc)

The parent supportingsurface is featurized using the surface normal (up, down, hor-izontally) and whether the surface is interior (facing in) orexterior (facing out). For instance, a room has a floor whichis an upwards interior supporting surface:

The child attachment surface priors are given by:Psurfatt

(s|Cc) = count(Cc attached at surface s)count(Cc)

Object attach-ment surfaces are featurized using the bounding box side:top, bottom, front, back, left, or right. For instance, postersare attached on their back side to walls, rugs are attached ontheir bottom side to floors.

If there are no observations available we use the model ge-ometry to determine the support and attachment surface. Forsupport surfaces we pick only upward facing surfaces, whilefor attachment we assume 3D (blocky) objects are attached onthe bottom, flat objects are attached on their back or bottom,and thin objects are attached on their side.

Relative Position PriorsWe model the relative positions of objects based on their ob-ject categories and current scene type: i.e., the relative po-sition of an object of category Cobj is with respect to an-other object of category Cref and for a scene type Cs. Wecondition on the relationship R between the two objects,whether they are siblings (R = Sibling) or child-parent(R = ChildParent), and define the relative position prior as:Prelpos(x, y, θ|Cobj, Cref, Cs, R) which we estimate using ker-nel density estimation.

10

sceneseer: 3d scene design with natural languagedesigning 3d scenes is a challenging creative task....

Documents