![Page 1: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/1.jpg)
Text to 3D Scene Generationwith Rich Lexical Grounding
IdibonJuly 23, 2015
“There is a desk and there is a notepad on the desk.There is a pen next to the notepad.”
Angel Chang Will Monroe Manolis SavvaChristopher Potts Christoper D. Manning
Stanford University
![Page 2: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/2.jpg)
[SPOILERS]
![Page 3: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/3.jpg)
The art of 3D scene design
![Page 4: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/4.jpg)
The art of 3D scene design Call of Duty: Advanced Warfare[Activision / Sledgehammer Games]
![Page 5: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/5.jpg)
Call of Duty: Advanced Warfare[Activision / Sledgehammer Games]
Toy Story 3[Disney / Pixar]
The art of 3D scene design
![Page 6: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/6.jpg)
Call of Duty: Advanced Warfare[Activision / Sledgehammer Games]
Toy Story 3[Disney / Pixar]
“Modern: Plywood, Plastic & Polished Metal”[Homedit Interior Design & Architecture]
The art of 3D scene design
![Page 7: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/7.jpg)
Generating 3D scenes from text
![Page 8: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/8.jpg)
Generating 3D scenes from text
TOYS’ POV -- An idyllic day care classroom, filled with the happy bustle of four- and five-year-olds, playing with toys -- dinosaurs, a baby doll, a pink Teddy bear, a Ken doll. ...
A Tonka Truck races forward, then backs up in a quick 180 arc, revealing a large pink Teddy bear, LOTSO, in its bed. Lotso taps a Tinker Toy cane and the truck bed rises, “dumping” him out. Like Bob Hope stepping off the links in Palm Springs, Lotso exudes an easy, cheerful charisma.
(Screenplay by Michael Arndt)
![Page 9: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/9.jpg)
Scene generation pipelineThere is a room with a wooden desk and a black lamp. There is a chair to the right of the desk.
(Chang et al., 2014)
![Page 10: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/10.jpg)
Scene generation pipelineThere is a room with a wooden desk and a black lamp. There is a chair to the right of the desk.
parsing
(Chang et al., 2014)
![Page 11: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/11.jpg)
Scene generation pipelineThere is a room with a wooden desk and a black lamp. There is a chair to the right of the desk.
parsing
object selection
(Chang et al., 2014)
![Page 12: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/12.jpg)
Scene generation pipelineThere is a room with a wooden desk and a black lamp. There is a chair to the right of the desk.
parsing
layout
object selection
(Chang et al., 2014)
![Page 13: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/13.jpg)
Handling lexical variety
sofa
couch
loveseat
dresser
chest of drawers
cabinet
![Page 14: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/14.jpg)
Identifying object mentions
Wood table and four wood chairs in the center of the room
![Page 15: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/15.jpg)
Wood table and four wood chairs in the center of the room
Can we fix this by learning from data?
Identifying object mentions
![Page 16: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/16.jpg)
Dataset
There is a bed and there is a chair next to the bed.
![Page 17: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/17.jpg)
Dataset
There is a bed and there is a chair next to the bed.
![Page 18: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/18.jpg)
Dataset
There is a bed and there is a chair next to the bed.
![Page 19: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/19.jpg)
Structure of a 3D scene
![Page 20: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/20.jpg)
{ 'modelID': '7bdc0aac', 'position': [118.545639, 97.979499, 3.098599], 'scale': 0.087807, 'rotation': -1.088704}
Structure of a 3D scene
![Page 21: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/21.jpg)
{ 'modelID': '7bdc0aac', 'position': [118.545639, 97.979499, 3.098599], 'scale': 0.087807, 'rotation': -1.088704}
Field Value
name ellington armchair
id 7bdc0aac
tags armchair, chair, ellington, haughton, sam, seating, woodmark
category Chair
wnlemmas armchair
unit 0.028974
up [0, 0, 1]
front [0, -1, 0]
Structure of a 3D scene
![Page 22: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/22.jpg)
{ 'modelID': '7bdc0aac', 'position': [118.545639, 97.979499, 3.098599], 'scale': 0.087807, 'rotation': -1.088704}
Field Value
name ellington armchair
id 7bdc0aac
tags armchair, chair, ellington, haughton, sam, seating, woodmark
category Chair
wnlemmas armchair
unit 0.028974
up [0, 0, 1]
front [0, -1, 0]
WordNet
human-tagged keywords &categories
size & orientationsuggestions
Structure of a 3D scene
![Page 23: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/23.jpg)
Dataset
There is a bed and there is a chair next to the bed.
Floor to ceiling windows on back wall. Green bed with two pillows and black blanket. Lights recessed into right side wall. Light wood flooring. A chair is in the upper right hand corner
There is a bed on the side of the room. There is a chair in the corner, next to the windows.
I see a bed and a chair.
The room has three windows on one wall. There is a red bed in the back of the room. Along side the bed is a side chair that is red and white.
This room has a bed with red bedding against the wall. Next to the bed is a chair.
there is a antique looking bed with red covers and pillows in a room. next to it is a recliner chair with red padding. also there are windows.
there is a bed with five pillows on it, and next to it is a chair
There is a bed in the room with two pillows and a small chair near to the right side of it.
There is a large grey bed in the bottom right corner of the room. Above the bed is a small black chair.
![Page 24: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/24.jpg)
Discrimination task
brown room with a refrigerator in the back corner
A B C
D E
![Page 25: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/25.jpg)
D
brown room with a refrigerator in the back corner
Discrimination task
![Page 26: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/26.jpg)
Learning lexical items
● One-vs.-all logistic regression● Features: 1{(language, object)}
– language: bag-of-words / bag-of-bigrams
– object: model id / categorybrownbrown roomroomroom withwith...
room01room027bdc0aaccat:Roomcat:Refrigerator...
![Page 27: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/27.jpg)
Discrimination results
Accuracy
Model ids only 71.5%
Model ids + categories 83.3%
● Accuracy (% correct scenes identified)
![Page 28: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/28.jpg)
Lexical grounding examples
text category
chair Chair
couch Couch
sofa Couch
fruit Bowl
bookshelf Bookcase
![Page 29: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/29.jpg)
Generate!
There is a room with a wooden desk and a black lamp. There is a chair to the right of the desk.
?
![Page 30: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/30.jpg)
Baseline
There is a room with a wooden desk and a black lamp. There is a chair to the right of the desk.
deskroom
chair
wooden desk
There is
a black
a
a wooden
black lamp
![Page 31: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/31.jpg)
Baseline
There is a room with a wooden desk and a black lamp. There is a chair to the right of the desk.
group by object
sum weights
![Page 32: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/32.jpg)
Baseline
There is a room with a wooden desk and a black lamp. There is a chair to the right of the desk.
deskroom
chair
wooden desk
There is
a black
a
a wooden
black lamp
![Page 33: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/33.jpg)
Baseline
There is a room with a wooden desk and a black lamp. There is a chair to the right of the desk.
deskroom
chair
wooden desk
There is
a black
a
a wooden
black lamp
![Page 34: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/34.jpg)
Baseline
There is a room with a wooden desk and a black lamp. There is a chair to the right of the desk.
choose top k(k = 4)
K = 4, average number of objects in human-constructed scenes
![Page 35: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/35.jpg)
Baseline
There is a room with a wooden desk and a black lamp. There is a chair to the right of the desk.
choose top k(k = 4)
No relationship enforced between objects! Combine with rule-based parser?
![Page 36: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/36.jpg)
Parsing + learned lexical grounding
there is a room with a wooden desk and a black lamp
![Page 37: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/37.jpg)
Parsing + learned lexical grounding
there is a room with a wooden desk and a black lamp
LampTableVase
c=argmaxc
∑ϕi∈ϕ( p )
θ(i , c)
![Page 38: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/38.jpg)
Parsing + learned lexical grounding
there is a room with a wooden desk and a black lamp
Lamp 2.304Table 0.622Vase -0.310
c=argmaxc
∑ϕi∈ϕ( p )
θ(i , c)
![Page 39: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/39.jpg)
Parsing + learned lexical grounding
there is a room with a wooden desk and a black lamp
Lamp 2.304Table 0.622Vase -0.310
c=argmaxc
∑ϕi∈ϕ( p )
θ(i , c) m=argmaxm∈c (λd ∑
ϕi∈ϕ(d )
θ(i , m)+λx ∑ϕi∈ϕ( x)
θ(i , m))
![Page 40: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/40.jpg)
Parsing + learned lexical grounding
there is a room with a wooden desk and a black lamp
Lamp 2.304Table 0.622Vase -0.310
c=argmaxc
∑ϕi∈ϕ( p )
θ(i , c) m=argmaxm∈c (λd ∑
ϕi∈ϕ(d )
θ(i , m)+λx ∑ϕi∈ϕ( x)
θ(i , m))
![Page 41: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/41.jpg)
Parsing + learned lexical grounding
there is a room with a wooden desk and a black lamp
Lamp 2.304Table 0.622Vase -0.310
c=argmaxc
∑ϕi∈ϕ( p )
θ(i , c) m=argmaxm∈c (λd ∑
ϕi∈ϕ(d )
θ(i , m)+λx ∑ϕi∈ϕ( x)
θ(i , m))
-0.302 0.460 -0.021
![Page 42: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/42.jpg)
Generated scene examples
A round table is in the center of the room with four chairs around the table. There is a double window facing west. A door is on the east side of the room.
![Page 43: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/43.jpg)
EvaluationIn between the doors and the window, there is a black couch with red cushions, two white pillows, and one black pillow. In front of the couch, there is a wooden coffee table with a glass top and two newspapers. Next to the table, facing the couch, is a wooden folding chair.
![Page 44: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/44.jpg)
Evaluation Results
Method Seeds
Random 2.03
Rule-based parser 5.44
Human-built 6.06
Turkers rated fidelity of generated sceneson a scale of 1 (poor) to 7 (good)
168 participants, average 4.2 ratings per scene-description pair
![Page 45: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/45.jpg)
Evaluation Results
Method Seeds Mturk
Random 2.03 1.68
Rule-based parser 5.44 3.15
Human-built 6.06 5.87
Turkers rated fidelity of generated sceneson a scale of 1 (poor) to 7 (good)
168 participants, average 4.2 ratings per scene-description pair
![Page 46: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/46.jpg)
Evaluation Results
Method Seeds Mturk
Random 2.03 1.68
Rule-based parser 5.44 3.15
Human-built 6.06 5.87
Turkers rated fidelity of generated sceneson a scale of 1 (poor) to 7 (good)
168 participants, average 4.2 ratings per scene-description pair
![Page 47: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/47.jpg)
Evaluation Results
Method Seeds Mturk
Random 2.03 1.68
Learned (k=4) 3.51 2.61
Rule-based parser 5.44 3.15
Human-built 6.06 5.87
Turkers rated fidelity of generated sceneson a scale of 1 (poor) to 7 (good)
168 participants, average 4.2 ratings per scene-description pair
![Page 48: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/48.jpg)
Evaluation Results
Method Seeds Mturk
Random 2.03 1.68
Learned (k=4) 3.51 2.61
Rule-based parser 5.44 3.15
Human-built 6.06 5.87
Turkers rated fidelity of generated sceneson a scale of 1 (poor) to 7 (good)
168 participants, average 4.2 ratings per scene-description pair
![Page 49: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/49.jpg)
Evaluation Results
Method Seeds Mturk
Random 2.03 1.68
Learned (k=4) 3.51 2.61
Rule-based parser 5.44 3.15
Combined 5.23 3.73
Human-built 6.06 5.87
Turkers rated fidelity of generated sceneson a scale of 1 (poor) to 7 (good)
168 participants, average 4.2 ratings per scene-description pair
![Page 50: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/50.jpg)
Evaluation Results
Method Seeds Mturk
Random 2.03 1.68
Learned (k=4) 3.51 2.61
Rule-based parser 5.44 3.15
Combined 5.23 3.73
Human-built 6.06 5.87
Turkers rated fidelity of generated sceneson a scale of 1 (poor) to 7 (good)
168 participants, average 4.2 ratings per scene-description pair
![Page 51: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/51.jpg)
Generated scene examplesIn between the doors and the window, there is a black couch with red cushions, two white pillows, and one black pillow. In front of the couch, there is a wooden coffee table with a glass top and two newspapers. Next to the table, facing the couch, is a wooden folding chair.
![Page 52: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/52.jpg)
Generated scene examplesIn between the doors and the window, there is a black couch with red cushions, two white pillows, and one black pillow. In front of the couch, there is a wooden coffee table with a glass top and two newspapers. Next to the table, facing the couch, is a wooden folding chair.
![Page 53: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/53.jpg)
Generated scene examplesIn between the doors and the window, there is a black couch with red cushions, two white pillows, and one black pillow. In front of the couch, there is a wooden coffee table with a glass top and two newspapers. Next to the table, facing the couch, is a wooden folding chair.
![Page 54: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/54.jpg)
Generated scene examplesIn between the doors and the window, there is a black couch with red cushions, two white pillows, and one black pillow. In front of the couch, there is a wooden coffee table with a glass top and two newspapers. Next to the table, facing the couch, is a wooden folding chair.
![Page 55: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/55.jpg)
Generated scene examplesIn between the doors and the window, there is a black couch with red cushions, two white pillows, and one black pillow. In front of the couch, there is a wooden coffee table with a glass top and two newspapers. Next to the table, facing the couch, is a wooden folding chair.
![Page 56: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/56.jpg)
Generated scene examplesIn between the doors and the window, there is a black couch with red cushions, two white pillows, and one black pillow. In front of the couch, there is a wooden coffee table with a glass top and two newspapers. Next to the table, facing the couch, is a wooden folding chair.
?
![Page 57: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/57.jpg)
Remaining Challenges
● Learning spatial relations
● Coreference
There in the middle is a table. On the table is a cup.
facing the couch
![Page 58: Will Monroe: Text to 3D scene generation with lexical grounding](https://reader030.vdocuments.net/reader030/viewer/2022032700/55d21b29bb61eb6c628b45ae/html5/thumbnails/58.jpg)
Thank you!
Dataset is publicly availablehttp://nlp.stanford.edu/data/text2scene.shtml