![Page 1: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/1.jpg)
Few-Shot AdaptiveVideo-to-Video Translation
Ting-Chun Wang
NVIDIA
![Page 2: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/2.jpg)
Recall the Motion Transfer Example
![Page 3: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/3.jpg)
Behind the Scenes…
![Page 4: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/4.jpg)
Disadvantages of vid2vid
• Separate models for each dataset
• Generalizing to new persons requires
Collecting new data Training
model 1 model 2 model 3
![Page 5: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/5.jpg)
Wouldn’t it be great if…
• One model for all
• Dynamically determine the style at run time• based on an exemplar image
![Page 6: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/6.jpg)
Adaptive Video-to-Video Translation
T.-C. Wang, M.-Y. Liu, A. Tao, G. Liu, J. Kautz, B. Catanzaro, “Few-shot Adaptive Video-to-Video Synthesis,” To appear at NeurIPS 2019.
![Page 7: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/7.jpg)
Adaptive vid2vid: overflow
• Original vid2vid• Output frame =
Hallucinated frame + Warped frame
• Adaptive vid2vid• Hallucinated frames
• generated based on example images
• Using a filter generation scheme
...
example images
filter
generation
frame t-1 input frame t
filters
output frame t
W
warped
frame
flow generation
conv
layers
hallucinated
frame
![Page 8: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/8.jpg)
Adaptive vid2vid
• Based on SPADE (GauGAN)• Prior work: input semantics → encoder-decoder → output image
• Instead: input semantics
→ spatially-varying normalization maps
→ used in every BatchNorm
Parameter-free
Batch Norm
convconv
𝛾
𝛽
element-wise
network output 𝑦network input 𝑥(label free) 𝑦 =
𝑥 − 𝜇
𝜎⋅ 𝛾 + 𝛽
![Page 9: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/9.jpg)
Adaptive vid2vid
• Based on SPADE (GauGAN)• Prior work: input semantics → encoder-decoder → output image
• Instead: input semantics
→ spatially-varying normalization maps
→ used in every BatchNorm
• Given an additional exemplar image• Dynamically configure the network weights in SPADE
• Generate spatially-varying, style-dependent normalization maps• Spatial info input semantics
• Style info exemplar images
Parameter-free
Batch Norm
convconv
𝛾
𝛽
element-wise
network output 𝑦network input 𝑥(label free)
![Page 10: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/10.jpg)
norm
map∗
input semanticsexample image
SPADE
normal convolutionfilter generation
normalizationconvolution filters
dynamic convolution
output image
filters
filters
AdaPool
fc
fc
...
...
∗
∗...
...
...
Dynamic Weight Generation
Spatially-varying, style dependent maps
input semantics
as network inputexample image for
weight generation
main image generator
![Page 11: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/11.jpg)
example
image 1
example
image 2
Utilizing Multiple Example Images
Inp
ut fr
am
es
Atte
ntio
n m
ap
s
front back
fron
t
fro
nt
ba
ck b
ack
Example images
softmax
example
pose 1
example
pose 2target pose
combined
features
attention maps
example
features
![Page 12: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/12.jpg)
Adaptive vid2vid: Training
• From a video• Randomly sample a clip
• Randomly sample another reference frame(s)
• Make the network generate the clip• Based on the reference frame
![Page 13: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/13.jpg)
Adaptive vid2vid: Testing
• Given an example image
• Finetune on the example image• Network output should be the same as the example
• Only finetune for a few iterations
• For faces: normalize keypoints• To the same as example image
• To better preserve identity
![Page 14: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/14.jpg)
Results
• Semantic → Street view scenes
• Edges → Human faces
• Poses → Human bodies
![Page 15: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/15.jpg)
Example images
Synthesized videosInput segmentations
Street View Scenes
![Page 16: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/16.jpg)
Example images
Synthesized videos
Edges → Faces
input videos
![Page 17: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/17.jpg)
Edges → Faces
Input videos Extracted edges Synthesized result
Example image
![Page 18: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/18.jpg)
Example
images
Input poses
Poses → Body
Synthesized
videos
![Page 19: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/19.jpg)
Poses → Body
![Page 20: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/20.jpg)
Poses → Body
![Page 21: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/21.jpg)
Poses → Body
Poses Synthesized
Example image
Input videos
![Page 22: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/22.jpg)
Conclusion
![Page 23: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/23.jpg)
Conclusion
Generative adversarial networks (GANs) Supervised Image Translation
Unsupervised Image Translation Video Translation
pix2pixHDpix2pix GauGAN
MUNITUNIT FUNIT
F D
D
G D
D
Unconditional
Conditional
Adap vid2vidvid2vid vid2game
![Page 24: Few-Shot Adaptive Video-to-Video Translation · pose 2 target pose combined features attention maps example features. ... Adaptive vid2vid: Testing •Given an example image •Finetune](https://reader034.vdocuments.net/reader034/viewer/2022043022/5f3dbdec5a58a949a167e4a4/html5/thumbnails/24.jpg)
THANK YOU
Questions?