Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis
ICCV 2023


Embodied View Synthesis. Given a long video of deformable objects captured by a handheld RGBD sensor, Total-Recon renders the scene from novel camera trajectories derived from in-scene motion of actors: (1) egocentric cameras that simulate the point-of-view of a target actor (such as the pet) and (2) 3rd-person (or pet) cameras that follow the actor from behind. Our method also enables (3) 3D video filters that attach virtual 3D assets to the actor. Total-Recon achieves this by reconstructing the geometry, appearance, and root-body and articulated motion of each deformable object in the scene as well as the background.

Abstract

We explore the task of embodied view synthesis from monocular videos of deformable scenes. Given a minute-long RGBD video of people interacting with their pets, we render the scene from novel camera trajectories derived from in-scene motion of actors: (1) egocentric cameras that simulate the point of view of a target actor and (2) 3rd-person cameras that follow the actor. Building an automated system for embodied view synthesis of deformable scenes requires reconstructing the root-body and articulated motion of each actor in the scene, as well as a scene representation that supports free-viewpoint synthesis. Longer videos are more likely to capture the scene from diverse viewpoints (which helps reconstruction) but are also more likely to contain larger motions (which complicates reconstruction). To address these challenges, we present Total-Recon, the first method to photorealistically reconstruct deformable scenes from long monocular RGBD videos. Crucially, to scale to long videos, our method hierarchically decomposes the scene motion into the motion of each object, which itself is decomposed into global root-body motion and local articulations. To quantify such "in-the-wild" reconstruction and view synthesis, we collect ground-truth data from a specialized stereo RGBD capture rig for 11 challenging videos, significantly outperforming prior art.

Total-Recon

Total-Recon represents the entire scene as a composition of neural fields, one for each deformable foreground object and the rigid background.

scales

Total-Recon represents the entire scene as a composition of `M` neural fields, one for the rigid background and for each of the `M-1` deformable foreground objects. (1) Each object field `j` is transformed into the camera space with a rigid transformation and an additional deformation field for each deformable object. Next, all (2) posed object fields are combined into a (3) composite field, which is then volume-rendered into (4) color, depth, optical flow, and object silhouettes. Each of these rendered outputs defines a reconstruction loss that derives supervision from a monocular RGBD video captured by a moving iPad Pro.

Results

1) 3D Reconstruction and Applications [Click for more sequences]

scales scales
Input RGBD
Video
3D Scene
Reconstruction
Egocentric
View
3rd-Person / Pet Follow
3D
Filter
Dog 1
Human 2

We train Total-Recon to reconstruct the entire scene for a variety of RGBD videos. The egocentric and 3rd-person / pet-follow cameras are represented by the yellow and blue camera meshes shown in the mesh renderings, respectively. To showcase the 3D video filter, we attach a sky-blue unicorn horn to the forehead of the foreground object, which is automatically propagated across all frames.


2) Novel 6-DoF Trajectories Derived from In-Scene Motion

GT RGB
(Train View)
Root-body
(Bird's Eye View)
Egocentric Camera
(Novel View)
3rd-Pet Follow Camera
(Novel View)
Human 1 &
Dog 1
Dog 1
Cat 1
Cat 3

By hierarchically decomposing scene motion into the motion of each object, which itself is decomposed into root-body motion and local articulations, Total-Recon can automatically compute novel 6-DoF trajectories such as those traversed by egocentric cameras and 3rd-person (or 3rd-Pet) follow cameras. In turn, these trajectories enable embodied view synthesis. For the camera trajectories, the blue axis denotes the viewing direction and the green axis denotes the "up" direction.


3) Object Removal

Novel View
(GT)
Human removed
(Rendered)
Pet removed
(Rendered)
Human 1 &
Dog 1
Human 2 &
Cat 1

By representing the scene as a composition of objects, Total-Recon also enables object removal.


4) Novel View Synthesis: Comparisons to Baselines [Click for more sequences]

Novel View
(GT)
Ours
(w/ depth)
D2NeRF
(w/ depth)
D2NeRF
(w/o depth)
HyperNeRF
(w/ depth)
HyperNeRF
(w/o depth)
Cat 2
Human 1

We compare Total-Recon (ours) to HyperNeRF, D2NeRF, and their depth-supervised equivalents on novel-view synthesis for RGBD sequences captured with a stereo validation rig we have built. While the baseline methods are only able to reconstruct the background at best, our method is able to reconstruct both the background and the moving foreground(s), demonstrating holistic scene reconstruction.


5) Ablation Study on Motion Modeling [Click for more sequences]


Methods Optimizes
Camera Poses
Deformation
Field
Deformable
Objects
Root-body
Initialization
Root-Body
Motion
Ours $$\checkmark$$ NBS $$\checkmark$$ $$\checkmark$$ $$\checkmark$$
w/o cam. opt. NBS $$\checkmark$$ $$\checkmark$$ $$\checkmark$$
w/ SE(3)-field $$\checkmark$$ SE(3)-field $$\checkmark$$ $$\checkmark$$ $$\checkmark$$
w/o deform. field $$\checkmark$$ None $$\checkmark$$ $$\checkmark$$
w/o root-body init. $$\checkmark$$ NBS $$\checkmark$$ $$\checkmark$$
w/o root-body NBS $$\checkmark$$
w/o root-body (SE3) SE(3)-field $$\checkmark$$

  • (w/o cam. opt.) Ablating camera-pose optimization does not qualitatively change the scene reconstruction.
  • (w/ SE(3)-field) Changing the deformation field from Total-Recon's NBS (neural blend skinning) function to an SE(3)-field results in minor artifacts in the foreground reconstruction.
  • (w/o deform. field) Removing the deformation field entirely produces coarse object reconstructions that fail to model moving body parts such as limbs.
  • (w/o root-body init.) Removing PoseNet-initialization of root-body poses results in noisy appearance and geometry, and sometimes even failed object reconstructions.
  • (w/o root-body) We do not visualize our method without root-body poses as this ablation does not converge.
  • (w/o root-body (SE3)) We perform another ablation that replaces the NBS function with the more flexible SE(3)-field, which does converge but breaks foreground reconstruction entirely, as evidenced by the ghosting artifacts.

These experiments justify our method's hierarchical motion representation, where object motion is decomposed into global root-body motion and local articulations.

Novel View
(GT)
Ours
w/o
cam. opt.
w/
SE(3)-field
w/o deform.
field
w/o root-
body init.
w/o root-
body (SE3)
Cat 2
Human 1


6) Ablation Study on Depth Supervision [Click for more sequences]

Novel View
(GT)
Depth-supervised
(Rendered)
No Depth Supervision
(Rendered)
Human 1 &
Dog 1
Human 2 &
Cat 1

While removing depth supervision doesn't significantly hamper the rendered RGB, it induces several failure modes as shown in the 3D reconstructions: 1) Floating objects in the Human & Dog sequence 2) Objects that sink into the background in the Human & Cat sequence 3) Lower reconstructon quality.


Citation

Acknowledgements

We thank Nathaniel Chodosh, Jeff Tan, George Cazenavette, and Jason Zhang for proofreading our paper and Songwei Ge for reviewing our code. We thank Sheng-Yu Wang, Daohan (Fred) Lu, Tamaki Kojima, Krishna Wadhwani, Takuya Narihira, and Tatsuo Fujiwara as well for providing valuable feedback. This work is supported in part by the Sony Corporation, Cisco Systems, Inc., and the CMU Argo AI Center for Autonomous Vehicle Research.

The website template was borrowed from Jon Barron.