Total-Recon

Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis
ICCV 2023

Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis
ICCV 2023

Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis ICCV 2023

Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis
ICCV 2023

Abstract

Total-Recon

Results

Citation

Acknowledgements

Abstract

Total-Recon

Results

Citation

Acknowledgements

Abstract

Total-Recon

Results

Citation

Acknowledgements

1) 3D Reconstruction and Applications [Click for more sequences]

2) Novel 6-DoF Trajectories Derived from In-Scene Motion

3) Object Removal

4) Novel View Synthesis: Comparisons to Baselines [Click for more sequences]

5) Ablation Study on Motion Modeling [Click for more sequences]

6) Ablation Study on Depth Supervision [Click for more sequences]

Paper

Summary Video

Code & Data

1) 3D Reconstruction and Applications [Click for more sequences]

2) Novel 6-DoF Trajectories Derived from In-Scene Motion

3) Object Removal

4) Novel View Synthesis: Comparisons to Baselines [Click for more sequences]

5) Ablation Study on Motion Modeling [Click for more sequences]

6) Ablation Study on Depth Supervision [Click for more sequences]

Paper

Summary Video

Code & Data

1) 3D Reconstruction and Applications [Click for more sequences]

2) Novel 6-DoF Trajectories Derived from In-Scene Motion

3) Object Removal

4) Novel View Synthesis: Comparisons to Baselines [Click for more sequences]

5) Ablation Study on Motion Modeling [Click for more sequences]

6) Ablation Study on Depth Supervision [Click for more sequences]

Embodied View Synthesis. Given a long video of deformable objects captured by a handheld RGBD sensor, Total-Recon renders the scene from novel camera trajectories derived from in-scene motion of actors: (1) egocentric cameras that simulate the point-of-view of a target actor (such as the pet) and (2) 3rd-person (or pet) cameras that follow the actor from behind. Our method also enables (3) 3D video filters that attach virtual 3D assets to the actor. Total-Recon achieves this by reconstructing the geometry, appearance, and root-body and articulated motion of each deformable object in the scene as well as the background.

We explore the task of embodied view synthesis from monocular videos of deformable scenes. Given a minute-long RGBD video of people interacting with their pets, we render the scene from novel camera trajectories derived from in-scene motion of actors: (1) egocentric cameras that simulate the point of view of a target actor and (2) 3rd-person cameras that follow the actor. Building an automated system for embodied view synthesis of deformable scenes requires reconstructing the root-body and articulated motion of each actor in the scene, as well as a scene representation that supports free-viewpoint synthesis. Longer videos are more likely to capture the scene from diverse viewpoints (which helps reconstruction) but are also more likely to contain larger motions (which complicates reconstruction). To address these challenges, we present Total-Recon, the first method to photorealistically reconstruct deformable scenes from long monocular RGBD videos. Crucially, to scale to long videos, our method hierarchically decomposes the scene motion into the motion of each object, which itself is decomposed into global root-body motion and local articulations. To quantify such "in-the-wild" reconstruction and view synthesis, we collect ground-truth data from a specialized stereo RGBD capture rig for 11 challenging videos, significantly outperforming prior art.

Total-Recon represents the entire scene as a composition of neural fields, one for each deformable foreground object and the rigid background.

scales

Total-Recon represents the entire scene as a composition of `M` neural fields, one for the rigid background and for each of the `M-1` deformable foreground objects. (1) Each object field `j` is transformed into the camera space with a rigid transformation $\left(\mathbf{G}_{j}^{t}\right)^{-1}$ and an additional deformation field $\mathbf{J}^{t, \rightarrow}_{j}$ for each deformable object. Next, all (2) posed object fields are combined into a (3) composite field, which is then volume-rendered into (4) color, depth, optical flow, and object silhouettes. Each of these rendered outputs defines a reconstruction loss that derives supervision from a monocular RGBD video captured by a moving iPad Pro.


	Input RGBD Video	3D Scene Reconstruction	Egocentric View	3rd-Person / Pet Follow	3D Filter
Dog 1
Human 2

We train Total-Recon to reconstruct the entire scene for a variety of RGBD videos. The egocentric and 3rd-person / pet-follow cameras are represented by the yellow and blue camera meshes shown in the mesh renderings, respectively. To showcase the 3D video filter, we attach a sky-blue unicorn horn to the forehead of the foreground object, which is automatically propagated across all frames.

	GT RGB (Train View)	Root-body (Bird's Eye View)	Egocentric Camera (Novel View)	3rd-Pet Follow Camera (Novel View)
Human 1 & Dog 1
Dog 1
Cat 1
Cat 3

By hierarchically decomposing scene motion into the motion of each object, which itself is decomposed into root-body motion and local articulations, Total-Recon can automatically compute novel 6-DoF trajectories such as those traversed by egocentric cameras and 3rd-person (or 3rd-Pet) follow cameras. In turn, these trajectories enable embodied view synthesis. For the camera trajectories, the blue axis denotes the viewing direction and the green axis denotes the "up" direction.

	Novel View (GT)	Human removed (Rendered)	Pet removed (Rendered)
Human 1 & Dog 1
Human 2 & Cat 1

By representing the scene as a composition of objects, Total-Recon also enables object removal.

We compare Total-Recon (ours) to HyperNeRF, D²NeRF, and their depth-supervised equivalents on novel-view synthesis for RGBD sequences captured with a stereo validation rig we have built. While the baseline methods are only able to reconstruct the background at best, our method is able to reconstruct both the background and the moving foreground(s), demonstrating holistic scene reconstruction.

Methods	Optimizes Camera Poses	Deformation Field	Deformable Objects	Root-body Initialization	Root-Body Motion
Ours	$$\checkmark$$	NBS	$$\checkmark$$	$$\checkmark$$	$$\checkmark$$
w/o cam. opt.	✗	NBS	$$\checkmark$$	$$\checkmark$$	$$\checkmark$$
w/ SE(3)-field	$$\checkmark$$	SE(3)-field	$$\checkmark$$	$$\checkmark$$	$$\checkmark$$
w/o deform. field	$$\checkmark$$	None	✗	$$\checkmark$$	$$\checkmark$$
w/o root-body init.	$$\checkmark$$	NBS	$$\checkmark$$	✗	$$\checkmark$$
w/o root-body	✗	NBS	$$\checkmark$$	✗	✗
w/o root-body (SE3)	✗	SE(3)-field	$$\checkmark$$	✗	✗

(w/o cam. opt.) Ablating camera-pose optimization does not qualitatively change the scene reconstruction.
(w/ SE(3)-field) Changing the deformation field from Total-Recon's NBS (neural blend skinning) function to an SE(3)-field results in minor artifacts in the foreground reconstruction.
(w/o deform. field) Removing the deformation field entirely produces coarse object reconstructions that fail to model moving body parts such as limbs.
(w/o root-body init.) Removing PoseNet-initialization of root-body poses results in noisy appearance and geometry, and sometimes even failed object reconstructions.
(w/o root-body) We do not visualize our method without root-body poses as this ablation does not converge.
(w/o root-body (SE3)) We perform another ablation that replaces the NBS function with the more flexible SE(3)-field, which does converge but breaks foreground reconstruction entirely, as evidenced by the ghosting artifacts.

These experiments justify our method's hierarchical motion representation, where object motion is decomposed into global root-body motion and local articulations.

	Novel View (GT)	Ours	w/o cam. opt.	w/ SE(3)-field	w/o deform. field	w/o root- body init.	w/o root- body (SE3)
Cat 2

Human 1

	Novel View (GT)	Depth-supervised (Rendered)	No Depth Supervision (Rendered)
Human 1 & Dog 1
Human 2 & Cat 1

While removing depth supervision doesn't significantly hamper the rendered RGB, it induces several failure modes as shown in the 3D reconstructions: 1) Floating objects in the Human & Dog sequence 2) Objects that sink into the background in the Human & Cat sequence 3) Lower reconstructon quality.

@inproceedings{song2023totalrecon,
  title={Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis},
  author={Song, Chonghyuk and Yang, Gengshan and 
          Deng, Kangle and Zhu, Jun-Yan and Ramanan, Deva},
  booktitle={IEEE International Conference on Computer Vision (ICCV)},
  year={2023}
}

We thank Nathaniel Chodosh, Jeff Tan, George Cazenavette, and Jason Zhang for proofreading our paper and Songwei Ge for reviewing our code. We thank Sheng-Yu Wang, Daohan (Fred) Lu, Tamaki Kojima, Krishna Wadhwani, Takuya Narihira, and Tatsuo Fujiwara as well for providing valuable feedback. This work is supported in part by the Sony Corporation, Cisco Systems, Inc., and the CMU Argo AI Center for Autonomous Vehicle Research.

The website template was borrowed from Jon Barron.

Novel View
(GT)

Ours
(w/ depth)

D²NeRF
(w/ depth)

D²NeRF
(w/o depth)

HyperNeRF
(w/ depth)

HyperNeRF
(w/o depth)