Impossible Staircase
Generative View Stitching (GVS) can generate a navigation video through our variant of Oscar Reutersvärd's Impossible Staircase, where the video forms a visually continuous loop thanks to our proposed loop-closing mechanism.
Autoregressive video diffusion models are capable of long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses.
To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce Omni Guidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersvärd’s Impossible Staircase.
Given a pretrained DFoT video model with an 8-frame context window and a predefined camera trajectory:
Generative View Stitching can generate a 120-frame navigation video that is stable, collision-free, faithful to the conditioning trajectory, consistent, and closes loops. On the other hand, Autoregressive Sampling diverges due to the collisions with the generated scene, is not faithful to the conditioning trajectory, and displays poor loop closure even when augmented with RAG.
Generative View Stitching (GVS) can generate a navigation video through our variant of Oscar Reutersvärd's Impossible Staircase, where the video forms a visually continuous loop thanks to our proposed loop-closing mechanism.
GVS can be scaled to longer videos given more test-time compute, as demonstrated in the following 1080-frame video that climbs a 18-story staircase. The entire video is collision-free, further demonstrating GVS' long-horizon stability.
Generative View Stitching (GVS) is a training-free diffusion stitching method that is compatible with any off-the-shelf video model trained with Diffusion Forcing. We first partition the target video into non-overlapping chunks shorter than the model's context window, then denoise every target chunk jointly with its neighboring chunks to condition on both the past and future. We use the denoised target chunk of every context window to update the noisy stitched sequence while discarding the denoised past and future conditioning chunks. We enhance the temporal consistency in stitching with Omni Guidance, which guides the original score function $\epsilon_{\theta}$ with the neighboring chunks to strengthen the conditioning on both the past and future.
Autoregressive (AR) Sampling collides with the generated scene, fails to dream up the desired staircase, and does last-minute loop closure, resulting in discontinuities in scene appearance. StochSync, a diffusion stitching method for images, performs better at these tasks, but it generates shape-shifting scenes that lack temporal consistency. GVS, on the other hand, avoids collisions, generates the desired staircase, and closes loops, all the while maintaining temporal consistency.
@article{song2025gvs,
author = {Song, Chonghyuk and Stary, Michal and Chen, Boyuan and Kopanas, George and Sitzmann, Vincent},
title = {Generative View Stitching},
journal = {arXiv preprint arXiv:2510.24718},
year = {2025},
}
This work was supported by the National Science Foundation under Grant No. 2211259, by the Singapore DSTA under DST00OECI20300823 (New Representations for Vision, 3D Self-Supervised Learning for Label-Efficient Vision), by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DOI/IBC) under 140D0423C0075, by the Amazon Science Hub, by the MIT-Google Program for Computing Innovation, and by Sony Interactive Entertainment.