Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Haoyu Wu^1* Diankun Wu^2* Tianyu He^1† Junliang Guo¹ Yang Ye¹
Yueqi Duan² Jiang Bian¹

¹ Microsoft Research ² Tsinghua University

Paper Code

Highlight

Geometry Forcing Baseline Recon Baseline

(a) Generated Video

Geometry Forcing Baseline Recon Baseline

(b) Underlying 3D Representation

Geometry Forcing equips video diffusion models with 3D awareness. (a) Generated Video shows the video generated by the baseline method and our method. The baseline method produces inconsistent scenes, while our method generates clear and consistent room scenes. (b) Underlying 3D Representation shows the 3D representation learned by the baseline method and our method. The features learned by the baseline model fail to predict meaningful 3D scenes, while our method internalizes 3D representation, enabling predicting better 3D scenes from the intermediate features.

Abstract

Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model’s intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view–conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods.

Overview

Geometry Forcing equips video diffusion models with 3D awareness. (a) We propose Geometry Forcing (GF), a simple yet effective paradigm to internalize geometric-aware structure into video diffusion models by aligning with features from a pretrained geometric foundation model, i.e., VGGT. (b) Compared to the baseline method, our method produces more consistent generations both temporally and geometrically. (c) Features learned by the baseline model fail to reconstruct meaningful 3D geometry, whereas our method internalizes 3D representation, enabling accurate 3D reconstruction from the intermediate features.

Main Results

Quantitative comparison on the RealEstate10K dataset for both short-term (16-Frame) and long-term (256-Frame) video generation. Our method (Geometry Forcing) achieves the best performance across all metrics. Bold values denote the best, and underlined values indicate the second best. * indicates the method is conditioned on the first frame only.

Visualization: 360 Rotation

Demo 1

Demo 2

Qualitative comparison of camera view-conditioned video generation under full-circle rotation. Videos are generated from a single input frame and corresponding per-frame camera poses simulating a full 360° rotation. Our proposed Geometry Forcing could consistently revisit the starting viewpoint and generate meaningful intermediate frames.

Visualization: Navigation

Geometry Forcing Ground Truth

Videos generated by Geometry Forcing (GF) and Ground Truth (GT), showing navigation in a room using only the first frame and a camera pose sequence. Our method closely follows the camera trajectory and preserves the room's geometric structure, demonstrating high consistency with the ground truth. In comparison, the baseline method (MineWorld) introduces noticeable artifacts and fails to maintain temporal and geometric consistency

Visualization: Comparison with baseline

Ground Truth

Baseline

Geometry Forcing

Ground Truth

Baseline

Geometry Forcing

Ground Truth

Baseline

Geometry Forcing

Ground Truth

Baseline

Geometry Forcing

Citation

@article{wu2025geometryforcing,
  title={Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling},
  author={Wu, Haoyu and Wu, Diankun and He, Tianyu and Guo, Junliang and Ye, Yang and Duan, Yueqi and Bian, Jiang},
  journal={arXiv preprint arXiv:2507.07982},
  year={2025}
}