MotionCrafter

Dense Geometry and Motion Reconstruction with a 4D VAE

Ruijie Zhu1,2,* Jiahao Lu3 Wenbo Hu2† Xiaoguang Han4
Jianfei Cai5 Ying Shan2 Chuanxia Zheng1
1 NTU 2 ARC Lab, Tencent PCG 3 HKUST 4 CUHK(SZ) 5 Monash University

* Work is done during the internship at Tencent ARC Lab
† Corresponding author
Paper Code 🤗 Hugging Face
Abstract
We introduce MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. The core of our method is a novel joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, and a novel 4D VAE to effectively learn this representation. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents—despite their fundamentally different distributions—we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post-optimization.
Method
MotionCrafter Pipeline
Overview of MotionCrafter. We first train a novel 4D VAE (bottom-right), consisting of a Geometry VAE and a Motion VAE. These two components jointly encode the point map and scene flow into a unified 4D latent representation. Within the Diffusion Unet, we leverage the pretrained VAE from SVD (Stable Video Diffusion) to encode video latents as conditional inputs, which are then channel-wise concatenated with our 4D latent to guide the denoising process. We only add noise to the 4D latents during model training for the Diffusion version. Note that we do not enforce the 4D latent distribution to strictly align with the original SVD VAE latent distribution. And we find that this relaxed training strategy consistently improves the generalization performance of both the VAE and the Diffusion Unet.
Qualitative Performance
We present qualitative comparisons between MotionCrafter and the state-of-the-art method St4RTrack on in-the-wild sequences from the DAVIS dataset. Our approach consistently produces more accurate and detailed 4D geometry and motion reconstructions, demonstrating its effectiveness and satisfactory generalization capabilities.
Flamingo
St4RTrack
Ours
Rollerblade
St4RTrack
Ours
Train
St4RTrack
Ours
Interactive Viewer
We provide an interactive viewer based Viser to explore the reconstructed 4D geometry and motion. You can rotate, zoom, and pan the 3D scene to examine the details from different angles.

BibTeX

@article{zhu2025motioncrafter,
    title={MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE},
    author={Zhu, Ruijie and Lu, Jiahao and Hu, Wenbo and Han, Xiaoguang and Cai, Jianfei and Shan, Ying and Zheng, Chuanxia},
    journal={arXiv preprint arXiv:2602.08961},
    year={2026}
}