𝙔𝙪𝙢𝙚

𝒀 𝑼 𝑴 𝑬: An Interactive World Generation Model

Xiaofeng Mao¹^,², Shaoheng Lin¹, Zhen Li¹, Chuanhao Li¹, Wenshuo Peng¹,
Tong He¹, Jiangmiao Pang¹, Mingmin Chi²^‡, Yu Qiao¹, Kaipeng Zhang¹^,³^†‡

¹Shanghai AI Laboratory, ²Fudan University, ³Shanghai Innovation Institute

We are looking for collaboration and self-motivated interns. Contact: zhangkaipeng@pjlab.org.cn.

Abstract

Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of Yume, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) are introduced to the sampler for better visual quality and more precise control. Moreover, we investigate model acceleration by synergistic optimization of adversarial distillation and caching mechanisms. We use the high-quality world exploration dataset Sekai to train Yume, and it achieves remarkable results in diverse scenes and applications. All data, codebase, and model weights are available on https://github.com/stdstu12/YUME. Yume will update monthly to achieve its original goal.

Overview

Yume aims to create interactive, realistic, and dynamic worlds from images, text, or videos, enabling exploration and control via peripheral devices. This preview version generates dynamic worlds from images with keyboard-controllable navigation.

Technical Framework

1. Camera Motion Quantization
Quantized Camera Motion (QCM) translates camera trajectories into intuitive directional controls (forward/backward/left/right) and rotational actions (turn right/turn left/tilt up/tilt down) mapped to keyboard input. QCM embeds spatiotemporal context into control signals without additional learnable modules.

2. Video Generation Architecture
Masked Video Diffusion Transformer (MVDT) with frame memory enables infinite autoregressive generation. This overcomes text-based control limitations observed in prior work, maintaining consistency across long sequences.

3. Enhanced Sampling Mechanisms
• Anti-Artifact Mechanism (AAM): Training-free refinement of latent representations to enhance details
• Time-Travel SDE (TTS-SDE): Stochastic differential equation-based sampling using future-frame guidance to maintain temporal coherence

4. Optimization Acceleration
Synergistic optimization combining adversarial distillation and caching mechanisms boosts sampling efficiency 3× while preserving visual fidelity.

Trained on the Sekai world exploration dataset, Yume achieves remarkable results across diverse scenes. All resources are open-source:

https://github.com/stdstu12/YUME

@article{mao2025yume, title={Yume: An Interactive World Generation Model}, author={Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, Kaipeng Zhang}, journal={arXiv preprint arXiv:2507.17744}, year={2025} }

𝒀 𝑼 𝑴 𝑬: An Interactive World Generation Model

Abstract

Introduction Video

Overview

Technical Framework

BibTeX