

Yume aims to create interactive, realistic, and dynamic worlds from images, text, or videos, enabling exploration and control via peripheral devices. This preview version generates dynamic worlds from images with keyboard-controllable navigation.
Technical Framework
1. Camera Motion Quantization
Quantized Camera Motion (QCM) translates camera trajectories into intuitive directional controls (forward/backward/left/right) and rotational actions (turn right/turn left/tilt up/tilt down) mapped to keyboard input. QCM embeds spatiotemporal context into control signals without additional learnable modules.
2. Video Generation Architecture
Masked Video Diffusion Transformer (MVDT) with frame memory enables infinite autoregressive generation. This overcomes text-based control limitations observed in prior work, maintaining consistency across long sequences.
3. Enhanced Sampling Mechanisms
• Anti-Artifact Mechanism (AAM): Training-free refinement of latent representations to enhance details
• Time-Travel SDE (TTS-SDE): Stochastic differential equation-based sampling using future-frame guidance to maintain temporal coherence
4. Optimization Acceleration
Synergistic optimization combining adversarial distillation and caching mechanisms boosts sampling efficiency 3× while preserving visual fidelity.
Trained on the Sekai world exploration dataset, Yume achieves remarkable results across diverse scenes. All resources are open-source: