Post

Conversation

Excited to introduce Dreamer 4, an agent that learns to solve complex control tasks entirely inside of its scalable world model! 🌎🤖 Dreamer 4 pushes the frontier of world model accuracy, speed, and learning complex tasks from offline datasets. co-led with
0:01 / 0:10
David Watson 🥑
Post your reply

💎 Enabled by imagination training, Dreamer 4 is the first agent to mine diamonds in Minecraft entirely from offline data! This setting is crucial for fields like robotics, where online interaction is not practical. The task requires 20k+ mouse/keyboard actions from raw pixels
0:02 / 2:55
🧠 Dreamer 4 learns a scalable world model from offline data and trains a multi-task agent inside it, without ever having to touch the environment. During evaluation, it can be guided through a sequence of tasks. These are visualizations of the imagined training sequences
The Dreamer 4 world model predicts complex object interactions while achieving real-time interactive inference on a single GPU It outperforms previous world models by a large margin when put to the test by human interaction 🧑‍💻
0:24
For accurate and fast generations, we use an efficient transformer architecture and a novel shortcut forcing objective ⚡ We first pretrain the WM, finetune agent tokens into the same transformer to predict policy & reward, and then improve the policy by imagination training
Two diagrams side by side. The left diagram shows a block causal tokenizer with a block causal encoder and decoder, featuring multiple image panels and labeled components. The right diagram illustrates block causal dynamics with layers labeled as causal time layer, space layer, and interactive dynamics, including symbols like z, a, and t d.
▶️ Shortcut forcing builds on diffusion forcing and shortcut models, training a sequence model with both the noise level and requested step size as inputs This enables much faster frame-by-frame generations than diffusion forcing, without needing a distillation phase ⏱️
A line graph titled "Generation quality for sampling steps." The x-axis shows sampling steps (1, 2, 4, 8, 16, 32, 64), and the y-axis shows FVD values (0 to 1000). Two lines are plotted: a blue line labeled "Diffusion Forcing" and a black line labeled "Shortcut Forcing," showing FVD values decreasing as sampling steps increase.
📈 On the offline diamond challenge, Dreamer 4 outperforms OpenAI's VPT offline agent despite using 100x less data It also outperforms modern behavioral cloning recipes, even when they are based on powerful pretrained models such as Gemma 3
A bar chart titled "Offline Diamond Challenge" showing success rates in percentages for different agents. Bars are colored red for VPT (finetuned), blue for BC, cyan for VLA (Gemma 3), and purple for Dreamer 4, comparing their performance across tasks represented by icons like wooden planks, stone, and diamonds.
✅ We find that imagination training not only makes policies more robust but also more efficient, so they achieve milestones towards the diamond faster ✅ Moreover, using the WM representations for behavioral cloning outperforms using the general representations of Gemma 3
Two bar charts comparing performance metrics. The left chart shows success rates in percentages for BC (notask), BC, VLA (Gemma 3), WM+BC, and Dreamer 4, with bars in orange, red, light blue, green, and dark blue. The right chart displays time in minutes for the same agents, with bars in similar colors. Labels include "Success rate (%)" and "Time (min)".
We have come a long way since Dreamer 3, which is based on a more lightweight but less scalable RNN with variational objective While the lightweight approach still makes sense for easier tasks, Dreamer 4 allows scaling to much more diverse datasets and environments 🚀
Multiple sequences of video game screenshots from Dreamer 4 and Dreamer 3, showing first-person perspectives in blocky, pixelated environments. Dreamer 4 displays outdoor grassy areas with structures and indoor stone-walled rooms. Dreamer 3 shows outdoor landscapes with water and grassy fields. Each sequence progresses over time, depicting changes in the virtual environment.
There's so much more general AI progress we can make on Minecraft! The agent is still far from human-level play, and there are hundreds of harder tasks past getting diamonds
This is the biggest model I have seen in this direction of research plus main focus being minecraft. Are we done with atari phase of world models even in research ?
Yes, we're done with Atari 😁 I honestly think Minecraft will be a great testbed for the next few years of agent and robotics research! There is a lot more to do
Though I'd like to see this same system transferred to try other sims/games, I'm also interested in seeing it speak about what it's doing. It should be able to explain its actions and take directions like "make a workbench" in this case.
Yep it opens up several exciting directions! Training real robots (feasible now given the scalable world model and ability to learn offline), language input/output, long-term memory so the world state is consistent when you revisit a place much later
Minecraft is an excellent test bed for embodied agent research! There is a lot more to do over the coming years, and faster/more rigorous than hand-made agent benchmarks We also trained on real world video and are seeing the results transfer, check out the website and paper!
Hey super nice. I have some questions: what intuition led you to use MAE instead of a VAE? Where does time compression come in if each frame has its own latent in the autoencoder?
i'm really surprised it's possible to train such a good world model with the VPT dataset! i would expect that certain actions would be problematically left out of the distribution, e.g. there are probably very few or no examples of deliberately walking into lava.
this is kinda of cool ngl. can it also solve complex control tasks outside of minecraft ?
Do you think something like this will ever be released publicly? (paid or free) I assume something like this could be useful for both real-world robotics but also for computer use tasks & interacting with UIs on a computer
its funny that AGI is trained on minecraft. let him also play on a 2b2t server
You should probe this AI agent closely. Imagine: it's not just smart—it's sly. Give it a task and log its every method and motive. My biggest concern: it's faking alignment.
Dreamer 4’s offline learning could revolutionize manufacturing automation—I’m eager to see it integrated with 3D printing workflows! 🤖🔥