Robots can now learn to pick, stack, and organize things inside digital rooms, practicing safely and faster, without needing real-world training every time.

Robots may soon learn like people, by practicing in digital worlds rather than physical spaces. Chatbots like ChatGPT or Claude improve by processing large amounts of text, but robots face a different challenge. They must move through spaces, handle objects, and complete tasks like stacking dishes or setting tables. Collecting motion data from real robots is slow, costly, and hard to reproduce. Virtual training often looks unrealistic, with objects floating or intersecting. Without accurate environments, robots cannot learn safely or effectively.
Researchers at MIT and the Toyota Research Institute have developed steerable scene generation to address this problem. It creates 3D environments such as kitchens, restaurants, and living rooms where robots can practice handling objects. The system uses a dataset of 44 million 3D room models with items like chairs, dishes, and utensils, and a diffusion model to fill in missing details and arrange objects according to physics, producing coherent scenes.
The team uses Monte Carlo Tree Search (MCTS) to explore multiple versions of a scene and select the best one. In one test, the system arranged 34 objects on a restaurant table, nearly double the 17-item average in its training data. The system also improves over time using reinforcement learning, generating scenes that match defined goals such as a tidy kitchen or a cluttered table.
Users can guide the system with text prompts like “a kitchen with four apples and a bowl on the table,” producing scenes with up to 98 percent accuracy. It can also modify existing scenes, moving items or filling empty shelves, allowing engineers to create hundreds of training setups quickly.
Steerable scene generation gives roboticists access to many training environments without the cost of real-world data collection. These virtual worlds let robots practice tasks like sorting, stacking, or organizing objects, helping researchers observe how robots adapt to real-world settings and speed up development.
The team plans to add interactive elements such as drawers that open or jars that twist, and integrate real-world images for simulations. They hope to build a community where developers contribute new scenes, creating a shared resource to train robots. By bridging simulation and reality, steerable scene generation provides a way for robots to learn to work alongside humans.








