A new AI framework gives robots stronger 3D spatial understanding, enabling them to translate complex human instructions into accurate physical actions without requiring task-specific training or retraining.

Researchers have developed a new robotics framework that could significantly improve how machines understand and execute complex human instructions by connecting language understanding with spatial awareness in real-world environments. The system, called Retrieval-Augmented Manipulation (RAM), helps robots convert abstract instructions into precise three-dimensional actions.
The framework addresses a long-standing challenge in robotics. While modern robots powered by vision-language models (VLMs) can understand simple instructions such as placing one object on another, they often struggle when tasks involve complex spatial relationships, orientation requirements, or contextual understanding. These limitations become more evident in environments requiring precision handling and adaptive decision-making.
Developed by researchers from the Chinese University of Hong Kong, Zhejiang Humanoid Robot Innovation Center, and collaborating institutions, RAM introduces an object-centric approach that links semantic understanding with explicit 3D scene representation. Instead of relying only on image and text interpretation, the framework builds a detailed understanding of objects around the robot, including their location, shape, orientation, and relative positioning.
The process begins by analyzing visual data captured by onboard cameras. The system then generates a 3D representation of the scene and feeds this information back into the vision-language model. By adding spatial context, the framework enables robots to translate human instructions into realistic and physically achievable actions.
RAM also decomposes complex activities into smaller sub-goals, allowing robots to modify their actions dynamically if obstacles arise or environmental conditions change. Researchers demonstrated the framework on real robots operating in zero-shot settings, where machines completed tasks they had not previously encountered during training.
The technology could support the next generation of household, industrial, and service robots, particularly in electronics manufacturing environments where precise object handling and adaptive task execution are increasingly critical.



