Thursday, July 18, 2024

Can Large Language Models Help Robots To Navigate?

- Advertisement -

MIT and MIT-IBM Watson AI Lab researchers have created a navigation method that converts visual inputs into text to guide robots through tasks using a language model.

A new navigation method uses language-based inputs to direct a robot through a multistep navigation task like doing laundry.
Credits:Credit: iStock
A new navigation method uses language-based inputs to direct a robot through a multistep navigation task like doing laundry.
Credits:Credit: iStock

Someday, you might want a home robot to carry laundry to the basement, a task requiring it to combine verbal instructions with visual cues. However, this is challenging for AI agents as current systems need multiple complex machine-learning models and extensive visual data, which are hard to obtain.

Researchers from MIT and the MIT-IBM Watson AI Lab have developed a navigation method that translates visual inputs into text descriptions. A large language model then processes these descriptions to guide a robot through multistep tasks. This approach, which uses text captions instead of computationally intensive visual representations, allows the model to generate extensive synthetic training data efficiently. 

- Advertisement -

Solving a vision problem with language

Researchers have developed a navigation method for robots using a simple captioning model that translates visual observations into text descriptions. These descriptions, along with verbal instructions, are input into a large language model, which then decides the robot’s next step. After each step, the model generates a scene caption to help update the robot’s trajectory, continually guiding it towards its goal. The information is standardized in templates, presenting it as a series of choices based on the surroundings, like choosing to move towards a door or an office, streamlining the decision-making process.

Advantages of language

When tested, this language-based navigation approach didn’t outperform vision-based methods but offered distinct advantages. It uses fewer resources, allowing for rapid synthetic data generation—for instance, creating 10,000 synthetic trajectories from only 10 real-world ones. Also, its use of natural language makes the system more understandable to humans and versatile across different tasks, using a single type of input. However, it does lose some information that vision-based models capture, like depth. Surprisingly, combining this language-based approach with vision-based methods improves navigation capabilities.

Researchers aim to enhance their method by developing a navigation-focused captioner and exploring how large language models can demonstrate spatial awareness to improve navigation.

Nidhi Agarwal
Nidhi Agarwal
Nidhi Agarwal is a journalist at EFY. She is an Electronics and Communication Engineer with over five years of academic experience. Her expertise lies in working with development boards and IoT cloud. She enjoys writing as it enables her to share her knowledge and insights related to electronics, with like-minded techies.


Unique DIY Projects

Electronics News

Truly Innovative Tech

MOst Popular Videos

Electronics Components