Researchers from MIT and the MIT-IBM Watson AI Lab have developed a new navigation method for home robots that combines language-based representations with visual signals to improve navigation performance in situations that lack enough visual data for training. The system converts visual representations into text captions that describe the robot’s point-of-view, which are then fed into a large language model that predicts the actions a robot should take to fulfill user instructions. By using only language-based representations, the researchers were able to efficiently generate synthetic training data utilizing a large language model.
The researchers utilized large language models as the most powerful available machine-learning models to address the complex task of vision-and-language navigation. They used a captioning model to generate text descriptions of a robot’s visual observations, which were combined with language-based instructions to guide the robot through the multistep navigation process. The large language model outputs a caption of the scene the robot should see after completing each step and updates the trajectory history to guide the robot to its goal one step at a time. The overall goal was to use language to substitute for visual data from a robot’s camera in order to streamline the navigation process.
The research team created templates to present observation information to the model in a standardized form, enabling the robot to make choices based on its surroundings. By encoding visual observations into language descriptions, the researchers simplified the process for the AI agent to understand the task and respond appropriately. This approach not only requires fewer computational resources to synthesize training data but also provides more transparency, making it easier to assess why the robot may have failed to reach its goal. The ability to use only language-based input allows the system to be applied across various tasks and environments without any modifications.
While the language-based method may lose some depth information compared to vision-based models, the researchers found that combining language-based representations with vision-based methods actually improved the robot’s navigation ability. Language was able to capture higher-level information that might not be easily captured by pure vision features. The researchers aim to further explore this concept and develop a navigation-oriented captioner that can enhance the method’s performance. They also want to investigate how large language models can exhibit spatial awareness to improve language-based navigation.
This research demonstrates the potential of using language-based representations in combination with vision-based methods to enhance the navigation capabilities of home robots. By converting visual data into text descriptions and feeding them into a large language model, the researchers were able to generate synthetic training data efficiently and improve the robot’s ability to understand and respond to user instructions. Future studies will focus on leveraging large language models to further develop spatial awareness and optimize language-based navigation systems.