What just happened? Researchers at the Massachusetts Institute of Technology (MIT) have developed a new approach to train general-purpose robots, drawing inspiration from the success of large language models like GPT-4. Called the Heterogeneous Pretrained Transformers (HPT), this approach allows robots to learn and adapt to a wide range of tasks - something that has been difficult to date.

The research could lead to a future where robots are not just specialized tools but flexible assistants that can quickly learn new skills and adapt to changing circumstances, becoming truly general-purpose robotic assistants.

Traditionally, robot training has been a time-consuming and costly process, requiring engineers to collect specific data for each robot and task in controlled environments. As a result, robots would struggle to adapt to new situations or unexpected obstacles.

The MIT team's new technique combines large amounts of heterogeneous data from various sources into a single system capable of teaching robots a wide array of tasks.

At the heart of the HPT architecture is a transformer, a type of neural network that processes inputs from various sensors, including vision and proprioception data, and creates a shared "language" that the AI model can understand and learn from.

"In robotics, people often claim that we don't have enough training data. But in my view, another big problem is that the data come from so many different domains, modalities, and robot hardware," said Lirui Wang, the lead author of the study and an electrical engineering and computer science (EECS) graduate student at MIT. "Our work shows how you'd be able to train a robot with all of them put together."

Wang's co-authors include fellow EECS graduate student Jialiang Zhao, Meta research scientist Xinlei Chen, and senior author Kaiming He, an associate professor in EECS and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL). The research will be presented at the Conference on Neural Information Processing Systems.

One of the key advantages of the HPT approach is its ability to leverage a massive dataset for pretraining. The researchers compiled a dataset consisting of 52 datasets with over 200,000 robot trajectories across four categories, including human demonstration videos and simulations.

This pretraining allows the system to transfer knowledge effectively when learning new tasks, requiring only a small amount of task-specific data for fine-tuning.

In both simulated and real-world tasks, the HPT method outperformed traditional training-from-scratch approaches by more than 20 percent. The HPT system still demonstrated improved performance even when faced with tasks significantly different from the pretraining data.

"This paper provides a novel approach to training a single policy across multiple robot embodiments," said David Held, an associate professor at Carnegie Mellon University's Robotics Institute who was not involved in the study. "This enables training across diverse datasets, enabling robot learning methods to significantly scale up the size of datasets that they can train on. It also allows the model to quickly adapt to new robot embodiments, which is important as new robot designs are continuously being produced."

The MIT researchers aim to enhance the HPT system by exploring how data diversity can boost its performance. They also plan to extend the system's capabilities to process unlabeled data, similar to how large language models like GPT-4 operate.

Wang and his colleagues have set an ambitious goal for the future of this technology. "Our dream is to have a universal robot brain that you could download and use for your robot without any training at all," Wang explained. "While we are just in the early stages, we are going to keep pushing hard and hope scaling leads to a breakthrough in robotic policies, like it did with large language models."

The Amazon Greater Boston Tech Initiative and the Toyota Research Institute partially funded this research.