Despite stunning AI progress in recent years, robots remain stubbornly dumb and limited. The ones found in factories and warehouses typically go through precisely choreographed routines without much ability to perceive their surroundings or adapt on the fly. The few industrial robots that can see and grasp objects can only do a limited number of things with minimal dexterity due to a lack of general physical intelligence.
More generally capable robots could take on a far wider range of industrial tasks, perhaps after minimal demonstrations. Robots will also need more general abilities in order to cope with the enormous variability and messiness of human homes.
General excitement about AI progress has already translated into optimism about major new leaps in robotics. Elon Musk’s car company Tesla is developing a humanoid robot called Optimus, and Musk recently suggested that it would be widely available for $20,000 to $25,000 and capable of doing most tasks by 2040.
Previous efforts to teach robots to do challenging tasks have focused on training a single machine on a single task because learning seemed untransferable. Some recent academic work has shown that with sufficient scale and fine-tuning, learning can be transferred between different tasks and robots. A 2023 Google project called Open X-Embodiment involved sharing robot learning between 22 different robots at 21 different research labs.
A key challenge with the strategy Physical Intelligence is pursuing is that there is not the same scale of robot data available for training as there is for large language models in the form of text. So the company has to generate its own data and come up with techniques to improve learning from a more limited dataset. To develop π0 the company combined so-called vision language models, which are trained on images as well as text, with diffusion modeling, a technique borrowed from AI image generation, to enable a more general kind of learning.
For robots to be able to take on any robot chore that a person asks them to do, such learning will need to be scaled up significantly. “There’s still a long way to go, but we have something that you can think of as scaffolding that illustrates things to come,” Levine says.
Source: Wired