Although AI systems have shown remarkable progress in text-related tasks (question answering, conversations, etc.), in spatial reasoning, progress has been slow. For humans, spatial reasoning is so natural that it’s almost invisible. We can assemble furniture from pictorial instructions, arrange odd-shaped objects into a backpack before an overnight trip, and decide if an elderly person needs help crossing uneven terrain by watching them. For computers, these problems are all far out of reach.
Why is this problem so hard? One big difference between text-based tasks and spatial tasks is the data: the web has billions of examples of human conversations, including the exact words that a computer system would need to emit in order to continue a given conversation. However, people wouldn't describe exactly how to grasp a chair leg during assembly, at least not at level of precision that would let a robot to perform the same grasp. Robots can't learn to assemble furniture simply by reading about it.
Instead, we might expect that robots could learn by watching videos. After all, there are many videos online of furniture assembly, and people easily understand what's happening in the 3D physical world by watching them. This ability is remarkable if you think about it: from a 2D screen, people can infer how the parts move all the way from the box to the final assembly, and all the ways people grasp, turn, push, and pull along the way. Current computer systems are far from this. In fact, they have a hard time even tracking the parts from the box to the assembly, much less the forces that make them move as they do.
To this end, we have introduced a new task in computer vision called TAP: Tracking Any Point. Given a video and a set of query points, i.e., 2D locations on any frame of the video, the algorithm outputs the location that those points correspond to on every other frame of the video.
Input 1 |
Input 2 |
Output |
On the dress above, the output came from our recent algorithm called TAPIR, which we have released open-source. The output reveals how the dress accelerates and changes shape over time, which reveals information about the underlying geometry and physics. TAPIR is fast and works well across a huge variety of real scenes. Below, we show a few more examples of the results it can achieve on real scenes.
What’s even more remarkable, however, is that the model learned to do this from just five demonstrations (compare to the hundreds or thousands of demonstrations required by recent generative model based solutions, like DeepMind's RoboCAT, for example). Furthermore, outside of these five demonstrations, the system has never even seen glue sticks, wooden blocks like these, or even the gear that it’s supposed to place the final assembly next to. It works because TAPIR can track any object, even objects that were never seen until the demonstrations, and the system simply imitates the motions revealed in the tracks.
Here’s how it works in a bit more detail. For this example the goal is to insert four objects into a stencil, given five demonstrations (with different scene configurations) where a human drives the robot to accomplish the task. We first track thousands of points using TAPIR. Next, we group points into clusters based on the similarity of the motion, using an algorithm which we have made public on GitHub. Below, we show the results for one video, where the colors indicate the object that each point has been assigned to.
Next we have to figure out which objects are relevant for each motion. To achieve this we exploit a natural property of goal-directed motion: that regardless of where objects start relative to the robot, at the end of a motion they tend to be at a specific place (i.e. the goal location). For example, perhaps in all demonstrations, we note that the robot first always lines up to grasp the pink cylinder, and as a second step, it always lines up with a particular spot on the stencil cutout, and so on. Below, we show the full set of objects that it discovers from the six demonstrations; at every moment, we show the points that the model believes are relevant at that particular stage of the motion.
With these objects discovered, it’s time to run the robot. At every moment, the system knows which points are relevant based on the procedure above, regardless of where they are. It detects these points using TAPIR and then moves the gripper so that the motion of those points matches what was seen in the demonstration.
Despite its simplicity, this system can solve a wide variety of tasks from a very small number of demonstrations. It works whether objects are rigid or not, and it even works when irrelevant objects are placed into the scene, something that prior learning-based robotic systems have struggled with.
|
|
|
Javascript required. |
Also in this direction, Inria and Adobe Research recently published VideoDoodles, a very fun project which allows users to add annotations and animations to videos, which follow the objects and backgrounds as if they share the same 3D space. Under the hood, it's powered by a point tracking algorithm that was developed with the help of our TAP-Vid dataset (though, as an independent project, it doesn't use TAPIR, at least not yet).
|
Trajectories computed |
Input image warped |
Animation result after hole |
Javascript required.
If you’re like most people, this looks like a regular brick wall. However, it has, in fact, been digitally edited. To see how it was edited, play the video below that the image came from.
The artificial motion is visible for less than a second, and yet to most people, the edit jumps out. In fact, it typically looks like the bricks are sliding against one another, completely overriding the ”common knowledge“ that brick walls are solid. This suggests that your brain is tracking essentially all points at all times; you only notice it when some motion doesn’t match your expectations.
TAP is a new field in computer vision, but it’s one that we believe can serve as a foundation for better physical understanding in AI systems. If you’re curious to test it out, the TAPIR model code and weights are open source. We’ve released a base model and also an online model which can run in real time, which served as the basis for our robotics work. There’s a colab notebook where you can try out TAPIR using Google’s GPUs, and also a live demo where you can run TAPIR in real time on the feed from your own webcam. We hope you find it useful.
Page authored by Carl Doersch, with valuable input from the other TAPIR, RoboTAP, and TAP-Vid authors: Yi Yang, Mel Vecerik, Jon Scholz, Todor Davchev, Joao Carreira, Guangyao Zhou, Ankush Gupta, Yusuf Aytar, Dilara Gokay, Larisa Markeeva, Raia Hadsell, Lourdes Agapito. We also thank Qianqian Wang for providing visuals of OmniMotion.