TAPIR: Towards Spatial Intelligence via Point Tracking

Although AI systems have shown remarkable progress in text-related tasks (question answering, conversations, etc.), in spatial reasoning, progress has been slow. For humans, spatial reasoning is so natural that it’s almost invisible. We can assemble furniture from pictorial instructions, arrange odd-shaped objects into a backpack before an overnight trip, and decide if an elderly person needs help crossing uneven terrain by watching them. For computers, these problems are all far out of reach.

Why is this problem so hard? One big difference between text-based tasks and spatial tasks is the data: the web has billions of examples of human conversations, including the exact words that a computer system would need to emit in order to continue a given conversation. However, people wouldn't describe exactly how to grasp a chair leg during assembly, at least not at level of precision that would let a robot to perform the same grasp. Robots can't learn to assemble furniture simply by reading about it.

Instead, we might expect that robots could learn by watching videos. After all, there are many videos online of furniture assembly, and people easily understand what's happening in the 3D physical world by watching them. This ability is remarkable if you think about it: from a 2D screen, people can infer how the parts move all the way from the box to the final assembly, and all the ways people grasp, turn, push, and pull along the way. Current computer systems are far from this. In fact, they have a hard time even tracking the parts from the box to the assembly, much less the forces that make them move as they do.

To this end, we have introduced a new task in computer vision called TAP: Tracking Any Point. Given a video and a set of query points, i.e., 2D locations on any frame of the video, the algorithm outputs the location that those points correspond to on every other frame of the video.

Input 1

Input 2

Output

On the dress above, the output came from our recent algorithm called TAPIR, which we have released open-source. The output reveals how the dress accelerates and changes shape over time, which reveals information about the underlying geometry and physics. TAPIR is fast and works well across a huge variety of real scenes. Below, we show a few more examples of the results it can achieve on real scenes.

Robotics

So what can we use such a system for? One answer is robotics. Today we’re announcing a system for robotic manipulation that uses TAPIR as its foundation (called RoboTAP). Here’s RoboTAP solving a gluing task without human intervention.

What’s even more remarkable, however, is that the model learned to do this from just five demonstrations (compare to the hundreds or thousands of demonstrations required by recent generative model based solutions, like DeepMind's RoboCAT, for example). Furthermore, outside of these five demonstrations, the system has never even seen glue sticks, wooden blocks like these, or even the gear that it’s supposed to place the final assembly next to. It works because TAPIR can track any object, even objects that were never seen until the demonstrations, and the system simply imitates the motions revealed in the tracks.

Here’s how it works in a bit more detail. For this example the goal is to insert four objects into a stencil, given five demonstrations (with different scene configurations) where a human drives the robot to accomplish the task. We first track thousands of points using TAPIR. Next, we group points into clusters based on the similarity of the motion, using an algorithm which we have made public on GitHub. Below, we show the results for one video, where the colors indicate the object that each point has been assigned to.

Next we have to figure out which objects are relevant for each motion. To achieve this we exploit a natural property of goal-directed motion: that regardless of where objects start relative to the robot, at the end of a motion they tend to be at a specific place (i.e. the goal location). For example, perhaps in all demonstrations, we note that the robot first always lines up to grasp the pink cylinder, and as a second step, it always lines up with a particular spot on the stencil cutout, and so on. Below, we show the full set of objects that it discovers from the six demonstrations; at every moment, we show the points that the model believes are relevant at that particular stage of the motion.

With these objects discovered, it’s time to run the robot. At every moment, the system knows which points are relevant based on the procedure above, regardless of where they are. It detects these points using TAPIR and then moves the gripper so that the motion of those points matches what was seen in the demonstration.

Despite its simplicity, this system can solve a wide variety of tasks from a very small number of demonstrations. It works whether objects are rigid or not, and it even works when irrelevant objects are placed into the scene, something that prior learning-based robotic systems have struggled with.

Javascript required.

Select a task for more information.

Dynamic 3D Reconstrution

Robotics isn’t the only domain that we’re interested in. Another exciting direction is building 3D models of individual scenes. Such 3D models could one day enable better augmented reality systems (especially when combined with tracking), or even allow users to create full 3D simulations starting with a single video. Recently, a team from Google Research collaborating with Cornell has published an algorithm called OmniMotion which can approximately reconstruct entire videos in 3D using point tracks. As a concurrent work, the paper relied on earlier, less-performant algorithms, but here we show a first example of OmniMotion running on top of TAPIR. For each video, the left shows the original video, and the right shows the depth map, where blue indicates closer and red/orange indicates further away:

Also in this direction, Inria and Adobe Research recently published VideoDoodles, a very fun project which allows users to add annotations and animations to videos, which follow the objects and backgrounds as if they share the same 3D space. Under the hood, it's powered by a point tracking algorithm that was developed with the help of our TAP-Vid dataset (though, as an independent project, it doesn't use TAPIR, at least not yet).

Video Generation

A final application we’re interested in is video generation. Although modern image generation models have improved dramatically in the last few years, video generation models still produce videos where objects tend to move in non-physical ways: they flicker in and out of existence, and the textures don’t stay in place as the objects move. We hypothesize that training video generation systems using point tracks can greatly reduce this kind of artifact: the generative model can directly check whether the motion is plausible, and can also check that two image patches representing the same point on the same surface depict the same texture.

As a proof-of-concept, we’ve built a generative model that’s designed to animate still images. It’s a two-step procedure: given an image, the model first generates a set of trajectories that describe how the object might move over time. Then for the second step, the model warps the input image according to the trajectories, and then attempts to fill in holes. Both stages use diffusion to generate both trajectories and pixels. The training data for these models is computed automatically by TAPIR on a large video dataset.

In the visualization below, we start with a single example, and generate two different plausible animations from it, demonstrating that our model can understands that a single image is ambiguous. The first column shows the input image. The second column shows a visualization of the trajectories themselves on top of the input image: purples show tracks with little motion, whereas yellows show the tracks with the most motion. The third column animates the original image according to the trajectories using standard image warping. The fourth column shows the result after filling the holes with another diffusion model. Note that the hole filling wasn't the focus of our work; thus, unlike most concurrent work on video generation, we don't do any pre-training on images, resulting in imperfect results. We encourage you to consider whether the trajectories themselves are reasonable predictions of the future. You can use the gallery at the bottom to navigate between examples.

 
Input: single image

Trajectories computed
from single image

Input image warped
using trajectories

Animation result after hole
filling

Javascript required.


Gallery

Point Tracking's Versatility

Hopefully we’ve convinced you that point tracking is useful. However, you might still be wondering why we should track points, rather than entire objects the way so much prior work has done? We argue that summarizing an entire object as a box (or segment) loses information about how the object rotates and deforms across time, which is important for understanding physical properties. It’s also more objectively defined: if the task is to track a chair, but the seat comes off of the chair halfway through a video, should we track the seat or the frame? For points on the chair’s surface, however, there is no ambiguity: after all, the surface of chair (and every other solid object) is made up of atoms, and those atoms persist over time.

Another reason that TAP is interesting is that people seem to track points. To demonstrate this, take a look at this image.

If you’re like most people, this looks like a regular brick wall. However, it has, in fact, been digitally edited. To see how it was edited, play the video below that the image came from.

The artificial motion is visible for less than a second, and yet to most people, the edit jumps out. In fact, it typically looks like the bricks are sliding against one another, completely overriding the ”common knowledge“ that brick walls are solid. This suggests that your brain is tracking essentially all points at all times; you only notice it when some motion doesn’t match your expectations.

TAP is a new field in computer vision, but it’s one that we believe can serve as a foundation for better physical understanding in AI systems. If you’re curious to test it out, the TAPIR model code and weights are open source. We’ve released a base model and also an online model which can run in real time, which served as the basis for our robotics work. There’s a colab notebook where you can try out TAPIR using Google’s GPUs, and also a live demo where you can run TAPIR in real time on the feed from your own webcam. We hope you find it useful.

Page authored by Carl Doersch, with valuable input from the other TAPIR, RoboTAP, and TAP-Vid authors: Yi Yang, Mel Vecerik, Jon Scholz, Todor Davchev, Joao Carreira, Guangyao Zhou, Ankush Gupta, Yusuf Aytar, Dilara Gokay, Larisa Markeeva, Raia Hadsell, Lourdes Agapito. We also thank Qianqian Wang for providing visuals of OmniMotion.