TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement

Carl Doersch¹, Yi Yang¹, Mel Vecerik¹, Dilara Gokay¹, Ankush Gupta¹, Yusuf Aytar¹, Joao Carreira¹, Andrew Zisserman^1,2

¹Google DeepMind, ²University of Oxford

TAPIR accurately tracks any desired point on a physical surface.

Abstract

We present a new model for Tracking Any Point (TAP) that effectively tracks a query point in a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Moreover, our model facilitates fast parallel inference on long video sequences. TAPIR can also run in an online fashion, tracking 256 points on a 256x256 video at roughly 40 fps, and can be flexibly extended to higher-resolution videos.

This visualization begins with dense TAPIR tracks. We segment the scene into foreground and background, remove background points, and compensate for camera motion to reveal how the objects move through the scene.

Video Summary

Demos

TAPIR is open-source. We provide two online Colab demos where you can try it on your own videos without installation: the first lets you run our best-performing TAPIR model and the second lets you run a model in an online fashion. Alternatively, you can clone our codebase and run TAPIR live, tracking points on your own webcam; with a modern GPU, this demo can run in real time. Have fun!

Architecture

TAPIR begins with our prior work, TAP-Net to initialize a trajectory given a query point, and then uses an architecture inspired by Persistent Independent Particles (PIPs) to refine the initial estimate.

TAP-Net lets us replace the “Chaining”, which was the slowest part of PIPs. We furthermore replace the MLP-Mixer with a fully-convolutional network, which allows us to remove complex chunking procedures while improving performance. Finally, the model estimates its own uncertainty regarding position, which improves performance and can also be useful in domains like 3D reconstruction, where confident errors can break downstream algorithms.

TAP-Vid Performance

The TAP-Vid benchmark is a set of real and synthetic videos annotated with point tracks. The metric, Average Jaccard, measures both accuracy in estimating position and occlusion. Higher is better.

Method	Kinetics	DAVIS	Kubric	RGB-Stacking
TAP-Net	46.6	38.4	65.4	59.9
PIPs	35.3	42.0	59.1	37.3
TAPIR	60.2	62.9	88.3	73.3

You can see that TAPIR provides a substantial boost in performance, roughly 20% absolute performance over prior methods. To get a sense of how much this is, here's a few examples of our improvements over prior work on the DAVIS dataset

Still Image Animation

One potential application of point tracking is to improve the temporal consistency and the physical plausibility of video generation. In this proof-of-concept, we build a pipeline which can take a still image and produce a short animated clip. This is a two-stage pipeline, where we first have a diffusion model which produces dense tracks given the input image, and a second stage which produces a video given the input image and the trajectories. TAPIR produces training data for both stages from otherwise unlabeled videos. See our paper for details.

In the visualization below, we start with a single example, and generate two different animations from it, demonstrating that our model can understands that a single image is ambiguous. The first column shows the input image. The second column shows a visualization of the trajectories themselves on top of the input image: purples show tracks with little motion, whereas yellows show the tracks with the most motion. The third column animates the original image according to the trajectories using standard image warping. The fourth column shows the result after filling the holes with the second diffusion model. Note that the hole filling wasn't the focus of our work; thus, unlike most concurrent work on video generation, we don't do any pre-training on images, resulting in imperfect results. We encourage you to consider whether the trajectories themselves are reasonable predictions of the future.

You can use the gallery at the bottom to navigate between examples.