Temporally Consistent Video Transformer
for Long-Term Video Prediction


Wilson Yan, Danijar Hafner, Stephen James, Pieter Abbeel

Paper Code Tweet

Abstract


Generating long, temporally consistent video remains an open challenge in video generation. Primarily due to computational limitations, most prior methods limit themselves to training on a small subset of frames that are then extended to generate longer videos through a sliding window fashion. Although these techniques may produce sharp videos, they have difficulty retaining long-term temporal consistency due to their limited context length. In this work, we present Temporally Consistent Video Transformer (TECO), a vector-quantized latent dynamics video prediction model that learns compressed representations to efficiently condition on long videos of hundreds of frames during both training and generation. We use a MaskGit prior for dynamics prediction which enables both sharper and faster generations compared to prior work. Our experiments show that TECO outperforms SOTA baselines in a variety of video prediction benchmarks ranging from simple mazes in DMLab, large 3D worlds in Minecraft, and complex real-world videos from Kinetics-600. In addition, to better understand the capabilities of video prediction models in modeling temporal consistency, we introduce several challenging video prediction tasks consisting of agents randomly traversing 3D scenes of varying difficulty. This presents a challenging benchmark for video prediction in partially observable environments where a model must understand what parts of the scenes to re-create versus invent depending on its past observations or generations.

DMLab


Left: Videos prediction samples of 300 frames conditioned on 144 (action-conditional)

Right: 3D visualizations of 300 frame videos conditioned on 36 frames. Video predictions use only RGB frames. We compute 3D scenes from RGB frames by applying pre-trained depth and pose estimators, and project points into a 3D pointcloud


Below: Video prediction samples for each method with 300 frames conditioned on 36


Minecraft


Below: Video prediction samples of 300 frames conditioned on 144 (action-conditional)


Below: Video prediction samples for each method with 300 frames conditioned on 36 (action-conditional)


Habitat


Below: Video prediction samples of 300 frames conditioned on 144 (action-conditional)


Below: Video prediction samples for each method with 300 frames conditioned on 36


Kinetics-600


Below: Video prediction samples of 100 frames conditioned on 20 with top-k sampling (left) and no top-k (right)


Below: Video prediction samples for each method with 100 frames conditioned on 20 with top-k sampling (left) and no top-k (right)