Temporally Consistent Transformers for Video Generation


Wilson Yan, Danijar Hafner, Stephen James, Pieter Abbeel

Paper Code Tweet

Abstract


To generate accurate videos, algorithms have to understand the spatial and temporal dependencies in the world. Current algorithms enable accurate predictions over short horizons but tend to suffer from temporal inconsistencies. When generated content goes out of view and is later revisited, the model invents different content instead. Despite this severe limitation, no established benchmarks exist for video generation with long temporal dependencies. In this paper, we curate 3 challenging video datasets with long-range dependencies by rendering walks through 3D scenes of procedural mazes, Minecraft worlds, and indoor scans. We perform a comprehensive evaluation of current models and observe their limitations in temporal consistency. Moreover, we introduce the Temporally Consistent Transformer (TECO), a generative model that substantially improves long-term consistency while also reducing sampling time. By compressing its input sequence into fewer embeddings, applying a temporal transformer, and expanding back using a spatial MaskGit, TECO outperforms existing models across many metrics.

DMLab


Left: Videos prediction samples of 300 frames conditioned on 144 (action-conditional)

Right: 3D visualizations of 300 frame videos conditioned on 36 frames. Video predictions use only RGB frames. We compute 3D scenes from RGB frames by applying pre-trained depth and pose estimators, and project points into a 3D pointcloud


Below: Video prediction samples for each method with 300 frames conditioned on 36


Minecraft


Below: Video prediction samples of 300 frames conditioned on 144 (action-conditional)


Below: Video prediction samples for each method with 300 frames conditioned on 36 (action-conditional)


Habitat


Below: Video prediction samples of 300 frames conditioned on 144 (action-conditional)


Below: Video prediction samples for each method with 300 frames conditioned on 36


Kinetics-600


Below: Video prediction samples of 100 frames conditioned on 20 with top-k sampling (left) and no top-k (right)


Below: Video prediction samples for each method with 100 frames conditioned on 20 with top-k sampling (left) and no top-k (right)