We learn an aligned text-image token representation without text-image data, that can then be used for downstream VQA and classification tasks using in-context learning with an LLM
A vector-quantized latent dynamics video prediction model that learns compressed representations to efficiently condition on long videos of hundreds of frames during both training and generation
An efficient and scalable video generation architecture that first learns a VQ-VAE to compress video data, and then learns a GPT-style transformer to model discrete latent codes
Performing image interpolation and semantic manipulation (e.g. hair color, smiling, gender) with a PixelCN on facial images using Fisher scores as an embedding space