IMO Proposal 4: Kumo Text to Video Foundation model

1. Introduction

This research proposal outlines the development of Kumo, an open-source video foundation model that combines a 3D Variational Autoencoder (3D VAE) for efficient video compression and reconstruction with a Diffusion Transformer (DiT) for advanced denoising and video generation. The synergy between these components enables Kumo to process large-scale video data with high computational efficiency while maintaining superior quality and temporal consistency in video outputs. By leveraging innovative architectural design, Kumo aims at setting a new benchmark for open-source video generation models.


2. VAE Design

Videos inherently combine spatial and temporal information, leading to significantly larger data volumes compared to images. To address the computational challenges of modeling video data, Kumo integrates a 3D VAE as a video compression module. The main features of the 3D VAE include:

  • High Compression Ratio: Utilizes three-dimensional convolutions to effectively reduce data size along both spatial and temporal dimensions, enabling efficient processing of high-resolution, long-duration videos.
  • Improved Spatial Reconstruction: Preserves fine-grained details and ensures smooth motion continuity during video reconstruction, delivering superior fidelity compared to traditional compression methods.
  • Better Temporal Consistency: Maintains temporal coherence, ensuring that motion and transitions in the reconstructed videos remain seamless and natural.

By embedding the 3D VAE into Kumo, the model achieves an optimized latent space representation, balancing computational efficiency with high-quality video reconstruction and generation capabilities.


3. Diffusion Model Structure

The DiT in Kumo is specifically designed to enhance video generation and denoising by leveraging tailored strategies for handling the latent representations produced by the 3D VAE. To efficiently process text-video data, the main features of the proposed DiT include:

  • Visual Content Tokenization: The 3D VAE encodes video latent into a tensor of shape T\times{H}\times{W}\times{C}, where T, H, W and C represent the number of frames, height, width, and the number of channels, respectively. To transform these latents into a sequence suitable for the transformer, Kumo patches the latents along the H and W dimensions, then flattens them in conjunction with the T dimension, resulting in a sequence of tokens z_{vision} with a length of \alpha\cdot{T}\cdot{H}\cdot{W}, where \alpha is the patching coefficient.

  • Latent Concatenation: Kumo uses T5XXL as the language model, merging the obtained text tokens z_{text} with visual tokens z_{vision} through concatenation, which serves as the input latent to the DiT. This approach facilitates full attention operations, allowing for a fusion of text with spatial and temporal information. Additionally, this strategy is highly scalable, enabling further concatenation of various modality tokens.

  • Timestep Renormalization: Diffusion models iteratively denoise data across timesteps. To enhance DiT’s ability to perceive varying timesteps, Kumo introduces a timestep renormalization technique applied at each layer. The timestep t is first embedded into a representation t_{\text{embed}}. This embedding is then used to renormalize the text tokens z_{\text{text}} and visual tokens z_{\text{vision}} independently using Adaptive Layer Normalization (AdaLN). The AdaLN operation is defined as f_{\text{scale}}(t_{\text{embed}}) \cdot z + f_{\text{shift}}(t_{\text{embed}}), where f_{\text{scale}} and f_{\text{shift}} are learnable functions that modulate the scale and shift of the tokens based on t_{\text{embed}}.

  • Attention Mechanism: To capture spatial-temporal relationships, Kumo adopts a spatial-temporal full attention mechanism. In this approach, all text tokens z_{\text{text}} and visual tokens z_{\text{vision}}, containing spatial and temporal information, are flattened. After applying a nonlinear mapping p, tokens from both modalities are concatenated and passed through a full self-attention operation. This process can be expressed as
    \text{SelfAttention}\big(\text{concat}[p_{\text{text}}(z_{\text{text}}), p_{\text{vision}}(z_{\text{vision}})]\big),
    where p_{\text{text}} and p_{\text{vision}} are modality-specific nonlinear transformations applied to text and visual tokens, respectively.


4. Model Training

  • Training Data: To support the training of Kumo, we collect, organize, clean, categorize, and maintain a large-scale video dataset, which includes 50 million single-shot video clips with text descriptions, each averaging about 6 seconds in length. We develop a custom data pre-processing pipeline for training our foundation models, which includes: Clip Splitting, Watermark Detection, Motion Score Detection, Camera Direction and Speed Detection, Visual Quality Detection, Re-Captioning, and Custom Algorithms for Specific Scenarios and so on. To improve the quality and generalization of the generated videos, we also maintain an image-text dataset to incorporate into the training, which has extremely high visual quality and a very general domain.
  • Training Objective: During the training, Kumo uniformly samples a timestep t\in\{1,\ldots,T\} and uses the \epsilon prediction to minimize the L^2-distance between the predicted noise and the ground truth:
    \text{Loss}=\mathbb E_t\Vert{\epsilon-\epsilon_{\theta}(\sqrt{\overline{a}_t}x_0+\sqrt{1-\overline{a}_t}\,\epsilon, t)}\Vert^2

5. Applications

  • Content Creation: Kumo enables automated video generation from textual descriptions, making it a powerful tool for content creators in industries like entertainment, advertising, and social media. It can produce high-quality videos for storytelling, marketing campaigns, or even personalized video content creation.
  • Education and Training: Kumo can be used to create educational content, such as instructional videos or interactive simulations, providing rich visual experiences that improve learning outcomes. It can also generate video demonstrations based on text instructions, helping with remote training and e-learning.
  • Virtual Reality (VR) and Augmented Reality (AR): By generating temporally coherent and immersive video sequences, Kumo can support VR and AR applications, creating realistic and dynamic virtual environments for gaming, simulation, and experiential storytelling.

6. Conclusion

This research proposal introduces Kumo, an open-source, state-of-the-art model for generative video tasks. By integrating cutting-edge diffusion techniques with transformer-based architectures, the model aims to advance state-of-the-art video generation capabilities. Kumo is designed to meet the increasing demands of industries such as entertainment, advertising, education, and AI-driven video applications. As an open-source initiative, Kumo not only pushes the boundaries of generative video technology but also invites the global research and developer community to contribute, innovate, and drive forward the future of AI-powered video generation.

3 Likes

This is cool, text2video model may have good use cases on AI agents.

1 Like

Definitely a complementary resource to have for agents running businesses and marketing applications onchain

Could this lead to verifiable AIGC video NFTs? Thinking there’s a lot of exciting applications for onchain gaming and cultural communities.

Nice job! wonder how’s Kumo compare with Sora?

Kumo now talks and generates videos on Twitter! Describe what you want and @kumo_bot on Twitter and it’ll generate videos for you!