Hi everyone!

Sharing Step-Video-T2V, a new open-source text-to-video model from StepFun, one of the original top 6 AI model players in China. It's pushing the boundaries of video generation, achieving state-of-the-art results and offering some impressive capabilities:

Key Features:

🎬 Long-Form Generation: Can generate videos up to 204 frames long.
⚙️ Advanced Architecture: Uses a DiT (Diffusion Transformer) with 3D full attention.
⏩ High Compression VAE: A custom Video-VAE with 16x16 spatial and 8x temporal compression for efficiency.
✨ Video-DPO: Uses Direct Preference Optimization (DPO) on videos to improve visual quality.
🌐 Bilingual: Supports both Chinese and English prompts.
🏆 SOTA Performance: Achieves top results on their new Step-Video-T2V Eval benchmark.

There is a base model and a "Turbo" version, which leverages inference step distillation for increased speed.

You can try out text-to-video generation directly on their website.

Step-Video-T2V

Open-Source, 204-Frame Video Generation from Text.

Open-Source, 204-Frame Video Generation from Text.