Step-Video-T2V

Step-Video-T2V

Open-Source, 204-Frame Video Generation from Text.

10 followers

Step-Video-T2V is the open-source text-to-video model series from StepFun. Up to 204-frame generation, high compression Video-VAE, and video-based DPO for enhanced quality. Achieves SOTA on Step-Video-T2V-Eval.
Step-Video-T2V gallery image
Step-Video-T2V gallery image
Step-Video-T2V gallery image
Step-Video-T2V gallery image
Step-Video-T2V gallery image
Step-Video-T2V gallery image
Step-Video-T2V gallery image
Free
Launch Team
AssemblyAI
AssemblyAI
Build voice AI apps with a single API
Promoted

What do you think? …

Zac Zuo

Hi everyone!

Sharing Step-Video-T2V, a new open-source text-to-video model from StepFun, one of the original top 6 AI model players in China. It's pushing the boundaries of video generation, achieving state-of-the-art results and offering some impressive capabilities:

Key Features:

🎬 Long-Form Generation: Can generate videos up to 204 frames long.
⚙️ Advanced Architecture: Uses a DiT (Diffusion Transformer) with 3D full attention.
⏩ High Compression VAE: A custom Video-VAE with 16x16 spatial and 8x temporal compression for efficiency.
✨ Video-DPO: Uses Direct Preference Optimization (DPO) on videos to improve visual quality.
🌐 Bilingual: Supports both Chinese and English prompts.
🏆 SOTA Performance: Achieves top results on their new Step-Video-T2V Eval benchmark.

There is a base model and a "Turbo" version, which leverages inference step distillation for increased speed.

You can try out text-to-video generation directly on their website.