LLaVA-Mini

LLaVA-Mini:Efficient Image and Video Large Multimodal Models

9 followers

LLaVA-Mini:Efficient Image and Video Large Multimodal Models

9 followers

LLaVA-Mini👏is an efficient LMM for image/video understanding using 1 vision token, offering: (1)⏩fast response (40ms per image) (2)🖥️less VRAM usage (support 3-hour video understanding on 24GB GPU).

Free

Launch tags:Productivity•Artificial Intelligence•GitHub

Launch Team

ElevenAgents by ElevenLabsScale conversations without scaling your team

Promoted

Hunter

📌

LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Guided by the interpretability within LMM, LLaVA-Mini only requires 1 token to represent each image, which improves the efficiency of image and video understanding, including computational effort (77% FLOPs reduction), response latency (reduce from 100ms to 40ms) and VRAM memory usage (reduce from 360 MB/image to 0.6 MB/image, support 3-hour video processing). Paper: https://arxiv.org/abs/2501.03895 Model: https://huggingface.co/ICTNLP/ll... Code & Demo: https://github.com/ictnlp/LLaVA-...

Report

1yr ago

Reviews

No reviews yetBe the first to leave a review for LLaVA-Mini