Jina AI was the only provider offering true multimodal embeddings — text and images in the same vector space — via a simple API. Our platform handles PDFs with figures, PPTX slides, and scanned documents, so cross-modal search (query text, retrieve images and vice versa) was non-negotiable. We evaluated OpenAI embeddings (text-only), Cohere (no multimodal), and local models like CLIP (operational overhead). Jina's CLIP v2 gave us 1024-dim multimodal vectors with a single API call and solid reranking on top. Clean DX, great performance.