Gemini

Gemini

Google's answer to GPT-4

4.8
121 reviews

4.6K followers

Google's largest and most capable AI model. Built from the ground up to be multimodal, Gemini can generalize and seamlessly understand, operate across and combine different types of information, including text, images, audio, video and code.
This is the 10th launch from Gemini. View more
Agentic Vision in Gemini

Agentic Vision in Gemini

Launching today
Agentic visual reasoning with code execution
Agentic Vision, a new capability introduced in Gemini 3 Flash, converts image understanding from a static act into an agentic process
Agentic Vision in Gemini gallery image
Agentic Vision in Gemini gallery image
Agentic Vision in Gemini gallery image
Agentic Vision in Gemini gallery image
Free Options
Launch Team
Flowstep
Flowstep
Generate real UI in seconds
Promoted

What do you think? …

Zac Zuo

Hi everyone!

OK, really excited about this one because it takes a huge step forward in visual context.

Tested it by asking it to find all the red dots in an image. Instead of trying to "eyeball" it (which models usually fail at), Gemini 3 Flash realized that "counting by eye" is imprecise. So it decided to act like an engineer and write a professional OpenCV script to solve it accurately.

The logic flow was fascinating:

  • Task: Precision counting.

  • Reasoning: Visual models have error margins -> I should use Python tools.

  • Action: Filter pixels via HSV color space -> Use findContours to locate them.

This actually blew my mind. Natively realizing the "Perception - Reasoning - Action" loop in vision is critical for real-world apps.

The demos in Google AI Studio are also worth checking out. Definitely some of the most interesting and inspiring visual use cases I've seen.

Xiang Lei

with the 90% cost reduction mentioned, does this apply to multimodal inputs like huge image datasets used as part of a system prompt?