Fabien Belleinguer

๐Ÿงช Field Report: Comparing ChatGpt5.4 Vs Mistral Large Vs Gemini 3.1 Pro on a n E-Commerce

byโ€ข

I wanted to share a quick benchmark we ran while building Fabio AI Chatbot, testing how different LLMs behave in a real product search scenario.

Instead of synthetic benchmarks, this was done on an actual WooCommerce store (~1000 products), using the exact same prompt across models.

๐Ÿ”— Live demo: https://fabio-plugins.com/demo_shop
๐Ÿ”— Fabio AI Chatbot: https://fabio-plugins.com

๐Ÿง  Prompt:
"I am looking for a power bank that can last 10 hours and costs less than 50 USD"

๐Ÿค– Gemini 3.1 Pro
โฑ Response time: 12.6s
โ†’ Returned 1 product
โ†’ Comforto Power Bank Series 148 Gray (10h, $39.96)

๐Ÿค– Mistral Large 3
โฑ Response time: 2.7s
โ†’ Returned 1 product
โ†’ NovaTech Power Bank Series 100 Green (12h, $27.38)

๐Ÿค– ChatGPT 5.4
โฑ Response time: 5.2s
โ†’ Returned 3 products
โ†’ Comforto (10h, $39.96)
โ†’ HomeEase (8h, $46.96)
โ†’ UrbanNest (6h, $29.96)

๐Ÿ“Š Test conditions:

  • Same dataset (WooCommerce, ~1000 products)

  • Same prompt

  • No manual post-processing

  • Focus on raw model behavior (speed + selection)

This kind of test has been useful for us to better understand how different models handle:

  • constraint-based queries

  • response latency

  • product selection patterns

Happy to hear how others here are evaluating LLMs in production environments.

1 view

Add a comment

Replies

Be the first to comment