Emmanuel Turlay

Airtrain.ai LLM Playground - Vibe-check many open-source and proprietary LLMs at once

A no-code LLM playground to vibe-check and compare quality, performance, and cost at once across a wide selection of open-source and proprietary LLMs: Claude, Gemini, Mistral AI models, Open AI models, Llama 2, Phi-2, etc.

Add a comment

Replies

Best
Emmanuel Turlay
Hello Product Hunt community! 🚀 We're very proud to introduce the Airtrain.ai LLM Playground, a no-code tool to prompt many open-source and proprietary LLMs at once: Claude, Gemma, GPT-4, Llama 2, Gemini, Phi-2, Mistral models, and more. Compare quality, cost, and performance. We built this playground to help AI enthusiasts and practitioners of all stripes easily “vibe check” popular LLMs. Key features include: 📌 Prompt multiple models at once 📌 18 models supported (8 open-source, 10 proprietary) 📌 Inference metrics (i/o token counts, throughput, inference cost) 📌 Persisted sessions (review and resume previous chat sessions) We'd love for you to try it out and share your feedback with us. Feel free to ask any questions, and we'll be more than happy to answer them. Thanks so much for your support, and we hope you enjoy using the LLM Playground! ✨
Chris McCann
Congrats on the launch! Curious, what open source models are you seeing get the most usage on the platform?
Emmanuel Turlay
@mccannatron Mistral and Llama 2 are neck to neck. Although Mistral includes 2 open-source and 3 proprietary.
Joy Larkin
Emmanuel and team, congratulations on the launch! I have to say, it is so very cool to be able to query Claude 3, GPT-4, and Gemini all at once. What inspired you to build this?
Emmanuel Turlay
@joy_larkin We heard from a lot of users that they wanted to try other models than the OpenAI ones but they did not know where to start. It starts with evaluation, and evaluation starts with a vibe check. Running a handful of prompts through various models to build an intuition about their relative quality, cost, and performance, before moving to batch evaluation across an entire test dataset (which we also offer :)
Will Jacob
Super interesting! Gotta say I love the inference metrics -- makes it way easier to compare costs than what I've been doing. Claude 3 is so pricey!
Joshua Bauer
@will_jacob Indeed Opus is quite expensive! Luckily Sonnet is still close to GPT-4 level but at a more reasonable price. It'll be interesting to have a chance to play with Haiku when Anthropic releases it, since that should be much more affordable.
Emmanuel Turlay
@will_jacob yeah those jumbo proprietary models are luxury!
Victoria Vassalotti
This is so cool! What pricing are you using?
Joshua Bauer
@victoria_vassalotti Thanks! It's per-token depending on the model. Everyone gets $10 on signup. You can see the detailed pricing here: https://docs.airtrain.ai/docs/in...
Rogério Chaves
That’s very interesting, I really like the concept of “vibe checking”. Is it possible to upload many samples I may have or run multiple times at once for a prompt? Since due to temperature responses vary right
Emmanuel Turlay
@rchaves Yes, to do this, go back to the task menu (New task in the top bar) and select "Evaluate Models". You can upload a CSV or JSONL file of up to 10k examples and configure the models you want to test and the metrics you care about. See docs here: https://docs.airtrain.ai/docs/ba...
Alex Li
This is awesome. Really like the broad coverage across both OS and closed LLMs. Good luck with the launch!
Joshua Bauer
@alex_li_mvp thanks! We noticed that not a lot of other tools allow you to have both of those side-by-side
Neil Krichi
Very cool! As a full stack dev who's primarily used OpenAI and Gemini models for code generation, this is super useful for comparing performance and quality of other LLMs.
Emmanuel Turlay
@neilkrichi Thank you, glad it's useful to you. How often do you consider switching models, and for what reasons? Do you ever run models locally? or on-prem?
Idriss Chebak
Congratulations on the launch! I find it very convenient to be able to compare Opus, Gemini and GPT-4 outputs and choose the best answer — especially in cybersecurity codegen.
Emmanuel Turlay
@idriss_chebak Thank you. It is pretty handy to get a number of different alternative implementations for the same coding prompt.
Frosina Lazarova
The side-by-side comparison is very nice, makes it really easy to compare the different models.
Idriss Chebak
@frosina_lazarova i would add especially for codegen. In case one hallucinates an imported package, you're safe comparing with the other LLMs
Hari
Congrats on the launch! The availability of open-source LLM is only part of the solution - the real big missing piece for faster adoption of them is tools like these to easily evaluate, pick, and customize models. Looking forward to trying it out!
Harrison Johnson
Hey! This is really cool Thanks for what you're doing for the LLM community. I think investments like this in accessibility around these models is going to be critical to fulfill it's full potential. Love the vibe check positioning too lol
Emmanuel Turlay
@harrisonjohnson Thanks Harrison, we're all about showing people that there is a world outside of GPT-4 :)
Snow W. Lee (Sungwon)
Congratulations on your launch, @emmanuel_turlay !
Arthur Lorotte de Banes
Congrats! I'm always wondering which LLM to use for different use cases 🧐
Emmanuel Turlay
@sysless Thanks! Give us your feedback after you've tried the Playground :)
Richard He
Very useful product, love it!
Zheng Hao Tan
Congrats on the launch, Emmanuel and team! This looks really amazing and I’ll give it a try soon :)
Noelle Tassey
Congrats on the launch! The side-by-side comparison is very nice, makes it really easy to compare the different models -- was really surprised how they stack up on certain tasks. Mistral did better than anticipated!
Emmanuel Turlay
@noelle_tassey1 Indeed, Mistral is the maverick!
Vivian Lee
Huge congrats on the launch!!
Jeremy Hindle
With a seemingly never ending list of new models being released these days this looks useful to evaluate both the different versions of, and alternate models over time for specific use cases.
Sydney Cohen
This is cool !