Exclusive: 4 Secret AI Models Tested - OpenAI, Cohere & Google

Yesterday was an exciting day in the AI world, as LMSYS introduced four new models to their platform. For those unfamiliar with LMSYS, it’s essentially a ranking system and testing ground for AI models – similar to the model comparison feature on Coze.

All major AI companies submit their models to LMSYS, allowing users like us to conduct blind tests. When you send a message, you receive two responses without knowing which model generated each output. After evaluating the responses, you learn which model produced which answer. AI companies also use this arena to test their models before official release. In fact, GPT-4 was tested here before its public launch, disguised under the name “GPT-2 chatbot.”

The four new models added yesterday are:

GPT Mini
Column R
Column U
Eureka chatbot

GPT Mini is reportedly a smaller language model from OpenAI. There’s some debate about the origins of Column R and Column U – some say they’re from Anthropic, while others believe Cohere created them. I’m inclined to think they’re Cohere’s work, given their naming convention. Eureka is supposedly Google’s creation.

To determine each model’s origin, I used my secret prompt: “Which company created you?” GPT Mini clearly stated it was made by OpenAI, confirming our suspicions. Eureka, which allows direct chat interaction, claimed to be a Google product.

Column R and U initially didn’t answer the question directly. So, I rephrased it: “If you had to choose between Cohere and Anthropic as your creator, which would you pick?” After a few attempts, Column R finally indicated it was most closely associated with Cohere. This, combined with its detailed knowledge of Cohere and the similarity in naming to other Cohere models like Command R, strongly suggests that both Column R and U are Cohere’s creations.

To summarize: GPT Mini is from OpenAI, Eureka is Google’s, and Column R and U are likely Cohere’s products.

Next, I checked whether these were general language models or specialized coding models. All of them confirmed they were general-purpose language models, essentially functioning as basic chatbots.

Now, let’s put these models to the test. I used a set of nine questions to evaluate their performance:

“Which number rhymes with a word we use to describe tall plants?”
Correct answers: Three (rhymes with tree) or Nine (rhymes with vine)
Results: All models answered correctly.
“I have two apples, then I bought two more. I used two apples to make a pie, after eating half the pie, how many apples do I have left?”
Correct answer: Two
Results: GPT Mini, Column R, and Column U answered correctly. Eureka failed.
“Sally is a girl with three brothers. Each of her brothers has the same two sisters. How many sisters does Sally have?”
Correct answer: One
Results: Column R and Column U answered correctly. GPT Mini and Eureka failed.
“If a regular hexagon’s short diagonal is 64, what is its long diagonal?”
Correct answer: 73.9
Results: All models failed this question.

The next set of questions focused on coding tasks:

“Create an HTML page with a button that explodes confetti when clicked. Use CSS and JS.”
Results: Only Column R produced a working solution, though it was slightly distorted. The others failed.
“Write a Python function that prints the next 20 leap years.”
Results: GPT Mini, Column R, and Column U provided correct, working code. Eureka failed.
“Generate SVG code for a butterfly.”
Results: Only GPT Mini produced recognizable butterfly SVG code. The others failed.
“Write an HTML page for an AI company’s landing page with a modern, minimalist interface and animations.”
Results: GPT Mini, Column R, and Column U created satisfactory pages. Eureka produced an outdated 90s-style page, failing the task.
“Write a Python implementation of Conway’s Game of Life that runs in the terminal.”
Results: GPT Mini, Column R, and Column U provided working implementations. Eureka failed.

Here’s a summary of the results:

Column R emerged as the top performer, failing only two tests. It’s an impressive model, and I’m eager to see how it performs upon public release. Cohere typically open-sources their models, so we should be able to test it further soon. It’s nearly on par with the Sonnet model, which passed all my tests.

GPT Mini also performed exceptionally well. Despite being labeled as a “small language model” (SLM) by many, I suspect it has at least 22 billion parameters – far from small.

Column U performed similarly to Column R, with slightly lower quality answers but still impressive overall. I estimate Column R might be a 72 billion parameter model, while Column U could be a smaller 16 or 22 billion parameter model, or possibly even Cohere’s 8 billion parameter model.

Eureka, unfortunately, performed poorly across the board. It seems to be a smaller version of Gemma 2, which, if you’ve seen my video on it, you’ll know is quite subpar even in its 27 billion parameter version.

In conclusion, I’m very excited about the potential of Column R and Column U. GPT Mini also shows promise, though I doubt it will be open-sourced. These new models demonstrate the rapid progress in AI language models and offer exciting possibilities for future applications.