Phi-3.5 Models: Microsoft’s Cutting-Edge AI Beats Llama-3.1!

August 22, 2024

by kevin

Microsoft has launched several new models in its series, numbered 3.5. This release includes three new models: the Phi 3.5 Vision, followed by the Phi 3.5 Mini, and finally, the Phi 3.5 Mixture of Experts (MoE).

Among them, the largest model is the Mixture of Experts model, which is a 16×3.8B model with 6.6B active parameters. The Phi 3.5 Mini is a model with 3.8B parameters, while the Phi 3.5 Vision has 4.2B parameters.

Excitingly, these models are all compact enough to run on any computer, each featuring an impressive context limit of approximately 128K. Let’s introduce these models from the largest to the smallest.

First up is the Phi 3.5 MoE, which is a Mixture of Experts model.

You can think of the Mixture of Experts model as a combination of several smaller expert models or models fine-tuned for specific domains. A router identifies your prompt and directs it to the corresponding expert model. While this description isn’t exact, it’s fairly close.

In any case, it is a 16×3.8B parameter model, which is already quite substantial, but compared to other Mixture of Experts models, it isn’t particularly high.

Now, let’s look at the benchmark data. This model performs comparably to Nemo or Llama 3.1 8B, and here’s how they stack up.

In the Arena Hard test, it outperformed Llama 3.1 8B but did not surpass Mistral, Nemo, Gemini 29B, or GPT-4 Omni.

However, in the Big Bench Hard test, it significantly outperformed all other models, including Gemini 1.5 Flash, and came very close to GPT-4 Omni’s performance, which is quite impressive.

In the MMLU test, it also outperformed all other models, even exceeding GPT-4 Omni, which is very cool.

In the MMLU Pro test, it surpassed all models except for Gemini 1.5 Flash and GPT-4 Omni.

Its performance is similar in reasoning benchmarks, multilingual benchmarks, and mathematical benchmarks.

In the HumanEval test, its performance was also comparable. Although it couldn’t beat Gemini 1.5 Flash and GPT-4 Omni, it outperformed all other models.

In the MBPP test, it beat all models except for GPT-4 Omni, which is quite commendable.

You could say that at least in benchmark tests, it can compete with Nemo or Llama 3.1 8B, which is quite good.

Now, let’s take a look at the Phi 3.5 Mini model. The Phi 3.5 Mini is a model with 3.8 billion parameters, making it very small, similar to the previous Phi 3 model.

In benchmark tests, it generally performs better across all categories than earlier models and is comparable to or slightly below Llama 3.1 8B.

Considering its size is only half that of the original model, that’s quite impressive.

Next is the Phi 3.5 Vision model, which is essentially the Phi 3.5 Mini model with visual capabilities. This model has 4.2B parameters, and its benchmark tests are similar to those of the Phi 3.5 Mini.

However, it also includes visual benchmark tests, where it performs exceptionally well.

In nearly all benchmark tests, it can compete with Gemini Flash, a compact yet powerful visual model, which is fantastic.

Open-source visual models have always been scarce, so it’s exciting to see such a small yet powerful visual model emerging.

Now, let’s discuss how to use these models. The 3.8B Mini model can be accessed on Ollama.

However, the other models are currently unavailable because their architectures differ slightly and need to gain support on Llama cpp for compatibility.

In any case, you can try these models for free on Nvidia NMS, so be sure to check it out.

For the text models, I will attempt these 13 questions. I won’t use these text questions to test the visual model since it is the Mini model with visual capabilities, and I can’t find any good demonstration sites to test it, nor is it available on Llama.

Alright, let’s begin.

The first question is: Which country’s capital ends with “Leah”? I’m referring to the country name, and the answer should be Canberra or any capital city name that rhymes with Leah. Let’s send the question and see the answer. By the way, the Mini is on the left, and the E model (MoE) is on the right.

Here’s the answer: the Mini model did not answer correctly, while the E model did. Thus, the Mini model is marked as failed, and the E model is marked as passed.

The next question is: Which number rhymes with the word we use to describe tall plants? The answer should be Three (3). Let’s send the question and check the answer. Here’s the response: the Mini model is incorrect, but the E model is correct. So we will mark the Mini as failed and the E model as passed.

The next question is: John has three boxes of pencils, with 12 pencils in each box. How many pencils does John have in total? The answer should be 36. Let’s send the question and check the answer. It looks like both models provided the correct answer, so this time both passed.

The next question is: Lucy has twice as many candies as Mike. If Mike has 7 candies, how many candies does Lucy have? The answer should be 14. Let’s send the question and check the answer. Both answers seem correct, so both passed again.

The next question is: Is the number 337 a prime number? The answer should be “Yes.” Let’s send the question and check. The Mini model was incorrect, while the E model was correct, so the Mini model is marked as failed, and the E model as passed.

The next question is: I have two apples and then buy two more. I baked a pie with two of the apples and ate half of the pie. How many apples do I have left? The answer should be two. Let’s send the question and check the answer. The Mini model was incorrect, while the E model was correct, so the Mini is marked as failed, and the E model as passed.

The next question is: Sally is a girl who has three brothers, and each brother has the same two sisters. How many sisters does Sally have? The answer should be one. Let’s send the question and check the answer. The Mini model was incorrect again, while the E model was correct, so we will mark the Mini as failed and the E model as passed.

The next question is: If a regular hexagon’s short diagonal is 64, what is its long diagonal? The answer should be 73.9. Let’s send the question and check the results. Here are the answers: both models failed to answer this question, so this time both are marked as failed.

The next few questions are programming-related. The first question is: Create an HTML page with a button that releases confetti when clicked. You can use CSS and JS. Let’s send the question and check the answers. Here are the codes generated by both models.

First, let’s preview the Mini model’s version; it doesn’t look like confetti, so this is a failure. Now, let’s preview the E model’s version; it doesn’t work at all, so both models failed.

The next question is: Create a Python program that prints the next few leap years based on user input. Let’s send the question and check the answers. Here are the codes generated by both models. Running the Mini model’s code works fine. Running the E model’s code also works fine, so both passed.

The next question is: Generate SVG code for a butterfly. Let’s send the question and check the answers. Here is the code. First, let’s preview the E model’s code; it doesn’t look like a butterfly, so this is a failure. Now, let’s preview the Mini model’s code; it also doesn’t look like a butterfly, so this time both models failed.

The next question is: Create a landing page for an AI company. The page should include four sections: title, banner, features, and contact us. Ensure the page looks stylish and modern. Let’s send the question and check the answers.

Here are the codes generated by both models. First, let’s preview the Mini model’s generated page; it looks like an outdated design, so this is a failure. Now, let’s preview the E model’s generated page; it looks good, so this is a pass.

The last question is: Write a Python script for the Game of Life that runs in the terminal. Let’s send the question and check the answers. Here are the codes generated by both models. Running the Mini model’s code works well. Running the E model’s code also looks good, so both passed.

Here are the final results: as you can see, the E model only failed on three questions, while the Mini model failed on nine questions.

However, considering that the Mini model is 16 times smaller than the E model, such performance is acceptable, while the E model’s performance is exceptional, almost on par with high-end models, which is truly remarkable.

Moreover, it has less than 70B active parameters, making it suitable for local execution, which is also fantastic. I believe it will soon gain support on Ollama.

What are the key features of Microsoft Phi-3.5 models?

The Microsoft Phi-3.5 models include three main variants: Phi-3.5-mini, Phi-3.5-MoE, and Phi-3.5-vision. Phi-3.5-mini is a compact model with 3.8 billion parameters designed for high-efficiency reasoning and code generation in resource-constrained environments. Phi-3.5-MoE uses a Mixture of Experts architecture with 16 experts and 6.6B active parameters, providing high performance, reduced latency, multilingual support, and robust safety measures. Phi-3.5-vision is a multimodal model that excels in tasks requiring the interpretation of both visual and textual data .

How does Phi-3.5 compare to other AI models like Llama-3.1?

Phi-3.5 outperforms Llama-3.1 in several benchmarks, particularly in reasoning and math tasks. Its advanced architecture, including the Mixture of Experts model, allows it to efficiently handle complex tasks while maintaining a smaller footprint compared to larger models .

Can I use Phi-3.5 models for real-time applications?

Yes, the Phi-3.5 models are optimized for real-time applications. Their architecture, especially the MoE variant, allows for reduced latency and efficient processing, making them ideal for applications that require quick responses, such as chatbots or interactive tools.

What industries can benefit from Microsoft Phi-3.5 models?

Various industries can leverage Phi-3.5 models, including healthcare, finance, education, and technology. Their capabilities in reasoning, language understanding, and multimodal processing enable innovative solutions, from automated customer service to advanced data analysis.

Are the Phi-3.5 models available for public use?

Yes, Microsoft has made the Phi-3.5 models available under an open-source MIT license. This allows developers and researchers to access, modify, and build upon these models, fostering innovation and collaboration in the AI community.

Categories: AI Tools

What are the key features of Microsoft Phi-3.5 models?

How does Phi-3.5 compare to other AI models like Llama-3.1?

Can I use Phi-3.5 models for real-time applications?

What industries can benefit from Microsoft Phi-3.5 models?

Are the Phi-3.5 models available for public use?

Related Posts