Introduction
The Phi model from Microsoft has been at the forefront of many open-source Large Language Models. Phi architecture has led to all the popular small open-source models that we see today which include TPhixtral, Phi-DPO, and others. Their Phi Family has taken the LLM architecture a step forward with the introduction of Small Language Models, saying that these are enough to achieve different tasks. Now Microsoft has finally unveiled the Phi 3, the next generation of Phi models, which further improves than the previous generation of models. We will go through the Phi 3 in this article and test it with different prompts.
Learning Objectives
- Understand the advancements in the Phi 3 model compared to previous iterations.
- Learn about the different variants of the Phi 3 model.
- Explore the improvements in context length and performance achieved by Phi 3.
- Recognize the benchmarks where Phi 3 surpasses other popular language models.
- Understand how to download, initialize, and use the Phi 3 mini model.
This article was published as a part of the Data Science Blogathon.
Phi 3 – The Next Iteration of Phi Family
Recently Microsoft has released Phi 3, showcasing its commitment to the open-source in the field of Artificial Intelligence. Phi has released two variants of Phi 3. One is the Phi 3 with a 4k context size and the other is the Phi 3 with a 128k context size. Both of these have the same architecture and a size of 3.8 Billion Parameters called the Phi 3 mini. Microsoft has even brought up two larger variants of Phi, a 7 Billion version called the Phi 3 Small and a 14 Billion version called the Phi 3 Medium, though they are still in the training phases. All the Phi 3 models come with the instruct version and thus are ready to be deployed in chat applications.
Unique Features
- Extended Context Length: Phi 3 increases the context length of the Large Language Model from 2k to 128k, facilitated by LongRope technology, with the default context length doubled to 4k.
- Training Data Size and Quality: Phi 3 is trained on 3.3 Trillion tokens, featuring larger and more advanced datasets compared to Phi 2.
- Model Variants:
- Phi 3 Mini: Trained on 3.3 Trillion tokens, with a 32k vocabulary size and leveraging the tiktoken tokenizer.
- Phi 3 Small (7B Version): Default context length of 8k, vocabulary size of 100k, and utilizes Grouped Query Attention with 4 Queries sharing 1 Key to reduce memory footprint.
- Model Architecture: Incorporates Grouped Query Attention to optimize memory usage, starting with Pretraining and moving to Supervised fine-tuning, aligned with Direct Preference Optimization for AI-responsible outputs.
Benchmarks – Phi 3
Coming to the benchmarks, the Phi 3 mini, i.e. the 3.8 Billion Parameter model has overtaken the Gemma 7B from Google. It has gotten a score of 68.8 in MMLU and 76.7 in HellaSwag which exceeds Gemma which has a score of 63.6 in MMLU and 49.8 in HellSwag and even the Mistral 7B model which has a score of 61.7 in MMLU and 58.5 in HellSwag. Phi-3 has even surpassed the recently released Llama 3 8B model in both of these benchmarks.
It even surpasses these and the other models in other popular evaluation tests like the WinoGrande, TruthfulQA, HumanEval, and others. In the below table, we can compare the scores of the Phi 3 family of models with other popular open-source large language models.

Getting Started with Phi 3
To get started with Phi-3. We need to follow certain steps. Let us dive deeper into each step.
Step1: Downloading Libraries
Let’s start by downloading the following libraries.
!pip install -q transformers huggingface-cli bitsandbytes accelerate
- transformers – We need this library to download the Large Language Models and work with them
- huggingface-cli – We need to log in to huggingface so that we can work with the official HuggingFace model
- bitsandbytes – We cannot directly run the 8 Billion model in the free GPU instance of Colab, hence we need this library to quantize the LLM to 4-bit to work with them
- accelerate – We need this to speed up the GPU inference for the Large Language Models
Now, before we start downloading the model, we need to define our quantization config. This is because we cannot load the entire full precision model within the free Google Colab GPU and even if we fit it, the inference will be slow. So, we will quantize our model to 4-bit precision and then work with the model.
Step2: Defining Quantization Configure
The configuration for this quantization can be seen below:
import torch
from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
- Here we start by importing the torch and the BitsAndBytesConfig from the transformers library.
- Then we create an instance of this BitsAndBytesConfig class and save it to the variable called config
- While creating this instance, we give it the following parameters.
- load_in_4bit: This tells that we want to quantize our model into 4bit precision format. This will greatly reduce the size of the model.
- bnb_4bit_quant_type: This tells the type of 4bit quantization we wish to work with. Here we go with the normal float called the nf4. This is proven to give better results.
- bnb_4bit_use_double_quant: Setting this to True will quantize the quantization constants that are internal to BitsAndBytes, this will further reduce the size of the model.
- bnb_4bit_compute_dtype: Here we tell what datatype we will be working with when computing the forward pass through the model. For the colab, we can set it to brain float16 called bfloat16, which tends to provide better results than the regular float16.
Running this code will create our quantization configuration.
Step3: Download the Model
Now, we are ready to download the model and quantize it with the following quantization configuration. The code for this setup is as follows:
“`python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
torch.random.manual_seed(0)
model = AutoModelForCausalLM.from_pretrained(
“microsoft/Phi-3-mini-4k-instruct”,
device_map=”cuda”,
torch_dtype=”auto”,
trust_remote_code=True,
quantization_config = config
)
tokenizer = AutoTokenizer.from_pretrained(“microsoft/Phi-3-mini-4k-instruct”)
“`
Explanation:
– Import the necessary modules from the transformers library.
– Initialize the Phi-3-mini model with specific configurations.
– Create a tokenizer for the model.
– This code downloads and quantizes the Phi-3 mini 4k context instruct LLM model based on the provided configuration.
Next, we test the Phi-3-mini model with specific messages using the following code snippet:
“`python
messages = [
{“role”: “user”, “content”: “A clock shows 12:00 p.m. now. How many degrees will the minute hand move in 15 minutes?”},
{“role”: “assistant”, “content”: “The minute hand moves 360 degrees in one hour (60 minutes). Therefore, in 15 minutes, it will move (15/60) * 360 degrees = 90 degrees.”},
{“role”: “user”, “content”: “How many degrees does the hour hand move in 15 minutes?”}
]
model_inputs = tokenizer.apply_chat_template(messages, return_tensors=”pt”).to(“cuda”)
output = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)
print(decoded_output[0])
“`
Explanation:
– Provide a list of messages to the model for inference.
– Apply the chat template to format the messages properly.
– Generate responses based on the input messages.
– Decode the model output to get the response in English text.
Running this code will showcase the model’s ability to understand and respond to the given conversation context accurately.
You can also test the model with another question using the code snippet below:
“`python
messages = [
{“role”: “user”, “content”: “If a plane crashes on the border of the United States and Canada, where do they bury the survivors?”},
]
model_inputs = tokenizer.apply_chat_template(messages, return_tensors=”pt”).to(“cuda”)
output = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)
print(decoded_output[0])
“`
This code segment demonstrates the model’s capability to provide convincing responses to various questions, including tricky ones. The model’s responses are detailed and demonstrate a logical thought process. Let’s present another challenging question and see how the generated response turns out.
“`python
messages = [
{“role”: “user”, “content”: “How many smartphones can a human eat?”},
]
model_inputs = tokenizer.apply_chat_template(messages,
return_tensors=”pt”).to(“cuda”)
output = model.generate(model_inputs,
max_new_tokens=1000,
do_sample=True)
decoded_output = tokenizer.batch_decode(output,
skip_special_tokens=True)
print(decoded_output[0])
“`
In this scenario, we posed a tricky question to the Phi-3-mini model, asking how many smartphones a human can eat. This test evaluates the model’s ability to apply common sense. The Phi-3 LLM correctly identified this query as a misunderstanding, showcasing its understanding of the question. This indicates that the Phi-3-mini has been well-trained on a high-quality dataset that includes a blend of common sense, reasoning, and mathematical concepts.
### Conclusion
Phi-3 represents a significant advancement over Microsoft’s previous Phi-2 model, with a substantially increased context length of up to 128k tokens and minimal performance impact. It is trained on a larger and more comprehensive dataset, leading to improved performance in various tasks compared to other models. Its proficiency in handling complex questions and incorporating common sense reasoning positions Phi-3 as a promising model for diverse applications.
#### Key Takeaways
– Phi 3 excels in addressing practical scenarios, effectively handling challenging and ambiguous questions.
– Model Variants: Phi 3 offers different versions such as Mini (3.8B), Small (7B), and Medium (14B), catering to various use cases.
– Phi 3 outperforms other open-source models in benchmarks like MMLU and HellaSwag.
– The context size of Phi 3 is doubled compared to Phi 2, reaching 4k, and further extended to 128k using the LongRope method with minimal performance degradation.
– Phi 3 is trained on 3.3 Trillion Tokens from curated datasets, undergoing supervised fine-tuning and alignment with Direct Preference Optimization.
### Frequently Asked Questions
1. **Q1. What kind of prompts can I use with Phi 3?**
A. Phi 3 models are optimized for specific chat template formats. It is advisable to use this format when interacting with the model by utilizing the apply_chat_template function.
2. **Q2. What is Phi 3 and its model variants?**
A. Phi 3 is Microsoft’s next-generation model series, including Phi 3 Mini (3.8B), Small (7B), and Medium (14B) parameter models.
3. **Q3. Can I access Phi 3 for free?**
A. Yes, Phi 3 models are accessible for free on the Hugging Face platform, with the Phi 3 Mini (3.8B) model currently available for commercial use under specific licensing terms.
4. **Q4. How does Phi 3 perform with tricky questions?**
A. Phi 3 demonstrates strong capabilities in common-sense reasoning, effectively handling tricky questions involving humor and logic.
5. **Q5. Are there any changes in tokenizers for the new Phi models?**
A. Yes, while the Phi 3 Mini uses the Llama 2 tokenizer with a vocabulary size of 32k, the Phi 3 Small model introduces a new tokenizer with a vocabulary size expanded to 100k tokens.
“`