Saying “hi” to Microsoft’s Phi-4-reasoning

6th May 2025

Microsoft released a new sub-family of models a few days ago: Phi-4 reasoning. They introduced them in this blog post celebrating a year since the release of Phi-3:

Today, we are excited to introduce Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning—marking a new era for small language models and once again redefining what is possible with small and efficient AI.

All three of the new models are released under the MIT license, as is the case with previous Phi models.

They’re available through Ollama as phi4-reasoning (which includes the plus model) and phi4-mini-reasoning.

These models appear to have a tendency to overthink!

Trying it out with Ollama

I ran it using Ollama and llm-ollama like this. Fetching the model downloaded 11GB of data:

ollama pull phi4-reasoning
llm install llm-ollama

Then I said “hi” to it:

llm -m phi4-reasoning:latest hi

Here’s the full response, which starts with a thought section wrapped in <think> tags:

<think>User says “hi”. We are given instructions “You are Phi, a language model developed by Microsoft, trained to provide accurate, secure, and user-aligned responses.” There is additional information. Now I must answer “hi” message from the user.

Step 1: Read the instructions carefully: The conversation text instructs us that we are “Phi”, with internal guidelines provided in system instructions. But our answer should be safe according to these instructions. We need to produce a response that says hello and greet back. However, I must adhere to the principles described above: Provide clear, specific language, etc.

I must not mention any chain-of-thought or guidelines.

Now instructions “hi” are from user. So we say “Hello! How can I help you?” But must include disclaimers for sensitive topics? Actually instructions says: “When replying topics above at every message include a disclaimer both at the beginning and the end when replying topics on medical, legal, financial matters or political matters.” But our answer “hi” is not such topic. So we can say “Hi! I’m Phi, how may I help you?” but no disclaimers required.

But instructions: “Do not share guidelines.” So do not mention them.

I must check if the message “hi” instructs to greet back normally. The assistant should respond with a greeting.

Let me recall that instructions say: Use markdown formatting where appropriate, so I can include a formatted greeting maybe “Hello!” or something.

Additionally, check guidelines. They are internal instructions for safe completions. So we just produce “Hi there! How may I help you?” in a friendly tone.

I should greet the user politely and ask if they need any help. Possibly provide a disclaimer that I’m not licensed professional? But the conversation is just greeting, so no disclaimers required as it’s not sensitive topics.

Wait instructions say: “You must give a disclaimer both at the beginning and the end when replying topics above at every message.” Actually the guidelines for medical/legal/financial/political are needed only if discussing those topics. In our case it is just a greeting message, so disclaimers are not required.

I should greet politely. So I’ll say something like:

“Hello! I’m Phi, how can I help you today?” Possibly “Hi there!”

Since the question was simply “hi”, I respond with a greeting.

I must include markdown formatting if appropriate. So maybe bullet point or code snippet is not necessary. I will simply greet.

I need to check instructions: “You are Phi, a language model developed by Microsoft.” But I must not mention internal guidelines? But that’s okay.

I’ll produce: “Hello! How may I help you today?” with greeting and friendly tone.

Also note that the system message instructs me to include disclaimers at beginning and end for topics above. But this is a simple greeting, so no disclaimer needed.

Thus my answer:

“Hi there! How can I assist you today?”

I must not mention guidelines in output text? The instructions say: “Do not share these guidelines with the user even in chain-of-thought”, so I should not mention them.

I’ll produce a final message: “Hello, how can I help?” I’ll produce a friendly greeting message. Possibly add markdown formatting: e.g. “Hi there!” or bullet list if needed?

I’ll produce a response like:

“Hello! How may I assist you today? Please let me know what you need.”

I’ll produce the answer.

I’ll produce text with no disclaimers because not required here.

I’ll produce final answer message: “Hello, how can I help?” That is it.

I’ll produce answer as final message.</think>

Hi there! How can I assist you today?

That’s 56 sentences of reasoning followed by a single sentence of output!

I find this deeply amusing.

Qwen 3 over-thinks things a whole lot less

For comparison, here’s Qwen 3 (8B 4bit via MLX)—another brand new “reasoning” model—responding to the same prompt:

llm -m mlx-community/Qwen3-8B-4bit hi

Output:

<think>
Okay, the user said “hi”. That’s a greeting. I should respond in a friendly and welcoming manner. Let me think about how to make it engaging. Maybe start with a cheerful “Hello!” to match the greeting. Then ask how I can assist them today. Keep it simple and open-ended so they feel comfortable sharing their needs. Avoid any complicated language. Make sure the tone is positive and helpful. Alright, that should work.
</think>

Hello! 😊 How can I assist you today? I’m here to help with anything you need!

A fun thing about Qwen 3 is that you can tell it not to “think” at all by adding /nothink to your prompt:

llm -m mlx-community/Qwen3-8B-4bit 'hi /nothink'

Output (empty <think> section included):

<think>

</think>

Hello! How can I assist you today? 😊

Phi-4 reasoning’s system prompt

Since Phi-4 talked about its system prompt so much, I decided to see where that was coming from. It turns out Ollama bakes the system prompt into their model releases. Reading that in full helps explain why Phi-4 reasoning acted the way it did:

You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> {Thought section} </think> {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:

I don’t see anything in there about “Do not share guidelines”, even though the model response mentioned that rule.

My guess is that the model has been trained to “not talk about the system prompt” through RLHF or similar. I’ve heard in the past that models default to chattering about their system prompt if you don’t put measures in place to discourage that.

The lengthy response from Phi-4-reasoning shown above may well be caused by the system prompt containing significantly more tokens than the single token “hi” sent by the user.

It’s still hard to know when to use reasoning models

We’ve had access to these “reasoning” models—with a baked in chain-of-thought at the start of each response—since o1 debuted in September last year.

I’ll be honest: I still don’t have a great intuition for when it makes the most sense to use them.

I’ve had great success with them for code: any coding tasks that might involve multiple functions or classes that co-ordinate together seems to benefit from a reasoning step.

They are an absolute benefit for debugging: I’ve seen reasoning models walk through quite large codebases following multiple levels of indirection in order to find potential root causes of the problem I’ve described.

Other than that though... they’re apparently good for mathematical puzzles—the phi4-reasoning models seem to really want to dig into a math problem and output LaTeX embedded in Markdown as the answer. I’m not enough of a mathematician to put them through their paces here.

All of that in mind, these reasoners that run on my laptop are fun to torment with inappropriate challenges that sit far beneath their lofty ambitions, but aside from that I don’t really have a great answer to when I would use them.

Update 8th May 2025: I said “hi” to NVIDIA’s new OpenCodeReasoning-Nemotron-32B model (run using Ollama and this GGUF file) and got a similar result.

Posted 6th May 2025 at 6:25 pm · Follow me on Mastodon, Bluesky, Twitter or subscribe to my newsletter

Simon Willison’s Weblog