AiNews.com
Posts
OpenAI Unveils o3 and o4-mini: Smarter, Faster, More Capable Models

OpenAI Unveils o3 and o4-mini: Smarter, Faster, More Capable Models

Alicia Shapiro
April 17, 2025 • Estimated Reading Time: 13 minutes

A sleek, modern workspace showing a human professional collaborating with an advanced AI system. The AI is represented by a translucent digital assistant projected beside the desk, displaying complex visual data—charts, code snippets, and annotated images—on a large, semi-transparent screen. The human, seated and engaged, reviews the material with a tablet in hand while gesturing toward the display. Natural light floods the room through large windows, and the environment blends futuristic design with warmth—wood accents, indoor plants, and minimal clutter. The scene captures the synergy between human insight and AI-driven reasoning in a realistic, near-future setting.

Image Source: ChatGPT-4o

OpenAI Unveils o3 and o4-mini: Smarter, Faster, More Capable Models

OpenAI has introduced two new models—o3 and o4-mini—representing a major leap forward in the company’s o-series reasoning models. These releases mark a convergence between large-scale intelligence and real-world utility, offering a new generation of models that not only understand more but also do more.

“These are the smartest models we’ve released to date,” OpenAI said. “It represents a step change in ChatGPT's capabilities for everyone.”

Smarter, Faster, More Capable

At the heart of this update is agentic tool use. For the first time, both o3 and o4-mini are trained not just to use tools, but to understand why, when, and in what order to use them. This enables them to respond to intricate prompts with more detailed, multimodal outputs—often in less than a minute.

With full tool access, the models can now:

Search the web for real-time information
Run code and analyze uploaded files with Python
Interpret and reason about images, diagrams, and charts
Generate new visuals through image synthesis
Chain these abilities together to complete complex workflows

The models are also trained to choose the right tool and format for their answers—such as tables, plots, citations, or graphs—improving clarity and utility.

This capability brings the o-series models closer to OpenAI’s vision of agentic AI: systems that can independently execute multi-step tasks and make intelligent decisions along the way.

The Power of o3

OpenAI’s new flagship model, o3, sets a new standard across core domains such as math, science, coding, business, and visual reasoning. It surpasses previous models by a wide margin on industry-standard benchmarks:

Codeforces ELO: 2706
SWE-bench Verified: 69.1% (whole solution accuracy)
MMMU (multimodal college-level questions): 82.9%
MathVista (visual math reasoning): 86.8%
CharXiv-Reasoning (scientific figures): 78.6%
Humanity's Last Exam (Expert-Level Questions Across Subjects - without tools): 20.32%
Humanity's Last Exam (Expert-Level Questions Across Subjects - with Python & Browsing Tools): 24.90%

Compared to OpenAI o1, o3 reduces major real-world task errors by 20%. It performs especially well in:

Visual analysis (images, charts, and scientific figures)
Hypothesis generation and evaluation (biology, math, and engineering)
Business/Consulting case analysis and creative ideation
Programming

Early testers described o3 as a “thought partner” capable of exploring new ideas, modeling uncertainty, and presenting alternative solutions with greater precision—particularly within biology, math, and engineering contexts.

Bar charts comparing the performance of OpenAI’s o1, o3-mini, o3, and o4-mini models on AIME 2024 and 2025 competition math benchmarks and Codeforces coding competition. The graphs show steady accuracy gains across generations, with o4-mini achieving the highest accuracy and ELO scores.

Math and Code Benchmark Results for o1 to o4-mini. Image Source: OpenAI

Side-by-side bar charts showing accuracy scores on two benchmarks: GPQA Diamond (PhD-level science questions) and Humanity’s Last Exam (expert-level interdisciplinary questions). The charts highlight the superior performance of o3 and o4-mini models, especially when tools like Python and browsing are enabled.

GPQA and Humanity’s Last Exam Results Across Models. Image Source: OpenAI

o4-mini: Big Brains, Smaller Footprint

While o4-mini is optimized for speed and efficiency, it still delivers strong performance across reasoning tasks—particularly in math, coding, and visual analysis—making it a compelling option for high-throughput use cases.

Notable performance metrics for o4-mini:

AIME 2025 competition math benchmark (without tools): 92.7%
GPQA (PhD-Level Science Questions - without tools): 81.4%
Codeforces (Competitive Programming - with terminal): 2719 ELO
MathVista (Visual Math Reasoning): 84.3%
SWE-bench Verified (Software Engineering): 68.1%
Humanity's Last Exam (Expert-Level Questions Across Subjects - without tools): 14.28%
Humanity's Last Exam (Expert-Level Questions Across Subjects - with Python & Browsing Tools): 17.70%

Tool-Assisted Excellence in Math Benchmarks

Both OpenAI o4-mini and o3 demonstrate remarkable performance on the AIME 2025 competition math benchmark when given access to a Python interpreter—allowing them to run calculations, verify solutions, and reason through complex symbolic math programmatically.

o4-mini achieves:

99.5% pass@1 — solving nearly every problem correctly on the first attempt
100% consensus@8 — delivering the same correct answer eight times in a row

o3 shows similarly strong gains with tools:

98.4% pass@1
100% consensus@8

Why this matters: These results highlight the models’ ability to reason strategically with tools, not just solve math problems from memory. The models determine when to invoke Python, how to structure their logic, and how to format precise outputs—often in under a minute.

Important context: These tool-assisted results are not directly comparable to benchmarks run without tools. They reflect a different kind of intelligence—agentic reasoning—where the model acts as a capable problem-solver, combining code and logic to arrive at answers.

Three bar charts comparing o1, o3, and o4-mini on college-level multimodal problem solving (MMMU), visual math reasoning (MathVista), and scientific figure interpretation (CharXiv-Reasoning). o3 leads slightly in visual benchmarks, with o4-mini close behind.

Visual and Scientific Reasoning Benchmark Results. Image Source: OpenAI

Two bar charts showing dollar earnings from freelance coding tasks (SWE-Lancer) and accuracy on SWE-Bench Verified (software engineering problems). o3 and o4-mini models significantly outperform earlier versions, earning more and solving more tasks accurately.

Software Engineering Performance Across Coding Benchmarks. Image Source: OpenAI

This ability to combine symbolic reasoning with tool use is one of the clearest signs that OpenAI’s new models are evolving from static assistants into interactive agents capable of solving real-world, multi-step problems.

Like o3, o4-mini makes effective use of Python and browsing tools—particularly for technical tasks. Despite its efficiency-first design, it also outperforms o3-mini on both STEM and non-STEM benchmarks, making it an ideal choice for high-volume reasoning workloads.

External experts found both models to be significantly stronger at following instructions and generating useful, verifiable responses. These improvements stem from higher reasoning ability and the integration of web sources. Compared to previous generations, o3 and o4-mini also offer a more natural, conversational experience—drawing on memory and past interactions to make replies feel more personalized and context-aware.

Bar chart comparing code editing accuracy across OpenAI models using the Aider Polyglot benchmark, with separate results for whole solutions and diffs. The o3-high model achieves the highest accuracy, followed by o4-mini-high.

Code Editing Accuracy on Aider Polyglot Benchmark. Image Source: OpenAI

Two bar charts measuring accuracy on the Scale MultiChallenge benchmark (multi-turn instruction following) and BrowseComp (agentic web browsing). o3 and o4-mini show stronger instruction handling and browsing abilities, especially with tools enabled.

Instruction Following and Agentic Browsing Benchmarks. Image Source: OpenAI

Bar chart comparing o1-high, o3-mini-high, o3-high, and o4-mini-high on Tau-bench across airline and retail tasks. o3-high leads in both scenarios, closely followed by o4-mini-high, showing strong reasoning and API usage capabilities.

Function Calling Accuracy on Tau-bench. Image Source: OpenAI

“Thinking with Images”

One of the most profound upgrades is how these models reason visually. Users can upload photos, whiteboard sketches, or diagrams—and the model will not only “see” the content but use it as part of its reasoning chain.

This means:

Interpreting blurry, reversed, or low-resolution images
Zooming, rotating, or transforming images as part of problem-solving
Solving multimodal questions previously out of reach

This visual intelligence contributes to their top-tier performance on benchmarks like MathVista, MMMU, and CharXiv-Reasoning.

Deep Reinforcement Learning at Scale

The o3 model was trained with a radically expanded reinforcement learning (RL) process, scaling inference-time reasoning and decision-making to new levels. OpenAI reports that model performance continues to improve the more the system is allowed to “think,” echoing earlier scaling trends observed in the GPT series.

Both o3 and o4-mini were also trained via RL to reason about tool use—not just how to use tools, but when to use them to achieve a desired outcome. This significantly boosts their ability to navigate open-ended problems and generate high-quality, context-appropriate outputs.

Expanded Safety Measures

With greater power comes greater scrutiny. OpenAI rebuilt its safety training dataset for these models, targeting high-risk areas like:

Biological threats and biorisk
Malware and cybersecurity abuse
Jailbreaks and prompt manipulation

A new reasoning-based safety monitor, trained on human-interpretable specifications, flagged ~99% of high-risk prompts during internal red-teaming. According to OpenAI’s Preparedness Framework, both models remain below the “High” capability threshold in biological & chemical, cybersecurity, and AI self-improvement risk categories. Full evaluation results are available in the accompanying system card.

Codex CLI: Terminal-Based Coding Agent

As part of this release, OpenAI introduced Codex CLI, a lightweight coding agent run from your terminal interface powered by o3 and o4-mini. It serves as a lightweight bridge between our models and users’ local computing environments. This open-source tool allows users to:

Interact with models from their local terminal
Run code with contextual understanding
Reason about sketches, screenshots, or code snippets
Automate workflows with file access

Codex CLI is fully open-source and available now on GitHub, and OpenAI has launched a $1M grant program to support developers using it, with awards of up to $25,000 in API credits. Proposals can be submitted here.

Availability

Starting today:

ChatGPT Plus, Pro, and Team users can access o3, o4-mini, and o4-mini-high. These models replace o1, o3-mini, and o3-mini-high in the model selector.
Enterprise and Edu customers gain access next week.
Free-tier users can try o4-mini by selecting “Think” in the ChatGPT composer before submitting their request.

For developers:

o3 and o4-mini are live on the Chat Completions and Responses APIs. (Please note: Some developers may be required to complete organization verification to gain access to these models.)
Responses API adds support for reasoning summaries, preserves reasoning tokens around function calls for improved performance, and will soon integrate built-in tools like web search, file search, and code interpreter directly into the model’s reasoning process.

What’s Next

OpenAI plans to release o3-pro in the coming weeks with full tool support for Pro users. This aligns with the broader strategy to unify the reasoning strength of the o-series with the natural, conversational fluency of GPT-series models.

Future models will increasingly blend:

Autonomous tool use
Advanced reasoning
Real-time, multimodal input/output
Memory-based personalization
Seamless conversation flow

OpenAI describes today's models as faster, smarter, and more capable than ever—but hints that future models will act more like true collaborators, blending reasoning, memory, and tool use into natural, flowing conversations.

What This Means

With the release of OpenAI o3 and o4-mini, the line between conversational AI and capable, autonomous agents is beginning to blur. These models are not just incrementally better—they represent a fundamental leap in how AI reasons, interacts with tools, and adapts to complex, real-world tasks. For the first time, a single model can decide when and how to combine web search, code execution, image understanding, and data analysis—solving multifaceted problems in under a minute with precision that rivals domain experts.

This marks a shift in how AI will be used across fields: not just answering questions, but serving as a true partner in workflows that span science, education, business strategy, software development, and more. The reinforced ability to "think with images," chain tools together, and self-reflect on reasoning allows these systems to go beyond previous limitations—offering responses that are not just useful, but verifiable, nuanced, and often creative.

For developers, this opens the door to more autonomous, intelligent applications with higher throughput and stronger instruction-following. For everyday users, it brings a more natural, intuitive assistant that feels both faster and more thoughtful. And for researchers, it offers an increasingly powerful lens through which to study reasoning itself.

The next wave of AI won’t just respond—it will collaborate, learn, and build with you.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.