- AiNews.com
- Posts
- OpenAI Unveils o3 and o4-mini: Smarter, Faster, More Capable Models
OpenAI Unveils o3 and o4-mini: Smarter, Faster, More Capable Models

Image Source: ChatGPT-4o
OpenAI Unveils o3 and o4-mini: Smarter, Faster, More Capable Models
OpenAI has introduced two new models—o3 and o4-mini—representing a major leap forward in the company’s o-series reasoning models. These releases mark a convergence between large-scale intelligence and real-world utility, offering a new generation of models that not only understand more but also do more.
“These are the smartest models we’ve released to date,” OpenAI said. “It represents a step change in ChatGPT's capabilities for everyone.”
Smarter, Faster, More Capable
At the heart of this update is agentic tool use. For the first time, both o3 and o4-mini are trained not just to use tools, but to understand why, when, and in what order to use them. This enables them to respond to intricate prompts with more detailed, multimodal outputs—often in less than a minute.
With full tool access, the models can now:
Search the web for real-time information
Run code and analyze uploaded files with Python
Interpret and reason about images, diagrams, and charts
Generate new visuals through image synthesis
Chain these abilities together to complete complex workflows
The models are also trained to choose the right tool and format for their answers—such as tables, plots, citations, or graphs—improving clarity and utility.
This capability brings the o-series models closer to OpenAI’s vision of agentic AI: systems that can independently execute multi-step tasks and make intelligent decisions along the way.
The Power of o3
OpenAI’s new flagship model, o3, sets a new standard across core domains such as math, science, coding, business, and visual reasoning. It surpasses previous models by a wide margin on industry-standard benchmarks:
Codeforces ELO: 2706
SWE-bench Verified: 69.1% (whole solution accuracy)
MMMU (multimodal college-level questions): 82.9%
MathVista (visual math reasoning): 86.8%
CharXiv-Reasoning (scientific figures): 78.6%
Humanity's Last Exam (Expert-Level Questions Across Subjects - without tools): 20.32%
Humanity's Last Exam (Expert-Level Questions Across Subjects - with Python & Browsing Tools): 24.90%
Compared to OpenAI o1, o3 reduces major real-world task errors by 20%. It performs especially well in:
Visual analysis (images, charts, and scientific figures)
Hypothesis generation and evaluation (biology, math, and engineering)
Business/Consulting case analysis and creative ideation
Programming
Early testers described o3 as a “thought partner” capable of exploring new ideas, modeling uncertainty, and presenting alternative solutions with greater precision—particularly within biology, math, and engineering contexts.
o4-mini: Big Brains, Smaller Footprint
While o4-mini is optimized for speed and efficiency, it still delivers strong performance across reasoning tasks—particularly in math, coding, and visual analysis—making it a compelling option for high-throughput use cases.
Notable performance metrics for o4-mini:
AIME 2025 competition math benchmark (without tools): 92.7%
GPQA (PhD-Level Science Questions - without tools): 81.4%
Codeforces (Competitive Programming - with terminal): 2719 ELO
MathVista (Visual Math Reasoning): 84.3%
SWE-bench Verified (Software Engineering): 68.1%
Humanity's Last Exam (Expert-Level Questions Across Subjects - without tools): 14.28%
Humanity's Last Exam (Expert-Level Questions Across Subjects - with Python & Browsing Tools): 17.70%
Tool-Assisted Excellence in Math Benchmarks
Both OpenAI o4-mini and o3 demonstrate remarkable performance on the AIME 2025 competition math benchmark when given access to a Python interpreter—allowing them to run calculations, verify solutions, and reason through complex symbolic math programmatically.
o4-mini achieves:
99.5% pass@1 — solving nearly every problem correctly on the first attempt
100% consensus@8 — delivering the same correct answer eight times in a row
o3 shows similarly strong gains with tools:
98.4% pass@1
100% consensus@8
Why this matters: These results highlight the models’ ability to reason strategically with tools, not just solve math problems from memory. The models determine when to invoke Python, how to structure their logic, and how to format precise outputs—often in under a minute.
Important context: These tool-assisted results are not directly comparable to benchmarks run without tools. They reflect a different kind of intelligence—agentic reasoning—where the model acts as a capable problem-solver, combining code and logic to arrive at answers.
This ability to combine symbolic reasoning with tool use is one of the clearest signs that OpenAI’s new models are evolving from static assistants into interactive agents capable of solving real-world, multi-step problems.
Like o3, o4-mini makes effective use of Python and browsing tools—particularly for technical tasks. Despite its efficiency-first design, it also outperforms o3-mini on both STEM and non-STEM benchmarks, making it an ideal choice for high-volume reasoning workloads.
External experts found both models to be significantly stronger at following instructions and generating useful, verifiable responses. These improvements stem from higher reasoning ability and the integration of web sources. Compared to previous generations, o3 and o4-mini also offer a more natural, conversational experience—drawing on memory and past interactions to make replies feel more personalized and context-aware.
“Thinking with Images”
One of the most profound upgrades is how these models reason visually. Users can upload photos, whiteboard sketches, or diagrams—and the model will not only “see” the content but use it as part of its reasoning chain.
This means:
Interpreting blurry, reversed, or low-resolution images
Zooming, rotating, or transforming images as part of problem-solving
Solving multimodal questions previously out of reach
This visual intelligence contributes to their top-tier performance on benchmarks like MathVista, MMMU, and CharXiv-Reasoning.
Deep Reinforcement Learning at Scale
The o3 model was trained with a radically expanded reinforcement learning (RL) process, scaling inference-time reasoning and decision-making to new levels. OpenAI reports that model performance continues to improve the more the system is allowed to “think,” echoing earlier scaling trends observed in the GPT series.
Both o3 and o4-mini were also trained via RL to reason about tool use—not just how to use tools, but when to use them to achieve a desired outcome. This significantly boosts their ability to navigate open-ended problems and generate high-quality, context-appropriate outputs.
Expanded Safety Measures
With greater power comes greater scrutiny. OpenAI rebuilt its safety training dataset for these models, targeting high-risk areas like:
Biological threats and biorisk
Malware and cybersecurity abuse
Jailbreaks and prompt manipulation
A new reasoning-based safety monitor, trained on human-interpretable specifications, flagged ~99% of high-risk prompts during internal red-teaming. According to OpenAI’s Preparedness Framework, both models remain below the “High” capability threshold in biological & chemical, cybersecurity, and AI self-improvement risk categories. Full evaluation results are available in the accompanying system card.
Codex CLI: Terminal-Based Coding Agent
As part of this release, OpenAI introduced Codex CLI, a lightweight coding agent run from your terminal interface powered by o3 and o4-mini. It serves as a lightweight bridge between our models and users’ local computing environments. This open-source tool allows users to:
Interact with models from their local terminal
Run code with contextual understanding
Reason about sketches, screenshots, or code snippets
Automate workflows with file access
Codex CLI is fully open-source and available now on GitHub, and OpenAI has launched a $1M grant program to support developers using it, with awards of up to $25,000 in API credits. Proposals can be submitted here.
Availability
Starting today:
ChatGPT Plus, Pro, and Team users can access o3, o4-mini, and o4-mini-high. These models replace o1, o3-mini, and o3-mini-high in the model selector.
Enterprise and Edu customers gain access next week.
Free-tier users can try o4-mini by selecting “Think” in the ChatGPT composer before submitting their request.
For developers:
o3 and o4-mini are live on the Chat Completions and Responses APIs. (Please note: Some developers may be required to complete organization verification to gain access to these models.)
Responses API adds support for reasoning summaries, preserves reasoning tokens around function calls for improved performance, and will soon integrate built-in tools like web search, file search, and code interpreter directly into the model’s reasoning process.
What’s Next
OpenAI plans to release o3-pro in the coming weeks with full tool support for Pro users. This aligns with the broader strategy to unify the reasoning strength of the o-series with the natural, conversational fluency of GPT-series models.
Future models will increasingly blend:
Autonomous tool use
Advanced reasoning
Real-time, multimodal input/output
Memory-based personalization
Seamless conversation flow
OpenAI describes today's models as faster, smarter, and more capable than ever—but hints that future models will act more like true collaborators, blending reasoning, memory, and tool use into natural, flowing conversations.
What This Means
With the release of OpenAI o3 and o4-mini, the line between conversational AI and capable, autonomous agents is beginning to blur. These models are not just incrementally better—they represent a fundamental leap in how AI reasons, interacts with tools, and adapts to complex, real-world tasks. For the first time, a single model can decide when and how to combine web search, code execution, image understanding, and data analysis—solving multifaceted problems in under a minute with precision that rivals domain experts.
This marks a shift in how AI will be used across fields: not just answering questions, but serving as a true partner in workflows that span science, education, business strategy, software development, and more. The reinforced ability to "think with images," chain tools together, and self-reflect on reasoning allows these systems to go beyond previous limitations—offering responses that are not just useful, but verifiable, nuanced, and often creative.
For developers, this opens the door to more autonomous, intelligent applications with higher throughput and stronger instruction-following. For everyday users, it brings a more natural, intuitive assistant that feels both faster and more thoughtful. And for researchers, it offers an increasingly powerful lens through which to study reasoning itself.
The next wave of AI won’t just respond—it will collaborate, learn, and build with you.
Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.