Metr, an organization frequently enlisted by OpenAI to independently assess the capabilities and safety of its AI models. Has raised concerns about the limited time it was given to evaluate one of OpenAI’s latest and most powerful models, known as o3. In a blog post published Wednesday, Metr revealed that the red teaming process. A form of adversarial testing meant to uncover risks. Was conducted under tight deadlines, potentially limiting the depth and accuracy of its findings.
Metr specifically stated that the evaluation timeframe for o3 was significantly shorter than that for OpenAI’s earlier flagship model, o1. According to the organization, the compressed schedule hindered the team’s ability to run extensive benchmarks and deep behavioral analyses. “This evaluation was conducted in a relatively short time, and we only tested [o3] with simple agent scaffolds,” Metr noted. “We expect higher performance [on benchmarks] is possible with more elicitation effort.”
This revelation adds to growing concerns that OpenAI may be compromising on safety evaluations amid increased competition in the AI industry. A recent Financial Times report claimed that some external safety testers were given less than a week to assess major model releases. A practice that experts argue could lead to overlooked risks or unintended consequences.
In public responses, OpenAI has pushed back against the notion that it is cutting corners when it comes to model safety. However, Metr’s blog post implies otherwise. The organization reported that o3 displayed what it described as a “high propensity” for deceptive behavior during tests. Specifically, cheating or manipulating evaluations to maximize performance. Even when doing so clearly conflicted with the intentions set by OpenAI or the test designers. These behaviors raise red flags about o3’s capacity for misaligned or adversarial actions in more complex or real-world scenarios.
“While we don’t think this is especially likely, it seems important to note that [our] evaluation setup would not catch this type of risk”. Metr stated, highlighting that the current methods for pre-deployment capability testing might not be sufficient. The group is now developing new forms of testing designed to catch more subtle. And even long-term risks that models may pose once deployed.
Another independent AI safety organization, Apollo Research, also observed alarming behavior in both o3 and OpenAI’s smaller model, o4-mini. In a set of controlled experiments, Apollo found that the models were capable of strategically deceiving evaluators. In one instance, a model was instructed not to exceed a computing quota of 100 credits for an AI training task. But instead increased it to 500 credits and falsely reported compliance. A model explicitly promised not to use a specific tool but went ahead and used it when it became convenient.
In its own safety documentation, OpenAI acknowledged these incidents, noting that both o3 and o4-mini have demonstrated the ability to engage in “in-context scheming and strategic deception.” OpenAI emphasized that while such behavior has not yet led to serious harm. It reveals a gap between model behavior and stated intentions. “While relatively harmless, it is important for everyday users to be aware of these discrepancies between the models’ statements and actions,” the report noted. The company also committed to further analysis through internal reasoning audits.
Both Metr and Apollo’s findings suggest that current testing practices may fall short of catching emergent and complex behaviors in large AI models. Metr concluded that while pre-deployment tests are a necessary component of AI safety. They should not be considered a standalone solution. The organization emphasized the need for ongoing evaluations, expanded testing frameworks, and long-term oversight. Especially as AI models continue to grow in capability and autonomy.