While companies like Google and Meta push to integrate AI into software development. A new study from Microsoft Research is throwing cold water on just how effective these models really are—especially when it comes to debugging code.
According to the research, even top-tier AI models from OpenAI and Anthropic fail to reliably resolve bugs in a widely-used benchmark called SWE-bench Lite. Despite bold claims about AI’s growing role in programming—Google CEO Sundar Pichai said AI now writes 25% of the company’s new code—the study shows that human developers remain far better at catching and fixing errors.
Microsoft researchers tested nine large language models using a prompt-based agent setup. Each AI agent had access to common debugging tools like a Python debugger, yet most struggled to solve even half of the assigned 300 debugging tasks. Among the top performers, Anthropic’s Claude 3.7 Sonnet managed a success rate of just 48.4%. OpenAI’s o1 and o3-mini followed behind with 30.2% and 22.1%, respectively.
The underwhelming results were attributed to two main issues: poor tool usage and limited training data. While some models couldn’t figure out which tools to use or how to use them effectively, the more pressing issue, according to the researchers, was the lack of real-world debugging data in model training sets.
In particular, the models seem to lack what’s known as “trajectory data”—recordings of step-by-step human debugging sessions. Without that kind of data, AI struggles with the sequential decision-making involved in solving complex bugs. The researchers believe that fine-tuning on this specific type of data could dramatically improve AI debugging performance.
This study isn’t an outlier. Recent evaluations of other AI coding tools, like Devin, have also revealed troubling limitations. Devin, for instance, completed only 3 out of 20 benchmark tests in a recent analysis. Even though AI excels at generating code snippets and automating simple tasks, it often falls short on deeper logic-based problem-solving, where understanding code context is key.
Still, despite these limitations, enthusiasm for AI in development hasn’t waned. Tools like GitHub Copilot, OpenAI Codex, and Replit’s Ghostwriter continue to gain traction. Yet experts warn that companies must proceed with caution.
Letting AI handle code autonomously—especially debugging—could lead to security flaws, unpredictable errors, and longer development cycles. As the Microsoft study suggests, relying on AI for debugging at this stage is risky without human oversight.
Prominent tech leaders, including Microsoft’s own co-founder Bill Gates, have weighed in. Gates recently said he doesn’t believe AI will replace programmers anytime soon. Others, like Replit CEO Amjad Masad and IBM CEO Arvind Krishna, share the sentiment: coding will evolve, but it’s not going away.
The bottom line? AI tools are powerful assistants, but they’re still learning how to debug like humans. Until then, developers will remain critical gatekeepers of quality and security in the coding process.