New Benchmark Exposes Gap Between AI Coding Models
Datacurve's DeepSWE benchmark reveals GPT-5.5 leads at 70%, challenging the narrative that top coding models perform equally.
Datacurve just dropped a new coding benchmark that blows up the cozy consensus in AI coding evaluations. DeepSWE puts models through 113 tasks across 91 open-source repositories spanning five programming languages. The result? GPT-5.5 sits on top with a 70% score.
That matters because existing benchmarks have been painting a misleading picture for enterprise buyers — suggesting the leading models are basically interchangeable. DeepSWE says otherwise.
The benchmark draws from real-world open-source codebases rather than synthetic tests, which should give buyers a sharper view of actual model capability. A spread that clearly separates contenders from pretenders is exactly what procurement teams need when millions are on the line.
OpenAI's GPT-5.5 claiming the top spot will raise eyebrows — and likely spark a fresh round of benchmark wars among rival labs.