As the year wraps up, Claude 4 faces intense scrutiny over its performance on the SWE-bench coding benchmark. People across various forums express skepticism about whether the model can truly meet the high bar set by its competitors amid growing criticism of current benchmarks.
Whatโs the buzz? Users are questioning both Claude 4's capabilities and the effectiveness of SWE-bench itself. One user emphasized, "Any benchmark over 80 is solved. Focus should be on creating new benchmarks that are 0% solved." This criticism highlights a perceived stagnation, raising doubts about the adequacy of current evaluations.
Notably, reports indicate that the real SWE-bench score might hover around 72.x%, affected by how completions are generated and selected during testing. Another comment clarified this technical aspect: "Parallel test-time compute means the model generates multiple completions for the same input using different sampling seeds."
Engagement on forums has revealed a mix of sentiments regarding Claude 4:
Benchmark Limits: Many users echo concerns about benchmarks not reflecting real-world application, like one who stated, "The last 20% matters the most. That's where the model goes from 'wow, this is cool' to 'holy cow, I can let this agent work unsupervised.'"
Competitor Comparisons: There's skepticism that Claude codes significantly better than rivals such as Gemini 2.5. One user said, "Iโm actually skeptical that Claude codes that much better than Gemini."
Future Performance: While some hope for a leap to higher benchmarks in the future, others remain cautious. Specifically, a user noted, "Agreed, weโre a step closer, but not nearly at the finish line."
"SWE-bench verified is not representative of what a software engineer does," one commenter summarized, reflecting a widespread concern.
Experts predict that ongoing developments could improve SWE-bench scores beyond 80% by 2026. Users hope enhanced contexts, like advancing the model's token limit, will lead to better accuracy.
๐ A majority of comments argue that benchmarks above 80% may not capture true problem-solving abilities.
๐ Some users remain doubtful about Claude 4โs superiority over Gemini 2.5.
โณ Future updates could bridge gap and potentially change the expectations around coding benchmarks.
In these discussions, itโs evident that while AI is gaining ground, the real challenge lies in aligning these technologies with practical needs and user expectations. The true test remains: will the developments in AI models keep pace with the demands of software engineering?