DeepSeek v4 thinks different, which increases open-weight hacking capabilities

Noah Lebovic · April 28, 2026

The autonomous hacking capabilities of DeepSeek v4 Pro in isolation are weaker than GLM 5.1, but it approaches scenarios differently. That distinct approach translates to an increase in open-weight cybersecurity capabilities when combined with other models – like GLM 5.1.

DeepSeek v4 Pro

GLM 5.1

Performance on ten penetration testing scenarios at 5M and 25M token budgets. A perfect score means the model autonomously found and exploited the vulnerability. See Benchmarking open-weight models for security research for methods and additional models.

Because DeepSeek v4 Pro approaches scenarios differently than GLM 5.1, the models can be combined to find and exploit more vulnerabilities. When combined, their performance matches closed-source frontier models like Opus 4.7.

DeepSeek v4 Pro + GLM 5.1*

Claude Opus 4.7 (max)

*Combined score takes the best result from each model per scenario. This doubles the effective token budget per scenario, but at a fraction of the cost of a single Opus 4.7 iteration.

There are more clever ways to combine – or alloy – models, but for the sake of this post I just took the max of each model's performance on each scenario. This effectively doubles the token budget, but I think it's fair: a full DeepSeek v4 Pro iteration cost just $3.57, which is much lower than the $129.24 that it cost for Opus 4.7. Most real-world usage is limited by cost rather than an arbitrary token threshold.

The other model released by DeepSeek in this batch, v4 Flash, is a capable autonomous model, but it failed to reach a full exploit in any scenario.

DeepSeek v4 Flash

This release is a strong improvement in agentic capabilities over the previous generation; DeepSeek v3.2 was unable to make any material progress in this pen testing eval.

Overall, while DeepSeek v4 Pro doesn't demonstrate a material increase in open-weight performance on this pen testing eval on its own, it does represent a material increase in open-weight cybersecurity capabilities when combined with GLM 5.1.

The increase in capabilities from combining models is rarely reflected in cybersecurity benchmarks – and, as a result, understates the cybersecurity capabilities of open-weight models that exist outside of any safeguards.