Chinese models are sometimes better, even if they're distilled

Noah Lebovic · April 24, 2026

Many folks are accusing Chinese models of distilling from American frontier labs, including the White House and Anthropic. As a part of these accusations, people often claim that distilled models are strictly inferior:

"Models developed from surreptitious, unauthorized distillation campaigns like this do not replicate the full performance of the original. They do, however, enable foreign actors to release products that appear to perform comparably on select benchmarks at a fraction of the cost."White House memorandum from April 23rd, 2026

This was true a few months ago, but it no longer consistently holds. GLM 5.1's cybersecurity capabilities are a good example of this.

GLM 5.1 was released a few weeks ago is alleged to be distilled from Opus 4.6. It outperformed Opus 4.6 on many public benchmarks – which is, admittedly, still in line with the White House memo. But it also outperformed Opus 4.6 on an internal cybersecurity evaluation that re-finds and exploits vulnerabilities that we've found, including an account takeover at a major American bank.

GLM 5.1

Claude Opus 4.6

Performance on ten penetration testing scenarios at 5M and 25M token budgets. A perfect score means the model autonomously found and exploited the vulnerability. See Benchmarking open-weight models for security research for methods and additional models.

This evaluation is nigh impossible to game; the model either finds the exploit or it doesn't, and the vulnerabilities are not public information. This outperfomance holds true for other Chinese models that are alleged to have distilled Opus 4.6, like Qwen 3.6 Plus: they regularly outperform the model they're alleged to have distilled from, even in real-world tasks.

Qwen 3.6 Plus

And on the point of "at a fraction of the cost", the small Qwen model – a quantized version of which can run on a laptop – also regularly outperforms Opus 4.6 in some cybersecurity scenarios.

Qwen 3.6 35B A3B

This shift happened recently. The last generation of small Qwen models could not even complete the evaluation, and the large Qwen 3.5 model vastly underperformed on this evaluation despite scoring well on public benchmarks.

Qwen 3.5 397B A17B

This set of evaluations is not representative of the full gamut of the model's abilities; the small Qwen model is not a drop-in replacement for Opus. And it is true that American labs still hold the frontier; Opus 4.7 and GPT 5.5 both outperform GLM 5.1.

Opus 4.7 (max)

Still, distilled models are not a strict subset of the teacher model's performance; they can exceed their abilities in many domains of interest, including cybersecurity.