We have been running Z.AI's GLM-5.2 on our infra since the weights were released on June 16, 2026. The conclusion is direct: it is the first open-weights frontier model that sustains production at parity with Claude and GPT, and it charges a tenth per token to do it. The open caught up with the closed — and charged 1/10 per token to prove it.
The architecture: MoE 744B-A40B and a 1-million-token window
The architecture is MoE 744B-A40B: 744 billion total parameters with approximately 40 billion active per token. The context window grew to 1 million tokens — up from 200K in GLM-5.1 — with a maximum output of 131,072 tokens. It is text-only; anyone who needs vision stays on GLM-5V-Turbo, which is a separate model.
The real gain is Deep Sparse Attention with IndexShare, which cuts FLOPs by 2.9x in long context, according to Z.AI. For inference workloads with extended context, that means significantly higher throughput on the same hardware. The combination of sparse MoE with IndexShare is what makes 1 million tokens sustainable without blowing the GPU budget — and we confirmed this on our infra: 500K-token contexts that were prohibitively expensive on closed models run within budget on GLM-5.2.
Benchmarks: what Z.AI reports vs. what third parties confirm
The code numbers are solid — and we tested them against our real baseline. According to Z.AI, self-reported without independent verification: Terminal-Bench 2.1 at 81.0 (Opus 4.8 scores 85.0; Gemini 3.1 Pro, 74) and SWE-bench Pro at 62.1 (SOTA for open weights). These are the numbers Z.AI itself publishes — not numbers verified by third parties.
The three that come from third parties are the ones that matter. FrontierSWE 74.4, ahead of GPT-5.5 at 72.6. PostTrainBench 34.3 vs. 28.4 for GPT-5.5. SWE-Marathon 13.0, second only to Opus globally. These are independent.
According to Artificial Analysis, the Intelligence Index is 51 — the highest of any open weights — GPQA-Diamond 89.5% (Z.AI reports 91.2%) and HLE 40.1% (Z.AI reports 40.5%). Small divergence, same direction: third-party numbers confirm the direction of the self-reported numbers, even if not the exact magnitude.
The cost: 10x cheaper than GPT and Claude
The cost is where it closes for us. The API charges $1.40/M input and $4.40/M output — roughly 10x cheaper than GPT and Claude. For high-volume inference workloads, the savings compound: in our usage, the monthly bill dropped to a fraction of what we paid on closed APIs, with no perceptible quality loss on code tasks.
For anyone without GPUs at home, Z.AI offers the Coding Plan: Lite at $18/mo, Pro at $72, and Max at $160 (with a 30% introductory discount). The Coding Plan runs inside Claude Code, Cursor, Cline, OpenCode, Roo Code, Kilo Code, and ZCode, with a 3x multiplier during the 14h-18h Beijing peak. It is the shortest path to test it in your workflow without committing infrastructure.
Deployment: three paths to production
Here is the split. Anyone with a rack grabs the weights from zai-org/GLM-5.2 on HuggingFace (BF16 and FP8), loads them on vLLM v0.23.0+ or SGLang v0.5.13.post1+, and is done. The MIT license allows commercial use, fine-tune, air-gap, and 80+ quants in llama.cpp and Ollama. Anyone without infra starts with the Coding Plan or the API.
For production inference in Latin America, the practical path is Microsoft Foundry brazilsouth or AWS Bedrock sa-east-1. There is no direct Z.AI operation in the region — access comes through cloud partners. We recommend starting with the API to validate the workload, then moving to self-hosted when volume justifies a dedicated rack.
Z.AI: Tsinghua spinoff, IPO, and Entity List
Z.AI is a 2019 spinoff from Tsinghua. The IPO on January 8, 2026, on the HKEX (ticker 2513) raised approximately $558M. The company was added to the Entity List on January 15, 2025; according to Z.AI, the inclusion "lacks factual basis" — the company's claim, not independent verification.
Carol Lin, Group VP and CEO of Zhipu International, ex-AWS, leads globalization. For Latin America, access arrives via Microsoft Foundry brazilsouth and AWS Bedrock sa-east-1; no direct Z.AI operation in the region. The cloud-partner distribution model is what makes GLM-5.2 viable for companies that do not want vendor lock-in.
Conclusion: the open caught up with the closed
We recommend GLM-5.2 for any engineer who needs frontier without locking in cost. The open caught up with the closed, and charged 1/10 per token to prove it. Third-party benchmarks confirm what Z.AI reports; the MIT license removes lock-in; and the cost per token makes viable what was prohibitive on closed APIs. The gap between open and closed is no longer about capability — it is about who controls the infrastructure and the budget.
At Tech86, we help companies deploy open-weights models on sovereign infrastructure — from downloading the weights to tuning throughput in production. Whether you start with the API to validate the workload or go straight to self-hosted on dedicated GPUs, the path is shorter than it has ever been. If your inference workload needs frontier without locking in cost, GLM-5.2 is the shortest path.