GLM 4.7 is INSANE For Software Dev...

AILABS-393 JaLycsbTlHU Watch on YouTube Published December 23, 2025
Scored
Duration
6:10
Views
17,811
Likes
342

Scores

Composite
0.66
Freshness
0.01
Quality
0.85
Relevance
1.00
1,322 words Language: en Auto-generated

The guys over at Zai just dropped GLM4.7 and at $29 a year, this is absurdly cheap for a model they claim hits 73% on Swebench. Right up there with Sonnet 4.5. The timing isn't random. They're going public and need to show Western traction. They even did a live Q&A on Reddit, which I've never seen a Chinese AI lab do. But 4.6 had real problems. Is 4.7 actually fixed? Hey everyone, if you're new here, this is AI Labs and welcome to another episode of Debunked, a series where we actually take AI tools and AI models, strip away the marketing hype, and show what they can actually do with real testing and honest results. The new model is mainly improved through post-training, not architecture change. It's heavily optimized for cloud code, and the ZAI team explicitly said this is their priority framework. Currently, it's actually beating a lot of the top tier models, including GPT5, especially on coding benchmarks. in all of their coding plans. One additional thing they've added is these new MCP tools, which are not integrated directly. They're separate MCP servers. They've listed three right now, and for all of them to work, you just need an API key. That's why they're included with the plan, but separate from the model. As far as the usage limits go, they're pretty much the same as they were on 4.6. But if you don't know what they were before, I actually generated a report on that. What's funny is I first tried to generate it with Gemini 3 and for some reason it wasn't able to give me a proper comparison of the plans. I went to Claude again and it researched it nicely. Basically, all you need to know is that for the entry-level plan, you get 10 to 40 prompts in Claude Code, while in GLM coding, you're getting 120 prompts for just $3, which is a huge difference. This only increases as you go into the higher tiers, where the $200 plan gets you up to 800 prompts in that 5h hour window with Claude, while $30 gets you 2,400. All of these rates are discounted for the first month, then they double. But if you're on the yearly plan, it's much more affordable. Another significant benchmark was Humanity's last exam. For those who don't know, it's one of those unsaturated benchmarks, and most newer models still score low on it because it's genuinely difficult. To actually test the UI, we do have this prompt which doesn't really focus on the architecture. It mainly focuses on the design logic the model is supposed to implement while also providing some design options. We can then see based on the company I'm proposing which in this case is an AI powered code review platform what it makes. We also subscribe to the max plan and there are two ways you can actually connect it with claude code. In both cases you do change the settings.json but one is located in the root of your project which changes the global settings. If you do it inside your project, then it just changes it for that project. We did this so we could actually compare it with Sonnet 4.5. This is what Sonnet 4.5 came up with. The prompt is actually pretty good, and we've been using it to really identify which of these models build UI and how creative they are in doing that. It's simple vanillajs. So, we're not looking at the architecture right now, just the design. This is what GLM 4.7 came up with. In terms of the design, it's pretty good, but it did make an error here where it didn't really account for the length, which is why the artifacts are breaking up a little bit. Other than that, the design is solid. But I do not like these emojis at all. Sonnet did not use any emojis, which is good and does match the design language. To actually test them both out, I have this pre-made Next.js project, which has this context initialized that it needs to build a scalable and back-end ready UI. This part is important because as I'm going to evaluate the reasons why GLM surprisingly performed better, it's going to come back to this point. Framer motion and shad CN components have been pre-installed for it to build the UI. Both of them have been asked to build the main browser page for a Netflix-like streaming platform. They've been specified on what to actually build and what needs to be on the page. If you're talking about the usability of the GLM model with Claude code, one problem with GLM 4.6 was that it was extremely slow in code generation. here. That issue in my experience has not been solved. It's still extremely slow. But there is one change with GLM 4.6. The model actually didn't think, meaning it didn't think inside clawed code. The detailed transcript you get here clearly shows thinking, but that wasn't showing in 4.6. You can clearly see here that it does think with the 4.7 model. So that's been fixed. Other than that, there are some quirks you need to know. GLM 4.7 is not that autonomous. I found this during my testing. As you can see here, this GLM folder already has a UI benchmark folder in which it needs to implement the app. But it chose to ignore that. Although it was clearly written inside the context, it went ahead and made another Nex.js app on its own. It didn't even initialize it. It just started writing code. Sometimes it does act really dumb. But after I corrected it and steered it in the right direction in terms of the implementation, this is what Claude created. Again, being the higher model, it's pretty good at UI. This is what GLM 4.7 created. Claude obviously created a better UI because in our opinion it's still better at design. For the price that is okay, but after I looked at the code and dug into it, since they were told this was supposed to be backend ready and that for now they need to use mock data, the GLM model actually implemented a better architecture by placing all the mock data in one file. Then when we need to swap it out, we just need to change that file because the imports are connected there as opposed to what Claude implemented where every other component has its own import. when we actually do implement the back end, we'll have to change all of those files one by one. In terms of basic architecture and code quality, GLM actually did pretty well. And it surprised me because 4.6 wasn't this good in my testing. The previous plan wasn't really justified by how much I had to steer it and how many mistakes it made, but this one is definitely a huge leap. Those benchmarks are definitely justified by the testing I've done. I've also looked at a few other small things in the code, and GLM 4.7 is actually a good model. Given these unexpected results, we're honestly recommending everyone getting the $29 per year plan. If you already have the $20 Claude plan, this is basically nothing in comparison. That said, it's still not a model you'd use for completely autonomous coding. Even though Claude really messed up the architecture here, it's good enough that it can correct and improve on that later. But with the small quirks GLM still has, we don't think it's a good idea to solely depend on it. That brings us to the end of this video. If you'd like to support the channel and help us keep making videos like this, you can do so by using the super thanks button below. As always, thank you for watching and I'll see you in the next one.

Summary

This video reviews GLM 4.7, a coding-focused AI model from Zai, evaluating its performance against competitors like Sonnet 4.5 and Claude. Despite improvements in thinking and architecture, GLM 4.7 still has limitations in autonomy and speed, but offers strong value for its low price, especially for backend-ready code generation.

Key Points

  • GLM 4.7 is a cost-effective AI model priced at $29/year, claiming 73% on Swebench, competing with top-tier models like Sonnet 4.5.
  • The model is optimized for cloud code and includes separate MCP tools requiring API keys, not directly integrated.
  • GLM 4.7 shows improved thinking capability compared to 4.6, but remains slow in code generation.
  • It makes errors like ignoring project context and creating unnecessary apps, showing limited autonomy.
  • Despite UI design shortcomings, GLM 4.7 implemented a better architecture by centralizing mock data, improving scalability.
  • The model is not fully autonomous and requires steering, but offers better code quality than expected for its price.
  • Testing revealed GLM 4.7 outperforms in backend-ready code structure, making it a strong choice for developers.
  • The model is recommended for the $29 yearly plan, especially for users already on higher-tier models like Claude.

Key Takeaways

  • Evaluate AI models beyond marketing claims by testing real-world coding tasks, especially for backend readiness.
  • Consider the trade-off between cost and autonomy—GLM 4.7 is affordable but not fully autonomous.
  • Centralizing mock data improves scalability; prioritize architecture that simplifies future backend integration.
  • Use separate MCP tools with API keys for enhanced functionality, but be aware they are not integrated directly.
  • Test models on specific use cases like UI design and code generation to assess real-world performance.

Primary Category

LLMs & Language Models

Secondary Categories

AI Tools & Frameworks Programming & Development

Topics

GLM 4.7 Claude Code SWE-bench AI coding UI generation code architecture MCP tools pricing comparison benchmark testing AI model comparison

Entities

people
Zai
organizations
ZAI Anthropic OpenAI
products
technologies
domain_specific
products technologies

Sentiment

0.70 (Positive)

Content Type

comparison

Difficulty

intermediate

Tone

educational analytical debunking objective technical