GLM 4.6 Review: Better Than Claude for Coding? (Real Project Tests)
Scores
Just two months ago, ZAI released one of the best coding models out there. While it wasn't topping any benchmarks in terms of performance, it was extremely cheap. What's better was that it offered messages like Claude Code did and completely outpriced it. Well, they just released another update, the GLM 4.6 model, and the internet blew up, saying that this was one of the best models out there. The benchmarks showed so. However, these benchmarks are never really to be trusted. So, I tested it out to see if even with such low pricing, it's really worth it to switch or not. One of the biggest updates with this model is the context window expansion from 128k tokens to 200k tokens, a significant jump in capability. The rest of the updates claim the model is better than the previous version, which we'll be testing by comparing it with Claude models. If you didn't already know, GLM offers coding plans that will be comparing with Claude code. They've also introduced their GLM coding max plan which is honestly a great value option as you'll see. It works with numerous coding tools including Claude Code, OpenDev, Rode, Klein, and many others. Here's how it compares. On the most basic $20 plan, you get about 45 messages per hour with GLM light. You might notice all the GLM models show double pricing. That's because it's $3 for the first month, then $6 for every month after. So, for $6, you're getting anywhere from 50 to 120 messages per hour, depending on the compute power you use. Here's where it gets interesting. Claude's max plan costs $200 and gives you 900 messages per hour. But GLM's plan just $60 for up to $2,400 messages per hour. Even when running really complex prompts, you're getting the same performance as Claude for $140 less. GLM 4.5 was priced around the same, so it's great they haven't increased the price. But buying a new plan with every model release is starting to put a hole in my wallet. So here's a word from our sponsor. Code Rabbit CLI. Imagine pushing changes late at night. Everything looks fine. Then the pipeline fails. Painful, right? Code Rabbit CLI runs checks in your terminal before you ship. Run it on your project and it flags the stuff that breaks pipelines. Things like errors, security issues, code smells, missing tests, and performance hits. Run it before you commit. It reviews staged and unstaged changes. And you can apply fixes in one click or hand them to your AI agent like claude code or cursor CLI. Reviews happen locally in your terminal not after a PR. It works with the languages you use like JavaScript, TypeScript, Python, Java, Go, Rust and more. Think of it as your safety net for production ready commits. The very first thing I wanted to test between these two was their actual ability in UX design to evaluate the creativity of the model. The second test focuses more on instruction following, testing whether these models can actually follow instructions and implement code as asked. Now, what did I do here? I asked them to implement a full inventory management system in HTML, CSS, and JavaScript. The prompt only included the user requirements, which basically explained how users would use the app and their journey flow. There was no technical information or design guidance provided. It was completely up to them. At the same time, I gave GLM the exact same prompt as well. You can see here that this one is running on API usage billing while this one is on the CloudMax plan. This one is using Sonnet 4.5 and this one is using GLM4.6. Now, both models ran and I'm just going to show you the results. This is what GLM 4.6 came up with and this is what Sonnet 4.5 came up with. The main goal here is to judge the UX and UI, how they both designed it, and what kinds of mistakes they made. First of all, for Sonnet 4.5, one thing I really like is how it listed the demo credentials. They're shown right here in a popup. And if I want to log in as any of the users, I just click on them. Their credentials automatically fill in and I can sign in instantly. That's really good. The dashboard is pretty minimal. Again, this is built purely in HTML, so there are no charts or animations or anything like that. Here we have our inventory and assignments showing to whom items are assigned along with our user database. I also tried to add this AI search functionality and it actually looks pretty good. The page routing is set up correctly as well since we can move back and forth between pages without any issues. Now moving on to what the GLM4.6 model made. It only gave us this small section about logging in as an admin, team lead or employee. When we log in as admin, first impressions. The user experience is kind of similar. I don't think we had quick actions in the previous one. Yeah, on the dashboard we didn't have any quick actions button. So that was added here. Other than that, it's mostly the same. However, this add item button is really out of place. It doesn't belong here. And in terms of UI, I don't like it at all. It's really lacking. Even the HTML implementation was done much better by Sonnet 4.5. When we look at the different sections, you can see it's very bare bones. The UI wasn't well planned out, and there are a lot of mistakes that the stronger model simply didn't make. Let me show you another thing about the GLM model. I made four commits. The first was the initial implementation. The second fixed an authentication loop where trying to log in would just redirect back to the login page. So, it wouldn't let me sign in. After fixing that, I made another commit to add a mock inventory, which Sonnet 4.5 had already included. But GLM added the mock inventory only on the dashboard. You can see the items filled here now, but not in the inventory itself, which was really odd. So, yeah, in terms of creativity, it's pretty clear which one is the better model. If you want creative tasks or UI design, GLM 4.6 isn't the right model. But because it's so cheap, I wanted to see whether it could still perform well in structured implementation. Because let's face it, you're not just blindly building apps. You're building them with comprehensive context engineering. I wanted to see if with proper context, the performance gap could be minimized. In which case, GLM 4.6 might actually be a model worth using. Before we move on to the second test, you need some context about what I'm building and the stack I'm using. I took the inventory management system idea and implemented it in an app called Desk. It's a dual inventory management system that handles both physical items and digital tools like service accounts and subscriptions. I'm using the BMAD method, a planned approach to building projects with AI agents. It pre-plans everything and makes the AI follow a structured flow, eliminating hallucination and inconsistency. We already have a video on this method if you want to learn more about how it works. The BMAD method divides projects into epics. This one has four. Epic 1 covered the foundation and core infrastructure, setting up everything and installing dependencies. Epic 2 implemented the physical inventory management system. Epic 3 focuses on account and tool inventory management. Each epic is divided into stories so the tool can implement them one by one without overloading its context. Sonnet 4.5 has implemented the first two epics and the entire project setup so far and I'm genuinely impressed. Looking at the tech stack, this uses spelt 5, which until Sonnet 4 was quite difficult to work with. But recently, Sonnet 4.5 has become really good at spelt 5, good enough that I can actually build functional apps with it. Now, for the front end, we're using Spelt 5. For the back end, Superbase, we haven't set up anything manually. We use the Superbase MCP to handle everything in the project. Right now we have five tables fully set up including authentication and all required data structures. Smeelt needs components. So we're using shadensel which is a port of the original shaden components from react. For the UI I asked for a notion inspired design so the model wouldn't spend too much time figuring out the design direction. The authentication system is already set up and sonnet 4.5 implemented it successfully. There are still a few bugs. The UI isn't perfectly aligned and overall it's quite minimal, but we'll improve that later. What I want to do now is test the third epic using both Sonnet 4.5 and GLM 4.6. We know Sonnet 4.5 can handle it, but the real question is whether GLM 4.6 can implement it just as well because if it can, it would truly be worth using. I started the implementation of the third Epic and this is what the Sonnet 4.5 model came up with. The UI isn't as polished as it would have been if it were built with React because these models are primarily trained on React. But still, it managed to set up everything correctly just as we expected. The GLM model on the other hand wasn't able to set it up correctly. It created all the pages properly and even the code written in Spelt 5 was technically correct, but there were problems with integration. At first, there was an issue integrating the account button and setting up the basic structure. Once I painstakingly fixed that through multiple reprompts, the account inventory page still ended up being messed up. It managed to do around 90% of the work. To finish the remaining 10%, you'd need to use the Sonnet 4.5 model to properly integrate everything. What that means is if you plan on getting this model, you might still need the $20 subscription alongside it. If you have unlimited budget, you won't need it at all. But if you're concerned about the bills stacking up from the $200 plan, this could actually be a decent option for you. Now, this entire test was done in Spelt mainly to figure out how well the model handles complex languages. It hasn't been trained on extensively. If it were React, things would have been different. You've seen our GLM 4.5 demo, and that model performed exceptionally well, so this one would do great, too. For React apps, if that's all you're working on, this model is a killer option, and honestly, you don't need anything else. That brings us to the end of this video. If you'd like to support the channel and help us keep making videos like this, you can do so by using the super thanks button below. As always, thank you for watching and I'll see you in the next one.
Summary
This video compares GLM 4.6 and Claude Sonnet 4.5 for coding tasks, finding that while GLM 4.6 is cheaper and offers a larger context window, it underperforms in creative UI design and complex implementation tasks, especially in non-React environments, suggesting it's better suited for structured, context-rich coding rather than creative or complex app development.
Key Points
- GLM 4.6 offers a significant context window expansion to 200k tokens and is priced at $60 for up to 2,400 messages per hour, making it cheaper than Claude's $200 plan.
- GLM 4.6 was tested against Claude Sonnet 4.5 in a full inventory management system project built with Svelte 5 and Supabase.
- In UX design and creativity, Sonnet 4.5 produced a more polished, functional UI with better layout and features like demo credentials, while GLM 4.6's UI was bare and poorly structured.
- GLM 4.6 struggled with complex integration tasks in Svelte 5, failing to properly set up the account inventory page despite correct syntax, requiring manual fixes.
- GLM 4.6 performed well in structured tasks when provided with clear context, suggesting it's effective for implementation-focused coding when used alongside a more capable model.
- The video highlights the importance of context engineering and structured development methods like BMAD to reduce hallucinations and improve AI agent performance.
- The comparison reveals that while GLM 4.6 is cost-effective, it may not replace higher-tier models like Claude Sonnet 4.5 for complex or creative coding tasks.
- The test used a dual inventory system (physical and digital tools) to evaluate model performance across different implementation challenges.
Key Takeaways
- Use GLM 4.6 for cost-effective, structured coding tasks when you have clear context, but rely on more capable models like Sonnet 4.5 for creative or complex UI design.
- Leverage context engineering and structured methods like BMAD to guide AI agents effectively and reduce errors in complex projects.
- When building apps in non-React frameworks like Svelte 5, test models thoroughly as performance varies significantly based on training data.
- Consider a hybrid approach: use cheaper models for implementation tasks and more expensive models for design or integration challenges.
- Evaluate AI models not just on price or benchmarks, but on real-world performance in your specific tech stack and use case.