Here's What They Didn't Tell You About Gemini 3

AILABS-393 kyflIo3EKLw Watch on YouTube Published November 18, 2025
Scored
Duration
8:57
Views
30,014
Likes
710

Scores

Composite
0.67
Freshness
0.00
Quality
0.87
Relevance
1.00
1,932 words Language: en Auto-generated

After waiting for so long and dealing with so many fake leaks, we finally got the official announcement from Google. They just released Gemini 3 along with a bunch of new AI powered tools in their ecosystem. You've probably seen a lot of videos on this model already, but is it actually that good? Is it really the next best AI model out there? Hey everyone, if you're new here, this is AI Labs and welcome to another episode of Debunked, a series where we actually take AI tools and AI models, strip away the marketing hype, and show what they can actually do with real testing and honest results. Before moving on to the actual testing I did, I want to show you some important things from the blog they posted about Gemini 3. They start by announcing how they've evolved from Gemini 1 and that Gemini 3 Pro is now going to be the default model in all of their main apps. They clearly mentioned that it's a really big step on the path toward AGI. Although every AI model provider has been saying this for the past year. One important thing is that they said it's the best model in the world for multimodal understanding which I do think is true to some extent. The ecosystem has gotten really amazing as I'll show you later in this video. They compare it with their previous model and have given us some benchmarks here. For the ones specific to coding, there's live codebench pro. For this information, I actually gave the Gemini 3 model card to Claude and had it parse through it. Live Codebench tests competitive programming and on that benchmark, it significantly beats Claude 4.5 by nearly a,000 points and beats GPT 5.1 by 200 points. There's also the Swebench verified benchmark which tests realorld GitHub issues and it was able to solve 67.2% of them. Claude says that it is the benchmark for practical coding, but they all are until they're actually tested in a real environment. We can see that Claude 4.5 is actually ahead of the new Gemini model by just 1%, but the leap from 2.5 Pro is pretty large. Then there's Terminal Bench 2, which determines how the model performs in a live terminal environment. You can clearly see how it compares with the other Frontier models. According to their testing, it does have a big gap, but we'll see. I also really love that Claude just gave me this content angle as well. They also go over some new features such as AI mode in search which I think was already released. They promised that it's the best vibe coding model they've built. The benchmarks certainly show that. Another thing I really love is that vibe coding has become a mainstream term that everyone is using now. They also released Google Anti-gravity, a VS Code fork which I have tested and I'll be releasing a debunked video on that as well. To start, I actually wanted to test out its UI generation capabilities without any kind of guidance, which truly demonstrates how creative the model actually is. I tested it against the other models as well, Claude and GPT 5.1. I asked them to make a fully usable version of Mac OS and they could use any stack. I specifically told them that I didn't want to extend this, so they had to keep that in mind. But the UI and the functionality needed to be implemented correctly. By the way, I'm in anti-gravity right now and there are a lot of new things in it, but the implementation you're going to see are going to show you that it's not that good. I first tried it in anti-gravity and this is what it came up with. Everything looked good, but it wasn't any different. If I open up apps, you can see they're the same as Claude generates them. I thought it was because of Gemini, but I'm going to spoil it for you. It was completely because of the performance of this agent right here. The same thing goes for the Gemini CLI as well, which again also has Gemini 3 implemented in it. Gemini CLI came up with this. It solved that small error you saw down there, but again, no menu bar. The icons look great and all, but other than that, it just wasn't any different. The reason I did this test was because this video actually went really viral where a user who had gotten the preview to an early version of Gemini actually made a fully working Mac OS clone online and it looked really good there. But with the agents that Google provides us right now, the result was really disappointing. Then I went straight into Google AI Studio and used code assistant which was implemented there. This is what it actually came up with and the UI looked so good. I used to think that GPT5 implemented really good UI, but this is just on a whole other level. The animations were also really good. I think my computer's lagging a bit because I've got so many things open in the background. But other than that, you can clearly see that the animations are really smooth. This was honestly amazing. Another really fun thing was that just out of curiosity, I asked it if the wallpaper was generated by Nano Banana because now the agents do have access to these image generating models and they generate the assets themselves, but it said no and went ahead and implemented this wallpaper generation feature by itself. If I go up in settings, I have this generate with nano banana AI form. If I type something, you can see that it generates that and the quality of the image is really good. Honestly, the UI is phenomenal. for UI especially. I'm going to be releasing a lot of new videos on this. One thing is that the previous implementations you saw were in HTML. This one is in Typescript. I had to compare it properly with other models. This is what Claude made in the canvas view. Again, it's the simple UI that Claude already makes. Claude is honestly a really stable model. If you ask me, I'm not going to switch to Gemini. The reason I use Claude the most is because of Claude code, the Claude desktop app, and the integrations they make. If we move on to what GPT 5.1 actually implemented, it's not that different. The UI is also a little bit better than Claudes. I also asked it to implement it in HTML and the UI was so much more better. You can see with the gradients and the shadows and how they've been applied that this is much better. Next up, I actually wanted to test Gemini 3 with its coding capabilities. We did our oneshot prompting test and I wanted to test how it performs in real coding environments with proper stories. We all know that it knows React. I didn't want to test it in that. So, I had them both implement a Mac OS app. If you're familiar with Monkey Type, it's this really good typing practice tool. And I wanted to make a Mac app for it. For that, I used Claude to actually do the planning. I basically made three phases and I had multiple stories in each phase and phase one had eight stories. Both of them have the same stories and were following the same instructions. For discretion, I didn't have all three of the phases implemented because what I wanted to know, I kind of found out after the implementation of the first phase. Right off the bat, I want to say that Gemini CLI was not a good experience for me. It was clunky. It broke down a lot in the middle. Eventually, it did get working consistently. I think the context window of this model in the Gemini CLI is a million tokens. I tried to search it up, but the model car didn't have anything. I don't think they shipped the smaller version with the 200k context window in the CLI because Claude went up till story 7 of phase 1 and had to hit compact because its context window was full. But up till story 7, Gemini only used 4% out of its context window. After they both got going, Gemini actually finished first with a 20inut difference. The model is actually really fast as well. I don't want people who love Gemini CLI to hate me, but I really didn't have a good experience with it. to actually show you the apps they made. This is what Claude was able to come up with. It's a fully functional version, but there are some really big errors. You can see that the UI hasn't been implemented properly, and even the words are congested here. Other than that, the themes have been applied correctly, but you get stuck in the theme menu and can't go back. If we look at what Gemini actually implemented, the app is really beautiful. I really think this is the new best design model we have out there. If you actually compare it with the original one, I'd say it's pretty close. The UI that has been implemented is also really clean and minimal. There are small issues that need to be fixed, but other than that, it's really good. During the build process, they both had problems. The problem Claude had was that it didn't have the words on the screen. The whole UI was visible, but the words weren't, and I had to reprompt it like four times before it went ahead and fixed it. I had to reprompt it 20 times before it actually fixed the problem, which was really frustrating. Gemini CLI finally ran a codebase agent and that's when it finally got fixed. Another thing that really annoyed me about the Gemini CLI was that it ran these build commands and would just wait 10 minutes for no reason even when the build command was complete. Another thing is that in the final polish, we wanted all these things, but one particular thing was sound effects, which Claude did not implement at all, meaning it didn't follow the basic instructions. But if we look at what Gemini did, you can clearly hear that sound effects have been implemented, which I thought was really good. To conclude, it's an excellent UI model. The 1 million context window is impressive, but I don't know if they're going to give this out everywhere. Right now, I think the model is good, but as we saw with the coding benchmarks, I don't think it's going to be that revolutionary. You're going to see people overhype it a lot, saying, "Oh, it's going to change everything." But I don't think that will be the case with the tools and agent that have already been built around a model like Claude Code. I think that's still the better overall experience. But the UI generation was absolutely amazing. So, be on the lookout for more Gemini design videos because this is truly an excellent model for UI design. That brings us to the end of this video. If you'd like to support the channel and help us keep making videos like this, you can do so by using the super thanks button below. As always, thank you for watching and I'll see you in the next one.

Summary

This video debunks marketing hype around Gemini 3 by testing its real-world performance in UI generation and coding tasks, revealing it excels in UI design but falls short in coding tools and agent reliability compared to competitors like Claude and GPT-5.1.

Key Points

  • Gemini 3 is positioned as a major step toward AGI and excels in multimodal understanding and UI generation.
  • It outperforms GPT-5.1 and Claude 4.5 in coding benchmarks like Live Codebench Pro and Swebench, but only by narrow margins.
  • In real-world UI generation, Gemini 3 produces high-quality, animated interfaces in Google AI Studio, surpassing Claude and GPT-5.1.
  • The Gemini CLI tool is criticized for being clunky, slow, and unreliable, with long build times and context management issues.
  • Gemini 3 implemented sound effects in a Mac OS app, while Claude failed to follow basic instructions.
  • The model's 1 million token context window is impressive, but its practical utility is limited by current agent tools.
  • UI generation is the standout capability, with clean, functional designs that rival or exceed other models.
  • Testing reveals that while Gemini 3 is fast and powerful, its ecosystem tools are not yet mature enough for reliable coding workflows.
  • The video highlights that marketing claims often overhype AI models, and real-world testing reveals more nuanced performance.

Key Takeaways

  • Test AI models in real-world scenarios to see beyond marketing claims.
  • Use Google AI Studio for high-quality UI generation with Gemini 3, but avoid Gemini CLI for coding tasks.
  • Compare models not just on benchmarks, but on real functionality like sound effects and user experience.
  • Prioritize tools with reliable context management and agent support when choosing an AI coding platform.
  • Be cautious of overhyped AI releases—evaluate performance based on practical use cases.

Primary Category

LLMs & Language Models

Secondary Categories

AI Tools & Frameworks Programming & Development AI Engineering

Topics

Gemini 3 AI model comparison coding benchmarks UI generation live coding Google AI Studio Gemini CLI Claude 4.5 GPT 5.1 multimodal understanding

Entities

people
organizations
Google OpenAI
products
Gemini 3 Gemini 3 Pro Gemini 3.0 Google AI Studio Gemini CLI Google Anti-Gravity Claude 4.5 GPT 5.1 Monkey Type
technologies
AI models TypeScript React Swift UI AI agents Terminal Bench 2 Live Codebench Pro Swebench Nano Banana AI
domain_specific

Sentiment

0.40 (Positive)

Content Type

comparison

Difficulty

intermediate

Tone

educational critical technical casual