I Have Never Seen Anything Like This
Scores
Can you actually train a model to be a better manager? Moonshot recently released Kimmy K 2.5 and called it the most powerful open-source model to date. That claim is already off because it's openweight, not open source. There's a difference, but that's not the point here. Kimmy 2.5 makes two claims that are actually worth testing. First, it says it was trained from the ground up to orchestrate agent swarms with up to 100 sub aents running in parallel. The reinforcement learning setup doesn't just reward correct answers, but also how effectively the model distributes work across agents. Second, it claims that it has visual agentic intelligence and said that it generated extremely highlevel animations with just a single prompt. Now, instead of people claiming they built it in one shot, it's the creators themselves claiming it. So, we had one of our team members test both. Some of what we found lived up to the hype, some of it didn't. As I mentioned, Kim 2.5 claims to be an open-source model. Actually, Kimmy 2.5 is not an open- source model. According to the definition given by the open source initiative, open source models mean the code, training data, and methodologies should be publicly available, allowing anyone to inspect, modify, and distribute them. But for this model, it's just an openw weight model. An openweight model only makes the final weights available, meaning neither the training code nor the training data set is publicly released. It only contains the weights which are released so others can fine-tune, adapt, or deploy the model for their own projects. Now, this model's architecture is very similar to Deepseek's mixture of expert model architecture. It contains 1 trillion parameters with only 32 billion parameters activated. Does that mean we're not using the model at full capacity? It answers to the same accuracy as a 1 trillion parameter model would, but with much lower processing power and cost. This difference between the total parameters and the activated parameters is the key reason why this model is claimed to be one of the fastest openweight models out there. Only a few activated parameters means only a few are being used per query and this significantly speeds up the model. This is the core reason why it's so cheap compared to other models. They say this is a native multimodal model and delivers state-of-the-art coding and vision capabilities. But this is the same claim every model makes about being state-of-the-art better than others and all that. So our team had to test it to verify for ourselves and we'll show you what we found. But before we move ahead to its actually unique capabilities, let's have a word from the sponsor, Opera Neon. This is the Opera's first agentic browser designed specifically for power users ready to experience the future. Neon uses tasks which replaces chaotic tabs with focused workspaces where the AI can analyze and act across multiple tabs within the same context. Imagine needing a quick utility for work. Instead of opening an IDE, simply use Neon Make. Type prompt like make a Cyberpunk Pomodoro timer and the browser spawns a virtual machine to generate the agenda, write the code, and deploy the app instantly. It's a massive timesaver for daily workflows, allowing you to prototype concepts or automate research via neon do without ever breaking your flow. It acts like a junior developer built directly into the interface. I'll definitely be using these neon cards to automate my prompts. You can subscribe to Opera Neon today. Don't just watch the agentic shift, be a part of it. The link is in the description. The Kimmy model is able to direct a swarm of agents, coordinating tasks amongst them. Now, you might think that Claude also does that and spawns multiple sub aents based on the required task. But here's how this model is different. Kimi 2.5 as a model has learned to self-direct an agent swarm of up to 100 sub aents executing parallel workflows across 1500 coordinated steps by parallel agent reinforcement learning. For those who don't know, reinforcement learning is a process where the model is rewarded when it performs well and penalized when it strays from the objective. Most models are rewarded based on performance alone. But in this case, the model is also rewarded based on how well it can parallelize steps and act as an orchestrator. To put it simply, the Kimmy model is trained to be an orchestrator. Its success criteria is its ability to create sub aents and assign tasks. The orchestrator is built in with tools for creating sub aents, assigning tasks, and other related functions. It creates sub agents for various tasks, assigns them those tasks, receives results from them, and then coordinates everything into a final result. According to them, they used this swarm method to improve performance on complex tasks, and in internal evaluations, it resulted in an 80% reduction in end-to-end runtime. This means they were able to execute much more complex long horizon tasks. They compared it with the best models for longrange tasks, namely Opus 4.5 and Kimmy 2.5 without the swarm, and found that the Kimmy 2.5 agent swarm surpassed all models across their benchmarks. They were also able to save considerable time by using agents instead of running a single agent. So those were all claims based on what they said. To test these claims, we installed the Kimode CLI, which is a new coding agent that was released with this model. We had already built a UI and wanted to migrate it to a different component structure. The UI was built using shad CN and we wanted to rebuild it using material UI. The project had multiple pages. So we asked Kimmy to change the UI of the entire project from shad CN to material UI and to use agents to handle each page so that this migration could be done faster in parallel. It started exploring the directory similar to how claude code does. It created a to-do list containing every page that needed to be converted to material UI. It grouped similar pages together such as O pages like sign up, login, and forgot password to handle them more efficiently. However, it spawned more agents than we were expecting, which we later found out was a bug in the CLI. It had just used five agents to perform the task, which was expected for a new product. It took around 15 minutes to complete the task, which we thought would be reduced using the parallel agents. It finished by verifying and cleaning everything. Some components were no longer being used after the migration, and it cleaned those up as well. It made sure all dependencies were installed and updated, including test files, and validated the rest. Once that was done, it ensured that all dependencies required for shad CN were removed, leaving the project without any unused dependencies, which most agents tend to forget and end up bloating the project unnecessarily. It tweaked the UI slightly. For example, the hero section originally had text and visuals side by side, but it changed them to be stacked vertically. Other than that, everything looked almost exactly the same with just the components switched. Even though it was a big task, it only used 25% of the context window, meaning it can run effectively on longunning agents. So, the agent swarm works, but it's not always faster and will take longer on a large scale codebase. You've probably noticed we build a lot in these videos. All the prompts, the code, the templates, you know, the stuff you'd normally have to pause and copy from the screen. It's all in our community, this video, and every video before it, too. Links in the description. The key selling point of Kimmy 2.5 is its visual agentic intelligence. It's claimed to be particularly strong in front-end capabilities. It can interact with and implement interactive layouts and rich animations such as scrolling through text. They provided multiple examples of animations which were all created well. Here's where it really stands out. Kimmy 2.5 excels at coding with vision going beyond just text and image prompts. It can even take videos as input and generate code making it one of the first models able to do so. This made explaining code flows much easier. This multimodal capability was not added later after training. It was integrated during model training. Most models incorporate additional capabilities only after their text capabilities are strong enough, which often leads to a trade-off between vision and text abilities. But with Kimmy 2.5's training methodology, this trade-off disappears and both capabilities improve together. Now, we had to test it ourselves. We screen recorded navigating around the Notion new page interface and using /comands. We kept the recording small because the documentation mentions that videos are limited to 40 megabytes. We provided the path to the notion recording and asked it to clone the website shown in the video. We didn't specifically tell it in the prompt what the recording was. So, it used the read media file tool to analyze the video. It concluded that the interface was notion-like, identified all the features, and determined it was a notion clone with a Mac OS style window. Once it had listed what was in the file, it started implementing it. If you're using video processing in your own projects, remember this. Videos and images can exhaust the context window quickly. So, be careful with large files, and watch out for context bloating. When it replicated the interface, it was accurate. The UI was editable, including page icons and features from Notion, even though some weren't fully functional at first. The slash commands weren't working yet, but the overall UI was accurate. It would have been better if the slash commands were implemented as that's a key part of the workflow. But this was a minor issue that could be fixed by reiteration. So we gave it a prompt asking it to fix the issues we were having with the implementation. From there it selfiterated implementing fixes, checking the results and ensuring the feature worked correctly without needing any additional prompt from us. This reiteration eventually fixed the /comand issue, making the whole interface feel like a functional notion clone. So it is living up to the model claims. After working through a few issues, we think it could be a cheaper alternative to Claude Code, given Claude's plans are known to be expensive and Kimmy's plans are lower priced. That brings us to the end of this video. If you'd like to support the channel and help us keep making videos like this, you can do so by joining AIABS Pro. As always, thank you for watching and I'll see you in the next one.
Summary
This video evaluates Kimmy 2.5, an openweight model claiming to excel in orchestrating agent swarms and visual agentic intelligence, testing its ability to manage complex tasks and generate code from video inputs.
Key Points
- Kimmy 2.5 is an openweight model, not open source, with 1 trillion parameters but only 32 billion activated per query, enabling fast, cost-effective performance.
- The model is trained to orchestrate up to 100 sub-agents in parallel, reducing end-to-end runtime by 80% on complex tasks compared to other models.
- It uses reinforcement learning not only for correct outputs but also for effective task distribution among agents, making it a native agent orchestrator.
- Kimmy 2.5 demonstrates strong visual agentic intelligence, capable of generating high-level animations and implementing interactive UIs from a single prompt.
- The model can process videos as input and generate functional code, making it one of the first to handle multimodal input like video for coding tasks.
- Testing showed Kimmy 2.5 successfully migrated a UI from shadCN to Material UI using parallel agents, though performance gains were limited in large codebases.
- It accurately cloned a Notion-like interface from a video recording, self-iterated to fix issues like slash command functionality, and cleaned up unused dependencies.
- The model's multimodal capabilities were integrated during training, avoiding the typical trade-off between vision and text performance.
- Kimmy 2.5 offers a potentially cheaper alternative to expensive models like Claude Code for coding and agent-based workflows.
Key Takeaways
- Use Kimmy 2.5 for complex, multi-step tasks where agent orchestration can improve efficiency, especially when parallel processing is beneficial.
- Leverage Kimmy 2.5's ability to generate code from video inputs to prototype or reverse-engineer UIs quickly.
- Be mindful of context window limitations when processing large media files like videos or high-resolution images.
- Test agent swarms in controlled environments to understand their scalability and potential bugs in real-world use cases.
- Consider Kimmy 2.5 as a cost-effective alternative to high-priced coding agents for development and automation tasks.