Claude Opus 4.6 vs GPT-5.3 Codex

GregIsenberg gmSnQPzoYHA Watch on YouTube Published February 05, 2026
Scored
Duration
48:55
Views
23,045
Likes
537

Scores

Composite
0.74
Freshness
0.67
Quality
0.87
Relevance
1.00
8,486 words Language: en Auto-generated

Today's a massive day because Anthropic just dropped Opus 4.6 and OpenAI answered with GPT 5.3 codeex. But what is the better model and how do you get started and what are some tips and tricks to get the most out of them? Well, this episode is all about that. This is for the technical person who's trying to get the most out of these models, who don't just want hot takes, who want tactical sauce for getting the most out of these models. This episode of the pod is with my dear friend Morgan Linton. Morgan is one of the best engineers I know. He was an executive at Sonos. He's invested in a lot of AI companies and he's building an AI company of his own. He's one of my first calls when I'm like, "Hey, which model's better?" So, we put the models headto-head and there's a winner at the end. We rebuild Poly Market, a multi-billion dollar app, but we use these models. So, which is the better one? You'll find out by watching this episode, but you'll also learn to become a better AI developer because you'll have these tips and tricks in your back pocket. >> I'm with one of my favorite people, Morgan Linton. You might not know him, but he is just, you know, just an incredible developer, founder, entrepreneur, investor. He's does it all. But today, what you know what I I needed him to help me understand is Opus 4.6 just came out. GBT 5.3 codeex just came out. Morgan, help me understand by the end of this episode. What are people going to get out of this? >> Yeah. Well, Greg, thanks for having me. Super exciting day. It's moving fast today. Opus 46 came out and then Sam Alman put together a quick tweet, I want to say like maybe 18 minutes later announcing GPT53 codecs. Uh, and me and I think everybody else has been jumping on it, playing around, figuring out the differences, um, and all the little neat new settings that there are in each of these. By the end of this, you're going to know first how to make sure that you are running Opus 4.6 and all of the little details you can change in the settings.json file to use some of the cool features in Opus 4.6, especially agent teams, which is probably the feature I'm the most excited about. Um, you'll also understand why you might use one versus the other because they both kind of tackle different engineering methodologies. Um, and then hopefully you'll see some cool stuff as we build some demos together that I've put together that I haven't tried myself. So, I'll be trying just live with you. So, we'll we'll see how that goes. >> Cool. I think one of them is we're going to try to recreate Poly Market. >> Yes. >> And see which model performs best. >> Yeah. We're going to have both. They're going to do a head-to-head to try to each build their own version of Poly Market. >> So, by the end of this episode, you will have a pretty good understanding of how to use the models, when to use the models, how to get started. Morgan, let's get into it. >> Cool. Right on. All right. So, I took some notes and essentially, you know, with 53 codecs, um, I'll be showing that in the desktop app on Mac because they're super excited about that. I'm excited about it. I think if if OpenAI was wanting a demo to be done the right way, they would want me to do it in their app. Um whereas with Opus 46, I would say the anthropic team would want me to do it in the CLI. And so there's a few different um configuration settings that you do want to make sure that you get right when you're using Opus 46 or trying to use Opus 46 today, tomorrow, whenever it is that you're jumping in to use it. I've seen a lot of people um online today on Twitter saying, "Uh, it's weird. I'm having a problem. Like, I'm supposed to be agent teams, but I don't see them or how do I know what version I'm running?" So, I thought, let's start by just giving everybody a level playing field to know, okay, I want to be able to use Claude Code with Opus 46. How do I make sure I'm doing that and doing that correctly? Um, so here's kind of the initial to-dos that everyone should have on their list. Just do an npm update. see if that does the trick. Um, if that doesn't and you're running an older version, um, then run cloud update. But you you should see like as of right now it's 2.1.32. If you see 1 something, you're running an old version. Um, and then what you want to do is go into your settings.json, and I'll just show this here. So, if you just do like cd uh tilda slash uh.clad quad. >> So, I bet that there's people who are running the old model, they don't even realize it. Probably bad bad result. >> No idea. Yeah. Yeah. So, I mean, make sure you go in here, cd to.claude. Here's your settings.json. Um, if you view this, here's essentially what you should see. Um, now it's okay. It can be model. If you want to like really be specific about it, you can put in claw-opish-4-6. that'll lock it in. But because 46 is the newest model, you can also just put in model and just opus and that'll work. Um, the key thing that you want to do is the in my opinion the coolest feature that they added with 46 is agent teams. I'm super excited to demo that with you. um you have to make sure to turn that on because it is an experimental feature and that's probably the biggest confusion I'm seeing people have today with Opus 46 is that they are running Opus 46. They keep hearing about agent teams and they're giving it prompts like build a team of agents to do this and this and it's not quite doing it and that's because you do have to enable this. So you do have to add and end this claude code experimental agent teams and then set it equal to one. Okay, nothing too crazy. Once you do that, that will uh make all that possible. Uh so with that in mind, um you're pretty ready to go there. Then you can just run cloud in the terminal and you're good. For people that are using the API, the one thing I did want to point out is there's a pretty cool new addition which is called adaptive thinking. Also, just to be clear because I'm seeing confusion on this too. Um this is in the API. This is not in cloud code itself. uh but adaptive thinking just to show it here uh you're able to essentially pick the uh level of effort that you would like the model to use. This is only going to work in 46 by the way if you want to use like an effort level of max. And so here's kind of the different levels. So with max claude always thinks with no constraints on thinking depth it's open 46 only. So requests using max on other models are going to return an error. So if you're calling the API and you set the effort level to max and you get an error, then you're probably not using Opus 4.6. But here's the example where you can see if I'm calling the API, I set the model to claude opus 46 and then here's where I can set the effort. And this is another thing. If you're using existing API code, you may have the model to opus 45 and now you adjust the effort to max. It gives you an error. All you need to do is just bump the version and you're good. But this is kind of a neat thing they've added to the API um with 46 that's worth mentioning. And then uh kind of the last the last thing I would say is uh just uh if you want to use split panes for agents. So if you want agents to show up in different panes um and you're using something like warp just make sure to install T-Max. You can do this with brew install t-m. And then um if you do that it's going to default to auto uh which usually means in process which means in that same terminal window you have the agents are going to be working um all together. If you want it to split pane then you just need to update uh that setting in the settings.json to split panes. I'm not going to go into super details on that but those are just like I think good housekeeping to start with for anyone using Opus. But don't worry about it. really all that anybody needs to do, especially if you don't even want to use teams, uh, agent teams, is just make sure you're updated using the newest version, um, and that the model is Opus and you'll be using Opus 46. >> Cool. >> So, that's that. Before I get into kind of the differences between uh Opus 46 and and Codeex, I thought I would actually read this because uh this was posted on Hacker News 4 hours ago and I was reading it. I was thinking that's like the best way to explain it. So, I'm just going to I'm just going to read this little section here because I think they do such a good job with it. This person's saying, "What's interesting to me is that GPT53 and Opus 46 are diverging philosophically and really in the same way that actual engineers and orgs have diverged philosophically." I think this really nails it. With codeex 53, the framing is an interactive collaborator. You steer it mid-execution, stay in the loop, course correct as it works. With Opus 46, the emphasis is the opposite. a more autonomous, agent, thoughtful system that plans deeply, runs longer, and asks less of the human. That feels like a reflection of a real split in how people think LM based coding should work. Some want tight human and loop control. Others want to delegate whole chunks of work and review the result. And I honestly I think that says it beautifully. And I think that that nails the differences. And also hopefully, you know, there's all everybody wants to pick a winner where it's like, "Oh, no, no, Opus 46 is better. Codeex is it's it's different. It depends on what your methodology is." And I think what we're seeing now um not just with vibe coding, but also with like overall like AI powered engineering is how how do you want to work with agentic coding? Do you want to have a totally autonomous experience where you're sending agents out to do work or do you want to work with an LLM um like another teammate and pair program with the LLM? And that's where you're now seeing a divergence where I think you're going to see a lot of teams using both because codeex really is your collaborator. And what they've added with 53 is like really good like midexecution steering. Whereas with with Opus 46, it's probably the best of the best now being able to say I want to spin up three or four agents. I want them to go do stuff. Hey, and don't bug me. I want to trust they're going to do good stuff. um and it's able to deliver. >> So are you saying that there in some ways it's just a preference like depending on how >> you know there's no right or wrong basically you know you're not wrong to be an opus person or you know it just like might feel Yeah. might it's just a preference. >> Yeah. Well and you might be both >> right. That's true. >> It might turn out you're both. That's why like not to disappoint people here, but we're not going to end this with me saying and so the winner is it's like well depend depends on what you want to do. Everyone has a different methodology um for it. So I'll dive in and try to try to make this part fast because I know the fun part is probably us going in and uh playing around with both of these and having them do a head-to-head and try to build a competitor to Poly Market in however much time we have. Um but uh I'll just I'll just start kind of going into these uh at a high level just so for anyone that wants to know like what are the core differences? Why is this so interesting? Um just going to what that is. So with with Opus 46 uh much bigger context window. So you have a million token context window here. Um very strong coherence over entire documents and repos. Uh designed for you know like load the whole universe and reason over it. uh 53. They talk about large context, but it's not a headline feature. And I actually went back and forth with it to get it to actually give me a number. And the number is around 200,000 tokens, which is not that impressive. That's smaller than I was thinking it would be. Um but that's okay. It's optimized, you know, for for progressive execution rather than total recall. So that's why that's not as important. um and you know optimized for deciding like what to keep in working memory. So high level what that means is claude is better when the task is understand everything first and then decide. Uh GBT3 53 codeex is probably better when the task is decide fast act iterate more of that you know pair programming you know midtask change uh behavior for uh for coding benchmarks. You know, Opus 46 is really good at codebased comprehension. Um, refactors with like architectural sensitivity, explaining why a system behaves a certain way, uh, and then, you know, a little less tendency of this like yolo write code, right? Uh, which is, I think, something everybody wants. >> Yeah, exactly. Um, so you know, that's good for everybody, but especially for um, vibe coders that are getting started and they may not be able to identify hallucinations, Opus 46 is definitely going to perform better there. But then for teams, you know, building in large code bases like like like me and my team are doing. Um, that's also really important. So kind of a win for everyone there. Five 53 Codeex did win on SWDbench Pro, Terminal Bench. Overall, it's like scored better on coding benchmarks. So probably better end to end app generation and you know Claude's kind of like senior reviewer staff engineer uh GPT53 probably like your your founding engineer right uh agentic behavior uh opus 46 this is the key one right is like the multi- aent orchestration um that's that's probably like the bleeding edge feature in 46 um and then with uh with with uh 53 codecs really like task-driven autonomy, build, test, modify without being asked. But then this task steering, you can watch it, you can go in, it's like your buddy's coding and you can say, "Oh, wait, wait, man. Wait, why are you doing this?" And you can stop it and you'll go, "Okay." And then you can restart. You can really fix things in line. Much harder to do that with with Opus. With Opus, you'll kind of be stopping it and then starting somewhat fresh, but it has a pretty big context window, so so it knows what it did. Um, but you know, Claude's really asking like, "Should we do this?" GPT GBT53 is like, "How fast can I ship this?" Right. >> It's really I mean, it's so cool because it almost feels like they're different people. You know what I mean? Like they have different styles. >> Yes. Totally. Yeah. It's a good way to look at it. It's like a different It's like a different personality type, right? And then Yeah. Uh, failure failure modes. Uh, you know, Claude 46, it might overanalyze. It's got a much bigger context window. Uh it can hesitate when when requirements are ambiguous and then it can stop short on full execution. Uh 53 codeex uh could be overconfident, can lock in a flawed assumption early, but you can steer it back into the right direction if that if that happens. So that's kind of a highlevel overview on the on the two. >> Cool. That's helpful. >> Yeah. So, should we should we just dive in? I haven't tested any of this. So, this is now I have like zero canned demos because I thought it'd be more fun to to uh try something together and see what happens. So, should we should we try it? >> Yeah. >> Okay. So, let's see. I'm going to um I'm going to start with Opus and I've got these prompts preloaded. So, I'm giving different prompts just like I think you you said it really well. It's like you're talking to different people. And so, you know, when I'm talking to Opus, I can tell Opus, build me a team, and here's what I want each member of the team to do. When I'm talking to Codeex, I can't really tell it to build me a team, but I can tell it to think about stuff. So, the prompt that I'm going to give to Opus is build a competitor to Poly Market. Create an agent team to explore this from different angles. one teammate on technical architecture, one on understanding poly market and the ins and outs of prediction markets, one on UX, and one that just works on building really good tests to make sure everything works. For codeex, I'm going to give it a little different prompt, but very similar. So, still build a competitive poly market, but now think deeply about technical architecture, understanding poly market and the ins and outs of prediction markets, good clean UX, make sure it builds really good tests to make sure everything works. And to be fair, I'm gonna try to pace these in around the same time. Um, >> you're a fair guy, Morgan. >> I'm trying to I'm trying to keep it fair here, right? That's the only way to do it. Like I said, no no winners or losers. It's it's just about uh >> just about letting everybody have a fair shot to play the game. >> Yeah. >> All right. So, let's see. I'm going to make different directories for these. So, I'll do uh let's just call this opus 45 trolley market competitor. All right. So, let's fire up cloud in here. By the way, if you want to check when you're running uh just to like really make sure that you're in a good place with the model, if you type /model, I can see here, right? Cloud opus 46, right? Um so, I'm good there. I'm going to take this prompt, copy it. Make sure this is all copied in correctly. Okay, got that. I'm not going to hit enter yet. I'm making this totally fair. I don't want anyone at Anthropic or OpenAI to get upset with me. So, I want to be in good terms with both of them. >> Totally smart guy. >> Let's see. Oh, wait. Actually, you know what? I do want to create a new folder for this. >> But we are keeping it real. We're being objective. Neither myself or Morgan are affiliated with either Well, actually, I don't know about you. I'm not affiliated. >> No, I'm not. I'm not. I'm not. >> Or Open AI? >> Nope. >> Yeah. >> Nope. >> I love them both equally. How about that? >> Yeah. >> Okay. And I'm going to try to start them as close to on the same time as I can. Enter. Go. All right. They're going off to the races. >> So, what do you think's going to happen? That's a great question. Well, I know right now because I told uh Opus 45 to build uh using different teammates. It's going to do that. So, you can see here it says I'll build a Polymark competitor by launching parallel research agents first, then synthesizing their finding to a comprehensive implementation planning codebase. This is brand brand new, right? like if I did this with Opus yesterday wouldn't be possible. Um that's kind of the difference here is that um the way that um that Codex is working is the way things have kind of always worked, right? So if you see this is like the individual person, right? It's not saying, okay, I'm going to launch all these different agents and compare what they say and say, okay, I'm going to inspect the workspace. this is your you know really detailoriented really senior like founding engineer like that example gave right whereas over here you can see it's already launched these agents and now it wants to do web searches and I'm going to let it do that so multiple agents were asking to do web searches so now now launching all four research agents in parallel so this is off and I've got you know my technical architecture agent I've got this other agent that's these are both doing web searches right now. So one is looking at like prediction market order book matching engine architecture. So this one's learning about engine architecture for prediction markets. This one's looking at poly market how it works binary prediction market mechanics. And then I've got the UX design um is doing some design research and then we've got some test research. Okay, now it's going to go to poly market and let's really hope that polymer market doesn't block it because that'll make things harder for it. Meanwhile, over here, uh, this has discovered COX has figured out the repo is empty. So, it's going to scaffold it from scratch. Um, and it is starting to I'm now wiring the core market math and trading engine. So, it's interesting, right? So, you've got uh Codeex is out here building and is is like building the engine. Uh, with Opus 46, it still has agents out there like doing research work. >> Yeah. you really start to see just like how different they really are as they as they >> make progress. >> Yeah. And like I said, I haven't touched it before, so we don't know how long it'll take each of these. >> Totally. Yeah. And I think like I guess one question I I have is like is one is one model better for being more of a beginner vi nontechnical vibe coder or >> you know does it really matter? >> Yeah, it's a good question. I mean, I think the fair answer would be probably Codeex because Codeex uh edged out Opus 46 a little bit on some of those coding benchmarks and is is kind of known for writing better production code, probably codec in that way. Um, at the same time, one of the downsides, and like I said, this is I could only do this in a totally balanced way because they're so different. Um, you know, at the same time for a vibe coder knowing when to interject and stop codecs and say, "Oh, wait. You're doing this this way. Uh, can you instead look at doing it this way?" They're probably not going to know how to do that, right? And so that's where maybe Opus 46 is better where you could say, "Okay, spin up four or five agents and let them work with each other." Right. >> Yeah. >> Okay. Uh, Codex is done. All right. Um, so Codeex built a competitor to Poly Market in 3 minutes and 47 seconds. >> And to be clear, Poly Market's a multi-billion dollar company. >> Yeah, I don't think this will work quite as well. Um, but uh but we'll see. Let's see. So um let's just check out if it worked first. I'll let this keep running here. Um so you know it'll tell you at the end here. It actually did the testing. So you can see it built a test suite. So, it has an LMSR math unit test suite, an engine behavior unit test suite, and an API integration uh test suite, and it passed with 10 out of 10 tests. Uh, as far as what it built, it has this core LMSR uh market maker engine, so coherent pricing, slippage, bound loss behavior, uh, domain trending engine. It built a REST API router, which is kind of interesting because I didn't tell it that it would have to build uh obviously any of this in any way. It figured out the architecture on its own. Uh clean, responsive front end. All right. Well, let's see. Let's see if it is actually. So, um let's go here. I'll let this keep running. This has got these four agents just running away here. And I'm in here. I'm going to do npm test. All right, tests 10 pass 10. That looks good to me. >> Oh, yeah. >> 1 p.m. start. All right, it's running. Uh, let's see. Okay, here we go. Uh, so this looks like it has the ability. So, let's Greg, let's make you We'll make you the first trader. All right. >> Okay. Uh, say add. Okay, you got a thousand bucks. >> Okay, >> there we go. Not bad. All right. What uh what market do you want to create? >> Um, well, Bitcoin, I think as we speak, has crashed to what 63,000 or something? >> Something like that. Yeah. >> So, I do like the I mean, will BDC be above, you know, be above 110K by >> Yeah. Okay. by death 31. >> Pretty good. Yeah. Okay. >> Pretty good. >> So, let's >> I mean, that's almost double. >> Yeah, that'd be pretty good. >> It dep It depends. Depends when you bought it. You know, if you bought it at uh 125K, then >> yeah, >> you're not so happy about it. >> Let's see. So, then I don't know. I don't even know what resolution criterion source >> would means >> be. I mean, I think I know what it's getting at, but I guess you could say like uh uh why don't we say use coin market uh cap as the source and resolve by looking at the price uh on the last day of December uh just before midnight I guess. Yeah, I guess like >> uh the price of BTC. Yeah. All right. >> Okay. It looks like it's okay. So, we've got it now. So, we'll use Coin Market Cap. >> Okay. So, then you could do a Yes. 50%. So, what do you think? Yes. Yes or no? >> I mean, this isn't financial advice. This is just purely for educational purposes, >> but I think I think so. I think that >> All right, that's a yes for Greg. Buy um let's see uh how many shares you want to buy. You've got a thousand bucks. >> I want to put it all. I'll put it all in this. >> I don't know how much it is per share. Let's see if it's a thousand if that's right. Okay. Yeah. Okay. Trade executed. >> Okay. So I mean it seems like it built something you know as a prototype relatively uh functional here. I I guess that it actually has decremented. So okay a thousand chairs was not that ended up being uh you know about $24 that you spent. So you've got more money if you wanted to create another market. Um but it worked. It's not returning an air. It shows the volume here. Um interesting. All right. So let's go back. Let's see. So far so good with that. I'd say uh >> let's see what's going on here. So, we've got Okay, so first off, look at how many tokens people have been talking about how token hungry uh Opus is and uh it's very token hungry. Each one of these agents has used over 25,000 tokens. Um so let's see though. So they finished, right? The the the technical research around architecture is done. Prediction market research is done. The UX design research is done. The testing strategy is done. Uh, now it's going to go and build. So, it's writing the package. JSON. >> Did you see the ad that Anthropic launched about ads? >> Yes, I watched them all. They're hilarious. Although, actually, I guess Sam was not very happy about them today. I saw a tweet from Sam that was less than happy. So, uh, I also I found them hilarious, but I also understand his side as well. >> It seems like Anthropic is sort of anti- ads for now and and andly chat GBT is going to be introducing ads. >> Yes. >> And, you know, when I'm when I'm watching this and I'm seeing you're going through 25,000 tokens, 25,000 tokens, 25,000 tokens, >> I'm like, yeah, of course, anthropic doesn't really, >> you know. >> Yeah. Yeah, I mean this is literally I mean if you add that all up, you're talking about over a 100,000 tokens used uh in doing this. So I think that's one of the very good things uh for like investors in Anthropic, right? Is with agents and agents now being I think probably the new killer feature in Opus. >> Yeah. >> You're going to take whatever token usage and multiply it by the number of agents. >> Exactly. It's actually really smart. And I wonder if that was like the thinking like they're like how can we get people to use more tokens? Oh, we'll just like spin up agents and we'll design it like that or or did they think like okay how can we design a system that is best for the use case and then they're like okay then we'll monetize it like this. I don't know. >> Yeah. Yeah. Probably a combination of the two. I can tell you I've never used so many tokens in one day as today. >> So it's working. 100 100,000 tokens is like roughly how much in in US dollars? >> I don't know because I have a I have a um Claude Max plan. So >> yeah. So I'm not paying it. We're not seeing it hit any limits right now. Right. So I'm not paying more than $200. I can tell you that. >> Yeah. My guess is it's you know we're talking like in in the $200 max plan. Do you remember how many how many tokens you get approximately? >> That's a good question. Um, >> let me check. >> Let me fire up Let me fire up Claude and ask it. Let's see here. How many tokens do I get? Estimate. Let's see. Okay, so here you go. Estimate. Estimates. So, uh, 45 million tokens per month of Sonnet. But let's see. Uh, what is your estimate for opus or six? >> It's like they don't really want you to know. >> No, they're trying to make it a little harder. Okay. Yeah, they're not even going to tell me actually. They're just going to say there's no public data. It's very new. >> Opus is roughly 5x more expensive. >> So then, >> uh, if it's 45 million, that's 5x. So 10 million is probably the answer about, right? >> Yeah. >> Yeah. Um, >> so then if we're doing, you know, quick math, let's just say we spent 100,000 tokens. >> Um, you know, 100,000 divided by 5 million is, you know, >> well, we're going to spend more than that cuz look at this. We're we're now over 17,000 tokens on top of that in this next build. >> Okay. So, but still, let's say, you know, even if we use a million tokens building a competitor to Poly Market right now, uh, we're still only using a tenth of what it can do. That's not terrible. >> No, I mean, it's $20, which is like the price of a cocktail in Miami. >> Yeah. Yeah. Exactly. Yeah. Yeah. So, let's see. But now, as I say, I'm watching the tokens uh creep up. All right. So, it's building the API routes. Now, >> I have a feeling this is going to be a better end result. >> I I was actually just going to say that this feels and and maybe it's just because there were four agents that were doing all the work beforehand and now it's doing the work. It It feels like we're going to see something very different when we uh when we load what it built. >> Yeah. I don't think we gave it any like design like visual design any you know so no >> do you recommend for folks to just like sort of get the MVP app out play around with it on local host you know click some buttons and then sort of update with the visual design from there >> it's a good question I I do like 50/50 sometimes if I have something in mind >> um >> uh especially if I want something like on brand with something like suppose I'm building something that is um going to be in the like open claw maltbook ecosystem. I would probably say, "Hey, I want to design a a site that uh you know, looks somewhat similar to or is inspired by, you know, uh opencloud.ai and moldbook.com, right? Take a look at those sites and get inspiration. These models are great at doing stuff like that." >> Cool. I'm really excited to see this is doing though. I think we're we're now like well over 200,000 tokens based from what I could tell. Uh but we're not at 10 million. >> We're not hitting any limits. You know, we don't have to take out second mortgages on our >> Yeah, there you go. >> Yet it's still going though. >> I guess you know, we can tell like here's an interesting thing. Um in comparison like this is still going. Uh why don't we say like uh the design because looked kind of bland to me, right? >> Yeah. Yes, it did. >> Uh can you spruce it up and make it look nicer? Cuz like we may as well have codeex working away too, right? >> Yeah. So you ba you didn't really give it any like specific it should look like square.com. No. So we'll see. We'll basically see if >> if codecs if you know fi if 53 has a little bit of taste. >> Yeah. Yeah. So that's what it's saying now. It's saying okay I'll upgrade the visual system without changing functionality, stronger typography, richer color direction, better card hierarchy, and purposeful motion. I don't know what that means, but we'll find out. Um all right. Okay. So now it's now it's editing um index.ht. HTML. It looks like he's going to add motion hover polish. Okay, cool. Look at that. This current task for over 30,000 tokens building the front end UI. >> Okay, it's done. >> Yeah, Codex is fast by the way, right? I mean, that's pretty darn fast. >> Yeah. >> Um, so we should be able to just go here. Should have already automatically reloaded, but okay. >> All right. I mean, >> I mean, >> not that not that different. >> It's not that different. It's >> I think um can I try something? >> Yeah, go for it. >> I'm going to say I I would say okay, thank you. But this was a minor design refresh. I'm looking I'm looking for a major one. >> There you go. Yeah. >> And then I'm going to say pretend you are Jack Dorsey. >> Here you go. And how would he design this website to be clean, elegant, >> and full and full of interesting interactions? >> Yeah. Great. Yeah. >> Jack Dorsey, for people who don't know, co-founder of formerly known as Twitter and Square Block now. >> Mhm. >> He's just got he's got he's a design guy. I don't He's first one guy who came to mind or first person came to mind. So >> yeah, that's a good one. That's a good prompt. Let's see. So I'll do a full visual rearchitect, not an incremental tweak, new layout, language, stronger typography, monochrome first pallet. Interesting. Interaction driven cars. Da da da. Okay. You know what's interesting is it didn't I would have kind of hoped and maybe you know we're not quite at AGI yet. I would have kind of hoped that it would say, "Let me go find some art." Like, if you told me that, Greg, like, "Hey, Morgan, can you read?" I would be like, "Yeah, let me go look at some articles about Jack Dorsey's design aesthetic." >> Exactly. >> Um, I'm surprised it's not doing that. Instead, it's going like >> I I am assuming it knows who Jack Dorsey is, although I don't know if it actually does. It just seems like it it's it's really just taking like >> this part of your question and going, "Oh, okay. Major refresh. I'll do that." Well, can can't you ask it? Can't you say, "Do you know who Jack Dorsey is?" >> Let's see. I can actually uh I'm supposed to be able to in the middle cut it off. So, let me see. Yeah. Do you know who Jack Dorsey is? Let's see. Okay. So, here we go. This is the midstream test. It's thinking about it. 43,000 tokens over here. Okay. Yes. Okay. Here we go. Yes. Jack Dorsey is the co-founder of Twitter, former Docking Square. Okay. With a design style that's typically minimal restrain and interaction focused. >> Beautiful. >> Okay. All right. So, touche. It showed us. >> Yeah. >> Now, here's the weird thing. It looks like it's >> Is it complete or do we have to say it >> complete? Like right when I was saying that like are you done or did you stop because I asked a question? >> Yeah, this is really interesting. I will say like the Oh, I paused when you asked the question. The major redesign. >> Oh, >> mostly. >> If you want to resume, so that's weird. So, you ask a question, it just stops, but like uh so like yes, of course, continue. >> Yeah. >> Okay. >> So, that's a that's actually some weird UX. Like, it obviously should just continue after, >> right? >> Yeah, I would assume that. It's such a weird thing because it said Yeah. If you want, I'll resume. Now, >> I will say I I do like that you can in midstream like kind of edit things. >> Yeah. >> Like that's how my brain works. >> Totally. >> Yeah. Yeah. This is using a ton of tokens. It is amazing to see the detail. I mean, this this should be a a work of art, whatever side this comes off with. >> Um, all right. This is done. So, now let's see. We can go back to this and uh okay I mean I'm not blown away but it's okay. >> Opinions become price in milliseconds. Trade conviction not noise signal market is designed for fast thesis iter iter iteration with transparent pricing. Okay. I mean I I I would push it more I think. >> Yeah. Yeah. I guess. Yeah. Like >> go for it. >> I would say that's not the Jack Dorsey I know. >> I don't know. Jack Dorsey. >> Uh I was looking for a Capslock major upgrade. Uh that that might mean um way more copy, way more images. >> Yeah. way more storytelling. >> Yeah, exactly. >> Etc., etc. >> Yeah. I'll just say seriously, take your time, go nuts. >> Yeah. >> What are credits? >> Famous last words, Morgan. >> I know, right? It's like, oh, perfect. Okay. That's like a signal within the opening headquarters like we finally got someone. >> Totally. It's a whale. >> All right. So, Opus has finished. >> Yeah. I have no idea how many credits I used, but probably actually, let me ask it. How many uh how many tokens in total did you use to put all of this together, including the four agents and then we can Oh, it's using tokens to answer. >> That's so funny. >> Okay. Uh doesn't know. Let's see. Okay, here we go. Okay, it's it's estimating. It actually doesn't know, which is weird because it should know. Although uh wonder if I can actually do slashcost. Oh, here we go. Yeah. Okay. >> Yeah. >> Oh, it doesn't. Okay. No, no need to monitor cost. Okay. So, they really don't want you to know. >> Um Okay. It's guessing 150 to 250,000 tokens totally. Okay. >> That's probably right. >> Okay. Sure. >> Okay. So, here's what it's done. So first off, one really interesting thing here is um you know codeex created 10 tests, right? Opus created 96 tests. >> So definitely a lot uh a lot more detail on the testing side and it's called it forecast uh whereas uh codeex called it uh signal market. So different names. A polymark competitor is built and verified. Here's what each team member delivered. So the architecture technical lead decided modular monolith Nex.js14 app router central limit order book database schema rest API. Okay. The prediction market domain expert binary yes no market where yes no is always a dollar. Okay. Seated markets across crypto politics. Okay. Okay. The UX design lead dark mode trading platform pages. It is a green for yes, red for no. Okay. Testing QA lead did order book test. Okay. So, here's how the tests are broken down. Order book test matching engine. Okay. All right. NPN runde to start the app. So, let's go in here. Oh, interesting. Okay. I don't want to say anything actually. I've already given to it. What's your initial take? >> I mean, my Hello, Jack Dorsey. You know what I'm saying? Like, this is this is what I expected it to look like when we pushed codeex. >> Yeah, me too. >> Um, this looks really clean. What happens when you hover over uh >> Oh, yeah. Look at that. Yeah, >> it's got hover states. >> Hover states. Um, it's obvious got it organized like those sports, you know, will the next tube will have over 120 million viewers. >> Will AI pass the touring test by 2027? Will the movie preserver three? It's got stuff in there. Yeah. >> Yeah. This doesn't even doesn't feel like an MVP. >> Yeah. This is pretty wild actually. >> And it created some stuff, you know, that we never talked to it about, right? like a leaderboard >> um which it's already populated with some initial stuff portfolio section. >> Yeah. Interesting. So, let's see now. So, yeah, I mean I'm I'm more impressed with the it was maybe worth the 150,000,000 tokens. >> Feeling better about it. Will SpaceX land humans on Mars before 2030? Only 8% it thinks. Huh. Oh, yeah. Look at this actually. Whoa. >> This is This is insane, bro. That's clean. >> That's clean. >> Yeah. >> I wasn't expecting to click in and actually get a well-designed page like this, >> huh? So, if I were to do I have to sign in to trade. I don't know if I'm going to be able to sign in because I haven't set anything up. Let me just check. Uh, >> well, you can sign up. Says don't have an account. >> Oh, yeah. Sign up. I don't know if it gets it all connected, though. Let's see, though. >> All right. I'm snagging. You know what? Actually, I'm gonna take the username Greg. I'm gonna steal your username. >> All right. Uh, let's see. Okay. So, yeah, it's probably because I was gonna say the database isn't wired up yet, right? >> So, I'm not I'm not surprised that I would actually have to do I wasn't expecting to do that. So, um, but I get it. I mean, it's clean. This is pretty neat. >> Yep. >> Um, yeah. All right, let's see. So, then can this All right, we've given I don't know about you, but this is the last chance I'm going to give Codeex on the design side. Um, it's out of it's out of opportunities here. So, um, oh, here it's funny though in the in the end it kind of it's it's acting a little bit like data from Star Trek on your question. What are credits in this context? Credits usually means >> uh wasn't quite it. That's good though. Okay, so let's see. Let's take a look and see. All right, here we go. The new version of signal market. >> Boom. >> Oh, wow. Okay, this is getting a little bit interesting. Let's see here. Read the manifesto. >> Yeah. I mean, I don't hate it. >> It's definitely better. Yeah. >> I mean, it's it's got a lot going on. >> Yeah. >> But it's different terminal. It's different than uh >> Yeah. It's different than any sort of prediction market app I've seen. >> Yeah. Yeah. From a UX perspective, I just feel like this is just so clean, though. >> This is so good. It's so fast. Yeah. Yeah. I mean, I I would say, you know, like I said, I'm not going to say which one is It's not that Opus is better than Codeex or vice versa, but I would say in this test, Opus won. >> Yeah. In this test, Opus won. >> Yeah. >> That's just the truth. >> Yeah. Yeah. But we could give it another I mean, you know, you never know. Like I think that what's interesting about this is I mean Codeex built it like I don't know how much faster. We can look at the timing on this video but like 20 times faster or something, right? >> Yeah. Yeah. >> Yeah. >> Well, anything else you want to cover? Um I don't think we'll have time to do another example, but anything else you want to cover between, you know, that you want to leave people with? >> Yeah. Um let me see if there is anything else in here. I cover the adapted thinking. Oh, I guess just on the um orchestration, I would say, you know, this is the feature I'm probably the most excited about with Opus and clearly we saw in this example it working really well. Um just make sure to look at the documentation. It's all in the docs now. Um and it gives some examples as well because it has this idea of like compare with sub aents of like context and communication and coordination and kind of breaks this down. Um, and then it has like a sample prompt. Uh, more of the details on the display mode. There's a lot of other stuff that I didn't go into there. That's probably what I would leave people with because I think a lot of people are going to want to dive in and use um use agents with Opus 46 and they've got pretty good details on all the little uh tweaks that you can make with it. >> Amazing. Well, Morgan, I can't thank you enough for coming on. I I hope people love this episode. Um, I love talking to you. You get super >> Thank you for having me, Greg. It's a total honor. >> You're uh, yeah, it's just I love how clearly you communicate and and and to technical people, but also nontechnical people, and you're criminally underfollowed. So, I'm going to include >> I appreciate >> links where you can find Morgan and follow him on X. He talks a lot about vibe coding over there. And Morgan, any anything else you want to, you know, places that you want to leave people to go and check you out? >> Yeah, I mean, I'm the the co-founder and CTO of Bold Metrics. Um, I'll just give a a little plug for us. We have a AI technology that's used by apparel brands or retailers. So, if you're shopping online and want to find the right size, you see a find my size button. We a lot of times power that and have really powerful uh machine learning models that update and adapt over time to help people find the right size and give lots of really interesting data to lots of amazing brands and retailers that you probably all know and love. And and and me and my team, you know, we're using all of this tooling. Like I had a meeting with my team this morning about uh Opus 46 and and Codeex and I've given everybody access to both of these and I actually have multiple teams of mine that are trying current things we're working on and are actually testing with each to see which performs better. So, you know, the one thing I encourage all engineering teams to do is like and engineering leaders to do is like let your teams loose with this stuff. let them try it. You know, some of this stuff is really cutting edge and really performing and gives us the opportunity um to do better, more creative work. >> Yeah. Stop listening to us right now. Like Exactly. Stop. Yeah. Listen to us. >> X out of this YouTube or Spotify link, but actually give us a like, a comment, and subscribe. Let us know if you like this episode. Morgan, thanks again for coming on the show. Um this was a lot of fun. See you next time. >> Thank you so much. Total honors.

Summary

This video compares Claude Opus 4.6 and GPT-5.3 Codeex in a head-to-head test, demonstrating their different approaches to AI coding—Opus 4.6's autonomous agent teams versus Codeex's interactive collaboration—and revealing which model excels in building a complex application like Poly Market.

Key Points

  • The video compares Claude Opus 4.6 and GPT-5.3 Codeex in a live coding challenge to build a competitor to Poly Market.
  • Opus 4.6 uses multi-agent teams for deep planning and autonomous execution, while Codeex offers interactive collaboration with mid-execution steering.
  • Opus 4.6 excels in comprehensive testing, architectural understanding, and producing detailed, production-ready code.
  • Codeex is faster and better at iterative coding, but requires more human oversight and can be overconfident.
  • Opus 4.6's agent teams feature is a key innovation, enabling complex tasks through coordinated, autonomous agents.
  • Both models have different strengths: Opus 4.6 for deep reasoning and full autonomy, Codeex for rapid prototyping and human-in-the-loop control.
  • The test shows Opus 4.6 produced a more polished, complete application despite taking longer.
  • Users should configure their tools correctly to access Opus 4.6's features like agent teams and adaptive thinking.
  • Token usage is high with Opus 4.6 due to multi-agent workflows, but still within typical plan limits.
  • The choice between models depends on the user's coding style—autonomous delegation versus collaborative pair programming.

Key Takeaways

  • Configure your settings.json to ensure you're using Opus 4.6 and enable experimental agent teams for multi-agent workflows.
  • Use Opus 4.6 for complex projects requiring deep planning and autonomous execution, especially when you want to delegate large tasks.
  • Use GPT-5.3 Codeex for faster prototyping and when you want to maintain tight control and make mid-execution adjustments.
  • Leverage adaptive thinking in the API for Opus 4.6 to control the depth of model thinking based on task complexity.
  • Both models have different strengths; choose based on your workflow—autonomous agents or interactive collaboration.

Primary Category

AI Engineering

Secondary Categories

LLMs & Language Models AI Agents Programming & Development

Topics

Claude Opus 4.6 GPT-5.3 Codex AI model comparison agent teams autonomous agents interactive pair-programming Polymarket competitor live coding demo model configuration token usage

Entities

people
Greg Morgan Linton
organizations
Anthropic OpenAI Bold Metrics
products
technologies
domain_specific
products technologies

Sentiment

0.70 (Positive)

Content Type

comparison

Difficulty

intermediate

Tone

educational analytical entertaining technical enthusiast