Using Veo 3 to create AI-generated music videos, like a Tiny Desk concert with Notorious B.I.G.
Scores
It's like the most creative satisfaction I've had in my whole life. So, I generated all these clips in a pretty straightforward way. I used GPT40 to help me with the prompts. Said, "Hey, help me capture grunge 1990s Seattle inspired by some of these music videos." And then, as you can see, it gets progressively more like camcorder grimy. So, I generated all this stuff and then I threw it together into a music video. All right, let's watch it. You get the patented Clairvo raised hands reaction on this one. I cannot believe this is AI generated. It's so high quality. It's so specific and aesthetic in a wardrobe and emotion. You have inspired me after this podcast. What music video am I going to make? It's so much fun. Welcome back to How I AI. I'm Claire Vo, product leader and AI obsessive here on a mission to help you build better with these new tools. Today we have a fun and inspiring episode with Anish Atraa, general partner at Andre Horowitz and AI consumer investor. But we're not going to talk about portfolio companies or the future of AI. No, we're going to use AI to build music videos, analyze our bookshelf, and help us plan our personal finances. Let's get to it. To celebrate 25,000 YouTube followers on How I AI, we're doing a giveaway. You can win a free year to my favorite AI products, including VZero, Replet, Lovable, Bolt, Cursor, and of course, ChatPD by leaving a rating and review on your favorite podcast app and subscribing to YouTube. To enter, simply go to howi ai pod.com/giveaway, read the rules, and leave us a review and subscribe. Enter by the end of August, and we will announce our winners in September. Thanks for listening. This episode is brought to you by Notion. Notion is now your do everything AI tool for work. With new AI meeting notes, enterprise search, and research mode, everyone on your team gets a notetaker, researcher, doc, drafter, brainstormer. Your new AI team is here right where your team already works. I've been a longtime Notion user and have been using the new notion AI features for the last few weeks. I can't imagine working without them. AI meeting notes are a gamecher. The summaries are accurate and extracting action items is super useful for stand-ups, team meetings, one-on- ones, customer interviews, and yes, podcast prep. Notion's AI meeting notes are now an essential part of my team's workflow. The fastest growing companies like OpenAI, Ramp, Versel, and Cursor all use Notion to get more done. Try all of Notion's new AI features for free by signing up with your work email at notion.com/howi aai. Anish, I am so excited to have you here and let me tell you why. It is because I have spent the majority of this podcast talking about enterprise B2B product management, how to manage your manager or manage yourself as a manager or how to vibe code. That has been the topic of how I AI and today we are just going to have a little bit more fun. So why did you start to come to these AI projects that are a little less like workreated or technical and and actually just a little bit more fun? How did you how did you get here? >> Great. Well, I'm excited to have some fun today. I mean, I've been passionate about music forever. I think most of us are. I've been DJing and making music for 30 years. But music is very constrained. you know, there's only so many ways you can work with it. And an example of that is if you look at a track that has all the instruments mixed down into a final MP3 or wave file. There's no way to just extract the vocal or just extract the drums. So, you're really limited by a set of choices that were made in the studio. And with AI, you can do all this crazy stuff like disentangle a track into just the vocals and just the instrumentation. So, what really got me excited at first was everything you could do with AI and audio. And then that of course fed into all of the new video models and videogen and lip sync and all the new technologies we're seeing. So it's just it's like the most uh creative satisfaction I've had in maybe my whole life. >> Yeah, I agree with you. One of the things that I have so much fun with AI on is people are really worried that it takes away the most fun, most human, most creative parts of not just building things, but creating music, creating art, creating writing. And I in fact feel like it just gives me so much more tools, so much more breath, so many more things I can play with and and build. And so it really opens up this like creative artist side of me in a way that has been really hard to access as an adult also with limited time. >> Yeah. No. And it's it's actually a fun conversation we'll have over a glass of wine sometime. But if you look at music culture, music culture has kind of been defined by remix culture for the last 40 years. You know, like the mixtape was the first time that you could take the music and do something, you know, the cassette tape and do something of your own with it. And then that of course evolved into, you know, hip-hop, which also sampled and which also had a lot of suspicion on it. But sampling was the foundation of hip-hop. And I think AI is just the next manifestation of sampling, and it'll be as important for music as hip-hop was. >> Well, and we'll we'll stop opiniding about AI and the arts. But the other thing that this remix culture makes me think about is kind of the next step that we've seen in the past couple years, which is kind of audio and video remixing. This like Tik Tok memes, these dances, these things where you're taking a snippet of creativity, turning it into your own thing, and then releasing it to the world in a in a new version. So, I definitely think we're seeing this not just the audio side, but also at the video side, which brings us to your use case. So tell me what you built or what you created maybe and I'm excited to walk through how you got it done. >> Amazing. Amazing. Great. Tiny Desk is the best. So if you haven't gotten into Tiny Desk, most people have seen it. It's just it's so cool. It's so fun and of course you know like creativity loves constraints and the constraints of Tiny Desk are incredible. Um there's a really good one from Clips that just dropped last week and I mean anyway there's an infinite number of them. It's a fun format. It's sort of like the unplugged format of the '9s. So, I I love Tiny Desk and I got to thinking about all the artists I'd want to see on Tiny Desk. And, you know, of course, some of them are no longer able to be on Tiny Desk because they're not alive anymore. Um, so that got me thinking about how I could do a notorious B.I. Christopher Wallace tiny desk and do we have the tools and technologies and of course, can we do it in a way that's, you know, respectful um and not derivative? And I did it and it seemed like it kind of worked. Maybe we can cut to it um so your audience can check it out and the workflow is pretty simple. >> We'll do a little clip of it, I think, and then we can work through how how it got there. >> To all the ladies in the place with style and grace, allow me to lace these lyrical dishes in your bushes who rock grooves and make moves with all the mommy. The back of the club sipping my wet is where you find me. The back of the clubing hoes, my crew's behind me. Mad question asking passing music blasting but I just can't quit because >> Okay, we love it. It's great. And you made that? >> I did make it. Yes. And it took surprisingly little time. Yeah. So, let me show you exactly how I made it. So, I started with 40. 40 is the best uh general purpose multimodal model um in in my opinion. I use it for everything. Um and I just asked it to generate an image of and we're going to do Kurt Cobain. That'll be fun today from Nirvana, of course. That's from when I was in high school um playing a tiny desk concert. So, let's see what it comes up with. >> While this is loading, you know, you mentioned that for 40 is the best kind of multimodal all-purpose model. I generally agree. You know, 40 Image Gen had this super viral moment a couple months ago when they they released it. What do you feel like 40 image gen is particularly good at compared to some of the other image gen models? It's very good at prompt adherence. So you can do things and I think that's because of the infrastructure underneath it. It's a different infrastructure from the diffusionbased models that preceded it. And BFL, Flux, a bunch of others do this now as well and and it's great. Um, but I think it was just the most productive image model because you could manipulate it in such a fine grained way. >> Yep. And uh I I remember the biggest improvement when the 40 image gen came out is that it could actually spell things and write letters out. That was >> magical moment. So, I have to call out that NPR in the in the top corner of this image is actually done correctly. Look, there's there he is with his uh >> heard. Okay, I'm going to remove the guitar actually so that it is ac cappella cuz I think that might work a little bit better. But look, this is the vibe of Tiny Desk. You know, it's as if you're seeing a photo from the '9s in the Tiny Desk studio. So, I just I love this and I think that we we become so attuned to what's possible we forget that this would be, you know, witchcraft 3 years ago. Witchcraft, right? >> What is the purpose of this? Are you storyboarding? Are you creating an asset that's going to go into another tool? Why start with on this flow? >> So, so I'll talk through essentially what I'm going to do. So there's this product called Hedra which is the best way to I think the best way to take a still frame and add custom audio to it. So create a a video that has uh sort of animated from the still frame and includes the audio with the right um lip sync. So and there's a bunch of amazing tools to do this. Sync Labs is one of my absolute favorites as well. Um but Ira is nice because it actually generates the video. So it does the text to video or the the frame to video and then it also adds the audio. So what we're going to essentially do is take this frame. We're going to get the audio from YouTube. We're going to stem separate the audio so we get the audio track we want and then we're going to put them together in Hedra. And that's it. >> This really is remix culture. >> It's amazing, isn't it? >> It is amazing. Okay. So the the asset that you really need to go into this video gen lip sync tool are two things. You need a still image um that can be used to generate the video and then you need some sort of audio to sync this to. So I know we're looking at this music example, but what other examples have you seen people use this kind of workflow for? >> I think we underestimated how useful it would be to add custom audio to video and there's been a bunch of great, you know, one of the the early examples was taking a speech that somebody was giving. I know Javier Mille did a really famous one and essentially lip-syncing changing the language to English and lip-syncing it that went really viral a couple of years ago. So we've seen and then of course you can imagine a character a photo of a character that you generate and then you want to animate them doing something and speaking at the same time. So you know stories are told this way and these technologies make it really really easy to do so. >> Oh we got him. Great. Okay. So now he's he's got bad posture, but we'll we'll allow it. It's very grumpy. >> I think he always did. >> Yeah, >> exactly. >> Okay. So now we've got Kurt. Now what I would do if I didn't actually have a So Tiny Desk has got a really specific acoustic aesthetic, which is it sounds like live instrumentation. So for the Biggie example, I actually found a Biggie cover band playing live in Brooklyn and I pulled that down from YouTube. Um, and then I extracted the actual vocals from the Notorious B.I. and laid them over. But in this case, Nirvana did um a really famous New York City unplugged concert in 93. So there's video of them playing in the way that they would and audio the way that they would um on Tiny. So that is right here. >> Even in the same cardigan. >> Even in the same cardigan. Isn't that amazing? >> Yep. >> Okay. Okay, so I use this nifty little tool called 4K video downloader, which is slightly sketchy, but that's okay. >> I love these little utilities that you just, you know, you Google like, "How do I get audio out of YouTube?" And then you look the scariest website possible and you just cross your fingers that your computer won't go up in flames and you download 4K video >> download. Yes, my Yes, my data is definitely going somewhere sketchy as a result of this. So, for the vibe coders that are listening, I have a request for startup, which is go go find all these uh slightly scary little utils and build me ones that are less sketchy looking. >> 100% 100%. It's a great idea. Okay. So, now we actually have this. So, we've got the video. Yep. Now, we're going to open Adobe Edition. Okay. So, this is this is a tool that people who have been working in computer audio have been using for 30 years plus. It used to be called Cooledit Pro. It's completely beloved and it's very very easy to use which is why so many of us use it. It was of course acquired by Adobe many years ago. It's now called Audition. So I go to Audition and I take this video and I just drop it in. So here we actually have the audio from the video which is really really cool. I'm going to zoom in and I'm going to see the first few seconds of it are blank. So let's just cut that out because we don't want to hear that. Then we're going to zoom out and we're going to take I don't know, let's take 15 seconds. And you can kind of see the audio the video in the bottom left corner there. >> Oh, got it. So, it's combining the audio and video just so you know exactly what you're syncing up to. >> Exactly. >> And I'm going to pretend like you're doing 15 seconds because uh we're doing a very efficient podcast here. But one of the limitations I know having used some of these audio and video gen tools is you're getting small clips right now with what we're we're working with. And so you know what I'm looking forward to is the day where I can have the you know hourong uh Nirvana unplugged tiny desk. But >> you know do you feel do you ever feel constrained by the kind of length of assets being generated or the um quality? I mean, sort of, but again, I think creativity breeds constraints. So, to not to over rotate on hip-hop, but if you look at the reason that so many samples were used in hip-hop in creative ways in the 80s and '9s was the actual drum machines and samplers had very limited sampling time. So, you could only sample a second of anything. So, you couldn't really sample four bars. And that's why so many producers put tracks together that use these many 1second samples in surprising ways. And once we actually got the technology to sample for more time, we actually got less creativity, I would argue. So I I sort of love the constraints that the technology gives us today. >> Well, I also love my complaints. I'm like, isn't it annoying that you can't revive Nirvana and overlay their audio and generate a completely fictional concert for longer than 15 seconds in probably under a 30 minute podcast. Like my complaints are so ridiculous because the idea of creating something like this even a year ago sounds so as you said impossible um that we get so spoiled once we get used to these tools. >> 100% right. No exactly like I mean this stuff we would have called it witchcraft 3 years ago. It would have been um okay now there's two things you can do with this. If we wanted to do an ac cappella only version for example we can use a technology called demox. So, Demox is this amazing technology that allows you to um to extract the vocals from any song. So, here I've forgotten what the actual command line is. So, I just do this. I looked it up in perplexity. What's the actual way to extract two tracks with Demox? We do this demox two stems vocals. And then let's go find the path. Okay. Okay, so this command is going to take that audio file we saved of the first 15 seconds of this concert and it's going to extract the vocals from the instrumentation. So this will be Kurt Cobain singing come as you are ac cappella which as far as I know has never happened which is pretty cool. And then we simply come back here and we say start frame upload an image. Let's use this. Okay, that's our Kurt Cobain audio script. Upload audio and let's use actually the full audio with all the instruments. Add to video and then we just say man singing on tiny desk. >> What I love about your prompting compared to other how I AI guests is every prompt has been sub six words six words. you're very simple in terms of describing what you want and uh get high quality outputs there. So, I don't know what that says about the uh prompt engineering industrial complex, but >> proof here that you can use simple prompts to get pretty cool stuff if the tool behind the scenes does does the work for you. >> I think you've got to give the AI the space as well. You know, if you overly constrain it, it just really struggles to satisfy you. Whereas if you give it less constraints, you know, sometimes it has unexpected results, but often they're unexpected, you know, delightful. >> Well, that's what I've heard a lot from folks that come from the more creative backgrounds. Um, designers in particular tend to be less precise in their prompting because they want that exploration space that then they can narrow in on. And so I really think it it also comes into play, your prompting technique can come into play based on kind of what profession or what background you're coming from. Engineers want like the most precise. They not only want the code to work, but they want the code to be written exactly how they would write it. And so they're very precise in their prompting. Um where I found designers and and more creative folks uh building different kinds of assets really like that wide open space. >> Totally. Yes. Exactly. And while we're while we're waiting for this to load, um it might be interesting. I'm just looking at some of the options at the bottom here. So, you have different kind of models that you can use, including one that looks like that they specifically fine-tuned for this, different aspect ratios, orientation, length, probably based on the script. And then, you know, the prompt says prompt your character with emotion and gesture. So, I I am very curious if you put like angsty man singing versus cheerful man singing, if you'd get a different a different version here, even if the audio and video were were the same. >> It works really well. Absolutely. Yeah. No, this this is such a useful storytelling product. It's it's amazing. And when you combine it with other videog models like V3, you can start to tell real stories, you know. >> Yeah. Okay, let's check it out. >> All right. All right. Pretty cool. >> It's very good. It's very good. >> Great. >> Very satisfying. >> He even he even manages his mic well, you know, pulls back on some of those notes. >> That's incredible. And so, you know, could you take this and take different clips of the video and sort of generate a string of these um these videos and maybe put them together in a in a longer form version? >> 100%. Yeah, I actually was inspired by this. So, I put together a a music video, a little mini music video um for a different Nirvana track. Can I show it to you right now? >> Yes, we would love to see it. >> Okay. I used V3 to generate the clips and um and it it turned out great, I think. Hold on one moment. >> Yeah. And I think if you haven't tried V3, it is pretty incredible. I mean, I can only generate like two and a half videos every day of three, you know, 7-second length or whatever. Um, I'm still uh capped on on usage, but the quality is really good. The physics are really good. It's one of my favorite video models to to play with right now. just as a just as a consumer. It's it's kind of um it's to me my experience with that model has been was very similar to my first experience with MidJourney where just the breadth of things coming out of the model were so incredible to me. So, highly recommend folks give that that model a little spin. >> It's amazing. >> Yeah. >> You've uh you've got to get on Gemini Ultra, Claire, so you have more. I have a um >> a household Gemini Ultra account. Um but my husband >> is the is the video gen guys. So he he's up there and by the time I get to it um we we've burned through some tokens. But you know I read all the I spent all the money on cursor. So yeah. >> Fair fair. I know. My wife for the first time this month was like babe what is cursor? I'm like h don't worry about it. I know all these like little secret AI tools popping up on the credit card. >> How I AI is now on Lenny's list with my personal selection of the best AI engineering courses on Maven. You can spend months thinking and playing with AI before really integrating it into your workflow or shipping an actual AI feature. If you want to start building, then these hands-on Maven courses are for you. Learn directly from Aishwaria Naresh Reganti, MIT instructor and AI scientist at AWS, or Sander Schuloff, who has authored research with OpenAI, HuggingFace, and Stamford. To pivot into an AI role or successfully lead your company's next AI initiative, visit maven.com/lenny to enroll now. Use code Lenny's for $100 off. That's m-aven.com/lenny to get ahead in the AI era and start building. So this is these are all the videos I generated Google Flow. So I was trying to capture like a 1990s high school band auditorium, you know, little dystopian energy. So I generated all these clips in a pretty straightforward way. Okay, I use GPT40 to help me with the prompts cuz as you can see, this is actually the beginning of my generations. This doesn't this is like the complete wrong energy, you know, this I don't know what this is, like early8s, you know, uh, synth pop or something. So, then I went to GP40 and said, "Hey, help me capture like grunge 1990s Seattle, you know, inspired by some of these music videos." And then, as you can see, it gets progressively more like, you know, camcorder and sort of grimy. So, I generated all this stuff and then I threw it together into a music video. Um, and I put the music behind it. I'll show it to you right now. >> Amazing. So, just restating this for helping you refine your prompts to get the aesthetic right, the phrasing, the prompting right, give you some keywords. Veo to generate these like shorter clips. And then, do you put it together in like Final Cut or something like that? >> I put it together in Capwing. Capwing is so easy and so useful. Um, highly recommend. Tip top girl, so I use cap cut. >> Yeah, got to get on capling. All right, let's watch it. Bring friends over. Hello. >> That's it. >> Okay. You get the patented Clairvo raised hands re reaction on this one. >> Love it. >> I'm going to tell you the real truth. Something like this makes me almost want to cry because I really got into technology. I wanted like everybody. I want to like make video games and like make movies and work for Pixar or D like and it always felt so inaccessible to get these like amazing ideas that I had in my head into a thing like could you film it? Could you access the people? Did you have the time? Did you have the music? Did you have the c? you just put together this amazing amazing music video. >> Thank you. >> I'm so impressed. >> Thank you. It was so It was so fun. It was so easy. And also like music videos are a lost art form. >> Totally. >> I'm so excited to see, you know, everybody making music videos for all their favorite tracks cuz what a cool way to contribute, you know, and in no way does it actually dilute from the original. I think it's a >> it's a testament to the original and our appreciation of it. No, it looks like a love letter. And I have to I have to call out when I was watching it, there's a lot of it that I think is incredible. I like how the cameras, you know, like pan and zoom in. Um, the part that really got me was the sequential shots of the teenagers in the hall. And I was like, I cannot believe this is AI generated. It's so high quality. >> It's so specific in an aesthetic, in a wardrobe, in a motion. And it got me until and again 33 good at physics until there's like a guy with like a pack of camel cigarettes on his arm and like the cigarettes are like halfway coming out. >> Yes. Yes. Yes. Totally. That's right. Well, the actually the and the other funny artifact is if you look at the end when the band is playing and a bunch of people are jumping out of the crowd. >> Four people jump out of the crowd at the same time. They look the same and they're making the exact same like, you know, like they look like acrobats at a circus or something. It's like the end of like an 80s TV special where they all jump up with their leg. >> Totally. Yes. Yes. >> That's amazing. You have inspired me truly after this podcast. I'm like, what what music video am I going to make? It's so much fun. >> Do it. Do it. >> Music. I mean music videos. You could do like fake movie trailers. >> Yes. >> All sort all documentaries. cuz I mean I we're doing the fun art, you know, heart heart and soul filling stuff, but I also think the ability to create educational materials that are compelling and interesting um with this technology are also right there. >> I mean, if you look at fanfiction, fanfiction's enormous because people want to contribute to the things they love and now we get fanfiction for every medium. It's so cool. >> Okay, sold. All right, that was that was just workflow number one. We're going to go pretty fast through workflow number two, which I think is a little bit more of a practical practical one, but still connected to to the arts. So, walk us through what your second second workflow is. >> Cool. Yeah, so one of the things that I think is really underhyped, underappreciated, underused is all the multimodal capabilities. And the model that does this really well actually is flash um Gemini flash. So, it's just it's great. It's one of the very few models that can do video analysis and ingestion. It can do all kinds of amazing things and yet I don't see it being used out there a lot. I thought I would use it to create an app that would help me catalog my record collection cuz I've got, you know, like every DJ, I've got so many records and it's such a pain to keep track of them and know which ones I had and which ones I didn't. So, I did a very quick app on Friday that let me take a video of flipping through my record collection and then using Gemini to extract artist names, album names, photos. It's It's really really cool. So, I thought today we could do something similar except for books. >> H this is amazing. And uh we were talking before we started recording. This is going to help me because over here I have like a 100 books and 100 records piled up on shelves um that have definitely not been cataloged. So, I can't wait to see what this looks like. >> Perfect. I got you. Let's share. So, here we are in Google AI Studio. So, I'm sure folks are familiar with AI Studio, but if you're not, actually, I think it's the best product surface to interact with all the Gemini models. One of the best anyway, um, because it doesn't have all of the kind of overhead and and links and constraints that a lot of the other Gemini products have. This feels like somebody just took a blank piece of paper and brought the best manifestation of the Gemini models forward. So, I really love AI Studio. It's my starting point for all of these things. And then in AI Studio, you can see here you can of course chat, you can stream um with your phone or with your webcam. You can generate media and you can build apps. This is a very good app builder and this is the best way to build off-the-shelf apps I think that integrate with Google models. So here I've I've typed, you know, create an app that takes a video of a person flipping through their book collection and extracts the author and title of every book shown. Then I give it a suggestion for how it could do it, which is you could do this by taking the video and first extracting the frames that show distinct books and then have a vision model analyze those frames to extract the information. Make sure you extract every book shown say sequentially. >> What I have to call it here is you know what's interesting is people know that these models exist and they generally know some of the capabilities vision you know text to speech or speech to text all this stuff. But what's really hard for people to do and I appreciate you showing us is think of novel ways you can access the abilities of those models. I would have actually I thought you were going to show us like you took a picture of it and you cataloged it. But this idea of a video and then extracting the frames. I just haven't changed my mental model to match these multimodal models in order to take, you know, take advantage of things that can be more efficient, allow you to do things. And so I really think it's great that you're coming to this from how could I solve this with audio, how could I solve this with video, how could I solve this with text and knowing that the models can do kind of the hard work on the back end. >> Thanks. Yeah, look, I I completely agree and a video is just of course it's so much more rich than image and this is the way that we buil we bring a lot of the outside world online, I think. So I've been really inspired by video. I saw something on Twitter where somebody had set up a mini app that watched him shoot free throws and kept count. You know, you could I mean, there's just so many ways that this will be productive. I'm I'm very passionate about AI for parents, and I've got kind of a neat video idea in there as well. So, um to me, there there's like the sort of skumorphic technologies, which is using the new technology with the old assumptions, and then there's the native ways to use it, and this feels like a very native way to use the models. Well, to connect the two things that you said, the, you know, um, basketball shooting analysis in kids, my husband did upload every single one of our eight-year-olds basketball games to a video analysis to get like each each kid's stats. >> No way. >> Shooting percentages, all the They actually don't even keep score at this age. So, he got it to like get the scores. >> I love that. >> Yeah. So, I totally love that. >> Okay. So, now we have an app. >> Yeah. So, I'm going to take a video here of me just flipping through my stack of books. Okay, I've taken the video. >> Okay. And that took all of 7 seconds. So, >> exactly. Yeah. Now, you know, the one edge here that's kind of interesting is this is really it's really easy to get something working, but if you want to publish an app that a lot of other people can use, it it then becomes more work. >> Yeah. >> So, I probably it took me 15 minutes to create this for my record collection, at least create the working demo and primitive, but then it took me half a day to get it live. Um, so anybody could use it. And what's interesting about that is I feel like a lot of individuals are just going to build their own tools and presume other people are going to build their own to their own tools. Um and so maybe this will just inspire somebody to build their own record collection extractor which might be faster than trying to f find yours online and and reusing something somebody built. >> I mean the the era of personal software is upon us, you know. >> Totally. Okay. So what it's doing taking this video it's going to do frame by frame extraction of video again something that is just so timeconuming and then it's going to use the vision capabilities. What model do you know is behind the the scenes of all this? You say flash. >> It's flash flash 1.5. >> Um and I can kind of skip ahead and show you what this looks like. So here is here's one that I built yesterday and which with essentially the exact same prompt. >> Yep. So, let's let's run it in parallel and see if this one's any happier with us. >> Okay. And I did notice one was light mode and one was dark mode. >> Was that just >> Yeah, this is just some of the randomness of the models. Yeah, exactly. >> Oh, I do I do have to say I like the um progress indicator of the second one. It told me how many frames it's extracting. Oh, look at this. >> So, here we go. You know, this is the Chris Dixon book, the Paul Graham book, this very nerdy book that Mark asked me to read when I was hired. Um, this is a really good Thomas Soul history book. Anyways, this is my entire stack of books, every single one of them. You can see a photo. It's extracted the author and and the book name. So, it's and like, you know, this is just a couple of prompts. That's it. And it generated it. So, this is what's possible. And then if you go here to deploy with cloudr run, you get a deployed version of it um that's actually running on the cloud. And now you can send this link to anyone. Now this is going to cost you API credit. So maybe you want to be a little bit deliberate, but you're pretty much ready to go with this really sophisticated video processing app that would have taken, I don't know, a month of time previously. >> Yeah. Amazing. and so useful because now I can figure out which of these also very nerdy books we we have we've read. I also see some duplicates up there. >> Totally. It's not perfect. Yeah, exactly. Well, actually in this case it's the photos duplicative, but >> it detected the Ben book and the Chris book separately. So, but yes, >> I need this, man. I need this for the pile of kids books I have up in in my kids closet so they even remember what they have. >> Okay, this is great. Okay. Well, thank you so much for showing us these fun use cases. I have to call out as we hop into our lightning round. One thing I noticed, which is you are using Comet. >> I am using Comet. >> Tell me a little bit more about why um that new browser is your browser of choice and what what are you getting out of it? >> Comet is so good. I mean, I've been skeptical of the new browser thing because it just feels like the ways to improve the browser in the past have been very incremental, ambitious, but there just wasn't that much surface area for new browser features. And now with Comet from Perplexity, it can do a bunch of really incredible things. Um, the my favorite thing that it can do is what's called RPA, right? Which is where the models operate your browser on your behalf. So, you've seen a bunch of examples of this of like, "Hey, go find me a flight and pay for it." Um, which is interesting. The way I've been using it is in my finances. So, I'll go into Robin Hood and I'll say, "Hey, why don't you tell me how my portfolio is performing? Why don't you tell me where I could get stocks that have similar upside at a lower cost basis? What stock should I buy next? Are any of these meme stock?" I mean, you can just go so deep. And look, I could probably figure that out by clicking around the website and downloading the data, but now I don't have to. So, this assistant feature in Comet makes every website dramatically more useful and it's it's been a big unlock for me. >> I love this whole episode because you've actually shown a couple use cases including talking about p personal finances with Comet that really are consumer use cases. Again, as I said at the beginning, we're doing a lot of like how do you work this inside of an enterprise? How do you write code with it? But I think the real, you know, underappreciated transformation is going to cons come in consumer experience. I think we're so early. I mean, as somebody who does a podcast trying to educate people, I just realized we're so early on consumer adoption of AI. And so I I have a question for you which is if you could get you know like my mom or you know one of my friends that is less you know not in Silicon Valley less in the middle of this in a room and say you know let me show you three things in 15 minutes that can totally change how you think about your life um or things that you never knew were possible. What would what would be those things? What are the consumer side things that you're excited about? So, I have kids and parenting's on my mind all the time. And the ways that my kids use models are amazing. Um, so for my four-year-old, Chachi BT reads her a bedtime story, but not just a bedtime story, one that where she can ask infinite questions, you know, so what what what was the king's dragon's name? What color was it? Where did it come from? Did it have any kids? You know, she's really into unicorns and alicorns. Like, tell me a story about an alicorn and a golden egg. And so she can just really interact with the bedtime story and chat GPT is far more patient and creative than we usually are. So that's one way. And look, she can't really use a computer otherwise um other than watching YouTube. And then for my son, he'll set up two figures like Sandman and Spider-Man and then he'll take a photo of them in um in Chachi Beti or one of the other models and say, "Hey, who would win?" And then it'll do this whole, you know, oh, Sandman would win in these conditions and Spider-Man, but maybe Spider-Man does this. So, they're just they're able to kind of play with the technology instead of just being broadcast to from technology, which is really new. That's like the near-term stuff. I think in the longer term, you know, um I think that the models can really help with a lot of social emotional learning. If you look at the classroom, part of it, of course, is academics, but part of it is just teaching children to be, you know, good people for the world. And a lot of that comes in observing how they're sort of behaving and interacting. And we never had a technology that could do that. If your kid went to a great school, there might be a second teacher in the classroom focused on social emotional. So, I think that's how AI shows up in the classroom. It's probably less like homework helpers and assignment generation and more observing the social dynamics in a classroom and helping um kids be better people. >> Yeah. Well, calling back to what we were saying earlier about trying to identify the AI native way of doing things. I watch my children. So I I I say that my children form my consumer AI thesis for me for me because the other day my um six-year-old was playing Minecraft and he wanted to know how to do a command and he literally went to my purse, picked up my Meta AI glasses, put them on and said, "Hey Meta, how do I transport to the woodland mansion in Minecraft?" And I was like, "Wait, this is like it's not type into chat GBT. It's not even ask Alexa. He took this physical device and put it on his face. >> Amazing. >> And asked this personal AI a question. And that just really opened my mind to again I think multimodal is going to change. I think hardware is going to have a real place to play here. And then this like AI native generation is going to think about accessing information and building things in a totally totally different totally different way. So, I am I'm with you on all of that. >> I love that. Yeah. And it's it's interesting because we have been taught what computers can and can't do. >> But they haven't been taught any of those things. So, when I generate an image of, you know, a Harry Potter image for my son, I'm like, "Wow, do you see how I just generated that?" He's like, "Dad, of course the computer can do that." So, they just assume that everything's possible and now everything kind of is. >> Oh my gosh. We had it. So, as I say when I had to walk uphill both ways for my internet like >> That's right. you and me both. >> We'll get you out of here. One last question I have to ask. You have had such success with generating these complicated assets. Um, but when AI is not listening to you, when it is giving you really poor results, what is your prompting technique to get it back on track? >> I mean, I don't know if it's a prompting technique, but it's a sort of it's a it's a mindset. Uh, two things. One is go with it. you know, like let it take you to some strange unexpected places and you might be amazed at the results. I I think the other is just reducing this sunk cost fallacy thing where you know you create a GitHub branch, you try to do something really ambitious, it's just like falling over over and over again. Just abandon the branch and start over because you didn't actually do any work. You feel like you did work because it did work, but that's not you doing work. And and I think being a lot more willing to abandon sort of approaches that aren't working is the sweet spot. >> I completely agree. Well, thank you so much for showing us all these workflows. It was totally inspiring. I want to get off this podcast so I can go play. So, thank you for making my day and I know everybody's going to love the episode. >> Thank you, Claire. Super fun. >> Thanks so much for watching. If you enjoyed this show, please like and subscribe here on YouTube or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app. Please consider leaving us a rating and review, which will help others find the show. You can see all our episodes and learn more about the show at howiipod.com. See you next time.
Summary
The video explores creative AI workflows for generating music videos and analyzing physical media, demonstrating how multimodal AI tools like Veo 3, Gemini Flash, and Comet enable personal, artistic, and practical applications that were previously impossible.
Key Points
- Anish Atraa uses AI to create a Notorious B.I.G. Tiny Desk concert by generating a Nirvana-inspired image with GPT-40, extracting vocals with Demox, and syncing audio to a video using Hedra.
- He creates a music video for a Nirvana track by generating 1990s Seattle grunge-style clips with Veo 3 and assembling them in CapCut.
- The speaker showcases a Gemini Flash-powered app that analyzes videos of bookshelves to extract book titles and authors, demonstrating multimodal AI for personal organization.
- AI enables remix culture by allowing users to recombine audio and video elements in novel ways, such as extracting vocals or animating still images with custom audio.
- The speaker emphasizes that simple prompts can yield high-quality results when paired with powerful AI models, and that constraints can foster creativity.
- The episode highlights the potential of AI for consumer applications beyond enterprise use, such as parenting tools and personal finance management via Comet.
- The speaker notes that children naturally interact with AI as a tool for play and learning, suggesting a future where AI is seamlessly integrated into daily life.
Key Takeaways
- Use simple prompts with advanced AI models like GPT-40 and Veo 3 to generate high-quality, specific creative outputs.
- Combine multiple AI tools—such as image generation, audio extraction, and lip-syncing—to create complex media like music videos.
- Leverage multimodal AI models like Gemini Flash to analyze physical media, such as books or records, by processing video input.
- Explore AI-native applications, such as using AI to analyze videos of personal collections or to assist with parenting tasks.
- Be willing to abandon unproductive approaches and embrace unexpected results when working with AI.