How Emmy Award–winning filmmakers use AI to automate the tedious parts of documentaries

howiaipodcast 9ngbZwA_h00 Watch on YouTube Published November 16, 2025
Scored
Duration
47:36
Views
4,330
Likes
111

Scores

Composite
0.54
Freshness
0.00
Quality
0.88
Relevance
1.00
9,462 words Language: en Auto-generated

How did you think about what problems there were to solve in AI relative to your job and the people that you work with? And why did you start where you started? >> Post-production is like a technical mess of media management. You have many different file types. You have images, you have archival footage that you're gathering, live footage that you may have filmed out in the field, interviews, transcripts. So, it ends up being hundreds of hours of footage, tens of thousands of photos. The data management piece when you're dealing with all that different stuff is the mess that I have used AI to tackle. My goal was to automate this. For years, this has been manual data entry. >> Automate away toil. That's what you want to do. >> No one was going to make me this app. And so the ability to make an extremely specific app that makes a workflow and my team and my company easier. It's been an unbelievable moment. Welcome back to How I AI. I'm Claire Vo, product leader and AI obsessive here on a mission to help you build better with these new tools. Today we have Tim Mleier, a producer at Ken Burns Florentine Films who's responsible for the technology and processes that bring these amazing films to life. Instead of focusing on how AI can create creative for these films, we're actually going to talk about how Tim uses AI to build software products that make his post-p production and research team's lives a lot better. If you're working with images, video, sound, or just a lot of data, this episode is a great one for you. Let's get to it. This episode is brought to you by Brex. If you're listening to this show, you already know AI is changing how we work in real practical ways. Rex is bringing that same power to finance. Rex is the intelligent finance platform built for founders. With autonomous agents running in the background, your finance stack basically runs itself. Cards are issued, expenses are filed, and fraud is stopped in real time without you having to think about it. Add Brex's banking solution with a high yield treasury account and you've got a system that helps you spend smarter, move faster, and scale with confidence. One in three startups in the US already runs on Brex. You can too at bre.com/howi AI. Tim, welcome to How I AI. I'm excited to have you here. >> Thank you for having me. What I love about what we're going to talk about today is you work in a very interesting and creative industry putting out amazing content and we're going to talk a little bit about how AI is impacting the creation side of things, but you've actually used AI to smooth out some of the challenges you've had on the production and postp production side of things. So, I'm curious, how did you think about what problems there were to solve in AI relative to your job and the people that you work with? And why did you start where you started? >> Yeah. Uh, I think most of the flashiest use cases of AI in uh creation or media and entertainment right now are often in like generating full video content or images or whatever it is. But post-production specifically is like a technical mess of media management. Especially in non-fiction, you have like many different file types, right? And you have images, you have archival footage that you're gathering, live footage that you may have filmed out in the field, interviews, transcripts, and so like the data management piece when you're dealing with all that different stuff is the mess that I have used AI to tackle. And I think that the sort of like AI as a tool versus AI for generation is even more immediately applicable in our field at the moment. >> Well, and I have a very, you know, very simple, humble little podcast, but even for us, we create a lot of research and and longer content and we're editing it down. I'm just curious with documentaries and non-fiction work. What do you think the ratio is of media captured, researched, and archived to actually publish? Because that will maybe give us a sense of how much of this you have to grapple with to get a good good piece of content on the end. >> We have a thing in our industry called a shooting ratio. And so you can imagine in like a fiction series or, you know, like a sitcom on air. I don't quite know what those shooting ratios would be, but you're working with a script. And so you're gonna have a slightly lower ratio. In documentary, it can get quite high. Like I can tell you that we made a series about Muhammad Ali a few years ago. It was an 8-hour show. We gathered 20,000 still images in the database of just stills. I think it was over 100 hours of footage because he had a lot of fights and that kind of thing, news footage. And then we also filmed I want to say like 35 interviews for the piece. So it ends up being like hundreds of hours of footage, tens of thousands of photos. And that's just like that's one example of, you know, a particularly famous individual, but that tends to be what it looks like for our shows. >> So that's what you have to manage, make searchable, make usable by the entire production team. And you got inspired by chat GPT and some of these early AI tools to do some of that. So you want to hop in and show us what you know the first use case is? >> Absolutely. So, I'm going to start by kind of just showing you the like end result uh before I go right to like how I got here. So, on any film that we work on, we end up having some kind of database, right? So, this is a database where you can see the still images we've gathered. You you can see there's a footage section, a music section, anything that might go into the film, and all the kind of stuff you might expect to see, right? Descriptions, tags, a date on the thing, where we got it from. Um, some more technical detail is also going to appear over here. In any event, my goal was to automate this for years. This has been manual data entry. And so I remember vividly I'm going to jump into cursor now, but I do remember like when I first started doing this, it was chat GBT. I remember chat GBT added image upload and it was this insane day for us. I was like in the office with my colleague Clark and we were just like throwing images at it and seeing kind of the quality of the output. like it was this an aha moment where it was like, "Oh my god, this thing can see and how could we harness this text generation, right, to to use it for our database entry." So, I'm going to simulate that like the starting point and then we'll jump to where we're at today. But essentially what it looked like at the beginning was we would throw something into GBT and we would say like, "Hey, can you describe this?" and it would hallucinate a little bit, but it was so tempting to figure out a way to harness that that I started essentially like writing little Python scripts with chat GPT. And at that time it was like VS Code on one monitor and GPT on another. And I'm going to All right, I'm just going to go ahead and demo what that kind of looked like. I'm going to speak my prompts if that's okay. I I use this tool called Super Whisper >> uh because it kind of cleans up my off-thecuff dictation. So, I have an image here of a nice street in somewhere America, maybe mid 20th century. We're going to see what kind of description we get from AI. All right. Uh, write me a script that submits the JPEG at the root of this workspace to OpenAI for description. I want just a general visual description of what we can see in the image. uh any API credentials you need are in a text file at the root of the folder. And what we can see here is that like everything I just said got funneled through uh this app called Super Whisper. So it got funneled through a prompt that itself is cleaning up my like messy vibe coding. I think it's clean enough. So we're going to go ahead and submit it. >> And I see you're using Claude 45 sonnet. Is that by choice or by default or >> that is because I'm on a podcast right now to be honest. It's like I think this is a very easy task for AI. I could keep it on auto for this, right? I will say I switch between various claw models depending upon the like difficulty and I do try and be cheap and stay on auto if I know that I'm asking for easy stuff, you know. >> Okay. So, you're just you're you're giving us a little bit of quality control here. >> Yeah. I don't want it to mess up. We're live on air, you know. >> Yeah. >> All right. So, it's telling me that I need to install some requirements. My guess is I have those requirements. It's got a submit image script. Let's see what it did. Here we go. It's running. Submitting this image to OpenAI for analysis. What kind of what kind of description will we get? There we go. This image depicts a small rural main street from what appears to be the mid- 20th century. We had guessed that. There are series of wooden storefronts each with signs indicating there are local businesses. Okay, so this is great. And this is kind of what we were getting in those early days of GPT image upload. But the problem here is like you're making a film, you want to know what rural main street, what town are we in, what is the exact year, and you can't really just go with this kind of generic description. So a lot of times we happen to know that images come with embedded metadata. And you know, if you're using your iPhone camera today, you know that maybe there's some metadata like GPS data, that kind of stuff. But archival images will often come with whatever notes people have scribbled onto them over time. And so I'm gonna now I'm gonna I'm gonna iterate on this one time and say I want you to add a step to this script. I want to scrape any available metadata from the file first and append that to the prompt. The goal here is that we are using any available metadata as like a source of truth for what this image actually is and not just guessing. And so just repeating that while this is running, what you're saying is, yeah, for this particular use case, you're working with a set of archival photos from sources that have embedded uh probably additional layers of metadata into it that you can read that give more information, which is different than, you know, scanning something or taking something off your off your phone, which I think we're going to look at a bit later. And so you're trying to harness the structured metadata off this file, >> which if you go back to the tab that shows the the image, we can't see with our with our human eyes, >> but our our agent friends can read with its robot brain. Um, and you're using that that information to then upgrade this script that is going to do all this AI analysis for you. >> That's exactly right. And so in this case it's going to be embedded metadata. I you know I happen to know this is a image from Library of Congress. There's going to be some metadata on it but it could also be something on the web like where this eventually goes to is like okay I know that there's a website with information may not be in the file but hey how about you go and scrape the web gather anything you can know about this because ultimately like this is a journalistic endeavor. We these shows get fact checked. We want everything going into our database to be, you know, true and verifiable information. All right. So, let's see how it did when it added that metadata check. So, it see it did a little bit of a scrape. It looks messy as hell, but somewhere in here we can see stuff like, yeah, archival information. And it's now going to use that. And what we've generally found is that when you add those guardrails, when you give it information, you know, to be true about the image, it it relies on that so much more than just what it can see. Like, you know, AI really wants to perform for us. It really wants to do a good job. And so when you give it the tools and the information to kind of write a better description, it's going to it's going to be able to get there. >> And I want to call out some things. So, we talked about using the anthropic claude models in particular for the actual coding of the script, but you're relying on the open AI models for the image analysis. Why open AI versus any other models that like stick with the one that you love or um it was the the first one that did a good job for you or do you feel like it's particularly good at image analysis? I'm curious why you select those different models for different use cases. >> Yeah, it's mostly that it's the first one. like they were the first one who had a they had a vision preview on their API. They did it before Claude and like I had built up enough of an infrastructure using that API call that it was like the switching costs were too much, you know. >> Yep. >> All right. So, let's see what we got this time. >> It's much more detailed. >> It is. It's much more detailed. So, the image shows a street scene on the main street of Cascade, Idaho. There we go. We know where it is now. Captured in 1941 by photographer Russell Lee. We've got photo credits. All right. So, this is a great example of like you add the guardrails and you're going to get more detail, but you're also just going to get facts, right? Before, I don't know if it's still up here somewhere. Yeah, before it was a small rural main street. Now, it is the main street of Cascade, Idaho. And so, we can imagine this getting duplicated in various ways, right? This image has embedded metadata. Maybe it's a website that we're going and gathering it from. But effectively, like this is where it all started. It started with a single Python script that I was running on my computer and I was like this is awesome. My database software is like it's advanced enough to call external scripts. You can kind of use any database to do this, you know, Air Table, whatever, but you just need something that has an API and that can call an external script or web hook or something. So, this is where we started. And now I'm going to switch my screen share to a remote machine like a little Mac Mini that I have running in my office. And what this, you know, it's hard to at this moment. It's a more complex cursor workspace. You can see um maybe I'll bop into the rules. Basically, what this is is a REST API so that every image file, video file, music file, anything that ends up in that database that we looked at at the beginning pings off of this REST API for all kinds of different like metadata tasks. If I if I pop into the jobs folder here for a second, you can we could zero in on like basically what we were just doing but the current iteration of it. So I call it auto log because the process of writing this in for years the the the manual data entry is called logging. So it's not the cleverest name but you know it fits. And you know you got a five-step process here. Basically first we're going to gather the info meaning like file specs you know how big the image is. Is it a JPEG? Is it a TIFF? We're gonna copy the file to our server. We're going to name it our ID number. We're going to parse it for metadata. Is there any metadata? If there is, great. But either way, we're going to look for more information on the web in this step four here. Scrape URL. And then once we know everything we could possibly know about that image, we're going to generate a description for it. And when you imagine how this might work for video, well, like video is itself, it's just 24 images in a second plus some audio. And so basically this just gets scaled up to deal with video files too. >> Are you using the same model for video files? Are you taking them extracting the stills and pushing them through open AI or using a different model? >> I use a different model for so I have to the the video files requires like two levels. Most video like AI models out there seem to do a basically some version of frame sampling. So, it could be extremely expensive if you were sending all 24 images every second to an API, right? So, I pull at 5-second intervals because I'm cheap. Some others maybe pull in a more in a smarter way, maybe at like lighting changes or something like that. Like there's different ways of thinking about the frame sampling. So, for the frame captions themselves, I will use a cheap model. I'll use like a nano GPT5 nano. But then for the and I can go in and show you a prompt here which maybe illustrates this. I have frame prompts which basically ask for just like a prompt of an individual still image extracted from video. But then I have a larger parent prompt. You can see that my prompts have gotten slightly more sophisticated over time. Um, basically what this does is it sends every single frame that we've extracted from a video file. It extends it anything like any of the audio we've transcribed from that video file. It packages it up into this elaborate prompt and it sends it to a reasoning model. >> And the purpose of that is to say like these are all the video events that we have observed in this video. Here is like a massive text file of data. Tell me what you think is happening in the video. >> Got it. >> Yeah. >> Yeah. I you know maybe maybe tip from one of our how other how AI guests, but I found that the Gemini um the Gemini models are quite good with video. It's actually what we use to do our podcast raw recording to uh both highlight stills and a blog post that I put out. I process them through the the Gemini models and have had a lot of success. >> And it just pulls out like the stills that might be >> it automatically pulls interesting stills. It actually gives me interesting stills plus 5 seconds or like plus 5 seconds plus minus 5 or minus 5 seconds because sometimes the guest and I are looking ridiculous some. >> Yeah. Yeah. Of course. base. So tip to anybody out there with video who hasn't tried the Gemini models, I I find those particularly good for this use case. >> You might have just, you know, added something to our little road map here. >> Well, um, and so and then I'm curious about the audio side of things. So I kind of, you know, I' I play with the Gemini models for video. This still makes tons of sense to me. Tell us a little bit about the audio side of things. >> So the audio is also I now I feel like I'm an OpenAI shill. Everything I'm using is OpenAI and I think except for the coding which is interesting but I think it's just habit. I use Whisper for audio. So like Whisper's an incredible open-source model for speechtoext detection. Even the like mediumsiz model does a pretty good job. And what I do is and I can pop back into the database software maybe to like illustrate this. What I do is I extract, you can see like frames pulled every five seconds >> and there's a caption associated with each frame and then there's this is a shot of an alligator in a swamp. So he doesn't have any audio. He wasn't talking. But I basically pull audio at 5-second increments so that when we send those like video events up to the reasoning model, we are sending a full transcript, but we're sending it like kind of like pegged to the moment in the video that it happened. If that makes sense. Yep. So, the transcription is all happening, you know, on my back end over here. Um, everything like I think I could probably open up the console and see like there we go. Like someone just sent a a job through not that long ago. Like I can kind of come in here and see what my colleagues are doing as they ping my API all day long. >> Great. And so you're pairing a snapshot image every 5 seconds from a video, the 5-second transcript of the audio speech to text via whisper >> metadata if you have it, parsing that all together, and then getting a very robust description and analysis of the content that you have available in back in this tool that you're using to archive, log, manage all all your assets. >> Yeah. And like I said, that tool could be kind of agnostic. Like you could do it in a Google sheet if that's, you know, if that's what you like. But um I like this. We've been using it for a while. Everything we just talked about is how we kind of get to like metadata that we can read, right? Like generative metadata that is a we know it's accurate because it's kind of been put on these guard rails by our metadata extraction steps. And then also it it provides this like nice visual for us. We can see what this thing is at a glance. But the next step of this now that you have this like API running in the background is you can generate something that maybe I can't read but the AI can read pretty well which is vector embeddings. So I'll jump back to stills for this because I think it's a maybe an easier illustration of it. Every asset in our database gets put through two modes of embedding. So we'll send the thumbnail through and run it against an open- source model. I use clip for this and I'll generate an embedding off of that and then we'll send the description through um I use again an open AI text model for this um and get an embedding for that and then we'll fuse them and the purposes of that is that so now we have like the ability to discover things semantically like prior to this and I think in a lot of film production today you're working with exact text search you know like if that description says dog but you know somebody wrote in puppy you're not finding that image. And so this has been like kind of the most exciting part of it. Not necessarily where I knew it was going when it started. Like I was just excited to generate a description, right? But now the ability to discover semantically is I think you know the most the most uh robust part of the system. >> So what I love about this I mean a a couple things is one you've really pushed every step of the way. You know, you could have stopped at like we got good descriptions or we got like the structured metadata out and now I have a script that runs it. You could have stopped at images only, but you took it to video and video and audio. You could have stopped at structured data only, but you went to embeddings to get semantic search. So, I love just the breadth of applicability of the AI in this process. But what I probably love more is I doubt this was anybody's favorite part of their job. Like I doubt it was anybody's favorite part of their job to be like I'm going to go read some Library of Congress meditate. >> It used to be my job. So I can tell you firsthand not my favorite part. And it's also like I think the the best argument I have for all the work I've done creating this system is that like the same people who used to write this data were the ones who are responsible for doing the research. So you've now freed them up to just look more, right? Like maybe now we could gather 25,000 still images for the Muhammad Ali project because you have that much more time. You're not just like copy and pasting stuff off a website to put it in this form, you know? >> Well, and you probably get to select from this big archive of data better assets to use in your content because they're more discoverable because you have more confidence in the source and the content of of that data. So, I bet it up levels at the end of the day the quality at at the end >> um because you have just much more data to work off of >> 100%. I mean, like a real quick example of that, too, is like I'm going to use a link in here, which is maybe not the best use use of this image, but embeddings enable us to find things in ways we never would have thought to find them before. So, like I have a button down here or when I click it, what it basically is going to do is a reverse image search within our own collection. So, if I if I'm an editor and I like an image, and this is going to take a while because I'm not on site, but if I like an image, I can click the find similar button, and it's just going to go and find every image that kind of has that vibe. >> You can see here we have a duplicate of this one, but then there you go. It recognized the man and it started pulling in other portraits. >> This episode is brought to you by Brex. If you're listening to this show, you already know AI is changing how we work in real practical ways. Brex is bringing that same power to finance. Brex is the intelligent finance platform built for founders. With autonomous agents running in the background, your finance stack basically runs itself. Cards are issued, expenses are filed, and fraud is stopped in real time without you having to think about it. Add Brex's banking solution with a high yield treasury account and you've got a system that helps you spend smarter, move faster, and scale with confidence. One in three startups in the US already runs on Brex. You can too at bre.com/h how I AI. I love this. Okay. So, this is more of your archival and footage data, but you capture a lot of stuff in the field where people are not sitting in front of cursor or their desktop um looking through these assets. And I know that you use some vibe coding and a creative approach to get more information about those assets. Could you walk us through that? >> Yeah, so the next use case is an app that I developed for archival research in the field. So I think that we we really pride ourselves on like turning over every rock on on not just relying on what's digitized and available online and going and visiting physical archives. And so um the process of visiting a physical archive is basically you have a bunch of folders um that you pull ahead of time. You arrive there and your goal is just to snap like low resolution iPhone snaps of everything you can possibly get. And so you're snapping the front of the image and you're snapping the back of the image because the back is typically where there's going to be like a scrolled description or maybe like uh an accession number and ID number that the archive has added themselves. And so this process used to look like you show up at the archive, you take iPhone snaps for two days, you get back to the office, you have the messiest camera roll you've ever had, you cannot actually pair your fronts to your backs because it just got out somehow it got out of order along the way. And so the goal was basically to make that process like a little better. So I I vibe coded this iOS app to deal with this problem. And I I tend to just like speak in screens like the way maybe it's because I'm a visual person. Like the way I deal with it is I just think like okay I see a screen that does this and a screen that does this. I I imagine a button that does this. And the purpose of this was basically like I want people to be able to create collections for each folder they're capturing. I want them to be able to snap a front and a back um like a the the flip side of the image uh so that they can easily associate those so the file names associate them and I want to immediately transcribe any information on the back and embed it into the original image. So now I have this app called flip-flop. I ask chat GBT at the end of my dog walk to generate some kind of specs doc or requirement doc. It pretty much does it in one go. If you chat with it for 30 minutes, you know, you can get a lot done. Uh, and then I fed this PRD to clawed code and it this one it like it it didn't build it in one shot, but it certainly built the UI in one shot. And so I guess maybe we should just jump into like the actual app. >> Yeah, let's do it. >> So flip-flop, which is my cute little name for it, is uh basically designed to capture those fronts and backs that I was talking about. So you have three screens here. You've got a collection screen where you're going to create your folders. You've got a capture screen where you're going to take your images. And I'll just quickly highlight this part, which is where you kind of have your AI processing options. So, I allow people to define a separate prompt for what I call the flip side of the image, the front, and the flop side of the image, the back. And so, in this example, I'm going to show you some photos of my dog now. And uh the flop side of the image is going to have some text on it. So, our prompts here are really just designed to get a decent caption from the image and to transcribe any text that we see on the back end. So, let's create a new collection. We're going to call it how I A I that that's good enough. There's also an option here to add more context. You know, the AI loves context. And so maybe if you're, you know, you can imagine if you're digitizing an entire collection of, you know, someone's personal letters or someone's uh portrait photographs, you would add that kind of thing here. But for now, we're just going to create a collection. Tap into that collection and capture. So here we go. >> It's a screen share within a screen share. >> We're going to not care about the glare too much. I'm going to capture the front side of this image of my dog Tony's third birthday. I now have the option to add notes if that's what I want to do. Or I could just add a flop side of the image right here. And when I complete that, it will have because it's lightning fast already sent it up to OpenAI for a description and embedded it. And this is the really crucial thing because you just saw the first system I had embedded it in the image metadata itself. So the flop details have the transcription Tony's third birthday and all of that will show up in the what we call XIF metadata which is just the image metadata standard. >> Got it. And just for people that that may be passed by instead of simply generating kind of the text description and storing that in a database relative to the original image you took, you actually now have this structured metadata on the image file itself, which again like what a pain. >> Oh, a giant a giant pain. Yeah, it's >> a bane to do manually. And so now anytime anybody uses one of these images, even if they don't have um access to this this app even now that that that image is embedded with that metadata >> 100%. So you could pull this onto any computer or any app, anything that can read underlying metadata and it's going to be able to see that this was Tony's third birthday. And so that's structured metadata in the sense that we've now structured the actual information about the image. But the other thing that's really crucial honestly is that we've structured the files themselves, right? So you can see they're getting named in a particular way. And so we've moved from like camera roll mess to like files that are going to sort in your in your computer that you're going to be able to import cleanly. You're going to be able to distinguish easily what's the front of the image, what's the back of the image. And that has, I think, been the other unlock. Like I had two colleagues out in the field a couple weeks ago and they came back with 1,400 images. And I don't think that's only because they were able to use Flip-Slop to capture it, but I think FlipFlop is certainly making the process easier since they've gotten back. The the thing that I want to call out for folks, maybe a general takeaway here is these AI models are so good with files and code can do a lot of stuff with files and a lot of the people we talk to um you know markdown is the file type dour these days which is you know just like a a specially formatted text document. But if you start to look at other file types and really understand what can be put in a particular file type, you can actually discover some pretty interesting things you can do with a combination of AI and coding to make those files much more useful for your use case. So, this is one of these takeaways where I'm like, I haven't thought about like what can be embedded in an image file or what can be embedded in a video file. And even just having, you know, Chad GBT or one of your general models say, "Hey, I'm working with an image. How can I load it up with as much context and specificity as possible? What's available to me?" And then using that as a jumping off point for what you do is a pretty interesting use case of AI. I didn't even know like I'm very familiar with stills underlying metadata fields but I didn't really know what was available in audio or what was available in in in video files and I just sort of I go into cursor and I ask right like now we have a music workflow which we're not going to look at but like where we embed artist album kind of like licensing data into any music we consider for a film and I didn't know that there was an metadata field we could just store that in but of course there is you know somebody thought of this a long time ago. >> Yep. Amazing. Okay, we have one last use case, which um mom, if you're listening, I think you're going to like this one. My mom's a genealogologist. >> So, uh I think she's going to like this this use case, but let's show it first and then I'll call out mama where I think you can use it. >> Okay. All right. So, you can imagine in our films, we work with a lot of documents and we're not always interested in the entire document. Sometimes like we just want to transcribe maybe part of it. Maybe um we want to translate and transcribe part of it. Like take this newspaper document for instance. Like maybe the Arkansas State News is the article we're interested in. That's the transcript we want to be searchable. That's what our editor might want to consider for the film. We can't just like put this in Adobe Acrobat and OCR the whole thing. It's like it's not going to work. And even more than that, like the quality of the image would not work with most OCR engines, you know. So AI is really good at OCR of old documents. It's really good at handwriting. It's pretty good at translation, too. So I built, and we're not going to get into the building necessarily, but this is this is one of the few like Xcode builds I had to do. So this is a Swift build, a little Mac menu bar app. It's called OCR party. Uh, which stems from the fact that we're just OCRing part of the image. You got to have fun with these things. And let's see. We're going to open up that newspaper in OCR party. We're going to get like a little preview window. So, let's say actually what we want is Coolage seeks peace in the world. So, let's zoom in a little bit. Let's open up our cropping tool. This little thing down here is basically a choice between Mac OS vision and uh an AI API call. And the purpose of that is because sometimes people don't sometimes people don't trust AI. You might have heard. And so I I built that in as an option essentially. I would I would think the AI option gets used more. But nevertheless, now you're going to select just the part of this article you care about or this paper that you care about. And you can see there's like a crease in the paper. There's a weird black mark here, but you can imagine we submit this for OCR. Now, we have just that text that we pulled. We're also calling out for our editors like where on the page they're going to be able to find it if they want to sort of zoom in on it, crop to that particular article. And I can't exactly remember what text we were looking at, but it certainly completed those sentences where there was a black marker. Right? So AI was able to kind of infer to the best of our ability what that sentence might have said. And you know if this ends up in a film, I could guarantee it would get fact checked later. But for the purposes of gathering documents, thousands of documents, this ability to kind of like precisely OCR is is has been a nice little unlock for us. >> One thing I also want to make sure people take away from this episode is we've seen basically three form factors of apps. So yes, they've all used AI, >> but you've been able to swap between sort of like a Python API service that gets called by another software application or database, a um iOS app that you know you can run on your phone and then like a little desktop toolbar widget. And what I like what I love about this moment in AI with with regards to software engineering is like if you have basic software engineering practices and then you know enough to be dangerous like yeah you can you can vibe code uh and you know a swift swift app to run on on your local desk. hyper specific app, you know, like no one was going to make me this app. And so the ability to make like an extremely specific app that makes a workflow, you know, on my team and my company easier, it's been it's been an unbelievable moment. >> Yeah. I I would say the TAM for this app is like you. >> Yeah. Yeah. Yeah. I mean, I think I could sell it to like two colleagues. Well, and then my mom, so what I was going to tell you is my mom um is a genealogologist for uh the Daughters of the American Revolution, of which I am one. Uh fun fact on Claire. >> Oh, no way. >> And she does the lineage tracing. And do you know how many times she screenshots something and is like, can you read this cursive? Like what in the world >> is this name? And it's like, you know, one name and a big a big image. And so I do think AI's and I'm like yeah I'm going to drop this into chat GPT and I'll tell you what I think it says and I think it's ability to read handwriting um old type faces kind of understand the nuances of of spelling and things like that are just really really interesting for these sort of um research use cases. Yeah, we didn't look at a handwritten doc here, but that is definitely something happening uh at our company like the ability to read letters that we could not read before and also just other languages, right? And then we immediately have that text to you have letters written in some kind of cursive scroll from the 17th century that is now translated to English and made legible for you. >> Amazing. Well, we've seen three great use cases. I am sure you are the hero on the team for this kind of stuff because I can imagine again >> people might be tired of hearing me talk about AI but thank you. >> Yeah but I mean this is this is hard stuff. It's tedious work to do. It, you know, requires a lot of time, a lot of detail orientation, and I'm sure people love using this information to produce amazing things, but probably is not their favorite thing, like zooming in and squinting at the um at the text to try to get try to get it the most accurate as possible. >> Try trying to, you know, automate away painful processes, right? Not the things people liked. >> Automate away toil. That's what we want to Yes. >> That's what we want to do. Okay. Well, we're going to do a couple lightning round questions. I'm going to get you out of here um to you know go digitize a thousand more images. >> So the first thing I want to ask you about is just your approach to learning. It seems like from what I'm seeing you're pretty fearless about new technologies, new things. I think this moment is such a critical moment for upskilling and learning. How do you think about learning in this moment? I think that uh one of the reasons that I find like tools like cursor or claude code kind of intuitive is to me there's a parallel with creative software. So like at various moments in my career I have been deep in Photoshop or deep in Adobe Premiere or Avid Media Composer whatever it is and those softwares are so complex. They are like a maze of tool menus and you end up on Reddit and on YouTube doing your research trying to just like figure out how to accomplish the thing. And I think that that's essentially what a lot of these tools are today too. Like I've been on cursor YouTube and cursor Reddit and learned tips and tricks on like from the vibe coding people of the internet. And uh you know I think it sort of starts from knowing what could be done or what's possible and the like path to get there is is swifter than ever before. >> What I like about this I started sort of my fascination with technology in these creative tools. I will like is this is like preoshop where I would go and how can I make my text look like liquid golden I would follow these like fivestep you know um graphics uh tools tutorials and what I love about this moment in vibe coding or AI assisted engineering is coding feels such so much more creative than technical where these tools feel really like creation engines to me more than functional tools to write write code. And so I love that parallel because it's what's made me so excited about technology my entire career. And I think it's why I'm so leaned in this moment. It like activates that same feeling of like, oh, now I can do can make this thing that I didn't think I could make before. >> I think that there are a lot of people too in my industry who have a kind of creative brain and creative approach to these things that would, you know, maybe like looking at a cursor window right now when you have no idea what it is is a little scary. But I actually think that they are more well suited for the work than they might know. >> Well, let's talk a little bit about your industry because I know that the film and creative world is deeply skeptical of AI. Um, sometimes we we we we wait into the the waters of AI video generation on this podcast and get a little feedback and I totally understand. I have family that's in the creative industry. I'm curious, you know, what's your point of view of AI particularly in the film world. What are you excited about and where do you think these kind of concerns are really warranted? And then where do you think the most practical applications are? >> I think today it's like sort of where we started at the top. The practical applications are more in like tooling than they are in creation. But I do think that like the creation's going to get there. Like today I play with I play with all the generative video models. Like how can I not? They're they're super fun. Um they are not like at professional grade quality yet. Like the amount of time you spend throwing tokens at even the highest end video models, you're not going to be able to match your shots that well. You're not going to be able to match the footage you shot yourself that well. And so I don't think they're there yet, but let's like I'll be honest, they're going to get there. I think that like they are still exciting to me, but I would separate a couple things. Like in the non-fiction world, I think I think people should be careful. Like I think >> we should not be generating archival footage. we should not be trying to fool our viewers into thinking that there was video in 1750, you know, and I think that that's the part that's like a little scary. And then of course there's the dis like job displacement aspect of things. I think people are scared if you film stuff for a living, you're definitely scared that like that >> you're going to be able to just like use text to generate that same video you used to shoot. So I don't know how to like I don't think anybody has like good answers to that part of it. >> But my approach has certainly just been like jump in and learn the tools like they are >> they are going to be here whether we want them to be or not. And uh >> I think that they have a lot of practical benefits today that are less scary. >> Yeah. The best advice I can give to people and I have I have of all the spaces and I'll say this honestly of all the spaces I have the most job displacement concern it's in video generation for >> non um non-archchival non-doccumentary cases but commercial use cases um you just you just see how it could be very applicable and >> the best advice that I can give to people in this moment is the more you learn the tools the better off you will be whether or not you know whether or not you love where the tools are taking us as an industry or as a culture knowledge is power and so the more you learn and understand one you can identify opportunities where it does add value even in your creative process and two you're going to be differentiated in the market from a job perspective because you're going to have a more robust sense of what's available in your industry and I think that stands for people in your industry I think it stands for people in my industry and technology so I just There is no harm in learning this stuff. >> Yeah, absolutely. I also think that like there's a place in the process for it, which allows you like a place to learn without thinking it needs to end up in the final product, right? Like you can use video models for storyboarding all day. You can maybe prove whether or not that shoot is worth spending that money on. Now, you've learned how to use the video models a little bit and you know, you haven't necessarily displaced anyone, but you've like made your production a little bit more efficient, a little smarter. Maybe you've shot better footage as a result of it, you know. >> Yes. But we're not we're not generating fake archival footage of like gay. >> We're not we are not doing that. Uh definitely not doing that. And I'm like PBS, which is where most of our films end up, have a lot of guidelines around that. And I think that's a good thing. But it's the other stuff. It's commercial. It's visual effects. Like a lot of that stuff's going to get easier. Um and so it's it's coming one way or another. >> Great. Well, last question have to ask you. when you know you're on your dog walk with ChatGBT doing voice mode and it's not listening to you or not giving you what you want. What is your personal prompting technique? Especially because you use voice. Like I'm willing to type things to AI. I don't know if I'd be willing to say them. So what's what's your technique here? >> It definitely is different when you have to say it out loud. Um I am I am super nice to the AI. I like can vividly remember the one time I was mean to it. I'm nice to the I don't know where this is going. I'm going to be nice to all the models. What I do is like for lack of a better way of describing it, I just start over. Like I will I know that a lot of these things have ways of like consolidating the context window now and sort of summarizing, but I will ask for what I call like a resume work prompt. I'll be like, "This isn't working. I want to resume work later with another AI dev. Can you give me a prompt with everything they'll need to know?" And typically what you'll find is that that prompt shows you where it was off, you know, like in its summarization of what it was doing, I'll be like, "Oh, see like I wasn't asking for that. That's that's why we were not communicating." And then I'll take that resume work prompt. I'll prune it a little bit, pop it into another chat, and then, you know, you'll find that you wish you hadn't beat your head against the wall with the previous chat for 20 minutes. >> You know, I am also team be polite to your AI, but then again, like, you hurt the one you love the most. And I' I've found myself occasionally getting testy. And you know when I stopped being mean to AI is when reasoning really started to show and I could see it reasoning how upset I was. It was >> Oh, it'll be like the user is mad at me right now. >> The user is really frustrated with me right now. I need to totally rethink my go sweet baby AI. I'm sorry. I apologize. I'm not that mad at you. >> Okay. So create a, you know, go return to progress prompt. Really get the summary. take that to understand if there was some misunderstanding, improve that and then just start fresh. That's great. Well, Tim, this has been super fun. So much for me to learn. I have tons of ideas even just for my day-to-day life about how I can use I have kids, so I probably have 30,000. >> Let me know if your mom wants the OCR party. >> I will. She'll love it. Okay, Mom. I have gotten you your first Vibecoded app direct from the podcast source. Tim, where can we find you and how can we be helpful? >> Yeah. Uh I'm not that active on social to be honest, but I am on LinkedIn. You can find me on there. I have a website that is itself a fun vibe code project. So you can find me at timmacle.com. I have a little chatbot there, the GP Tim. You can go chat with him, learn a little bit more more about me and my work. Uh and then other than that, I would say tune in to Florentine Film's upcoming production. We have a a series about the American Revolution coming out in November. So on your local PBS station. My kids are obsessed with the American Revolution. So, everybody >> sounds like it's in the family. >> Yeah, we will. We will be uh big fans. Tim, this has been great. Thank you so much and thanks for joining How I AI. >> Thank you for having me. >> Thanks so much for watching. If you enjoyed this show, please like and subscribe here on YouTube or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app. Please consider leaving us a rating and review which will help others find the show. You can see all our episodes and learn more about the show at howiipod.com. See you next time.

Summary

Tim Mleier, a producer at Ken Burns Florentine Films, uses AI to automate tedious media management tasks in documentary production, transforming manual data entry into intelligent workflows for images, video, and audio using tools like OpenAI, Claude, Whisper, and custom apps.

Key Points

  • Tim Mleier uses AI to automate metadata tagging, description generation, and search for thousands of media assets in documentary production.
  • He built a Python API system that integrates image analysis, metadata extraction, web scraping, and AI-generated descriptions to create accurate, searchable media databases.
  • The system uses AI models like OpenAI's vision API for image analysis, Whisper for audio transcription, and CLIP for generating vector embeddings to enable semantic search.
  • To improve field research, Tim created a mobile app called Flip-Flop that captures front/back images, transcribes text from the back, and embeds metadata directly into image files.
  • He built a desktop app called OCR Party to precisely extract text from old documents and newspapers, using AI for OCR and handwriting recognition.
  • Tim uses vibe coding with tools like Cursor and Claude to rapidly prototype AI-powered applications tailored to his team's specific needs.
  • The primary value of AI in his workflow is automating toil—replacing manual data entry with intelligent automation to free up researchers for higher-level work.
  • He emphasizes that AI's most practical applications in media are in tooling and automation rather than content generation, especially for non-fiction work.
  • The approach involves combining multiple AI models (OpenAI, Claude, Whisper, Gemini) and custom code to build robust, application-specific solutions.
  • Tim demonstrates that even non-technical creatives can use AI to build powerful tools by leveraging prompt engineering and low-code development platforms.

Key Takeaways

  • Use AI to automate repetitive media management tasks like metadata tagging and description generation to save time and reduce errors.
  • Combine multiple AI models (vision, speech, text) with custom code to build specialized tools for your specific workflow needs.
  • Leverage metadata embedding to make media assets self-describing and searchable across different platforms and applications.
  • Build custom apps (mobile, desktop) using vibe coding to solve specific problems in your production process.
  • Focus on using AI to automate toil—manual, tedious work—rather than trying to replace creative roles, to maximize practical value.

Primary Category

AI Engineering

Secondary Categories

AI Tools & Frameworks Machine Learning Computer Vision

Topics

AI automation documentary production metadata extraction AI for media management OCR vector embeddings semantic search custom AI tools post-production workflows archival research

Entities

people
Tim McAleer Claire Vo
organizations
Ken Burns’s Florentine Films Brex OpenAI Anthropic Whisper Claude Gemini
products
technologies
domain_specific
technologies products

Sentiment

0.80 (Positive)

Content Type

interview

Difficulty

intermediate

Tone

educational technical entertaining inspirational professional