Observability: the present and future, with Charity Majors

pragmaticengineer SvEjS4-2WJQ Watch on YouTube Published January 21, 2025
Scored
Duration
1:14:25
Views
79,189
Likes
1,614

Scores

Composite
0.49
Freshness
0.00
Quality
0.79
Relevance
1.00
14,445 words Language: en Auto-generated

the three pillars model what is that the famous phrase goes I think it was coined by Peter borgan back in like 2017 that observability has three pillars metrix logs and traces and a lot of vendors glommed onto this because cynically speaking they have a metric product to sell a logging product to sell and a tracing product to sell but it's actually kind of worse than that right like every request that enters your system historically people have stored it in sure like metrics storage and dashboards and maybe structured logs and unstructured logs and a tracing tool and a profiling tool and an Analytics tool you know again and again and again and again and again and what connects them not you the engineer sitting in the middle going well that shape looks like that shape so they're probably the same thing or maybe copy pasting IDs and the cost multiplier is OB charity Majors is an observably expert and the author of the book observably engineering she worked at a software engineer at parse then at Facebook and then co-founded honeycomb an observability startup she believes that observability tools should be a lot easier and faster than they are today and that every engineer should have the kind of magical observability experience like devs of meta have with their internal Tools in today's episode we go into what is observability and what do things like the three-pillar model and cardinality mean what is observability 2.0 and why is everyone so excited about it things engineering teams frequently get wrong about observability and so many more things worth knowing about modern observability practices if you enjoy the show and would like to support it please subscribe to the podcast on any podcast platform and on YouTube charity welcome to the podcast thank you for so much for having me you and I go way back and I feel like we don't get to talk enough so this is a real treat for me we go way back in fact the way we started talking was so funny we both were on our blogs were writing uh advice columns same the same advice question to both of us and we both answered and we were like hey hi there friend yeah so so someone asked something about I I think measuring developer productivity actually funny enough and then I think you wrote something and I wrote something and we didn't know about each other and then we discovered it later and I read your article I'm like oh I could have written the same and I realized you little little bit common Souls yeah exactly that was that was funny this episode was brought to you by sonar the creators of sonar Cube server cloud ID and Community build sonar helps prevent bugs code quality and security issues from reaching production amp ifies developers productivity in concert with AI assistance and improves the developer experience with streamlined workflows sonar analyzes all code regardless of who writes it your internal team or geni resulting in more secure reliable and maintainable software combining sonar's AI code Assurance capability and sonar cube with the power of AI coding assistance like GitHub co-pilot Amazon Q developer and Google Gemini code assist boosts developer productivity and ensures that code meets rigorous quality and security standards join over 7 million developers from organizations like IBM nza Barkleys and Microsoft who use sonar trust your developers verify your AI generated code visit sonar source.com pragmatic to try sonar Cube for free today that is sonar source.com pragmatic trust isn't just earned it's demanded whether you're start of founder navigating your first audit or season secur to professional skill in your governance risk and compliance program proving your commitment to security has never been more critical or more complex that's where vant comes in vant can help you start or scale your security program by connecting with Auditors and experts to conduct your audit and set up your security program quickly plus with Automation and AI throughout the platform vanta gives your time back so you can focus on building your company businesses use vant to establish trust by automating compliance needs across over 35 Frameworks like sock 2 and ISO 271 with vanta they centralize security workflows complete questioners up to five times faster and proactively manage vendor risk join over 9,000 global companies to manage risk and proof Security in real time for a limited time my listeners get $1,000 off vanta at v.com pragmatic that is v.com pragmatic for $11,000 off so to kick things off can you tell us how you got first exposed to observability before befor hand when when we talked I remember that there was something about the startup that you worked at parse and then it got Acquired and you worked at Facebook and then something happened there how how how did this go well so around the time that pars was getting acquired you know we were this was in the like free money days you know rocket ship growth and like we were we were taking off and we had like 60 some thousand mobile apps when we got acquired by the time I left we had over a million mobile apps and the traffic was so unpredicted like a different app would hit the top 10 in iTunes like every day you know just be like and this was back in 2013 2014 right yeah exactly like 2012 2013 and we had built this on the Ruby and rails stack which was a smart enough idea at the time but you know fixed pool of workers and so any app takes off and suddenly boom parse goes down because all of the inflight workers get caught up on threads any of the back ends get slow boom parse goes down and I'm I'm like the infrastructure right and as a reliability engineer this is just professionally humiliating for me because just constantly going down and I tried everything you know every tool out there and and the first glimmer of light that we got was we get to Facebook and I was really cynical about the you know because all of the tools have been built in-house for in-house workloads you have the luxury of forcing people to use your stuff you know and I'm like this is not built for us but we started we started feeding some data sets into a tool there called scuba and so we're able to slice and dice in real time on high cardinality Dimensions we were able to instead of just going oh God cuz like parse goes down we're looking through the logs and like it might be full of requ a particular requests but that doesn't necessarily mean that that's the app that caused it to go down there there might just be lots of stuff backing up you know it might just a few very slow requests that's taking like you just don't know and having the ability to break down slice by things like app ID user ID raw query normal query the amount of time it took for us to just pinpoint the cause and do something about it dropped like like a rock like like from hours sometimes it would just recover and we'd never actually know what happened to like seconds like it wasn't even an engineering problem anymore it was like it was like a support problem right and this made this this like this was life-changing right and so when when Christine and Christine my co- so I'm coming at this from like an operational L and Christine my co-founder had written the parse Analytics product and she had built it on top of Cassandra and she was constantly being like professionally humiliated by the fact that this product she has she had written for our users to understand their applications you had to predefine in advance how to capture the data and what questions you'd be able to ask and they'd be like but I need to ask this new question and she'd be like ah how embarrassing she'd go and look it up by hand In Scuba and reply to them and she just like ah so we had both had this experience from very different sides of this step back and we were just like so when we so when we left Facebook it's looking back on how ill-prepared we were to start I had never heard the words product Market fit right deep in infrastructure uh we didn't know that categories existed or Gartner existed so like we were not setting out to build a category we were setting out to try and explain this this like life-changing experience we had had like we knew it wasn't monitoring we knew it wasn't logging you know and and which is why it took so much longer and was so much harder than it needed to be and then just to get a sense of so what was scuba at Facebook or how would you describe it was this cuz it was it sounds like it's a mix of like just listening to like events but also monitoring and querying scuba is this weird Beast it's it's kind of a Coler store but it was in memory uh they WR a white paper about it one of my favorite things was that replication was it's a C++ binary and it shelled out to our sa to do replication it was quick and dirty and and the evolution of scuba is actually an interesting one they developed it about a decade earlier when they were trying to get a handle on their MySQL issues like there was there was a time when like PHP and my sequel you know just crashing all the time Facebook was being professionally humiliated right which is why I think it's so interesting like the Genesis of scuba and this honeycomb is like not connected to the entire 20 30 year history of telemetry and monitoring and three pillars like it came out of complete left field it's much closer to like the history of business analytics I think so so that that's where it started yeah now jumping a little bit more Back to Basics what does you've been in the observability field for a long time now like we can say now it's it's called that but what does observ mean to you uh from a software engineer P point of view how would you define it yeah you know there are so many definitions floating around I'm probably responsible for more than my fair share of them but really it's about understanding your software it's about understanding the intersection of your code your systems and your users right that's all it is and I feel like for a long time the observability field was obsessed with errors and bugs and outages and nines and crashes and all these things and I feel like one of the directions that we're starting to pull the industry in is that it's not just about problems understanding the need to understand impact of what you're putting out into the world is so much bigger and more interesting than that which is part of why I feel like you know it's it's not just it's not just an operational tool right this is this is the this is the tool that underpins your development feedback loops yeah and I guess maybe not just development but at some some level product as well right like like how your your stuff works exactly and ultimately like Christine and I just did this exercise to refresh our mission and vision and and the phrase that we landed on was it's we're we're here to help Engineers explain and understand their software in the language of the business I feel like I I feel like there's this void like if you look at all of the sea level roles out there the only one that has no template or definition is CTO all over the a map right and if you look at like VPS and executive teams engineering VPS have traditionally been kind of like the junior Varity league right they're they're like not really in the in the key the the core group right why is that it's because for so long Engineers have been kind of the artists of the company you just trust us right we can't really justify you know we have really because what is it about executive teams like the the point of having an executive team is that you're each other first team and you co-own the most fundamental decisions about where companies invest their resources right which means you have to un you have to co- understand them right everyone needs to understand enough about marketing to be like in the upcoming year our priorities are to move the needle in these ways so we're going to allocate these you know for engineering it's just like we need 20% more people why well we just do you know like that's about as deep as it goes and so I feel like over the next 5 years or so really helping engineering get that engineering product even design get that first class seat at the table it all comes down to being able to explain and understand our work in the same language as everyone else which is money so so you're saying that let's say that the CFO so the finance team the CMO the marketing team the CEO the operation they can all explain here's what we do here's how that translates to the business here's what we are important Here's the the the the things that we're moving for the business and if I get 20% more people I can do this for the business right and then you're saying that engineering at a lot of startups if I understand don't really have that ability and observability or understanding you know how our stuff connects to the business gets us there it's a translation layer yeah being able to understand reason about the work that you're doing and tie it back to top level goals you know I don't think this is just a startup problem either like I talked to big exact and you nobody wants to say this in public but there's a lot of it's hard it's really freaking hard yeah so going back to observability as as a concept it it is an industry and you you written a large book about it can we go through some kind of common ter terms where if I'm a software engineer I you know I heard about observability what are some things that I should probably know about and you could have mentioned just before you mentioned High cardinality normalized query but there's stuff like traces instrumentation sampling what are some of the basics that you think look if you're an engineer figure these things out like look it up read a book talk to people and then you'll be able to start with yeah I think it's important to know the difference between metrics um like there's small M metrics and Big M metrics right we use we use the term metrics a lot just as a you say small M metrics small M like the generic term for for Telemetry we just like oh the metrics are blah blah blah and then there's the metric which is a number with some taags appended right and I think this causes a lot of confusion with folks the metric um is is is a very small fast efficient data type but it's supremely limited because it doesn't store any contextual relationship uh data right um I think one of the big shifts in the industry right now is away from the sort of three pillars model where every request that enters your system you store it in a bunch of different places to a model where you have unified storage so I think understanding the difference between the metric and structured data is pretty key I think sampling is emerging as uh a really important lever um and it scares a lot of people in large part because big logging vendors have put decades now into telling people every log is sacred don't drop a log line you know which always reminds me of the sort of Monty Python every sperm is sacred sketch and it's like H anyway uh yeah so like I think sampling is important I think understanding the data structures is important you mentioned something that I think is also kind of a a given in the industry observable industry or people who do it the three pillars model what is that yeah so the famous phrase goes I think it was coined by Peter borgan back in like 2017 that observability has three pillars metrics logs and traces and a lot of vendors glommed onto this because cynically speaking they have a metrics product to sell logging product to sell and a tracing product to sell but it's actually kind of worse than that right like every request that enters your system uh historically people have stored it in sure like metrics storage and dashboards and maybe structured logs and unstructured logs in a tracing tool and a profiling tool and an Analytics tool and a you know again and again and again and again and again and what connects them uh not you the engineer sitting in the middle going well that shape looks like that shape so they're probably the same thing or maybe copy pasting IDs you know and some of the bigger fancier tools like data dog have built sort of bridges you can predefine like this metric here ties into that log Lan over there but again you're in this situation where you're having to Define in advance what information is going to be important and where you're going to need to connect it and the cost multiplier is obene Right some people every request enters your system they're storing in 15 different tools for 15 different use cases and the more of them you get the more expensive it gets the more sprawling it gets and the harder it gets to correlate everything so like the 1.0 to 2.0 shift like data is data there are a lot of different ways to like climb the mountain but like fundamentally it's about moving from many sources of Truth to unified storage okay so just I get that right one observably 1.0 is this like three pillars the metrics logs and traces which you know it started like that and apparently it is really good to sell a bunch of services and you can sell expensive services with it as well so what is 2.0 okay cuz you mentioned UniFi storage but uh can you expand on you know why that it it just sounds a bit too simple I mean you know sounds like people could have come up with this like earlier right well they have like they've had nice things in the business side for decades I they've been extremely expensive but we're in the very like the cober children have no shoes in the sophomore side it's just like you know well we can we could make it work with our duct tape and bailing wire um like I remember vertica came out like 15 20 years ago and you know the business anyway a lot of folks are talking about unified observability but in a lot of situations when you look under a hood what they're talking about is either a unified bill or best a unified visualization and the beautiful thing about having unified storage is you have no dead ends right you click on a log you can you can turn it into a trace right you can visualize it over time you can derive your metrics from it you can derive your slos from it you can take your SLO data click on it jump into it see exactly which events are you know violating your SLO and and why and what's different about them but you're right there is a lot more to it I just I try to emphasize that because I when I started writing about this I started to notice all these other like observability 2.0 articles popping up all over the place where people are like oh yeah we do that too because blah blah blah blah blah and it's like all right the some of the other things that I think are associated with two and by the way I'm not trying to police this like only honeycomb does this I don't think that at all in fact one of the most exciting things about this last year to me was that I feel like the batch of baby observability startups that we're seeing are no longer looking like cheaper data dogs they're looking like more like cheaper Honeycombs right they're they're built on click house they're built in Coler stores they're using otel native they're using wide structured events organized around units of work and I am so ecstatic like this is a better world for industry and I'm and I'm excited um but some other like things that I think have in common with this uh are I think it really does parallel the shift from observably being an Ops tool you know organized around errors and downtime and outages and crashes and towards being something that underpins the whole development cycle it's what underpins your feedback loops and allows you to you know one of the most exciting use cases I think is in the cicd like being able to visualize it as a Trace see where your tests are breaking where your time is going because keeping that time between when you're building it and when it's in production and you're looking at it keeping that as small as possible is like is like the most fundamental part of building great software and great team I think so when you say it underpins development like do are you saying that you know these new tools either that exist today or or that will exist you're kind of envisioning it as I I am coding my stuff and as I push to just development to see CD and you know there some there's a crash something went wrong I should be able to look at either like you know if it's the the build crash I should be able to look at there if if the app crash I can just look at this you know like magical dashboard which you know in the past I had to look at the logs or if I had a trace look at the trace so you're saying that I can just use this to like actually develop faster like me as a Dev basically yeah I mean it can be hard to generalize because there are so many different ways people do this thing I think that I think that having observability in the pre-production environment is is good um but I really think it's so much of it about accelerating time to Value accelerating time to insights accelerating getting it into production like some of the most exciting things that I think we're doing right now are about getting you know I mean I usually have my test and proder live aiz shirt on today I have my database shirt on uh but like getting things into production and immediately right and safely like using Progressive deployment using feature Flags you know and the combination of observability of the 2.0 framework plus things like Progressive deployments canaries feature Flags is like it's greater than the sum of its parts because when you can slice and dice and break down like say say you get your coat out within 15 20 minutes right it's so fresh in your head you know what you did why you did it how you did it and you're look but it's not like you're blasting it out to everyone immediately of course of course not you ship it you know behind a feature flag you ship it to a canary you ship it to 10% and then you flip it on you send some test requests you know you do it this very control it's like using a scalpel right you have this Precision tooling that gives you confidence in moving fast so am I hearing it correctly that what you're saying is when you have like these modern and healthful Engineering practices thing like feature flx things like being able to sh test in production because you you have like some some separation of um Tendencies for example uh all of those things if you have those things and then you kind of invest in observability you know trying to have it better like that is just a much bigger win than let's say just I don't know let's get into this whatever observ 2.0 is you let's modernize it but by itself it's not going to be as big of a win right correct correct I I use this metaphor a lot I'm I'm super blind so it's like putting on your glasses before you go barreling down the freeway right when you're driving you you don't want to be just like veering you want it should it shouldn't feel like you're always course correcting or driving out of it should feel like you're just driving right and when you're building you should have these feedback loops that are so intuitive and so fast and so integrated that it feels like you're just building right like that's the that's the dream yeah now one other question I have related to De we talk about developers and observability but how do you think observ 01 y observability relates to the short form how how does it relate to S devops and other roles because I I feel that there are some companies where they're saying oh observ observability that is owned by our Sr team or our devops team what's what's your view about this I think that you know I think obviously any vendor relationship any product has to have an owner I think that's not necessarily a terrible idea I think that a lot of the center of gravity is moving to platform teams in a lot of places because it's like platform their remit is really managing that that thin line the fuzzy line between our code and infrastructure code right it's permeable there's always you know things that sort of somebody's got to own that line but the the other thing I like about the platform engineering model is that your custom customers are internal right it's a product focused development organization your customers are internal and I like that model because some of the some of the historical flaws of you know the sort of s devops whatever models have been that oh they own it they own monitoring yeah and the platform model is so explicitly you own your code and we're here to help you with that and that is I think such a critical change I could not agree more it's interesting how it's a platform to you suddenly it's not they but they kind of either reflect in it's you know that you have skin in the game and you cannot expect them to for example go on call for your St platform teers don't do that right they will build a tooling for you but and and to be clear like you know as Ops is deep in my DNA and it's not like sres are going anywhere like the systs are only getting harder and more complex and there's an area for expertise and like you know the consultative model is I think a great one devop if I might go on a little side rant here I do feel like you know is there any term in Computing that's been more contested than the word devops I'm not sure uh but like whereas I feel like the devops philosophy of you know being very you know you you work together empathy you know collaboration is eternal and not restricted to software I do feel like we're sort of in the waning days of the devops movement because we it's no longer considered a good thing to do to spit up a Dev team and an Ops Team to then collaborate right increasingly there are only Engineers who write code and own their code in production and I think this is really exciting I think it's you know that we can understand why Deb versus Ops evolved but it was always kind of a crazy idea that half your engineers could build the software and the other half would understand and operate it like that's just not a great way to break out it doesn't lead to Excellence in either domain yeah I I feel it's a little bit similar to like waterfall where people still talk about waterfall but there is no waterfall waterfall used to be literally three or four year old projects long projects but they don't exist and so when people talk about waterfall they talk about a two-month project that is not waterfall like if if it's one or two months that's you can call it whatever you want and so which is a good thing right it's a good thing that we don't have waterfall projects anymore even like government projects no longer take uh I mean most most of them don't don't take thankfully that long yeah yeah and to be clear like I get just as mad as anyone else people like devops is dead like you no it's not like that's the wrong that's the wrong takeaway it's like any good movement eventually fulfills its purpose right and I feel like the movement of devops is is in the Fulfillment stage I agree with that so I I have a question I've been wanting to ask you on why do you think observability is so darn hard every single developer has their story uh of why observably is hard if if they actually did it wherever they sit first of all what was your kind of first time where you just realized like this is just really hard even though it shouldn't be you know if I can be honest I actually always hated observability and monitoring story like I would do anything to weasle out of being the one to own it like including three companies in a row I hired my friend Ben haror who loves this stuff I'd be like Ben please come work with me because I hate this and he would come in and he would build all the graphs and I would bookmark them like I have always hated this oh my gosh look where that got you I know right anyway you know it's hard it's hard because because software is hard right and like it's like the first line of defense and it's like not only you have to build the software but you also have to have like this this sort of meta thread that's watching what you're doing and going what is future me going to need to know or what is future me going to need to understand at 2 am and and that's just a muscle that takes takes a long time to build it right and I also think that like historically we've had a lot of tools that were so like some of my some of my gripes about you know the past generation of tooling is just that like you really expected people to master at least two discipline like so so many of the tools for so long required you to convert the code that you were writing into like physical resources like what is this doing to my CPU and my Ram like that's just like okay this is a little too much to ask of your average software engineer you know okay I I'll ask an easier question the one of the biggest frustrations I think you know like non-technical people like the CEO or CFO have is they look at the bill of an organization number one is going to be the cloud cost or infrastructure depending if yourself hosting number two is always almost always observability and it just feels like people feel that it gets out of hand and you know when you ask Engineers obviously Engineers are told you know optimize the bill or or make it lower why do you think the costs just get out of hand in so many cases I I think if they don't get out of hand that is the exception what makes it so darn expensive I mean what makes it so Dar expensive is the complexity of our systems and our high bar like the easiest way to slash observability bill is to give uh fewer shits about your customer experience um you know uh and and to be clear like there's a wide like there there are companies who every single request delivery companies Banks right you don't really have a choice you have to understand every single request so there's a certain builtin and then there are like advertising companies it's more spray and prey you know you can get by and like buckets and and so I think you know that's a legit angle to look at the the second one that I would call out is the thing I mentioned before which is just like the multiplier effect how many times are you storing a record of this data for every request that enters your system I think that like you know we've been honeycom we've been building this for it'll be nine years on January 1 wow I know we were so small for several years and the last couple years it just taken off but like a a big part of the driver from 1.0 to 2.0 is people suddenly money is not free anymore and they're taking a look at the multiplier effect and it's just unsustainable like no matter how tightly you're trying to keep control on cost it's unsustainable if you're storing 15 different copies for every single request you know it's just as un sustainable the third one that I'll point out is the one that personally bothers me the most which is cardinality if you look at actual observability engineering teams using you know your traditional three pillars bottles they usually and this is true spend an outright majority of their time trying to govern cardinality you know you can you can go to bed Friday night have a $200,000 a month dayto dog Bill make no code changes over the weekend and wake up Monday with a $2 million a month dayto dog bill just because the cardinality changed out from under you and C can we pause here because I don't think any every software engineer will actually understand what cardinality does especially if if if you've not work with observability can can we break it out like what does it mean cuz it it's it's wicked important like I I know but the first time you meet at it it's kind of it's not a obvious concept is it yeah no not at all uh cardinality refers to I think the mathematicians call it the number of unique items in a set and basically it means you know if you've got a collection of 100 million users any unique ID is going to be the highest possible cardinality so like request ID or in America like Social Security numbers right if you have 100 million users you have 100 million values right and then the lowest possible cardinality would be a field with just one value like species equals human right yeah now the point of bigm metrics tools is they are built to handle low cardinality data full stop uh like so like if you've got like and and there's there there's this very traditional experience that everyone who uses these tools has which they start using it they're happy with it they append something like host name right and then they get to have like more than 100 hosts and suddenly it all breaks or it becomes abely and they're just like what the hell and and just to explain the the behind the scenes on why it gets like so expensive whatnot because you have to store you have to store there's no relational data right in time series databases you have to store another unique uh every every unique combination of of um of of number and value you have to store again and again so you have like the term custom metrics I always thought that that meant like oh a custom metric that you've defined in your code it's a line of code no it refers unique combinations of of U metrics values so basically every single unique combination will take up more space and this is you know like as you said a rookie mistake I I I've seen when people use a one of these many products is you just literally you're like okay we're I don't know we're we're I want a query for like what city what country does this event happen and then they add IP address and then IP address is unique to everyone and suddenly it it just adds so much to your bill yeah it could it can like 100x overnight and so worldclass observability engineering teams end up spending an outright majority of their time just trying to govern because because the the the irony is that is the most valuable data right the more unique the data is the more identifying it is which means the easier it is to debug your systems and understand what's happening and it's not just storing the data it's also being able to use it to slice and dice and break down and group by and explore and so you just can't do this with stuff that's built on the metric data type like it's actually like impossible and this is what we talk about observably 1.0 or so so how does how does it change right like it it seems this feels a little bit to me like you know if I had to compare like cobal the programming language from the' 60s you actually had to worry about where your program like you need to tell where it goes on the tape goes back and you know you structured your code accordingly and it kind of feels to me that if you're using these observably products you really need to you know like when you're like thinking about like I want to lock stuff you need to be thinking about like how expensive will this be if I had this field which is like a an IP or a website oh I shouldn't add that should transform it which I mean in 2024 that sounds a bit silly it does because you're kind of like you're optimizing you know machine time even though machine time is cheap but I guess storage is expensive and so what's the solution the solution is we have to move away from tools that are backed by bigm metrics we have to move towards tools that use structured data where where you can have high cardinality where you can store lots of De like I I've often said like the bridge from 1.0 to 2.0 is logs right it's it's emitting fewer logs but wider logs like The Wider your log the more context you're attaching to each each event and the and the more context you have the better your ability to identify outliers and correlate things like we do we have this thing called Bubble Up which is just like any graph you have you can be like what's that draw a little bubble around it and we'll compute like for all the dimensions inside the bubble versus the Baseline outside the bubble so you're like what's this little Spike and you're like oh I see uh all of these are for request coming from Android devices from this region going to the secondary with the batch size of blah blah blah and it's taking this and and like so much debugging boils down to here's the thing I care about why what are all the ways that's different and Gathering up your data in this way you ask like why it's so expensive and it's because the model does not fit the needs that we have for this data and do I understand correctly because you know the software Engineers always tradeoff right I think it's easy enough to understand that like observably 1.0 it optimizes for it it when it stores the data you can immediately query it and and you will get it almost immediately you don't need too much computation when you're doing the opposite of uh okay I can just store whatever it's not going to expand my storage cost I'm assuming there's a trade-off with let's say compute right like you're going to compute it later or you'll have post-processing or or something right like you don't get anything for free you're shaking your head so the uh you're right there are always trade-offs but it might not be exactly the trade-offs that you think uh so this is made possible by the falling costs of storage and compute and all these things absolutely nwor uh metrics were optimized for a world where all of your resources were so expensive right you're just like I can't afford to store this engine X log I'm going to derive some metrics and score that right um yeah no you should be able to slice and dice in real time like another aspect of this is I don't think you can really move to a 2.0 world without taking your wide structure data and feeding into a column n store because if you're using a traditional relational database you have to Define and Advance uh the indexes the schemas you know all these things you want it to be you want to be able to just drop in oh this might be useful someday drop it in and immediately yeah for for for logging you shouldn't have to like it feels a royal pain to to do it it feels a little bit like you know like typically what happens when you don't have good logging or or good practices you have a bug something crashes with a customer you realize you have zero logs you have no way to look at it so what do you do you ship something to production you know back in or mobile app you add logs and then you tell the customer we now have logs can you please update the app and try it again and it's like I mean it's doable but it's pretty darn embarrassing right like capturing enough Rich Telemetry all the time that you can go back and you can just be like oh what did they do right yeah so a common worry about any observability these days I mean you could build your own observability but unless you're Facebook or or Google you probably shouldn't do it and they do it already but there's this worry about vendor lock in uh thinking okay whatever I choose um I might be locked in or should I try to choose a vendor that's not Lely now you work for a vendor so you know you're on one end you're going to be biased on the other end you're you you often speak truth to power how big of a deal do you think vendor lockin is can you avoid it should you even want to avoid it is it even possible this is a rare spot of good news in the landscape historically this has been a huge problem open Telemetry is changing everything the goal with open Telemetry has always been you instrument your code with otel and you can basically take your fire hose and point it to whatever vendor you want which is there's a little bit but it it's 90 95% true this is a GameChanger forcing vendors to compete for your business based on being excellent and responsive and a good value instead of keeping you locked in their ecosystem is and honestly this is the first year where I've really seen this start to come true and it's really exciting it's interesting because my next note was exactly this I made a note open Telemetry I've heard about it and again don't forget I'm I'm a bit of an outsider for for observa right like like I I software but like not not the the details of this what is open Elementary like you you you you said how great it is but what is it and why should we care about it yeah I mean it's it's the inheritor to where you know Google had like open census and open tracing both kind of you know flopped um Ben Sigman and the folks at light step actually spec this out it's now get this this is also the year where it overtook kubernetes is the number one cncf project open Telemetry t is now the top project in terms of commits and committers yeah it's huge it's amazing um and you know it's it's a few years old now um I it's it gets it gets critiqued a bit for being kind of big and Bloated it it does the job it needs to do I think what most people need to understand is this um it does a lot of jobs it does them well and increasingly you don't need to understand everything about it to get the value out because for a long time it was like it was kind of funky and you had to really invest into understanding it increasingly it's getting to the point where it just accelerates like like a lot of the value is you get your data even if you don't think about otel at all if you get your data into an otel enabled pipeline it gets consistent naming consistent structure you know there's semantic conventions and stuff and what this means is then when it gets to the server side your vendors can do amazing for it like we can in we can we can derive we can compute we can C we know what the data is right so we can do a lot of really exciting things with it and I think you're going to see a lot more of that in the next couple years and and so just to get a sense of what exactly it is I'm on the website and and it says how open Telemetry is a collection of apis sdks and tools it's it's built for to make make telary portable and effective and it it you can use it an instrument generate collect and Export Telemetry data so do I understand it correctly that a lot of like languages or Frameworks have like I don't know apis that you can like use and then you can kind of plug in vendors underneath it or or how do the vendors come in the picture here um you know some vendors support their own collectors some are a lot of them are contributing back to the core project um but basically the idea is you know however you do your instrument trying to provide um just Frameworks for consistency right like it's it's it's a little bit too big to just generalize it like if you're like does the otel do this the answer is probably yes uh but basically it's just getting people's Telemetry into a consistent format with consistent naming and you know semantic conventions which means that we can do a lot of great stuff with it okay so is it safe to say that if I'm working at now like a midsize company or a project that I know will need observability or or already needs it it's kind of a safe bet to look at open tary see if I can at least parts of my pipeline adhere to it because a this will hopefully make it a bit better and then B if the question of portability comes up it'll be way easier to move vendors if if it ever comes up right because we know that a lot of companies Will Never Move but I think you know like knowing that you could move it is well it it also helps I mean I'm kind of talking a little bit against like possibly your business but as as a as any business the best negotiation is saying we could move if we wanted to so so let's talk about do you want to give us a should we do a longer term commitment can you do you have new features that you ship that are unique and your competitors don't have and we need it like you know that's and and I kind of have the same Defenders to compete on the on the territory that they should be competing on what are some common things that you've seen engineering teams get wrong about putting observability in place some of the most common ones oh boy um feeling like they don't need to start they don't need to have any until it's in production and things start breaking like you really want to be developing with it you want to get in a habit of understanding yourself like what you're going to attach a a a like a a GDB to it like like no you want to debug your your software the way you're going to debug it in production like shift left shift right whatever the you want to call it like you want you you you want to do that early um other other areas um you know I think a lot of folks feel like well the dashboards versus a lot of folks get really attached to their dashboards and I really don't feel like unless your dashboard is dynamic and allows you to ask questions I feel like it's a really poor view into your software you you want to be like and then what and then what and then what you want to be interacting with your data if all you're doing is looking at static dashboards I think it really it it it limits your ability to really develop a rich mental model of your software and it means you're often you're there things you don't think to ask or graph or dashboard so you don't see them right yeah it's interesting cuz I I I built so many dashboards and whenever you walk into an office where there's a team usually they have a dashboard and it looks cool and all of the dashboards we've had always looked cool and they I'll be honest they were kind of useless I mean they were good like like you know when someone important walked by you know director or VP they're like oh this is the team here's their sets and after a while what we started to do because we realized that's what we mostly used it for we made sure that like nice numbers and it was always green it sounds silly to say it no I get it I get it a public dashboard you didn't really so and then we had like private ones where we actually had the real stuff relatedly I think that a habit that more and more teams are picking up which I think is super important is using slos as their entry point instead of using dashboards as their entry point oh so using as an entry point for what for understanding debugging interacting you know like slos I think I have this sticker that I made it's like slos are the AP P for engineering teams and like an SLO is your agreement you're like we will provide a level of service that we all agree internally and externally is is good this means we have a budget right we can use what's left over in our budget for running chaos engineering experiments for you know one of the things that we did at honeycomb a few years ago was we we had kofka nodes that kept kept just like Vanishing on us and it was really frustrating and so we um we took some of our SLO budget and we started um killing kfka machines every day and and working on the automated recovery process right so that they would you know so and and to this day every Monday we kill the oldest kfka node we just shoot it in the head and so we're always testing the bootstrapping process right got us out of a lot of we stopped getting page in the middle of the night because of kfka nodes right because we're constantly testing this thing but if you have slos it's also the greatest hedge against micromanagement I think because if you're meeting your obligations then how you do how you like spend your time as a team is like below the fold nobody should care about that if you are meeting your obligations uh it's also a way for you to negotiate and be like hey we're not meeting we're not meeting our obligations so we have to put a hold on this this feature work because we have to do this reliability work because we have to get ourselves back to a place where you know we can deliver on our obligations so there's just so many ways that I think and and an anti-pattern I think is when your slos are not derived from the same data that you're using to debug when it's like something that's out there on a satellite it's not connected it's like well now you have one more problem right it's like you really want it to be like here's my slos o I don't understand that you know my budget is like going down faster like click on it and see why I really like you saying that slos can be a way to avoid micromanagement and and it should be because you know this is no one wants micromanagement I think it's kind of fair right like I think m micromanagement is warranted when you're doing doing a terrible job then you know like the manager or Tech leader whoever or director should should come and look at you but otherwise they should leave you alone and I I think that's kind of fair right it's yeah it's a good way to think of it I I wish more more teams would would take this and hope hopefully they can take this as an inspiration I think I think it's picking up steam I see a lot of people really dealing with slos and and I feel like that wasn't true for a long time so with honeycomb it's an observably startup that that you're building what was a major exciting or interesting engineering challenge that you have to solve to actually build this product I mean we had to ride our own database your own database yeah really you're kidding me why uh because you know when we started in 2016 to be clear I spent my entire career telling people never write a database don't do it just never write a database if you think you want to write a database trust me you don't we r a database uh but Christine and I got started and we were just like well you know like click click house wasn't around snowflake was like and ironically we would have I guarantee if we if click house was around we would have used it and I'm now really grateful that it wasn't and we didn't because being able to like the data model is so customed us right and being able to iterate on it you know add traces add you know all these things has been a real Force multiplier for us I I will say it it's why we lost when I was CEO our earliest investor um I met with him a year in I was like well we're starting to get some interest you know but like you know we're not we're not we don't we're going to need more money he's like well if you're going to succeed you would have succeeded by now and I'm not giving you any more money and I'm like well and he's like well you know what you shouldn't have spent all this time around writing a database you should have found product Market fit first and then written a database and and the thing is the thing is as snotty as he was he's right he's absolutely right that is the common wisdom that is like 99 times out of 100 that is the right smart thing to do we are too dumb to know better so we accidentally did the right thing and so how did your database help you and what's it called even it's internals to you right you you never opened it up yeah yeah it's called retriever so all of our services are called after dogs we have dalmati and poodle and yeah Retriever and Basset and Hound and all these things and so yeah how has it helped us I mean it's it's it's I mean every people what what kind of database is it it's a cumer store it's a cumer store it's got so so it's been through a few different evolutions and actually uh so uh Sam Stokes gave a great talk at strange Loop and I think 2018 about this about the internals people can go and look it up and then a couple years later at the very last strange Loop jessitron gave a talk about how we've evolved it because at some point around 2020 or or so we actually serverless our database uh we were like okay so like initially we're using the Coler store it was all on you know local ssds on ec2 um but the vast majority of data that gets written to dis never gets queried by anyone right and it was really expensive we're just like this is not you can't build a business on this where 99% of data never gets queried yet we have to pay to store it and so Ian wils uh one of our principal Engineers who's been here since the very beginning he actually moved the query planner uh two Lambda jobs and shortly after the data gets laid down in on the on the ssds we actually age out to S3 and then we do this massive fan out uh and merge uh at query time all right that's pretty crazy that that you built it I mean congrats yeah no it's it's it's it's you know I still tell people never read a database but there's like an asterisk once in a while you really can't well what I also like about this is is like there are startups that succeed because they don't follow the the beaten path in fact like some of the more interesting ones I I I talk with we we covered some in the newsletter I I'll link it in the show notes below but there's um you know like figma for example ignored the wisdom of launch in six months they took three or four years to build their their first version you know again they burned a lot of money but it was the right thing to do there was another company uh antithesis which built a advanced debuging tool for four years just I just talked to the CEO of anti just a few days ago and they ALS they also built their own database but they did something wild which is they didn't write any test and they're having their platform tested it's it's it's it's wild but again he he told me the same thing people say don't write your own thing so I think it's kind of a little bit reassuring that the rules when when you know where you want to go I mean you know there's a chance that you might run out of money whatnot but but when you just just do it I mean you know like take all the advice but you don't need to control C control V to succeed yeah no 100% you know I'm a big fan of the um Innovation token metaphor uh you know you've got like two or three Innovation tokens as startup so spend them wisely and we definitely spent two on our internal storage engine of of the of the three that you had in total back in the time but it pays off like every time we go to ride a feature uh like we're not fighting our storage engine we can build our storage engine to do what we want to do and it's and it's actually incredible Force multiplier okay so let's jump into a pretty interesting topic which is observability and AI llms are a super hot topic this these days and you know like they're everywhere how do you think about observability and AI systems uh yeah uh such a good question so I feel like there are three places where AI really intersects with observability number one is when you're building and training a model number two is when you're developing with llms and number three is uh the everyone problem of we're all now dealing with this influx of software of Unknown Origin like it used to be you could pretty much guarantee that someone somewhere understood the software at some point and you can no longer take that for granted and and that is I think so funny because it really hearkens back to like the origin story of honeycomb at parse like we had developers all over the world just writing Snippets of JavaScript and uploading them we just had to make it work right Mong Mong queries they' write them and upload them we just had to make it work and so so many of the things we just sort of forg and fire to understand this unknown software is like oh now this is the worldwide problem that everyone is having so it's a little fun I wrote I wrote a blog post about observability in AI last week and oh cool yeah it it's not I don't think it's super like mind-blowing or anything but like a couple of the conclusions that we're coming to is you know basically first of all if you can compute the answer you should probably compute the answer we see a lot of observability vendors out there who are like AI this and that and it's like yeah but it's actually your it's a guess right and if you had gathered the context then you could have computed it and it would have been faster cheaper easier and better but instead you're like we put AI on it and it's a get that's not better it's not better Just because it has AI on it AI can make things worse too you know there there are certain problems where AI is like the right tool for their job when it comes to calculation and computation not one of them um another thing that we're really seeing a lot of and I unfort we have actually some customers who are really sophisticated like model Builders and and stuff and but they're none of them want everybody in the AI Community is so tight lipped like they don't want to talk about anything you know and it's like H guys come on uh but I I feel like one of the early lessons for Phillip and me has been that you can't have good AI observability in in isolation it you have to have it embedded in good software observability right there are all these startups out there that are raising just buckets of cash to solve the problem of AI observability and they're all focusing on like the sort of self-contained models and it's like but but like the inputs come from all these different services and data and and stuff like it it's a trace right it's a trace shaped problem you have to be able to trace it all the way from all these inputs up here in software land through the model to the human feedback um it's a classic Tye and by AI observability you mean that the problem is like okay I have an llm or a system that uses llms in my software and I want to add observability to it right to understand how it works and you're saying that a lot of the stars are like focus on all right let's just you know wrap the model observability inputs and outputs and we'll see it whereas like this thing it actually like you it's built around other parts of your software you want to see the user interface you want to you know connect to like like customer support tickets that kind of stuff yeah absolutely it's a trace shaped problem um yeah it's a trace shaped problem it's a high cardinality shaped problem and it's a high dimensionality shaped problem like these are just this is this is a software problem with non-deterministic elements it's not an AI problem is how I think of it yeah and well there's gonna be a lot of money in it for sure because just because how many how much money there is in in AI but I really liked your your third point which is because because you said that the three buckets are number one is observa for these llm models or you know when companies are writing AI number two was observability for developers who are writing code using LMS and number three which I think is the most important is observably for this code generated by AI which we know is you know hard to tell good quality or not but there's going to be just more of it and and this is what you said that back at parse you were used to just adding observa for basically like all sorts of code like JavaScript and whatever that people uploaded to to power their mobile apps yep exactly exactly you know production is where code meets reality it's where doesn't matter how pretty it looks doesn't matter how you know great like you don't know if your code is good or not until you've watched it run in production so I feel like like so many things in the age of AI this is not new it's just a really intensified version of what we're actually dealing with yeah and it what what it tells me is like any company that wants to use these AI agent you know maybe these new AI agents or or just more AI code probably the prerequisite of doing that and not like burning is have good observability so that you know when stuff breaks and from there on you know like it might be a next step of figuring out can there be a feedback loop can I actually allow some of this AI to push production and and all of those things but without that you're you're going to be Flying Blind which is just stupid to do with uh with something as unreliable as an llm exactly a different topic but also pretty important one every company these days has a choice of building buying or using open source now in the case of observability usually this goes down to should we buy or should we use open source because again building from scratch doesn't really make sense uh all of these have have upsides do you predict any of these to gain more momentum based on what you're seeing do you see more companies might be going to like use open source and try to host them or or maybe more of them are giving up and uh going to vendors and maybe using open Telemetry those kind of things you know I think in in so far as open toeter is open source uh I think it has a really bright future and I'm so relieved to see it when it comes to using open source you know running your own versus using vendors the main Trend that I'm seeing in the space is consolidation I think it's part of people just trying to get a handle on their bills like which totally makes sense right if you're if you're paying five different vendors and each of the those vendors is going for like your 15% like I I feel like a reasonable like Benchmark for for observability spend is like 15 to 20% of your Cloud spend depending on the type of business or whatever I think it's a good rule of thumb but if you're paying for five different vendors and each of them are gunning for that 15 to 20% of your class that's just like that's nuto right you can't do that uh so there's a big consolidation in the industry um the only open- source vendor that's really involved is is grafana right uh and grafana is doing very well they just raised a huge round right um I think you know different models although I I guess should count Prometheus too but Prometheus is like I think of Prometheus and data and data dog as the last best capalm metrics products that have ever will ever be built like nobody's ever going to try and launch another big project because like what's the point like these are mature they're great honestly um and and they're just very mature Technologies um so you know and metrics for all the talking that I do metrics have a place in the ecosystem right uh it's just not right now it's like 80% of what people use as metrics and 20% is structured data and they need to invert that it needs to be 80% structured data and 20% metrics but metrics are still great for you know uh cheaply plotting Trends over long periods of time right or or for you know counters right counters are a very essential metric thing or at a certain at a certain um scale which is a lot higher than a lot of people think uh structured data is too expensive and you should use metrics so there are Niche use cases for metrics uh and so Prometheus I think will continue to be a contender um I think gon is great I think that you know data dog Etc uh but in general most people are using vendors and most people don't want to have to deal with this stuff under the hood when when everything breaks at 2 a.m. you don't want to also be dealing with a broken observability tool on top of your other broken software uh so I think I think it's I mean it makes sense to me and I'm not saying that just because I'm a vendor I mean I would say that though wouldn't I yeah a question I got actually from reader on social media uh what about frontend and mobile observability because when we talk about it you know it feels so much a bit is about the backend yes yes so we actually launched uh our rum replacement this year for front end trying to do the same thing for rum that we have done for the back end what does rum stand for uh real user monitoring and it's basically the front end it's it's organized around browser sessions and user sessions instead of like backend requests but I think it's so critical that I I feel like so often the borders of tools are what creat silos so if you've got one tool one team over here that's using a completely different view into their software than the other teams it's like you get together you spend more time arguing about the nature of reality than trying to solve the problem together when you have a common again back to that unified storage right many different views many different entry points but a unified view all the way from your mobile device your browser to the database and back it's it's it's such a such a powerful thing mobile is a different beast and we we've started dipping our toe into it this year um it's no it's it's no surprise that mobile is kind of the only Standalone solution out there that sort of left and I think a lot of folks are trying to figure out how I mean you come from a mobile background so you probably know all the reasons for this better than I do well I I mean right there was crash ltic which was yeah it's such a weird thing because what I've heard is Crash litics around like 20 12 or something was acquired by Twitter and for a while it kept alive and then it was destroyed and two things happened there I think they bought it for about $300 million and apparently it was such a low return that the VCS really got spooked and they never invested in another company that was doing this cuz they said if this is the biggest exit then was there and it ALS always feels that mobile has kind of been left by itself there are some tools but none of them are really first class none of them really come from the the vendors that are doing you know the the the proper kind of backend observably monitoring and then it it just feels I think everyone thinks it's such a small Market which I kind of disagree with honestly it's not small but I do have a theory for why this is and it's because the build pipeline is so alien and different because you can't do cicd right you've got like the Apple Store gating or you've got like the Android diaspora and inability to like fold it into the best practices of software development has I think that's why it's out in an Island by itself no mo mobile remains a little surprisingly archaic in this and a lot of it has to do with apple and then you know Google going along not allowing binary code to be shipped directly without their permission obviously companies are are you know like every team is going around they're doing like feature flags and there's some JavaScript here JavaScript there but it's all hush hush under the Apple kind of knows but they it's it's it's different it's a it's it's a funun world yeah every area has its own challenges every every area does yeah and the last question I got from again someone on social uh media from from hus nine this person wrote I'm starting a new a new company today what is the right time to start investing in observability and how can I design for it upfront you know startup fresh idea I I I I think it's as integral to building software as tests I do I think I think it's the same sort of the best time to instrument is the best time to write test the best time to instrument is while you're writing the code anything that you try to slap on after the fact is not going to get it that original intent of what you were trying to do when you had it in your head and I think that done correctly it actually accelerates your development it doesn't hold you back it accelerates your your ability to get stuff out to understand to keep moving and so yeah I would say as soon as the code that you're writing is real you know something you intend to put in front of a user uh this as soon as you start thinking about writing tests you should be thinking about writing observability so it sounds like to me we're kind of saying like when you're prototyping and you know you're doing throwaway stuff and again you wouldn't write any test don't worry about it but when you're like okay this this might go out I really like the test analogy and you know the interesting thing with tests is there's two types of people like when you say oh tests will will speed you up the long term like people who haven't written test or haven't seen it they're like that's BS like it takes time it cannot sto me I'm just going to skip it and then people who've seen it they're like no no no you do not understand and now you see you know the two startup Founders one of them is starting by writing the tests and they actually do get sped up where the other one says it's it's silly you know there's that uh comic about uh pushing the the the car on on the Square wheel someone brings a round one saying oh let's change it no no no it will take too much time oh so real I I feel this might be the case with observability once you've seen it you probably cannot unsee it exactly so let's close with some rapid questions I'll I'll just shoot a question and you tell me whatever comes to mind people told me I need I need to ask something around management for you so I'll ask do you like being an Eng enger or a manager more I love being an engineer I love it uh I you know one of the hardest things about you know the early years of honeycomb were really rough I didn't expect to have to be CEO and I wasn't a very good one um but so much of my identity came from being an engineer and it was really really hard for me to kind of move past that I will say that when I was an engineering manager I hated it but now I look back and like oh I kind of missed that which is why I have I actually have this open calendar link where people can set up time to like just kind of bring their problems to me and talk about them and I get such a kick out of it like I miss those aspects of doing Engineering Management but I I I think being an engineer is just you get paid to solve puzzles all day that dopamine hit of just like figuring things out making things work yeah I would just going around feeling high all day it was it was really fun H yeah what is a controversial thing that you believe is true oh absolutely nothing Jerry how could you s such a thing to me my what am I thinking all right what are you thinking a controversial thing that I believe to be true you know I actually wrote an article yesterday I got out something about founder mode um and uh I this so I'm not sure if this counts as controversial or not but I think it counts as controversial in the Silicon Valley like YC I do not think that it's a good idea for the CEO to have to approve everything that goes out I think that is egotistical I think it is I think it is wrong I think it hobbles good decision-making and judgment in other parts of the organization I wouldn't want to work for someone like that you know there's this there's a group of folks who just idolize Steve Jobs you know Johnny and I think that Steve J obviously very successful person and I think in large part despite the fact that he was a raging control freak I think he was successful despite that not because of that it makes me really sad to see so many bright people be like oh well I should also be an and control- free think of how many brilliant wonderful people probably left because they couldn't stand to be micromanaged in that way well I I think it also comes to show there's just so many ways to succeed that there's no one way many ways there's so many and also to be fair so many personalities right so so I'm I'm going to just like you know do a wild stab you don't operate like that right don't I For Better or For Worse I've been a worker and an employee much longer than I've been a founder and a sea level and I don't know it it's it's so clear to me that the way that you bring out greatness in people is by supporting them and empowering them and giving them agency and giving them control and yes be in the details like the the advice of hire people and get out of their way is terrible advice because you need to be doing the work to create alignment to make sure that you have a shared view of what good looks like what great looks like so that you can course correct early when you start to Verge so yes be in the details but you don't take people's agency away from them so if you are not building an observably startup what would you be doing I would be a staff engineer someplace in fact that's that's what I plan to do next is just go be an engineer in someone's company build stuff at 5:00 P PM I go home and I turn my brain off and it's like it's your problem now bud it's gonna be amazing oh I you know for for some reason I want to believe you but I I I don't think you're going to later the the turning off the brain at 5:00 p.m. I I mean specifically oh that part yeah yeah maybe not I I I mean it though like I plan on going and being an engineer for a while I think I think you can only like I hate the term thought leader it just makes me a little nauseous uh and I really think that like the farther detached you get from the work the the just the lower quality I don't know like for me at some point I need to Circle back and do something with my hands in order to in order to not cringe when I hear myself speaking no I I I absolutely hear you so you're a big fan of whiskey what is a current favorite uh you know for the longest time I I really liked py scotches and it's I'm still kind of coming to terms with my identity as being more of a bourbon person now oh yeah no interesting um I really like Whistle Pig uh and I really like the the Rye Whistle Pig too um I my favorite ever is impossible to find now though it's called George T Stags and it is like 190 proof it is so good you can't find it but they started making something called Stags Jr which is like 80% is good so if you haven't tried it try Stags junr all right and what's a what's a non-fiction book you'd recommend for software Engineers the book is called fluke uh by Brian Claus fluke chance chaos and why everything we do matters I'm not a religious person but I do like really look for ways of having meaning in my life and like the takeaway for me was that like everything we do really does matter because you don't know when the thing like a stray comment that you drop in someone's presence sets off something in their head that changes their life you know or causes them to St company or causes them to get sober or or just you know all these Ripple effects happen like the things that we do and say yeah 90% of them you know don't maybe 95 don't actually like set off but things that we do like really do matter it's called fluke I've read it three times in the past year uh cannot cannot recommend it highly enough it's that is a very strong recommendation all right now I'm going to get it thank you well charity this was a really interesting and fun conversation as usual so much it's so nice to see you I wish we lived closer so we could get together more often it's always a delight getting to spend a little bit of time just shooting the thank you to charity for sharing all these interesting details charity is a prolific writer and you can read more from her on her blog at charity. WTF which is linked in the show notes below for a deep dive on how to build an observably startup and for details on how a scaleup managed to have a 65 million observa bill for a single year see the pragmatic engineer deep Dives Linked In the show notes below if you enjoyed this podcast please do subscribe on your favorite podcast platform and on YouTube thank you and see you in the next one

Summary

Charity Majors discusses the evolution of observability from a fragmented, tool-heavy approach to a unified, data-centric model, emphasizing the importance of understanding software in business terms and the role of modern practices like SLOs and AI observability.

Key Points

  • Observability has evolved from the 'three pillars' model (metrics, logs, traces) to a unified storage approach that treats all telemetry data as structured events.
  • High cardinality data is crucial for debugging but is expensive in traditional metric-based systems; structured data allows for real-time exploration and correlation.
  • The shift to observability 2.0 includes unified storage, real-time querying, and integration with development workflows like CI/CD and feature flags.
  • OpenTelemetry enables vendor-agnostic instrumentation, reducing lock-in and enabling consistent data collection across different tools.
  • Observability should be integrated early in development ('shift left'), with SLOs as a key entry point for understanding system health.
  • Observability is not just for operations; it empowers engineering teams to explain their work in business terms and gain a seat at the executive table.
  • The cost of observability can spiral due to data duplication and high cardinality; teams must govern data carefully to avoid unsustainable bills.
  • AI observability requires tracing the entire software stack, not just the model, to understand inputs, outputs, and user feedback.
  • Many engineering teams make mistakes by relying on static dashboards and delaying observability implementation until production issues arise.
  • The future of observability involves deeper integration with development practices, including AI-generated code, which requires robust observability to ensure quality.

Key Takeaways

  • Implement observability early in development to accelerate feedback loops and prevent production issues.
  • Move from fragmented tools to unified storage of structured telemetry data to enable real-time exploration and correlation.
  • Use SLOs as a foundational metric to align engineering work with business goals and avoid micromanagement.
  • Leverage OpenTelemetry to avoid vendor lock-in and ensure consistent data collection across your stack.
  • Govern high cardinality data carefully to prevent observability costs from spiraling out of control.

Primary Category

AI Engineering

Secondary Categories

Machine Learning Programming & Development Data Engineering

Topics

observability three pillars model cardinality OpenTelemetry AI observability observability 2.0 platform teams DevOps SLOs structured data unified storage AI agents LLMs

Entities

people
Charity Majors Gergely Orosz
organizations
Honeycomb Sonar Vanta Pragmatic Engineer Uber Facebook Parse LightStep CNCF GitHub Amazon Google
products
technologies
domain_specific
technologies products frameworks concepts

Sentiment

0.70 (Positive)

Content Type

interview

Difficulty

intermediate

Tone

educational technical entertaining inspirational analytical