Categories
Uncategorized

Evals are the new PRD: the complete Ankur Goyal episode

Check out the conversation on Apple, Spotify, and YouTube.

Why should anyone care about evals (1:43)

Aakash: When I think of experts in the eval space, you have to be right at the very top of the list. But some people just rely on vibe checks. I have had some product leaders on this podcast who have created amazing AI features that have helped their company bag the next $500 million to $1 billion valuation just on vibes. So why should anyone care about evals to begin with?

Ankur: I actually think vibe checks are a form of eval. There is this really popular Paul Graham essay, “Do Things That Don’t Scale.” Vibe checks are the do-things-that-don’t-scale analogue for evals. When you do a vibe check, you are using your AI product and then using a scoring function, which is your brain, to try to intuit whether the result is good or bad. And if it is not very good, you might tweak the prompt or try a different model or adjust how your agent is architected, whatever it may be, and then try again.

What happens is that once your product gets into production, more people start using it. You have more subject matter experts and engineers at your company that are actually contributing to its quality. Then the vibe check version of an eval stops scaling. You need a little bit more software and process and tooling to help you execute at a higher scale and with more predictable performance. That is where what we normally think of as evals start to come in. But I actually think it is a whole journey, and vibe checking is great. It is just one type of eval.

LLMs are imperfect yet capable (3:21)

Ankur: One of the really new things about AI development is there is this kind of magical thing, ether, that we have to deal with, which is an LLM. And an LLM, not unlike a person that you might hire or work with, is somewhat unpredictable. You don’t know if your app isn’t working because the LLM inherently doesn’t understand your task, or maybe you haven’t prompted or built around it well.

I remember a couple of years ago, right around when Braintrust started, a lot of my really smart friends who are LLM skeptics would say, “This LLM doesn’t understand C++,” or, “It doesn’t understand my specific task even though it works for demos.” I think nowadays people have mostly moved past that, but it illustrates the idea that it is hard to know where the responsibility lies.

What the most clever, successful AI builders have proven, Manus is a really good recent example of this, is that the alpha in building a good AI product is kind of understanding that LLMs are imperfect yet very capable, and figuring out how to work your way around that and make the most of what LLMs are able to do today with an eye towards what they can do in the future.

That is really where evals come into play. They are a good way for you to treat the imperfection of LLMs from being kind of a mystery or a burden into a really fun and engaging product and engineering challenge that you can actually overcome.

Evals are a durable investment (4:52)

Ankur: One of the things I come back to is that an eval is a relatively durable thing that you can invest in. Let’s say you are working on a new product area and you use the latest agent framework and use Opus 4.5, which is the cool model right now. All of that might change in a couple of weeks or a couple of months.

But if you invest in evals, and by that I mean you do a good job of understanding what your users are actually trying to do with the product, and then you encode that as data and scores and eval flow, then even as the models and agents and everything change, you have actually set yourself up to continue iterating and build on an investment that you made.

The companies that have started doing that effectively, they are actually building true differentiation. If you believe that the way that you have wired together your agent today is your differentiator, you are actually highly likely to fail because that is probably going to change in a couple of months. On the other hand, if you build really good evals, then you have built something that has a little bit more durability to it.

Aakash: I have been preaching this message to everybody, which is like the harness around your LLM, everything from the memory to the evals, that is actually a more durable moat because the model underneath that continues to evolve.

The role of the PM in defining evals (6:35)

Aakash: One of the interesting things you have here on this slide is you have the quotes from Mike Krieger and Kevin Weil, and I think it is super notable because they are the product leaders at these companies. What do you see as the role of the product manager in defining the evals? You are running one of the most used eval platforms out there. Are product managers the main user of it, or how do they interact with whoever the main user actually is?

Ankur: This is honestly not something I anticipated when we started. What we have seen is that if you are building an AI product, you are now able to solve problems that software couldn’t really solve before. And the people who really drive that level of creative thinking and software application outside of the sort of four lines of what software did before are product managers.

Evals are core to product managers’ ability to do that. I have actually sort of shifted my thinking. I think of evals as kind of like the natural evolution of a PRD. If you look at a PRD in 2015, it is an unstructured document that is a spec that is meant to communicate how you should build something and what maybe the success criteria are for the product working.

Fast forward to now, I think the modern PRD is an eval, and it is actually something that an engineering team who maybe doesn’t know everything about the product or the problem that they are solving can use to quantify how well the software that they are building is able to solve the problem.

I think that actually means product managers are able to be a lot more effective because they go from providing kind of a qualitative spec that no one really follows and it is always kind of annoying to reconcile the PRD with the actual product into something that is very quantifiable. You can look at an eval and say, does this piece of software fit the eval or not? And oftentimes it will fit the eval and the product will still suck, and that means that it is actually on the product manager to then go and improve the evals. That is an area of leverage that product managers actually didn’t have before.

The Claude Code evals controversy (8:45)

Aakash: I would love to talk about this coding controversy that you referred to. This tweet blew up. I actually had somebody ping me about this the day it happened because they were like, my boss was telling me about this because I have been championing evals in my own company, but Claude Code is not using evals. Have we been doing it all wrong? This is literally affecting people’s jobs who are product managers and heads of product. How should they be dealing with this controversy?

Ankur: I think a lot of coding tools also don’t have product managers. And the reason is that the software engineers who work on the coding tools have relatively good intuition about what other software engineers want to do. I actually think the same principle applies here.

As I mentioned earlier, I think vibe checks are evals. So I just think this, I love Swick, he is a good friend, but I actually think this tweet is factually incorrect. The fact that other people, Boris and other folks at Anthropic, are using Claude Code and likely providing feedback about whether the model or Claude Code itself is solving their problem, that is a form of eval.

Sure, they don’t have it necessarily as a quantifiable process. Maybe they do by now. We don’t know or I certainly don’t know. Or maybe they don’t use a tool or whatever and follow what someone might think of as an eval, but I think they are doing evals. If someone is trying out the product and they are providing feedback, and then they are incorporating that feedback into iterating on the product, which I think they are, to me that is doing evals.

Now, why are they able to get away without a structured process that is somewhat multi-disciplinary with product managers and engineers? I think it is likely because the engineers are solving problems for other engineers and they are doing it at a company that is training the models that are also able to solve that problem. So it is totally verticalized and you don’t really need any third-party intuition to solve the problem.

Distance from the end user (11:34)

Ankur: If you go into another domain like an AI company that is applying an LLM to solve healthcare problems, I think you are in a totally different world because they are probably not making the LLM themselves. They probably have great software engineers who are passionate about healthcare but are not necessarily healthcare subject matter experts. And then of course there are product managers who are able to bridge from what engineers are working on to what patients or doctors, whoever the end user is, is actually experiencing.

My parents are both doctors, so I have a little bit of a soft spot for this use case. When I hang out with them and talk to them, I have almost no idea what they are talking about. They are using very specific jargon. They are talking about medical issues that are obviously very important and maybe can be assisted by software, but I just don’t have the intuition for that.

And so evals become a mechanism for product managers in this scenario to help glue together the unknowns of how an end user might actually interact with a piece of software into something that is tangible that an engineering and product team can use to iterate on and improve the quality of their product.

Aakash: Nailed it. In my opinion, if you are not the end user, it becomes more and more important. And the more distance you have from that end user, like in a healthcare setting, the more important it is to create the evals. I think also one thing that probably Claude Code benefits from is that Anthropic in their post-training is using a bunch of evals around coding. So even if Claude Code doesn’t have formalized evals, we know Anthropic does.

Ankur: Right, I think distance is the perfect way to think about it. If you imagine Anthropic, which is just an amazing organization bubbling with talent. You have the people training the models, pre-training, post-training, building the harness, building the product, the UI which is Claude Code for the harness, and the end users all inside of one set of four walls. And so the efficiency with which they are able to circulate feedback is very high, and therefore it may not need additional process or whatever to help facilitate the feedback.

As any of those points of distance starts to increase, you actually need a little bit more structure. One of the big use cases for Braintrust has actually been helping our customers collect evals that they can share with labs, so that labs can do a better job of implementing support for their use case. They need some ledger to be able to capture that information. Otherwise how are they going to communicate it?

How big is Braintrust today (14:27)

Aakash: You mentioned Braintrust. I wanted to ask you, how big is Braintrust today? What can you share, whether that is users, revenue, evaluation?

Ankur: Braintrust is about 100 people. We have many hundreds of customers and many tens of thousands of organizations using the product. We actually have a pretty generous free plan which we intend to make even more generous over time. If you are a product manager or an engineer and you are working on a hobby project, we want you to be able to use Braintrust without having to really think about it.

Growth has just been absurd. Nowadays people are running about 10 times as many evals as they were this time last year. Today people log about twice as much data per day as they did the entire first year of Braintrust being in existence. And it has just been incredible.

What we have seen is that everything is growing in AI. Every individual LLM call is getting bigger. People are creating larger prompts. They are putting more context into their prompts. There are more LLM calls in every request that comes through because people are building agents and agents are doing research and interacting more frequently with users and doing much richer work. And then AI products are actually achieving real product-market fit. So the number of requests is also growing very rapidly. If you multiply those three things together, you get this incredible explosion of interesting data that we see flowing through Braintrust.

Aakash: As of I believe October 2024, so a year and change ago, it was reported that you were valued at $150 million in that fundraise. Can you give us a sense of what the scale of growth has been since then?

Ankur: I think we have been very fortunate. We were cash flow positive for a very long time, and so we have been able to utilize capital actually very effectively. I can’t share the very specific numbers now, but if you look at our revenue metrics and growth metrics, we are more than an order of magnitude in growth on literally every axis. And if you look at just consumption growth, it is multiple orders of magnitude of growth. So it has been a pretty wild 15 months since then.

Why top companies focus on evals (16:48)

Aakash: That is crazy, and I think it is a testament to companies that you have as customers like Vercel and Replit and Airtable being so keen on evals. Why are all the hottest companies so focused on evals?

Ankur: When we started Braintrust, we wanted to partner with entrepreneurs and builders who had companies that had pre-existing product-market fit and were earnestly investing in AI. I will just highlight Brian from Zapier for example. Zapier was our first customer. Brian is the CTO. He has been working on Zapier for a long time. When I met him, he basically introduced himself as a full-time AI engineer. Now this guy is super successful. He probably doesn’t have to work, but I have not seen anyone nerd out about AI as much as Brian does.

The reason we wanted to partner with these companies is that we knew they would only build and ship products that met a certain level of quality and they would hold themselves to a rigorous product-market fit bar, but they were very earnestly adopting AI. And that has very much turned out to be true.

If you consider that these companies have pre-existing product-market fit, so they have to do things at some level of scale. They can’t simply rely on vibe checks. Although they do a lot of vibe checking as everyone should. They have enough product-market fit to actually drive real scale, and then they have products, like if Ramp doesn’t work, it is very bad. They don’t really have the leeway to screw things up. And so the standard for the quality of the products that they are shipping is very high.

You kind of mix those ingredients together and it is very obvious from first principles that you need to run evals and take observability very seriously to implement a good product.

Building an eval from scratch, live demo (18:48)

Aakash: I want to get a little bit more tactical for everybody. You have the stat which I think is pretty crazy, 12.8 experiments per day. What exactly are those tangibly? Like what are people doing that they are running this many experiments per day?

Ankur: One of the benefits of AI is that experimentation, which used to be something that you would only run in production, is now something that you can do offline as well. That is actually one of the things that I think contributes to so much rapid evolution of AI products.

You are absolutely right, if you had this non-deterministic problem that you had to solve, then you might have to AB test it. And doing an AB test is a very high fidelity but very expensive way to get feedback about whether a non-deterministic thing works or not. In AI, because you are able to do evals and actually iterate offline, you can do those experiments just on your laptop.

Aakash: What are the steps that we need to go through in order to define an experiment like this?

Ankur: This is straight from our docs. An eval consists of three things, and I think this is a very helpful framework because it allows you to simplify what might otherwise be kind of an overly complex or infinitely complex topic.

An eval is literally three things. Data, which is a set of inputs. So we are going to play with Linear’s MCP in a moment. An example of a piece of data could be “How many tasks do I have assigned to me?” That could be the input question, and then optionally you might have a ground truth answer like 12. You might not, which is totally fine.

The next part is a task. A task is something that takes an input and then generates an output. A task could be as simple as a single LLM call, like you could just take the question and paste it as a message into GPT 5 Nano and then get a response. Or it could be as complex as an agent. It might do some research or call an MCP server. It might call other LLMs, it might call APIs or vector databases, whatever it may be. At the end of the day though, it is going to produce some kind of output, and that is the thing that you evaluate.

Then the last thing is scores. Scores take the data, so they know about the input, they know about maybe the expected output. They take the output of the task. And then their job is to produce a number between 0 and 1. I think it is actually really important you normalize things between a fixed range, 0 and 1. The reason that is important is that it forces you to make everything comparable. So no matter what, a week from now or a month from now, when you run a new eval, you will be able to produce a score that is within the same range. And when you do that, that means you will be able to compare how the thing that you did today performs against the thing that you do tomorrow.

Live demo, creating the data set (22:15)

Ankur: So we are going to create an eval entirely from scratch. There is no pre-written prompts, there is no pre-written data set, there is no pre-written scoring functions. This is going to be 100% live.

Aakash: For those who don’t know, Linear, if you are just on a Jira ecosystem, Linear is a competitor to Jira. So it is your task management tool.

Ankur: We use Linear. It has been a fantastic piece of software for us. Linear also uses Braintrust, so they are a good friend of ours and I think they have a really nice MCP server which is super cool. So let us say we are building a tool that allows us to ask questions about our task workload and understand what work we have to do.

So let us just write a really simple system prompt. “You are a helpful assistant who answers questions from Linear.” And let us create a data set and instead of creating it from scratch, let us just use Opus to help us create the data set.

Aakash: And for those of you who are wondering, MCP, model context protocol, it is just the standard definition, basically the API that LLMs can use. So it is allowing the tool we are looking at, Braintrust, to get access to the data inside of Linear.

Ankur: OK so we have got the initial test data from Opus. But I don’t love this data. It is asking questions about what Linear is. We are trying to build a bot that helps us ask questions about the workload. So let us actually tweak it. I want questions about my Linear project, for example what tasks are assigned to me.

So creating stronger test data, in this case making it more about tasks and kinds of tasks instead of just the high level it was before. Last but not least, remove the expected answers since we don’t know them.

Now what we can do is just hit run. So it is going to use GPT 5 Nano, which is one of my favorite models. It is super cheap and relatively fast.

OK so this doesn’t seem like a great answer. “What tasks are assigned to me?” “Happy to help with Linear. What would you like me to do?” “Are there any overdue tasks?” “I can help with questions about Linear’s usage.”

What we just did is a vibe check, and that means that we looked at some of these questions, we looked at the answers. I think these answers are pretty bad.

Creating the scoring function (26:03)

Ankur: Now before we actually try to improve them, what I would like to do is be able to quantify that, and that is where scoring comes in. The benefit of quantifying it is that we are of course going to vibe check the improved results as well, but the artifacts that we will produce by actually running these evals is something that our team could continue to use so that as we add more data, as we evolve the prompts, we will have a quantitative signal about whether we are improving the thing that we are trying to improve.

So now let us go back to Loop. By the way, Loop is the agent that is built into Braintrust, and it works kind of like Claude Code or Cursor. It has tools that are plugged into all of the nooks and crannies of our product and so it can interact with data and prompts and run evals and stuff for you.

So these answers aren’t great. They are vague and introductory. Can you create a scoring function that makes sure that the answers actually answer the question, and if they cite any information or include any facts about tasks, they cite a source?

Aakash: In the lore of the podcast, the prior few eval episodes that we had from Hamel Husain, Shreya Shankar, Aman Khan, they all warned against numerical scores. They said that we need to go for more of like a binary yes, no. Here we are going for a score. Can you talk to us about that?

Ankur: The simple way to think about it is that jumping into scores like 0.2 or 0.4 before you have really justified the need to do that is not a good idea. And in fact even though we are going to create numbers here, we are actually only going to create scores that fit a specific set of values. It only has three options.

I think it is important not to overcomplicate your scores, and I think if you are creating LLM-based scores, you shouldn’t ask the LLM to generate a number because that is not very clear. It is useful to have clear criteria, but I actually disagree that every score needs to be binary. I don’t think there is any real justification behind that. In fact I worked with the OpenAI team about a year ago and published a research cookbook that walks through somewhat scientifically what is a good thing to do and what is a bad thing to do and why.

Connecting to Linear MCP and iterating (29:51)

Ankur: OK good, so it is 0 everywhere. If you look at one of these, you can actually see why. We have the model actually tell you the rationale, and so it labeled it as option C and it says it doesn’t answer the question and it doesn’t provide any claims or specific facts.

So now let us have a little bit of fun and actually connect this thing to the Linear MCP. It is pretty easy to do that. You just click MCP and then Linear has a nice HTTP-based MCP so we can just put the URL in and it will actually authenticate to my Linear account and give me a bunch of tools.

Models can get somewhat overwhelmed by having a lot of tools, so I am going to remove some of these just to keep it simple.

Aakash: That is an interesting insight around just selecting the tools that you need so that it doesn’t accidentally choose the wrong tool.

Ankur: OK great. So we will save that and let us try running it again.

Aakash: It is fun how fast you can iterate here, and so this might be like an example of those 13 experiments. They are constantly improving what they are working on and you just get the results so quickly.

Ankur: Exactly. Every time I click run, actually, it is essentially running an experiment.

OK great, so it didn’t actually do that well. Welcome to AI. It said, “Are there any overdue tasks?” And this model said, “I am ready to help with Linear tasks,” but it doesn’t actually do anything. It just says what it can do and it doesn’t really solve the problem.

Now there are a few things we can do. We could try a different model. We could try GPT 5 or GPT 5 Mini and see if we get better performance. We could try to improve the system prompt. We could say, “Don’t ask clarifying questions please, just use the tools and figure it out.” A third thing we could do is we could go and actually edit the questions. And then of course the fourth thing we can do is edit the scoring function.

Iterating with Loop and improving scores (35:21)

Ankur: I think the scoring function is too harsh. If the response contains any references to Braintrust tasks, then it has cited its sources. So this will go update the scoring function.

Aakash: Could we do the same for the system prompt? Could we say, right now we are still failing on 4 out of the 5, so can we add a few-shot examples and specify which MCP tools and Linear to use?

Ankur: Absolutely. So here you can see it has edited the criteria for the scorer. We can hit accept and it will update it. And we can also let it improve the prompt.

Aakash: Now that I am back to becoming a coder thanks to vibe coding tools after 16 years being away, I am like one step at a time.

Ankur: You know, there was this kind of watershed moment we saw with Claude 3.7, where it was the first time we saw that a model was able to look at its own work and improve. I think prior to that, the metaphor that I use is it was kind of like a dog looking at itself in the mirror. It didn’t really know whether it was a virtual representation of itself.

Sometimes models would assume the identity of the model that they were working on, or the prompt that they were working on. But when Claude 3.7 came out, things started to change, and that is actually when we shipped Loop. We had our own eval for this problem leading up to shipping Loop, and the eval performed terribly. And then finally Claude 3.7 came out and it was a huge jump, and we realized this product idea might actually work.

Aakash: So you should be thinking about what the future state products are going to be. Create the eval, then watch the models. Once the models hit the right set of quality, you can go ahead and release it.

Why you need evals that fail (39:12)

Ankur: Absolutely. I think one of the most important things is to have evals that fail. If you only have evals that succeed, then you don’t know what problems there are. That means you either don’t have a clear understanding of what problems your users are hitting, or you don’t have a clear understanding of what is impossible today. And I think it is very important to have both.

If you have evals that are failing, then when a new model comes out, the first thing you should do is just rerun those evals, and you will be surprised that every time a new model comes out, something interesting is going to happen.

Aakash: I heard some people who are running a coding tool, Gemini 3 Flash was somehow performing better on a lot of coding benchmarks than Gemini 3 Pro, but it was hallucinating more. These are the nuances where you need to have a full eval testing suite to really understand which metrics are improving versus which it is hurting.

Ankur: For sure, and I think as with any benchmark, an up does not necessarily mean good. An up just means that something interesting happened. And I think more often than not, when you see something interesting happen in a benchmark, including an improvement, it means that the benchmark itself is broken. But you should not necessarily hypothesize whether a benchmark is broken until you are able to reproduce it with some real data.

I am a big believer in doing really dumb, seemingly obvious things like just auto-generating silly questions about Linear tasks or whatever it is, and then running stuff and confronting the actual generated outputs with your intuition and using that moment as the opportunity to improve things, as opposed to spending a month creating a perfect golden data set.

Open evals to the whole team (41:02)

Aakash: So that is a real case for don’t silo your Braintrust licenses and user accounts to the AI engineers. Make sure that the PMs, maybe even the right go-to-market domain experts who really understand it, have access to the tool.

Ankur: About maybe 3 or 6 months ago, we sort of realized that Braintrust should not be constrained to the AI engineering team, and we removed user-based pricing. So there is no user-based pricing. It is just based on how many evals you run and how much data you log. You should just not worry about that.

Looks like this thing is cooking and it has made some serious progress. It has solved some problems like telling the model to use the tool. It has also solved the problem of the model asking for clarification. In chat-based use cases, models are post-trained to ask for clarifying questions. In the context of this demo, we are not giving it the opportunity to do that. We are just hoping that it generates a response from one question. And so it is really important that we tell it not to do that.

Aakash: I think it started with a partial score, then it moved, it said OK let me iterate on the system prompt again to get a full score. So it is really working through the problems.

Ankur: We touched all three parts of the workflow. We worked on the data set, we iterated it a little bit. We actually worked on the task function itself. We wrote a prompt, we picked a model, we changed the MCP tools that were available to it. And then we also created and iterated on a scoring function. We made an initial one, you pointed out that the scoring function was being a little bit too nitpicky.

Wow, it looks like we are now at a 0.75 across the board, which is a huge improvement from where we were before.

Offline vs online evals (43:36)

Aakash: What is the distinction between offline and online evals, and when should people be doing each?

Ankur: One of the cool things about the work that we did is we created a scorer, and even though we are using it in this playground, this isn’t the only place that you could use the scorer. If we go into the scorer list in Braintrust, you will see that we have the scorer right here and we can actually run it on real live logs and deploy it into production.

So every time we ask a question, it will actually run the scorer online. And I am loving this prompt. The tool usage hardcore.

Aakash: And again, you are a product manager, so I think you see this and you get some PRD vibes, right?

Ankur: That is what I mean. This is a much more quantifiable version of thinking about what a product should be. And it is really fun to actually be able to take a product intuition and quantify it and turn it into something really tangible.

What is happening is that every time I use a prompt in Braintrust or whatever my app is, I am going to be generating real live logs of my production application. Online evals are taking these scores that we built and running them on your real live user logs.

That is helpful for two reasons. The first is that it gives you insight into how well the same eval functions that you are using to test things offline are actually translating into real-world performance. So let’s say that offline we are able to achieve a score of 0.75, which is not bad. And then we run the same score online and we consistently see the result as 0.3. That means that maybe it is not actually working as well in the real world as we think it is working in our little simulation environment.

The second thing is that it becomes a really good flywheel for you to find examples that are worth including in your offline eval. When you see that the score is 0.3, then you can actually filter down to the examples that are not performing very well and then grab them and add them into that same data set that we were using to assess things.

How to maintain eval culture (47:40)

Aakash: How do you maintain trust in your eval system so people don’t bypass it when they are shipping new features?

Ankur: The best teams don’t think of evals as a gate. They think of it as a core part of their iterative loop of actually improving things. The best workflow looks like looking at real production examples.

In fact some of our customers have kind of like a ritual where every morning in standup, they will look at some examples from the previous day’s usage of their product. And then what they will do is they will reconcile what they see with those examples with what their evals have.

So let’s say that the scores are very low for questions related to our UI. It is like, maybe we don’t have that many questions related to UI tasks in our eval data set. So what they will do is find these novel patterns that have emerged from their logs and then add them to the data set. And maybe you do that in the morning, and then they will grind that day and actually try to improve the eval performance on the things that they noticed.

That becomes a really helpful way to prioritize what you should actually work on and what it means to actually succeed on a particular endeavor. Hey, it clearly looks like we are not doing well on questions related to UIs. Let’s bring in a bunch of those tasks, add them to our data sets, reproduce that problem in our evals, and then go and iterate on it until we are able to produce a better result.

That is the best way to think about evals. If you think about evals instead as, I think there is a problem, let me edit my prompt to try to fix the problem and play with it on 3 examples, and then OK it seems like it is better, now let me go run a full eval run and see if I can ship this thing, I think you are not going to be as efficient because you are not thinking about the broader problem which is presented in the data set while you are actually making those iterations in the first place.

Where to go deeper (49:57)

Aakash: This was a less than 1 hour masterclass into evals. If people want to go deeper Ankur, where should they be going?

Ankur: You can reach out to us, braintrust.dev is our website. You can email me ankur@braintrust.dev or reach out to us on X or Discord. We also have a user conference coming up called Trace. So if you go to braintrust.dev/trace, you can see information about signing up. It is a zero bullshit practitioner-led conference. A bunch of talks from people from companies like Dropbox, Ramp, Notion, and other folks that we talked about earlier, who are just going to talk about how they are solving these problems.

Aakash: All right guys, for my money in 2026, whether or not you are building an AI feature or not right now, every PM should be learning this skill. I hope we got you excited enough to go out there and try this out, maybe with a free Braintrust account or something else, whatever platform you are using. Get out there, start iterating. You saw how fun it was. You saw how I was jumping in on how I wanted to do more of the system prompt. I think you will feel that same excitement once you get your hands into a tool like this.

Leave your thoughts