Categories
Uncategorized

The AI System Design Interview – Full Mock with Aman

Introduction and the interview question (0:00)

Aakash: Product design is dead. That’s not clickbait. It’s actually what candidates are telling me after their AI PM interviews. AI product managers are making millions of dollars. At OpenAI, the average stock grant per employee is $1.5 million. At companies like Google and Meta, employees are getting huge stock grants. I was not really asked any of those conventional “make a fridge for blind people” kind of questions. It has moved to AI system design where they are not just testing product sense but they’re also trying to test your system design knowledge.

If you want to get one of the top AI product management roles at companies like OpenAI or Google or Meta and you want it to be at the high end of the pay, the $1 million plus roles, the roles that have $1.5 million in yearly stock grants, not just anyone can land these high paid AI product management roles. So with Aman, we’ll walk you right through how a great response looks like and we hope you benefit the most out of it.

When it comes to the AI system design interview, they are looking for your ability to go deep on a technical topic. So we’re going to show you how to ace this with the mock interview question. Build the system design for a churn reduction agent.

Clarifying questions (1:05)

Aman: So we need to build a churn reduction agent. Before we jump into solutioning I have some clarifying questions. I would actually like to understand, I know what churn generally means but in this specific scenario does it have any specific meaning or any relation to any kind of product?

Aakash: Churn you can define it. At the high level there’s an engagement churn that we’re going to see, which is they use the product, they tried it and then their engagement drops off. And then the lagging thing we’ll see is they stop paying us.

Aman: All right. So we don’t have a more precise input for this interview. Also with regards to the scope of this, are we talking about users of any specific domain like mobile app versus desktop app or is there anything which needs to be understood here?

Aakash: I guess we want to consider all platform behavior. The important thing to think about is what is going to help us secure more revenue. And if we can get signals about mobile versus desktop usage and it’s relevant then I think you should focus on it.

Aman: Are there any constraints with regards to time or the effort or budget we have to launch this feature or product?

Aakash: I would say think about the best case scenario. What would be the ideal thing to design and don’t worry too much about constraints for this interview.

Aman: Cool. No constraints. I’ll try to keep it to 6 months if not more. Finally, I know the goal is churn reduction. Do we have any other goal associated with the product apart from churn reduction?

Aakash: That’s a good question. I guess just to build a good system, think about it as like its own code base, maybe its own repo. So how are we going to build it? Specifically the technical areas is what I’m most interested in.

Aman: The reason I ask is because there are a lot of things which a company does because of its competitors. So if our competitors are trying to do something, is that one of the reasons we are trying to do this or is there no dependent correlation? But it looks like in your case we just need to build an independent system and we don’t really have any other goals apart from churn.

Aakash: Yeah. Just the normal reasons. If we can identify churn we can take actions in the product to improve engagement, we can secure more revenue. It generally tends to be more profitable for us than focusing on acquisition.

Aman: Do you have any suggestions as to what kind of an industry or product should I go towards or is that up to me to decide?

Aakash: Just consider software.

Defining the product vision (4:47)

Aman: So I think with regards to the vision, customer care is something which we are trying to use agentic AI to make our workflows more efficient. So what I’ll do is I’ll take an example from the telecommunications industry because I think it will be easier for people to understand. In terms of the product I’ll take a mobile app which will have the necessary agentic AI pipeline built in which should help us reduce churn. This is my vision for the product. Is there any question before I jump into the next section?

Aakash: Sure. Makes sense to kind of focus it in a specific real life more textured situation.

User segmentation and prioritization (5:49)

Aman: So what I’ll do now is I’ll start talking about some of the users. I’ll try to prioritize one of them and I’ll go towards the pain points and finally I’ll move towards solutioning where we’ll discuss about the different AI solutions and finally we’ll jump into system design in terms of the diagram as well as some of the metrics we can track. Does that sound good?

Aakash: Cool. Go for it.

Aman: So I think with regards to users of any product there are different ways how you can segment users. What I would like to do is segment by when they joined us. There are new users, power users, and users from a B2B perspective. A telecom company has B2B users as well.

Because we want to prioritize to only one user, I will stick to power users. The reason is because these are the users we want to take care of the most. The users who are mostly engaging with you, who are most deriving value from your product, you want to protect them and you want to ensure that their interests are well met. We do not want to lose them. So from a churn point of view I want to prioritize this segment of power users. Do you have any questions before I proceed to pain points?

Aakash: Yeah, I think power users make sense with our broader strategy.

User journey and pain points (7:40)

Aman: Before jumping into pain points, I would like to do something called a user journey. Pain points are often derived through the different phases of user journey. So I’m going to spend a few seconds to define what a user journey of a power user might look like.

This user already has been part of this network for a while and has been using our app and customer care. There are definitely going to be a lot of customer issues. So this user contacts customer care through phone, tries to get a ticket for the problem. Once that happens, the customer normally waits, reaches out again if required to customer care through email. Once the issue is resolved, normal system is restored and the user effectively starts using the app as well. On the app there are a lot of benefits provided these days for users. They can browse benefits and they can try to get Wi-Fi of the same company or avail other services.

There are a lot of pain points among these. Firstly, everyone hates calling customer care. You do not want to spend your time calling customer care if you are a power user. So one pain point is definitely huge amounts of time and effort taken for each of these customer care interactions, which also causes anxiety, frustration, and delay.

Apart from this, there is also the inconsistent experience. When you open the app, you might not be able to track your request across your customer call and app. That causes a lot of confusion.

Finally, one pain point is regarding benefits. There are a lot of users with different interests and we do not want to show irrelevant benefits. If you like dining, we do not want to show you a benefit for Wi-Fi. Irrelevance becomes a huge problem. We are not able to tap in the right value for the right user and that causes friction and loss of potential upsell.

Aakash: Yeah, I think these are good ones to focus on for this particular system design.

Prioritization of pain points (11:25)

Aman: With regards to prioritization, I like to consider alignment to my vision, frequency, and impact. I think the customer care pain point is a very highly recurring problem with huge impact on experience. So I’ll mark it as high.

Inconsistent experience is a big problem but even if we do solve it, it might not matter as much in the short term. It doesn’t directly align with our immediate vision which is to create a highly engaging product and reduce churn. So I’ll go medium here.

Irrelevance in benefits causes friction and loss of potential upsell. That is important, but when it comes to churn reduction I feel it might not be our top lever. So I’ll again go medium.

I’ll prioritize only one pain point, which is customer care, and I want to devise an agentic solution to resolve this kind of a pain point.

Aakash: How do we give our product teams the right signals that hey, this user is about to churn so we need to take an intervention? How do we get that early warning sign?

Aman: That’s a very proactive system which we want to build. Before the user calls, before three to four interactions, we are able to predict. That is something which I’ll be explaining more in my solution section.

Solution framing (14:14)

Aman: I think with regards to solutions there are a lot of different possibilities. One is a chatbot which tries to understand from your past call transcriptions and accordingly helps guide your solutions quickly. But I feel that might not be as engaging as possible because a human always prefers a more humanlike touch.

There is also voice. I feel voice might really be a very good scenario where you’re not just able to talk to the AI agent but you’re able to show your screen and actually have it end to end guide you and not just redirect you to a human agent.

For a more basic solution which might not require AI, we can make our experience more gamified in the app to show features and benefits in a way that the user always drives to avoid churning. But I feel that because at the core we want to improve the user experience, I would like to build a voice agent which tries to predict from your past data when the user is going to be in the red zone of churning and accordingly tries to activate the system of sending more offers, sending more benefits, and also trying to talk to the user end to end and solve problems as soon as possible.

Aakash: It’s going to be a big project. I don’t know if 6 months will be enough, it depends on resources.

Aman: I think 6 months is a big time frame. We can definitely map out an MVP version in 6 months and from there maybe craft a very good agent in 9 to 10 months.

Three pillars of any AI system (16:53)

Aman: Before jumping into our system design diagram I would like to talk about different layers of an AI system. Each AI agent or bot has potentially three things. Model, data, and memory. These three things are the pillars of any AI system.

In terms of model, you already have some of the best AI models available at your disposal. We could do a comparative analysis and basic testing on an engineering level and see which fits our use case. For voice, I might actually go to Gemini because I feel that has a great voice experience, but it can be anything based on how your engineering or ML team perceives this.

Now coming to data, this is probably the core. Without data your very powerful model is completely useless. You definitely need call transcriptions. But you also need different signals. For example, app usage, network status through the past 12 months, what has been your connectivity, are we able to provide you those benefits of traveling which our competitors are providing, what kind of a user are you. Having a lot of different data points helps you understand what kind of user we are dealing with. Putting them in a churn bucket would be helpful. If you are a high-risk churner, put in a red bucket. If not, maybe a green bucket.

We would also need some last minute retention offers. If you are in a red bucket and an AI agent is unable to resolve, we need some sort of a last minute retention offer specially for you.

Finally, memory. There are different types of memory available. There is episodic memory and session memory. We need to be efficient here. Episodic memory is highly important. What I mean is in cases where the user has ever tried to contact customer care through chat or through call, what the conversations looked like. That is probably the most important set of transcriptions or data we can analyze. If we cannot contextualize previous conversations it might be very difficult for us to provide a great solution. So I’ll stick to episodic memory for now. These are the three pillars of any AI bot.

Aakash: Just make sure you think about latency of the bot and how quickly it’s working. That’ll be one of the key areas we’ll need to pressure test.

Aman: Latency and overall performance whether you are fast and whether you’re accurate is important. A lot of times there are wrong prompts which are injected and bots are annoying and not really helpful. Response time from the bot as well as some sort of basic customer satisfaction score or NPS feedback might be important.

System design diagram (22:32)

Aman: So at the top we have our very basic interaction layer which is our mobile app. It connects to our orchestration layer. In any agentic AI system you do need an orchestration layer which coordinates across different agents and then sends the response to the mobile app.

In terms of agents there could be multiple agents. You definitely need a data analyst agent whose sole understanding is to go deep into your data, get all the key details, get all the key points and serve it.

Apart from this, we also need a very solid customer voice agent who responds to you based on the data.

And finally because we are supporting churn reduction, we would also need an executor agent whose sole task is to make executions of maybe the retention offers. In today’s world customer care has a lot of different departments. We want to try to replicate that entire system where we have a data analyst, we have an agent which talks to you, and we also have an agent who ultimately executes some of those actions whether it is retention offers or escalating to a human.

Underneath, they are connected to RAG which is probably the most common way you use data extraction through the vector database. So we have our data layer and we obviously have our model APIs which could be called on as and when we are using it for predicting different data patterns or creating offers for users.

Aakash: Yeah, I want to dive a layer deeper. How are you going to do the churn signals? Are you going to use an LLM or ML models?

Aman: I think our data analyst agent will be strictly working towards it. We are trying to use different buckets. Each user has already been bucketed into different categories of churn. The data analyst holds all those patterns with regards to each user. So once you start interacting with a user, the customer voice agent and the executor agent know what exactly the percentage of the user is to churn, what kind of bucket they fall into, and accordingly the voice agent will start taking actions. If the churn bucket is really high, it is red, the system will eventually quickly go to serve them with some retention offer.

Metrics and monitoring (27:25)

Aman: Even though we have created a good solid system for MVP, metrics play a huge role. We need to focus on latency and performance because those are the end performance indicators. But let’s start with the model perspective.

Some metrics like recall will really be helpful to understand if the model is able to use the data and give accurate outcomes. It ultimately also gives us understanding of what’s the hallucination of our model here.

Apart from the model, you also want to understand latency. What’s the response rate on an average per user.

From an end user perspective, what is the overall NPS? Is the user really liking the performance? Is the problem getting resolved? What is the percentage of issues resolved by AI bot alone without escalations? If our AI bot is just escalating to human, then it’s not really effective. That’s something we need to understand.

Also our business metrics are hugely important. Retention is one definitely. Revenue is a huge factor driven by retention. Overall I think a good understanding of what a model is doing, how it’s performing, how the end user is seeing this, and what ultimately are the levers pushing for business impact is what could constitute a good set of metrics.

Failure modes and edge cases (29:44)

Aakash: Where does this system fall over? What would be some of the failure scenarios you would design around?

Aman: Initially if our model is down or not working, immediately we need to redirect to human. That’s a failsafe you definitely need to build. You do not want just because your model is down that the user just goes away. You need to redirect to human on that bot itself and not increase the churn percentage.

Apart from that it’s also important to understand latency. If the model is taking more than 30 seconds to reply on average, then it’s important for us to redirect to a human or try to understand how we can help the user reach their goal.

There is also the case where a model is repeating the same response or the user is asking the same questions and getting frustrated. It’s again important for us to understand when do we pass the baton to a human.

Overall in terms of failure cases, the AI agentic system will be the best case scenario, but we also want to prepare for failsafes and failure cases.

Scaling to 10x traffic (32:03)

Aakash: And what about let’s say how are you going to build your system to handle a 10x spike in traffic?

Aman: For example, in 6 months we launch our MVP, we are successful and now we want to scale it from testing mode to full-time production mode.

For going to 10x there are multiple things. From a model perspective, you would now need to have that bandwidth to be able to call that model 10 times or 100 times more. You would basically need a much more strong and dense server to handle all these models. A lot of companies host them on-prem as well. Considering that it’s a huge company, we might switch from using a cloud to going on-prem so that we can ensure that we are able to control and manage every API call or interaction from model much quicker.

Apart from the model, because we’re going 10x, latency certainly becomes important. I feel latency might also improve if we handle it on premise.

Data becomes a huge bottleneck because now we suddenly have our entire data from just maybe thousand users to maybe 10,000 or 100K plus users. We need to ensure that we use a vector database 100% for the MVP if we are not already using it. A vector database would be highly required because it’s much more faster and efficient than a normal database.

From a memory layer perspective, there are a lot of intelligent memory solutions which decide whether at all we need to save or process a certain kind of response. I do know Mem0 is one, there are a lot of other solutions as well. Having an intelligent layer over memory helps you deal with so many requests much more quickly and ultimately improves latency. Just trying to understand whether we need to save it or process it in this particular field gets really important.

Post-interview feedback (35:26)

Aakash: That’s all the time we have unfortunately. Aman, thank you so much for walking through this interview. So now let’s jump out of the interview and review. What did Aman do well? What could he have done better? Aman, why don’t you go first and I’ll add on.

Aman: Asking clarifying questions always helps. I think with regards to the interview process, going deeper on users and pain points is very important. I could have spent more time on users, breaking down the user flow much better because a telecom company has so many different kinds of users. So it’s a very good opportunity to spend time there. I slightly missed out there.

With regards to solutioning, I think I did decently well, but some of the areas which you pointed out around latency, system requirements, tradeoffs, those are things I could have added by myself and ensured that I build an end to end system.

Aakash: I think the big challenge with this type of interview is time management. Overall, I wouldn’t overly call out the amount of time you spent on users as too small because you want to get to that system design diagram, which he got to.

Where I think you could have improved is technical fluency and depth, and delivery. In terms of technical fluency, sometimes when I would ask a question like “are you going to use an LLM or an ML model?”, what I’m hoping is for you to first confirm that you understand what I’m talking about and then talk about the pros and cons of each option.

An answer might sound tighter if it sounded like this. “That’s a great point. We don’t always want to use an LLM when an ML model will do because something like an XGBoost algorithm will be cheaper and less of a black box. So if I think about the pros and cons, an LLM might adapt to new data sources better and might be able to use agentic search, but it’ll be much more expensive and be more of a black box. So actually I think for the data analyst part of the agent, we should use an XGBoost model.”

That might be an example of a strong response.

The other area to improve is delivery. You have this crutch of “uh” which you use whenever you have a pause. So those are the two big areas to improve.

But I think people watching are going to get a lot out of this because now you’ve seen how to present a structure, how to have clarifying questions, how to deal with lack of definition from an interviewer and make assumptions, how to share your screen and draw a system design diagram even if you are a PM.

Aman: One thing you should definitely focus a lot on is asking questions to the interviewer and ensuring that it is engaging and not just a one-sided monologue or conversation. Ensure that you’re keeping your interviewer very engaged and walking through what you’re doing. Even if you share your screen, there is a tendency for people to just work through silently. Ensure that as you’re typing, you’re talking to your interviewer, trying to speak your mind aloud so that they can assess your thought process. Just go there and crush it.

Aakash: In terms of interviewers, I was more on the easy side. Some interviewers will ask more pressing questions and it’ll mess up your time management. So be really careful about how you handle your interviewer and your time management.

With that, we have other mock interviews on this YouTube channel. AI product design, AI success metrics, AI execution. Go check out those videos and subscribe if you’d like more like this. Comment down below what interview mock you would like to see next. Until the next episode, I’ll see you later.

Leave your thoughts