I’ve watched teams spend months building polished features that died on contact with reality. I’ve also watched scrappy PMs validate the same core idea with a doc, a fake door, and a week of customer calls, then earn the right to build.
That’s the job. Not shipping more. Reducing uncertainty before you ask the company to spend real money, real time, and real credibility.
The PM's Mandate From 'Could We' to 'Should We'
A junior PM once showed me a roadmap with perfect logic, strong technical design, and zero evidence that customers cared. Another PM on the same org brought a one-page hypothesis, ten interview notes, and a rough demand test. I funded the second one.
That wasn’t because the idea was prettier. It was because the second PM understood the product management mandate. Move the team from “could we build this?” to “should we build this?”

Most product waste starts before engineering
PMs like to talk about execution risk. Missed deadlines. Scope creep. Bugs. Those matter.
But the more expensive mistake is building the wrong thing cleanly.
Approximately 90% of startups fail, and 42% of those failures are attributed to building products that nobody wants, according to Strategyzer’s summary of Testing Business Ideas and the startup post-mortem analyses it references (Strategyzer’s summary of Testing Business Ideas). That’s the statistic every PM should tape to their laptop.
If you work in product discovery, this should reshape your default behavior. Your first instinct shouldn’t be writing PRDs. It should be identifying which assumption can kill the idea fastest.
For PMs earlier in their careers, this is also how you become trusted. Anyone can push a backlog. Fewer people can protect the company from avoidable waste. If you need a grounding in the broader discipline, Aakash Gupta has a useful primer on product discovery.
Practical rule: If your roadmap item starts with a solution instead of a risk, you’re already behind.
The best PMs test before they advocate
I’ve killed ideas I personally liked. I’ve backed ideas I initially doubted. Evidence has to outrank ego.
That’s why I’m opinionated about testing business ideas. It isn’t a startup ritual. It isn’t innovation theater. It’s the core operating system for serious PMs, especially in AI where teams can prototype quickly and still be wildly wrong about user behavior.
Three mindset shifts matter:
- Stop defending ideas. Your job is to interrogate them.
- Stop treating customer enthusiasm as validation. Real behavior matters more than polite feedback.
- Stop assuming internal excitement predicts market demand. It usually predicts internal excitement.
A PM who tests better gets promoted faster
Senior PMs don’t earn trust by sounding strategic. They earn it by making high-quality bets.
That means they know when to keep going, when to reframe, and when to kill the work before the org gets emotionally attached. Teams remember the PM who saved six months far longer than the PM who shipped another feature with mediocre adoption.
That’s the lens I’d use for the rest of this playbook. Not “how do I validate a startup idea?” The better question is, “how do I de-risk a roadmap decision before it becomes expensive, political, and hard to reverse?”
The Hypothesis-Driven PM Framing Testable Assumptions
A PM once brought me a shiny pitch for an AI forecasting copilot. The demo was slick. Leadership loved it. Engineering was ready to start.
I asked for the hypothesis.
There wasn’t one. There was a solution, a few opinions, and a lot of organizational momentum. Six weeks later, after a handful of customer calls and one ugly pilot, we killed it. Not because the team was weak. Because nobody had written down what had to be true for the idea to deserve budget, headcount, and political capital.
That is the job here. Turn a promising idea into a set of claims that can survive contact with reality. If you want a sharper framework for writing them, Aakash Gupta’s piece on what makes a good hypothesis is useful.
Start with the three risks
I use the same framing with startup teams and big-company PMs, but internal teams need one adjustment. In a large organization, you are not only testing customer demand. You are also testing whether the company will support the change.
Break the idea into three risk buckets.
Desirability
Will a specific user change behavior to get this outcome?
For an AI analytics assistant, the question is not whether people say AI sounds helpful. It is whether an analyst, marketer, or sales manager will trust the answer enough to stop using their current workflow. In AI products, stated interest is cheap. Behavior is the test.
Feasibility
Can your team deliver the experience at a quality bar that holds up in production?
PMs get this wrong in AI all the time. They mistake model output for product quality. Feasibility work lives in retrieval, latency, permissions, evaluation, fallback behavior, auditability, and error handling. If the system gives a confident wrong answer in a high-stakes workflow, you do not have a product. You have a liability.
Viability
Does this idea deserve real investment from the business?
Inside a large company, viability includes revenue, retention, cost to serve, and something startup advice often ignores. Organizational willingness. Will sales carry it? Will support absorb it? Will legal approve it? Will an executive sponsor still care when the novelty wears off?
Testing Business Ideas gets this part right. Strong validation starts by making desirability, feasibility, and viability explicit before you choose experiments.
Use one hypothesis template
I want every PM to write assumptions in one format:
We believe [target customer] will [take a specific action] because [reason]. We’ll know we’re right when [observable metric or behavior].
This forces precision. It also exposes political theater fast.
Weak hypothesis:
- We believe data teams want AI-powered reporting.
Stronger hypothesis:
- We believe growth managers will ask an AI assistant to explain campaign variance because manual analysis is slow and repetitive. We’ll know we’re right when they return to the workflow without prompts and use it in their weekly review process.
The difference matters. The second version names the user, the behavior, the motivation, and the evidence. It gives you something you can test.
Rank assumptions by consequence, not convenience
PMs under pressure often test whatever is easy to test. That is amateur behavior.
List the assumptions first. Then rank each one on two axes:
| Assumption | Risk if wrong | Current evidence |
|---|---|---|
| Users trust AI-generated explanations | High | Low |
| The model can interpret internal metrics accurately | High | Low |
| The workflow fits weekly reporting habits | Medium | Low |
| The company can package this into an upsell or retention play | High | Very low |
Start with the assumptions that are high risk and weakly supported. Do not start with the one that produces the cleanest demo. In big companies, teams waste quarters proving low-risk details while the core bet remains untested.
A practical example for an AI PM
Say you want to launch an internal AI copilot for sales forecasting.
Write the assumptions like this:
- Desirability assumption: Sales managers will ask natural-language questions instead of exporting spreadsheets because they need faster answers in forecast reviews.
- Feasibility assumption: The system can answer forecast questions accurately enough that managers will use it in live planning conversations.
- Viability assumption: Revenue leadership will treat this as an operating workflow worth training, support, and process change.
That last assumption kills a lot of internal products. The user problem can be real. The technology can work. The initiative still dies because no leader wants the rollout burden, no function wants to own success metrics, and no team wants to retrain existing habits.
Write that risk down early. Test it like any other.
Write hypotheses so they can fail. If the statement cannot be disproven, it is useless.
What I tell teams to avoid
Three failure patterns show up constantly:
- Solution-first framing: “Users need a chatbot.” No. They need a better way to complete a job.
- Soft evidence: “Stakeholders liked the demo.” That only proves the demo was polished.
- Bundled assumptions: If one experiment tries to test trust, usability, pricing, and retention at once, you will learn almost nothing.
Before any experiment starts, I expect a one-page brief with:
- The user segment
- The exact assumption
- The risk type
- The smallest valid test
- The behavior that counts as evidence
- The decision the team will make after the result
If a PM cannot fill out that page clearly, the idea is still too fuzzy to fund.
The Experimentation Toolkit From Interviews to Smoke Tests
A team once asked me for six engineers to build an AI forecasting assistant. I said no.
Instead, I asked them to put a fake entry point inside the planning workflow, interview ten sales managers who had missed their number, and manually answer a week of forecast questions behind the scenes. In two weeks, we learned three things that a quarter of product development would have hidden: managers wanted faster answers, they did not trust generated explanations without source data, and revenue ops refused to own the rollout. That saved us months of work and gave us a much better idea to fund.
That is the job here. Pick the cheapest test that can kill the riskiest assumption.

Pick the test that matches the risk
PMs inside large companies often choose experiments based on what is politically easy. A few stakeholder interviews. A polished prototype. A pilot with a friendly team. That is how weak ideas survive.
Choose based on uncertainty, not comfort. Early experiments should be cheap and fast. Later experiments should require real user behavior. If you are testing an AI concept, add one more filter: test trust before you test scale. A model that looks impressive in a demo can still fail the minute a user sees one wrong answer in a high-stakes workflow.
Use this matrix with your team and with your sponsors. It makes the tradeoffs visible.
Experiment Selection Matrix
| Experiment Type | Best for Testing | Cost & Time | Evidence Strength |
|---|---|---|---|
| Customer interviews | Problem clarity, current behavior, workflow pain | Low cost, fast to start | Qualitative, useful early |
| Surveys | Pattern checking across a broader group | Low to medium effort | Weak to moderate, depends on question quality |
| Landing page | Demand and message resonance | Low build, moderate setup | Moderate behavioral signal |
| Smoke test | Purchase or signup intent before build | Low to medium effort | Stronger than stated interest |
| Concierge MVP | Whether users value the outcome enough to return | High touch, manual work | Strong learning for desirability |
| Wizard of Oz | Whether the experience works before automation | Medium effort, hidden manual ops | Strong feasibility proxy |
If your team needs to sharpen interview execution, use this guide on how to conduct user interviews.
Low-fidelity discovery tools
These methods are for learning fast, not proving demand.
Customer interviews
Interviews help you understand pain, behavior, workarounds, buying context, and internal blockers. They do not prove that anyone will adopt or pay.
For internal products, talk to two groups separately. Talk to users about the job. Talk to budget owners and workflow owners about change management, compliance, training, and who gets blamed when the tool fails. Corporate politics kills more ideas than weak UX.
For AI products, I want teams to ask questions like these:
- Current workflow: “Show me how this gets done today.”
- Failure cost: “What breaks when this output is wrong, late, or missing?”
- Escalation path: “Who steps in when the system cannot be trusted?”
- Trust boundary: “What would make you reject an AI-generated answer immediately?”
- Approval friction: “Who has to sign off before this can be used in a live workflow?”
Ask for a recent example. Ask for artifacts. Ask who else is involved. Do not ask people to predict their future behavior.
Surveys
Surveys are useful only after you already understand the problem well enough to write precise questions. Use them to check whether a pattern is widespread. Do not use them to discover the pattern in the first place.
A bad survey gives you fake certainty at scale. A good one forces tradeoffs, tests recall of real behavior, and avoids vague prompts like “How valuable would this be?” If you are testing an internal AI workflow, include questions about trust, review burden, and willingness to change process. Those answers often matter more than excitement about the feature itself.
Surveys can size pain. They rarely settle whether a product should exist.
Higher-fidelity validation tools
Once you know the problem is real, move to tests that require action.
Landing page
A landing page is a message test with a behavioral signal attached. It works well when you need to know whether the promise is clear enough to get a click, signup, or demo request.
For AI ideas, sell the outcome, not the model. “Prepare a board-ready forecast summary in five minutes” is strong. “AI-powered forecasting copilot” is lazy and usually useless.
A good landing page has four parts:
- One painful job
- One clear promise
- One primary CTA
- One qualifier step after the click
That qualifier step matters. Inside a company, a pile of signups from curious employees means very little. Interest from the right function, with budget authority or workflow ownership, means a lot more.
Smoke test
A smoke test puts a real offer in front of users before the product exists. The click is the evidence.
Examples:
- “Start free trial”
- “Request access”
- “Book setup”
- “Upgrade to access AI summaries”
Inside an existing product, this is one of the best tests you can run. Put the entry point in the actual workflow. Track who clicks, what segment they belong to, and whether they complete the intake step. Then follow up with a short message that says access is limited or coming soon. You get demand data without building the feature.
This is also politically useful. A smoke test gives you something stronger than opinion when a senior leader asks why the team should fund the work.
Concierge MVP
A concierge MVP is my default recommendation for AI products with high workflow uncertainty. Deliver the result manually. Learn whether the outcome matters before you automate anything.
If you want to test an AI research assistant, have a human create the output using existing tools and internal data. If you want to test an internal support copilot, let an operator draft the answers while the user sees a guided experience. Then watch what happens next. Do people come back? Do they trust the output? Do they try to route more work through it? Does any team volunteer to own it?
That last question matters in big companies. A product without an operational owner is a demo, not a business.
Wizard of Oz
Wizard of Oz tests present an automated experience while a human does the work behind the curtain. They are useful when you need to test the workflow before you commit to building infrastructure, integrations, or model orchestration.
This method is especially strong for AI features because it separates two risks that teams often mix together: “Do users want this experience?” and “Can we automate it well enough?” Answer the first question before you spend time on the second.
What each tool is bad at
Every experiment has blind spots. Good PMs know them and design around them.
- Interviews are weak evidence for willingness to pay or sustained usage.
- Surveys are weak at predicting real behavior.
- Landing pages tell you little about retention or trust after first use.
- Smoke tests show interest, but they do not explain hesitation by themselves.
- Concierge MVPs can make ugly operations look acceptable for a short period.
- Wizard of Oz tests can hide technical and support complexity that will surface later.
If a team treats one test as proof of everything, stop them. The point of a toolkit is sequence.
My AI-era sequence for internal PMs
For AI ideas inside large organizations, I use a simple order:
- Interview users and workflow owners to identify pain, trust boundaries, and political blockers.
- Run a fake door or landing page to test message resonance and actual interest.
- Use concierge or Wizard of Oz delivery to test repeat value before automation.
- Review the operational owner question before funding buildout.
- Build the narrowest product slice only after the evidence is strong enough to justify engineering time.
That sequence works because it reflects reality. In startups, the market can kill the idea. Inside a company, the market and the org chart can kill it together. Test both.
The MVP is Not Your First Product
The term MVP is so widely misused that it’s almost useless.
They call a stripped-down product an MVP when it’s really just version one. They still invest real engineering time, real design cycles, and real launch energy. Then they act surprised when the “minimum” product is still expensive.
An MVP is a learning vehicle first. If it isn’t designed to answer a risky question quickly, it’s not doing its job.

Airbnb got this right
The canonical example still holds because it was so crude and so effective.
In 2008, Airbnb’s founders put up a simple website, rented out three air mattresses in their apartment during a conference, and generated $1,000, which validated the core hypothesis that people would pay to stay in a stranger’s home (Accel MG on testing business ideas with customers).
That was not a polished product. It was behavioral evidence.
That’s the standard. Your MVP should create learning disproportionate to the effort invested.
If you want to see more formats, Aakash Gupta has a useful roundup of minimum viable product examples.
The best MVPs don’t look impressive internally. They answer the scariest question cheaply.
There are two kinds of MVPs
PMs need to separate these clearly.
Learning MVP
This exists to test a hypothesis with minimal build.
Examples:
- A concierge workflow
- A fake door inside an existing product
- A manually generated AI report sent by email
- A prototype demo backed by human operations
Use this when uncertainty is still high.
Value-delivery MVP
This exists to deliver enough value for early adopters to use repeatedly.
Examples:
- A narrow self-serve feature in production
- A lightweight beta with limited integrations
- An internal tool rolled out to a small operational team
Use this when the core assumptions already have evidence and you now need to learn about product quality, onboarding, and repeat usage.
A simple decision tree
Ask these questions in order:
Do we understand the problem fully? If no, don’t build. Interview and observe.
Do we have evidence users want the outcome?
If no, use a fake door, landing page, or manual offer.Do we know the experience can work at a basic quality level?
If no, run Wizard of Oz or concierge tests.Do we need productized software to learn the next thing?
If no, stay manual longer.Will a small shipped version create real usage signals we can’t get manually?
If yes, build the narrowest possible value-delivery MVP.
This keeps teams from jumping into productization too early.
Why this matters even more for AI
AI teams have a bad habit of building capability before proving need.
They demo a summarizer, classifier, copilot, or agent because they can. Then they search for a workflow where it belongs. That’s backwards.
For AI, I strongly prefer human-in-the-loop MVPs at the start. They answer critical questions:
- Is the output useful enough to matter?
- Where does trust break?
- Which edge cases force escalation?
- What level of explanation or provenance do users need?
Those are product questions, not model questions.
If a manually delivered version doesn’t create repeat pull, adding automation won’t rescue it. It’ll just help you scale disappointment.
Beyond Startups Testing Ideas Inside Large Companies
Startup advice breaks down fast inside a big company.
A founder can wake up, change direction, and run a test by lunch. A PM at Google, Meta, or a large public company needs legal review, design bandwidth, analytics support, stakeholder alignment, and someone willing to absorb the political cost if the idea flops.
That changes how testing business ideas works. It doesn’t make it less important. It makes it more strategic.

Internal buy-in is its own risk category
A lot of product frameworks cover desirability, feasibility, and viability. In large companies, I add a fourth question:
Can this idea survive the organization?
That sounds cynical. It’s just accurate.
A 2023 McKinsey report found that 70% of corporate innovations fail due to poor internal buy-in, not lack of market fit, which is why PMs in large organizations should test organizational feasibility alongside customer demand (YouTube discussion referencing the McKinsey finding).
If nobody will fund it, support it, integrate it, sell it, or defend it in planning, the market doesn’t matter yet.
Run organizational feasibility experiments
Treat internal adoption like a product problem.
Here are the tests I use most:
- Executive pitch simulation: Present the idea as if you were asking for staffing today. Watch where leaders hesitate.
- Sales or support reaction test: Put a one-pager in front of frontline teams and ask whether they’d champion it or dread it.
- Dependency mapping workshop: Get engineering, legal, security, and ops in one room. Force hidden blockers into the open.
- Internal fake door: Offer the capability to one internal team before asking for broad rollout support.
These aren’t side quests. They’re core validation in enterprise environments.
In a large company, the first customer for a new idea is often the organization itself.
Stealth testing beats grand launches
If you’re an intrapreneurial PM, don’t ask for broad permission too early. Earn the right to ask later.
That usually means:
- Use existing surfaces instead of requesting net-new platform work.
- Prototype in docs, slides, Figma, or no-code tools before you pull engineers in.
- Borrow manual operations from a partner team for a limited pilot.
- Frame the work as risk reduction, not as a visionary moonshot.
Leaders rarely object to learning. They often object to speculative spending.
This matters a lot for AI projects. AI triggers both excitement and fear. Teams worry about hallucinations, governance, privacy, quality, and brand risk. A narrow test with visible controls is easier to approve than a broad “AI transformation” proposal.
Translate experiments into the language executives care about
Don’t walk into a review and say, “We’d like to run discovery.”
Say this instead:
- What risk are we retiring?
- What decision will this test inform?
- What’s the smallest resource ask?
- What happens if the answer is no?
That last one is powerful. It shows discipline.
When PMs can prove they’ll stop if the evidence is weak, executives trust them more. Ironically, that often gets them more room to experiment.
Career advantage for enterprise PMs
This skill compounds.
The PM who can discover customer pain is useful. The PM who can also handle legal, security, GTM, and executive buy-in becomes hard to replace.
That’s the difference between someone who manages features and someone who shapes company bets.
The Decision Engine Interpreting Signals and Making the Call
Testing business ideas isn’t about collecting artifacts. It’s about making decisions.
You need a decision engine that forces honesty. Mine has only three outputs: persevere, pivot, or kill.
Simple beats elegant here.
Use evidence thresholds before you run the test
Teams get political when they define success after the results arrive.
Decide in advance:
- What behavior counts as a positive signal
- What result is ambiguous
- What result means stop
If you’re running a landing page test, you already have a benchmark from the earlier experiment framework. If you’re running a concierge test, you know what early retention signal you’re looking for from that same methodology. Use those thresholds to inform judgment, then add context from user quality and workflow fit.
Don’t let vanity metrics hijack the call
A lot of PMs get seduced by weak positives.
Examples:
- Plenty of signups, but low-intent users
- Strong interview enthusiasm, but no repeated behavior
- Exec excitement, but no team willing to operate the process
- Good demo feedback, but weak trust in real workflows
The question isn’t whether you found something encouraging. The question is whether you reduced the core uncertainty enough to justify the next investment.
Persevere, pivot, or kill
Persevere
Choose this when the key assumption got stronger and the next step is obvious.
That usually means the test showed real behavior, not just interest. Now you can increase fidelity, tighten the workflow, or invest in productization.
Pivot
Choose this when the problem is real but your framing, segment, channel, or solution shape is off.
A pivot is not a euphemism for denial. It should be explicit. What changed? User segment? Use case? Promise? Workflow? Business model?
Kill
This is the most underused move in product management.
Killing an idea after a disciplined test is a success. You learned cheaply. You protected the roadmap. You saved engineering from building something the org would regret.
Senior PMs don’t get promoted for never being wrong. They get promoted for getting to the truth quickly.
The leadership skill most PMs avoid
A PM who can say, “The evidence isn’t strong enough, and we should stop,” is far more valuable than a PM who keeps weak ideas alive through storytelling.
That’s especially true in AI. The demos are seductive. The pressure is high. The temptation to label every experiment “promising” is everywhere.
Don’t do that.
Use a clear framework. Write the assumptions. Pick the smallest honest test. Define the decision before you run it. Then make the call without flinching.
If you want to sharpen that muscle, Aakash Gupta has a broader set of decision-making frameworks that pair well with this kind of product judgment.
Your next step is straightforward. Pick one idea on your roadmap that feels exciting but under-evidenced. Write the top three assumptions. Classify them by desirability, feasibility, and viability. Then design one test you can launch in the next 24 hours.
That’s how strong PMs work. Not by shipping faster. By learning faster.
If you want more practical product thinking like this, explore Aakash Gupta. His newsletter, podcast, and resources are built for PMs who want sharper judgment, stronger execution, and a stronger career position.