Should You Build an LLM Feature? Three Questions to Answer First

Most LLM features fail for predictable reasons: the economics don't work at scale, errors compound across multi-step workflows, or users realize they can get the same result by using ChatGPT directly. Three questions separate features that work from expensive mistakes.

Will errors compound?

An LLM with 95% accuracy sounds acceptable. Five LLM calls in sequence, each at 95% accuracy, gives you 77% reliability. That's the math of compounding errors, and it's why multi-step LLM workflows often fail in production even when each individual step works fine in testing.

The problem shows up whenever you chain operations together. An LLM extracts data from a document (95% accurate), then categorizes it (95% accurate), then generates a summary (95% accurate). Each step depends on the previous step being correct. The errors multiply, not add.

Consider a document processing pipeline with four LLM steps: extract shipping details, validate addresses, categorize shipment type, generate routing instructions. If each step runs at 94% accuracy, the end-to-end accuracy drops to 78%. A 22% failure rate means manual review of everything, which defeats the automation.

The fix is deterministic validation between LLM steps. After extraction, validate addresses against a database before passing to the next step. After categorization, check against a whitelist of valid types. The LLM still makes mistakes at each step, but validation catches errors before they compound.

The question isn't whether your LLM is accurate enough. It's whether you can tolerate the compounded error rate when you chain multiple operations together, and whether you can add validation layers that prevent errors from propagating.

Do the unit economics work?

LLM costs are deceptive. The per-query cost looks small until you multiply by volume, and suddenly you're spending more on inference than you're saving in labor.

Take a customer support system using an LLM to draft responses. At $0.08 per query and 50,000 tickets monthly, that's $4,000 in API costs. If the support team's loaded cost is $30/hour and tickets take 8 minutes to resolve, that's $200,000 monthly in labor. Reducing resolution time from 8 minutes to 3 minutes saves $125,000 monthly. After subtracting API costs, you save $121,000. The economics work.

But change the variables. At 10,000 tickets monthly, you save $25,000 in labor and spend $800 on API calls—still worth it. At 5,000 tickets, savings drop to $12,500 while costs stay at $400. At some volume, the labor savings no longer justify the API costs.

The failure mode is building without running the math first. You need your actual usage volume, your average context length, and your actual cost of the manual process you're replacing. Without those numbers, you can't determine if the economics work.

Then there's pricing volatility. Model upgrades change costs unpredictably. One update might cut costs by 50%, the next might increase them. Your unit economics depend on someone else's pricing decisions.

Could they just use ChatGPT?

The hardest question: if your LLM feature is valuable, why wouldn't users just use ChatGPT directly instead of your product?

If your feature is "ask questions about your data," the user could copy their data into ChatGPT and ask questions there. If it's "generate a report based on these inputs," they could paste the inputs into ChatGPT and ask for a report. Your integration adds convenience, but it doesn't add capability.

The features that work are ones where the LLM needs context that's hard to manually provide. A legal research tool that pulls relevant case law from a proprietary database and uses an LLM to synthesize it delivers something ChatGPT can't. The user would need to manually find and paste dozens of cases to replicate it.

A financial analysis tool that connects to live transaction data and uses an LLM to identify anomalies works because the user can't easily export all their transaction data to ChatGPT every time they want an analysis.

A documentation system that uses an LLM to answer questions based on internal engineering docs works because ChatGPT doesn't have access to those docs, and copying them all into a prompt isn't practical.

The test is: can a user replicate your LLM feature in under 5 minutes by using ChatGPT directly? If yes, your feature adds convenience but not unique value. Users might pay for convenience, but they won't pay much, and they'll churn when a competitor offers the same convenience cheaper.

The features that command premium pricing are ones where the LLM has access to proprietary data, real-time information, or integrations that would take significant effort to replicate manually.

What actually works

Consider predictive maintenance for industrial equipment. An LLM analyzes sensor readings and maintenance logs to predict failures, with access to 18 months of historical data per machine and the ability to correlate failure patterns across similar equipment.

Could a technician use ChatGPT for this? Technically yes, but they'd need to export gigabytes of sensor data, format it correctly, and paste it into a prompt. The effort exceeds the value. The automated feature works because it eliminates that friction and runs continuously rather than on-demand.

The error compounding issue exists—the system extracts patterns from sensor data, categorizes failure types, and generates recommendations. The solution is validation that checks predictions against known failure signatures before surfacing them. When confidence is low, it flags for human review rather than making an automated decision.

The unit economics work at 5,000 analyses per month at $0.12 each ($600 monthly), compared to hiring an additional data analyst at $8,000 monthly.

The three questions

Before building an LLM feature, work through these in order:

If you're chaining multiple LLM calls together, can you tolerate the compounded error rate? If not, can you add validation between steps?

Have you calculated the monthly API cost at your expected volume? Does it make sense relative to what you're saving?

Could users replicate your feature by using ChatGPT directly in under 5 minutes? If yes, why would they pay for your integration?

Most proposed LLM features fail at least one of these tests. When they do, the right answer is often not to build it, or to redesign it so the economics work and the value is defensible.

Need help figuring out if your LLM system will work? Email us at hello@detroitcomputing.com.