Skip to content

Architecting AI Systems Part 2: The AI Assistant That Needed Less AI Than We Expected

Part of our AIAppliedRight series where we reviewed an AI first Architecture and are working towards understanding “When does this system actually Need AI“.

This is a follow-up to Just Because You Have Content Doesn’t Mean You Need RAG. In that post, four “AI” features turned out to be lookups, ranking and classification in disguise. One request, though, looked like the real thing.

The company wanted an AI assistant — something a customer could ask a question and get an answer from, drawing on everything the business had ever published: products, recipes, blogs, podcasts, buying guides.

“What should I use for sleep?”

“Is this safe to take with my medication?”

“What’s a good substitute if I don’t have this on hand?”

Unlike the other features we reviewed, this one wasn’t a recommendation or ranking problem in disguise. It involved open-ended language, an unstructured content library, and a real expectation that the system would respond in natural language. If anything in this engagement justified retrieval and generation, this looked like it.

The client had one persistent worry: what does this cost per conversation? An assistant people enjoy chatting with isn’t automatically an assistant that sells anything. A lot of usage would be browsing and curiosity with no purchase intent behind it — and a full retrieval-and-generation pipeline charges the same whether the conversation ends in a cart or not. They didn’t want to pay premium, per-turn LLM costs for engagement that was never going to convert.

So we started where everyone starts: assume full RAG. Embed the content, retrieve relevant passages for a question, hand them to an LLM, let it compose an answer.

Then we looked at the questions people would actually ask, and they clustered. “What should I use for sleep,” “I can’t fall asleep,” and “something to help me wind down” are the same question wearing different words. So were the dozen ways to ask about a substitute, or safety, or what a product is good for.

This is where the nature of the platform mattered. It wasn’t a generic store that sells almost anything to almost anyone, facing a genuinely open-ended range of questions. It was domain-specific — health and wellness, a defined intent around health and home use. A platform that focused has a finite, knowable space of things people ask about: sleep, digestion, immunity, skin, stress, and a handful of others — the same categories the business already organized its products and content around. The questions were never going to range further than the domain itself did.

That changes the problem entirely. Generating a novel answer from scratch is a job for an LLM over retrieved content. Recognizing that a question belongs to one of a known set of intents is a job for a classifier — smaller, cheaper, faster, and with no retrieval step at all.

So a small model classified each incoming question into an intent — a sleep-and-calm concern, a substitution request, a safety question, a product lookup — and handed it to a path already built for it elsewhere in the system: the same curated tags, lookup tables, and ranking logic running the rest of the platform. The assistant’s real job wasn’t answering. It was understanding what was being asked, and routing it.

That answered the cost question too. A classification call is a fraction of the cost of a full generation turn, and it runs on every message — including the ones that were only ever browsing. The expensive path gets invoked only for the genuine long tail: questions that fit no known intent and really do need retrieval and generation. The assistant’s cost stopped scaling with how much people enjoyed talking to it, and started scaling with how much of that talking actually needed a model to generate something new.

This one is worth sitting with, because the assistant was the feature most likely to survive an architecture review with RAG intact. Open-ended language, unstructured content, a conversational interface — every signal pointed at retrieval. And most of it still came down to the same question as everything else: what does the system actually need to do, before you decide how to build it?

The assistant’s real job wasn’t answering. It was understanding what was being asked.

Team Cennest