Linguistic Limits of Intent Taxonomies in Agentic Systems

Enterprises talk about “understanding” users, but language rarely cooperates with the neat boxes we design in org charts, ticket categories, and product menus.

In a recent Dialogue Architects episode, Rasa CTO and Co-Founder, Alan Nichol, walked through a pattern we keep seeing across support, employee experience, and internal tooling: the gap between how businesses structure work and how people actually speak.

This post pulls on that thread. It’s a linguistic look at why conversational systems succeed or fail, and what changes when you treat language as interaction instead of classification.

The first lesson many teams learn: people speak in fragments, not forms

Early on, Alan built systems that let teams “talk to their data,” asking questions that translated into structured queries. On paper, it sounded straightforward: natural language in, SQL out.

In practice, users almost never spoke the way the system designer expected.

They did not ask:

“What was the ROI on all campaigns by country for the last three weeks?”

They asked:

“How’s Facebook?”
“How’s Germany looking?”

Those short questions carry meaning, but they rely on shared context and implied structure.

Linguists call this ellipsis: the user omits what both sides can infer. Humans do this constantly because it’s efficient. A conversational system only handles it well when it can ask follow-ups, track context across turns, and treat the exchange as a joint activity, not a single-shot query.

That’s where multi-turn behavior stops being a product feature and starts being the whole problem.

Intent labels: a practical tool that collides with how people express needs

For years, many conversational stacks framed “understanding” as intent classification. You define categories, collect examples, train a model, and map each intent to a path.

That structure works best when:

the category set stays stable
users phrase requests in predictable ways
the business agrees on where each request belongs

Real conversations violate all three.

Alan described a failure mode we’ve seen repeatedly. A user’s request can be perfectly clear in their reality, yet still fail inside a system built around rigid categories.

A customer says, “You messed up my order.”

From the customer’s perspective, that statement makes complete sense. From the enterprise perspective, it can route to shipping issues, incorrect items, billing problems, refunds, order tracking, fraud checks, or policy exceptions. The ambiguity lives inside the organization, not in the sentence.

This tension shaped much of the early work we did in language understanding. Rasa NLU gave teams a way to build and control their own language pipelines at a time when most alternatives lived behind third-party APIs. Before the GPT era, teams used it in production across well over a hundred languages, proving that customizable NLU could work at scale.

Even as that approach succeeded, Alan was already clear that it could not be the final answer. Classification helps structure language, but conversations depend on context, follow-up, and shared understanding over time. Improving training data can smooth the edges, but it rarely resolves the deeper mismatch between how users describe problems and how businesses internally partition them.

That recognition pushed us to keep looking ahead.

We were among the early teams experimenting with BERT-style models for dialogue understanding, contributing to work that treated meaning as contextual rather than categorical. The direction stayed consistent. Build practical language systems enterprises can trust, while moving closer to how people actually communicate across turns.

“Disambiguation” often means “help the user translate into our world”

A lot of conversational design work gets labeled as disambiguation. The system asks questions to narrow down which path applies.

That label hides an important linguistic reality: users do not walk in thinking, “I will now select a category.” They state what happened, what they want, and what they’re worried about. They also revise, backtrack, and add constraints midstream.

So the system has to do two things at once:

interpret the user’s language in context
guide the user into a resolution path that the business can execute safely

That second part is not language understanding. It’s conversation management that clarifies questions, provides confirmation, repairs misunderstandings, and keeps shared context stable as the user changes direction.

The hard part comes when a system tries to push everything through a single classification step and then relies on brittle logic. Every new category changes the boundaries of existing categories. Every boundary change creates new edge cases.

Over time, the system accumulates complexity that looks like progress, right up until one update triggers a regression across unrelated paths.

Confidence scores taught teams the wrong lessons

If you’ve built with supervised intent models, you’ve seen the confidence score problem. A system reports “80% confidence,” stakeholders interpret that as “pretty accurate,” and teams start managing quality by chasing higher numbers.

The number often measures the model’s internal preference among known labels, not the truth of the interpretation. Different vendors calculate it differently. Retraining can raise confidence without improving user outcomes. Teams end up debating thresholds instead of fixing the experience.

A shift that LLMs pushed into the mainstream is that people can describe desired behavior directly in language. That brings its own risks, but it avoids one form of self-deception: laundering a product team’s assumptions through synthetic training data and then treating the model’s score as ground truth.

Why chaining model output into more model calls creates linguistic drift

Alan called out a pattern that shows up in a lot of modern “agent” designs: repeated LLM calls where each step consumes the previous step’s output.

From a language perspective, this invites drift:

each generation can introduce subtle reinterpretations
slight phrasing changes can cause downstream steps to take different paths
small noise compounds over a multi-step plan

When the system starts making decisions based on its own prior text rather than grounded state, it becomes difficult to debug. You end up asking: which step introduced the shift, and why did the system interpret its own output that way?

This ties back to a foundational conversation principle that language is a poor storage medium for state. Human dialogue works because both sides share context plus a world model, and they continuously repair misunderstandings. Software systems need explicit state and inspectable logic if you want reliability at enterprise scale.

What stays constant even as the tech cycle swings

We’ve watched this space move from NLU APIs, to intent pipelines, to end-to-end neural models, to LLM-first stacks. A lot changed quickly, but some constraints did not. Across teams handling high-volume interactions, a few linguistic realities keep showing up:

users speak with implied context, shortcuts, and mid-course corrections
organization’s internal structure rarely matches how users describe problems
“understanding” depends on the system’s ability to ask the right follow-up question at the right time
quality work comes from reviewing real conversations and building intuition about where users get stuck

Plenty of tooling can generate a polished demo in minutes. The work that moves business metrics still lives in the same place: the conversations themselves.

A practical takeaway: Audit your taxonomy like a linguist, not like a ticketing system

If you’re building or rebuilding an enterprise agent, try this approach when you review transcripts:

look for repeated user phrasings that do not map cleanly to your internal categories
identify where the user’s words express an experience (“my order is wrong”) while your system expects a process selection (“refund” vs “replacement”)
track where users ask elliptical questions (“how’s X?”) that require state, time windows, and a shared frame
separate language interpretation problems from conversation management problems, then address them differently

That separation is one reason we built Rasa’s dialogue engine the way we did, to treat language as language and business logic as business logic. When those layers blur, teams end up debugging English instead of debugging software.

Try it yourself

If you want to see how we think about these problems in practice, try our guided starter experience, Hello Rasa. It’s designed to make system behavior inspectable as you build, so you can connect conversational choices back to real outcomes without guessing what the agent “meant.”

Linguistic Exploration: Why Real Conversations Break Tidy Taxonomies

The first lesson many teams learn: people speak in fragments, not forms

Intent labels: a practical tool that collides with how people express needs

“Disambiguation” often means “help the user translate into our world”

Confidence scores taught teams the wrong lessons

Why chaining model output into more model calls creates linguistic drift

What stays constant even as the tech cycle swings

A practical takeaway: Audit your taxonomy like a linguist, not like a ticketing system

Try it yourself

Read more

Most Enterprises Aren’t Yet Confident in Conversational AI, But They’re Scaling Anyway

Enterprises Want Control Over Their Conversational AI. This Is How to Build It.

10 Best Botpress Alternatives for Enterprise AI Agents (2026)

AI that adapts to your business, not the other way around

Build your next AI

agent with Rasa