Get the app
Back
AI • 05/21/26

Introducing Cleo’s custom router, trained on our own data

It’s part of a larger system that keeps message classification accurate as user behavior shifts and new agents come online.

We’re launching a new router for Cleo, a custom model built using our own conversational data. Cleo’s conversational layer is a collection of specialized agents, each handling a slice of what users talk to Cleo about. The router is the component that decides which specialist agent fields a given message, quickly enough that users don’t notice the extra step.

The router we’re shipping today is an encoder-only model trained on Cleo’s conversational traffic. It runs classification roughly 16× faster than the LLM-based router it replaces and beats GPT-5.4-nano on accuracy against our internal benchmarks. Training directly on real Cleo conversations is what gives the in-house model its edge. The classifier learns to recognize the specific kinds of messages users send to Cleo, which a general-purpose model can’t do.

How Cleo picks the right agent

Cleo’s router is a classification layer. Incoming messages pass through it, and the router decides which specialist agent should handle the reply before generation happens. Speed is a design constraint, as any latency the router adds directly affects how long a user waits for Cleo’s response.

Each routing decision is made before an agent starts replying, so switches between specialists don’t interrupt a response in progress. The system also assumes, correctly, that any classifier will occasionally be wrong. When a specialist receives a message better suited for a different specialist, it can call a handoff tool to pass the message to the correct agent, so an incorrect initial classification doesn’t force the user to restart the conversation.

The version of the router running on GPT-5.4-nano had a latency of roughly 800 ms per message. The replacement is a custom encoder-only model, trained on data from Cleo’s own conversational traffic and shaped around our agent taxonomy. It runs at around 50 ms, a 16× latency reduction.

Bar chart titled "Router latency: 16x reduction." The GPT-5.4-nano bar reaches roughly 800 ms; the in-house encoder bar reaches roughly 50 ms.

Comparison: GPT-5.4-nano vs. in-house encoder model.

Measuring router accuracy well requires more than one offline benchmark. We evaluate against two labeled data sets that answer different questions:

  1. The first is a human-annotated set of tool call scenarios where labels are direct: For a given message, did the router send it to the agent whose tools match the request? On that data set, accuracy is around 94%.

  2. The second set is broader, drawn from conversations that human reviewers have flagged as high quality. The labels here are less direct; we use overall conversation quality as a proxy for tool call accuracy. That gives us a noisier signal, but one we can apply to far more conversations. Accuracy on that data set is approximately 92%.

Most of the errors on the broader data set are not misrouting per se. Rather, they’re cases where the router made a defensible call (the conversation was clearly inside one domain), but the user’s most recent message introduced something that domain’s agent wasn’t equipped to handle. In such cases, the fix is usually to widen the agent’s tool scope.

How we keep improving the system

Four pieces of infrastructure let us keep upgrading Cleo’s router as models, architecture, and user behavior change.

Traffic mirroring

The primary router serves live traffic. A shadow model runs in parallel against the same conversations, making routing decisions the user never sees. Comparing the two streams shows us exactly where the primary model makes mistakes against the shadow’s predictions, which feeds back into the training data for the next round.

Running against user traffic enables us to respond to shifts in the mix of user messages over time. A benchmark built in January can’t always tell us whether the router is making good decisions on the conversations people are having with Cleo in April, for example, due to changes like a new feature or shifts in user queries. Traffic mirroring anchors our evaluations in real production traffic rather than a static snapshot.

Flow diagram of traffic mirroring. A user message is routed two ways: the primary router serves live traffic and returns the response to the user, while the same message is mirrored to a shadow model running in parallel with no user impact. The two routers' decisions are compared, and disagreements feed into a dataset that trains the next router version.

How we use traffic mirroring to train the next router.

Automated prompt optimization

As a manual loop, improving an agent’s prompt requires an engineer to look at failure cases, rewrite part of the prompt, deploy it, wait, look at the new failure cases, and repeat. To automate that loop, we feed in the current prompt, a data set of real conversations and their outputs, and human-annotated (and sometimes LLM-judged) evaluations of those conversations. 

From that input, we can generate candidate prompt revisions aimed at the specific failure modes the evaluation data exposes. Engineers then review and test the proposed revisions to decide what ships. Template variables remain intact throughout the optimization, so the process changes how the agent reasons within its domain without altering what data or tools it can access.

The evaluation pipeline

Two layers of evaluation run in parallel. Offline evaluation focuses on tool selection, or whether the router picked the right specialist for a given input. It runs against the human-annotated tool call data set and the broader reviewer-flagged set. Online evaluation runs during rollout. Both human reviewers and LLM judges annotate a sample of real conversations, flagging drift and edge cases missed by offline benchmarks.

Phased rollouts

Major changes, such as bringing a new specialist agent online or migrating to a new underlying model, go through a two-phase test before reaching all users. The first phase is an A/A test. The updated system is deployed with prompts and tools identical to the current one to isolate the impact of any infrastructure changes on engagement and conversion. Only once that passes does the A/B test go live with the actual behavioral change. Both phases run as gradual rollouts, monitoring for engagement, conversion, and conversation quality.

This is the pattern we’re using to migrate Cleo’s conversational agents from GPT-4.1 to GPT-5.4. Prompt wording, tool schemas, and parameter settings can all behave differently under a new model and have to be retuned before live traffic routes through the updated stack. We run offline optimization across configurations, identify the combinations that perform well, and A/B test the winners against the current production system.

What’s next

We’re working on several future directions that build on this foundation:

  • Complexity-based routing. Today, the router picks a specialist agent based on message topic. A natural extension is to choose a specialist based on both topic and complexity, so lighter models handle simple requests and reasoning models handle more involved ones.

  • In-house small language models. In addition to our in-house encoder, we’re building other Cleo-specific models for routing and for some specialist tasks. By building small models internally, we can optimize directly for the distribution of conversations Cleo actually sees, rather than relying on general-purpose models.

  • Tighter feedback loops. Traffic mirroring already produces the data used to train the next router version. The next step is to tighten that loop so that less time elapses between when the shadow model flags a mistake and when that data reaches the next training run.