All posts

· 8 min

How I shipped aRTi AI Roleplay end-to-end at Rising Team

The architecture, evaluation harness, and latency tuning behind the LLM-powered practice tool that drove +29% AI adoption growth at Rising Team.

When we set out to ship aRTi AI Roleplay at Rising Team, the brief was simple to state and hard to deliver: let any manager rehearse a hard conversation, on demand, with an AI partner that knows their team.

This is the engineering story behind that product. The numbers, in case you're skimming: +8% AI feature usage in the first 24 hours, +29% adoption growth since launch, and a key contributor to platform-wide outcomes — up to 33% retention lift, 60–200% eNPS gain, and 9-out-of-10 likelihood-to-recommend across customer cohorts.

What I optimized for

Three constraints drove every architecture choice:

  1. Realism without cruelty. The AI partner has to push back like a real person — but never go off the rails. A bad turn here breaks trust permanently.
  2. Latency you forget about. Voice mode dies above ~800 ms TTFT. Text mode dies above ~1.5 s. Architecture had to keep both well below.
  3. Cost-per-session below a coach. Rising Team's whole pitch is "leadership development at less than 5% the cost of traditional coaches." The AI bill couldn't break that.

The architecture

The runtime path looks like this:

client → session orchestrator → retrieval (Pinecone) →
  LLM partner (OpenAI / Anthropic / Gemini) →
  evaluator pass (Anthropic) →
  voice (ElevenLabs, optional) →
  client

A few things worth calling out:

  • Two LLM passes per turn. One generates the partner's response. A second smaller model evaluates the partner's response itself — not the manager's. That's where most "supportive but wrong" AI feedback comes from in this domain.
  • Multi-model orchestration. OpenAI for some flows, Anthropic for reflection and feedback, Gemini for cost-sensitive paths. LangChain holds the routing.
  • Pinecone retrieval narrows the prompt — it doesn't replace it. Team context grounds the partner; the system prompt still has to encode the kind of pushback that's productive vs. cruel.
  • Cache-aware prompts. The first ~2k tokens of the system prompt are stable across all of a manager's sessions. Wrapping that in prompt caching cuts per-turn cost dramatically.

The evaluation harness

We learned early that single-LLM eval ("did the partner respond well?") under-detects bad turns. So we built a deliberately adversarial harness:

  • A library of stress prompts designed to lure the partner into being too soft, too hostile, or going off-topic.
  • A panel of three different model families (Claude, GPT, Gemini) scoring each response on tone, realism, and safety.
  • A regression set that replays the worst real-world turns from production. Any model swap or prompt change has to clear it.

Most production "AI evals" are too kind to their own product. Make yours mean.

The latency story

Voice mode tripled session length when we got it right — but only when total round-trip stayed under ~800 ms. The audit looked something like this on Day 1:

  • Client → orchestrator: ~30 ms
  • Retrieval: ~120 ms
  • LLM generation: ~600–1100 ms
  • Evaluator pass: ~250 ms
  • ElevenLabs TTS: ~400–700 ms
  • Total worst case: ~2.2 s. Way too slow.

What we changed:

  1. Streamed everything. Generation streams to TTS, TTS streams to the client, the client plays as bytes arrive.
  2. Ran the evaluator pass concurrently with TTS — by the time the manager finishes hearing a turn, the next turn's safety check is already done.
  3. Pre-warmed the partner model with a short keep-alive at session start.
  4. Cached retrieval by team — most manager sessions reuse the same context window all session long.

We landed under 800 ms TTFT in voice mode for ~95% of turns.

What I'd do differently

  • Start with the eval harness. We back-built ours after a couple of bad weeks. Should have been week-one work.
  • Build the cost dashboard before the feature. Per-session cost crept up before we noticed; a real-time dashboard would have flagged it the first day.
  • Don't ship voice and text simultaneously. Voice has a different UX bar. Text first, voice when you have latency budget.

What's next

Live AI Coaching, on-demand session summaries, and continued integration with the rest of aRTi's growth-plan loop. Plus a lot more time spent with the eval harness.

If you're working on AI products that managers actually use, reach out — happy to compare notes.