When we first demoed KPIcons to a prospect last fall, the VP of Sales asked a question I didn't expect: "how fast is the nudge?" I gave the founder-honest answer — "under two seconds, usually" — and he said, "if you can prove it, we'll buy." That was the moment I realized latency was a contract, not a metric.
Here's what we had to build to make "1.8 seconds p50" something we could write into an enterprise SLA, prove with a signed receipt, and sleep at night knowing it would still be true at 10,000× scale.
The latency budget
Let's start with the target. A nudge is the chain: signal event → rule evaluation → nudge generation → channel delivery → receipt. We gave ourselves 2 seconds end-to-end, broken down like this:
- Signal capture — 300 ms
- Rule evaluation — 200 ms
- Nudge generation (LLM) — 900 ms
- Channel delivery (Slack / SFDC widget / Gmail) — 400 ms
- Signed receipt persistence — 200 ms
Total: 2 seconds. p50 target: 1.8s. p95: under 3s. Anything slower than 2s gets an automatic alert, retried once, and logged in the IT Admin audit trail.
Step 1: Signal capture via logical replication
The naive way to detect a pipeline dip is to poll. That'd be the wrong answer. We use Postgres logical replication (WAL2JSON) to emit every KPI-relevant DB change as an event within milliseconds of the write committing.
CREATE PUBLICATION kpi_events FOR TABLE deals, activities, kpi_history;
-- Debezium connector consumes WAL, emits Kafka events per change.
-- Downstream: rule-engine consumer filters by tenant + KPI spec.
Because logical replication is streaming, the signal arrives at the rule engine in 80–250 ms. We budgeted 300 to be safe.
Step 2: Rule evaluation at the edge
The rule engine runs on Cloudflare Workers — as close to the user's region as physically possible. Each tenant's rule set (delegation rules, KPI thresholds, escalation paths) is distributed to the edge via Workers KV and refreshed on change via Durable Objects.
Why not Lambda? Cold starts. A Worker is warm in <1 ms; a Lambda can be 100+ ms on first invocation. Over 40 million events/day, that difference compounds.
Delegation rules as compiled predicates
Manager-authored rules ("when pipeline < $400K, ping Sarah…") are compiled once at save time into a predicate tree. Evaluation is O(log n) for up to several thousand rules per tenant. The evaluated rule emits a nudge intent to the generation layer.
Step 3: LLM inference with two tricks
The hardest part of the budget is the LLM call — 900 ms for a persona-tuned nudge. Two tricks make this fit:
Trick 1: Templates with slot-fill
Most nudges don't need a full LLM call. Each of the 6 base personas has ~40 pre-authored nudge templates per KPI category. The LLM picks the best template (a 100-ms classifier call), then fills in deal-specific slots (account name, ACV, stage). Total: 300 ms median.
Trick 2: Distilled persona models run regional
For latency-critical paths, distilled student models (7B-parameter range) run on dedicated GPU pods in the same region as the tenant. Zero cross-region hops, zero vendor-API variance. The trade-off is a slightly narrower vocabulary than frontier models — but for coaching nudges that's not the constraint.
Step 4: Channel delivery
Each channel has its own SLA quirks:
- Slack — Block Kit interactive message; Slack guarantees < 3s delivery. We target < 400 ms to Slack's API.
- Salesforce Lightning widget — WebSocket push over the Streaming API. Latency: 200–350 ms.
- Gmail add-on sidebar — Google Workspace quota-limited. Our p50 here is 500 ms; p95 sometimes hits 1s, which eats the latency budget.
If Gmail is projected to miss, the router automatically falls back to in-product notification and flags the receipt as "channel-degraded."
Step 5: Signed receipts
Every completed nudge is appended to an Ed25519-signed ledger. The ledger is a Merkle-chain: each receipt references the hash of the previous, so the whole sequence is tamper-evident. Signing takes 0.8 ms. Persistence to the tenant ledger: 180 ms to commit, async replication for durability.
The key rotates every 90 days. Old receipts stay verifiable because we publish the public-key history with every audit export.
What we measure, always
We emit four percentiles (p50, p95, p99, p99.9) per tenant per minute, plus a breakdown by stage. When any budget is blown, the on-call SRE gets paged within 15 minutes — and the affected tenant gets a status-page notification automatically.
If you can't measure it, you don't have an SLA. If you can't prove it, you don't have a product.
Where we'd like to improve
Two honest admissions:
- Gmail is our weakest link. Google quotas make sub-400 ms hard. We're piloting a "parallel push" where we fire to both the Gmail add-on and a Slack DM simultaneously, and take whichever lands first. Experimental.
- Cold-start on distilled-persona pods. When a regional persona pod sits idle for hours, the first nudge takes ~2.5s instead of 1s. We keep warm replicas for active tenants, but the economics get expensive at long-tail scale. Open problem.
Ship, measure, improve, ship again. That's the whole job.