When we first demoed KPIcons to a prospect last fall, the VP of Sales asked a question I didn't expect: "how fast is the nudge?" I gave the founder-honest answer — "under two seconds, usually" — and he said, "if you can prove it, we'll buy." That was the moment I realized latency was a contract, not a metric.

Here's what we had to build to make "1.8 seconds p50" something we could write into an enterprise SLA, prove with a signed receipt, and sleep at night knowing it would still be true at 10,000× scale.

The latency budget

Let's start with the target. A nudge is the chain: signal event → rule evaluation → nudge generation → channel delivery → receipt. We gave ourselves 2 seconds end-to-end, broken down like this:

Total: 2 seconds. p50 target: 1.8s. p95: under 3s. Anything slower than 2s gets an automatic alert, retried once, and logged in the IT Admin audit trail.

Step 1: Signal capture via logical replication

The naive way to detect a pipeline dip is to poll. That'd be the wrong answer. We use Postgres logical replication (WAL2JSON) to emit every KPI-relevant DB change as an event within milliseconds of the write committing.

CREATE PUBLICATION kpi_events FOR TABLE deals, activities, kpi_history;
-- Debezium connector consumes WAL, emits Kafka events per change.
-- Downstream: rule-engine consumer filters by tenant + KPI spec.

Because logical replication is streaming, the signal arrives at the rule engine in 80–250 ms. We budgeted 300 to be safe.

Step 2: Rule evaluation at the edge

The rule engine runs on Cloudflare Workers — as close to the user's region as physically possible. Each tenant's rule set (delegation rules, KPI thresholds, escalation paths) is distributed to the edge via Workers KV and refreshed on change via Durable Objects.

Why not Lambda? Cold starts. A Worker is warm in <1 ms; a Lambda can be 100+ ms on first invocation. Over 40 million events/day, that difference compounds.

Delegation rules as compiled predicates

Manager-authored rules ("when pipeline < $400K, ping Sarah…") are compiled once at save time into a predicate tree. Evaluation is O(log n) for up to several thousand rules per tenant. The evaluated rule emits a nudge intent to the generation layer.

Step 3: LLM inference with two tricks

The hardest part of the budget is the LLM call — 900 ms for a persona-tuned nudge. Two tricks make this fit:

Trick 1: Templates with slot-fill

Most nudges don't need a full LLM call. Each of the 6 base personas has ~40 pre-authored nudge templates per KPI category. The LLM picks the best template (a 100-ms classifier call), then fills in deal-specific slots (account name, ACV, stage). Total: 300 ms median.

Trick 2: Distilled persona models run regional

For latency-critical paths, distilled student models (7B-parameter range) run on dedicated GPU pods in the same region as the tenant. Zero cross-region hops, zero vendor-API variance. The trade-off is a slightly narrower vocabulary than frontier models — but for coaching nudges that's not the constraint.

Step 4: Channel delivery

Each channel has its own SLA quirks:

If Gmail is projected to miss, the router automatically falls back to in-product notification and flags the receipt as "channel-degraded."

Step 5: Signed receipts

Every completed nudge is appended to an Ed25519-signed ledger. The ledger is a Merkle-chain: each receipt references the hash of the previous, so the whole sequence is tamper-evident. Signing takes 0.8 ms. Persistence to the tenant ledger: 180 ms to commit, async replication for durability.

The key rotates every 90 days. Old receipts stay verifiable because we publish the public-key history with every audit export.

What we measure, always

We emit four percentiles (p50, p95, p99, p99.9) per tenant per minute, plus a breakdown by stage. When any budget is blown, the on-call SRE gets paged within 15 minutes — and the affected tenant gets a status-page notification automatically.

If you can't measure it, you don't have an SLA. If you can't prove it, you don't have a product.

Where we'd like to improve

Two honest admissions:

  1. Gmail is our weakest link. Google quotas make sub-400 ms hard. We're piloting a "parallel push" where we fire to both the Gmail add-on and a Slack DM simultaneously, and take whichever lands first. Experimental.
  2. Cold-start on distilled-persona pods. When a regional persona pod sits idle for hours, the first nudge takes ~2.5s instead of 1s. We keep warm replicas for active tenants, but the economics get expensive at long-tail scale. Open problem.

Ship, measure, improve, ship again. That's the whole job.


We're hiring a Senior Database Engineer to own more of the receipt ledger and peer-benchmark aggregation. If end-to-end latency SLAs sound fun, say hi.