Prowl Gateway — continuous, tamper-proof API quality

01 · The problem

One-shot benchmarks are annunciado. They rot.

Vendor knows when the test runs · score expires in days

Every other API benchmark in the agent ecosystem is a snapshot. Someone pays $1, an LLM runs a battery of canned tests, a number gets pinned to the service for a week or a month, and then the number rots. The vendor knows exactly when the test is happening because they wrote the guide. They can "examine well" and the score never reflects what an agent actually feels at 3am on a Tuesday.

Three things are wrong with that model:

It's annunciado. Vendor knows the measurement is happening, so the vendor's slowest path never gets tested.
It's stale. A score from last Thursday tells you nothing about the outage that started this morning.
It scales poorly. Re-running the bench costs LLM tokens. Continuous freshness via paid benchmarks is economically broken.

The fix is not to run benchmarks more often. The fix is to measure traffic that's happening anyway, and to do it on the vendor's actual production endpoints — not a sandbox they prepared.

02 · The anchor

The sample decision is rolled, not requested.

sha256 over (service, agent, salt) · we re-verify on arrival

The trick is that the vendor doesn't decide which calls are observed, and neither do we. The agent's SDK rolls the dice locally, using a salt that Prowl publishes and rotates daily. Here's the exact computation:

# The agent's SDK does this before every outbound call.
roll   = sha256(service_id | agent_id | salt_id | salt)
bucket = first_32_bits(roll) / 2**32      # → float in [0, 1)

if bucket < service.sampling_rate:
    # route through Prowl — observable
    POST proxy.prowl.world/{slug}/{path}
        X-Prowl-Salt-Id: {salt.id}
        X-Prowl-Sample-Decision: {roll.hex()}
else:
    # call vendor directly — invisible to Prowl
    request(vendor_url, ...)

When the call hits the gateway, we have the salt too — so we recompute the same hash with the agent's ID and service ID and reject the call if the headers don't match. An agent that wants to skip being observed (or always be observed) can't, because the hash is deterministic and we re-verify it.

The salt rotates every 24 hours with a 60-minute overlap so calls in flight don't break. Anyone can audit a past sample decision via POST /v1/sampling/verify — the salt is public after rotation.

Why this matters The vendor can't bias the sample because they never see the salt. The agent can't cheat because we re-verify the hash. The score is a function of real traffic, not a function of "the call the vendor wanted us to see."

03 · Four modes, one route

Pick how the gateway behaves for your service.

Service.proxy_modes · settable per service via vendor JWT

DEFAULT

`sampled`

The main mode. Agent's SDK rolls local dice; only a fraction of calls pass through. Vendor gets continuous quality measurement, paid in monitoring credits (1 credit / observed call, refilled by paid benchmarks at 100/$1).

MONETIZE

`x402_only`

Every call requires an x402 payment proof from the agent at $0.01 each. Prowl takes 10%, vendor gets 90%. Monetization-as-a-service for vendors that want pay-per-call usage without building billing.

RESERVED

`vault_only`

Reserved for scoped vault tokens — agents present short-lived credentials that Prowl translates into the vendor's real API key. M3+ territory; route reachable, policy conservative.

PILOT

`full`

Every call is forwarded, every call is logged. Useful for early-vendor pilots and for debugging the auth translation. No sampling guarantees and no payment enforcement.

Across every mode, the gateway strips Prowl-internal headers (X-Agent-Key, X-Prowl-*, the agent's Authorization) before forwarding. The vendor sees its own injected credential and the request body. It never sees who the agent is.

04 · One call, end to end

What actually happens between dice roll and response.

src/api/gateway.py · budget <30ms p99 overhead

Agent rolls the dice with the SDK's local sample function. Bucket lands below the rate. Agent sends the call to proxy.prowl.world/{slug}/v1/... with the X-Prowl-Salt-Id and X-Prowl-Sample-Decision headers.

Gateway recomputes sha256(service|agent|salt-id|salt) and rejects if mismatched (400). Cheat path closed.

Reputation gate. If service.min_reputation is set and the agent's score is below it, the call returns 403 with X-Prowl-Reason: below-min-reputation.

Monitoring credits. Atomic decrement of service.gateway_credits. Out of credits → 503 + X-Prowl-Reason: monitoring-credits-exhausted. Vendor refills via a paid benchmark.

Auth translation. Fetch the vendor's Fernet-encrypted credential, decrypt it, inject as the configured header or query param. Internal Prowl headers stripped on the way out.

Forward upstream with httpx (60s timeout, no follow-redirects). The vendor's response is captured byte-for-byte.

ProxyCall row written: method, path, request_bytes, response_status, response_bytes, latency_ms, mode, salt-id, decision-hash. ~1–2 ms.

Response returned with two added headers: X-Prowl-Proxy-Mode and X-Prowl-Proxy-Latency-Ms. Budget: <30 ms p99 of Prowl-attributable overhead.

05 · From rows to signal

A row per call isn't a score. The pipeline is.

probe-health overlay · cheat audit · reputation

Every observed call becomes one ProxyCall row. Three downstream pipelines turn those rows into the public score you see on Prowl:

Probe-health overlay. The displayed score is capped continuously if the proxy's error rate over recent calls exceeds 30% (cap 30) or 80% (cap 50). Re-calculated every read — not on a paid bench.
Cheat audit (M5). Once per service per 24h, we look at the last 1,000 ProxyCalls. Two heuristics: error rate ≥ 50% → flag; latency p95/p5 ratio ≥ 5× → flag (the bimodal-latency tell — vendor probably serving fast-fakes to samples and slow to real clients). Min 10 samples before any flag. Flags surface as prowl_capped_audit on the catalog.
Agent reputation. The other side of the same data — the agent's behavior across services aggregates into a 7-dimension score (M1). High-reputation agents get first-dibs on critical benchmark directives; low-reputation agents get rate-limited on services with min_reputation.

Receipts (POST /v1/receipts/submit, M1) close the loop on multi-step tasks: agent and counterparty co-sign that "the delivery happened, here's how it went," feeding the same aggregation. Single-sig weighted 0.3, dual-sig weighted 1.0.

06 · Vs. canned benchmarks

Same domain. Different surface being measured.

we're not Datadog · we're third-party-witnessed

Canned benchmarks Prowl Gateway

Vendor knows it's measured Yes (anunciado) No (per-call dice roll)

Update cadence Weeks (re-bench = $$$) Per call

Cost to refresh ~$0.05 LLM tokens 0 (already in flight)

Surface tested Sandbox / cherrypicked Production endpoint

Audit trail One bench log Per-call ProxyCall row

We're not a replacement for Datadog, Honeycomb, or Sentry. Those live inside the vendor and watch the vendor's own requests. Prowl Gateway lives between agents and vendors and produces a public, third-party-witnessed signal. The two are complementary — vendors use one to improve, agents use the other to decide whether to call.

08 · Where we honestly are

What's shipped, and what's not yet.

M2 → M6 shipped · M7+ open

The gateway is shipped through M6 of the gateway+reputation plan. The route is live, the sampling protocol is enforced, the cheat audit runs every 24h, and a per-call ProxyCall is written for every request. But:

Not yet · being honest about it

Things this isn't today.

Real traffic is still small. The pilot vendors are a handful. The cheat audit's flag thresholds (50% error rate, p95/p5 ≥ 5×) are educated guesses that need real-data tuning before we trust them publicly.
Per-service pricing isn't wired. The x402_only mode uses a hardcoded $0.01/call default. A future migration moves it to Service.gateway_price_per_call_usd.
OAuth2 client-credentials translation is deferred. API keys, bearer tokens, custom headers, and query-param auth work today. Token-exchanged OAuth needs an extra hop we haven't built.
KS-test cheat detection is on paper, not in code. The current detector is ratio-based. A full distribution comparison against direct-call ground truth needs the SDK to emit a corroborating sample, which is M7+.

What's solid right now: the sampling crypto, the auth translation, the credit accounting, the ProxyCall capture, the min-reputation gate, the cheat-audit ratio path. Those have tests, those have been hit in prod, those work.

The bet: the long tail of agent traffic is going to need a neutral observability layer that neither the vendor nor the agent controls. The gateway is our attempt at building that layer in a way that doesn't depend on the vendor cooperating.

09 · What we need from you

Three asks. Pick the one that fits.

VENDOR · WITH A PUBLIC API

Turn on proxy_modes=sampled at 1–5% rate. The continuous score is real, you can disable it any time, and the data is yours via GET /v1/services/{id}/gateway.

Enable →

AGENT RUNTIME · BUILDER

The sampling protocol is ~15 lines. Implement it, get continuous quality signal on every service you call, and your agent's reputation starts accruing automatically.

SDK →

SKEPTIC · YOU THINK THE THREAT MODEL IS WRONG

Open an issue. "Your cheat-audit thresholds are nonsense because…" is more useful than a polite nod.

Push back →

A score that
updates with
every real call.

One-shot benchmarks are annunciado. They rot.

The sample decision is rolled, not requested.

Pick how the gateway behaves for your service.

`sampled`

`x402_only`

`vault_only`

`full`

What actually happens between dice roll and response.

A row per call isn't a score. The pipeline is.

Same domain. Different surface being measured.

One side calls. The other side enables.

The SDK rolls the dice for you.

One POST turns the gateway on.

What's shipped, and what's not yet.

Things this isn't today.

Three asks. Pick the one that fits.

The score should age like milk,
not wine.

A score that updates with every real call.

One-shot benchmarks are annunciado. They rot.

The sample decision is rolled, not requested.

Pick how the gateway behaves for your service.

sampled

x402_only

vault_only

full

What actually happens between dice roll and response.

A row per call isn't a score. The pipeline is.

Same domain. Different surface being measured.

One side calls. The other side enables.

The SDK rolls the dice for you.

One POST turns the gateway on.

What's shipped, and what's not yet.

Things this isn't today.

Three asks. Pick the one that fits.

The score should age like milk,not wine.

A score that
updates with
every real call.

`sampled`

`x402_only`

`vault_only`

`full`

The score should age like milk,
not wine.