TensorWasm
availability-slow-burn
availability-slow-burn
Alert: availability_http is burning the 30-day error budget at 6×
the budgeted rate, sustained across both a 30-minute and a 6-hour
window. Severity: page.
What this alert means
The HTTP API is returning 5xx responses to roughly 3.0% of all
incoming requests, and has been doing so steadily across at least the
last 30 minutes. At this rate, the full 30-day budget of ~216 minutes
of allowed downtime is consumed in about five days. This is the
"something is wrong but the building isn't on fire" alert: slow enough
that it would not have fired the 14.4× fast-burn page, fast enough
that ignoring it for a working day bleeds half the monthly budget.
Defends the same availability_http: 99.5% SLO as the fast-burn
alert (SLO.md §3) but catches degradations that the
faster alert misses.
Symptoms users see
- A noticeable but non-catastrophic minority of API calls fail — retries usually succeed, but the client error rate is above baseline.
- CLI clients see intermittent
Error: server returned 500or occasional connection-reset errors that clear on retry. - Long-running batch consumers that don't retry start logging elevated failure counts in their own dashboards.
- The dashboard's HTTP error rate panel sits visibly above the 1% threshold line for a sustained period, without spiking high enough to be visually alarming.
First-look queries
# 1. Confirm: is the 30-minute error rate above the alert threshold?
sum(rate(tensor_wasm_http_requests_total{status=~"5.."}[30m]))
/
sum(rate(tensor_wasm_http_requests_total[30m]))
A value above 0.030 (= 6 × 0.005) confirms the alert. The
6-hour-window query (same expression with [6h]) confirms the
incident is steady-state, not a recent spike.
# 2. Scope: which route is failing?
topk(3,
sum by (route) (
rate(tensor_wasm_http_requests_total{status=~"5.."}[30m])
)
)
The top three failing routes. A single dominant route narrows the search to one handler; an even split across many routes points at a shared dependency (snapshot disk, GPU, auth, DB).
# 3. Compare against the same window yesterday: is this new?
sum(rate(tensor_wasm_http_requests_total{status=~"5.."}[30m]))
/
sum(rate(tensor_wasm_http_requests_total[30m]))
-
sum(rate(tensor_wasm_http_requests_total{status=~"5.."}[30m] offset 1d))
/
sum(rate(tensor_wasm_http_requests_total[30m] offset 1d))
A large positive delta means the elevated error rate is new in the last 24 hours and worth correlating to a deploy or config change. A near-zero delta means the system has been like this for a while and the alert just crossed its window threshold.
# 4. Are the failures concentrated on one tenant?
sum by (tenant) (rate(tensor_wasm_instance_terminations_total[30m]))
This uses an existing metric to spot a tenant that is repeatedly spawning and failing — a tenant-specific bug pattern is a common slow-burn cause.
Mitigation steps
The slow-burn rate is patient enough to leave room for diagnosis before mitigation. Stop after step 2 if the dashboard recovers.
- Open the dashboard alongside the queries. Confirm the SLO
summary
availability_http (30d)panel is dropping, and identify the dominant failing route from query 2. - If the cause is obvious from logs, fix forward.
journalctl -u tensor-wasm --since '6 hours ago' | grep -iE 'error|warn' | tail -200often surfaces a recurring message that points at the cause (e.g. a tenant exceeding quota repeatedly, an upstream timeout). A targeted config change is faster than a rollback if the root cause is genuinely known. - If a recent deploy correlates with the start of the burn, roll
back. Compare deploy timestamps (
git log --since '6 hours ago'on the release branch, or your deployment system's audit log) to the 6-hour query window. Userollback.mdif the correlation is plausible — the slow burn at 6× still consumes budget quickly enough that an unnecessary rollback is cheaper than a missed real cause. - If a single tenant is the cause, isolate them. Use the
tensor-wasm-cliadmin commands to revoke or rate-limit the offending tenant's bearer token:tensor-wasm admin token revoke <token-id>or set a stricter QPS in the per-tenant config. Document the action for the postmortem. - If no clear cause emerges in 30 minutes, escalate. This alert is patient but not infinite. See When to page.
Root-cause hypotheses
Ranked by base rate; the slow-burn alert tends to surface different root causes than the fast-burn one.
| Hypothesis | How to confirm | How to fix |
|---|---|---|
| One tenant is repeatedly hitting a quota or timeout, failing 100% of their calls | Per-tenant termination rate (query 4); journalctl -u tensor-wasm | grep '<tenant_id>' | Revoke or rate-limit their token; engage the tenant; raise quota if legitimate |
| Snapshot disk degraded (slow but not yet failed) — restore calls increasingly time out at 30 s | iostat -x 5 shows high %util on the snapshot device; tensor-wasm observe --once shows restore P95 climbing | Move snapshots to a faster volume; investigate the storage backend's own metrics |
| Memory leak slowly approaching the per-process limit; per-call allocations start failing before OOM kill | RSS climbing in systemctl status tensor-wasm or cat /proc/$(pidof tensor-wasm)/status | grep VmRSS; tensor_wasm_active_instances not draining | Restart tensor-wasm; file an issue with the RSS-over-time graph |
| Wasmtime engine accumulating compiled modules without eviction | tensor_wasm_active_instances high and stable; restart drops error rate immediately | Tune the engine's instance limit downward; consider periodic warm restart until the leak is fixed |
| Upstream dependency (auth backend, OTLP collector) slow and triggering tower middleware timeouts | curl directly against the dependency; tracing spans show long http.request parent with no child progress | Restore or temporarily bypass the dependency; tighten the middleware timeout if it is too loose |
When to page
Escalate to sev-1 if any of the following:
- The 30-minute error rate stays above 0.030 for more than two hours without identifying a hypothesis to act on.
- The slow burn upgrades to a fast burn (the
availability-fast-burn.mdalert starts firing). Handle that page first — it is louder for a reason. - The dashboard's
availability_http (30d)panel crosses below 99.3% (half the budget consumed in the rolling window). - A single tenant is identified as the cause but cannot be isolated via token revocation (e.g. the auth subsystem is also degraded).
Postmortem checklist
- Save
journalctl -u tensor-wasm --since '<incident_start - 30m>' --until '<incident_end + 30m>' > /tmp/tensor-wasm-slow-burn.log. - Capture the Prometheus snapshot for the full 6-hour window plus 30 minutes either side.
- If a tenant was isolated during mitigation, file an issue against that tenant's onboarding ticket (or notify their owner channel) so the action is undone or formalised.
- Note the actual root cause in the issue's body even if the dashboard recovered before it was confirmed — a slow burn that auto-resolves is still worth attributing.
- Update the per-tenant config or quota defaults if the cause was a quota that was set too high.
- If the cause traces back to a release, follow up with a regression test added before the next release.
- Cross-reference the incident in the next CHANGELOG entry under "Operator-visible behaviour change" per
SLO.md§9.
Related
SLO.md§3 (target), §4.1 (budget), §5.2 (alert query).availability-fast-burn.md— escalate here if this alert upgrades.availability-very-slow-burn.md— the long-window ticket-severity variant of the same SLI.invoke-latency-spike.md— a slow upstream often surfaces here as both elevated latency AND elevated 5xx rate; cross-check that runbook's queries.rollback.md— the rollback procedure step 3 references.oncall-paging.md— escalation path used in "When to page".dashboards/README.md— the HTTP traffic row's "Error rate (5xx) by route" panel is the primary visual.