TensorWasm

availability-slow-burn

availability-slow-burn

Alert: availability_http is burning the 30-day error budget at 6× the budgeted rate, sustained across both a 30-minute and a 6-hour window. Severity: page.

What this alert means

The HTTP API is returning 5xx responses to roughly 3.0% of all incoming requests, and has been doing so steadily across at least the last 30 minutes. At this rate, the full 30-day budget of ~216 minutes of allowed downtime is consumed in about five days. This is the "something is wrong but the building isn't on fire" alert: slow enough that it would not have fired the 14.4× fast-burn page, fast enough that ignoring it for a working day bleeds half the monthly budget. Defends the same availability_http: 99.5% SLO as the fast-burn alert (SLO.md §3) but catches degradations that the faster alert misses.

Symptoms users see

  • A noticeable but non-catastrophic minority of API calls fail — retries usually succeed, but the client error rate is above baseline.
  • CLI clients see intermittent Error: server returned 500 or occasional connection-reset errors that clear on retry.
  • Long-running batch consumers that don't retry start logging elevated failure counts in their own dashboards.
  • The dashboard's HTTP error rate panel sits visibly above the 1% threshold line for a sustained period, without spiking high enough to be visually alarming.

First-look queries

# 1. Confirm: is the 30-minute error rate above the alert threshold?
sum(rate(tensor_wasm_http_requests_total{status=~"5.."}[30m]))
  /
sum(rate(tensor_wasm_http_requests_total[30m]))

A value above 0.030 (= 6 × 0.005) confirms the alert. The 6-hour-window query (same expression with [6h]) confirms the incident is steady-state, not a recent spike.

# 2. Scope: which route is failing?
topk(3,
  sum by (route) (
    rate(tensor_wasm_http_requests_total{status=~"5.."}[30m])
  )
)

The top three failing routes. A single dominant route narrows the search to one handler; an even split across many routes points at a shared dependency (snapshot disk, GPU, auth, DB).

# 3. Compare against the same window yesterday: is this new?
sum(rate(tensor_wasm_http_requests_total{status=~"5.."}[30m]))
  /
sum(rate(tensor_wasm_http_requests_total[30m]))
  -
sum(rate(tensor_wasm_http_requests_total{status=~"5.."}[30m] offset 1d))
  /
sum(rate(tensor_wasm_http_requests_total[30m] offset 1d))

A large positive delta means the elevated error rate is new in the last 24 hours and worth correlating to a deploy or config change. A near-zero delta means the system has been like this for a while and the alert just crossed its window threshold.

# 4. Are the failures concentrated on one tenant?
sum by (tenant) (rate(tensor_wasm_instance_terminations_total[30m]))

This uses an existing metric to spot a tenant that is repeatedly spawning and failing — a tenant-specific bug pattern is a common slow-burn cause.

Mitigation steps

The slow-burn rate is patient enough to leave room for diagnosis before mitigation. Stop after step 2 if the dashboard recovers.

  1. Open the dashboard alongside the queries. Confirm the SLO summary availability_http (30d) panel is dropping, and identify the dominant failing route from query 2.
  2. If the cause is obvious from logs, fix forward. journalctl -u tensor-wasm --since '6 hours ago' | grep -iE 'error|warn' | tail -200 often surfaces a recurring message that points at the cause (e.g. a tenant exceeding quota repeatedly, an upstream timeout). A targeted config change is faster than a rollback if the root cause is genuinely known.
  3. If a recent deploy correlates with the start of the burn, roll back. Compare deploy timestamps (git log --since '6 hours ago' on the release branch, or your deployment system's audit log) to the 6-hour query window. Use rollback.md if the correlation is plausible — the slow burn at 6× still consumes budget quickly enough that an unnecessary rollback is cheaper than a missed real cause.
  4. If a single tenant is the cause, isolate them. Use the tensor-wasm-cli admin commands to revoke or rate-limit the offending tenant's bearer token: tensor-wasm admin token revoke <token-id> or set a stricter QPS in the per-tenant config. Document the action for the postmortem.
  5. If no clear cause emerges in 30 minutes, escalate. This alert is patient but not infinite. See When to page.

Root-cause hypotheses

Ranked by base rate; the slow-burn alert tends to surface different root causes than the fast-burn one.

HypothesisHow to confirmHow to fix
One tenant is repeatedly hitting a quota or timeout, failing 100% of their callsPer-tenant termination rate (query 4); journalctl -u tensor-wasm | grep '<tenant_id>'Revoke or rate-limit their token; engage the tenant; raise quota if legitimate
Snapshot disk degraded (slow but not yet failed) — restore calls increasingly time out at 30 siostat -x 5 shows high %util on the snapshot device; tensor-wasm observe --once shows restore P95 climbingMove snapshots to a faster volume; investigate the storage backend's own metrics
Memory leak slowly approaching the per-process limit; per-call allocations start failing before OOM killRSS climbing in systemctl status tensor-wasm or cat /proc/$(pidof tensor-wasm)/status | grep VmRSS; tensor_wasm_active_instances not drainingRestart tensor-wasm; file an issue with the RSS-over-time graph
Wasmtime engine accumulating compiled modules without evictiontensor_wasm_active_instances high and stable; restart drops error rate immediatelyTune the engine's instance limit downward; consider periodic warm restart until the leak is fixed
Upstream dependency (auth backend, OTLP collector) slow and triggering tower middleware timeoutscurl directly against the dependency; tracing spans show long http.request parent with no child progressRestore or temporarily bypass the dependency; tighten the middleware timeout if it is too loose

When to page

Escalate to sev-1 if any of the following:

  • The 30-minute error rate stays above 0.030 for more than two hours without identifying a hypothesis to act on.
  • The slow burn upgrades to a fast burn (the availability-fast-burn.md alert starts firing). Handle that page first — it is louder for a reason.
  • The dashboard's availability_http (30d) panel crosses below 99.3% (half the budget consumed in the rolling window).
  • A single tenant is identified as the cause but cannot be isolated via token revocation (e.g. the auth subsystem is also degraded).

Postmortem checklist

  • Save journalctl -u tensor-wasm --since '<incident_start - 30m>' --until '<incident_end + 30m>' > /tmp/tensor-wasm-slow-burn.log.
  • Capture the Prometheus snapshot for the full 6-hour window plus 30 minutes either side.
  • If a tenant was isolated during mitigation, file an issue against that tenant's onboarding ticket (or notify their owner channel) so the action is undone or formalised.
  • Note the actual root cause in the issue's body even if the dashboard recovered before it was confirmed — a slow burn that auto-resolves is still worth attributing.
  • Update the per-tenant config or quota defaults if the cause was a quota that was set too high.
  • If the cause traces back to a release, follow up with a regression test added before the next release.
  • Cross-reference the incident in the next CHANGELOG entry under "Operator-visible behaviour change" per SLO.md §9.