TensorWasm
Craton TensorWasm — Service Level Objectives (v0.3)
Craton TensorWasm — Service Level Objectives (v0.3)
This document is the project's commitment to numeric availability,
latency, and error-rate targets for the public HTTP surface and the
kernel-dispatch path. It is the operator-facing contract: every alert
in docs/runbooks/ and every panel in
docs/dashboards/tensor-wasm-overview.json traces back to a Service
Level Indicator (SLI) and a Service Level Objective (SLO) defined
here.
Status: v0.3 gate. The targets are conservative, intentionally
under-promised, and tied to the host-only baselines measured today
(bench-results/baseline.json). They tighten in v0.4-v0.5 once the
S22 self-hosted CUDA runner replaces the modeled CUDA estimates with
measured numbers.
Contents
- Purpose
- SLI definitions
- SLO targets
- Error budget calculations
- Burn-rate alerts
- Dashboards
- Runbook mapping
- Disclosure: measured vs modeled
- How to change an SLO
1. Purpose
An SLO in this document is an externally-visible promise about
the behaviour of a running TensorWasm node: "the HTTP API is up at
least X% of the time, and 95% of /invoke calls finish within Y
milliseconds". When an SLO is missed for long enough that the error
budget is consumed, the operator is expected to act — roll back the
release, page the on-call, freeze deployments, or open an incident.
SLOs are about user-visible behaviour, not about internal
implementation details.
This is not the same as a bench target. The numbers in
PERFORMANCE.md are CI regression guards measured
in micro-benchmarks under controlled conditions: a single host with
no contention, Criterion warm-up, outlier rejection. The numbers in
this document are operational bounds under real load: noisy
neighbours, network jitter, OS scheduler stalls, log-rotation
hiccups, snapshot capture happening on the same disk. As a rule of
thumb, an SLO target is one to three orders of magnitude looser than
the corresponding bench median, because production includes
everything the bench excludes.
If a bench regression fires but an SLO does not, the regression is worth investigating but is not necessarily a user-visible problem. If an SLO fires but the benches are green, the production stack is hiding behaviour the benches do not exercise — that is a higher- priority signal.
2. SLI definitions
Each SLI is a single Prometheus expression returning a unitless
ratio (for availability and error-rate) or a duration in seconds (for
latency). The expressions assume the standard Prometheus scrape
interval (15 s) and standard tower-http instrumentation. As of W2.3,
every metric referenced below is emitted by the gateway — the PromQL
runs as-is against a live /metrics scrape.
The metrics that do exist today, audited against
crates/tensor-wasm-core/src/metrics.rs, are:
tensor_wasm_active_instances(gauge)tensor_wasm_gpu_memory_used_bytes(gauge)tensor_wasm_gpu_memory_bytes_per_tenant{tenant_id}(gauge family, C3)tensor_wasm_kernel_dispatches_total(counter)tensor_wasm_kernel_latency_seconds(histogram, 14 buckets, 10 µs – 10 s)tensor_wasm_instance_spawns_total(counter)tensor_wasm_instance_terminations_total(counter)tensor_wasm_offload_success_total(counter)tensor_wasm_offload_fallback_total(counter)tensor_wasm_jobs_active(gauge, single series, C3)
The HTTP-level metrics referenced below
(tensor_wasm_http_requests_total,
tensor_wasm_http_request_duration_seconds) are emitted by the
tensor_wasm_api::http_metrics middleware as of W2.3. Labels are
method (GET, POST, DELETE), route (the matched axum route
template, e.g. /functions/:id/invoke, never the substituted value),
and status (the numeric HTTP status code as a string). The
PromQL expressions below are executable against a running gateway.
2.1 availability_http
Ratio of non-5xx HTTP responses to all HTTP responses over the trailing 30 days.
sum(rate(tensor_wasm_http_requests_total{status!~"5.."}[30d]))
/
sum(rate(tensor_wasm_http_requests_total[30d]))
Rationale for excluding 4xx: a 404 on GET /functions/<unknown>
or a 401 on a missing bearer token is a client-side error, not a
TensorWasm fault. We report client errors in the dashboard but do not
spend availability budget on them. 429 is similarly excluded: the
rate limiter doing its job is not a service failure.
2.2 latency_http_healthz
P95 latency of GET /healthz over the trailing 5 minutes.
histogram_quantile(
0.95,
sum by (le) (
rate(tensor_wasm_http_request_duration_seconds_bucket{
route="/healthz",
method="GET"
}[5m])
)
)
/healthz is the liveness probe; it does no work and exists to
detect that the process is serving. Its P95 is a pure proxy for
"is the axum router stuck".
2.3 latency_http_invoke
P95 end-to-end latency of POST /functions/{id}/invoke over the
trailing 5 minutes. The 30-second per-call deadline (see
API.md middleware section) is the worst-case ceiling; the SLO
is far tighter than that for the common path.
histogram_quantile(
0.95,
sum by (le) (
rate(tensor_wasm_http_request_duration_seconds_bucket{
route="/functions/:id/invoke",
method="POST"
}[5m])
)
)
Two flavours of this SLO are tracked separately (see
SLO targets): one for host-only invocations
(guest never touches WASI-CUDA), one for invocations that issue at
least one kernel dispatch. They are distinguished post-hoc in the
dashboard via tenant-tagged join on
tensor_wasm_kernel_dispatches_total; the underlying histogram is
a single series.
2.4 latency_dispatch_serial
P95 of single-stream kernel dispatch latency (back-pressure permit
acquire → future poll → completion) over the trailing 5 minutes.
This is measured against the existing
tensor_wasm_kernel_latency_seconds histogram.
histogram_quantile(
0.95,
sum by (le) (
rate(tensor_wasm_kernel_latency_seconds_bucket[5m])
)
)
Note: the existing histogram does not distinguish serial from
concurrent dispatch, nor host-stub from real CUDA. Once the S22
runner lands and a real CUDA event-based sync replaces the immediate-
resolve stub described in RISKS.md, this SLI will split into
{dispatch_kind="serial"} and {dispatch_kind="concurrent"} series.
Until then, the SLO covers the aggregate.
2.5 error_rate_invoke
Ratio of 5xx responses on /invoke to all responses on /invoke,
trailing 5 minutes. Aligned with the API error envelope in
crates/tensor-wasm-api/API.md — wasmtime, internal, and
invoke_timeout (a 504 — counted here because it indicates a TensorWasm
failure to meet its own deadline) all roll up under "5xx for the
purposes of this SLI".
(
sum(rate(tensor_wasm_http_requests_total{
route="/functions/:id/invoke",
method="POST",
status=~"5..|504"
}[5m]))
)
/
(
sum(rate(tensor_wasm_http_requests_total{
route="/functions/:id/invoke",
method="POST"
}[5m]))
)
3. SLO targets
Targets are evaluated per calendar quarter. Each target is the floor the project commits to publicly; we expect to exceed it on healthy days, and we treat sustained miss as an incident.
| SLI | Target (v0.3) | Window | Headroom vs measured today |
|---|---|---|---|
availability_http | 99.5% | 30 d rolling | No production telemetry yet; conservative for a pre-1.0 release |
latency_http_healthz_P95 | ≤ 10 ms | 5 m rolling | Bench P50 ~30–60 µs (PERFORMANCE.md) → ~150× headroom |
latency_http_invoke_P95 (host-only) | ≤ 100 ms | 5 m rolling | Bench e2e/create_function P50 ~40–80 µs → ~1000× headroom |
latency_http_invoke_P95 (with GPU dispatch) | ≤ 500 ms | 5 m rolling | Modeled — CUDA event-sync path lands in v0.2; SLO ratifies in v0.3 |
latency_dispatch_serial_P95 (host-only) | ≤ 50 µs | 5 m rolling | Baseline median 50 µs at dispatch/serial/10 → exactly the floor; see disclosure |
latency_dispatch_serial_P95 (CUDA-host) | TBD — v0.4 | 5 m rolling | S22 numbers must land before this SLO can be set |
error_rate_invoke | ≤ 1.0% | 5 m rolling | No production telemetry; budget chosen to align with availability |
3.1 Rationale
availability_http: 99.5% is "two nines and a half". It allows
~3 h 36 m of unavailability per 30 days. For v0.x this is deliberate
under-promising: TensorWasm has no high-availability story yet (single-
host runtime, no built-in failover), so the SLO should not pretend
that the binary's worst-case downtime is bounded by anything other
than the operator's deploy and restart cadence. v1.0 will tighten
this to 99.9% (≈43 m / month), once docs/UPGRADE.md lands the
rolling-restart procedure.
latency_http_healthz_P95: 10 ms is the operationally meaningful
bound — it includes the network hop from the load balancer, OS
scheduler, and the axum middleware stack. The bench P50 of ~30-60 µs
is the lower bound of what /healthz could ever achieve under
ideal conditions; the SLO covers the realistic case of a noisy
neighbour and a load balancer that sometimes sleeps.
latency_http_invoke_P95: 100 ms (host-only) / 500 ms (with GPU dispatch) is sized to comfortably cover a small-payload synchronous
invocation including instance spawn, single export call, and
terminate. The 500 ms variant adds room for one PTX load + one kernel
launch + one sync, modelled against the v0.2 CUDA event-sync target
of 5-20 µs per dispatch plus PCIe transfer cost for inputs/outputs.
These will tighten significantly once we have production telemetry.
latency_dispatch_serial_P95: 50 µs (host-only) is exactly the
baseline median from bench-results/baseline.json
(dispatch/serial/10, median 50 000 ns). The SLO matches the bench
floor on purpose: dispatch on the host-stub path is essentially a
Tokio semaphore round-trip, and an SLO that allows worse than 50 µs
P95 on the stub path would mask scheduler regressions. Note that the
bench is P50 and the SLO is P95 — even at the same numeric value,
the SLO is a strictly weaker promise. The CUDA-host variant of this
SLO is TBD because no measured CUDA dispatch latency exists in
the repo today; setting a number now would be a wild guess and we
prefer the honest "TBD" to a wrong number.
error_rate_invoke: ≤ 1.0% complements the availability SLO at
finer time resolution. 1% over a 5-minute window catches a fault
that an availability burn-rate alarm would not page on for hours.
3.2 Out of scope for v0.3
Quantities we deliberately do not set SLOs on at this gate:
- Snapshot capture/restore latency. Sized by payload, dominated by
disk I/O. Measured in
cold_start/*benches, alarmed via dashboard thresholds, but no SLO until v0.5. - JIT compile latency. Cache-hit on a warm path is the common case; cache-miss is rare and bursty. Tracked in dashboard.
- Tenant context-switch latency. Below the 5-µs done-when target documented in PERFORMANCE.md; no SLO until multi-tenant deployments are common enough to need one.
- GPU memory pressure. A gauge, not a rate; alarmed via threshold, not via SLO.
4. Error budget calculations
The error budget is the amount of bad behaviour the SLO permits before it is considered violated. It is the operator's accounting unit for risk: if there is budget left, ship. If there is not, freeze deploys until the budget refills.
4.1 Availability budget
availability_http: 99.5% over 30 days = 0.5% downtime allowed.
30 d × 24 h × 60 m = 43 200 minutes per window
0.5% × 43 200 m = 216 minutes = 3 h 36 m
Per-day amortised budget: 216 / 30 = 7.2 minutes/day.
4.2 Latency budget
For the latency SLOs, "budget consumption" is the fraction of
requests that fell above the SLO threshold during the rolling 5 m
window. Example: if 5% of /invoke calls exceeded 100 ms P95 in
the last 5 m, but the SLO allows the P95 itself to be ≤ 100 ms,
then the SLO is met by definition (the 5% is the tail above P95,
not a violation). Latency SLOs are violated by the P95 of the
window itself exceeding the bound, not by individual slow calls.
4.3 Error-rate budget
error_rate_invoke: ≤ 1.0% over 5 minutes is the
simultaneous-window view: at any moment, the past 5 m of
/invoke traffic must have fewer than 1% server-side failures.
For monthly accounting we report budget consumption as
1 - (1 - observed_5m_max) / (1 - SLO_target) summed over windows,
but the alerts care only about the rolling number.
4.4 Spending the budget
The budget exists to be spent on calculated risks. Examples of legitimate spend:
- Rolling out a feature flag to 10% of traffic and accepting up to X minutes of degradation in that slice.
- Running a destructive migration that briefly serves
503while a tenant's state moves. - Doing a planned maintenance restart during a low-traffic window.
Examples of illegitimate spend (these are bugs, not features):
- An unintentional infinite loop in a handler.
- An OOM crash because a deploy raised a memory limit without testing.
- A regression in a dependency picked up by a routine
cargo update.
4.5 Rollback when budget consumed
If a release consumes more than 50% of the monthly availability budget in a single 24-hour window (i.e. >108 minutes of downtime or >50% of error budget on the latency/error SLOs), the operator should:
- Page the on-call (
docs/runbooks/oncall-paging.md). - Roll back to the last known-good release using the procedure in
docs/runbooks/rollback.md. - Open an incident retrospective; do not redeploy until the regression's root cause is identified and tested for.
- If the budget is fully consumed, freeze all non-rollback deploys until the next 30-day window opens. Security patches are the only exception; document them in the incident retro.
This is a self-imposed gate, not an automatic enforcement. The project trusts the operator to follow it.
5. Burn-rate alerts
Following the Google SRE workbook multi-window multi-burn-rate pattern, three alert pairs cover fast, slow, and very-slow consumption of the availability budget. Each pair fires when both windows exceed the burn rate simultaneously, which suppresses brief spikes and short outages that auto-recover.
The burn rate is actual_error_rate / SLO_error_budget_rate. For a
99.5% SLO, the budget rate is 0.5% = 0.005; a burn rate of 14.4×
means errors at 7.2%, which would exhaust the 30-day budget in
~50 hours if sustained.
5.1 Fast burn (page)
Errors at 14.4× the budgeted rate, sustained over both a 5-minute and a 1-hour window. At this rate, the monthly budget is gone in 50 hours.
(
sum(rate(tensor_wasm_http_requests_total{status=~"5.."}[1h]))
/
sum(rate(tensor_wasm_http_requests_total[1h]))
> (14.4 * 0.005)
)
and
(
sum(rate(tensor_wasm_http_requests_total{status=~"5.."}[5m]))
/
sum(rate(tensor_wasm_http_requests_total[5m]))
> (14.4 * 0.005)
)
Severity: page. Runbook: docs/runbooks/availability-fast-burn.md.
5.2 Slow burn (page)
Errors at 6× budgeted rate, sustained over 30 m and 6 h windows. Catches a degradation that is not fast enough to trigger the 14.4× page but is bleeding the monthly budget over a working day.
(
sum(rate(tensor_wasm_http_requests_total{status=~"5.."}[6h]))
/
sum(rate(tensor_wasm_http_requests_total[6h]))
> (6 * 0.005)
)
and
(
sum(rate(tensor_wasm_http_requests_total{status=~"5.."}[30m]))
/
sum(rate(tensor_wasm_http_requests_total[30m]))
> (6 * 0.005)
)
Severity: page. Runbook: docs/runbooks/availability-slow-burn.md.
5.3 Very-slow burn (ticket)
Errors at 1× the budgeted rate sustained over 6 h and 3 d windows. At this rate the budget is depleted exactly at 30 days, so this is a "you are using all of your budget" warning rather than an emergency. Files a ticket, does not page.
(
sum(rate(tensor_wasm_http_requests_total{status=~"5.."}[3d]))
/
sum(rate(tensor_wasm_http_requests_total[3d]))
> (1 * 0.005)
)
and
(
sum(rate(tensor_wasm_http_requests_total{status=~"5.."}[6h]))
/
sum(rate(tensor_wasm_http_requests_total[6h]))
> (1 * 0.005)
)
Severity: ticket. Runbook: docs/runbooks/availability-very-slow-burn.md.
5.4 Latency burn-rate alerts
Latency SLOs use a single-window threshold rather than the multi-burn-rate pattern, because the 5-m P95 is itself the burn-rate analogue (a single window already smooths the per-request noise).
Fast latency-burn on /invoke (page)
histogram_quantile(
0.95,
sum by (le) (
rate(tensor_wasm_http_request_duration_seconds_bucket{
route="/functions/:id/invoke",
method="POST"
}[5m])
)
) > 0.5
and
histogram_quantile(
0.95,
sum by (le) (
rate(tensor_wasm_http_request_duration_seconds_bucket{
route="/functions/:id/invoke",
method="POST"
}[1h])
)
) > 0.5
The threshold 0.5 is in seconds, matching the 500 ms SLO for
invocations that issue GPU dispatch. The host-only 100 ms variant
fires a separate alert keyed off the same metric.
Severity: page. Runbook: docs/runbooks/invoke-latency-spike.md.
Sustained latency-burn on /healthz (ticket)
histogram_quantile(
0.95,
sum by (le) (
rate(tensor_wasm_http_request_duration_seconds_bucket{
route="/healthz",
method="GET"
}[30m])
)
) > 0.01
/healthz exceeding 10 ms P95 sustained over 30 m is usually a sign
of a stuck event loop, not an outage; ticket rather than page.
Severity: ticket. Runbook: docs/runbooks/healthz-slow.md.
5.5 Dispatch-latency burn (page)
histogram_quantile(
0.95,
sum by (le) (
rate(tensor_wasm_kernel_latency_seconds_bucket[5m])
)
) > 0.00005
and
histogram_quantile(
0.95,
sum by (le) (
rate(tensor_wasm_kernel_latency_seconds_bucket[1h])
)
) > 0.00005
0.00005 is 50 µs, matching the host-only latency_dispatch_serial_P95
SLO. This alert is meaningful only on host-only deployments; on
CUDA-host nodes the threshold must be widened to the CUDA-host
SLO once that number is set (v0.4 work).
Severity: page on host-only deployments. Runbook:
docs/runbooks/dispatch-latency-spike.md.
5.6 Why no separate error_rate_invoke alert
The fast/slow/very-slow burn alerts above operate on
tensor_wasm_http_requests_total across all routes. A
/invoke-specific burn would page on the same conditions a few
seconds earlier and would mask the same underlying incidents.
v0.4 may split the alerts per-route if the operational data
justifies it.
6. Dashboards
The reference Grafana dashboard committed at
docs/dashboards/tensor-wasm-overview.json (added in a sibling task
under the v0.3 milestone) renders each SLI above and overlays the
SLO threshold as a horizontal line. Panels:
- Availability — 30-day rolling, with the 99.5% target as a reference line and the consumed-budget bar in the corner.
- HTTP P50/P95/P99 latency by route — one panel per route in
the API, with the SLO threshold marked for
/healthzand/invoke. - Error rate by route and status family — stacked, with the 1%
threshold marked on
/invoke. - Kernel latency — P50/P95/P99 from
tensor_wasm_kernel_latency_seconds, with the 50-µs host-only SLO threshold marked. - Burn-rate — three panels, one per alert pair, with the alert threshold marked. A panel that crosses its threshold is the pre-page warning the operator sees in the dashboard.
- Active instances / GPU memory / spawn-terminate rate — capacity panels, no SLO threshold.
- JIT cache hit ratio / offload success vs fallback — efficiency panels, no SLO threshold.
Dashboard import is a single JSON upload; the file is meant to be checked in alongside this document and updated under the same PR process whenever an SLO changes.
7. Runbook mapping
Each alert in Section 5 links to a one-page
runbook with: what the alert means, what to check first, common
causes, mitigation steps, and the criteria for paging the next
escalation tier. The runbooks themselves are out of scope for this
document — they are tracked under the v0.3 milestone in
PATH-TO-V1.md.
| Alert | Severity | Runbook |
|---|---|---|
| Availability fast burn (14.4×, 5 m + 1 h) | Page | docs/runbooks/availability-fast-burn.md |
| Availability slow burn (6×, 30 m + 6 h) | Page | docs/runbooks/availability-slow-burn.md |
| Availability very-slow burn (1×, 6 h + 3 d) | Ticket | docs/runbooks/availability-very-slow-burn.md |
/invoke latency spike (P95 > SLO, 5 m + 1 h) | Page | docs/runbooks/invoke-latency-spike.md |
/healthz slow (P95 > 10 ms, 30 m) | Ticket | docs/runbooks/healthz-slow.md |
| Dispatch latency spike (P95 > 50 µs, 5 m + 1 h) | Page (host-only) | docs/runbooks/dispatch-latency-spike.md |
| Rollback procedure | (manual) | docs/runbooks/rollback.md |
| On-call paging procedure | (manual) | docs/runbooks/oncall-paging.md |
A runbook landing without a corresponding alert is fine; an alert landing without a runbook is a PR blocker.
8. Disclosure: measured vs modeled
The honesty bar mirrors PERFORMANCE.md: every
SLO target above is annotated with whether the headroom claim is
grounded in measurement, extrapolation, or guess.
| SLO | Grounded in | v0.3 follow-up |
|---|---|---|
availability_http (99.5%) | Modeled. No production deployment exists yet; the number is a conservative pre-1.0 choice. | Recruit a v0.5 design partner per PATH-TO-V1 §6; replace the model with one month of observed data. |
latency_http_healthz_P95 (10 ms) | Measured floor, modeled SLO. Bench e2e/healthz/get shows P50 ~30-60 µs (PERFORMANCE.md). The 10 ms SLO is a 150× ceiling chosen to bound real-world tail behaviour (network + scheduler), not the bench floor. | None — this is a stable v0.3 commitment. |
latency_http_invoke_P95 host-only (100 ms) | Measured floor, modeled SLO. Bench e2e/create_function/post P50 ~40-80 µs. SLO at 100 ms gives ~1000× headroom for real payloads and instance spawn cost. | None — stable. |
latency_http_invoke_P95 with GPU (500 ms) | Fully modeled. No GPU-host invocation latency has been measured; the number combines the modeled CUDA event-sync target (5-20 µs/dispatch) with a worst-case PCIe transfer estimate. | Replace with measured P95 once S22 runner produces real numbers; tighten in v0.4. |
latency_dispatch_serial_P95 host-only (50 µs) | Measured. Baseline median 50 000 ns at dispatch/serial/10 in bench-results/baseline.json. The SLO equals the bench median, which is honest: it means we promise no worse than today's measured floor. | Tighten once tolerance_pct drops from 50% to 10% in v0.2 re-baseline. |
latency_dispatch_serial_P95 CUDA-host | TBD. No measurement exists. | S22 runner produces a measured baseline → number set in v0.4 SLO update. Until then, the column reads "TBD". |
error_rate_invoke (1.0%) | Modeled. No production telemetry. Chosen to align with the availability budget at finer time resolution. | Re-evaluate against design-partner data before v1.0. |
Metrics landed in W2.3 (crates/tensor-wasm-api/src/http_metrics.rs,
shipped alongside this document under the v0.3 milestone):
tensor_wasm_http_requests_total{route,method,status}(counter)tensor_wasm_http_request_duration_seconds_bucket{route,method,status}(histogram)tensor_wasm_http_requests_in_flight{route,method}(gauge — capacity panel only, not an SLI)
The instrumentation point is a tower middleware (http_metrics_middleware)
wired into build_router outside bearer_auth, so 401 responses
are also counted — consistent with availability_http and the
burn-rate alerts in Section 5, which evaluate
the ratio over every HTTP response. The route label always carries
the axum route template (e.g. /functions/:id/invoke); the substituted
UUID is never emitted as a label value, and any unmatched path collapses
to route="unknown". Cardinality is bounded by a runtime allow-list
initialised at startup from the route templates registered in
build_router_with_audit; adding a new route to that builder requires
adding the same template to tensor_wasm_api::http_metrics::DEFAULT_ROUTE_ALLOWLIST
or the panel falls through to route="unknown". The PromQL above is
therefore executable today; the alerts in Section 5
can be loaded into Prometheus and will fire once production traffic
exists to exercise them.
Metrics confirmed present today (audited against
crates/tensor-wasm-core/src/metrics.rs and
crates/tensor-wasm-api/src/http_metrics.rs):
tensor_wasm_active_instancestensor_wasm_gpu_memory_used_bytestensor_wasm_kernel_dispatches_totaltensor_wasm_kernel_latency_seconds(histogram)tensor_wasm_instance_spawns_totaltensor_wasm_instance_terminations_totaltensor_wasm_offload_success_totaltensor_wasm_offload_fallback_totaltensor_wasm_http_requests_total{route,method,status}(counter, W2.3)tensor_wasm_http_request_duration_seconds{route,method,status}(histogram, W2.3)tensor_wasm_http_requests_in_flight{route,method}(gauge, W2.3)tensor_wasm_jobs_active(gauge, single series, C3 — number of async-invocation jobs inPendingstate in the API-layer job registry; not an SLI but feeds the dashboard capacity row)tensor_wasm_gpu_memory_bytes_per_tenant{tenant_id}(gauge family, C3 — additive per-tenant breakdown of GPU memory reservation; the pre-existing single-series total attensor_wasm_gpu_memory_used_bytesis preserved alongside)
Every SLO in this document is now enforceable against the metrics the
gateway emits. The latency_dispatch_serial_P95 SLO sits on
tensor_wasm_kernel_latency_seconds; the four HTTP-keyed SLOs
(availability_http, latency_http_healthz, latency_http_invoke,
error_rate_invoke) sit on the HTTP families landed in W2.3.
9. How to change an SLO
SLO targets are part of the project's public contract. Tightening an SLO is a non-breaking change to users (we promise more) but a potentially breaking change to operators (a node that met the old SLO may fail the new one). Loosening an SLO is the opposite. Both directions require:
- An RFC under
rfcs/per the process documented inrfcs/README.md. The RFC must include:- the current target, the proposed target, and the delta;
- measured data supporting the change (a month of production telemetry, or new bench results, or both);
- a list of operators consulted and their feedback;
- the migration plan for any in-flight deployment that meets the old SLO but would miss the new one.
- A
CHANGELOG.mdentry under the next release's section, classified as "Operator-visible behaviour change". Tightening reads: "SLO<name>tightened fromXtoY; see RFC #NNN." Loosening reads the same with a justification. - Updated dashboard threshold in
docs/dashboards/tensor-wasm-overview.jsonand the corresponding runbook revisions, all landing in the same PR as the SLO change.
Adding a new SLO follows the same process. Removing an SLO is the heaviest change — it requires the RFC plus a six-month deprecation window during which the SLO continues to be evaluated and reported but no longer pages.
Adjusting the burn-rate alert thresholds without changing the SLO target itself does not require an RFC, only a PR with the rationale. The alert is an operational knob; the SLO is the contract.
Related docs
- PATH-TO-V1.md — milestone gates; the v0.3 "SLOs published" criterion is satisfied by this document.
- PERFORMANCE.md — bench targets and the CI regression gate; the floor this document's SLOs sit above.
- OBSERVABILITY.md — tracing schema, OTLP setup, and the existing Prometheus exposition.
crates/tensor-wasm-api/API.md— HTTP surface this document covers.crates/tensor-wasm-core/src/metrics.rs— source of truth for currently-emitted metric names.- BENCHMARKING.md — external comparison methodology; out of scope for SLOs but cross-checks the floors.
- RISKS.md — v0.1.0 known limitations relevant to several "modeled" disclosures here.
Status: v0.3 gate. Targets are conservative pre-1.0 commitments and will tighten once production telemetry exists. The CUDA-host dispatch SLO is intentionally left TBD pending the S22 runner; prefer "TBD" to a guess.
Dashboards. The reference Grafana dashboard described in
Section 6 is committed at
docs/dashboards/tensor-wasm-overview.json
with an importer-facing companion at
docs/dashboards/README.md. The dashboard's
top row renders the five SLIs defined in
Section 2 — availability_http,
error_rate_invoke, latency_http_healthz_P95,
latency_http_invoke_P95, and latency_dispatch_serial_P95 — as
Stat panels with thresholds matching the targets in
Section 3. Panels whose backing metric is in the
"TODO" column of the dashboard's metric inventory (snapshot
histograms, JIT cache counters, back-pressure gauges, per-tenant
labelling on existing series) render "No data" until those follow-ups
land; no dashboard edit is required to bring them online. The HTTP
request counter, HTTP duration histogram, and HTTP in-flight gauge
landed in W2.3 and render real data today.