TensorWasm
Deploying Craton TensorWasm
Deploying Craton TensorWasm
This document covers running TensorWasm in production: topology, sizing, configuration, and disaster recovery. If you're still onboarding, read GETTING-STARTED.md first; if you're writing the functions, WASM-DEVELOPER-GUIDE.md is your starting point.
1. Production topology
A typical TensorWasm production deployment looks like:
+---------------------+
| Load Balancer |
+----------+----------+
|
+--------------+--------------+
| | |
+----+----+ +----+----+ +----+----+
| tensor-wasm-api| | tensor-wasm-api| | tensor-wasm-api|
| + GPU | | + GPU | | + GPU |
| + MPS | | + MPS | | + MPS |
+----+----+ +----+----+ +----+----+
| | |
+--------------+--------------+
|
+--------------+--------------+
| |
+----+-------+ +------+------+
| Prometheus | | Jaeger / |
| + Grafana | | Tempo |
+------------+ +-------------+
tensor-wasm-apireplicas sit behind a load balancer (any L7 proxy will do — nginx, Caddy, an ALB). Each replica is stateless.- MPS (Multi-Process Service) runs as a daemon on each host that has a GPU, providing context isolation between concurrent tenants. See MPS-SETUP.md.
- Prometheus and Grafana are shared across replicas for metrics.
- Jaeger or Tempo are shared for traces.
The gateway tier holds no durable state — the only durable artifact is the snapshot store (covered below).
2. System requirements
| Resource | Minimum | Recommended |
|---|---|---|
| CPU | 4 cores | 8+ cores |
| RAM | 8 GB | 16+ GB |
| GPU | sm_70, CUDA 11.8 | sm_80+, CUDA 12.0+ |
| Disk | HDD | SSD (snapshot I/O dominates cold-start at large payloads) |
| Network | 1 Gb | 10 Gb between gateway and snapshot storage |
The SSD recommendation is not cosmetic. At payload sizes above ~32 MiB, snapshot read latency dominates the end-to-end cold-start budget; HDDs add tens of milliseconds you can't get back. See COLD-START.md for the numbers.
3. Configuration
Craton TensorWasm is configured by environment variables. The full set:
| Variable | Default | Purpose |
|---|---|---|
TENSOR_WASM_LOG | info | tracing-subscriber filter directive. |
TENSOR_WASM_OTLP_ENDPOINT | unset | OpenTelemetry collector endpoint (e.g. http://otel:4317). |
TENSOR_WASM_LISTEN_ADDR | 0.0.0.0:8080 | HTTP bind address for tensor-wasm serve. |
CUDA_ROOT | autodetected | Override CUDA toolkit location (build- and run-time). |
CUDA_ARCH | none (required for GPU builds) | Target GPU compute capability for emitted PTX; set to match your GPU (e.g. sm_80 for A100, sm_89 for L4). See CUDA-SETUP.md. |
For zero-trust environments, set TENSOR_WASM_API_SNAPSHOT_HMAC_KEY (32-byte hex HMAC key) to authenticate snapshot bytes, and TENSOR_WASM_API_SNAPSHOT_REQUIRE_SIGNATURE=true to refuse unsigned snapshots (see SNAPSHOT-COMPATIBILITY.md).
4. Docker Compose stack
The repo ships a docker-compose.yml at its root that brings up tensor-wasm-api, Prometheus, Grafana, and Jaeger with a single command:
docker compose up
The compose file pins versions, wires up the OTLP exporter, and pre-loads a Grafana dashboard for the four key SLI metrics (see Monitoring below). It's the recommended way to evaluate a real TensorWasm deployment locally before promoting to Kubernetes.
5. Kubernetes
A Helm chart ships at deploy/helm/tensor-wasm/ — it is the recommended way to install on Kubernetes (helm install tensor-wasm ./deploy/helm/tensor-wasm -n tensor-wasm --create-namespace; see the chart's README.md for the full values reference). If you prefer a hand-rolled manifest, a minimal Deployment looks like:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tensor-wasm-api
spec:
replicas: 3
selector:
matchLabels: { app: tensor-wasm-api }
template:
metadata:
labels: { app: tensor-wasm-api }
spec:
containers:
- name: tensor-wasm-api
image: ghcr.io/craton-co/craton-tensor-wasm-api:0.1.0
args: ["serve", "--addr", "0.0.0.0:8080"]
ports: [{ containerPort: 8080 }]
env:
- { name: TENSOR_WASM_LOG, value: "info" }
- { name: TENSOR_WASM_OTLP_ENDPOINT, value: "http://otel-collector:4317" }
- { name: CUDA_ARCH, value: "sm_80" }
resources:
limits:
nvidia.com/gpu: 1
memory: "16Gi"
cpu: "8"
Pair with the standard nvidia-device-plugin DaemonSet for GPU scheduling.
6. TLS
Do not terminate TLS inside tensor-wasm-api for v0.1 — the axum stack is configured for plaintext H2C. Put a TLS-terminating proxy in front of it: nginx, Caddy, an ALB, or a service mesh sidecar. Any of these is more battle-tested than what we'd ship inline.
A minimal Caddyfile:
api.example.com {
reverse_proxy tensor-wasm-api:8080
}
That's it — Caddy will provision an ACME certificate on startup.
7. Multi-tenant capacity planning
Capacity is bounded by GPU HBM, not CPU or system RAM. The arithmetic:
- Each instance reserves up to
EngineConfig::max_memory_bytes(default 256 MiB) of linear memory, which lives in unified memory and is page-backed on the device. - A 40 GiB A100 therefore supports a theoretical ceiling of
40 GiB / 256 MiB ≈ 160concurrently resident instances, minus headroom for driver state and PTX caches. In practice plan for 120. - Per-tenant quotas are tracked through
tensor-wasm-tenantvia theconsume_bytesbudget on each tenant principal. Set quotas conservatively; you can always raise them.
If you need stronger isolation than MPS provides, run one tensor-wasm-api replica per tenant on a dedicated GPU. The replica is stateless, so this scales horizontally without coordination.
8. Backup and restore
Snapshots are the only durable artifact in a TensorWasm deployment. They contain:
- The Wasm module bytes
- The serialized
TensorWasmLinearMemory - The
KernelRegistrystate (loaded PTX, kernel ids) - Engine config and tenant attribution metadata
Snapshots are portable across hosts of the same architecture — sm_80 to sm_80 is fine; sm_80 to sm_90 requires a rebuild because PTX is JITed per arch.
The snapshot schema is versioned. The reader does not migrate snapshots in place: a version mismatch is a hard error (within v0.x the reader version must equal the writer version). The supported upgrade path is to re-capture from the live instance under the new format — or, if the source instance is gone, run an older tensor-wasm binary against the old snapshot, restore the instance, then re-capture under the current binary. See SNAPSHOT-COMPATIBILITY.md. Always rehearse this on a staging snapshot before bumping the binary in production.
Recommended rotation: snapshot every active instance hourly, with a 7-day retention window in your object store. Pair with periodic integrity checks on a random 1% sample — see the offline integrity check in BACKUP-RESTORE.md §7.1, which parses each blob through SnapshotReader::restore (validating the CRC32 and per-blob size caps) without restoring it to a live instance.
9. Monitoring
TensorWasm exposes Prometheus metrics on /metrics. The four SLI metrics to alert on:
| Metric | What it tells you |
|---|---|
tensor_wasm_active_instances | Capacity headroom. Alert when within 20% of your planned ceiling. |
tensor_wasm_kernel_dispatches_total | Throughput. Sudden drops indicate a stuck instance or scheduler. |
tensor_wasm_kernel_latency_seconds | Tail latency. p99 regressions usually mean GPU contention. |
tensor_wasm_offload_success_total / tensor_wasm_offload_fallback_total | Auto-offload health. A growing fallback ratio means promoted kernels are failing on the device. |
The bundled Grafana dashboard groups these into a single overview page. For the full catalog — including the per-tenant breakdown — see OBSERVABILITY.md.
10. Disaster recovery
Because the gateway tier is stateless, DR is essentially snapshot DR:
- Ship snapshots off-host. Mirror to S3, GCS, or your equivalent. The snapshot format is content-addressed by sha256, so the mirror is naturally deduplicating.
- Verify periodically. Run the offline snapshot integrity check (BACKUP-RESTORE.md §7.1) against a sample of stored snapshots on a schedule.
- Practice restore. A quarterly DR drill that restores a snapshot to a fresh region is cheaper than discovering schema drift the day you actually need it.
- Keep the binary in sync with the snapshot schema. A binary that can read your latest snapshots should be available in your registry at all times — not just rebuildable from source.
If both the gateway and a snapshot are intact, recovery is a tensor-wasm snapshot restore --input <file>.tensor-wasm --as-instance <instance-id> --server <url> away (see BACKUP-RESTORE.md §6.1). If only the Wasm module bytes survive, you can redeploy clean — you lose accumulated linear-memory state, but the function is back online.
For broader operational guidance, see the other runbooks in this directory.