Modern problems require modern solutions.
Modern AI and data workloads want two things that have always been at odds: the isolation of a sandbox — so you can run untrusted, multi-tenant code without it touching the host or its neighbours — and the raw throughput of the GPU, so the work actually finishes on time.
Traditional sandboxes give you safety but keep you on the CPU. Hand-written GPU code gives you speed but no isolation. TensorWasm gives you both in one runtime: sandboxed .wasm guests that dispatch real CUDA kernels through a typed host interface.
This isn't a whitepaper. The full path — Wasm guest → wasi:cuda → cuLaunchKernel → read results back — runs end-to-end on a real NVIDIA GPU, with tests asserting the GPU actually computed the right answer. On the pure-CPU path, throughput is statistically tied with upstream Wasmtime 45.
A snapshot subsystem captures and restores Wasm + GPU state, so cycling many small functions doesn't mean paying full instantiation cost every time. Apache-2.0, with a permissive trademark policy — commercial use, modification, and redistribution all permitted. No open-core bait-and-switch.
Sandboxed by construction
Every workload is a WebAssembly module isolated by Wasmtime. Untrusted code stays in its lane — memory-safe, capability-gated, and deadline-enforced. No escape hatches.
GPU-native, not GPU-adjacent
Guests reach the GPU through a typed wasi:cuda interface. Wasm linear memory is backed by CUDA Unified Memory, so data is reachable from the GPU without a copy.
Multi-tenant from the first line
One process, many tenants — each with scoped bearer tokens, per-token rate limits, and per-tenant GPU memory quotas. Isolation is the architecture, not a deployment pattern bolted on.
Production-ready ops included
Prometheus metrics, end-to-end OpenTelemetry traces, a drop-in Grafana dashboard, structured audit logs, published SLOs, and one runbook per alert — all shipped in the repo.
The Technical Edge
Why experts choose TensorWasm
Typed wasi:cuda host interface
Guests perform explicit kernel dispatch today, with opt-in automatic offload on the roadmap. Wasm linear memory is backed by CUDA Unified Memory for zero-copy data sharing. Requires CUDA 12.0+ and SM_70+ for standard kernels; the CPU path runs anywhere Wasmtime runs.
Multi-tenant isolation & quotas
Scoped bearer tokens, per-token rate limits, and per-tenant GPU memory quotas out of one fleet. An OpenAI-compatible /v1/completions and /v1/chat/completions gateway with streaming responses sits in front, with auth and audit on every request.
Snapshots & fast cold-starts
An 11-crate Rust workspace wrapping Wasmtime (not a fork). A snapshot subsystem captures and restores combined Wasm + GPU state so high-churn, small-function workloads avoid full instantiation cost on every cycle.