TensorWasm
Craton TensorWasm
Craton TensorWasm
A GPU-accelerated serverless WebAssembly runtime, in Rust.
Slide-style pitch deck for technical evaluators (CTOs, platform leads, SREs).
Every numeric claim and every "we have X" line links to a commit, a test, or
a doc — no aspirational marketing. Cross-checked against the v0.3.7 release
tag.
To present: pipe through marp, or read as plain markdown.
Slide breaks are ---. Speaker notes are HTML comments.
What it is (30 seconds)
Untrusted WebAssembly + explicit GPU dispatch + multi-tenant isolation + production observability — all in one runtime, sandboxed by Wasmtime, hot-pathed by CUDA, and shipped as Apache-2.0 source.
- Run any
.wasmmodule that fits WASI Preview 2 - Talk to the GPU through a typed
wasi:cudahost interface - Multi-tenant by construction; one process serves many tenants
- HTTP API + CLI + structured audit log + OpenTelemetry trace propagation
- 9 tagged releases (
v0.1.0→v0.3.7), 0 audit problems open
The three things it actually does
Pitch points, each graded against tests/commits in the repo today.
| Promise | Where the proof lives | Honest grade |
|---|---|---|
| Zero-copy Wasm linear memory in CUDA UVM | crates/tensor-wasm-mem/ + is_uvm_backed() probe + 5 tests pinning the property | Proven structurally; B2 e2e correctness via real PTX |
| Explicit GPU dispatch from Wasm guests with typed argv | crates/tensor-wasm-wasi-gpu/ + 5/5 kernel_args_e2e tests + the B2 vector_add e2e test pass on real RTX 2060 | Proven end-to-end including vector_add correctness readback |
| Async-yielding dispatch on Tokio + Wasmtime | DispatchFuture::poll yields via 50 µs tokio sleep (B1); cuda-async proper waker path scaffolded (F3) for v0.4 cutover | Framework works; production-scale waker is v0.4 |
Anything not in this table is either out of v1.0 scope (see PATH-TO-V1.md "Anti-goals") or earlier than these three. We don't pitch what we haven't shipped.
Architecture in one diagram
┌─────────────────┐
│ HTTP gateway │ axum: bearer + scoped-tokens + rate-limit + audit log
│ (tensor-wasm-api) │ W2.1 + W2.2 + W1.4 + W2.3 + W4.1 tracing
└────────┬────────┘
│
┌────────▼────────┐
│ Snapshot subsys │ zstd + bincode + 256 MiB bomb guard
│ (tensor-wasm-snapshot) │ W1.3 cross-version compat
└────────┬────────┘
│
┌────────▼────────┐
│ Tenant registry │ per-context streams + back-pressure cap=64
│ (tensor-wasm-tenant) │ C3 per-tenant GPU memory accounting
└────────┬────────┘
│
┌────────▼────────┐
│ Wasmtime exec │ async + epoch interruption
│ (tensor-wasm-exec) │ deadline enforcement
└────────┬────────┘
│
┌────────▼────────┐
│ wasi-cuda │ W1.1 typed-argv lowering (scalar + pointer)
│ (tensor-wasm-wasi-gpu) │ bounds-checked guest→device pointers
└────────┬────────┘
│ ┌─────────────────────┐
┌────────▼────────┐ │ tensor-wasm-jit │
│ UnifiedBuffer │◀──────│ PTX emit + BLAKE3 cache │
│ (tensor-wasm-mem) │ │ matmul / vector_add / conv2d
└────────┬────────┘ └─────────────────────┘
│
┌─────────┴─────────┐
│ 3 backings │
├───────────────────┤
│ cust (default) │ unified-memory feature
│ cudarc (W1.2) │ cudarc-backend feature; v0.5 fallback
│ cuda-oxide (O2) │ cuda-oxide-backend feature; v0.5 default contingent on v0.2
└───────────────────┘
See ARCHITECTURE.md for the full dep graph. Each component is a separate
crate in the workspace; the dep graph is enforced one-direction-only (Makefile ci runs the check).
What's measured today
bench-results/ on the dev RTX 2060 + Windows 11 WDDM + nightly-2026-04-03:
P50 latencies:
tensor_wasm_kernel_dispatch/serial/100→ real CUDA path through the typed-argv pipeline;kernel_args_e2eproves correctness viac[i] == a[i] + b[i]readbacke2e/healthz/get→ 6.7 µs (W4.6 tail_latency bench)e2e/invoke_not_found/post→ 28.1 µs (W4.6)dispatch/serial/100busy-poll baseline → 400 ns P50 / 1.8 µs P99.9 (F3 dispatch_future_backends, busy-poll path)cold_start/restore/1 MiBsnapshot → ~10 ms (W1.3 compat tests baseline)
End-to-end pure-CPU comparison:
- hyperfine vs wasmtime 45 on a 500M-iter integer loop (
bench-results/hyperfine-vs-wasmtime.json):- tensor-wasm: 490.7 ± 13.2 ms
- wasmtime: 501.2 ± 34.9 ms
- Statistically tied (CIs overlap)
- Tracing + audit log + HTTP metrics overhead not measurable on the CLI path (CLI bypasses the HTTP layer where the instrumentation lives)
Important disclosure: Committed numbers were captured on a noisy Windows
developer host with CV > 5% per docs/BENCHMARKING.md "Methodology → CV
target" section. They are floor-only baselines suitable for regression
gating, not publication-grade external comparison numbers. The S22
self-hosted CUDA runner (.github/workflows/cuda.yml + the registration
runbook) produces publishable numbers once registered.
What "it actually runs on CUDA" means
The repo has a permanent record of CUDA tests passing on real silicon at
bench-results/cuda-rtx2060-tests.txt:
2. Kernel-args E2E tests (W1.1 typed-argv marshalling against real GPU)
$ cargo test -p tensor-wasm-wasi-gpu --release --features cuda \
--test kernel_args_e2e -- --include-ignored
test scalar_argv_round_trips_through_launch_path ... ok
test pointer_argv_round_trips_through_launch_path ... ok
test pointer_argv_out_of_bounds_returns_invalid_pointer ... ok
test scalar_argv_real_cuda_launch ... ok
test pointer_argv_real_cuda_launch ... ok
test result: ok. 5 passed; 0 failed; 0 ignored
Plus the B2 wave's vector_add_end_to_end_real_ptx_real_kernel: builds a Wasm
guest with three f32[64] arrays in linear memory, encodes typed argv
[Ptr(a), Ptr(b), Ptr(c), U32(64)], launches against the canonical
kernels/vector_add.ptx, reads c back from linear memory, asserts
c[i] == 100.0 + 2*i for all 64 elements. This B2 end-to-end test
passes on RTX 2060 — on top of the 5-test kernel_args_e2e suite
quoted above — even though the kernel targets SM_80 (the CUDA driver
JIT'd it down to SM_75).
This is the test that proves the three pitch points work together on real
silicon for the non-wmma vector path. Scope caveat: the RTX 2060 is
SM_75, so the dev box cannot exercise the sm_80 wmma/MatMul Tensor-Core
lowering at all — that path stays unproven on hardware pending an SM_89
runner (see docs/HARDWARE-GATED-WORK.md § "Experimental wmma MatMul
lowering"). If you want to see the vector path run, that's the smoke test.
Where you'd put it in production today
Three deploy surfaces, all real:
- k8s:
deploy/k8s/plain manifests (W2.7) ordeploy/helm/tensor-wasm/chart withimage.backendtoggle (F1 + C8) - Nomad:
deploy/nomad/with bothdockerandraw_execdriver variants (W5.6) - systemd / docker-compose: per
docs/UPGRADE.mdwalkthroughs (W3.3)
Observability stack ships out-of-the-box:
- Prometheus: every metric self-documents via
/metrics, includingtensor_wasm_http_requests_total,tensor_wasm_http_request_duration_seconds,tensor_wasm_jobs_active,tensor_wasm_gpu_memory_bytes_per_tenant,tensor_wasm_build_info(W2.3 + C3 + W4.9) - Grafana: drop-in dashboard at
docs/dashboards/tensor-wasm-overview.json(W2.5) - OpenTelemetry: W3C
traceparentpropagation end-to-end (W4.1 + C2 load test proving exact 4×N span emission under 64 concurrent invokes) - Audit log: structured JSON to stdout or file via
TENSOR_WASM_API_AUDIT_LOG(W2.2) - Runbooks: per-alert response procedures under
docs/runbooks/(W2.6) - SLOs: published in
docs/SLO.mdwith burn-rate PromQL (W1.9)
Where TensorWasm wins, where it doesn't
Per docs/BENCHMARKING.md "Where TensorWasm wins, where it won't" — honesty
on losses is the marketing strategy.
TensorWasm wins on:
- GPU-aware Wasm runtime (we are not aware of another Wasm runtime with this design)
- Multi-tenant Wasm with tenant-keyed GPU memory + per-token rate limiting
- Production observability out of the box (the W2-W4 wave)
- Snapshot-based cold-start when you're cycling many small Wasm functions
TensorWasm loses on:
- Pure-CPU Wasm execution speed vs hand-tuned Wasmer-LLVM — they spend more compile time, get faster runtime; if your workload is "I have one Wasm module I call a million times in a tight loop," Wasmer-LLVM is faster
- Cold-start vs raw Wasmtime
Module::deserialize— our restore is a superset (snapshot decode + tenant state replay); a pure deserialize is a tighter loop - Raw GPU dispatch latency vs hand-written C++ — we pay the WASI-GPU bounds
check + the back-pressure semaphore + the (current) 50 µs poll cadence;
target post-v0.4 is 2-5× of
cuLaunchKernel, not parity - Pre-built FaaS at edge scale (workerd on Cloudflare's network) — we're a runtime, not a CDN. Self-hosted comparison only.
If you need any of those things more than the four wins above, TensorWasm is the wrong choice. Saying so is what makes the wins credible.
The roadmap, in 4 honest sentences
- v0.3.x today: three CUDA backends scaffolded, real PTX dispatch proven on RTX 2060, observability + auth + ops complete, audit-closed.
- v0.4 next quarter: cuda-oxide v0.2 cutover (when it ships), Pliron
Cranelift→PTX lowering for the first batch of ops, cuda-async-backed
DispatchFuturewaker, in-place UVM grow viacuMemAddressReserve. Seedocs/CUDA-OXIDE-CUTOVER.mdfor the 8-step executable runbook. - v0.5-beta: external pen-test + at least one named or anonymized production design partner.
- v1.0: API freeze + release engineering. 12-16 months from today per PATH-TO-V1.md "Effort and timeline (caveated)" — every estimate is wrong, but the milestone exit criteria are the commitment, not the dates.
Try it in 5 minutes
git clone https://github.com/craton-co/craton-tensor-wasm
cd craton-tensor-wasm
cargo build --workspace --release # ~3 min on a cold cache
cargo test --workspace --release # ~2 min, 70 batches, 0 failures expected
# Run a Wasm function locally:
cargo run -p tensor-wasm-cli -- run \
tests/wasm-fixtures/matrix_multiply.wat
# Run the HTTP API:
cargo run --release --bin tensor-wasm -- serve --addr 0.0.0.0:8080
# Watch live metrics in another terminal:
cargo run -p tensor-wasm-cli -- observe
If you have CUDA + an NVIDIA GPU (SM_70+ for non-wmma kernels; SM_80+ for
wmma): add --features cuda to the test command and watch
vector_add_end_to_end_real_ptx_real_kernel actually compute on your GPU.
Get involved
If you're a developer: the W1.7 RFC process is live at rfcs/. Substantive
design changes go through it; bugfixes just open a PR. CONTRIBUTING.md has
the dev-loop setup.
If you're a sponsor / corporation: the v0.5 design-partner program is
open. Reciprocal engagement: you get early access + named credit in v1.0
release notes; we get production validation data. Apply via
security@craton.com.ar.
If you're a security researcher: SECURITY.md is the disclosure
contract. 72h acknowledgement, 90-day coordinated disclosure, public
findings table in docs/SECURITY-AUDIT-v0.5.md (post-pen-test).
If you're at a security firm reading this: the v0.5 external pen-test
is on the roadmap; reach out via security@craton.com.ar to discuss
commissioning it.
FAQ
Q: Is this production-ready?
A: For v1.0-grade production at the SLA published in docs/SLO.md: not
yet. v0.5-beta is the first beta the design-partner program runs against.
For staging / pilot / internal-tooling: yes, today, at v0.3.7. The CUDA
path runs end-to-end on real hardware; the HTTP API has 0 audit problems
open; the test suite has 70 passing batches and 0 failures.
Q: How does this compare to Wasmtime?
A: We wrap Wasmtime 45.x. We are not a fork. See docs/WASMTIME-FORK.md.
Pure CPU execution is within 5% of upstream Wasmtime per the dimension-1
hyperfine comparison (statistically tied at v0.3.7). Everything else — GPU,
multi-tenancy, snapshot, HTTP gateway, observability — is layered on top
and additive.
Q: How does this compare to Wasmer Edge / Spin / Fermyon?
A: Different problem. Those are FaaS platforms; we're a runtime they could
build on. See docs/MIGRATING-FROM-WASMTIME-WASMER.md for the 13-row
feature matrix and per-persona migration guides.
Q: AMD / Intel / Apple GPU support? A: Out of scope for v1.0 by explicit decision (PATH-TO-V1.md "Anti-goals"). v2.x research item.
Q: What's the license?
A: Apache-2.0. Commercial use, modification, redistribution, sublicensing
all permitted. LICENSE + NOTICE at repo root. Trademark policy is
permissive — see docs/TRADEMARK.md.
Q: Who's behind it?
A: Sponsored by Craton Software Company. Maintainer roster + governance at
MAINTAINERS.md + GOVERNANCE.md. Several slots currently TBD by design
during the v0.x window; see MAINTAINERS.md "Placeholders are by design"
section.
Q: What's the catch? A: Two real ones:
- CUDA-only for v1.0. NVIDIA hardware required for the GPU path. Pure-CPU path works anywhere Wasmtime works.
- Pinned Rust nightly (
nightly-2026-04-03). Quarterly bump cadence per PATH-TO-V1 Open Decision #8. Moving to stable Rust is a v2 effort gated on Wasmtime dropping its own nightly needs.
One-slide summary
┌──────────────────────────────────────────────────────────────────┐
│ CRATON TENSORWASM v0.3.7 │
│ GPU-accelerated serverless WebAssembly runtime, in Rust │
│ │
│ PROVEN ON REAL SILICON (non-wmma vector path): │
│ Wasm→wasi-cuda→cuLaunchKernel→readback e2e tests pass │
│ (RTX 2060 = SM_75; sm_80 wmma/MatMul unproven, needs SM_89) │
│ 70 test batches, 0 failures, 0 audit problems │
│ 9 tagged releases (v0.1.0 → v0.3.7) │
│ │
│ SHIPPED, USABLE TODAY: │
│ Auth (bearer + scoped tokens), rate limit, audit log │
│ Prometheus + OTel + Grafana dashboard + runbooks │
│ Helm chart, k8s manifests, Nomad job spec, Dockerfile │
│ Reproducible builds + CycloneDX SBOM + cargo-deny │
│ CLI with completions + man pages + live observe dashboard │
│ │
│ WHERE WE WIN: │
│ GPU-aware Wasm (no direct peer we know of) │
│ Multi-tenant by construction │
│ Observability + ops out of the box │
│ │
│ WHERE WE LOSE (honest): │
│ Pure-CPU vs Wasmer-LLVM │
│ Cold-start vs raw Wasmtime deserialize │
│ Raw GPU dispatch vs hand-written C++ │
│ Edge CDN vs Cloudflare Workers │
│ │
│ NEXT: │
│ v0.4 cuda-oxide v0.2 cutover (runbook ready) │
│ v0.5 external pen-test + design partners (RFP + kit ready) │
│ v1.0 release engineering (paperwork only) │
│ │
│ https://github.com/craton-co/craton-tensor-wasm │
│ Apache-2.0 · security@craton.com.ar │
└──────────────────────────────────────────────────────────────────┘
Related docs
Everything pitched above has a doc behind it:
README.md— the technical entry pointARCHITECTURE.md— the dep graphdocs/PATH-TO-V1.md— the roadmap source-of-truthdocs/SLO.md— what we commit to numericallydocs/BENCHMARKING.md— how we measure (incl. anti-cheating rules)docs/CUDA-OXIDE-CUTOVER.md— v0.4 cuda-oxide v0.2 cutover runbookdocs/tutorials/production-deployment.md— end-to-end production guiderfcs/0001-cuda-oxide-integration.md— the v0.5 default-flip decisionCHANGELOG.md— what shipped in each release
Status: written against the v0.3.7 release tag (2026-05-28). Re-validate
all numeric claims when re-publishing — bench-results/*.json is the
authoritative source.