TensorWasm

Feature-Status Matrix — Canonical Source of Truth

This file is the single canonical status source for every major TensorWasm feature: whether it is wired, scaffold-only, hardware-gated, or planned, and behind which Cargo feature flag (if any). It exists to end the scaffold-vs-wired drift between README.md, CHANGELOG.md, and docs/OPENAI-COMPAT.md.

Workspace version: 0.3.7 (root Cargo.toml).

When this table and any prose elsewhere disagree, this table wins. README.md, CHANGELOG.md, and the per-feature spec docs defer to it for status; they may add narrative, but the Status column here is authoritative.

Status vocabulary

Term	Meaning
Wired	Reachable and functional on the default build or a documented feature flag — works end-to-end through the invoke / HTTP / store path it claims.
Landed	Implementation present and exercised, but the production-relevant verification (typically the GPU path) is `#[ignore]`d pending hardware; host-side behaviour runs today.
Scaffold	Compiles, surface-area-stable types + tests exist, but the call site returns a documented not-yet-wired sentinel (e.g. `NotYetWired` / `NotYetImplemented` / `FEATURE_NOT_EXPOSED`).
Hardware-gated	Code is written and reviewed but has never executed against real NVIDIA silicon; CI links the CUDA stubs and the proving test is `#[ignore = "requires CUDA hardware"]`. See `HARDWARE-GATED-WORK.md`.
Proven on hardware	The path has actually executed against real NVIDIA silicon and produced verified-correct output. The proving test is still `#[ignore = "requires CUDA hardware"]` (so it is skipped in GPU-less CI and needs a GPU runner / local `--features cuda --include-ignored` run) — `#[ignore]` here means "CI has no GPU", not "unproven".
Planned-v0.4	Not yet implemented; tracked for v0.4+.

Scaffold is deliberately not called "Wired": a green default-build test does not prove the driver call works. A path is only marked Proven on hardware once it has actually run on a GPU; an unrun CUDA path stays Hardware-gated. The end-to-end GPU launch path (kernel_args_e2e / vector_add_end_to_end_real_ptx_real_kernel) is Proven on hardware — this is the same claim presentation.md makes ("Proven on real silicon"), and this table now agrees with it.

Matrix

Feature	Crate(s)	Status	Cargo feature flag	Notes / Tracking-ID
Typed multi-value guest args	`tensor-wasm-exec`, `tensor-wasm-cli`, `tensor-wasm-api`	Wired	none (default)	T33. `WasmArg` enum + JSON↔`Val` codec; `--args <JSON>` plumbed end-to-end through CLI / HTTP `/invoke{,-async,-stream}` / `SpawnConfig::with_args`. `call_export_with_args` supersedes deprecated `call_export`.
`/invoke-stream` SSE streaming	`tensor-wasm-wasi-gpu`, `tensor-wasm-api`	Wired	none (default)	T34. Guest `wasi:tensor/host.emit-chunk` calls surface as SSE `event: chunk` frames via `StreamingContext`. Honors T36 cooperative deadlines (`DEADLINE-ELAPSED` → terminal `event: error`). Replaces the 0.3.7 single not_yet_wired frame. Spec: `STREAMING.md`.
Signed kernel registry / `DiskRegistry`	`tensor-wasm-jit` (+ `tensor-wasm-api` for HTTP)	Wired	`kernel-registry` (jit); `kernel-registry-api` (api)	T35. Disk-persisted `DiskRegistry` over `tensor-wasm-artifacts::DiskArtifactStore`, restart-safe, paginated (`list_paginated`, cap 1000), optional publisher allowlist. HTTP backend selected by `TENSOR_WASM_API_KERNEL_REGISTRY_DIR`. CLI `kernel publish\|list\|verify` is wired (B6.4): `publish` BLAKE3-hashes + signs a `KernelManifest` and POSTs it to `/kernels`, `list` GETs + renders the manifest table, `verify` re-computes the HMAC locally (constant-time) against a manifest blob on disk. Replaces the prior v0.3.6 scaffold that exited `FEATURE_NOT_EXPOSED` (3). Servers built without `--features kernel-registry-api` (the default) return `503 kernel_registry_not_configured`, which the CLI surfaces as a clear error. Source: `crates/tensor-wasm-cli/src/cmd/kernel.rs`. Signing envelope is v2 (T12). Spec: `KERNEL-REGISTRY.md`.
Cooperative epoch yield / deadlines	`tensor-wasm-wasi-gpu`, `tensor-wasm-exec`	Wired	none (default)	T36 + this session's executor change. `wasi:scheduler/host@0.1.0` `SchedulerContext`; the executor's per-invocation `Instant` deadline drives both scheduler verdicts (`CONTINUE` / `DEADLINE-NEAR` / `DEADLINE-ELAPSED`) and `BackPressure` acquire rejection (`DEADLINE_NEAR_WINDOW = 50ms`). Spec: `COOPERATIVE-YIELD.md`.
Pre-instantiated instance pool	`tensor-wasm-exec`	Wired	none (default)	T37. `InstancePool` + `InstancePoolConfig` wired through the invoke path; per-`(tenant, module-hash)` channel with pre-spawn and reset-on-return. Spec: `INSTANCE-POOL.md`.
Differential correctness oracle	`tensor-wasm-jit`	Landed (host); GPU path hardware-gated	`differential-oracle`	T38. Proptest harness driving `DifferentialOracle` over matmul / vector_add / conv2d blueprints + per-kernel tolerance table. Host (Wasmtime CPU) verdicts run end-to-end today; CUDA GPU verdicts are `#[ignore]` pending the S22 self-hosted runner. Spec: `DIFFERENTIAL-ORACLE.md`.
Per-tenant GPU memory quotas (in-process)	`tensor-wasm-tenant`	Wired	none (default)	T39. `TenantContextBuilder::with_gpu_memory_bytes_cap` + `consume_gpu_bytes` / `release_gpu_bytes`. In-process counter is the primary accounting source. Spec: `GPU-QUOTAS.md`.
GPU memory quotas (`cuMemPool` + host-side cap)	`tensor-wasm-mem`	Hardware-gated	`gpu-mem-pool` (strict-superset of `cudarc-backend`)	T39. `TenantMemPool` gives each tenant a `cuMemPool` and enforces the cap host-side in `allocate` (CAS over `live_bytes` vs `cap_bytes`), routing allocations through `cuMemAllocFromPoolAsync` (`UnifiedBuffer::new_in_tenant_pool`). NB: `CU_MEMPOOL_ATTR_RELEASE_THRESHOLD` is a retention hint, NOT a driver-level cap — a hardware run (GPU-VALIDATION-2026-05-30 BUG-1) disproved the earlier driver-pin claim. Requires CUDA 11.2+; the driver-reject test is `#[ignore]`. Bypass-resistant for pool-routed allocations; a raw-driver-handle tenant is out of scope (see GPU-QUOTAS.md).
Artifact-backed snapshots	`tensor-wasm-snapshot`, `tensor-wasm-artifacts`	Wired (default; BREAKING for default writer)	`artifact-backing` (on by default)	T40. Default snapshot envelope flipped from legacy inline v3 to `DiskArtifactStore`-backed v4. Reads still accept legacy v2/v3 indefinitely. Opt out per-call via `capture_legacy()` or build `--no-default-features --features signed-snapshots`.
Unified content-addressed artifact store	`tensor-wasm-artifacts`	Wired	none (always-on crate; no `[features]`)	Roadmap #9. `ArtifactStore` trait + `InMemoryArtifactStore` + fully-implemented `DiskArtifactStore` (BLAKE3 content-hash + zstd + HMAC-SHA256 + 16-byte magic). Backs snapshots (T40) and the JIT disk cache (T30). Spec: `ARTIFACT-STORE.md`.
OpenAI-compat gateway	`tensor-wasm-api`	Wired	none (always-on routes)	T41. `/v1/completions` + `/v1/chat/completions` translate to internal invoke via `TENSOR_WASM_API_OPENAI_MODEL_MAP` (`model:uuid,…`); buffered or SSE. Closes the 0.3.7 `501 openai_not_yet_wired` scaffold. Caveats: argv marshalling calls `_start() -> ()` (no typed prompt arg yet); `usage` token counts are zeros (no tokenizer); multimodal content dropped — all deferred to v0.5. Spec: `OPENAI-COMPAT.md`.
CUDA backend — cust (UVM)	`tensor-wasm-mem`	Hardware-gated	`unified-memory` (pulls `cust`)	Default cust 0.3 backing. `cuMemAllocManaged` + `cuMemAdvise` + `cuMemPrefetchAsync`. Round-trip tests `#[ignore]`. The historical "default" GPU backing.
CUDA backend — cudarc	`tensor-wasm-mem`	Hardware-gated	`cudarc-backend` (pulls `cudarc`)	cust → cudarc migration spike (W1.2 / `CUDARC-SPIKE.md`). Parallel `UnifiedBuffer` impl. Allocation/prefetch tests `#[ignore]`.
CUDA backend — cuda-oxide	`tensor-wasm-mem`, `tensor-wasm-jit`	Scaffold	`cuda-oxide-backend`	RFC 0001 v0.5 cust-successor. Dep-less (no cuda-oxide crate pulled in); on `tensor-wasm-mem` exposes `CudaOxideUnifiedBuffer` returning the `NOT_YET_WIRED` sentinel; on `tensor-wasm-jit` pulls `pliron` 0.15 (crates.io) for the `pliron_dialect` scaffold. The git-pinned `experimental-cuda-oxide-host-backend` feature and its cuda-host/cuda-core/cuda-async git deps were removed this cycle pending crates.io publish (re-add per `CUDA-OXIDE-CUTOVER.md`).
Real CUDA host functions / kernel launch	`tensor-wasm-exec`, `tensor-wasm-wasi-gpu`, `tensor-wasm-snapshot`, `tensor-wasm-tenant`	End-to-end launch PROVEN ON HARDWARE; lower-level host fns still partly hardware-gated	`cuda`	Real `wasi:cuda` host fns, contexts, `cuLaunchKernel`, GPU snapshot restore. Typed argv lowering for scalar + pointer kernel args (W1.1). The full launch path — Wasm guest → `wasi:cuda` → `cuLaunchKernel` → results read back and asserted (`c[i]==a[i]+b[i]` from managed memory) — is VERIFIED ON REAL SILICON: `kernel_args_e2e` (incl. `vector_add_end_to_end_real_ptx_real_kernel`) passes 8/8 on an RTX 2060 (cc 7.5, CUDA 13.2), re-confirmed 2026-06-01. These tests remain `#[ignore = "requires CUDA hardware"]` so they are skipped in (GPU-less) CI — they require a GPU runner or a local `--features cuda --include-ignored` run, not because the path is unproven. Individual lower-level cuda host fns still flagged UNVERIFIED-PENDING-HARDWARE in `HARDWARE-GATED-WORK.md` remain as-is unless covered by this e2e proof. See `GPU-VALIDATION-2026-05-30.md`.
JIT auto-offload pipeline	`tensor-wasm-jit`	Wired (pipeline always compiled); CUDA wiring hardware-gated	`auto-offload` gates extra CUDA-side wiring	The Cranelift-free detector → `BlockIR` → PTX-text pipeline is always compiled and runs on host; the feature only gates CUDA-side wiring tested under `--features cuda`. Spec: `AUTO-OFFLOAD.md`.
JIT MatMul / wmma PTX emission	`tensor-wasm-jit`	Scaffold (refused by default); wmma hardware-unverified	runtime `EmitConfig::enable_experimental_matmul` (no Cargo flag)	`MatMul` returns `EmitError::NotYetImplemented` by default. The `wmma.mma.sync` sm_80+ lowering fires only when `enable_experimental_matmul = true`, and even then the emitted PTX is hardware-unverified (the dev RTX 2060 is SM_75 and cannot run it; needs the SM_89 runner). See `HARDWARE-GATED-WORK.md` item 7.
pliron PTX pipeline (stages)	`tensor-wasm-jit`	Scaffold	`cuda-oxide-backend`; stage-2 `twasm.`→`llvm.` under `pliron-llvm-backend`	`pliron_dialect` / `pliron_lowering` stages return `NotYetImplemented` / `NotYetWired` sentinels. `pliron-llvm-backend` is a strict superset of `cuda-oxide-backend` and carries a hard `llvm-sys = "221"` dep (needs system LLVM 221). Spec: `PLIRON-PIPELINE.md`.

Non-feature-flag capabilities (always on)

These ship by default with no Cargo feature gate; configured at runtime (env/config), included here so the matrix doubles as a complete status reference:

Capability	Status	Configuration
Signed snapshots (HMAC-SHA256, v3)	Wired	`signed-snapshots` feature (on by default)
Capability-to-registry binding	Wired (unconditional; enforced even with `default-features = false`)	`strict-cap-binding` gates only the typed `*_strict` admin APIs
Scoped bearer tokens / audit log / per-token rate limit / HTTP metrics	Wired	env vars (`TENSOR_WASM_API_*`)
Async execution + epoch interruption	Wired	always-on behaviour, not a flag
NVIDIA MPS shared contexts	Runtime-selected	env/config, not a Cargo flag — see `MPS-SETUP.md`
OTLP exporter	Opt-in	`otlp` feature (`tensor-wasm-core`)

Sources

../CHANGELOG.md [0.3.7] section and the T-task wire-up notes (T30, T33–T41, T8/T9/T12).
Per-crate [features] tables in each crates/*/Cargo.toml.
OPENAI-COMPAT.md — T41 wiring + v0.5 caveats.
HARDWARE-GATED-WORK.md — authoritative inventory of unverified-on-silicon paths.
The per-feature spec docs linked in the Notes column.