TensorWasm
Auto-Offload — Supported Patterns and Known Limitations
Auto-Offload — Supported Patterns and Known Limitations
Status: introduced in S14. The Cranelift-fork question is settled in
docs/WASMTIME-FORK.md; this document is the user-facing companion that
explains what code patterns TensorWasm will (and won't) auto-offload to the GPU.
Triggering auto-offload
Auto-offload is off by default. Enable it with the cargo feature flag:
cargo build --workspace --features tensor-wasm-jit/auto-offload
or via the HTTP API by submitting a function with "jit_auto_offload": true
(see crates/tensor-wasm-api/API.md, S17).
When enabled, candidate basic blocks are inspected at pipeline-driver time
by tensor_wasm_jit::detector::classify —
that is, by programs that call the tensor-wasm-jit pipeline directly, not by
Wasmtime at dispatch time. Blocks marked [DetectorVerdict::Offload] are
lowered to PTX via the pipeline described in this document and the result is
inserted into the kernel cache. The rest run on the standard Wasmtime CPU
path. In v0.1.0, an Offload verdict produces a cached PTX entry but does
not yet replace the Cranelift-compiled function body at runtime — see
the next section for why.
Integration status in v0.1.0
TensorWasm v0.1.0 ships the auto-offload pipeline as a library — the detector,
clif-lower, PTX emitter, kernel cache, and deopt guard are all production-
quality and exhaustively tested. What v0.1.0 does NOT yet ship is the
runtime swap: at execution time, Wasmtime continues to dispatch the
Cranelift-compiled function body. Replacing that body with a dispatch stub
into the cached PTX kernel requires per-block compile hooks in
wasmtime-cranelift that upstream does not currently expose — see
WASMTIME-FORK.md for the rationale of not forking.
Concretely:
- The pipeline runs ahead-of-time inside
tensor-wasm-jitand produces validated PTX kernels. Programs can drive it directly via the public API (tensor_wasm_jit::detector::classify→clif_lower::lower_block→ptx_emit::emit→cache::KernelCache::put). - The runtime-swap hook is intentionally deferred. When Wasmtime upstreams
the per-block hook (tracked in
WASMTIME-FORK.md), or when we explicitly decide to fork, the executor can begin consulting the cache at dispatch time. The pipeline output is then a drop-in. - Until then, the
tensor-wasm-jit/auto-offloadfeature flag toggles the pipeline on/off (controlling whether kernels are pre-emitted at all) but does not change runtime dispatch.
Supported patterns
1. Element-wise vector arithmetic on f32
Code that maps a single SIMD op across a statically-bounded loop is the sweet spot:
// In Wasm (wasm32-wasip1), after rustc + wasm-opt:
for i in 0..N { // N must be a constant
c[i] = a[i] + b[i];
}
The detector flags this as Offload when N ≥ 64 and the body is ≥ 80% v128.*.
2. Fused multiply-add (vector dot product, GEMV)
for i in 0..N {
sum[i] += a[i] * b[i]; // lowers to v128.fma
}
Lowered to fma.rn.f32 PTX instructions for sm_80+.
3. Tiled matrix multiply (f16 → f32 accumulator)
A naive (M, K) × (K, N) → (M, N) GEMM written with explicit blocks is
lowered to wmma.mma.sync.aligned.row.col.m16n16k16.f32.f16.f16.f32
(tensor-core path). M, N, K must each be a multiple of 16.
Known limitations
1. Control flow inside the candidate block
If a basic block contains Op::Branch or Op::Call, the lowering returns
LowerError::UnsupportedOp. The block falls back to the CPU path. Loops
with early exits will not auto-offload.
2. f64, integer SIMD, mixed precision
The PTX emitter currently emits only f32 (and f16 for tensor-core MatMul).
v128 ops over i32/i64 lanes are detected but trigger UnsupportedOp in the
lowering pass. Tracked as a follow-up.
3. Dynamic-trip-count loops
loop_trip_count = None (a Wasm loop with a runtime-determined count) is
never offloaded — the launch overhead beats any wins. This is intentional.
4. Cross-block dependencies
Each candidate is one basic block. Patterns that span multiple blocks (prologue + loop + epilogue) are split; only the loop body is considered. Most useful kernels are already inside a single basic block after Cranelift's pre-passes, so in practice this is rarely a problem.
Deopt-on-error behaviour
DeoptGuard tracks per-fingerprint deopt
state. A kernel that has been deopted in this process is not retried on
the GPU until the cache is invalidated. Numerical correctness is checked by
the executor on the first invocation of every newly-emitted kernel
(comparing against the Cranelift reference within a configurable tolerance —
default 1e-4 absolute for f32).
Metrics
The following Prometheus counters surface auto-offload behaviour:
tensor_wasm_offload_success_total— kernels that ran on GPU and passed correctness checks.tensor_wasm_offload_fallback_total— kernels that deopted, plus their reason label (cuda_error,numerical_divergence,assembly_failed).
Compute the success rate as
tensor_wasm_offload_success_total / (tensor_wasm_offload_success_total + tensor_wasm_offload_fallback_total).
Status: S14 of the plan. Expand as we add patterns in S19+.