TensorWasm
Writing CUDA Kernels for Craton TensorWasm
Writing CUDA Kernels for Craton TensorWasm
A practical guide for developers who already know CUDA and want to
ship a kernel that loads and dispatches under TensorWasm's
wasi:cuda surface. Covers explicit dispatch (PTX you write and
the guest loads at runtime), the auto-offload path (Cranelift
pattern detection emits PTX for you), and the shared machinery they
ride on: the W1.1 typed-argv wire format, host-side bounds checking,
and the launch contract from wit/wasi-cuda.wit.
Status: v0.3 workstream. The kernel-args lane in Section 6
shipped in W1.1 (March 2026). The auto-offload runtime-swap hook
(Section 11) is still gated behind the per-block Cranelift hook
tracked in docs/WASMTIME-FORK.md. Path C
(Section 5) is forward-looking scaffold per
RFC 0001.
Contents
- Audience and prerequisites
- Two dispatch surfaces
- Writing a PTX kernel by hand
- Compiling .cu to PTX
- Path C: Rust kernels via cuda-oxide
- Loading and launching from a wasm guest
- Bounds checking and pointer args
- Performance tips
- Common pitfalls
- Testing
- Auto-offload coverage
- Related
1. Audience and prerequisites
This guide is for a developer who has written at least one CUDA
kernel in .cu or hand-rolled PTX, understands the SIMT execution
model (warp = 32 threads, blocks on SMs), knows what cuLaunchKernel's
gridDim / blockDim mean, and wants the kernel to run inside a
Wasm module dispatched by TensorWasm. A working CUDA toolchain pinned
per docs/CUDA-SETUP.md (12.4 on the S22 runner;
12.0–13.2 on contributor dev boxes) is assumed. If you don't have a
CUDA-capable host, the launch path is still exercisable against the
no-CUDA stub (see Section 10).
You do not need Wasm internals beyond: a module imports
functions, exports a linear memory, and writes bytes into it before
calling imports. Host functions are declared in
wit/wasi-cuda.wit and implemented in
crates/tensor-wasm-wasi-gpu.
2. Two dispatch surfaces
TensorWasm exposes two ways for a Wasm guest's GPU work to reach the driver. They share machinery (the kernel registry, the bounds checker, the back-pressure semaphore) but the developer-facing experience is different.
2.1 Explicit dispatch (PTX written by hand)
The guest treats PTX as data: (1) compile .cu → PTX ahead of time;
(2) bundle or fetch the PTX; (3) call
wasi_cuda_load_ptx(ptx_bytes, entry_name) for a kernel_id; (4)
pack parameters using the W1.1 typed-argv wire format
(Section 3.3); (5) call
wasi_cuda_launch(kernel_id, grid, block, shared_mem, args); (6)
call wasi_cuda_sync() before reading results.
This is what most of this guide covers — the only path that ships with v0.2 and the only path where you control the PTX.
2.2 Auto-offload (Cranelift to PTX)
The guest writes nothing CUDA-aware. Plain Wasm SIMD (v128.*) loops
are inspected by tensor_wasm_jit::detector::classify
at JIT-pipeline time; recognized patterns (element-wise f32, FMA
GEMV / dot-product, tiled f16 → f32 matmul, 3x3 conv2d stencil) are
lowered to PTX blueprints and cached. See Section 11
and docs/AUTO-OFFLOAD.md. The auto-offload runtime
swap is not yet wired in v0.2 — the pipeline pre-emits PTX but
Wasmtime still dispatches the Cranelift body at runtime.
3. Writing a PTX kernel by hand
3.1 PTX version target
TensorWasm pins the toolchain to CUDA 12.4 on the S22 runner (see
docs/CUDA-SETUP.md §Required versions),
so PTX ISA 8.0 is the lowest common denominator your kernel must
declare. Anything above 8.0 will fail to JIT on the runner; the
fixtures under kernels/ all pin .version 8.0.
.version 8.0
.target sm_80
.address_size 64
Bumping .version is a build-host concern; bumping .target is a
runtime concern — see the SM matrix next.
3.2 SM target matrix
Mirrors CUDA-SETUP.md §SM-level compatibility matrix:
| Kernel kind | Minimum .target | Hardware floor |
|---|---|---|
| Scalar / vector arithmetic | sm_70 | V100 / Titan V |
cp.async / async-copy intrinsics | sm_80 | A100 |
wmma.mma.sync tensor-core matmul (f16 → f32) | sm_80 | A100 |
cp.async.bulk / tensor-memory accelerator | sm_90 | H100 |
Set .target sm_70 for broadest coverage. A .target exceeding the
running GPU's compute capability surfaces as MalformedPtx (-4) —
cust::module::Module::from_ptx fails the JIT compile inside ptxas.
3.3 Calling convention
In a C host you pass arguments via a void** that cuLaunchKernel
reads. A Wasm guest can't author a void** directly — it lives in
linear memory and only sees i32 offsets — so the W1.1 wire format
flattens the argument list into one tagged byte buffer that the host
parses, bounds-checks, and lowers into the void** for you.
The wire format is documented in full in
crates/tensor-wasm-wasi-gpu/src/kernel_args.rs:
| Tag byte | PTX type | Value bytes (LE) | Wire size |
|---|---|---|---|
0x01 | .s32 | 4 | 5 |
0x02 | .s64 | 8 | 9 |
0x03 | .f32 | 4 | 5 |
0x04 | .f64 | 8 | 9 |
0x05 | .u32 | 4 | 5 |
0x06 | .u64 | 8 | 9 |
0x07 | .ptr (pointer arg) | 4 (guest offset) + 4 (byte length) | 9 |
No inter-argument padding. The buffer is a flat concatenation of
(tag, value) records; the host reads exactly tag.value_bytes()
after each tag byte. The packing maps one-to-one onto a PTX .param
block — the PTX-side type drives the tag:
.visible .entry my_kernel(
.param .u64 a_ptr, // tag 0x07: ptr + len
.param .u64 b_ptr, // tag 0x07: ptr + len
.param .u32 n, // tag 0x05: u32
.param .f32 scale // tag 0x03: f32
)
Pointer args carry both the guest-memory offset and the byte
length. The host bounds-checks [ptr, ptr+len) against the guest's
linear memory; the checked window becomes a raw host pointer that —
under CUDA Unified Memory — doubles as a device address. The original
guest offset is preserved for log lines and tests
(LoweredArg::Ptr).
Sanity caps enforced by parse_argv:
MAX_KERNEL_ARGS = 128MAX_KERNEL_ARGS_BYTES = 4 * 1024(4 KiB)
Either violation surfaces as KernelArgsUnsupported (-10). The
caps are deliberately low so a malicious guest can't pin host memory
inside the parser (docs/RISKS.md).
3.4 Worked example: vector_add.ptx
The reference fixture
kernels/vector_add.ptx round-trips the
calling convention with three pointer args and one scalar:
.version 8.0
.target sm_80
.address_size 64
.visible .entry vector_add(
.param .u64 a_ptr,
.param .u64 b_ptr,
.param .u64 c_ptr,
.param .u32 n
)
{
.reg .pred %p<2>;
.reg .s32 %r<6>;
.reg .s64 %rd<11>;
.reg .f32 %f<4>;
ld.param.u64 %rd1, [a_ptr];
ld.param.u64 %rd2, [b_ptr];
ld.param.u64 %rd3, [c_ptr];
ld.param.u32 %r1, [n];
mov.u32 %r2, %ntid.x;
mov.u32 %r3, %ctaid.x;
mov.u32 %r4, %tid.x;
mad.lo.s32 %r5, %r2, %r3, %r4; // i = bd.x*bi.x + ti.x
setp.ge.s32 %p1, %r5, %r1;
@%p1 bra L_done;
cvt.s64.s32 %rd4, %r5;
shl.b64 %rd5, %rd4, 2; // i * sizeof(f32)
add.s64 %rd6, %rd1, %rd5;
add.s64 %rd7, %rd2, %rd5;
add.s64 %rd8, %rd3, %rd5;
ld.global.f32 %f1, [%rd6];
ld.global.f32 %f2, [%rd7];
add.f32 %f3, %f1, %f2;
st.global.f32 [%rd8], %f3;
L_done:
ret;
}
The four .param declarations correspond, in order, to four argv
records the guest packs:
[tag=0x07, a_offset:u32, a_len:u32, // a_ptr
tag=0x07, b_offset:u32, b_len:u32, // b_ptr
tag=0x07, c_offset:u32, c_len:u32, // c_ptr
tag=0x05, n:u32] // n
Total wire size: 9 + 9 + 9 + 5 = 32 bytes, well under the 4 KiB cap.
4. Compiling .cu to PTX
Most contributors write the source in .cu and let nvcc emit PTX,
rather than hand-rolling PTX. Both reach the same wasi_cuda_load_ptx
call site.
4.1 Basic compile
nvcc --ptx -arch=sm_75 vector_add.cu -o vector_add.ptx
--ptx stops at PTX; the output is UTF-8 text the guest can bundle
as-is. -arch=sm_75 is a concrete target: ship to a higher SM
and the driver JIT-recompiles; ship to a lower SM and the launch
fails.
4.2 Virtual vs concrete targets
For kernels that ship once and run on a range of GPUs, use the virtual form:
nvcc --ptx -arch=compute_70 vector_add.cu -o vector_add.ptx
compute_70 PTX is valid input for any concrete SM at or above 70;
the driver JIT picks the final code at load time. Use compute_70
for scalar / vector, compute_80 for wmma. To support both Turing
(no wmma) and Ampere+ (with wmma), compile two PTX modules and pick
at runtime — there is no wasi_cuda_get_device_capability function
yet (tracked in docs/PATH-TO-V1.md).
4.3 Validation and register usage
ptxas validates the PTX and reports register pressure:
ptxas --gpu-name=sm_75 -O3 -v vector_add.ptx
Sample output (vector_add on SM_75):
ptxas info : Compiling entry function 'vector_add' for 'sm_75'
ptxas info : Function properties for vector_add
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 10 registers, 376 bytes cmem[0]
Watch: registers above ~64 limit occupancy on SM_75 / SM_80, above 255 the kernel can't launch with a 1024-thread block; non-zero stack frame means nvcc spilled to local memory; cmem[0] above ~32 KiB eats into the 64 KiB per-launch cap. The host won't reject high register pressure but contended launches silently lose throughput.
5. Path C: Rust kernels via cuda-oxide
A third authoring path: write the kernel as a #[no_std] Rust
function annotated with cuda-oxide's #[cuda_module] macro, build it
with cargo oxide build, and load the emitted PTX through the same
wasi_cuda_load_ptx host fn that consumes Path A and Path B output.
The pitch is single-language ergonomics — no .ptx or .cu
sidecars, no nvcc-out-of-band step — paired with Rust's safety on
the host-facing pieces of the kernel (lifetimes for pointer args,
trait-checked numeric types) that Path A simply does not check.
The PTX bytes Path C emits are interchangeable with the bytes Paths
A and B emit; nothing downstream of load_ptx knows or cares which
path produced them.
5.1 Status (v0.3.1)
cuda-oxide is v0.1.0 alpha (NVIDIA Labs, released 2026-05-09).
The TensorWasm cuda-oxide-backend feature flag in tensor-wasm-mem
is a scaffold — the host-runtime surface is stubbed, and the
kernel-authoring side (cuda-device, cuda-macros) is not yet
wired into any in-tree example. The real implementation lands in
the v0.4 port tracked by
RFC 0001. The guidance
below is forward-looking; anyone exercising this path against the
v0.1.x crates today should expect API breakage at v0.2 and budget
for rework. See
crates/tensor-wasm-mem/README-cuda-oxide.md
for the host-runtime scaffold's current shape and feature-flag
matrix.
5.2 Toolchain prereqs
cuda-oxide pins nightly-2026-04-03 in its own
rust-toolchain.toml. TensorWasm's workspace now pins the same
channel (nightly-2026-04-03, bumped 2026-05-25 per RFC 0001
"Toolchain plan" step 3 from the prior nightly-2026-03-15 pin),
so Path C builds on the workspace default toolchain — no
RUSTUP_TOOLCHAIN override or branch-local rust-toolchain.toml
edit is required.
# Workspace default toolchain is fine; no override needed:
cargo oxide build --release
Components required on the override toolchain:
rust-src— cuda-oxide's codegen backend re-compilescorefor thenvptx64-nvidia-cudatarget.rustc-dev— cuda-oxide is arustccodegen backend and links againstrustc_publicinternals.llvm-tools-preview— supplies thellvm-*binaries thedialect-llvm→ LLVM IR → PTX stage shells out to.
The cargo oxide subcommand is installed via cargo:
cargo install cargo-oxide # TODO(v0.4): verify exact crate name once cuda-oxide v0.2 ships
5.3 Write a kernel in Rust
Skeleton, illustrative. The exact APIs in cuda_device::prelude may
shift between v0.1.x patch releases:
// src/kernels/vector_add.rs
#![no_std]
use cuda_device::prelude::*;
#[cuda_module]
pub mod vector_add {
pub unsafe fn vector_add(
a: *const f32,
b: *const f32,
out: *mut f32,
n: usize,
) {
let i = thread::index_x() as usize;
if i < n {
*out.add(i) = *a.add(i) + *b.add(i);
}
}
}
// TODO(v0.4): verify against cuda-oxide v0.2 — the prelude
// re-exports (thread::index_x, etc.) are still in flux.
The body mirrors the vector_add PTX fixture from
Section 3.4: one thread per
element, bounds check against n, single add.f32. The
#[cuda_module] macro is what flips the inner items onto the
nvptx64-nvidia-cuda codegen path; everything outside the macro
remains a normal host-side Rust crate.
5.4 Build to PTX
cargo oxide build --release --target nvptx64-nvidia-cuda
# outputs target/nvptx64-nvidia-cuda/release/vector_add.ptx
The compiler pipeline under the hood is
Rust source → rustc_public Stable MIR → dialect-mir → mem2reg → dialect-llvm → LLVM IR → PTX. The emitted .ptx is the same UTF-8
text format Path A authors by hand and Path B emits from nvcc; the
.entry symbol matches the inner-fn name (vector_add). Validate
register pressure with ptxas -v exactly as for Path B — see
Section 4.3.
5.5 Load and dispatch from TensorWasm
The PTX bytes are interchangeable with Paths A and B, so the guest
side is the existing load_ptx + launch sequence verbatim — see
Section 6.2 for the Rust guest and
Section 6.3 for the C guest. The
five-line shape:
const PTX: &[u8] = include_bytes!(
"../target/nvptx64-nvidia-cuda/release/vector_add.ptx");
let kid = unsafe { wasi_cuda_load_ptx(
PTX.as_ptr() as i32, PTX.len() as i32,
b"vector_add".as_ptr() as i32, 10) };
// argv packing + wasi_cuda_launch unchanged from Section 6.2.
Argv packing (the W1.1 wire format from Section 3.3) is unchanged: the host parses the same tagged byte stream regardless of how the PTX was authored.
5.6 Limitations vs Paths A/B today
- Toolchain split is the headline cost. Until the v0.4 workspace
bump, Path C lives on a different nightly than the rest of the
codebase; cross-cutting refactors that touch a Path C kernel and
any other crate require two
cargoinvocations. - v0.1.0 alpha churn. cuda-oxide explicitly warns about API
breakage between v0.1 patch releases. The
cuda_device::preludesurface, the#[cuda_module]macro inputs, and thecargo oxideCLI flags are all candidates for renaming before v0.2. - Kernel stdlib coverage.
cuda-deviceis materially smaller than the CUDA C++ runtime. Atomics, warp-shuffle intrinsics, cooperative-groups, and the bulk of<cuda/std/*>are not yet exposed; kernels that need them stay on Path A or Path B. - No SM_80 wmma yet. cuda-oxide has not lowered tensor-core
intrinsics (
wmma.mma.sync,cp.async.bulk) through its dialect pipeline. Per the SM matrix indocs/CUDA-SETUP.md, this rules out the wmma-blueprint matmul path; tensor-core kernels stay on Path A. - Pliron supply-chain footprint. cuda-oxide pulls Pliron via a
gitrevision pin;cargo-denyand the W3.6 reproducible-builds pipeline both need allowlist entries that the cust and nvcc paths do not require.
5.7 When to pick Path C vs A vs B
| Pick | When |
|---|---|
| A (hand-PTX) | Maximum control; warp-level tuning; tensor-core intrinsics (wmma, cp.async.bulk); SM-specific micro-optimisations the higher paths cannot express. |
| B (.cu via nvcc) | Existing CUDA C++ codebase or team CUDA expertise; full CUDA runtime headers (<cuda/std/*>, atomics, cooperative groups); kernels shared with non-TensorWasm consumers. |
| C (Rust via cuda-oxide) | Rust-native team that values single-language ergonomics; safety on pointer/lifetime bookkeeping; willing to track v0.1.x alpha churn and the toolchain split until v0.4. |
The choice is per kernel, not per project — the same crate can ship
Path A wmma kernels alongside Path C element-wise kernels, loaded
through the same wasi_cuda_load_ptx call site.
5.8 Future (v0.4+)
Path C exists today because v0.4 needs a documented author-side
surface, but the longer-term direction is that many users will
not write CUDA kernels at all. RFC 0001's Pliron lever (see the
module docs in
crates/tensor-wasm-jit/src/pliron_dialect.rs)
adds a fourth front-end to cuda-oxide's pipeline — Wasm → Cranelift IR → dialect-mir — so the auto-offload detector
(Section 11) can lower arbitrary pure-compute Wasm loops to PTX
through the same dialect machinery cuda-oxide uses for Rust source.
That expands auto-offload coverage from the three hand-written
blueprints today to anything the detector can prove safe. Path C
remains for the kernels that benefit from being written explicitly;
the Pliron lowering pass covers everything else.
6. Loading and launching from a wasm guest
The three host functions declared in
wit/wasi-cuda.wit:
load-ptx(ptx: list<u8>, entry: string) -> result<kernel-id, abi-error>launch(kernel, grid, block, shared-mem, args: list<u8>) -> result<_, abi-error>sync() -> result<_, abi-error>
In the raw function-import ABI (pre-Component-Model, what most guests use today):
wasi_cuda_load_ptx(ptx_ptr, ptx_len, entry_ptr, entry_len: i32) -> i64wasi_cuda_launch(kernel_id: i64, grid_x..block_z, shared_mem, args_ptr, args_len: i32) -> i32wasi_cuda_sync() -> i32
All names live under the import module wasi:cuda/host@0.2.0. See
crates/tensor-wasm-wasi-gpu/src/abi.rs
for the full constant set, including the wasi_cuda_last_error_len /
wasi_cuda_last_error_copy pair used to retrieve the host's last
recorded error string.
Note: the inline launch docs in wit/wasi-cuda.wit still describe
v0.1.0 behavior (rejects non-empty args with
kernel-args-unsupported). That comment is stale — W1.1 now accepts
typed argv; kernel-args-unsupported is reserved for sanity-cap
busts (Section 3.3).
6.1 Encoding the argv buffer
The argv buffer is a flat sequence of tagged records. Both example
guests below pack three pointer args plus a scalar count for the
vector_add kernel from Section 3.4.
The host-side helper
encode_argv
is the canonical byte-layout reference.
6.2 Worked example: Rust guest
wasm32-wasip1 guest. Build with cargo build --target wasm32-wasip1 --release.
// SPDX-License-Identifier: Apache-2.0
const PTX: &[u8] = include_bytes!("vector_add.ptx");
const ENTRY: &[u8] = b"vector_add";
const N: u32 = 1024;
const TAG_U32: u8 = 0x05;
const TAG_PTR: u8 = 0x07;
#[link(wasm_import_module = "wasi:cuda/host@0.2.0")]
extern "C" {
fn wasi_cuda_load_ptx(
ptx_ptr: i32, ptx_len: i32,
entry_ptr: i32, entry_len: i32,
) -> i64;
fn wasi_cuda_launch(
kernel_id: i64,
gx: i32, gy: i32, gz: i32, bx: i32, by: i32, bz: i32,
shared_mem: i32, args_ptr: i32, args_len: i32,
) -> i32;
fn wasi_cuda_sync() -> i32;
}
fn push_ptr(buf: &mut Vec<u8>, offset: u32, byte_len: u32) {
buf.push(TAG_PTR);
buf.extend_from_slice(&offset.to_le_bytes());
buf.extend_from_slice(&byte_len.to_le_bytes());
}
fn push_u32(buf: &mut Vec<u8>, v: u32) {
buf.push(TAG_U32);
buf.extend_from_slice(&v.to_le_bytes());
}
#[no_mangle]
pub extern "C" fn _start() {
// Allocate input/output buffers in linear memory.
let bytes_per_buf = (N as usize) * std::mem::size_of::<f32>();
let mut a = vec![0.0f32; N as usize];
let mut b = vec![0.0f32; N as usize];
let mut c = vec![0.0f32; N as usize];
for i in 0..(N as usize) { a[i] = i as f32; b[i] = (2*i) as f32; }
// Register the PTX. Entry must match the .entry symbol exactly.
let kid = unsafe {
wasi_cuda_load_ptx(
PTX.as_ptr() as i32, PTX.len() as i32,
ENTRY.as_ptr() as i32, ENTRY.len() as i32,
)
};
assert!(kid >= 0, "load_ptx failed: {kid}");
// Pack the argv buffer for (a, b, c, n).
let mut argv = Vec::with_capacity(9*3 + 5);
push_ptr(&mut argv, a.as_ptr() as u32, bytes_per_buf as u32);
push_ptr(&mut argv, b.as_ptr() as u32, bytes_per_buf as u32);
push_ptr(&mut argv, c.as_mut_ptr() as u32, bytes_per_buf as u32);
push_u32(&mut argv, N);
// Block dim must be a warp multiple (32). 256 threads/block is the
// modal sweet spot; grid = ceil(N / 256).
let block_x: i32 = 256;
let grid_x: i32 = ((N as i32) + block_x - 1) / block_x;
let rc = unsafe {
wasi_cuda_launch(kid, grid_x, 1, 1, block_x, 1, 1,
0, argv.as_ptr() as i32, argv.len() as i32)
};
assert_eq!(rc, 0, "launch failed: {rc}");
assert_eq!(unsafe { wasi_cuda_sync() }, 0);
// c[i] == a[i] + b[i] under CUDA UVM; under no-CUDA, launch
// returned NotAvailable (-1) and the assert above tripped.
assert!((c[7] - 21.0).abs() < 1e-5);
}
Two non-obvious things: a.as_ptr() as u32 assumes 32-bit Wasm
linear memory (wasm32-wasip1 default) — never as usize, on a
future 64-bit memory build it would silently widen. The argv buffer
has no length prefix or padding; the host walks one tag at a
time until args_len bytes are consumed.
6.3 Worked example: C guest
wasi-sdk C guest. Build with clang --target=wasm32-wasi -O2 guest.c -o guest.wasm.
// SPDX-License-Identifier: Apache-2.0
#include <stdint.h>
#include <string.h>
#include <assert.h>
#define IMPORT __attribute__((import_module("wasi:cuda/host@0.2.0")))
IMPORT int64_t wasi_cuda_load_ptx(int32_t pp, int32_t pl, int32_t ep, int32_t el);
IMPORT int32_t wasi_cuda_launch(int64_t kid,
int32_t gx, int32_t gy, int32_t gz, int32_t bx, int32_t by, int32_t bz,
int32_t shmem, int32_t ap, int32_t al);
IMPORT int32_t wasi_cuda_sync(void);
#define TAG_U32 0x05
#define TAG_PTR 0x07
#define N 1024
extern const uint8_t vector_add_ptx[];
extern const uint32_t vector_add_ptx_len;
static float a[N], b[N], c[N];
static uint8_t argv_buf[32]; /* 3 * 9 (ptr) + 5 (u32) */
static size_t push_ptr(uint8_t *p, uint32_t off, uint32_t len) {
*p++ = TAG_PTR; memcpy(p, &off, 4); p += 4; memcpy(p, &len, 4); return 9;
}
static size_t push_u32(uint8_t *p, uint32_t v) {
*p++ = TAG_U32; memcpy(p, &v, 4); return 5;
}
void _start(void) {
for (uint32_t i = 0; i < N; ++i) { a[i] = (float)i; b[i] = (float)(2*i); }
int64_t kid = wasi_cuda_load_ptx(
(int32_t)(uintptr_t)vector_add_ptx, (int32_t)vector_add_ptx_len,
(int32_t)(uintptr_t)"vector_add", 10);
assert(kid >= 0);
uint8_t *p = argv_buf;
p += push_ptr(p, (uint32_t)(uintptr_t)a, sizeof(a));
p += push_ptr(p, (uint32_t)(uintptr_t)b, sizeof(b));
p += push_ptr(p, (uint32_t)(uintptr_t)c, sizeof(c));
p += push_u32(p, N);
int32_t rc = wasi_cuda_launch(kid,
(N + 255) / 256, 1, 1, 256, 1, 1, 0,
(int32_t)(uintptr_t)argv_buf, (int32_t)(p - argv_buf));
assert(rc == 0);
assert(wasi_cuda_sync() == 0);
/* c[7] == 21.0f under CUDA. */
}
Identical packing: same tag bytes, same little-endian value bytes,
no padding. Any guest language that can write a &[u8] and call an
extern can drive the host.
7. Bounds checking and pointer args
Every pointer arg is bounds-checked before any CUDA call. Inside
parse_argv,
for each Ptr record the host computes end = guest_offset + len
and rejects with InvalidPointer (-2) if the add overflows u32
or end > caller_memory_length. Once the check passes, the kernel
dereferences the resolved host pointer — under CUDA Unified Memory it
doubles as a device address.
Cost: ~50 ns per pointer arg (overflow + compare + slice index). Negligible against the ~5 µs single-block launch overhead, but the 128-arg cap from Section 3.3 caps the worst-case overhead at a few microseconds.
Invariants:
- A zero-length pointer at the exact end of memory (
offset = memory_len,len = 0) is allowed — safe sentinel for "no buffer to pass" without conditional argv packing. - A non-zero
lenwithoffset == memory_lenis rejected. - The pointer's region may overlap the argv buffer; the host treats them as independent reads.
HAZARD: the resolved host pointer is captured into
cuLaunchKernel synchronously, before the wasmtime fiber suspends.
A subsequent memory.grow inside the same guest could relocate
linear memory, invalidating host pointers stored elsewhere. The
launch path is correct because CUDA has already consumed the pointer;
do not hold a LoweredArg::Ptr across a guest-callable boundary
(host.rs HAZARD note).
8. Performance tips
The launch path is thin (parse argv → bounds-check → cuLaunchKernel
→ spawn_blocking(stream.synchronize) → return), so the usual
CUDA-host performance disciplines apply.
Match the wave size. Launch dims so block_x * block_y * block_z
is a multiple of 32 (warp size). A 33-thread block spins up two warps
but wastes one full lane group. 256 threads/block is the modal sweet
spot.
Avoid divergent branches in tight loops. A divergent if
serializes lanes. The vector_add PTX from
Section 3.4 gets this right with
one setp.ge.s32 + early bra — all 32 warp lanes branch together.
Use shared memory for matmul tiles. Declare a .shared tile and
pass shared_mem (bytes) to wasi_cuda_launch; the host forwards
unchanged to cuLaunchKernel. 16x16 f32 (1 KiB) or 32x32 f32 (4 KiB)
per block are common; hardware cap is 48 KiB/block on SM_70+ (96 KiB
opt-in on SM_80+).
Profile with nsys:
nsys profile --stats=true --trace=cuda,osrt ./tensor-wasm run my_module.wasm
--stats=true summarizes per-kernel duration, launch overhead, and
sync wait. Compare against
bench-results/baseline.json for
regressions. For TensorWasm's own counters, use the W1.5
tensor-wasm-cli observe subcommand (docs/CLI.md).
Sizing. Each in-flight launch holds a back-pressure permit until
stream.synchronize returns; a full queue stalls new launches. The
formulas in
docs/CAPACITY-PLANNING.md §4
relate launches/sec to permit count to GPU memory budget.
9. Common pitfalls
KernelArgsUnsupported (-10) — Reserved for sanity-cap busts
only: argv buffer above 4 KiB or more than 128 records. The v0.1.0
contract (returned for any non-empty argv) is gone — W1.1 lowers
typed argv directly into cuLaunchKernel
(host.rs).
InvalidPointer (-2) — A pointer arg's [guest_offset, guest_offset + len) window is outside the guest's linear memory.
Common causes: (1) the length in the argv record is bytes, not
elements — a Vec<f32> of length 1024 is 4096 bytes; (2) growing a
Vec after packing argv, which reallocs and invalidates the recorded
offset. Pin buffers (Vec::with_capacity + no more pushes, or a
fixed-size array) before packing.
unsupported gpu architecture 'compute_80' on SM_75 — PTX
targets an architecture the GPU doesn't support. On RTX 2060 (SM_75),
.target sm_80+ fails at JIT time inside ptxas; the host surfaces
this as MalformedPtx (-4). Fix: re-emit with -arch=compute_70
or -arch=sm_75. See
CUDA-SETUP.md §SM-level compatibility
for the wmma-on-Turing caveat.
Forgetting to call sync — In v0.2 wasi_cuda_launch is
synchronous from the guest's POV (returns 0 after stream.synchronize
completes). But future extensions (multi-launch batching, async launch
— see PATH-TO-V1.md §v0.5)
will require explicit wasi_cuda_sync. Forward-compatible guests call
sync defensively before reading kernel-written buffers.
Entry name mismatch — The entry string passed to
wasi_cuda_load_ptx must match the .visible .entry foo(...) symbol
byte-for-byte (no nul terminator, no leading underscore, no namespace
mangling). Passing "_vector_add" when the PTX declares vector_add
returns MalformedPtx.
10. Testing
10.1 Host-side mock dispatch (no GPU required)
W1.1 added WasiCudaContext::last_lowered_args so argv lowering can
be tested end-to-end without a GPU. On a no-CUDA host,
wasi_cuda_launch returns NotAvailable (-1) after parsing the
argv, and the parsed Vec<LoweredArg> is recorded for inspection.
Representative test
(tests/kernel_args_e2e.rs):
let expected = vec![
LoweredArg::I32(-13),
LoweredArg::U32(0x1234_5678),
LoweredArg::ptr_for_encoding(/* guest_offset */ 256, /* len */ 64),
];
let argv = encode_argv(&expected);
let wat = build_launch_wat(&argv, /*offset=*/1024, kid);
// Build a wasmtime Store with WasiCudaContext, instantiate, call.
let rc = launch_fn.call_async(&mut store, ()).await.unwrap();
#[cfg(not(feature = "cuda"))]
assert_eq!(rc, AbiError::NotAvailable.code());
let recorded = store.data().wasi_cuda().last_lowered_args();
assert_eq!(recorded.len(), 3); // host_ptr re-resolves against live memory.
Use this pattern to verify a guest's argv packer, pin the contract on a kernel before shipping, and regression-test parser / bounds-checker changes. It does not exercise the kernel itself.
10.2 Real-GPU integration tests
Tests needing a real CUDA device are marked
#[ignore = "requires CUDA hardware"]:
#[tokio::test]
#[ignore = "requires CUDA hardware"]
async fn scalar_argv_real_cuda_launch() {
// Same shape as the mock-dispatch test, but launch returns 0 and
// the test reads back the output buffer.
}
The S22 self-hosted CUDA runner runs these via cargo test --features tensor-wasm-wasi-gpu/cuda -- --include-ignored. On a no-CUDA dev
box the bodies still compile, catching ABI drift early. Guideline:
keep the kernel-correctness assertion behind
#[cfg(feature = "cuda")]; keep the argv round-trip assertion
unconditional so it runs on every PR.
11. Auto-offload coverage
Pointer to docs/AUTO-OFFLOAD.md. The detector
recognizes element-wise f32 vector arithmetic (vector_add), fused
multiply-add (GEMV / dot-product), and tiled f16 → f32 matmul
(tensor-core path on SM_80+); a 3x3 conv2d stencil blueprint is
queued but not yet wired. The pipeline produces validated PTX and
caches it; the runtime swap that would replace the Cranelift body
is gated on a per-block hook
(docs/WASMTIME-FORK.md) — until that lands,
auto-offload pre-emits but does not replace. Blocks with control
flow, dynamic trip counts, or cross-block dependencies are skipped
(AUTO-OFFLOAD.md §Known limitations).
Write PTX when the kernel needs control flow, integer SIMD, f64,
custom shared-memory tiling, or intrinsics above the blueprint set.
Use auto-offload for element-wise loops or textbook matmul when
you'd rather write idiomatic Rust + v128 intrinsics. Use both
when the hot inner loop is a blueprint and surrounding logic is custom.
v0.4 expansion via Pliron. The blueprint set above is the
v0.3.x coverage ceiling because the detector emits PTX from three
hand-written templates. The v0.4 Pliron lowering pass tracked in
crates/tensor-wasm-jit/src/pliron_dialect.rs
(scaffold today; real lowering per RFC 0001 step 4) replaces the
template-emitter with a Cranelift IR → Pliron dialect-mir → ... → PTX pipeline that shares cuda-oxide's backend. Once it lands,
coverage expands from the three blueprints to arbitrary pure-compute
loops the detector can prove safe — see
RFC 0001 "Pliron lever and the auto-offload pipeline".
That is what shrinks Path C's audience over time: many users will
not write CUDA kernels at all, the lowering pass auto-offloads them.
12. Related
docs/CUDA-SETUP.md— toolchain, driver matrix, SM compatibility, troubleshooting.docs/AUTO-OFFLOAD.md— pattern list, detector verdicts, deopt behaviour.docs/RISKS.md— v0.1.0 known limitations including the superseded kernel-args contract.docs/CAPACITY-PLANNING.md— sizing formulas for launch rate, permits, GPU memory per tenant.docs/PERFORMANCE.md— measured dispatch medians.docs/BENCHMARKING.md— external-comparison methodology.docs/PATH-TO-V1.md— v0.2 / v0.3 milestone exit criteria.wit/wasi-cuda.wit— authoritative Component-Model interface. The inlinelaunchdocstring is stale (describes v0.1.0); the shipped behavior is whathost.rsandkernel_args.rsimplement.crates/tensor-wasm-wasi-gpu/src/kernel_args.rs— wire format, sanity caps, encoder / decoder.crates/tensor-wasm-wasi-gpu/src/host.rs— launch impl: validation, parsing,cuLaunchKernel, sync.crates/tensor-wasm-wasi-gpu/tests/kernel_args_e2e.rs— e2e tests for scalar / pointer argv and the#[ignore]pattern.rfcs/0001-cuda-oxide-integration.md— Path C provenance: the v0.5 cust-successor RFC that motivates thecuda-oxide-backendfeature and the v0.4 toolchain bump.crates/tensor-wasm-mem/README-cuda-oxide.md— user-facing reference for thecuda-oxide-backendfeature flag, current stub behaviour, and the toolchain-alignment status (the workspace pin matches cuda-oxide's; no override required).crates/tensor-wasm-jit/src/pliron_dialect.rs— module-doc scaffold for the v0.4 Wasm-to-dialect-mirlowering pass that expands auto-offload coverage beyond the three blueprints.
Updated for tensor-wasm v0.3.1 (PATH-TO-V1 W4.5 + RFC 0001 step 5).
Worked examples are pinned to the W1.1 wire format; re-render if the
tag table changes. Path C content is forward-looking against
cuda-oxide v0.1.0 alpha and will need a refresh once cuda-oxide v0.2
ships — see the inline TODO(v0.4) markers in Section 5.