TensorWasm

Wasm Developer Guide

This guide walks you through writing Wasm functions for TensorWasm, from a trivial add(a, b) through hand-tuned GPU kernels and the auto-offload fast path. If you haven't already deployed a hello-world, start with GETTING-STARTED.md.

1. The Wasm target

Craton TensorWasm targets wasm32-wasip1 — the WebAssembly System Interface, Preview 1. This is the modern, stable WASI target and the one you almost always want:

rustup target add wasm32-wasip1

wasm32-wasip1 gives your guest access to a curated set of host imports: clocks, random, filesystem (sandboxed), and TensorWasm's own wasi:cuda/host@0.2.0 for GPU work.

If you're writing a pure compute kernel with no I/O, you can also use wasm32-unknown-unknown. The output is smaller and the link is faster, but you lose all WASI imports — no clocks, no random, no GPU. Use it only when you genuinely need nothing from the host.

Target	WASI imports	`wasi:cuda` available	Typical use
`wasm32-wasip1`	Yes	Yes	Almost everything
`wasm32-unknown-unknown`	No	No	Tiny pure-compute kernels

2. Project layout

A minimal Cargo.toml for a TensorWasm function:

[package]
name = "my_fn"
version = "0.1.0"
edition = "2021"

[lib]
crate-type = ["cdylib"]

crate-type = ["cdylib"] is what produces a .wasm file with C-ABI exports rather than a Rust rlib. Use this layout for compute libraries that the host calls into by export name.

If instead you want a WASI binary with a _start entry point — useful for one-shot batch jobs — use:

[[bin]]
name = "my_fn"
path = "src/main.rs"

TensorWasm handles both layouts; the choice is yours.

3. A minimal compute function

Here's a complete src/lib.rs exposing both a scalar add and a vectorized run:

#[no_mangle]
pub extern "C" fn add(a: f32, b: f32) -> f32 {
    a + b
}

/// Read `len` floats from `input_ptr`, write `len` doubled values to `out`.
#[no_mangle]
pub extern "C" fn run(input_ptr: *const f32, len: usize, out: *mut f32) {
    // SAFETY: the host guarantees these point to len*4 bytes of guest memory.
    let input = unsafe { core::slice::from_raw_parts(input_ptr, len) };
    let output = unsafe { core::slice::from_raw_parts_mut(out, len) };
    for i in 0..len {
        output[i] = input[i] * 2.0;
    }
}

Build it:

cargo build --target wasm32-wasip1 --release

The resulting .wasm exports add and run. The host invokes them by name with the arguments you pass through the API.

Pointers (*const f32, *mut f32) are guest-relative offsets into the Wasm linear memory — they're plain i32 values at the ABI level. TensorWasm's host translates them transparently.

3.1 Typed exports — what the host accepts and returns

Exports invoked through tensor-wasm run --args …, the CLI's tensor-wasm invoke subcommand, and the HTTP API's POST /functions/{id}/invoke endpoint share a single argument-marshalling contract. The executor exposes four argument types and the matching four result types:

Wasm parameter type	JSON input shape	Rust signature example
`i32`	integer literal in `[i32::MIN, i32::MAX]`	`fn add(a: i32, b: i32) -> i32`
`i64`	integer literal outside `i32` range	`fn big(n: i64) -> i64`
`f32`	not selectable from JSON — wrap from `f64` if needed	`fn demote(x: f64) -> f32`
`f64`	non-integer numeric literal	`fn scale(x: f64) -> f64`

The conversion rules are deterministic:

1, 2, 42, -1 → i32. Anything that fits in 32 signed bits never escalates.
2147483648, 9999999999 → i64. Use this for wider counters or pointers from 64-bit guests.
1.5, 3.14, 1e10 → f64. JSON has no way to distinguish "f32" from "f64" literals, so the host always picks f64. If your export's signature is (f32) -> ..., the dynamic call rejects with a type mismatch.
Strings, arrays, null, booleans → rejected as invalid_args (400 on the HTTP path, non-zero exit on the CLI).

Result lists are mapped back to JSON symmetrically: i32/i64 → JSON integer, f32/f64 → JSON number. An export returning () produces an empty JSON array, which the CLI prints as the literal ok.

If you need to call a function with arguments the JSON shape cannot represent — f32 parameters, v128 SIMD values, references — write a thin wrapper export in your guest that takes JSON-representable arguments and demotes / packs them on entry. The executor side is intentionally narrow so the wire contract stays predictable.

4. Using wasi-cuda for explicit GPU kernels

When you've written a CUDA kernel by hand and compiled it to PTX, you can launch it from your Wasm guest using the wasi:cuda/host@0.2.0 import surface.

Declare the host imports your guest will call:

#[link(wasm_import_module = "wasi:cuda/host@0.2.0")]
extern "C" {
    fn wasi_cuda_load_ptx(
        ptx_ptr: i32, ptx_len: i32,
        entry_ptr: i32, entry_len: i32,
    ) -> i64;
    fn wasi_cuda_launch(
        kernel_id: i64,
        grid_x: i32, grid_y: i32, grid_z: i32,
        block_x: i32, block_y: i32, block_z: i32,
        shared_mem: i32,
        args_ptr: i32, args_len: i32,
    ) -> i32;
    fn wasi_cuda_sync() -> i32;
    fn wasi_cuda_last_error_len() -> i32;
}

A complete vector-add kernel that runs on the GPU:

static PTX: &[u8] = include_bytes!("../kernels/vector_add.ptx");

#[no_mangle]
pub extern "C" fn vector_add_gpu(
    a: *const f32, b: *const f32, out: *mut f32, len: usize,
) -> i32 {
    let entry = b"vector_add\0";
    let kernel_id = unsafe {
        wasi_cuda_load_ptx(
            PTX.as_ptr() as i32, PTX.len() as i32,
            entry.as_ptr() as i32, (entry.len() - 1) as i32,
        )
    };
    if kernel_id < 0 { return -1; }

    // Pack arguments using the W1.1 typed-argv wire format: a flat
    // concatenation of `(tag, value)` records with no padding. Each
    // pointer arg is tagged 0x07 and carries a guest offset (u32) plus
    // a byte length (u32); the `len` scalar is a u32 tagged 0x05. All
    // values are little-endian. See CUDA-KERNELS.md §3.3 for the full
    // tag table.
    const TAG_U32: u8 = 0x05;
    const TAG_PTR: u8 = 0x07;
    let buf_bytes = (len * core::mem::size_of::<f32>()) as u32;
    let mut args: Vec<u8> = Vec::with_capacity(9 * 3 + 5);
    for ptr in [a as u32, b as u32, out as u32] {
        args.push(TAG_PTR);
        args.extend_from_slice(&ptr.to_le_bytes());
        args.extend_from_slice(&buf_bytes.to_le_bytes());
    }
    args.push(TAG_U32);
    args.extend_from_slice(&(len as u32).to_le_bytes());

    let block = 256i32;
    let grid = ((len as i32) + block - 1) / block;

    let rc = unsafe {
        wasi_cuda_launch(
            kernel_id,
            grid, 1, 1,
            block, 1, 1,
            0,
            args.as_ptr() as i32, args.len() as i32,
        )
    };
    if rc != 0 { return -2; }

    unsafe { wasi_cuda_sync() }
}

Three things to notice:

PTX is embedded via include_bytes!. TensorWasm caches loaded PTX per-instance keyed by hash, so repeat loads are free.
kernel_id is opaque — treat it as a handle. It's only valid within the lifetime of the current instance.
wasi_cuda_sync() is explicit — kernel launches are asynchronous on the device. Always sync before reading results from host-shared memory.

If anything goes wrong, wasi_cuda_last_error_len() returns the length of a host-side error string; pair it with a wasi_cuda_last_error_copy to retrieve the message.

5. Auto-offload (opt-in)

Many compute loops don't need a hand-written PTX kernel — TensorWasm can detect SIMD-shaped Rust loops and promote them to GPU kernels at instantiation time.

The promotion criteria, at a high level:

The loop must use core::arch::wasm32 v128 SIMD intrinsics, or be auto-vectorized by rustc into v128 ops.
The estimated trip count must exceed the trip_count threshold (default 4096).
The v128-ratio — fraction of body ops that are SIMD — must exceed the configured threshold (default 0.5).

Both thresholds are configurable per deployment. Full details — including the IR pattern matcher and how to inspect promotion decisions — live in AUTO-OFFLOAD.md.

Auto-offload always has a CPU fallback path wired up; you'll never silently fail to run because the GPU rejected your kernel.

6. Memory model

Your Wasm guest sees its own linear memory — a flat u8 buffer addressed by i32 offsets. From the guest's point of view, that's the entire universe.

Under the hood, TensorWasm backs that linear memory with a UnifiedBuffer that's also mapped into the CUDA device's address space. When you call wasi_cuda_launch with a pointer, you're passing the host a guest-relative offset; the host translates that into the equivalent device pointer for the kernel. The kernel reads and writes the same bytes your guest sees — no explicit cudaMemcpy needed.

The practical implications:

Allocate buffers with Vec<T> or Box<[T]> and pass .as_ptr() / .as_mut_ptr() directly to host calls. They're already in unified memory.
Don't assume pointers are stable across sync if you've grown a Vec — linear memory may have been re-paged.
Cross-instance buffer sharing is not supported; each instance owns its memory.

7. Limits per instance

Limit	Default	Notes
`MAX_PTX_BYTES`	8 MiB	Per `wasi_cuda_load_ptx` call.
`MAX_KERNELS_PER_INSTANCE`	256	Across the instance's lifetime.
Epoch deadline	30 s	Wired from `SpawnConfig`; configurable per deployment.
Linear memory	256 MiB	`EngineConfig::max_memory_bytes`.

Hitting any of these terminates the instance with a clear diagnostic; they're guardrails, not silent truncations.

8. Debugging

The single most useful knob is TENSOR_WASM_LOG:

TENSOR_WASM_LOG=debug cargo run --bin tensor-wasm -- run my_fn.wasm

At debug, the host logs every WASI-CUDA call with its arguments — PTX hash on load, grid/block dims on launch, return codes on every call. At trace, you also get per-arg byte dumps.

Tracing spans are grouped per instance, so when you're running under the HTTP server you can pivot in Jaeger from a single invocation down to every kernel it dispatched.

For deeper performance work, see PERFORMANCE.md and COLD-START.md.