Modern problems require modern solutions.
Most GPU dataframe engines chain precompiled kernels and bounce intermediates through global memory between each one. Craton Bolt takes the opposite approach: it compiles each SQL query into a single fresh NVIDIA PTX kernel at runtime, loads it through the CUDA driver, and runs the entire fused expression tree in registers.
The full pipeline — parse, plan, codegen, launch — is pure Rust. There is no C++ shim, no precompiled kernel library, and no FFI to a third-party query engine. Everything below cuModuleLoadData is NVIDIA's CUDA driver; everything above it is Rust.
GPU memory is borrow-checked. Allocations are typed handles, read access borrows a shared view, and writes take an exclusive one — and kernel launches require those borrows. Use-after-free, double-free, and mutable/shared aliasing across kernel boundaries become compile errors, exactly as Rust already guarantees for CPU memory.
JIT-compiling every query costs almost nothing: planning plus codegen stays under 25 µs regardless of dataset size. On a 50 M-row fused-arithmetic workload that buys a 32.4× speedup over multi-threaded Polars, and on high-cardinality GROUP BY it beats both Polars and DuckDB. Apache-2.0, and honestly pre-1.0 — the public API is still unstable and it is not production-ready.
Kernel fusion via runtime PTX
One PTX kernel per query keeps the entire fused expression tree in registers — instead of chaining precompiled kernels and bouncing intermediates through global memory the way RAPIDS / cuDF do.
Borrow-checked GPU memory
GPU allocations are typed handles (GpuVec<T>), borrowed read-only or exclusively. Use-after-free, double-free, and shared/mutable aliasing across kernel boundaries are rejected at compile time — the same guarantees Rust makes for CPU memory.
Pure Rust, no C++ shim
The full pipeline — parse, plan, codegen, launch — is pure Rust on the raw CUDA driver API. No precompiled kernel library, and no FFI to a third-party query engine.
Broad SQL surface
Projection, filter, aggregates, GROUP BY, every join type, set ops, CTEs (including WITH RECURSIVE), window functions, ORDER BY / LIMIT / HAVING, plus Decimal128 and date/time arithmetic lowered to the GPU.
The Technical Edge
Why experts choose Craton Bolt
Runtime PTX codegen
Each query lowers to an SSA-shaped IR of ops, then to a fresh PTX module emitted in Rust and assembled to SASS by the driver (cuModuleLoadData). A KernelSpec-keyed LRU module cache (128-bit key) plus a self-invalidating on-disk cache means repeated query shapes skip recompilation. Targets sm_70 (Volta) and newer.
CUDA-Oxide memory model
GpuVec<T> owns device memory; GpuView and the !Sync / !Copy GpuViewMut borrow it. Buffers are Arrow-aligned, so results download straight into arrow-rs RecordBatches. Equality and LIKE over dictionary-encoded Utf8 fold into pure integer index-membership predicates on the GPU.
Tiered GROUP BY, joins & sort
Multi-tier shared-memory and hash-partitioned GROUP BY kernels; GPU hash-join build/probe for qualifying INNER / LEFT / RIGHT / FULL / CROSS shapes with host fallbacks; GPU bitonic and radix sort for ORDER BY. Requires CUDA 12+ and an NVIDIA GPU with compute capability 7.0+; the crate type-checks anywhere via the cuda-stub feature.