Craton Bolt

SQL strings go in. NVIDIA PTX comes out at runtime. The GPU does the rest.

License: Apache-2.0Read the docs (26) →View on GitHub

32.4×

vs Polars · 50 M-row arithmetic

< 25 µs

Plan + lower + codegen / query

742

Integration & PTX-golden tests

v0.7.0 · pre-1.0

Active development

Modern problems require modern solutions.

Most GPU dataframe engines chain precompiled kernels and bounce intermediates through global memory between each one. Craton Bolt takes the opposite approach: it compiles each SQL query into a single fresh NVIDIA PTX kernel at runtime, loads it through the CUDA driver, and runs the entire fused expression tree in registers.

The full pipeline — parse, plan, codegen, launch — is pure Rust. There is no C++ shim, no precompiled kernel library, and no FFI to a third-party query engine. Everything below cuModuleLoadData is NVIDIA's CUDA driver; everything above it is Rust.

GPU memory is borrow-checked. Allocations are typed handles, read access borrows a shared view, and writes take an exclusive one — and kernel launches require those borrows. Use-after-free, double-free, and mutable/shared aliasing across kernel boundaries become compile errors, exactly as Rust already guarantees for CPU memory.

JIT-compiling every query costs almost nothing: planning plus codegen stays under 25 µs regardless of dataset size. On a 50 M-row fused-arithmetic workload that buys a 32.4× speedup over multi-threaded Polars, and on high-cardinality GROUP BY it beats both Polars and DuckDB. Apache-2.0, and honestly pre-1.0 — the public API is still unstable and it is not production-ready.

✦

Kernel fusion via runtime PTX

One PTX kernel per query keeps the entire fused expression tree in registers — instead of chaining precompiled kernels and bouncing intermediates through global memory the way RAPIDS / cuDF do.

✦

Borrow-checked GPU memory

GPU allocations are typed handles (GpuVec<T>), borrowed read-only or exclusively. Use-after-free, double-free, and shared/mutable aliasing across kernel boundaries are rejected at compile time — the same guarantees Rust makes for CPU memory.

✦

Pure Rust, no C++ shim

The full pipeline — parse, plan, codegen, launch — is pure Rust on the raw CUDA driver API. No precompiled kernel library, and no FFI to a third-party query engine.

✦

Broad SQL surface

Projection, filter, aggregates, GROUP BY, every join type, set ops, CTEs (including WITH RECURSIVE), window functions, ORDER BY / LIMIT / HAVING, plus Decimal128 and date/time arithmetic lowered to the GPU.

The Technical Edge

Why experts choose Craton Bolt

Runtime PTX codegen

Each query lowers to an SSA-shaped IR of ops, then to a fresh PTX module emitted in Rust and assembled to SASS by the driver (cuModuleLoadData). A KernelSpec-keyed LRU module cache (128-bit key) plus a self-invalidating on-disk cache means repeated query shapes skip recompilation. Targets sm_70 (Volta) and newer.

CUDA-Oxide memory model

GpuVec<T> owns device memory; GpuView and the !Sync / !Copy GpuViewMut borrow it. Buffers are Arrow-aligned, so results download straight into arrow-rs RecordBatches. Equality and LIKE over dictionary-encoded Utf8 fold into pure integer index-membership predicates on the GPU.

Tiered GROUP BY, joins & sort

Multi-tier shared-memory and hash-partitioned GROUP BY kernels; GPU hash-join build/probe for qualifying INNER / LEFT / RIGHT / FULL / CROSS shapes with host fallbacks; GPU bitonic and radix sort for ORDER BY. Requires CUDA 12+ and an NVIDIA GPU with compute capability 7.0+; the crate type-checks anywhere via the cuda-stub feature.

Ready to secure
the future?

Request Expert Briefing