Craton Bolt
Changelog
Changelog
All notable changes to this project will be documented here. The format follows Keep a Changelog and the project tries to follow Semantic Versioning once it leaves 0.x.
Note on version 0.2.0
There is no 0.2.0 release. The project jumped from 0.1.0 (2026-05-23) directly to 0.3.0 (2026-05-26) — a three-day span in which the scope grew well past what a single minor bump could honestly carry (multi-batch tables, INNER JOIN, DISTINCT / LIMIT / ORDER BY / HAVING / UNION, real cuda-stub, PTX cache, CI). Tagging an intermediate 0.2.0 would have been a paper milestone, so the version number was reserved and skipped.
[Unreleased]
Performance
- Tier-2 group-by host slot-walk (
exec::groupby_tier2_common::collect_populated_slots_sorted): the post-reduce collection over the fixedNUM_PARTITIONS × BLOCK_GROUPS(~4.2M-entry) slot buffer now uses a fused single-pass serial scan (pre-sized outputVec, hoisted bounds) and, above a 256K-slot threshold, astd::thread::scopeparallel scan (no new dependency). Output is byte-identical to the previous serial implementation (ordered chunk concatenation followed by the same stable sort), proven by a 1M-slot parity test; the collector's generic bound widened toT: Copy + Send + Sync.
Fixed
Dictionary(Utf8)ORDER BYnow sorts by string value (lexicographic) instead of by the raw dictionary index — fixes wrong ordering for unordered dictionaries on the GPU sort path.- Scalar
MIN/MAX/SUM/AVGover an all-NULL or empty input now return SQLNULL(matching DuckDB / standard SQL) instead of a sentinel value (i64::MAX/±inf) or0. ILIKEnow uses Unicode-aware per-character case folding — fixes incorrect matches when case folding changes string length (e.g. the dotted-Iİ) around_and prefix / suffix / contains patterns.- Predicate pushdown no longer pushes column-free (row-invariant) predicates into a single side of an OUTER join, removing a latent wrong-results hazard.
Changed
- De-duplicated
StreamSet—cuda::async_copynow consumes the canonicalcuda::buffer::StreamSetinstead of carrying its own identical copy (~46 LOC removed). Stream tracking,Dropfencing, and the event-based deferred-free path are unchanged;buffer::StreamSetgained only additivepub(crate)accessors. - Integer
SUMoverflow is now a hard error (BoltError::Type) rather than a silent wraparound; the behavior is documented indocs/SQL_REFERENCE.md. - New typed
BoltError::Unsupportedvariant for the no-GPU /cuda-stubcase, which previously surfaced through the genericBoltError::Other. GpuView/GpuViewMut::byte_lenuse checked multiplication (overflow-safe), consistent with the other buffer types.- PTX disk-cache key now incorporates an automatic codegen fingerprint
emitted by
build.rs, so the on-disk cache self-invalidates when the codegen source changes. - Hardened the module-cache key with a
Debug-injectivity guard test; disk-cache write-through failures are now logged.
Docs
- Corrected
docs/JIT_PIPELINE.md(LRU + 128-bit key, disk cache documented) anddocs/SQL_REFERENCE.md(fusedAVG, integerSUMoverflow error, NULL / empty aggregate semantics,ESCAPEimplemented,ILIKEdocumented). - Added a "Rejected SQL constructs" section to
docs/LIMITATIONS.md, createddocs/CUDARC_ADOPTION.md, and refreshed the milestone framing indocs/COMPETITIVE_BENCHMARKING.md.
Packaging
- Removed internal scratch files from the published repo / tarball; cleaned up
deny.tomland the example declarations.
Internal
- Documented the GPU-only performance items deliberately deferred pending on-hardware
benchmarking (AVG sum+count reduce fusion, device-side compaction before the 52 MiB
group-by D2H, adaptive spin back-off, pinned-memory pool) — see
reviews/PERF_BACKLOG.md. These change device behavior or emitted PTX and cannot be validated under thecuda-stub+ host-oracle CI used on this branch.
0.7.0 - 2026-05-29
v0.7 turns the v0.6 carry-overs into live code paths. The themes are
the same three the v0.6 closing notes scheduled for "v0.7+": wiring the
KernelSpec module cache into real call sites, lighting up the
Decimal128 / Date / Timestamp GPU lowering boundaries, and landing
the GPU radix sort dispatch in the executor. It also widens the SQL
surface itself — set operations (EXCEPT / INTERSECT), host-side
window functions, non-recursive CTEs, uncorrelated subqueries,
JOIN ... USING / NATURAL, COUNT(DISTINCT), and GPU LIKE /
UPPER / LOWER / LENGTH. Later feature waves in the same release
widen it further — LATERAL and derived tables, WITH RECURSIVE,
correlated WHERE subqueries, VALUES as a row source,
generate_series, DISTINCT ON, named WINDOW / QUALIFY,
super-aggregates (ROLLUP / CUBE / GROUPING SETS), grouped
Decimal128 GPU aggregation, and FETCH / TOP / FOR UPDATE /
PREWHERE query-clause sugar. Items are grouped to mirror the v0.6
milestone headings.
Added — Types (Decimal128 / Date / Timestamp)
Decimal128GPU arithmetic — dual-register IR (Op::*128+RegAlloc::assign_pair),GpuColumnDataingest, andCodegenwiring so+,-,*are reachable from lowering.Decimal128comparisons (=,!=,<,>,<=,>=) lowered to GPU and reachable fromWHEREpredicates.SUM(Decimal128)via host-side reduction.Date32/Timestamparithmetic (Date−Date and Timestamp−Timestamp; Day-INTERVAL only) lowered to GPU.
Added — Aggregates
- Grouped
STDDEV/VARunderGROUP BYvia per-group host-side Welford.
Added — Join + Sort
- GPU radix sort integration in
src/exec/sort.rs— single-keyInt32/Int64ASC, plus multi-key andDESCsupport. Fixes theDESCpre-transform to use!(val ^ MIN)rather than a bare!val.
Changed — KernelSpec module cache wiring
KernelSpecextended to model aggregate / join / sort / compaction kernel kinds (sibling spec types).- Cache wired into call sites:
ScalarAggSpec(scalar reduction),HashJoinKernelSpec(10gpu_joincall sites),RadixSortKernelSpec(4gpu_sortcall sites), andCompactionKernelSpec(6 compaction kernels: prefix-scan + gather). Cache hits skip both codegen and PTXAS. - Async memcpy rolled out to the remaining
GROUP BYvariants (tier2 / shmem / wide / valid) and async D2H forcompact::download_mask(theWHEREfilter path).
Changed — Lowering / validation
- WHERE predicate type-checking during SQL lowering (fixes
LIKEon a non-Utf8column).
Added — SQL surface (set ops, windows, CTEs, subqueries, joins)
EXCEPT [ALL]/INTERSECT [ALL]— lowered to a binaryLogicalPlan::SetOpnode and executed host-side bysrc/exec/setops.rs. Set forms return distinct left rows; multiset (ALL) forms follow the SQL-standardmax(0, lc - rc)/min(lc, rc)multiplicities. Row equality reuses theDISTINCTexecutor's row-key machinery (NULLs not distinct;±0.0canonicalised).UNION/EXCEPT/INTERSECT BY NAMErejected.- Window functions (
OVER) — host-side executor (src/exec/window.rs) forROW_NUMBER,RANK,DENSE_RANK, andSUM/AVG/MIN/MAX/COUNTaggregate windows. DefaultRANGE UNBOUNDED PRECEDING AND CURRENT ROWframe only; window functions must be top-level SELECT items. Explicit / non-default frames, namedWINDOW,QUALIFY, andCOUNT(DISTINCT) OVERrejected. - Non-recursive CTEs (
WITH name AS (...)) — lowered against the left-to-right CTE scope and type-checked at the definition site.WITH RECURSIVE, CTE column-list aliases, and the materialization hint rejected. - Uncorrelated subqueries — scalar
(SELECT ...)and[NOT] IN (SELECT ...)inSELECT/WHERE, resolved to constants before physical lowering (src/exec/subquery_resolve.rs). Scalar>1row errors;INfolds to anOR/ANDchain. Correlated subqueries,EXISTS, and derived tables inFROMrejected. JOIN ... USING (...)/NATURAL JOIN— desugared to equileft.col = right.colpairs and run through the existing join paths. Missing / ambiguous / duplicateUSINGcolumns and aNATURALjoin with no common column are rejected.COUNT(DISTINCT col)— supported as the sole SELECT item (noGROUP BY/HAVING/SELECT DISTINCT), lowered toCOUNT(*) ∘ Distinct ∘ Project([col]) ∘ Filter(col IS NOT NULL)and executed via the newPhysicalPlan::CountRowsnode.
Added — SQL surface (later 0.7 feature waves: LATERAL / recursive / VALUES / generate_series / super-aggregates)
LATERALderived tables —FROM left, LATERAL (SELECT ... WHERE x = left.col) AS d(and theCROSS/INNER JOIN LATERAL ... ON trueandLEFT JOIN LATERAL ... ON trueforms). Correlated; executed as a host nested-loop apply (dependent join), bounded byCRATON_MAX_APPLY_ROWS(default 100k). Leading / multiple /RIGHT/FULLLATERAL, a predicate other thanON true, and a column-list alias are rejected.- Derived tables — a subquery in
FROM(FROM (SELECT ...) AS alias) is planned recursively as a self-contained subtree and exposed under the (required) alias. A column-list alias (AS d(x, y)) is rejected. WITH RECURSIVE— host-orchestrated (execute_recursive_cte/execute_mutual_recursive_cte); the body must be<anchor> UNION [ALL] <recursive term>, evaluated to a fixpoint. Linear, non-linear, and mutual recursion, plus an optional recursive-CTE column-list alias (WITH RECURSIVE c (a, b) AS ...). Iteration capCRATON_MAX_RECURSIVE_ITERATIONS. A recursive anchor that seeds from a recursive member, a self-reference buried in a subquery, andUNION BY NAMEare rejected.- Correlated
WHEREsubqueries — a single correlated subquery in a top-levelWHEREconjunct: a correlated scalar comparison,EXISTS(semi-join), orNOT EXISTS(anti-join). Bounded by the sameCRATON_MAX_APPLY_ROWScap asLATERAL. More than one correlated subquery, a correlation inside anOR, and a correlatedIN (SELECT ...)are rejected. VALUESas a row source (plan_values_query/execute_values_query) — bare (VALUES (...), (...) [ORDER BY] [LIMIT]) and inFROM(SELECT ... FROM (VALUES (...)) AS t(a, b)). Per-column dtypes inferred by numeric widening; row count capped (CRATON_VALUES_MAX_ROWS, default 1,000,000).generate_series(start, stop[, step])— the one supported table-valued function inFROM(plan_generate_series_query/execute_generate_series_query); produces an inclusive non-nullableInt64series.stepdefaults to1; negativestepdescends;step = 0errors. Row count capped (CRATON_GENERATE_SERIES_MAX_ROWS).DISTINCT ON (...)(Postgres extension) — host-orchestrated (plan_distinct_on/execute_distinct_on): the base query runs with theDISTINCT ONkeys prepended and itsORDER BYapplied, then the engine keeps the first row per key (LIMITapplied after dedup). Keys must be simple column refs; combined withGROUP BY/HAVINGor a computed-expression key, it is rejected.- Named
WINDOWclause andQUALIFY—WINDOW w AS (...)referenced viaOVER w(including extension,OVER (w ORDER BY ...)), andQUALIFY <window-predicate>lowered to aFilterover the window projection.QUALIFY/ namedWINDOWcombined withGROUP BY/ aggregates is rejected. - Super-aggregates —
GROUP BY ROLLUP/CUBE/GROUPING SETS/ALL, the trailingWITH TOTALS/WITH ROLLUP/WITH CUBEmodifiers, and theGROUPING()/GROUPING_ID()indicators. Expanded host-side at plan time into one grouping set per result row, rewritten as aUNION ALLof the per-set sub-plans (max 12 grouping columns). - Combined
COUNT(DISTINCT col)forms — two forms layered on the sole-item distinct-count base plan are now accepted (e.g.SELECT DISTINCT COUNT(DISTINCT col)). - Grouped
Decimal128GPU aggregation —SUM/MIN/MAXover aDecimal128column underGROUP BYlowered on-device (complements the scalarDecimal128aggregation and arithmetic already in 0.7), andDecimal128division added to the GPU arithmetic set. - Query-clause sugar —
FETCHand T-SQLTOPfold intoLIMIT,FOR UPDATE/FOR SHAREare accepted as a no-op, andPREWHERE(ClickHouse-ism) folds intoWHERE. A bareFROM a, bcomma list desugars to aCROSS JOINchain.
Added — GPU string functions
- GPU
LIKE/NOT LIKEoverUtf8columns — dictionary columns via dictionary-precompute → index membership; non-dictionaryUtf8via the newPhysicalPlan::StringLikeFilterdevice matcher (compile_like_match_kernel, EXACT / PREFIX / SUFFIX / CONTAINS). Hosthost_likefallback retained. - GPU
UPPER/LOWER— two-pass variable-width device output viaPhysicalPlan::StringProject. - GPU
LENGTH—PhysicalPlan::StringLength(dictionary-gather,Int64output). - Host-side
SUBSTRING/TRIM(TRIM BOTH/LEADING/TRAILING) executed end-to-end through a host projection. - GPU
NOTin a predicate lowered viaOp::Not.
Internal
- Schema-converter consolidation — the plan↔Arrow schema converters
are unified into
exec::schema_convert. ScalarAggSpecdedup (collision between two sibling-spec additions); field references updated (dtype→input_dtype).- Radix dispatch gate tests serialized via an override hook to remove env-var test contention; dead single-key wrapper / warning cleanup.
0.6.0 - 2026-05-28
This release covers milestones M1 (foundation), M3 (join + sort), M4
(types), M5 (observability + ergonomics), M6 (performance), M7 (API
stabilization), and M8 (freeze prep) from docs/PATH_TO_1.0.md. v0.5
brought the SQL surface up to "table stakes"; v0.6 turns to the
execution-layer plumbing, the type system, and the public-API shape
that 1.0 will freeze. Many of the new code paths are present but
intentionally not yet wired into the default execution hot path — see
the closing paragraph for the explicit carry-overs.
Added — M1 (Foundation)
Engine::register_table_stream(name, schema, iter)insrc/exec/engine.rs. v0.6 ships an eager implementation that drains the iterator into the existing in-memory table representation; the signature is future-compatible with a truly-lazy streaming path so callers won't need to rewrite their code when the lazy executor lands.- Async memcpy + pinned host buffers piloted in the scalar
aggregate executor (
src/exec/aggregate.rs::upload_primitive_values_async). Per-shape rollout to the other executors is deferred to v0.7. KernelSpec-keyed module cache insrc/exec/module_cache.rs, built and unit-tested. The cache skips both codegen and PTXAS on a hit. Call-site wiring is deferred to v0.7.
Added — M3 (Join + Sort)
- GPU radix-sort kernel scaffold for
Int32andInt64insrc/jit/sort_kernel_radix.rs. Env-gated viaBOLT_GPU_SORT=1; not integrated intosrc/exec/sort.rsyet (that wiring is a v0.7 task). - Non-equi join via nested-loop in
src/exec/join.rs::execute_nested_loop_join. INNER only, capped atMAX_NESTED_LOOP_INNER_ROWS = 1024. Closes the long-standing non-equi gap for small-cardinality cases.
Added — M4 (Types)
DataType::Decimal128(p, s)plumbed end-to-end through the logical plan + Arrow round-trip.Literal::Decimal128carried through the parser and type-checker.CAST(int AS DECIMAL(p, s))parses; GPU codegen rejects cleanly with"Decimal128 not yet lowered to GPU"until the runtime path lands.DataType::Date32andDataType::Timestamp(TimeUnit, Option<&'static str>)with aTimeUnitenum.Literal::Date32(i32)andLiteral::Timestamp(i64, unit, tz).DATE '...'andTIMESTAMP '...'literals parse. Timezones are interned viacrate::plan::logical_plan::intern_timezonesoDataTypestaysCopy.
Added — M5 (Observability + ergonomics)
tracingcrate dependency with spans on the full parse / plan / lower / codegen / ptx_load / launch / transfer / materialize pipeline. Span names catalogued insrc/observability.rs. Off by default; opt-in via the consumer'stracing_subscriber.BoltErroris now#[non_exhaustive]and gains aSqlWithSpan { msg, span: Range<usize> }variant plus aBoltError::span()accessor. sqlparser parse errors are wrapped viaparse_error_to_bolt_errorinsrc/plan/sql_frontend.rs.- Did-you-mean suggestions in
Schema::index_of,NameResolver::resolve_compound, andtry_aggregate. Backed by a shared Levenshtein helper insrc/plan/suggest.rs(edit distance capped at 2).
Added — M6 (Performance)
- Disk-backed PTX cache in
src/jit/disk_cache.rs. Opt-in via theBOLT_PTX_CACHE_DIR=/pathenv var or a builder hook. Writes are atomic (tempfile+ rename) so a partially-written cache entry can't poison subsequent runs. - Criterion regression bench scaffold in
benches/regression.rs. Three queries (scalar aggregate, GROUP BY, filter) measured at parse / lower / ptx_gen. cuda-stub invocation is documented; a >5% slowdown convention is established for the regression workflow.
Added — M7 (API stabilization)
Engine::Builder(EngineBuilder) with knobs fordevice,memory_budget,persistent_cache, andenable_tracing.Engine::newandEngine::new_with_deviceare preserved as thin wrappers over the builder.Engineis now#[non_exhaustive]so future fields don't break downstream destructuring.DataFrame::collect(self, engine: &mut Engine) -> BoltResult<RecordBatch>— the#[doc(hidden)]tombstone is gone;collectnow materializes through the newEngine::run_logical_planentry point.PlanRewritetrait insrc/plan/rewrite.rs.Enginestoresrewrites: Vec<Box<dyn PlanRewrite>>and threads them throughEngine::sqlimmediately beforelower_physical. Builder / fluent hook:Engine::with_rewrite(self, r) -> Self.docs/API_SURFACE.mdenumerates the public surface by stability tier, distinguishing the items 1.0 will freeze from the ones still subject to change.
Added — M8 (Freeze prep)
docs/MIGRATION_GUIDE.md— covers0.3 → 0.5 → 0.6upgrade paths.
Added — Docs
docs/USER_GUIDE.md— 10-minute-tutorial structure aimed at first-time users.
Notes — intentionally NOT in v0.6 (carry-overs for v0.7+)
The following items parse and type-check in v0.6 but reject at the GPU lowering boundary; the runtime paths are scheduled for v0.7+:
- GPU lowering for
CASE,CAST, scalar string funcs,LIKEwithESCAPE,||inWHEREpredicates, groupedSTDDEV/VAR,Decimal128arithmetic, andDate/Timestamparithmetic. - Per-executor async-memcpy wiring beyond the scalar aggregate pilot.
KernelSpeccache integration into call sites (the cache is built and unit-tested; wiring is deferred).- GPU radix sort integration in
src/exec/sort.rs(the kernel scaffold exists; the dispatch is gated behind an env var and not yet selected by the planner). - Disk PTX cache wiring through
EngineBuilder::build— the env-var path works today, but the builder knob is not yet honored.
0.5.0 - 2026-05-28
This release covers the M2 milestone from docs/PATH_TO_1.0.md: SQL scalar
completeness. Version 0.4 is skipped — the M1 foundation work (streaming
tables, async Stage 2, KernelSpec cache) is deferred to a later release;
this cut focuses on bringing the SQL surface up to "table stakes" while
keeping the existing in-memory execution model.
Added — SQL scalar surface
NOT <bool-expr>— newUnaryOp::Notvariant routed through the host-side filter path (GPU lowering is a follow-up).<expr> [NOT] IN (v1, v2, …)— desugared to an OR/AND chain of element-wise comparisons. Capped at 64 values; a large-list hash probe is a follow-up.<expr> [NOT] BETWEEN low AND high— desugared to(expr >= low) AND (expr <= high)(or the DeMorgan inverse).CASE WHEN cond THEN val [WHEN…] [ELSE val] END— both plain and simple (with-operand) forms. Type-check unifies numeric arms viaunify_numericand requires exact match for non-numeric. Physical lowering rejects cleanly with "CASE not yet lowered to GPU".CAST(expr AS type)— primitive numeric and boolean pairs only. Physical lowering rejects cleanly until the runtime conversion lands.COALESCE(a, b, …)andNULLIF(a, b)— desugared toCASE.<expr> [NOT] LIKE 'pattern'— constant-pattern LIKE with%and_wildcards. Routes through the host-sidehost_likeevaluator; fast paths for prefix / suffix / contains / exact shapes.- String concat
a || b— newBinaryOp::Concatoperator, lowered through the host-sidePhysicalPlan::Projectexecutor for SELECT positions. WHERE-clause concat is rejected with a clear message. STDDEV_POP,STDDEV_SAMP,STDDEVaggregates (Welford on host). Scalar-aggregate only; GROUP BY support is a follow-up.VAR_POP,VAR_SAMP,VARIANCEaggregates (shared Welford state).UPPER,LOWER,LENGTH,SUBSTRING,CONCATscalar functions surfaced viaExpr::ScalarFn. Parser + type-check only; physical lowering rejects each with a "follow-up" message.
Added — SQL ergonomics
- Aggregate aliasing (
SELECT SUM(x) AS total) — the alias carries through the post-Aggregate Project and is visible to HAVING / ORDER BY. - Qualified column references (
t.col,alias.col) — resolved against the FROM-tree, including JOIN aliases. Schema-qualified three-part names are rejected with a dedicated message. - Post-aggregate scalar expressions (
SUM(x) + 1,AVG(qty) * 2,(SUM(a) + SUM(b)) / 2) — extracted as aggregate feeds + rewritten surface expression in a post-Aggregate Project. - Case-insensitive identifiers — unquoted SQL idents fold to
lowercase at parse time; schema lookup falls back to case-insensitive
match when the lookup name is all-ASCII-lowercase. Quoted
(
"MyCol") identifiers preserve case and match verbatim.
Added — M1 foundation
- Validity propagation through primitive scalar aggregates —
COUNT(col)now excludes NULLs via the bitmap;SUM/MIN/MAX/AVGhost-strip NULL positions before the GPU reduction. The zero-null fast path (null_count == 0) remains a zero-copyprimitive_to_gpuupload.
Notes
- The execution surface remains conservative: many of the items above parse and type-check, but the physical layer rejects them with a clear "not yet lowered to GPU" message until the corresponding kernel / host-side runtime path lands in a follow-up. The intent is to unblock third-party tooling (which can now generate the SQL it would naturally write) without claiming false execution coverage.
- This release also skips the original 0.4 milestone (streaming / async-memcpy Stage 2 / KernelSpec cache) for the same reason 0.2.0 was skipped: scope grew past what a single minor bump could carry, and scalar completeness was the more user-visible delta.
0.3.0 - 2026-05-26
Added
INNER JOIN ... ON <equi predicate>— host-side hash join. Recursively executes both sides, builds aHashMap<JoinKey, Vec<row_idx>>on the smaller input, probes the larger, and materialises matches viaarrow::compute::take. One join perSELECT; LEFT / RIGHT / FULL / CROSS and non-equi predicates are rejected at the parser. NULL keys never match (SQLNULL = NULL → UNKNOWN).DISTINCT,LIMIT [OFFSET],ORDER BY [ASC|DESC],HAVING,UNION [ALL]— full plan + parser + standalone executors (src/exec/{distinct,sort,limit}.rs). HAVING desugars to aFilterover theAggregate; plainUNIONlowers toDistinct(Union(..)),UNION ALLstays a flatUnion. Executors are host-side for 0.3.x.- Multi-batch tables: the engine accepts more than one
RecordBatchper registered table and threads them through the new operators (was: single-batch only). - Validity propagation through
compact/gpu_compact: filter selection masks now carry per-row validity for downstream consumers. - Warp-shuffle reduction path in
agg_kernels.rsfor the last 5 strides of the agg-kernel tree (replaces the all-stride__syncthreads+ shared-memory reduction the TODO marker called out). - 13 new offline e2e tests in
tests/e2e_tests.rscovering the new operators: 9 for DISTINCT / LIMIT / ORDER BY / HAVING / UNION (plan shapes, ASC/DESC defaults,LIMIT -1parse rejection) and 4 for INNER JOIN (single-key, multi-key, schema disambiguation, combined physical output schema). Engine::new_with_device(idx)for selecting a specific GPU on multi-GPU hosts.Engine::new()delegates to it with device 0.cuda-stubfeature is now real: the#[link(name = "cuda")]block is gated and every FFI entry has a stub returningCUDA_ERROR_STUB, socargo check --no-default-features --features cuda-stubworks without the CUDA toolkit.[package.metadata.docs.rs]requests the feature sodocs.rsbuilds the crate.- Process-wide PTX cache in
jit_compiler— FIFO at 256 entries, hashes the emitted PTX text and reuses the loadedCudaModuleon a hit, skippingcuModuleLoadDataEx/ PTXAS re-assembly. BoolNullablevariant in the device-column enums propagates Arrow validity bitmaps forBooleanArraycolumns; the projection round-trip reconstructs a nullableBooleanArrayon download. Filter / aggregate kernels still consume the values buffer only (TODO marker inengine.rs).- New FFI bindings and safe wrappers for
cuMemAllocHost_v2,cuMemFreeHost,cuMemcpyHtoDAsync_v2,cuMemcpyDtoHAsync_v2,cuMemsetD8_v2,cuMemsetD8Async. - New CI workflow
.github/workflows/ci.yml(Ubuntu + Windows × stable- 1.74) gated on
cuda-stub, plusdependabot.yml, issue / PR templates,CODEOWNERS, andSECURITY.md.
- 1.74) gated on
tests/ptx_golden_tests.rs: golden-snapshot smoke tests for emitted PTX (substring assertions on.target sm_70,atom.*, predicate gate,.restrict, sign-extension before atomic add, etc.).tests/parser_tests.rs: 17 negative parser tests covering DISTINCT, ORDER BY, LIMIT, HAVING, UNION, subqueries, JOIN, CTE, qualified column refs, integer-literal overflow, plus one positive bare-bool-predicate control.- 10 offline aggregate / GROUP BY tests in
tests/e2e_tests.rscovering SUM widening, COUNT(*), AVG, alias preservation, SELECT-order preservation, andi64::MINliteral handling. - Host-only unit tests on
src/cuda/buffer.rs,dictionary.rs,dictionary_any.rs,smart_ptrs.rs(via test-onlynew_host_onlyconstructors).dictionary_anyregains four previously#[ignore]'d dispatch tests via host-only execution. DCOfile at repo root and DCO sign-off section inCONTRIBUTING.md.ROADMAP.mdanddocs/FAQ.md.
Changed
LogicalPlan::Join::schema()andPhysicalPlan::Join::output_schema()now return the combined (left + right) schema with collision-safe naming: any right-side field whose name clashes with a left-side name is prefixedright.<col>, with a__2,__3, … suffix as a final uniqueness guard. Both methods share a singlejoin_combined_schemahelper so they can't drift. Previously the logical version concatenated without disambiguation (duplicate names) and the physical version returned only the left input.SUM(Int32) -> Int64widening end-to-end (plan output dtype, scalar reducer, GROUP BY accumulator, kernel emitsatom.global.add.s64withcvt.s64.s32sign extension). SUM(Int64), SUM(Float*) unchanged.- Float-MIN/MAX GROUP BY launch in
groupby_validnow passes 7 params (kernel ABI) instead of 11; integer / float-SUM variants keep all 11.debug_assert_eq!on arg count at each launch site. pub fn craton_bolt::sql()convenience deleted (it constructed an Engine with no tables — unusable).pub struct Reg(pub u32)IR type: field demoted topub(crate)with a newReg::id() -> u32accessor.BoltError::Cudais now a tuple variantCuda(String)(was a struct variant). Internal-only ergonomic; not part of the stable API.GpuBuffer::zerosusescuMemsetD8(no host alloc + memcpy).- IR types (
PhysicalPlan,KernelSpec,AggregateSpec,Op,Reg,Value,ColumnIO) and internal re-exports underexec::*/jit::*are marked#[doc(hidden)]for 0.3.x. Cargo.tomlgainsauthors,repository,homepage,documentation,readme,keywords,categories,rust-version,[package.metadata.docs.rs].log = "0.4"added as a runtime dep.LICENSEandNOTICEupdated to "Copyright 2026 Craton Software Company";NOTICElistsarrow-array,arrow-buffer,arrow-schema, andlogexplicitly.- README gains badges (crates.io / docs.rs / CI / license / MSRV), a
Platform support subsection, and Security / Releases sections. The
string-subset claim is tightened to flag
UPPER/LOWER/LENGTH/CONCATas host-only Rust API, not SQL. docs/SQL_REFERENCE.md: explicit "Not yet supported (planned)" section; documentedSUM(Int32) -> Int64widening and all-NULL group semantics.docs/JIT_PIPELINE.md: predicate-gate snippet now matches the emitter byte-for-byte; per-instruction CC table.docs/ARCHITECTURE.md:GpuViewcorrected toSend-only /!Syncwith rationale; IR-types stability disclaimer.build.rs: skips CUDA discovery undercuda-stub; picks the highest-version CUDA install on Windows; also searcheslib64/stubs/on Linux for driverless hosts (NVIDIA's CI shim).#[inline]on leaf accessors ofGpuVec/GpuView/GpuViewMut..ptr .global .restrict .align 16on emitted kernel column-pointer params (enables PTXAS alias optimizations).
Fixed
- Aggregate output column order was silently rearranged:
SELECT SUM(x), key FROM t GROUP BY keypreviously returned[key, sum_x]because theselected_keysprojection was built but never wrapped around theAggregate. SELECT order is now preserved via a top-levelProject; aliases on group keys are honored. - Windows linkage: dropped
kind = "static"on the#[link(name = "cuda")]attribute.cuda.libis an import library fornvcuda.dll, not a static archive. - Soundness:
GpuViewis now!Sync(was unsoundlySync). A concurrent writer kernel launched throughGpuViewMutagainst the parentGpuVecwould have raced aGpuViewreader. - Soundness:
static mut INIT_RESULTincuda_sysreplaced withOnceLock<CUresult>— the previous pattern was a data race and a hard error under Rust 2024. - 32-bit hosts: pointer-truncation bug in
GpuBuffer::with_capacityalignment check; theidx-to-usizenarrowing inDictionaryColumnI64::to_string_array. n_rows as u32silent truncation across every executor launch site, via a newn_rows_to_u32(n_rows) -> BoltResult<u32>helper.pack_keysUB shift: bare<<replaced withwrapping_shlplusdebug_assert!(shift + bit_width <= 64, ...).BooleanArraynull/false conflation — upload now distinguishes null from false via theBoolNullablevariant (round-trip works for projection; filter / agg kernels still see values only).__idx_<col>device→host→device bounce removed; the engine borrows the dictionary's existingGpuVecdirectly.- Integer literal overflow in
parse_number: a positive literal whose magnitude exceedsi64::MAXis now rejected with a clear error rather than silently demoted toFloat64. Thei64::MIN-magnitude literal-9223372036854775808is preserved asLiteral::Int64(i64::MIN). - AVG over all-NULL group in
groupby_validnow returns SQLNULLinstead of0.0(matchesSQL_REFERENCE.md). - Test memory_tests:
shared_view_is_send_but_not_syncassertion updated to match the newGpuView: !Synccontract. - DataFrame builder:
select/filter/group_by/aggnow validate column references at builder time, deferring the first error via aString-typedfirst_errorfield surfaced throughDataFrame::validation_error()andschema(). physical_plan::lowernow folds arbitraryScan / Filter / Projectchain shapes (was: onlyScanorFilter(Scan)). DataFrame chains likescan().select().filter().select()no longer produce unlowerable plans.- String literal rewriter: peels
Aliaswrappers on either side ofBinaryOp::Eq;LiteralResolver::index_dtypelets i64-indexed dicts emitInt64index columns rather than the hardcodedInt32. hash_kernelsclassic keys kernel: bounded probe loop withMAX_PROBE_FACTOR = 2; previously could spin forever on a full table.jit_compiler::from_ptx: usescuModuleLoadDataExwith PTXAS info / error log buffers; failures now surface line numbers.- build.rs Windows fallback picks the highest CUDA version on disk (was: first NTFS-ordered entry).
Removed
BoltError::Nvrtcvariant (Craton Bolt usescuModuleLoadDataEx, not NVRTC). The 4 jit_compiler.rs call sites migrated toCuda.pub fn craton_bolt::sql(query)(broken — see Fixed).
Deprecated
DataFrame::collect()(useinto_plan(); tombstone retained for 0.1 call-site compatibility).
Security
- (none yet — see
SECURITY.mdfor the disclosure address.)
0.1.0 - 2026-05-23
Added
CUDA layer (src/cuda/)
- Raw CUDA driver FFI (
cuda_sys.rs) — context init, device discovery, memory alloc / free / memcpy, module load, stream create / destroy / sync,cuLaunchKernel. GpuBuffer<T>(buffer.rs) — owned device allocation with Arrow's 64-byte alignment.GpuVec<T>/GpuView<'a, T>/GpuViewMut<'a, T>(smart_ptrs.rs) — borrow-checked GPU memory. Kernel launches require borrows; use-after-free, double-free, and shared/mutable aliasing across kernel boundaries are rejected at compile time.DictionaryColumn(dictionary.rs) — i32-indexed string dictionary with NULL at slot 0.DictionaryColumnI64(dictionary_i64.rs) — i64-indexed dictionary for columns with > i32::MAX unique strings.DictionaryColumnAny(dictionary_any.rs) — unified enum picking i32/i64 by cardinality at construction.
Plan layer (src/plan/)
LogicalPlanAST (logical_plan.rs) — Scan / Filter / Project / Aggregate.Exprcovers Column / Literal / Binary / Alias. Numeric type promotion follows the standard SQL rules.DataFramebuilder (dataframe.rs) — Polars-style lazy API.- SQL frontend (
sql_frontend.rs) — sqlparser-based; supports SELECT with WHERE, GROUP BY, scalar aggregates. PhysicalPlanlowering (physical_plan.rs) — produces fusedKernelSpecwith SSA-shaped op IR.StringPredicateRewriter(string_literal_rewrite.rs) — rewritescol = 'literal'to__idx_col = i32/i64(idx)against registered dictionaries.
JIT layer (src/jit/)
- PTX codegen for projection (
ptx_gen.rs) — targetssm_70/.version 7.5/ 64-bit addressing. - PTX codegen for predicate-only kernels (
scan_kernel.rs) — materialises u8 keep-masks. - Scalar reduction kernels (
agg_kernels.rs) — SUM / MIN / MAX / COUNT / AVG with per-block reduction + host-side cross-block finish. - Hash GROUP BY kernels (
hash_kernels.rs) — single-pass open-addressing withatom.cas.b64on the keys table. - Float MIN/MAX via CAS loop (
float_atomics.rs) — closes the sm_70 gap foratom.global.{min,max}.f{32,64}. - Sentinel-free GROUP BY kernels (
valid_flag_kernels.rs) — parallelslot_valid: u32[]table eliminatesi64::MINcollision risk (notably Float64-0.0). - Sentinel-free float MIN/MAX (
valid_flag_float.rs) — combines the CAS loop with the valid-flag probe. - Parallel prefix-scan + gather (
prefix_scan.rs) — Hillis-Steele per-block scan, host-side block-base reduction, per-dtype gather. - Multi-pass prefix-scan (
prefix_scan_multipass.rs) — recursive scan over block_sums; unbounded row counts. - CUDA module loader (
jit_compiler.rs) —cuModuleLoadDatawrapper; PTX-to-cubin assembly happens inside the driver.
Execution layer (src/exec/)
Engine(engine.rs) — top-level entry point. Holds the CUDA context, registered tables, dictionary registry.sql(query)returns aQueryHandlewrapping an ArrowRecordBatch.- Scalar aggregate executor (
aggregate.rs) — primitive SUM / MIN / MAX / COUNT / AVG. - Aggregate with pre kernel (
agg_with_pre.rs) — handles aggregates over expressions / filtered inputs. - GROUP BY executor (
groupby.rs) — packed-i64-key path with composite-tuple decode. - GROUP BY + pre (
groupby_with_pre.rs) — fused pre kernel + GROUP BY. - Wide-key GROUP BY fallback (
groupby_wide.rs) — host-side reduction for > 64-bit composite keys. - Sentinel-free GROUP BY (
groupby_valid.rs) — float-key safe path with bounded spin + spill. - Stream + kernel launcher (
launch.rs) —CudaStream,KernelArgs, 1D launch helper. - Host-side filter compaction (
compact.rs) — downloads mask, applies viaarrow::compute::filter. - GPU-side filter compaction (
gpu_compact.rs) — prefix-scan + gather, end-to-end on the GPU. - GPU compaction multi-pass driver (
gpu_compact_multipass.rs). - Dictionary registry (
dict_registry.rs) — per-table dictionaries, drives the predicate rewrite atEngine::sqltime. - Bool / Utf8 aggregate executor (
extended_agg.rs) — host-side SUM(bool) / MIN(utf8) / etc. - Host-side expression evaluator (
expr_agg.rs) — fallback when an aggregate input isn't a bare column ref. - Dictionary-aware string ops (
string_ops.rs) — UPPER / LOWER / LENGTH / input_eq_literal. - Variable-width-free CONCAT / SUBSTRING (
string_ops_extended.rs). - Bool / Utf8 device columns (
string_col.rs).
Tests & benches
- Memory-safety tests (
tests/memory_tests.rs) — type-level proofs, compile-fail doctests, ignored live-GPU round-trips. - End-to-end tests (
tests/e2e_tests.rs) — parser → plan → PTX-shape assertions; ignored live-GPU query verification. - Criterion benchmarks (
benches/query_benchmarks.rs) — plan / lower / ptx_gen, CPU reference, Polars head-to-head, GPU engine path (gated behindBOLT_BENCH_GPU=1).
Build status
Compiles clean on Windows MSVC / Linux with CUDA Toolkit ≥ 12. cargo check --lib --tests --benches works on hosts without CUDA. cargo test requires cuda.lib on the linker path; cargo test -- --ignored requires an NVIDIA GPU with compute capability ≥ 7.0.
Known limitations
- No JOIN support. Single-table queries only.
- No NULL-aware GPU aggregates yet — COUNT counts every row, not just non-null. The host-side
extended_aggpath does honour nulls for Bool/Utf8. - Variable-width string outputs (CONCAT producing genuinely new strings) work via host-side dictionary cross-product, not on the GPU.
- Polars head-to-head numbers are not yet published.