Craton Bolt
Craton Bolt Roadmap
Craton Bolt Roadmap
This document tracks intentional gaps in the current release and the milestones
planned beyond it. For day-to-day progress, see CHANGELOG.md. For supported
SQL today, see docs/SQL_REFERENCE.md. For the full 1.0 plan, see
docs/PATH_TO_1.0.md. To install and build, see docs/INSTALL.md.
0.7.0 (current — pre-production, API stabilising)
v0.7 turns the v0.6 carry-overs into live code paths: it lights up the
Decimal128 / Date / Timestamp GPU lowering boundaries, wires the
KernelSpec module cache into real call sites, and lands the GPU radix-sort
dispatch in the executor. Highlights (see CHANGELOG.md for the full list):
Decimal128GPU arithmetic (+,-,*) and comparisons (=,!=,<,>,<=,>=, reachable fromWHERE);SUM(Decimal128)via host-side reduction.Date32/Timestamparithmetic (Date−Date, Timestamp−Timestamp, Day-INTERVAL only) lowered to GPU.- Grouped
STDDEV/VARunderGROUP BYvia per-group host-side Welford. - GPU radix sort integrated into
src/exec/sort.rs— single-keyInt32/Int64ASC plus multi-key andDESC. Still opt-in viaBOLT_GPU_SORT=1(seedocs/ENV_VARS.md); not yet planner-selected by default. KernelSpecmodule cache wired into call sites — scalar aggregate, hash-join, radix-sort, and compaction kernels now hit the cache (skipping both codegen and PTXAS); async memcpy rolled out to the remainingGROUP BYvariants and theWHEREfilter D2H path.- WHERE-predicate type-checking during SQL lowering.
- SQL surface expansion (set ops, windows, CTEs, subqueries, joins):
EXCEPT [ALL]/INTERSECT [ALL](host-side), non-recursive CTEs (WITH), host-side window functions (ROW_NUMBER/RANK/DENSE_RANK/SUM/AVG/MIN/MAX/COUNTOVER, default frame only), uncorrelated scalar and[NOT] INsubqueries,JOIN ... USING/NATURAL JOIN, andCOUNT(DISTINCT col)(sole SELECT item). GPULIKE(dict + non-dictUtf8) and GPUUPPER/LOWER/LENGTH; host-sideSUBSTRING/TRIM/CONCAT. - SQL surface expansion (later 0.7 feature waves):
LATERALderived tables and plain derived tables (subquery inFROM);WITH RECURSIVE(linear, non-linear, and mutual recursion, optional column-list alias); a single correlatedWHEREsubquery (scalar comparison /EXISTS/NOT EXISTS);VALUESas a row source (bare and inFROM); thegenerate_series(start, stop[, step])table-valued function;DISTINCT ON (...); namedWINDOWclause +QUALIFY; super-aggregates (GROUP BY ROLLUP/CUBE/GROUPING SETS/ALL,WITH TOTALS/ROLLUP/CUBE, andGROUPING()/GROUPING_ID()); and query-clause sugar (FETCH/ T-SQLTOP→LIMIT,FOR UPDATE/FOR SHAREno-op,PREWHERE→WHERE). The correlated-WHEREandLATERALnested-loop apply paths are bounded byCRATON_MAX_APPLY_ROWS. - Grouped
Decimal128GPU aggregation —SUM/MIN/MAXover aDecimal128column underGROUP BYlowered on-device, plusDecimal128division added to the GPU arithmetic set (complementing the scalarDecimal128arithmetic and aggregation above).
What works (carried forward from 0.5)
- SQL → PTX → execution end-to-end for projection, filter, scalar aggregate, and GROUP BY (single/multi-column, packed and wide keys).
DISTINCT,LIMIT [OFFSET],ORDER BY [ASC|DESC],HAVING,UNION [ALL],EXCEPT [ALL], andINTERSECT [ALL](host-side executors for the non-GROUP-BY paths). Non-recursive CTEs (WITH).INNER,LEFT [OUTER],RIGHT [OUTER],FULL [OUTER], andCROSSjoins (GPU fast path + host hash-join fallback), withON/USING (...)/NATURALconstraints. Multiple joins perSELECTare permitted.- Host-side window functions (
OVER) and uncorrelated scalar /[NOT] INsubqueries. - Borrow-checked GPU memory primitives (
GpuVec/GpuView/GpuViewMut) — use-after-free, double-free, and mutable/shared aliasing across kernel boundaries are compile-time errors. - The full v0.5 SQL scalar surface (
NOT,IN,BETWEEN,CASE,CAST,COALESCE/NULLIF,LIKE,||,STDDEV/VAR, scalar string fns) — parsed and type-checked. v0.7 landed GPU execution for groupedSTDDEV/VAR,Decimal128arithmetic and comparisons, andDate/Timestamparithmetic; the remaining items (e.g.CASE/CAST/ scalar string funcs on the GPU,LIKEwithESCAPE,||inWHERE) still reject cleanly at physical lowering. - Dictionary-encoded Utf8, float GROUP BY with sentinel-free fallback,
GPU-side filter compaction, process-wide PTX module cache,
--features cuda-stubfor CI /docs.rs.
New in 0.6.0 — M1 (Foundation)
Engine::register_table_stream(name, schema, iter)— eager implementation in v0.6, signature future-compatible with the lazy streaming path scheduled for v0.7.- Async memcpy + pinned host buffers piloted in the scalar aggregate
executor (
upload_primitive_values_async). KernelSpec-keyed module cache built and unit-tested insrc/exec/module_cache.rs(skips both codegen and PTXAS on a hit).
New in 0.6.0 — M3 (Join + Sort)
- GPU radix-sort kernel scaffold for
Int32/Int64insrc/jit/sort_kernel_radix.rs. Env-gated viaBOLT_GPU_SORT=1. - Non-equi join via nested-loop in
src/exec/join.rs::execute_nested_loop_join(INNER only; capMAX_NESTED_LOOP_INNER_ROWS = 1024).
New in 0.6.0 — M4 (Types)
DataType::Decimal128(p, s)plumbed end-to-end through plan + Arrow round-trip;CAST(int AS DECIMAL(p, s))parses.DataType::Date32andDataType::Timestamp(TimeUnit, Option<&'static str>)with aTimeUnitenum.DATE '...'andTIMESTAMP '...'literals parse. Timezones interned viaintern_timezonesoDataTypestaysCopy.
New in 0.6.0 — M5 (Observability + ergonomics)
tracingdependency; spans on parse / plan / lower / codegen / ptx_load / launch / transfer / materialize. Off by default; opt-in via the consumer'stracing_subscriber. Catalogue insrc/observability.rs.BoltErroris now#[non_exhaustive]and gains aSqlWithSpan { msg, span: Range<usize> }variant plus aBoltError::span()accessor. sqlparser parse errors wrapped viaparse_error_to_bolt_error.- Did-you-mean suggestions in
Schema::index_of,NameResolver::resolve_compound, andtry_aggregate. Shared helper insrc/plan/suggest.rs(Levenshtein capped at 2).
New in 0.6.0 — M6 (Performance)
- Disk-backed PTX cache in
src/jit/disk_cache.rs. Opt-in via theBOLT_PTX_CACHE_DIR=/pathenv var; writes are atomic. - Criterion regression bench scaffold in
benches/regression.rscovering scalar agg / GROUP BY / filter at parse / lower / ptx_gen.
New in 0.6.0 — M7 (API stabilization)
Engine::Builder(EngineBuilder) withdevice,memory_budget,persistent_cache,enable_tracingknobs.Engine::new/Engine::new_with_devicepreserved as thin wrappers.Engineis now#[non_exhaustive].DataFrame::collect(self, &mut Engine)materializes through the newEngine::run_logical_plan. The 0.1-era#[doc(hidden)]tombstone is gone.PlanRewritetrait insrc/plan/rewrite.rs.Enginestoresrewrites: Vec<Box<dyn PlanRewrite>>and threads them throughEngine::sqlimmediately beforelower_physical.Engine::with_rewrite(self, r)registers a rewrite.docs/API_SURFACE.mdenumerates the public surface by stability tier.
New in 0.6.0 — M8 (Freeze prep) + Docs
docs/MIGRATION_GUIDE.mdcovers the 0.3 → 0.5 → 0.6 upgrade path.docs/USER_GUIDE.mdships as a 10-minute tutorial.
Known limitations (not bugs) — as of 0.7.0
v0.7 closed most of the v0.6 carry-overs. What remains:
- Several v0.5 SQL scalar items still parse / type-check but reject at
the physical layer: GPU lowering for
CASE/CAST/ scalar string funcs,LIKEwithESCAPE, and||inWHERE. (v0.7 did landDecimal128arithmetic + comparisons,Date/Timestamparithmetic, and groupedSTDDEV/VAR.) - The GPU radix sort is integrated into
src/exec/sort.rsbut is still opt-in viaBOLT_GPU_SORT=1rather than planner-selected by default. - The disk PTX cache honours
BOLT_PTX_CACHE_DIR; theEngineBuilder::persistent_cacheknob is wired through the builder surface but does not yet driveEngineBuilder::build. - The lazy streaming executor behind
Engine::register_table_streamis still the eager drain implementation (the signature is future-compatible).
Beyond 0.7 — toward 1.0 (next)
With the v0.6 execution carry-overs largely landed in v0.7, the remaining pre-1.0 work is the last GPU-lowering gaps, planner-driven dispatch, and the freeze checklist:
Goals
- GPU lowering for the still-deferred scalar items:
CASE WHEN ... END(predicated select) andCASTover documented primitive pairs. (UPPER/LOWER/LENGTHandLIKEalready lower to GPU as of 0.7;SUBSTRING/TRIM/CONCATremain host-side.) - Planner-driven radix-sort dispatch — promote the integrated
src/exec/sort.rsradix path fromBOLT_GPU_SORT=1opt-in to a default selected on size / dtype. EngineBuilder::persistent_cachewiring throughEngineBuilder::build(today the env-var path is the only honoured surface).- Security audit prep (M8 from
docs/PATH_TO_1.0.md) — dependency audit, public-surface review, and the freeze checklist needed before the 1.0 stabilisation window opens.
Stretch goals
- GPU hash join (the existing executor is host-side; a GPU-resident probe path is the natural next step).
- GPU lowering for
LIKEwithESCAPEand||inWHEREpredicates.
1.0 — public API freeze
See docs/PATH_TO_1.0.md for the detailed
milestone-by-milestone plan, acceptance criteria, open decisions, and
explicit exclusions. Headlines:
- All
#[doc(hidden)]IR types (PhysicalPlan,KernelSpec,AggregateSpec,Op,Reg,Value,ColumnIO) either stabilised or replaced with a public builder surface. DataFrame::collect()becomes a real materialising terminal. (Landed in v0.6 — kept here for the 1.0 acceptance checklist.)- Stable
Engine::sqlcontract;cuda-stubfeature documented as a permanent CI helper rather than an experiment. - Multi-platform:
aarch64-linux(Jetson) tested in CI. - Regression-CI green; ClickBench numbers published per release.