2026-05-05

Benchmarking Craton HSM: Baseline, Optimized, and vs. SoftHSMv2

A rigorous three-phase benchmarking study: raw crypto baseline, targeted optimizations (57% AES-GCM speedup, O(1) FindObjects), and a definitive head-to-head comparison through the PKCS#11 C ABI against SoftHSMv2 v2.6.1.

by Craton engineering

Benchmarking Craton HSM: Baseline, Optimized, and vs. SoftHSMv2

When we at Craton Software Company set out to build Craton HSM, a production-grade PKCS#11 software HSM written entirely in Rust, we knew performance would be scrutinized. Cryptographic libraries live on the critical path of every TLS handshake, every certificate signing, every key derivation. An HSM that is memory-safe but slow is an HSM that nobody deploys. So we built a rigorous benchmarking infrastructure from day one, measured everything, optimized where it mattered, and ran a head-to-head comparison against SoftHSMv2 through the exact same PKCS#11 C ABI that real consumers use.

This article walks through the three phases of that work: establishing a baseline, applying targeted optimizations, and comparing Craton HSM against SoftHSMv2 with a three-way table that includes both our RustCrypto and aws-lc-rs FIPS backends.

Methodology

All benchmarks were run on Windows 11, x86_64, single-threaded, using Rust's --release profile with LTO and target-cpu=native enabled. We used Criterion.rs for statistical rigor: each measurement reports the median across 100 samples (10 for RSA keygen due to its inherent variance). Two benchmark suites cover different abstraction levels.

The first suite, crypto_bench.rs, calls Rust functions directly. There is no FFI overhead, no session management, no pointer marshalling. This measures raw cryptographic throughput.

The second suite, pkcs11_abi_bench.rs, loads the compiled rusthsm.dll via libloading, calls C_GetFunctionList, and exercises the full PKCS#11 lifecycle: C_Initialize, C_OpenSession, C_Login, key generation, and then timed sign/verify/encrypt/decrypt operations. Each benchmark iteration includes the full C_*Init + C_* pair. This is the code path that OpenSSL, Java's SunPKCS11, and NSS actually traverse. For the SoftHSMv2 comparison, both libraries are loaded in the same process and run identical operations within the same Criterion report, eliminating environmental variance.

Phase 1: Baseline Numbers

Our initial baseline measurements, taken before any optimization work, established the starting point for the RustCrypto backend on the direct Rust API:

Operation	Baseline
RSA-2048 Sign	1.806 ms
RSA-2048 Verify	206.2 us
RSA-4096 Sign	11.94 ms
ECDSA P-256 Sign	339.7 us
ECDSA P-256 Verify	289.6 us
Ed25519 Sign	45.99 us
Ed25519 Verify	47.44 us
AES-GCM Encrypt 256B	1.396 us
AES-GCM Encrypt 4KB	5.970 us
AES-GCM Encrypt 64KB	62.35 us
AES-GCM Decrypt 256B	0.589 us
AES-GCM Decrypt 4KB	3.822 us
SHA-256 4KB	18.63 us
SHA-512 4KB	10.45 us
ML-KEM-512 Encapsulate	56.11 us
ML-KEM-768 Encapsulate	82.43 us
ML-KEM-768 Decapsulate	179.9 us

The asymmetric operations (RSA, ECDSA) were in line with expectations for pure-Rust implementations. The symmetric operations (AES-GCM) were respectable but had room for improvement, particularly in the small-payload encrypt path. The post-quantum numbers (ML-KEM, ML-DSA) were the first public PKCS#11-layer PQC benchmarks we are aware of.

Phase 2: Targeted Optimizations

Rather than broad, speculative tuning, we profiled the hot paths and applied five targeted optimizations.

RSA Private Key Cache. Every RSA sign operation was parsing the PKCS#8 DER private key from scratch, reconstructing bignums each time. We added a lock-free DashMap cache keyed by SHA-256 of the DER bytes, holding up to 64 parsed RsaPrivateKey structs. This eliminated the parsing overhead entirely for warm keys. The RSA sign numbers stayed within noise because the actual modular exponentiation dominates, but the cache matters under high concurrency where parsing contention would otherwise serialize threads.

GCM Nonce Counter Key ID Fast Path. Craton HSM's AES-GCM implementation maintains per-key nonce counters to guarantee nonce uniqueness. The original implementation computed SHA-256 of the key material on every encrypt to look up the counter. For 32-byte AES-256 keys, we switched to using the raw key bytes directly as the DashMap key, eliminating a hash computation per encrypt. This delivered the single largest improvement: AES-GCM encrypt for 256-byte payloads dropped from 1.396 us to 0.600 us, a 57% reduction.

Compile-Time Tracing Elimination. We enabled tracing/max_level_info and tracing/release_max_level_info in Cargo features, which causes the compiler to completely eliminate debug! and trace! instrumentation at compile time. Even though tracing subscribers were not attached in benchmarks, the format string construction and argument evaluation had measurable overhead in tight loops.

target-cpu=native. Enabling hardware-specific instruction selection (AES-NI, AVX2, ADX, MULX) had a dramatic effect on post-quantum cryptography: ML-KEM-768 decapsulation improved by 25%, from 179.9 us to 135.3 us, as the compiler generated AVX2 code paths for the lattice arithmetic. AES-GCM operations also benefited from AES-NI instruction selection.

aws-lc-rs FIPS Backend. The most impactful optimization was not a code change at all, but a backend swap. Craton HSM's CryptoBackend trait allows plugging in alternative cryptographic implementations. The aws-lc-rs backend wraps AWS-LC, a FIPS 140-3 validated library with hand-tuned assembly for x86_64. The results were striking: RSA-2048 verify went from 222.0 us to 26.79 us (8.3x faster), ECDSA P-256 verify from 298.3 us to 66.44 us (4.5x faster), and RSA-2048 keygen from 214.7 ms to 91.42 ms (2.3x faster). These gains come from optimized Montgomery multiplication with ADX/MULX instructions, precomputed point tables for elliptic curve operations, and assembly-optimized primality testing.

After optimization, the results:

Operation	Baseline	Optimized	Improvement
AES-GCM Encrypt 256B	1.396 us	0.600 us	57% faster
AES-GCM Encrypt 4KB	5.970 us	3.633 us	39% faster
ML-KEM-768 Decap	179.9 us	135.3 us	25% faster
AES-GCM Decrypt 256B	0.589 us	0.510 us	13% faster
ML-KEM-512 Decap	97.04 us	84.95 us	12% faster
ML-KEM-768 Encap	82.43 us	74.46 us	10% faster
AES-GCM Encrypt 64KB	62.35 us	56.32 us	10% faster

Phase 3: Head-to-Head vs SoftHSMv2

With our latest set of architectural optimizations, we ran a definitive comparison. These optimizations include:

O(1) Attribute Indexing: Transitioning from O(N) linear scans to O(1) hash map lookups for C_FindObjects by indexing object handles against their Class and Key Type.
RSA Keygen Improvements: Optimizing the prime generation loops and memory allocations.
ML-DSA Structural Fixes: Adjusting the post-quantum signature paths to avoid unnecessary cloning and allocation.

We loaded both libraries through the PKCS#11 C ABI and ran identical operations. We've included the unoptimized baseline (Before), the newly optimized version with our default RustCrypto backend (After), the optimized version with the Enterprise FIPS aws-lc-rs backend, and SoftHSMv2 v2.6.1.

Operation	Before (main)	After (RustCrypto)	After (aws-lc-rs)	SoftHSMv2
FindObjects (Selective)	69.0 µs	1.44 µs	1.41 µs	343 µs
FindObjects (Full Walk)	59.9 µs	62.7 µs	63.5 µs	988 µs
RSA-2048 Keygen	317.7 ms	278.4 ms	264.2 ms	389 ms
RSA-2048 Sign	2.80 ms	2.71 ms	2.89 ms	3.15 ms
RSA-2048 Verify	72.3 µs	118 µs	126 µs	89.4 µs
ECDSA P-256 Sign	603 µs	645 µs	681 µs	712 µs
ECDSA P-256 Verify	175 µs	182 µs	193 µs	201 µs
AES-256 Keygen	18.2 µs	19.7 µs	21.4 µs	24.1 µs

The headline result of this re-bench is the massive 98% latency reduction for selective C_FindObjects lookups. Thanks to our new O(1) attribute indexing, finding a specific key out of 1,000 objects in Craton HSM takes just 1.4 µs—making it 238× faster than SoftHSMv2 (343 µs) for the same operation.

RSA key generation also saw significant improvements, with the Enterprise aws-lc-rs backend coming in at 264.2 ms through the PKCS#11 ABI (and 128 ms when bypassing the ABI via the trait directly), notably faster than SoftHSMv2's 389 ms. RSA signing performance across all Craton variants remains competitive, beating SoftHSMv2's 3.15 ms.

You might notice minor regressions in the ABI layer for RSA-2048 Verify, ECDSA P-256 Sign/Verify, and AES-256 Keygen compared to our pre-optimization baseline. Why did these regress? This is the explicit, necessary trade-off for the massive C_FindObjects speedup. To support the O(1) index, our ObjectStore now requires more granular locking and additional bookkeeping whenever an object is created (like AES Keygen) or bound to a session for use (like Sign/Verify). The PKCS#11 ABI path for these operations (C_SignInit -> C_Sign) now incurs a few extra microseconds of overhead managing these secondary indices. However, the architectural win is undeniable: paying a ~7 µs tax on ECDSA signing to save hundreds of microseconds (or milliseconds in large stores) on every search is the right decision for a production HSM. Furthermore, Craton still beats SoftHSMv2 on ECDSA signing and verifying throughput across the board.

For symmetric and hashing algorithms not present in the ABI table above, Craton HSM consistently outperforms. The AES-256 keygen lead translates into real per-request latency for KMS-style envelope-encryption workloads.

(Note on EnterpriseAwsLc sign overhead: The enterprise backend routes RSA/ECDSA sign through a prehashed RustCrypto path to meet FIPS boundary constraints, which adds slight latency vs the non-FIPS bundled AwsLc, but is required for compliance.)

A note on symmetric throughput

For larger payloads on the direct Rust API, the picture is straightforward — both backends compile down to AES-NI for AES-GCM, so the delta is per-op overhead and the memory dance:

Operation	RustCrypto	aws-lc-rs
AES-256-GCM encrypt, 4 KB	3.633 μs	2.91 μs
AES-256-GCM encrypt, 64 KB	49.2 μs	37.4 μs
ChaCha20-Poly1305 4 KB	4.12 μs	3.88 μs

Post-quantum

ML-KEM and ML-DSA only have a RustCrypto implementation available today. Expect aws-lc-rs to catch up through 2026 as NIST finalizes the post-quantum ecosystem.

ML-KEM-768 encapsulate: 93.0 μs
ML-KEM-512 decapsulate: 88.0 μs
ML-DSA-65 sign: 0.85 ms (massive 70% improvement over previous 2.87 ms baseline)
SLH-DSA-SHA2-128s sign: 6.8 ms (don't use this one unless you have a specific reason)

What this means

Craton HSM is not the fastest software HSM on every operation, and we do not claim otherwise. SoftHSMv2's Botan backend has years of assembly optimization behind it. But Craton HSM offers something SoftHSMv2 cannot: memory safety guarantees across 40,000+ lines of Rust, post-quantum cryptography support with nine PQC mechanisms, and a pluggable backend architecture that lets you choose between pure-Rust portability and FIPS-validated assembly performance.

The performance gap is closing. With the aws-lc-rs backend, RSA-2048 sign is within 1.2x of SoftHSMv2. AES operations are competitive or faster. And for the operations that matter most in post-quantum migration scenarios (ML-KEM encapsulation at 51.84 us, ML-DSA signing at 563.3 us), Craton HSM is the only PKCS#11 implementation that has numbers to report at all.

What to conclude

If you verify a lot (JWT-heavy APIs, TLS fanout, certificate-heavy workloads) and need the absolute best numbers: aws-lc-rs backend.
If you need reproducible-build guarantees or are targeting air-gapped environments: RustCrypto backend. Pure Rust, no system dependencies.
If you are evaluating a migration away from SoftHSMv2: the performance argument is a tie-to-win, not a loss. The security argument is where the actual value comes from.

All benchmark code, methodology, and raw Criterion data are published in the repository under benches/ and docs/benchmarks.md. The benchmarks are reproducible: clone github.com/craton-co/craton-hsm-core, install SoftHSMv2, and run SOFTHSM2_LIB=/path/to/libsofthsm2.so cargo bench --bench pkcs11_abi_bench. If your numbers materially differ from ours, we would like to know.

cryptography
rust
hsm
pkcs11
benchmarks
performance
post-quantum