Craton HSM
Troubleshooting
Troubleshooting
Symptom-indexed failure modes observed in production deployments of Craton HSM. Each entry follows the same shape: symptom as reported by the client or surfaced in logs, diagnosis steps to confirm the cause, and the fix. For recurring operational procedures (key rotation, backup, restore) see ./runbook; for monitoring and audit-log verification see ./monitoring.
POST failure at startup
Symptom. Every PKCS#11 call returns CKR_GENERAL_ERROR immediately. grpcurl against the daemon also returns errors on every method.
Diagnosis. Check the process log for a power-on self-test failure line. At least one of the 17 tests (module integrity + 16 algorithm KATs) failed.
journalctl -u craton-hsm | grep -i self-test
docker logs craton-hsm 2>&1 | grep -i self-test
Causes.
- Binary modified since build — most often from a broken download or incomplete deployment copy.
- Build from a dirty working tree that excluded a cryptographic dependency update.
- Deployment of a binary from one target architecture onto another.
Fix. Re-download the official release, verify its signature per ./release-signing, verify its SHA-256 checksum, and redeploy. Do not attempt to patch the binary in place — a deliberately tampered binary that passes integrity verification but fails a KAT is the exact scenario POST is designed to catch, and the module is doing what it should.
DRBG health-test failure
Symptom. Key generation fails with CKR_GENERAL_ERROR or CKR_FUNCTION_FAILED; the process log contains a DRBG health-test failure line. The module may transition to a permanent error state.
Diagnosis. Grep the process log for drbg. The health test runs periodically and on reseed.
Causes.
- A hardware entropy source (
getrandom,/dev/urandom,BCryptGenRandom) returned a value that failed the continuous health test. Sustained failure indicates a faulty entropy source. - Virtualization edge case: a VM restored from a snapshot with a stale PRNG state. The module's reseed policy normally handles this, but the period immediately after a snapshot restore is higher-risk.
Fix. Restart the daemon. If the failure recurs, treat it as potential key-quality compromise: rotate keys generated since the last known-good boot. On VMs, configure virtio-rng or the platform's equivalent entropy injection. Never disable the DRBG health test — it is the last layer that catches a failing RNG before it produces a weak key.
CKR_TOKEN_NOT_PRESENT
Symptom. Client receives CKR_TOKEN_NOT_PRESENT on C_OpenSession or C_GetTokenInfo. C_GetSlotList returns a slot but the slot reports no initialized token.
Diagnosis.
pkcs11-tool --module /path/to/libcraton_hsm.so --list-slots
pkcs11-tool --module /path/to/libcraton_hsm.so --list-token-slots
craton-hsm-admin status
Cause. The token has not been initialized on this deployment. This is the expected state of a fresh install, and also the expected state after a config change pointed [token].storage_path at a new (empty) directory.
Fix. Initialize the token:
craton-hsm-admin token init --label "Production HSM"
If the token was expected to be present — for example, after a restore — verify that storage_path in the running config matches the directory restored from backup. A mismatch points at a bad restore.
CKR_PIN_INCORRECT and lockout
Symptom. C_Login returns CKR_PIN_INCORRECT. After several failures it returns CKR_PIN_LOCKED. Between failures, the module rate-limits: each call takes longer than the last.
Diagnosis.
- The rate limiter is operating as designed — exponential backoff from 100 ms base, doubling per failure, capped at 5 s. Wait for the backoff to elapse before the next attempt.
- Count failed attempts from the audit log. When the count reaches
[security].max_failed_logins, the next failure producesCKR_PIN_LOCKED.
Fix.
-
On
CKR_PIN_INCORRECT, verify the PIN against the operator's source of truth. Check character encoding — PINs are bytes, and a PIN created with a specific locale that is then entered under a different one will mismatch. -
On
CKR_PIN_LOCKED, the SO must reset:craton-hsm-admin pin resetThen deliver the new PIN through a channel independent of the HSM and record the reset in the operations log.
Fork-related crashes
Symptom. A child process created via fork() on Unix crashes with CKR_CRYPTOKI_NOT_INITIALIZED on its first PKCS#11 call. Tomcat, uWSGI, gunicorn, and other pre-fork servers hit this most often.
Cause. The library detects the PID mismatch between the parent (which called C_Initialize) and the child (which inherited the initialized state) and refuses to operate with the parent's DRBG state. This is intentional: sharing DRBG state across a fork would allow identical RNG output in parent and child.
Fix.
- Call
C_Initializein the child afterfork(), not in the parent. For pre-fork servers this usually means moving the initialization hook from worker-startup to request-handling or to a post-fork callback (gunicorn'spost_fork, uWSGI'spost-forkhook, Tomcat's listener). - Alternatively, use the gRPC daemon instead of the in-process library. The daemon is a separate process; the client-side gRPC library is fork-safe in the normal way.
See ../architecture/overview for the rationale.
TLS handshake failure on cratond
Symptom. Clients fail to connect with TLS errors — bad_certificate, unknown_ca, certificate_required, or certificate_unknown. The daemon log shows alert bad certificate or similar.
Diagnosis.
openssl s_client -connect <host>:5696 -showcerts \
-cert client.pem -key client.key -CAfile ca.pem
Read the chain printed by s_client and compare against the CA bundle the daemon trusts.
Common causes and fixes.
| Cause | Check | Fix |
|---|---|---|
| Certificate and key mismatch | `openssl x509 -in tls.crt -modulus -noout | sha256sumvsopenssl rsa -in tls.key -modulus -noout |
| Certificate expired | openssl x509 -in tls.crt -enddate -noout | Rotate per ./runbook. |
| Client does not trust the server CA | Compare client trust store against issuing CA | Install the CA cert in the client trust store. |
| Intermediate missing from server bundle | openssl verify -CAfile bundle.pem server.crt | Concatenate the full chain into the bundle referenced by tls_cert. |
Permission errors on the storage path
Symptom. On startup the daemon fails with a permission-denied error against storage_path, or with Database is locked.
Diagnosis.
ls -la /var/lib/craton-hsm/store
lsof | grep store # another process holding the file
ps aux | grep craton-hsm
Causes and fixes.
- Wrong owner. The daemon runs as a dedicated user (
craton-hsm,nobody, or the container UID) butstorage_pathis owned by root. Fix:chown -R craton-hsm:craton-hsm /var/lib/craton-hsm. - Restrictive parent directory.
ProtectSystem=strictor a read-only root filesystem can block writes even when the store directory itself is writable. EnsureReadWritePaths=in the systemd unit includes the store directory. - Another process holds the store. Only one process may open the encrypted object store at a time. Stop the other instance, or use
craton-hsm-daemonas the single owner and let other clients connect over gRPC. - Symlink in path. The loader rejects symlinks in
storage_pathcomponents. Resolve the path to a real directory.
Related: permission errors on the audit-log path surface as Failed to open audit log: permission denied at startup. The file must be 0600 and owned by the daemon user. Either delete it to let the daemon create a fresh file, or chown and chmod the existing file.
Session exhaustion
Symptom. C_OpenSession returns CKR_SESSION_COUNT.
Diagnosis. The session count reached [token].max_sessions. The most common cause is a client that opens sessions per request and never closes them.
Fix.
- Fix the client to reuse a session or close it explicitly in a
finallyblock. - As an interim mitigation, raise
max_sessionsin the config and restart. This is a patch, not a solution — session leaks tend to grow.
Memory-lock warnings
Symptom. Startup log contains mlock failed: Operation not permitted (Unix) or the equivalent VirtualLock failure on Windows.
Cause. The process lacks the capability or privilege to lock memory.
Fix (Linux).
sudo setcap cap_ipc_lock=ep /usr/local/bin/craton-hsm-daemon
# or
echo "* soft memlock unlimited" | sudo tee -a /etc/security/limits.conf
Fix (Windows). Grant the Lock pages in memory user right to the service account via secpol.msc → Local Policies → User Rights Assignment.
The warning is non-fatal but the module may page key material to swap under memory pressure. Treat it as a hardening gap, not a runtime incident.
Clock skew and date-based key lifecycle
Symptom. A key created with CKA_START_DATE in the immediate future is rejected as pre-activation, or a key with CKA_END_DATE in the past is rejected as deactivated, when the expected state is active.
Diagnosis.
date -u
timedatectl status # systemd
chronyc tracking # chrony
Cause. System time on the host has drifted out of the window expected by the key-lifecycle state machine. SP 800-57 lifecycle states are computed against wall-clock time; skew produces state transitions that surprise the application.
Fix. Run NTP on every host running craton-hsm-daemon, target a drift budget under 100 ms. In regulated or low-latency deployments, use PTP with hardware timestamping and target under 10 ms. Monitor drift; alert on sustained skew exceeding the budget.
Note the interaction with TLS certificate validity — a host with a regressed clock will also fail to validate its TLS chain. Recovery requires fixing time before the daemon will start cleanly.
Audit log chain break
Symptom. The SIEM's chain verifier reports a mismatch between entry[n].previous_hash and SHA-256(entry[n-1]).
Diagnosis. Run the verification script from ./monitoring locally. A genuine break survives re-verification; a transient ingest-pipeline error does not.
Causes.
- Log was edited. Treat as a security incident.
- Log was truncated by a tool that did not respect the append-only contract.
- A rotation procedure used
mvinstead ofcopytruncate, leaving the daemon writing to the archived file under a different inode.
Fix. Preserve a forensic copy of the current and previous log segments before any remediation. Escalate to the security team. Rotate the token's keys if integrity cannot be re-established — a break of unknown origin is an untrusted operation history, and keys used during the untrusted window should be treated as potentially compromised.
Getting help
- Release notes and migration guidance for tightened validation — see ../getting-started/installation and the migration notes in each release.
- Chain-verification reference implementation — ./monitoring.
- Hardening checklist for suspected supply-chain events — ../security/hardening.
- FIPS-specific failure modes and the validated boundary — ../fips/self-tests.