Craton HSM

Troubleshooting

Symptom-indexed failure modes observed in production deployments of Craton HSM. Each entry follows the same shape: symptom as reported by the client or surfaced in logs, diagnosis steps to confirm the cause, and the fix. For recurring operational procedures (key rotation, backup, restore) see ./runbook; for monitoring and audit-log verification see ./monitoring.

POST failure at startup

Symptom. Every PKCS#11 call returns CKR_GENERAL_ERROR immediately. grpcurl against the daemon also returns errors on every method.

Diagnosis. Check the process log for a power-on self-test failure line. At least one of the 17 tests (module integrity + 16 algorithm KATs) failed.

journalctl -u craton-hsm | grep -i self-test
docker logs craton-hsm 2>&1 | grep -i self-test

Causes.

Binary modified since build — most often from a broken download or incomplete deployment copy.
Build from a dirty working tree that excluded a cryptographic dependency update.
Deployment of a binary from one target architecture onto another.

Fix. Re-download the official release, verify its signature per ./release-signing, verify its SHA-256 checksum, and redeploy. Do not attempt to patch the binary in place — a deliberately tampered binary that passes integrity verification but fails a KAT is the exact scenario POST is designed to catch, and the module is doing what it should.

DRBG health-test failure

Symptom. Key generation fails with CKR_GENERAL_ERROR or CKR_FUNCTION_FAILED; the process log contains a DRBG health-test failure line. The module may transition to a permanent error state.

Diagnosis. Grep the process log for drbg. The health test runs periodically and on reseed.

Causes.

A hardware entropy source (getrandom, /dev/urandom, BCryptGenRandom) returned a value that failed the continuous health test. Sustained failure indicates a faulty entropy source.
Virtualization edge case: a VM restored from a snapshot with a stale PRNG state. The module's reseed policy normally handles this, but the period immediately after a snapshot restore is higher-risk.

Fix. Restart the daemon. If the failure recurs, treat it as potential key-quality compromise: rotate keys generated since the last known-good boot. On VMs, configure virtio-rng or the platform's equivalent entropy injection. Never disable the DRBG health test — it is the last layer that catches a failing RNG before it produces a weak key.

`CKR_TOKEN_NOT_PRESENT`

Symptom. Client receives CKR_TOKEN_NOT_PRESENT on C_OpenSession or C_GetTokenInfo. C_GetSlotList returns a slot but the slot reports no initialized token.

Diagnosis.

pkcs11-tool --module /path/to/libcraton_hsm.so --list-slots
pkcs11-tool --module /path/to/libcraton_hsm.so --list-token-slots
craton-hsm-admin status

Cause. The token has not been initialized on this deployment. This is the expected state of a fresh install, and also the expected state after a config change pointed [token].storage_path at a new (empty) directory.

Fix. Initialize the token:

craton-hsm-admin token init --label "Production HSM"

If the token was expected to be present — for example, after a restore — verify that storage_path in the running config matches the directory restored from backup. A mismatch points at a bad restore.

`CKR_PIN_INCORRECT` and lockout

Symptom. C_Login returns CKR_PIN_INCORRECT. After several failures it returns CKR_PIN_LOCKED. Between failures, the module rate-limits: each call takes longer than the last.

Diagnosis.

The rate limiter is operating as designed — exponential backoff from 100 ms base, doubling per failure, capped at 5 s. Wait for the backoff to elapse before the next attempt.
Count failed attempts from the audit log. When the count reaches [security].max_failed_logins, the next failure produces CKR_PIN_LOCKED.

Fix.

On CKR_PIN_INCORRECT, verify the PIN against the operator's source of truth. Check character encoding — PINs are bytes, and a PIN created with a specific locale that is then entered under a different one will mismatch.
On CKR_PIN_LOCKED, the SO must reset:
```
craton-hsm-admin pin reset
```
Then deliver the new PIN through a channel independent of the HSM and record the reset in the operations log.

Symptom. A child process created via fork() on Unix crashes with CKR_CRYPTOKI_NOT_INITIALIZED on its first PKCS#11 call. Tomcat, uWSGI, gunicorn, and other pre-fork servers hit this most often.

Cause. The library detects the PID mismatch between the parent (which called C_Initialize) and the child (which inherited the initialized state) and refuses to operate with the parent's DRBG state. This is intentional: sharing DRBG state across a fork would allow identical RNG output in parent and child.

Fix.

Call C_Initialize in the child after fork(), not in the parent. For pre-fork servers this usually means moving the initialization hook from worker-startup to request-handling or to a post-fork callback (gunicorn's post_fork, uWSGI's post-fork hook, Tomcat's listener).
Alternatively, use the gRPC daemon instead of the in-process library. The daemon is a separate process; the client-side gRPC library is fork-safe in the normal way.

See ../architecture/overview for the rationale.

TLS handshake failure on `cratond`

Symptom. Clients fail to connect with TLS errors — bad_certificate, unknown_ca, certificate_required, or certificate_unknown. The daemon log shows alert bad certificate or similar.

Diagnosis.

openssl s_client -connect <host>:5696 -showcerts \
    -cert client.pem -key client.key -CAfile ca.pem

Read the chain printed by s_client and compare against the CA bundle the daemon trusts.

Common causes and fixes.

Cause	Check	Fix
Certificate and key mismatch	`openssl x509 -in tls.crt -modulus -noout	sha256sum`vs`openssl rsa -in tls.key -modulus -noout
Certificate expired	`openssl x509 -in tls.crt -enddate -noout`	Rotate per ./runbook.
Client does not trust the server CA	Compare client trust store against issuing CA	Install the CA cert in the client trust store.
Intermediate missing from server bundle	`openssl verify -CAfile bundle.pem server.crt`	Concatenate the full chain into the bundle referenced by `tls_cert`.

Permission errors on the storage path

Symptom. On startup the daemon fails with a permission-denied error against storage_path, or with Database is locked.

Diagnosis.

ls -la /var/lib/craton-hsm/store
lsof | grep store    # another process holding the file
ps aux | grep craton-hsm

Causes and fixes.

Wrong owner. The daemon runs as a dedicated user (craton-hsm, nobody, or the container UID) but storage_path is owned by root. Fix: chown -R craton-hsm:craton-hsm /var/lib/craton-hsm.
Restrictive parent directory. ProtectSystem=strict or a read-only root filesystem can block writes even when the store directory itself is writable. Ensure ReadWritePaths= in the systemd unit includes the store directory.
Another process holds the store. Only one process may open the encrypted object store at a time. Stop the other instance, or use craton-hsm-daemon as the single owner and let other clients connect over gRPC.
Symlink in path. The loader rejects symlinks in storage_path components. Resolve the path to a real directory.

Related: permission errors on the audit-log path surface as Failed to open audit log: permission denied at startup. The file must be 0600 and owned by the daemon user. Either delete it to let the daemon create a fresh file, or chown and chmod the existing file.

Session exhaustion

Symptom. C_OpenSession returns CKR_SESSION_COUNT.

Diagnosis. The session count reached [token].max_sessions. The most common cause is a client that opens sessions per request and never closes them.

Fix.

Fix the client to reuse a session or close it explicitly in a finally block.
As an interim mitigation, raise max_sessions in the config and restart. This is a patch, not a solution — session leaks tend to grow.

Memory-lock warnings

Symptom. Startup log contains mlock failed: Operation not permitted (Unix) or the equivalent VirtualLock failure on Windows.

Cause. The process lacks the capability or privilege to lock memory.

Fix (Linux).

sudo setcap cap_ipc_lock=ep /usr/local/bin/craton-hsm-daemon
# or
echo "* soft memlock unlimited" | sudo tee -a /etc/security/limits.conf

Fix (Windows). Grant the Lock pages in memory user right to the service account via secpol.msc → Local Policies → User Rights Assignment.

The warning is non-fatal but the module may page key material to swap under memory pressure. Treat it as a hardening gap, not a runtime incident.

Clock skew and date-based key lifecycle

Symptom. A key created with CKA_START_DATE in the immediate future is rejected as pre-activation, or a key with CKA_END_DATE in the past is rejected as deactivated, when the expected state is active.

Diagnosis.

date -u
timedatectl status        # systemd
chronyc tracking          # chrony

Cause. System time on the host has drifted out of the window expected by the key-lifecycle state machine. SP 800-57 lifecycle states are computed against wall-clock time; skew produces state transitions that surprise the application.

Fix. Run NTP on every host running craton-hsm-daemon, target a drift budget under 100 ms. In regulated or low-latency deployments, use PTP with hardware timestamping and target under 10 ms. Monitor drift; alert on sustained skew exceeding the budget.

Note the interaction with TLS certificate validity — a host with a regressed clock will also fail to validate its TLS chain. Recovery requires fixing time before the daemon will start cleanly.

Audit log chain break

Symptom. The SIEM's chain verifier reports a mismatch between entry[n].previous_hash and SHA-256(entry[n-1]).

Diagnosis. Run the verification script from ./monitoring locally. A genuine break survives re-verification; a transient ingest-pipeline error does not.

Causes.

Log was edited. Treat as a security incident.
Log was truncated by a tool that did not respect the append-only contract.
A rotation procedure used mv instead of copytruncate, leaving the daemon writing to the archived file under a different inode.

Fix. Preserve a forensic copy of the current and previous log segments before any remediation. Escalate to the security team. Rotate the token's keys if integrity cannot be re-established — a break of unknown origin is an untrusted operation history, and keys used during the untrusted window should be treated as potentially compromised.

Getting help

Release notes and migration guidance for tightened validation — see ../getting-started/installation and the migration notes in each release.
Chain-verification reference implementation — ./monitoring.
Hardening checklist for suspected supply-chain events — ../security/hardening.
FIPS-specific failure modes and the validated boundary — ../fips/self-tests.