Craton HSM

Operator Runbook

Operator Runbook

Day-2 procedures for a deployed Craton HSM: starting and stopping the daemon, rotating PINs, rotating TLS material, rotating and archiving the audit log, moving token state between hosts, and the disaster-recovery checklist. This page assumes the daemon is deployed per ../getting-started/installation and configured per ./configuration.

Daemon lifecycle

Start and stop

The craton-hsm-daemon binary takes an optional config path and runs in the foreground until signalled.

craton-hsm-daemon /etc/craton_hsm/craton_hsm.toml

Under systemd the service is managed as a unit:

sudo systemctl start craton-hsm
sudo systemctl stop craton-hsm
sudo systemctl restart craton-hsm
sudo systemctl status craton-hsm
journalctl -u craton-hsm -f

A typical unit file runs the daemon as a dedicated craton-hsm user with aggressive sandboxing — NoNewPrivileges, ProtectSystem=strict, ProtectHome, PrivateTmp, and ReadWritePaths limited to the data and log directories. The service should restart on-failure with a short RestartSec.

Configuration reload

There is no in-place reload. Any change to craton_hsm.toml, to TLS material referenced by the config, or to the audit log path requires a daemon restart. A rolling restart of a replicated deployment is coordinated from the load balancer; drain traffic off one node at a time.

Health checks

There is no dedicated /healthz endpoint on the gRPC daemon. Probe the gRPC service directly.

# TCP-level probe
nc -z localhost 5696

# gRPC-level probe (requires grpcurl)
grpcurl -plaintext localhost:5696 craton_hsm.HsmService/GetTokenInfo

A successful GetTokenInfo response implies the process started, POST passed, the object store opened, and TLS (if configured) is serving. Orchestrators should treat repeated GetTokenInfo failures as liveness failures and restart the pod or unit.

Token initialization

A fresh deployment starts with an uninitialized token. Initialize it once, then protect the SO PIN in the same tier as other root-of-trust material.

craton-hsm-admin token init --label "Production HSM"
# Prompts for SO PIN, then confirm SO PIN.
craton-hsm-admin token info
craton-hsm-admin status --json

Re-initialization destroys every object in the token and resets both PINs. It is a recovery-only procedure; never run it on a production token except as the final step of a disaster-recovery drill.

PIN rotation

Rotate the SO PIN

craton-hsm-admin pin change --user-type SO
# Prompts: current SO PIN, new SO PIN, confirm new SO PIN.

Rotate the user PIN

craton-hsm-admin pin change --user-type USER
# Prompts: current user PIN, new user PIN, confirm new user PIN.

Unlock a locked user PIN

After max_failed_logins consecutive failures the user PIN is locked and C_Login returns CKR_PIN_LOCKED. Only the SO can clear the lock.

craton-hsm-admin pin reset
# Authenticates as SO, then prompts for a new user PIN.

Communicate the new PIN to the user through a channel independent of the HSM (e.g., an enterprise password vault), and record the reset in the operations log.

TLS certificate rotation

TLS material for craton-hsm-daemon lives at the paths given in [daemon].tls_cert and [daemon].tls_key. Rotate every 90 days, or sooner if your CA issues short-lived certificates.

  1. Issue the new certificate from the internal CA with the same Subject and SANs as the live one, and at least 30 days of overlap with the current notAfter.

  2. Stage the new files alongside the current ones as server.crt.new and server.key.new, mode 0600, owned by the daemon user.

  3. Validate offline:

    openssl x509 -in server.crt.new -noout -dates -text
    openssl x509 -in server.crt.new -modulus -noout | sha256sum
    openssl rsa  -in server.key.new -modulus -noout | sha256sum
    # The two modulus hashes must match.
    
  4. Rotate one node at a time: on a follower, atomically rename .new over the active paths and restart the daemon. Wait for the node to return GetTokenInfo successfully before moving on.

  5. Rotate the leader last. In a single-node deployment, accept a brief window of unavailability during restart.

  6. Revoke the old certificate via your CA's CRL and publish the updated CRL before the old notAfter.

Audit log rotation

The audit log at [audit].log_path is an append-only JSON-Lines file with chained SHA-256 hashes between entries. The daemon does not rotate it: rotation must preserve the chain.

Safe rotation uses copy-truncate, not move-rename — the daemon holds the file handle and truncation keeps the chain intact from the daemon's perspective while archival happens out of process.

# Daily rotation with logrotate (Unix)
# /etc/logrotate.d/craton-hsm
/var/log/craton-hsm/audit.jsonl {
    daily
    rotate 30
    missingok
    compress
    delaycompress
    copytruncate
    create 0600 craton-hsm craton-hsm
}

After rotation, verify the archived segment's hash chain before shipping it to long-term storage (see ./monitoring for the verification algorithm). Do not delete archived segments until their chain has been independently verified and checksum-signed.

Import and export of token state

Exporting for backup

craton-hsm-admin backup --pin <SO_PIN> --output backup-YYYYMMDD.enc

The backup is produced in the encrypted object-store format — private key material never leaves the cryptographic boundary in plaintext. Store it on an encrypted volume and, ideally, wrap it again with the organization's escrow key before long-term storage.

Restoring to the same host

Stop the daemon, move the existing store aside, restore, restart.

sudo systemctl stop craton-hsm
mv /var/lib/craton-hsm/store /var/lib/craton-hsm/store.old
craton-hsm-admin restore --pin <SO_PIN> --input backup-YYYYMMDD.enc
sudo systemctl start craton-hsm

Restoring to a new host

Provision the new host with the same Craton HSM version, the same config file (except for deployment-specific paths), and the same SO PIN. Copy the backup over an encrypted channel, restore, and verify. See ./backup-recovery for the full procedure.

Kubernetes operations

For clusters deployed via the Helm chart:

# Deploy
helm install my-hsm deploy/helm/craton_hsm/ \
    --set image.tag=0.9.1 \
    --set tls.enabled=true \
    --set tls.secretName=craton-hsm-tls

# Upgrade to a new image
helm upgrade my-hsm deploy/helm/craton_hsm/ --set image.tag=0.9.2

# Logs and status
kubectl logs -l app=craton-hsm -f
kubectl get pods -l app=craton-hsm

Persistent storage requires persistence.enabled=true and a storage class backed by an encrypted volume (LUKS on bare metal, CMK-backed EBS on AWS, CMEK-backed PD on GCP). Without persistence the token is in-memory and every pod restart starts from an uninitialized token.

Disaster-recovery checklist

Run through this list any time a node is lost, compromised, or being rebuilt. Keep the completed checklist with the incident record.

  • Confirm the scope of the incident: single node, full cluster, suspected key compromise.
  • Isolate the affected host from the network if compromise is suspected.
  • Locate the most recent verified encrypted backup.
  • Verify the backup's integrity by decrypting and listing contents without restoring.
  • Provision a clean host with the same Craton HSM version and base OS image.
  • Install the same config file; update only deployment-specific paths.
  • Restore the backup per the procedure above.
  • Start the daemon and verify POST passed (journalctl -u craton-hsm | grep -i self-test).
  • Smoke-test a read operation (GetTokenInfo, object listing).
  • Verify audit-log continuity: the first entry in the restored log must chain from the last entry in the archived segment.
  • If compromise is suspected, rotate every key wrapped by the token immediately; invalidate any certificates signed by token keys.
  • Record the RTO achieved and compare against the target.
  • File the post-incident review within five business days.

See ./backup-recovery for the detailed recovery procedure and ./troubleshooting for diagnosis of specific failure modes.