Craton HSM

Monitoring and Audit

Monitoring and Audit

Craton HSM emits two primary observability signals: the tamper-evident audit log (JSON Lines, append-only, chained SHA-256) and structured process logs at RUST_LOG-controlled severity. There is no built-in Prometheus exporter in the core module as of v0.9.x; metrics are derived from the audit log and from process log lines. This page describes both signals, how to verify the chain, how to ship the log to a SIEM, and which events should page an operator.

Audit log format

The audit log lives at the path given by [audit].log_path (default craton_hsm_audit.jsonl). It is created with owner-only permissions (0600 on Unix, owner-only ACL on Windows) and must remain owned by the daemon's process user — a permission-denied failure on open is the most common audit-related startup error.

Each line is a single JSON object. The concrete fields are:

FieldTypeMeaning
timestampintegerUnix epoch seconds.
session_handleintegerThe PKCS#11 session that performed the operation.
operationstringThe operation name (Login, Logout, GenerateKey, Sign, Verify, Encrypt, Decrypt, DestroyObject, etc.).
key_idinteger | nullThe key handle involved, if any.
resultobject{"Success": null} or {"Failure": "<CK_RV symbol>"}.
previous_hashstringHex SHA-256 of the previous entry's canonical serialization.

The log_level field in [audit] filters which operation classes are written:

LevelWrites
allEvery operation.
cryptoCryptographic operations only (Sign, Verify, Encrypt, Decrypt, GenerateKey, Digest).
authLogin, Logout, PIN changes, lockouts.
adminObject create/destroy, token init, PIN reset.
noneNothing — discouraged except in closed lab environments.

For compliance purposes always use all. The overhead is dominated by fsync, not field count.

Chained-hash integrity

Each entry's previous_hash field is SHA-256(canonical_bytes(previous_entry)). The first entry's previous_hash is a well-known all-zero hash. A mismatch anywhere in the chain means the log was truncated, reordered, or mutated after writing.

To verify a segment:

  1. Read all entries in order.
  2. Compute SHA-256(entry[n-1]) using the canonical serialization.
  3. Compare against entry[n].previous_hash.
  4. Report the line number and both hashes on mismatch.

A reference verifier in Python:

import hashlib, json, sys

prev = "0" * 64
for lineno, line in enumerate(open(sys.argv[1]), start=1):
    entry = json.loads(line)
    if entry["previous_hash"] != prev:
        sys.exit(f"chain break at line {lineno}: "
                 f"expected {entry['previous_hash']}, got {prev}")
    # Canonical form: previous_hash field excluded, keys sorted.
    body = {k: v for k, v in entry.items() if k != "previous_hash"}
    prev = hashlib.sha256(
        json.dumps(body, sort_keys=True, separators=(",", ":")).encode()
    ).hexdigest()
print(f"verified {lineno} entries")

Run verification on every archived segment before it leaves the host. Any verification failure is a security event and must be treated as a potential compromise — see ./troubleshooting.

Dumping the audit log

The admin CLI dumps recent entries without touching the live log file:

craton-hsm-admin audit dump --last 100
craton-hsm-admin audit dump --last 50 --json

The --json form is suitable for piping into a SIEM ingest tool or an ad-hoc jq query.

Process logs

Process-level logs go to stdout/stderr and are controlled with RUST_LOG. Recommended production setting:

RUST_LOG=craton_hsm=info

Reserve debug for incident response; trace produces sensitive internals (key handles, session details) and is development-only.

Severity conventions:

LevelUsed for
errorPOST failure, DRBG health-test failure, audit-log open failure, object-store I/O failure.
warnMemory-lock failure, near-exhaustion of AES-GCM nonce counter, degraded CRL state (enterprise add-on), TLS reload failure.
infoStartup, shutdown, successful self-test, token initialization.
debugSession open/close, operation tracing — off by default.

Log shipping to a SIEM

The audit log is JSON Lines; any JSON-aware collector can ship it without transformation. The two deployment patterns:

  • Sidecar collector (Docker, Kubernetes). Mount the audit log directory read-only into a Fluent Bit / Vector / Filebeat container; forward to the SIEM over mTLS. Parse each line as JSON and preserve the previous_hash field — it is needed upstream for offline chain verification.
  • Host agent (systemd). Point the existing host agent (rsyslog with imfile, journald forwarding, splunk forwarder) at the log path. Run it as a user with read-only access to the log file — do not give the agent write permission, or the chain integrity can no longer be trusted end to end.

Forwarding produces a duplicate of the log; it does not replace the need to retain the originals. Never truncate or delete a segment until the SIEM has confirmed ingest and the segment's chain has been verified.

Alert triggers

The following conditions warrant paging an on-call operator. All of them surface either in process logs, the audit log, or both.

TriggerSourceSeverityAction
Power-on self-test failure at startupProcess log (error)PageAll operations return CKR_GENERAL_ERROR; daemon is unusable. See ./troubleshooting.
DRBG health-test failureProcess log (error)PageKey generation cannot proceed; treat as potential key-quality compromise.
Audit-log chain breakSIEM verifierPagePotential tampering. Preserve forensic copies before any remediation.
Repeated CKR_PIN_INCORRECT from the same sessionAudit logTicketAt threshold, CKR_PIN_LOCKED follows.
CKR_PIN_LOCKED eventAudit logTicketSO intervention required.
Audit-log open failureProcess log (error)PageThe daemon runs without audit; depending on policy this may require immediate shutdown.
AES-GCM nonce counter > 75%Process log (warn)TicketSchedule key rotation.
AES-GCM nonce counter > 95%Process log (warn)PageRotate immediately.
mlock / VirtualLock failure at startupProcess log (warn)TicketKey material may be paged to swap; fix capabilities.
TLS certificate expiring in < 14 daysExternal monitoringTicketRotate per ./runbook.
TLS certificate expiredProcess log on restartPageClients cannot connect.

Metrics

The core daemon does not expose a Prometheus endpoint. Operators typically derive the following metrics from audit-log tailing:

  • Operation rate by operation and result.
  • Failed-login rate per session.
  • Key-generation rate, broken down by mechanism.
  • Error-code histogram (count by CK_RV symbol).

A minimal Prometheus exporter can be implemented as a sidecar that tails the JSON Lines log and increments counters. For deployments that need native metrics, see the enterprise observability add-on referenced in ../enterprise/certified.

Correlating audit and process logs

The audit log carries session_handle but not the originating client identity; process logs carry client TLS identity (when mTLS is configured) but not session handle. To correlate a failed sign to a specific client:

  1. Find the failing audit entry; note its session_handle and timestamp.
  2. Find the process-log line at the same timestamp for C_OpenSession on that handle.
  3. The C_OpenSession line carries the TLS peer identity.

Joining the two streams in the SIEM at ingest time simplifies this — add the TLS peer identity as a derived field on every audit line for the duration of each session.