Craton HSM

Monitoring and Audit

Craton HSM emits two primary observability signals: the tamper-evident audit log (JSON Lines, append-only, chained SHA-256) and structured process logs at RUST_LOG-controlled severity. There is no built-in Prometheus exporter in the core module as of v0.9.x; metrics are derived from the audit log and from process log lines. This page describes both signals, how to verify the chain, how to ship the log to a SIEM, and which events should page an operator.

Audit log format

The audit log lives at the path given by [audit].log_path (default craton_hsm_audit.jsonl). It is created with owner-only permissions (0600 on Unix, owner-only ACL on Windows) and must remain owned by the daemon's process user — a permission-denied failure on open is the most common audit-related startup error.

Each line is a single JSON object. The concrete fields are:

Field	Type	Meaning
`timestamp`	integer	Unix epoch seconds.
`session_handle`	integer	The PKCS#11 session that performed the operation.
`operation`	string	The operation name (`Login`, `Logout`, `GenerateKey`, `Sign`, `Verify`, `Encrypt`, `Decrypt`, `DestroyObject`, etc.).
`key_id`	integer \| null	The key handle involved, if any.
`result`	object	`{"Success": null}` or `{"Failure": "<CK_RV symbol>"}`.
`previous_hash`	string	Hex SHA-256 of the previous entry's canonical serialization.

The log_level field in [audit] filters which operation classes are written:

Level	Writes
`all`	Every operation.
`crypto`	Cryptographic operations only (Sign, Verify, Encrypt, Decrypt, GenerateKey, Digest).
`auth`	Login, Logout, PIN changes, lockouts.
`admin`	Object create/destroy, token init, PIN reset.
`none`	Nothing — discouraged except in closed lab environments.

For compliance purposes always use all. The overhead is dominated by fsync, not field count.

Chained-hash integrity

Each entry's previous_hash field is SHA-256(canonical_bytes(previous_entry)). The first entry's previous_hash is a well-known all-zero hash. A mismatch anywhere in the chain means the log was truncated, reordered, or mutated after writing.

To verify a segment:

Read all entries in order.
Compute SHA-256(entry[n-1]) using the canonical serialization.
Compare against entry[n].previous_hash.
Report the line number and both hashes on mismatch.

A reference verifier in Python:

import hashlib, json, sys

prev = "0" * 64
for lineno, line in enumerate(open(sys.argv[1]), start=1):
    entry = json.loads(line)
    if entry["previous_hash"] != prev:
        sys.exit(f"chain break at line {lineno}: "
                 f"expected {entry['previous_hash']}, got {prev}")
    # Canonical form: previous_hash field excluded, keys sorted.
    body = {k: v for k, v in entry.items() if k != "previous_hash"}
    prev = hashlib.sha256(
        json.dumps(body, sort_keys=True, separators=(",", ":")).encode()
    ).hexdigest()
print(f"verified {lineno} entries")

Run verification on every archived segment before it leaves the host. Any verification failure is a security event and must be treated as a potential compromise — see ./troubleshooting.

Dumping the audit log

The admin CLI dumps recent entries without touching the live log file:

craton-hsm-admin audit dump --last 100
craton-hsm-admin audit dump --last 50 --json

The --json form is suitable for piping into a SIEM ingest tool or an ad-hoc jq query.

Process logs

Process-level logs go to stdout/stderr and are controlled with RUST_LOG. Recommended production setting:

RUST_LOG=craton_hsm=info

Reserve debug for incident response; trace produces sensitive internals (key handles, session details) and is development-only.

Severity conventions:

Level	Used for
`error`	POST failure, DRBG health-test failure, audit-log open failure, object-store I/O failure.
`warn`	Memory-lock failure, near-exhaustion of AES-GCM nonce counter, degraded CRL state (enterprise add-on), TLS reload failure.
`info`	Startup, shutdown, successful self-test, token initialization.
`debug`	Session open/close, operation tracing — off by default.

Log shipping to a SIEM

The audit log is JSON Lines; any JSON-aware collector can ship it without transformation. The two deployment patterns:

Sidecar collector (Docker, Kubernetes). Mount the audit log directory read-only into a Fluent Bit / Vector / Filebeat container; forward to the SIEM over mTLS. Parse each line as JSON and preserve the previous_hash field — it is needed upstream for offline chain verification.
Host agent (systemd). Point the existing host agent (rsyslog with imfile, journald forwarding, splunk forwarder) at the log path. Run it as a user with read-only access to the log file — do not give the agent write permission, or the chain integrity can no longer be trusted end to end.

Forwarding produces a duplicate of the log; it does not replace the need to retain the originals. Never truncate or delete a segment until the SIEM has confirmed ingest and the segment's chain has been verified.

Alert triggers

The following conditions warrant paging an on-call operator. All of them surface either in process logs, the audit log, or both.

Trigger	Source	Severity	Action
Power-on self-test failure at startup	Process log (`error`)	Page	All operations return `CKR_GENERAL_ERROR`; daemon is unusable. See ./troubleshooting.
DRBG health-test failure	Process log (`error`)	Page	Key generation cannot proceed; treat as potential key-quality compromise.
Audit-log chain break	SIEM verifier	Page	Potential tampering. Preserve forensic copies before any remediation.
Repeated `CKR_PIN_INCORRECT` from the same session	Audit log	Ticket	At threshold, `CKR_PIN_LOCKED` follows.
`CKR_PIN_LOCKED` event	Audit log	Ticket	SO intervention required.
Audit-log open failure	Process log (`error`)	Page	The daemon runs without audit; depending on policy this may require immediate shutdown.
AES-GCM nonce counter > 75%	Process log (`warn`)	Ticket	Schedule key rotation.
AES-GCM nonce counter > 95%	Process log (`warn`)	Page	Rotate immediately.
`mlock` / `VirtualLock` failure at startup	Process log (`warn`)	Ticket	Key material may be paged to swap; fix capabilities.
TLS certificate expiring in < 14 days	External monitoring	Ticket	Rotate per ./runbook.
TLS certificate expired	Process log on restart	Page	Clients cannot connect.

Metrics

The core daemon does not expose a Prometheus endpoint. Operators typically derive the following metrics from audit-log tailing:

Operation rate by operation and result.
Failed-login rate per session.
Key-generation rate, broken down by mechanism.
Error-code histogram (count by CK_RV symbol).

A minimal Prometheus exporter can be implemented as a sidecar that tails the JSON Lines log and increments counters. For deployments that need native metrics, see the enterprise observability add-on referenced in ../enterprise/certified.

Correlating audit and process logs

The audit log carries session_handle but not the originating client identity; process logs carry client TLS identity (when mTLS is configured) but not session handle. To correlate a failed sign to a specific client:

Find the failing audit entry; note its session_handle and timestamp.
Find the process-log line at the same timestamp for C_OpenSession on that handle.
The C_OpenSession line carries the TLS peer identity.

Joining the two streams in the SIEM at ingest time simplifies this — add the TLS peer identity as a derived field on every audit line for the duration of each session.