Craton HSM
Monitoring and Audit
Monitoring and Audit
Craton HSM emits two primary observability signals: the tamper-evident audit log (JSON Lines, append-only, chained SHA-256) and structured process logs at RUST_LOG-controlled severity. There is no built-in Prometheus exporter in the core module as of v0.9.x; metrics are derived from the audit log and from process log lines. This page describes both signals, how to verify the chain, how to ship the log to a SIEM, and which events should page an operator.
Audit log format
The audit log lives at the path given by [audit].log_path (default craton_hsm_audit.jsonl). It is created with owner-only permissions (0600 on Unix, owner-only ACL on Windows) and must remain owned by the daemon's process user — a permission-denied failure on open is the most common audit-related startup error.
Each line is a single JSON object. The concrete fields are:
| Field | Type | Meaning |
|---|---|---|
timestamp | integer | Unix epoch seconds. |
session_handle | integer | The PKCS#11 session that performed the operation. |
operation | string | The operation name (Login, Logout, GenerateKey, Sign, Verify, Encrypt, Decrypt, DestroyObject, etc.). |
key_id | integer | null | The key handle involved, if any. |
result | object | {"Success": null} or {"Failure": "<CK_RV symbol>"}. |
previous_hash | string | Hex SHA-256 of the previous entry's canonical serialization. |
The log_level field in [audit] filters which operation classes are written:
| Level | Writes |
|---|---|
all | Every operation. |
crypto | Cryptographic operations only (Sign, Verify, Encrypt, Decrypt, GenerateKey, Digest). |
auth | Login, Logout, PIN changes, lockouts. |
admin | Object create/destroy, token init, PIN reset. |
none | Nothing — discouraged except in closed lab environments. |
For compliance purposes always use all. The overhead is dominated by fsync, not field count.
Chained-hash integrity
Each entry's previous_hash field is SHA-256(canonical_bytes(previous_entry)). The first entry's previous_hash is a well-known all-zero hash. A mismatch anywhere in the chain means the log was truncated, reordered, or mutated after writing.
To verify a segment:
- Read all entries in order.
- Compute
SHA-256(entry[n-1])using the canonical serialization. - Compare against
entry[n].previous_hash. - Report the line number and both hashes on mismatch.
A reference verifier in Python:
import hashlib, json, sys
prev = "0" * 64
for lineno, line in enumerate(open(sys.argv[1]), start=1):
entry = json.loads(line)
if entry["previous_hash"] != prev:
sys.exit(f"chain break at line {lineno}: "
f"expected {entry['previous_hash']}, got {prev}")
# Canonical form: previous_hash field excluded, keys sorted.
body = {k: v for k, v in entry.items() if k != "previous_hash"}
prev = hashlib.sha256(
json.dumps(body, sort_keys=True, separators=(",", ":")).encode()
).hexdigest()
print(f"verified {lineno} entries")
Run verification on every archived segment before it leaves the host. Any verification failure is a security event and must be treated as a potential compromise — see ./troubleshooting.
Dumping the audit log
The admin CLI dumps recent entries without touching the live log file:
craton-hsm-admin audit dump --last 100
craton-hsm-admin audit dump --last 50 --json
The --json form is suitable for piping into a SIEM ingest tool or an ad-hoc jq query.
Process logs
Process-level logs go to stdout/stderr and are controlled with RUST_LOG. Recommended production setting:
RUST_LOG=craton_hsm=info
Reserve debug for incident response; trace produces sensitive internals (key handles, session details) and is development-only.
Severity conventions:
| Level | Used for |
|---|---|
error | POST failure, DRBG health-test failure, audit-log open failure, object-store I/O failure. |
warn | Memory-lock failure, near-exhaustion of AES-GCM nonce counter, degraded CRL state (enterprise add-on), TLS reload failure. |
info | Startup, shutdown, successful self-test, token initialization. |
debug | Session open/close, operation tracing — off by default. |
Log shipping to a SIEM
The audit log is JSON Lines; any JSON-aware collector can ship it without transformation. The two deployment patterns:
- Sidecar collector (Docker, Kubernetes). Mount the audit log directory read-only into a Fluent Bit / Vector / Filebeat container; forward to the SIEM over mTLS. Parse each line as JSON and preserve the
previous_hashfield — it is needed upstream for offline chain verification. - Host agent (systemd). Point the existing host agent (rsyslog with
imfile, journald forwarding, splunk forwarder) at the log path. Run it as a user with read-only access to the log file — do not give the agent write permission, or the chain integrity can no longer be trusted end to end.
Forwarding produces a duplicate of the log; it does not replace the need to retain the originals. Never truncate or delete a segment until the SIEM has confirmed ingest and the segment's chain has been verified.
Alert triggers
The following conditions warrant paging an on-call operator. All of them surface either in process logs, the audit log, or both.
| Trigger | Source | Severity | Action |
|---|---|---|---|
| Power-on self-test failure at startup | Process log (error) | Page | All operations return CKR_GENERAL_ERROR; daemon is unusable. See ./troubleshooting. |
| DRBG health-test failure | Process log (error) | Page | Key generation cannot proceed; treat as potential key-quality compromise. |
| Audit-log chain break | SIEM verifier | Page | Potential tampering. Preserve forensic copies before any remediation. |
Repeated CKR_PIN_INCORRECT from the same session | Audit log | Ticket | At threshold, CKR_PIN_LOCKED follows. |
CKR_PIN_LOCKED event | Audit log | Ticket | SO intervention required. |
| Audit-log open failure | Process log (error) | Page | The daemon runs without audit; depending on policy this may require immediate shutdown. |
| AES-GCM nonce counter > 75% | Process log (warn) | Ticket | Schedule key rotation. |
| AES-GCM nonce counter > 95% | Process log (warn) | Page | Rotate immediately. |
mlock / VirtualLock failure at startup | Process log (warn) | Ticket | Key material may be paged to swap; fix capabilities. |
| TLS certificate expiring in < 14 days | External monitoring | Ticket | Rotate per ./runbook. |
| TLS certificate expired | Process log on restart | Page | Clients cannot connect. |
Metrics
The core daemon does not expose a Prometheus endpoint. Operators typically derive the following metrics from audit-log tailing:
- Operation rate by
operationandresult. - Failed-login rate per session.
- Key-generation rate, broken down by mechanism.
- Error-code histogram (count by
CK_RVsymbol).
A minimal Prometheus exporter can be implemented as a sidecar that tails the JSON Lines log and increments counters. For deployments that need native metrics, see the enterprise observability add-on referenced in ../enterprise/certified.
Correlating audit and process logs
The audit log carries session_handle but not the originating client identity; process logs carry client TLS identity (when mTLS is configured) but not session handle. To correlate a failed sign to a specific client:
- Find the failing audit entry; note its
session_handleandtimestamp. - Find the process-log line at the same timestamp for
C_OpenSessionon that handle. - The
C_OpenSessionline carries the TLS peer identity.
Joining the two streams in the SIEM at ingest time simplifies this — add the TLS peer identity as a derived field on every audit line for the duration of each session.