← blog

MSP for specialized stacks: what the generic shops miss

When your production systems include Postgres clusters, HSMs, and embedded gateways, the average MSP runbook is a liability, not an asset.

by Craton engineering

Generic MSP offerings are built for the 80% case: Windows servers, Office 365, endpoint patching, AD, a helpdesk queue. That is a real market, and there are competent providers in it. It is also not a good fit for a production software stack that includes Postgres clusters replicating across three availability zones, an HSM handling cryptographic operations at line rate, and embedded gateways deployed on customer premises.

The mismatch

  • Alert coverage. A generic MSP monitors "is the server up." A production MSP monitors query latency P95, replication lag, HSM operation success rate, certificate expiry on every edge, and trace latency across the service mesh. The generic set will not page when the queries get 40% slower — but that is exactly when you need the page.
  • Change rhythm. A generic MSP's patching cadence is "Tuesdays." A production MSP's change cadence is dependency-driven: CVE comes out, assess impact, stage on canary, roll with regression tests. Sometimes that is the same day; sometimes it is six weeks of validation.
  • Post-mortems. A generic MSP delivers a ticket close-out. A production MSP delivers a written blameless post-mortem with a timeline, contributing factors, corrective actions with owners, and a scheduled review. The evidence artifact is the deliverable.

What a specialized MSP looks like

We run managed services for customers whose stacks we wrote, or whose stacks look like ours. That is the filter: if your production services are Rust, Postgres, Redis, Kafka, Kubernetes, Prometheus, OpenTelemetry, Grafana — we know these tools in production, not in theory. If your stack is WordPress and Windows file shares, we will honestly refer you elsewhere.

Typical engagement shape:

  • 24×7 on-call for defined production environments. SLA: P1 response 15 min, P2 1 hour, P3 4 hours during business.
  • Monitoring stack operated end-to-end: we run your Prometheus / Loki / Tempo, not a SaaS layer over it.
  • Monthly dependency rhythm: CVE assessment, patch, regression. Written evidence at end of month.
  • Quarterly DR exercise: full restore from backup, measured RPO/RTO, documented in a shareable report.
  • Per-environment pricing, not per-ticket. There is no incentive for us to create work.

The reference question

Ask every MSP: when was your last P1 incident, who paged, what broke, and what changed as a result? If the answer is a sales talk track, walk away. If the answer is a concrete story with names and commit hashes, you are in the right room.

If you are running a specialized production stack and evaluating a switch, we are happy to compare notes — even if the answer is "you already have the right partner."