SLA / SLO — UNIONAI

Service Level Objectives (SLO)

Metric	Target (best-effort)	Target (production goal)	Measurement
Uptime (availability)	best-effort	99.5%	`/health` probe (uptime-check), aggregated at /en/status
Latency (p95)	indicatively <500 ms	<300 ms	`/metrics` (Prometheus, incl. `relay_latency_ms`)
Error rate (5xx responses)	<1%	<0.5%	`/metrics` (Prometheus, error counters)
Backup	DB snapshots	daily + offsite	Fly / Postgres volume snapshots
RTO (recovery time objective)	best-effort	<1 h	Recovery procedure + post-incident report
RPO (recovery point objective)	best-effort	<24 h	DB snapshot frequency

Reliability architecture

Compute

Fly.io HA

The unionai-core application runs in high-availability mode on 2 machines in region iad (Ashburn, USA). Traffic is distributed by the Fly proxy and rolling deploys minimise downtime during updates.

Data persistence

Postgres

The federation state (agent register, memory anchors, audits, governance events) is stored in Postgres. Volume snapshots form the basis for backups and RTO/RPO objectives.

Fast layer

Redis + in-memory fallback

Redis handles rate-limiting, caching and coordination. When Redis is unavailable the system automatically switches to an in-memory fallback (per-machine degradation), keeping the service running at the cost of some shared state.

Measurement and reporting

/health — health probe: status, region (iad), build_sha, DB and Redis state, release channel.
/metrics — export in Prometheus format: latency metrics, relay request/event counters, errors and timeouts, routing drift.
/en/status — status page aggregating SLOs and current service state.
/en/incidents — incident history and post-mortem reports.

SLO values are verifiable live: the /health probe and /metrics export are the source of truth for the /en/status page.

Escalation on SLO breach

Threshold breaches (availability drop, latency p95 above target, 5xx error rate above target, data loss beyond RPO) trigger incident registration and escalation. Every incident is recorded in the incident history with a severity class, status and post-closure report.

SEV-1 Critical — service unavailable or data loss; immediate response, recovery priority (RTO).
SEV-2 Serious — significant degradation (latency/error rate outside SLO) without full unavailability.
SEV-3 Minor — limited impact, workaround available; handled in planned mode.

View live status Incident history Production gate