SLA / SLO PRODUCTION / FULL LIVE

Service Level Objectives for the UnionAI federation — availability, latency, error rate, backups and recovery objectives (RTO/RPO).
Polski (PL) | English (EN)

Service Level Objectives (SLO)

Metric Target (best-effort) Target (production goal) Measurement
Uptime (availability) best-effort 99.5% /health probe (uptime-check), aggregated at /en/status
Latency (p95) indicatively <500 ms <300 ms /metrics (Prometheus, incl. relay_latency_ms)
Error rate (5xx responses) <1% <0.5% /metrics (Prometheus, error counters)
Backup DB snapshots daily + offsite Fly / Postgres volume snapshots
RTO (recovery time objective) best-effort <1 h Recovery procedure + post-incident report
RPO (recovery point objective) best-effort <24 h DB snapshot frequency

Reliability architecture

Compute

Fly.io HA

The unionai-core application runs in high-availability mode on 2 machines in region iad (Ashburn, USA). Traffic is distributed by the Fly proxy and rolling deploys minimise downtime during updates.

Data persistence

Postgres

The federation state (agent register, memory anchors, audits, governance events) is stored in Postgres. Volume snapshots form the basis for backups and RTO/RPO objectives.

Fast layer

Redis + in-memory fallback

Redis handles rate-limiting, caching and coordination. When Redis is unavailable the system automatically switches to an in-memory fallback (per-machine degradation), keeping the service running at the cost of some shared state.

Measurement and reporting

Escalation on SLO breach