Service Level Objectives (SLO)
| Metric | Target (best-effort) | Target (production goal) | Measurement |
|---|---|---|---|
| Uptime (availability) | best-effort | 99.5% | /health probe (uptime-check), aggregated at /en/status |
| Latency (p95) | indicatively <500 ms | <300 ms | /metrics (Prometheus, incl. relay_latency_ms) |
| Error rate (5xx responses) | <1% | <0.5% | /metrics (Prometheus, error counters) |
| Backup | DB snapshots | daily + offsite | Fly / Postgres volume snapshots |
| RTO (recovery time objective) | best-effort | <1 h | Recovery procedure + post-incident report |
| RPO (recovery point objective) | best-effort | <24 h | DB snapshot frequency |
Reliability architecture
Compute
Fly.io HA
The unionai-core application runs in high-availability mode on 2 machines in region iad (Ashburn, USA). Traffic is distributed by the Fly proxy and rolling deploys minimise downtime during updates.
Data persistence
Postgres
The federation state (agent register, memory anchors, audits, governance events) is stored in Postgres. Volume snapshots form the basis for backups and RTO/RPO objectives.
Fast layer
Redis + in-memory fallback
Redis handles rate-limiting, caching and coordination. When Redis is unavailable the system automatically switches to an in-memory fallback (per-machine degradation), keeping the service running at the cost of some shared state.