High availability

A single toggle switches the whole platform between standalone (one replica per component) and high availability (multiple replicas, disruption budgets, and clustered databases). You don't tune each component by hand — the chart derives sensible replica counts from one decision.

Who this is for

Platform engineers sizing the platform for production. Standalone is fine for pilots and single teams; HA is for production traffic and many teams.

The single toggle

yaml

global:
  highAvailability: true

false (default) → standalone: one replica per component, no PodDisruptionBudgets, no anti-affinity. Lowest footprint; suitable for pilots and non-critical use.
true → high availability: multiple replicas, PodDisruptionBudgets, and soft pod anti-affinity so replicas spread across nodes.

What HA changes

When global.highAvailability is true, each component's replica count derives automatically:

Component	Standalone	High availability
Gateway (data plane)	1	2+
Control plane	1	2 (leader-elected for background jobs)
Web console	1	2
SSO (oauth2-proxy)	1	2
Keycloak	1	2
PostgreSQL (control plane)	1 instance	3-instance cluster with failover
Keycloak PostgreSQL	1 instance	3-instance cluster
Redis	1	3 (replication + Sentinel)
Observability	1	2+ (object storage backend)

You can still override any single component's count by setting its replicas value explicitly (e.g. gateway.replicas: 3); leave it unset to follow the global toggle.

Disruption budgets and spreading

In HA mode the chart renders PodDisruptionBudgets for the disruption-sensitive components and applies soft anti-affinity by hostname, so a node drain or rolling upgrade won't take all replicas of a component down at once.

Background jobs are leader-elected

The control plane runs periodic tasks — reconcile, price sync, audit pruning. With multiple replicas these are guarded by a single-holder lock in PostgreSQL, so they run once cluster-wide. Write-driven reconciles are idempotent and need no lock.

Stateful components in HA

PostgreSQL runs as a 3-instance CloudNativePG cluster with streaming replication and automatic failover. Pair this with Backup & DR.
Redis runs with replication and Sentinel for automatic master failover.
Observability uses an object-storage backend (observability.storage: object) in HA rather than local volumes — see Platform observability.

Sizing

HA roughly doubles to triples the standalone footprint. Confirm your cluster has the headroom from Requirements before enabling it, and remember that turning on the semantic features adds Qdrant and Ollama on top.

Next steps

Backup & DR — protect the PostgreSQL source of truth.
Platform observability — storage backend and retention in HA.
Upgrades — rolling upgrades with disruption budgets.

High availability ​

The single toggle ​

What HA changes ​

Disruption budgets and spreading ​

Stateful components in HA ​

Sizing ​