A TAK Server is the heart of a tactical common operating picture. Every Cursor on Target (CoT) event – every position report, every contact, every overlay – flows through it. When that server goes down, the shared picture freezes for every connected operator simultaneously. In garrison that is an inconvenience; in an operation it is a loss of situational awareness at the worst possible moment. High availability is therefore not a luxury feature of a serious TAK deployment – it is the difference between infrastructure you can put behind a mission and infrastructure you cannot. This article examines how to run TAK Server with no single point of failure: clustering the application tier, replicating the database, balancing client load, and engineering failover that completes faster than an operator can notice.
Why a single TAK server node is a single point of failure
A default TAK Server installation is a single Java application backed by a single PostgreSQL/PostGIS database on a single host. It works, it is simple to stand up, and for a small unit it is entirely adequate. But that simplicity hides four distinct failure domains, any one of which takes down the whole picture: the application process can crash or run out of heap; the host can lose power, network, or disk; the database can corrupt, fill its volume, or deadlock; and the network path between clients and server can be severed. In a single-node design these are not independent – the loss of any one of them is total.
High availability means engineering each of those domains so that no single failure is fatal. The application tier becomes a cluster of interchangeable nodes. The database becomes a replicated primary-plus-standby set with automatic promotion. The client connection becomes a balanced, health-checked virtual endpoint rather than a hardcoded host. The result is a system where a node can be lost – to a crash, a patch reboot, or a destroyed server room – and the operators keep their picture.
TAK server cluster mode: the messaging plane
The foundation of TAK Server high availability is cluster mode. In a single-node deployment, CoT events are routed through in-process queues – the messaging plane lives inside the one running application. In cluster mode, that plane is externalized onto a shared message fabric so that multiple TAK Server nodes can cooperate as one logical server. A CoT event published by a client connected to node A is distributed across the fabric and delivered to clients connected to node B without those clients ever knowing they are on different machines.
This decoupling is what makes the application tier horizontally scalable and fault-tolerant at the same time. Adding capacity for more clients or higher track density becomes a matter of adding nodes, the same scaling lever covered in TAK Server performance tuning. Surviving a node loss becomes automatic, because every node holds the same view of the shared subscription state. The nodes are stateless with respect to the picture – all durable state lives in the shared database and the message fabric, not in any single node's memory.
Stateless nodes, shared state
The cardinal design rule of a TAK Server cluster is that the application nodes must be stateless. Anything an operator would lose if a node disappeared must live outside the node: persisted CoT in the database, group and mission state in the database, and live subscription routing in the message fabric. When this rule holds, a node failure is a non-event for the data – the only cost is that the clients on the dead node must reconnect, and that cost is bounded by how quickly the load balancer and the clients notice the failure.
Database high availability: the real bottleneck
Once the application tier is clustered, the PostgreSQL/PostGIS database becomes the dominant single point of failure – every clustered node depends on the same database, so an unprotected database undoes all the application-tier resilience. Database high availability rests on three components working together.
Streaming replication. A primary node accepts all writes and continuously ships its write-ahead log (WAL) to one or more standby nodes that replay it to stay current. A synchronous standby acknowledges each commit before the primary reports success, which guarantees zero data loss at the cost of a small write-latency penalty; an asynchronous standby trails by a fraction of a second but adds no commit latency. A robust design uses at least one synchronous standby for the recovery point objective and additional asynchronous standbys for read scaling and disaster recovery.
Automatic failover. Replication alone does not fail over – something must detect that the primary is dead and promote a standby. Patroni, running as an agent on each database node, performs this role. It uses a distributed configuration store – etcd or Consul, deployed as an odd-numbered quorum of three or five members – to hold a leader lock. If the primary stops renewing its lock, Patroni elects the most up-to-date standby and promotes it to primary, typically within 5 to 15 seconds.
Connection routing. TAK Server nodes must always connect to whichever database is currently the primary, without being reconfigured. A connection router – PgBouncer or HAProxy fronting the database ports – tracks the current leader (Patroni updates its health endpoints on promotion) and routes write traffic accordingly. From the TAK Server's perspective the database is a single stable virtual IP; the promotion of a standby behind that IP is invisible.
Recovery objectives drive the design
Two numbers govern the database tier. The recovery point objective (RPO) is how much data you can afford to lose; for committed CoT persistence the answer for a tactical system is usually zero, which mandates a synchronous standby. The recovery time objective (RTO) is how long a failover may take; with Patroni and a healthy quorum this is the 5-to-15-second promotion window. Define both explicitly before provisioning – they dictate whether you can tolerate asynchronous-only replication and how aggressive the failure detection timers must be.
Load balancing and client reconnection
The client-facing tier ties the cluster together. ATAK and other TAK clients hold persistent, mutually authenticated TLS connections to the server. To make those connections survive a node loss, they must terminate on a virtual endpoint rather than a specific host.
A layer-4 load balancer – HAProxy, NGINX stream mode, or a cloud L4 balancer – presents a single virtual IP for the TLS client ports and distributes new connections across healthy TAK Server nodes using least-connections or round-robin. Active health checks against each node's status endpoint remove a failed node from rotation within a health-check interval. Critically, the balancer must operate at layer 4 and pass TLS through untouched, because TAK Server uses client-certificate mutual authentication that must terminate at the application node, not at the balancer.
When a node fails, its clients' sockets drop. Each client detects the dead socket through TCP keepalive and reconnects to the virtual IP, where the balancer routes it to a surviving node. Because every node shares the same database and message fabric, the reconnected client immediately resynchronizes to the current picture. The perceived outage is governed entirely by the client keepalive interval and the balancer health-check cadence – tune both down to a few seconds and the operator sees a momentary "reconnecting" state, not a loss of awareness.
Capacity planning matters here too. Size the cluster so that the loss of one node still leaves enough headroom to absorb the entire client population – the N+1 rule. If two nodes each run near saturation and one dies, every dropped client reconnects to the survivor and overwhelms it, turning a single-node failure into a cascade. Budget each node to carry its steady-state share plus a full failover share, and verify under load that a survivor can take the surge without exhausting heap or connection limits.
Zero-downtime maintenance
The same machinery that survives unplanned failure also enables planned maintenance without an outage. To patch or upgrade a node, drain it: instruct the load balancer to stop sending new connections, let existing clients age out or reconnect elsewhere, then take the node down, update it, and return it to rotation. Rolling through the cluster one node at a time keeps the service continuously available. Database upgrades follow the same logic in reverse – upgrade standbys first, then perform a controlled switchover that promotes an already-upgraded standby, so the primary is never the node under the wrench. This turns a maintenance window from a scheduled outage into a transparent rolling operation.
Key insight: In a properly clustered TAK deployment the application nodes are almost never the limiting factor for failover speed – surviving nodes already hold the full picture, so reconnection is near-instant. The real failover budget is spent on database promotion. If your RTO is missed, look at Patroni timers, the quorum store latency, and the synchronous-standby commit path before you add more application nodes.
Geographic redundancy and federation
Surviving a node loss is one thing; surviving the loss of an entire site is another. The instinct is to stretch a single cluster across two data centers, but the database write path makes a true active-active multi-region cluster impractical: synchronous replication across a high-latency link adds that round trip to every committed write, and asynchronous-only replication across regions reintroduces data-loss risk on failover.
The practical pattern is active-passive geographic redundancy. A full cluster – clustered application nodes, synchronous local standbys, local quorum – runs in the primary site. Asynchronous streaming replication feeds a warm standby cluster in a second site that can be promoted if the primary site is lost entirely. This bounds the cross-site RPO to the asynchronous replication lag while keeping the in-site write path fast.
Where independent sites need to share a picture without sharing a database, federation is the right tool rather than clustering. Federation links separate TAK Server clusters and exchanges CoT between them under policy – the mechanism covered in TAK Server federation setup. Clustering gives you a single resilient server; federation connects multiple resilient servers across organizational and geographic boundaries. A mature deployment uses both: each command runs a clustered, highly available TAK Server, and federation stitches those clusters into a theater-wide picture.
Operational discipline: testing failure
A high-availability design that has never failed over in anger is a hypothesis, not a capability. The single most important operational practice is to deliberately and repeatedly induce failure: kill an application node under load and measure client reconnect time; kill the database primary and confirm Patroni promotes a standby within RTO with zero loss of persisted CoT; partition the quorum store and verify the cluster refuses to split-brain. Run these drills against realistic client counts and track density, not an empty server. The objective is sub-30-second failover with no operator action – and the only way to trust that number is to have produced it under load, on purpose, many times.
Deploy TAK infrastructure built to stay up
TAKpilot packages clustered, replicated TAK Server with load balancing and automatic failover – engineered for tactical tempo where the picture cannot go dark. Zero-downtime upgrades, monitored health, and federation-ready in a single deployable package.
This analysis was prepared by Corvus Intelligence engineers who build mission-critical ISR and field applications for defense and government organizations. Learn about our team →