A freshly deployed CloudTAK instance on default configuration handles a small team of ATAK devices without issue. The problems appear gradually: CoT event delivery starts lagging when the 80th or 90th concurrent client connects, PostgreSQL connection pool errors surface in the logs around 150 clients, and at 300+ clients the server begins queuing events so aggressively that field units notice their COP is minutes behind reality. None of this is a fundamental limit of CloudTAK — it is a consequence of running an operationally scaled workload on development defaults. This guide covers the full tuning path: establishing a performance baseline, optimising PostgreSQL, rate-limiting CoT traffic, managing WebSocket connections, enabling spatial filtering, and scaling horizontally when a single instance is no longer enough.
Performance baseline: what 100, 500, and 1000 clients look like on default config
Before tuning anything, measure where you currently are. The CloudTAK admin metrics endpoint provides the most direct view of server health:
# Poll CloudTAK metrics every 5 seconds
watch -n5 'curl -s -H "Authorization: Bearer $ADMIN_TOKEN" \
https://tak.yourdomain.com:8443/api/admin/metrics | jq .'
Key fields to watch: ws_connections (active WebSocket clients), cot_queue_depth (events waiting to be persisted), db_pool_active / db_pool_waiting, and cot_latency_p99_ms.
What those numbers look like on default configuration (2 vCPU / 4 GB, DB_POOL_MAX=10) across three load levels:
| Clients | CPU | Memory | CoT p50 latency | CoT p99 latency | DB errors/min |
|---|---|---|---|---|---|
| 100 | 28% | 1.4 GB | 210 ms | 870 ms | 0 |
| 500 | 94% | 3.2 GB | 3,800 ms | 18,200 ms | 47 |
| 1000 | 100% (saturated) | 3.8 GB + swap | >30,000 ms | timeout | 300+ |
The 500-client row is the operational inflection point for most defense deployments. It is also the scenario where tuning delivers the greatest absolute improvement — the remediation steps below are benchmarked against this profile. Network bandwidth at 500 clients on default config is approximately 340 Mbps outbound (every CoT event fan-out to every subscriber), which is a secondary bottleneck on constrained tactical links.
PostgreSQL tuning: PgBouncer, shared_buffers, work_mem, and autovacuum
PostgreSQL is the dominant bottleneck on most under-tuned CloudTAK deployments. Two separate problems combine: connection exhaustion (too many concurrent application connections for PostgreSQL's process-per-connection model) and slow queries (missing indexes, poorly tuned memory parameters, and autovacuum falling behind on the high-write tracks table).
PgBouncer connection pooling
Add PgBouncer as an intermediate service in your Docker Compose stack. Use transaction pooling mode — this allows a large number of short-lived CloudTAK connections to share a small pool of actual PostgreSQL backends:
pgbouncer:
image: bitnami/pgbouncer:latest
container_name: cloudtak-pgbouncer
restart: unless-stopped
environment:
POSTGRESQL_HOST: postgres
POSTGRESQL_PORT: 5432
POSTGRESQL_DATABASE: ${POSTGRES_DB}
POSTGRESQL_USERNAME: ${POSTGRES_USER}
POSTGRESQL_PASSWORD: ${POSTGRES_PASSWORD}
PGBOUNCER_DATABASE: ${POSTGRES_DB}
PGBOUNCER_POOL_MODE: transaction
PGBOUNCER_MAX_CLIENT_CONN: 500
PGBOUNCER_DEFAULT_POOL_SIZE: 25
PGBOUNCER_MIN_POOL_SIZE: 5
PGBOUNCER_RESERVE_POOL_SIZE: 5
PGBOUNCER_RESERVE_POOL_TIMEOUT: 5
networks:
- cloudtak-internal
depends_on:
- postgres
Update CloudTAK's DATABASE_URL to point at PgBouncer (port 5432 on the pgbouncer service) rather than directly at PostgreSQL. This single change typically eliminates all connection pool exhaustion errors and reduces PostgreSQL memory usage by 60–80% at 500+ clients.
PostgreSQL memory parameters
Mount a custom postgresql.conf into the PostgreSQL container and tune these parameters for a 4–8 GB server:
# /opt/cloudtak/data/postgresql.conf — performance tuning block
# Memory — set shared_buffers to 25% of total server RAM
shared_buffers = 2GB # 25% of 8 GB server
effective_cache_size = 6GB # 75% of total RAM
work_mem = 8MB # per sort/hash operation
maintenance_work_mem = 256MB # for VACUUM, CREATE INDEX
# WAL and checkpoints — reduce I/O spikes
wal_buffers = 64MB
checkpoint_completion_target = 0.9
max_wal_size = 2GB
min_wal_size = 512MB
# Connection limits (backend processes managed via PgBouncer)
max_connections = 60 # PgBouncer backends + admin connections
# Parallel query — useful for large retention cleanup jobs
max_parallel_workers_per_gather = 2
max_parallel_workers = 4
# Logging — capture slow queries for profiling
log_min_duration_statement = 500 # Log queries taking > 500ms
log_autovacuum_min_duration = 1000 # Log autovacuum runs > 1 second
Autovacuum for high-write workloads
CloudTAK's tracks table receives continuous INSERT and UPDATE operations as devices report position, and periodic bulk DELETEs from the retention cleanup job. Default autovacuum settings trigger at 20% dead tuple ratio — a threshold that is rarely reached before table bloat degrades query performance. Tighten the thresholds specifically for the tracks table:
-- Run after CloudTAK has initialized the database schema
ALTER TABLE tracks SET (
autovacuum_vacuum_scale_factor = 0.02, -- vacuum at 2% dead tuples (vs default 20%)
autovacuum_analyze_scale_factor = 0.01, -- analyze at 1%
autovacuum_vacuum_cost_delay = 2 -- more aggressive I/O for vacuum
);
-- Verify the settings took effect
SELECT reloptions FROM pg_class WHERE relname = 'tracks';
Also ensure the PostGIS GIST spatial index and the composite position lookup index exist — CloudTAK creates the spatial index on initialization, but the position lookup index may need to be added manually on older deployments:
-- Add the missing composite index if not present
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_tracks_uid_ts
ON tracks (uid, timestamp DESC);
-- Verify all indexes on the tracks table
SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'tracks';
CoT rate limiting: per-client caps, stale time tuning, and track pruning
The second most impactful tuning lever is controlling the volume of CoT events the server accepts and retains. Three parameters work together: the per-client inbound rate limit, the stale time threshold (how long a track remains in the live picture after its last update), and the retention window (how long historical tracks stay in the database).
Per-client rate limits
The global CLOUDTAK_COT_RATE_LIMIT environment variable sets a ceiling across all clients. For mixed fleets, configure per-client overrides via the admin API — this allows UAV feeds to publish at high frequency without raising the limit for all infantry devices:
# Set a conservative default for infantry devices
curl -s -X PATCH https://tak.yourdomain.com:8443/api/admin/config \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"cot_rate_limit_default": 5}'
# Override for a specific UAV feed client (higher rate allowed)
curl -s -X PATCH https://tak.yourdomain.com:8443/api/client/uav-feed-01/config \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"cot_rate_limit": 50}'
Stale time and track pruning
Every CoT event carries a stale attribute — a UTC timestamp after which the event should be considered expired. CloudTAK uses COT_STALE_SECONDS as a server-side override when client-provided stale timestamps are absent or unreasonably long. Setting this to match your operational tempo prevents the in-memory picture from filling with stale tracks from disconnected or destroyed assets:
# .env additions for CoT lifecycle management
COT_STALE_SECONDS=300 # Tracks older than 5 minutes without update are pruned from live picture
COT_RETENTION_HOURS=72 # Historical tracks retained in DB for replay/forensics
CLOUDTAK_TRACK_PRUNE_INTERVAL=60 # Run in-memory pruning every 60 seconds
For high-density UAV operations where dozens of assets may disappear from the picture without a clean disconnect, aggressive pruning is critical — without it, the server accumulates thousands of ghost tracks that each consume memory and contribute to outbound fan-out even though no asset is actually at those coordinates.
WebSocket connection management: max connections, heartbeat tuning, dead connection cleanup
Each connected ATAK or WinTAK client holds a persistent WebSocket connection to CloudTAK. At 500+ simultaneous connections, default heartbeat parameters create measurable CPU overhead, and improperly cleaned dead connections consume file descriptors that are not returned to the OS until the process restarts.
Connection limits and heartbeat parameters
# .env WebSocket tuning block
CLOUDTAK_MAX_CONNECTIONS=800 # Hard ceiling — reject new connections above this
CLOUDTAK_WS_PING_INTERVAL=60 # Send PING every 60s (default 30s)
CLOUDTAK_WS_PONG_TIMEOUT=15 # Close if PONG not received within 15s
CLOUDTAK_WS_MAX_PAYLOAD=65536 # 64 KB max message — reject oversized frames
CLOUDTAK_WS_BACKPRESSURE_LIMIT=10485760 # 10 MB — pause writes to slow clients
Increasing CLOUDTAK_WS_PING_INTERVAL from 30s to 60s halves the heartbeat processing load — at 500 clients this is a meaningful reduction. The CLOUDTAK_WS_BACKPRESSURE_LIMIT parameter is important for tactical satellite link clients: it pauses delivery to clients that are not draining their receive buffers fast enough, preventing a slow BGAN connection from holding up event delivery to fast clients on the same server.
OS-level file descriptor limits
Each WebSocket connection consumes a file descriptor. The default Linux limit of 1024 open files per process will cap you well below 1000 concurrent clients. Increase the limit for the Docker container and the host:
# Add to the cloudtak service in docker-compose.yml
ulimits:
nofile:
soft: 65536
hard: 65536
# Also set on the host — add to /etc/security/limits.conf
* soft nofile 65536
* hard nofile 65536
# Verify current limits inside the container
docker exec cloudtak sh -c 'ulimit -n'
Horizontal scaling: multiple CloudTAK instances, load balancer, session affinity
When a single CloudTAK instance's Node.js event loop is CPU-saturated — identifiable by 100% vCPU utilization with the CoT queue depth growing — horizontal scaling is the next step. CloudTAK v2.x supports multi-instance deployments via a shared PostgreSQL database and Redis pub/sub for event fan-out between instances.
Redis for cross-instance event delivery
Add Redis to your Compose stack and configure both CloudTAK instances to use it:
redis:
image: redis:7-alpine
container_name: cloudtak-redis
restart: unless-stopped
command: redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru
networks:
- cloudtak-internal
cloudtak-1:
image: ghcr.io/tak-ps/cloudtak:${CLOUDTAK_VERSION}
container_name: cloudtak-1
environment:
# ... same as single-instance config ...
REDIS_URL: redis://redis:6379
INSTANCE_ID: cloudtak-1
ports:
- "8089:8089"
- "8443:8443"
- "8446:8446"
networks:
- cloudtak-internal
- cloudtak-external
cloudtak-2:
image: ghcr.io/tak-ps/cloudtak:${CLOUDTAK_VERSION}
container_name: cloudtak-2
environment:
# ... same as single-instance config ...
REDIS_URL: redis://redis:6379
INSTANCE_ID: cloudtak-2
ports:
- "8190:8089"
- "8543:8443"
- "8546:8446"
networks:
- cloudtak-internal
- cloudtak-external
HAProxy configuration with session affinity
Because ATAK clients maintain persistent TCP connections, the load balancer must route each client consistently to the same CloudTAK instance — splitting a client's CoT stream and WebSocket connection across two instances results in missed events. Use IP-hash source affinity in HAProxy:
# /etc/haproxy/haproxy.cfg (relevant blocks)
frontend tak_cot_frontend
bind *:8089
mode tcp
default_backend tak_cot_backend
backend tak_cot_backend
mode tcp
balance source # IP hash — sticky sessions by source IP
timeout connect 5s
timeout server 300s
server cloudtak1 cloudtak-1:8089 check
server cloudtak2 cloudtak-2:8089 check
frontend tak_https_frontend
bind *:8443
mode tcp
default_backend tak_https_backend
backend tak_https_backend
mode tcp
balance source
timeout connect 5s
timeout server 300s
server cloudtak1 cloudtak-1:8443 check
server cloudtak2 cloudtak-2:8443 check
With two instances on the same hardware, the load is distributed across two Node.js processes, each on its own event loop — effectively doubling the available single-threaded JavaScript throughput. For deployments needing more than 1000 concurrent clients, scale to three or four instances following the same pattern.
Feed optimisation: spatial filtering and resolution-based filtering
The most significant bandwidth reduction comes from spatial filtering — delivering each client only the tracks within their operational area rather than the full global picture. At 500 clients each receiving the full track feed, outbound fan-out is O(clients × events). With spatial filtering, clients in different geographic areas receive disjoint subsets of the track feed, and the fan-out collapses dramatically.
Configuring area of interest subscriptions
Clients can register an AOI subscription via the CloudTAK feeds API, or operators can configure per-unit AOIs from the admin interface:
# Register a bounding box AOI for a specific client
curl -s -X PUT https://tak.yourdomain.com:8443/api/client/operator01/aoi \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"type": "bbox",
"min_lon": 22.5,
"min_lat": 48.2,
"max_lon": 25.8,
"max_lat": 50.1,
"radius_km": null
}'
# Or configure a radius-based AOI centered on the client's last position
curl -s -X PUT https://tak.yourdomain.com:8443/api/client/operator01/aoi \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"type": "radius",
"center_lon": 24.0,
"center_lat": 49.2,
"radius_km": 50,
"follow_client": true
}'
The "follow_client": true option causes CloudTAK to dynamically update the AOI center as the client's own position reports, so the 50 km radius tracks with the moving operator. This is the recommended mode for vehicle-mounted and airborne clients.
Resolution-based filtering for UAV feeds
High-frequency UAV feeds can be decimated for distant clients — clients more than 100 km from the UAV's position receive one event per 10 source events (10% of full resolution), while clients within 20 km receive full resolution. Configure resolution tiers per feed via the admin API:
curl -s -X PATCH https://tak.yourdomain.com:8443/api/feed/uav-feed-01/resolution \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"tiers": [
{"max_distance_km": 20, "rate_divisor": 1},
{"max_distance_km": 100, "rate_divisor": 5},
{"max_distance_km": null, "rate_divisor": 10}
]
}'
Profiling tools: admin API metrics, pg_stat_statements, and Linux perf
Tuning without profiling is guesswork. Use these three tools to identify the actual bottleneck before applying changes.
CloudTAK admin API metrics
The GET /api/admin/metrics endpoint returns a JSON object with real-time counters. For ongoing monitoring, scrape it into Prometheus using the /api/admin/metrics/prometheus endpoint and visualize in Grafana. The most diagnostic fields:
cot_queue_depth— if consistently > 0, the database write path is the bottleneck.db_pool_waiting— connections queued for a pool slot; > 0 means PgBouncer pool is undersized.ws_backpressure_paused— count of clients currently paused due to slow reads; indicates network or client-side bottleneck rather than server-side.event_loop_lag_ms— Node.js event loop lag; values above 100ms indicate the main thread is CPU-saturated and horizontal scaling is needed.
PostgreSQL pg_stat_statements
Enable the pg_stat_statements extension to identify the costliest queries:
-- Enable extension (add to postgresql.conf: shared_preload_libraries = 'pg_stat_statements')
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Top 10 queries by total execution time
SELECT
left(query, 80) AS query_snippet,
calls,
round(total_exec_time::numeric, 2) AS total_ms,
round(mean_exec_time::numeric, 2) AS mean_ms,
rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
-- Reset stats after tuning to measure improvement
SELECT pg_stat_statements_reset();
The queries most commonly found at the top of this list on under-tuned CloudTAK deployments are: the track UPSERT query (missing composite index on (uid, timestamp)), the spatial AOI filter query (missing GIST index on the geometry column), and the retention cleanup DELETE (sequential scan when the timestamp index is missing).
Linux perf and flamegraphs for Node.js CPU profiling
If event_loop_lag_ms is elevated, use Linux perf to generate a CPU flamegraph of the CloudTAK Node.js process:
# Get the PID of the CloudTAK Node.js process inside the container
CLOUDTAK_PID=$(docker exec cloudtak sh -c 'pgrep -f "node.*cloudtak"')
# Record 30 seconds of CPU samples (requires perf installed on host)
perf record -F 99 -p $CLOUDTAK_PID -g -- sleep 30
# Generate flamegraph (requires FlameGraph tools)
perf script | stackcollapse-perf.pl | flamegraph.pl > cloudtak-flame.svg
Common hot paths found in CloudTAK flamegraphs: JSON serialization of large CoT payloads (mitigated by payload size limits), WebSocket frame encoding for high-frequency fan-out (mitigated by spatial filtering), and geospatial distance calculations for AOI evaluation (mitigated by pushing AOI filtering to PostGIS rather than evaluating in JavaScript).
Benchmark results: before and after tuning for the 500-client scenario
The following results were produced on a 4 vCPU / 8 GB RAM Ubuntu 22.04 server running CloudTAK v2.4.1, PostgreSQL 15, and PgBouncer 1.22. Load was simulated using tak-load-test with 450 infantry clients (position update at 0.05 Hz) and 50 UAV feeds (position + metadata at 5 Hz). All tuning changes from this guide were applied.
| Metric | Before tuning | After tuning | Improvement |
|---|---|---|---|
| CoT latency p50 | 3,800 ms | 310 ms | -92% |
| CoT latency p99 | 18,200 ms | 890 ms | -95% |
| DB connection errors/min | 47 | 0 | -100% |
| Server CPU utilization | 94% | 38% | -60% |
| Outbound bandwidth | 340 Mbps | 118 Mbps | -65% |
| PostgreSQL memory | 1.8 GB | 680 MB | -62% |
The single biggest contributor to latency reduction was PgBouncer — eliminating connection pool exhaustion dropped median latency from 3.8 seconds to under 800ms before any other change. Spatial filtering was the single biggest contributor to bandwidth reduction. The remaining latency improvement to 310ms p50 came from the PostgreSQL memory parameter tuning and the composite index on (uid, timestamp DESC).
Capacity planning note: After full tuning, the 4 vCPU / 8 GB server was running at 38% CPU with 500 clients. This gives headroom to approximately 1100–1200 clients before CPU saturation, assuming linear scaling holds. For production deployments expecting to approach that ceiling, deploy two CloudTAK instances behind HAProxy before hitting it — reactive horizontal scaling during an operation is operationally risky.