A freshly deployed CloudTAK instance on default configuration handles a small team of ATAK devices without issue. The problems appear gradually: CoT event delivery starts lagging when the 80th or 90th concurrent client connects, PostgreSQL connection pool errors surface in the logs around 150 clients, and at 300+ clients the server begins queuing events so aggressively that field units notice their COP is minutes behind reality. None of this is a fundamental limit of CloudTAK — it is a consequence of running an operationally scaled workload on development defaults. This guide covers the full tuning path: establishing a performance baseline, optimising PostgreSQL, rate-limiting CoT traffic, managing WebSocket connections, enabling spatial filtering, and scaling horizontally when a single instance is no longer enough.

Performance baseline: what 100, 500, and 1000 clients look like on default config

Before tuning anything, measure where you currently are. The CloudTAK admin metrics endpoint provides the most direct view of server health:

# Poll CloudTAK metrics every 5 seconds
watch -n5 'curl -s -H "Authorization: Bearer $ADMIN_TOKEN" \
  https://tak.yourdomain.com:8443/api/admin/metrics | jq .'

Key fields to watch: ws_connections (active WebSocket clients), cot_queue_depth (events waiting to be persisted), db_pool_active / db_pool_waiting, and cot_latency_p99_ms.

What those numbers look like on default configuration (2 vCPU / 4 GB, DB_POOL_MAX=10) across three load levels:

Clients CPU Memory CoT p50 latency CoT p99 latency DB errors/min
100 28% 1.4 GB 210 ms 870 ms 0
500 94% 3.2 GB 3,800 ms 18,200 ms 47
1000 100% (saturated) 3.8 GB + swap >30,000 ms timeout 300+

The 500-client row is the operational inflection point for most defense deployments. It is also the scenario where tuning delivers the greatest absolute improvement — the remediation steps below are benchmarked against this profile. Network bandwidth at 500 clients on default config is approximately 340 Mbps outbound (every CoT event fan-out to every subscriber), which is a secondary bottleneck on constrained tactical links.

PostgreSQL tuning: PgBouncer, shared_buffers, work_mem, and autovacuum

PostgreSQL is the dominant bottleneck on most under-tuned CloudTAK deployments. Two separate problems combine: connection exhaustion (too many concurrent application connections for PostgreSQL's process-per-connection model) and slow queries (missing indexes, poorly tuned memory parameters, and autovacuum falling behind on the high-write tracks table).

PgBouncer connection pooling

Add PgBouncer as an intermediate service in your Docker Compose stack. Use transaction pooling mode — this allows a large number of short-lived CloudTAK connections to share a small pool of actual PostgreSQL backends:

  pgbouncer:
    image: bitnami/pgbouncer:latest
    container_name: cloudtak-pgbouncer
    restart: unless-stopped
    environment:
      POSTGRESQL_HOST: postgres
      POSTGRESQL_PORT: 5432
      POSTGRESQL_DATABASE: ${POSTGRES_DB}
      POSTGRESQL_USERNAME: ${POSTGRES_USER}
      POSTGRESQL_PASSWORD: ${POSTGRES_PASSWORD}
      PGBOUNCER_DATABASE: ${POSTGRES_DB}
      PGBOUNCER_POOL_MODE: transaction
      PGBOUNCER_MAX_CLIENT_CONN: 500
      PGBOUNCER_DEFAULT_POOL_SIZE: 25
      PGBOUNCER_MIN_POOL_SIZE: 5
      PGBOUNCER_RESERVE_POOL_SIZE: 5
      PGBOUNCER_RESERVE_POOL_TIMEOUT: 5
    networks:
      - cloudtak-internal
    depends_on:
      - postgres

Update CloudTAK's DATABASE_URL to point at PgBouncer (port 5432 on the pgbouncer service) rather than directly at PostgreSQL. This single change typically eliminates all connection pool exhaustion errors and reduces PostgreSQL memory usage by 60–80% at 500+ clients.

PostgreSQL memory parameters

Mount a custom postgresql.conf into the PostgreSQL container and tune these parameters for a 4–8 GB server:

# /opt/cloudtak/data/postgresql.conf — performance tuning block

# Memory — set shared_buffers to 25% of total server RAM
shared_buffers = 2GB                  # 25% of 8 GB server
effective_cache_size = 6GB            # 75% of total RAM
work_mem = 8MB                        # per sort/hash operation
maintenance_work_mem = 256MB          # for VACUUM, CREATE INDEX

# WAL and checkpoints — reduce I/O spikes
wal_buffers = 64MB
checkpoint_completion_target = 0.9
max_wal_size = 2GB
min_wal_size = 512MB

# Connection limits (backend processes managed via PgBouncer)
max_connections = 60                  # PgBouncer backends + admin connections

# Parallel query — useful for large retention cleanup jobs
max_parallel_workers_per_gather = 2
max_parallel_workers = 4

# Logging — capture slow queries for profiling
log_min_duration_statement = 500      # Log queries taking > 500ms
log_autovacuum_min_duration = 1000    # Log autovacuum runs > 1 second

Autovacuum for high-write workloads

CloudTAK's tracks table receives continuous INSERT and UPDATE operations as devices report position, and periodic bulk DELETEs from the retention cleanup job. Default autovacuum settings trigger at 20% dead tuple ratio — a threshold that is rarely reached before table bloat degrades query performance. Tighten the thresholds specifically for the tracks table:

-- Run after CloudTAK has initialized the database schema
ALTER TABLE tracks SET (
    autovacuum_vacuum_scale_factor = 0.02,    -- vacuum at 2% dead tuples (vs default 20%)
    autovacuum_analyze_scale_factor = 0.01,   -- analyze at 1%
    autovacuum_vacuum_cost_delay = 2          -- more aggressive I/O for vacuum
);

-- Verify the settings took effect
SELECT reloptions FROM pg_class WHERE relname = 'tracks';

Also ensure the PostGIS GIST spatial index and the composite position lookup index exist — CloudTAK creates the spatial index on initialization, but the position lookup index may need to be added manually on older deployments:

-- Add the missing composite index if not present
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_tracks_uid_ts
    ON tracks (uid, timestamp DESC);

-- Verify all indexes on the tracks table
SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'tracks';

CoT rate limiting: per-client caps, stale time tuning, and track pruning

The second most impactful tuning lever is controlling the volume of CoT events the server accepts and retains. Three parameters work together: the per-client inbound rate limit, the stale time threshold (how long a track remains in the live picture after its last update), and the retention window (how long historical tracks stay in the database).

Per-client rate limits

The global CLOUDTAK_COT_RATE_LIMIT environment variable sets a ceiling across all clients. For mixed fleets, configure per-client overrides via the admin API — this allows UAV feeds to publish at high frequency without raising the limit for all infantry devices:

# Set a conservative default for infantry devices
curl -s -X PATCH https://tak.yourdomain.com:8443/api/admin/config \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"cot_rate_limit_default": 5}'

# Override for a specific UAV feed client (higher rate allowed)
curl -s -X PATCH https://tak.yourdomain.com:8443/api/client/uav-feed-01/config \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"cot_rate_limit": 50}'

Stale time and track pruning

Every CoT event carries a stale attribute — a UTC timestamp after which the event should be considered expired. CloudTAK uses COT_STALE_SECONDS as a server-side override when client-provided stale timestamps are absent or unreasonably long. Setting this to match your operational tempo prevents the in-memory picture from filling with stale tracks from disconnected or destroyed assets:

# .env additions for CoT lifecycle management
COT_STALE_SECONDS=300        # Tracks older than 5 minutes without update are pruned from live picture
COT_RETENTION_HOURS=72       # Historical tracks retained in DB for replay/forensics
CLOUDTAK_TRACK_PRUNE_INTERVAL=60   # Run in-memory pruning every 60 seconds

For high-density UAV operations where dozens of assets may disappear from the picture without a clean disconnect, aggressive pruning is critical — without it, the server accumulates thousands of ghost tracks that each consume memory and contribute to outbound fan-out even though no asset is actually at those coordinates.

WebSocket connection management: max connections, heartbeat tuning, dead connection cleanup

Each connected ATAK or WinTAK client holds a persistent WebSocket connection to CloudTAK. At 500+ simultaneous connections, default heartbeat parameters create measurable CPU overhead, and improperly cleaned dead connections consume file descriptors that are not returned to the OS until the process restarts.

Connection limits and heartbeat parameters

# .env WebSocket tuning block
CLOUDTAK_MAX_CONNECTIONS=800           # Hard ceiling — reject new connections above this
CLOUDTAK_WS_PING_INTERVAL=60          # Send PING every 60s (default 30s)
CLOUDTAK_WS_PONG_TIMEOUT=15           # Close if PONG not received within 15s
CLOUDTAK_WS_MAX_PAYLOAD=65536         # 64 KB max message — reject oversized frames
CLOUDTAK_WS_BACKPRESSURE_LIMIT=10485760  # 10 MB — pause writes to slow clients

Increasing CLOUDTAK_WS_PING_INTERVAL from 30s to 60s halves the heartbeat processing load — at 500 clients this is a meaningful reduction. The CLOUDTAK_WS_BACKPRESSURE_LIMIT parameter is important for tactical satellite link clients: it pauses delivery to clients that are not draining their receive buffers fast enough, preventing a slow BGAN connection from holding up event delivery to fast clients on the same server.

OS-level file descriptor limits

Each WebSocket connection consumes a file descriptor. The default Linux limit of 1024 open files per process will cap you well below 1000 concurrent clients. Increase the limit for the Docker container and the host:

# Add to the cloudtak service in docker-compose.yml
    ulimits:
      nofile:
        soft: 65536
        hard: 65536

# Also set on the host — add to /etc/security/limits.conf
*    soft    nofile    65536
*    hard    nofile    65536

# Verify current limits inside the container
docker exec cloudtak sh -c 'ulimit -n'

Horizontal scaling: multiple CloudTAK instances, load balancer, session affinity

When a single CloudTAK instance's Node.js event loop is CPU-saturated — identifiable by 100% vCPU utilization with the CoT queue depth growing — horizontal scaling is the next step. CloudTAK v2.x supports multi-instance deployments via a shared PostgreSQL database and Redis pub/sub for event fan-out between instances.

Redis for cross-instance event delivery

Add Redis to your Compose stack and configure both CloudTAK instances to use it:

  redis:
    image: redis:7-alpine
    container_name: cloudtak-redis
    restart: unless-stopped
    command: redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru
    networks:
      - cloudtak-internal

  cloudtak-1:
    image: ghcr.io/tak-ps/cloudtak:${CLOUDTAK_VERSION}
    container_name: cloudtak-1
    environment:
      # ... same as single-instance config ...
      REDIS_URL: redis://redis:6379
      INSTANCE_ID: cloudtak-1
    ports:
      - "8089:8089"
      - "8443:8443"
      - "8446:8446"
    networks:
      - cloudtak-internal
      - cloudtak-external

  cloudtak-2:
    image: ghcr.io/tak-ps/cloudtak:${CLOUDTAK_VERSION}
    container_name: cloudtak-2
    environment:
      # ... same as single-instance config ...
      REDIS_URL: redis://redis:6379
      INSTANCE_ID: cloudtak-2
    ports:
      - "8190:8089"
      - "8543:8443"
      - "8546:8446"
    networks:
      - cloudtak-internal
      - cloudtak-external

HAProxy configuration with session affinity

Because ATAK clients maintain persistent TCP connections, the load balancer must route each client consistently to the same CloudTAK instance — splitting a client's CoT stream and WebSocket connection across two instances results in missed events. Use IP-hash source affinity in HAProxy:

# /etc/haproxy/haproxy.cfg (relevant blocks)

frontend tak_cot_frontend
    bind *:8089
    mode tcp
    default_backend tak_cot_backend

backend tak_cot_backend
    mode tcp
    balance source             # IP hash — sticky sessions by source IP
    timeout connect 5s
    timeout server 300s
    server cloudtak1 cloudtak-1:8089 check
    server cloudtak2 cloudtak-2:8089 check

frontend tak_https_frontend
    bind *:8443
    mode tcp
    default_backend tak_https_backend

backend tak_https_backend
    mode tcp
    balance source
    timeout connect 5s
    timeout server 300s
    server cloudtak1 cloudtak-1:8443 check
    server cloudtak2 cloudtak-2:8443 check

With two instances on the same hardware, the load is distributed across two Node.js processes, each on its own event loop — effectively doubling the available single-threaded JavaScript throughput. For deployments needing more than 1000 concurrent clients, scale to three or four instances following the same pattern.

Feed optimisation: spatial filtering and resolution-based filtering

The most significant bandwidth reduction comes from spatial filtering — delivering each client only the tracks within their operational area rather than the full global picture. At 500 clients each receiving the full track feed, outbound fan-out is O(clients × events). With spatial filtering, clients in different geographic areas receive disjoint subsets of the track feed, and the fan-out collapses dramatically.

Configuring area of interest subscriptions

Clients can register an AOI subscription via the CloudTAK feeds API, or operators can configure per-unit AOIs from the admin interface:

# Register a bounding box AOI for a specific client
curl -s -X PUT https://tak.yourdomain.com:8443/api/client/operator01/aoi \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "bbox",
    "min_lon": 22.5,
    "min_lat": 48.2,
    "max_lon": 25.8,
    "max_lat": 50.1,
    "radius_km": null
  }'

# Or configure a radius-based AOI centered on the client's last position
curl -s -X PUT https://tak.yourdomain.com:8443/api/client/operator01/aoi \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "radius",
    "center_lon": 24.0,
    "center_lat": 49.2,
    "radius_km": 50,
    "follow_client": true
  }'

The "follow_client": true option causes CloudTAK to dynamically update the AOI center as the client's own position reports, so the 50 km radius tracks with the moving operator. This is the recommended mode for vehicle-mounted and airborne clients.

Resolution-based filtering for UAV feeds

High-frequency UAV feeds can be decimated for distant clients — clients more than 100 km from the UAV's position receive one event per 10 source events (10% of full resolution), while clients within 20 km receive full resolution. Configure resolution tiers per feed via the admin API:

curl -s -X PATCH https://tak.yourdomain.com:8443/api/feed/uav-feed-01/resolution \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "tiers": [
      {"max_distance_km": 20,  "rate_divisor": 1},
      {"max_distance_km": 100, "rate_divisor": 5},
      {"max_distance_km": null, "rate_divisor": 10}
    ]
  }'

Profiling tools: admin API metrics, pg_stat_statements, and Linux perf

Tuning without profiling is guesswork. Use these three tools to identify the actual bottleneck before applying changes.

CloudTAK admin API metrics

The GET /api/admin/metrics endpoint returns a JSON object with real-time counters. For ongoing monitoring, scrape it into Prometheus using the /api/admin/metrics/prometheus endpoint and visualize in Grafana. The most diagnostic fields:

  • cot_queue_depth — if consistently > 0, the database write path is the bottleneck.
  • db_pool_waiting — connections queued for a pool slot; > 0 means PgBouncer pool is undersized.
  • ws_backpressure_paused — count of clients currently paused due to slow reads; indicates network or client-side bottleneck rather than server-side.
  • event_loop_lag_ms — Node.js event loop lag; values above 100ms indicate the main thread is CPU-saturated and horizontal scaling is needed.

PostgreSQL pg_stat_statements

Enable the pg_stat_statements extension to identify the costliest queries:

-- Enable extension (add to postgresql.conf: shared_preload_libraries = 'pg_stat_statements')
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Top 10 queries by total execution time
SELECT
    left(query, 80) AS query_snippet,
    calls,
    round(total_exec_time::numeric, 2) AS total_ms,
    round(mean_exec_time::numeric, 2)  AS mean_ms,
    rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Reset stats after tuning to measure improvement
SELECT pg_stat_statements_reset();

The queries most commonly found at the top of this list on under-tuned CloudTAK deployments are: the track UPSERT query (missing composite index on (uid, timestamp)), the spatial AOI filter query (missing GIST index on the geometry column), and the retention cleanup DELETE (sequential scan when the timestamp index is missing).

Linux perf and flamegraphs for Node.js CPU profiling

If event_loop_lag_ms is elevated, use Linux perf to generate a CPU flamegraph of the CloudTAK Node.js process:

# Get the PID of the CloudTAK Node.js process inside the container
CLOUDTAK_PID=$(docker exec cloudtak sh -c 'pgrep -f "node.*cloudtak"')

# Record 30 seconds of CPU samples (requires perf installed on host)
perf record -F 99 -p $CLOUDTAK_PID -g -- sleep 30

# Generate flamegraph (requires FlameGraph tools)
perf script | stackcollapse-perf.pl | flamegraph.pl > cloudtak-flame.svg

Common hot paths found in CloudTAK flamegraphs: JSON serialization of large CoT payloads (mitigated by payload size limits), WebSocket frame encoding for high-frequency fan-out (mitigated by spatial filtering), and geospatial distance calculations for AOI evaluation (mitigated by pushing AOI filtering to PostGIS rather than evaluating in JavaScript).

Benchmark results: before and after tuning for the 500-client scenario

The following results were produced on a 4 vCPU / 8 GB RAM Ubuntu 22.04 server running CloudTAK v2.4.1, PostgreSQL 15, and PgBouncer 1.22. Load was simulated using tak-load-test with 450 infantry clients (position update at 0.05 Hz) and 50 UAV feeds (position + metadata at 5 Hz). All tuning changes from this guide were applied.

Metric Before tuning After tuning Improvement
CoT latency p50 3,800 ms 310 ms -92%
CoT latency p99 18,200 ms 890 ms -95%
DB connection errors/min 47 0 -100%
Server CPU utilization 94% 38% -60%
Outbound bandwidth 340 Mbps 118 Mbps -65%
PostgreSQL memory 1.8 GB 680 MB -62%

The single biggest contributor to latency reduction was PgBouncer — eliminating connection pool exhaustion dropped median latency from 3.8 seconds to under 800ms before any other change. Spatial filtering was the single biggest contributor to bandwidth reduction. The remaining latency improvement to 310ms p50 came from the PostgreSQL memory parameter tuning and the composite index on (uid, timestamp DESC).

Capacity planning note: After full tuning, the 4 vCPU / 8 GB server was running at 38% CPU with 500 clients. This gives headroom to approximately 1100–1200 clients before CPU saturation, assuming linear scaling holds. For production deployments expecting to approach that ceiling, deploy two CloudTAK instances behind HAProxy before hitting it — reactive horizontal scaling during an operation is operationally risky.