Large language models running entirely on a local edge node — no internet, no cloud API, no data leaving the platform — are no longer a research curiosity. They are an operational reality for defense programs that need AI-assisted command and control, intelligence summarization, or autonomous decision support in environments where connectivity is a liability rather than an asset. This article covers the full stack: why cloud LLMs fail tactically, which hardware and models to choose, how to quantize and serve them efficiently, and how to secure the inference service on a classified edge node.
1. the connectivity problem
Cloud LLMs like GPT-4o or Claude Sonnet assume a stable, low-latency broadband connection. In tactical environments that assumption fails in at least three structurally distinct ways.
EMCON (Emissions Control) is the first and most fundamental constraint. When a unit goes silent to avoid electronic detection, all radio transmissions stop — satellite uplinks, cellular modems, tactical data links. A C2 assistant that routes every query to a cloud API becomes inoperable the moment EMCON is ordered. For dismounted infantry, special operations teams, and any platform operating in a contested electromagnetic environment, this is not an edge case; it is the default operating condition for significant portions of a mission.
MANET (Mobile Ad Hoc Network) bandwidth is the second constraint. Tactical MANETs typically operate at 1–10 Mbps aggregate with latency measured in hundreds of milliseconds and packet loss rates that make TCP streams unreliable. A single LLM API call carrying a 2,000-token context window at 4 bytes per token consumes 8 KB in the request alone; the response at 500 tokens adds another 2 KB. In isolation that is manageable. In practice, a network with 30 simultaneous users sending LLM queries every 30 seconds saturates a 1 Mbps link in under a minute, competing with voice, video, and command traffic.
Denied communications — jamming, terrain masking, or deliberate network isolation in contested areas — complete the picture. A GPS-denied, comms-denied forward operating position needs AI assistance precisely when cloud connectivity is least available. Offline LLM inference removes the dependency entirely: the model weights sit on the local node, inference runs on the local processor, and the application works identically whether the node is connected to HQ or isolated in a mountain valley.
There is a fourth issue that applies even when connectivity exists: OPSEC and data handling. Routing sensitive operational data — patrol routes, target nominations, intelligence reports — through commercial cloud APIs means that data traverses commercial infrastructure, is processed on commercial servers, and is subject to commercial terms of service and logging practices. For classified or sensitive-but-unclassified workloads, this is often simply not permissible. On-premise edge inference keeps the data on the platform.
2. hardware tiers for edge LLM inference
Not all edge hardware can run a useful LLM. The minimum viable configuration for a 7B-parameter model at interactive speeds (10+ tokens/second) requires roughly 8 GB of memory bandwidth, fast enough to stream model weights. The practical tiers in 2026 are as follows.
NVIDIA Jetson Orin NX 16GB is the primary recommendation for military edge LLM inference. 16 GB of unified LPDDR5 with 102 GB/s memory bandwidth, 1024 CUDA cores, and the full JetPack software stack (llama.cpp with CUDA backend compiles natively). Llama 3.1 8B at Q4_K_M quantization occupies approximately 5 GB and delivers 12–18 tokens/second — fast enough for interactive C2 queries. Power draw for sustained LLM inference is 15–20 W, within the envelope of most tactical platforms. The Orin NX runs at -25 to +85 °C (industrial variant), has a mature ecosystem of ruggedized carrier boards, and is available through defense channels.
Hailo-10 is Hailo's first LLM-capable NPU, delivering 40 TOPS at under 5 W with a dedicated transformer engine. Early benchmarks show Llama 3.2 3B at interactive speeds. For 7–8B models the Hailo-10 is more constrained than the Orin NX — memory bandwidth is the bottleneck — but the power profile is exceptional for genuinely power-constrained nodes (patrol base kits, small UAS command nodes, body-worn compute). Toolchain maturity for LLMs is still developing; budget engineering time for integration versus the more mature Jetson path.
Intel Arc A-series (Arc A770, 16GB) sits in the middle tier for vehicle-mounted or shelter-based edge servers. The A770 at 16GB GDDR6 delivers roughly 200 GB/s memory bandwidth and runs llama.cpp with the SYCL/OpenCL backend or vLLM with the XPU backend. Performance on Llama 3.1 8B is approximately 20–30 tokens/second at Q4_K_M. Power draw is 30–35 W. The trade-off versus Jetson is form factor: Arc requires a PCIe slot and a host CPU, making it a shelf unit rather than a module.
CPU-only (ARM Cortex / Apple Silicon equivalent class) remains a viable fallback for platforms that carry no discrete accelerator. An ARM Cortex-X4 cluster at 8 cores with LPDDR5 achieves roughly 3–6 tokens/second on Llama 3.1 8B Q4_K_M — below interactive threshold but usable for batch processing, background intelligence summarization, or asynchronous tasks that can tolerate multi-second latency. The key insight is that CPU-only inference works: it is slow, but it works offline on any hardware. For truly resource-constrained nodes, Llama 3.2 1B or 3B at Q4_K_M reduces requirements proportionally.
| Platform | Memory | Llama 8B tok/s | TDP (LLM) | Form factor |
|---|---|---|---|---|
| Jetson Orin NX 16GB | 16 GB LPDDR5 | 12–18 | 15–20 W | SOM module |
| Hailo-10 + host ARM | 8–16 GB shared | 6–10 (3B model) | <5 W NPU | M.2 / mPCIe |
| Intel Arc A770 16GB | 16 GB GDDR6 | 20–30 | 30–35 W | PCIe card |
| ARM CPU-only (8-core) | 8–16 GB LPDDR5 | 3–6 | 5–10 W | Any SBC |
3. model selection: Llama, Qwen, Mistral
The 7–8B parameter class is the operational sweet spot for military edge LLM inference in 2026. Models in this class fit in 4–6 GB at 4-bit quantization, run at interactive speeds on Jetson Orin NX, and score well enough on reasoning benchmarks to be genuinely useful for tactical tasks — summarizing intelligence reports, generating SALUTE format from raw observations, drafting fragmentary orders, answering doctrine queries.
Llama 3.1 8B (Meta AI, Apache 2.0 license) is the baseline recommendation. MMLU score of approximately 73% — competitive with models twice its size from two years prior. Context window of 128K tokens allows long intelligence documents and multi-turn C2 conversations without truncation. The instruct-tuned variant follows instructions reliably and responds well to structured output prompting. llama.cpp GGUF files are available in all quantization levels from the standard Hugging Face repositories, enabling air-gapped download and local deployment.
Qwen2.5 7B (Alibaba, Apache 2.0) scores slightly higher than Llama 3.1 8B on MMLU at approximately 74.2%, with notably stronger multilingual performance — relevant for coalition operations involving non-English-speaking partners. The model handles code generation and structured output reliably. Country-of-origin (China) is a relevant consideration for programs with strict sourcing requirements; verify with your security team before deploying on classified networks.
Mistral 7B v0.3 (Mistral AI, Apache 2.0) is the lightest-weight strong performer: approximately 62–65% MMLU, smaller KV cache than Llama, and efficient grouped-query attention that lowers memory bandwidth requirements. It is the preferred option for CPU-only nodes where every token/second counts. The lower MMLU score reflects its training dataset focus rather than a fundamental capability gap for most tactical tasks — for single-turn queries with structured output, the performance difference versus Llama 3.1 8B is operationally negligible.
4. quantization pipeline
Full-precision (FP16) 8B models require approximately 16 GB of GPU memory — too large for most tactical edge nodes. Quantization reduces the bit depth of model weights, trading a small quality loss for a large reduction in memory footprint and an improvement in inference speed.
GGUF (llama.cpp native format) is the recommended format for edge LLM deployment. It supports CPU, mixed CPU/GPU, and pure GPU inference from the same binary. Quantization levels are expressed as Q-codes: Q4_K_M (4-bit, K-quant method, medium) is the standard tactical choice — Llama 3.1 8B at Q4_K_M weighs approximately 4.9 GB and scores within 1–2% of FP16 on most benchmarks. Q8_0 (8-bit) weighs approximately 8.5 GB and is nearly lossless — preferred when the node has memory headroom and maximum output quality is required. Q2_K and Q3_K_S save memory but degrade structured output reliability noticeably; avoid below Q4 for operational use.
AWQ (Activation-aware Weight Quantization) targets NVIDIA GPU inference. It applies a per-channel scaling factor calibrated on a representative dataset before quantizing to INT4, preserving the most salient weights. AWQ models load via AutoAWQ or vLLM and deliver better perplexity at 4-bit than naive INT4 quantization. For Jetson Orin, the CUDA backend in llama.cpp handles GGUF equally well; AWQ becomes relevant when running on Orin AGX or a server-class edge system where vLLM's throughput optimizations (continuous batching, PagedAttention) matter for multiple concurrent users.
GPTQ is the older GPU quantization standard, supported by AutoGPTQ and a wide range of serving frameworks. Quality is marginally below AWQ at equivalent bit depth but tooling is mature. For new deployments in 2026, AWQ is preferred over GPTQ for GPU inference; GGUF remains the default for Jetson and mixed-hardware environments.
5. inference runtimes
llama.cpp is the foundation of edge LLM inference. Written in C++ with backends for CUDA (NVIDIA), Metal (Apple), OpenCL, SYCL (Intel), and pure CPU, it compiles on Jetson JetPack, Ubuntu ARM, and virtually any Linux system. The GGUF format is native. Latency on Llama 3.1 8B Q4_K_M: 12–18 tok/s on Orin NX 16GB with the CUDA backend, 3–6 tok/s on CPU-only ARM. Memory usage is predictable and well-documented. For a single-user, single-model edge node, llama.cpp accessed via its HTTP server (`llama-server`) is the correct default.
Ollama wraps llama.cpp with a local REST API, a CLI, and model management (pull, list, delete). The REST API (`POST /api/generate`, `POST /api/chat`) is straightforward to integrate into a C2 application. On an air-gapped node, disable the model registry and pre-load models from a local GGUF file: `ollama create mymodel -f Modelfile`. Ollama adds roughly 50–100 ms of overhead per request compared to raw llama.cpp; for interactive C2 use this is negligible. Run Ollama as a systemd service under a dedicated `ollama` user account.
vLLM is designed for high-throughput multi-user serving on NVIDIA GPUs. It implements PagedAttention (near-zero KV cache waste) and continuous batching (multiple requests processed simultaneously). On an edge server with an RTX 4090 or A-series GPU serving 10+ simultaneous users, vLLM outperforms llama.cpp by 3–5x in requests/second. For single-user tactical nodes, the overhead of vLLM's architecture is not justified. For a shared patrol-base inference server supporting a platoon-level network, vLLM is the right choice.
ExLlamaV2 specializes in extremely fast single-user NVIDIA inference using custom CUDA kernels and EXL2 quantization format. On a Jetson Orin AGX 64GB, ExLlamaV2 with Llama 3.1 8B EXL2 (4-bit) achieves approximately 30–40 tok/s — the fastest single-GPU edge throughput available. The trade-off is ecosystem: EXL2 format is ExLlamaV2-specific, tooling is more complex, and the project is maintained by a small team. For programs that need maximum throughput on a high-end Jetson, ExLlamaV2 is worth evaluating; for standard tactical nodes, llama.cpp or Ollama is simpler and more maintainable.
6. TAKpilot edge mode: automatic fallback to local LLM
TAKpilot is Corvus Intelligence's AI layer for ATAK and WinTAK environments. It connects a C2 operator's TAK client to an LLM backend for natural-language situation report summarization, course-of-action generation, and SALUTE report parsing. The dual-backend architecture is central to its design for military edge deployment.
In connected mode — forward HQ with a satellite uplink, a tactical operations center with fiber backhaul — TAKpilot routes inference requests to Claude Sonnet via the Anthropic API. This provides maximum model quality: long context, strong reasoning, reliable structured output, and up-to-date knowledge. The API call overhead (network latency plus model inference) is typically 1–3 seconds for a C2 query, acceptable for most HQ-level tasks.
TAKpilot continuously monitors connectivity state using a lightweight heartbeat against a configured endpoint. When connectivity drops below a configurable threshold — or the operator activates EMCON mode via a toggle in the TAK plugin UI — TAKpilot switches automatically to the local Llama 8B quantized model running on the same edge compute node. The switch is transparent to the application layer: the same REST interface, the same JSON response schema, no restart required. Switching latency is under 200 ms.
In edge mode, TAKpilot applies a set of automatic optimizations for the local model: system prompts are shortened to under 200 tokens (conserving context for the actual query), structured output is enforced via grammar constraints (see section 7), and response length is capped to reduce inference time. The operator sees a small indicator in the plugin UI showing the active backend — cloud or edge — but the workflow is otherwise identical.
7. prompt engineering for constrained models
A 7B quantized model is not GPT-4. It requires more careful prompt engineering to produce reliable, structured output for operational use. Three disciplines matter most.
Short system prompts. Every token in the system prompt consumes context window space and adds to the prefill computation. A cloud LLM can absorb a 2,000-token system prompt with role definition, doctrine references, and extensive examples. A Llama 8B Q4_K_M on an edge node should have a system prompt under 200 tokens. Focus on output format, role, and the most critical constraints — omit examples and background, which the model's training already covers.
Structured output with JSON schema enforcement. A 7B model is significantly more likely to produce malformed JSON than a frontier model. The mitigation has two layers: grammar-constrained generation (llama.cpp GBNF grammars, Ollama's format parameter) forces the model to produce only syntactically valid JSON; and application-layer validation with retry logic catches semantic errors. A typical implementation retries up to 3 times, feeding the validation error back as a correction prompt: "Your previous response was invalid because: [error]. Please correct and return only the JSON object." This combination achieves near-100% structured output compliance in practice.
Few-shot examples in the user turn, not the system prompt. Placing 1–2 input/output examples directly in the user message (rather than the system prompt) is more token-efficient and produces better results on smaller models. The model sees the examples immediately before generating its response, reducing the chance it forgets the format mid-generation.
Key insight: The gap between a frontier cloud LLM and a 7B quantized edge model is real but manageable for well-defined tactical tasks. Summarizing a SALUTE report, formatting a FRAGO, or classifying a short intelligence fragment are structured problems with bounded output spaces. Careful prompt engineering, grammar-constrained generation, and retry logic close most of the quality gap. Avoid deploying edge LLMs for open-ended reasoning tasks that require world knowledge, long-horizon planning, or multi-document synthesis — those tasks still benefit from routing to a cloud backend when available.
8. security for edge LLM deployment
An LLM running on a military edge node is a software service processing potentially sensitive operational data. It requires the same security discipline as any other mission-critical service.
Model integrity verification. LLM weights are large binary files delivered over potentially untrusted channels (USB drives, air-gap transfer). Before loading, compute a SHA-256 hash of every model file and compare against a known-good manifest signed by your PKI. An adversary who can substitute a model file — with a backdoored model that exfiltrates via steganographic output patterns, or simply with a degraded model to degrade capability — has a meaningful attack surface if integrity checking is absent. Verify on every load, not just at install time.
Service isolation. Run the inference service (llama.cpp server, Ollama) as a dedicated low-privilege OS user (e.g., llm-service) with no write access outside its working directory. On Linux, apply a restrictive AppArmor or SELinux policy that allows only the operations the service legitimately needs: read model files, listen on a Unix socket or localhost TCP port, write to a log directory. Deny all network egress on air-gapped nodes with an iptables rule scoped to the service user's UID.
Audit logging. Log every inference request and response to a tamper-evident audit trail. The minimum log record is: timestamp, requesting user/process, prompt hash (not plaintext — to preserve confidentiality while enabling auditing), response hash, inference latency, and the active model version. Write to an append-only log partition or forward to a syslog server on a separate network segment. This log is your forensic record if an operator claims the system produced an erroneous output that influenced a decision.
Network posture. On a deployed node, the LLM service should accept connections only from localhost or a designated internal network segment — never from the open network. If multiple clients (TAK plugin, C2 application, log aggregator) need to reach the service, use a Unix domain socket with filesystem permissions as the access control. Document the network topology in the system security plan and validate it against the deployed configuration at each maintenance cycle.
The combination of these four controls — integrity verification, service isolation, audit logging, and network restriction — provides a defensible security posture for an edge LLM service without requiring custom security software or significant engineering overhead. They map directly onto standard NIST SP 800-53 controls and should be straightforward to document for an ATO package.