A large language model that needs a cloud endpoint is useless at the tactical edge, because the network is the least reliable thing in the operational environment. A denied, degraded, intermittent, or limited-bandwidth link cannot carry a request to a remote model and return an answer inside the timeline a decision allows – and sending operational text off the platform creates both an emissions signature and a data-exposure risk. The alternative is to run the model where the data is: on the device, with the radio off, generating every token locally. This article covers how that is actually done – model selection, quantization, hardware budgets, the key-value cache, and the runtime details that decide whether on-device inference is responsive or unusably slow.

Why on-device, and what it costs you

The case for on-device inference is straightforward: it removes the network from the critical path. After the model is loaded into memory, there is no remote call, no dependency on a satellite or mesh-radio link, and no telemetry leaving the platform. Latency becomes deterministic – a function of the hardware and the model, not of a contested link. Prompts and outputs that may contain locations, unit designations, or intent never leave the device.

The cost is capability. A model that fits on a single edge device is far smaller than a frontier cloud model, and a 7B–8B parameter model is meaningfully less capable at open-ended reasoning than a model two orders of magnitude larger. The engineering discipline of on-device LLM work is therefore task scoping: matching a narrow, well-defined task to the smallest model that can perform it reliably, rather than expecting a small model to be a general assistant. Summarizing a contact report, classifying inbound messages by priority, extracting structured fields from free text, or answering questions over a local document set are all tasks a small model handles well. Multi-hop reasoning across a large, ambiguous context is where small models break down.

Model selection: smaller than you think

The instinct to load the largest model the hardware can physically hold is the most common mistake in edge LLM deployment. The largest model leaves no memory headroom for the context window, the key-value cache, or any concurrent workload sharing the device – and on autoregressive generation, a bigger model means fewer tokens per second. The correct starting point is the smallest model family plausibly capable of the task, validated against a held-out set of real examples before anything else is optimized.

As a rough mapping: 1B–3B parameter models are appropriate for templated extraction, classification, and short-form transformation; they run on modest hardware and generate quickly. 7B–8B models are the workhorse class for summarization, retrieval-augmented question answering, and constrained reasoning, and they fit comfortably on mid-tier edge accelerators once quantized. Beyond roughly 13B parameters, the memory and bandwidth demands generally exceed what a single ruggedized edge device can sustain at an interactive token rate, and the marginal capability rarely justifies the cost at the edge.

Quantization: the central trade

Quantization is the technique that makes on-device LLMs practical. A model is trained and distributed at 16-bit floating-point precision, but most of that precision is not needed for inference. Quantization re-encodes the weights at lower bit widths – 8, 5, or 4 bits, and lower – which shrinks the memory footprint proportionally and increases throughput, because the binding constraint on generation is memory bandwidth and fewer bytes per weight means fewer bytes to move per token.

The accuracy cost is non-linear, and understanding its shape is what separates a sound deployment from a brittle one. Eight-bit quantization is nearly lossless for almost every task. Four-bit quantization using a modern K-quant scheme (commonly labelled Q4_K_M) typically costs 1–3 percent on reasoning benchmarks while halving the footprint relative to 8-bit – this is the default sweet spot for edge deployment. Below 4 bits, degradation accelerates: 3-bit and 2-bit builds can collapse on reasoning tasks even when they still produce fluent text, which makes them dangerous precisely because the failure is not obvious from a casual read.

The decisive point is that this trade is task-dependent and must be measured, not assumed. For extractive and templated tasks – pull the grid reference out of this message, classify this report – a 4-bit model performs close to the full-precision baseline because the task does not exercise the fragile parts of the model. For multi-step reasoning, the same 4-bit build may lose enough to matter. The only way to know is to run a held-out evaluation set through the full-precision baseline, the 8-bit build, and the 4-bit build, compare task-relevant metrics, and accept the smallest build whose loss is inside the operational tolerance. Choosing the right edge accelerator for that model is its own discipline – see our analysis of edge AI hardware selection for defense.

Quantization-aware training versus post-training quantization

Most edge deployments use post-training quantization: take an existing model and quantize the weights directly, with optional calibration on a small representative dataset. It is fast, requires no training infrastructure, and is good enough at 4 bits for the majority of tasks. Quantization-aware training – fine-tuning the model with quantization simulated in the forward pass – recovers more accuracy at very low bit widths but requires the training pipeline and the original data. For most fielded systems, post-training 4-bit quantization with calibration is the pragmatic choice; reserve quantization-aware training for the cases where sub-4-bit operation is forced by hardware limits.

Hardware budgets and the memory-bandwidth wall

The hardware question for on-device LLMs is dominated by memory, not compute. Autoregressive generation produces one token at a time, and producing each token requires reading the entire set of model weights from memory. Throughput in tokens per second is therefore bounded by memory bandwidth divided by model size in bytes far more often than by raw arithmetic throughput. A device with abundant FLOPS but modest memory bandwidth will be slow on generation regardless of its compute rating.

The practical floor for a responsive 7B-class model is roughly 8 GB of memory accessible to the inference accelerator and enough bandwidth to sustain 10–20 tokens per second. A Jetson Orin NX in its 8GB or 16GB configuration sits squarely in this range, as does a small ruggedized x86 system with an integrated or discrete GPU. CPU-only inference is entirely viable for 1B–3B models and produces a few tokens per second on 7B models – acceptable for batch summarization that runs without an operator waiting, unacceptable for interactive use. Prefill (processing the prompt) and decode (generating the response) have different bottlenecks: prefill is compute-bound and parallel, decode is bandwidth-bound and sequential, so they must be measured separately when sizing hardware.

The key-value cache: the hidden memory cost

Weight size is the memory cost everyone budgets for; the key-value (KV) cache is the one that catches teams out. During generation the model caches the attention keys and values for every token already processed so it does not recompute them, and that cache grows linearly with context length. For a 7B model at 16-bit precision the KV cache costs on the order of 0.5 MB per token, so an 8,000-token context adds roughly 4 GB on top of the weights – frequently larger than the quantized weights themselves. On constrained hardware, the KV cache, not the weights, is what caps the usable context length. Quantizing the KV cache to 8 or 4 bits halves or quarters this and is often the difference between a workable context budget and an out-of-memory failure. The corollary is to set the context length to the smallest value the task needs rather than the largest the model supports.

Key insight: The binding constraint on on-device LLM deployment is rarely the size of the quantized weights – it is the key-value cache, which grows with context length and routinely exceeds the weight footprint. Budget the KV cache explicitly, quantize it when memory is tight, and set the context window to the smallest size the task requires. A deployment that fits the weights but ignores the cache will fail the first time an operator pastes a long document.

Runtime, packaging, and verified offline operation

The on-device runtime is the layer that loads the quantized weights, manages the KV cache, and exposes a generation interface to the application. A llama.cpp-based engine is the common choice because it runs the GGUF quantized format directly, supports CPU and accelerator back-ends, and has a small dependency footprint that suits a ruggedized image. Whatever the runtime, the model should be pinned in memory after first load so that the seconds-long load cost is paid once rather than on every request, and the application should treat generation latency as a first-class metric surfaced to the operator.

Packaging is where on-device claims are won or lost. The model file, the runtime, the tokenizer, and the prompt templates must all be present on the device image – nothing fetched at runtime. The only honest test of offline operation is to run the entire workflow with the radio off, in airplane mode, or on an isolated network, and confirm the system starts and answers with no reachable endpoint. Any hidden call – a tokenizer download, a telemetry beacon, a license check – must be found and removed, because at the edge it will fail silently and take the capability with it. Model updates ship by physical media or by an authenticated local sync, never by an assumption of connectivity.

On-device LLMs also widen the attack surface in ways a cloud model does not, since the weights, prompts, and any retrieval corpus live on a device that may be captured. Prompt-injection through ingested documents, data exfiltration through crafted outputs, and tampering with the model file are all in scope and must be designed against – a subject covered in depth in our guide to LLM security for defense AI systems.

Run language models where the data lives

Corvus SENSE brings quantized, on-device AI to disconnected tactical hardware – local inference with no cloud dependency, deterministic latency, and operational data that never leaves the platform. Built for the denied, degraded, and intermittent environment.

Explore Corvus SENSE → Book a Briefing

This analysis was prepared by Corvus Intelligence engineers who build mission-critical edge AI and ISR systems for defense and government organizations. Learn about our team →