Computer Vision for Defense: On-Device Object Detection and Tracking

Computer vision — the ability of a machine to interpret and understand visual data — has become one of the most operationally significant AI capabilities in modern defense systems. From UAV-mounted sensors that identify vehicles in real time to perimeter security systems that distinguish humans from animals at night, on-device computer vision is transforming how militaries collect, process, and act on visual intelligence.

Deploying computer vision on defense hardware is fundamentally different from deploying it in a commercial data center. The models must run on ruggedized, power-constrained hardware. They must operate across variable lighting, weather, and sensor conditions. They must meet latency requirements measured in milliseconds, not seconds. And they must fail gracefully rather than catastrophically when inputs fall outside training distribution. This article covers the full pipeline: detection architecture, hardware platforms, optimization, multi-object tracking, and deployment engineering.

Detection Pipeline Architecture: From Frame to Bounding Box

A modern object detection pipeline for defense edge deployment consists of several sequential stages. The first stage is input preprocessing: resizing the incoming frame to the model's input resolution (typically 640×640 or 1280×1280 pixels), normalizing pixel values to the [0, 1] range, and optionally applying letterboxing to preserve aspect ratio without distortion. For thermal (LWIR) cameras, preprocessing includes additional normalization steps to account for the sensor's 14-bit or 16-bit dynamic range being compressed into an 8-bit or 16-bit inference input.

The detection model itself — currently dominated by YOLO variants — takes the preprocessed frame as input and produces a set of candidate detections: each a bounding box (x, y, width, height), a class probability vector, and an objectness score. YOLOv8, released in 2023, introduced an anchor-free detection head that significantly improved small-object detection compared to YOLOv5 — a critical improvement for aerial reconnaissance where targets occupy only a few pixels. YOLOv9, with its Programmable Gradient Information (PGI) mechanism, further improves gradient flow during training and produces better generalization from limited labeled datasets.

The final preprocessing stage is Non-Maximum Suppression (NMS). A detection model typically produces hundreds of overlapping candidate boxes; NMS filters these to the subset of highest-confidence, non-overlapping detections using an Intersection-over-Union (IoU) threshold (typically 0.45–0.65). On-device NMS implementation matters: a naive CPU-based NMS on 1,000 candidates at 30 fps consumes more compute than the model inference itself. TensorRT provides efficient GPU-accelerated NMS, and for ultra-low-power platforms, implementing NMS in hardware-accelerated kernels is essential.

Hardware Platforms: Jetson, Hailo, and Movidius Compared

Three hardware families dominate defense edge AI deployments, each with distinct performance, power, and ecosystem characteristics:

NVIDIA Jetson AGX Orin is the performance leader in the ruggedized embedded GPU space. At 275 TOPS (INT8), it can run multiple large detection models simultaneously — for example, a YOLOv8-large model at 30+ fps while concurrently running a tracking algorithm and a separate classification model. The AGX Orin operates at 10W–60W depending on power mode, supports CUDA 11.4+, TensorRT 8.x, and DeepStream SDK for multi-camera pipelines. Its 64 GB LPDDR5 unified memory allows large model weights and large frame buffers simultaneously. For vehicle-mounted applications with a 100W+ power budget, the AGX Orin is the standard choice.

Hailo-8 and Hailo-8L occupy the low-power end of high-performance AI inference. The Hailo-8 delivers 26 TOPS at under 3W in PCIe M.2 or mPCIe form factor — making it viable for small UAV payloads and dismounted systems. The Hailo-8L (13 TOPS) reduces power further to ~1.5W. Hailo uses a proprietary Dataflow Architecture optimized for CNN inference, with the Hailo Model Zoo providing pre-compiled versions of YOLO variants optimized for the Hailo runtime. The trade-off: Hailo's ecosystem is narrower than NVIDIA's — custom model architectures require additional conversion effort via the Hailo Dataflow Compiler.

Intel Movidius Myriad X and its successor architecture (integrated into Intel OpenVINO toolkit) target the integration of vision AI with Intel's camera and sensor ecosystem. The Myriad X delivers approximately 4 TOPS at ~1W, suitable for embedded vision applications. OpenVINO provides a model optimization and deployment pipeline supporting heterogeneous execution across CPU, GPU, VPU, and FPGA targets on Intel silicon. For programs using Intel RealSense depth cameras or integrated with Intel ISP pipelines, Movidius provides the tightest hardware integration.

Optimization: TensorRT INT8 Quantization and Layer Fusion

A YOLOv8-medium model trained in PyTorch with FP32 weights requires approximately 850 MB of memory and runs at about 8 fps on an NVIDIA Jetson Orin NX in its native form. After TensorRT optimization to INT8, the same model requires approximately 210 MB and runs at 65+ fps — an 8× throughput improvement and 4× memory reduction, with typically less than 1% mAP degradation on a representative calibration dataset.

TensorRT optimization involves three main techniques. INT8 quantization converts model weights and activations from 32-bit floating point to 8-bit integer representation, using a calibration dataset (typically 500–1,000 representative images) to determine the optimal quantization scale factors per layer. Layer fusion combines sequences of operations — convolution followed by batch normalization followed by ReLU activation — into a single optimized CUDA kernel, eliminating the memory bandwidth overhead of writing and reading intermediate results. Kernel auto-tuning evaluates multiple CUDA kernel implementations for each layer on the target GPU hardware and selects the fastest, accounting for the specific CUDA core count and memory hierarchy of the deployment device.

FP16 (half-precision) inference is often used as an intermediate optimization step between FP32 and INT8. FP16 requires no calibration dataset and delivers roughly a 2× speedup with no accuracy loss on Turing/Ampere GPU architectures that have native FP16 tensor core support.

Key insight: Calibration data quality is the primary determinant of INT8 accuracy. Using images from the deployment domain — matching sensor type, lighting conditions, and target classes — yields significantly better calibration results than using ImageNet or other generic datasets. For LWIR thermal inputs, calibrate exclusively with thermal imagery.

Multi-Object Tracking: DeepSORT, ByteTrack, and BoT-SORT

Object detection produces per-frame detections. Multi-object tracking (MOT) links these detections across frames to produce persistent tracks — each with a unique ID, trajectory history, and velocity estimate. For defense applications, tracking is as important as detection: a target that disappears behind an obstacle for 2–3 seconds must be re-identified correctly when it reappears, not assigned a new ID that breaks the engagement timeline.

DeepSORT (Deep Simple Online and Realtime Tracking) was the standard for several years. It uses Kalman filtering for trajectory prediction and a deep appearance feature extractor (a lightweight ReID model) to match detections to existing tracks across occlusions. The ReID model adds compute overhead but significantly improves re-identification after occlusion. DeepSORT works well when targets have distinct visual appearances but degrades in crowded scenes where many similar-looking targets cross paths.

ByteTrack improves on DeepSORT by using low-confidence detections (below the standard threshold) as additional association cues rather than discarding them. This dramatically reduces ID switches during partial occlusions, where a target's detection confidence drops temporarily. ByteTrack achieves state-of-the-art MOT metrics on standard benchmarks with lower computational cost than DeepSORT, making it a strong choice for edge deployment.

BoT-SORT (Robust Associations Multi-Pedestrian Tracking) adds camera motion compensation to ByteTrack's framework. For a UAV-mounted camera that itself is moving and rotating, naive Kalman prediction fails because the apparent motion of a stationary target can be large due to camera ego-motion. BoT-SORT estimates camera motion from homography (using feature matching between frames) and compensates for it before running Kalman prediction, substantially improving tracking accuracy for airborne platforms.

Deployment Challenges: Thermal Inputs, Sensor Fusion, and Ruggedization

Deploying computer vision models from controlled test environments to operational field hardware introduces several challenges that are routinely underestimated in development.

IR and thermal input processing. Longwave infrared (LWIR) cameras operate in the 8–14 µm spectral band and produce 14-bit or 16-bit grayscale images that map temperature to intensity. The normalization approach matters significantly: simple min-max normalization across the full dynamic range washes out low-contrast targets. Adaptive histogram equalization (CLAHE) applied per-frame or per-region significantly improves target visibility in thermal imagery. Models trained on EO imagery must be retrained or fine-tuned on thermal data; cross-modal transfer does not work reliably.

Sensor fusion with LWIR and EO cameras. A common architecture pairs an EO camera (for classification detail and color discrimination) with an LWIR camera (for detection through camouflage and in low-light conditions). Fusing detections from two sensors requires extrinsic calibration (aligning their fields of view geometrically), temporal synchronization (ensuring frame timestamps align), and a fusion strategy — either early fusion (combining feature maps from both sensors), late fusion (combining detections from two independent models), or decision-level fusion (voting across independent detection outputs). Late fusion is the most common in deployed defense systems because it allows each sensor pipeline to be optimized and certified independently.

Ruggedized enclosures. IP67-rated enclosures (dust-tight, immersion-resistant) are the minimum for field-deployed computer vision hardware. MIL-STD-810H defines environmental test methods for shock, vibration, temperature cycling (operating range −40°C to +71°C for most ground vehicle applications, −54°C to +85°C for aviation), humidity, and altitude. Hardware must be qualified to the applicable MIL-STD test sequences before deployment. Thermal management within sealed enclosures — preventing GPU junction temperature from exceeding safe limits without fan or vented cooling — typically requires conduction cooling through the enclosure wall to a heat spreader or vehicle chassis.

Model update mechanisms in the field are a frequently overlooked deployment requirement. A model that performs well in summer vegetation may degrade significantly in winter or urban terrain. The deployment pipeline must support cryptographically signed model packages pushed to field devices via a secure update channel, with rollback capability if the new model degrades performance.