Computer Vision for ISR Drones: Detection, Tracking, and the Real-Time Pipeline

A computer-vision pipeline on an ISR drone has one job: take photons hitting a sensor, turn them into geo-located tracks of objects that matter, and push those tracks to a command-and-control system fast enough that an operator — or another system — can act on them. Everything else is engineering overhead in service of that loop. This article walks through the pipeline end to end: the model architectures that detect, the algorithms that track, the sensor fusion that survives night and weather, the georeferencing math that makes a bounding box useful, and the edge-deployment realities that decide whether any of it works in the field.

For broader context on where this fits in the defense AI stack, see our complete guide to AI in defense and the sensor-edge analysis in sensor-to-shooter part 2.

1. The ISR CV Pipeline

The canonical pipeline is six stages: sensor capture (EO and IR), frame ingest and synchronization, detection, multi-object tracking, georeferencing, and C2 push. End to end, the budget on a tactical ISR platform is roughly 150–250 ms wall-clock from photon arrival to track update on the C2 surface. Anything beyond 300 ms breaks operator confidence — a moving vehicle at 60 km/h covers 5 metres in 300 ms.

The budget breakdown on a typical Jetson Orin NX-class platform: 16–33 ms for capture (depending on whether the sensor runs at 30 or 60 fps), 5–10 ms for ISP and demosaic, 15–40 ms for the detector forward pass, 3–8 ms for tracking association, 10–20 ms for georeferencing math, and 20–80 ms for the radio link to C2. The radio is usually the worst offender and the one the CV engineer cannot fix. Everything on-board must compress to compensate.

Frame ingest synchronization matters more than first-timers expect. EO and IR sensors rarely share a frame clock. If your fusion logic assumes they do, you fuse a target's EO pixel at t with the IR pixel at t-16 ms — a vehicle at 30 m/s has moved half a metre. The pipeline must timestamp at the sensor, not at the consumer.

2. Detection Architectures

The detector is the dominant compute and accuracy decision in the pipeline. Three families currently matter on ISR drones.

YOLOv8, v10, v11. The convolutional YOLO line remains the workhorse — Ultralytics' YOLOv8 and the newer YOLOv10 and v11 deliver 30–60 fps at 640×640 on Jetson Orin NX with INT8 quantization. YOLOv11n (nano) hits ~60 fps at acceptable mAP on aerial datasets; YOLOv11s (small) trades to ~30 fps with materially better small-object recall. YOLOv10 removes the NMS step entirely, shaving 3–5 ms of post-processing latency, which matters when every millisecond is contested.

RT-DETR. Baidu's real-time DETR is the transformer alternative — a query-based detector that skips NMS by design and produces a fixed set of object queries. On benchmarks RT-DETR-L matches or beats YOLOv8-L mAP on COCO while running at comparable latency. On aerial imagery the transformer attention pattern often handles dense small-object scenes (parked vehicles, infantry clusters) better than convolutional anchor-based detectors. The cost is a larger model and trickier INT8 quantization — transformer attention layers degrade more under aggressive quantization than conv layers.

The small-object problem. An ISR drone at 1500 m AGL with a 30° HFOV sees a person as roughly 6–10 pixels on a side. Standard object detectors trained on COCO-style imagery (where objects are typically >32 pixels) fail badly on this regime. The two practical fixes are tiling (split the frame into overlapping 640×640 patches, run inference per patch, reconcile in image space) and training on aerial-specific datasets — VisDrone, DOTA, xView, and increasingly domain-specific synthetic data. See our synthetic data for defense AI training piece for the pipeline.

3. Tracking Algorithms

Detection gives you bounding boxes per frame. Tracking turns those into identity-stable tracks across time — which is what a C2 system actually needs. The dominant on-board choices are BYTETrack, StrongSORT, and OC-SORT.

BYTETrack. Cheap, fast, and surprisingly robust. BYTETrack's insight is that low-confidence detections — which most trackers discard — are usually real objects partially occluded or temporarily ambiguous. By associating high-confidence detections first, then matching low-confidence boxes against unmatched tracks in a second pass, BYTETrack recovers tracks that pure IoU-association methods drop. On a Jetson Orin NX the tracker adds <5 ms per frame.

StrongSORT. An evolution of DeepSORT — Kalman filter for motion plus a re-identification appearance embedding. Better on ID-switch-prone scenes (vehicles passing each other, occlusion under tree cover) but the appearance embedding network adds 8–15 ms per frame and needs its own training data. Worth the cost when ID stability matters more than throughput, for example in convoy tracking.

OC-SORT. Observation-Centric SORT addresses a specific BYTETrack/SORT failure: when an object is lost for several frames, the Kalman filter's velocity estimate drifts. OC-SORT re-estimates velocity from the observation at re-identification rather than trusting the filter prediction. On ISR footage with frequent occlusion (urban environments, forest edge) OC-SORT measurably reduces ID switches versus BYTETrack.

The shaky-platform problem. All these trackers assume the camera-frame motion of an object is dominated by object motion. On a drone in turbulent air, ego-motion contributes most of the apparent pixel velocity. The fix is to track in a stabilized or world frame: either feed the tracker pre-stabilized frames (homography-based de-rotation against the IMU), or run the Kalman filter in georeferenced coordinates rather than image coordinates. The latter is more work but produces dramatically cleaner tracks.

4. EO + IR Sensor Fusion

An EO-only ISR drone is a daytime platform. An IR-only drone resolves heat sources but cannot read a vehicle's markings, count personnel reliably at distance, or distinguish similar-temperature decoys. Operational ISR demands both, and demands they fuse.

Late fusion runs independent detectors on EO and IR streams and reconciles tracks downstream. Simpler to engineer, fails gracefully if one sensor degrades, but loses the cross-modal signal — a faint EO contact reinforced by a clear IR signature should produce a high-confidence track, and late fusion handles that awkwardly.

Early fusion stacks EO and IR channels into a single tensor and trains a detector across the combined input. Better cross-modal performance, but requires aligned data — which requires boresight calibration discipline. EO and IR optics rarely share a boresight; they need per-airframe calibration (typically a checkerboard or hot-target calibration before flight) and re-calibration after any maintenance event.

Day-night handoff. The most failure-prone moment is dusk and dawn, when EO contrast is collapsing but the IR scene is also at minimum thermal contrast (everything's at ambient). A good fusion pipeline gates per-sensor confidence by scene-level metrics — image-wide contrast, histogram statistics — and re-weights the fused detection accordingly, rather than trusting a fixed early-fusion weight 24 hours a day.

5. Georeferencing at Frame Rate

A bounding box in pixel coordinates is useless to a C2 system. The bounding box must be projected to a geographic coordinate (latitude, longitude, elevation), with an error ellipse. The math involves: the drone's position (GPS, often INS-fused), the drone's attitude (IMU), the gimbal pose relative to the airframe (gimbal encoders), the camera intrinsics (focal length, principal point), and a terrain model (ideally a DTED Level 2 or better DEM) to unproject the pixel ray to ground intersection.

Two practical realities. First, georeferencing latency competes with detection latency. A naïve implementation that reads gimbal encoders and IMU at the C2 push moment introduces a 50–100 ms error against the actual frame timestamp — at 30 m/s ground speed that is 1.5–3 metres of position error. Encoder and IMU samples must be timestamped and interpolated to the frame's exposure midpoint.

Second, the error budget. At 1500 m slant range with 0.5° gimbal pose uncertainty, the ground-projected error is roughly 13 metres before you add GPS uncertainty, terrain model error, and timing skew. The realistic CEP for a well-engineered tactical-class system is 15–25 metres at typical ISR altitudes. Anything reported tighter than that is either heroic engineering or wishful thinking.

6. Model Selection for Edge Deployment

The compute platform constrains everything. The current ISR-drone-class options:

Jetson Orin Nano (8 GB) — ~40 TOPS INT8, suitable for YOLOv8n/v11n at 640×640 plus a light tracker. Power envelope 7–15 W. Good for Group 1/2 platforms where the airframe cannot dissipate more.

Jetson Orin NX (16 GB) — ~100 TOPS INT8. Runs YOLOv11s comfortably at 60 fps, RT-DETR-R18 at ~30 fps, StrongSORT with appearance embedding. 10–25 W. The current sweet spot for tactical ISR.

Jetson AGX Orin (32/64 GB) — ~275 TOPS INT8. Runs larger models, multi-stream (EO+IR simultaneously without sharing the GPU), and leaves headroom for additional CV tasks (change detection, classification heads). 15–60 W — usually a Group 3 platform decision.

INT8 quantization realities. Float32 → INT8 typically delivers 3–4× inference speedup and 4× memory reduction with 0.5–1.5 mAP loss on well-quantized detectors. The gotchas: transformer attention quantizes worse than convolutions; calibration data must be representative of deployment imagery (calibrating on COCO and deploying on thermal IR is malpractice); and some custom layers fall back to FP16, silently losing speedup. Our ONNX/TensorRT optimization guide covers the toolchain.

TensorRT vs ONNX Runtime. On Jetson, TensorRT is the right answer for production — engine builds tuned to the exact GPU SM count, INT8 calibration pipelines mature, kernel fusion aggressive. ONNX Runtime with the TensorRT execution provider is acceptable for development and gives 80–90% of TensorRT-native performance with a simpler deployment story. Pure CUDA EP loses 30–50%.

7. Real-Time Output to C2

The pipeline's product is a stream of geo-located, identity-stable tracks plus the full-motion video that produced them. The interoperable formats are well-defined.

CoT (Cursor-on-Target). XML-based event format, originated by MITRE, the lingua franca of TAK-ecosystem C2 (ATAK, WinTAK, iTAK). A CoT event encodes a point (lat/lon/elevation with error ellipse), a type code (e.g. a-h-G-U-C-I for a hostile ground unit), and free-form detail. A drone publishing CoT every 0.5–1 s per tracked object integrates natively with operator displays.

MISB 0903 VMTI. Video Moving Target Indicator — the NATO/MISB standard for embedding detection and track metadata in KLV alongside full-motion video. A VMTI packet inside the MISB 0601-encoded MPEG-TS stream carries per-frame target lists with georeferenced position, velocity, and confidence. Required for any platform that needs to plug into NATO Class 1 ISR FMV consumers.

Message-bus patterns. Inside the airframe, ROS 2, Zenoh, or MQTT carry intermediate messages between the detector, tracker, georeferencer, and the radio downlink process. Zenoh's pub-sub-query model handles intermittent links well — the radio drops, the on-board store-and-forward holds tracks, and the C2 client catches up on reconnect.

8. Field Realities

Everything above is the easy part. The hard part is keeping it working in the field.

Vibration. A 2 kg quadcopter at full throttle vibrates the camera mount at 100–200 Hz. Rolling-shutter sensors smear; global-shutter sensors don't, but cost more and dissipate more. Detector accuracy on motion-blurred imagery drops 5–15 mAP points unless the training set includes motion-blurred samples.

Thermal. A Jetson Orin NX running at 100 TOPS dissipates 20+ W in a sealed payload that may itself be in direct sun at +45°C. Without active cooling, thermal throttling kicks in within 90 seconds — and a throttled GPU drops detector fps by 40–60%. Designing the payload thermal envelope is as much a CV-engineering concern as model choice.

Low-power modes. A loitering ISR mission may want the detector running at 5 fps during transit and 60 fps over the area of interest, dropping average power by 4–5×. The pipeline must support per-stage power gating — not just GPU clocks, but sensor frame rate, ISP path, and radio duty cycle. See AI ISR data triage for the on-board filtering side of this.

Model degradation across deployment. A detector trained on European summer imagery and deployed in -20°C Baltic winter sees a different world: snow-covered terrain reflectance changes EO statistics; cold engines emit less IR; foliage that hid vehicles in July is leafless in February. The realistic mitigation is continuous evaluation against new collected data and a re-training cadence measured in weeks, not the one-shot training-and-deploy model that lab work assumes.

An ISR drone CV pipeline is not a model — it is a system. The model is the smallest part. The latency budget, the calibration discipline, the C2 message format, the thermal design, and the re-training cadence are what decide whether the system works for the operator at the other end of the radio link.