Every tactical headquarters runs on SITREPs — situation reports that aggregate observations from platoon level upward into a coherent picture of what is happening on the battlefield. The problem is that a large fraction of those SITREPs still arrive as hand-drawn sketches on paper, photographed maps, annotated satellite printouts, or scanned forms. Before any of that information reaches the digital common operating picture (COP), it passes through a human operator who reads the document, identifies each tactical entity, transcribes grid references, and manually plots the unit or threat onto a screen. That manual re-entry step is the bottleneck, and it is one of the highest-leverage targets for AI vision in military operations today.

This article describes the full technical pipeline for automating SITREP processing with AI vision: from image ingestion and pre-processing through entity extraction, coordinate parsing, NATO symbol inference, and CoT message generation for TAK placement. It covers where the pipeline can operate autonomously, where human confirmation is required, how it integrates with CloudTAK via TAKpilot, and what it takes to run it on edge hardware in disconnected environments.

The SITREP processing bottleneck

A field SITREP arriving at a battalion operations center typically takes one of several physical forms: a hand-drawn sketch on a grid overlay sheet, a photograph of a map with annotations written on it in grease pencil or marker, a scanned or photographed pre-printed form with fields filled by hand, or — increasingly — a photo taken by a soldier on a smartphone and transmitted via messaging app. Each of these requires the receiving operator to do the same things: identify the reporting unit's callsign, find the grid references for each observed entity, determine what type of entity it is (friendly, enemy, unknown; vehicle type, troop concentration, obstacle, fire position), and enter all of that into the digital COP.

Under calm conditions this process takes 3–8 minutes per SITREP. Under stress, at night, or during high-tempo operations when dozens of SITREPs may arrive per hour, it becomes a bottleneck that introduces dangerous staleness into the tactical picture. The operator's cognitive attention — which should be on interpretation and decision support — is consumed by transcription. Errors in transcription are common: transposed grid digits, misread callsigns, ambiguous symbol identification. The digital COP lags the actual situation by the time it takes to process the backlog.

AI vision models address this bottleneck by automating the transcription step. The operator uploads or forwards the document; the model extracts entities, resolves coordinates, identifies symbols, and generates a structured output ready for map placement. The operator's role shifts from transcriber to reviewer — confirming or correcting the model's output before committing it to the COP, a task that takes seconds rather than minutes.

Vision model pipeline: ingestion to structured extraction

The pipeline begins with image ingestion. Input formats include JPEG and PNG photographs, PDF scans, and occasionally video frames from a soldier's device. For multi-page PDFs, each page is rasterized to a high-resolution image (300 DPI minimum for form scans; 150 DPI acceptable for large-format map photographs where the relevant annotations are large). A metadata extraction step records any EXIF data — particularly timestamp and GPS coordinates if the image was taken on a smartphone — which can serve as a prior for the expected area of operations.

Pre-processing is the most impactful phase for extraction accuracy on degraded field documents. The pipeline applies: de-skew using projection profile analysis or Hough line detection, correcting document rotations of up to ±15° that are common in handheld photographs; adaptive binarization (Sauvola algorithm) rather than global thresholding, which handles the uneven illumination typical of documents photographed under field lighting; CLAHE to recover low-contrast pencil marks that global contrast enhancement would wash out; morphological noise removal using an open/close pass sized to the expected minimum stroke width; and layout analysis to segment the document into text regions, symbol regions, and grid overlay regions before routing each to the appropriate processing module. This segmentation step is important: OCR models applied to tactical symbol regions produce meaningless output, and symbol classifiers applied to handwritten text fields produce incorrect symbol matches.

Key insight: Layout analysis — separating text, symbols, and map grid regions before model inference — is the single most impactful pre-processing investment for SITREP vision pipelines. Routing each region type to the correct model eliminates a class of errors that cannot be corrected downstream.

Coordinate extraction: MGRS, UTM, and relative positions

Grid reference extraction is the most technically demanding part of SITREP processing because handwritten MGRS strings are ambiguous in multiple ways simultaneously. The format is: a Grid Zone Designator (a number 1–60 followed by a letter C–X), a two-letter 100 km square identifier, and an easting/northing numeric pair of equal length (2, 4, 6, 8, or 10 digits). A 10-digit MGRS string specifying a 1-metre precision position has 15 characters of variable format, hand-written by someone under stress, on a moving vehicle, possibly in low light.

The extraction approach combines OCR output with a structured validator. After the text extraction stage produces raw token sequences from the text regions of the document, each token is tested against a regular-expression pattern for valid MGRS format. Tokens that match are recorded as high-confidence grid references. Tokens that partially match but fail validation are passed to a fuzzy correction module: edit-distance matching against a pre-computed lookup table of valid Grid Zone Designator and 100 km square combinations for the theatre of operations. A grid reference that fails clean parsing but matches a valid MGRS prefix within Levenshtein distance 2 is accepted with reduced confidence and flagged for operator review.

UTM references (which some units use, particularly non-NATO forces or those operating legacy systems) are handled by a parallel extraction path. The validator checks for the zone number, hemisphere letter, and easting/northing pair in decimal or degree-minute-second notation.

Relative position references — extremely common in hand-drawn sketches where an entity is placed at "400m NE of checkpoint BRAVO" rather than given an explicit grid — require spatial reasoning beyond regex matching. The pipeline uses a chain-of-thought prompt on a VLM (or a rule-based parser for disconnected edge deployment) to extract the anchor reference point, the bearing (interpreted from compass notation, cardinal, or intercardinal text), and the distance with unit. The anchor's resolved WGS-84 coordinate is then offset by the bearing and distance to compute a derived position. Derived coordinates carry an inflated circular error (CE) value — typically 100–500 m depending on the precision of the offset description — which is passed through to the CoT message so that TAK clients render an appropriate uncertainty ring on the map.

NATO symbology inference: matching hand-drawn symbols to MIL-STD-2525C

Tactical symbols in hand-drawn SITREPs range from careful, standards-compliant renderings to minimalist sketches that only loosely resemble the canonical APP-6/MIL-STD-2525C forms. A colored rectangle with a circle on top is probably an infantry unit. An X inside a rectangle probably indicates a destroyed or eliminated entity. An arrow with a line through it may be an obstacle or a boundary. The vision pipeline must map these sketches to 15-character Symbol Identification Coding (SIDC) strings that encode affiliation, battle dimension, status, function, modifiers, and country code.

Symbol classification uses a CNN classifier trained on a synthetic dataset of APP-6/MIL-STD-2525C symbols rendered across a range of degradation conditions: varying stroke widths, rotation up to ±30°, incomplete rendering (simulating interrupted hand-drawing), and background noise typical of paper-over-map photography. The classifier is trained as a hierarchical problem: first predicting affiliation (friendly/hostile/neutral/unknown) and battle dimension (ground/air/sea/space/subsurface), then within each branch predicting the function code. This decomposition significantly reduces the classification search space at each stage.

The classifier outputs a ranked list of SIDC candidates with softmax probabilities. The top candidate above a configurable confidence threshold (default 0.80) is accepted for automatic processing. Below threshold, the entity is queued for operator confirmation — the UI presents the cropped symbol image alongside the top-3 candidates so the operator can select the correct one in a single tap. The overall system is designed so that the confirmation interface is faster than manual entry even for all entities simultaneously, not just those above threshold.

CoT message generation: from entities to TAK placement

Once entities have extracted coordinates and assigned SIDC codes, they must be packaged for delivery to the TAK ecosystem. Cursor-on-Target (CoT) XML is the standard interchange format. Each CoT event has the following mandatory structure: a uid (unique identifier derived from the document identifier and entity sequence number), a type (the CoT type string derived from the SIDC code using the standard MIL-STD-2525C-to-CoT mapping table), a time, start, and stale timestamp triplet, and a point element carrying the WGS-84 latitude, longitude, height, circular error (CE), and linear error (LE) values.

Additional detail about the entity — callsign, unit designation, observer unit, observation time, remarks — is carried in the CoT detail element. The pipeline extracts callsign and unit designation from the text regions of the SITREP using named-entity recognition tuned for military unit naming conventions (alphanumeric callsigns, battalion-regiment-brigade hierarchy notation). Observation time is extracted from the document header if present, or defaults to the document ingestion timestamp with a confidence penalty applied.

The completed CoT XML bundle — one event per extracted entity — is delivered to the TAK server over TCP (for reliable delivery) or UDP multicast (for broadcast to all clients on the tactical network). TAK clients — ATAK on Android, WinTAK on Windows laptops, iTAK on iOS, CloudTAK in the browser — immediately render each entity at its specified coordinates using the appropriate MIL-STD-2525C symbol. The result is a SITREP that was a photograph 15–30 seconds ago appearing as a set of correctly symbolized icons on every operator's shared map.

TAKpilot implementation: vision pipeline integrated with CloudTAK

TAKpilot (corvusintell.com/takpilot) is Corvus Intelligence's TAK operations platform, which includes an integrated SITREP vision processing pipeline connected to CloudTAK. The workflow is designed around the operator confirmation step as the primary human-machine interaction point, rather than treating the vision model as a black box that writes directly to the COP.

An operator receives a SITREP photograph — via radio operator, messaging app forward, or direct upload — and uploads it to the TAKpilot interface. The file is transmitted to the TAKpilot processing backend, which runs the full vision pipeline: pre-processing, layout analysis, OCR, coordinate extraction and validation, symbol classification, callsign and unit extraction, and CoT generation. Processing time for a typical SITREP photograph is 8–20 seconds, depending on document complexity and whether the pipeline is running in cloud mode (VLM API) or edge mode (quantized local model).

The result is presented to the operator as a confirmation card: a structured table listing each detected entity with its extracted grid reference, symbol type (rendered as the MIL-STD-2525C icon), callsign, observation time, and a confidence indicator (green/amber/red) for each field. Entities with any field below threshold are highlighted and require individual confirmation; entities above threshold are pre-approved but can still be corrected. The operator can edit any field inline — correcting an OCR misread or changing a symbol assignment — before approving. Approving the card triggers TAKpilot to push the CoT bundle to the connected CloudTAK server.

The confirmation card design reflects the operational reality that zero-miss is more important than zero-latency: a missed entity on the tactical map is more dangerous than a 10-second confirmation delay. The interface is optimized for mobile (tablet) use so that operators working at a field terminal can complete confirmation with minimal keystrokes.

Accuracy and confidence scoring

Confidence scoring operates at two levels: field-level confidence (individual grid reference, symbol classification, callsign extraction) and entity-level confidence (the product of all field confidences, used for the auto-place vs. confirm routing decision).

Grid reference confidence is computed from three factors: the OCR character-level confidence scores output by the text model, the edit distance from the nearest valid MGRS string (zero for clean parse, higher for fuzzy-corrected), and a spatial plausibility check against the theatre bounding box. A grid reference that parses cleanly, matches a valid MGRS string exactly, and falls within the expected area of operations scores above 0.92 and qualifies for auto-placement. One that required fuzzy correction or falls near the theatre boundary scores 0.65–0.85 and requires confirmation.

Symbol classification confidence is the softmax probability of the top SIDC candidate. In controlled evaluations on a test set of field-collected SITREP photographs, the classifier achieves top-1 accuracy of 87% at the function code level when confidence is above 0.80, falling to 61% below that threshold. This is why the 0.80 threshold for auto-acceptance is important: it separates a reliably correct region from an ambiguous one.

Ambiguous symbols — those for which the top-3 candidates are closely clustered (softmax spread less than 0.15) — are always routed to human confirmation regardless of the top-candidate score. Close clustering indicates genuine symbol ambiguity (the hand-drawn symbol is consistent with multiple tactical meanings) rather than low-quality input, and the correct resolution requires operator knowledge of the tactical context that the model does not have.

Operational note: Auto-placement thresholds should be mission-configured, not hardcoded. During high-tempo phases where speed outweighs accuracy risk, threshold can be lowered. During consolidation or planning phases where COP accuracy is paramount, threshold should be raised and all entities confirmed. TAKpilot exposes threshold as a per-session operator setting.

Edge deployment: Jetson, CPU-only nodes, and disconnected operation

Cloud-connected SITREP processing (routing documents to a VLM API endpoint) achieves the highest extraction accuracy but introduces latency and a network dependency that is unacceptable at the tactical edge. The TAKpilot vision pipeline is designed to run fully air-gapped on edge hardware.

NVIDIA Jetson AGX Orin is the primary target for a full-featured edge deployment. With 64 GB unified memory, the node can run a quantized 7B-parameter vision-language model (LLaVA-1.6 or equivalent at INT4 via llama.cpp) for general entity extraction alongside a TensorRT-optimized symbol classifier. A single SITREP image processes in 8–15 seconds. The Jetson simultaneously serves as the CloudTAK node — TAKpilot and CloudTAK run as co-located services on the same device, with CoT delivery over loopback rather than a network hop. This colocation architecture is important for forward-deployed headquarters where the TAK server and the SITREP processing system are on the same ruggedized compute node.

CPU-only nodes — where GPU hardware is unavailable or power-constrained below Jetson levels — use a two-model pipeline: PaddleOCR with its PPOCR-v4 detection and recognition models for text extraction (runs at ~1 second per page on a modern ARM64 core), and a lightweight MobileNetV3 symbol classifier at INT8 quantization for symbol recognition. The VLM step is omitted; relative position parsing falls back to the rule-based offset parser. This pipeline processes a SITREP in 3–6 seconds on a modern laptop CPU, or 8–20 seconds on a single-board ARM processor (Raspberry Pi 5 class), with somewhat lower extraction accuracy on complex documents but still operationally useful performance for the most common SITREP formats.

Model updates in the field follow the same signed-package update mechanism described for other edge AI deployments: the update bundle is cryptographically signed, delivered via the TAKpilot management channel, and applied with automatic rollback if post-update accuracy metrics fall below baseline. Theatre-specific fine-tuning — adapting the symbol classifier to the particular hand-drawing conventions of the units in the area of operations — can be pushed to forward nodes as a model delta within 24 hours of receiving a labeled sample batch.

The transition between edge and cloud mode is transparent to the operator. When network connectivity is available, TAKpilot routes to the cloud pipeline for higher accuracy. When connectivity drops — as detected by a 5-second timeout on the API health check — it automatically falls back to the local model without operator intervention. The confirmation card UI is identical in both modes; only the processing time changes.