How are relative positions on hand-drawn sketches handled when no explicit grid reference is written?

Many field sketches mark entities relative to a reference point (e.g., '400m NE of checkpoint BRAVO') rather than with explicit grid references. The vision pipeline first extracts any explicit anchor grid references on the sketch, then uses a spatial reasoning module — either a VLM with chain-of-thought prompting or a rule-based parser — to interpret directional offsets (bearing + distance) and compute a derived WGS-84 coordinate. Bearing is extracted from compass notation or cardinal/intercardinal text; distance is extracted from numeric tokens with unit detection (m, km, metres). Derived coordinates carry an inflated CE value reflecting the uncertainty of the offset interpretation.

Can the vision pipeline run fully offline on disconnected edge nodes?

Yes. On an NVIDIA Jetson AGX Orin with 64 GB unified memory, a quantized open-weight VLM (e.g., LLaVA-1.6 7B at INT4 via llama.cpp) can process a SITREP image in 8–15 seconds with acceptable extraction quality. For higher throughput or lower latency, a purpose-built pipeline of PaddleOCR plus a TensorRT-optimized symbol classifier runs at under 2 seconds per document on the same hardware. Both pipelines operate fully air-gapped with no external API dependencies, making them viable for forward-deployed nodes without internet connectivity.

What pre-processing steps improve extraction accuracy on degraded field documents?

The most impactful steps are: (1) de-skew using projection profile analysis or Hough line detection, correcting rotations up to ±15°; (2) adaptive binarization (Sauvola or Niblack) rather than global thresholding, which handles uneven illumination from photographed documents; (3) CLAHE to recover low-contrast pencil marks; (4) noise removal with a morphological open/close pass at a kernel sized to the expected minimum stroke width; and (5) layout analysis to segment text regions from symbol regions before routing each to the appropriate model, which prevents OCR from attempting to read tactical symbols as text.

Edge AI for Defense

AI vision for SITREP processing: automatic entity extraction and map placement

Q: Which AI vision model types are best suited for SITREP entity extraction?

Multimodal large vision-language models (VLMs) such as GPT-4o, Gemini 1.5 Pro, or open-weight alternatives like LLaVA perform well for structured extraction because they combine OCR, spatial reasoning, and symbol recognition in a single inference pass. For edge-deployed pipelines without cloud connectivity, a combination of a lightweight OCR model (PaddleOCR, Tesseract with LSTM) for text and a small YOLO variant fine-tuned on MIL-STD-2525C symbols handles entity detection at acceptable latency on Jetson-class hardware.

Q: How does the model parse MGRS grid references from handwritten text?

After the OCR stage extracts raw text tokens, a regular-expression validator checks each token against the MGRS format pattern: a 3-digit Grid Zone Designator, a two-letter 100 km square identifier, and an easting/northing pair of equal digit length (2, 4, 6, 8, or 10 digits). Tokens that partially match the pattern but fail validation are passed to a fuzzy correction module that applies edit-distance matching against a pre-computed lookup table of valid GZD and square combinations for the theatre of operations. Confidence is scored by Levenshtein distance from the nearest valid MGRS string.

Q: How are hand-drawn NATO tactical symbols matched to MIL-STD-2525C codes?

The symbol region is cropped from the document image and passed through a CNN classifier trained on synthetic renderings of APP-6/MIL-STD-2525C symbol frames and icons. The classifier outputs a ranked list of SIDC (Symbol Identification Coding) candidates with confidence scores. The top candidate above a configurable threshold (default 0.80) is accepted; below threshold, the entity is flagged for operator confirmation. The classification training set must include hand-drawn and degraded versions of symbols, not only clean vector renderings, to achieve acceptable field accuracy.

Q: What is a CoT message and how does it carry SITREP entities to TAK?

Cursor-on-Target (CoT) is an XML schema originally developed for US DoD sensor-to-shooter interoperability. Each CoT event carries a unique ID, event type (which encodes the MIL-STD-2525C SIDC), a timestamp, and a point element containing WGS-84 latitude, longitude, and circular error (CE) in metres. TAK server and TAK clients (ATAK, WinTAK, iTAK, CloudTAK) ingest CoT messages over UDP multicast, TCP, or WebSocket and render the entity on the map at the specified coordinates. An extracted SITREP entity becomes a CoT event once its grid reference is converted to WGS-84 and its symbol code is mapped to a CoT type string.

Q: What is the TAKpilot vision pipeline and how does it connect to CloudTAK?

TAKpilot (corvusintell.com/takpilot) includes a document vision pipeline integrated with CloudTAK. An operator uploads a SITREP image or PDF through the TAKpilot interface; the file is passed to the vision processing backend, which runs entity extraction and returns a structured confirmation card listing each detected entity with its extracted grid reference, symbol code, callsign, and confidence score. The operator reviews and approves (or corrects) each entity, then triggers map placement — TAKpilot generates a CoT XML bundle and pushes it to the connected CloudTAK server, placing all approved entities on the shared tactical map simultaneously.

Q: When should entities be auto-placed vs held for human confirmation?

Entities with a grid reference confidence above 0.92 and a symbol classification confidence above 0.85 can typically be auto-placed with a low false-placement rate in controlled evaluations. Below either threshold, the entity should be queued for operator confirmation. Any entity whose extracted coordinates fall outside the expected theatre bounding box should always require confirmation regardless of individual confidence scores — this catches gross OCR errors such as transposed digits that happen to produce valid but wrong MGRS strings.

By Corvus Intelligence Engineering Team · About the team →

May 29, 2026 12 min read

Every tactical headquarters runs on SITREPs – situation reports that aggregate observations from platoon level upward into a coherent picture of what is happening on the battlefield. The problem is that a large fraction of those SITREPs still arrive as hand-drawn sketches on paper, photographed maps, annotated satellite printouts, or scanned forms. Before any of that information reaches the digital common operating picture (COP), it passes through a human operator who reads the document, identifies each tactical entity, transcribes grid references, and manually plots the unit or threat onto a screen. That manual re-entry step is the bottleneck, and it is one of the highest-leverage targets for AI vision in military operations today.

This article describes the full technical pipeline for automating SITREP processing with AI vision: from image ingestion and pre-processing through entity extraction, coordinate parsing, NATO symbol inference, and CoT message generation for TAK placement. It covers where the pipeline can operate autonomously, where human confirmation is required, how it integrates with CloudTAK via TAKpilot, and what it takes to run it on edge hardware in disconnected environments.

The SITREP processing bottleneck

A field SITREP arriving at a battalion operations center typically takes one of several physical forms: a hand-drawn sketch on a grid overlay sheet, a photograph of a map with annotations written on it in grease pencil or marker, a scanned or photographed pre-printed form with fields filled by hand, or – increasingly – a photo taken by a soldier on a smartphone and transmitted via messaging app. Each of these requires the receiving operator to do the same things: identify the reporting unit's callsign, find the grid references for each observed entity, determine what type of entity it is (friendly, enemy, unknown; vehicle type, troop concentration, obstacle, fire position), and enter all of that into the digital COP.

Under calm conditions this process takes 3–8 minutes per SITREP. Under stress, at night, or during high-tempo operations when dozens of SITREPs may arrive per hour, it becomes a bottleneck that introduces dangerous staleness into the tactical picture. The operator's cognitive attention – which should be on interpretation and decision support – is consumed by transcription. Errors in transcription are common: transposed grid digits, misread callsigns, ambiguous symbol identification. The digital COP lags the actual situation by the time it takes to process the backlog.

AI vision models address this bottleneck by automating the transcription step. The operator uploads or forwards the document; the model extracts entities, resolves coordinates, identifies symbols, and generates a structured output ready for map placement. The operator's role shifts from transcriber to reviewer – confirming or correcting the model's output before committing it to the COP, a task that takes seconds rather than minutes.

Vision model pipeline: ingestion to structured extraction

The pipeline begins with image ingestion. Input formats include JPEG and PNG photographs, PDF scans, and occasionally video frames from a soldier's device. For multi-page PDFs, each page is rasterized to a high-resolution image (300 DPI minimum for form scans; 150 DPI acceptable for large-format map photographs where the relevant annotations are large). A metadata extraction step records any EXIF data – particularly timestamp and GPS coordinates if the image was taken on a smartphone – which can serve as a prior for the expected area of operations.

Pre-processing is the most impactful phase for extraction accuracy on degraded field documents. The pipeline applies: de-skew using projection profile analysis or Hough line detection, correcting document rotations of up to ±15° that are common in handheld photographs; adaptive binarization (Sauvola algorithm) rather than global thresholding, which handles the uneven illumination typical of documents photographed under field lighting; CLAHE to recover low-contrast pencil marks that global contrast enhancement would wash out; morphological noise removal using an open/close pass sized to the expected minimum stroke width; and layout analysis to segment the document into text regions, symbol regions, and grid overlay regions before routing each to the appropriate processing module. This segmentation step is important: OCR models applied to tactical symbol regions produce meaningless output, and symbol classifiers applied to handwritten text fields produce incorrect symbol matches.

Key insight: Layout analysis – separating text, symbols, and map grid regions before model inference – is the single most impactful pre-processing investment for SITREP vision pipelines. Routing each region type to the correct model eliminates a class of errors that cannot be corrected downstream.

Coordinate extraction: MGRS, UTM, and relative positions

Grid reference extraction is the most technically demanding part of SITREP processing because handwritten MGRS strings are ambiguous in multiple ways simultaneously. The format is: a Grid Zone Designator (a number 1–60 followed by a letter C–X), a two-letter 100 km square identifier, and an easting/northing numeric pair of equal length (2, 4, 6, 8, or 10 digits). A 10-digit MGRS string specifying a 1-metre precision position has 15 characters of variable format, hand-written by someone under stress, on a moving vehicle, possibly in low light.

The extraction approach combines OCR output with a structured validator. After the text extraction stage produces raw token sequences from the text regions of the document, each token is tested against a regular-expression pattern for valid MGRS format. Tokens that match are recorded as high-confidence grid references. Tokens that partially match but fail validation are passed to a fuzzy correction module: edit-distance matching against a pre-computed lookup table of valid Grid Zone Designator and 100 km square combinations for the theatre of operations. A grid reference that fails clean parsing but matches a valid MGRS prefix within Levenshtein distance 2 is accepted with reduced confidence and flagged for operator review.

UTM references (which some units use, particularly non-NATO forces or those operating legacy systems) are handled by a parallel extraction path. The validator checks for the zone number, hemisphere letter, and easting/northing pair in decimal or degree-minute-second notation.

Relative position references – extremely common in hand-drawn sketches where an entity is placed at "400m NE of checkpoint BRAVO" rather than given an explicit grid – require spatial reasoning beyond regex matching. The pipeline uses a chain-of-thought prompt on a VLM (or a rule-based parser for disconnected edge deployment) to extract the anchor reference point, the bearing (interpreted from compass notation, cardinal, or intercardinal text), and the distance with unit. The anchor's resolved WGS-84 coordinate is then offset by the bearing and distance to compute a derived position. Derived coordinates carry an inflated circular error (CE) value – typically 100–500 m depending on the precision of the offset description – which is passed through to the CoT message so that TAK clients render an appropriate uncertainty ring on the map.

NATO symbology inference: matching hand-drawn symbols to MIL-STD-2525C

Tactical symbols in hand-drawn SITREPs range from careful, standards-compliant renderings to minimalist sketches that only loosely resemble the canonical APP-6/MIL-STD-2525C forms. A colored rectangle with a circle on top is probably an infantry unit. An X inside a rectangle probably indicates a destroyed or eliminated entity. An arrow with a line through it may be an obstacle or a boundary. The vision pipeline must map these sketches to 15-character Symbol Identification Coding (SIDC) strings that encode affiliation, battle dimension, status, function, modifiers, and country code.

Symbol classification uses a CNN classifier trained on a synthetic dataset of APP-6/MIL-STD-2525C symbols rendered across a range of degradation conditions: varying stroke widths, rotation up to ±30°, incomplete rendering (simulating interrupted hand-drawing), and background noise typical of paper-over-map photography. The classifier is trained as a hierarchical problem: first predicting affiliation (friendly/hostile/neutral/unknown) and battle dimension (ground/air/sea/space/subsurface), then within each branch predicting the function code. This decomposition significantly reduces the classification search space at each stage.

The classifier outputs a ranked list of SIDC candidates with softmax probabilities. The top candidate above a configurable confidence threshold (default 0.80) is accepted for automatic processing. Below threshold, the entity is queued for operator confirmation – the UI presents the cropped symbol image alongside the top-3 candidates so the operator can select the correct one in a single tap. The overall system is designed so that the confirmation interface is faster than manual entry even for all entities simultaneously, not just those above threshold.

CoT message generation: from entities to TAK placement

Once entities have extracted coordinates and assigned SIDC codes, they must be packaged for delivery to the TAK ecosystem. Cursor-on-Target (CoT) XML is the standard interchange format. Each CoT event has the following mandatory structure: a uid (unique identifier derived from the document identifier and entity sequence number), a type (the CoT type string derived from the SIDC code using the standard MIL-STD-2525C-to-CoT mapping table), a time, start, and stale timestamp triplet, and a point element carrying the WGS-84 latitude, longitude, height, circular error (CE), and linear error (LE) values.

Additional detail about the entity – callsign, unit designation, observer unit, observation time, remarks – is carried in the CoT detail element. The pipeline extracts callsign and unit designation from the text regions of the SITREP using named-entity recognition tuned for military unit naming conventions (alphanumeric callsigns, battalion-regiment-brigade hierarchy notation). Observation time is extracted from the document header if present, or defaults to the document ingestion timestamp with a confidence penalty applied.

The completed CoT XML bundle – one event per extracted entity – is delivered to the TAK server over TCP (for reliable delivery) or UDP multicast (for broadcast to all clients on the tactical network). TAK clients – ATAK on Android, WinTAK on Windows laptops, iTAK on iOS, CloudTAK in the browser – immediately render each entity at its specified coordinates using the appropriate MIL-STD-2525C symbol. The result is a SITREP that was a photograph 15–30 seconds ago appearing as a set of correctly symbolized icons on every operator's shared map.

TAKpilot implementation: vision pipeline integrated with CloudTAK

TAKpilot (corvusintell.com/takpilot) is Corvus Intelligence's TAK operations platform, which includes an integrated SITREP vision processing pipeline connected to CloudTAK. The workflow is designed around the operator confirmation step as the primary human-machine interaction point, rather than treating the vision model as a black box that writes directly to the COP.

An operator receives a SITREP photograph – via radio operator, messaging app forward, or direct upload – and uploads it to the TAKpilot interface. The file is transmitted to the TAKpilot processing backend, which runs the full vision pipeline: pre-processing, layout analysis, OCR, coordinate extraction and validation, symbol classification, callsign and unit extraction, and CoT generation. Processing time for a typical SITREP photograph is 8–20 seconds, depending on document complexity and whether the pipeline is running in cloud mode (VLM API) or edge mode (quantized local model).

The result is presented to the operator as a confirmation card: a structured table listing each detected entity with its extracted grid reference, symbol type (rendered as the MIL-STD-2525C icon), callsign, observation time, and a confidence indicator (green/amber/red) for each field. Entities with any field below threshold are highlighted and require individual confirmation; entities above threshold are pre-approved but can still be corrected. The operator can edit any field inline – correcting an OCR misread or changing a symbol assignment – before approving. Approving the card triggers TAKpilot to push the CoT bundle to the connected CloudTAK server.

The confirmation card design reflects the operational reality that zero-miss is more important than zero-latency: a missed entity on the tactical map is more dangerous than a 10-second confirmation delay. The interface is optimized for mobile (tablet) use so that operators working at a field terminal can complete confirmation with minimal keystrokes.

Accuracy and confidence scoring

Confidence scoring operates at two levels: field-level confidence (individual grid reference, symbol classification, callsign extraction) and entity-level confidence (the product of all field confidences, used for the auto-place vs. confirm routing decision).

Grid reference confidence is computed from three factors: the OCR character-level confidence scores output by the text model, the edit distance from the nearest valid MGRS string (zero for clean parse, higher for fuzzy-corrected), and a spatial plausibility check against the theatre bounding box. A grid reference that parses cleanly, matches a valid MGRS string exactly, and falls within the expected area of operations scores above 0.92 and qualifies for auto-placement. One that required fuzzy correction or falls near the theatre boundary scores 0.65–0.85 and requires confirmation.

Symbol classification confidence is the softmax probability of the top SIDC candidate. In controlled evaluations on a test set of field-collected SITREP photographs, the classifier achieves top-1 accuracy of 87% at the function code level when confidence is above 0.80, falling to 61% below that threshold. This is why the 0.80 threshold for auto-acceptance is important: it separates a reliably correct region from an ambiguous one.

Ambiguous symbols – those for which the top-3 candidates are closely clustered (softmax spread less than 0.15) – are always routed to human confirmation regardless of the top-candidate score. Close clustering indicates genuine symbol ambiguity (the hand-drawn symbol is consistent with multiple tactical meanings) rather than low-quality input, and the correct resolution requires operator knowledge of the tactical context that the model does not have.

Operational note: Auto-placement thresholds should be mission-configured, not hardcoded. During high-tempo phases where speed outweighs accuracy risk, threshold can be lowered. During consolidation or planning phases where COP accuracy is paramount, threshold should be raised and all entities confirmed. TAKpilot exposes threshold as a per-session operator setting.

Edge deployment: jetson, CPU-only nodes, and disconnected operation

Cloud-connected SITREP processing (routing documents to a VLM API endpoint) achieves the highest extraction accuracy but introduces latency and a network dependency that is unacceptable at the tactical edge. The TAKpilot vision pipeline is designed to run fully air-gapped on edge hardware.

NVIDIA Jetson AGX Orin is the primary target for a full-featured edge deployment. With 64 GB unified memory, the node can run a quantized 7B-parameter vision-language model (LLaVA-1.6 or equivalent at INT4 via llama.cpp) for general entity extraction alongside a TensorRT-optimized symbol classifier. A single SITREP image processes in 8–15 seconds. The Jetson simultaneously serves as the CloudTAK node – TAKpilot and CloudTAK run as co-located services on the same device, with CoT delivery over loopback rather than a network hop. This colocation architecture is important for forward-deployed headquarters where the TAK server and the SITREP processing system are on the same ruggedized compute node.

CPU-only nodes – where GPU hardware is unavailable or power-constrained below Jetson levels – use a two-model pipeline: PaddleOCR with its PPOCR-v4 detection and recognition models for text extraction (runs at ~1 second per page on a modern ARM64 core), and a lightweight MobileNetV3 symbol classifier at INT8 quantization for symbol recognition. The VLM step is omitted; relative position parsing falls back to the rule-based offset parser. This pipeline processes a SITREP in 3–6 seconds on a modern laptop CPU, or 8–20 seconds on a single-board ARM processor (Raspberry Pi 5 class), with somewhat lower extraction accuracy on complex documents but still operationally useful performance for the most common SITREP formats.

Model updates in the field follow the same signed-package update mechanism described for other edge AI deployments: the update bundle is cryptographically signed, delivered via the TAKpilot management channel, and applied with automatic rollback if post-update accuracy metrics fall below baseline. Theatre-specific fine-tuning – adapting the symbol classifier to the particular hand-drawing conventions of the units in the area of operations – can be pushed to forward nodes as a model delta within 24 hours of receiving a labeled sample batch.

The transition between edge and cloud mode is transparent to the operator. When network connectivity is available, TAKpilot routes to the cloud pipeline for higher accuracy. When connectivity drops – as detected by a 5-second timeout on the API health check – it automatically falls back to the local model without operator intervention. The confirmation card UI is identical in both modes; only the processing time changes.

See TAKpilot in Action

TAKpilot integrates AI SITREP processing with CloudTAK — automated entity extraction, confidence-scored confirmation cards, and direct map placement for your tactical operations center.

Explore TAKpilot → Book a Briefing

This analysis was prepared by Corvus Intelligence engineers who build mission-critical software for defense and government organizations. Learn about our team →

Frequently Asked Questions

What is the main bottleneck in manual SITREP processing?

The primary bottleneck is the manual re-entry step: an operator receives a hand-drawn sketch or photographed form, reads the grid references and unit symbols, and types them individually into a digital COP or TAK server. This can take 3–8 minutes per SITREP under calm conditions and much longer under stress, during which the tactical picture is stale and operator attention is diverted from decision-making.

Which AI vision model types are best suited for SITREP entity extraction?

Multimodal vision-language models (VLMs) such as GPT-4o or open-weight alternatives like LLaVA perform well for structured extraction because they combine OCR, spatial reasoning, and symbol recognition in a single inference pass. For edge-deployed pipelines without cloud connectivity, a combination of PaddleOCR for text and a small TensorRT-optimized symbol classifier handles entity detection at acceptable latency on Jetson-class hardware.

How does the model parse MGRS grid references from handwritten text?

OCR output tokens are validated against MGRS format patterns: Grid Zone Designator + two-letter 100 km square + equal-digit easting/northing pair. Tokens that partially match are passed to a fuzzy correction module using edit-distance matching against a lookup table of valid GZD and square combinations for the theatre. Confidence is scored by Levenshtein distance from the nearest valid MGRS string.

How are hand-drawn NATO tactical symbols matched to MIL-STD-2525C codes?

Symbol regions are classified by a CNN trained on synthetic APP-6/MIL-STD-2525C renderings with degradation augmentation. The classifier outputs a ranked list of SIDC candidates with confidence scores. Candidates above 0.80 confidence are accepted; below threshold, the entity is flagged for operator confirmation via a top-3 candidate selection UI.

What is a CoT message and how does it carry SITREP entities to TAK?

Cursor-on-Target (CoT) is an XML schema for US DoD sensor-to-shooter interoperability. Each CoT event carries a uid, a type (encoding the MIL-STD-2525C SIDC), timestamps, and a point element with WGS-84 lat/lon/CE. TAK clients (ATAK, WinTAK, CloudTAK) ingest CoT over UDP, TCP, or WebSocket and render the entity on the tactical map immediately.

What is TAKpilot's SITREP vision workflow?

An operator uploads a SITREP image to TAKpilot. The vision backend processes it in 8–20 seconds and returns a confirmation card listing each detected entity with extracted grid reference, symbol icon, callsign, and confidence indicators. The operator reviews and approves (or corrects), then triggers map placement — TAKpilot pushes a CoT bundle to CloudTAK and all entities appear on the shared tactical map simultaneously.

When should entities be auto-placed vs held for human confirmation?

Entities with grid reference confidence above 0.92 and symbol confidence above 0.85 can typically be auto-placed. Below either threshold, or whenever coordinates fall outside the theatre bounding box, the entity is held for confirmation. Ambiguous symbols — where the top-3 SIDC candidates are closely clustered — always require confirmation regardless of the top-candidate score.

How are relative positions handled when no explicit grid reference is written?

Relative references (e.g., "400m NE of checkpoint BRAVO") are handled by extracting an anchor grid reference, then parsing bearing and distance with a VLM chain-of-thought prompt or rule-based offset parser. The derived WGS-84 coordinate carries an inflated circular error (100–500 m) that TAK renders as an uncertainty ring on the map.

Can the vision pipeline run fully offline on edge nodes?

Yes. On a Jetson AGX Orin, a quantized 7B VLM (LLaVA-1.6 INT4) processes a SITREP in 8–15 seconds fully air-gapped. For higher throughput, PaddleOCR plus a TensorRT symbol classifier runs under 2 seconds per document. Both pipelines operate without external API dependencies, colocating with CloudTAK on the same edge node for forward-deployed headquarters.

What pre-processing steps most improve extraction accuracy on degraded documents?

The most impactful steps are: de-skew (Hough line detection, up to ±15°), adaptive binarization (Sauvola) for uneven illumination, CLAHE to recover pencil marks, morphological noise removal, and layout analysis to segment text from symbol regions before routing to separate models. Layout segmentation is the highest single-step improvement for reducing cross-region classification errors.