The most dangerous assumption a defense program can make about its AI systems is that adversaries will attack them the same way academic benchmarks do — with carefully constructed digital perturbations tested against held-out datasets. Operational military AI faces a broader and grimmer threat surface: nation-state actors with months of preparation time, insider access to training pipelines, and the ability to manipulate the physical environment that sensors observe. Understanding that threat surface, and closing it systematically, is the discipline of adversarial machine learning for defense.

Why adversarial attacks matter for military AI

When an AI model misclassifies in a commercial application, the cost is a degraded user experience or a lost sale. When an ISR classification model misidentifies a vehicle as civilian because an adversary placed a carefully engineered pattern on the roof, the operational consequence is categorically different. Military AI is embedded in decision loops where errors carry lethal or strategic weight — targeting, logistics authorization, personnel identification, signals analysis — and that consequence asymmetry is precisely what makes defense AI an attractive adversarial target.

The attack surface grows with every new AI deployment. A logistics AI that approves resupply routes can be manipulated through poisoned input data to approve routes that expose convoys to interdiction. An acoustic classifier on an unmanned sensor node can be fooled by RF signal injection into failing to flag hostile gunfire. An object detection model in a UAV feed can be evaded by a printed patch on a vehicle roof, causing the vehicle to transit an area undetected. None of these attacks require exploiting a software vulnerability in the traditional sense — they exploit the statistical properties of the model itself.

As defense organizations adopt AI more broadly — in predictive maintenance, personnel screening, intelligence triage, and command decision support — the aggregate adversarial attack surface compounds. A program that evaluated AI robustness only during initial procurement, without a continuous red team posture, is accumulating uncharted exposure with every new capability deployed. The threat is not hypothetical: independent research organizations have demonstrated physical-world adversarial attacks against production object detection models achieving attack success rates above 85% with no access to model weights.

Taxonomy of adversarial attacks

Adversarial attacks on AI systems divide into four principal categories, each targeting a different phase of the model lifecycle and requiring a different defensive posture.

Evasion attacks occur at inference time. The adversary constructs an input — an image, an audio sample, a text sequence — that is perceptually similar to a legitimate input but causes the model to produce an incorrect output. The model itself is not modified; only the input changes. Evasion attacks are the most studied class in academic literature, and the standard benchmark for digital evasion is the L-infinity perturbation norm, which constrains how much any individual pixel can be changed. In practice, this threat model is relevant when an adversary can directly manipulate sensor inputs — for example, by adding noise to a satellite image feed or modifying a document entering an NLP pipeline.

Poisoning attacks occur at training time. The adversary corrupts or augments the training data with samples that cause the model to learn a specific malicious behavior. The trained model performs normally on clean inputs but behaves incorrectly on inputs that carry the adversary's chosen trigger pattern. This class of attack — also called a backdoor or trojan attack — is most relevant for defense when training data is sourced from open or insufficiently verified repositories. A nation-state adversary with the patience and capability to seed a small number of poisoned samples into a widely used pre-training dataset can implant backdoors into models built by multiple organizations from the same data source.

Model extraction attacks allow an adversary with query access to a deployed model to reconstruct a functional approximation of it through systematic probing. The extracted model can then be used to develop more effective evasion attacks without direct access to the original weights. This threat is relevant for defense AI deployed through API interfaces or accessible to external users — an adversary who can query the model thousands of times can build a surrogate that transfers attacks back to the production system.

Backdoor and trojan attacks — while a subset of poisoning — deserve separate emphasis because of their stealth properties. A backdoored model passes all standard accuracy tests. It behaves identically to a clean model on every input except those containing the trigger the adversary embedded during training. Detecting this class of attack requires dedicated techniques beyond standard evaluation.

The relevant threat actors for defense are nation-state adversaries (with the resources and patience for supply chain poisoning and long-duration physical-world attack preparation) and malicious insiders (with direct access to training pipelines, model weights, or deployment infrastructure). Commercial threat models that focus on opportunistic attackers are insufficient for this threat profile.

Physical-world adversarial examples

Physical-world adversarial attacks are the category most underestimated in defense AI deployments and the most practically dangerous in operational settings. They do not require access to the model — they require only the ability to modify objects, surfaces, or signals that the model's sensors will observe.

Adversarial patches are the most studied physical-world attack. A patch is a printed image, typically 20–30 cm in the largest dimension for vehicle-scale targets, designed using the Expectation over Transformation (EOT) technique to remain adversarial across variations in viewing angle, lighting, distance, and print quality. When placed on a vehicle roof or hull, the patch causes object detection models to fail to locate or correctly classify the vehicle. Research has consistently demonstrated attack success rates above 80% against production-grade detection models across a range of ISR-relevant scenarios — drone feeds at operational altitudes, optical sensors at medium range. The patch requires no ongoing adversary action once printed and placed.

Adversarial camouflage patterns represent a more sophisticated extension. Rather than a discrete patch, the adversary designs a texture or camouflage pattern for an entire vehicle or personnel equipment set that is systematically adversarial against a target class of detection models. The pattern appears visually similar to standard military camouflage but produces consistent evasion of AI-based classification. This attack is more difficult to execute than a simple patch because it requires larger perturbation budgets and more extensive evaluation against diverse model versions, but its stealth advantage is correspondingly higher — it does not produce the visible anomaly of a printed patch.

RF signal injection into acoustic classifiers is a less publicized but operationally relevant physical-world attack. Acoustic gunshot detection and vehicle acoustic classification systems increasingly use neural network models to replace or augment traditional signal processing. An adversary with a directed RF emitter can inject carefully crafted interference that causes the acoustic classifier to suppress the detection of genuine events or hallucinate false ones. The attack exploits the same statistical properties of the model as a digital perturbation attack, but the mechanism is electromagnetic rather than optical. Defending this attack class requires both model-level hardening and anomaly detection at the signal preprocessing stage.

Adversarial training and certified robustness

Adversarial training is the most empirically effective defense against evasion attacks, and it is the first control that should be applied to high-risk defense classifiers. The core idea is simple: augment the training set with adversarially perturbed examples, so that the model learns representations that are stable across the perturbations an adversary can generate.

The Projected Gradient Descent (PGD) adversarial training method generates the strongest perturbations within a specified norm ball — typically L-infinity with epsilon = 8/255 for natural images, though the appropriate budget must be derived from the operational threat model — and adds them to each training batch. The model is optimized not just to classify clean examples correctly but to classify the worst-case perturbation of each example correctly. PGD adversarial training reliably reduces the attack success rate of gradient-based evasion attacks by a factor of 10 to 100 compared to a standard-trained model on the same architecture.

The TRADES (TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization) loss function extends PGD adversarial training by explicitly penalizing the gap between the model's prediction on a clean example and its prediction on the adversarially perturbed version. This produces better calibrated robustness-accuracy tradeoffs than vanilla PGD training and is the current empirical state of the art for L-infinity adversarial training on image classifiers.

Certified robustness methods — most notably randomized smoothing — offer a mathematically provable guarantee that the model's output cannot change within a specified L2 radius around a given input. A randomized smoothing classifier wraps a base classifier in Gaussian noise: it classifies an input by taking a majority vote over many noisy versions of the input, and the resulting classifier comes with a certified radius within which no perturbation can change the output. For defense deployments where a provable bound on adversarial sensitivity is more valuable than the highest empirical robust accuracy, certified robustness is the appropriate technique. The tradeoff is that certified robustness under L2 typically achieves lower clean accuracy than empirical adversarial training, and the certified bounds degrade for larger perturbation radii.

Every adversarial training approach incurs a clean accuracy cost — typically 2–8% on natural images, and potentially higher for domain-specific defense datasets with limited training data. The correct approach for a defense deployment is not to select the technique with the best benchmark number but to evaluate the robustness-accuracy tradeoff against the specific operational threat model. A logistics AI that runs on a well-controlled internal data feed has a different acceptable tradeoff than an ISR classifier operating against an active adversary with physical access to the sensor environment.

Input preprocessing defenses

Input preprocessing defenses attempt to remove or detect adversarial perturbations before they reach the model, without modifying the model itself. They are particularly useful when adversarial training degrades accuracy unacceptably, when a model cannot be retrained (e.g., a deployed third-party component), or as a complementary layer alongside adversarial training.

Feature squeezing reduces the input's precision or resolution — for images, this means bit-depth reduction or spatial smoothing — to remove the high-frequency perturbations that most adversarial attacks rely on. If the model's output changes significantly between the original input and the squeezed version, this discrepancy is a signal that the input may be adversarial. Feature squeezing is computationally cheap and architecturally simple but is evaded by adaptive attacks that optimize the perturbation to survive squeezing.

JPEG compression as a preprocessing step has been extensively studied as an adversarial defense. Moderate JPEG compression destroys many gradient-based perturbations because the discrete cosine transform quantization stage acts as a differentiability-breaking noise injection. The defense is weak against adaptive attacks but relevant against the majority of non-adaptive attacks in practical deployment scenarios.

Local Intrinsic Dimensionality (LID) and Mahalanobis distance detectors operate at the feature level rather than the input level. They extract intermediate layer activations for a given input and compare them against the distribution of activations seen on clean training data. Adversarial inputs frequently produce activation patterns that are outliers in this distribution, even when the final classification is confident and plausible. These methods are more effective against adaptive attacks than input-level preprocessing and are appropriate as a detection layer in a defense-in-depth architecture.

Ensemble disagreement detection runs the input through multiple independently trained models and flags high disagreement between their outputs as a signal of adversarial manipulation. An adversarial example crafted to fool one model will often be less effective against a second model trained from a different initialization or on augmented data. Ensemble detection is computationally expensive but provides strong coverage against non-adaptive attacks and moderate coverage against transfer attacks.

Model governance for adversarial resilience

Technical defenses at the model level are necessary but not sufficient. A model that is robustly trained and protected by input preprocessing can still be undermined by governance failures — unauthorized model substitution, unauthorized access to inference endpoints, or deployment of a model version that preceded the most recent robustness evaluation cycle. Adversarial resilience requires governance controls that treat the model artifact with the same rigor applied to cryptographic keys and classified source code.

Model signing is the practice of attaching a cryptographic signature to a trained model artifact, so that any unauthorized modification between training and deployment is detectable. A model that has been tampered with — whether to insert a backdoor, downgrade to a less robust version, or substitute entirely — will fail signature verification at the deployment gate. Model signing should be paired with a broader supply chain security posture that extends provenance tracking from source code through training through deployment.

Role-based access control (RBAC) on inference endpoints limits which systems and users can query a deployed model. This directly constrains model extraction attacks: an adversary who cannot issue arbitrary queries to the model cannot build a surrogate. Defense AI inference endpoints should apply the same RBAC policies as any other sensitive API — strict authentication, logging of all inference requests, and rate limiting to impede the systematic probing required for model extraction.

Model versioning and rollback ensures that every deployed model version is recorded and that the organization can rapidly revert to a previously validated version if a newly deployed model is found to have robustness deficiencies. Version management also enables precise blast-radius analysis when a new vulnerability is discovered in an adversarial training technique or a preprocessing defense: the organization can determine exactly which deployed models were affected and prioritize retraining accordingly.

A continuous red team evaluation cycle closes the feedback loop between threat research and deployment. The field of adversarial machine learning advances rapidly — new attack techniques regularly defeat defenses that were state of the art twelve months earlier. A defense organization that evaluates adversarial robustness only at initial deployment will find its posture degraded without any change to its own systems. A quarterly red team cadence for high-criticality AI functions, with a defined remediation process, is the minimum appropriate governance standard.

Red team evaluation methodology

Adversarial robustness evaluation for defense AI should follow a structured methodology that produces reproducible, comparable results across evaluation cycles and covers both digital and physical-world attack vectors.

For digital robustness benchmarking, the AutoAttack framework is the current standard. AutoAttack assembles a fixed ensemble of diverse, parameter-free attacks — APGD-CE, APGD-T, FAB, and Square Attack — and evaluates a model against all of them automatically, reporting the robust accuracy under the joint worst case. Unlike single-attack evaluations, which are routinely defeated by gradient masking (a phenomenon where a model's robustness appears high because gradients are uninformative rather than because the model is genuinely robust), AutoAttack includes gradient-free attacks that detect masking. Foolbox provides a complementary library of individual attacks that can be used for targeted investigation once AutoAttack identifies vulnerability.

Physical-world evaluation requires a purpose-built protocol distinct from digital benchmarking. The evaluation team generates adversarial patches using the EOT method, targeting the specific sensor type, resolution, and altitude range of the operational deployment. Patches are printed at operationally relevant sizes, mounted on target objects, and evaluated under the same collection conditions used in deployment — including representative ranges, viewing angles, lighting conditions, and weather. The evaluation reports attack success rate (percentage of patched targets not detected or misclassified), clean detection rate on unpatched targets, and the transferability of patches to alternative model versions (to assess whether a patch developed against the current model version will remain effective after a model update).

The evaluation results should be recorded in the model card alongside the threat model assumptions, the perturbation budgets used, and the software versions of the attack frameworks. This documentation is the foundation for the governance review that follows and the baseline against which future evaluations are compared.

Key insight: The most underestimated attack vector in deployed military AI is not the white-box gradient attack that dominates academic research — it is the physical-world adversarial patch. A printed 20×20 cm adversarial patch placed on a vehicle roof defeats most production object detection models in ISR drone feeds with over 85% attack success rate in independent evaluations, without any access to the model's weights. Defending against physical-world attacks requires empirical robustness evaluation under physical patch protocols, not just digital perturbation benchmarks.

Assess the adversarial robustness of your defense AI pipeline

Corvus Intelligence engineers evaluate adversarial attack surface in deployed military AI systems — from ISR image classifiers to LLM-based intelligence triage — and implement hardening measures appropriate to the operational threat model.

Book a Briefing Explore Corvus SENSE →

This analysis was prepared by Corvus Intelligence engineers who build and evaluate mission-critical AI systems for defense and government organizations. Learn about our team →