Red team vs blue team: running cyber defense exercises for military organizations

By Corvus Intelligence Engineering Team · About the team →

June 4, 2026 9 min read

Every cyber commander eventually faces the same uncomfortable question after a security audit: if a nation-state adversary had been operating inside your networks for the past six months, would your SOC analysts have detected them? The honest answer, for most military organizations, is probably not – not because the defenders are incompetent, but because classified networks are complex, operational continuity requirements constrain aggressive defensive tooling, and threat actor tradecraft has evolved far beyond what signature-based detection alone can catch. Red team/blue team exercises exist to answer that question before an adversary does. They are among the most high-return investments a military cyber organization can make – and among the most poorly executed when the planning discipline is absent.

This article covers the complete arc of a military cyber exercise: why defense networks present unique challenges, how to structure the exercise types along a maturity continuum, how the red and blue team roles are staffed and scoped, what technical infrastructure an effective exercise requires, and how to extract actionable improvements from the after-action review. The WARG multi-domain exercise platform supports integration of cyber domain events into broader joint exercise planning – a capability examined in the final section.

Why military cyber exercises differ from commercial ones

The principles of red team/blue team testing apply across sectors, but military networks introduce constraints and requirements that have no commercial analog. Understanding these distinctions is prerequisite to designing an exercise that produces useful training data rather than a scripted performance.

Classified network architecture. Military networks span multiple classification levels – from unclassified administrative systems through secret and above enclaves – and the boundaries between them are themselves high-value targets. A red team emulating an advanced persistent threat is not simply trying to exfiltrate data; it may be attempting cross-domain transfer that exploits a misconfigured data diode or a poorly implemented cross-domain solution. Exercises that operate only on unclassified ranges may miss the most operationally relevant attack paths.

Operational continuity requirements. A commercial penetration tester can bring down a web application for an hour without catastrophic consequence. A red team operating on a network supporting active command and control cannot interrupt operational traffic without mission impact that extends well beyond the exercise. This constraint forces a tradeoff: exercises on production-adjacent networks are more realistic but carry genuine operational risk; exercises on isolated cyber ranges are safer but may not expose the actual defensive gaps in operational systems. Experienced exercise designers address this with a layered approach – isolated range exercises for technique validation, limited-scope production-adjacent exercises for realistic environmental testing.

OPSEC during the exercise itself. The existence of a red team exercise is operationally sensitive information. An adversary who learns that a unit is conducting internal testing may adjust timing to avoid the exercise window, or may attempt to blend real intrusion activity into exercise noise. Exercise planning and communications should be handled at appropriate classification levels, and the circle of personnel with knowledge of exercise timing should be minimized – particularly from blue team personnel, whose detection training requires genuine uncertainty about whether observed activity is exercise-generated or real.

Legal authority for offensive techniques. Red team operators using offensive cyber tools against military networks require explicit written authority that does not exist by default. The Computer Fraud and Abuse Act in the United States, and equivalent statutes in allied nations, create criminal liability for unauthorized computer access regardless of the actor's affiliation. Establishing proper legal authority – command authorization documents, rules of engagement, and get-out-of-jail letters – before any red team activity begins is not bureaucratic overhead; it is the baseline that makes the exercise legally defensible.

Exercise types along the maturity continuum

Military cyber exercises range from low-cost tabletop discussions to full live-fire simulations on dedicated range infrastructure. Organizations at different maturity levels benefit from different exercise types, and the progression from tabletop to live-fire is itself a structured development path.

Tabletop exercises bring the incident response team together to walk through a scenario without any live technical activity. The facilitator presents a scenario – "you have received an alert that an endpoint on the command network has initiated an unusual outbound DNS query pattern; what do you do?" – and the team discusses their response process. Tabletops are inexpensive, require no technical infrastructure, and are highly effective at exposing process gaps: missing escalation procedures, undefined roles, decision-making ambiguities, and communication failures between the SOC and the incident commander. They produce no data on whether the detection tooling actually works, but they reveal whether the team knows how to use it.

Full simulation exercises involve a human red team actively operating against a target environment while the blue team defends in real time. The red team uses real offensive tools and techniques; the blue team uses their operational detection and response tooling. These exercises are the highest-fidelity training available short of responding to a real intrusion, and they are the only exercise type that produces realistic MTTD and MTTR metrics. They require the most planning, the most technical infrastructure, and the most rigorous legal authority documentation.

Live-fire on range networks uses a dedicated cyber range – an isolated network environment that mirrors production architecture without carrying operational traffic – as the exercise environment. This approach preserves operational continuity while allowing the red team to use the full spectrum of authorized techniques, including those that would cause service disruption on a production network. Cyber ranges can be on-premise, cloud-hosted, or provided by national-level training organizations. The investment in range infrastructure is significant but amortizable across many exercises per year.

Coalition exercises (CWIX-style) involve multiple allied nations operating together in a shared exercise environment. The Cyber Warfare Interoperability eXercise (CWIX) model allows participating nations to test not only their internal defensive capabilities but also their ability to share threat intelligence and coordinate incident response across national and organizational boundaries. These exercises expose interoperability gaps – incompatible ticketing systems, incompatible indicator-sharing formats, language and terminology barriers in high-tempo incident response – that internal exercises cannot reveal.

Key insight: The most common failure in military cyber exercise programs is attempting a live red team exercise before the tabletop and process foundation is established. A blue team that has never walked through its incident response procedures in a tabletop will spend a live exercise discovering process gaps rather than training detection and response skills. The maturity progression – tabletop first, simulation second, live-fire third – is not optional.

Red team structure and threat actor emulation

The value of a red team exercise is directly proportional to how accurately the red team emulates the actual threat. A red team that uses techniques from five years ago, or that operates more noisily than a real APT because it lacks the tradecraft to be subtle, produces training data that does not prepare the blue team for the threat they actually face. Effective military red teams are structured around specific threat actor emulation rather than generic penetration testing.

For military networks in the NATO and Five Eyes context, the most operationally relevant threat actors are nation-state groups with demonstrated military network intrusion capability. APT28 (Fancy Bear, attributed to GRU Unit 26165) has a documented record of targeting military and government networks using spear-phishing, credential theft, and living-off-the-land techniques that minimize the footprint visible to endpoint detection. APT29 (Cozy Bear, attributed to SVR) operates with a longer dwell time and more patient operational tempo, often maintaining access for months before executing its mission objective. Red teams emulating these actors should operate from their documented TTP playbooks, using MITRE ATT&CK as the organizing framework.

Living-off-the-land (LotL) techniques are particularly important to emulate because they represent the detection challenge that defeats most signature-based defenses. A red team that uses only open-source exploit frameworks generates alerts that any commercial EDR product will catch; a red team that uses built-in Windows administrative tools (PowerShell, WMI, PsExec, scheduled tasks) to perform lateral movement operates in the same detection-resistant manner as a sophisticated nation-state actor. The blue team's ability to distinguish malicious use of legitimate tools from routine administrative activity is the core competency that a well-designed exercise develops.

Command and control (C2) infrastructure should be purpose-built for the exercise rather than reusing commercial penetration testing frameworks that are heavily signatured by network security products. DNS tunneling, HTTPS beaconing to domain-fronted infrastructure, and covert channels over allowed protocols (ICMP, legitimate cloud storage APIs) represent the C2 techniques that operational red teams use. Tooling options include CALDERA for automated TTP execution and Cobalt Strike or Havoc for manual C2 operations by red team operators with the appropriate training and authorization.

Blue team roles and SOC structure during an exercise

The blue team is not a homogeneous group during a cyber exercise – it comprises distinct roles that must coordinate under time pressure. Exercises that do not define these roles explicitly produce confused responses where multiple analysts duplicate work or critical decisions wait for an authority that no one knows they hold.

SOC analysts (Tier 1 and 2) are the detection layer. Their exercise training objective is to triage alerts accurately, escalate confirmed suspicious activity promptly, and not dismiss genuine indicators of compromise as false positives. The exercise should generate realistic alert volume – not only red team activity but simulated background noise from routine network events – to train analysts under conditions that approximate operational load.

The incident commander holds decision authority during a declared incident. Their exercise training objective is to make correct triage decisions under incomplete information: when to invoke containment procedures that will cause service disruption, when to allow adversary activity to continue for intelligence collection purposes, and when to escalate to command authority. Incident commanders who have never practiced these decisions in a simulated environment reliably make suboptimal choices under the cognitive load of a real incident.

The forensics team reconstructs attack timelines after containment. Their exercise training objective is to produce an accurate timeline from available log data within a defined timeframe. The quality of the forensic reconstruction – whether they correctly identify the initial access vector, the full scope of lateral movement, and the data that was accessed – is a direct measure of the organization's ability to conduct post-incident remediation rather than simply closing the incident ticket.

Key insight: The incident commander role is the most under-trained position in most military SOC structures. SOC analysts receive regular technical training; incident commanders rarely practice decision-making under simulated incident pressure. A cyber exercise that runs all three blue team roles simultaneously – analyst, incident commander, forensics – produces far more training value than one that focuses exclusively on the detection layer.

Purple team for continuous improvement

The traditional red-vs-blue adversarial model produces a binary outcome: either the blue team detected the technique or it did not. Purple teaming modifies this model to produce continuous, collaborative improvement rather than a single measurement event. In a purple team exercise, red and blue team members work together – the red team executes a specific technique, the blue team attempts to detect it, and both teams immediately discuss what log data was generated, what detection rule would catch it, and what changes are needed in the detection stack. This process is iterated across the full TTP catalog.

Purple teaming is not a replacement for adversarial red team exercises – the training value of operating under genuine uncertainty without knowing what techniques will be used is irreplaceable for blue team development. Purple teaming is a complement that builds detection engineering capability faster than adversarial exercises alone. The cadence for most military organizations should be: adversarial red team exercise annually, purple team workshops quarterly, continuous automated adversary emulation using CALDERA on range networks as a persistent background activity.

Technical infrastructure for military cyber exercises

The technical infrastructure requirement for a military cyber exercise scales with exercise type. Tabletop exercises require only a meeting room and a scenario document. Full simulation exercises on an isolated cyber range require network infrastructure, virtualization platforms, logging aggregation, and tooling that represents a significant but one-time investment.

The cyber range should mirror the production network architecture as closely as possible – same operating system versions, same network segmentation model, same security tooling – because detection gaps that exist on the range almost certainly exist on the production network. Ranges built on generic templates rather than production clones produce exercise results that do not transfer to operational defensive improvements. Cloud-hosted cyber ranges (Azure Government, AWS GovCloud) have reduced the infrastructure cost of range deployment substantially, but the configuration effort of accurately modeling production architecture remains significant.

Attack simulation tooling for the red team should include: CALDERA for automated TTP execution and scenario chaining; Atomic Red Team for individual technique validation against the detection stack; and appropriate C2 frameworks authorized for the exercise scope. The MITRE ATT&CK Navigator provides a visual coverage map – overlaying which techniques are included in the exercise scenario against which techniques have confirmed detection coverage – that is the single most useful planning artifact for both red team scenario design and post-exercise remediation tracking.

Logging and SIEM configuration is the most common exercise infrastructure failure point. Exercises that generate no usable detection data because log sources were not feeding the SIEM, or because retention periods were too short to support post-exercise forensic reconstruction, produce no training value regardless of how well the red team executed. Verify log source coverage before the exercise begins, not after.

Scoring and metrics: measuring what matters

Mean time to detect (MTTD) – the interval from red team technique execution to confirmed blue team alert – is the primary quantitative metric for a military cyber exercise. It is computed per technique, not as a single exercise-wide average, because a blue team with excellent network detection and poor endpoint detection will show very different MTTD values across the technique spectrum. The per-technique breakdown is what drives targeted remediation rather than a generic "improve detection" recommendation.

Mean time to respond (MTTR) – from confirmed alert to completed containment action – measures the effectiveness of the incident response process rather than the detection stack. High MTTD and low MTTR indicates a detection engineering problem. Low MTTD and high MTTR indicates a process or staffing problem. Both metrics are necessary to distinguish the type of remediation required.

Data exfiltration simulation provides a mission-impact metric. The red team attempts to exfiltrate a synthetic data set (placeholder files labeled with the classification and data type of the target data, but containing no real sensitive content) and the exercise scores whether the exfiltration was detected and prevented, detected after the fact, or undetected entirely. This metric connects the technical exercise to the operational consequence that senior commanders understand: if this had been a real adversary, what would they have taken?

Coverage rate – the percentage of red team techniques that generated any detection at all, regardless of MTTD – is the detection engineering completeness metric. A coverage rate below 60% indicates that the detection stack has significant blind spots that a sophisticated adversary can exploit freely. Organizations using the MITRE ATT&CK framework as their exercise planning basis should track coverage against the full technique matrix, not only against the techniques included in a specific exercise scenario.

After-action review methodology

The after-action review is where exercise training value is realized. An exercise that produces no structured AAR produces no sustained improvement in defensive posture regardless of how well the technical execution went. Military organizations that apply the same AAR discipline to cyber exercises that they apply to kinetic training exercises close defensive gaps; those that conduct a brief debrief and return to routine operations do not.

The cyber exercise AAR should reconstruct the complete timeline from both perspectives simultaneously: every red team action with its exact timestamp, and every blue team detection or non-detection event at the corresponding time. Overlaying these timelines reveals the detection gaps visually – the intervals during which the red team was actively operating and generating log data that the blue team either did not see, did not triage, or did not escalate. For each gap, the AAR identifies the specific cause: missing detection rule, missing log source, alert dismissed as false positive, or alert generated but response process failed.

Every identified gap must become a tracked remediation task with an owner and a deadline. Organizations that generate long lists of AAR findings without assigning owners and tracking closure have not improved their defensive posture – they have documented it. The remediation tracking process is the mechanism that converts exercise findings into operational security improvements. A follow-on tabletop or purple team session within 90 days of the main exercise should validate that critical remediations have been implemented before the defensive gaps identified in the exercise have had time to be exploited by an actual adversary.

Planning cyber exercises at scale – integrating cyber domain training events into multi-domain joint exercises – benefits from exercise management platforms that coordinate scenarios, participants, and after-action data across domains. The WARG platform supports this multi-domain exercise planning capability, enabling cyber exercise events to be scheduled, staffed, and debriefed within the same planning framework as kinetic domain training events. Related reading: live military exercises vs AI wargaming and after-action review software for military training.

Key insight: The single most common cause of stagnant defensive posture in military SOC organizations is the failure to close the loop between exercise findings and remediation verification. An AAR that produces a list of findings without tracked owners and follow-on validation is an administrative document, not a training outcome. The exercise program is only as valuable as the remediation discipline that follows it.

Integrate Cyber Exercises Into Your Joint Training Program

We build multi-domain exercise planning platforms for defense organizations — from cyber range scenario design to joint after-action review integration.

WARG Platform → Book a Briefing

This analysis was prepared by Corvus Intelligence engineers who build mission-critical software for defense and government organizations. Learn about our team →

Frequently Asked Questions

How often should military organizations run red team cyber exercises?

Most defense security frameworks recommend a minimum of one full red team/blue team exercise per year for each mission-critical network enclave, with tabletop exercises conducted quarterly in between. High-priority networks supporting operational command and control warrant more frequent testing — semi-annual live exercises supplemented by continuous automated adversary emulation using tools like CALDERA. The frequency should scale with the sensitivity of the data processed and the operational impact of a successful intrusion. Organizations that cannot sustain internal red team capacity should prioritize at least one annual external red team engagement supported by a contracted threat emulation provider.

How can a military unit get red team capability without hiring internally?

Several options exist for organizations that cannot staff a permanent internal red team. First, national-level cyber defense agencies — CYBERCOM in the US, NCSC-affiliated units in allied nations — provide red team support to subordinate commands under a tasking framework. Second, contracted threat emulation providers hold requisite security clearances and can operate against classified enclaves under proper legal authority agreements. Third, coalition exercises such as CWIX and Locked Shields allow units to participate in multi-nation cyber exercises where the red team function is centrally managed. Fourth, automated adversary emulation platforms such as CALDERA can serve as a persistent low-cost red team supplement between live exercises, running continuous scenario-based attacks against isolated range networks without requiring a standing human red team.

What legal frameworks apply to red team operations against military networks?

Red team operations against military networks require explicit written authority before any activity begins. In the United States, this flows from the Computer Fraud and Abuse Act exemptions for authorized testing, combined with command authority granted through operational orders or a specific testing authorization document. NATO member operations are additionally governed by the applicable SOFA (Status of Forces Agreement) when multinational personnel are involved. The critical documents are: a rules of engagement document defining what systems are in scope and what attack techniques are authorized; a get-out-of-jail letter signed by the appropriate commander that red team operators carry during the exercise; and a deconfliction mechanism with any ongoing defensive operations to prevent genuine incident response being triggered. Failure to establish these authorities before exercise start is the most common legal exposure in military red team programs.

What is the difference between a red team exercise and a penetration test in a military context?

A penetration test is a scoped technical assessment: testers are given a defined target, a defined timeframe, and a defined objective — typically to find vulnerabilities in a specific system. It is an audit activity, not a training exercise. A red team exercise is an adversary simulation: the red team is given mission objectives and operates with broad latitude to use any technique a real threat actor would use, including social engineering, physical access attempts, and supply-chain attack simulation. For military organizations, the red team exercise is the primary training vehicle — it trains the blue team to detect and respond to realistic adversary behavior, whereas a penetration test trains system administrators to patch specific vulnerabilities. Both are necessary; neither substitutes for the other.

How do you measure success in a military cyber defense exercise?

The primary quantitative metrics are mean time to detect (MTTD) — the interval from when the red team establishes its first foothold to when the blue team generates a confirmed alert — and mean time to respond (MTTR) — the interval from confirmed alert to completed containment action. Secondary metrics include the percentage of red team techniques that generated any detection at all (coverage rate), the false positive rate during the exercise window, and whether the red team achieved its stated mission objective before being evicted. Qualitative metrics from the after-action review capture decision quality: did the incident commander correctly triage the threat, were the right escalation decisions made, and was the forensic timeline reconstructed accurately post-exercise? A blue team that contains the red team quickly but misidentifies the attack vector has failed a critical training objective even if its MTTR is excellent.