Coalition Test Harnesses: How CWIX-Ready Code Is Tested Before Brussels

CWIX — the Coalition Warrior Interoperability eXercise — runs once a year in Bydgoszcz under JFC Brunssum sponsorship. Three weeks. Hundreds of nations and vendors. Real partner stacks plugged into the same fabric as yours. It is the only place on Earth where an honest test of NATO interoperability happens at scale, and it is unforgiving. Code that "passed conformance" at home routinely fails the first cross-vendor handshake on the floor.

The fix is not to wait for Brussels. The fix is a coalition test harness — a deliberately engineered rig that exercises your interop code against simulated partner stacks every time you push a commit. This article is an engineering walkthrough: what the harness contains, which tools belong inside it, and how to run it so the first day at CWIX is boring instead of catastrophic.

Why a Coalition Test Harness

The economics are blunt. A bug found at CWIX costs roughly two orders of magnitude more than a bug found in CI. The team is on travel, the testing window is fixed, and the partner stack you need to retest with may not be available again until next year. Worse, partner trust erodes fast — a vendor whose gateway corrupts a Link 16 J3.2 surface track on day one will be quietly steered around for the rest of the exercise. The accreditation outcome is downstream of that first impression.

The deeper problem is that CWIX is usually the first time a foreign stack ever sees your messages. You wrote to ADatP-3 chapter and verse. You ran a STANAG conformance tool. Your messages parse cleanly in your own emitter and your own consumer. None of that proves your code interoperates with a German JCHAT instance, a French SICF-NG gateway, or a US JREAP-C terminal driven by a different vendor's interpretation of the same STANAG. A coalition test harness shifts that first-contact event from Brussels to your laptop. See our complete guide to NATO interoperability for the wider picture.

Test-Pyramid for Interop Code

Interop code deserves its own test pyramid. The shape is familiar; the layers are specific.

Unit — message struct. Pure tests over the wire-format parser and serializer. Round-trip a known-good byte buffer to an in-memory struct and back. Boundary fields (bit-packed enumerations in J-series, fixed-length identifiers in ADatP-34) deserve dedicated property-based tests using Hypothesis, jqwik, or fast-check. Coverage at this layer should be near total — these tests are cheap and they catch the silent corruption bugs that humans never spot in a hex dump.

Integration — single-protocol round-trip. Boot the protocol stack in-process, send a generated message into your emitter, route it through a loopback transport, and assert the consumer reconstructs an equivalent struct. This is where you catch endianness mistakes, time-conversion bugs (NATO time, UTC, GPS time, leap seconds), and coordinate-frame errors (WGS-84 vs MGRS vs UTM). Use Testcontainers if a real broker (NATS, ActiveMQ, RabbitMQ for NFFI) is in the path.

System — multi-protocol gateway. Most coalition systems are gateways. A track enters as Link 16 J3.2, exits as ADatP-34 NFFI, and is mirrored to a CoT/MQTT feed for situational-awareness clients. The system layer wires the full pipeline together inside a docker-compose or k3d cluster, drives messages in, and asserts the cross-protocol invariants — track ID stability, position fidelity within tolerance, classification preservation.

CWIX — real partner stacks. The apex of the pyramid is unavoidable: you cannot fully simulate a partner stack you have never seen. But the pyramid keeps that apex narrow. By the time you land in Bydgoszcz, ninety-five percent of the bugs should already be dead.

Message Generators

A harness lives or dies on the realism of its generated traffic. Half-credible generators give false confidence.

Link 16 J-series generators. Build a parameterised generator per J-series message family — J2 surveillance, J3.2 surface track, J7 information management, J12 mission management. Bit-level fidelity matters: a wrong reserved-field default will pass your decoder and fail a partner's. Tools like the MIDS-LVT simulator output and the NSA-published J-Series Message Catalog are the reference. Wrap them in a fuzzer that varies declared classification, source TN, and track quality.

ADatP-34 (NFFI) emitters. NFFI 1.3 / IP1 / IP2 messages over SOAP or REST. Build emitters that produce both compliant and intentionally near-compliant payloads — partners' parsers vary in strictness, and your harness must expose your consumer's strictness too. The NATO NCIA-published NFFI XSDs are the contract; validate every generated message against them before transmission.

CoT and MQTT injection. Cursor-on-Target XML over TCP or MQTT is the lingua franca of tactical SA clients (ATAK, WinTAK, iTAK). Generate CoT events with realistic stale times, geo-fenced extents, and varied detail extensions. Mosquitto in a container handles broker side; for higher fidelity, run TAK Server CE.

MIP4-IES message factories. The Multilateral Interoperability Programme's MIP4 Information Exchange Specification (formerly known as JC3IEDM at the data-model level) drives structured C2 exchange. MIP4 message factories are heavier — RDF triples and SPARQL-based assertion — but indispensable if your code touches a national C2 system.

Partner-Stack Simulators

No single simulator covers the spectrum. Combine them.

JREAP-C terminal simulators. JREAP (Joint Range Extension Applications Protocol) carries Link 16 over IP. Several vendors ship JREAP-C terminal simulators; the US Navy's open NavyJTIDS test kit and commercial offerings from ViaSat or Ultra are common. Fidelity gap: timing — real terminals introduce J-series synchronisation slot dynamics that pure software simulators flatten.

JISR-Lite. NATO's Joint Intelligence, Surveillance and Reconnaissance reference implementation. Excellent for STANAG 4609 motion imagery metadata and STANAG 4559 CSD product query/retrieval. Run it in a VM; point your code at its endpoints. Fidelity gap: catalogue scale — real coalition CSDs hold orders of magnitude more products than the reference dataset.

NCI Server reference stacks. NCIA publishes reference implementations for several FMN spiral services — directory, messaging, situational awareness publish/subscribe. They are not certified partner stacks, but they expose the wire formats and authentication flows you must match. Fidelity gap: certificate trust chains — real FMN nodes terminate on PKI hierarchies you cannot perfectly replicate without coalition CAs.

Simulated FMN nodes. Spin up a minimal FMN node using NCIA's reference services plus a local PKI (step-ca or smallstep) for the trust fabric. Configure FMN Spiral 4 or Spiral 5 service profiles depending on the exercise you are preparing for. Walk through this configuration with the discipline of an accreditation evidence pack — see CWIX accreditation.

Conformance Test Suites

NATO STANAG conformance reports are necessary and insufficient. They prove your messages match the standard's syntactic and semantic rules. They do not prove a German partner will understand your meaning.

Run the suites anyway. ADatP-3 message catalogues ship with validators; STANAG 4774/4778 confidentiality metadata has its own. NFFI XSD validation is non-negotiable. FMN compliance gates per spiral are gated on documented evidence — your harness should emit that evidence as a build artefact. Pair conformance reports with NATO AQAP-2110 software quality evidence to keep accreditation reviewers moving; see our AQAP-2110 walkthrough.

The gap between "passes the test" and "interops with humans" is closed only by partner-stack rehearsal. A J3.2 surface track that conforms perfectly but uses a track number space colliding with a partner's allocation will fail human-judged interop on day one. Document allocation negotiations explicitly in your harness configuration; treat them as test data.

Continuous Integration for Interop

The harness has to run on every pull request. If it runs only nightly, the team has already accepted weeks of drift by the time CWIX arrives.

Bake the harness into a single CI job: GitHub Actions, GitLab CI, or Azure DevOps Pipelines all work. Use containerised simulators so the job is hermetic. Capture a deterministic message corpus — a curated set of J-series, NFFI, CoT, and MIP4 messages with known-good expected outcomes — and replay it every build. Snapshot-regression any wire-format output: a one-byte change in a serialiser is exactly the bug that breaks a partner.

Provenance matters. Each harness run should emit a signed artefact bundle — conformance reports, message corpus version, simulator versions, your SBOM. Tie this into the supply-chain controls described in SBOM enforcement in defense pipelines.

Key insight: The harness is not a separate project. It is part of the codebase, versioned with the code, owned by the engineers who write the interop logic. Outsourced harnesses go stale; in-house harnesses evolve with every PR and catch regressions the day they land.

Negative Tests

Most interop bugs surface on the unhappy path. The harness must drive it deliberately.

Malformed messages. Truncate Link 16 J-series frames mid-field. Corrupt bit-packed enumerations to reserved values. Send NFFI payloads with deliberately invalid XSDs. Your consumer should reject, log, and continue — never crash, never silently accept, never propagate.

Security overlays. Vary STANAG 4774 confidentiality metadata: send a NATO SECRET-tagged message to a consumer cleared only to NATO RESTRICTED. The consumer must refuse and audit, not downgrade. STANAG 4778 binding violations — signature mismatch on a metadata-bound payload — must fail closed.

Classification mishandling. Cross-domain mistakes are career-ending in coalition operations. Inject mixed-classification batches into your gateway and assert that the highest-classification rule holds for the entire batch. Inject messages that lack classification metadata at all — your code must reject, never default.

Time-skew edge cases. Clocks drift, GPS time and UTC diverge across leap seconds, and partner systems sometimes report time in fields that do not match their wire spec. Drive your harness with deliberately skewed timestamps (positive and negative) and assert your code clamps, rejects, or logs per requirement — never silently accepts a message dated next year.

CWIX Preparation

Six weeks out, the rehearsal cycle starts. Freeze the harness scope; no new features until after Brussels. Stand up an in-house "mini-CWIX" — a closed event over two or three days where every team in the company that touches the system plugs in concurrently. The goal is not to find new bugs; the goal is to make the operations and travel teams fluent on the floor flow before they meet a real partner.

Four weeks out, run a partner dress rehearsal. Coordinate with a friendly vendor or allied unit for a one-day virtual exchange. Even a single external connection exposes assumptions your harness baked in. Capture every pcap, every log; the lessons feed the next year's harness corpus.

Two weeks out, lock the artefact. Tag the build. Burn the simulator image set onto the laptops travelling to Bydgoszcz. Pre-stage every conformance report, every SBOM, every signed evidence bundle in the format JFC Brunssum accreditation reviewers expect.

On the floor, the discipline is logging. Capture every byte on every interface, classified and unclassified, with synchronised clocks. Triage in real time but resolve nothing destructively — the value of CWIX is not the bugs you fix during the exercise; it is the bugs you fix in the harness afterwards so they never recur. The lessons-learned cycle, executed faithfully, is what fuels the next year's clean run and the year after that.