Building a defense fusion pipeline, part 1

A fusion pipeline that operates reliably under load is not designed; it is iterated. The first iteration almost always fails for the same reason: insufficient discipline at the source-and-schema layer. Adapters leak source-specific concepts upstream, the track schema is not stable enough to support evolution, the canonical model conflates concepts that should stay separate, and six months later the team is rewriting the fusion engine while operators are still using the broken one. This four-part series walks through how to avoid that outcome. Part 1 covers the foundation: cataloguing sources, designing the canonical track schema, and the adapter layer that keeps everything else clean.

The architectural framing for this series is in The Complete Guide to Defense Data Fusion. The C2-side equivalent – building the full C2 stack with fusion as one component – is the parallel series starting at Building a C2 System from Scratch, Part 1. This series goes narrow on the fusion-engine-plus-data-layer subsystem.

Step 1: catalogue the sources before writing code

The single highest-leverage activity at the start of a fusion programme is the source catalogue – a document describing every sensor, intelligence feed, and external input the platform will ingest. The catalogue is uninteresting to build, uninteresting to read, and critical to get right. It becomes the contract that every downstream component depends on.

The catalogue captures, for each source:

Source identity – stable identifier, friendly name, owning organization.
Wire format – ASTERIX category and version, STANAG 4586 release, AIS NMEA 0183, CoT XML schema version, NITF version, etc.
Transport – UDP multicast, TCP unicast, MQTT topic, HTTP webhook, file drop. Includes addressing, authentication, encryption posture.
Cadence – message rate at nominal load, peak rate, expected silence intervals.
Latency profile – observation time vs report time vs ingest time. Some sources are real-time; others have batch delays measured in hours.
Accuracy and uncertainty – what the spec claims, what the operational data shows, what the failure modes look like.
Classification posture – what classification level the source operates at, what compartments apply, what releasability rules govern the data.
Known failure modes – link drops, source-side outages, gradual degradation, deliberate manipulation possibilities.
Schema mappings – how each source field maps to the canonical track schema (filled in once the schema exists).

The catalogue is a versioned artefact, stored in the repository alongside source code, reviewed by the engineering team and (where appropriate) by the operational community whose sensors feed in. A new source is not "integrated" until it has a catalogue entry; this discipline alone prevents the most common multi-year refactor in fusion projects.

The detailed treatment of source-integration challenges, particularly the multi-INT semantics that surface in defense, is in Defense Data Integration Challenges.

Step 2: design the canonical track schema

The track is the central data structure of any fusion platform. Every adapter produces tracks; every fusion decision updates tracks; every consumer reads tracks. The schema is the contract that the platform lives with for its operational life, typically 15-20 years. Spend a sprint getting it right; spend a week documenting it.

The minimum viable schema includes:

Track ID. Globally unique, stable across the track's lifetime, never reused. UUIDv7 or a typed prefix-plus-UUID is the safe default. The ID is opaque – it does not encode source, identity, or any other attribute that might change.

Identity. A structured type with three sub-fields: type taxonomy (vessel, aircraft, vehicle, person, unit, signal, unclassified-other), subtype (per-domain finer-grained classification), and identifying attributes (hull number, tail number, callsign, MMSI, transponder ID). Identity is updated by fusion as evidence accumulates; the ID is not.

Position and uncertainty. Latitude, longitude, altitude in WGS84 by default. Uncertainty represented as either a covariance matrix (preferred for kinematic fusion) or a major/minor axis with bearing (acceptable for simpler use cases). Never a single uncertainty number – it loses the geometric information fusion needs.

Kinematic state. Velocity vector, turn rate, derived course/speed for display. Time-tagged with the moment of estimation.

Source set. Which adapters contributed observations to this track, with per-source classification, releasability, and confidence. The source set is the foundation for classification propagation and audit. The detailed treatment is in Military Data Fusion Explained.

Three timestamps. Observation time (when the sensor saw the object), report time (when the message left the sensor), ingest time (when the platform received it). Conflating these is the most common bug source in fusion engineering. Operators need observation time; replay analytics need ingest time; the difference between them surfaces sensor latency for monitoring.

Lifecycle state. Tentative, confirmed, mature, fading, lost. State machine details come in Part 2.

Classification envelope. Effective classification computed from the source set. Releasability tags computed from the intersection of source releasabilities. Compartment markings where applicable.

Confidence and certainty. Track-level confidence as a single calibrated score. Per-attribute certainty where it materially differs – for instance, a track may have high position certainty but tentative identity.

Step 3: commit to additive schema evolution

The schema will evolve. New attributes will be needed; rare cases will surface that the original design did not anticipate. The discipline that keeps the platform operational through this evolution is additive-only versioning.

The rules:

New fields are optional. Existing consumers ignore them until they are upgraded. Producers fill them when relevant data is available.
Existing fields never change semantics. A field that means "speed in m/s" today must mean "speed in m/s" forever. A meaning change requires a new field, not an in-place change.
Removals are deprecations. A field marked deprecated is still in the schema; new producers stop writing it; new consumers stop reading it; old data continues to work indefinitely.
Breaking changes are major-version bumps. They happen – rarely. When they do, the migration is documented, tested, and coordinated across all consumers. A breaking change should occur at most once per platform lifetime, not once per release.

Wrap the schema in a code-generated client library shared by every consumer language. Schema-as-code prevents the slow divergence that otherwise produces "fusion platform v3.4 in service A, v3.6 in service B, v4.0 in service C" – the operational nightmare that every fusion programme will encounter without this discipline.

Key insight: The track schema is the platform's most consequential artefact. Schemas designed in week one to be additive survive 20 years of operational evolution. Schemas designed informally and refined later become the source of the multi-month refactor that ships every two years. Invest the sprint up front; reap the benefit for the platform's life.

Step 4: build the adapter layer with strict isolation

The adapter layer translates each source's native format to the canonical track schema. The architectural rule is brutal and worth memorizing: no sensor-specific concept leaks past the adapter. If your fusion engine code references ASTERIX categories, you have a leaky architecture. If your track store has a column for AIS message types, you have a leaky architecture. The rule is structural – break it once, and the cost compounds across years.

The structure of a well-designed adapter, in four layers:

Transport. The connector to the source. UDP socket, TCP listener, MQTT subscription, HTTP webhook, file watcher. Resilient to source-side failure: automatic reconnect with backoff, dropped-message accounting, telemetry exported to the platform's monitoring stack.

Parser. Translates the wire format to a strongly-typed in-process structure. Validates against the format specification. Rejects malformed input loudly, with structured logging that surfaces the malformation, the source identifier, and the timestamp. Silent dropping of bad input is the wrong default – it hides operational issues that should be surfaced.

Normalizer. Maps source-specific fields to canonical-schema fields. Coordinate-system conversion (typically to WGS84). Timestamp normalization to UTC with the three-time-stamp discipline. Identity-field normalization across the various ways the same hull number or callsign might be formatted in different sources.

Emitter. Publishes the canonical track update to the platform's message bus, tagged with source identifier, source classification, releasability, and a fresh ingest timestamp. The emitter is the only component in the adapter that knows about the platform; everything upstream of it is source-specific isolated code.

Each adapter runs as a separate service or process. They share a code-generated client library for the canonical schema, but no other code paths. Adding a new source means writing a new adapter; it does not touch any other component. The detailed integration patterns for the most common defense sources are in Integrating AIS and ADS-B into a Military Picture and the CoT side in Cursor on Target (CoT).

Step 5: wire the adapters to a durable message bus

Adapters publish to a durable, ordered, partitioned log. Fusion services consume from it. So does audit, historical replay, and downstream analytics. The bus is the spinal cord of the fusion platform.

The pattern that scales: Kafka or NATS JetStream as the durable event log; topic-per-source-type at the input side; topic-per-output-type at the fusion side. Adapters publish to raw.<source-type>; the fusion engine consumes those and publishes to tracks.updates, tracks.lifecycle, tracks.classification. Consumers subscribe to whichever topics they need.

The detailed trade-offs between Kafka and NATS, the topic-modelling discipline, and the operational considerations are in Message Queues for Defense Data Pipelines.

The architectural rule worth surfacing: do not call HTTP between fusion components. Synchronous request-response coupling between adapters, fusion services, and consumers makes the pipeline brittle. A sensor surge that stalls one consumer must not stall the producer side. The bus with backpressure handling is the structural solution; HTTP between fusion components is a recurring source of outages.

Step 6: test the source catalogue against reality

The source catalogue is a hypothesis until it is tested. The disciplines that validate it before the pipeline goes operational:

Captured-data replay. For each source, capture days or weeks of real wire-format traffic into a file. Replay the file against the adapter at original rate and at accelerated rate. The adapter that handles real data at 10× speed is the adapter that handles operational sensor surges; the adapter that handles only catalogue-synthetic data is not yet ready.

Adversarial input testing. Inject malformed messages, spoofed AIS, radar plots that violate physics (Mach 5 ground tracks), CoT XML with schema violations. The adapter must reject these loudly, not crash, not silently propagate. The discipline carries through to the fusion engine itself, treated in Military Data Fusion Explained.

Schema round-trip tests. Every adapter must be able to round-trip its native input through the canonical schema and back, preserving operationally-significant fields. A lossy adapter is a design failure that surfaces under conformance testing months later.

Catalogue audit against real production data. Once the pipeline runs in pilot deployment, audit the source catalogue against real ingest data. Sources that produce attributes the catalogue did not anticipate, latencies that exceed the catalogue's expectations, or failure modes the catalogue did not document – these are findings that update the catalogue, the adapter, or both.

What's next

Part 1 has covered the foundation. Sources catalogued, canonical track schema designed with additive evolution, adapters built with strict isolation, the message bus wired, and the testing disciplines that validate the layer. The pipeline now ingests source data and produces canonical track observations on the bus – but those observations are not yet correlated into tracks.

Part 2: Track Correlation and Lifecycle Management takes the canonical observation stream and builds the heart of the fusion engine. Rule-based gating, probabilistic data association (JPDA, MHT), lifecycle state machine, and the track store as event-sourced read model.

Building a defense fusion pipeline, part 1: sources, schemas, and the adapter layer

Step 1: catalogue the sources before writing code

Step 2: design the canonical track schema

Step 3: commit to additive schema evolution

Step 4: build the adapter layer with strict isolation

Step 5: wire the adapters to a durable message bus

Step 6: test the source catalogue against reality

What's next

Discuss Your Fusion Programme

Building a defense fusion pipeline, part 1: sources, schemas, and the adapter layer

Step 1: catalogue the sources before writing code

Step 2: design the canonical track schema

Step 3: commit to additive schema evolution

Step 4: build the adapter layer with strict isolation

Step 5: wire the adapters to a durable message bus

Step 6: test the source catalogue against reality

What's next

Discuss Your Fusion Programme

Related Articles