Every serious military intelligence platform eventually confronts the same structural problem: five or more intelligence disciplines each produce data in their own format, at their own velocity, with their own semantics — and analysts need a unified picture that reasons across all of them simultaneously. The complete guide to defense data fusion covers the processing pipeline in broad terms. This article goes deeper on the schema layer — the canonical data model that sits beneath the fusion engine and gives it something coherent to work with.
Getting the data model right is not a detail. A poorly designed schema forces INT-specific logic into the application layer, makes cross-source queries fragile, and turns schema migrations into multi-week platform freezes. A well-designed model absorbs new INT types, supports bi-temporal queries, and keeps provenance intact through every stage of fusion. This article covers all of the decisions that determine which category your platform falls into.
Why each INT needs a different schema adaptation
The five main intelligence disciplines differ not just in what they collect but in how that data is structured, at what velocity it arrives, and what metadata is inherently available. These differences are not superficial. They determine what adapter logic is needed before any unified model can ingest a source, and they constrain what cross-INT queries are feasible.
HUMINT (human intelligence) is primarily textual. A HUMINT report is a narrative document describing what a source observed, heard, or was told. Timestamps are often imprecise — the report may describe an event that occurred over a range of days, with uncertainty in both the time and the location. The most important metadata is source assessment: how reliable is this particular source, and how credible is this specific piece of information? HUMINT data velocity is low — tens to hundreds of reports per day at a busy collection point, not thousands per second.
SIGINT (signals intelligence) — covering both COMINT (communications) and ELINT (electronic intelligence) — is high-velocity, highly structured, and time-stamped to millisecond precision. A SIGINT intercept or emitter detection carries frequency parameters, bearing angles, time-difference-of-arrival fixes, and modulation characteristics. The semantic content (what was said) is often classified separately from the signal parameters. SIGINT data velocity can reach millions of records per hour for a modern collection system covering a contested electromagnetic environment.
IMINT (imagery intelligence) produces structured observation records derived from imagery analysis: bounding boxes with entity class labels and confidence scores, geolocation coordinates, ground sample distance, and collection timestamp. A single satellite pass or drone flight may generate thousands of object detection records. The challenge is that IMINT detections are spatial snapshots — they tell you where something was at a specific moment, not where it is going.
OSINT (open-source intelligence) is structurally the most heterogeneous. It includes social media posts, news articles, commercial satellite imagery analysis, flight tracking data, and maritime AIS feeds. Each source type has its own schema. OSINT is also the least controlled — source quality ranges from authoritative government publications to anonymous unverified social media claims.
MASINT (measurement and signature intelligence) covers physical phenomenon measurement: seismic, acoustic, nuclear radiation, chemical/biological signatures, and radar cross-section profiles. MASINT observations are often indirect — they detect a phenomenon (explosion, vehicle movement, RF emission) rather than directly identifying an entity. The chain from MASINT observation to entity identification requires explicit inference steps that must be modeled in the schema.
The implication for a unified model is that the schema must accommodate this diversity without collapsing it. The answer is a typed core envelope with discipline-specific extension payloads — a design pattern covered in detail in the building defense fusion pipeline part 1 series.
Canonical entity types for a unified model
The starting point for schema design is defining the entity type taxonomy — the exhaustive list of real-world things the platform must track and reason about. For most military intelligence platforms, six entity types cover the vast majority of intelligence objects:
- Person — individual human subjects: combatants, commanders, facilitators, civilians of interest
- Organization — groups, units, networks, command structures
- Location — fixed geographic sites: facilities, infrastructure, landmarks, named areas of interest
- Equipment — vehicles, weapons systems, sensors, communications devices
- Event — discrete occurrences: engagements, explosions, meetings, transmissions
- Document — captured materials, publications, intelligence reports as objects of analysis
Each entity type has a core field set that is INT-agnostic — fields that must be populated regardless of which intelligence discipline contributed the information:
EntityCore {
entity_id: UUID // globally unique, immutable
entity_type: Enum // Person | Organization | Location |
// Equipment | Event | Document
classification: ClassMarkings // see provenance section
valid_time: TimeInterval // [start, end) when fact was true
transaction_time:TimeInterval // [start, end) when row was current
confidence: Float[0..1] // fused confidence across sources
source_obs_ids: UUID[] // contributing observation record IDs
schema_version: SemVer // for evolution compatibility
created_at: Timestamp
updated_at: Timestamp
}
Beyond the core, each entity type has typed attribute extensions. A Person entity carries biometric identifiers, aliases, nationality, and associated organization links. An Equipment entity carries platform type, serial identifiers if known, and associated unit link. An Event entity carries event class, involved entity references, and spatial footprint. These extensions are stored as typed payloads attached to the core envelope — not as columns on the core table. This separation is what enables the schema to absorb new attributes for one entity type without affecting others.
The same separation principle applies to INT contributions. When a SIGINT intercept links to a Person entity (because an IMSI was resolved to a known individual), that link is stored as an observation record with a SIGINT-typed payload pointing to the Person entity UUID. The Person entity itself does not carry SIGINT-specific columns — that coupling would make the schema fragile to any SIGINT collection change.
Provenance and source tracking
Provenance is the most critical non-functional requirement of any intelligence data model. Every piece of information in the fused picture must be traceable back to its source observation, the collection system that produced it, and the human assessments applied to its reliability. Without this chain, analysts cannot evaluate the quality of the picture they are working from, and the platform cannot perform rollback when a source is found to be unreliable.
A provenance block attached to every observation record should carry at minimum:
ProvenanceBlock {
int_type: Enum // HUMINT | SIGINT | IMINT | OSINT | MASINT
source_id: UUID // internal source registry reference
source_reliability: Char // A–F (NATO admiralty scale)
info_credibility: Integer // 1–6 (NATO admiralty scale)
collection_time: Timestamp
report_time: Timestamp // when report entered system
originator: String // unit or system that produced report
classification: ClassMarkings
handling_caveats: String[] // NOFORN, ORCON, REL TO, etc.
dissemination_ctrl: String[]
}
The NATO admiralty scale encodes two independent human assessments on each piece of intelligence. Source reliability (A through F) rates the historical track record and trustworthiness of the source — an A-rated source has been consistently accurate and reliable; an F-rated source has an unknown or poor track record. Information credibility (1 through 6) rates the plausibility of the specific information independent of source history — a 1-rated item is confirmed by other independent sources; a 6-rated item is improbable given what else is known.
These two grades are human assessments made by trained intelligence officers. They are distinct from, and must not be conflated with, the machine-computed fusion confidence score on the entity. The fusion confidence reflects statistical agreement across corroborating sources; the admiralty grades reflect human judgment about source quality. Both must be preserved and surfaced to analysts separately.
Classification markings require structured representation, not free text. A ClassMarkings type must encode: classification level (UNCLASSIFIED through TOP SECRET), compartments and codewords, and handling caveats as an enumerated list. The structure enables programmatic access control enforcement — the platform can evaluate at query time whether a given user's clearance satisfies the classification of each field, and can selectively redact or withhold fields that exceed the user's clearance rather than refusing to return the entire entity.
Cross-INT entity resolution
Entity resolution — determining that records from different sources refer to the same real-world entity — is the core fusion problem, and it is hardest precisely at the cross-INT boundary. Within a single INT, identifier schemes are consistent: two SIGINT records that share an IMSI refer to the same device. Across INTs, no shared identifier exists by default. An IMINT detection of a vehicle, a SIGINT bearing fix on an emitter collocated with that vehicle, and a HUMINT report naming a person seen in that vehicle must be linked through probabilistic inference, not through a shared key.
The entity resolution pipeline for a unified model must handle three linking scenarios:
Hard links — shared identifiers that definitively link records to the same entity. A known IMSI, a license plate read by two IMINT passes, a biometric match. Hard links should be propagated automatically with no confidence decay.
Soft links — probabilistic associations based on attribute similarity within uncertainty bounds. Two observations reporting a vehicle of the same class at overlapping locations within a temporal window that is consistent with movement between them. Soft links carry a match confidence score computed by the resolution engine.
Inferred links — associations derived from domain knowledge: if an SIGINT emitter bearing consistently co-moves with an IMINT vehicle track, they are likely the same platform. These links require explicit rule definitions and carry lower confidence than soft links based on direct attribute overlap.
The resolution pipeline produces match hypotheses. Hypotheses above a high-confidence threshold are automatically fused into the golden record. Hypotheses in the middle range are flagged for analyst review. Hypotheses below the low threshold are retained as separate entities. The threshold values are configurable and should be tunable per entity type — Person entity merges warrant higher confidence thresholds than Equipment merges, because false person fusions produce worse analytical consequences than false equipment fusions.
Golden record management requires a defined merge policy for attribute conflicts. When two sources disagree on an attribute — one HUMINT report says a person was at location A, an IMINT detection places them at location B one hour later — the merge policy must specify how to reconcile the attribute in the golden record. Common policies include: most recent valid time wins, highest source reliability wins, weighted combination for numeric attributes. The chosen policy must be stored on the golden record as metadata so analysts can understand why the golden record shows a particular attribute value.
The JDL data fusion model frames entity resolution as a Level 1 (object refinement) and Level 2 (situation refinement) problem. The schema design described here is what makes those JDL levels implementable in practice.
Temporal modeling: valid time vs transaction time
Bi-temporal modeling is not optional for a military intelligence platform. It is the minimum temporal structure needed to support the two most critical query types: "what was true in the world at time T?" (valid time query) and "what did the system know about X as of time T?" (transaction time query). These are different questions that require different answers, and a schema that conflates them — using a single timestamp per record — cannot answer either correctly.
Valid time represents when a fact was true in the real world. For an IMINT detection of a vehicle at a grid coordinate, valid time is the imaging timestamp. For a HUMINT report describing a meeting, valid time is the analyst's best estimate of when the meeting occurred — which may be a range of days, not a precise timestamp. Valid time is a property of the world, not of the database.
Transaction time represents when a record was current in the database. For the same IMINT detection, transaction time starts when the detection record was inserted and ends if the record is ever superseded (e.g., if the geolocation is reprocessed and corrected). Transaction time is a property of the database, automatically managed by the system.
The combination enables two critical operations. First, as-of queries: "reconstruct the complete intelligence picture as the system held it at 14:00 on day D." This requires querying across transaction time — returning only records that were current in the database as of 14:00 on day D, regardless of when their valid time falls. This is essential for post-incident analysis and for audit of intelligence-based decisions. Second, historical fact queries: "what events occurred at location X between day D-7 and day D?" This queries across valid time — returning records whose valid time interval overlaps the query window, regardless of when they were inserted.
Implementation in PostgreSQL uses period columns. The valid time dimension is represented as a tstzrange column (timezone-aware timestamp range). The transaction time dimension uses either a system-period temporal table (supported natively in some PostgreSQL extensions) or an explicit transaction_start and transaction_end column pair, with transaction_end set to infinity for current rows and stamped on update to indicate when the row was superseded. All updates must be implemented as insert-new-row / stamp-old-row operations, never as in-place overwrites.
Version control and lineage for fused objects
Intelligence entities are not static. A person entity may begin as a tentative identification from a single HUMINT report, gain spatial confirmation from an IMINT detection three days later, and receive a biometric confirmation from a separate collection event a week after that. Each of these updates changes the golden record — but the previous states must be recoverable, not overwritten.
The standard implementation is an append-only event log per entity. Every state change to a golden record generates an update event. Each event is immutable once written and carries:
- The entity UUID
- The event type (Created / Updated / Merged / Split / Retracted)
- The previous state snapshot (full copy of the golden record before the change)
- The new state snapshot
- The IDs of the observation records that triggered the update
- The fusion policy name and version applied
- The transaction timestamp
- The operator ID (human analyst or system process)
The current golden record is the result of applying all events in sequence from the beginning of the log. This is the event-sourcing pattern applied to intelligence data. It provides a complete audit trail for every entity state at every point in time, which is required for intelligence accountability in most military frameworks.
Rollback is a first-class operation: given an entity UUID and a target transaction timestamp, the platform re-materializes the golden record as it existed at that timestamp by replaying the event log up to but not including events after the target time. Rollback is triggered when a source is assessed as deceptive or erroneous — all golden records that incorporated observations from that source must be re-evaluated with the contaminated observations excluded.
A retraction event is the mechanism for handling this scenario at scale. When source S is invalidated, the system generates a retraction event for every observation attributed to S, then re-runs fusion for every entity that referenced any of those observations. Entities that were solely supported by the retracted source revert to a lower confidence state or are marked unconfirmed. Entities that had corroborating sources from other INTs absorb the retraction with a confidence penalty but remain in the picture.
The lineage model also enables split events — the reverse of entity resolution. If two entities were incorrectly merged (a false positive fusion), a split event un-merges them: the erroneous golden record is retracted, and two new entity records are created, each inheriting the source observations that properly belong to them. The split event preserves the full history of the merged state and the split decision, enabling later analysts to understand why the split occurred.
Schema evolution in production
A military intelligence platform is not a static product. New collection systems come online, new INT disciplines are added to scope, and existing schemas need attribute additions as new analytical requirements emerge. Schema evolution in a production platform that cannot tolerate downtime requires deliberate design choices from day one.
The core principle is backwards compatibility as a contract. The core entity envelope — the EntityCore fields — must be strictly versioned using a schema_version field. Any change to the core envelope that removes a field or changes a field's type is a breaking change and requires a major version bump with a defined migration path. Adding optional fields to the core is a minor version change. The version field allows consumers to declare which schema versions they support and enables the platform to serve different versions to different consumers during a migration period.
Extension payloads are the correct vehicle for adding new INT types or new attributes. When a new imagery analysis system comes online and produces additional attribute types (for example, structural damage assessment scores derived from SAR imagery), those attributes go into a new or updated IMINT extension payload version — not into the core entity schema. Existing consumers that do not need SAR-specific attributes are unaffected.
The provenance taxonomy must be expanded when a new INT type is added. The INT type enumeration gains a new value, and the source reliability and credibility grade definitions must be reviewed for applicability to the new source type. Some new source types may require new credibility criteria that do not map cleanly to the existing six-point admiralty scale — in those cases, the provenance block should carry the raw source-specific reliability metadata in addition to the translated admiralty grade, preserving fidelity.
Entity resolution rules are the most labor-intensive evolution path. When a new INT type joins the unified model, resolution engineers must specify how observations from the new source can be linked to existing entity types. This requires both data analysis (what attributes are available for matching?) and domain knowledge (what attribute proximity thresholds are operationally meaningful?). These rules must be peer-reviewed by experienced intelligence analysts, not just software engineers — incorrect resolution rules produce false fusions that silently corrupt the intelligence picture.
Schema migration in a bi-temporal model has an additional consideration: historical rows must be migrated without altering their transaction time history. A migration that re-writes existing rows and updates their transaction timestamps breaks the historical query semantics. Migrations must be additive: add new columns with defaults for historical rows, never update existing column values in historical records.
Testing schema evolution requires a multi-layer strategy: unit tests for each schema version's serialization and deserialization; integration tests for cross-version consumer compatibility; and regression tests using historical intelligence data samples to confirm that existing queries still return identical results after a migration. The historical data tests are the ones most commonly skipped and the ones that catch the most production-breaking regressions.
The data model described in this article represents a design target, not a starting point for a one-sprint implementation. Most platforms build toward this architecture incrementally — starting with a simpler schema for two or three INT types and adding the bi-temporal model, full provenance blocks, and event-sourced lineage as operational requirements solidify. What matters is that the core design decisions — typed extension payloads, INT-agnostic entity envelopes, separated valid and transaction time — are made early, because retrofitting them onto a monolithic schema is far more expensive than building them in from the start.