A battalion-level logistics officer requests a readiness report. The C2 system lists 47 vehicles assigned to the unit. The property accountability system records 51 vehicles under the same unit identifier. The maintenance system tracks 44 of those vehicles under different serial number formats, and three are flagged as deadlined under identifiers that do not appear in either of the other two systems. No one can answer the question "how many operational vehicles does this unit have right now" without a phone call — because the same physical entities are represented by different records, different identifiers, and different attribute values across systems that were never designed to agree with each other.
This is the defense MDM problem in its most common form. Master data management (MDM) is the discipline that creates a single authoritative representation of each core entity — person, organization, equipment, location — that all systems reference consistently. It is not a reporting tool and it is not a data warehouse. It is the layer that makes cross-system joins meaningful, multi-domain analytics trustworthy, and operational decisions based on data reliable. This article covers the architecture of a defense MDM system from hub model selection through entity resolution, golden record construction, stewardship workflows, and the survivability requirements that make MDM viable in contested environments.
The defense MDM problem — equipment records in three systems with three different IDs, personnel records diverging between HR and C2, the cost of inconsistency
Defense organizations accumulate source-of-truth fragmentation organically. The HR system was procured to manage personnel administration. The C2 system was built to track unit structure and tactical assignments. The logistics system was designed to manage supply and property accountability. Each was designed by a different program office, deployed in a different decade, and uses a different data model. None was designed to interoperate with the others at the entity identity level.
The result is a set of identifier schemes that are incompatible by design. A person may be identified by their personnel number in HR, their military occupational specialty code and rank combination in C2, and their equipment custodian identifier in logistics — three identifiers for one human being, with no system-maintained mapping between them. Equipment is worse: an M1A2 tank may carry a bumper number stenciled on the hull (used by the crew and C2 operators), a Department of Defense activity address code in logistics, a unique item identifier (UII) barcode in property accountability, and a maintenance work order number in the maintenance management system. None of these identifiers are the same format, and the systems that use them were not built to translate between them.
The cost of this inconsistency is not merely inconvenience. When a readiness analyst joins C2 unit assignment data against logistics supply data to compute a unit's operational capability score, the join fails to match records that refer to the same physical equipment under different identifiers. The resulting capability score is wrong — and it is wrong in a direction the analyst cannot detect without independently verifying each system, which defeats the purpose of the analytical model. For an integrated view of how these integration failures manifest across the broader data architecture, see our treatment of defense data integration patterns.
Personnel record divergence between HR and C2 is an additional, distinct problem. HR maintains the administrative record: permanent rank, duty position, assigned unit, clearance level, training history. C2 maintains the operational picture: which person is physically present with which unit, what role they are filling in the current task organization, what systems they are credentialed to operate. During stable garrison conditions, these records agree reasonably well. During operations — when personnel are attached to different units, when task organization deviates from table of organization and equipment (TO&E), when temporary duty assignments create secondary affiliations — the two records diverge rapidly. An MDM system that manages the person entity must reconcile both the administrative and operational representations into a coherent golden record that is useful for both readiness reporting and operational planning.
MDM hub architecture — registry vs consolidation vs coexistence hub models, selection criteria for defense environments, hub placement for classified vs unclassified data
Three hub models dominate production MDM deployments, each with a distinct relationship to the source systems it serves and a different set of operational trade-offs that matter in defense environments.
The registry model stores only the cross-reference table — a mapping of each source system identifier to the MDM-assigned global entity ID — without replicating or storing any entity attributes in the hub itself. When a consumer needs entity data, it queries the hub for the global ID, then queries the appropriate source systems using the source-specific identifiers returned by the cross-reference. The registry model has the lowest data footprint and requires no synchronization of attribute data into the hub, making it attractive for environments where replicating classified data into a central location raises authorization issues. Its limitation is that it forces every consumer to resolve entities across multiple source systems at query time, which is operationally impractical for high-frequency operational queries.
The consolidation model copies and normalizes entity attributes from all contributing source systems into the hub, runs entity resolution to link duplicates, and serves a unified entity view to consumers. Source systems are not modified — the hub is a read-optimized consumer of source data, not a writer back into it. This model is the most practical for defense environments because it does not require source system modification rights (C2, HR, and logistics systems are typically not modifiable by the MDM program), and it concentrates the entity resolution computation in the hub rather than distributing it across consumers. The military data lake architecture typically consumes the MDM consolidation hub's golden records as a curated reference layer rather than joining raw source tables directly.
The coexistence model adds write-back capability to the consolidation model: the hub constructs the golden record and then propagates authoritative attribute values back to the source systems, overwriting locally maintained values with the hub-determined authoritative value. This model produces the strongest consistency across systems but requires that every participating source system accept hub-initiated writes — a requirement that is frequently blocked by system authorization constraints, vendor change control processes, and operational risk aversion in live systems.
Hub placement for classified vs unclassified data requires separate hub instances at each classification boundary. A single MDM hub that processes both classified and unclassified entity data would require the hub to operate at the higher classification level, which would prevent unclassified consumers from accessing any entity data from the hub — defeating its purpose. The practical architecture deploys an unclassified hub for entities that exist entirely in unclassified source systems (general reference data, unclassified location data, commercial supplier records), and a classified hub at the appropriate classification level for entities that carry classified attributes. Cross-classification entity resolution — determining that an entity in the unclassified system corresponds to the same physical entity in a classified system — requires a specialized cross-domain solution guard to carry only the identifier cross-reference across the classification boundary, never the classified attributes themselves.
Entity types in defense MDM — person (military personnel), organization (unit/command), equipment (platform/end item), location (facility/position)
Defense MDM manages four primary entity domains, each with distinct source systems, identifier schemes, attribute structures, and data quality challenges.
The person entity domain covers military personnel, contractors, and coalition partners. Key source systems are the personnel management system (administrative record, permanent assignment, rank), the C2 system (current operational assignment, task organization position), and the clearance management system (clearance level, compartment authorizations, access expiration dates). The primary identifier challenge is that the same individual may appear under their permanent duty unit in HR while being operationally attached to a different unit in C2 — both records are correct in their respective systems, but they represent two different operational states of the same entity. Golden record construction for the person domain must capture both the administrative and operational states as distinct attribute groups rather than collapsing them into a single unit assignment field.
The organization entity domain covers units, commands, sub-units, and task forces. Organizations in defense environments have a hierarchical parent-child structure (brigade contains battalions, battalion contains companies) and a temporal dimension (task forces are created, merge, reorganize, and dissolve on operational timescales). The MDM system must maintain the current organizational hierarchy, its historical states at past timestamps, and the relationships between administrative TO&E structure and current operational task organization. Source systems are the unit hierarchy in the C2 system, the TO&E in the personnel management system, and the command element record in logistics systems.
The equipment entity domain covers platforms, vehicles, weapons systems, and end-item inventory. This is the most identifier-fragmented domain in practice: a single physical item typically carries four or more distinct identifier schemes across the systems that manage it. The UII (unique item identifier) is the Department of Defense standard globally unique barcode for serialized equipment items and should serve as the master identifier for the equipment domain, but legacy systems predating the UII standard use proprietary identifiers that require a mapping table to link to the UII space. Matching across systems requires fuzzy matching on serial number fields that may include different separator characters, leading zeros, or manufacturer code prefixes depending on which system recorded the identifier.
The location entity domain covers facilities, installations, named areas of interest, and tactical positions. Location matching is fundamentally a geospatial problem: the same physical location may be referenced by MGRS coordinates in one system, decimal-degree geographic coordinates in another, and a natural language place name in a third. Geospatial blocking groups candidate location records by proximity using a geohash grid, and matching determines whether two spatially proximate records refer to the same facility or to distinct nearby locations. Location entities also have a temporal dimension — facility classification, operational status, and controlling force change over time — requiring the golden record to carry temporal validity intervals for key location attributes. The real-time intelligence fusion layer that correlates ISR detections against known locations depends directly on the accuracy and completeness of the location entity master.
Entity resolution: matching across source systems — blocking strategies for large-scale defense entity sets, ML matching models, deterministic vs probabilistic matching
Entity resolution is the process that determines which records across source systems refer to the same real-world entity and links them under a shared global identifier. At the scale of a large defense dataset — millions of equipment records across logistics, maintenance, and property accountability systems; hundreds of thousands of personnel records across HR, C2, and clearance systems — naive pairwise comparison of all records against all other records is computationally infeasible. The matching pipeline must be structured as a two-stage process: blocking, which reduces the candidate pair space to a tractable size; followed by matching, which evaluates candidate pairs with sufficient precision to separate true matches from near-misses.
Blocking strategies for defense entity domains use domain-specific partitioning keys to group records that could plausibly be the same entity. For personnel records, phonetic blocking on the surname field using the Soundex or Double Metaphone algorithm groups records where different systems have transcribed the same name with variant spellings, extra spaces, or hyphenation differences — all common in personnel management systems that predate Unicode normalization. For equipment records, prefix blocking on the first six characters of the serial number (after normalizing whitespace and case) groups records from systems that represent the same serial number with different separator conventions. For location records, a geohash grid at precision level 6 (approximately 1.2 km cell width) groups spatially proximate records while excluding obviously distinct locations. The blocking design must be validated against a gold-standard dataset of known matches before deployment — the blocking step must retain at least 95% of true match pairs in the candidate set, or the resolution pipeline will produce systematic miss errors for the excluded record types.
Within each candidate block, the matching model evaluates each pair:
- Deterministic matching applies a fixed rule set producing a binary match/non-match decision. A deterministic rule for equipment records: two records match if and only if their UII barcodes are identical (after stripping non-alphanumeric characters), or if their serial numbers are identical and their national stock numbers agree to the first nine digits. Deterministic rules require no training data, are fully auditable, and produce zero false positives when the rule is correctly specified. They are appropriate for attributes that are intended to be globally unique and maintained under data entry controls.
- Probabilistic matching computes a composite match score from weighted field-level similarity metrics. A personnel record comparison might apply Jaro-Winkler similarity on the given name, phonetic matching on the surname, exact match on date of birth, and fuzzy matching on the rank abbreviation (to handle variant formatting), combined with a logistic regression or gradient boosted tree classifier trained on labeled match/non-match pairs from the actual source systems. The trained model learns the relative importance of each field in the specific dirty-data environment — in some source systems, date of birth is highly reliable; in others it has a 3% error rate that makes it a weak discriminating feature.
- ML matching models extend probabilistic matching with learned representations of entity attribute text. A Siamese neural network trained on personnel name pairs learns a vector representation of names such that phonetically or orthographically similar names have similar vectors — capturing similarity patterns that hand-crafted string distance metrics miss. ML models require larger labeled datasets for training and are harder to audit than deterministic rules, but they outperform classical probabilistic models on high-noise entity sets where the data entry error patterns are complex and non-uniform across source systems.
The output of the matching pipeline for each candidate pair is a match decision (match / non-match / review) and a confidence score. Pairs above the match threshold are automatically linked under a shared global entity ID. Pairs in the review band — where the confidence score falls between the non-match threshold and the match threshold — are routed to a human steward for adjudication rather than being resolved automatically.
Golden record construction and maintenance — survivorship rules for conflicting source attributes, golden record confidence scoring, automated vs steward-assisted resolution
Once entity resolution has linked records from multiple source systems to the same global entity ID, the MDM system constructs the golden record: the authoritative, unified representation of that entity whose attributes are drawn from the best available source for each field. The golden record is not a simple merge of all source attributes — when sources disagree, a survivorship rule must determine which value the golden record carries.
Survivorship rules are defined per attribute per entity domain and encode the authority hierarchy for that attribute:
Each attribute in the golden record carries three metadata fields in addition to its value: a source reference (which source system provided this value and the identifier of the specific source record), an update timestamp (when the source system last confirmed this value), and a confidence score (a normalized value reflecting the reliability of the source and the quality of the match that linked this record to the golden entity). A confidence score of 1.0 indicates a deterministic match on a globally unique identifier from a highly reliable source; lower scores reflect probabilistic match results, sources with known data quality issues, or attributes where survivorship was contested between sources with conflicting values.
The confidence score is not a decoration — it is operationally significant. A readiness analyst building a capability assessment can filter the golden record query to exclude attributes below a confidence threshold, or the analytics layer can weight each equipment record's contribution to the unit readiness score by the confidence of its maintenance status attribute. An analyst who receives a readiness metric without visibility into the underlying confidence scores cannot distinguish between a high-confidence assessment and a figure assembled from low-quality matches and stale source data.
Golden record maintenance is a continuous process, not a one-time batch operation. When a source system updates an entity record, the MDM ingest pipeline receives the update, re-evaluates survivorship for all affected attributes, and updates the golden record accordingly. If the update causes the golden record to change for an attribute that downstream systems have consumed, a change notification is published to the MDM event bus so consuming systems can re-query the updated golden record. The change notification carries the entity global ID, the list of changed attributes, and the before/after values — enough information for a consuming system to determine whether the change affects any of its active operational data without requiring a full re-fetch of the golden record.
Data stewardship workflows — steward assignment by domain (equipment steward, person steward), dispute resolution workflow, audit trail for golden record changes
Automated entity resolution and survivorship rules handle the majority of matching and attribute conflict cases in a production MDM system — typically 85 to 95 percent of records can be resolved without human intervention when the matching pipeline is well calibrated to the specific source system data. The remaining 5 to 15 percent of cases — low-confidence matches, contested attribute values where multiple sources claim equal authority, and entity splits or merges that require a judgment call — must be routed to a human data steward for adjudication.
Stewardship is organized by entity domain, with each domain assigned to a steward who holds both the domain expertise and the system access authority needed to resolve disputes:
- Equipment steward — typically embedded in the property accountability or logistics function, holds authority to determine the correct equipment record when systems disagree on serial numbers, unit assignment, or condition code. The equipment steward has direct access to physical records and can verify the ground truth by querying the originating paper trail or coordinating with equipment custodians.
- Personnel steward — typically embedded in the HR or personnel management function, holds authority to resolve conflicting name spellings, duplicate personnel records (common when a person re-enlists and receives a new system-generated identifier), and assignment discrepancies between administrative and operational records.
- Organization steward — typically the J1 or J3 staff element, holds authority to resolve unit hierarchy ambiguities during task organization changes, to confirm unit activation and inactivation dates, and to adjudicate parent-unit assignment when sub-units are temporarily attached to multiple headquarters.
- Location steward — typically the geospatial intelligence or facilities management function, holds authority to confirm whether two spatially proximate records represent the same facility or distinct co-located facilities, and to establish the canonical coordinate and name for a location that appears under different designations in different systems.
The dispute resolution workflow presents each steward case as a structured package. The case package contains: the candidate records from each source system that the MDM system believes may refer to the same entity, field-level similarity scores for each compared attribute, the MDM system's automated recommendation (match, non-match, or the recommended winning value for a contested attribute), and a clear display of where the sources agree and disagree. The steward selects from the available resolution options — confirm the automated recommendation, override it with an alternative resolution, or flag the case as requiring additional information before it can be resolved. A free-text rationale field allows the steward to document the reasoning for non-obvious decisions.
The audit trail for golden record changes records every state transition in the golden record's history, regardless of whether the change was triggered by an automated survivorship update or a manual steward decision. Each audit record contains:
- The entity global ID and the entity domain
- The attribute(s) changed and the before/after values for each
- The change trigger: automated survivorship rule (with rule ID and version), incoming source update (with source system ID and source record ID), or steward decision (with steward case ID)
- The authenticated identity of the principal responsible for the change — the system service account for automated changes, the steward's authenticated identity for manual decisions
- The change timestamp (logical clock value for distributed deployments, wall clock for single-hub deployments)
Audit records are written to append-only storage and cannot be modified after write. This immutability allows the MDM system to reconstruct the complete state of any golden record at any past point in time by replaying the audit trail from the initial record creation forward — a capability that is operationally necessary when analysts need to assess what the MDM system believed about an entity at a specific historical moment, and a compliance requirement for classified entity management systems.
MDM survivability in disconnected environments — local MDM cache for forward-deployed systems, conflict resolution on reconnect, eventual consistency model for disconnected nodes
A centralized MDM hub that requires network connectivity to the rear echelon for every entity query is operationally unusable in a contested environment where communications are degraded or severed. Forward-deployed C2 systems, logistics systems, and operational planning tools all depend on entity resolution to function correctly — if the MDM hub is unreachable, they must fall back to raw source identifiers that may not match across systems, returning the operational picture to the pre-MDM state of fragmented entity identity. MDM survivability design prevents this regression by deploying a local cache forward and defining a rigorous synchronization protocol for the reconnect phase.
The forward MDM cache is a pre-positioned read replica of the golden records and cross-reference table for the entity subset relevant to the forward element's operational scope. "Relevant scope" is defined by two dimensions: geographic area of operations (all location entities within the AO, all equipment entities with a last-known location within the AO) and task organization (all person, organization, and equipment entities associated with units in the forward element's task organization). The cache is populated before the forward element deploys, using a snapshot export from the central hub. It is deployed on hardware that operates independently of the wide-area network — a ruggedized server co-located with the forward C2 node, or embedded in the tactical command post computing environment.
During disconnected operation, tactical systems query the local cache exactly as they would query the central hub — the same API, the same response format. Entity resolution requests are served from the local cross-reference table. Golden record queries return the cached attribute values with a freshness timestamp indicating when each attribute was last confirmed by the central hub. Tactical systems are responsible for presenting this freshness timestamp to users when the data may be stale, rather than presenting cached data as if it were current.
All entity creates, updates, merges, and splits that occur during the disconnected period are written to a local change log — an ordered, bounded queue of entity change events. Each event carries a vector clock timestamp that establishes a partial ordering of events across disconnected nodes without requiring synchronized wall clocks. The change log capacity limits the duration of disconnected operation: a change log sized for 72 hours of typical operational entity update rates provides 72 hours of independent operation before synchronization becomes mandatory to prevent log overflow. The operational planning process must account for this constraint.
The reconnect synchronization protocol operates in three phases when the forward element re-establishes network connectivity with the central hub:
The eventual consistency model that underlies this architecture provides a guarantee: after reconnection and successful completion of the synchronization protocol, all nodes in the MDM topology — the central hub and all forward caches — hold identical entity state. During a disconnected period, nodes may diverge. The synchronization protocol is designed so that divergence is bounded, detectable, and resolvable without data loss — no entity updates made during the disconnected period are discarded, only reordered and potentially overridden by a survivorship rule that applies the organization's defined authority hierarchy.
The operational implication is that analysts and systems using the forward cache during a disconnected period must understand they are working from a snapshot of uncertain currency. The forward cache should expose a "last sync timestamp" field at the cache level and a "last confirmed" timestamp at the individual golden record attribute level, giving downstream systems the information they need to present appropriate confidence caveats when operational decisions depend on entity data that has not been recently confirmed by the central hub. A unit readiness assessment built on equipment golden records with a 12-hour-old snapshot and a 71-hour disconnected period warrants a different level of confidence than one built on records synchronized 30 minutes before the assessment was generated. Making this distinction visible in the data — rather than hiding it behind a uniform presentation of golden record attributes — is what makes MDM useful rather than misleading in contested operational environments.