A military data lake and a commercial data lake share the same core idea — cheap, scalable object storage as the foundation for analytics — but they diverge sharply on every constraint that matters operationally. Classification enforcement, compartment-aware access, air-gap operation, and mission-critical query SLAs are not add-on features for a defense deployment; they are load-bearing requirements that shape every layer of the architecture from storage bucket naming to query engine configuration. An architecture that treats them as afterthoughts — adding a classification tag to existing commercial data lake patterns — will fail security audits at best and produce compliance violations at worst.
This article walks through the complete design of a production military data lake: the storage layer, ingest pipeline from C2, ISR, and logistics sources, military data catalog metadata management, compartmentalized access control, and analytics workload patterns. Each section addresses the design decisions that separate an operationally deployable system from a reference architecture that looks correct in a whitepaper but breaks at contact with real classified data.
What a military data lake must do that a commercial data lake does not — classification enforcement, compartment-aware access, air-gap operation, mission-critical SLAs
The most fundamental difference between a commercial and a military data lake is not a technology choice — it is a threat model. A commercial data lake protects against external attackers and insider data exfiltration. A military data lake must also protect against a more subtle failure: an authorized user accessing data they are cleared for but not authorized to see in the current context. A signals analyst with a Top Secret clearance must not be able to query a compartmented HUMINT dataset unless they hold the specific compartment code. A logistics officer cleared for Secret must not be able to see ISR intelligence even if it is also classified Secret. Classification level and compartment membership are orthogonal access dimensions, and both must be enforced at every layer of the system.
Air-gap operation eliminates every cloud-managed service dependency. There is no managed Kafka, no cloud-hosted catalog, no vendor-operated key management service. Every component — object storage, streaming broker, query engine, metadata catalog, key management hardware — must be deployable on-premises from packages that can be transferred to the air-gapped environment via approved media. This constraint eliminates architectures that assume internet-reachable update endpoints, license validation servers, or telemetry collection services. It also means the operations team owns the full upgrade and patching lifecycle for every component with no vendor-managed path.
Mission-critical SLAs apply to both data freshness and query latency. An ISR fusion analyst querying the latest sensor ingestion cannot accept a multi-hour Spark batch job window. A logistics officer building a resupply model needs query results in minutes, not days. These requirements force architectural choices that commercial data lakes often defer: always-on query engines that have already compiled query plans against up-to-date catalog statistics, hot-tier storage for recent data, and streaming ingest that makes new data queryable within seconds of receipt — not the next batch window. For deeper background on the foundational patterns used across the layers described here, see our guide to defense data integration patterns.
A fourth requirement that commercial data lakes also underserve is multi-domain data integration — joining C2 track data, ISR sensor data, logistics records, and geospatial reference layers in a single query across classification boundaries (with appropriate controls). In a commercial lake, cross-domain joins are a data engineering convenience. In a military lake, they are the primary analytical use case — and they must be possible without relaxing classification enforcement. The query engine must be capable of joining tables from different data domains while applying per-table classification filters transparently to each side of the join.
Storage layer design — object storage (MinIO/Ceph for on-prem), tiered storage (hot/warm/cold), object naming and partitioning for military data domains
Object storage is the foundation of every modern data lake architecture, and for air-gapped military deployments the choice narrows to two production-proven platforms: MinIO and Ceph. MinIO implements the full AWS S3 API with AES-256 encryption at rest, per-bucket IAM policies, and object-level versioning and locking — all without any cloud dependency. Its performance on modern NVMe hardware exceeds commercial cloud object storage for sequential write workloads, which is the dominant pattern for sensor data ingest. Ceph provides object, block, and file storage from a single cluster via its RADOS Gateway, making it appropriate for environments where the same physical storage infrastructure must serve both the data lake's object workload and block volumes for virtual machines. For a dedicated data lake deployment, MinIO is operationally simpler; Ceph is the correct choice when the storage cluster must also serve block and file interfaces.
Classification isolation is implemented at the bucket level, not the object level. Each classification tier has its own bucket namespace:
The zone structure — raw, structured, curated — partitions data by processing stage within each classification tier. The raw zone holds data exactly as received from source systems, in native format (CoT XML, NIEM XML, binary sensor streams, CSV logistics exports). Nothing in the raw zone is modified after write; it is an immutable audit record of what was received. The structured zone holds data after normalization to Parquet with Delta Lake table format — schema enforced, fields mapped to canonical names, provenance metadata appended. The curated zone holds analyst-ready aggregates, pre-joined multi-domain views, and summary tables that enable fast interactive queries without requiring every analyst to perform the same join logic independently.
Object key design for the structured zone encodes all partition dimensions needed for efficient query pushdown:
Tiered storage is managed via MinIO's lifecycle policies. The hot tier uses NVMe-backed nodes and holds data from the last 30 days — recent enough that analysts frequently query it for current operational analysis. After 30 days, objects transition to the warm tier on high-density spinning disk, where they remain queryable but with higher latency. After 180 days, objects transition to the cold tier on high-density archive storage with configurable retrieval latency. Tier thresholds are configurable per domain and per classification level — ISR sensor data from active operational areas may have a shorter warm-to-cold transition than logistics archive data.
Ingest pipeline architecture — C2 system feeds (CoT/NIEM), ISR sensor data ingestion, logistics system ETL, streaming vs batch ingestion patterns
The ingest layer must handle three fundamentally different data velocity profiles simultaneously: high-velocity streaming feeds from C2 and ISR systems producing thousands of events per minute; medium-velocity structured feeds from logistics and planning systems producing hundreds of records per hour; and low-velocity batch exports from legacy systems that deliver daily or weekly file dumps. These profiles require different ingest patterns that must interoperate cleanly at the landing zone boundary.
Apache Kafka is the streaming backbone for real-time feeds. Topic design follows the same classification-scoped convention as storage — each topic is prefixed with its classification level, and Kafka ACLs restrict producer and consumer access by classification tier. The topic naming convention encodes source type and system identity:
CoT/NIEM adapter services subscribe to the raw C2 topics, parse the XML, map fields to the canonical ingest schema, attach provenance metadata (source system ID, collection timestamp, classification level, originating unit), and publish to the normalized topic. CoT parsing must handle both standard CoT fields and Detail element extensions used by specific C2 systems — the adapter must not silently drop extension content; it must either map known extensions to canonical fields or preserve them in a structured extensions column. NIEM adapters handle significantly more complex schemas — NIEM's modular design means a NIEM exchange instance may reference types from multiple NIEM core and domain schemas, and the adapter must resolve all references before field mapping. The adapter outputs a flattened row-per-report structure suitable for Parquet storage.
ISR sensor ingest varies by sensor type. UAV video analytics feeds produce detection event records — bounding box, detection class, confidence score, frame timestamp, sensor position — at rates that can exceed 10,000 records per minute for a multi-stream platform. These must be ingested via Kafka with Avro or Protobuf schema enforcement at the producer side, not just at the consumer. SIGINT receivers produce emitter intercept records with frequency, modulation, pulse parameters, bearing, and TDOA data. Radar feeds produce track reports with position, velocity, and track quality metrics. Each sensor type has a dedicated adapter that normalizes its native format to the canonical ISR observation schema before the record reaches the landing zone. For an in-depth treatment of real-time intelligence fusion architecture, see our complementary article on multi-source stream processing.
Logistics ETL typically operates in batch mode — most logistics ERP systems produce periodic exports rather than real-time change streams. The ETL pipeline extracts data from the source system's export interface (SFTP file drop, database query, or REST API pull), validates the schema, applies field mapping to the canonical logistics schema, and writes Parquet files to the raw landing zone. A Spark job then processes the raw files into the structured zone on a schedule synchronized with the source system's export cadence. For logistics systems that do support change data capture (CDC) via database log streaming, the streaming path is preferred — it reduces the time between a logistics event and its availability for analytics queries from hours to minutes.
Data quality validation at ingest time is not optional — it is the mechanism that prevents malformed data from reaching the structured zone and corrupting analytics. Every ingest adapter must perform at minimum: schema conformance validation, provenance completeness checks, temporal sanity checks (event timestamps within plausible bounds), and for geospatial data, coordinate range validation. Validation failures route to a dead-letter queue with the full original message body and a structured error record identifying the specific check that failed.
Metadata catalog and data discovery — classification and caveat labeling in metadata, schema registry, data lineage tracking, search and discovery UX for analysts
The metadata catalog is the component that makes the data lake usable by analysts who did not build it. Without a catalog, the data lake is a collection of opaque Parquet files in object storage that only the engineers who wrote the ingest adapters know how to query. With a well-populated catalog, any analyst with appropriate clearance can discover available datasets, understand their schemas and provenance, and construct queries against them — without requiring engineering support for every new analytical question. The military data catalog metadata management guide covers the catalog design in detail; this section addresses its integration with the data lake architecture.
Apache Atlas is the production-proven open-source catalog for environments that require extensible metadata types, lineage tracking, and integration with Hadoop ecosystem tools. Atlas's type system allows custom metadata entity types to be defined as first-class schema objects — not as free-text annotations — which is essential for the defense-specific fields that must be mandatory at dataset registration time:
- ClassificationLevel — enumerated type: UNCLASSIFIED, CONFIDENTIAL, SECRET, TOP_SECRET. Mandatory on every dataset entity. Drives storage bucket routing and access control policy evaluation.
- CaveatCodes — multi-valued string list. Records applicable caveats such as NOFORN, compartment identifiers, and handling instructions. Must be populated at ingest or left as an empty explicit declaration — null is not permitted.
- DataOwnerUnit — string reference to the organizational unit responsible for the dataset. Used for data stewardship routing and for establishing point-of-contact for data access requests.
- RetentionPolicy — reference to the applicable retention rule defining maximum retention duration, secure deletion procedure, and the authority that approved the retention period.
- OriginatingSystem — identifier of the source system that produced the data. Used for lineage tracing and for routing data quality issues back to the source system operator.
Schema registry integration with Apache Kafka ensures that the schema for every topic is registered and versioned before any producer begins writing. The Confluent Schema Registry (deployable on-premises without cloud connectivity) enforces Avro or Protobuf schema compatibility rules — BACKWARD compatibility for consumer-facing topics means new schema versions can add optional fields without breaking existing consumers. Every schema version is linked from the Atlas catalog entity for the corresponding dataset, providing a complete history of how the dataset's structure has evolved over time.
Data lineage tracking in Atlas captures the transformation chain from raw ingest to curated analytics table. Spark's Atlas hook automatically emits lineage events when a Spark job reads from one Atlas-registered dataset and writes to another. Trino's Atlas integration requires a custom event listener that records which source tables were accessed to produce a query result. The lineage graph that Atlas builds from these events allows an analyst to trace any value in a curated analytics table back through every transformation step to the original raw record in the landing zone, and further back to the source system and originating collection event. This chain is operationally important — analysts frequently need to assess whether a given analytical result is traceable to a reliable source — and is a compliance requirement for classified data.
Analyst-facing search and discovery requires the catalog to be queryable by the attributes analysts actually use: data domain (C2, ISR, logistics, geospatial), time range covered, classification level and caveats (filtered to what the user is authorized to see), and keyword search across dataset descriptions and column names. Atlas's search API supports all of these dimensions, but the access control layer must ensure that catalog search results are themselves filtered by the user's clearance and compartment memberships — an analyst must not be able to discover the existence of a compartmented dataset they are not authorized to access, even if the catalog entry does not expose the dataset contents.
Compartmentalized access control — attribute-based access control (ABAC) for compartment enforcement, row/column-level security in query engines, audit logging
Access control in a military data lake is not a single policy — it is a layered enforcement stack where each layer catches what the layer above it might miss. The layers are: storage-level bucket policies (who can read and write which buckets), catalog-level visibility policies (who can discover which datasets), query-engine-level row and column security (what rows and columns a user can see within a dataset they are authorized to access), and audit logging (what was accessed, by whom, when, and what was returned). All four layers must be active simultaneously — any single layer operating in isolation provides only partial protection.
Attribute-based access control (ABAC) is the correct model for the classification and compartment dimensions because RBAC role explosion makes it unmanageable at scale. An ABAC policy evaluates access by computing a policy function over user attributes and resource attributes:
OPA is deployed as a sidecar to the Trino coordinator. When an analyst submits a SQL query, Trino's access control plugin calls OPA for each table reference in the query, passing the authenticated user's identity attributes (from the identity provider) and the table's resource attributes (from the catalog). OPA returns an allow/deny decision, and for allowed tables, returns the row-filter predicate and column mask definitions that must be applied to that table in this user's session. Trino injects the row filters and column masks transparently — the analyst's query runs as written, but the result set contains only the rows and columns the analyst is authorized to see.
Row-level security in Trino appends a predicate to the WHERE clause of every query against a protected table. For a table containing records at multiple classification levels, the row filter restricts the query to rows where the record's classification level is within the user's authorized range. Column-level masking replaces column values with NULL or a string literal for columns the user is not authorized to see in full — for example, a field containing a compartment identifier may be visible to users with the appropriate compartment authorization and masked to NULL for users without it.
Audit logging must capture every query event with sufficient detail to reconstruct what the user saw. The minimum audit record fields are: authenticated user identity (not just username — full identity provider attributes), query submission timestamp, query text, list of tables accessed with their classification levels, row count returned per table, any row-filter or column-mask policies that were applied, and query completion status. Audit records must be written to append-only object storage with object lock enabled, preventing modification or deletion of audit records even by storage administrators. The audit log store must be at the highest classification level that can appear in any query, ensuring that audit records are not themselves a cross-classification data channel.
Query and analytics workload patterns — Trino/Presto for ad-hoc SQL over object storage, Spark for batch analytics, notebook environments for analysts
A military data lake serves two distinct query workload profiles that require different execution engines optimized for different cost functions. Ad-hoc analyst queries prioritize low latency — an analyst building a targeting assessment needs an answer in seconds to minutes, not hours. Batch analytics jobs prioritize throughput — a data engineering pipeline processing the full historical corpus of sensor records for pattern-of-life analysis can accept longer run times if it produces accurate, comprehensive results. Deploying both Trino and Spark against the same object storage serves both profiles without compromise.
Trino is the ad-hoc query engine of choice for data lake SQL analytics. It reads Parquet and Delta Lake files directly from object storage via its Hive connector — no data loading step, no ETL into a separate database. Trino's distributed query execution engine splits the query into fragments executed in parallel across worker nodes, with intermediate results exchanged over the network. For typical analyst queries — filtering recent sensor records, joining C2 tracks against geospatial reference layers, aggregating logistics consumption by unit and time period — Trino delivers results in seconds to minutes against datasets that span hundreds of gigabytes of Parquet files. The key tuning parameters for a military data lake deployment are: partition pruning configuration (Trino must push WHERE clause predicates on partition columns down to the object storage listing, avoiding full bucket scans), statistics collection frequency (Trino's cost-based optimizer requires accurate table statistics to produce efficient join plans), and worker memory limits (large join queries can exceed per-worker memory; configuring spill-to-disk prevents out-of-memory failures at the cost of increased query latency).
Apache Spark is the batch analytics engine for workloads that require full-corpus processing or complex iterative computation. Spark's Delta Lake integration provides ACID write semantics — batch jobs can write to Delta Lake tables with exactly-once guarantees, and concurrent Spark and Trino readers see a consistent snapshot of the table at all times via Delta Lake's MVCC protocol. Common Spark workloads in a military data lake include:
- Nightly ETL jobs that process the raw landing zone files from the previous 24 hours into the structured zone — parsing, validating, normalizing, and writing Parquet with partition metadata
- Pattern-of-life analysis over weeks or months of historical track data — identifying regular movement patterns, frequent co-location events, and behavioral changes over time
- Entity resolution batch jobs that consolidate duplicate records across sources for the curated entity master table
- Training data preparation pipelines for ML models — feature extraction from raw sensor data, label joining, dataset export to the format expected by the training framework
- Data quality sweeps that run statistical profiling across the full structured zone to detect schema drift, value distribution anomalies, and provenance metadata gaps
JupyterHub provides the analyst-facing notebook environment. Notebooks connect to Trino for SQL queries and to Spark for programmatic data manipulation, combining the exploratory flexibility of Python with the structured query capabilities of both engines. In a classified deployment, the JupyterHub server must be isolated within the appropriate classification network segment — each classification tier requires its own JupyterHub instance connected to the Trino and Spark clusters within that tier. Notebooks created by analysts must be stored in the data lake's curated zone (not on the JupyterHub server's local disk) so that they are subject to the same retention and audit policies as other data lake content. Pre-built notebook templates for common multi-domain join patterns — C2 track joined against ISR sensor detections, logistics consumption joined against operational reports — reduce the time an analyst needs to spend on boilerplate query construction before reaching their analytical question.
Data quality and governance — data quality checks at ingest, master data management for entity references, data retention and disposal policies for classified data
Data governance in a military data lake is not a documentation exercise — it is an operational discipline that determines whether the analytics produced by the lake can be trusted for decision-making. An analyst who discovers that a targeting assessment was based on data that failed its quality checks but was allowed to propagate to the curated zone has lost trust in the entire analytics platform. Governance mechanisms must be enforced automatically, not relying on analyst awareness of data quality state.
Data quality checks at ingest time (described in the ingest pipeline section) are the first governance gate. The second gate is a daily data quality sweep across the structured zone using Apache Spark. The sweep runs a configurable set of checks against every registered dataset:
- Completeness checks — percentage of non-null values for mandatory fields; any dataset where a mandatory field drops below 95% completeness triggers an alert to the data steward
- Value distribution checks — statistical profiling of numeric columns detects sudden distribution shifts that may indicate sensor miscalibration, coordinate system changes, or ETL bugs
- Referential integrity checks — foreign key relationships between datasets (e.g., track entity IDs referenced in sensor records must exist in the C2 entity master) are validated daily to detect orphaned references produced by ingest sequencing issues
- Freshness checks — every dataset has a maximum expected ingest latency; datasets that have not received new records within that window trigger an alert that may indicate source system failure or network partition
Master data management (MDM) for entity references is the governance mechanism that prevents the proliferation of inconsistent entity identifiers across data domains. In a military context, the primary entities requiring MDM are: units and formations (with their hierarchical relationships and current task organization), equipment items (with model, variant, and assigned unit), geographic features and facilities (with classified and unclassified names), and personnel (with clearance and assignment records). Each of these entity types requires a golden record in the entity master table that serves as the authoritative reference for joins across data domains. Without a maintained entity master, analysts who join ISR sensor records against C2 unit records will produce join failures or incorrect results whenever the same unit is identified by different identifiers in different source systems — a common problem in multi-source military data environments.
Data retention and disposal for classified data uses cryptographic erasure as the standard mechanism. Each classification tier's data encryption key (DEK) is stored in a hardware security module (HSM). When a retention policy triggers disposal of a dataset, the HSM destroys the DEK for the affected objects rather than performing object-level overwrite across potentially petabytes of encrypted data. After DEK destruction, the encrypted objects are computationally unrecoverable — the requirement for secure deletion is satisfied without the operational burden of multi-pass overwrite procedures at petabyte scale. The purge service that executes DEK destruction must log every purge event to an immutable audit trail that records:
- The dataset identifier and object path prefix affected
- The count of objects whose encryption key was destroyed
- The retention policy rule that triggered the disposal (by rule ID and version)
- The identity of the automated service or human principal that authorized the disposal
- A cryptographic hash of the DEK at the time of destruction (proving the key existed before destruction without preserving the key itself)
- The timestamp of DEK destruction and the HSM attestation record confirming the destruction event
The governance framework is only effective if its outputs are visible to the humans responsible for data quality. A daily governance dashboard — built from the data quality sweep results, catalog registration completeness metrics, retention policy compliance status, and audit log summaries — gives data stewards the visibility needed to identify and address issues before they propagate to analyst-facing curated datasets. The dashboard itself must be deployed within the appropriate classification network segment, and its data must be sourced exclusively from within that segment — a governance dashboard for Secret-classified data must not pull metrics from Unclassified monitoring infrastructure. Treating governance as a second-class concern that can share infrastructure with lower-classification services is a common architectural mistake that creates both compliance exposure and an operational blind spot for the data quality of the lake's most sensitive content.