A modern defense organization generates data at a rate that breaks traditional database assumptions. A single ISR platform produces gigabytes of imagery per sortie. A sensor-dense convoy generates thousands of CoT position reports per minute. A single SIGINT collection session can produce terabytes of raw I/Q data before any signal processing has occurred. Multiply those volumes across an entire joint force — hundreds of platforms, dozens of sensor types, multiple classification levels — and the resulting data problem is no longer a database problem. It is a data lake problem.
This article walks through the full architecture of a defense data lake: how data enters, how it is stored and structured, how classification boundaries are enforced, how analysts query it, and how classified data is securely retired. The patterns here apply whether you are building an on-premises classified system, a hybrid deployment, or a cloud-connected analytics platform at the unclassified or controlled-unclassified level.
Why traditional databases cannot handle defense data at scale
Relational databases are designed around structured, well-defined schemas. They excel at transactional workloads — creating, reading, updating, and deleting records with strong consistency guarantees. Most defense sensor data is none of those things. It arrives in heterogeneous formats: CoT XML from ground troops, binary radar track files, compressed video from UAV feeds, JSON from software-defined radio pipelines, PDF intelligence reports, and audio transcripts from communications monitoring. Forcing all of that into a normalized relational schema is not just operationally impractical — it destroys the raw fidelity that analysts sometimes need to return to.
NoSQL databases solve the schema problem but introduce new ones: most are not designed for the analytical query patterns that intelligence work demands (full-table scans, aggregations across millions of records, vector similarity searches over document embeddings). Time-series databases handle high-frequency sensor streams well but collapse under ad-hoc analytical joins. The defense data lake pattern addresses all of these gaps by separating ingestion, storage, and query concerns into independently scalable layers.
Ingestion layer: streaming versus batch
The ingestion layer is where raw data enters the lake. Two distinct patterns dominate defense environments, and a production lake needs both.
Streaming ingestion
Real-time sensor feeds — position reports, radar tracks, signals intelligence alerts, chat messages, video analytics events — arrive continuously and must be ingested with low latency. Apache Kafka is the dominant open-source choice for on-premises and air-gapped environments. Kafka topics map naturally to data sources: one topic per sensor type or feed. Topic-level access control lists (ACLs) provide a first line of classification enforcement — a Secret-classified sensor feed lands in a Secret-classified topic, and only consumers with appropriate credentials can subscribe.
For hybrid or cloud-connected deployments, Azure Event Hubs offers a Kafka-compatible API surface with native integration into Azure Data Lake Storage Gen2 and Azure Synapse Analytics. Event Hubs Capture writes incoming events directly to ADLS Gen2 in Avro or Parquet format, eliminating a separate consumer process for the raw landing zone. The operational overhead is significantly lower than self-managed Kafka at the cost of reduced control over topic-level access policies.
Schema registries — either Confluent Schema Registry (for Kafka) or Azure Schema Registry — should be mandatory for streaming ingestion. Registering schemas at the point of entry prevents malformed messages from propagating downstream and provides a versioned contract for schema evolution. A sensor firmware update that changes a field name or adds new telemetry fields should never silently break downstream analytics.
Batch ingestion
Not all defense data arrives in real time. Daily intelligence summary dumps, archived signal recordings, historical track databases, and imported data from allied systems typically arrive as bulk transfers on a defined schedule. Batch ingestion pipelines are simpler than streaming pipelines but carry their own challenges: files may arrive in legacy formats (NITF imagery, STANAG 4607 GMTI, CSV exports from aging C2 systems), and file sizes can range from kilobytes to hundreds of gigabytes per transfer.
A robust batch ingestion layer needs format detection and validation at the entry point, checksum verification to confirm transfer integrity, and a dead-letter path for files that fail validation. Ingestion should be idempotent — running the same batch job twice should not duplicate records in the structured zone. Delta Lake's transaction log makes idempotent batch ingestion straightforward: write jobs check the transaction log before appending, and duplicate detection can be implemented with a deterministic row key derived from source system identifiers and timestamps.
Storage layer: landing zone to structured zone
A defense data lake uses a multi-zone storage model. Data moves through zones as it is validated, transformed, and made available for analysis.
Raw landing zone
The raw landing zone is the first destination for all inbound data — streaming events written as Avro or JSON line files, batch transfers stored in their original format. Nothing is modified here. The landing zone is a forensic record: if a processing error corrupts a downstream dataset, the raw landing zone is the recovery point. Storage is S3-compatible object storage — AWS S3, Azure Data Lake Storage Gen2, MinIO for on-premises air-gapped deployments, or Ceph for large-scale on-premises object storage.
Objects in the landing zone are named with a deterministic key scheme that encodes source system, classification level, data type, and arrival timestamp. A naming convention like raw/{classification}/{source}/{year}/{month}/{day}/{hour}/{uuid}.{ext} gives the transformation pipeline a reliable partitioning structure and makes it possible to reprocess a specific time window for a single source without touching unrelated data.
Structured zone: Parquet and Delta Lake
The structured zone is where raw data is transformed into a format that analytical engines can query efficiently. The current standard is columnar Parquet files managed by a Delta Lake or Apache Iceberg table format layer. Parquet's columnar layout dramatically reduces I/O for analytical queries that access only a subset of fields — which is the norm for intelligence analysis. A query for all air tracks within a 50 km radius over a six-hour window only needs the latitude, longitude, altitude, timestamp, and track ID columns, not the full 80-field sensor schema.
Delta Lake adds four capabilities that are critical in a classified environment. First, ACID transactions ensure that concurrent writes from multiple Spark jobs do not produce partial or corrupted datasets. Second, the transaction log provides a complete history of every write, update, and delete operation — a requirement for data provenance in classified systems. Third, time-travel queries allow analysts to reconstruct the state of a dataset at any past point in time, which supports both forensic analysis and after-action review. Fourth, schema enforcement prevents downstream ingestion errors from silently writing malformed records into a production table.
Classification isolation
Classification boundaries must be enforced at the storage layer, not merely at the application layer. Each classification tier (Unclassified, Controlled Unclassified Information, Confidential, Secret, Top Secret/SCI) requires physically separate storage buckets or namespaces. Shared buckets with path-based separation are not sufficient — a misconfigured IAM policy or a software bug in the access control layer can expose cross-classification data if objects share the same bucket.
Each classification tier uses a separate data encryption key (DEK) managed by a hardware security module (HSM) or a key management service with FIPS 140-2 Level 3 certification. Encryption is applied server-side at the storage layer so that even storage media removal does not expose plaintext data. Key rotation schedules are defined per classification tier and must be automated — manual key rotation at the frequency required for classified data is operationally impractical.
Data catalog and classification enforcement
A data lake without a catalog is a data swamp. Defense analysts need to discover what datasets exist, what they contain, when they were last updated, and what classification level they carry — before issuing a query that might inadvertently request data above their clearance. A metadata catalog serves as the searchable index of the lake's contents.
Apache Atlas (commonly deployed with Hadoop-ecosystem stacks) and AWS Glue Data Catalog (for cloud or hybrid deployments) are the two most widely used options. Both support schema registration, lineage tracking, and custom metadata attributes. Classification level should be a mandatory schema attribute — not an optional tag — so that every dataset in the catalog has an explicit classification label that the query layer can enforce.
Catalog visibility should itself respect access policy: an analyst cleared for Secret should not be able to browse the catalog entries for Top Secret datasets, even if they cannot query the underlying data. This requires integrating the catalog's authorization layer with the organization's identity provider (Active Directory, LDAP, or a SAML-compatible IdP). Every catalog access event should be logged to a central audit sink alongside query events.
Query layer: SQL, batch analytics, and vector search
The query layer is where analysts and downstream systems consume data from the lake. A production defense data lake needs at least three query modalities.
Ad-hoc SQL with Trino
Trino (formerly PrestoSQL) is the standard choice for ad-hoc SQL queries across large Parquet or Delta Lake datasets. Trino's connector architecture allows a single query to join data from multiple sources — the Delta Lake structured zone, a live PostgreSQL operational database, and an Elasticsearch index — in a single SQL statement. For defense analytics, this means an analyst can write a query that correlates historical track data from the lake with live contact reports from the operational picture without exporting data between systems.
Trino's access control layer supports row-level filtering and column masking through connector-level policies. A row filter can restrict a query to only the records that match the analyst's authorized geographic area of responsibility. Column masking can redact sensitive fields — source system identifiers, collection method codes — for analysts whose clearance does not extend to that metadata. All query events are logged to an audit sink that captures the query text, the authenticated user identity, the tables accessed, and the classification level of the returned data.
Large-scale batch analytics with Spark
Some intelligence analysis tasks are too large for interactive SQL. Pattern-of-life analysis over six months of position data, correlation of signals intelligence with ground movement across an entire theater, or training a machine learning model on labeled track data all require distributed batch processing. Apache Spark running on a YARN or Kubernetes cluster is the standard engine for these workloads.
Spark integrates natively with Delta Lake and can read Parquet directly from S3-compatible storage. For classified environments, Spark jobs should run within classification-level-isolated clusters or namespaces so that a Secret-level job cannot accidentally reference an unclassified dataset via a misconfigured path variable. Job execution should be logged with the same audit detail as interactive queries: job owner, classification level of input datasets, classification level of output datasets, and execution timestamp.
Vector search for intelligence documents
Unstructured intelligence documents — reports, transcripts, translated intercepts — do not fit well into SQL query patterns. Analysts need semantic search: "find all reports that discuss supply route disruption near this grid reference" rather than "find all records where document_text LIKE '%supply route%'." Vector embeddings generated by a language model and stored in a vector database (pgvector on PostgreSQL, or a dedicated service like Qdrant for on-premises deployment) enable this type of semantic retrieval.
The vector search layer must respect classification boundaries in the same way as the SQL and Spark layers. Embedding generation pipelines should run within the classification tier of the source documents, and the resulting vector indexes should be isolated per classification level. Cross-classification semantic search — finding unclassified documents that are topically similar to a classified query — requires explicit cross-domain solution (CDS) architecture review and is not a default capability.
Retention, purge, and audit trail
Data in a defense data lake does not accumulate indefinitely. Classification-driven retention policies define how long each type of data is kept at each classification level. Operational sensor data might have a 90-day retention at Secret level; strategic intelligence products might be retained for 10 years. Retention policies are defined in a policy registry and enforced by automated lifecycle management jobs that run on a defined schedule.
Secure deletion for classified data cannot rely on standard filesystem deletion or object expiration. Standard deletion marks storage blocks as available for reuse but does not overwrite them. For classified data, the required approach is cryptographic erasure, also called crypto-shredding: each classification tier uses a separate DEK, and when a retention policy triggers deletion, the DEK is rotated and the previous key version is destroyed. Without the DEK, the stored ciphertext is computationally indistinguishable from random noise. This approach scales to petabyte datasets without the performance penalty of multi-pass overwrite procedures.
Every purge event must produce an immutable audit log entry. The audit entry must record the object keys or partition identifiers that were purged, the retention rule that triggered the purge, the timestamp of key destruction, and the identity of the automated or human principal that authorized the operation. The audit log itself must be stored in a write-once, tamper-evident configuration — append-only S3 bucket with object lock, or a dedicated audit log service with cryptographic chaining.
For more detail on how message queues support the streaming ingestion layer described here, see our article on message queue architecture for high-throughput defense data. For the fusion patterns that operate over data once it reaches the structured zone, see our guide to multi-sensor fusion architecture.
Operational considerations
A defense data lake is not a set-and-forget infrastructure deployment. Several operational concerns deserve explicit attention during architecture and procurement.
Air-gap compatibility. Many classified deployments cannot maintain persistent internet connectivity. All components of the lake stack — Kafka, Spark, Trino, the catalog service, the vector store — must be deployable from local package mirrors and container registries. Dependency on public package repositories during runtime is a security and availability risk in classified environments.
Schema evolution governance. Sensor firmware updates, new platform integrations, and changing reporting requirements will alter data schemas over time. Schema changes in the structured zone must go through a change control process that evaluates downstream impact: does the change break existing Trino queries? Does it require a backfill of historical data? Delta Lake's schema evolution controls (mergeSchema option) and Iceberg's built-in schema versioning provide the technical mechanisms, but the governance process around them is equally important.
Performance monitoring per classification tier. Query performance may differ significantly between classification tiers — a Tier 1 analyst running queries against a petabyte-scale Secret dataset is operating in a different performance envelope than a Tier 3 analyst querying a small Unclassified dataset. Monitoring query latency, data scan volume, and cluster utilization per classification tier allows capacity planning to track actual usage patterns rather than theoretical peaks.
Corvus.Head is built to integrate directly with multi-source defense data lakes — ingesting sensor feeds, fusing tracks across classification boundaries where cross-domain solutions permit, and surfacing actionable analytics to operators and intelligence teams in real time.
Explore Corvus.Head →