1. The Air-Gapped K8s Reality

"Air-gapped" rarely means what marketing decks imply. In a classified deployment, an air gap is an accredited boundary: no routable path to the public internet, no DNS resolution outward, no implicit egress for package managers or container runtimes. Every artifact crossing the boundary is logged, scanned, and approved. The cluster behaves as if the internet does not exist — because for the accreditation authority, it does not.

The first reality engineers underestimate is that Kubernetes assumes connectivity. kubelet pulls from registry.k8s.io. Helm charts reference quay.io and docker.io. CNI plugins fetch binaries on install. CSI drivers fetch sidecar images. Operators reconcile by pulling new image digests. A vanilla kubeadm init on a disconnected host fails in the first minute.

The second reality is the recurring-update problem. A cluster does not stand up once and stay still. Kubernetes minor releases land every four months. CVEs in containerd, runc, CoreDNS, and the kernel arrive weekly. Accreditation reviewers ask one question above all others: show me your patching cadence and the evidence trail. An air-gapped cluster without a documented, repeatable update pipeline is a cluster that will be denied authority to operate at the next review.

Everything that follows is a consequence of these two facts: nothing pulls itself, and the cluster must be patchable for years.

2. Distribution Choice — RKE2 vs K3s vs kubeadm vs OpenShift

RKE2 (Rancher Kubernetes Engine 2). SUSE/Rancher's hardened distribution. RKE2 1.31 ships a CIS-1.9 profile out of the box: kube-apiserver is configured with the audit policy, admission controllers, and TLS posture the CIS benchmark requires. The release tarball bundles every system image — kube-apiserver, kube-controller-manager, etcd, CoreDNS, the CNI (Cilium or Canal), the ingress controller — into a single rke2-images-all.linux-amd64.tar.zst. For air-gap, this is the right answer 80% of the time.

K3s. The lightweight sibling. Single binary, embedded etcd or sqlite, ~50 MB. Excellent for edge nodes inside the enclave (forward command posts, mobile shelters, sensor gateways) where the cluster runs on a single rugged appliance. K3s 1.31 has the same air-gap tarball pattern as RKE2 but a smaller component set and no hardened profile preset — you bring your own admission policy.

kubeadm. Upstream Kubernetes. Maximum flexibility, maximum work. Every component image must be mirrored manually, every CNI installed manually, every CIS control applied by you. Choose kubeadm only when the accreditation authority forbids vendor-distributed binaries (rare but real on some national programs).

OpenShift. Red Hat's distribution. Stronger air-gap tooling (oc mirror, Operator Lifecycle Manager with offline catalogs) and a serious accreditation footprint (FIPS 140-3, CC EAL4+ on RHEL). The trade-off is licensing — OpenShift seats are expensive and the platform footprint is heavy. For programs that already have Red Hat enterprise agreements, this is the path of least resistance.

For most defense engagements we recommend RKE2 1.31 with Rancher 2.10 as the multi-cluster management plane, sitting inside the classified enclave. K3s 1.31 fills the edge slot. Kubeadm and OpenShift are program-specific choices driven by procurement and accreditation, not engineering preference.

3. Offline Image Registry

The registry is the heart of an air-gapped Kubernetes platform. Every pod pulls from it. If it goes down, the cluster freezes on the next image pull. If it is compromised, every workload is compromised.

Harbor 2.11. The CNCF-graduated, enterprise-grade registry. Native Trivy integration scans every pushed image for CVEs against an offline vulnerability database that you sync in via the same approved transfer process you use for application images. Harbor supports cosign signature verification at pull time, project-scoped RBAC, replication policies, and a robot-account model that works cleanly with admission webhooks. For a primary registry inside the enclave, Harbor is the default.

zot. The OCI-native, golang single-binary registry. Far lighter than Harbor (no Postgres, no Redis, no Trivy sidecar). zot 2.1 supports OCI 1.1 referrers, cosign, and a small footprint that suits forward nodes where Harbor would be over-provisioned. Pair zot at the edge with Harbor at the central site, replicating one-way.

Sonatype Nexus. The polyglot artifact manager. If the program already standardised on Nexus for Maven, npm, and APT mirrors, adding Docker repositories keeps everything in one tool and one set of audit logs. Nexus' container scanning is weaker than Harbor's, so it pairs with a separate scanning gate in the ingest pipeline.

The pattern most large programs converge on is the registry of registries: one central Harbor inside the enclave as the source of truth, one zot per site or edge cluster, and a documented replication topology. Application clusters never pull from the central Harbor directly — they pull from the local zot mirror. Failure domains stay small. Network round trips stay short. The accreditation diagram stays drawable on one page.

4. Sneakernet and One-Way Diodes

Images, Helm charts, vulnerability databases, OS package mirrors, GitOps repositories — everything has to physically cross the boundary. Two transport patterns dominate.

Sneakernet. Approved removable media, hand-carried. The media is wiped, written, hash-verified, sealed, signed across the boundary, hash-verified on the high side, ingested into a staging registry, scanned, manually approved, then promoted to the production registry. The full cycle takes hours to days. It is slow, auditable, and survives any accreditation review.

One-way data diodes. Hardware-enforced unidirectional transfer (Owl Cyber Defense, Fox-IT DataDiode, Advenica). Bandwidth is real (1–10 Gbps on current hardware) and the lack of a return path is enforced in fibre, not configuration. Diodes work brilliantly for telemetry leaving the high side; for images entering, the absence of acknowledgement complicates retries, so most programs pair a diode with a strict resend-on-checksum-failure protocol layered on top.

Both patterns share the same staged acceptance workflow: receive → quarantine registry → automated scan (Trivy, ClamAV, YARA) → content disarm and reconstruction for any non-container artifact → manual analyst approval → promotion to production registry. Skipping the quarantine stage is the single most common cause of accreditation findings.

5. GitOps in an Air Gap

GitOps works inside the enclave — provided every reference is internal. ArgoCD 2.13 and Flux 2.4 both run happily air-gapped. The reconciliation loop does not care that the Git server is a Gitea or GitLab instance hosted on the high side rather than github.com. What breaks is every Helm chart that references an external chart repository, every Kustomize overlay that pulls a base from a public Git remote, and every operator that watches an external image stream.

The manifest mirror pattern fixes this. A scheduled job on the low side pulls upstream Helm charts, container images, and Git repositories; rewrites every external reference (repository: docker.io/bitnami/postgresql becomes repository: harbor.enclave.mil/bitnami/postgresql); commits to an internal Git mirror; and exports the bundle for sneakernet. Inside the enclave, ArgoCD points at the mirror exclusively. There is no fallback path to the internet because there is no internet.

Drift detection without phone-home is straightforward — ArgoCD computes diffs against the in-cluster state, not against an external service. The only feature you lose is automated upstream notification of new chart versions; that detection moves to the low-side mirror job, which is where it belonged anyway. For the broader pattern, see our walkthrough on DevSecOps and zero trust in the defense pipeline.

6. Supply-Chain Integrity

An air gap stops outbound exfiltration; it does not stop a malicious image that arrived through the approved channel. Supply-chain integrity is the second line of defence.

cosign signing. Every image promoted to the production registry is signed with a cosign key whose root sits in the enclave HSM. The signing happens at the promotion step, after scanning and analyst approval. The signature attests "this image was vetted through our process," not "this image is upstream-authentic" — the upstream provenance is verified separately at the low-side ingest gate.

Kyverno or OPA Gatekeeper at admission. A ClusterPolicy rejects any pod whose image is not signed by a key in the enclave trust bundle and whose digest is not pinned. Tag-based references (:latest, :v1) are blocked outright — only @sha256:... digests pass admission. Kyverno 1.13 is the lighter-weight choice; Gatekeeper suits programs already invested in Rego.

SBOM verification. Every signed image carries an attached SPDX or CycloneDX SBOM as an OCI referrer. Admission policy verifies the SBOM signature and optionally checks for forbidden components (e.g. log4j 2.x below 2.17, any package on the program's banned list). For the wider picture, see SBOM enforcement in defense pipelines.

Key insight: The signing key in the enclave HSM is the trust anchor for the entire cluster. Its key-ceremony, rotation schedule, and split-knowledge custody are accreditation artifacts in their own right. Build them before the cluster, not after.

7. Day-2 Operations

CVE patching cadence is where most air-gapped programs lose the accreditation conversation. The reviewer's question is simple: a critical CVE was disclosed yesterday — when does it land in your production cluster, and where is the evidence?

A defensible answer has three tiers. Hot-fix for critical CVEs with active exploitation: the low-side ingest pipeline accepts an emergency patch within 24 hours, sneakernet runs out-of-cycle, the production registry receives the patched image within 72 hours, and emergency-change records cover the deployment. Scheduled window for high and medium CVEs: monthly maintenance windows pull a curated batch through the full staged-acceptance pipeline. Quarterly minor upgrades for the Kubernetes control plane itself: RKE2 patch releases land in a test cluster first, then production, with a documented rollback plan.

The cluster-lifecycle plan that satisfies accreditation reviewers is not a one-page diagram. It is a written runbook covering: control-plane upgrade procedure, worker-node upgrade procedure, etcd backup and restore drill (executed quarterly, not theoretical), CNI upgrade procedure, registry replication failure procedure, and the named individuals responsible for each. Reviewers read these. They notice when the runbook references a tool the team does not actually use.

For the broader cloud-architecture context these clusters live inside, see sovereign cloud architecture for defense.

8. Multi-Enclave Federation

A defense organisation rarely runs one cluster. There is an unclassified enclave, a NATO SECRET enclave, a national SECRET enclave, occasionally a TOP SECRET enclave for specific programs. The instinct is to "federate" them through Kubernetes Federation v2 or a similar mechanism. The instinct is wrong.

Federation across classification boundaries is forbidden by every accreditation framework we have worked under. The correct pattern is separate clusters per classification, linked only by cross-domain gateways. Each cluster has its own control plane, its own registry, its own GitOps repository, its own signing keys. Manifests for shared workloads are duplicated — yes, duplicated, with all the version-drift risk that implies — because the alternative is a federated control plane that breaches the boundary.

GitOps duplication strategy is the operational discipline that makes this manageable. The same low-side mirror that produces the unclassified bundle produces a NATO SECRET bundle and a national SECRET bundle, each with the same upstream content but distinct signing keys and distinct registry destinations. The Git repositories diverge only on enclave-specific configuration (registry hostnames, network policies, secrets references). Drift between enclaves is detected by a low-side comparison tool that reads the public-side manifests and the redacted versions of the high-side manifests that come back through the diode.

Cross-domain message flow — telemetry up, commands down, intelligence sideways — runs through accredited cross-domain solutions (Forcepoint Trusted Gateway, Owl, national equivalents), not through Kubernetes-level integration. The cluster does not know it is multi-enclave. Each cluster believes it is alone, and that is the property that lets it be accredited. For the network model that wraps these enclaves, see zero trust on military networks.