Real-time SIGINT answers the question of what is happening now. Forensic SIGINT answers a different and often harder question: what happened then, and can you prove it? Answering that requires a wideband recording and spectrum-archiving capability – a system that captures raw IQ at full fidelity, stores it with verifiable provenance, indexes it so an analyst can find a signal that occurred weeks ago, and replays it through the same processing chain to reproduce a result. This article walks through the architecture of such a system: the recording front end, IQ data storage, the metadata index, retention policy, and the replay path that turns an archive into a forensic instrument.
Why archive raw IQ at all
Most operational SIGINT pipelines discard raw samples within seconds. They channelize the band, detect signals, classify them, and keep only the derived products – detection records, demodulated content, geolocation fixes. That is the right design for a real-time mission, where the question is always about the present and storage is finite.
Forensic work breaks that assumption. When a new emitter of interest appears, analysts want to look back and ask: was this signal present last week, and where? When a detection is challenged, an analyst must reproduce it from the original samples. When a novel waveform is discovered, the only way to study it is to replay the raw capture through new processing that did not exist when the signal was recorded. None of these are possible if the system kept only derived products. Raw IQ is the ground truth – everything else is a lossy interpretation of it, and the interpretation can always be redone if the samples survive.
The cost of that ground truth is data volume, and it is severe. A single 100 MHz channel at 16-bit complex samples produces roughly 400 MB every second – about 1.44 TB per hour. A four-channel wideband collector running continuously can exceed 100 TB per day. The entire architecture of a spectrum archive is, in effect, a set of strategies for getting forensic value out of raw IQ without paying to keep all of it forever.
The recording front end
The recording front end sits directly behind the digitizer and has one job: move complex samples off the wire and onto durable storage without dropping any. At wideband rates this is a sustained-throughput problem, not a compute problem. The standard pattern is a lock-free ring buffer in host memory that the digitizer DMA fills and a writer thread drains, writing fixed-size segment files to a write-optimized storage tier (NVMe in a striped array, or a parallel filesystem for very high aggregate rates).
Two properties must be designed in from the start. First, disciplined timing. Every sample must carry an accurate, traceable timestamp derived from a GPS-disciplined oscillator or a PTP (IEEE 1588) reference. Forensic correlation across collectors – and any future geolocation work – depends on timestamps that are accurate to better than a microsecond and provably traceable to a reference. Second, integrity at write time. A cryptographic hash (SHA-256 is typical) is computed over each segment as it is written and stored in the segment's metadata. This fixes the capture's integrity at the moment of recording, which is the foundation of any later evidentiary claim.
The recording front end also writes a self-describing metadata record for each capture. The IQ-to-intelligence processing that consumes these recordings is the same channelization-and-detection chain described in our walkthrough of SDR signal processing pipelines – the archive simply inserts a durable storage stage between collection and processing so the pipeline can run again later against stored samples.
IQ data storage and file format
The choice of storage format determines whether a recording is still usable in five years. The de facto open standard is SigMF (Signal Metadata Format), which pairs a binary IQ data file with a JSON metadata file. The metadata describes the sample rate, center frequency, datatype (e.g. complex 16-bit integer or 32-bit float), hardware, capture start time, and a list of annotations marking regions of interest within the file. Because the format is self-describing and non-proprietary, a SigMF recording remains interpretable without the original collection software.
Separating raw samples from derived metadata
A spectrum archive maintains two physically distinct data stores. The first is the raw IQ store: large, write-once segment files in SigMF, expensive to keep, accessed rarely and only by byte offset. The second is the derived-metadata store: small structured records describing what was found in the IQ – detections, features, spectrograms, emitter identifications. Analysts query the derived store constantly; they touch the raw store only when they need to replay actual samples.
This separation drives the technology choice. Raw IQ lives on object storage or a parallel filesystem optimized for sequential throughput. Derived metadata lives in a database tuned for time-range and frequency-range queries – frequently a columnar store such as Apache Parquet for the analytic feature tables, fronted by a time-series or search index for interactive lookup. The two are joined by a key: each metadata record carries the path of its source IQ segment and the byte offset of the sample range it describes.
Bit depth is a deliberate trade between fidelity and cost. Sixteen-bit complex samples preserve roughly 96 dB of dynamic range, enough to capture a weak signal sitting beneath a strong neighbour in the same band – the situation where forensic detail matters most. Dropping to 8-bit halves the data volume but discards that headroom, so high-bit-depth recording is reserved for collection bands where co-channel dynamic range is operationally significant. The decision is recorded in the SigMF datatype field so a later analyst knows exactly what fidelity the archive holds.
Indexing the archive
An unindexed archive is a write-only black hole – samples go in and never come back out, because no one can find them. Indexing is what turns storage into an archive. The index is built by a detection-and-feature-extraction pass that runs as segments land, the same kind of detection front end used in spectrum monitoring for unauthorized emitters, repurposed to write durable index records instead of live alerts.
For each detected signal the pass emits one index record containing, at minimum: the time interval, center frequency, occupied bandwidth, peak and average power, an estimated modulation class, an emitter identification where available, and – critically – the source IQ file path and the byte offset of the sample range. The offset is what makes retrieval cheap: an analyst's query returns index records, and each record points directly into the raw store, so replay is a seek-and-read rather than a scan.
The index store must support the queries analysts actually run, which are overwhelmingly bounded in time and frequency: "show me everything between 2.40 and 2.48 GHz on this date with bandwidth under 1 MHz and power above this threshold." A composite index on (time, frequency, power) over the detection records serves these efficiently. Spectrograms – downsampled time-frequency power images – are stored as a tertiary product so an analyst can visually scan an entire band-day without decoding any raw IQ.
The detection pass should be conservative rather than clever. Its job is to make every signal findable later, not to reach a final classification on the spot, so it favours recall: it marks any energy that crosses a threshold, even if the modulation estimate is uncertain. Refinement – confirming a modulation class, attributing an emitter, correlating across collectors – happens later, on demand, by replaying the indexed raw IQ through heavier analysis. An index that under-detects is unrecoverable, because the missing signals were never written as findable records; an index that over-detects merely costs a little extra metadata storage.
Key insight: The expensive resource in a spectrum archive is raw IQ, but the resource analysts actually query is metadata. A well-designed archive answers the overwhelming majority of forensic questions from cheap derived data – detection records and spectrograms – and reaches into the multi-terabyte raw store only for the rare capture that must be replayed sample-for-sample. Design the index first; it determines whether the archive is usable at all.
Retention policy: the only thing keeping the archive affordable
No realistic budget keeps full-fidelity raw IQ from a wideband collector forever. A tiered retention policy is what makes the archive financially possible, and it is a deliberate, documented mission parameter rather than an afterthought.
Tier 0 – rolling raw buffer. A short window of full-fidelity raw IQ (minutes to a few hours) is always retained on the fastest storage. Its purpose is look-back: when a new signal of interest appears, analysts can immediately reach backward to capture the moments before it was flagged, which are otherwise lost. The buffer overwrites continuously.
Tier 1 – triggered raw retention. Captures flagged by a trigger – a watchlist match, an anomaly detector, or explicit analyst tasking – are promoted out of the rolling buffer and retained as raw IQ for days to weeks. This is the selective-archiving mechanism: the system keeps full fidelity only where evidence suggests it will be needed.
Tier 2 – derived products, long-term. Spectrograms, detection metadata, and extracted features are retained indefinitely. They occupy a tiny fraction of the raw data volume yet answer most forensic questions on their own. An analyst can establish that a signal was present, at what frequency, with what characteristics, from Tier 2 alone – and only escalate to a Tier 1 raw replay when sample-level proof is required.
Migration between tiers is automated. Aging raw IQ moves from hot NVMe to cheaper object storage and is eventually deleted per policy, while its derived metadata persists. The retention windows, the trigger definitions, and the deletion schedule are all configuration, not code, so a collection manager can tune them per mission and per classification authority.
Replay and forensic defensibility
The payoff of the whole architecture is replay: taking a stored capture and running it through processing as if it were live. Because the archive retained both the original raw IQ and the exact capture-time parameters, an analyst can reprocess a signal with newer or different algorithms, or simply reproduce an earlier result to confirm it. Replay seeks to the byte offset from the index record, verifies the segment's stored hash to confirm the samples are unaltered, and feeds the range through the processing chain.
Forensic defensibility is the property that makes a replay credible to someone who was not present at collection. It rests on three things the architecture has already provided: provenance – each capture carries collector identity, calibration state, antenna configuration, and disciplined timestamps fixed at recording time; integrity – the write-time hash proves the samples have not changed; and chain of custody – every access, export, and replay is written to an immutable audit log. Reproducibility follows directly: because the raw samples and the processing parameters are both retained, an independent analyst can replay the capture and obtain the same detections and measurements, which is the essence of a defensible forensic finding.
These same provenance and access-control requirements shape the broader collection system; the layered design that produces them is covered in our reference on software-defined radio platforms for defense.
Build a forensic spectrum archive that holds up
Corvus SENSE captures wideband IQ with disciplined timing, hashes and indexes every segment on ingest, and gives analysts time-frequency search with sample-accurate replay – provenance and chain of custody built in from the first sample.
This analysis was prepared by Corvus Intelligence engineers who build mission-critical SIGINT and RF analytics systems for defense and government organizations. Learn about our team →