Sound arrives before you see the source. A rifle shot at 500 metres reaches an acoustic sensor node in under 1.5 seconds. A tracked vehicle moving under tree cover at 2 km produces engine harmonics that propagate across terrain long before any optical or radar sensor can resolve the platform. Acoustic detection AI exploits this physics: by classifying what a microphone array hears – and computing bearing from the timing differences between elements – an edge-deployed acoustic node can contribute a detection layer to the common operating picture (COP) that optical sensors cannot replicate. This article walks through the sensor physics, feature extraction, machine learning architectures, bearing estimation algorithms, and CoT integration that make edge acoustic sensing a viable military AI capability.
Why acoustic sensing at the edge?
The operational case for edge-deployed acoustic sensors rests on three properties that no other passive sensing modality shares.
Passive detection. Acoustic sensors emit nothing. Unlike radar or active sonar, a microphone array has no RF signature, no laser return, and no thermal output beyond the minimal power draw of the compute node. This makes acoustic sensors suitable for covert unattended ground sensor (UGS) deployments at choke points, along supply routes, or around defended positions, with no risk of giving away the sensor's position through its own emissions.
Penetration through visual obscurants. Acoustic waves propagate through fog, smoke, vegetation, and darkness with far less attenuation than visible or infrared light. A wheeled vehicle in a treeline that is invisible to an EO drone is acoustically loud. An engaged crew weapon behind a berm still produces a detectable muzzle blast. The acoustic domain provides sensing persistence in conditions that defeat optical systems.
Low power, long endurance. A microphone array with a microcontroller-class inference engine consumes 20–100 mW in continuous monitoring mode. A small battery pack provides weeks to months of unattended operation. By contrast, a ground radar or persistent EO sensor requires orders of magnitude more power for comparable continuous coverage. Acoustic sensors fill the endurance niche that powered sensors cannot.
Sensor array geometry and the physics of TDOA
A single microphone can detect and classify acoustic events but cannot determine where they come from. Direction-finding requires an array – multiple microphones at known geometric separations – and a time-difference-of-arrival (TDOA) algorithm that computes bearing from the microsecond differences in when the acoustic wavefront reaches each element.
For a linear array of N microphones with spacing d, the maximum unambiguous TDOA is d/c, where c is the speed of sound (approximately 343 m/s at 20°C, varying by roughly 0.6 m/s per degree Celsius). To resolve bearing without aliasing, the inter-element spacing must not exceed half a wavelength at the highest frequency of interest – the same spatial sampling criterion as phased-array radar. For gunshot classification where the relevant spectral content extends to 10 kHz (wavelength ≈ 34 mm), the array spacing must be under 17 mm to avoid ambiguity at the highest frequency. In practice, production military acoustic arrays use a 2D arrangement (cross, pentagon, or hexagon) with element spacings in the 10–30 cm range and rely on the lower-frequency content of the muzzle blast (1–4 kHz) for unambiguous bearing.
The generalized cross-correlation with phase transform (GCC-PHAT) is the standard algorithm for estimating TDOA between a pair of microphone channels. It cross-correlates the two channel signals in the frequency domain, normalizes by the cross-spectral magnitude (the "phase transform" step), and finds the time lag at the correlation peak. GCC-PHAT is robust to reverberation – the normalization step suppresses multi-path energy – and it produces a sharp peak even in noisy outdoor environments when the direct-path signal is coherent across channels.
Array calibration and environmental compensation
Two practical complications degrade TDOA accuracy in field deployment. First, the actual microphone positions in a manufactured array may differ from the nominal geometry by 1–3 mm due to manufacturing tolerances. At 48 kHz sampling and 343 m/s sound speed, 1 mm of position error corresponds to approximately 3 µs of timing error – equivalent to a 1° bearing error at short range for a 15 cm aperture. Arrays should be calibrated after assembly using an acoustic point source at a known position, fitting the actual positions to the observed TDOAs.
Second, temperature affects the speed of sound by 0.6 m/s per °C. A 20°C temperature swing – common between night and midday at mid-latitudes – shifts the sound speed by 12 m/s (3.5%), which propagates directly into range and bearing error if the temperature compensation is not applied. Edge acoustic nodes should include a temperature sensor (and ideally a humidity and barometric pressure sensor) to update the sound speed estimate in real time.
Feature extraction for audio classification
Classifying acoustic events as gunshots, explosions, vehicles, or ambient noise requires features that capture the spectral and temporal structure of each event class while being compact enough to process on edge hardware within the latency budget.
Mel-frequency cepstral coefficients (MFCCs). The most widely used compact audio feature for classification tasks. MFCCs map the short-time Fourier transform of a signal onto a mel-scale filterbank (which approximates the human auditory system's frequency resolution), then apply a discrete cosine transform to decorrelate the filterbank outputs. Twenty to 40 coefficients per analysis frame capture the broad spectral shape of the event. For gunshot versus vehicle discrimination, the key discriminant is the ratio of high-frequency to low-frequency energy: gunshots concentrate energy above 2 kHz in a brief impulsive burst, while vehicles produce sustained low-frequency content below 500 Hz with harmonic structure.
Log-mel spectrograms. For deep-learning classifiers, log-mel spectrograms – two-dimensional time-frequency representations on a mel scale – give the model access to the full spectrotemporal structure of the event. A 64-band, 25 ms frame, 10 ms hop spectrogram of a 200 ms event window produces a 64×19 feature image that a small CNN classifies accurately. The log-mel representation preserves transient onset structure (critical for gunshot detection) and sustained harmonic patterns (critical for vehicle classification) in a format amenable to convolutional feature extraction.
Onset detection and event segmentation. Before feature extraction can run, the system needs to identify that an event worth classifying has occurred. A simple energy threshold triggers on loud transients but has high false-alarm rates from thunder, metal impacts, and industrial noise. A better approach uses a learned onset detector – a small model trained to distinguish acoustic onsets that precede classifiable military events from all other transients – as a pre-filter. This two-stage architecture reduces the false-alarm rate fed to the main classifier by 60–80% in typical outdoor industrial environments, at the cost of an additional 5–10 ms of inference latency.
Machine learning architectures for edge acoustic classification
Three model families are production-viable for edge acoustic classification in military applications.
Convolutional neural networks on spectrograms. A MobileNetV2 or EfficientNet-Lite architecture adapted for audio (replacing the ImageNet input shape with the spectrogram dimensions) achieves 92–96% accuracy on four-class acoustic event datasets (gunshot, vehicle, explosion, ambient) at under 20 ms inference time on an ARM Cortex-M55 with INT8 quantization. The key adaptation is using a relatively narrow temporal context window – 200–500 ms – to keep the input tensor small enough for on-device memory. For gunshot detection specifically, the same quantization and optimization techniques used in visual edge AI apply directly to audio CNN deployment.
Audio transformer models. Models in the Audio Spectrogram Transformer (AST) family apply self-attention across spectrogram patches, achieving state-of-the-art accuracy on general audio classification benchmarks. On edge hardware, the attention mechanism is more memory-intensive than convolutions at equivalent model size, and attention layers degrade more under INT8 quantization than convolutional layers. Distilled tiny-AST variants with 1–5 million parameters are feasible on Cortex-A class processors at 10–30 ms inference time. The accuracy advantage over CNN-based models is modest (1–3%) for military acoustic event classification, where the training set is domain-specific rather than the broad AudioSet on which AST was designed to excel.
Recurrent classifiers for vehicle identification. Vehicle classification – distinguishing wheeled from tracked, light from heavy, and specific platform types – benefits from temporal context that CNNs capture poorly with short windows. A bidirectional LSTM operating on a sequence of 20–50 MFCC frames (200–500 ms of audio) captures the evolution of engine harmonics as load and speed change, producing more stable vehicle-type estimates over multi-second windows. The LSTM classifier can run asynchronously from the event-trigger classifier, continuously updating a vehicle-type estimate as long as acoustic contact is maintained.
Supersonic ballistic shockwave versus muzzle blast
A rifle or heavy weapon fired at a sensor produces two distinct acoustic events: the muzzle blast (an omnidirectional impulsive wavefront from the propellant gas) and the ballistic shockwave (a conical N-wave generated by the supersonic projectile). These arrive at the sensor at different times depending on the geometry of the engagement, and the time difference between them encodes information about the weapon type, the muzzle velocity, and – critically – the shooter's location relative to the target-sensor geometry.
The muzzle blast TDOA gives the direction toward the weapon. The ballistic shockwave TDOA gives the direction of the projectile trajectory. Combining both estimates, a properly trained classifier and estimator can determine whether the weapon was fired toward, away from, or across the sensor position. This capability – distinguishing incoming from outgoing fire – has obvious operational value for defensive posture decisions. Systems that classify only on muzzle blast without separating the shockwave component will systematically misreport the shooter's bearing by an angle that increases with shooter-to-sensor range.
Key insight: The most common classification failure in deployed acoustic gunshot detectors is not the model – it is the failure to separate the muzzle blast from the ballistic shockwave before running bearing estimation. A single-peak TDOA estimator that does not model both arrivals will report a bearing that is a weighted average of the two propagation directions, biased toward whichever event has higher SNR at the array. For engagements at ranges above 200 metres, this can produce bearing errors exceeding 15°. The fix is a multi-hypothesis TDOA estimator that explicitly models both arrivals and assigns each to its physical source.
Integrating acoustic detections into the common operating picture
An acoustic detection that stays on the edge node is tactically useless. The value is realized only when the detection event – bearing, classification, confidence, timestamp, sensor position – reaches operators and automated fusion engines on the COP. The integration pattern mirrors what is well-established for distributed military sensor networks: each node reports locally processed results over a constrained link to a hub that fuses across nodes.
For TAK-ecosystem integration, acoustic detection events are published as CoT XML to TAK Server. The CoT event type for an acoustic observation is drawn from the CoT type taxonomy (b-m-p-s-p-op for observation, or a hostile type code if the classification confidence and rules of engagement permit). The CoT detail field carries structured extension elements: bearing, bearing uncertainty, event class, acoustic confidence, and an identifier for the reporting sensor node. TAK Server's built-in CoT subscription model delivers the event to all connected ATAK clients within 1–3 seconds of the acoustic onset.
Multi-node fusion is the capability that turns bearing lines into position fixes. When two or more acoustic nodes report the same event (matched by timestamp and classification within a configurable time window), their bearing lines are intersected using a weighted least-squares algorithm. The weight for each bearing line is inversely proportional to bearing uncertainty. The fused position is represented as a 2D error ellipse (CEP) whose size grows with the geometry of the node network and the bearing uncertainties of the contributing nodes. For a two-node network with 90° crossing angle and 2° bearing uncertainty per node, the CEP at 500 m range is approximately 18 metres – sufficient to cue an observation team or direct a UAS to investigate.
Battery-powered edge nodes that operate in communications-denied periods store detections locally with precise GPS timestamps. On reconnection to the tactical network, buffered events are replayed to TAK Server with their original timestamps, reconstructing the acoustic event history on the COP for post-event analysis.
Fuse acoustic detections into your operational picture
Corvus SENSE integrates acoustic sensor nodes, TDOA bearing estimates, and classification results directly into the common operating picture – publishing CoT events to TAK Server and providing multi-node fusion across the sensor network in real time.
This analysis was prepared by Corvus Intelligence engineers who build mission-critical ISR and field applications for defense and government organizations. Learn about our team →