Abstract:
Non-stationary mechanical noise in complex fully mechanized mining faces seriously interferes with underground dispatching communication. Existing speech separation models, which adopt a single masking architecture, struggle to effectively suppress noise and preserve weak speech components, resulting in reduced separation accuracy. To address this issue, this paper proposes a Mask Feature Cross-Predissociation Network, comprising two collaborative modules: mask feature extraction and feature cross-fusion. The former extracts noise-related associated features from two mask paths via concatenation operations and convolutional gating modules, generating noise-related complementary weights. These weights are then adaptively fused with the original masks to achieve noise filtering. The latter fully captures target mask features through cross-path cross-fusion of masks, and further refines the masks using convolutional gating and residual modules—while avoiding the masking of weak speech components—thereby providing high-quality preprocessed features for speech reconstruction in the decoder. Notably, the proposed network exhibits a plug-and-play property, enabling seamless integration into existing speech separation frameworks without modifying their original structures. Experimental results validate the effectiveness and generality of the proposed network: on the fully mechanized mining face noisy speech dataset (CM2VSD), the integrated Convolutional Time-Domain Audio Separation Network (Conv-TasNet) and Dual-Path Transformer Network (DPTNet) achieve state-of-the-art performance with SI-SNRi (Scale-Invariant Signal-to-Noise Ratio improvement) upper bounds of 17.06 dB (+1.4 dB) and 15.03 dB (+1.49 dB), respectively, and SDRi (Signal Distortion Ratio improvement) upper bounds of 17.15 dB (+1.11 dB) and 15.35 dB (+1.41 dB), respectively.