基于掩码特征交叉预解网络的综采工作面语音分离方法

A Speech Separation Method for Fully Mechanized Mining Faces Based on the Mask Feature Cross-Predissociation Network

  • 摘要: 综采工作面复杂非平稳机械噪声严重干扰井下调度通信,现有语音分离模型因采用单一掩码架构,难以有效抑制噪声与保留微弱语音成分,导致分离精度下降。对此,本文提出掩码特征交叉预解网络,其包含掩码特征提取与特征交叉双协同模块;前者通过拼接操作与卷积门控模块提取两路掩码的噪声关联特征,生成噪声关联互补权重,最后该权重通过与掩码互补加权,实现噪声过滤;后者通过掩码的跨路交叉融合,充分地捕获目标掩码特征,最后利用卷积门控与残差模块再次提纯掩码,同时避免微弱语音被掩盖,为解码器的语音重构提供高质量的预处理特征。该网络具备即插即用特性,无需改动原模型结构,即可集成于语音分离框架。实验验证了该网络有效性与通用性:在综采工作面噪声语音数据集(CM2VSD),集成后的卷积时域分离模型(Conv-TasNet)与双路径Transformer网络(DPTNet)的SI-SNRi上限分别提升至17.06dB(+1.4dB)和15.03dB(+1.49dB),SDRi上限分别提升至17.15dB(+1.11dB)和15.35dB(+1.41dB)。

     

    Abstract: Non-stationary mechanical noise in complex fully mechanized mining faces seriously interferes with underground dispatching communication. Existing speech separation models, which adopt a single masking architecture, struggle to effectively suppress noise and preserve weak speech components, resulting in reduced separation accuracy. To address this issue, this paper proposes a Mask Feature Cross-Predissociation Network, comprising two collaborative modules: mask feature extraction and feature cross-fusion. The former extracts noise-related associated features from two mask paths via concatenation operations and convolutional gating modules, generating noise-related complementary weights. These weights are then adaptively fused with the original masks to achieve noise filtering. The latter fully captures target mask features through cross-path cross-fusion of masks, and further refines the masks using convolutional gating and residual modules—while avoiding the masking of weak speech components—thereby providing high-quality preprocessed features for speech reconstruction in the decoder. Notably, the proposed network exhibits a plug-and-play property, enabling seamless integration into existing speech separation frameworks without modifying their original structures. Experimental results validate the effectiveness and generality of the proposed network: on the fully mechanized mining face noisy speech dataset (CM2VSD), the integrated Convolutional Time-Domain Audio Separation Network (Conv-TasNet) and Dual-Path Transformer Network (DPTNet) achieve state-of-the-art performance with SI-SNRi (Scale-Invariant Signal-to-Noise Ratio improvement) upper bounds of 17.06 dB (+1.4 dB) and 15.03 dB (+1.49 dB), respectively, and SDRi (Signal Distortion Ratio improvement) upper bounds of 17.15 dB (+1.11 dB) and 15.35 dB (+1.41 dB), respectively.

     

/

返回文章
返回