基于掩码特征交叉预解网络的综采工作面语音分离方法

Mask feature cross pre-decoding network-based speech separation method for fully mechanized mining face

  • 摘要: 综采工作面复杂非平稳机械噪声严重干扰井下调度通信,现有基于时域音频分离网络(TasNet)架构(编码器−掩码网络−解码器)的语音分离方法生成的目标语音掩码易残留噪声与干扰语音成分,且抑制噪声时会损伤目标语音特征,导致语音分离精度下降。针对上述问题,提出一种基于掩码特征交叉预解网络的综采工作面语音分离方法。掩码特征交叉预解网络集成于TasNet的掩码网络之后,主要包含掩码特征提取模块与特征交叉预解模块:掩码特征提取模块通过拼接操作与卷积门控模块学习不同目标语音掩码中的噪声关联特征,生成噪声关联互补权重,利用该权重对目标语音掩码进行互补加权,实现噪声过滤;特征交叉预解模块对不同目标语音掩码特征进行交叉互补融合,挖掘目标语音掩码间的关联信息,再利用卷积门控与残差增强模块对掩码进行净化和补偿,避免微弱语音被掩盖,保护噪声抑制过程中可能被损伤的目标语音。实验结果表明,所提方法与卷积时域音频分离网络(Conv−TasNet)、双路径循环神经网络(DPRNN)、双路径Transformer网络(DPTNet)、全局注意力局部循环网络(GALR)等主流基于TasNet架构的语音分离方法相比,尺度不变信噪比改善值(SI−SNRi)分别提升了3.52,1.74,1.40,2.09 dB,信号失真比改善值(SDRi)分别提升了3.21,1.45,1.14,1.80 dB,且参数量较少;所提方法可基于内置神经网络处理单元(NPU)的嵌入式芯片部署,模块尺寸较小、算力消耗低,满足井下语音终端小型化、低功耗的工程应用需求。

     

    Abstract: The complex non-stationary mechanical noise in fully mechanized mining faces severely interferes with underground dispatch communication. Existing speech separation methods based on the Time-Domain Audio Separation Network (TasNet) architecture (encoder-mask network-decoder) tend to generate target speech masks that retain residual noise and interfering speech components. In addition, noise suppression may damage target speech features, resulting in reduced speech separation accuracy. To address this problem, a speech separation method for fully mechanized mining faces based on a mask feature cross pre-decoding network was proposed. The mask feature cross pre-decoding network was integrated after the mask network of TasNet and mainly consisted of a mask feature extraction module and a feature cross pre-decoding module. The mask feature extraction module learned noise-related features in different target speech masks through concatenation operations and a convolutional gating module, generated noise-related complementary weights, and used these weights to perform complementary weighting on the target speech masks to achieve noise filtering. The feature cross pre-decoding module performed cross-complementary fusion of features from different target speech masks, mined correlation information among the target speech masks, and then used a convolutional gating module and a residual enhancement module to purify and compensate the masks, avoiding weak speech from being masked and protecting target speech that may be damaged during the noise suppression. Experimental results showed that, compared with mainstream TasNet-based speech separation methods such as Convolutional Time-Domain Audio Separation Network (Conv-TasNet), Dual-Path Recurrent Neural Network (DPRNN), Dual-Path Transformer Network (DPTNet), and Globally Attentive Locally Recurrent Network (GALR), the proposed method improved the Scale-Invariant Signal-to-Noise Ratio Improvement (SI-SNRi) by 3.52, 1.74, 1.40, and 2.09 dB, and improved the Signal-to-Distortion Ratio Improvement (SDRi) by 3.21, 1.45, 1.14, and 1.80 dB, respectively, and had fewer parameters. The proposed method can be deployed on embedded chips with built-in Neural Network Processing Units (NPUs). The module is compact and requires low computational cost, meeting the engineering application requirements for miniaturization and low power consumption of underground voice terminals.

     

/

返回文章
返回