Abstract:
The complex non-stationary mechanical noise in fully mechanized mining faces severely interferes with underground dispatch communication. Existing speech separation methods based on the Time-Domain Audio Separation Network (TasNet) architecture (encoder-mask network-decoder) tend to generate target speech masks that retain residual noise and interfering speech components. In addition, noise suppression may damage target speech features, resulting in reduced speech separation accuracy. To address this problem, a speech separation method for fully mechanized mining faces based on a mask feature cross pre-decoding network was proposed. The mask feature cross pre-decoding network was integrated after the mask network of TasNet and mainly consisted of a mask feature extraction module and a feature cross pre-decoding module. The mask feature extraction module learned noise-related features in different target speech masks through concatenation operations and a convolutional gating module, generated noise-related complementary weights, and used these weights to perform complementary weighting on the target speech masks to achieve noise filtering. The feature cross pre-decoding module performed cross-complementary fusion of features from different target speech masks, mined correlation information among the target speech masks, and then used a convolutional gating module and a residual enhancement module to purify and compensate the masks, avoiding weak speech from being masked and protecting target speech that may be damaged during the noise suppression. Experimental results showed that, compared with mainstream TasNet-based speech separation methods such as Convolutional Time-Domain Audio Separation Network (Conv-TasNet), Dual-Path Recurrent Neural Network (DPRNN), Dual-Path Transformer Network (DPTNet), and Globally Attentive Locally Recurrent Network (GALR), the proposed method improved the Scale-Invariant Signal-to-Noise Ratio Improvement (SI-SNRi) by 3.52, 1.74, 1.40, and 2.09 dB, and improved the Signal-to-Distortion Ratio Improvement (SDRi) by 3.21, 1.45, 1.14, and 1.80 dB, respectively, and had fewer parameters. The proposed method can be deployed on embedded chips with built-in Neural Network Processing Units (NPUs). The module is compact and requires low computational cost, meeting the engineering application requirements for miniaturization and low power consumption of underground voice terminals.