Abstract:
Some key targets will be missed when using coal-gangue image recognition technology to recognize coal-gangue. Compared with the image target recognition model, the video target recognition model is closer to the requirements of the coal-gangue recognition and separation scene. The coal-gangue features in the video data can be extracted more widely and deeply. However, the influence of frame repetition, frame similarity and contingency of key frame on the model performance is not considered in the current coal-gangue video target recognition technology. In order to solve the above problems, this paper proposes an aggregation enhanced coal-gangue video recognition model based on long and short-term storage (LSS) model. Firstly, the key frames and non-key frames are used to screen the massive information. Multi-frame aggregation is carried out on the video frame sequence of the coal-gangue. The feature information of the key frame and the adjacent frame is aggregated through temporal relation networks (TRN), and a long-term video frame and a short-term video frame are established. The calculation amount of the model is reduced while the key feature information is not lost. Secondly, the feature weights among the long-term video frames, the short-term video frames and the keyframes are reallocated by using an attention mechanism that integrates semantic similarity weights, learnable weights and region of interest (ROI) similarity weights. Finally, the LSS module is designed to store the effective features of long-term video frames and short-term video frames. The module fuses them in the key frame recognition to enhance the characterization capability of the key frame features, so as to realize coal-gangue recognition. The model is tested based on the coal-gangue video data set in Zaoquan Coal Preparation Plant. The results show that in comparison with the memory enhanced global-local aggregation (MEGA) network, the flow-guided feature aggregation for video object detection (FGFA), the relation distillation networks (RDN) and deep feature flow for video recognition (DFF) model for video recognition, the mean average precision of the aggregation enhanced coal-gangue video recognition model based on LSS is 77.12 % and better than that of other models. The recognition precision of the modes is negatively correlated with the moving speed of the target in the video. The recognition precision of the model in this paper is 83.82% for the slow-moving target detection, and the performance is the best.