Aggregation enhanced coal-gangue video recognition model based on long and short-term storage
-
摘要: 采用煤矸石图像识别技术进行煤矸石识别会错过一些关键目标的识别。视频目标识别模型比图像目标识别模型更贴近煤矸石识别分选场景需求,对视频数据中的煤矸石特征可以进行更广泛、更有深度的提取。但目前煤矸石视频目标识别技术未考虑视频帧重复性、帧间相似性、关键帧偶然性对模型性能的影响。针对上述问题,提出了一种基于长短期存储(LSS)的聚合增强型煤矸石视频识别模型。首先,采用关键帧与非关键帧对海量信息进行初筛。对煤矸石视频帧序列进行多帧聚合,通过时空关系网络 (TRN)将关键帧与相邻帧特征信息相聚合,建立长期视频帧和短期视频帧,在不丢失关键特征信息的同时减少模型计算量。然后,采用语义相似性权重、可学习权重和感兴趣区域(ROI)相似性权重融合的注意力机制,对长期视频帧、短期视频帧与关键帧之间的特征进行权重再分配。最后,设计用于存储增强的LSS模块,对长期视频帧与短期视频帧进行有效特征存储,并在关键帧识别时加以融合,增强关键帧特征的表征能力,以实现煤矸石识别。基于枣泉选煤厂自建煤矸石视频数据集对该模型进行实验验证,结果表明:相较于记忆增强全局−局部聚合(MEGA)网络、基于流引导的特征聚合视频目标检测(FGFA)、关系蒸馏网络(RDN)、视频识别的深度特征流(DFF)模型,基于LSS的聚合增强型煤矸石视频识别模型的平均精度均值优于其他模型,为77.12%;模型视频目标运动速度与识别精度呈负相关,基于LSS的聚合增强型煤矸石视频识别模型在慢速运动的目标检测上识别精度最高达83.82%。Abstract: Some key targets will be missed when using coal-gangue image recognition technology to recognize coal-gangue. Compared with the image target recognition model, the video target recognition model is closer to the requirements of the coal-gangue recognition and separation scene. The coal-gangue features in the video data can be extracted more widely and deeply. However, the influence of frame repetition, frame similarity and contingency of key frame on the model performance is not considered in the current coal-gangue video target recognition technology. In order to solve the above problems, this paper proposes an aggregation enhanced coal-gangue video recognition model based on long and short-term storage (LSS) model. Firstly, the key frames and non-key frames are used to screen the massive information. Multi-frame aggregation is carried out on the video frame sequence of the coal-gangue. The feature information of the key frame and the adjacent frame is aggregated through temporal relation networks (TRN), and a long-term video frame and a short-term video frame are established. The calculation amount of the model is reduced while the key feature information is not lost. Secondly, the feature weights among the long-term video frames, the short-term video frames and the keyframes are reallocated by using an attention mechanism that integrates semantic similarity weights, learnable weights and region of interest (ROI) similarity weights. Finally, the LSS module is designed to store the effective features of long-term video frames and short-term video frames. The module fuses them in the key frame recognition to enhance the characterization capability of the key frame features, so as to realize coal-gangue recognition. The model is tested based on the coal-gangue video data set in Zaoquan Coal Preparation Plant. The results show that in comparison with the memory enhanced global-local aggregation (MEGA) network, the flow-guided feature aggregation for video object detection (FGFA), the relation distillation networks (RDN) and deep feature flow for video recognition (DFF) model for video recognition, the mean average precision of the aggregation enhanced coal-gangue video recognition model based on LSS is 77.12 % and better than that of other models. The recognition precision of the modes is negatively correlated with the moving speed of the target in the video. The recognition precision of the model in this paper is 83.82% for the slow-moving target detection, and the performance is the best.
-
表 1 本文模型与MEGA,FGFA,RDN,DFF模型mAP对比
Table 1. The mAP comparison of the proposed model and MEGA,FGFA,RDN,DFF models
% 模型 识别精度 mAP 快速运动目标 中速运动目标 慢速运动目标 本文模型 55.12 76.02 83.82 77.12 MEGA-101 55.63 76.24 82.39 76.65 MEGA-50 49.53 70.58 79.83 72.63 RDN-101 51.65 71.95 82.10 74.68 RDN-50 45.27 67.46 80.22 70.40 FGFA-101 43.97 69.86 81.07 71.91 FGFA-50 40.75 66.57 78.90 68.68 DFF-101 37.47 66.87 79.32 68.42 DFF-50 35.65 62.19 74.14 63.50 -
[1] SHARMA V,GUPTA M,KUMAR A,et al. Video processing using deep learning techniques:a systematic literature review[J]. IEEE Access,2021,9:139489-139507. doi: 10.1109/ACCESS.2021.3118541 [2] AICH A, ZHENG M, KAEANAM S, et al. Spatio-temporal representation factorization for video-based person re-identification[C]. International Conference on Computer Vision, Montreal, 2021: 152-162. [3] 孙立新. 基于卷积神经网络的煤矸石识别方法研究[D]. 邯郸: 河北工程大学, 2020.SUN Lixin. Research on coal gangue recognition method based on convolutional neural network[D]. Handan: Hebei University of Engineering, 2020. [4] PAN Hongguang,SHI Yuhong,LEI Xinyu,et al. Fast identification model for coal and gangue based on the improved tiny YOLO V3[J]. Journal of Real-Time Image Processing,2022,19(3):687-701. doi: 10.1007/s11554-022-01215-1 [5] ZHU Xizhou, WANG Yujie, DAI Jifeng, et al. Flow-guided feature aggregation for video object detection[C]. IEEE International Conference on Computer Vision, Venice, 2017, 408-417. [6] 张勇. 基于视频处理的煤矸石识别研究[D]. 徐州: 中国矿业大学, 2018.ZHANG Yong. Research on gangue identification based on video processing[D]. Xuzhou: China University of Mining and Technology, 2018. [7] 程健,王东伟,杨凌凯,等. 一种改进的高斯混合模型煤矸石视频检测方法[J]. 中南大学学报(自然科学版),2018,49(1):118-123.CHENG Jian,WANG Dongwei,YANG Lingkai,et al. An improved Gaussian mixture model for coal gangue video detection[J]. Journal of Central South University (Science and Technology),2018,49(1):118-123. [8] LEI Xinyu,PAN Hongguang,HUANG Xiangdong. A dilated CNN model for image classification[J]. IEEE Access,2019,7:124087-124095. doi: 10.1109/ACCESS.2019.2927169 [9] PAN Hongguang,WEN Fan,HUANG Xiangdong,et al. The enhanced deep plug-and-play super-resolution algorithm with residual channel attention networks[J]. Journal of Intelligent & Fuzzy Systems:Applicationgs in Engineering and Technology,2021,41(2):4069-4078. [10] ZHU Xizhou, DAI Jifeng, YUAN Lu, et al. Towards high performance video object detection[C]. The IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018: 7210-7218. [11] WANG Shiyao, ZHOU Yucong, YAN Junjie, et al. Fully motion-aware network for video object detection[C]. Proceedings of the European Conference on Computer Vision, Munich, 2018: 542-557. [12] WU Haiping, CHEN Yuntao, WANG Naiyan, et al. Sequence level semantics aggregation for video object detection[C]. The IEEE/CVF International Conference on Computer Vision, Seoul, 2019: 9217-9225. [13] FEICHTENHOFER C, PINZ A, ZISSERMAN A. Detect to track and track to detect[C]. The IEEE International Conference on Computer Vision, Venice, 2017: 3038-3046. [14] ZHOU Bolei, ANDONIAN A, TORRALBA A. Temporal relational reasoning in videos[C]. European Conference on Computer Vision, Munich, 2018: 803-818. [15] LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017: 936-944. [16] HE Kaiming, GKIOXARI G, DOLLAR P, et al. Mask R-CNN[C]. International Conference on Computer Vision, Venice, 2017: 2961-2969. [17] REN Shaoqing,HE Kaiming,GIRSHICK R,et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(6):1137-1149. doi: 10.1109/TPAMI.2016.2577031 [18] CHEN Yihong, CAO Yue, HU Han, et al. Memory enhanced global-local aggregation for video object detection[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 2020: 10337-10346. [19] GIRSHICK R. Fast R-CNN[C]. IEEE International Conference on Computer Vision, Santiago, 2015, 1440-1448. [20] HOCHREITER S,SCHMIDHUBER J. Long short-term memory[J]. Neural Computation,1997,9(8):1735-1780. doi: 10.1162/neco.1997.9.8.1735 [21] DENG Jiajun, PAN Yingwei, YAO Ting, et al. Relation distillation networks for video object detection[C]. The IEEE/CVF International Conference on Computer Vision, Seoul, 2019: 7023-7032. [22] ZHU Xizhou, XIONG Yuwen, DAI Jifeng, et al. Deep feature flow for video recognition[C]. The IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017: 2349-2358.