矿山人员行为视觉语义方法研究

Research on visual semantic method of mine personnel behavior

  • 摘要: 煤矿井下人员行为检测是感知矿山建设关注的重点,而现有的基于电磁波、基于可穿戴设备、基于计算机视觉等人员行为检测方法无法综合时间、地点、行为、环境等多方面因素评判矿山人员行为是否安全。提出了一种矿山人员行为视觉语义方法,通过特征提取、语义检测、特征重构、解码等生成描述视频中人员行为的语句。分别采用InceptionV4网络、I3D网络提取视频图像静态、动态特征,在InceptionV4网络中引入基于空间位置注意力模型和通道注意力模型的并行双重注意力机制,提高了网络的特征提取能力。针对视频内容与视觉语义易出现不一致的问题,引入语义检测网络对视频特征添加高级语义标签生成嵌入特征,将其与视频特征、语义特征共同输入解码器,并在解码过程中引入特征重构模块,通过获取解码器隐藏层状态重建视频特征,增强了视频特征与描述语句之间的关联关系,提高了视觉语义生成的准确性。采用MSVD,MSR-VTT公共数据集及自制矿山视频数据集进行实验,结果表明该方法具有较好的语义一致性,能准确获取视频中关键语义,更好地反映视频真实含义。

     

    Abstract: The personnel behavior detection in underground coal mines is the focus of sensor mine construction. However, the existing personnel behavior detection methods based on electromagnetic waves, wearable devices and computer vision cannot integrate time, location, behavior, environment and other factors to judge whether the behavior of mine personnel is safe. A visual semantic method of mine personnel behavior is proposed, which generates statements describing personnel behavior in videos through characteristic extraction, semantic detection, characteristic reconstruction and decoding. The InceptionV4 network and the I3D network are used to extract the static and dynamic characteristics of the video images, and the parallel dual attention mechanism based on the spatial location attention model and the channel attention model is introduced into the InceptionV4 network so as to improve the characteristic extraction ability of the network. In order to solve the problem of the inconsistency between video content and visual semantics, the semantic detection network is introduced to add advanced semantic tags to video characteristics to generate embedded characteristics. The embedded characteristics are input into the decoder together with video characteristics and semantic characteristics, and the characteristic reconstruction module is introduced in the decoding process. Reconstructing video characteristics by obtaining the hidden layer state of the decoder enhances the correlation between video characteristics and description statements, and improves the accuracy of visual semantic generation. MSVD, MSR-VTT public data set and mine own video data set are used for experiments, and the results show that the method has good semantic consistency, can obtain the key semantics in the video accurately and better reflects the true meaning of the video.

     

/

返回文章
返回