基于多模态的井下登高作业专人扶梯检测方法

A multi-modal detection method for holding ladders in underground climbing operations

  • 摘要: 目前大多数的井下人员不安全行为识别研究侧重于在计算机视觉上提高精度,但井下易出现遮挡、光照不稳定、反光等情况,仅采用计算机视觉技术难以实现对不安全行为的准确识别,尤其登高作业中的爬梯、扶梯等相似动作在识别过程中易被混淆,存在安全隐患。针对上述问题,提出一种基于多模态的井下登高作业专人扶梯检测方法。该方法从视觉和音频2个模态对监控视频数据进行分析。视觉模态方面,采用YOLOv8模型检测登高梯是否存在,如果存在,获得登高梯的位置坐标,并将视频段放入OpenPose算法中进行姿态估计,得到人体的各个骨骼关节点的特征,将这些骨骼关节点序列放入改进的时空注意图卷积网络(SAT−GCN)中,得到人体动作标签及其对应概率。音频模态方面,采用飞桨自动语言识别系统将语音转换为文本,使用双向编码器表示(BERT)模型对文本信息进行特征分析与提取,得到文本标签及其对应的概率。最后将视觉模态与音频模态得到的信息进行决策级融合,判断井下登高作业是否有专人扶梯。实验结果表明:基于骨架数据的动作识别中,优化后的SAT−GCN模型对于扶梯、爬梯、站立3种动作的识别精度分别提升了3.36%,2.83%,10.71%;基于多模态的检测方法比单模态方法具有更高的识别准确率,达到98.29%。

     

    Abstract: Currently, most research on recognizing unsafe behaviors of underground personnel focuses on improving precision through computer vision. However, underground areas are prone to occlusion, unstable lighting, and reflection, making it difficult to accurately recognize unsafe behaviors using computer vision technology alone. Especially, similar actions such as climbing ladders and holding ladders during climbing operations are easily confused during the recognition process, posing safety hazards. In order to solve the above problems, a multi-modal detection method for holding ladders in underground climbing operations is proposed. This method analyzes surveillance video data from two modalities: visual and audio. In terms of visual modality, the YOLOv8 model is used to detect the presence of ladder. If there is a ladder, the position coordinates of the ladder are obtained, and the video segment is put into the OpenPose algorithm for pose estimation to obtain the features of various skeletal joint points of the human body. These skeletal joint point sequences are then placed into improved spatial attention temporal graph convolutional networks(SAT-GCN) to obtain human action labels and their corresponding probabilities. In terms of audio modality, the PaddlePaddle automatic language recognition system is used to convert speech into text, and the bidirectional encoder representations from transformers (BERT) model is used to analyze and extract the features of text information, so as to obtain the text label and its corresponding probability. Finally, the information obtained from the visual and audio modalities is fused at the decision-making level to determine whether there is a dpersonnel holding ladders for underground climbing operations. The experimental results show that in action recognition based on skeleton data, the optimized SAT-GCN model improves the recognition precision of three types of actions: holding, climbing, and standing by 3.36%, 2.83%, and 10.71%, respectively. The multi-modal detection method has a higher recognition accuracy than the single modal method, reaching 98.29%.

     

/

返回文章
返回