A multi-modal detection method for holding ladders in underground climbing operations
-
Graphical Abstract
-
Abstract
Currently, most research on recognizing unsafe behaviors of underground personnel focuses on improving precision through computer vision. However, underground areas are prone to occlusion, unstable lighting, and reflection, making it difficult to accurately recognize unsafe behaviors using computer vision technology alone. Especially, similar actions such as climbing ladders and holding ladders during climbing operations are easily confused during the recognition process, posing safety hazards. In order to solve the above problems, a multi-modal detection method for holding ladders in underground climbing operations is proposed. This method analyzes surveillance video data from two modalities: visual and audio. In terms of visual modality, the YOLOv8 model is used to detect the presence of ladder. If there is a ladder, the position coordinates of the ladder are obtained, and the video segment is put into the OpenPose algorithm for pose estimation to obtain the features of various skeletal joint points of the human body. These skeletal joint point sequences are then placed into improved spatial attention temporal graph convolutional networks(SAT-GCN) to obtain human action labels and their corresponding probabilities. In terms of audio modality, the PaddlePaddle automatic language recognition system is used to convert speech into text, and the bidirectional encoder representations from transformers (BERT) model is used to analyze and extract the features of text information, so as to obtain the text label and its corresponding probability. Finally, the information obtained from the visual and audio modalities is fused at the decision-making level to determine whether there is a dpersonnel holding ladders for underground climbing operations. The experimental results show that in action recognition based on skeleton data, the optimized SAT-GCN model improves the recognition precision of three types of actions: holding, climbing, and standing by 3.36%, 2.83%, and 10.71%, respectively. The multi-modal detection method has a higher recognition accuracy than the single modal method, reaching 98.29%.
-
-