Abstract:
Skeleton-sequence-based behavior recognition models are characterized by fast processing speeds, low computational requirements, and simple structures. Graph convolutional networks (GCNs) have advantages in processing skeleton sequence data. However, existing miner behavior recognition models based on graph convolution struggle to balance high accuracy and low computational complexity. To address this issue, this study proposed a miner behavior recognition model based on a lightweight pose estimation network (Lite-HRNet) and a multi-dimensional feature-enhanced spatial-temporal graph convolutional network (MEST-GCN). Lite-HRNet performed human detection using a target detector, extracted image features through a convolutional neural network (CNN), and generated anchor boxes via a region proposal network (RPN). These anchor boxes were classified to determine whether they contain a target. The RPN applied bounding box regression to the anchor boxes identified as containing targets and outputted the human bounding box, with the optimal detection result selected via non-maximum suppression. The detected human regions were cropped and inputted into Lite-HRNet to generate skeleton sequences based on human pose keypoints. MEST-GCN improved upon the spatial-temporal graph convolutional network (ST-GCN) by removing redundant layers to simplify the model structure and reduce the number of parameters. It also introduced a multi-dimensional feature fusion attention module (M2FA). The generated skeleton sequences were processed by the BN layer for batch normalization, and the miner behavior features were extracted through the multi-dimensional feature-enhanced graph convolution module. These features were passed through global average pooling and a Softmax layer to obtain the behavior confidence, providing the miner behavior prediction results. Experimental results showed that: ① The parameter count of MEST-GCN was reduced to 1.87 Mib. ② On the public NTU60 dataset, evaluated using cross subject and cross view standards, the accuracy of the miner behavior recognition model based on Lite-HRNet and MEST-GCN reached 88.0% and 92.6%, respectively, with Lite-HRNet extracting 2D human keypoint coordinates. ③ On a custom-built miner behavior dataset, the model based on Lite-HRNet and MEST-GCN achieved an accuracy of 88.5% and a video processing speed of 18.26 frames per second, accurately and quickly identifying miner action categories.