轻量化姿态估计时空增强图卷积模型下的矿工行为识别

王建芳; 段思源; 潘红光; 景宁波

doi:10.13272/j.issn.1671-251x.2024090059

轻量化姿态估计时空增强图卷积模型下的矿工行为识别

Lightweight pose estimation spatial-temporal enhanced graph convolutional model for miner behavior recognition

摘要

摘要: 基于骨架序列的行为识别模型具有速度快、算力要求低、模型简单等特点，图卷积神经网络在处理骨架序列数据时具有优势，而现有基于图卷积的矿工行为识别模型在高精度和低计算复杂度之间难以兼顾。针对该问题，提出了一种基于轻量化姿态估计网络（Lite−HRNet）和多维特征增强时空图卷积网络（MEST−GCN）的矿工行为识别模型。Lite−HRNet通过目标检测器进行人体检测，利用卷积神经网络提取图像特征，并通过区域提议网络生成锚框，对每个锚框进行分类以判断是否包含目标；区域提议网络对被判定为目标的锚框进行边界框回归，输出人体边界框，并通过非极大值抑制筛选出最优检测结果；将每个检测到的人体区域裁剪出来并输入到Lite−HRNet，生成人体关键点骨架序列。MEST−GCN在时空图卷积神经网络（ST−GCN）的基础上进行改进：去除ST−GCN中的冗余层以简化模型结构，减少模型参数量；引入多维特征融合注意力模块M2FA。生成的骨架序列经MEST−GCN的BN层批量标准化处理后，由多维特征增强图卷积模块提取矿工行为特征，经全局平均池化层和Softmax层得到行为的置信度，获得矿工行为预测结果。实验结果表明：① MEST−GCN的参数量降低至1.87 Mib；② 在以交叉主体和交叉视角为评价标准的公开数据集NTU60上，采用Lite−HRNet提取2D人体关键点坐标，基于Lite−HRNet和MEST−GCN的矿工行为识别模型的准确率分别达88.0%和92.6%；③ 在构建的矿工行为数据集上，基于Lite−HRNet和MEST−GCN的矿工行为识别模型的准确率达88.5%，视频处理速度达18.26 帧/s，可以准确且快速地识别矿工的动作类别。

Abstract: Skeleton-sequence-based behavior recognition models are characterized by fast processing speeds, low computational requirements, and simple structures. Graph convolutional networks (GCNs) have advantages in processing skeleton sequence data. However, existing miner behavior recognition models based on graph convolution struggle to balance high accuracy and low computational complexity. To address this issue, this study proposed a miner behavior recognition model based on a lightweight pose estimation network (Lite-HRNet) and a multi-dimensional feature-enhanced spatial-temporal graph convolutional network (MEST-GCN). Lite-HRNet performed human detection using a target detector, extracted image features through a convolutional neural network (CNN), and generated anchor boxes via a region proposal network (RPN). These anchor boxes were classified to determine whether they contain a target. The RPN applied bounding box regression to the anchor boxes identified as containing targets and outputted the human bounding box, with the optimal detection result selected via non-maximum suppression. The detected human regions were cropped and inputted into Lite-HRNet to generate skeleton sequences based on human pose keypoints. MEST-GCN improved upon the spatial-temporal graph convolutional network (ST-GCN) by removing redundant layers to simplify the model structure and reduce the number of parameters. It also introduced a multi-dimensional feature fusion attention module (M2FA). The generated skeleton sequences were processed by the BN layer for batch normalization, and the miner behavior features were extracted through the multi-dimensional feature-enhanced graph convolution module. These features were passed through global average pooling and a Softmax layer to obtain the behavior confidence, providing the miner behavior prediction results. Experimental results showed that: ① The parameter count of MEST-GCN was reduced to 1.87 Mib. ② On the public NTU60 dataset, evaluated using cross subject and cross view standards, the accuracy of the miner behavior recognition model based on Lite-HRNet and MEST-GCN reached 88.0% and 92.6%, respectively, with Lite-HRNet extracting 2D human keypoint coordinates. ③ On a custom-built miner behavior dataset, the model based on Lite-HRNet and MEST-GCN achieved an accuracy of 88.5% and a video processing speed of 18.26 frames per second, accurately and quickly identifying miner action categories.

HTML全文

参考文献(25)

施引文献

资源附件(0)