融合词汇信息的煤矿安全事故实体提取

Entity extraction integrating lexical information for coal mine safety accidents

  • 摘要: 命名实体识别是构建煤矿安全事故领域知识图谱的基本任务,但中文缺乏明显的词汇边界特征,导致现有实体提取模型对词汇信息利用不充分。针对上述问题,提出了一种融合词汇信息的煤矿安全事故实体提取模型——融合词汇信息的RoBERTa−BiLSTM−CRF模型。首先,构建煤矿安全领域专业词典,采用RoBERTa获取字符特征向量,采用AC自动机算法进行字词匹配,得到字符对应的潜在词汇,采用Glove获取词汇特征向量。然后,通过自注意机制分配权重,将基于RoBERTa得到的字符特征向量和基于GloVe得到的词汇特征向量进行融合,得到包含词汇信息的融合向量。最后,将融合向量作为BiLSTM−CRF的输入,得到最优预测序列结果,实现煤矿安全事故实体提取。实验结果表明:① 融合词汇信息的RoBERTa−BiLSTM−CRF模型对煤矿安全领域12种实体提取的F1达91.63%,较RoBERTa−BiLSTM−CRF模型提高了1.63%。② 融合词汇信息的RoBERTa−BiLSTM−CRF模型在整体实体提取任务及各类实体类型的提取任务中,综合性能优于其他模型,说明模型架构设计对不同实体类型具有广泛适用性。

     

    Abstract: Named Entity Recognition (NER) serves as a foundational task in constructing knowledge graphs for coal mine safety accidents, yet the absence of explicit lexical boundaries in Chinese text has constrained the effective utilization of lexical information by existing entity extraction models. To address this challenge, a RoBERTa-BiLSTM-CRF model integrated with lexical information was proposed for entity extraction in coal mine safety accidents. Initially, a domain-specific lexicon for coal mine safety was constructed, where character-level feature vectors were obtained via RoBERTa, and potential lexical units corresponding to characters were identified through the Aho-Corasick (AC) Automation. Subsequently, lexical feature vectors were derived using GloVe embeddings. These vectors were then fused via a self-attention mechanism, which dynamically allocated weights to integrate RoBERTa-based character features and GloVe-based lexical features, yielding a composite vector enriched with lexical semantics. Finally, the fused vector was fed into a BiLSTM-CRF framework to generate optimized prediction sequences, thereby achieving accurate entity extraction in coal mine safety accidents. Experimental results demonstrated that: (1) the proposed model achieved an F1-score of 91.63%, which was 1.63 % higher than that of the RoBERTa-BiLSTM-CRF model. (2) It outperformed comparative models in both overall entity extraction tasks and across various entity categories, indicating the broad applicability of its design to diverse entity types.

     

/

返回文章
返回