Abstract:
Named Entity Recognition (NER) serves as a foundational task in constructing knowledge graphs for coal mine safety accidents, yet the absence of explicit lexical boundaries in Chinese text has constrained the effective utilization of lexical information by existing entity extraction models. To address this challenge, a RoBERTa-BiLSTM-CRF model integrated with lexical information was proposed for entity extraction in coal mine safety accidents. Initially, a domain-specific lexicon for coal mine safety was constructed, where character-level feature vectors were obtained via RoBERTa, and potential lexical units corresponding to characters were identified through the Aho-Corasick (AC) Automation. Subsequently, lexical feature vectors were derived using GloVe embeddings. These vectors were then fused via a self-attention mechanism, which dynamically allocated weights to integrate RoBERTa-based character features and GloVe-based lexical features, yielding a composite vector enriched with lexical semantics. Finally, the fused vector was fed into a BiLSTM-CRF framework to generate optimized prediction sequences, thereby achieving accurate entity extraction in coal mine safety accidents. Experimental results demonstrated that: (1) the proposed model achieved an F1-score of 91.63%, which was 1.63 % higher than that of the RoBERTa-BiLSTM-CRF model. (2) It outperformed comparative models in both overall entity extraction tasks and across various entity categories, indicating the broad applicability of its design to diverse entity types.