融合词汇信息的煤矿安全事故实体提取
Entity Extraction of Coal Mine Safety Accidents with Integrated Lexical Information
-
摘要: 命名实体识别是构建知识图谱的关键,煤矿安全事故非结构化文本中信息抽取是研究的难点。本文提出了一种融合词汇信息的实体提取方法,基于大规模中文预训练语言模型开展煤矿安全事故领域的命名实体识别。首先,收集煤矿相关文本资料建立资料集,在系统整体结构的框架下,基于全要素安全评价构建煤矿安全事故的本体模型,设计了12类概念。其次,在煤矿安全事故领域数据集上融合字词信息,采用RoBERTa获取字符特征向量,利用AC自动机进行字词匹配,通过Glove获取词汇特征向量,基于自注意力机制得到字符特征和词汇特征的融合向量。最后,开展融合词汇信息的命名实体识别,采用BiLSTM捕捉上下文特征,通过CRF进行标签约束得到预测结果,将提取的6564个实体存入Neo4j图数据库,实现基本的查询功能。结果表明,融合词汇信息的RoBERTa-BiLSTM-CRF模型方法对煤矿安全事故命名实体识别F1-score为91.63%。本研究实现了煤矿安全事故实体提取和数据集构建,为创建垂直领域知识图谱奠定了基础。Abstract: Named entity recognition is pivotal in constructing knowledge graphs, particularly for extracting information from unstructured text related to coal mine safety accidents. This paper introduces an entity extraction method utilizing lexical information and a large-scale Chinese pre-trained language model. Initially, we compile a dataset from relevant coal mine text data and develop an ontology model of coal mine safety accidents, incorporating 12 conceptual categories based on comprehensive safety assessments. Subsequently, we integrate lexical features using RoBERTa for character embeddings, AC automata for word matching, and GloVe for word embeddings, synthesizing these into fusion vectors through a self-attention mechanism. For NER, the integrated lexical information is leveraged with a BiLSTM-CRF model to capture contextual features and enforce label constraints, achieving an F1-score of 91.63% in entity recognition. The extracted 6564 entities are stored in a Neo4j graph database for foundational querying capabilities. This work advances entity extraction and dataset construction, establishing a basis for developing specialized domain knowledge graphs in coal mine safety.
-
-
期刊类型引用(3)
1. 王鑫. 综采机电事故规律探索及防控. 科技与创新. 2023(18): 123-126 . 百度学术
2. 汪卓俊,周建文,钱伟,朱汉平,张洁. 基于多层准入控制的内网物资采购信息化平台. 吉林大学学报(信息科学版). 2021(06): 706-711 . 百度学术
3. 崔希国,韩安. 基于RFID的煤矿设备巡检系统设计. 工矿自动化. 2018(10): 77-80 . 本站查看
其他类型引用(1)
计量
- 文章访问数: 15
- HTML全文浏览量: 1
- PDF下载量: 0
- 被引次数: 4