西南石油大学学报(自然科学版) ›› 2020, Vol. 42 ›› Issue (6): 165-173.DOI: 10.11885/j.issn.1674-5086.2020.05.12.01

• 油气田人工智能技术与应用专刊 • 上一篇    下一篇

基于NER的石油非结构化信息抽取研究

钟原, 刘小溶, 王杰, 陈雁, 张泰   

  1. 西南石油大学计算机科学学院, 四川 成都 610500
  • 收稿日期:2020-05-12 发布日期:2020-12-21
  • 通讯作者: 钟原,E-mail:strangeryy0202@163.com
  • 作者简介:钟原,1982年生,女,汉族,四川南充人,讲师,硕士,主要从事机器学习、深度学习、学习模式及应用方面的研究。E-mail:strangeryy0202@163.com;刘小溶,1994年生,女,汉族,四川巴中人,硕士研究生,主要从事机器学习与石油文本挖掘方面的研究。E-mail:lxr0528@126.com;王杰,1994年生,男,汉族,四川广元人,硕士研究生,主要从事自然语言处理方面的研究。E-mail:wangjie_self@qq.com;陈雁,1982年生,女,汉族,甘肃庆阳人,副教授,主要从事机器学习,复杂网络相关研究。E-mail:carly.chenyan@gmail.com;张泰,1995年生,男,汉族,四川南充人,硕士研究生,主要从事机器学习在石油钻井环境下应用方面的研究。E-mail:m.dreamland@foxmail.com
  • 基金资助:
    油气藏地质及开发工程国家重点实验室开放基金(PLN201731);西南石油大学创新基地项目(642)

Research of Extraction on Petroleum Unstructured Information Based on Named Entity Recognition

ZHONG Yuan, LIU Xiaorong, WANG Jie, CHEN Yan, ZHANG Tai   

  1. School of Computer Science, Southwest Petroleum University, Chengdu, Sichuan 610500, China
  • Received:2020-05-12 Published:2020-12-21

摘要: 随着"智能油田"的建设加快,构建基于海量石油数据的智能分析系统意义重大。然而,由于石油生产过程中产生的文本数据往往无结构且类型多样,从中抽取关键信息进行分析成为一个研究热点,而信息抽取又需要高质量的语义实体做支撑。根据这一特定问题,提出基于命名实体识别(Named Entity Recognition,NER)技术针对石油非结构化文本进行信息抽取,构建双向长短时记忆(Bidirectional Long Short-Term Memory,Bi LSTM)网络模型提取语料特征,并结合条件随机场(Conditional Random Field,CRF)做分类器,构建了基于Bi LSTM+CRF的高精度NER模型,针对石油工业领域的非结构化文本进行命名实体抽取。通过在修井作业文本数据集上进行对比实验表明,本方法具有较高的精确率和召回率。

关键词: 命名实体识别, Bi-LSTM+CRF, 信息抽取, 非结构化文本

Abstract: With the acceleration of the construction of "intelligent oilfield", it is of great significance to build an intelligent analysis system for mass oil data. However, as a result of the dynamic text data generated in oilfield production process is often of unstructured and various types. Extracting the crucial information for analysis becomes a popular area of research, and information extraction needs high-quality entities to support. In this paper, we propose an unstructured text information extraction method based on NER (Named Entity Recognition) according to the particular problem. Feature extraction of oil corpus is carried out by Bidirectional Long Short-Term Memory (Bi-LSTM) network model, and combines Conditional Random Field (CRF) as classifier. Bi-LSTM+CRF method is used to construct a high-precision NER model to extract named entities from unstructured texts in petroleum industry. The experimental results on the text data set of well workover treatment show that this method has a higher precision and recall rate than other state-of-art methods.

Key words: NER, Bi-LSTM+CRF, information extraction, unstructured text

中图分类号: