基于NER的石油非结构化信息抽取研究

doi:10.11885/j.issn.1674-5086.2020.05.12.01

摘要/Abstract

摘要： 随着"智能油田"的建设加快，构建基于海量石油数据的智能分析系统意义重大。然而，由于石油生产过程中产生的文本数据往往无结构且类型多样，从中抽取关键信息进行分析成为一个研究热点，而信息抽取又需要高质量的语义实体做支撑。根据这一特定问题，提出基于命名实体识别（Named Entity Recognition，NER）技术针对石油非结构化文本进行信息抽取，构建双向长短时记忆（Bidirectional Long Short-Term Memory，Bi LSTM）网络模型提取语料特征，并结合条件随机场（Conditional Random Field，CRF）做分类器，构建了基于Bi LSTM+CRF的高精度NER模型，针对石油工业领域的非结构化文本进行命名实体抽取。通过在修井作业文本数据集上进行对比实验表明，本方法具有较高的精确率和召回率。

关键词: 命名实体识别, Bi-LSTM+CRF, 信息抽取, 非结构化文本

Abstract: With the acceleration of the construction of "intelligent oilfield", it is of great significance to build an intelligent analysis system for mass oil data. However, as a result of the dynamic text data generated in oilfield production process is often of unstructured and various types. Extracting the crucial information for analysis becomes a popular area of research, and information extraction needs high-quality entities to support. In this paper, we propose an unstructured text information extraction method based on NER (Named Entity Recognition) according to the particular problem. Feature extraction of oil corpus is carried out by Bidirectional Long Short-Term Memory (Bi-LSTM) network model, and combines Conditional Random Field (CRF) as classifier. Bi-LSTM+CRF method is used to construct a high-precision NER model to extract named entities from unstructured texts in petroleum industry. The experimental results on the text data set of well workover treatment show that this method has a higher precision and recall rate than other state-of-art methods.

Key words: NER, Bi-LSTM+CRF, information extraction, unstructured text

中图分类号:

TE319

钟原, 刘小溶, 王杰, 陈雁, 张泰. 基于NER的石油非结构化信息抽取研究[J]. 西南石油大学学报(自然科学版), 2020, 42(6): 165-173.

ZHONG Yuan, LIU Xiaorong, WANG Jie, CHEN Yan, ZHANG Tai. Research of Extraction on Petroleum Unstructured Information Based on Named Entity Recognition[J]. Journal of Southwest Petroleum University(Science & Technology Edition), 2020, 42(6): 165-173.

0
/ / 推荐

导出引用管理器 EndNote|Ris|BibTeX

链接本文: http://journal15.magtechjournal.com/Jwk_xnzk/CN/10.11885/j.issn.1674-5086.2020.05.12.01

http://journal15.magtechjournal.com/Jwk_xnzk/CN/Y2020/V42/I6/165

参考文献

[1] 石玉江. 智能油田在中国的研究现状分析[J]. 海峡科技与产业, 2016(12):81-83. doi:10.3969/j.issn.1006-3013.2016.12.029 SHI Yujiang. Analysis on the research status of intelligent oilfield in China[J]. Technology and Industry Across the Straits, 2016(12):81-83. doi:10.3969/j.issn.1006-3013.2016.12.029
[2] 邹才能,潘松圻,党刘栓. 论能源革命与科技使命[J]. 西南石油大学学报(自然科学版), 2019, 41(3):1-12. doi:10.11885/j.issn.1674-5086.2019.04.07.01 ZOU Caineng, PAN Songqi, DANG Liushuan. On the energy revolution and the mission of science and technology[J]. Journal of Southwest Petroleum University(Science & Technology Edition), 2019, 41(3):1-12. doi:10.11885/j.issn.1674-5086.2019.04.07.01
[3] MA Huiyun, YU Chenggang, DONG Liangliang, et al. Review of intelligent well technology[J]. Petroleum, 2020, 6(3):226-233. doi:10.1016/j.petlm.2019.11.003
[4] 赵鸿阳. 基于深度学习的电子病历命名实体识别的研究与实现[J]. 软件, 2019, 40(8):208-211. doi:10.3969/j.issn.1003-6970.2019.08.047 ZHAO Hongyang. Research and implementation of named entity recognition of electronic medical records based on deep learning[J]. Computer Engineering & Software, 2019, 40(8):208-211. doi:10.3969/j.issn.1003-6970.2019.08.047
[5] BIKEL D M, MILLER S, SCHWARTZ R, et al. Nymble:A high-performance learning name-finder[C]. Proceedings of the Fifth Conference on Applied Natural Language Processing. Stroudsburg:Association for Computational Linguistics, 1997:194-201. doi:10.3115/974557.974586
[6] BIKEL D M, SCHWARTZ R, WEISCHEDEL R M. An algorithm that learns what's in a name[J]. Machine Learning, 1999, 34(1-3):211-231. doi:10.1023/A:1007558221122
[7] BORTHWICK A. A maximum entropy approach to named entity recognition[D]. New York:New York University, 1999.
[8] MCCALLUM A, LI Wei. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons[C]. Stroudsburg:Association for Computational Linguistics, 2003, 4:188-191. doi:10.3115/1119176.1119206
[9] ISOZAKI H, KAZAWA H. Efficient support vector classifiers for named entity recognition[C]. Stroudsburg:Association for Computational Linguistics, 2002.
[10] 刘浏,王东波. 命名实体识别研究综述[J]. 情报学报, 2018, 37(3):329-340. doi:10.3772/j.issn.1000-0135.2018.03.010 LIU Liu, Wang Dongbo. A review on named entity recognition[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(3):329-340. doi:10.3772/j.issn.1000-0135.2018.03.010
[11] 苏庆林,金刚,陈灵山. 非结构化数据库用于油田科技情报系统[J]. 油气田地面工程, 2005(2):50. SU Qinglin, JIN Gang, CHEN Lingshan. Unstructured database for oilfield science and technology information system[J]. Oil-Gas Field Surface Engineering, 2005(2):50.
[12] 文必龙,李云静. 基于油田领域本体的信息抽取技术研究[J]. 计算机技术与发展, 2015, 25(7):226-229. doi:10.3969/j.issn.1673-629X.2015.07.051 WEN Bilong, LI Yunjing. Research on information extraction technology based on domain ontology in oilfield[J]. Computer Technology and Development, 2015, 25(7):226-229. doi:10.3969/j.issn.1673-629X.2015.07.051
[13] 李云静. 基于石油领域本体的Web信息抽取技术研究[D]. 大庆:东北石油大学, 2015. LI Yunjing. Research on web information extraction technology based on ontology of petroleum domain[D]. Daqing:Northeast Petroleum University, 2015.
[14] LI Jianqiang, ZHAO Shenhe, YANG Jijiang, et al. WCPRNN:A novel RNN-based approach for Bio-NER in Chinese EMRs[J]. The Journal of Super Computing, 2018(6):1-18. doi:10.1007/s11227-017-2229-x
[15] SHEN Dinghan, WANG Guoyin, WANG Wenlin, et al. Baseline needs more love:On simple word-embeddingbased models and associated pooling mechanisms[C]. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers), 2018:440-450. doi:10.18653/v1/P18-1041
[16] CASSEL M, LIMA F. Evaluating one-hot encoding finite state machines for SEU reliability in SRAM-based FPGAs[C]. 12th IEEE International On-Line Testing Symposium(IOLTS'06). Lake Como, Italy, 2006. doi:10.1109/IOLTS.2006.32
[17] 郁可人,傅云斌,董启文. 基于神经网络语言模型的分布式词向量研究进展[J]. 华东师范大学学报(自然科学版),2017(5):52-65,79. doi:10.3969/j.issn.1000-5641.2017.05.006 YU Keren, FU Yunbin, DONG Qiwen. Survey on distributed word embeddings based on neural network language models[J]. Journal of East China Normal University (Natural Science), 2017(5):52-65, 79. doi:10.3969/j.issn.1000-5641.2017.05.006
[18] 孟琦. 基于情感词强度极值的情感嵌入模型研究[D]. 哈尔滨:哈尔滨工程大学, 2019. MENG Qi. Research on sentiment embedding model based on the value of sentimental word intensity[D]. Harbin:Harbin Engineering University, 2019.
[19] QIN Zengchang, CONG Yonghui, WAN Tao. Topic modeling of Chinese language beyond a bag-of-words[J]. Computer Speech & Language, 2016, 40:60-78.
[20] MIKOLOV T, CHEN Kai, CORRADO G, et al. Efficient estimation of word representations in vector space[C]. arXiv:1301.3781, 2013.
[21] WANG Yan, WANG Jian, LIN Hongfei, et al. Bidirectional long short-term memory with CRF for detecting biomedical event trigger in FastText semantic space[J]. BMC Bioinformatics, 2018(S20):507. doi:10.1186/s12859-018-2543-1
[22] PENNINGTON J, SOCHER R, MANNING C D. Glove:Global vectors for word representation[C]. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014. doi:10.3115/v1/D14-1162
[23] PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations[C]. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, New Orleans, Louisiana, 2018. doi:10.18653/v1/N18-1202
[24] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pretraining[C]. https://s3-us-west-2.amazonaws.com/openaiassets/research-covers/language-unsupervised/language_understanding_paper.pdf
[25] DEVLIN J, CHANG M W, LEE K, et al. Bert:Pre-training of deep bidirectional transformers for language understanding[C]. arXiv:1810.04805, 2018.
[26] WU Yonghui, JIANG Min, LEI Jianbo, et al. Named entity recognition in chinese clinical text using deep neural network[J]. Studies in Health Technology and Informatics, 2015, 216:624-628.
[27] MCCALLUM A, LI W. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons[C]. Proceeding of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003.2003:188-191.
[28] 李健龙,王盼卿,韩琪羽. 基于双向LSTM的军事命名实体识别[J]. 计算机工程与科学, 2019, 41(4):143-148. doi:10.3969/J.ISSN.1007-130x.2019.04.019 LI Jianlong, WANG Panqing, HAN Qiyu. Military named entity recognition based on bidirectional LSTM[J]. Computer Engineering & Science, 2019, 41(4):143-148. doi:10.3969/J.ISSN.1007-130x.2019.04.019
[29] 吴岸城. 神经网络与深度学习[M]. 北京:电子工业出版社, 2016. WU Ancheng. Neural network and deep learning[M]. Beijing:Publishing House of Electronics Industry, 2016.
[30] 宗成庆. 统计自然语言处理[M]. 北京:清华大学出版社, 2013. ZONG Chengqing. Statistical natural language processing[M]. Beijing:Tsinghua University Press, 2013.
[31] GRAVES A. Supervised sequence labelling with recurrent neural networks[M]. Berlin:Springer, 2012. doi:10.1007/978-3-642-24797-2
[32] 柏兵,侯霞,石松. 基于CRF和BI-LSTM的命名实体识别方法[J]. 北京信息科技大学学报(自然科学版), 2018, 33(6):27-33. doi:10.16508/j.cnki.11-5866/n.2018.06.006 BAI Bing, HOU Xia, SHI Song. Named entity recognition method based on CRF and BI-LSTM[J]. Journal of Beijing Information Science & Technology University, 2018, 33(6):27-33. doi:10.16508/j.cnki.11-5866/n.2018.06.006