- 无标题文档
查看论文信息

中文题名:

 

基于指针标注的生物医学实体识别和关系抽取研究与应用

    

姓名:

 董高材    

学号:

 1049721801631    

保密级别:

 公开    

论文语种:

 chi    

学科代码:

 083500    

学科名称:

 工学 - 软件工程    

学生类型:

 硕士    

学位:

 工学硕士    

学校:

 武汉理工大学    

院系:

 计算机科学与技术学院    

专业:

 软件工程    

研究方向:

 生物医学关系抽取    

第一导师姓名:

 袁晓辉    

第一导师院系:

 武汉理工大学    

完成日期:

 2021-03-30    

答辩日期:

 2021-05-23    

中文关键词:

 

指针标注 ; 嵌套实体识别 ; 关系抽取 ; 联合学习 ; 药物重定位

    

中文摘要:

近年来,生物医学领域研究取得飞速进展,大量携带研究成果的文献被发表。尽管有人力物力去手动整理这些文献中的信息,但仅仅依靠人力对信息进行更新远远跟不上文献发表的速度。与通用领域相比,生物医学文献的信息抽取存在大量的嵌套实体和重叠关系等问题。因此,如何准确的从大量生物医学文献中抽取出有价值的知识,是目前生物医学领域信息抽取的重大挑战。

为此,本文研究了生物医学领域的实体识别和关系抽取。对于实体识别,本文基于层叠指针标注的方法不仅能有效识别生物医学领域中的非嵌套实体,同时也能实现嵌套实体的识别。识别出实体之后,本文基于两次指针标注的方法进行实体关系的联合抽取。最后,结合实体识别和关系抽取,提出了基于文献挖掘的药物重定位框架。具体地,本文的主要研究内容包括以下几个方面:

(1)针对生物医学领域嵌套实体的识别,构建了基于层叠指针标注的方法CPT (Cascade Pointer Tagging)。借助层叠指针标注,解决了基于序列标注方式不能识别嵌套实体的问题。此外,本文将实体的描述信息作为先验知识,在实体识别的过程中引入实体的类别信息,可以取得更好的结果。在与基线方法的对比中,无论是嵌套实体的识别还是非嵌套实体的识别,都取得了最高的F1值。

(2)针对生物医学领域文献中存在大量重叠关系的问题,构建了基于两次指针标注进行联合学习的关系抽取方法TPT (Two-time Pointer Tagging)。与基于流水线的方法相比,该方法没有错误传播、忽略子任务间的交互关系和产生冗余信息的缺点,同时还能解决生物医学领域中重叠关系的问题。本文将关系三元组的抽取转换为头部实体到尾部实体的函数映射,加强了三元组内部结构的依赖,在损失函数上添加偏执来缓解标签不平衡的问题。与基线方法进行对比时,在DDI和CPI两个公开的生物医学语料库上,本文的方法不仅提高了精准率,更能明显提升召回率,在两个语料库上都获得最高的F1值。

(3)提出了基于文献挖掘的药物重定位框架,从大量的生物医学文献中获取潜在的药物和疾病之间的关系。本文基于层叠指针标注(CPT)扩充了临床变量的实体列表;基于秩和检验获取了疾病和临床变量之间的关系;基于两次指针标注的关系抽取(TPT)得到了药物和临床变量之间的关系;然后基于逻辑回归模型预测潜在的药物和疾病之间的关系并对治疗某种疾病的候选药物进行排序。本文一共收集了986个临床变量,2,532个药物实体以及超过80万篇的文献摘要,最终为三种常见的疾病(哮喘、糖尿病、心脏衰竭)找出了超过500种的候选治疗药物,为基于文献挖掘进行药物重定位提供了借鉴意义。这不仅验证了本文实体识别和关系抽取方法的有效性,也让抽取出的信息有了实际应用的价值。

参考文献:

[1] Chung-Chi H, Zhiyong L. Community challenges in biomedical text mining over 10 years:success, failure and the future[J]. Briefings in Bioinformatics, 2016, 17(1):132-144.

[2] Wishart D S, Craig K, Chi G A, et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration[J]. Nucleic Acids Research, 2006, 34(Database issue):D668-72.

[3] Becker K G, Barnes K C, Bright T J, et al. The Genetic Association Database[J]. Nature Genetics, 2004, 36(5):431-432.

[4] Davis A P, Grondin C J, Johnson R J, et al. The comparative toxicogenomics database: update 2019[J]. Nucleic acids research, 2019, 47(D1): D948-D954.

[5] Zhou G D, Zhang J, Su J, et al. Recognizing Names in Biomedical Texts: a Machine Learning Approach[J]. Bioinformatics, 2004, 20(7):1178-1190.

[6] Liu H, Aronson A R, Friedman C. A study of abbreviations in MEDLINE abstracts[C]. Proceedings of the AMIA Symposium, 2002:464-468.

[7] Proux D, Rechenmann F, Julliard L, et al. Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction[J]. Genome Inform Ser Workshop Genome Inform, 1998, 9(3):72-80.

[8] Krauthammer M, Rz

hetsky A, Morozov P, et al. Using BLAST for identifying gene and protein names in journal articles[J]. Gene, 2000, 259(1-2):245-252.

[9] Fukuda K, Tamura A, Tsunoda T, et al. Toward Information Extraction: Identifying Protein Names from Biological Papers[J]. Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing, 1997, 98:707-718.

[10] Yeganova L, Smith L, Wilbur W J. Identification of related gene protein names based on an HMM of name variations[J]. Computational Biology and Chemistry, 2004, 28(2):97-107.

[11] Zhang J, Shen D, Zhou G, et al. Enhancing HMM-based biomedical named entity recognition by studying special phenomena[J]. Journal of Biomedical Informatics, 2004, 37(6): 411-422.

[12] Habib M S, Kalita J. Scalable biomedical named entity recognition: investigation of a database-supported SVM approach[J]. International Journal of Bioinformatics Research and Applications, 2010, 6(2):191-208.

[13] Rakesh P, Kumar S S. A kernel-based approach for biomedical named entity recognition[J]. The Scientific World Journal, 2013,12(2):796.

[14] Skeppstedt M, Kvist M, Nilsson G H, et al. Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study[J]. Journal of Biomedical Informatics, 2014, 49:148-158.

[15] 孙晓, 孙重远, 任福继. 基于深层条件随机场的生物医学命名实体识别[J].模式识别与人工智能, 2016, 29(11):997-1008.

[16] 王浩畅,李钰,赵铁军.面向生物医学命名实体识别的多Agent元学习框架[J].计算机学报,2010,33(07):1256-1262.

[17] Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition[C]. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics, 2016:260-270.

[18] Habibi M, Weber L, Neves M, et al. Deep learning with word embeddings improves biomedical named entity recognition[J]. Bioinformatics, 2017, 33(14):37-48.

[19] Wang B, Lu W. Neural Segmental Hypergraphs for Overlapping Mention Recognition[C]. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018: 204-214.

[20] Sohrab M G, Miwa M. Deep exhaustive model for nested named entity recognition[C]. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018: 2843-2849.

[21] Muis A O, Lu W. Labeling Gaps Between Words: Recognizing Overlapping Mentions with Mention Separators[C]. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017: 2608-2618.

[22] Li X, Feng J, Meng Y, et al. A Unified MRC Framework for Named Entity Recognition[C]. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 5849-5859.

[23] Pyysalo S, Ginter F, Heimonen J, et al. BioInfer: a corpus for information extraction in the biomedical domain[J]. BMC bioinformatics, 2007, 8(1): 1-24.

[24] Segura Bedmar I, Martínez P, Herrero Zazo M. Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013)[C]. Association for Computational Linguistics, 2013: 341-350.

[25] Krallinger M, Rabal O, Akhondi S A, et al. Overview of the BioCreative VI chemical-protein interaction Track[C]. Proceedings of the sixth BioCreative challenge evaluation workshop, 2017, 1: 141-146.

[26] Silver B, Ramaiya K, Andrew S B, et al. EADSG guidelines:

insulin therapy in diabetes[J]. Diabetes therapy, 2018, 9(2): 449-492.

[27] Blaschke C, Valencia A. Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study[J]. International Journal of Genomics, 2001, 2(4):196-206.

[28] Quoc-Chinh B, Sloot P M A, Van E M, et al. A novel feature-based approach to extract drug-drug interactions from biomedical text[J]. Bioinformatics, 2014, 30(23):3365-3371.

[29] Kim S, Liu H, Yeganova L, et al. Extracting drug–drug interactions from literature using a rich feature-based linear kernel approach[J]. Journal of Biomedical Informatics, 2015, 55:23-30.

[30] Rastegar-Mojarad M, Boyce R D, Prasad R. UWM-TRIADS: classifying drug-drug interactions with two-stage SVM and post-processing[C]. Proceedings of the 2013 International Workshop on Semantic Evaluation (SemEval), Task 9 - Extraction of Drug-drug Interactions from BioMedical Texts, 2013:667-674.

[31] Chowdhury M F M, Lavelli A. FBK-irst: A multi-phase kernel based approach for drug-drug interaction detection and classification that exploits linguistic information[C]. Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), 2013: 351-355.

[32] Thomas P, Neves M, Rockt?schel T, et al. WBI-DDI: drug-drug interaction extraction using majority voting[C]. Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), 2013:628-635.

[33] 鄂海红,张文静,肖思琪,程瑞,胡莺夕,周筱松,牛佩晴.深度学习实体关系抽取研究综述[J].软件学报,2019,30(06):1793-1818.

[34] Sahu S, Anand A, Oruganty K, et al. Relation extraction from clinical texts using domain invariant convolutional neural network[C]. Proceedings of the 15th Workshop on Biomedical Natural Language Processing, 2016: 206-215.

[35] Wu Y, Luo R, Leung H C M, et al. Renet: A deep learning approach for extracting gene-disease associations from literature[C]. International Conference on Research in Computational Molecular Biology, Springer, Cham, 2019:272-284.

[36] Wei C H, Kao H Y, Lu Z. PubTator: a web-based text mining tool for assisting biocuration[J]. Nucleic acids research, 2013, 41(W1): W518-W522.

[37] Zhao Z, Yang Z, Luo L, et al. Drug drug interaction extraction from biomedical literature using syntax convolutional neural network[J]. Bioinformatics, 2016, 32(22): 3444-3453.

[38] Zhang Y, Zheng W, Lin H, et al. Drug–drug interaction extraction via hierarchical RNNs on sequence and shortest dependency paths[J]. Bioinformatics, 2018, 34(5): 828-835.

[39] Peng Y, Rios A, Kavuluru R, et al. Extracting chemical–protein relations with ensembles of SVM and deep learning models[J]. Database, 2018, 1-9.

[40] Sun C, Yang Z, Wang L, et al. Chemical-protein interaction extraction from biomedical literature: a hierarchical recurrent convolutional neural network method[J]. International Journal of Data Mining and Bioinformatics, 2019, 22(2): 113-130.

[41] Qin L, Dong G, Peng J. Chemical-protein Intera

ction Extraction via ChemicalBERT and Attention Guided Graph Convolutional Networks in Parallel[C]. 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, 2020:708-715.

[42] 万佳. 基于词表示和深度学习的生物实体关系抽取[D].大连理工大学, 2018.

[43] 冯钦林. 基于半监督和深度学习的生物实体关系抽取[D].大连理工大学, 2016.

[44] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2019:4171-4186.

[45] Zheng S, Hao Y, Lu D, et al. Joint entity and relation extraction based on a hybrid neural network[J]. Neurocomputing, 2017, 257: 59-66.

[46] Miwa M, Bansal M. End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures[C]. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016: 1105-1116.

[47] Li F, Zhang M, Fu G, et al. A neural joint model for entity and relation extraction from biomedical text[J]. BMC bioinformatics, 2017, 18(1): 1-11.

[48] Zheng S, Wang F, Bao H, et al. Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme[C]. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017:1227-1236.

[49] Luo L, Yang Z, Cao M, et al. A neural network-based joint learning approach for biomedical entity and relation extraction from biomedical literature[J]. Journal of biomedical informatics, 2020, 103: 103384.

[50] Wei Z, Su J, Wang Y, et al. A Novel Cascade Binary Tagging Framework for Relational Triple Extraction[C]. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 1476-1488.

[51] Wang S, Zhang Y, Che W, et al. Joint Extraction of Entities and Relations Based on a Novel Graph Scheme[C]. IJCAI, 2018:4461-4467.

[52] Sun C, Gong Y, Wu Y, et al. Joint type inference on entities and relations via graph convolutional networks[C]. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 1361-1370.

[53] Fu T J, Li P H, Ma W Y. GraphRel: Modeling text as relational graphs for joint entity and relation extraction[C]. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 1409-1418.

[54] Han P, Yang P, Zhao P, et al. GCN-MF: Disease-Gene Association Identification By Graph Convolutional Networks and Matrix Factorization[C]. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019:705-713.

[55] Zeng X, Zeng D, He S, et al. Extracting relational facts by an end-to-end neural model with copy mechanism[C]. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018: 506-514.

[56] Zeng D, Zhang H, Liu Q. Copymtl: Copy mechanism for joint extraction of entities and relations with multi-task learning[C]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(05): 9507-9514.

[57] Levy O, Seo M, Choi E, et al. Zero-Shot Relation Extraction via Reading Comprehension[C]. Proceedings of the 21st Conference on Computational Natural Language Learning, 2017: 333-342.

[58] Swanson D R. Medical literature as a potential source of new knowledge[J]. Bulletin of the Medical Library Association, 1990, 78(1): 29-37.

[59] Jang D, Lee S, Lee J, et al. Inferring new drug indications using the complementarity between clinical disease signatures and drug effects[J]. Journal of biomedical informatics, 2016, 59: 248-257.

[60] Controlprevention C F D. National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Questionnaire (or Examination Protocol, or Laboratory Protocol)[OL]. [2021-03-01]. http://www.cdc.gov/nchs/nhanes.htm.

[61] Wei W Q, Cronin R M, Xu H, et al. Development and evaluation of an ensemble resource linking medications to their indications[J]. Journal of the American Medical Informatics Association, 2013, 20(5): 954-961.

[62] Brown A S, Patel C J. A standard database for drug repositioning[J]. Scientific data, 2017, 4(1): 1-7.

中图分类号:

 TP391.1    

条码号:

 002000062029    

馆藏号:

 TD10050805    

馆藏位置:

 403    

备注:

 403-西院分馆博硕论文库;203-余家头分馆博硕论文库    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式