- 无标题文档
查看论文信息

中文题名:

 基于词向量的基础教育资源领域概念及关系抽取研究     

姓名:

 刘雅梦    

学号:

 1049721201344    

保密级别:

 公开    

论文语种:

 chi    

学科代码:

 081203    

学科名称:

 计算机应用技术    

学生类型:

 硕士    

学位:

 工学硕士    

学校:

 武汉理工大学    

获奖论文:

 校优秀硕士学位论文    

院系:

 计算机科学与技术学院    

专业:

 计算机科学与技术    

研究方向:

 人工智能与机器学习    

第一导师姓名:

 熊盛武    

第一导师院系:

 武汉理工大学    

完成日期:

 2015-04-08    

答辩日期:

 2015-05-17    

中文关键词:

 

概念抽取 ; 概念关系抽取 ; 合成词抽取 ; 词向量 ; 基础教育资源

    

中文摘要:

目前,类人智能在全球蓬勃发展,比较有代表性的项目有日本的Todai Robot项目——目标是让机器人参加高考并取得高分,IBM的Waston项目——已经拓展至医疗等领域,这类类人智能项目对教育、医疗等行业帮助巨大。然而这些类人智能项目的成功应用离不开完备知识库的支持,因此拓展类人智能项目的知识库,对提高其智能水平至关重要。以“文综”为代表的基础教育资源,蕴含了丰富的知识,故研究如何从海量的基础教育资源中获取丰富的语义信息,并构建基础教育领域本体知识库,对构建类人智能产品具有重要意义。

本文依托863项目“面向基础教育的类人智能知识理解与推理关键技术”(2015AA015403)对知识抽取的两个核心任务——领域概念及概念间关系抽取进行了研究。主要研究工作如下:

1)设计了基于F-M-E的合成词抽取算法及基于词频分布状态的领域概念抽取算法。针对领域概念被分词工具错误切分的情况,结合词性和语言模型的统计特征,设计了基于F-M-E的合成词抽取算法(F指词频,M指互信息,E指信息熵)进行合成词的抽取,保证了领域概念的完整性;针对低频领域概念漏抽取及高频非领域概念被错误抽取的情况,根据领域概念与非领域概念分布状态的差异,设计了基于词频分布状态的领域概念抽取算法并对历史领域概念进行了抽取。

2)设计了基于词向量语义相关度的领域概念对获取算法。本文利用基于神经网络的Word Embedding方法将文本语料的语词映射到低维词向量空间,用向量间的距离表征语词之间的语义相关度,不仅能扩充领域概念集,还能找出语义相关的概念对。

3)设计了基于语义相关度的扩展关联规则非分类关系抽取算法。针对基于关联规则的关系抽取算法仅根据语料的统计量特征抽取具有关联关系的概念对的情况,设计了基于语义相关度的扩展关联规则非分类关系抽取算法,该算法在基于关联规则算法的基础上添加了语义相关度这一评价标准来进行概念对的抽取,并利用词向量和K-means聚类方法对关系标签进行名词扩充并将其分配到相应的概念对上,最后得到概念对关系标签三元组的非分类关系集合。

本文提出的面向基础教育资源的领域概念及关系抽取方法具有较高的性能和较强的实用性。利用该方法抽取的领域概念集及领域概念关系集不仅可以用于构建相应的领域本体知识库,在一定程度上实现了自动化,而且还可以应用于语义检索、文本摘要、知识图谱、问答系统等其它领域。

关键词:  概念抽取,概念关系抽取,合成词抽取,词向量,基础教育资源

参考文献:

[1] Bourigault D, Gonzalez-Mullier I, Gros L C. A natural language processing tool for terminology extraction[C]. Proceedings of the 7th EURALEX International Congress on Lexicography, 1996: 771-779.

[2] Jacquemin C. Recycling terms into a partial parser[C]. Proceedings of the fourth conference on Applied natural language processing. Association Computation Linguistics.1994:113- 118.

[3] Jacquemin C.Syntagmatic and paradigmatic representations of term variation [C]. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, 1999: 341-348.

[4] 周浪,石树敏,冯冲.基于多策略融合的中文术语抽取方法[J].情报学报,2010,29(3): 460- 467.

[5] Finkel J R, Manning C D. Nested named entity recognition[C]. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing -Volume 1. Association for Computational Linguistics, 2009: 141-150.

[6] 岑咏华,韩哲,季培培,基于隐马尔科夫模型的中文术语识别研究[J].现代图书情报技术, 2008 (12):54-58.

[7] 李丽双,领域本体学习中术语及关系抽取方法的研究[D],大连理工大学,2013.

[8] Enguehard C, Pantera L. Automatic terminology[J]. Quantitative Linguistics,1994,2(1): 27- 32.

[9] 胡文敏,何婷婷,张勇.基于卡方检验的汉语术语抽取[J].计算机应用,2007, 27 (12) : 3019- 3025.

[10] Wu, S.A.H.W, SOAT: a semi-automatic domain ontology acquisition tool from Chinese corpus [C], in Proceedings of the 19th international conference on Computation linguistics- Volume 2, Association for Computational Linguistics, 2002:1-5

[11] 何琳,基于多策略的领域本体术语抽取研究[J],情报学报,2012.31(8)798-804

[12] 杜波,田怀凤,王立等,基于多策略的专业领域术语抽取器的设计[J].计算机工程,2005, 31(14): 159-160.

[13] 周浪,张亮,冯冲等,基于词频分布统计的术语抽取方法[J].计算机科学,2009,36(5):177- 180

[14] 于娟,基于文本的领域本体学习方法及其应用研究[D],大连理工大学,2010

[15] Hearst, Marti A. Automatic acquisition of hyponyms from large text corpora[C]. Proceedings of the 14th conference on Computational linguistics. 1992:539-545.

[16] 张新.基于中文科技论文的本体交互式构建方法研究[D].大连:大连理工大学,2006.

[17] Ciaramit M,Gangemi A,Ratsch E.Unsupervised learning of semantic relationships between concepts of amolecular biology Ontology[C].Proceedings of the 19th International Joint Conference on Artificial Intelligence(IJCAI 2005),Edinburgh, Scotland,Uk,2005:659-664.

[18] 于娟,党延忠.本体关系学习方法研究——概念特征词法[J].系统工程理论与实践,2012, 32(7):1582-1590

[19] Zhou G D, Qian L H, Fan J X. Tree kernel-based semantic relation extraction with rich syntactic and semantic information. Information Sciences, 2010, 180 (8),1313-1325.

[20] Mu Kong ,Qin Guo .Improved method of relation extraction using subsequence kernel,Computational Intelligence, Communication Systems and Networks (CICSyN), 2012 Fourth International Conference on. IEEE, 2012: 14-17.

[21] Sun A, Grishman R, Sekine S. Semi-supervised relation extraction with large-scale word clustering. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 2011,521-529.

[22] Hasegawa T,Sekine S,Grishman R. Discovering relations among named entities from large corpora[C]. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics(ACL), Barcelona,Spain,2004, 415-422.

[23] Obitko M, snasel V, Smid J: Ontology Design with Formal Concept Analysis[C], Proceedings of the CLA 2004 International Workshop on Concept Lattices and their Applications Ostrava, Czech Republic, 2004:111-119

[24] Cimiano P, Hotho A, Staab S. Comparing Conceptual, Divisive and Agglomerative Clutering for Learing Taxonomies from Text[C]. Proceedings of the European Conference on Artificial Intelligence(ECAI),Valencia,Spain,2004:435-439

[25] Cimiano P, Hotho A, Staab S. Learing Cocept Hierarchies from Text Corpora using Formal Concept Analysis[J].Journal of Artificial Intelligence Research,2005(24):305-339.

[26] 王磊,周宽久,仇鹏,等.领域本体自动构建研究[J].情报学报,2010,29(1):30-33.

[27] Douglas H.Fisher. Knowledge Acquisition Via Incremental Conceptual Clustering[J]. Machine Learning ,1987,2(2):139-172.

[28] Zhang M,Su J, Wang D M, Zhou G D, Tan C L. Discovering relations between named entities from a large raw corpus using tree similarity-based clustering[C],In Second International Joint Conference on Natural Language Processing, 2005: 378-389.

[29] 张志田.无监督关系抽取方法研究[D].哈尔滨工业人学,2007.

[30] Takamatsu S, Sato I, Nakagawa H. Probabilistic matrix factorization leveraging contexts for unsupervised relation extraction [J]. Advances in Knowledge Discovery and Data Mining, 2011, 6634: 87-99.

[31]词向量[EB/O]http://licstar.net/archives/328.

[32] Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch[J]. The Journal of Machine Learning Research, 2011, 12:2493–2537.

[33] Socher R, Pennington J, Huang E H, et al. Semi-supervised recursive autoen-coders for predicting sentiment distributions[C]. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011:151–161.

[34] Socher R, Huang E H, Pennin J, et al. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection[C]. Advances in Neural Information Pro-cessing Systems. 2011:801–809.

[35] Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng Wang and Ting Liu. Learning Semantic Hierarchies via Word Embeddings. In Proceedings of the 52nd Annual Meeting of the Association of Computational Linguistics (ACL 2014). Baltimore, Maryland, USA, June, 2014:1199-1209.

[36] Kageura K, Umino B. Methods of Automatic Term Recognition: A Review[J]. Terminology, 1996, 3(2):259-289.

[37] NLPIR/ICTCLAS2014[EB/OL], http://ictclas.nlpir.org/(Accessec on Nov 20,2014)

[38]ansj[EB/OL]. https://githubxom/ansjsun/ansj_seg.

[39] FudanNLP[EB/OL].https://code.google,com/p/fudannlp.

[40] “结巴”中文分词[EB/OL].https://github.com/fxsjy/jieba.

[41] Frakes W B, Baeza-Yates R. Information Retrieval Data Structures & Algorithms [M]. Prentice Hall PTR, 1992: 66-82.

[42] Mikolov T,Chen K,Corrado G,et al. Efficient Estimation of Word Representations in Vector Space[C],Proceedings of Workshop at International Conference on Lerning Representations 2013( ICLR2013)

[43] Mikolov T,Sutskever I,Chen K,et al. Distributed Representations of Words and Phrases and their Compositionality[C],Proceedings of NIPS,2013.

[44] Bengio Y, Schwenk H, Senecal J S. Neural probabilistic language models[M].Innovations in Machine Learning. Springer Berlin Heidelberg, 2006:137-186.

[45] word2vec[EB/OL].https://code.googlexom/p/word2vec

[46] Mikolov T, Sutskever I,Chen K. Distributed representations of words and phrases and their compositionality[C].Advances in Neural Information Processing Systems. 2013: 3111-3119.

[47] Mikolov T, Yih W, Zweig G.Linguistic regularities in continuous space word representations [C]. Proceedings of NAACL-HLT. 2013: 746-751.

[48] Word2Vec java版本[EB/OL]. https://github.com/siegfang/word2vec.

[49] 温春,石昭祥,辛元.基于扩展关联规则的中文非分类关系抽取[J].计算机工程. 2009,35(24): 63-65

中图分类号:

 TP391.1    

馆藏号:

 TP391.1/1344/2015    

备注:

 403-西院分馆博硕论文库;203-余家头分馆博硕论文库    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式