One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse ...One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.展开更多
Word embedding has drawn a lot of attention due to its usefulness in many NLP tasks. So far a handful of neural-network based word embedding algorithms have been proposed without considering the effects of pronouns in...Word embedding has drawn a lot of attention due to its usefulness in many NLP tasks. So far a handful of neural-network based word embedding algorithms have been proposed without considering the effects of pronouns in the training corpus. In this paper, we propose using co-reference resolution to improve the word embedding by extracting better context. We evaluate four word embeddings with considerations of co-reference resolution and compare the quality of word embedding on the task of word analogy and word similarity on multiple data sets.Experiments show that by using co-reference resolution, the word embedding performance in the word analogy task can be improved by around 1.88%. We find that the words that are names of countries are affected the most,which is as expected.展开更多
Most word embedding models have the following problems:(1)In the models based on bag-of-words contexts,the structural relations of sentences are completely neglected;(2)Each word uses a single embedding,which makes th...Most word embedding models have the following problems:(1)In the models based on bag-of-words contexts,the structural relations of sentences are completely neglected;(2)Each word uses a single embedding,which makes the model indiscriminative for polysemous words;(3)Word embedding easily tends to contextual structure similarity of sentences.To solve these problems,we propose an easy-to-use representation algorithm of syntactic word embedding(SWE).The main procedures are:(1)A polysemous tagging algorithm is used for polysemous representation by the latent Dirichlet allocation(LDA)algorithm;(2)Symbols‘+’and‘-’are adopted to indicate the directions of the dependency syntax;(3)Stopwords and their dependencies are deleted;(4)Dependency skip is applied to connect indirect dependencies;(5)Dependency-based contexts are inputted to a word2vec model.Experimental results show that our model generates desirable word embedding in similarity evaluation tasks.Besides,semantic and syntactic features can be captured from dependency-based syntactic contexts,exhibiting less topical and more syntactic similarity.We conclude that SWE outperforms single embedding learning models.展开更多
针对维吾尔语人称代词指代现象,提出利用双向长短时记忆网络(Bi-directional long short term memory,Bi-LSTM)的深度学习机制进行基于深层语义信息的维吾尔语人称代词指代消解.首先将富含语义和句法信息的word embedding向量作为Bi-LST...针对维吾尔语人称代词指代现象,提出利用双向长短时记忆网络(Bi-directional long short term memory,Bi-LSTM)的深度学习机制进行基于深层语义信息的维吾尔语人称代词指代消解.首先将富含语义和句法信息的word embedding向量作为Bi-LSTM的输入,挖掘维吾尔语隐含的上下文语义层面特征;其次对维吾尔语人称代词指代现象进行探索,提取针对人称代词指代研究的24个hand-crafted特征;然后利用多层感知器(multilayer perception,MLP)融合Bi-LSTM学习到的上下文语义层面特征与hand-crafted特征;最后使用融合的两类特征训练softmax分类器完成维吾尔语人称代词指代消解任务.实验结果表明,充分利用两类特征的优势,维吾尔语人称代词指代消解的F1值达到76.86%.实验验证了Bi-LSTM与单向LSTM、浅层机器学习算法的SVM和ANN相比更具备挖掘隐含上下文深层语义信息的能力,而hand-crafted层面特征的引入,则有效提高指代消解性能.展开更多
近年来,情感分析是近年来自然语言处理领域备受学者关注的核心研究方向,传统文本情感分析模型只能捕捉文本的表面特征,在不同领域或语境下缺乏泛化能力,难以处理长文本以及语义歧义等问题.针对上述问题,本文设计了基于图神经网络与表示...近年来,情感分析是近年来自然语言处理领域备受学者关注的核心研究方向,传统文本情感分析模型只能捕捉文本的表面特征,在不同领域或语境下缺乏泛化能力,难以处理长文本以及语义歧义等问题.针对上述问题,本文设计了基于图神经网络与表示学习的文本情感分析模型(a text sentiment analysis model based on graph neural networks and representation learning,GNNRL).利用Spacy生成句子的语法依赖树,利用图卷积神经网络进行编码,以捕捉句子中词语之间更复杂的关系;采用动态k-max池化进一步筛选特征,保留文本相对位置的序列特征,避免部分特征损失的问题,从而提高模型的特征提取能力.最后将情感特征向量输送到分类器SoftMax中,根据归一化后的值来判断情感分类.为验证本文所提GNNRL模型的有效性,采用OS10和SMP2020两个文本情感分析数据集进行测试,与HyperGAT、IBHC、BERT_CNN、BERT_GCN、TextGCN模型比较,结果表明,综合accuracy、precision、recall、f14个指标,本文改进的AM_DNN模型均优于其他模型,在文本情感中具有较好的分类性能,并探究了不同优化器的选择对本模型的影响.展开更多
针对信息网络(text-based information network)现有研究多基于网络自身信息建模,受限于任务语料规模,只使用任务相关文本进行建模容易产生语义漂移或语义残缺的问题,本文将外部语料引入建模过程中,利用外部语料得到的词向量对建模过程...针对信息网络(text-based information network)现有研究多基于网络自身信息建模,受限于任务语料规模,只使用任务相关文本进行建模容易产生语义漂移或语义残缺的问题,本文将外部语料引入建模过程中,利用外部语料得到的词向量对建模过程进行优化,提出基于外部词向量的网络表示模型NE-EWV(network embeddingbased on external word vectors),从语义特征空间以及结构特征空间两个角度学习特征融合的网络表示。通过实验,在现实网络数据集中对模型有效性进行了验证。实验结果表明,在链接预测任务中的AUC指标,相比只考虑结构特征的模型提升7%~19%,相比考虑结构与文本特征的模型在大部分情况下有1%~12%提升;在节点分类任务中,与基线方法中性能最好的CANE性能相当。证明引入外部词向量作为外部知识能够有效提升网络表示能力。展开更多
建立精准的农产品销售途径对于农产品的生产流通有非常大的指导意义,但农产品由于产地与保质期等因素的限制,导致多数农产品在电商平台上的流通能力较差。此外,农产品相关的推荐工作主要集中在以买家为目标的商品推荐上,鲜少有以卖家为...建立精准的农产品销售途径对于农产品的生产流通有非常大的指导意义,但农产品由于产地与保质期等因素的限制,导致多数农产品在电商平台上的流通能力较差。此外,农产品相关的推荐工作主要集中在以买家为目标的商品推荐上,鲜少有以卖家为目标用户,为其预售产品提供销售途径推荐的系统。提出了一种基于网络表示学习的农产品销售途径推荐方法,它使用基于词嵌入的AP-GloVe(Global Word Vector Representation for Agricultural Products)方法进行农产品销售地与销售商推荐,并基于影响力的图神经网络模型IAGNN(Influence-Aware Graph Neural Networks)进行潜在的买家推荐,实现了农产品销售中对于销售区域、销售商与产品买家的推荐。相关模型在词的相似性检测、节点分类与链路预测等实验中取得了优于现有模型的效果。展开更多
文摘One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.
基金supported by the National HighTech Research and Development(863)Program(No.2015AA015401)the National Natural Science Foundation of China(Nos.61533018 and 61402220)+2 种基金the State Scholarship Fund of CSC(No.201608430240)the Philosophy and Social Science Foundation of Hunan Province(No.16YBA323)the Scientific Research Fund of Hunan Provincial Education Department(Nos.16C1378 and 14B153)
文摘Word embedding has drawn a lot of attention due to its usefulness in many NLP tasks. So far a handful of neural-network based word embedding algorithms have been proposed without considering the effects of pronouns in the training corpus. In this paper, we propose using co-reference resolution to improve the word embedding by extracting better context. We evaluate four word embeddings with considerations of co-reference resolution and compare the quality of word embedding on the task of word analogy and word similarity on multiple data sets.Experiments show that by using co-reference resolution, the word embedding performance in the word analogy task can be improved by around 1.88%. We find that the words that are names of countries are affected the most,which is as expected.
基金Project supported by the National Natural Science Foundation of China(Nos.61663041 and 61763041)the Program for Changjiang Scholars and Innovative Research Team in Universities,China(No.IRT_15R40)+2 种基金the Research Fund for the Chunhui Program of Ministry of Education of China(No.Z2014022)the Natural Science Foundation of Qinghai Province,China(No.2014-ZJ-721)the Fundamental Research Funds for the Central Universities,China(No.2017TS045)
文摘Most word embedding models have the following problems:(1)In the models based on bag-of-words contexts,the structural relations of sentences are completely neglected;(2)Each word uses a single embedding,which makes the model indiscriminative for polysemous words;(3)Word embedding easily tends to contextual structure similarity of sentences.To solve these problems,we propose an easy-to-use representation algorithm of syntactic word embedding(SWE).The main procedures are:(1)A polysemous tagging algorithm is used for polysemous representation by the latent Dirichlet allocation(LDA)algorithm;(2)Symbols‘+’and‘-’are adopted to indicate the directions of the dependency syntax;(3)Stopwords and their dependencies are deleted;(4)Dependency skip is applied to connect indirect dependencies;(5)Dependency-based contexts are inputted to a word2vec model.Experimental results show that our model generates desirable word embedding in similarity evaluation tasks.Besides,semantic and syntactic features can be captured from dependency-based syntactic contexts,exhibiting less topical and more syntactic similarity.We conclude that SWE outperforms single embedding learning models.
文摘针对维吾尔语人称代词指代现象,提出利用双向长短时记忆网络(Bi-directional long short term memory,Bi-LSTM)的深度学习机制进行基于深层语义信息的维吾尔语人称代词指代消解.首先将富含语义和句法信息的word embedding向量作为Bi-LSTM的输入,挖掘维吾尔语隐含的上下文语义层面特征;其次对维吾尔语人称代词指代现象进行探索,提取针对人称代词指代研究的24个hand-crafted特征;然后利用多层感知器(multilayer perception,MLP)融合Bi-LSTM学习到的上下文语义层面特征与hand-crafted特征;最后使用融合的两类特征训练softmax分类器完成维吾尔语人称代词指代消解任务.实验结果表明,充分利用两类特征的优势,维吾尔语人称代词指代消解的F1值达到76.86%.实验验证了Bi-LSTM与单向LSTM、浅层机器学习算法的SVM和ANN相比更具备挖掘隐含上下文深层语义信息的能力,而hand-crafted层面特征的引入,则有效提高指代消解性能.
文摘近年来,情感分析是近年来自然语言处理领域备受学者关注的核心研究方向,传统文本情感分析模型只能捕捉文本的表面特征,在不同领域或语境下缺乏泛化能力,难以处理长文本以及语义歧义等问题.针对上述问题,本文设计了基于图神经网络与表示学习的文本情感分析模型(a text sentiment analysis model based on graph neural networks and representation learning,GNNRL).利用Spacy生成句子的语法依赖树,利用图卷积神经网络进行编码,以捕捉句子中词语之间更复杂的关系;采用动态k-max池化进一步筛选特征,保留文本相对位置的序列特征,避免部分特征损失的问题,从而提高模型的特征提取能力.最后将情感特征向量输送到分类器SoftMax中,根据归一化后的值来判断情感分类.为验证本文所提GNNRL模型的有效性,采用OS10和SMP2020两个文本情感分析数据集进行测试,与HyperGAT、IBHC、BERT_CNN、BERT_GCN、TextGCN模型比较,结果表明,综合accuracy、precision、recall、f14个指标,本文改进的AM_DNN模型均优于其他模型,在文本情感中具有较好的分类性能,并探究了不同优化器的选择对本模型的影响.
文摘针对信息网络(text-based information network)现有研究多基于网络自身信息建模,受限于任务语料规模,只使用任务相关文本进行建模容易产生语义漂移或语义残缺的问题,本文将外部语料引入建模过程中,利用外部语料得到的词向量对建模过程进行优化,提出基于外部词向量的网络表示模型NE-EWV(network embeddingbased on external word vectors),从语义特征空间以及结构特征空间两个角度学习特征融合的网络表示。通过实验,在现实网络数据集中对模型有效性进行了验证。实验结果表明,在链接预测任务中的AUC指标,相比只考虑结构特征的模型提升7%~19%,相比考虑结构与文本特征的模型在大部分情况下有1%~12%提升;在节点分类任务中,与基线方法中性能最好的CANE性能相当。证明引入外部词向量作为外部知识能够有效提升网络表示能力。
文摘建立精准的农产品销售途径对于农产品的生产流通有非常大的指导意义,但农产品由于产地与保质期等因素的限制,导致多数农产品在电商平台上的流通能力较差。此外,农产品相关的推荐工作主要集中在以买家为目标的商品推荐上,鲜少有以卖家为目标用户,为其预售产品提供销售途径推荐的系统。提出了一种基于网络表示学习的农产品销售途径推荐方法,它使用基于词嵌入的AP-GloVe(Global Word Vector Representation for Agricultural Products)方法进行农产品销售地与销售商推荐,并基于影响力的图神经网络模型IAGNN(Influence-Aware Graph Neural Networks)进行潜在的买家推荐,实现了农产品销售中对于销售区域、销售商与产品买家的推荐。相关模型在词的相似性检测、节点分类与链路预测等实验中取得了优于现有模型的效果。