Document processing in natural language includes retrieval,sentiment analysis,theme extraction,etc.Classical methods for handling these tasks are based on models of probability,semantics and networks for machine learn...Document processing in natural language includes retrieval,sentiment analysis,theme extraction,etc.Classical methods for handling these tasks are based on models of probability,semantics and networks for machine learning.The probability model is loss of semantic information in essential,and it influences the processing accuracy.Machine learning approaches include supervised,unsupervised,and semi-supervised approaches,labeled corpora is necessary for semantics model and supervised learning.The method for achieving a reliably labeled corpus is done manually,it is costly and time-consuming because people have to read each document and annotate the label of each document.Recently,the continuous CBOW model is efficient for learning high-quality distributed vector representations,and it can capture a large number of precise syntactic and semantic word relationships,this model can be easily extended to learn paragraph vector,but it is not precise.Towards these problems,this paper is devoted to developing a new model for learning paragraph vector,we combine the CBOW model and CNNs to establish a new deep learning model.Experimental results show that paragraph vector generated by the new model is better than the paragraph vector generated by CBOW model in semantic relativeness and accuracy.展开更多
当前的基于词向量的多文档摘要方法没有考虑句子中词语的顺序,存在异句同向量问题以及在小规模训练数据上生成的摘要冗余度高的问题。针对这些问题,提出基于PV-DM(Distributed Memory Model of Paragraph Vectors)模型的多文档摘要方法...当前的基于词向量的多文档摘要方法没有考虑句子中词语的顺序,存在异句同向量问题以及在小规模训练数据上生成的摘要冗余度高的问题。针对这些问题,提出基于PV-DM(Distributed Memory Model of Paragraph Vectors)模型的多文档摘要方法。该方法首先构建单调亚模(Submodular)目标函数;然后,通过训练PV-DM模型得到句子向量计算句子间的语义相似度,进而求解单调亚模目标函数;最后,利用优化算法抽取句子生成摘要。在标准数据集Opinosis上的实验结果表明该方法优于当前主流的多文档摘要方法。展开更多
为了克服短文本的稀疏性和高维度性,同时提升文本聚类质量,提出了一种结合词对主题模型(Biterm Topic Model, BTM)与段落向量(Paragraph Vector, PV)的短文本聚类方法。该方法主要包括两个重要步骤:一是利用由词对主题模型所求出的词-文...为了克服短文本的稀疏性和高维度性,同时提升文本聚类质量,提出了一种结合词对主题模型(Biterm Topic Model, BTM)与段落向量(Paragraph Vector, PV)的短文本聚类方法。该方法主要包括两个重要步骤:一是利用由词对主题模型所求出的词-文档-主题概率分布,并结合局部离群因子与JS散度对整个文本集合中的词语进行语义拆分;二是将经过词语语义拆分后的文本输入至向量化模型PV-DBOW(Distributed Bag of Words Version of Paragraph Vector)得到段落向量,并将其与对应的文档-主题概率分布拼接起来构成文本特征向量。实验结果表明,本文方法得到的特征向量对短文本具有较强的区分能力,能有效改善短文本的聚类效果,同时也能避免受到短文本的稀疏性影响。展开更多
The risk classification of BBS posts is important to the evaluation of societal risk level within a period. Using the posts collected from Tianya forum as the data source, the authors adopted the societal risk indicat...The risk classification of BBS posts is important to the evaluation of societal risk level within a period. Using the posts collected from Tianya forum as the data source, the authors adopted the societal risk indicators from socio psychology, and conduct document-level multiple societal risk classification of BBS posts. To effectively capture the semantics and word order of documents, a shallow neural network as Paragraph Vector is applied to realize the distributed vector representations of the posts in the vector space. Based on the document vectors, the authors apply one classification method KNN to identify the societal risk category of the posts. The experimental results reveal that paragraph vector in document-level societal risk classification achieves much faster training speed and at least 10% improvements of F-measures than Bag-of-Words. Furthermore, the performance of paragraph vector is also superior to edit distance and Lucene-based search method. The present work is the first attempt of combining document embedding method with socio psychology research results to public opinions area.展开更多
基金The authors would like to thank all anonymous reviewers for their suggestions and feedback.This work Supported by the National Natural Science,Foundation of China(No.61379052,61379103)the National Key Research and Development Program(2016YFB1000101)+1 种基金The Natural Science Foundation for Distinguished Young Scholars of Hunan Province(Grant No.14JJ1026)Specialized Research Fund for the Doctoral Program of Higher Education(Grant No.20124307110015).
文摘Document processing in natural language includes retrieval,sentiment analysis,theme extraction,etc.Classical methods for handling these tasks are based on models of probability,semantics and networks for machine learning.The probability model is loss of semantic information in essential,and it influences the processing accuracy.Machine learning approaches include supervised,unsupervised,and semi-supervised approaches,labeled corpora is necessary for semantics model and supervised learning.The method for achieving a reliably labeled corpus is done manually,it is costly and time-consuming because people have to read each document and annotate the label of each document.Recently,the continuous CBOW model is efficient for learning high-quality distributed vector representations,and it can capture a large number of precise syntactic and semantic word relationships,this model can be easily extended to learn paragraph vector,but it is not precise.Towards these problems,this paper is devoted to developing a new model for learning paragraph vector,we combine the CBOW model and CNNs to establish a new deep learning model.Experimental results show that paragraph vector generated by the new model is better than the paragraph vector generated by CBOW model in semantic relativeness and accuracy.
文摘当前的基于词向量的多文档摘要方法没有考虑句子中词语的顺序,存在异句同向量问题以及在小规模训练数据上生成的摘要冗余度高的问题。针对这些问题,提出基于PV-DM(Distributed Memory Model of Paragraph Vectors)模型的多文档摘要方法。该方法首先构建单调亚模(Submodular)目标函数;然后,通过训练PV-DM模型得到句子向量计算句子间的语义相似度,进而求解单调亚模目标函数;最后,利用优化算法抽取句子生成摘要。在标准数据集Opinosis上的实验结果表明该方法优于当前主流的多文档摘要方法。
文摘为了克服短文本的稀疏性和高维度性,同时提升文本聚类质量,提出了一种结合词对主题模型(Biterm Topic Model, BTM)与段落向量(Paragraph Vector, PV)的短文本聚类方法。该方法主要包括两个重要步骤:一是利用由词对主题模型所求出的词-文档-主题概率分布,并结合局部离群因子与JS散度对整个文本集合中的词语进行语义拆分;二是将经过词语语义拆分后的文本输入至向量化模型PV-DBOW(Distributed Bag of Words Version of Paragraph Vector)得到段落向量,并将其与对应的文档-主题概率分布拼接起来构成文本特征向量。实验结果表明,本文方法得到的特征向量对短文本具有较强的区分能力,能有效改善短文本的聚类效果,同时也能避免受到短文本的稀疏性影响。
基金supported by the National Natural Science Foundation of China under Grant Nos.71171187,71371107,and 61473284
文摘The risk classification of BBS posts is important to the evaluation of societal risk level within a period. Using the posts collected from Tianya forum as the data source, the authors adopted the societal risk indicators from socio psychology, and conduct document-level multiple societal risk classification of BBS posts. To effectively capture the semantics and word order of documents, a shallow neural network as Paragraph Vector is applied to realize the distributed vector representations of the posts in the vector space. Based on the document vectors, the authors apply one classification method KNN to identify the societal risk category of the posts. The experimental results reveal that paragraph vector in document-level societal risk classification achieves much faster training speed and at least 10% improvements of F-measures than Bag-of-Words. Furthermore, the performance of paragraph vector is also superior to edit distance and Lucene-based search method. The present work is the first attempt of combining document embedding method with socio psychology research results to public opinions area.