This paper explores how the Chinese college students' life is represented in some graffiti collected in campus.The article analyzes and compares the topics of graffiti from different settings and the linguistic fe...This paper explores how the Chinese college students' life is represented in some graffiti collected in campus.The article analyzes and compares the topics of graffiti from different settings and the linguistic features they manifest.The findings show that fewer graffiti from female toilet and classroom in this university pay attention to political issues compared with the graffiti abroad.Graffiti in female toilet mainly focus on the theme of love,and are found to be more interactive in discourse.Whereas graffiti on desks tend to cover mixed themes and be less interactive.There are more graphic graffiti and exam answers on the undergraduate students' desk than on the postgraduates'.Graffiti have some linguistic features as thematization,repetition and salience,etc.展开更多
With the extensive integration of the Internet,social networks and the internet of things,the social internet of things has increasingly become a significant research issue.In the social internet of things application...With the extensive integration of the Internet,social networks and the internet of things,the social internet of things has increasingly become a significant research issue.In the social internet of things application scenario,one of the greatest challenges is how to accurately recommend or match smart objects for users with massive resources.Although a variety of recommendation algorithms have been employed in this field,they ignore the massive text resources in the social internet of things,which can effectively improve the effect of recommendation.In this paper,a smart object recommendation approach named object recommendation based on topic learning and joint features is proposed.The proposed approach extracts and calculates topics and service relevant features of texts related to smart objects and introduces the“thing-thing”relationship information in the internet of things to improve the effect of recommendation.Experiments show that the proposed approach enables higher accuracy compared to the existing recommendation methods.展开更多
Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic captur...Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic capture accuracy. We propose a novel approach that incorporates pre-clustered knowledge into the BERTopic model while reducing the l2 norm for low-frequency words. Our method effectively mitigates feature sparsity during cluster mapping. Empirical evaluation on the StackOverflow dataset demonstrates that our approach outperforms baseline models, achieving superior Macro-F1 scores. These results validate the effectiveness of our proposed feature sparsity reduction technique for short-text topic modeling.展开更多
As a generative model,Latent Dirichlet Allocation Model,which lacks optimization of topics' discrimination capability focuses on how to generate data,This paper aims to improve the discrimination capability throug...As a generative model,Latent Dirichlet Allocation Model,which lacks optimization of topics' discrimination capability focuses on how to generate data,This paper aims to improve the discrimination capability through unsupervised feature selection.Theoretical analysis shows that the discrimination capability of a topic is limited by the discrimination capability of its representative words.The discrimination capability of a word is approximated by the Information Gain of the word for topics,which is used to distinguish between "general word" and "special word" in LDA topics.Therefore,we add a constraint to the LDA objective function to let the "general words" only happen in "general topics" other than "special topics".Then a heuristic algorithm is presented to get the solution.Experiments show that this method can not only improve the information gain of topics,but also make the topics easier to understand by human.展开更多
Text classification is an essential task for many applications related to the Natural Language Processing domain.It can be applied in many fields,such as Information Retrieval,Knowledge Extraction,and Knowledge modeli...Text classification is an essential task for many applications related to the Natural Language Processing domain.It can be applied in many fields,such as Information Retrieval,Knowledge Extraction,and Knowledge modeling.Even though the importance of this task,Arabic Text Classification tools still suffer from many problems and remain incapable of responding to the increasing volume of Arabic content that circulates on the web or resides in large databases.This paper introduces a novel machine learning-based approach that exclusively uses hybrid(stylistic and semantic)features.First,we clean the Arabic documents and translate them to English using translation tools.Consequently,the semantic features are automatically extracted from the translated documents using an existing database of English topics.Besides,the model automatically extracts from the textual content a set of stylistic features such as word and character frequencies and punctuation.Therefore,we obtain 3 types of features:semantic,stylistic and hybrid.Using each time,a different type of feature,we performed an in-depth comparison study of nine well-known Machine Learning models to evaluate our approach and used a standard Arabic corpus.The obtained results show that Neural Network outperforms other models and provides good performances using hybrid features(F1-score=0.88%).展开更多
The topic recognition for dynamic topic number can realize the dynamic update of super parameters,and obtain the probability distribution of dynamic topics in time dimension,which helps to clear the understanding and ...The topic recognition for dynamic topic number can realize the dynamic update of super parameters,and obtain the probability distribution of dynamic topics in time dimension,which helps to clear the understanding and tracking of convection text data.However,the current topic recognition model tends to be based on a fixed number of topics K and lacks multi-granularity analysis of subject knowledge.Therefore,it is impossible to deeply perceive the dynamic change of the topic in the time series.By introducing a novel approach on the basis of Infinite Latent Dirichlet allocation model,a topic feature lattice under the dynamic topic number is constructed.In the model,documents,topics and vocabularies are jointly modeled to generate two probability distribution matrices:Documentstopics and topic-feature words.Afterwards,the association intensity is computed between the topic and its feature vocabulary to establish the topic formal context matrix.Finally,the topic feature is induced according to the formal concept analysis(FCA)theory.The topic feature lattice under dynamic topic number(TFL DTN)model is validated on the real dataset by comparing with the mainstream methods.Experiments show that this model is more in line with actual needs,and achieves better results in semi-automatic modeling of topic visualization analysis.展开更多
Purpose–The purpose of this paper is to analyze topics as alternative features for sentiment analysis in Indonesian tweets.Design/methodology/approach–Given Indonesian tweets,the processes of sentiment analysis star...Purpose–The purpose of this paper is to analyze topics as alternative features for sentiment analysis in Indonesian tweets.Design/methodology/approach–Given Indonesian tweets,the processes of sentiment analysis start by extracting features from the tweets.The features are words or topics.The authors use non-negative matrix factorization to extract the topics and apply a support vector machine to classify the tweets into its sentiment class.Findings–The authors analyze the accuracy using the two-class and three-class sentiment analysis data sets.Both data sets are about sentiments of candidates for Indonesian presidential election.The experiments show that the standard word features give better accuracies than the topics features for the two-class sentiment analysis.Moreover,the topic features can slightly improve the accuracy of the standard word features.The topic features can also improve the accuracy of the standard word features for the three-class sentiment analysis.Originality/value–The standard textual data representation for sentiment analysis using machine learning is bag of word and its extensions mainly created by natural language processing.This paper applies topics as novel features for the machine learning-based sentiment analysis in Indonesian tweets.展开更多
话题跟踪是一项针对新闻话题进行相关信息识别、挖掘和自组织的研究课题,其关键问题之一是如何建立符合话题形态的统计模型.话题形态的研究涉及两个问题,其一是话题的结构特性,其二是话题变形.对比分析了现有词包式、层次树式和链式这3...话题跟踪是一项针对新闻话题进行相关信息识别、挖掘和自组织的研究课题,其关键问题之一是如何建立符合话题形态的统计模型.话题形态的研究涉及两个问题,其一是话题的结构特性,其二是话题变形.对比分析了现有词包式、层次树式和链式这3类主流话题模型的形态特征,尤其深入探讨了静态和动态话题模型拟合话题脉络的优势和劣势,并提出一种基于特征重叠比的核捕捉衰减评价策略,专门用于衡量静态和动态话题模型追踪话题发展趋势的能力.在此基础上,分别给出突发式增量式学习方法和时序事件链的更新算法,借以提高动态话题模型的核捕捉性能.实验基于国际标准评测语料TDT4,采用NIST(National Institute of Standards and Technology)提出的最小检测错误权衡系数评测法,并结合所提出的核捕捉衰减评价方法,对各类主要话题模型进行测试.实验结果显示,结构化的动态话题模型具有最佳的跟踪性能,且突发式增量式学习和时序事件链的更新算法分别给予动态话题模型0.4%和3.3%的性能改进.展开更多
文摘This paper explores how the Chinese college students' life is represented in some graffiti collected in campus.The article analyzes and compares the topics of graffiti from different settings and the linguistic features they manifest.The findings show that fewer graffiti from female toilet and classroom in this university pay attention to political issues compared with the graffiti abroad.Graffiti in female toilet mainly focus on the theme of love,and are found to be more interactive in discourse.Whereas graffiti on desks tend to cover mixed themes and be less interactive.There are more graphic graffiti and exam answers on the undergraduate students' desk than on the postgraduates'.Graffiti have some linguistic features as thematization,repetition and salience,etc.
基金supported by National Key Research and Development Program of China (2019YFB2102500)China Postdoctoral Science Foundation (2021M700533)+1 种基金Natural Science Basic Research Program of Shaanxi Province of China (2021JQ-289,2020JQ-855)Social Science Fund of Shaanxi Province of China (2019S044).
文摘With the extensive integration of the Internet,social networks and the internet of things,the social internet of things has increasingly become a significant research issue.In the social internet of things application scenario,one of the greatest challenges is how to accurately recommend or match smart objects for users with massive resources.Although a variety of recommendation algorithms have been employed in this field,they ignore the massive text resources in the social internet of things,which can effectively improve the effect of recommendation.In this paper,a smart object recommendation approach named object recommendation based on topic learning and joint features is proposed.The proposed approach extracts and calculates topics and service relevant features of texts related to smart objects and introduces the“thing-thing”relationship information in the internet of things to improve the effect of recommendation.Experiments show that the proposed approach enables higher accuracy compared to the existing recommendation methods.
文摘Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic capture accuracy. We propose a novel approach that incorporates pre-clustered knowledge into the BERTopic model while reducing the l2 norm for low-frequency words. Our method effectively mitigates feature sparsity during cluster mapping. Empirical evaluation on the StackOverflow dataset demonstrates that our approach outperforms baseline models, achieving superior Macro-F1 scores. These results validate the effectiveness of our proposed feature sparsity reduction technique for short-text topic modeling.
基金supported by National Nature Science Foundation of China under Grant No.60905017,61072061National High Technical Research and Development Program of China(863 Program)under Grant No.2009AA01A346+1 种基金111 Project of China under Grant No.B08004the Special Project for Innovative Young Researchers of Beijing University of Posts and Telecommunications
文摘As a generative model,Latent Dirichlet Allocation Model,which lacks optimization of topics' discrimination capability focuses on how to generate data,This paper aims to improve the discrimination capability through unsupervised feature selection.Theoretical analysis shows that the discrimination capability of a topic is limited by the discrimination capability of its representative words.The discrimination capability of a word is approximated by the Information Gain of the word for topics,which is used to distinguish between "general word" and "special word" in LDA topics.Therefore,we add a constraint to the LDA objective function to let the "general words" only happen in "general topics" other than "special topics".Then a heuristic algorithm is presented to get the solution.Experiments show that this method can not only improve the information gain of topics,but also make the topics easier to understand by human.
文摘Text classification is an essential task for many applications related to the Natural Language Processing domain.It can be applied in many fields,such as Information Retrieval,Knowledge Extraction,and Knowledge modeling.Even though the importance of this task,Arabic Text Classification tools still suffer from many problems and remain incapable of responding to the increasing volume of Arabic content that circulates on the web or resides in large databases.This paper introduces a novel machine learning-based approach that exclusively uses hybrid(stylistic and semantic)features.First,we clean the Arabic documents and translate them to English using translation tools.Consequently,the semantic features are automatically extracted from the translated documents using an existing database of English topics.Besides,the model automatically extracts from the textual content a set of stylistic features such as word and character frequencies and punctuation.Therefore,we obtain 3 types of features:semantic,stylistic and hybrid.Using each time,a different type of feature,we performed an in-depth comparison study of nine well-known Machine Learning models to evaluate our approach and used a standard Arabic corpus.The obtained results show that Neural Network outperforms other models and provides good performances using hybrid features(F1-score=0.88%).
基金the Key Projects of Social Sciences of Anhui Provincial Department of Education(SK2018A1064,SK2018A1072)the Natural Scientific Project of Anhui Provincial Department of Education(KJ2019A0371)Innovation Team of Health Information Management and Application Research(BYKC201913),BBMC。
文摘The topic recognition for dynamic topic number can realize the dynamic update of super parameters,and obtain the probability distribution of dynamic topics in time dimension,which helps to clear the understanding and tracking of convection text data.However,the current topic recognition model tends to be based on a fixed number of topics K and lacks multi-granularity analysis of subject knowledge.Therefore,it is impossible to deeply perceive the dynamic change of the topic in the time series.By introducing a novel approach on the basis of Infinite Latent Dirichlet allocation model,a topic feature lattice under the dynamic topic number is constructed.In the model,documents,topics and vocabularies are jointly modeled to generate two probability distribution matrices:Documentstopics and topic-feature words.Afterwards,the association intensity is computed between the topic and its feature vocabulary to establish the topic formal context matrix.Finally,the topic feature is induced according to the formal concept analysis(FCA)theory.The topic feature lattice under dynamic topic number(TFL DTN)model is validated on the real dataset by comparing with the mainstream methods.Experiments show that this model is more in line with actual needs,and achieves better results in semi-automatic modeling of topic visualization analysis.
文摘Purpose–The purpose of this paper is to analyze topics as alternative features for sentiment analysis in Indonesian tweets.Design/methodology/approach–Given Indonesian tweets,the processes of sentiment analysis start by extracting features from the tweets.The features are words or topics.The authors use non-negative matrix factorization to extract the topics and apply a support vector machine to classify the tweets into its sentiment class.Findings–The authors analyze the accuracy using the two-class and three-class sentiment analysis data sets.Both data sets are about sentiments of candidates for Indonesian presidential election.The experiments show that the standard word features give better accuracies than the topics features for the two-class sentiment analysis.Moreover,the topic features can slightly improve the accuracy of the standard word features.The topic features can also improve the accuracy of the standard word features for the three-class sentiment analysis.Originality/value–The standard textual data representation for sentiment analysis using machine learning is bag of word and its extensions mainly created by natural language processing.This paper applies topics as novel features for the machine learning-based sentiment analysis in Indonesian tweets.
文摘话题跟踪是一项针对新闻话题进行相关信息识别、挖掘和自组织的研究课题,其关键问题之一是如何建立符合话题形态的统计模型.话题形态的研究涉及两个问题,其一是话题的结构特性,其二是话题变形.对比分析了现有词包式、层次树式和链式这3类主流话题模型的形态特征,尤其深入探讨了静态和动态话题模型拟合话题脉络的优势和劣势,并提出一种基于特征重叠比的核捕捉衰减评价策略,专门用于衡量静态和动态话题模型追踪话题发展趋势的能力.在此基础上,分别给出突发式增量式学习方法和时序事件链的更新算法,借以提高动态话题模型的核捕捉性能.实验基于国际标准评测语料TDT4,采用NIST(National Institute of Standards and Technology)提出的最小检测错误权衡系数评测法,并结合所提出的核捕捉衰减评价方法,对各类主要话题模型进行测试.实验结果显示,结构化的动态话题模型具有最佳的跟踪性能,且突发式增量式学习和时序事件链的更新算法分别给予动态话题模型0.4%和3.3%的性能改进.