Topic modeling is a fundamental technique of content analysis in natural language processing,widely applied in domains such as social sciences and finance.In the era of digital communication,social scientists increasi...Topic modeling is a fundamental technique of content analysis in natural language processing,widely applied in domains such as social sciences and finance.In the era of digital communication,social scientists increasingly rely on large-scale social media data to explore public discourse,collective behavior,and emerging social concerns.However,traditional models like Latent Dirichlet Allocation(LDA)and neural topic models like BERTopic struggle to capture deep semantic structures in short-text datasets,especially in complex non-English languages like Chinese.This paper presents Generative Language Model Topic(GLMTopic)a novel hybrid topic modeling framework leveraging the capabilities of large language models,designed to support social science research by uncovering coherent and interpretable themes from Chinese social media platforms.GLMTopic integrates Adaptive Community-enhanced Graph Embedding for advanced semantic representation,Uniform Manifold Approximation and Projection-based(UMAP-based)dimensionality reduction,Hierarchical Density-Based Spatial Clustering of Applications with Noise(HDBSCAN)clustering,and large language model-powered(LLM-powered)representation tuning to generate more contextually relevant and interpretable topics.By reducing dependence on extensive text preprocessing and human expert intervention in post-analysis topic label annotation,GLMTopic facilitates a fully automated and user-friendly topic extraction process.Experimental evaluations on a social media dataset sourced from Weibo demonstrate that GLMTopic outperforms Latent Dirichlet Allocation(LDA)and BERTopic in coherence score and usability with automated interpretation,providing a more scalable and semantically accurate solution for Chinese topic modeling.Future research will explore optimizing computational efficiency,integrating knowledge graphs and sentiment analysis for more complicated workflows,and extending the framework for real-time and multilingual topic modeling.展开更多
目的/意义探索医学信息学跨学科主题演化路径识别方法,为该领域跨学科研究布局与科研管理提供参考。方法/过程首先利用BERTopic和大语言模型(large language model,LLM)在获取的医学信息学文献中识别全局主题和阶段主题;然后基于主题影...目的/意义探索医学信息学跨学科主题演化路径识别方法,为该领域跨学科研究布局与科研管理提供参考。方法/过程首先利用BERTopic和大语言模型(large language model,LLM)在获取的医学信息学文献中识别全局主题和阶段主题;然后基于主题影响力与学科多样性二维分析框架,确定跨学科主题;最后分析跨学科主题的演化路径。结果/结论基于Topic-LLM的分析方法,发现医学信息学跨学科主题的学科多样性和主题影响力呈现稳步增长趋势。展开更多
目的/意义构建双层分析框架,全面把握学科结构,识别新兴前沿领域,追踪主题演化。方法/过程检索2016—2025年PubMed、Scopus和Web of Science数据库医学信息学文献,采用BERTopic识别主题,并划分为新兴、稳定、衰退3种演化模式。基于Chrom...目的/意义构建双层分析框架,全面把握学科结构,识别新兴前沿领域,追踪主题演化。方法/过程检索2016—2025年PubMed、Scopus和Web of Science数据库医学信息学文献,采用BERTopic识别主题,并划分为新兴、稳定、衰退3种演化模式。基于ChromaDB构建检索增强生成系统,通过文档-主题映射实现微观验证与知识关联挖掘。结果/结论医学信息学主题演化呈现研究重心转移、技术融合深化、学科交叉增强3个特征。BERTopic-RAG框架为知识发现提供了新方法。展开更多
Microblogs have become an important platform for people to publish,transform information and acquire knowledge.This paper focuses on the problem of discovering user interest in microblogs.In this paper,we propose a to...Microblogs have become an important platform for people to publish,transform information and acquire knowledge.This paper focuses on the problem of discovering user interest in microblogs.In this paper,we propose a topic mining model based on Latent Dirichlet Allocation(LDA) named user-topic model.For each user,the interests are divided into two parts by different ways to generate the microblogs:original interest and retweet interest.We represent a Gibbs sampling implementation for inference the parameters of our model,and discover not only user's original interest,but also retweet interest.Then we combine original interest and retweet interest to compute interest words for users.Experiments on a dataset of Sina microblogs demonstrate that our model is able to discover user interest effectively and outperforms existing topic models in this task.And we find that original interest and retweet interest are similar and the topics of interest contain user labels.The interest words discovered by our model reflect user labels,but range is much broader.展开更多
Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty...Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty topics that experience a sudden increase during a period of time. In this paper, we propose a new topic model named Burst-LDA, which simultaneously discovers topics and reveals their burstiness through explicitly modeling each topic's burst states with a first order Markov chain and using the chain to generate the topic proportion of documents in a Logistic Normal fashion. A Gibbs sampling algorithm is developed for the posterior inference of the proposed model. Experimental results on a news data set show our model can efficiently discover bursty topics, outperforming the state-of-the-art method.展开更多
Most research on anomaly detection has focused on event that is different from its spatial-temporal neighboring events.It is still a significant challenge to detect anomalies that involve multiple normal events intera...Most research on anomaly detection has focused on event that is different from its spatial-temporal neighboring events.It is still a significant challenge to detect anomalies that involve multiple normal events interacting in an unusual pattern.In this work,a novel unsupervised method based on sparse topic model was proposed to capture motion patterns and detect anomalies in traffic surveillance.scale-invariant feature transform(SIFT)flow was used to improve the dense trajectory in order to extract interest points and the corresponding descriptors with less interference.For the purpose of strengthening the relationship of interest points on the same trajectory,the fisher kernel method was applied to obtain the representation of trajectory which was quantized into visual word.Then the sparse topic model was proposed to explore the latent motion patterns and achieve a sparse representation for the video scene.Finally,two anomaly detection algorithms were compared based on video clip detection and visual word analysis respectively.Experiments were conducted on QMUL Junction dataset and AVSS dataset.The results demonstrated the superior efficiency of the proposed method.展开更多
User interest is not static and changes dynamically. In the scenario of a search engine, this paper presents a personalized adaptive user interest prediction framework. It represents user interest as a topic distribut...User interest is not static and changes dynamically. In the scenario of a search engine, this paper presents a personalized adaptive user interest prediction framework. It represents user interest as a topic distribution, captures every change of user interest in the history, and uses the changes to predict future individual user interest dynamically. More specifically, it first uses a personalized user interest representation model to infer user interest from queries in the user's history data using a topic model; then it presents a personalized user interest prediction model to capture the dynamic changes of user interest and to predict future user interest by leveraging the query submission time in the history data. Compared with the Interest Degree Multi-Stage Quantization Model, experiment results on an AOL Search Query Log query log show that our framework is more stable and effective in user interest prediction.展开更多
Recommendation system can greatly alleviate the "information overload" in the big data era. Existing recommendation methods, however, typically focus on predicting missing rating values via analyzing user-it...Recommendation system can greatly alleviate the "information overload" in the big data era. Existing recommendation methods, however, typically focus on predicting missing rating values via analyzing user-item dualistic relationship, which neglect an important fact that the latent interests of users can influence their rating behaviors. Moreover, traditional recommendation methods easily suffer from the high dimensional problem and cold-start problem. To address these challenges, in this paper, we propose a PBUED(PLSA-Based Uniform Euclidean Distance) scheme, which utilizes topic model and uniform Euclidean distance to recommend the suitable items for users. The solution first employs probabilistic latent semantic analysis(PLSA) to extract users' interests, users with different interests are divided into different subgroups. Then, the uniform Euclidean distance is adopted to compute the users' similarity in the same interest subset; finally, the missing rating values of data are predicted via aggregating similar neighbors' ratings. We evaluate PBUED on two datasets and experimental results show PBUED can lead to better predicting performance and ranking performance than other approaches.展开更多
基金funded by the Natural Science Foundation of Fujian Province,China,grant No.2022J05291.
文摘Topic modeling is a fundamental technique of content analysis in natural language processing,widely applied in domains such as social sciences and finance.In the era of digital communication,social scientists increasingly rely on large-scale social media data to explore public discourse,collective behavior,and emerging social concerns.However,traditional models like Latent Dirichlet Allocation(LDA)and neural topic models like BERTopic struggle to capture deep semantic structures in short-text datasets,especially in complex non-English languages like Chinese.This paper presents Generative Language Model Topic(GLMTopic)a novel hybrid topic modeling framework leveraging the capabilities of large language models,designed to support social science research by uncovering coherent and interpretable themes from Chinese social media platforms.GLMTopic integrates Adaptive Community-enhanced Graph Embedding for advanced semantic representation,Uniform Manifold Approximation and Projection-based(UMAP-based)dimensionality reduction,Hierarchical Density-Based Spatial Clustering of Applications with Noise(HDBSCAN)clustering,and large language model-powered(LLM-powered)representation tuning to generate more contextually relevant and interpretable topics.By reducing dependence on extensive text preprocessing and human expert intervention in post-analysis topic label annotation,GLMTopic facilitates a fully automated and user-friendly topic extraction process.Experimental evaluations on a social media dataset sourced from Weibo demonstrate that GLMTopic outperforms Latent Dirichlet Allocation(LDA)and BERTopic in coherence score and usability with automated interpretation,providing a more scalable and semantically accurate solution for Chinese topic modeling.Future research will explore optimizing computational efficiency,integrating knowledge graphs and sentiment analysis for more complicated workflows,and extending the framework for real-time and multilingual topic modeling.
文摘目的/意义探索医学信息学跨学科主题演化路径识别方法,为该领域跨学科研究布局与科研管理提供参考。方法/过程首先利用BERTopic和大语言模型(large language model,LLM)在获取的医学信息学文献中识别全局主题和阶段主题;然后基于主题影响力与学科多样性二维分析框架,确定跨学科主题;最后分析跨学科主题的演化路径。结果/结论基于Topic-LLM的分析方法,发现医学信息学跨学科主题的学科多样性和主题影响力呈现稳步增长趋势。
文摘目的/意义构建双层分析框架,全面把握学科结构,识别新兴前沿领域,追踪主题演化。方法/过程检索2016—2025年PubMed、Scopus和Web of Science数据库医学信息学文献,采用BERTopic识别主题,并划分为新兴、稳定、衰退3种演化模式。基于ChromaDB构建检索增强生成系统,通过文档-主题映射实现微观验证与知识关联挖掘。结果/结论医学信息学主题演化呈现研究重心转移、技术融合深化、学科交叉增强3个特征。BERTopic-RAG框架为知识发现提供了新方法。
基金This work was supported by the National High Technology Research and Development Program of China(No. 2010AA012505, 2011AA010702, 2012AA01A401 and 2012AA01A402), Chinese National Science Foundation (No. 60933005, 91124002,61303265), National Technology Support Foundation (No. 2012BAH38B04) and National 242 Foundation (No. 2011A010)
文摘Microblogs have become an important platform for people to publish,transform information and acquire knowledge.This paper focuses on the problem of discovering user interest in microblogs.In this paper,we propose a topic mining model based on Latent Dirichlet Allocation(LDA) named user-topic model.For each user,the interests are divided into two parts by different ways to generate the microblogs:original interest and retweet interest.We represent a Gibbs sampling implementation for inference the parameters of our model,and discover not only user's original interest,but also retweet interest.Then we combine original interest and retweet interest to compute interest words for users.Experiments on a dataset of Sina microblogs demonstrate that our model is able to discover user interest effectively and outperforms existing topic models in this task.And we find that original interest and retweet interest are similar and the topics of interest contain user labels.The interest words discovered by our model reflect user labels,but range is much broader.
基金Supported by the National High Technology Research and Development Program of China(No.2012AA011005)
文摘Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty topics that experience a sudden increase during a period of time. In this paper, we propose a new topic model named Burst-LDA, which simultaneously discovers topics and reveals their burstiness through explicitly modeling each topic's burst states with a first order Markov chain and using the chain to generate the topic proportion of documents in a Logistic Normal fashion. A Gibbs sampling algorithm is developed for the posterior inference of the proposed model. Experimental results on a news data set show our model can efficiently discover bursty topics, outperforming the state-of-the-art method.
基金Project(50808025)supported by the National Natural Science Foundation of ChinaProject(20090162110057)supported by the Doctoral Fund of Ministry of Education,China
文摘Most research on anomaly detection has focused on event that is different from its spatial-temporal neighboring events.It is still a significant challenge to detect anomalies that involve multiple normal events interacting in an unusual pattern.In this work,a novel unsupervised method based on sparse topic model was proposed to capture motion patterns and detect anomalies in traffic surveillance.scale-invariant feature transform(SIFT)flow was used to improve the dense trajectory in order to extract interest points and the corresponding descriptors with less interference.For the purpose of strengthening the relationship of interest points on the same trajectory,the fisher kernel method was applied to obtain the representation of trajectory which was quantized into visual word.Then the sparse topic model was proposed to explore the latent motion patterns and achieve a sparse representation for the video scene.Finally,two anomaly detection algorithms were compared based on video clip detection and visual word analysis respectively.Experiments were conducted on QMUL Junction dataset and AVSS dataset.The results demonstrated the superior efficiency of the proposed method.
基金Supported by the National Natural Science Foundation of China(71473183,71503188)
文摘User interest is not static and changes dynamically. In the scenario of a search engine, this paper presents a personalized adaptive user interest prediction framework. It represents user interest as a topic distribution, captures every change of user interest in the history, and uses the changes to predict future individual user interest dynamically. More specifically, it first uses a personalized user interest representation model to infer user interest from queries in the user's history data using a topic model; then it presents a personalized user interest prediction model to capture the dynamic changes of user interest and to predict future user interest by leveraging the query submission time in the history data. Compared with the Interest Degree Multi-Stage Quantization Model, experiment results on an AOL Search Query Log query log show that our framework is more stable and effective in user interest prediction.
基金supported in part by the National High‐tech R&D Program of China (863 Program) under Grant No. 2013AA102301technological project of Henan province (162102210214)
文摘Recommendation system can greatly alleviate the "information overload" in the big data era. Existing recommendation methods, however, typically focus on predicting missing rating values via analyzing user-item dualistic relationship, which neglect an important fact that the latent interests of users can influence their rating behaviors. Moreover, traditional recommendation methods easily suffer from the high dimensional problem and cold-start problem. To address these challenges, in this paper, we propose a PBUED(PLSA-Based Uniform Euclidean Distance) scheme, which utilizes topic model and uniform Euclidean distance to recommend the suitable items for users. The solution first employs probabilistic latent semantic analysis(PLSA) to extract users' interests, users with different interests are divided into different subgroups. Then, the uniform Euclidean distance is adopted to compute the users' similarity in the same interest subset; finally, the missing rating values of data are predicted via aggregating similar neighbors' ratings. We evaluate PBUED on two datasets and experimental results show PBUED can lead to better predicting performance and ranking performance than other approaches.