Topic modeling is a fundamental technique of content analysis in natural language processing,widely applied in domains such as social sciences and finance.In the era of digital communication,social scientists increasi...Topic modeling is a fundamental technique of content analysis in natural language processing,widely applied in domains such as social sciences and finance.In the era of digital communication,social scientists increasingly rely on large-scale social media data to explore public discourse,collective behavior,and emerging social concerns.However,traditional models like Latent Dirichlet Allocation(LDA)and neural topic models like BERTopic struggle to capture deep semantic structures in short-text datasets,especially in complex non-English languages like Chinese.This paper presents Generative Language Model Topic(GLMTopic)a novel hybrid topic modeling framework leveraging the capabilities of large language models,designed to support social science research by uncovering coherent and interpretable themes from Chinese social media platforms.GLMTopic integrates Adaptive Community-enhanced Graph Embedding for advanced semantic representation,Uniform Manifold Approximation and Projection-based(UMAP-based)dimensionality reduction,Hierarchical Density-Based Spatial Clustering of Applications with Noise(HDBSCAN)clustering,and large language model-powered(LLM-powered)representation tuning to generate more contextually relevant and interpretable topics.By reducing dependence on extensive text preprocessing and human expert intervention in post-analysis topic label annotation,GLMTopic facilitates a fully automated and user-friendly topic extraction process.Experimental evaluations on a social media dataset sourced from Weibo demonstrate that GLMTopic outperforms Latent Dirichlet Allocation(LDA)and BERTopic in coherence score and usability with automated interpretation,providing a more scalable and semantically accurate solution for Chinese topic modeling.Future research will explore optimizing computational efficiency,integrating knowledge graphs and sentiment analysis for more complicated workflows,and extending the framework for real-time and multilingual topic modeling.展开更多
目的/意义探索医学信息学跨学科主题演化路径识别方法,为该领域跨学科研究布局与科研管理提供参考。方法/过程首先利用BERTopic和大语言模型(large language model,LLM)在获取的医学信息学文献中识别全局主题和阶段主题;然后基于主题影...目的/意义探索医学信息学跨学科主题演化路径识别方法,为该领域跨学科研究布局与科研管理提供参考。方法/过程首先利用BERTopic和大语言模型(large language model,LLM)在获取的医学信息学文献中识别全局主题和阶段主题;然后基于主题影响力与学科多样性二维分析框架,确定跨学科主题;最后分析跨学科主题的演化路径。结果/结论基于Topic-LLM的分析方法,发现医学信息学跨学科主题的学科多样性和主题影响力呈现稳步增长趋势。展开更多
目的/意义构建双层分析框架,全面把握学科结构,识别新兴前沿领域,追踪主题演化。方法/过程检索2016—2025年PubMed、Scopus和Web of Science数据库医学信息学文献,采用BERTopic识别主题,并划分为新兴、稳定、衰退3种演化模式。基于Chrom...目的/意义构建双层分析框架,全面把握学科结构,识别新兴前沿领域,追踪主题演化。方法/过程检索2016—2025年PubMed、Scopus和Web of Science数据库医学信息学文献,采用BERTopic识别主题,并划分为新兴、稳定、衰退3种演化模式。基于ChromaDB构建检索增强生成系统,通过文档-主题映射实现微观验证与知识关联挖掘。结果/结论医学信息学主题演化呈现研究重心转移、技术融合深化、学科交叉增强3个特征。BERTopic-RAG框架为知识发现提供了新方法。展开更多
Objective To systematically characterize the developmental trajectory and interdisciplinary integration of intelligent diagnosis in traditional Chinese medicine(TCM)through quantitative topic evolution analysis,we add...Objective To systematically characterize the developmental trajectory and interdisciplinary integration of intelligent diagnosis in traditional Chinese medicine(TCM)through quantitative topic evolution analysis,we addressed the fragmentation of existing research and clarified the long-term research structure and evolutionary patterns of the field.Methods A topic evolution analysis was performed on Chinese-language literature pertaining to intelligent diagnosis in TCM.Publications were retrieved from the China National Knowledge Infrastructure(CNKI),Wanfang Data,and China Science and Technology Journal Database(VIP),covering the period from database inception to July 3,2025.A hybrid segmentation approach,based on cumulative publication growth trends and inflection point detection,was applied to divide the research timeline into distinct stages.Subsequently,the latent Dirichlet allocation(LDA)model was used to extract research topics,followed by alignment and evolutionary analysis of topics across different stages.Results A total of 3919 publications published between 2003 and 2025 were included,and the research trajectory was divided into five stages based on data-driven breakpoint detection.The field exhibited a clear evolutionary shift from early rule-based systems and tonguepulse image and signal analysis(2006–2010),to machine-learning-based syndrome and prescription modeling(2011–2015),followed by deep-learning-driven pattern recognition and formula association(2016–2020).Since 2021,research has increasingly emphasized knowledge-graph construction,multimodal integration,and intelligent clinical decision-support systems,with recent studies(2024–2025)showing the emergence of large language models and agent-based diagnostic frameworks.Topic evolution analysis further revealed sustained cross-stage continuity in syndrome modeling and prescription association analysis,alongside the progressive consolidation of integrated intelligent diagnostic platforms.Conclusion By identifying key technological transitions and persistent core research themes,our findings offer a structured reference framework for the design of intelligent diagnostic systems,the construction of knowledge-driven clinical decision-support tools,and the alignment of AI models with TCM diagnostic logic.Importantly,the stage-based evolutionary insights derived from this analysis can inform future methodological choices,improve model interpretability and clinical applicability,and support the translation of intelligent TCM diagnosis from experimental research to real-world clinical practice.展开更多
Microblogs have become an important platform for people to publish,transform information and acquire knowledge.This paper focuses on the problem of discovering user interest in microblogs.In this paper,we propose a to...Microblogs have become an important platform for people to publish,transform information and acquire knowledge.This paper focuses on the problem of discovering user interest in microblogs.In this paper,we propose a topic mining model based on Latent Dirichlet Allocation(LDA) named user-topic model.For each user,the interests are divided into two parts by different ways to generate the microblogs:original interest and retweet interest.We represent a Gibbs sampling implementation for inference the parameters of our model,and discover not only user's original interest,but also retweet interest.Then we combine original interest and retweet interest to compute interest words for users.Experiments on a dataset of Sina microblogs demonstrate that our model is able to discover user interest effectively and outperforms existing topic models in this task.And we find that original interest and retweet interest are similar and the topics of interest contain user labels.The interest words discovered by our model reflect user labels,but range is much broader.展开更多
Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty...Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty topics that experience a sudden increase during a period of time. In this paper, we propose a new topic model named Burst-LDA, which simultaneously discovers topics and reveals their burstiness through explicitly modeling each topic's burst states with a first order Markov chain and using the chain to generate the topic proportion of documents in a Logistic Normal fashion. A Gibbs sampling algorithm is developed for the posterior inference of the proposed model. Experimental results on a news data set show our model can efficiently discover bursty topics, outperforming the state-of-the-art method.展开更多
Most research on anomaly detection has focused on event that is different from its spatial-temporal neighboring events.It is still a significant challenge to detect anomalies that involve multiple normal events intera...Most research on anomaly detection has focused on event that is different from its spatial-temporal neighboring events.It is still a significant challenge to detect anomalies that involve multiple normal events interacting in an unusual pattern.In this work,a novel unsupervised method based on sparse topic model was proposed to capture motion patterns and detect anomalies in traffic surveillance.scale-invariant feature transform(SIFT)flow was used to improve the dense trajectory in order to extract interest points and the corresponding descriptors with less interference.For the purpose of strengthening the relationship of interest points on the same trajectory,the fisher kernel method was applied to obtain the representation of trajectory which was quantized into visual word.Then the sparse topic model was proposed to explore the latent motion patterns and achieve a sparse representation for the video scene.Finally,two anomaly detection algorithms were compared based on video clip detection and visual word analysis respectively.Experiments were conducted on QMUL Junction dataset and AVSS dataset.The results demonstrated the superior efficiency of the proposed method.展开更多
基金funded by the Natural Science Foundation of Fujian Province,China,grant No.2022J05291.
文摘Topic modeling is a fundamental technique of content analysis in natural language processing,widely applied in domains such as social sciences and finance.In the era of digital communication,social scientists increasingly rely on large-scale social media data to explore public discourse,collective behavior,and emerging social concerns.However,traditional models like Latent Dirichlet Allocation(LDA)and neural topic models like BERTopic struggle to capture deep semantic structures in short-text datasets,especially in complex non-English languages like Chinese.This paper presents Generative Language Model Topic(GLMTopic)a novel hybrid topic modeling framework leveraging the capabilities of large language models,designed to support social science research by uncovering coherent and interpretable themes from Chinese social media platforms.GLMTopic integrates Adaptive Community-enhanced Graph Embedding for advanced semantic representation,Uniform Manifold Approximation and Projection-based(UMAP-based)dimensionality reduction,Hierarchical Density-Based Spatial Clustering of Applications with Noise(HDBSCAN)clustering,and large language model-powered(LLM-powered)representation tuning to generate more contextually relevant and interpretable topics.By reducing dependence on extensive text preprocessing and human expert intervention in post-analysis topic label annotation,GLMTopic facilitates a fully automated and user-friendly topic extraction process.Experimental evaluations on a social media dataset sourced from Weibo demonstrate that GLMTopic outperforms Latent Dirichlet Allocation(LDA)and BERTopic in coherence score and usability with automated interpretation,providing a more scalable and semantically accurate solution for Chinese topic modeling.Future research will explore optimizing computational efficiency,integrating knowledge graphs and sentiment analysis for more complicated workflows,and extending the framework for real-time and multilingual topic modeling.
文摘目的/意义探索医学信息学跨学科主题演化路径识别方法,为该领域跨学科研究布局与科研管理提供参考。方法/过程首先利用BERTopic和大语言模型(large language model,LLM)在获取的医学信息学文献中识别全局主题和阶段主题;然后基于主题影响力与学科多样性二维分析框架,确定跨学科主题;最后分析跨学科主题的演化路径。结果/结论基于Topic-LLM的分析方法,发现医学信息学跨学科主题的学科多样性和主题影响力呈现稳步增长趋势。
文摘目的/意义构建双层分析框架,全面把握学科结构,识别新兴前沿领域,追踪主题演化。方法/过程检索2016—2025年PubMed、Scopus和Web of Science数据库医学信息学文献,采用BERTopic识别主题,并划分为新兴、稳定、衰退3种演化模式。基于ChromaDB构建检索增强生成系统,通过文档-主题映射实现微观验证与知识关联挖掘。结果/结论医学信息学主题演化呈现研究重心转移、技术融合深化、学科交叉增强3个特征。BERTopic-RAG框架为知识发现提供了新方法。
基金Grants of National Natural Science Foundation of China(82274685).
文摘Objective To systematically characterize the developmental trajectory and interdisciplinary integration of intelligent diagnosis in traditional Chinese medicine(TCM)through quantitative topic evolution analysis,we addressed the fragmentation of existing research and clarified the long-term research structure and evolutionary patterns of the field.Methods A topic evolution analysis was performed on Chinese-language literature pertaining to intelligent diagnosis in TCM.Publications were retrieved from the China National Knowledge Infrastructure(CNKI),Wanfang Data,and China Science and Technology Journal Database(VIP),covering the period from database inception to July 3,2025.A hybrid segmentation approach,based on cumulative publication growth trends and inflection point detection,was applied to divide the research timeline into distinct stages.Subsequently,the latent Dirichlet allocation(LDA)model was used to extract research topics,followed by alignment and evolutionary analysis of topics across different stages.Results A total of 3919 publications published between 2003 and 2025 were included,and the research trajectory was divided into five stages based on data-driven breakpoint detection.The field exhibited a clear evolutionary shift from early rule-based systems and tonguepulse image and signal analysis(2006–2010),to machine-learning-based syndrome and prescription modeling(2011–2015),followed by deep-learning-driven pattern recognition and formula association(2016–2020).Since 2021,research has increasingly emphasized knowledge-graph construction,multimodal integration,and intelligent clinical decision-support systems,with recent studies(2024–2025)showing the emergence of large language models and agent-based diagnostic frameworks.Topic evolution analysis further revealed sustained cross-stage continuity in syndrome modeling and prescription association analysis,alongside the progressive consolidation of integrated intelligent diagnostic platforms.Conclusion By identifying key technological transitions and persistent core research themes,our findings offer a structured reference framework for the design of intelligent diagnostic systems,the construction of knowledge-driven clinical decision-support tools,and the alignment of AI models with TCM diagnostic logic.Importantly,the stage-based evolutionary insights derived from this analysis can inform future methodological choices,improve model interpretability and clinical applicability,and support the translation of intelligent TCM diagnosis from experimental research to real-world clinical practice.
基金This work was supported by the National High Technology Research and Development Program of China(No. 2010AA012505, 2011AA010702, 2012AA01A401 and 2012AA01A402), Chinese National Science Foundation (No. 60933005, 91124002,61303265), National Technology Support Foundation (No. 2012BAH38B04) and National 242 Foundation (No. 2011A010)
文摘Microblogs have become an important platform for people to publish,transform information and acquire knowledge.This paper focuses on the problem of discovering user interest in microblogs.In this paper,we propose a topic mining model based on Latent Dirichlet Allocation(LDA) named user-topic model.For each user,the interests are divided into two parts by different ways to generate the microblogs:original interest and retweet interest.We represent a Gibbs sampling implementation for inference the parameters of our model,and discover not only user's original interest,but also retweet interest.Then we combine original interest and retweet interest to compute interest words for users.Experiments on a dataset of Sina microblogs demonstrate that our model is able to discover user interest effectively and outperforms existing topic models in this task.And we find that original interest and retweet interest are similar and the topics of interest contain user labels.The interest words discovered by our model reflect user labels,but range is much broader.
基金Supported by the National High Technology Research and Development Program of China(No.2012AA011005)
文摘Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty topics that experience a sudden increase during a period of time. In this paper, we propose a new topic model named Burst-LDA, which simultaneously discovers topics and reveals their burstiness through explicitly modeling each topic's burst states with a first order Markov chain and using the chain to generate the topic proportion of documents in a Logistic Normal fashion. A Gibbs sampling algorithm is developed for the posterior inference of the proposed model. Experimental results on a news data set show our model can efficiently discover bursty topics, outperforming the state-of-the-art method.
基金Project(50808025)supported by the National Natural Science Foundation of ChinaProject(20090162110057)supported by the Doctoral Fund of Ministry of Education,China
文摘Most research on anomaly detection has focused on event that is different from its spatial-temporal neighboring events.It is still a significant challenge to detect anomalies that involve multiple normal events interacting in an unusual pattern.In this work,a novel unsupervised method based on sparse topic model was proposed to capture motion patterns and detect anomalies in traffic surveillance.scale-invariant feature transform(SIFT)flow was used to improve the dense trajectory in order to extract interest points and the corresponding descriptors with less interference.For the purpose of strengthening the relationship of interest points on the same trajectory,the fisher kernel method was applied to obtain the representation of trajectory which was quantized into visual word.Then the sparse topic model was proposed to explore the latent motion patterns and achieve a sparse representation for the video scene.Finally,two anomaly detection algorithms were compared based on video clip detection and visual word analysis respectively.Experiments were conducted on QMUL Junction dataset and AVSS dataset.The results demonstrated the superior efficiency of the proposed method.