Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic captur...Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic capture accuracy. We propose a novel approach that incorporates pre-clustered knowledge into the BERTopic model while reducing the l2 norm for low-frequency words. Our method effectively mitigates feature sparsity during cluster mapping. Empirical evaluation on the StackOverflow dataset demonstrates that our approach outperforms baseline models, achieving superior Macro-F1 scores. These results validate the effectiveness of our proposed feature sparsity reduction technique for short-text topic modeling.展开更多
Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty...Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty topics that experience a sudden increase during a period of time. In this paper, we propose a new topic model named Burst-LDA, which simultaneously discovers topics and reveals their burstiness through explicitly modeling each topic's burst states with a first order Markov chain and using the chain to generate the topic proportion of documents in a Logistic Normal fashion. A Gibbs sampling algorithm is developed for the posterior inference of the proposed model. Experimental results on a news data set show our model can efficiently discover bursty topics, outperforming the state-of-the-art method.展开更多
[目的/意义]旨在为在线教育的发展提供参考。[方法/过程]检索Web of Science(WoS)核心数据库关于在线教育的文献(保留摘要以Excel格式导出),运用BERTopic模型进行热点主题抽取,进而对在线教育领域主题进行分析。[结果/结论]BERTopic模...[目的/意义]旨在为在线教育的发展提供参考。[方法/过程]检索Web of Science(WoS)核心数据库关于在线教育的文献(保留摘要以Excel格式导出),运用BERTopic模型进行热点主题抽取,进而对在线教育领域主题进行分析。[结果/结论]BERTopic模型自动生成145个主题(未经干扰),经过归纳和筛选共得到四个主题,即主题1在线教育底层技术研究、主题2在线教学研究、主题3创造思维培养以及主题4在线学习研究。目前元宇宙的出现促进了在线教育的发展,给在线教育提供了沉浸式的学习环境。同时,在线教育促进了学生的个性化发展,在一定程度上弥补了教育不公平。展开更多
Background: With mounting global environmental, social and economic pressures the resilience and stability of forests and thus the provisioning of vital ecosystem services is increasingly threatened. Intensified moni...Background: With mounting global environmental, social and economic pressures the resilience and stability of forests and thus the provisioning of vital ecosystem services is increasingly threatened. Intensified monitoring can help to detect ecological threats and changes earlier, but monitoring resources are limited. Participatory forest monitoring with the help of "citizen scientists" can provide additional resources for forest monitoring and at the same time help to communicate with stakeholders and the general public. Examples for citizen science projects in the forestry domain can be found but a solid, applicable larger framework to utilise public participation in the area of forest monitoring seems to be lacking. We propose that a better understanding of shared and related topics in citizen science and forest monitoring might be a first step towards such a framework. Methods: We conduct a systematic meta-analysis of 1015 publication abstracts addressing "forest monitoring" and "citizen science" in order to explore the combined topical landscape of these subjects. We employ 'topic modelling an unsupervised probabilistic machine learning method, to identify latent shared topics in the analysed publications. Results: We find that large shared topics exist, but that these are primarily topics that would be expected in scientific publications in general. Common domain-specific topics are under-represented and indicate a topical separation of the two document sets on "forest monitoring" and "citizen science" and thus the represented domains. While topic modelling as a method proves to be a scalable and useful analytical tool, we propose that our approach could deliver even more useful data if a larger document set and full-text publications would be available for analysis. Conclusions: We propose that these results, together with the observation of non-shared but related topics, point at under-utilised opportunities for public participation in forest monitoring. Citizen science could be applied as a versatile tool in forest ecosystems monitoring, complementing traditional forest monitoring programmes, assisting early threat recognition and helping to connect forest management with the general public. We conclude that our presented approach should be pursued further as it may aid the understanding and setup of citizen science efforts in the forest monitoring domain.展开更多
This paper presents a non-parametric topic model that captures not only the latent topics in text collections, but also how the topics change over space. Unlike other recent work that relies on either Gaussian assumpt...This paper presents a non-parametric topic model that captures not only the latent topics in text collections, but also how the topics change over space. Unlike other recent work that relies on either Gaussian assumptions or discretization of locations, here topics are associated with a distance dependent Chinese Restaurant Process(ddC RP), and for each document, the observed words are influenced by the document's GPS-tag. Our model allows both unbound number and flexible distribution of the geographical variations of the topics' content. We develop a Gibbs sampler for the proposal, and compare it with existing models on a real data set basis.展开更多
文摘Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic capture accuracy. We propose a novel approach that incorporates pre-clustered knowledge into the BERTopic model while reducing the l2 norm for low-frequency words. Our method effectively mitigates feature sparsity during cluster mapping. Empirical evaluation on the StackOverflow dataset demonstrates that our approach outperforms baseline models, achieving superior Macro-F1 scores. These results validate the effectiveness of our proposed feature sparsity reduction technique for short-text topic modeling.
基金Supported by the National High Technology Research and Development Program of China(No.2012AA011005)
文摘Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty topics that experience a sudden increase during a period of time. In this paper, we propose a new topic model named Burst-LDA, which simultaneously discovers topics and reveals their burstiness through explicitly modeling each topic's burst states with a first order Markov chain and using the chain to generate the topic proportion of documents in a Logistic Normal fashion. A Gibbs sampling algorithm is developed for the posterior inference of the proposed model. Experimental results on a news data set show our model can efficiently discover bursty topics, outperforming the state-of-the-art method.
文摘[目的/意义]旨在为在线教育的发展提供参考。[方法/过程]检索Web of Science(WoS)核心数据库关于在线教育的文献(保留摘要以Excel格式导出),运用BERTopic模型进行热点主题抽取,进而对在线教育领域主题进行分析。[结果/结论]BERTopic模型自动生成145个主题(未经干扰),经过归纳和筛选共得到四个主题,即主题1在线教育底层技术研究、主题2在线教学研究、主题3创造思维培养以及主题4在线学习研究。目前元宇宙的出现促进了在线教育的发展,给在线教育提供了沉浸式的学习环境。同时,在线教育促进了学生的个性化发展,在一定程度上弥补了教育不公平。
文摘Background: With mounting global environmental, social and economic pressures the resilience and stability of forests and thus the provisioning of vital ecosystem services is increasingly threatened. Intensified monitoring can help to detect ecological threats and changes earlier, but monitoring resources are limited. Participatory forest monitoring with the help of "citizen scientists" can provide additional resources for forest monitoring and at the same time help to communicate with stakeholders and the general public. Examples for citizen science projects in the forestry domain can be found but a solid, applicable larger framework to utilise public participation in the area of forest monitoring seems to be lacking. We propose that a better understanding of shared and related topics in citizen science and forest monitoring might be a first step towards such a framework. Methods: We conduct a systematic meta-analysis of 1015 publication abstracts addressing "forest monitoring" and "citizen science" in order to explore the combined topical landscape of these subjects. We employ 'topic modelling an unsupervised probabilistic machine learning method, to identify latent shared topics in the analysed publications. Results: We find that large shared topics exist, but that these are primarily topics that would be expected in scientific publications in general. Common domain-specific topics are under-represented and indicate a topical separation of the two document sets on "forest monitoring" and "citizen science" and thus the represented domains. While topic modelling as a method proves to be a scalable and useful analytical tool, we propose that our approach could deliver even more useful data if a larger document set and full-text publications would be available for analysis. Conclusions: We propose that these results, together with the observation of non-shared but related topics, point at under-utilised opportunities for public participation in forest monitoring. Citizen science could be applied as a versatile tool in forest ecosystems monitoring, complementing traditional forest monitoring programmes, assisting early threat recognition and helping to connect forest management with the general public. We conclude that our presented approach should be pursued further as it may aid the understanding and setup of citizen science efforts in the forest monitoring domain.
基金Supported by National High Technology Research and Development Program of China(No.2012AA011005)
文摘This paper presents a non-parametric topic model that captures not only the latent topics in text collections, but also how the topics change over space. Unlike other recent work that relies on either Gaussian assumptions or discretization of locations, here topics are associated with a distance dependent Chinese Restaurant Process(ddC RP), and for each document, the observed words are influenced by the document's GPS-tag. Our model allows both unbound number and flexible distribution of the geographical variations of the topics' content. We develop a Gibbs sampler for the proposal, and compare it with existing models on a real data set basis.