Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm...Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform.Since the TF-IDF(term frequency-inverse document frequency)algorithm under Spark is irreversible to word mapping,the mapped words indexes cannot be traced back to the original words.In this paper,an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored.Firstly,the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper,and then the features are inputted to the LDA(Latent Dirichlet Allocation)topic model for training.Finally,the text topic clustering is obtained.Experimental results show that for large data samples,the processing speed of LDA topic model clustering has been improved based Spark.At the same time,compared with the LDA topic model based on word frequency input,the model proposed in this paper has a reduction of perplexity.展开更多
Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic captur...Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic capture accuracy. We propose a novel approach that incorporates pre-clustered knowledge into the BERTopic model while reducing the l2 norm for low-frequency words. Our method effectively mitigates feature sparsity during cluster mapping. Empirical evaluation on the StackOverflow dataset demonstrates that our approach outperforms baseline models, achieving superior Macro-F1 scores. These results validate the effectiveness of our proposed feature sparsity reduction technique for short-text topic modeling.展开更多
Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty...Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty topics that experience a sudden increase during a period of time. In this paper, we propose a new topic model named Burst-LDA, which simultaneously discovers topics and reveals their burstiness through explicitly modeling each topic's burst states with a first order Markov chain and using the chain to generate the topic proportion of documents in a Logistic Normal fashion. A Gibbs sampling algorithm is developed for the posterior inference of the proposed model. Experimental results on a news data set show our model can efficiently discover bursty topics, outperforming the state-of-the-art method.展开更多
[目的/意义]旨在为在线教育的发展提供参考。[方法/过程]检索Web of Science(WoS)核心数据库关于在线教育的文献(保留摘要以Excel格式导出),运用BERTopic模型进行热点主题抽取,进而对在线教育领域主题进行分析。[结果/结论]BERTopic模...[目的/意义]旨在为在线教育的发展提供参考。[方法/过程]检索Web of Science(WoS)核心数据库关于在线教育的文献(保留摘要以Excel格式导出),运用BERTopic模型进行热点主题抽取,进而对在线教育领域主题进行分析。[结果/结论]BERTopic模型自动生成145个主题(未经干扰),经过归纳和筛选共得到四个主题,即主题1在线教育底层技术研究、主题2在线教学研究、主题3创造思维培养以及主题4在线学习研究。目前元宇宙的出现促进了在线教育的发展,给在线教育提供了沉浸式的学习环境。同时,在线教育促进了学生的个性化发展,在一定程度上弥补了教育不公平。展开更多
以编目分类和规则匹配为主的古籍文本主题分类方法存在工作效能低、专家知识依赖性强、分类依据单一化、古籍文本主题自动分类难等问题。对此,本文结合古籍文本内容和文字特征,尝试从古籍内容分类得到符合研究者需求的主题,推动数字人...以编目分类和规则匹配为主的古籍文本主题分类方法存在工作效能低、专家知识依赖性强、分类依据单一化、古籍文本主题自动分类难等问题。对此,本文结合古籍文本内容和文字特征,尝试从古籍内容分类得到符合研究者需求的主题,推动数字人文研究范式的转型。首先,参照东汉古籍《说文解字》对文字的分析方式,以前期标注的古籍语料数据集为基础,构建全新的“字音(说)-原文(文)-结构(解)-字形(字)”四维特征数据集。其次,设计四维特征向量提取模型(speaking,word,pattern,and font to vector,SWPF2vec),并结合预训练模型实现对古籍文本细粒度的特征表示。再其次,构建融合卷积神经网络、循环神经网络和多头注意力机制的古籍文本主题分类模型(dianji-recurrent convolutional neural networks for text classification,DJ-TextRCNN)。最后,融入四维语义特征,实现对古籍文本多维度、深层次、细粒度的语义挖掘。在古籍文本主题分类任务上,DJ-TextRCNN模型在不同维度特征下的主题分类准确率均为最优,在“说文解字”四维特征下达到76.23%的准确率,初步实现了对古籍文本的精准主题分类。展开更多
Single-pass is commonly used in topic detection and tracking( TDT) due to its simplicity,high efficiency and low cost. When dealing with large-scale data,time cost will increase sharply and clustering performance will...Single-pass is commonly used in topic detection and tracking( TDT) due to its simplicity,high efficiency and low cost. When dealing with large-scale data,time cost will increase sharply and clustering performance will be affected greatly. Aiming at this problem,hierarchical clustering algorithm based on single-pass is proposed,which is inspired by hierarchical and concurrent ideas to divide clustering process into three stages. News reports are classified into different categories firstly.Then there are twice single-pass clustering processes in the same category,and one agglomerative clustering among different categories. In addition,for semantic similarity in news reports,topic model is improved based on named entities. Experimental results show that the proposed method can effectively accelerate the process as well as improve the performance.展开更多
Background: With mounting global environmental, social and economic pressures the resilience and stability of forests and thus the provisioning of vital ecosystem services is increasingly threatened. Intensified moni...Background: With mounting global environmental, social and economic pressures the resilience and stability of forests and thus the provisioning of vital ecosystem services is increasingly threatened. Intensified monitoring can help to detect ecological threats and changes earlier, but monitoring resources are limited. Participatory forest monitoring with the help of "citizen scientists" can provide additional resources for forest monitoring and at the same time help to communicate with stakeholders and the general public. Examples for citizen science projects in the forestry domain can be found but a solid, applicable larger framework to utilise public participation in the area of forest monitoring seems to be lacking. We propose that a better understanding of shared and related topics in citizen science and forest monitoring might be a first step towards such a framework. Methods: We conduct a systematic meta-analysis of 1015 publication abstracts addressing "forest monitoring" and "citizen science" in order to explore the combined topical landscape of these subjects. We employ 'topic modelling an unsupervised probabilistic machine learning method, to identify latent shared topics in the analysed publications. Results: We find that large shared topics exist, but that these are primarily topics that would be expected in scientific publications in general. Common domain-specific topics are under-represented and indicate a topical separation of the two document sets on "forest monitoring" and "citizen science" and thus the represented domains. While topic modelling as a method proves to be a scalable and useful analytical tool, we propose that our approach could deliver even more useful data if a larger document set and full-text publications would be available for analysis. Conclusions: We propose that these results, together with the observation of non-shared but related topics, point at under-utilised opportunities for public participation in forest monitoring. Citizen science could be applied as a versatile tool in forest ecosystems monitoring, complementing traditional forest monitoring programmes, assisting early threat recognition and helping to connect forest management with the general public. We conclude that our presented approach should be pursued further as it may aid the understanding and setup of citizen science efforts in the forest monitoring domain.展开更多
基金This work is supported by the Science Research Projects of Hunan Provincial Education Department(Nos.18A174,18C0262)the National Natural Science Foundation of China(No.61772561)+2 种基金the Key Research&Development Plan of Hunan Province(Nos.2018NK2012,2019SK2022)the Degree&Postgraduate Education Reform Project of Hunan Province(No.209)the Postgraduate Education and Teaching Reform Project of Central South Forestry University(No.2019JG013).
文摘Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform.Since the TF-IDF(term frequency-inverse document frequency)algorithm under Spark is irreversible to word mapping,the mapped words indexes cannot be traced back to the original words.In this paper,an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored.Firstly,the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper,and then the features are inputted to the LDA(Latent Dirichlet Allocation)topic model for training.Finally,the text topic clustering is obtained.Experimental results show that for large data samples,the processing speed of LDA topic model clustering has been improved based Spark.At the same time,compared with the LDA topic model based on word frequency input,the model proposed in this paper has a reduction of perplexity.
文摘Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic capture accuracy. We propose a novel approach that incorporates pre-clustered knowledge into the BERTopic model while reducing the l2 norm for low-frequency words. Our method effectively mitigates feature sparsity during cluster mapping. Empirical evaluation on the StackOverflow dataset demonstrates that our approach outperforms baseline models, achieving superior Macro-F1 scores. These results validate the effectiveness of our proposed feature sparsity reduction technique for short-text topic modeling.
基金Supported by the National High Technology Research and Development Program of China(No.2012AA011005)
文摘Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty topics that experience a sudden increase during a period of time. In this paper, we propose a new topic model named Burst-LDA, which simultaneously discovers topics and reveals their burstiness through explicitly modeling each topic's burst states with a first order Markov chain and using the chain to generate the topic proportion of documents in a Logistic Normal fashion. A Gibbs sampling algorithm is developed for the posterior inference of the proposed model. Experimental results on a news data set show our model can efficiently discover bursty topics, outperforming the state-of-the-art method.
文摘[目的/意义]旨在为在线教育的发展提供参考。[方法/过程]检索Web of Science(WoS)核心数据库关于在线教育的文献(保留摘要以Excel格式导出),运用BERTopic模型进行热点主题抽取,进而对在线教育领域主题进行分析。[结果/结论]BERTopic模型自动生成145个主题(未经干扰),经过归纳和筛选共得到四个主题,即主题1在线教育底层技术研究、主题2在线教学研究、主题3创造思维培养以及主题4在线学习研究。目前元宇宙的出现促进了在线教育的发展,给在线教育提供了沉浸式的学习环境。同时,在线教育促进了学生的个性化发展,在一定程度上弥补了教育不公平。
文摘以编目分类和规则匹配为主的古籍文本主题分类方法存在工作效能低、专家知识依赖性强、分类依据单一化、古籍文本主题自动分类难等问题。对此,本文结合古籍文本内容和文字特征,尝试从古籍内容分类得到符合研究者需求的主题,推动数字人文研究范式的转型。首先,参照东汉古籍《说文解字》对文字的分析方式,以前期标注的古籍语料数据集为基础,构建全新的“字音(说)-原文(文)-结构(解)-字形(字)”四维特征数据集。其次,设计四维特征向量提取模型(speaking,word,pattern,and font to vector,SWPF2vec),并结合预训练模型实现对古籍文本细粒度的特征表示。再其次,构建融合卷积神经网络、循环神经网络和多头注意力机制的古籍文本主题分类模型(dianji-recurrent convolutional neural networks for text classification,DJ-TextRCNN)。最后,融入四维语义特征,实现对古籍文本多维度、深层次、细粒度的语义挖掘。在古籍文本主题分类任务上,DJ-TextRCNN模型在不同维度特征下的主题分类准确率均为最优,在“说文解字”四维特征下达到76.23%的准确率,初步实现了对古籍文本的精准主题分类。
基金Supported by the National Natural Science Foundation of China(No.61502312)the Fundamental Research Funds for the Central Universities(No.2017BQ024)+1 种基金the Natural Science Foundation of Guangdong Province(No.2017A030310428)the Science and Technology Programm of Guangzhou(No.201806020075,20180210025)
文摘Single-pass is commonly used in topic detection and tracking( TDT) due to its simplicity,high efficiency and low cost. When dealing with large-scale data,time cost will increase sharply and clustering performance will be affected greatly. Aiming at this problem,hierarchical clustering algorithm based on single-pass is proposed,which is inspired by hierarchical and concurrent ideas to divide clustering process into three stages. News reports are classified into different categories firstly.Then there are twice single-pass clustering processes in the same category,and one agglomerative clustering among different categories. In addition,for semantic similarity in news reports,topic model is improved based on named entities. Experimental results show that the proposed method can effectively accelerate the process as well as improve the performance.
文摘Background: With mounting global environmental, social and economic pressures the resilience and stability of forests and thus the provisioning of vital ecosystem services is increasingly threatened. Intensified monitoring can help to detect ecological threats and changes earlier, but monitoring resources are limited. Participatory forest monitoring with the help of "citizen scientists" can provide additional resources for forest monitoring and at the same time help to communicate with stakeholders and the general public. Examples for citizen science projects in the forestry domain can be found but a solid, applicable larger framework to utilise public participation in the area of forest monitoring seems to be lacking. We propose that a better understanding of shared and related topics in citizen science and forest monitoring might be a first step towards such a framework. Methods: We conduct a systematic meta-analysis of 1015 publication abstracts addressing "forest monitoring" and "citizen science" in order to explore the combined topical landscape of these subjects. We employ 'topic modelling an unsupervised probabilistic machine learning method, to identify latent shared topics in the analysed publications. Results: We find that large shared topics exist, but that these are primarily topics that would be expected in scientific publications in general. Common domain-specific topics are under-represented and indicate a topical separation of the two document sets on "forest monitoring" and "citizen science" and thus the represented domains. While topic modelling as a method proves to be a scalable and useful analytical tool, we propose that our approach could deliver even more useful data if a larger document set and full-text publications would be available for analysis. Conclusions: We propose that these results, together with the observation of non-shared but related topics, point at under-utilised opportunities for public participation in forest monitoring. Citizen science could be applied as a versatile tool in forest ecosystems monitoring, complementing traditional forest monitoring programmes, assisting early threat recognition and helping to connect forest management with the general public. We conclude that our presented approach should be pursued further as it may aid the understanding and setup of citizen science efforts in the forest monitoring domain.