Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm...Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform.Since the TF-IDF(term frequency-inverse document frequency)algorithm under Spark is irreversible to word mapping,the mapped words indexes cannot be traced back to the original words.In this paper,an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored.Firstly,the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper,and then the features are inputted to the LDA(Latent Dirichlet Allocation)topic model for training.Finally,the text topic clustering is obtained.Experimental results show that for large data samples,the processing speed of LDA topic model clustering has been improved based Spark.At the same time,compared with the LDA topic model based on word frequency input,the model proposed in this paper has a reduction of perplexity.展开更多
Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic captur...Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic capture accuracy. We propose a novel approach that incorporates pre-clustered knowledge into the BERTopic model while reducing the l2 norm for low-frequency words. Our method effectively mitigates feature sparsity during cluster mapping. Empirical evaluation on the StackOverflow dataset demonstrates that our approach outperforms baseline models, achieving superior Macro-F1 scores. These results validate the effectiveness of our proposed feature sparsity reduction technique for short-text topic modeling.展开更多
Single-pass is commonly used in topic detection and tracking( TDT) due to its simplicity,high efficiency and low cost. When dealing with large-scale data,time cost will increase sharply and clustering performance will...Single-pass is commonly used in topic detection and tracking( TDT) due to its simplicity,high efficiency and low cost. When dealing with large-scale data,time cost will increase sharply and clustering performance will be affected greatly. Aiming at this problem,hierarchical clustering algorithm based on single-pass is proposed,which is inspired by hierarchical and concurrent ideas to divide clustering process into three stages. News reports are classified into different categories firstly.Then there are twice single-pass clustering processes in the same category,and one agglomerative clustering among different categories. In addition,for semantic similarity in news reports,topic model is improved based on named entities. Experimental results show that the proposed method can effectively accelerate the process as well as improve the performance.展开更多
微信群组中存在大量会话文本数据,对其进行关键词提取有助于理解群组动态和主题演变。由于微信会话文本存在长度短、主题交叉、语言不规范等特点,传统提取方法效果欠佳。为此,提出了一个基于会话主题聚类的多阶段关键词提取算法。首先,...微信群组中存在大量会话文本数据,对其进行关键词提取有助于理解群组动态和主题演变。由于微信会话文本存在长度短、主题交叉、语言不规范等特点,传统提取方法效果欠佳。为此,提出了一个基于会话主题聚类的多阶段关键词提取算法。首先,提出了一种结合预训练知识的会话主题聚类算法(Single Pass Using Thread Segmentation and Pre-training Knowledge,SP_(TSPK)),综合考虑语义相关性、消息活跃度和用户亲密度,有效解决了会话主题交叉和信息量不足的问题。其次,设计了一种多阶段关键词提取算法(Multi-Stage Keyword Extraction,MSKE),将任务分解为无监督关键词抽取和有监督关键词生成,有效提取原文中存在和缺失的关键词,减少了候选词规模和语义冗余;最终,组合SP_(TSPK)算法与MSKE算法实现微信会话文本关键词提取。在WeChat数据集上相比AutoKeyGen算法,F_(1)@5和F_(1)@O平均提升了12.8%与10.8%,R@10平均达到其2.59倍。实验结果表明,该算法能有效地提取微信会话文本关键词。展开更多
【背景】微生物浸出是处理尾矿、废弃矿石、低品位矿、难处理矿的有效浸出手段;相较于传统的浸出技术,其具有环境友好、收益好、冶金效率高等优点。【目的】深入探讨微生物浸出在全球范围内的发展趋势和学术影响,并协助研究人员确定研...【背景】微生物浸出是处理尾矿、废弃矿石、低品位矿、难处理矿的有效浸出手段;相较于传统的浸出技术,其具有环境友好、收益好、冶金效率高等优点。【目的】深入探讨微生物浸出在全球范围内的发展趋势和学术影响,并协助研究人员确定研究方向,开展相关研究,了解该领域最相关的课题。【方法】基于Web of Science核心合集数据库对2011-2023年全球微生物浸出的文献进行检索和分析。【结果】年发文趋势揭示微生物浸出领域的研究热度有所下降。高被引文献的研究表明,除了文献本身的质量外,另一个重要因素是国家政策支持和资金的可用性。全球共有80个国家包括1546个机构开展研究,共发表在580种期刊上。中国、伊朗、印度和澳大利亚进行了大量的研究,集成了冶金工程、环境科学生态学、采矿工程和生物技术与应用微生物学等多学科。聚类分析确定了4个经常出现的关键词:黄铜矿、废旧电路板、重金属和浸出,这为研究人员提供了新的检索词。【结论】目前对微生物浸出的研究主要集中在单个菌株,而混合菌株与矿物的吸附和耐受机理是未来发展所向。展开更多
【背景】蛹虫草(Cordyceps militaris)作为虫草科虫草属的模式种一直受到全球研究人员的关注。【目的】多维度探讨蛹虫草研究的当前状况与未来趋势。【方法】基于Web of Science核心合集数据库对2005-2024年间有关蛹虫草的SCI核心集论...【背景】蛹虫草(Cordyceps militaris)作为虫草科虫草属的模式种一直受到全球研究人员的关注。【目的】多维度探讨蛹虫草研究的当前状况与未来趋势。【方法】基于Web of Science核心合集数据库对2005-2024年间有关蛹虫草的SCI核心集论文进行了全面的数据搜集、整理、分析和可视化处理。【结果】过去20年里,蛹虫草研究已从单一的培养特性拓展至跨学科领域,尤其是其活性成分和药理学效应已成为学术界关注的焦点。文献计量分析结果显示,2005-2009年间,主要研究方向为蛹虫草的人工培养。2010-2014年间,研究主题扩展至子实体相关的药理学,研究地位显著提升。2015年后,研究主题进一步多元化,涵盖了优化、表达、氧化应激、真菌、抗氧化剂、化学成分、NF-κB、细胞周期停滞等领域,显示了从培养技术向深入的生物学和医学机制研究的转变。【结论】蛹虫草的研究经历了从传统培养研究向多学科交叉的深刻变革,未来研究将更加侧重于活性成分的功能机制、生物活性物质的药理作用及潜在的医学应用,为蛹虫草的深入研究和开发利用提供科学依据。展开更多
基金This work is supported by the Science Research Projects of Hunan Provincial Education Department(Nos.18A174,18C0262)the National Natural Science Foundation of China(No.61772561)+2 种基金the Key Research&Development Plan of Hunan Province(Nos.2018NK2012,2019SK2022)the Degree&Postgraduate Education Reform Project of Hunan Province(No.209)the Postgraduate Education and Teaching Reform Project of Central South Forestry University(No.2019JG013).
文摘Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data,this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform.Since the TF-IDF(term frequency-inverse document frequency)algorithm under Spark is irreversible to word mapping,the mapped words indexes cannot be traced back to the original words.In this paper,an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored.Firstly,the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper,and then the features are inputted to the LDA(Latent Dirichlet Allocation)topic model for training.Finally,the text topic clustering is obtained.Experimental results show that for large data samples,the processing speed of LDA topic model clustering has been improved based Spark.At the same time,compared with the LDA topic model based on word frequency input,the model proposed in this paper has a reduction of perplexity.
文摘Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic capture accuracy. We propose a novel approach that incorporates pre-clustered knowledge into the BERTopic model while reducing the l2 norm for low-frequency words. Our method effectively mitigates feature sparsity during cluster mapping. Empirical evaluation on the StackOverflow dataset demonstrates that our approach outperforms baseline models, achieving superior Macro-F1 scores. These results validate the effectiveness of our proposed feature sparsity reduction technique for short-text topic modeling.
基金Supported by the National Natural Science Foundation of China(No.61502312)the Fundamental Research Funds for the Central Universities(No.2017BQ024)+1 种基金the Natural Science Foundation of Guangdong Province(No.2017A030310428)the Science and Technology Programm of Guangzhou(No.201806020075,20180210025)
文摘Single-pass is commonly used in topic detection and tracking( TDT) due to its simplicity,high efficiency and low cost. When dealing with large-scale data,time cost will increase sharply and clustering performance will be affected greatly. Aiming at this problem,hierarchical clustering algorithm based on single-pass is proposed,which is inspired by hierarchical and concurrent ideas to divide clustering process into three stages. News reports are classified into different categories firstly.Then there are twice single-pass clustering processes in the same category,and one agglomerative clustering among different categories. In addition,for semantic similarity in news reports,topic model is improved based on named entities. Experimental results show that the proposed method can effectively accelerate the process as well as improve the performance.
文摘微信群组中存在大量会话文本数据,对其进行关键词提取有助于理解群组动态和主题演变。由于微信会话文本存在长度短、主题交叉、语言不规范等特点,传统提取方法效果欠佳。为此,提出了一个基于会话主题聚类的多阶段关键词提取算法。首先,提出了一种结合预训练知识的会话主题聚类算法(Single Pass Using Thread Segmentation and Pre-training Knowledge,SP_(TSPK)),综合考虑语义相关性、消息活跃度和用户亲密度,有效解决了会话主题交叉和信息量不足的问题。其次,设计了一种多阶段关键词提取算法(Multi-Stage Keyword Extraction,MSKE),将任务分解为无监督关键词抽取和有监督关键词生成,有效提取原文中存在和缺失的关键词,减少了候选词规模和语义冗余;最终,组合SP_(TSPK)算法与MSKE算法实现微信会话文本关键词提取。在WeChat数据集上相比AutoKeyGen算法,F_(1)@5和F_(1)@O平均提升了12.8%与10.8%,R@10平均达到其2.59倍。实验结果表明,该算法能有效地提取微信会话文本关键词。
文摘【背景】微生物浸出是处理尾矿、废弃矿石、低品位矿、难处理矿的有效浸出手段;相较于传统的浸出技术,其具有环境友好、收益好、冶金效率高等优点。【目的】深入探讨微生物浸出在全球范围内的发展趋势和学术影响,并协助研究人员确定研究方向,开展相关研究,了解该领域最相关的课题。【方法】基于Web of Science核心合集数据库对2011-2023年全球微生物浸出的文献进行检索和分析。【结果】年发文趋势揭示微生物浸出领域的研究热度有所下降。高被引文献的研究表明,除了文献本身的质量外,另一个重要因素是国家政策支持和资金的可用性。全球共有80个国家包括1546个机构开展研究,共发表在580种期刊上。中国、伊朗、印度和澳大利亚进行了大量的研究,集成了冶金工程、环境科学生态学、采矿工程和生物技术与应用微生物学等多学科。聚类分析确定了4个经常出现的关键词:黄铜矿、废旧电路板、重金属和浸出,这为研究人员提供了新的检索词。【结论】目前对微生物浸出的研究主要集中在单个菌株,而混合菌株与矿物的吸附和耐受机理是未来发展所向。
文摘【背景】蛹虫草(Cordyceps militaris)作为虫草科虫草属的模式种一直受到全球研究人员的关注。【目的】多维度探讨蛹虫草研究的当前状况与未来趋势。【方法】基于Web of Science核心合集数据库对2005-2024年间有关蛹虫草的SCI核心集论文进行了全面的数据搜集、整理、分析和可视化处理。【结果】过去20年里,蛹虫草研究已从单一的培养特性拓展至跨学科领域,尤其是其活性成分和药理学效应已成为学术界关注的焦点。文献计量分析结果显示,2005-2009年间,主要研究方向为蛹虫草的人工培养。2010-2014年间,研究主题扩展至子实体相关的药理学,研究地位显著提升。2015年后,研究主题进一步多元化,涵盖了优化、表达、氧化应激、真菌、抗氧化剂、化学成分、NF-κB、细胞周期停滞等领域,显示了从培养技术向深入的生物学和医学机制研究的转变。【结论】蛹虫草的研究经历了从传统培养研究向多学科交叉的深刻变革,未来研究将更加侧重于活性成分的功能机制、生物活性物质的药理作用及潜在的医学应用,为蛹虫草的深入研究和开发利用提供科学依据。