Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty...Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty topics that experience a sudden increase during a period of time. In this paper, we propose a new topic model named Burst-LDA, which simultaneously discovers topics and reveals their burstiness through explicitly modeling each topic's burst states with a first order Markov chain and using the chain to generate the topic proportion of documents in a Logistic Normal fashion. A Gibbs sampling algorithm is developed for the posterior inference of the proposed model. Experimental results on a news data set show our model can efficiently discover bursty topics, outperforming the state-of-the-art method.展开更多
Statistical language modeling techniques are investigated so as to construct a language model for Chinese text proofreading. After the defects of n-gram model are analyzed, a novel statistical language model for Chine...Statistical language modeling techniques are investigated so as to construct a language model for Chinese text proofreading. After the defects of n-gram model are analyzed, a novel statistical language model for Chinese text proofreading is proposed. This model takes full account of the information located before and after the target word wi, and the relationship between un-neighboring words w_i and w_j in linguistic environment(LE). First, the word association degree between w_i and w_j is defined by using the distance-weighted factor, w_j is l words apart from w_i in the LE, then Bayes formula is used to calculate the LE related degree of word w_i, and lastly, the LE related degree is taken as criterion to predict the reasonability of word w_i that appears in context. Comparing the proposed model with the traditional n-gram in a Chinese text automatic error detection system, the experiments results show that the error detection recall rate and precision rate of the system have been improved.展开更多
Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic captur...Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic capture accuracy. We propose a novel approach that incorporates pre-clustered knowledge into the BERTopic model while reducing the l2 norm for low-frequency words. Our method effectively mitigates feature sparsity during cluster mapping. Empirical evaluation on the StackOverflow dataset demonstrates that our approach outperforms baseline models, achieving superior Macro-F1 scores. These results validate the effectiveness of our proposed feature sparsity reduction technique for short-text topic modeling.展开更多
Mining rich semantic information hidden in heterogeneous information network is one of the important tasks of data mining. Generally, a nuclear medicine text consists of the description of disease (<i>i.e.</i...Mining rich semantic information hidden in heterogeneous information network is one of the important tasks of data mining. Generally, a nuclear medicine text consists of the description of disease (<i>i.e.</i>, lesions) and diagnostic results. However, how to construct a computer-aided diagnostic model with a large number of medical texts is a challenging task. To automatically diagnose diseases with SPECT imaging, in this work, we create a knowledge-based diagnostic model by exploring the association between a disease and its properties. Firstly, an overview of nuclear medicine and data mining is presented. Second, the method of preprocessing textual nuclear medicine diagnostic reports is proposed. Last, the created diagnostic modes based on random forest and SVM are proposed. Experimental evaluation conducted real-world data of diagnostic reports of SPECT imaging demonstrates that our diagnostic models are workable and effective to automatically identify diseases with textual diagnostic reports.展开更多
针对国产大语言模型(large language models,LLMs)在地理信息科学(geographic information science,GIS)领域缺乏系统性评估基准问题,构建Geo-Text-700测试集的GIS领域定制化测评体系,基于优劣解距离层次分析法(technique for order pre...针对国产大语言模型(large language models,LLMs)在地理信息科学(geographic information science,GIS)领域缺乏系统性评估基准问题,构建Geo-Text-700测试集的GIS领域定制化测评体系,基于优劣解距离层次分析法(technique for order preference by similarity to ideal solution,TOPSIS)对10个主流国产模型进行多维度评估。测评结果显示:模型表现呈现显著题型分化,客观题平均得分为68.4(标准差±5.2),较主观题低21.7%(P<0.05);Doubao-pro-32k综合得分最优(87.3),客观题优势显著(单选86,填空77);hunyuan-turbo在主观题(简答88.1,编程90.83)方面展现高阶任务潜力;领域知识盲区突出,如GIS拓扑规则题错误率为43.6%。展开更多
近年来,大语言模型(large language models,LLMs)在自然语言处理(natural language processing,NLP)等领域取得了显著进展,展现出强大的语言理解与生成能力。然而,在实际应用过程中,大语言模型仍然面临诸多挑战。其中,幻觉(hallucinati...近年来,大语言模型(large language models,LLMs)在自然语言处理(natural language processing,NLP)等领域取得了显著进展,展现出强大的语言理解与生成能力。然而,在实际应用过程中,大语言模型仍然面临诸多挑战。其中,幻觉(hallucination)问题引起了学术界和工业界的广泛关注。如何有效检测大语言模型幻觉,成为确保其在文本生成等下游任务可靠、安全、可信应用的关键挑战。该研究着重对大语言模型幻觉检测方法进行综述:首先,介绍了大语言模型概念,进一步明确了幻觉的定义与分类,系统梳理了大语言模型从构建到部署应用全生命周期各环节的特点,并深入分析了幻觉的产生机制与诱因;其次,立足于实际应用需求,考虑到在不同任务场景下模型透明度的差异等因素,将幻觉检测方法划分为针对白盒模型和黑盒模型2类,并进行了重点梳理和深入对比;而后,分析总结了现阶段主流的幻觉检测基准,为后续开展幻觉检测奠定基础;最后,指出了大语言模型幻觉检测的各种潜在研究方法和新的挑战。展开更多
基金Supported by the National High Technology Research and Development Program of China(No.2012AA011005)
文摘Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty topics that experience a sudden increase during a period of time. In this paper, we propose a new topic model named Burst-LDA, which simultaneously discovers topics and reveals their burstiness through explicitly modeling each topic's burst states with a first order Markov chain and using the chain to generate the topic proportion of documents in a Logistic Normal fashion. A Gibbs sampling algorithm is developed for the posterior inference of the proposed model. Experimental results on a news data set show our model can efficiently discover bursty topics, outperforming the state-of-the-art method.
文摘Statistical language modeling techniques are investigated so as to construct a language model for Chinese text proofreading. After the defects of n-gram model are analyzed, a novel statistical language model for Chinese text proofreading is proposed. This model takes full account of the information located before and after the target word wi, and the relationship between un-neighboring words w_i and w_j in linguistic environment(LE). First, the word association degree between w_i and w_j is defined by using the distance-weighted factor, w_j is l words apart from w_i in the LE, then Bayes formula is used to calculate the LE related degree of word w_i, and lastly, the LE related degree is taken as criterion to predict the reasonability of word w_i that appears in context. Comparing the proposed model with the traditional n-gram in a Chinese text automatic error detection system, the experiments results show that the error detection recall rate and precision rate of the system have been improved.
文摘Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic capture accuracy. We propose a novel approach that incorporates pre-clustered knowledge into the BERTopic model while reducing the l2 norm for low-frequency words. Our method effectively mitigates feature sparsity during cluster mapping. Empirical evaluation on the StackOverflow dataset demonstrates that our approach outperforms baseline models, achieving superior Macro-F1 scores. These results validate the effectiveness of our proposed feature sparsity reduction technique for short-text topic modeling.
文摘Mining rich semantic information hidden in heterogeneous information network is one of the important tasks of data mining. Generally, a nuclear medicine text consists of the description of disease (<i>i.e.</i>, lesions) and diagnostic results. However, how to construct a computer-aided diagnostic model with a large number of medical texts is a challenging task. To automatically diagnose diseases with SPECT imaging, in this work, we create a knowledge-based diagnostic model by exploring the association between a disease and its properties. Firstly, an overview of nuclear medicine and data mining is presented. Second, the method of preprocessing textual nuclear medicine diagnostic reports is proposed. Last, the created diagnostic modes based on random forest and SVM are proposed. Experimental evaluation conducted real-world data of diagnostic reports of SPECT imaging demonstrates that our diagnostic models are workable and effective to automatically identify diseases with textual diagnostic reports.
文摘针对国产大语言模型(large language models,LLMs)在地理信息科学(geographic information science,GIS)领域缺乏系统性评估基准问题,构建Geo-Text-700测试集的GIS领域定制化测评体系,基于优劣解距离层次分析法(technique for order preference by similarity to ideal solution,TOPSIS)对10个主流国产模型进行多维度评估。测评结果显示:模型表现呈现显著题型分化,客观题平均得分为68.4(标准差±5.2),较主观题低21.7%(P<0.05);Doubao-pro-32k综合得分最优(87.3),客观题优势显著(单选86,填空77);hunyuan-turbo在主观题(简答88.1,编程90.83)方面展现高阶任务潜力;领域知识盲区突出,如GIS拓扑规则题错误率为43.6%。
文摘近年来,大语言模型(large language models,LLMs)在自然语言处理(natural language processing,NLP)等领域取得了显著进展,展现出强大的语言理解与生成能力。然而,在实际应用过程中,大语言模型仍然面临诸多挑战。其中,幻觉(hallucination)问题引起了学术界和工业界的广泛关注。如何有效检测大语言模型幻觉,成为确保其在文本生成等下游任务可靠、安全、可信应用的关键挑战。该研究着重对大语言模型幻觉检测方法进行综述:首先,介绍了大语言模型概念,进一步明确了幻觉的定义与分类,系统梳理了大语言模型从构建到部署应用全生命周期各环节的特点,并深入分析了幻觉的产生机制与诱因;其次,立足于实际应用需求,考虑到在不同任务场景下模型透明度的差异等因素,将幻觉检测方法划分为针对白盒模型和黑盒模型2类,并进行了重点梳理和深入对比;而后,分析总结了现阶段主流的幻觉检测基准,为后续开展幻觉检测奠定基础;最后,指出了大语言模型幻觉检测的各种潜在研究方法和新的挑战。