Topic modeling is a fundamental technique of content analysis in natural language processing,widely applied in domains such as social sciences and finance.In the era of digital communication,social scientists increasi...Topic modeling is a fundamental technique of content analysis in natural language processing,widely applied in domains such as social sciences and finance.In the era of digital communication,social scientists increasingly rely on large-scale social media data to explore public discourse,collective behavior,and emerging social concerns.However,traditional models like Latent Dirichlet Allocation(LDA)and neural topic models like BERTopic struggle to capture deep semantic structures in short-text datasets,especially in complex non-English languages like Chinese.This paper presents Generative Language Model Topic(GLMTopic)a novel hybrid topic modeling framework leveraging the capabilities of large language models,designed to support social science research by uncovering coherent and interpretable themes from Chinese social media platforms.GLMTopic integrates Adaptive Community-enhanced Graph Embedding for advanced semantic representation,Uniform Manifold Approximation and Projection-based(UMAP-based)dimensionality reduction,Hierarchical Density-Based Spatial Clustering of Applications with Noise(HDBSCAN)clustering,and large language model-powered(LLM-powered)representation tuning to generate more contextually relevant and interpretable topics.By reducing dependence on extensive text preprocessing and human expert intervention in post-analysis topic label annotation,GLMTopic facilitates a fully automated and user-friendly topic extraction process.Experimental evaluations on a social media dataset sourced from Weibo demonstrate that GLMTopic outperforms Latent Dirichlet Allocation(LDA)and BERTopic in coherence score and usability with automated interpretation,providing a more scalable and semantically accurate solution for Chinese topic modeling.Future research will explore optimizing computational efficiency,integrating knowledge graphs and sentiment analysis for more complicated workflows,and extending the framework for real-time and multilingual topic modeling.展开更多
As the COVID-19 pandemic swept the globe,social media plat-forms became an essential source of information and communication for many.International students,particularly,turned to Twitter to express their struggles an...As the COVID-19 pandemic swept the globe,social media plat-forms became an essential source of information and communication for many.International students,particularly,turned to Twitter to express their struggles and hardships during this difficult time.To better understand the sentiments and experiences of these international students,we developed the Situational Aspect-Based Annotation and Classification(SABAC)text mining framework.This framework uses a three-layer approach,combining baseline Deep Learning(DL)models with Machine Learning(ML)models as meta-classifiers to accurately predict the sentiments and aspects expressed in tweets from our collected Student-COVID-19 dataset.Using the pro-posed aspect2class annotation algorithm,we labeled bulk unlabeled tweets according to their contained aspect terms.However,we also recognized the challenges of reducing data’s high dimensionality and sparsity to improve performance and annotation on unlabeled datasets.To address this issue,we proposed the Volatile Stopwords Filtering(VSF)technique to reduce sparsity and enhance classifier performance.The resulting Student-COVID Twitter dataset achieved a sophisticated accuracy of 93.21%when using the random forest as a meta-classifier.Through testing on three benchmark datasets,we found that the SABAC ensemble framework performed exceptionally well.Our findings showed that international students during the pandemic faced various issues,including stress,uncertainty,health concerns,financial stress,and difficulties with online classes and returning to school.By analyzing and summarizing these annotated tweets,decision-makers can better understand and address the real-time problems international students face during the ongoing pandemic.展开更多
社交网络发展迅速,即时消息系统已成为人们日常生活中必不可少的沟通交流工具。在线群聊能使人们迅速交流生活、技术及工作等信息,但是由于群聊信息更新较快,大量的信息导致跟进群聊话题是困难的。传统的主题挖掘模型不能很好地适用于...社交网络发展迅速,即时消息系统已成为人们日常生活中必不可少的沟通交流工具。在线群聊能使人们迅速交流生活、技术及工作等信息,但是由于群聊信息更新较快,大量的信息导致跟进群聊话题是困难的。传统的主题挖掘模型不能很好地适用于群聊文本的挖掘。通过对群聊文本的特征进行分析,提出一种基于GRU和LDA的群聊会话主题挖掘(GLB-GCTM,GRU and LDA Based Group Chat Topic Mining)模型,解决了传统主题模型不能解决的词语顺序问题。首先,假定每个文档有一个基于高斯分布的主题向量,然后根据GRU原理产生每个词的隐含状态,根据当前词的隐含状态的伯努利分布确定当前词是否为停用词,以决定所使用的语言模型。该方法使用笔者加入的10个QQ群最近3个月的群聊数据集进行试验验证,结合对比实验评估标准,该模型能够有效识别出群聊文本中的主题。展开更多
[目的/意义]在人工智能技术及应用快速发展与深刻变革背景下,机器学习领域不断出现新的研究主题和方法,深度学习和强化学习技术持续发展。因此,有必要探索不同领域机器学习研究主题演化过程,并识别出热点与新兴主题。[方法/过程]本文以...[目的/意义]在人工智能技术及应用快速发展与深刻变革背景下,机器学习领域不断出现新的研究主题和方法,深度学习和强化学习技术持续发展。因此,有必要探索不同领域机器学习研究主题演化过程,并识别出热点与新兴主题。[方法/过程]本文以图书情报领域中2011—2022年Web of Science数据库中的机器学习研究论文为例,融合LDA和Word2vec方法进行主题建模和主题演化分析,引入主题强度、主题影响力、主题关注度与主题新颖性指标识别热点主题与新兴热点主题。[结果/结论]研究结果表明,(1)Word2vec语义处理能力与LDA主题演化能力的结合能够更加准确地识别研究主题,直观展示研究主题的分阶段演化规律;(2)图书情报领域的机器学习研究主题主要分为自然语言处理与文本分析、数据挖掘与分析、信息与知识服务三大类范畴。各类主题之间的关联性较强,且具有主题关联演化特征;(3)设计的主题强度、主题影响力和主题关注度指标及综合指标能够较好地识别出2011—2014年、2015—2018年和2019—2022年3个不同周期阶段的热点主题。展开更多
基金funded by the Natural Science Foundation of Fujian Province,China,grant No.2022J05291.
文摘Topic modeling is a fundamental technique of content analysis in natural language processing,widely applied in domains such as social sciences and finance.In the era of digital communication,social scientists increasingly rely on large-scale social media data to explore public discourse,collective behavior,and emerging social concerns.However,traditional models like Latent Dirichlet Allocation(LDA)and neural topic models like BERTopic struggle to capture deep semantic structures in short-text datasets,especially in complex non-English languages like Chinese.This paper presents Generative Language Model Topic(GLMTopic)a novel hybrid topic modeling framework leveraging the capabilities of large language models,designed to support social science research by uncovering coherent and interpretable themes from Chinese social media platforms.GLMTopic integrates Adaptive Community-enhanced Graph Embedding for advanced semantic representation,Uniform Manifold Approximation and Projection-based(UMAP-based)dimensionality reduction,Hierarchical Density-Based Spatial Clustering of Applications with Noise(HDBSCAN)clustering,and large language model-powered(LLM-powered)representation tuning to generate more contextually relevant and interpretable topics.By reducing dependence on extensive text preprocessing and human expert intervention in post-analysis topic label annotation,GLMTopic facilitates a fully automated and user-friendly topic extraction process.Experimental evaluations on a social media dataset sourced from Weibo demonstrate that GLMTopic outperforms Latent Dirichlet Allocation(LDA)and BERTopic in coherence score and usability with automated interpretation,providing a more scalable and semantically accurate solution for Chinese topic modeling.Future research will explore optimizing computational efficiency,integrating knowledge graphs and sentiment analysis for more complicated workflows,and extending the framework for real-time and multilingual topic modeling.
基金supported by the National Natural Science Foundation of China[Grant Number:92067106]the Ministry of Education of the People’s Republic of China[Grant Number:E-GCCRC20200309].
文摘As the COVID-19 pandemic swept the globe,social media plat-forms became an essential source of information and communication for many.International students,particularly,turned to Twitter to express their struggles and hardships during this difficult time.To better understand the sentiments and experiences of these international students,we developed the Situational Aspect-Based Annotation and Classification(SABAC)text mining framework.This framework uses a three-layer approach,combining baseline Deep Learning(DL)models with Machine Learning(ML)models as meta-classifiers to accurately predict the sentiments and aspects expressed in tweets from our collected Student-COVID-19 dataset.Using the pro-posed aspect2class annotation algorithm,we labeled bulk unlabeled tweets according to their contained aspect terms.However,we also recognized the challenges of reducing data’s high dimensionality and sparsity to improve performance and annotation on unlabeled datasets.To address this issue,we proposed the Volatile Stopwords Filtering(VSF)technique to reduce sparsity and enhance classifier performance.The resulting Student-COVID Twitter dataset achieved a sophisticated accuracy of 93.21%when using the random forest as a meta-classifier.Through testing on three benchmark datasets,we found that the SABAC ensemble framework performed exceptionally well.Our findings showed that international students during the pandemic faced various issues,including stress,uncertainty,health concerns,financial stress,and difficulties with online classes and returning to school.By analyzing and summarizing these annotated tweets,decision-makers can better understand and address the real-time problems international students face during the ongoing pandemic.
文摘社交网络发展迅速,即时消息系统已成为人们日常生活中必不可少的沟通交流工具。在线群聊能使人们迅速交流生活、技术及工作等信息,但是由于群聊信息更新较快,大量的信息导致跟进群聊话题是困难的。传统的主题挖掘模型不能很好地适用于群聊文本的挖掘。通过对群聊文本的特征进行分析,提出一种基于GRU和LDA的群聊会话主题挖掘(GLB-GCTM,GRU and LDA Based Group Chat Topic Mining)模型,解决了传统主题模型不能解决的词语顺序问题。首先,假定每个文档有一个基于高斯分布的主题向量,然后根据GRU原理产生每个词的隐含状态,根据当前词的隐含状态的伯努利分布确定当前词是否为停用词,以决定所使用的语言模型。该方法使用笔者加入的10个QQ群最近3个月的群聊数据集进行试验验证,结合对比实验评估标准,该模型能够有效识别出群聊文本中的主题。
文摘[目的/意义]在人工智能技术及应用快速发展与深刻变革背景下,机器学习领域不断出现新的研究主题和方法,深度学习和强化学习技术持续发展。因此,有必要探索不同领域机器学习研究主题演化过程,并识别出热点与新兴主题。[方法/过程]本文以图书情报领域中2011—2022年Web of Science数据库中的机器学习研究论文为例,融合LDA和Word2vec方法进行主题建模和主题演化分析,引入主题强度、主题影响力、主题关注度与主题新颖性指标识别热点主题与新兴热点主题。[结果/结论]研究结果表明,(1)Word2vec语义处理能力与LDA主题演化能力的结合能够更加准确地识别研究主题,直观展示研究主题的分阶段演化规律;(2)图书情报领域的机器学习研究主题主要分为自然语言处理与文本分析、数据挖掘与分析、信息与知识服务三大类范畴。各类主题之间的关联性较强,且具有主题关联演化特征;(3)设计的主题强度、主题影响力和主题关注度指标及综合指标能够较好地识别出2011—2014年、2015—2018年和2019—2022年3个不同周期阶段的热点主题。