This paper deals with the statistical modeling of latent topic hierarchies in text corpora. The height of the topic tree is assumed as fixed, while the number of topics on each level as unknown a priori and to be infe...This paper deals with the statistical modeling of latent topic hierarchies in text corpora. The height of the topic tree is assumed as fixed, while the number of topics on each level as unknown a priori and to be inferred from data. Taking a nonpara-metric Bayesian approach to this problem, we propose a new probabilistic generative model based on the nested hierarchical Dirichlet process (nHDP) and present a Markov chain Monte Carlo sampling algorithm for the inference of the topic tree structure as well as the word distribution of each topic and topic distribution of each document. Our theoretical analysis and experiment results show that this model can produce a more compact hierarchical topic structure and captures more fine-grained topic rela-tionships compared to the hierarchical latent Dirichlet allocation model.展开更多
User-Generated Content(UGC)provides a potential data source which can help us to better describe and understand how places are conceptualized,and in turn better represent the places in Geographic Information Science(G...User-Generated Content(UGC)provides a potential data source which can help us to better describe and understand how places are conceptualized,and in turn better represent the places in Geographic Information Science(GIScience).In this article,we aim at aggregating the shared meanings associated with places and linking these to a conceptual model of place.Our focus is on the metadata of Flickr images,in the form of locations and tags.We use topic modeling to identify regions associated with shared meanings.We choose a grid approach and generate topics associated with one or more cells using Latent Dirichlet Allocation.We analyze the sensitivity of our results to both grid resolution and the chosen number of topics using a range of measures including corpus distance and the coherence value.Using a resolution of 500 m and with 40 topics,we are able to generate meaningful topics which characterize places in London based on 954 unique tags associated with around 300,000 images and more than 7000 individuals.展开更多
Topic modeling is a probabilistic model that identifies topics covered in text(s). In this paper, topics were loaded from two implementations of topic modeling, namely, Latent Semantic Indexing (LSI) and Latent Dirich...Topic modeling is a probabilistic model that identifies topics covered in text(s). In this paper, topics were loaded from two implementations of topic modeling, namely, Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA). This analysis was performed in a corpus of 1000 academic papers written in English, obtained from PLOS ONE website, in the areas of Biology, Medicine, Physics and Social Sciences. The objective is to verify if the four academic fields were represented in the four topics obtained by topic modeling. The four topics obtained from Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) did not represent the four academic fields.展开更多
Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic captur...Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic capture accuracy. We propose a novel approach that incorporates pre-clustered knowledge into the BERTopic model while reducing the l2 norm for low-frequency words. Our method effectively mitigates feature sparsity during cluster mapping. Empirical evaluation on the StackOverflow dataset demonstrates that our approach outperforms baseline models, achieving superior Macro-F1 scores. These results validate the effectiveness of our proposed feature sparsity reduction technique for short-text topic modeling.展开更多
Purpose-This study aims to reveal the technological development trajectories within the fuzzy field,addressing the long-standing gap between the extensive application of fuzzy technologies and the lack of a comprehens...Purpose-This study aims to reveal the technological development trajectories within the fuzzy field,addressing the long-standing gap between the extensive application of fuzzy technologies and the lack of a comprehensive,data-driven understanding of their developmental logic and evolution.Design/methodology/approach-An integrated methodological framework combining structural topic modeling(STM)and main path analysis(MPA)is developed.A total of 43,905 patents related to fuzzy technologies were collected from the Derwent Innovation Index.STM is applied to identify 12 representative topics within the fuzzy technology field,followed by the construction of topic-specific citation networks.MPA is then used to extract the core development paths across these topics,capturing structural dynamics and knowledge diffusion.This integration enables a multidimensional exploration of topic structures and technology trajectories,providing a frame of reference for analyzing other emerging technologies.Findings-The combined STM-MPA approach effectively identifies the classification and developmental trajectories of fuzzy-related technologies.Results highlight topic-specific knowledge flows and inter-topic linkages,offering new insights into the internal evolution and external integration of fuzzy technologies.The study demonstrates how different subfields of fuzzy technologies have progressed and interacted over time.Originality/value-This study is among the first to systematically explore the development of fuzzy technologies using large-scale patent data and a hybrid analytical framework.By integrating topic modeling with citation analysis,it captures both topic patterns and development paths.The approach enhances existing methods in technology analysis and offers new insights for innovation research,policy design and enterprise strategy.展开更多
Topic modeling stands as a well-explored and foundational challenge in the text mining domain.Traditional topic schemes based on word co-occurrences,aim to expose the latent semantic structure embedded in a document c...Topic modeling stands as a well-explored and foundational challenge in the text mining domain.Traditional topic schemes based on word co-occurrences,aim to expose the latent semantic structure embedded in a document corpus.Nevertheless,the inherent brevity of short texts introduces data sparsity,hindering the effectiveness of conventional topic models and yielding suboptimal outcomes for such text.Typically,short texts encompass a restricted number of topics,necessitating a grasp of relevant background knowledge for a comprehensive understanding of semantic content.Motivated by the observed information,this research introduces a novel Deep Auto encoder Graph Regularized Non-negative Matrix Factorization algorithm(DAGR-NMF)to uncover significant and meaningful topics within short document contents.The three main phases of proposed work are preprocessing,feature extraction and topic modeling.Initially,the data are preprocessed using natural language preprocessing tasks such as stop word removal,stemming and lemmatizing.Then,feature extraction is performed using hybrid Absolute Deviation Factors-Class Term Frequency(ADF-CTF)to capture the most relevant information from the text.Finally,topic modeling task is executed using proposed DAGR-NMF approach.Experimental findings demonstrate that the introduced DAGR-NMF model outperforms all other techniques by achieving NMI values of 0.852,0.857,0.793,and 0.831 on associated press,political blog datasets,20NewsGroups,and News category dataset,respectively.展开更多
Purpose-This study aims to provide a comprehensive analysis of two-sided matching(TSM)research,an interdisciplinary field that integrates both theoretical and practical perspectives.By examining 756 research articles ...Purpose-This study aims to provide a comprehensive analysis of two-sided matching(TSM)research,an interdisciplinary field that integrates both theoretical and practical perspectives.By examining 756 research articles from the Web of Science database,this paper seeks to identify key trends,collaboration patterns and emerging research topics within the TSM domain.Design/methodology/approach-The research utilizes bibliometric analysis combined with a structural topic model to analyze TSM-related articles published between January 1,2000,and September 30,2022.The study identifies leading subfields,journals,countries/regions and institutions based on publication volume,total citations and average citations per article.Interaction and collaboration patterns among these entities are examined through co-occurrence and coupling networks.Additionally,five major research topics are identified and explored using topic modeling and co-word networks.This hybrid knowledge mining approach better reveals the inherent structural changes in topic clusters.Topic distribution and network analysis are beneficial in capturing the attention allocation of different entities to knowledge.Findings-The analysis reveals five prominent research topics in TSM:communication resource allocation,stable matching research,computing task assignment,TSM decision-making and market matching mechanism design.These topics represent the main directions of TSM research.The study also uncovers a shift in research focus from theoretical aspects to practical applications.Furthermore,the distribution of knowledge and interaction patterns among key entities align with the identified research trends.Originality/value-This study offers a novel and detailed overview of TSM research highlighting significant trends and collaboration patterns within the field.By integrating bibliometric methods with structural topic modeling the study provides unique insights into the evolution of TSM research making it a valuable resource for both academic and professional communities.展开更多
This paper develops a novel online algorithm, namely moving average stochastic variational inference (MASVI), which applies the results obtained by previous iterations to smooth out noisy natural gradients. We analyze...This paper develops a novel online algorithm, namely moving average stochastic variational inference (MASVI), which applies the results obtained by previous iterations to smooth out noisy natural gradients. We analyze the convergence property of the proposed algorithm and conduct a set of experiments on two large-scale collections that contain millions of documents. Experimental results indicate that in contrast to algorithms named 'stochastic variational inference' and 'SGRLD', our algorithm achieves a faster convergence rate and better performance.展开更多
Objective:To analyze Chinese medicine(CM)prescriptions for gastroesophageal reflux disease(GERD),we model topics on GERD-related classical CM literature,providing insights into the potential treatment.Methods:Clinical...Objective:To analyze Chinese medicine(CM)prescriptions for gastroesophageal reflux disease(GERD),we model topics on GERD-related classical CM literature,providing insights into the potential treatment.Methods:Clinical guidelines were used to identify symptom terms for GERD,and CM literature from the database"Imedbooks"was retrieved for related prescriptions and their corresponding sources,indications,and other information.BERTopic was applied to identify the main topics and visualize the data.Results:A total of 36,207entries are queried and 1,938 valid entries were acquired after manually filtering.Eight topics were identified by BERTopic,including digestion function abate,stomach flu,respiratory-related symptoms,gastric dysfunction,regurgitation and gastrointestinal dysfunction in pediatric patients,vomiting,stroke and alcohol accumulation are associated with the risk of GERD,vomiting and its causes,regurgitation,epigastric pain,and symptoms of heartburn.Conclusions:Topic modeling provides an unbiased analysis of classical CM literature on GERD in a time-efficient and scale-efficient manner.Based on this analysis,we present a range of treatment options for relieving symptoms,including herbal remedies and non-pharmacological interventions such as acupuncture and dietary therapy.展开更多
The sudden arrival of AI(Artificial Intelligence) into people's daily lives all around the world was marked by the introduction of ChatGPT, which was officially released on November 30, 2022. This AI invasion in o...The sudden arrival of AI(Artificial Intelligence) into people's daily lives all around the world was marked by the introduction of ChatGPT, which was officially released on November 30, 2022. This AI invasion in our lives drew the attention of not only tech enthusiasts but also scholars from diverse fields, as its capacity extends across various fields. Consequently, numerous articles and journals have been discussing ChatGPT, making it a headline for several topics. However, it does not reflect most public opinion about the product. Therefore, this paper investigated the public's opinions on ChatGPT through topic modelling, Vader-based sentiment analysis and SWOT analysis. To gather data for this study, 202905 comments from the Reddit platform were collected between December 2022 and December 2023. The findings reveal that the Reddit community engaged in discussions related to ChatGPT, covering a range of topics including comparisons with traditional search engines, the impacts on software development, job market, and education industry, exploring ChatGPT's responses on entertainment and politics, the responses from Dan, the alter ego of ChatGPT, the ethical usage of user data as well as queries related to the AI-generated images. The sentiment analysis indicates that most people hold positive views towards this innovative technology across these several aspects. However, concerns also arise regarding the potential negative impacts associated with this product. The SWOT analysis of these results highlights both the strengths and pain points, market opportunities and threats associated with ChatGPT. This analysis also serves as a foundation for providing recommendations aimed at the product development and policy implementation in this paper.展开更多
User-generated content(UGC) such as blogs and twitters are exploding in modern Internet services. In such systems, recommender systems are needed to help people filter vast amount of UGC generated by other users. Howe...User-generated content(UGC) such as blogs and twitters are exploding in modern Internet services. In such systems, recommender systems are needed to help people filter vast amount of UGC generated by other users. However, traditional recommendation models do not use user authorship of items. In this paper, we show that with this additional information, we can significantly improve the performance of recommendations. A generative model that combines hierarchical topic modeling and matrix factorization is proposed. Empirical results show that our model outperforms other state-of-the-art models, and can provide interpretable topic structures for users and items. Furthermore, since user interests can be inferred from their productions, recommendations can be made for users that do not have any ratings to solve the cold-start problem.展开更多
The health care system encompasses the participation of individuals,groups,agencies,and resources that offer services to address the requirements of the person,community,and population in terms of health.Parallel to t...The health care system encompasses the participation of individuals,groups,agencies,and resources that offer services to address the requirements of the person,community,and population in terms of health.Parallel to the rising debates on the healthcare systems in relation to diseases,treatments,interventions,medication,and clinical practice guidelines,the world is currently discussing the healthcare industry,technology perspectives,and healthcare costs.To gain a comprehensive understanding of the healthcare systems research paradigm,we offered a novel contextual topic modeling approach that links up the CombinedTM model with our healthcare Bert to discover the contextual topics in the domain of healthcare.This research work discovered 60 contextual topics among them fteen topics are the hottest which include smart medical monitoring systems,causes,and effects of stress and anxiety,and healthcare cost estimation and twelve topics are the coldest.Moreover,thirty-three topics are showing in-significant trends.We further investigated various clusters and correlations among the topics exploring inter-topic distance maps which add depth to the understanding of the research structure of this scientific domain.The current study enhances the prior topic modeling methodologies that examine the healthcare literature from a particular disciplinary perspective.It further extends the existing topic modeling approaches that do not incorporate contextual information in the topic discovery process adding contextual information by creating sentence embedding vectors through transformers-based models.We also utilized corpus tuning,the mean pooling technique,and the hugging face tool.Our method gives a higher coherence score as compared to the state-of-the-art models(LSA,LDA,and Ber Topic).展开更多
Many existing warning prioritization techniques seek to reorder the static analysis warnings such that true positives are provided first. However, excessive amount of time is required therein to investigate and fix pr...Many existing warning prioritization techniques seek to reorder the static analysis warnings such that true positives are provided first. However, excessive amount of time is required therein to investigate and fix prioritized warnings because some are not actually true positives or are irrelevant to the code context and topic. In this paper, we propose a warning prioritization technique that reflects various latent topics from bug-related code blocks. Our main aim is to build a prioritization model that comprises separate warning priorities depending on the topic of the change sets to identify the number of true positive warnings. For the performance evaluation of the proposed model, we employ a performance metric called warning detection rate, widely used in many warning prioritization studies, and compare the proposed model with other competitive techniques. Additionally, the effectiveness of our model is verified via the application of our technique to eight industrial projects of a real global company.展开更多
Topic modeling is a fundamental technique of content analysis in natural language processing,widely applied in domains such as social sciences and finance.In the era of digital communication,social scientists increasi...Topic modeling is a fundamental technique of content analysis in natural language processing,widely applied in domains such as social sciences and finance.In the era of digital communication,social scientists increasingly rely on large-scale social media data to explore public discourse,collective behavior,and emerging social concerns.However,traditional models like Latent Dirichlet Allocation(LDA)and neural topic models like BERTopic struggle to capture deep semantic structures in short-text datasets,especially in complex non-English languages like Chinese.This paper presents Generative Language Model Topic(GLMTopic)a novel hybrid topic modeling framework leveraging the capabilities of large language models,designed to support social science research by uncovering coherent and interpretable themes from Chinese social media platforms.GLMTopic integrates Adaptive Community-enhanced Graph Embedding for advanced semantic representation,Uniform Manifold Approximation and Projection-based(UMAP-based)dimensionality reduction,Hierarchical Density-Based Spatial Clustering of Applications with Noise(HDBSCAN)clustering,and large language model-powered(LLM-powered)representation tuning to generate more contextually relevant and interpretable topics.By reducing dependence on extensive text preprocessing and human expert intervention in post-analysis topic label annotation,GLMTopic facilitates a fully automated and user-friendly topic extraction process.Experimental evaluations on a social media dataset sourced from Weibo demonstrate that GLMTopic outperforms Latent Dirichlet Allocation(LDA)and BERTopic in coherence score and usability with automated interpretation,providing a more scalable and semantically accurate solution for Chinese topic modeling.Future research will explore optimizing computational efficiency,integrating knowledge graphs and sentiment analysis for more complicated workflows,and extending the framework for real-time and multilingual topic modeling.展开更多
Emerging topics in app reviews highlight the topics(e.g.,software bugs)with which users are concerned during certain periods.Identifying emerging topics accurately,and in a timely manner,could help developers more eff...Emerging topics in app reviews highlight the topics(e.g.,software bugs)with which users are concerned during certain periods.Identifying emerging topics accurately,and in a timely manner,could help developers more effectively update apps.Methods for identifying emerging topics in app reviews based on topic models or clustering methods have been proposed in the literature.However,the accuracy of emerging topic identification is reduced because reviews are short in length and offer limited information.To solve this problem,an improved emerging topic identification(IETI)approach is proposed in this work.Specifically,we adopt natural language processing techniques to reduce noisy data,and identify emerging topics in app reviews using the adaptive online biterm topic model.Then we interpret the implicature of emerging topics through relevant phrases and sentences.We adopt the official app changelogs as ground truth,and evaluate IETI in six common apps.The experimental results indicate that IETI is more accurate than the baseline in identifying emerging topics,with improvements in the F1 score of 0.126 for phrase labels and 0.061 for sentence labels.Finally,we release the codes of IETI on Github(https://github.com/wanizhou/IETI).展开更多
Environmental,social,and governance(ESG)factors are critical in achieving sustainability in business management and are used as values aiming to enhance corporate value.Recently,non-financial indicators have been cons...Environmental,social,and governance(ESG)factors are critical in achieving sustainability in business management and are used as values aiming to enhance corporate value.Recently,non-financial indicators have been considered as important for the actual valuation of corporations,thus analyzing natural language data related to ESG is essential.Several previous studies limited their focus to specific countries or have not used big data.Past methodologies are insufficient for obtaining potential insights into the best practices to leverage ESG.To address this problem,in this study,the authors used data from two platforms:LexisNexis,a platform that provides media monitoring,and Web of Science,a platform that provides scientific papers.These big data were analyzed by topic modeling.Topic modeling can derive hidden semantic structures within the text.Through this process,it is possible to collect information on public and academic sentiment.The authors explored data from a text-mining perspective using bidirectional encoder representations from transformers topic(BERTopic)—a state-of-the-art topic-modeling technique.In addition,changes in subject patterns over time were considered using dynamic topic modeling.As a result,concepts proposed in an international organization such as the United Nations(UN)have been discussed in academia,and the media have formed a variety of agendas.展开更多
Most research on anomaly detection has focused on event that is different from its spatial-temporal neighboring events.It is still a significant challenge to detect anomalies that involve multiple normal events intera...Most research on anomaly detection has focused on event that is different from its spatial-temporal neighboring events.It is still a significant challenge to detect anomalies that involve multiple normal events interacting in an unusual pattern.In this work,a novel unsupervised method based on sparse topic model was proposed to capture motion patterns and detect anomalies in traffic surveillance.scale-invariant feature transform(SIFT)flow was used to improve the dense trajectory in order to extract interest points and the corresponding descriptors with less interference.For the purpose of strengthening the relationship of interest points on the same trajectory,the fisher kernel method was applied to obtain the representation of trajectory which was quantized into visual word.Then the sparse topic model was proposed to explore the latent motion patterns and achieve a sparse representation for the video scene.Finally,two anomaly detection algorithms were compared based on video clip detection and visual word analysis respectively.Experiments were conducted on QMUL Junction dataset and AVSS dataset.The results demonstrated the superior efficiency of the proposed method.展开更多
Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty...Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty topics that experience a sudden increase during a period of time. In this paper, we propose a new topic model named Burst-LDA, which simultaneously discovers topics and reveals their burstiness through explicitly modeling each topic's burst states with a first order Markov chain and using the chain to generate the topic proportion of documents in a Logistic Normal fashion. A Gibbs sampling algorithm is developed for the posterior inference of the proposed model. Experimental results on a news data set show our model can efficiently discover bursty topics, outperforming the state-of-the-art method.展开更多
Purpose:Research dynamics have long been a research interest.It is a macro perspective tool for discovering temporal research trends of a certain discipline or subject.A micro perspective of research dynamics,however,...Purpose:Research dynamics have long been a research interest.It is a macro perspective tool for discovering temporal research trends of a certain discipline or subject.A micro perspective of research dynamics,however,concerning a single researcher or a highly cited paper in terms of their citations and“citations of citations”(forward chaining)remains unexplored.Design/methodology/approach:In this paper,we use a cross-collection topic model to reveal the research dynamics of topic disappearance topic inheritance,and topic innovation in each generation of forward chaining.Findings:For highly cited work,scientific influence exists in indirect citations.Topic modeling can reveal how long this influence exists in forward chaining,as well as its influence.Research limitations:This paper measures scientific influence and indirect scientific influence only if the relevant words or phrases are borrowed or used in direct or indirect citations.Paraphrasing or semantically similar concept may be neglected in this research.Practical implications:This paper demonstrates that a scientific influence exists in indirect citations through its analysis of forward chaining.This can serve as an inspiration on how to adequately evaluate research influence.Originality:The main contributions of this paper are the following three aspects.First,besides research dynamics of topic inheritance and topic innovation,we model topic disappearance by using a cross-collection topic model.Second,we explore the length and character of the research impact through“citations of citations”content analysis.Finally,we analyze the research dynamics of artificial intelligence researcher Geoffrey Hinton’s publications and the topic dynamics of forward chaining.展开更多
User interest is not static and changes dynamically. In the scenario of a search engine, this paper presents a personalized adaptive user interest prediction framework. It represents user interest as a topic distribut...User interest is not static and changes dynamically. In the scenario of a search engine, this paper presents a personalized adaptive user interest prediction framework. It represents user interest as a topic distribution, captures every change of user interest in the history, and uses the changes to predict future individual user interest dynamically. More specifically, it first uses a personalized user interest representation model to infer user interest from queries in the user's history data using a topic model; then it presents a personalized user interest prediction model to capture the dynamic changes of user interest and to predict future user interest by leveraging the query submission time in the history data. Compared with the Interest Degree Multi-Stage Quantization Model, experiment results on an AOL Search Query Log query log show that our framework is more stable and effective in user interest prediction.展开更多
基金Project (No. 60773180) supported by the National Natural Science Foundation of China
文摘This paper deals with the statistical modeling of latent topic hierarchies in text corpora. The height of the topic tree is assumed as fixed, while the number of topics on each level as unknown a priori and to be inferred from data. Taking a nonpara-metric Bayesian approach to this problem, we propose a new probabilistic generative model based on the nested hierarchical Dirichlet process (nHDP) and present a Markov chain Monte Carlo sampling algorithm for the inference of the topic tree structure as well as the word distribution of each topic and topic distribution of each document. Our theoretical analysis and experiment results show that this model can produce a more compact hierarchical topic structure and captures more fine-grained topic rela-tionships compared to the hierarchical latent Dirichlet allocation model.
基金funded by the Swiss National Science Foundation Project PlaceGen[grant number 200021_149823].
文摘User-Generated Content(UGC)provides a potential data source which can help us to better describe and understand how places are conceptualized,and in turn better represent the places in Geographic Information Science(GIScience).In this article,we aim at aggregating the shared meanings associated with places and linking these to a conceptual model of place.Our focus is on the metadata of Flickr images,in the form of locations and tags.We use topic modeling to identify regions associated with shared meanings.We choose a grid approach and generate topics associated with one or more cells using Latent Dirichlet Allocation.We analyze the sensitivity of our results to both grid resolution and the chosen number of topics using a range of measures including corpus distance and the coherence value.Using a resolution of 500 m and with 40 topics,we are able to generate meaningful topics which characterize places in London based on 954 unique tags associated with around 300,000 images and more than 7000 individuals.
文摘Topic modeling is a probabilistic model that identifies topics covered in text(s). In this paper, topics were loaded from two implementations of topic modeling, namely, Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA). This analysis was performed in a corpus of 1000 academic papers written in English, obtained from PLOS ONE website, in the areas of Biology, Medicine, Physics and Social Sciences. The objective is to verify if the four academic fields were represented in the four topics obtained by topic modeling. The four topics obtained from Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) did not represent the four academic fields.
文摘Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic capture accuracy. We propose a novel approach that incorporates pre-clustered knowledge into the BERTopic model while reducing the l2 norm for low-frequency words. Our method effectively mitigates feature sparsity during cluster mapping. Empirical evaluation on the StackOverflow dataset demonstrates that our approach outperforms baseline models, achieving superior Macro-F1 scores. These results validate the effectiveness of our proposed feature sparsity reduction technique for short-text topic modeling.
基金supported by the National Statistical Science Research Project of China(No.2024LY021).
文摘Purpose-This study aims to reveal the technological development trajectories within the fuzzy field,addressing the long-standing gap between the extensive application of fuzzy technologies and the lack of a comprehensive,data-driven understanding of their developmental logic and evolution.Design/methodology/approach-An integrated methodological framework combining structural topic modeling(STM)and main path analysis(MPA)is developed.A total of 43,905 patents related to fuzzy technologies were collected from the Derwent Innovation Index.STM is applied to identify 12 representative topics within the fuzzy technology field,followed by the construction of topic-specific citation networks.MPA is then used to extract the core development paths across these topics,capturing structural dynamics and knowledge diffusion.This integration enables a multidimensional exploration of topic structures and technology trajectories,providing a frame of reference for analyzing other emerging technologies.Findings-The combined STM-MPA approach effectively identifies the classification and developmental trajectories of fuzzy-related technologies.Results highlight topic-specific knowledge flows and inter-topic linkages,offering new insights into the internal evolution and external integration of fuzzy technologies.The study demonstrates how different subfields of fuzzy technologies have progressed and interacted over time.Originality/value-This study is among the first to systematically explore the development of fuzzy technologies using large-scale patent data and a hybrid analytical framework.By integrating topic modeling with citation analysis,it captures both topic patterns and development paths.The approach enhances existing methods in technology analysis and offers new insights for innovation research,policy design and enterprise strategy.
文摘Topic modeling stands as a well-explored and foundational challenge in the text mining domain.Traditional topic schemes based on word co-occurrences,aim to expose the latent semantic structure embedded in a document corpus.Nevertheless,the inherent brevity of short texts introduces data sparsity,hindering the effectiveness of conventional topic models and yielding suboptimal outcomes for such text.Typically,short texts encompass a restricted number of topics,necessitating a grasp of relevant background knowledge for a comprehensive understanding of semantic content.Motivated by the observed information,this research introduces a novel Deep Auto encoder Graph Regularized Non-negative Matrix Factorization algorithm(DAGR-NMF)to uncover significant and meaningful topics within short document contents.The three main phases of proposed work are preprocessing,feature extraction and topic modeling.Initially,the data are preprocessed using natural language preprocessing tasks such as stop word removal,stemming and lemmatizing.Then,feature extraction is performed using hybrid Absolute Deviation Factors-Class Term Frequency(ADF-CTF)to capture the most relevant information from the text.Finally,topic modeling task is executed using proposed DAGR-NMF approach.Experimental findings demonstrate that the introduced DAGR-NMF model outperforms all other techniques by achieving NMI values of 0.852,0.857,0.793,and 0.831 on associated press,political blog datasets,20NewsGroups,and News category dataset,respectively.
基金supported by the Social Science Foundation Project of Jiangsu Province,China(20GLC010)National Statistical Science Research Project(2024LY021).
文摘Purpose-This study aims to provide a comprehensive analysis of two-sided matching(TSM)research,an interdisciplinary field that integrates both theoretical and practical perspectives.By examining 756 research articles from the Web of Science database,this paper seeks to identify key trends,collaboration patterns and emerging research topics within the TSM domain.Design/methodology/approach-The research utilizes bibliometric analysis combined with a structural topic model to analyze TSM-related articles published between January 1,2000,and September 30,2022.The study identifies leading subfields,journals,countries/regions and institutions based on publication volume,total citations and average citations per article.Interaction and collaboration patterns among these entities are examined through co-occurrence and coupling networks.Additionally,five major research topics are identified and explored using topic modeling and co-word networks.This hybrid knowledge mining approach better reveals the inherent structural changes in topic clusters.Topic distribution and network analysis are beneficial in capturing the attention allocation of different entities to knowledge.Findings-The analysis reveals five prominent research topics in TSM:communication resource allocation,stable matching research,computing task assignment,TSM decision-making and market matching mechanism design.These topics represent the main directions of TSM research.The study also uncovers a shift in research focus from theoretical aspects to practical applications.Furthermore,the distribution of knowledge and interaction patterns among key entities align with the identified research trends.Originality/value-This study offers a novel and detailed overview of TSM research highlighting significant trends and collaboration patterns within the field.By integrating bibliometric methods with structural topic modeling the study provides unique insights into the evolution of TSM research making it a valuable resource for both academic and professional communities.
基金Project supported by the National Natural Science Foundation of China (Nos. 61170092, 61133011, and 61103091)
文摘This paper develops a novel online algorithm, namely moving average stochastic variational inference (MASVI), which applies the results obtained by previous iterations to smooth out noisy natural gradients. We analyze the convergence property of the proposed algorithm and conduct a set of experiments on two large-scale collections that contain millions of documents. Experimental results indicate that in contrast to algorithms named 'stochastic variational inference' and 'SGRLD', our algorithm achieves a faster convergence rate and better performance.
基金Supported by Shanghai Municipal Administrator of Traditional Chinese Medicine[No.ZY(2021-2023)-0301-01]the National Nature Science Foundation of China(No.82074366)。
文摘Objective:To analyze Chinese medicine(CM)prescriptions for gastroesophageal reflux disease(GERD),we model topics on GERD-related classical CM literature,providing insights into the potential treatment.Methods:Clinical guidelines were used to identify symptom terms for GERD,and CM literature from the database"Imedbooks"was retrieved for related prescriptions and their corresponding sources,indications,and other information.BERTopic was applied to identify the main topics and visualize the data.Results:A total of 36,207entries are queried and 1,938 valid entries were acquired after manually filtering.Eight topics were identified by BERTopic,including digestion function abate,stomach flu,respiratory-related symptoms,gastric dysfunction,regurgitation and gastrointestinal dysfunction in pediatric patients,vomiting,stroke and alcohol accumulation are associated with the risk of GERD,vomiting and its causes,regurgitation,epigastric pain,and symptoms of heartburn.Conclusions:Topic modeling provides an unbiased analysis of classical CM literature on GERD in a time-efficient and scale-efficient manner.Based on this analysis,we present a range of treatment options for relieving symptoms,including herbal remedies and non-pharmacological interventions such as acupuncture and dietary therapy.
文摘The sudden arrival of AI(Artificial Intelligence) into people's daily lives all around the world was marked by the introduction of ChatGPT, which was officially released on November 30, 2022. This AI invasion in our lives drew the attention of not only tech enthusiasts but also scholars from diverse fields, as its capacity extends across various fields. Consequently, numerous articles and journals have been discussing ChatGPT, making it a headline for several topics. However, it does not reflect most public opinion about the product. Therefore, this paper investigated the public's opinions on ChatGPT through topic modelling, Vader-based sentiment analysis and SWOT analysis. To gather data for this study, 202905 comments from the Reddit platform were collected between December 2022 and December 2023. The findings reveal that the Reddit community engaged in discussions related to ChatGPT, covering a range of topics including comparisons with traditional search engines, the impacts on software development, job market, and education industry, exploring ChatGPT's responses on entertainment and politics, the responses from Dan, the alter ego of ChatGPT, the ethical usage of user data as well as queries related to the AI-generated images. The sentiment analysis indicates that most people hold positive views towards this innovative technology across these several aspects. However, concerns also arise regarding the potential negative impacts associated with this product. The SWOT analysis of these results highlights both the strengths and pain points, market opportunities and threats associated with ChatGPT. This analysis also serves as a foundation for providing recommendations aimed at the product development and policy implementation in this paper.
基金Project supported by the Monitoring Statistics Project on Agricultural and Rural Resources,MOA,Chinathe Innovative Talents Project,MOA,Chinathe Science and Technology Innovation Project Fund of Chinese Academy of Agricultural Sciences(No.CAAS-ASTIP-2015-AI I-02)
文摘User-generated content(UGC) such as blogs and twitters are exploding in modern Internet services. In such systems, recommender systems are needed to help people filter vast amount of UGC generated by other users. However, traditional recommendation models do not use user authorship of items. In this paper, we show that with this additional information, we can significantly improve the performance of recommendations. A generative model that combines hierarchical topic modeling and matrix factorization is proposed. Empirical results show that our model outperforms other state-of-the-art models, and can provide interpretable topic structures for users and items. Furthermore, since user interests can be inferred from their productions, recommendations can be made for users that do not have any ratings to solve the cold-start problem.
文摘The health care system encompasses the participation of individuals,groups,agencies,and resources that offer services to address the requirements of the person,community,and population in terms of health.Parallel to the rising debates on the healthcare systems in relation to diseases,treatments,interventions,medication,and clinical practice guidelines,the world is currently discussing the healthcare industry,technology perspectives,and healthcare costs.To gain a comprehensive understanding of the healthcare systems research paradigm,we offered a novel contextual topic modeling approach that links up the CombinedTM model with our healthcare Bert to discover the contextual topics in the domain of healthcare.This research work discovered 60 contextual topics among them fteen topics are the hottest which include smart medical monitoring systems,causes,and effects of stress and anxiety,and healthcare cost estimation and twelve topics are the coldest.Moreover,thirty-three topics are showing in-significant trends.We further investigated various clusters and correlations among the topics exploring inter-topic distance maps which add depth to the understanding of the research structure of this scientific domain.The current study enhances the prior topic modeling methodologies that examine the healthcare literature from a particular disciplinary perspective.It further extends the existing topic modeling approaches that do not incorporate contextual information in the topic discovery process adding contextual information by creating sentence embedding vectors through transformers-based models.We also utilized corpus tuning,the mean pooling technique,and the hugging face tool.Our method gives a higher coherence score as compared to the state-of-the-art models(LSA,LDA,and Ber Topic).
基金The research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Science,ICT&Future Planning under Grant No.NRF-2019R1A2C2084158Samsung Electronics Co.Ltd.
文摘Many existing warning prioritization techniques seek to reorder the static analysis warnings such that true positives are provided first. However, excessive amount of time is required therein to investigate and fix prioritized warnings because some are not actually true positives or are irrelevant to the code context and topic. In this paper, we propose a warning prioritization technique that reflects various latent topics from bug-related code blocks. Our main aim is to build a prioritization model that comprises separate warning priorities depending on the topic of the change sets to identify the number of true positive warnings. For the performance evaluation of the proposed model, we employ a performance metric called warning detection rate, widely used in many warning prioritization studies, and compare the proposed model with other competitive techniques. Additionally, the effectiveness of our model is verified via the application of our technique to eight industrial projects of a real global company.
基金funded by the Natural Science Foundation of Fujian Province,China,grant No.2022J05291.
文摘Topic modeling is a fundamental technique of content analysis in natural language processing,widely applied in domains such as social sciences and finance.In the era of digital communication,social scientists increasingly rely on large-scale social media data to explore public discourse,collective behavior,and emerging social concerns.However,traditional models like Latent Dirichlet Allocation(LDA)and neural topic models like BERTopic struggle to capture deep semantic structures in short-text datasets,especially in complex non-English languages like Chinese.This paper presents Generative Language Model Topic(GLMTopic)a novel hybrid topic modeling framework leveraging the capabilities of large language models,designed to support social science research by uncovering coherent and interpretable themes from Chinese social media platforms.GLMTopic integrates Adaptive Community-enhanced Graph Embedding for advanced semantic representation,Uniform Manifold Approximation and Projection-based(UMAP-based)dimensionality reduction,Hierarchical Density-Based Spatial Clustering of Applications with Noise(HDBSCAN)clustering,and large language model-powered(LLM-powered)representation tuning to generate more contextually relevant and interpretable topics.By reducing dependence on extensive text preprocessing and human expert intervention in post-analysis topic label annotation,GLMTopic facilitates a fully automated and user-friendly topic extraction process.Experimental evaluations on a social media dataset sourced from Weibo demonstrate that GLMTopic outperforms Latent Dirichlet Allocation(LDA)and BERTopic in coherence score and usability with automated interpretation,providing a more scalable and semantically accurate solution for Chinese topic modeling.Future research will explore optimizing computational efficiency,integrating knowledge graphs and sentiment analysis for more complicated workflows,and extending the framework for real-time and multilingual topic modeling.
基金Project supported by the Anhui Provincial Natural Science Foundation of China(No.1908085MF183)the National Natural Science Foundation of China(Nos.62002084and 61976005)+4 种基金the Training Program for Young and MiddleAged Top Talents of Anhui Polytechnic University,China(No.201812)the Zhejiang Provincial Natural Science Foundation of China(No.LQ21F020004)the State Key Laboratory for Novel Software Technology(Nanjing University)Research Program,China(No.KFKT2019B23)the Open Research Fund of Anhui Key Laboratory of Detection Technology and Energy Saving Devices,Anhui Polytechnic University,China(No.DTESD2020B03)the Stable Support Plan for Colleges and Universities in Shenzhen,China(No.GXWD20201230155427003-20200730101839009)。
文摘Emerging topics in app reviews highlight the topics(e.g.,software bugs)with which users are concerned during certain periods.Identifying emerging topics accurately,and in a timely manner,could help developers more effectively update apps.Methods for identifying emerging topics in app reviews based on topic models or clustering methods have been proposed in the literature.However,the accuracy of emerging topic identification is reduced because reviews are short in length and offer limited information.To solve this problem,an improved emerging topic identification(IETI)approach is proposed in this work.Specifically,we adopt natural language processing techniques to reduce noisy data,and identify emerging topics in app reviews using the adaptive online biterm topic model.Then we interpret the implicature of emerging topics through relevant phrases and sentences.We adopt the official app changelogs as ground truth,and evaluate IETI in six common apps.The experimental results indicate that IETI is more accurate than the baseline in identifying emerging topics,with improvements in the F1 score of 0.126 for phrase labels and 0.061 for sentence labels.Finally,we release the codes of IETI on Github(https://github.com/wanizhou/IETI).
基金supported by a National Research Foundation of Korea(NRF)(http://nrf.re.kr/eng/index)grant funded by the Korean government(RS-2023-00208278).
文摘Environmental,social,and governance(ESG)factors are critical in achieving sustainability in business management and are used as values aiming to enhance corporate value.Recently,non-financial indicators have been considered as important for the actual valuation of corporations,thus analyzing natural language data related to ESG is essential.Several previous studies limited their focus to specific countries or have not used big data.Past methodologies are insufficient for obtaining potential insights into the best practices to leverage ESG.To address this problem,in this study,the authors used data from two platforms:LexisNexis,a platform that provides media monitoring,and Web of Science,a platform that provides scientific papers.These big data were analyzed by topic modeling.Topic modeling can derive hidden semantic structures within the text.Through this process,it is possible to collect information on public and academic sentiment.The authors explored data from a text-mining perspective using bidirectional encoder representations from transformers topic(BERTopic)—a state-of-the-art topic-modeling technique.In addition,changes in subject patterns over time were considered using dynamic topic modeling.As a result,concepts proposed in an international organization such as the United Nations(UN)have been discussed in academia,and the media have formed a variety of agendas.
基金Project(50808025)supported by the National Natural Science Foundation of ChinaProject(20090162110057)supported by the Doctoral Fund of Ministry of Education,China
文摘Most research on anomaly detection has focused on event that is different from its spatial-temporal neighboring events.It is still a significant challenge to detect anomalies that involve multiple normal events interacting in an unusual pattern.In this work,a novel unsupervised method based on sparse topic model was proposed to capture motion patterns and detect anomalies in traffic surveillance.scale-invariant feature transform(SIFT)flow was used to improve the dense trajectory in order to extract interest points and the corresponding descriptors with less interference.For the purpose of strengthening the relationship of interest points on the same trajectory,the fisher kernel method was applied to obtain the representation of trajectory which was quantized into visual word.Then the sparse topic model was proposed to explore the latent motion patterns and achieve a sparse representation for the video scene.Finally,two anomaly detection algorithms were compared based on video clip detection and visual word analysis respectively.Experiments were conducted on QMUL Junction dataset and AVSS dataset.The results demonstrated the superior efficiency of the proposed method.
基金Supported by the National High Technology Research and Development Program of China(No.2012AA011005)
文摘Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty topics that experience a sudden increase during a period of time. In this paper, we propose a new topic model named Burst-LDA, which simultaneously discovers topics and reveals their burstiness through explicitly modeling each topic's burst states with a first order Markov chain and using the chain to generate the topic proportion of documents in a Logistic Normal fashion. A Gibbs sampling algorithm is developed for the posterior inference of the proposed model. Experimental results on a news data set show our model can efficiently discover bursty topics, outperforming the state-of-the-art method.
基金This work is supported by the Programs for the Young Talents of National Science Library,Chinese Academy of Sciences(Grant No.2019QNGR003).
文摘Purpose:Research dynamics have long been a research interest.It is a macro perspective tool for discovering temporal research trends of a certain discipline or subject.A micro perspective of research dynamics,however,concerning a single researcher or a highly cited paper in terms of their citations and“citations of citations”(forward chaining)remains unexplored.Design/methodology/approach:In this paper,we use a cross-collection topic model to reveal the research dynamics of topic disappearance topic inheritance,and topic innovation in each generation of forward chaining.Findings:For highly cited work,scientific influence exists in indirect citations.Topic modeling can reveal how long this influence exists in forward chaining,as well as its influence.Research limitations:This paper measures scientific influence and indirect scientific influence only if the relevant words or phrases are borrowed or used in direct or indirect citations.Paraphrasing or semantically similar concept may be neglected in this research.Practical implications:This paper demonstrates that a scientific influence exists in indirect citations through its analysis of forward chaining.This can serve as an inspiration on how to adequately evaluate research influence.Originality:The main contributions of this paper are the following three aspects.First,besides research dynamics of topic inheritance and topic innovation,we model topic disappearance by using a cross-collection topic model.Second,we explore the length and character of the research impact through“citations of citations”content analysis.Finally,we analyze the research dynamics of artificial intelligence researcher Geoffrey Hinton’s publications and the topic dynamics of forward chaining.
基金Supported by the National Natural Science Foundation of China(71473183,71503188)
文摘User interest is not static and changes dynamically. In the scenario of a search engine, this paper presents a personalized adaptive user interest prediction framework. It represents user interest as a topic distribution, captures every change of user interest in the history, and uses the changes to predict future individual user interest dynamically. More specifically, it first uses a personalized user interest representation model to infer user interest from queries in the user's history data using a topic model; then it presents a personalized user interest prediction model to capture the dynamic changes of user interest and to predict future user interest by leveraging the query submission time in the history data. Compared with the Interest Degree Multi-Stage Quantization Model, experiment results on an AOL Search Query Log query log show that our framework is more stable and effective in user interest prediction.