Topic modeling is a fundamental technique of content analysis in natural language processing,widely applied in domains such as social sciences and finance.In the era of digital communication,social scientists increasi...Topic modeling is a fundamental technique of content analysis in natural language processing,widely applied in domains such as social sciences and finance.In the era of digital communication,social scientists increasingly rely on large-scale social media data to explore public discourse,collective behavior,and emerging social concerns.However,traditional models like Latent Dirichlet Allocation(LDA)and neural topic models like BERTopic struggle to capture deep semantic structures in short-text datasets,especially in complex non-English languages like Chinese.This paper presents Generative Language Model Topic(GLMTopic)a novel hybrid topic modeling framework leveraging the capabilities of large language models,designed to support social science research by uncovering coherent and interpretable themes from Chinese social media platforms.GLMTopic integrates Adaptive Community-enhanced Graph Embedding for advanced semantic representation,Uniform Manifold Approximation and Projection-based(UMAP-based)dimensionality reduction,Hierarchical Density-Based Spatial Clustering of Applications with Noise(HDBSCAN)clustering,and large language model-powered(LLM-powered)representation tuning to generate more contextually relevant and interpretable topics.By reducing dependence on extensive text preprocessing and human expert intervention in post-analysis topic label annotation,GLMTopic facilitates a fully automated and user-friendly topic extraction process.Experimental evaluations on a social media dataset sourced from Weibo demonstrate that GLMTopic outperforms Latent Dirichlet Allocation(LDA)and BERTopic in coherence score and usability with automated interpretation,providing a more scalable and semantically accurate solution for Chinese topic modeling.Future research will explore optimizing computational efficiency,integrating knowledge graphs and sentiment analysis for more complicated workflows,and extending the framework for real-time and multilingual topic modeling.展开更多
Supervised topic modeling algorithms have been successfully applied to multi-label document classification tasks.Representative models include labeled latent Dirichlet allocation(L-LDA)and dependency-LDA.However,these...Supervised topic modeling algorithms have been successfully applied to multi-label document classification tasks.Representative models include labeled latent Dirichlet allocation(L-LDA)and dependency-LDA.However,these models neglect the class frequency information of words(i.e.,the number of classes where a word has occurred in the training data),which is significant for classification.To address this,we propose a method,namely the class frequency weight(CF-weight),to weight words by considering the class frequency knowledge.This CF-weight is based on the intuition that a word with higher(lower)class frequency will be less(more)discriminative.In this study,the CF-weight is used to improve L-LDA and dependency-LDA.A number of experiments have been conducted on real-world multi-label datasets.Experimental results demonstrate that CF-weight based algorithms are competitive with the existing supervised topic models.展开更多
Recently,automation is considered vital in most fields since computing methods have a significant role in facilitating work such as automatic text summarization.However,most of the computing methods that are used in r...Recently,automation is considered vital in most fields since computing methods have a significant role in facilitating work such as automatic text summarization.However,most of the computing methods that are used in real systems are based on graph models,which are characterized by their simplicity and stability.Thus,this paper proposes an improved extractive text summarization algorithm based on both topic and graph models.The methodology of this work consists of two stages.First,the well-known TextRank algorithm is analyzed and its shortcomings are investigated.Then,an improved method is proposed with a new computational model of sentence weights.The experimental results were carried out on standard DUC2004 and DUC2006 datasets and compared to four text summarization methods.Finally,through experiments on the DUC2004 and DUC2006 datasets,our proposed improved graph model algorithm TG-SMR(Topic Graph-Summarizer)is compared to other text summarization systems.The experimental results prove that the proposed TG-SMR algorithm achieves higher ROUGE scores.It is foreseen that the TG-SMR algorithm will open a new horizon that concerns the performance of ROUGE evaluation indicators.展开更多
Social media has revolutionized the dissemination of real-life information,serving as a robust platform for sharing life events.Twitter,characterized by its brevity and continuous flow of posts,has emerged as a crucia...Social media has revolutionized the dissemination of real-life information,serving as a robust platform for sharing life events.Twitter,characterized by its brevity and continuous flow of posts,has emerged as a crucial source for public health surveillance,offering valuable insights into public reactions during the COVID-19 pandemic.This study aims to leverage a range of machine learning techniques to extract pivotal themes and facilitate text classification on a dataset of COVID-19 outbreak-related tweets.Diverse topic modeling approaches have been employed to extract pertinent themes and subsequently form a dataset for training text classification models.An assessment of coherence metrics revealed that the Gibbs Sampling Dirichlet Mixture Model(GSDMM),which utilizes trigram and bag-of-words(BOW)feature extraction,outperformed Non-negative Matrix Factorization(NMF),Latent Dirichlet Allocation(LDA),and a hybrid strategy involving Bidirectional Encoder Representations from Transformers(BERT)combined with LDA and K-means to pinpoint significant themes within the dataset.Among the models assessed for text clustering,the utilization of LDA,either as a clustering model or for feature extraction combined with BERT for K-means,resulted in higher coherence scores,consistent with human ratings,signifying their efficacy.In particular,LDA,notably in conjunction with trigram representation and BOW,demonstrated superior performance.This underscores the suitability of LDA for conducting topic modeling,given its proficiency in capturing intricate textual relationships.In the context of text classification,models such as Linear Support Vector Classification(LSVC),Long Short-Term Memory(LSTM),Bidirectional Long Short-Term Memory(BiLSTM),Convolutional Neural Network with BiLSTM(CNN-BiLSTM),and BERT have shown outstanding performance,achieving accuracy and weighted F1-Score scores exceeding 80%.These results significantly surpassed other models,such as Multinomial Naive Bayes(MNB),Linear Support Vector Machine(LSVM),and Logistic Regression(LR),which achieved scores in the range of 60 to 70 percent.展开更多
Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic captur...Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic capture accuracy. We propose a novel approach that incorporates pre-clustered knowledge into the BERTopic model while reducing the l2 norm for low-frequency words. Our method effectively mitigates feature sparsity during cluster mapping. Empirical evaluation on the StackOverflow dataset demonstrates that our approach outperforms baseline models, achieving superior Macro-F1 scores. These results validate the effectiveness of our proposed feature sparsity reduction technique for short-text topic modeling.展开更多
Automatic extraction of key data from design specifications is an important means to assist in engineering design automation.Considering the characteristics of diverse data types,small scale,insufficient character inf...Automatic extraction of key data from design specifications is an important means to assist in engineering design automation.Considering the characteristics of diverse data types,small scale,insufficient character information content and strong contextual relevance of design specification,a named entity recognition model integrated with high-quality topic and attention mechanism,namely Quality Topic-Char Embedding-BiLSTMAttention-CRF,was proposed to automatically identify entities in design specification.Based on the topic model,an improved algorithm for high-quality topic extraction was proposed first,and then the high-quality topic information obtained was added into the distributed representation of Chinese characters to better enrich character features.Next,the attention mechanism was used in parallel on the basis of the BiLSTM-CRF model to fully mine the contextual semantic information.Finally,the experiment was performed on the collected corpus of Chinese ship design specification,and the model was compared with multiple sets of models.The results show that F-score(harmonic mean of precision and recall)of the model is 80.24%.The model performs better than other models in design specification,and is expected to provide an automatic means for engineering design.展开更多
Traditionally,exam preparation involves manually analyzing past question papers to identify and prioritize key topics.This research proposes a data-driven solution to automate this process using techniques like Docume...Traditionally,exam preparation involves manually analyzing past question papers to identify and prioritize key topics.This research proposes a data-driven solution to automate this process using techniques like Document Layout Segmentation,Optical Character Recognition(OCR),and Latent Dirichlet Allocation(LDA)for topic modelling.This study aims to develop a system that utilizes machine learning and topic modelling to identify and rank key topics from historical exam papers,aiding students in efficient exam preparation.The research addresses the difficulty in exam preparation due to the manual and labour-intensive process of analyzing past exam papers to identify and prioritize key topics.This approach is designed to streamline and optimize exam preparation,making it easier for students to focus on the most relevant topics,thereby using their efforts more effectively.The process involves three stages:(i)Document Layout Segmentation and Data Preparation,using deep learning techniques to separate text from non-textual content in past exam papers,(ii)Text Extraction and Processing using OCR to convert images into machine-readable text,and(iii)Topic Modeling with LDA to identify key topics covered in the exams.The research demonstrates the effectiveness of the proposed method in identifying and prioritizing key topics from exam papers.The LDA model successfully extracts relevant themes,aiding students in focusing their study efforts.The research presents a promising approach for optimizing exam preparation.By leveraging machine learning and topic modelling,the system offers a data-driven and efficient solution for students to prioritize their study efforts.Future work includes expanding the dataset size to further enhance model accuracy.Additionally,integration with educational platforms holds potential for personalized recommendations and adaptive learning experiences.展开更多
Most research on anomaly detection has focused on event that is different from its spatial-temporal neighboring events.It is still a significant challenge to detect anomalies that involve multiple normal events intera...Most research on anomaly detection has focused on event that is different from its spatial-temporal neighboring events.It is still a significant challenge to detect anomalies that involve multiple normal events interacting in an unusual pattern.In this work,a novel unsupervised method based on sparse topic model was proposed to capture motion patterns and detect anomalies in traffic surveillance.scale-invariant feature transform(SIFT)flow was used to improve the dense trajectory in order to extract interest points and the corresponding descriptors with less interference.For the purpose of strengthening the relationship of interest points on the same trajectory,the fisher kernel method was applied to obtain the representation of trajectory which was quantized into visual word.Then the sparse topic model was proposed to explore the latent motion patterns and achieve a sparse representation for the video scene.Finally,two anomaly detection algorithms were compared based on video clip detection and visual word analysis respectively.Experiments were conducted on QMUL Junction dataset and AVSS dataset.The results demonstrated the superior efficiency of the proposed method.展开更多
Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty...Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty topics that experience a sudden increase during a period of time. In this paper, we propose a new topic model named Burst-LDA, which simultaneously discovers topics and reveals their burstiness through explicitly modeling each topic's burst states with a first order Markov chain and using the chain to generate the topic proportion of documents in a Logistic Normal fashion. A Gibbs sampling algorithm is developed for the posterior inference of the proposed model. Experimental results on a news data set show our model can efficiently discover bursty topics, outperforming the state-of-the-art method.展开更多
Purpose:Research dynamics have long been a research interest.It is a macro perspective tool for discovering temporal research trends of a certain discipline or subject.A micro perspective of research dynamics,however,...Purpose:Research dynamics have long been a research interest.It is a macro perspective tool for discovering temporal research trends of a certain discipline or subject.A micro perspective of research dynamics,however,concerning a single researcher or a highly cited paper in terms of their citations and“citations of citations”(forward chaining)remains unexplored.Design/methodology/approach:In this paper,we use a cross-collection topic model to reveal the research dynamics of topic disappearance topic inheritance,and topic innovation in each generation of forward chaining.Findings:For highly cited work,scientific influence exists in indirect citations.Topic modeling can reveal how long this influence exists in forward chaining,as well as its influence.Research limitations:This paper measures scientific influence and indirect scientific influence only if the relevant words or phrases are borrowed or used in direct or indirect citations.Paraphrasing or semantically similar concept may be neglected in this research.Practical implications:This paper demonstrates that a scientific influence exists in indirect citations through its analysis of forward chaining.This can serve as an inspiration on how to adequately evaluate research influence.Originality:The main contributions of this paper are the following three aspects.First,besides research dynamics of topic inheritance and topic innovation,we model topic disappearance by using a cross-collection topic model.Second,we explore the length and character of the research impact through“citations of citations”content analysis.Finally,we analyze the research dynamics of artificial intelligence researcher Geoffrey Hinton’s publications and the topic dynamics of forward chaining.展开更多
User interest is not static and changes dynamically. In the scenario of a search engine, this paper presents a personalized adaptive user interest prediction framework. It represents user interest as a topic distribut...User interest is not static and changes dynamically. In the scenario of a search engine, this paper presents a personalized adaptive user interest prediction framework. It represents user interest as a topic distribution, captures every change of user interest in the history, and uses the changes to predict future individual user interest dynamically. More specifically, it first uses a personalized user interest representation model to infer user interest from queries in the user's history data using a topic model; then it presents a personalized user interest prediction model to capture the dynamic changes of user interest and to predict future user interest by leveraging the query submission time in the history data. Compared with the Interest Degree Multi-Stage Quantization Model, experiment results on an AOL Search Query Log query log show that our framework is more stable and effective in user interest prediction.展开更多
Recommendation system can greatly alleviate the "information overload" in the big data era. Existing recommendation methods, however, typically focus on predicting missing rating values via analyzing user-it...Recommendation system can greatly alleviate the "information overload" in the big data era. Existing recommendation methods, however, typically focus on predicting missing rating values via analyzing user-item dualistic relationship, which neglect an important fact that the latent interests of users can influence their rating behaviors. Moreover, traditional recommendation methods easily suffer from the high dimensional problem and cold-start problem. To address these challenges, in this paper, we propose a PBUED(PLSA-Based Uniform Euclidean Distance) scheme, which utilizes topic model and uniform Euclidean distance to recommend the suitable items for users. The solution first employs probabilistic latent semantic analysis(PLSA) to extract users' interests, users with different interests are divided into different subgroups. Then, the uniform Euclidean distance is adopted to compute the users' similarity in the same interest subset; finally, the missing rating values of data are predicted via aggregating similar neighbors' ratings. We evaluate PBUED on two datasets and experimental results show PBUED can lead to better predicting performance and ranking performance than other approaches.展开更多
Environmental,social,and governance(ESG)factors are critical in achieving sustainability in business management and are used as values aiming to enhance corporate value.Recently,non-financial indicators have been cons...Environmental,social,and governance(ESG)factors are critical in achieving sustainability in business management and are used as values aiming to enhance corporate value.Recently,non-financial indicators have been considered as important for the actual valuation of corporations,thus analyzing natural language data related to ESG is essential.Several previous studies limited their focus to specific countries or have not used big data.Past methodologies are insufficient for obtaining potential insights into the best practices to leverage ESG.To address this problem,in this study,the authors used data from two platforms:LexisNexis,a platform that provides media monitoring,and Web of Science,a platform that provides scientific papers.These big data were analyzed by topic modeling.Topic modeling can derive hidden semantic structures within the text.Through this process,it is possible to collect information on public and academic sentiment.The authors explored data from a text-mining perspective using bidirectional encoder representations from transformers topic(BERTopic)—a state-of-the-art topic-modeling technique.In addition,changes in subject patterns over time were considered using dynamic topic modeling.As a result,concepts proposed in an international organization such as the United Nations(UN)have been discussed in academia,and the media have formed a variety of agendas.展开更多
Background: With mounting global environmental, social and economic pressures the resilience and stability of forests and thus the provisioning of vital ecosystem services is increasingly threatened. Intensified moni...Background: With mounting global environmental, social and economic pressures the resilience and stability of forests and thus the provisioning of vital ecosystem services is increasingly threatened. Intensified monitoring can help to detect ecological threats and changes earlier, but monitoring resources are limited. Participatory forest monitoring with the help of "citizen scientists" can provide additional resources for forest monitoring and at the same time help to communicate with stakeholders and the general public. Examples for citizen science projects in the forestry domain can be found but a solid, applicable larger framework to utilise public participation in the area of forest monitoring seems to be lacking. We propose that a better understanding of shared and related topics in citizen science and forest monitoring might be a first step towards such a framework. Methods: We conduct a systematic meta-analysis of 1015 publication abstracts addressing "forest monitoring" and "citizen science" in order to explore the combined topical landscape of these subjects. We employ 'topic modelling an unsupervised probabilistic machine learning method, to identify latent shared topics in the analysed publications. Results: We find that large shared topics exist, but that these are primarily topics that would be expected in scientific publications in general. Common domain-specific topics are under-represented and indicate a topical separation of the two document sets on "forest monitoring" and "citizen science" and thus the represented domains. While topic modelling as a method proves to be a scalable and useful analytical tool, we propose that our approach could deliver even more useful data if a larger document set and full-text publications would be available for analysis. Conclusions: We propose that these results, together with the observation of non-shared but related topics, point at under-utilised opportunities for public participation in forest monitoring. Citizen science could be applied as a versatile tool in forest ecosystems monitoring, complementing traditional forest monitoring programmes, assisting early threat recognition and helping to connect forest management with the general public. We conclude that our presented approach should be pursued further as it may aid the understanding and setup of citizen science efforts in the forest monitoring domain.展开更多
The problem of "rich topics get richer"(RTGR) is popular to the topic models,which will bring the wrong topic distribution if the distributing process has not been intervened.In standard LDA(Latent Dirichlet...The problem of "rich topics get richer"(RTGR) is popular to the topic models,which will bring the wrong topic distribution if the distributing process has not been intervened.In standard LDA(Latent Dirichlet Allocation) model,each word in all the documents has the same statistical ability.In fact,the words have different impact towards different topics.Under the guidance of this thought,we extend ILDA(Infinite LDA) by considering the bias role of words to divide the topics.We propose a self-adaptive topic model to overcome the RTGR problem specifically.The model proposed in this paper is adapted to three questions:(1) the topic number is changeable with the collection of the documents,which is suitable for the dynamic data;(2) the words have discriminating attributes to topic distribution;(3) a selfadaptive method is used to realize the automatic re-sampling.To verify our model,we design a topic evolution analysis system which can realize the following functions:the topic classification in each cycle,the topic correlation in the adjacent cycles and the strength calculation of the sub topics in the order.The experiment both on NIPS corpus and our self-built news collections showed that the system could meet the given demand,the result was feasible.展开更多
This paper deals with the statistical modeling of latent topic hierarchies in text corpora. The height of the topic tree is assumed as fixed, while the number of topics on each level as unknown a priori and to be infe...This paper deals with the statistical modeling of latent topic hierarchies in text corpora. The height of the topic tree is assumed as fixed, while the number of topics on each level as unknown a priori and to be inferred from data. Taking a nonpara-metric Bayesian approach to this problem, we propose a new probabilistic generative model based on the nested hierarchical Dirichlet process (nHDP) and present a Markov chain Monte Carlo sampling algorithm for the inference of the topic tree structure as well as the word distribution of each topic and topic distribution of each document. Our theoretical analysis and experiment results show that this model can produce a more compact hierarchical topic structure and captures more fine-grained topic rela-tionships compared to the hierarchical latent Dirichlet allocation model.展开更多
User-Generated Content(UGC)provides a potential data source which can help us to better describe and understand how places are conceptualized,and in turn better represent the places in Geographic Information Science(G...User-Generated Content(UGC)provides a potential data source which can help us to better describe and understand how places are conceptualized,and in turn better represent the places in Geographic Information Science(GIScience).In this article,we aim at aggregating the shared meanings associated with places and linking these to a conceptual model of place.Our focus is on the metadata of Flickr images,in the form of locations and tags.We use topic modeling to identify regions associated with shared meanings.We choose a grid approach and generate topics associated with one or more cells using Latent Dirichlet Allocation.We analyze the sensitivity of our results to both grid resolution and the chosen number of topics using a range of measures including corpus distance and the coherence value.Using a resolution of 500 m and with 40 topics,we are able to generate meaningful topics which characterize places in London based on 954 unique tags associated with around 300,000 images and more than 7000 individuals.展开更多
Traditional topic models have been widely used for analyzing semantic topics from electronic documents.However,the obvious defects of topic words acquired by them are poor in readability and consistency.Only the domai...Traditional topic models have been widely used for analyzing semantic topics from electronic documents.However,the obvious defects of topic words acquired by them are poor in readability and consistency.Only the domain experts are possible to guess their meaning.In fact,phrases are the main unit for people to express semantics.This paper presents a Distributed Representation-Phrase Latent Dirichlet Allocation(DR-Phrase LDA)which is a phrase topic model.Specifically,we reasonably enhance the semantic information of phrases via distributed representation in this model.The experimental results show the topics quality acquired by our model is more readable and consistent than other similar topic models.展开更多
Retelling extraction is an important branch of Natural Language Processing(NLP),and high-quality retelling resources are very helpful to improve the performance of machine translation.However,traditional methods based...Retelling extraction is an important branch of Natural Language Processing(NLP),and high-quality retelling resources are very helpful to improve the performance of machine translation.However,traditional methods based on the bilingual parallel corpus often ignore the document background in the process of retelling acquisition and application.In order to solve this problem,we introduce topic model information into the translation mode and propose a topic-based statistical machine translation method to improve the translation performance.In this method,Probabilistic Latent Semantic Analysis(PLSA)is used to obtains the co-occurrence relationship between words and documents by the hybrid matrix decomposition.Then we design a decoder to simplify the decoding process.Experiments show that the proposed method can effectively improve the accuracy of translation.展开更多
This study explored user satisfaction with mobile payments by applying a novel structural topic model.Specifically,we collected 17,927 online reviews of a specific mobile payment(i.e.,PayPal).Then,we employed a struct...This study explored user satisfaction with mobile payments by applying a novel structural topic model.Specifically,we collected 17,927 online reviews of a specific mobile payment(i.e.,PayPal).Then,we employed a structural topic model to investigate the relationship between the attributes extracted from online reviews and user satisfaction with mobile payment.Consequently,we discovered that“lack of reliability”and“poor customer service”tend to appear in negative reviews.Whereas,the terms“convenience,”“user-friendly interface,”“simple process,”and“secure system”tend to appear in positive reviews.On the basis of information system success theory,we categorized the topics“convenience,”“user-friendly interface,”and“simple process,”as system quality.In addition,“poor customer service”was categorized as service quality.Furthermore,based on the previous studies of trust and security,“lack of reliability”and“secure system”were categorized as trust and security,respectively.These outcomes indicate that users are satisfied when they perceive that system quality and security of specific mobile payments are great.On the contrary,users are dissatisfied when they feel that service quality and reliability of specific mobile payments is lacking.Overall,our research implies that a novel structural topic model is an effective method to explore mobile payment user experience.展开更多
基金funded by the Natural Science Foundation of Fujian Province,China,grant No.2022J05291.
文摘Topic modeling is a fundamental technique of content analysis in natural language processing,widely applied in domains such as social sciences and finance.In the era of digital communication,social scientists increasingly rely on large-scale social media data to explore public discourse,collective behavior,and emerging social concerns.However,traditional models like Latent Dirichlet Allocation(LDA)and neural topic models like BERTopic struggle to capture deep semantic structures in short-text datasets,especially in complex non-English languages like Chinese.This paper presents Generative Language Model Topic(GLMTopic)a novel hybrid topic modeling framework leveraging the capabilities of large language models,designed to support social science research by uncovering coherent and interpretable themes from Chinese social media platforms.GLMTopic integrates Adaptive Community-enhanced Graph Embedding for advanced semantic representation,Uniform Manifold Approximation and Projection-based(UMAP-based)dimensionality reduction,Hierarchical Density-Based Spatial Clustering of Applications with Noise(HDBSCAN)clustering,and large language model-powered(LLM-powered)representation tuning to generate more contextually relevant and interpretable topics.By reducing dependence on extensive text preprocessing and human expert intervention in post-analysis topic label annotation,GLMTopic facilitates a fully automated and user-friendly topic extraction process.Experimental evaluations on a social media dataset sourced from Weibo demonstrate that GLMTopic outperforms Latent Dirichlet Allocation(LDA)and BERTopic in coherence score and usability with automated interpretation,providing a more scalable and semantically accurate solution for Chinese topic modeling.Future research will explore optimizing computational efficiency,integrating knowledge graphs and sentiment analysis for more complicated workflows,and extending the framework for real-time and multilingual topic modeling.
基金Project supported by the National Natural Science Foundation of China(No.61602204)
文摘Supervised topic modeling algorithms have been successfully applied to multi-label document classification tasks.Representative models include labeled latent Dirichlet allocation(L-LDA)and dependency-LDA.However,these models neglect the class frequency information of words(i.e.,the number of classes where a word has occurred in the training data),which is significant for classification.To address this,we propose a method,namely the class frequency weight(CF-weight),to weight words by considering the class frequency knowledge.This CF-weight is based on the intuition that a word with higher(lower)class frequency will be less(more)discriminative.In this study,the CF-weight is used to improve L-LDA and dependency-LDA.A number of experiments have been conducted on real-world multi-label datasets.Experimental results demonstrate that CF-weight based algorithms are competitive with the existing supervised topic models.
文摘Recently,automation is considered vital in most fields since computing methods have a significant role in facilitating work such as automatic text summarization.However,most of the computing methods that are used in real systems are based on graph models,which are characterized by their simplicity and stability.Thus,this paper proposes an improved extractive text summarization algorithm based on both topic and graph models.The methodology of this work consists of two stages.First,the well-known TextRank algorithm is analyzed and its shortcomings are investigated.Then,an improved method is proposed with a new computational model of sentence weights.The experimental results were carried out on standard DUC2004 and DUC2006 datasets and compared to four text summarization methods.Finally,through experiments on the DUC2004 and DUC2006 datasets,our proposed improved graph model algorithm TG-SMR(Topic Graph-Summarizer)is compared to other text summarization systems.The experimental results prove that the proposed TG-SMR algorithm achieves higher ROUGE scores.It is foreseen that the TG-SMR algorithm will open a new horizon that concerns the performance of ROUGE evaluation indicators.
文摘Social media has revolutionized the dissemination of real-life information,serving as a robust platform for sharing life events.Twitter,characterized by its brevity and continuous flow of posts,has emerged as a crucial source for public health surveillance,offering valuable insights into public reactions during the COVID-19 pandemic.This study aims to leverage a range of machine learning techniques to extract pivotal themes and facilitate text classification on a dataset of COVID-19 outbreak-related tweets.Diverse topic modeling approaches have been employed to extract pertinent themes and subsequently form a dataset for training text classification models.An assessment of coherence metrics revealed that the Gibbs Sampling Dirichlet Mixture Model(GSDMM),which utilizes trigram and bag-of-words(BOW)feature extraction,outperformed Non-negative Matrix Factorization(NMF),Latent Dirichlet Allocation(LDA),and a hybrid strategy involving Bidirectional Encoder Representations from Transformers(BERT)combined with LDA and K-means to pinpoint significant themes within the dataset.Among the models assessed for text clustering,the utilization of LDA,either as a clustering model or for feature extraction combined with BERT for K-means,resulted in higher coherence scores,consistent with human ratings,signifying their efficacy.In particular,LDA,notably in conjunction with trigram representation and BOW,demonstrated superior performance.This underscores the suitability of LDA for conducting topic modeling,given its proficiency in capturing intricate textual relationships.In the context of text classification,models such as Linear Support Vector Classification(LSVC),Long Short-Term Memory(LSTM),Bidirectional Long Short-Term Memory(BiLSTM),Convolutional Neural Network with BiLSTM(CNN-BiLSTM),and BERT have shown outstanding performance,achieving accuracy and weighted F1-Score scores exceeding 80%.These results significantly surpassed other models,such as Multinomial Naive Bayes(MNB),Linear Support Vector Machine(LSVM),and Logistic Regression(LR),which achieved scores in the range of 60 to 70 percent.
文摘Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic capture accuracy. We propose a novel approach that incorporates pre-clustered knowledge into the BERTopic model while reducing the l2 norm for low-frequency words. Our method effectively mitigates feature sparsity during cluster mapping. Empirical evaluation on the StackOverflow dataset demonstrates that our approach outperforms baseline models, achieving superior Macro-F1 scores. These results validate the effectiveness of our proposed feature sparsity reduction technique for short-text topic modeling.
基金the High Tech Ship Project of the Ministry of Industry and Information Technology(No.[2019]331)。
文摘Automatic extraction of key data from design specifications is an important means to assist in engineering design automation.Considering the characteristics of diverse data types,small scale,insufficient character information content and strong contextual relevance of design specification,a named entity recognition model integrated with high-quality topic and attention mechanism,namely Quality Topic-Char Embedding-BiLSTMAttention-CRF,was proposed to automatically identify entities in design specification.Based on the topic model,an improved algorithm for high-quality topic extraction was proposed first,and then the high-quality topic information obtained was added into the distributed representation of Chinese characters to better enrich character features.Next,the attention mechanism was used in parallel on the basis of the BiLSTM-CRF model to fully mine the contextual semantic information.Finally,the experiment was performed on the collected corpus of Chinese ship design specification,and the model was compared with multiple sets of models.The results show that F-score(harmonic mean of precision and recall)of the model is 80.24%.The model performs better than other models in design specification,and is expected to provide an automatic means for engineering design.
文摘Traditionally,exam preparation involves manually analyzing past question papers to identify and prioritize key topics.This research proposes a data-driven solution to automate this process using techniques like Document Layout Segmentation,Optical Character Recognition(OCR),and Latent Dirichlet Allocation(LDA)for topic modelling.This study aims to develop a system that utilizes machine learning and topic modelling to identify and rank key topics from historical exam papers,aiding students in efficient exam preparation.The research addresses the difficulty in exam preparation due to the manual and labour-intensive process of analyzing past exam papers to identify and prioritize key topics.This approach is designed to streamline and optimize exam preparation,making it easier for students to focus on the most relevant topics,thereby using their efforts more effectively.The process involves three stages:(i)Document Layout Segmentation and Data Preparation,using deep learning techniques to separate text from non-textual content in past exam papers,(ii)Text Extraction and Processing using OCR to convert images into machine-readable text,and(iii)Topic Modeling with LDA to identify key topics covered in the exams.The research demonstrates the effectiveness of the proposed method in identifying and prioritizing key topics from exam papers.The LDA model successfully extracts relevant themes,aiding students in focusing their study efforts.The research presents a promising approach for optimizing exam preparation.By leveraging machine learning and topic modelling,the system offers a data-driven and efficient solution for students to prioritize their study efforts.Future work includes expanding the dataset size to further enhance model accuracy.Additionally,integration with educational platforms holds potential for personalized recommendations and adaptive learning experiences.
基金Project(50808025)supported by the National Natural Science Foundation of ChinaProject(20090162110057)supported by the Doctoral Fund of Ministry of Education,China
文摘Most research on anomaly detection has focused on event that is different from its spatial-temporal neighboring events.It is still a significant challenge to detect anomalies that involve multiple normal events interacting in an unusual pattern.In this work,a novel unsupervised method based on sparse topic model was proposed to capture motion patterns and detect anomalies in traffic surveillance.scale-invariant feature transform(SIFT)flow was used to improve the dense trajectory in order to extract interest points and the corresponding descriptors with less interference.For the purpose of strengthening the relationship of interest points on the same trajectory,the fisher kernel method was applied to obtain the representation of trajectory which was quantized into visual word.Then the sparse topic model was proposed to explore the latent motion patterns and achieve a sparse representation for the video scene.Finally,two anomaly detection algorithms were compared based on video clip detection and visual word analysis respectively.Experiments were conducted on QMUL Junction dataset and AVSS dataset.The results demonstrated the superior efficiency of the proposed method.
基金Supported by the National High Technology Research and Development Program of China(No.2012AA011005)
文摘Topic models such as Latent Dirichlet Allocation(LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty topics that experience a sudden increase during a period of time. In this paper, we propose a new topic model named Burst-LDA, which simultaneously discovers topics and reveals their burstiness through explicitly modeling each topic's burst states with a first order Markov chain and using the chain to generate the topic proportion of documents in a Logistic Normal fashion. A Gibbs sampling algorithm is developed for the posterior inference of the proposed model. Experimental results on a news data set show our model can efficiently discover bursty topics, outperforming the state-of-the-art method.
基金This work is supported by the Programs for the Young Talents of National Science Library,Chinese Academy of Sciences(Grant No.2019QNGR003).
文摘Purpose:Research dynamics have long been a research interest.It is a macro perspective tool for discovering temporal research trends of a certain discipline or subject.A micro perspective of research dynamics,however,concerning a single researcher or a highly cited paper in terms of their citations and“citations of citations”(forward chaining)remains unexplored.Design/methodology/approach:In this paper,we use a cross-collection topic model to reveal the research dynamics of topic disappearance topic inheritance,and topic innovation in each generation of forward chaining.Findings:For highly cited work,scientific influence exists in indirect citations.Topic modeling can reveal how long this influence exists in forward chaining,as well as its influence.Research limitations:This paper measures scientific influence and indirect scientific influence only if the relevant words or phrases are borrowed or used in direct or indirect citations.Paraphrasing or semantically similar concept may be neglected in this research.Practical implications:This paper demonstrates that a scientific influence exists in indirect citations through its analysis of forward chaining.This can serve as an inspiration on how to adequately evaluate research influence.Originality:The main contributions of this paper are the following three aspects.First,besides research dynamics of topic inheritance and topic innovation,we model topic disappearance by using a cross-collection topic model.Second,we explore the length and character of the research impact through“citations of citations”content analysis.Finally,we analyze the research dynamics of artificial intelligence researcher Geoffrey Hinton’s publications and the topic dynamics of forward chaining.
基金Supported by the National Natural Science Foundation of China(71473183,71503188)
文摘User interest is not static and changes dynamically. In the scenario of a search engine, this paper presents a personalized adaptive user interest prediction framework. It represents user interest as a topic distribution, captures every change of user interest in the history, and uses the changes to predict future individual user interest dynamically. More specifically, it first uses a personalized user interest representation model to infer user interest from queries in the user's history data using a topic model; then it presents a personalized user interest prediction model to capture the dynamic changes of user interest and to predict future user interest by leveraging the query submission time in the history data. Compared with the Interest Degree Multi-Stage Quantization Model, experiment results on an AOL Search Query Log query log show that our framework is more stable and effective in user interest prediction.
基金supported in part by the National High‐tech R&D Program of China (863 Program) under Grant No. 2013AA102301technological project of Henan province (162102210214)
文摘Recommendation system can greatly alleviate the "information overload" in the big data era. Existing recommendation methods, however, typically focus on predicting missing rating values via analyzing user-item dualistic relationship, which neglect an important fact that the latent interests of users can influence their rating behaviors. Moreover, traditional recommendation methods easily suffer from the high dimensional problem and cold-start problem. To address these challenges, in this paper, we propose a PBUED(PLSA-Based Uniform Euclidean Distance) scheme, which utilizes topic model and uniform Euclidean distance to recommend the suitable items for users. The solution first employs probabilistic latent semantic analysis(PLSA) to extract users' interests, users with different interests are divided into different subgroups. Then, the uniform Euclidean distance is adopted to compute the users' similarity in the same interest subset; finally, the missing rating values of data are predicted via aggregating similar neighbors' ratings. We evaluate PBUED on two datasets and experimental results show PBUED can lead to better predicting performance and ranking performance than other approaches.
基金supported by a National Research Foundation of Korea(NRF)(http://nrf.re.kr/eng/index)grant funded by the Korean government(RS-2023-00208278).
文摘Environmental,social,and governance(ESG)factors are critical in achieving sustainability in business management and are used as values aiming to enhance corporate value.Recently,non-financial indicators have been considered as important for the actual valuation of corporations,thus analyzing natural language data related to ESG is essential.Several previous studies limited their focus to specific countries or have not used big data.Past methodologies are insufficient for obtaining potential insights into the best practices to leverage ESG.To address this problem,in this study,the authors used data from two platforms:LexisNexis,a platform that provides media monitoring,and Web of Science,a platform that provides scientific papers.These big data were analyzed by topic modeling.Topic modeling can derive hidden semantic structures within the text.Through this process,it is possible to collect information on public and academic sentiment.The authors explored data from a text-mining perspective using bidirectional encoder representations from transformers topic(BERTopic)—a state-of-the-art topic-modeling technique.In addition,changes in subject patterns over time were considered using dynamic topic modeling.As a result,concepts proposed in an international organization such as the United Nations(UN)have been discussed in academia,and the media have formed a variety of agendas.
文摘Background: With mounting global environmental, social and economic pressures the resilience and stability of forests and thus the provisioning of vital ecosystem services is increasingly threatened. Intensified monitoring can help to detect ecological threats and changes earlier, but monitoring resources are limited. Participatory forest monitoring with the help of "citizen scientists" can provide additional resources for forest monitoring and at the same time help to communicate with stakeholders and the general public. Examples for citizen science projects in the forestry domain can be found but a solid, applicable larger framework to utilise public participation in the area of forest monitoring seems to be lacking. We propose that a better understanding of shared and related topics in citizen science and forest monitoring might be a first step towards such a framework. Methods: We conduct a systematic meta-analysis of 1015 publication abstracts addressing "forest monitoring" and "citizen science" in order to explore the combined topical landscape of these subjects. We employ 'topic modelling an unsupervised probabilistic machine learning method, to identify latent shared topics in the analysed publications. Results: We find that large shared topics exist, but that these are primarily topics that would be expected in scientific publications in general. Common domain-specific topics are under-represented and indicate a topical separation of the two document sets on "forest monitoring" and "citizen science" and thus the represented domains. While topic modelling as a method proves to be a scalable and useful analytical tool, we propose that our approach could deliver even more useful data if a larger document set and full-text publications would be available for analysis. Conclusions: We propose that these results, together with the observation of non-shared but related topics, point at under-utilised opportunities for public participation in forest monitoring. Citizen science could be applied as a versatile tool in forest ecosystems monitoring, complementing traditional forest monitoring programmes, assisting early threat recognition and helping to connect forest management with the general public. We conclude that our presented approach should be pursued further as it may aid the understanding and setup of citizen science efforts in the forest monitoring domain.
基金ACKNOWLEDGMENTS This work is supported by grants National 973 project (No.2013CB29606), Natural Science Foundation of China (No.61202244), research fund of ShangQiu Normal Colledge (No. 2013GGJS013). N1PS corpus is supported by SourceForge. We thank the anonymous reviewers for their helpful comments.
文摘The problem of "rich topics get richer"(RTGR) is popular to the topic models,which will bring the wrong topic distribution if the distributing process has not been intervened.In standard LDA(Latent Dirichlet Allocation) model,each word in all the documents has the same statistical ability.In fact,the words have different impact towards different topics.Under the guidance of this thought,we extend ILDA(Infinite LDA) by considering the bias role of words to divide the topics.We propose a self-adaptive topic model to overcome the RTGR problem specifically.The model proposed in this paper is adapted to three questions:(1) the topic number is changeable with the collection of the documents,which is suitable for the dynamic data;(2) the words have discriminating attributes to topic distribution;(3) a selfadaptive method is used to realize the automatic re-sampling.To verify our model,we design a topic evolution analysis system which can realize the following functions:the topic classification in each cycle,the topic correlation in the adjacent cycles and the strength calculation of the sub topics in the order.The experiment both on NIPS corpus and our self-built news collections showed that the system could meet the given demand,the result was feasible.
基金Project (No. 60773180) supported by the National Natural Science Foundation of China
文摘This paper deals with the statistical modeling of latent topic hierarchies in text corpora. The height of the topic tree is assumed as fixed, while the number of topics on each level as unknown a priori and to be inferred from data. Taking a nonpara-metric Bayesian approach to this problem, we propose a new probabilistic generative model based on the nested hierarchical Dirichlet process (nHDP) and present a Markov chain Monte Carlo sampling algorithm for the inference of the topic tree structure as well as the word distribution of each topic and topic distribution of each document. Our theoretical analysis and experiment results show that this model can produce a more compact hierarchical topic structure and captures more fine-grained topic rela-tionships compared to the hierarchical latent Dirichlet allocation model.
基金funded by the Swiss National Science Foundation Project PlaceGen[grant number 200021_149823].
文摘User-Generated Content(UGC)provides a potential data source which can help us to better describe and understand how places are conceptualized,and in turn better represent the places in Geographic Information Science(GIScience).In this article,we aim at aggregating the shared meanings associated with places and linking these to a conceptual model of place.Our focus is on the metadata of Flickr images,in the form of locations and tags.We use topic modeling to identify regions associated with shared meanings.We choose a grid approach and generate topics associated with one or more cells using Latent Dirichlet Allocation.We analyze the sensitivity of our results to both grid resolution and the chosen number of topics using a range of measures including corpus distance and the coherence value.Using a resolution of 500 m and with 40 topics,we are able to generate meaningful topics which characterize places in London based on 954 unique tags associated with around 300,000 images and more than 7000 individuals.
基金This work was supported by the Project of Industry and University Cooperative Research of Jiangsu Province,China(No.BY2019051)Ma,J.would like to thank the Jiangsu Eazytec Information Technology Company(www.eazytec.com)for their financial support.
文摘Traditional topic models have been widely used for analyzing semantic topics from electronic documents.However,the obvious defects of topic words acquired by them are poor in readability and consistency.Only the domain experts are possible to guess their meaning.In fact,phrases are the main unit for people to express semantics.This paper presents a Distributed Representation-Phrase Latent Dirichlet Allocation(DR-Phrase LDA)which is a phrase topic model.Specifically,we reasonably enhance the semantic information of phrases via distributed representation in this model.The experimental results show the topics quality acquired by our model is more readable and consistent than other similar topic models.
基金supported by National Social Science Fund of China(Youth Program):“A Study of Acceptability of Chinese Government Public Signs in the New Era and the Countermeasures of the English Translation”(No.:13CYY010)the Subject Construction and Management Project of Zhejiang Gongshang University:“Research on the Organic Integration Path of Constructing Ideological and Political Training and Design of Mixed Teaching Platform during Epidemic Period”(No.:XKJS2020007)Ministry of Education IndustryUniversity Cooperative Education Program:“Research on the Construction of Cross-border Logistics Marketing Bilingual Course Integration”(NO.:202102494002).
文摘Retelling extraction is an important branch of Natural Language Processing(NLP),and high-quality retelling resources are very helpful to improve the performance of machine translation.However,traditional methods based on the bilingual parallel corpus often ignore the document background in the process of retelling acquisition and application.In order to solve this problem,we introduce topic model information into the translation mode and propose a topic-based statistical machine translation method to improve the translation performance.In this method,Probabilistic Latent Semantic Analysis(PLSA)is used to obtains the co-occurrence relationship between words and documents by the hybrid matrix decomposition.Then we design a decoder to simplify the decoding process.Experiments show that the proposed method can effectively improve the accuracy of translation.
基金This work was supported by a National Research Foundation of Korea(NRF)grant funded by the Korean government(NRF-2020R1A2C1014957).
文摘This study explored user satisfaction with mobile payments by applying a novel structural topic model.Specifically,we collected 17,927 online reviews of a specific mobile payment(i.e.,PayPal).Then,we employed a structural topic model to investigate the relationship between the attributes extracted from online reviews and user satisfaction with mobile payment.Consequently,we discovered that“lack of reliability”and“poor customer service”tend to appear in negative reviews.Whereas,the terms“convenience,”“user-friendly interface,”“simple process,”and“secure system”tend to appear in positive reviews.On the basis of information system success theory,we categorized the topics“convenience,”“user-friendly interface,”and“simple process,”as system quality.In addition,“poor customer service”was categorized as service quality.Furthermore,based on the previous studies of trust and security,“lack of reliability”and“secure system”were categorized as trust and security,respectively.These outcomes indicate that users are satisfied when they perceive that system quality and security of specific mobile payments are great.On the contrary,users are dissatisfied when they feel that service quality and reliability of specific mobile payments is lacking.Overall,our research implies that a novel structural topic model is an effective method to explore mobile payment user experience.