This study is an exploratory analysis of applying natural language processing techniques such as Term Frequency-Inverse Document Frequency and Sentiment Analysis on Twitter data. The uniqueness of this work is establi...This study is an exploratory analysis of applying natural language processing techniques such as Term Frequency-Inverse Document Frequency and Sentiment Analysis on Twitter data. The uniqueness of this work is established by determining the overall sentiment of a politician’s tweets based on TF-IDF values of terms used in their published tweets. By calculating the TF-IDF value of terms from the corpus, this work displays the correlation between TF-IDF score and polarity. The results of this work show that calculating the TF-IDF score of the corpus allows for a more accurate representation of the overall polarity since terms are given a weight based on their uniqueness and relevance rather than just the frequency at which they appear in the corpus.展开更多
Anomaly detection(AD)is an important aspect of various domains and title insurance(TI)is no exception.Robotic process automation(RPA)is taking over manual tasks in TI business processes,but it has its limitations with...Anomaly detection(AD)is an important aspect of various domains and title insurance(TI)is no exception.Robotic process automation(RPA)is taking over manual tasks in TI business processes,but it has its limitations without the support of artificial intelligence(AI)and machine learning(ML).With increasing data dimensionality and in composite population scenarios,the complexity of detecting anomalies increases and AD in automated document management systems(ADMS)is the least explored domain.Deep learning,being the fastest maturing technology can be combined along with traditional anomaly detectors to facilitate and improve the RPAs in TI.We present a hybrid model for AD,using autoencoders(AE)and a one-class support vector machine(OSVM).In the present study,OSVM receives input features representing real-time documents from the TI business,orchestrated and with dimensions reduced by AE.The results obtained from multiple experiments are comparable with traditional methods and within a business acceptable range,regarding accuracy and performance.展开更多
The rapid expansion of online content and big data has precipitated an urgent need for efficient summarization techniques to swiftly comprehend vast textual documents without compromising their original integrity.Curr...The rapid expansion of online content and big data has precipitated an urgent need for efficient summarization techniques to swiftly comprehend vast textual documents without compromising their original integrity.Current approaches in Extractive Text Summarization(ETS)leverage the modeling of inter-sentence relationships,a task of paramount importance in producing coherent summaries.This study introduces an innovative model that integrates Graph Attention Networks(GATs)with Transformer-based Bidirectional Encoder Representa-tions from Transformers(BERT)and Latent Dirichlet Allocation(LDA),further enhanced by Term Frequency-Inverse Document Frequency(TF-IDF)values,to improve sentence selection by capturing comprehensive topical information.Our approach constructs a graph with nodes representing sentences,words,and topics,thereby elevating the interconnectivity and enabling a more refined understanding of text structures.This model is stretched to Multi-Document Summarization(MDS)from Single-Document Summarization,offering significant improvements over existing models such as THGS-GMM and Topic-GraphSum,as demonstrated by empirical evaluations on benchmark news datasets like Cable News Network(CNN)/Daily Mail(DM)and Multi-News.The results consistently demonstrate superior performance,showcasing the model’s robustness in handling complex summarization tasks across single and multi-document contexts.This research not only advances the integration of BERT and LDA within a GATs but also emphasizes our model’s capacity to effectively manage global information and adapt to diverse summarization challenges.展开更多
A new common phrase scoring method is proposed according to term frequency-inverse document frequency(TFIDF)and independence of the phrase.Combining the two properties can help identify more reasonable common phrases,...A new common phrase scoring method is proposed according to term frequency-inverse document frequency(TFIDF)and independence of the phrase.Combining the two properties can help identify more reasonable common phrases,which improve the accuracy of clustering.Also,the equation to measure the in-dependence of a phrase is proposed in this paper.The new algorithm which improves suffix tree clustering algorithm(STC)is named as improved suffix tree clustering(ISTC).To validate the proposed algorithm,a prototype system is implemented and used to cluster several groups of web search results obtained from Google search engine.Experimental results show that the improved algorithm offers higher accuracy than traditional suffix tree clustering.展开更多
With an upsurge in biomedical literature,using data-mining method to search new knowledge from literature has drawing more attention of scholars.In this study,taking the mining of non-coding gene literature from the n...With an upsurge in biomedical literature,using data-mining method to search new knowledge from literature has drawing more attention of scholars.In this study,taking the mining of non-coding gene literature from the network database of PubMed as an example,we first preprocessed the abstract data,next applied the term occurrence frequency(TF) and inverse document frequency(IDF)(TF-IDF) method to select features,and then established a biomedical literature data-mining model based on Bayesian algorithm.Finally,we assessed the model through area under the receiver operating characteristic curve(AUC),accuracy,specificity,sensitivity,precision rate and recall rate.When 1 000 features are selected,AUC,specificity,sensitivity,accuracy rate,precision rate and recall rate are 0.868 3,84.63%,89.02%,86.83%,89.02% and 98.14%,respectively.These results indicate that our method can identify the targeted literature related to a particular topic effectively.展开更多
文摘This study is an exploratory analysis of applying natural language processing techniques such as Term Frequency-Inverse Document Frequency and Sentiment Analysis on Twitter data. The uniqueness of this work is established by determining the overall sentiment of a politician’s tweets based on TF-IDF values of terms used in their published tweets. By calculating the TF-IDF value of terms from the corpus, this work displays the correlation between TF-IDF score and polarity. The results of this work show that calculating the TF-IDF score of the corpus allows for a more accurate representation of the overall polarity since terms are given a weight based on their uniqueness and relevance rather than just the frequency at which they appear in the corpus.
文摘Anomaly detection(AD)is an important aspect of various domains and title insurance(TI)is no exception.Robotic process automation(RPA)is taking over manual tasks in TI business processes,but it has its limitations without the support of artificial intelligence(AI)and machine learning(ML).With increasing data dimensionality and in composite population scenarios,the complexity of detecting anomalies increases and AD in automated document management systems(ADMS)is the least explored domain.Deep learning,being the fastest maturing technology can be combined along with traditional anomaly detectors to facilitate and improve the RPAs in TI.We present a hybrid model for AD,using autoencoders(AE)and a one-class support vector machine(OSVM).In the present study,OSVM receives input features representing real-time documents from the TI business,orchestrated and with dimensions reduced by AE.The results obtained from multiple experiments are comparable with traditional methods and within a business acceptable range,regarding accuracy and performance.
文摘The rapid expansion of online content and big data has precipitated an urgent need for efficient summarization techniques to swiftly comprehend vast textual documents without compromising their original integrity.Current approaches in Extractive Text Summarization(ETS)leverage the modeling of inter-sentence relationships,a task of paramount importance in producing coherent summaries.This study introduces an innovative model that integrates Graph Attention Networks(GATs)with Transformer-based Bidirectional Encoder Representa-tions from Transformers(BERT)and Latent Dirichlet Allocation(LDA),further enhanced by Term Frequency-Inverse Document Frequency(TF-IDF)values,to improve sentence selection by capturing comprehensive topical information.Our approach constructs a graph with nodes representing sentences,words,and topics,thereby elevating the interconnectivity and enabling a more refined understanding of text structures.This model is stretched to Multi-Document Summarization(MDS)from Single-Document Summarization,offering significant improvements over existing models such as THGS-GMM and Topic-GraphSum,as demonstrated by empirical evaluations on benchmark news datasets like Cable News Network(CNN)/Daily Mail(DM)and Multi-News.The results consistently demonstrate superior performance,showcasing the model’s robustness in handling complex summarization tasks across single and multi-document contexts.This research not only advances the integration of BERT and LDA within a GATs but also emphasizes our model’s capacity to effectively manage global information and adapt to diverse summarization challenges.
基金Supported by the National Natural Science Foundation of China(60503020,60503033,60703086)Opening Foundation of Jiangsu Key Laboratory of Computer Information Processing Technology in Soochow Uni-versity(KJS0714)+1 种基金Research Foundation of Nanjing University of Posts and Telecommunications(NY207052,NY207082)National Natural Science Foundation of Jiangsu(BK2006094).
文摘A new common phrase scoring method is proposed according to term frequency-inverse document frequency(TFIDF)and independence of the phrase.Combining the two properties can help identify more reasonable common phrases,which improve the accuracy of clustering.Also,the equation to measure the in-dependence of a phrase is proposed in this paper.The new algorithm which improves suffix tree clustering algorithm(STC)is named as improved suffix tree clustering(ISTC).To validate the proposed algorithm,a prototype system is implemented and used to cluster several groups of web search results obtained from Google search engine.Experimental results show that the improved algorithm offers higher accuracy than traditional suffix tree clustering.
文摘With an upsurge in biomedical literature,using data-mining method to search new knowledge from literature has drawing more attention of scholars.In this study,taking the mining of non-coding gene literature from the network database of PubMed as an example,we first preprocessed the abstract data,next applied the term occurrence frequency(TF) and inverse document frequency(IDF)(TF-IDF) method to select features,and then established a biomedical literature data-mining model based on Bayesian algorithm.Finally,we assessed the model through area under the receiver operating characteristic curve(AUC),accuracy,specificity,sensitivity,precision rate and recall rate.When 1 000 features are selected,AUC,specificity,sensitivity,accuracy rate,precision rate and recall rate are 0.868 3,84.63%,89.02%,86.83%,89.02% and 98.14%,respectively.These results indicate that our method can identify the targeted literature related to a particular topic effectively.