This study is an exploratory analysis of applying natural language processing techniques such as Term Frequency-Inverse Document Frequency and Sentiment Analysis on Twitter data. The uniqueness of this work is establi...This study is an exploratory analysis of applying natural language processing techniques such as Term Frequency-Inverse Document Frequency and Sentiment Analysis on Twitter data. The uniqueness of this work is established by determining the overall sentiment of a politician’s tweets based on TF-IDF values of terms used in their published tweets. By calculating the TF-IDF value of terms from the corpus, this work displays the correlation between TF-IDF score and polarity. The results of this work show that calculating the TF-IDF score of the corpus allows for a more accurate representation of the overall polarity since terms are given a weight based on their uniqueness and relevance rather than just the frequency at which they appear in the corpus.展开更多
Anomaly detection(AD)is an important aspect of various domains and title insurance(TI)is no exception.Robotic process automation(RPA)is taking over manual tasks in TI business processes,but it has its limitations with...Anomaly detection(AD)is an important aspect of various domains and title insurance(TI)is no exception.Robotic process automation(RPA)is taking over manual tasks in TI business processes,but it has its limitations without the support of artificial intelligence(AI)and machine learning(ML).With increasing data dimensionality and in composite population scenarios,the complexity of detecting anomalies increases and AD in automated document management systems(ADMS)is the least explored domain.Deep learning,being the fastest maturing technology can be combined along with traditional anomaly detectors to facilitate and improve the RPAs in TI.We present a hybrid model for AD,using autoencoders(AE)and a one-class support vector machine(OSVM).In the present study,OSVM receives input features representing real-time documents from the TI business,orchestrated and with dimensions reduced by AE.The results obtained from multiple experiments are comparable with traditional methods and within a business acceptable range,regarding accuracy and performance.展开更多
A new common phrase scoring method is proposed according to term frequency-inverse document frequency(TFIDF)and independence of the phrase.Combining the two properties can help identify more reasonable common phrases,...A new common phrase scoring method is proposed according to term frequency-inverse document frequency(TFIDF)and independence of the phrase.Combining the two properties can help identify more reasonable common phrases,which improve the accuracy of clustering.Also,the equation to measure the in-dependence of a phrase is proposed in this paper.The new algorithm which improves suffix tree clustering algorithm(STC)is named as improved suffix tree clustering(ISTC).To validate the proposed algorithm,a prototype system is implemented and used to cluster several groups of web search results obtained from Google search engine.Experimental results show that the improved algorithm offers higher accuracy than traditional suffix tree clustering.展开更多
文摘This study is an exploratory analysis of applying natural language processing techniques such as Term Frequency-Inverse Document Frequency and Sentiment Analysis on Twitter data. The uniqueness of this work is established by determining the overall sentiment of a politician’s tweets based on TF-IDF values of terms used in their published tweets. By calculating the TF-IDF value of terms from the corpus, this work displays the correlation between TF-IDF score and polarity. The results of this work show that calculating the TF-IDF score of the corpus allows for a more accurate representation of the overall polarity since terms are given a weight based on their uniqueness and relevance rather than just the frequency at which they appear in the corpus.
文摘Anomaly detection(AD)is an important aspect of various domains and title insurance(TI)is no exception.Robotic process automation(RPA)is taking over manual tasks in TI business processes,but it has its limitations without the support of artificial intelligence(AI)and machine learning(ML).With increasing data dimensionality and in composite population scenarios,the complexity of detecting anomalies increases and AD in automated document management systems(ADMS)is the least explored domain.Deep learning,being the fastest maturing technology can be combined along with traditional anomaly detectors to facilitate and improve the RPAs in TI.We present a hybrid model for AD,using autoencoders(AE)and a one-class support vector machine(OSVM).In the present study,OSVM receives input features representing real-time documents from the TI business,orchestrated and with dimensions reduced by AE.The results obtained from multiple experiments are comparable with traditional methods and within a business acceptable range,regarding accuracy and performance.
基金Supported by the National Natural Science Foundation of China(60503020,60503033,60703086)Opening Foundation of Jiangsu Key Laboratory of Computer Information Processing Technology in Soochow Uni-versity(KJS0714)+1 种基金Research Foundation of Nanjing University of Posts and Telecommunications(NY207052,NY207082)National Natural Science Foundation of Jiangsu(BK2006094).
文摘A new common phrase scoring method is proposed according to term frequency-inverse document frequency(TFIDF)and independence of the phrase.Combining the two properties can help identify more reasonable common phrases,which improve the accuracy of clustering.Also,the equation to measure the in-dependence of a phrase is proposed in this paper.The new algorithm which improves suffix tree clustering algorithm(STC)is named as improved suffix tree clustering(ISTC).To validate the proposed algorithm,a prototype system is implemented and used to cluster several groups of web search results obtained from Google search engine.Experimental results show that the improved algorithm offers higher accuracy than traditional suffix tree clustering.