Named Entity Recognition(NER)is vital in natural language processing for the analysis of news texts,as it accurately identifies entities such as locations,persons,and organizations,which is crucial for applications li...Named Entity Recognition(NER)is vital in natural language processing for the analysis of news texts,as it accurately identifies entities such as locations,persons,and organizations,which is crucial for applications like news summarization and event tracking.However,NER in the news domain faces challenges due to insufficient annotated data,complex entity structures,and strong context dependencies.To address these issues,we propose a new Chinesenamed entity recognition method that integrates transfer learning with word embeddings.Our approach leverages the ERNIE pre-trained model for transfer learning and obtaining general language representations and incorporates the Soft-lexicon word embedding technique to handle varied entity structures.This dual-strategy enhances the model’s understanding of context and boosts its ability to process complex texts.Experimental results show that our method achieves an F1 score of 94.72% on a news dataset,surpassing baseline methods by 3%–4%,thereby confirming its effectiveness for Chinese-named entity recognition in the news domain.展开更多
Reviews have a significant impact on online businesses.Nowadays,online consumers rely heavily on other people’s reviews before purchasing a product,instead of looking at the product description.With the emergence of ...Reviews have a significant impact on online businesses.Nowadays,online consumers rely heavily on other people’s reviews before purchasing a product,instead of looking at the product description.With the emergence of technology,malicious online actors are using techniques such as Natural Language Processing(NLP)and others to generate a large number of fake reviews to destroy their competitors’markets.To remedy this situation,several researches have been conducted in the last few years.Most of them have applied NLP techniques to preprocess the text before building Machine Learning(ML)or Deep Learning(DL)models to detect and filter these fake reviews.However,with the same NLP techniques,machine-generated fake reviews are increasing exponentially.This work explores a powerful text representation technique called Embedding models to combat the proliferation of fake reviews in online marketplaces.Indeed,these embedding structures can capture much more information from the data compared to other standard text representations.To do this,we tested our hypothesis in two different Recurrent Neural Network(RNN)architectures,namely Long Short-Term Memory(LSTM)and Gated Recurrent Unit(GRU),using fake review data from Amazon and TripAdvisor.Our experimental results show that our best-proposed model can distinguish between real and fake reviews with 91.44%accuracy.Furthermore,our results corroborate with the state-of-the-art research in this area and demonstrate some improvements over other approaches.Therefore,proper text representation improves the accuracy of fake review detection.展开更多
Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, whic...Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, which increases the use cost and hinders its applications. In this work, an effective named entity recognition (NER) method is presented for information extraction on Chinese EMR, which is achieved by word embedding bootstrapped deep active learning to promote the acquisition of medical information from Chinese EMR and to release its value. In this work, deep active learning of bi-directional long short-term memory followed by conditional random field (Bi-LSTM+CRF) is used to capture the characteristics of different information from labeled corpus, and the word embedding models of contiguous bag of words and skip-gram are combined in the above model to respectively capture the text feature of Chinese EMR from unlabeled corpus. To evaluate the performance of above method, the tasks of NER on Chinese EMR with “medical history” content were used. Experimental results show that the word embedding bootstrapped deep active learning method using unlabeled medical corpus can achieve a better performance compared with other models.展开更多
One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse ...One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.展开更多
Two learning models,Zolu-continuous bags of words(ZL-CBOW)and Zolu-skip-grams(ZL-SG),based on the Zolu function are proposed.The slope of Relu in word2vec has been changed by the Zolu function.The proposed models can ...Two learning models,Zolu-continuous bags of words(ZL-CBOW)and Zolu-skip-grams(ZL-SG),based on the Zolu function are proposed.The slope of Relu in word2vec has been changed by the Zolu function.The proposed models can process extremely large data sets as well as word2vec without increasing the complexity.Also,the models outperform several word embedding methods both in word similarity and syntactic accuracy.The method of ZL-CBOW outperforms CBOW in accuracy by 8.43%on the training set of capital-world,and by 1.24%on the training set of plural-verbs.Moreover,experimental simulations on word similarity and syntactic accuracy show that ZL-CBOW and ZL-SG are superior to LL-CBOW and LL-SG,respectively.展开更多
Aspect-based sentiment analysis aims to detect and classify the sentiment polarities as negative,positive,or neutral while associating them with their identified aspects from the corresponding context.In this regard,p...Aspect-based sentiment analysis aims to detect and classify the sentiment polarities as negative,positive,or neutral while associating them with their identified aspects from the corresponding context.In this regard,prior methodologies widely utilize either word embedding or tree-based rep-resentations.Meanwhile,the separate use of those deep features such as word embedding and tree-based dependencies has become a significant cause of information loss.Generally,word embedding preserves the syntactic and semantic relations between a couple of terms lying in a sentence.Besides,the tree-based structure conserves the grammatical and logical dependencies of context.In addition,the sentence-oriented word position describes a critical factor that influences the contextual information of a targeted sentence.Therefore,knowledge of the position-oriented information of words in a sentence has been considered significant.In this study,we propose to use word embedding,tree-based representation,and contextual position information in combination to evaluate whether their combination will improve the result’s effectiveness or not.In the meantime,their joint utilization enhances the accurate identification and extraction of targeted aspect terms,which also influences their classification process.In this research paper,we propose a method named Attention Based Multi-Channel Convolutional Neural Net-work(Att-MC-CNN)that jointly utilizes these three deep features such as word embedding with tree-based structure and contextual position informa-tion.These three parameters deliver to Multi-Channel Convolutional Neural Network(MC-CNN)that identifies and extracts the potential terms and classifies their polarities.In addition,these terms have been further filtered with the attention mechanism,which determines the most significant words.The empirical analysis proves the proposed approach’s effectiveness compared to existing techniques when evaluated on standard datasets.The experimental results represent our approach outperforms in the F1 measure with an overall achievement of 94%in identifying aspects and 92%in the task of sentiment classification.展开更多
One of the issues in Computer Vision is the automatic development of descriptions for images,sometimes known as image captioning.Deep Learning techniques have made significant progress in this area.The typical archite...One of the issues in Computer Vision is the automatic development of descriptions for images,sometimes known as image captioning.Deep Learning techniques have made significant progress in this area.The typical architecture of image captioning systems consists mainly of an image feature extractor subsystem followed by a caption generation lingual subsystem.This paper aims to find optimized models for these two subsystems.For the image feature extraction subsystem,the research tested eight different concatenations of pairs of vision models to get among them the most expressive extracted feature vector of the image.For the caption generation lingual subsystem,this paper tested three different pre-trained language embedding models:Glove(Global Vectors for Word Representation),BERT(Bidirectional Encoder Representations from Transformers),and TaCL(Token-aware Contrastive Learning),to select from them the most accurate pre-trained language embedding model.Our experiments showed that building an image captioning system that uses a concatenation of the two Transformer based models SWIN(Shiftedwindow)and PVT(PyramidVision Transformer)as an image feature extractor,combined with the TaCL language embedding model is the best result among the other combinations.展开更多
Word embedding, which refers to low-dimensional dense vector representations of natural words, has demon- strated its power in many natural language processing tasks. However, it may suffer from the inaccurate and inc...Word embedding, which refers to low-dimensional dense vector representations of natural words, has demon- strated its power in many natural language processing tasks. However, it may suffer from the inaccurate and incomplete information contained in the free text corpus as training data. To tackle this challenge, there have been quite a few studies that leverage knowledge graphs as an additional information source to improve the quality of word embedding. Although these studies have achieved certain success, they have neglected some important facts about knowledge graphs: 1) many relationships in knowledge graphs are many-to-one, one-to-many or even many-to-many, rather than simply one-to-one; 2) most head entities and tail entities in knowledge graphs come from very different semantic spaces. To address these issues, in this paper, we propose a new algorithm named ProjectNet. ProjectNet models the relationships between head and tail entities after transforming them with different low-rank projection matrices. The low-rank projection can allow non one- to-one relationships between entities, while different projection matrices for head and tail entities allow them to originate in different semantic spaces. The experimental results demonstrate that ProjectNet yields more accurate word embedding than previous studies, and thus leads to clear improvements in various natural language processing tasks.展开更多
Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the sema...Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the semantic similarity of short texts. Document-level semantic measurement remains an open issue due to problems such as the omission of background knowledge and topic transition. In this paper, we propose a novel semantic matching method for long documents in the academic domain. To accurately represent the general meaning of an academic article, we construct a semantic profile in which key semantic elements such as the research purpose, methodology, and domain are included and enriched. As such, we can obtain the overall semantic similarity of two papers by computing the distance between their profiles. The distances between the concepts of two different semantic profiles are measured by word vectors. To improve the semantic representation quality of word vectors, we propose a joint word-embedding model for incorporating a domain-specific semantic relation constraint into the traditional context constraint. Our experimental results demonstrate that, in the measurement of document semantic similarity, our approach achieves substantial improvement over state-of-the-art methods, and our joint word-embedding model produces significantly better word representations than traditional word-embedding models.展开更多
Existing algorithms of news recommendations lack in depth analysis of news texts and timeliness. To address these issues, an algorithm for news recommendations based on time factor and word embedding(TFWE) was propose...Existing algorithms of news recommendations lack in depth analysis of news texts and timeliness. To address these issues, an algorithm for news recommendations based on time factor and word embedding(TFWE) was proposed to improve the interpretability and precision of news recommendations. First, TFWE used term frequency-inverse document frequency(TF-IDF) to extract news feature words and used the bidirectional encoder representations from transformers(BERT) pre-training model to convert the feature words into vector representations. By calculating the distance between the vectors, TFWE analyzed the semantic similarity to construct a user interest model. Second, considering the timeliness of news, a method of calculating news popularity by integrating time factors into the similarity calculation was proposed. Finally, TFWE combined the similarity of news content with the similarity of collaborative filtering(CF) and recommended some news with higher rankings to users. In addition, results of the experiments on real dataset showed that TFWE significantly improved precision, recall, and F1 score compared to the classic hybrid recommendation algorithm.展开更多
Most word embedding models have the following problems:(1)In the models based on bag-of-words contexts,the structural relations of sentences are completely neglected;(2)Each word uses a single embedding,which makes th...Most word embedding models have the following problems:(1)In the models based on bag-of-words contexts,the structural relations of sentences are completely neglected;(2)Each word uses a single embedding,which makes the model indiscriminative for polysemous words;(3)Word embedding easily tends to contextual structure similarity of sentences.To solve these problems,we propose an easy-to-use representation algorithm of syntactic word embedding(SWE).The main procedures are:(1)A polysemous tagging algorithm is used for polysemous representation by the latent Dirichlet allocation(LDA)algorithm;(2)Symbols‘+’and‘-’are adopted to indicate the directions of the dependency syntax;(3)Stopwords and their dependencies are deleted;(4)Dependency skip is applied to connect indirect dependencies;(5)Dependency-based contexts are inputted to a word2vec model.Experimental results show that our model generates desirable word embedding in similarity evaluation tasks.Besides,semantic and syntactic features can be captured from dependency-based syntactic contexts,exhibiting less topical and more syntactic similarity.We conclude that SWE outperforms single embedding learning models.展开更多
Word-embedding acts as one of the backbones of modern natural language processing(NLP).Recently,with the need for deploying NLP models to low-resource devices,there has been a surge of interest to compress word embedd...Word-embedding acts as one of the backbones of modern natural language processing(NLP).Recently,with the need for deploying NLP models to low-resource devices,there has been a surge of interest to compress word embeddings into hash codes or binary vectors so as to save the storage and memory consumption.Typically,existing work learns to encode an embedding into a compressed representation from which the original embedding can be reconstructed.Although these methods aim to preserve most information of every individual word,they often fail to retain the relation between words,thus can yield large loss on certain tasks.To this end,this paper presents Relation Reconstructive Binarization(R2B)to transform word embeddings into binary codes that can preserve the relation between words.At its heart,R2B trains an auto-encoder to generate binary codes that allow reconstructing the wordby-word relations in the original embedding space.Experiments showed that our method achieved significant improvements over previous methods on a number of tasks along with a space-saving of up to 98.4%.Specifically,our method reached even better results on word similarity evaluation than the uncompressed pre-trained embeddings,and was significantly better than previous compression methods that do not consider word relations.展开更多
Word embedding has drawn a lot of attention due to its usefulness in many NLP tasks. So far a handful of neural-network based word embedding algorithms have been proposed without considering the effects of pronouns in...Word embedding has drawn a lot of attention due to its usefulness in many NLP tasks. So far a handful of neural-network based word embedding algorithms have been proposed without considering the effects of pronouns in the training corpus. In this paper, we propose using co-reference resolution to improve the word embedding by extracting better context. We evaluate four word embeddings with considerations of co-reference resolution and compare the quality of word embedding on the task of word analogy and word similarity on multiple data sets.Experiments show that by using co-reference resolution, the word embedding performance in the word analogy task can be improved by around 1.88%. We find that the words that are names of countries are affected the most,which is as expected.展开更多
This research paper has provided the methodology and design for implementing the hybrid author recommender system using Azure Data Lake Analytics and Power BI. It offers a recommendation for the top 1000 Authors of co...This research paper has provided the methodology and design for implementing the hybrid author recommender system using Azure Data Lake Analytics and Power BI. It offers a recommendation for the top 1000 Authors of computer science in different fields of study. The technique used in this paper is handling the inadequate Information for citation;it removes the problem of cold start, which is encountered by very many other recommender systems. In this paper, abstracts, the titles, and the Microsoft academic graphs have been used in coming up with the recommendation list for every document, which is used to combine the content-based approaches and the co-citations. Prioritization and the blending of every technique have been allowed by the tuning system parameters, allowing for the authority in results of recommendation versus the paper novelty. In the end, we do observe that there is a direct correlation between the similarity rankings that have been produced by the system and the scores of the participant. The results coming from the associated scrips of analysis and the user survey have been made available through the recommendation system. Managers must gain the required expertise to fully utilize the benefits that come with business intelligence systems [1]. Data mining has become an important tool for managers that provides insights about their daily operations and leverage the information provided by decision support systems to improve customer relationships [2]. Additionally, managers require business intelligence systems that can rank the output in the order of priority. Ranking algorithm can replace the traditional data mining algorithms that will be discussed in-depth in the literature review [3].展开更多
Social perception refers to how individuals interpret and understand the social world.It is a foundational area of theory and measurement within the social sciences,particularly in communication,political science,psyc...Social perception refers to how individuals interpret and understand the social world.It is a foundational area of theory and measurement within the social sciences,particularly in communication,political science,psychology,and sociology.Classical models include the Stereotype Content Model(SCM),Dual Perspective Model(DPM),and Semantic Differential(SD).Extensive research has been conducted on these models.However,their interrelationships are still difficult to define using conventional comparison methods,which often lack efficiency,validity,and scalability.To tackle this challenge,we employ a text-based computational approach to quantitatively represent each theoretical dimension of the models.Specifically,we map key content dimensions into a shared semantic space using word embeddings and automate the selection of over 500 contrasting word pairs based on semantic differential theory.The results suggest that social perception can be organized around two fundamental components:subjective evaluation(e.g.,how good or likable someone is)and objective attributes(e.g.,power or competence).Furthermore,we validate this computational approach with the widely used Rosenberg’s 64 personality traits,demonstrating improvements in predictive performance over previous methods,with increases of 19%,13%,and 4%for the SD,DPM,and SCM dimensions,respectively.By enabling scalable and interpretable comparisons across these models,our findings would facilitate both theoretical integration and practical applications.展开更多
:Social media data are rapidly increasing and constitute a source of user opinions and tips on a wide range of products and services.The increasing availability of such big data on biased reviews and blogs creates cha...:Social media data are rapidly increasing and constitute a source of user opinions and tips on a wide range of products and services.The increasing availability of such big data on biased reviews and blogs creates challenges for customers and businesses in reviewing all content in their decision-making process.To overcome this challenge,extracting suggestions from opinionated text is a possible solution.In this study,the characteristics of suggestions are analyzed and a suggestion mining extraction process is presented for classifying suggestive sentences from online customers’reviews.A classification using a word-embedding approach is used via the XGBoost classifier.The two datasets used in this experiment relate to online hotel reviews and Microsoft Windows App Studio discussion reviews.F1,precision,recall,and accuracy scores are calculated.The results demonstrated that the XGBoost classifier outperforms—with an accuracy of more than 80%.Moreover,the results revealed that suggestion keywords and phrases are the predominant features for suggestion extraction.Thus,this study contributes to knowledge and practice by comparing feature extraction classifiers and identifying XGBoost as a better suggestion mining process for identifying online reviews.展开更多
Aspect’s extraction is a critical task in aspect-based sentiment analysis,including explicit and implicit aspects identification.While extensive research has identified explicit aspects,little effort has been put for...Aspect’s extraction is a critical task in aspect-based sentiment analysis,including explicit and implicit aspects identification.While extensive research has identified explicit aspects,little effort has been put forward on implicit aspects extraction due to the complexity of the problem.Moreover,existing research on implicit aspect identification is widely carried out on product reviews targeting specific aspects while neglecting sentences’dependency problems.Therefore,in this paper,a multi-level knowledge engineering approach for identifying implicit movie aspects is proposed.The proposed method first identifies explicit aspects using a variant of BiLSTM and CRF(Bidirectional Long Short Memory-Conditional Random Field),which serve as a memory to process dependent sentences to infer implicit aspects.It can identify implicit aspects from four types of sentences,including independent and three types of dependent sentences.The study is evaluated on a largemovie reviews dataset with 50k examples.The experimental results showed that the explicit aspect identification method achieved 89%F1-score and implicit aspect extraction methods achieved 76%F1-score.In addition,the proposed approach also performs better than the state-of-the-art techniques(NMFIAD andML-KB+)on the product review dataset,where it achieved 93%precision,92%recall,and 93%F1-score.展开更多
Purpose:The paper aims to enhance Arabic machine translation(MT)by proposing novel approaches:(1)a dimensionality reduction technique for word embeddings tailored for Arabic text,optimizing efficiency while retaining ...Purpose:The paper aims to enhance Arabic machine translation(MT)by proposing novel approaches:(1)a dimensionality reduction technique for word embeddings tailored for Arabic text,optimizing efficiency while retaining semantic information;(2)a comprehensive comparison of meta-embedding techniques to improve translation quality;and(3)a method leveraging self-attention and Gated CNNs to capture token dependencies,including temporal and hierarchical features within sentences,and interactions between different embedding types.These approaches collectively aim to enhance translation quality by combining different embedding schemes and leveraging advanced modeling techniques.Design/methodology/approach:Recent works on MT in general and Arabic MT in particular often pick one type of word embedding model.In this paper,we present a novel approach to enhance Arabic MT by addressing three key aspects.Firstly,we propose a new dimensionality reduction technique for word embeddings,specifically tailored for Arabic text.This technique optimizes the efficiency of embeddings while retaining their semantic information.Secondly,we conduct an extensive comparison of different meta-embedding techniques,exploring the combination of static and contextual embeddings.Through this analysis,we identify the most effective approach to improve translation quality.Lastly,we introduce a novel method that leverages self-attention and Gated convolutional neural networks(CNNs)to capture token dependencies,including temporal and hierarchical features within sentences,as well as interactions between different types of embeddings.Our experimental results demonstrate the effectiveness of our proposed approach in significantly enhancing Arabic MT performance.It outperforms baseline models with a BLEU score increase of 2 points and achieves superior results compared to state-of-the-art approaches,with an average improvement of 4.6 points across all evaluation metrics.Findings:The proposed approaches significantly enhance Arabic MT performance.The dimensionality reduction technique improves the efficiency of word embeddings while preserving semantic information.Comprehensive comparison identifies effective meta-embedding techniques,with the contextualized dynamic meta-embeddings(CDME)model showcasing competitive results.Integration of Gated CNNs with the transformer model surpasses baseline performance,leveraging both architectures’strengths.Overall,these findings demonstrate substantial improvements in translation quality,with a BLEU score increase of 2 points and an average improvement of 4.6 points across all evaluation metrics,outperforming state-of-the-art approaches.Originality/value:The paper’s originality lies in its departure from simply fine-tuning the transformer model for a specific task.Instead,it introduces modifications to the internal architecture of the transformer,integrating Gated CNNs to enhance translation performance.This departure from traditional fine-tuning approaches demonstrates a novel perspective on model enhancement,offering unique insights into improving translation quality without solely relying on pre-existing architectures.The originality in dimensionality reduction lies in the tailored approach for Arabic text.While dimensionality reduction techniques are not new,the paper introduces a specific method optimized for Arabic word embeddings.By employing independent component analysis(ICA)and a post-processing method,the paper effectively reduces the dimensionality of word embeddings while preserving semantic information which has not been investigated before especially for MT task.展开更多
Purpose:With more and more digital collections of various information resources becoming available,also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization syst...Purpose:With more and more digital collections of various information resources becoming available,also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems.While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification(DDC)classes for Swedish digital collections,the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC.Design/methodology/approach:State-of-the-art machine learning algorithms require at least 1,000 training examples per class.The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data(totaling 802 classes in the training and testing sample,out of 14,413 classes at all levels).Findings:Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average;the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task.Word embeddings combined with different types of neural networks(simple linear network,standard neural network,1 D convolutional neural network,and recurrent neural network)produced worse results than Support Vector Machine,but reach close results,with the benefit of a smaller representation size.Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input.Stemming only marginally improves the results.Removed stop-words reduced accuracy in most cases,while removing less frequent words increased it marginally.The greatest impact is produced by the number of training examples:81.90%accuracy on the training set is achieved when at least 1,000 records per class are available in the training set,and 66.13%when too few records(often less than A Comparison of Approaches100 per class)on which to train are available—and these hold only for top 3 hierarchical levels(803 instead of 14,413 classes).Research limitations:Having to reduce the number of hierarchical levels to top three levels of DDC because of the lack of training data for all classes,skews the results so that they work in experimental conditions but barely for end users in operational retrieval systems.Practical implications:In conclusion,for operative information retrieval systems applying purely automatic DDC does not work,either using machine learning(because of the lack of training data for the large number of DDC classes)or using string-matching algorithm(because DDC characteristics perform well for automatic classification only in a small number of classes).Over time,more training examples may become available,and DDC may be enriched with synonyms in order to enhance accuracy of automatic classification which may also benefit information retrieval performance based on DDC.In order for quality information services to reach the objective of highest possible precision and recall,automatic classification should never be implemented on its own;instead,machine-aided indexing that combines the efficiency of automatic suggestions with quality of human decisions at the final stage should be the way for the future.Originality/value:The study explored machine learning on a large classification system of over 14,000 classes which is used in operational information retrieval systems.Due to lack of sufficient training data across the entire set of classes,an approach complementing machine learning,that of string matching,was applied.This combination should be explored further since it provides the potential for real-life applications with large target classification systems.展开更多
Purpose:Ever increasing penetration of the Internet in our lives has led to an enormous amount of multimedia content generation on the internet.Textual data contributes a major share towards data generated on the worl...Purpose:Ever increasing penetration of the Internet in our lives has led to an enormous amount of multimedia content generation on the internet.Textual data contributes a major share towards data generated on the world wide web.Understanding people’s sentiment is an important aspect of natural language processing,but this opinion can be biased and incorrect,if people use sarcasm while commenting,posting status updates or reviewing any product or a movie.Thus,it is of utmost importance to detect sarcasm correctly and make a correct prediction about the people’s intentions.Design/methodology/approach:This study tries to evaluate various machine learning models along with standard and hybrid deep learning models across various standardized datasets.We have performed vectorization of text using word embedding techniques.This has been done to convert the textual data into vectors for analytical purposes.We have used three standardized datasets available in public domain and used three word embeddings i.e Word2Vec,GloVe and fastText to validate the hypothesis.Findings:The results were analyzed and conclusions are drawn.The key finding is:the hybrid models that include Bidirectional LongTerm Short Memory(Bi-LSTM)and Convolutional Neural Network(CNN)outperform others conventional machine learning as well as deep learning models across all the datasets considered in this study,making our hypothesis valid.Research limitations:Using the data from different sources and customizing the models according to each dataset,slightly decreases the usability of the technique.But,overall this methodology provides effective measures to identify the presence of sarcasm with a minimum average accuracy of 80%or above for one dataset and better than the current baseline results for the other datasets.Practical implications:The results provide solid insights for the system developers to integrate this model into real-time analysis of any review or comment posted in the public domain.This study has various other practical implications for businesses that depend on user ratings and public opinions.This study also provides a launching platform for various researchers to work on the problem of sarcasm identification in textual data.Originality/value:This is a first of its kind study,to provide us the difference between conventional and the hybrid methods of prediction of sarcasm in textual data.The study also provides possible indicators that hybrid models are better when applied to textual data for analysis of sarcasm.展开更多
基金funded by Advanced Research Project(30209040702).
文摘Named Entity Recognition(NER)is vital in natural language processing for the analysis of news texts,as it accurately identifies entities such as locations,persons,and organizations,which is crucial for applications like news summarization and event tracking.However,NER in the news domain faces challenges due to insufficient annotated data,complex entity structures,and strong context dependencies.To address these issues,we propose a new Chinesenamed entity recognition method that integrates transfer learning with word embeddings.Our approach leverages the ERNIE pre-trained model for transfer learning and obtaining general language representations and incorporates the Soft-lexicon word embedding technique to handle varied entity structures.This dual-strategy enhances the model’s understanding of context and boosts its ability to process complex texts.Experimental results show that our method achieves an F1 score of 94.72% on a news dataset,surpassing baseline methods by 3%–4%,thereby confirming its effectiveness for Chinese-named entity recognition in the news domain.
文摘Reviews have a significant impact on online businesses.Nowadays,online consumers rely heavily on other people’s reviews before purchasing a product,instead of looking at the product description.With the emergence of technology,malicious online actors are using techniques such as Natural Language Processing(NLP)and others to generate a large number of fake reviews to destroy their competitors’markets.To remedy this situation,several researches have been conducted in the last few years.Most of them have applied NLP techniques to preprocess the text before building Machine Learning(ML)or Deep Learning(DL)models to detect and filter these fake reviews.However,with the same NLP techniques,machine-generated fake reviews are increasing exponentially.This work explores a powerful text representation technique called Embedding models to combat the proliferation of fake reviews in online marketplaces.Indeed,these embedding structures can capture much more information from the data compared to other standard text representations.To do this,we tested our hypothesis in two different Recurrent Neural Network(RNN)architectures,namely Long Short-Term Memory(LSTM)and Gated Recurrent Unit(GRU),using fake review data from Amazon and TripAdvisor.Our experimental results show that our best-proposed model can distinguish between real and fake reviews with 91.44%accuracy.Furthermore,our results corroborate with the state-of-the-art research in this area and demonstrate some improvements over other approaches.Therefore,proper text representation improves the accuracy of fake review detection.
基金the Artificial Intelligence Innovation and Development Project of Shanghai Municipal Commission of Economy and Information (No. 2019-RGZN-01081)。
文摘Electronic medical record (EMR) containing rich biomedical information has a great potential in disease diagnosis and biomedical research. However, the EMR information is usually in the form of unstructured text, which increases the use cost and hinders its applications. In this work, an effective named entity recognition (NER) method is presented for information extraction on Chinese EMR, which is achieved by word embedding bootstrapped deep active learning to promote the acquisition of medical information from Chinese EMR and to release its value. In this work, deep active learning of bi-directional long short-term memory followed by conditional random field (Bi-LSTM+CRF) is used to capture the characteristics of different information from labeled corpus, and the word embedding models of contiguous bag of words and skip-gram are combined in the above model to respectively capture the text feature of Chinese EMR from unlabeled corpus. To evaluate the performance of above method, the tasks of NER on Chinese EMR with “medical history” content were used. Experimental results show that the word embedding bootstrapped deep active learning method using unlabeled medical corpus can achieve a better performance compared with other models.
文摘One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.
基金Supported by the National Natural Science Foundation of China(61771051,61675025)。
文摘Two learning models,Zolu-continuous bags of words(ZL-CBOW)and Zolu-skip-grams(ZL-SG),based on the Zolu function are proposed.The slope of Relu in word2vec has been changed by the Zolu function.The proposed models can process extremely large data sets as well as word2vec without increasing the complexity.Also,the models outperform several word embedding methods both in word similarity and syntactic accuracy.The method of ZL-CBOW outperforms CBOW in accuracy by 8.43%on the training set of capital-world,and by 1.24%on the training set of plural-verbs.Moreover,experimental simulations on word similarity and syntactic accuracy show that ZL-CBOW and ZL-SG are superior to LL-CBOW and LL-SG,respectively.
基金supported by the Deanship of Scientific Research,Vice Presidency for Graduate Studies and Scientific Research,King Faisal University,Saudi Arabia[Grant No.3418].
文摘Aspect-based sentiment analysis aims to detect and classify the sentiment polarities as negative,positive,or neutral while associating them with their identified aspects from the corresponding context.In this regard,prior methodologies widely utilize either word embedding or tree-based rep-resentations.Meanwhile,the separate use of those deep features such as word embedding and tree-based dependencies has become a significant cause of information loss.Generally,word embedding preserves the syntactic and semantic relations between a couple of terms lying in a sentence.Besides,the tree-based structure conserves the grammatical and logical dependencies of context.In addition,the sentence-oriented word position describes a critical factor that influences the contextual information of a targeted sentence.Therefore,knowledge of the position-oriented information of words in a sentence has been considered significant.In this study,we propose to use word embedding,tree-based representation,and contextual position information in combination to evaluate whether their combination will improve the result’s effectiveness or not.In the meantime,their joint utilization enhances the accurate identification and extraction of targeted aspect terms,which also influences their classification process.In this research paper,we propose a method named Attention Based Multi-Channel Convolutional Neural Net-work(Att-MC-CNN)that jointly utilizes these three deep features such as word embedding with tree-based structure and contextual position informa-tion.These three parameters deliver to Multi-Channel Convolutional Neural Network(MC-CNN)that identifies and extracts the potential terms and classifies their polarities.In addition,these terms have been further filtered with the attention mechanism,which determines the most significant words.The empirical analysis proves the proposed approach’s effectiveness compared to existing techniques when evaluated on standard datasets.The experimental results represent our approach outperforms in the F1 measure with an overall achievement of 94%in identifying aspects and 92%in the task of sentiment classification.
文摘One of the issues in Computer Vision is the automatic development of descriptions for images,sometimes known as image captioning.Deep Learning techniques have made significant progress in this area.The typical architecture of image captioning systems consists mainly of an image feature extractor subsystem followed by a caption generation lingual subsystem.This paper aims to find optimized models for these two subsystems.For the image feature extraction subsystem,the research tested eight different concatenations of pairs of vision models to get among them the most expressive extracted feature vector of the image.For the caption generation lingual subsystem,this paper tested three different pre-trained language embedding models:Glove(Global Vectors for Word Representation),BERT(Bidirectional Encoder Representations from Transformers),and TaCL(Token-aware Contrastive Learning),to select from them the most accurate pre-trained language embedding model.Our experiments showed that building an image captioning system that uses a concatenation of the two Transformer based models SWIN(Shiftedwindow)and PVT(PyramidVision Transformer)as an image feature extractor,combined with the TaCL language embedding model is the best result among the other combinations.
文摘Word embedding, which refers to low-dimensional dense vector representations of natural words, has demon- strated its power in many natural language processing tasks. However, it may suffer from the inaccurate and incomplete information contained in the free text corpus as training data. To tackle this challenge, there have been quite a few studies that leverage knowledge graphs as an additional information source to improve the quality of word embedding. Although these studies have achieved certain success, they have neglected some important facts about knowledge graphs: 1) many relationships in knowledge graphs are many-to-one, one-to-many or even many-to-many, rather than simply one-to-one; 2) most head entities and tail entities in knowledge graphs come from very different semantic spaces. To address these issues, in this paper, we propose a new algorithm named ProjectNet. ProjectNet models the relationships between head and tail entities after transforming them with different low-rank projection matrices. The low-rank projection can allow non one- to-one relationships between entities, while different projection matrices for head and tail entities allow them to originate in different semantic spaces. The experimental results demonstrate that ProjectNet yields more accurate word embedding than previous studies, and thus leads to clear improvements in various natural language processing tasks.
基金supported by the Foundation of the State Key Laboratory of Software Development Environment(No.SKLSDE-2015ZX-04)
文摘Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the semantic similarity of short texts. Document-level semantic measurement remains an open issue due to problems such as the omission of background knowledge and topic transition. In this paper, we propose a novel semantic matching method for long documents in the academic domain. To accurately represent the general meaning of an academic article, we construct a semantic profile in which key semantic elements such as the research purpose, methodology, and domain are included and enriched. As such, we can obtain the overall semantic similarity of two papers by computing the distance between their profiles. The distances between the concepts of two different semantic profiles are measured by word vectors. To improve the semantic representation quality of word vectors, we propose a joint word-embedding model for incorporating a domain-specific semantic relation constraint into the traditional context constraint. Our experimental results demonstrate that, in the measurement of document semantic similarity, our approach achieves substantial improvement over state-of-the-art methods, and our joint word-embedding model produces significantly better word representations than traditional word-embedding models.
基金supported by the Research Program of the Basic Scientific Research of National Defense of China (JCKY2019210B005, JCKY2018204B025, and JCKY2017204B011)the Key Scientific Project Program of National Defense of China (ZQ2019D20401 )+2 种基金the Open Program of National Engineering Laboratory for Modeling and Emulation in E-Government (MEL-20-02 )the Foundation Strengthening Project of China (2019JCJZZD13300 )the Jiangsu Postgraduate Research and Innovation Program (KYCX20_0824)。
文摘Existing algorithms of news recommendations lack in depth analysis of news texts and timeliness. To address these issues, an algorithm for news recommendations based on time factor and word embedding(TFWE) was proposed to improve the interpretability and precision of news recommendations. First, TFWE used term frequency-inverse document frequency(TF-IDF) to extract news feature words and used the bidirectional encoder representations from transformers(BERT) pre-training model to convert the feature words into vector representations. By calculating the distance between the vectors, TFWE analyzed the semantic similarity to construct a user interest model. Second, considering the timeliness of news, a method of calculating news popularity by integrating time factors into the similarity calculation was proposed. Finally, TFWE combined the similarity of news content with the similarity of collaborative filtering(CF) and recommended some news with higher rankings to users. In addition, results of the experiments on real dataset showed that TFWE significantly improved precision, recall, and F1 score compared to the classic hybrid recommendation algorithm.
基金Project supported by the National Natural Science Foundation of China(Nos.61663041 and 61763041)the Program for Changjiang Scholars and Innovative Research Team in Universities,China(No.IRT_15R40)+2 种基金the Research Fund for the Chunhui Program of Ministry of Education of China(No.Z2014022)the Natural Science Foundation of Qinghai Province,China(No.2014-ZJ-721)the Fundamental Research Funds for the Central Universities,China(No.2017TS045)
文摘Most word embedding models have the following problems:(1)In the models based on bag-of-words contexts,the structural relations of sentences are completely neglected;(2)Each word uses a single embedding,which makes the model indiscriminative for polysemous words;(3)Word embedding easily tends to contextual structure similarity of sentences.To solve these problems,we propose an easy-to-use representation algorithm of syntactic word embedding(SWE).The main procedures are:(1)A polysemous tagging algorithm is used for polysemous representation by the latent Dirichlet allocation(LDA)algorithm;(2)Symbols‘+’and‘-’are adopted to indicate the directions of the dependency syntax;(3)Stopwords and their dependencies are deleted;(4)Dependency skip is applied to connect indirect dependencies;(5)Dependency-based contexts are inputted to a word2vec model.Experimental results show that our model generates desirable word embedding in similarity evaluation tasks.Besides,semantic and syntactic features can be captured from dependency-based syntactic contexts,exhibiting less topical and more syntactic similarity.We conclude that SWE outperforms single embedding learning models.
基金The reseach work was supported by the National Key Research and Development Program of China(2017YFB1002104)the National Natural Science Foundation of China(Grant Nos.92046003,61976204,U1811461)Xiang Ao was also supported by the Project of Youth Innovation Promotion Association CAS and Beijing Nova Program(Z201100006820062).
文摘Word-embedding acts as one of the backbones of modern natural language processing(NLP).Recently,with the need for deploying NLP models to low-resource devices,there has been a surge of interest to compress word embeddings into hash codes or binary vectors so as to save the storage and memory consumption.Typically,existing work learns to encode an embedding into a compressed representation from which the original embedding can be reconstructed.Although these methods aim to preserve most information of every individual word,they often fail to retain the relation between words,thus can yield large loss on certain tasks.To this end,this paper presents Relation Reconstructive Binarization(R2B)to transform word embeddings into binary codes that can preserve the relation between words.At its heart,R2B trains an auto-encoder to generate binary codes that allow reconstructing the wordby-word relations in the original embedding space.Experiments showed that our method achieved significant improvements over previous methods on a number of tasks along with a space-saving of up to 98.4%.Specifically,our method reached even better results on word similarity evaluation than the uncompressed pre-trained embeddings,and was significantly better than previous compression methods that do not consider word relations.
基金supported by the National HighTech Research and Development(863)Program(No.2015AA015401)the National Natural Science Foundation of China(Nos.61533018 and 61402220)+2 种基金the State Scholarship Fund of CSC(No.201608430240)the Philosophy and Social Science Foundation of Hunan Province(No.16YBA323)the Scientific Research Fund of Hunan Provincial Education Department(Nos.16C1378 and 14B153)
文摘Word embedding has drawn a lot of attention due to its usefulness in many NLP tasks. So far a handful of neural-network based word embedding algorithms have been proposed without considering the effects of pronouns in the training corpus. In this paper, we propose using co-reference resolution to improve the word embedding by extracting better context. We evaluate four word embeddings with considerations of co-reference resolution and compare the quality of word embedding on the task of word analogy and word similarity on multiple data sets.Experiments show that by using co-reference resolution, the word embedding performance in the word analogy task can be improved by around 1.88%. We find that the words that are names of countries are affected the most,which is as expected.
文摘This research paper has provided the methodology and design for implementing the hybrid author recommender system using Azure Data Lake Analytics and Power BI. It offers a recommendation for the top 1000 Authors of computer science in different fields of study. The technique used in this paper is handling the inadequate Information for citation;it removes the problem of cold start, which is encountered by very many other recommender systems. In this paper, abstracts, the titles, and the Microsoft academic graphs have been used in coming up with the recommendation list for every document, which is used to combine the content-based approaches and the co-citations. Prioritization and the blending of every technique have been allowed by the tuning system parameters, allowing for the authority in results of recommendation versus the paper novelty. In the end, we do observe that there is a direct correlation between the similarity rankings that have been produced by the system and the scores of the participant. The results coming from the associated scrips of analysis and the user survey have been made available through the recommendation system. Managers must gain the required expertise to fully utilize the benefits that come with business intelligence systems [1]. Data mining has become an important tool for managers that provides insights about their daily operations and leverage the information provided by decision support systems to improve customer relationships [2]. Additionally, managers require business intelligence systems that can rank the output in the order of priority. Ranking algorithm can replace the traditional data mining algorithms that will be discussed in-depth in the literature review [3].
文摘Social perception refers to how individuals interpret and understand the social world.It is a foundational area of theory and measurement within the social sciences,particularly in communication,political science,psychology,and sociology.Classical models include the Stereotype Content Model(SCM),Dual Perspective Model(DPM),and Semantic Differential(SD).Extensive research has been conducted on these models.However,their interrelationships are still difficult to define using conventional comparison methods,which often lack efficiency,validity,and scalability.To tackle this challenge,we employ a text-based computational approach to quantitatively represent each theoretical dimension of the models.Specifically,we map key content dimensions into a shared semantic space using word embeddings and automate the selection of over 500 contrasting word pairs based on semantic differential theory.The results suggest that social perception can be organized around two fundamental components:subjective evaluation(e.g.,how good or likable someone is)and objective attributes(e.g.,power or competence).Furthermore,we validate this computational approach with the widely used Rosenberg’s 64 personality traits,demonstrating improvements in predictive performance over previous methods,with increases of 19%,13%,and 4%for the SD,DPM,and SCM dimensions,respectively.By enabling scalable and interpretable comparisons across these models,our findings would facilitate both theoretical integration and practical applications.
基金This research is funded by Taif University, TURSP-2020/115.
文摘:Social media data are rapidly increasing and constitute a source of user opinions and tips on a wide range of products and services.The increasing availability of such big data on biased reviews and blogs creates challenges for customers and businesses in reviewing all content in their decision-making process.To overcome this challenge,extracting suggestions from opinionated text is a possible solution.In this study,the characteristics of suggestions are analyzed and a suggestion mining extraction process is presented for classifying suggestive sentences from online customers’reviews.A classification using a word-embedding approach is used via the XGBoost classifier.The two datasets used in this experiment relate to online hotel reviews and Microsoft Windows App Studio discussion reviews.F1,precision,recall,and accuracy scores are calculated.The results demonstrated that the XGBoost classifier outperforms—with an accuracy of more than 80%.Moreover,the results revealed that suggestion keywords and phrases are the predominant features for suggestion extraction.Thus,this study contributes to knowledge and practice by comparing feature extraction classifiers and identifying XGBoost as a better suggestion mining process for identifying online reviews.
文摘Aspect’s extraction is a critical task in aspect-based sentiment analysis,including explicit and implicit aspects identification.While extensive research has identified explicit aspects,little effort has been put forward on implicit aspects extraction due to the complexity of the problem.Moreover,existing research on implicit aspect identification is widely carried out on product reviews targeting specific aspects while neglecting sentences’dependency problems.Therefore,in this paper,a multi-level knowledge engineering approach for identifying implicit movie aspects is proposed.The proposed method first identifies explicit aspects using a variant of BiLSTM and CRF(Bidirectional Long Short Memory-Conditional Random Field),which serve as a memory to process dependent sentences to infer implicit aspects.It can identify implicit aspects from four types of sentences,including independent and three types of dependent sentences.The study is evaluated on a largemovie reviews dataset with 50k examples.The experimental results showed that the explicit aspect identification method achieved 89%F1-score and implicit aspect extraction methods achieved 76%F1-score.In addition,the proposed approach also performs better than the state-of-the-art techniques(NMFIAD andML-KB+)on the product review dataset,where it achieved 93%precision,92%recall,and 93%F1-score.
文摘Purpose:The paper aims to enhance Arabic machine translation(MT)by proposing novel approaches:(1)a dimensionality reduction technique for word embeddings tailored for Arabic text,optimizing efficiency while retaining semantic information;(2)a comprehensive comparison of meta-embedding techniques to improve translation quality;and(3)a method leveraging self-attention and Gated CNNs to capture token dependencies,including temporal and hierarchical features within sentences,and interactions between different embedding types.These approaches collectively aim to enhance translation quality by combining different embedding schemes and leveraging advanced modeling techniques.Design/methodology/approach:Recent works on MT in general and Arabic MT in particular often pick one type of word embedding model.In this paper,we present a novel approach to enhance Arabic MT by addressing three key aspects.Firstly,we propose a new dimensionality reduction technique for word embeddings,specifically tailored for Arabic text.This technique optimizes the efficiency of embeddings while retaining their semantic information.Secondly,we conduct an extensive comparison of different meta-embedding techniques,exploring the combination of static and contextual embeddings.Through this analysis,we identify the most effective approach to improve translation quality.Lastly,we introduce a novel method that leverages self-attention and Gated convolutional neural networks(CNNs)to capture token dependencies,including temporal and hierarchical features within sentences,as well as interactions between different types of embeddings.Our experimental results demonstrate the effectiveness of our proposed approach in significantly enhancing Arabic MT performance.It outperforms baseline models with a BLEU score increase of 2 points and achieves superior results compared to state-of-the-art approaches,with an average improvement of 4.6 points across all evaluation metrics.Findings:The proposed approaches significantly enhance Arabic MT performance.The dimensionality reduction technique improves the efficiency of word embeddings while preserving semantic information.Comprehensive comparison identifies effective meta-embedding techniques,with the contextualized dynamic meta-embeddings(CDME)model showcasing competitive results.Integration of Gated CNNs with the transformer model surpasses baseline performance,leveraging both architectures’strengths.Overall,these findings demonstrate substantial improvements in translation quality,with a BLEU score increase of 2 points and an average improvement of 4.6 points across all evaluation metrics,outperforming state-of-the-art approaches.Originality/value:The paper’s originality lies in its departure from simply fine-tuning the transformer model for a specific task.Instead,it introduces modifications to the internal architecture of the transformer,integrating Gated CNNs to enhance translation performance.This departure from traditional fine-tuning approaches demonstrates a novel perspective on model enhancement,offering unique insights into improving translation quality without solely relying on pre-existing architectures.The originality in dimensionality reduction lies in the tailored approach for Arabic text.While dimensionality reduction techniques are not new,the paper introduces a specific method optimized for Arabic word embeddings.By employing independent component analysis(ICA)and a post-processing method,the paper effectively reduces the dimensionality of word embeddings while preserving semantic information which has not been investigated before especially for MT task.
文摘Purpose:With more and more digital collections of various information resources becoming available,also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems.While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification(DDC)classes for Swedish digital collections,the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC.Design/methodology/approach:State-of-the-art machine learning algorithms require at least 1,000 training examples per class.The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data(totaling 802 classes in the training and testing sample,out of 14,413 classes at all levels).Findings:Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average;the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task.Word embeddings combined with different types of neural networks(simple linear network,standard neural network,1 D convolutional neural network,and recurrent neural network)produced worse results than Support Vector Machine,but reach close results,with the benefit of a smaller representation size.Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input.Stemming only marginally improves the results.Removed stop-words reduced accuracy in most cases,while removing less frequent words increased it marginally.The greatest impact is produced by the number of training examples:81.90%accuracy on the training set is achieved when at least 1,000 records per class are available in the training set,and 66.13%when too few records(often less than A Comparison of Approaches100 per class)on which to train are available—and these hold only for top 3 hierarchical levels(803 instead of 14,413 classes).Research limitations:Having to reduce the number of hierarchical levels to top three levels of DDC because of the lack of training data for all classes,skews the results so that they work in experimental conditions but barely for end users in operational retrieval systems.Practical implications:In conclusion,for operative information retrieval systems applying purely automatic DDC does not work,either using machine learning(because of the lack of training data for the large number of DDC classes)or using string-matching algorithm(because DDC characteristics perform well for automatic classification only in a small number of classes).Over time,more training examples may become available,and DDC may be enriched with synonyms in order to enhance accuracy of automatic classification which may also benefit information retrieval performance based on DDC.In order for quality information services to reach the objective of highest possible precision and recall,automatic classification should never be implemented on its own;instead,machine-aided indexing that combines the efficiency of automatic suggestions with quality of human decisions at the final stage should be the way for the future.Originality/value:The study explored machine learning on a large classification system of over 14,000 classes which is used in operational information retrieval systems.Due to lack of sufficient training data across the entire set of classes,an approach complementing machine learning,that of string matching,was applied.This combination should be explored further since it provides the potential for real-life applications with large target classification systems.
文摘Purpose:Ever increasing penetration of the Internet in our lives has led to an enormous amount of multimedia content generation on the internet.Textual data contributes a major share towards data generated on the world wide web.Understanding people’s sentiment is an important aspect of natural language processing,but this opinion can be biased and incorrect,if people use sarcasm while commenting,posting status updates or reviewing any product or a movie.Thus,it is of utmost importance to detect sarcasm correctly and make a correct prediction about the people’s intentions.Design/methodology/approach:This study tries to evaluate various machine learning models along with standard and hybrid deep learning models across various standardized datasets.We have performed vectorization of text using word embedding techniques.This has been done to convert the textual data into vectors for analytical purposes.We have used three standardized datasets available in public domain and used three word embeddings i.e Word2Vec,GloVe and fastText to validate the hypothesis.Findings:The results were analyzed and conclusions are drawn.The key finding is:the hybrid models that include Bidirectional LongTerm Short Memory(Bi-LSTM)and Convolutional Neural Network(CNN)outperform others conventional machine learning as well as deep learning models across all the datasets considered in this study,making our hypothesis valid.Research limitations:Using the data from different sources and customizing the models according to each dataset,slightly decreases the usability of the technique.But,overall this methodology provides effective measures to identify the presence of sarcasm with a minimum average accuracy of 80%or above for one dataset and better than the current baseline results for the other datasets.Practical implications:The results provide solid insights for the system developers to integrate this model into real-time analysis of any review or comment posted in the public domain.This study has various other practical implications for businesses that depend on user ratings and public opinions.This study also provides a launching platform for various researchers to work on the problem of sarcasm identification in textual data.Originality/value:This is a first of its kind study,to provide us the difference between conventional and the hybrid methods of prediction of sarcasm in textual data.The study also provides possible indicators that hybrid models are better when applied to textual data for analysis of sarcasm.