Covert timing channels(CTC)exploit network resources to establish hidden communication pathways,posing signi cant risks to data security and policy compliance.erefore,detecting such hidden and dangerous threats remain...Covert timing channels(CTC)exploit network resources to establish hidden communication pathways,posing signi cant risks to data security and policy compliance.erefore,detecting such hidden and dangerous threats remains one of the security challenges. is paper proposes LinguTimeX,a new framework that combines natural language processing with arti cial intelligence,along with explainable Arti cial Intelligence(AI)not only to detect CTC but also to provide insights into the decision process.LinguTimeX performs multidimensional feature extraction by fusing linguistic attributes with temporal network patterns to identify covert channels precisely.LinguTimeX demonstrates strong e ectiveness in detecting CTC across multiple languages;namely English,Arabic,and Chinese.Speci cally,the LSTM and RNN models achieved F1 scores of 90%on the English dataset,89%on the Arabic dataset,and 88%on the Chinese dataset,showcasing their superior performance and ability to generalize across multiple languages. is highlights their robustness in detecting CTCs within security systems,regardless of the language or cultural context of the data.In contrast,the DeepForest model produced F1-scores ranging from 86%to 87%across the same datasets,further con rming its e ectiveness in CTC detection.Although other algorithms also showed reasonable accuracy,the LSTM and RNN models consistently outperformed them in multilingual settings,suggesting that deep learning models might be better suited for this particular problem.展开更多
Background:In mental health,recovery is emphasized,and qualitative analyses of service users’narratives have accumulated;however,while qualitative approaches excel at capturing rich context and generating new concept...Background:In mental health,recovery is emphasized,and qualitative analyses of service users’narratives have accumulated;however,while qualitative approaches excel at capturing rich context and generating new concepts,they are limited in generalizability and feasible data volume.This study aimed to quantify the subjective life history narratives of users of psychiatric home-visit nursing using natural language processing(NLP)and to clarify the relationships between linguistic features and recovery-related indicators.Methods:We conducted audio-recorded and transcribed semi-structured interviews on daily life verbatim and collected self-report questionnaires(Recovery Assessment Scale[RAS])and clinician ratings(Global Assessment of Functioning[GAF])from Japanese users of psychiatric home-visit nursing.Using the artificial intelligence-based topic-modeling method BERTopic,we extracted topics from the interview texts and calculated each participant’s topic proportions,and then examined associations between topic proportions and recovery-related indicators using Pearson correlation analyses.Results:“School”showed a significant positive correlation with RAS(r=0.39,p=0.05),whereas“Family”showed a significant negative correlation(r=–0.46,p=0.02).GAF was positively correlated with word count(r=0.44,p=0.02)and“Hospital”(r=0.42,p=0.03),and negatively correlated with“Backchannels”(aizuchi)(r=–0.41,p=0.03).Conclusion:The present results suggest that the quantity,quality,and content of narratives can serve as useful indicators of mental health and recovery,and that objective NLP-based analysis of service users’narratives can complement traditional self-report scales and clinician ratings to inform the design of recovery-oriented care in psychiatric home-visit nursing.展开更多
Sentiment analysis, a crucial task in discerning emotional tones within the text, plays a pivotal role in understandingpublic opinion and user sentiment across diverse languages.While numerous scholars conduct sentime...Sentiment analysis, a crucial task in discerning emotional tones within the text, plays a pivotal role in understandingpublic opinion and user sentiment across diverse languages.While numerous scholars conduct sentiment analysisin widely spoken languages such as English, Chinese, Arabic, Roman Arabic, and more, we come to grapplingwith resource-poor languages like Urdu literature which becomes a challenge. Urdu is a uniquely crafted language,characterized by a script that amalgamates elements from diverse languages, including Arabic, Parsi, Pashtu,Turkish, Punjabi, Saraiki, and more. As Urdu literature, characterized by distinct character sets and linguisticfeatures, presents an additional hurdle due to the lack of accessible datasets, rendering sentiment analysis aformidable undertaking. The limited availability of resources has fueled increased interest among researchers,prompting a deeper exploration into Urdu sentiment analysis. This research is dedicated to Urdu languagesentiment analysis, employing sophisticated deep learning models on an extensive dataset categorized into fivelabels: Positive, Negative, Neutral, Mixed, and Ambiguous. The primary objective is to discern sentiments andemotions within the Urdu language, despite the absence of well-curated datasets. To tackle this challenge, theinitial step involves the creation of a comprehensive Urdu dataset by aggregating data from various sources such asnewspapers, articles, and socialmedia comments. Subsequent to this data collection, a thorough process of cleaningand preprocessing is implemented to ensure the quality of the data. The study leverages two well-known deeplearningmodels, namely Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), for bothtraining and evaluating sentiment analysis performance. Additionally, the study explores hyperparameter tuning tooptimize the models’ efficacy. Evaluation metrics such as precision, recall, and the F1-score are employed to assessthe effectiveness of the models. The research findings reveal that RNN surpasses CNN in Urdu sentiment analysis,gaining a significantly higher accuracy rate of 91%. This result accentuates the exceptional performance of RNN,solidifying its status as a compelling option for conducting sentiment analysis tasks in the Urdu language.展开更多
Machine translation of low-resource languages(LRLs)has long been hindered by limited corpora and linguistic complexity.This review summarizes key developments,from traditional methods to recent progress with large lan...Machine translation of low-resource languages(LRLs)has long been hindered by limited corpora and linguistic complexity.This review summarizes key developments,from traditional methods to recent progress with large language models(LLMs),while highlighting ongoing challenges such as data bottlenecks,biases,fairness,and computational costs.Finally,it discusses future directions,including efficient parameter fine-tuning,multimodal translation,and community-driven corpus construction,providing insights for advancing LRL translation research.展开更多
Text clustering is an important task because of its vital role in NLP-related tasks.However,existing research on clustering is mainly based on the English language,with limited work on low-resource languages,such as U...Text clustering is an important task because of its vital role in NLP-related tasks.However,existing research on clustering is mainly based on the English language,with limited work on low-resource languages,such as Urdu.Low-resource language text clustering has many drawbacks in the form of limited annotated collections and strong linguistic diversity.Theprimary aim of this paper is twofold:(1)By introducing a clustering dataset namedUNC-2025 comprises 100k Urdu news documents,and(2)a detailed empirical standard of Large Language Model(LLM)improved clusteringmethods for Urdu text.We explicitly evaluate the behavior of the 11multilingual and Urdu-specific embeddings on 3 different clustering algorithms.We carefully evaluated our performance based on a set of internal and external measurements of validity.We discover the best configuration of the mBERT embedding with the HDBSCAN algorithm that attains a new state-of-the-art performance with a high score of external validity of 0.95.This new LLM method has created a new strong standard of Urdu text clustering.Importantly,the results confirm the strength and high scalability of the LLM-generated embeddings towards the ability to generalise the fine,subtle semantics needed to discover topics in low-resource settings and open the door to novel NLP applications in underrepresented languages.展开更多
The natural language processing(NLP)domain has witnessed significant advancements with the emergence of transformer-based models,which have reshaped the text understanding and generation landscape.While their capabili...The natural language processing(NLP)domain has witnessed significant advancements with the emergence of transformer-based models,which have reshaped the text understanding and generation landscape.While their capabilities are well recognized,there remains a limited systematic synthesis of how these models perform across tasks,scale efficiently,adapt to domains,and address ethical challenges.Therefore,the aim of this paper was to analyze the performance of transformer-based models across various NLP tasks,their scalability,domain adaptation,and the ethical implications of such models.This meta-analysis paper synthesizes findings from 25 peer-reviewed studies on NLP transformer-based models,adhering to the PRISMA framework.Relevant papers were sourced from electronic databases,including IEEE Xplore,Springer,ACM Digital Library,Elsevier,PubMed,and Google Scholar.The findings highlight the superior performance of transformers over conventional approaches,attributed to selfattention mechanisms and pre-trained language representations.Despite these advantages,challenges such as high computational costs,data bias,and hallucination persist.The study provides new perspectives by underscoring the necessity for future research to optimize transformer architectures for efficiency,address ethical AI concerns,and enhance generalization across languages.This paper contributes valuable insights into the current trends,limitations,and potential improvements in transformer-based models for NLP.展开更多
The increasing frequency and severity of natural disasters,exacerbated by global warming,necessitate novel solutions to strengthen the resilience of Critical Infrastructure Systems(CISs).Recent research reveals the si...The increasing frequency and severity of natural disasters,exacerbated by global warming,necessitate novel solutions to strengthen the resilience of Critical Infrastructure Systems(CISs).Recent research reveals the sig-nificant potential of natural language processing(NLP)to analyze unstructured human language during disasters,thereby facilitating the uncovering of disruptions and providing situational awareness supporting various aspects of resilience regarding CISs.Despite this potential,few studies have systematically mapped the global research on NLP applications with respect to supporting various aspects of resilience of CISs.This paper contributes to the body of knowledge by presenting a review of current knowledge using the scientometric review technique.Using 231 bibliographic records from the Scopus and Web of Science core collections,we identify five key research areas where researchers have used NLP to support the resilience of CISs during natural disasters,including sentiment analysis,crisis informatics,data and knowledge visualization,disaster impacts,and content analysis.Furthermore,we map the utility of NLP in the identified research focus with respect to four aspects of resilience(i.e.,preparedness,absorption,recovery,and adaptability)and present various common techniques used and potential future research directions.This review highlights that NLP has the potential to become a supplementary data source to support the resilience of CISs.The results of this study serve as an introductory-level guide designed to help scholars and practitioners unlock the potential of NLP for strengthening the resilience of CISs against natural disasters.展开更多
DeepSeek Chinese artificial intelligence(AI)open-source model,has gained a lot of attention due to its economical training and efficient inference.DeepSeek,a model trained on large-scale reinforcement learning without...DeepSeek Chinese artificial intelligence(AI)open-source model,has gained a lot of attention due to its economical training and efficient inference.DeepSeek,a model trained on large-scale reinforcement learning without supervised fine-tuning as a preliminary step,demonstrates remarkable reasoning capabilities of performing a wide range of tasks.DeepSeek is a prominent AI-driven chatbot that assists individuals in learning and enhances responses by generating insightful solutions to inquiries.Users possess divergent viewpoints regarding advanced models like DeepSeek,posting both their merits and shortcomings across several social media platforms.This research presents a new framework for predicting public sentiment to evaluate perceptions of DeepSeek.To transform the unstructured data into a suitable manner,we initially collect DeepSeek-related tweets from Twitter and subsequently implement various preprocessing methods.Subsequently,we annotated the tweets utilizing the Valence Aware Dictionary and sentiment Reasoning(VADER)methodology and the lexicon-driven TextBlob.Next,we classified the attitudes obtained from the purified data utilizing the proposed hybrid model.The proposed hybrid model consists of long-term,shortterm memory(LSTM)and bidirectional gated recurrent units(BiGRU).To strengthen it,we include multi-head attention,regularizer activation,and dropout units to enhance performance.Topic modeling employing KMeans clustering and Latent Dirichlet Allocation(LDA),was utilized to analyze public behavior concerning DeepSeek.The perceptions demonstrate that 82.5%of the people are positive,15.2%negative,and 2.3%neutral using TextBlob,and 82.8%positive,16.1%negative,and 1.2%neutral using the VADER analysis.The slight difference in results ensures that both analyses concur with their overall perceptions and may have distinct views of language peculiarities.The results indicate that the proposed model surpassed previous state-of-the-art approaches.展开更多
The increased accessibility of social networking services(SNSs)has facilitated communication and information sharing among users.However,it has also heightened concerns about digital safety,particularly for children a...The increased accessibility of social networking services(SNSs)has facilitated communication and information sharing among users.However,it has also heightened concerns about digital safety,particularly for children and adolescents who are increasingly exposed to online grooming crimes.Early and accurate identification of grooming conversations is crucial in preventing long-term harm to victims.However,research on grooming detection in South Korea remains limited,as existing models trained primarily on English text and fail to reflect the unique linguistic features of SNS conversations,leading to inaccurate classifications.To address these issues,this study proposes a novel framework that integrates optical character recognition(OCR)technology with KcELECTRA,a deep learning-based natural language processing(NLP)model that shows excellent performance in processing the colloquial Korean language.In the proposed framework,the KcELECTRA model is fine-tuned by an extensive dataset,including Korean social media conversations,Korean ethical verification data from AI-Hub,and Korean hate speech data from Hug-gingFace,to enable more accurate classification of text extracted from social media conversation images.Experimental results show that the proposed framework achieves an accuracy of 0.953,outperforming existing transformer-based models.Furthermore,OCR technology shows high accuracy in extracting text from images,demonstrating that the proposed framework is effective for online grooming detection.The proposed framework is expected to contribute to the more accurate detection of grooming text and the prevention of grooming-related crimes.展开更多
Objective To develop a clinical decision and prescription generation system(CDPGS)specifically for diarrhea in traditional Chinese medicine(TCM),utilizing a specialized large language model(LLM),Qwen-TCM-Dia,to standa...Objective To develop a clinical decision and prescription generation system(CDPGS)specifically for diarrhea in traditional Chinese medicine(TCM),utilizing a specialized large language model(LLM),Qwen-TCM-Dia,to standardize diagnostic processes and prescription generation.Methods Two primary datasets were constructed:an evaluation benchmark and a fine-tuning dataset consisting of fundamental diarrhea knowledge,medical records,and chain-ofthought(CoT)reasoning datasets.After an initial evaluation of 16 open-source LLMs across inference time,accuracy,and output quality,Qwen2.5 was selected as the base model due to its superior overall performance.We then employed a two-stage low-rank adaptation(LoRA)fine-tuning strategy,integrating continued pre-training on domain-specific knowledge with instruction fine-tuning using CoT-enriched medical records.This approach was designed to embed the clinical logic(symptoms→pathogenesis→therapeutic principles→prescriptions)into the model’s reasoning capabilities.The resulting fine-tuned model,specialized for TCM diarrhea,was designated as Qwen-TCM-Dia.Model performance was evaluated for disease diagnosis and syndrome type differentiation using accuracy,precision,recall,and F1-score.Furthermore,the quality of the generated prescriptions was compared with that of established open-source TCM LLMs.Results Qwen-TCM-Dia achieved peak performance compared to both the base Qwen2.5 model and five other open-source TCM LLMs.It achieved 97.05%accuracy and 91.48%F1-score in disease diagnosis,and 74.54%accuracy and 74.21%F1-score in syndrome type differentiation.Compared with existing open-source TCM LLMs(BianCang,HuangDi,LingDan,TCMLLM-PR,and ZhongJing),Qwen-TCM-Dia exhibited higher fidelity in reconstructing the“symptoms→pathogenesis→therapeutic principles→prescriptions”logic chain.It provided complete prescriptions,whereas other models often omitted dosages or generated mismatched prescriptions.Conclusion By integrating continued pre-training,CoT reasoning,and a two-stage fine-tuning strategy,this study establishes a CDPGS for diarrhea in TCM.The results demonstrate the synergistic effect of strengthening domain representation through pre-training and activating logical reasoning via CoT.This research not only provides critical technical support for the standardized diagnosis and treatment of diarrhea but also offers a scalable paradigm for the digital inheritance of expert TCM experience and the intelligent transformation of TCM.展开更多
The malicious dissemination of hate speech via compromised accounts,automated bot networks and malware-driven social media campaigns has become a growing cybersecurity concern.Automatically detecting such content in S...The malicious dissemination of hate speech via compromised accounts,automated bot networks and malware-driven social media campaigns has become a growing cybersecurity concern.Automatically detecting such content in Spanish is challenging due to linguistic complexity and the scarcity of annotated resources.In this paper,we compare two predominant AI-based approaches for the forensic detection of malicious hate speech:(1)finetuning encoder-only models that have been trained in Spanish and(2)In-Context Learning techniques(Zero-and Few-Shot Learning)with large-scale language models.Our approach goes beyond binary classification,proposing a comprehensive,multidimensional evaluation that labels each text by:(1)type of speech,(2)recipient,(3)level of intensity(ordinal)and(4)targeted group(multi-label).Performance is evaluated using an annotated Spanish corpus,standard metrics such as precision,recall and F1-score and stability-oriented metrics to evaluate the stability of the transition from zero-shot to few-shot prompting(Zero-to-Few Shot Retention and Zero-to-Few Shot Gain)are applied.The results indicate that fine-tuned encoder-only models(notably MarIA and BETO variants)consistently deliver the strongest and most reliable performance:in our experiments their macro F1-scores lie roughly in the range of approximately 46%–66%depending on the task.Zero-shot approaches are much less stable and typically yield substantially lower performance(observed F1-scores range approximately 0%–39%),often producing invalid outputs in practice.Few-shot prompting(e.g.,Qwen 38B,Mistral 7B)generally improves stability and recall relative to pure zero-shot,bringing F1-scores into a moderate range of approximately 20%–51%but still falling short of fully fine-tuned models.These findings highlight the importance of supervised adaptation and discuss the potential of both paradigms as components in AI-powered cybersecurity and malware forensics systems designed to identify and mitigate coordinated online hate campaigns.展开更多
Sentiment analysis(SA)is the procedure of recognizing the emotions related to the data that exist in social networking.The existence of sarcasm in tex-tual data is a major challenge in the efficiency of the SA.Earlier...Sentiment analysis(SA)is the procedure of recognizing the emotions related to the data that exist in social networking.The existence of sarcasm in tex-tual data is a major challenge in the efficiency of the SA.Earlier works on sarcasm detection on text utilize lexical as well as pragmatic cues namely interjection,punctuations,and sentiment shift that are vital indicators of sarcasm.With the advent of deep-learning,recent works,leveraging neural networks in learning lexical and contextual features,removing the need for handcrafted feature.In this aspect,this study designs a deep learning with natural language processing enabled SA(DLNLP-SA)technique for sarcasm classification.The proposed DLNLP-SA technique aims to detect and classify the occurrence of sarcasm in the input data.Besides,the DLNLP-SA technique holds various sub-processes namely preprocessing,feature vector conversion,and classification.Initially,the pre-processing is performed in diverse ways such as single character removal,multi-spaces removal,URL removal,stopword removal,and tokenization.Secondly,the transformation of feature vectors takes place using the N-gram feature vector technique.Finally,mayfly optimization(MFO)with multi-head self-attention based gated recurrent unit(MHSA-GRU)model is employed for the detection and classification of sarcasm.To verify the enhanced outcomes of the DLNLP-SA model,a comprehensive experimental investigation is performed on the News Headlines Dataset from Kaggle Repository and the results signified the supremacy over the existing approaches.展开更多
As one of the most widely used languages in the world,Chinese language is distinct from most western languages in many properties,thus providing a unique opportunity for understanding the brain basis of human language...As one of the most widely used languages in the world,Chinese language is distinct from most western languages in many properties,thus providing a unique opportunity for understanding the brain basis of human language and cognition.In recent years,non-invasive neuroimaging techniques such as magnetic resonance imaging(MRI)blaze a new trail to comprehensively study specific neural correlates of Chinese language processing and Chinese speakers.We reviewed the application of functional MRI(fMRI)in such studies and some essential findings on brain systems in processing Chinese.Specifically,for example,the application of task fMRI and resting-state fMRI in observing the process of reading and writing the logographic characters and producing or listening to the tonal speech.Elementary cognitive neuroscience and several potential research directions around brain and Chinese language were discussed,which may be informative for future research.展开更多
A variety of neural networks have been presented to deal with issues in deep learning in the last decades.Despite the prominent success achieved by the neural network,it still lacks theoretical guidance to design an e...A variety of neural networks have been presented to deal with issues in deep learning in the last decades.Despite the prominent success achieved by the neural network,it still lacks theoretical guidance to design an efficient neural network model,and verifying the performance of a model needs excessive resources.Previous research studies have demonstrated that many existing models can be regarded as different numerical discretizations of differential equations.This connection sheds light on designing an effective recurrent neural network(RNN)by resorting to numerical analysis.Simple RNN is regarded as a discretisation of the forward Euler scheme.Considering the limited solution accuracy of the forward Euler methods,a Taylor‐type discrete scheme is presented with lower truncation error and a Taylor‐type RNN(T‐RNN)is designed with its guidance.Extensive experiments are conducted to evaluate its performance on statistical language models and emotion analysis tasks.The noticeable gains obtained by T‐RNN present its superiority and the feasibility of designing the neural network model using numerical methods.展开更多
The recent developments in Multimedia Internet of Things(MIoT)devices,empowered with Natural Language Processing(NLP)model,seem to be a promising future of smart devices.It plays an important role in industrial models...The recent developments in Multimedia Internet of Things(MIoT)devices,empowered with Natural Language Processing(NLP)model,seem to be a promising future of smart devices.It plays an important role in industrial models such as speech understanding,emotion detection,home automation,and so on.If an image needs to be captioned,then the objects in that image,its actions and connections,and any silent feature that remains under-projected or missing from the images should be identified.The aim of the image captioning process is to generate a caption for image.In next step,the image should be provided with one of the most significant and detailed descriptions that is syntactically as well as semantically correct.In this scenario,computer vision model is used to identify the objects and NLP approaches are followed to describe the image.The current study develops aNatural Language Processing with Optimal Deep Learning Enabled Intelligent Image Captioning System(NLPODL-IICS).The aim of the presented NLPODL-IICS model is to produce a proper description for input image.To attain this,the proposed NLPODL-IICS follows two stages such as encoding and decoding processes.Initially,at the encoding side,the proposed NLPODL-IICS model makes use of Hunger Games Search(HGS)with Neural Search Architecture Network(NASNet)model.This model represents the input data appropriately by inserting it into a predefined length vector.Besides,during decoding phase,Chimp Optimization Algorithm(COA)with deeper Long Short Term Memory(LSTM)approach is followed to concatenate the description sentences 4436 CMC,2023,vol.74,no.2 produced by the method.The application of HGS and COA algorithms helps in accomplishing proper parameter tuning for NASNet and LSTM models respectively.The proposed NLPODL-IICS model was experimentally validated with the help of two benchmark datasets.Awidespread comparative analysis confirmed the superior performance of NLPODL-IICS model over other models.展开更多
Purpose:This work aims to normalize the NLPCONTRIBUTIONS scheme(henceforward,NLPCONTRIBUTIONGRAPH)to structure,directly from article sentences,the contributions information in Natural Language Processing(NLP)scholarly...Purpose:This work aims to normalize the NLPCONTRIBUTIONS scheme(henceforward,NLPCONTRIBUTIONGRAPH)to structure,directly from article sentences,the contributions information in Natural Language Processing(NLP)scholarly articles via a two-stage annotation methodology:1)pilot stage-to define the scheme(described in prior work);and 2)adjudication stage-to normalize the graphing model(the focus of this paper).Design/methodology/approach:We re-annotate,a second time,the contributions-pertinent information across 50 prior-annotated NLP scholarly articles in terms of a data pipeline comprising:contribution-centered sentences,phrases,and triple statements.To this end,specifically,care was taken in the adjudication annotation stage to reduce annotation noise while formulating the guidelines for our proposed novel NLP contributions structuring and graphing scheme.Findings:The application of NLPCONTRIBUTIONGRAPH on the 50 articles resulted finally in a dataset of 900 contribution-focused sentences,4,702 contribution-information-centered phrases,and 2,980 surface-structured triples.The intra-annotation agreement between the first and second stages,in terms of F1-score,was 67.92%for sentences,41.82%for phrases,and 22.31%for triple statements indicating that with increased granularity of the information,the annotation decision variance is greater.Research limitations:NLPCONTRIBUTIONGRAPH has limited scope for structuring scholarly contributions compared with STEM(Science,Technology,Engineering,and Medicine)scholarly knowledge at large.Further,the annotation scheme in this work is designed by only an intra-annotator consensus-a single annotator first annotated the data to propose the initial scheme,following which,the same annotator reannotated the data to normalize the annotations in an adjudication stage.However,the expected goal of this work is to achieve a standardized retrospective model of capturing NLP contributions from scholarly articles.This would entail a larger initiative of enlisting multiple annotators to accommodate different worldviews into a“single”set of structures and relationships as the final scheme.Given that the initial scheme is first proposed and the complexity of the annotation task in the realistic timeframe,our intraannotation procedure is well-suited.Nevertheless,the model proposed in this work is presently limited since it does not incorporate multiple annotator worldviews.This is planned as future work to produce a robust model.Practical implications:We demonstrate NLPCONTRIBUTIONGRAPH data integrated into the Open Research Knowledge Graph(ORKG),a next-generation KG-based digital library with intelligent computations enabled over structured scholarly knowledge,as a viable aid to assist researchers in their day-to-day tasks.Originality/value:NLPCONTRIBUTIONGRAPH is a novel scheme to annotate research contributions from NLP articles and integrate them in a knowledge graph,which to the best of our knowledge does not exist in the community.Furthermore,our quantitative evaluations over the two-stage annotation tasks offer insights into task difficulty.展开更多
One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse ...One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.展开更多
Objective Natural language processing (NLP) was used to excavate and visualize the core content of syndrome element syndrome differentiation (SESD). Methods The first step was to build a text mining and analysis envir...Objective Natural language processing (NLP) was used to excavate and visualize the core content of syndrome element syndrome differentiation (SESD). Methods The first step was to build a text mining and analysis environment based on Python language, and built a corpus based on the core chapters of SESD. The second step was to digitalize the corpus. The main steps included word segmentation, information cleaning and merging, document-entry matrix, dictionary compilation and information conversion. The third step was to mine and display the internal information of SESD corpus by means of word cloud, keyword extraction and visualization. Results NLP played a positive role in computer recognition and comprehension of SESD. Different chapters had different keywords and weights. Deficiency syndrome elements were an important component of SESD, such as "Qi deficiency""Yang deficiency" and "Yin deficiency". The important syndrome elements of substantiality included "Blood stasis""Qi stagnation", etc. Core syndrome elements were closely related. Conclusions Syndrome differentiation and treatment was the core of SESD. Using NLP to excavate syndromes differentiation could help reveal the internal relationship between syndromes differentiation and provide basis for artificial intelligence to learn syndromes differentiation.展开更多
In this research paper, we research on the automatic pattern abstraction and recognition method for large-scale database system based on natural language processing. In distributed database, through the network connec...In this research paper, we research on the automatic pattern abstraction and recognition method for large-scale database system based on natural language processing. In distributed database, through the network connection between nodes, data across different nodes and even regional distribution are well recognized. In order to reduce data redundancy and model design of the database will usually contain a lot of forms we combine the NLP theory to optimize the traditional method. The experimental analysis and simulation proves the correctness of our method.展开更多
A systolic array architecture computer (FXCQ) has been designed for signal processing. R can handle floating point data at very high speed. It is composed of 16 processing cells and a cache that are connected linearly...A systolic array architecture computer (FXCQ) has been designed for signal processing. R can handle floating point data at very high speed. It is composed of 16 processing cells and a cache that are connected linearly and form a ring structure. All processing cells are identical and programmable. Each processing cell has the peak performance of 20 million floating-point operations per second (20MFLOPS). The machine therefore has a peak performance of 320 M FLOPS. It is integrated as an attached processor into a host system through VME bus interface. Programs for FXCQ are written in a high-level language -B language, which is supported by a parallel optimizing compiler. This paper describes the architecture of FXCQ, B language and its compiler.展开更多
基金This study is financed by the European Union-NextGenerationEU,through the National Recovery and Resilience Plan of the Republic of Bulgaria,Project No.BG-RRP-2.013-0001.
文摘Covert timing channels(CTC)exploit network resources to establish hidden communication pathways,posing signi cant risks to data security and policy compliance.erefore,detecting such hidden and dangerous threats remains one of the security challenges. is paper proposes LinguTimeX,a new framework that combines natural language processing with arti cial intelligence,along with explainable Arti cial Intelligence(AI)not only to detect CTC but also to provide insights into the decision process.LinguTimeX performs multidimensional feature extraction by fusing linguistic attributes with temporal network patterns to identify covert channels precisely.LinguTimeX demonstrates strong e ectiveness in detecting CTC across multiple languages;namely English,Arabic,and Chinese.Speci cally,the LSTM and RNN models achieved F1 scores of 90%on the English dataset,89%on the Arabic dataset,and 88%on the Chinese dataset,showcasing their superior performance and ability to generalize across multiple languages. is highlights their robustness in detecting CTCs within security systems,regardless of the language or cultural context of the data.In contrast,the DeepForest model produced F1-scores ranging from 86%to 87%across the same datasets,further con rming its e ectiveness in CTC detection.Although other algorithms also showed reasonable accuracy,the LSTM and RNN models consistently outperformed them in multilingual settings,suggesting that deep learning models might be better suited for this particular problem.
文摘Background:In mental health,recovery is emphasized,and qualitative analyses of service users’narratives have accumulated;however,while qualitative approaches excel at capturing rich context and generating new concepts,they are limited in generalizability and feasible data volume.This study aimed to quantify the subjective life history narratives of users of psychiatric home-visit nursing using natural language processing(NLP)and to clarify the relationships between linguistic features and recovery-related indicators.Methods:We conducted audio-recorded and transcribed semi-structured interviews on daily life verbatim and collected self-report questionnaires(Recovery Assessment Scale[RAS])and clinician ratings(Global Assessment of Functioning[GAF])from Japanese users of psychiatric home-visit nursing.Using the artificial intelligence-based topic-modeling method BERTopic,we extracted topics from the interview texts and calculated each participant’s topic proportions,and then examined associations between topic proportions and recovery-related indicators using Pearson correlation analyses.Results:“School”showed a significant positive correlation with RAS(r=0.39,p=0.05),whereas“Family”showed a significant negative correlation(r=–0.46,p=0.02).GAF was positively correlated with word count(r=0.44,p=0.02)and“Hospital”(r=0.42,p=0.03),and negatively correlated with“Backchannels”(aizuchi)(r=–0.41,p=0.03).Conclusion:The present results suggest that the quantity,quality,and content of narratives can serve as useful indicators of mental health and recovery,and that objective NLP-based analysis of service users’narratives can complement traditional self-report scales and clinician ratings to inform the design of recovery-oriented care in psychiatric home-visit nursing.
文摘Sentiment analysis, a crucial task in discerning emotional tones within the text, plays a pivotal role in understandingpublic opinion and user sentiment across diverse languages.While numerous scholars conduct sentiment analysisin widely spoken languages such as English, Chinese, Arabic, Roman Arabic, and more, we come to grapplingwith resource-poor languages like Urdu literature which becomes a challenge. Urdu is a uniquely crafted language,characterized by a script that amalgamates elements from diverse languages, including Arabic, Parsi, Pashtu,Turkish, Punjabi, Saraiki, and more. As Urdu literature, characterized by distinct character sets and linguisticfeatures, presents an additional hurdle due to the lack of accessible datasets, rendering sentiment analysis aformidable undertaking. The limited availability of resources has fueled increased interest among researchers,prompting a deeper exploration into Urdu sentiment analysis. This research is dedicated to Urdu languagesentiment analysis, employing sophisticated deep learning models on an extensive dataset categorized into fivelabels: Positive, Negative, Neutral, Mixed, and Ambiguous. The primary objective is to discern sentiments andemotions within the Urdu language, despite the absence of well-curated datasets. To tackle this challenge, theinitial step involves the creation of a comprehensive Urdu dataset by aggregating data from various sources such asnewspapers, articles, and socialmedia comments. Subsequent to this data collection, a thorough process of cleaningand preprocessing is implemented to ensure the quality of the data. The study leverages two well-known deeplearningmodels, namely Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), for bothtraining and evaluating sentiment analysis performance. Additionally, the study explores hyperparameter tuning tooptimize the models’ efficacy. Evaluation metrics such as precision, recall, and the F1-score are employed to assessthe effectiveness of the models. The research findings reveal that RNN surpasses CNN in Urdu sentiment analysis,gaining a significantly higher accuracy rate of 91%. This result accentuates the exceptional performance of RNN,solidifying its status as a compelling option for conducting sentiment analysis tasks in the Urdu language.
基金supported by China Undergraduate Innovation Training Program[Grant No.202410699184]Humanities and Social Sciences Research Project funded by the Ministry of Education of China[Grant No.23YJAZH139].
文摘Machine translation of low-resource languages(LRLs)has long been hindered by limited corpora and linguistic complexity.This review summarizes key developments,from traditional methods to recent progress with large language models(LLMs),while highlighting ongoing challenges such as data bottlenecks,biases,fairness,and computational costs.Finally,it discusses future directions,including efficient parameter fine-tuning,multimodal translation,and community-driven corpus construction,providing insights for advancing LRL translation research.
基金Chang Gung University and Chang Gung Memorial Hospital under project number NERPD4Q0021.
文摘Text clustering is an important task because of its vital role in NLP-related tasks.However,existing research on clustering is mainly based on the English language,with limited work on low-resource languages,such as Urdu.Low-resource language text clustering has many drawbacks in the form of limited annotated collections and strong linguistic diversity.Theprimary aim of this paper is twofold:(1)By introducing a clustering dataset namedUNC-2025 comprises 100k Urdu news documents,and(2)a detailed empirical standard of Large Language Model(LLM)improved clusteringmethods for Urdu text.We explicitly evaluate the behavior of the 11multilingual and Urdu-specific embeddings on 3 different clustering algorithms.We carefully evaluated our performance based on a set of internal and external measurements of validity.We discover the best configuration of the mBERT embedding with the HDBSCAN algorithm that attains a new state-of-the-art performance with a high score of external validity of 0.95.This new LLM method has created a new strong standard of Urdu text clustering.Importantly,the results confirm the strength and high scalability of the LLM-generated embeddings towards the ability to generalise the fine,subtle semantics needed to discover topics in low-resource settings and open the door to novel NLP applications in underrepresented languages.
文摘The natural language processing(NLP)domain has witnessed significant advancements with the emergence of transformer-based models,which have reshaped the text understanding and generation landscape.While their capabilities are well recognized,there remains a limited systematic synthesis of how these models perform across tasks,scale efficiently,adapt to domains,and address ethical challenges.Therefore,the aim of this paper was to analyze the performance of transformer-based models across various NLP tasks,their scalability,domain adaptation,and the ethical implications of such models.This meta-analysis paper synthesizes findings from 25 peer-reviewed studies on NLP transformer-based models,adhering to the PRISMA framework.Relevant papers were sourced from electronic databases,including IEEE Xplore,Springer,ACM Digital Library,Elsevier,PubMed,and Google Scholar.The findings highlight the superior performance of transformers over conventional approaches,attributed to selfattention mechanisms and pre-trained language representations.Despite these advantages,challenges such as high computational costs,data bias,and hallucination persist.The study provides new perspectives by underscoring the necessity for future research to optimize transformer architectures for efficiency,address ethical AI concerns,and enhance generalization across languages.This paper contributes valuable insights into the current trends,limitations,and potential improvements in transformer-based models for NLP.
基金financial support from the National Science Foundation(NSF)EPSCoR R.I.I.Track-2 Program,awarded under the NSF grant number 2119691.
文摘The increasing frequency and severity of natural disasters,exacerbated by global warming,necessitate novel solutions to strengthen the resilience of Critical Infrastructure Systems(CISs).Recent research reveals the sig-nificant potential of natural language processing(NLP)to analyze unstructured human language during disasters,thereby facilitating the uncovering of disruptions and providing situational awareness supporting various aspects of resilience regarding CISs.Despite this potential,few studies have systematically mapped the global research on NLP applications with respect to supporting various aspects of resilience of CISs.This paper contributes to the body of knowledge by presenting a review of current knowledge using the scientometric review technique.Using 231 bibliographic records from the Scopus and Web of Science core collections,we identify five key research areas where researchers have used NLP to support the resilience of CISs during natural disasters,including sentiment analysis,crisis informatics,data and knowledge visualization,disaster impacts,and content analysis.Furthermore,we map the utility of NLP in the identified research focus with respect to four aspects of resilience(i.e.,preparedness,absorption,recovery,and adaptability)and present various common techniques used and potential future research directions.This review highlights that NLP has the potential to become a supplementary data source to support the resilience of CISs.The results of this study serve as an introductory-level guide designed to help scholars and practitioners unlock the potential of NLP for strengthening the resilience of CISs against natural disasters.
文摘DeepSeek Chinese artificial intelligence(AI)open-source model,has gained a lot of attention due to its economical training and efficient inference.DeepSeek,a model trained on large-scale reinforcement learning without supervised fine-tuning as a preliminary step,demonstrates remarkable reasoning capabilities of performing a wide range of tasks.DeepSeek is a prominent AI-driven chatbot that assists individuals in learning and enhances responses by generating insightful solutions to inquiries.Users possess divergent viewpoints regarding advanced models like DeepSeek,posting both their merits and shortcomings across several social media platforms.This research presents a new framework for predicting public sentiment to evaluate perceptions of DeepSeek.To transform the unstructured data into a suitable manner,we initially collect DeepSeek-related tweets from Twitter and subsequently implement various preprocessing methods.Subsequently,we annotated the tweets utilizing the Valence Aware Dictionary and sentiment Reasoning(VADER)methodology and the lexicon-driven TextBlob.Next,we classified the attitudes obtained from the purified data utilizing the proposed hybrid model.The proposed hybrid model consists of long-term,shortterm memory(LSTM)and bidirectional gated recurrent units(BiGRU).To strengthen it,we include multi-head attention,regularizer activation,and dropout units to enhance performance.Topic modeling employing KMeans clustering and Latent Dirichlet Allocation(LDA),was utilized to analyze public behavior concerning DeepSeek.The perceptions demonstrate that 82.5%of the people are positive,15.2%negative,and 2.3%neutral using TextBlob,and 82.8%positive,16.1%negative,and 1.2%neutral using the VADER analysis.The slight difference in results ensures that both analyses concur with their overall perceptions and may have distinct views of language peculiarities.The results indicate that the proposed model surpassed previous state-of-the-art approaches.
基金supported by the IITP(Institute of Information&Communications Technology Planning&Evaluation)-ITRC(Information Technology Research Center)grant funded by the Korean government(Ministry of Science and ICT)(IITP-2025-RS-2024-00438056).
文摘The increased accessibility of social networking services(SNSs)has facilitated communication and information sharing among users.However,it has also heightened concerns about digital safety,particularly for children and adolescents who are increasingly exposed to online grooming crimes.Early and accurate identification of grooming conversations is crucial in preventing long-term harm to victims.However,research on grooming detection in South Korea remains limited,as existing models trained primarily on English text and fail to reflect the unique linguistic features of SNS conversations,leading to inaccurate classifications.To address these issues,this study proposes a novel framework that integrates optical character recognition(OCR)technology with KcELECTRA,a deep learning-based natural language processing(NLP)model that shows excellent performance in processing the colloquial Korean language.In the proposed framework,the KcELECTRA model is fine-tuned by an extensive dataset,including Korean social media conversations,Korean ethical verification data from AI-Hub,and Korean hate speech data from Hug-gingFace,to enable more accurate classification of text extracted from social media conversation images.Experimental results show that the proposed framework achieves an accuracy of 0.953,outperforming existing transformer-based models.Furthermore,OCR technology shows high accuracy in extracting text from images,demonstrating that the proposed framework is effective for online grooming detection.The proposed framework is expected to contribute to the more accurate detection of grooming text and the prevention of grooming-related crimes.
基金National Key Research and Development Program of China(2024YFC3505400)Capital Clinical Project of Beijing Municipal Science&Technology Commission(Z221100007422092)Capital’s Funds for Health Improvement and Research(2024-1-2231).
文摘Objective To develop a clinical decision and prescription generation system(CDPGS)specifically for diarrhea in traditional Chinese medicine(TCM),utilizing a specialized large language model(LLM),Qwen-TCM-Dia,to standardize diagnostic processes and prescription generation.Methods Two primary datasets were constructed:an evaluation benchmark and a fine-tuning dataset consisting of fundamental diarrhea knowledge,medical records,and chain-ofthought(CoT)reasoning datasets.After an initial evaluation of 16 open-source LLMs across inference time,accuracy,and output quality,Qwen2.5 was selected as the base model due to its superior overall performance.We then employed a two-stage low-rank adaptation(LoRA)fine-tuning strategy,integrating continued pre-training on domain-specific knowledge with instruction fine-tuning using CoT-enriched medical records.This approach was designed to embed the clinical logic(symptoms→pathogenesis→therapeutic principles→prescriptions)into the model’s reasoning capabilities.The resulting fine-tuned model,specialized for TCM diarrhea,was designated as Qwen-TCM-Dia.Model performance was evaluated for disease diagnosis and syndrome type differentiation using accuracy,precision,recall,and F1-score.Furthermore,the quality of the generated prescriptions was compared with that of established open-source TCM LLMs.Results Qwen-TCM-Dia achieved peak performance compared to both the base Qwen2.5 model and five other open-source TCM LLMs.It achieved 97.05%accuracy and 91.48%F1-score in disease diagnosis,and 74.54%accuracy and 74.21%F1-score in syndrome type differentiation.Compared with existing open-source TCM LLMs(BianCang,HuangDi,LingDan,TCMLLM-PR,and ZhongJing),Qwen-TCM-Dia exhibited higher fidelity in reconstructing the“symptoms→pathogenesis→therapeutic principles→prescriptions”logic chain.It provided complete prescriptions,whereas other models often omitted dosages or generated mismatched prescriptions.Conclusion By integrating continued pre-training,CoT reasoning,and a two-stage fine-tuning strategy,this study establishes a CDPGS for diarrhea in TCM.The results demonstrate the synergistic effect of strengthening domain representation through pre-training and activating logical reasoning via CoT.This research not only provides critical technical support for the standardized diagnosis and treatment of diarrhea but also offers a scalable paradigm for the digital inheritance of expert TCM experience and the intelligent transformation of TCM.
基金the research project LaTe4PoliticES(PID2022-138099OB-I00)funded by MCIN/AEI/10.13039/501100011033 and the European Fund for Regional Development(ERDF)-a way to make Europe.Tomás Bernal-Beltrán is supported by University of Murcia through the predoctoral programme.
文摘The malicious dissemination of hate speech via compromised accounts,automated bot networks and malware-driven social media campaigns has become a growing cybersecurity concern.Automatically detecting such content in Spanish is challenging due to linguistic complexity and the scarcity of annotated resources.In this paper,we compare two predominant AI-based approaches for the forensic detection of malicious hate speech:(1)finetuning encoder-only models that have been trained in Spanish and(2)In-Context Learning techniques(Zero-and Few-Shot Learning)with large-scale language models.Our approach goes beyond binary classification,proposing a comprehensive,multidimensional evaluation that labels each text by:(1)type of speech,(2)recipient,(3)level of intensity(ordinal)and(4)targeted group(multi-label).Performance is evaluated using an annotated Spanish corpus,standard metrics such as precision,recall and F1-score and stability-oriented metrics to evaluate the stability of the transition from zero-shot to few-shot prompting(Zero-to-Few Shot Retention and Zero-to-Few Shot Gain)are applied.The results indicate that fine-tuned encoder-only models(notably MarIA and BETO variants)consistently deliver the strongest and most reliable performance:in our experiments their macro F1-scores lie roughly in the range of approximately 46%–66%depending on the task.Zero-shot approaches are much less stable and typically yield substantially lower performance(observed F1-scores range approximately 0%–39%),often producing invalid outputs in practice.Few-shot prompting(e.g.,Qwen 38B,Mistral 7B)generally improves stability and recall relative to pure zero-shot,bringing F1-scores into a moderate range of approximately 20%–51%but still falling short of fully fine-tuned models.These findings highlight the importance of supervised adaptation and discuss the potential of both paradigms as components in AI-powered cybersecurity and malware forensics systems designed to identify and mitigate coordinated online hate campaigns.
基金supported through the Annual Funding track by the Deanship of Scientific Research,Vice Presidency for Graduate Studies and Scientific Research,King Faisal University,Saudi Arabia[Project No.AN000685].
文摘Sentiment analysis(SA)is the procedure of recognizing the emotions related to the data that exist in social networking.The existence of sarcasm in tex-tual data is a major challenge in the efficiency of the SA.Earlier works on sarcasm detection on text utilize lexical as well as pragmatic cues namely interjection,punctuations,and sentiment shift that are vital indicators of sarcasm.With the advent of deep-learning,recent works,leveraging neural networks in learning lexical and contextual features,removing the need for handcrafted feature.In this aspect,this study designs a deep learning with natural language processing enabled SA(DLNLP-SA)technique for sarcasm classification.The proposed DLNLP-SA technique aims to detect and classify the occurrence of sarcasm in the input data.Besides,the DLNLP-SA technique holds various sub-processes namely preprocessing,feature vector conversion,and classification.Initially,the pre-processing is performed in diverse ways such as single character removal,multi-spaces removal,URL removal,stopword removal,and tokenization.Secondly,the transformation of feature vectors takes place using the N-gram feature vector technique.Finally,mayfly optimization(MFO)with multi-head self-attention based gated recurrent unit(MHSA-GRU)model is employed for the detection and classification of sarcasm.To verify the enhanced outcomes of the DLNLP-SA model,a comprehensive experimental investigation is performed on the News Headlines Dataset from Kaggle Repository and the results signified the supremacy over the existing approaches.
基金the National Natural Scientific Foundation of China(Grants 81790650,81790651,81727808,81627901,and 31771253)the Beijing Municipal Science and Technology Commission(Grants Z171100000117012 and Z181100001518003)the Collaborative Research Fund of the Chinese Institute for Brain Research,Beijing(No.2020-NKXPT-02).
文摘As one of the most widely used languages in the world,Chinese language is distinct from most western languages in many properties,thus providing a unique opportunity for understanding the brain basis of human language and cognition.In recent years,non-invasive neuroimaging techniques such as magnetic resonance imaging(MRI)blaze a new trail to comprehensively study specific neural correlates of Chinese language processing and Chinese speakers.We reviewed the application of functional MRI(fMRI)in such studies and some essential findings on brain systems in processing Chinese.Specifically,for example,the application of task fMRI and resting-state fMRI in observing the process of reading and writing the logographic characters and producing or listening to the tonal speech.Elementary cognitive neuroscience and several potential research directions around brain and Chinese language were discussed,which may be informative for future research.
基金supported in part by the National Natural Science Foundation of China under Grant 62176109in part by the Tibetan Information Processing and Machine Translation Key Laboratory of Qinghai Province under Grant 2021‐Z‐003+3 种基金in part by the Natural Science Foundation of Gansu Province under Grant 21JR7RA531 and Grant 22JR5RA487in part by the Fundamental Research Funds for the Central Universities under Grant lzujbky‐2022‐23in part by the CAAI‐Huawei MindSpore Open Fund under Grant CAAIXSJLJJ‐2022‐020Ain part by the Supercomputing Center of Lanzhou University,in part by Sichuan Science and Technology Program No.2022nsfsc0916.
文摘A variety of neural networks have been presented to deal with issues in deep learning in the last decades.Despite the prominent success achieved by the neural network,it still lacks theoretical guidance to design an efficient neural network model,and verifying the performance of a model needs excessive resources.Previous research studies have demonstrated that many existing models can be regarded as different numerical discretizations of differential equations.This connection sheds light on designing an effective recurrent neural network(RNN)by resorting to numerical analysis.Simple RNN is regarded as a discretisation of the forward Euler scheme.Considering the limited solution accuracy of the forward Euler methods,a Taylor‐type discrete scheme is presented with lower truncation error and a Taylor‐type RNN(T‐RNN)is designed with its guidance.Extensive experiments are conducted to evaluate its performance on statistical language models and emotion analysis tasks.The noticeable gains obtained by T‐RNN present its superiority and the feasibility of designing the neural network model using numerical methods.
基金Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2022R161)PrincessNourah bint Abdulrahman University,Riyadh,Saudi Arabia.The authors would like to thank the|Deanship of Scientific Research at Umm Al-Qura University|for supporting this work by Grant Code:(22UQU4310373DSR33).
文摘The recent developments in Multimedia Internet of Things(MIoT)devices,empowered with Natural Language Processing(NLP)model,seem to be a promising future of smart devices.It plays an important role in industrial models such as speech understanding,emotion detection,home automation,and so on.If an image needs to be captioned,then the objects in that image,its actions and connections,and any silent feature that remains under-projected or missing from the images should be identified.The aim of the image captioning process is to generate a caption for image.In next step,the image should be provided with one of the most significant and detailed descriptions that is syntactically as well as semantically correct.In this scenario,computer vision model is used to identify the objects and NLP approaches are followed to describe the image.The current study develops aNatural Language Processing with Optimal Deep Learning Enabled Intelligent Image Captioning System(NLPODL-IICS).The aim of the presented NLPODL-IICS model is to produce a proper description for input image.To attain this,the proposed NLPODL-IICS follows two stages such as encoding and decoding processes.Initially,at the encoding side,the proposed NLPODL-IICS model makes use of Hunger Games Search(HGS)with Neural Search Architecture Network(NASNet)model.This model represents the input data appropriately by inserting it into a predefined length vector.Besides,during decoding phase,Chimp Optimization Algorithm(COA)with deeper Long Short Term Memory(LSTM)approach is followed to concatenate the description sentences 4436 CMC,2023,vol.74,no.2 produced by the method.The application of HGS and COA algorithms helps in accomplishing proper parameter tuning for NASNet and LSTM models respectively.The proposed NLPODL-IICS model was experimentally validated with the help of two benchmark datasets.Awidespread comparative analysis confirmed the superior performance of NLPODL-IICS model over other models.
基金This work was co-funded by the European Research Council for the project ScienceGRAPH(Grant agreement ID:819536)by the TIB Leibniz Information Centre for Science and Technology.
文摘Purpose:This work aims to normalize the NLPCONTRIBUTIONS scheme(henceforward,NLPCONTRIBUTIONGRAPH)to structure,directly from article sentences,the contributions information in Natural Language Processing(NLP)scholarly articles via a two-stage annotation methodology:1)pilot stage-to define the scheme(described in prior work);and 2)adjudication stage-to normalize the graphing model(the focus of this paper).Design/methodology/approach:We re-annotate,a second time,the contributions-pertinent information across 50 prior-annotated NLP scholarly articles in terms of a data pipeline comprising:contribution-centered sentences,phrases,and triple statements.To this end,specifically,care was taken in the adjudication annotation stage to reduce annotation noise while formulating the guidelines for our proposed novel NLP contributions structuring and graphing scheme.Findings:The application of NLPCONTRIBUTIONGRAPH on the 50 articles resulted finally in a dataset of 900 contribution-focused sentences,4,702 contribution-information-centered phrases,and 2,980 surface-structured triples.The intra-annotation agreement between the first and second stages,in terms of F1-score,was 67.92%for sentences,41.82%for phrases,and 22.31%for triple statements indicating that with increased granularity of the information,the annotation decision variance is greater.Research limitations:NLPCONTRIBUTIONGRAPH has limited scope for structuring scholarly contributions compared with STEM(Science,Technology,Engineering,and Medicine)scholarly knowledge at large.Further,the annotation scheme in this work is designed by only an intra-annotator consensus-a single annotator first annotated the data to propose the initial scheme,following which,the same annotator reannotated the data to normalize the annotations in an adjudication stage.However,the expected goal of this work is to achieve a standardized retrospective model of capturing NLP contributions from scholarly articles.This would entail a larger initiative of enlisting multiple annotators to accommodate different worldviews into a“single”set of structures and relationships as the final scheme.Given that the initial scheme is first proposed and the complexity of the annotation task in the realistic timeframe,our intraannotation procedure is well-suited.Nevertheless,the model proposed in this work is presently limited since it does not incorporate multiple annotator worldviews.This is planned as future work to produce a robust model.Practical implications:We demonstrate NLPCONTRIBUTIONGRAPH data integrated into the Open Research Knowledge Graph(ORKG),a next-generation KG-based digital library with intelligent computations enabled over structured scholarly knowledge,as a viable aid to assist researchers in their day-to-day tasks.Originality/value:NLPCONTRIBUTIONGRAPH is a novel scheme to annotate research contributions from NLP articles and integrate them in a knowledge graph,which to the best of our knowledge does not exist in the community.Furthermore,our quantitative evaluations over the two-stage annotation tasks offer insights into task difficulty.
文摘One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.
基金the funding support from the National Natural Science Foundation of China (No. 81874429)Digital and Applied Research Platform for Diagnosis of Traditional Chinese Medicine (No. 49021003005)+1 种基金2018 Hunan Provincial Postgraduate Research Innovation Project (No. CX2018B465)Excellent Youth Project of Hunan Education Department in 2018 (No. 18B241)
文摘Objective Natural language processing (NLP) was used to excavate and visualize the core content of syndrome element syndrome differentiation (SESD). Methods The first step was to build a text mining and analysis environment based on Python language, and built a corpus based on the core chapters of SESD. The second step was to digitalize the corpus. The main steps included word segmentation, information cleaning and merging, document-entry matrix, dictionary compilation and information conversion. The third step was to mine and display the internal information of SESD corpus by means of word cloud, keyword extraction and visualization. Results NLP played a positive role in computer recognition and comprehension of SESD. Different chapters had different keywords and weights. Deficiency syndrome elements were an important component of SESD, such as "Qi deficiency""Yang deficiency" and "Yin deficiency". The important syndrome elements of substantiality included "Blood stasis""Qi stagnation", etc. Core syndrome elements were closely related. Conclusions Syndrome differentiation and treatment was the core of SESD. Using NLP to excavate syndromes differentiation could help reveal the internal relationship between syndromes differentiation and provide basis for artificial intelligence to learn syndromes differentiation.
文摘In this research paper, we research on the automatic pattern abstraction and recognition method for large-scale database system based on natural language processing. In distributed database, through the network connection between nodes, data across different nodes and even regional distribution are well recognized. In order to reduce data redundancy and model design of the database will usually contain a lot of forms we combine the NLP theory to optimize the traditional method. The experimental analysis and simulation proves the correctness of our method.
文摘A systolic array architecture computer (FXCQ) has been designed for signal processing. R can handle floating point data at very high speed. It is composed of 16 processing cells and a cache that are connected linearly and form a ring structure. All processing cells are identical and programmable. Each processing cell has the peak performance of 20 million floating-point operations per second (20MFLOPS). The machine therefore has a peak performance of 320 M FLOPS. It is integrated as an attached processor into a host system through VME bus interface. Programs for FXCQ are written in a high-level language -B language, which is supported by a parallel optimizing compiler. This paper describes the architecture of FXCQ, B language and its compiler.