Due to the availability of a huge number of electronic text documents from a variety of sources representing unstructured and semi-structured information,the document classication task becomes an interesting area for ...Due to the availability of a huge number of electronic text documents from a variety of sources representing unstructured and semi-structured information,the document classication task becomes an interesting area for controlling data behavior.This paper presents a document classication multimodal for categorizing textual semi-structured and unstructured documents.The multimodal implements several individual deep learning models such as Deep Neural Networks(DNN),Recurrent Convolutional Neural Networks(RCNN)and Bidirectional-LSTM(Bi-LSTM).The Stacked Ensemble based meta-model technique is used to combine the results of the individual classiers to produce better results,compared to those reached by any of the above mentioned models individually.A series of textual preprocessing steps are executed to normalize the input corpus followed by text vectorization techniques.These techniques include using Term Frequency Inverse Term Frequency(TFIDF)or Continuous Bag of Word(CBOW)to convert text data into the corresponding suitable numeric form acceptable to be manipulated by deep learning models.Moreover,this proposed model is validated using a dataset collected from several spaces with a huge number of documents in every class.In addition,the experimental results prove that the proposed model has achieved effective performance.Besides,upon investigating the PDF Documents classication,the proposed model has achieved accuracy up to 0.9045 and 0.959 for the TFIDF and CBOW features,respectively.Moreover,concerning the JSON Documents classication,the proposed model has achieved accuracy up to 0.914 and 0.956 for the TFIDF and CBOW features,respectively.Furthermore,as for the XML Documents classication,the proposed model has achieved accuracy values up to 0.92 and 0.959 for the TFIDF and CBOW features,respectively.展开更多
Large language models(LLMs),such as ChatGPT developed by OpenAI,represent a significant advancement in artificial intelligence(AI),designed to understand,generate,and interpret human language by analyzing extensive te...Large language models(LLMs),such as ChatGPT developed by OpenAI,represent a significant advancement in artificial intelligence(AI),designed to understand,generate,and interpret human language by analyzing extensive text data.Their potential integration into clinical settings offers a promising avenue that could transform clinical diagnosis and decision-making processes in the future(Thirunavukarasu et al.,2023).This article aims to provide an in-depth analysis of LLMs’current and potential impact on clinical practices.Their ability to generate differential diagnosis lists underscores their potential as invaluable tools in medical practice and education(Hirosawa et al.,2023;Koga et al.,2023).展开更多
Aiming at the problems of incomplete characterization of text relations,poor guidance of potential representations,and low quality of model generation in the field of controllable long text generation,this paper propo...Aiming at the problems of incomplete characterization of text relations,poor guidance of potential representations,and low quality of model generation in the field of controllable long text generation,this paper proposes a new GSPT-CVAE model(Graph Structured Processing,Single Vector,and Potential Attention Com-puting Transformer-Based Conditioned Variational Autoencoder model).The model obtains a more comprehensive representation of textual relations by graph-structured processing of the input text,and at the same time obtains a single vector representation by weighted merging of the vector sequences after graph-structured processing to get an effective potential representation.In the process of potential representation guiding text generation,the model adopts a combination of traditional embedding and potential attention calculation to give full play to the guiding role of potential representation for generating text,to improve the controllability and effectiveness of text generation.The experimental results show that the model has excellent representation learning ability and can learn rich and useful textual relationship representations.The model also achieves satisfactory results in the effectiveness and controllability of text generation and can generate long texts that match the given constraints.The ROUGE-1 F1 score of this model is 0.243,the ROUGE-2 F1 score is 0.041,the ROUGE-L F1 score is 0.22,and the PPL-Word score is 34.303,which gives the GSPT-CVAE model a certain advantage over the baseline model.Meanwhile,this paper compares this model with the state-of-the-art generative models T5,GPT-4,Llama2,and so on,and the experimental results show that the GSPT-CVAE model has a certain competitiveness.展开更多
Restoring texts corrupted by visually perturbed homoglyph characters presents significant challenges to conventional Natural Language Processing(NLP)systems,primarily due to ambiguities arising from characters that ap...Restoring texts corrupted by visually perturbed homoglyph characters presents significant challenges to conventional Natural Language Processing(NLP)systems,primarily due to ambiguities arising from characters that appear visually similar yet differ semantically.Traditional text restoration methods struggle with these homoglyph perturbations due to limitations such as a lack of contextual understanding and difficulty in handling cases where one character maps to multiple candidates.To address these issues,we propose an Optical Character Recognition(OCR)-assisted masked Bidirectional Encoder Representations from Transformers(BERT)model specifically designed for homoglyph-perturbed text restoration.Our method integrates OCR preprocessing with a character-level BERT architecture,where OCR preprocessing transforms visually perturbed characters into their approximate alphabetic equivalents,significantly reducing multi-correspondence ambiguities.Subsequently,the character-level BERT leverages bidirectional contextual information to accurately resolve remaining ambiguities by predicting intended characters based on surrounding semantic cues.Extensive experiments conducted on realistic phishing email datasets demonstrate that the proposed method significantly outperforms existing restoration techniques,including OCR-based,dictionarybased,and traditional BERT-based approaches,achieving a word-level restoration accuracy of up to 99.59%in fine-tuned settings.Additionally,our approach exhibits robust performance in zero-shot scenarios and maintains effectiveness under low-resource conditions.Further evaluations across multiple downstream tasks,such as part-ofspeech tagging,chunking,toxic comment classification,and homoglyph detection under conditions of severe visual perturbation(up to 40%),confirm the method’s generalizability and applicability.Our proposed hybrid approach,combining OCR preprocessing with character-level contextual modeling,represents a scalable and practical solution for mitigating visually adversarial text attacks,thereby enhancing the security and reliability of NLP systems in real-world applications.展开更多
Surgical site infections(SSIs)are the most common healthcare-related infections in patients with lung cancer.Constructing a lung cancer SSI risk prediction model requires the extraction of relevant risk factors from l...Surgical site infections(SSIs)are the most common healthcare-related infections in patients with lung cancer.Constructing a lung cancer SSI risk prediction model requires the extraction of relevant risk factors from lung cancer case texts,which involves two types of text structuring tasks:attribute discrimination and attribute extraction.This article proposes a joint model,Multi-BGLC,around these two types of tasks,using bidirectional encoder representations from transformers(BERT)as the encoder and fine-tuning the decoder composed of graph convolutional neural network(GCNN)+long short-term memory(LSTM)+conditional random field(CRF)based on cancer case data.The GCNN is used for attribute discrimination,whereas the LSTM and CRF are used for attribute extraction.The experiment verified the effectiveness and accuracy of the model compared with other baseline models.展开更多
We analyze the suitability of existing pre-trained transformer-based language models(PLMs)for abstractive text summarization on German technical healthcare texts.The study focuses on the multilingual capabilities of t...We analyze the suitability of existing pre-trained transformer-based language models(PLMs)for abstractive text summarization on German technical healthcare texts.The study focuses on the multilingual capabilities of these models and their ability to perform the task of abstractive text summarization in the healthcare field.The research hypothesis was that large language models could perform high-quality abstractive text summarization on German technical healthcare texts,even if the model is not specifically trained in that language.Through experiments,the research questions explore the performance of transformer language models in dealing with complex syntax constructs,the difference in performance between models trained in English and German,and the impact of translating the source text to English before conducting the summarization.We conducted an evaluation of four PLMs(GPT-3,a translation-based approach also utilizing GPT-3,a German language Model,and a domain-specific bio-medical model approach).The evaluation considered the informativeness using 3 types of metrics based on Recall-Oriented Understudy for Gisting Evaluation(ROUGE)and the quality of results which is manually evaluated considering 5 aspects.The results show that text summarization models could be used in the German healthcare domain and that domain-independent language models achieved the best results.The study proves that text summarization models can simplify the search for pre-existing German knowledge in various domains.展开更多
On January 14,Heimtextil kicked off the new trade fair year with over 3,000 exhibitors from 65 countries.With steady growth,the leading trade fair for home and contract textiles and textile design is strongly position...On January 14,Heimtextil kicked off the new trade fair year with over 3,000 exhibitors from 65 countries.With steady growth,the leading trade fair for home and contract textiles and textile design is strongly positioned. This makes it a reliable platform for international participants.At the opening,architect and designer Patricia Urquiola presented her installation 'among-us' at Heimtextil.展开更多
The application of legal texts in the context of digital television is a process that relies on several normative instruments,ranging from international treaties,such as those of the ITU(International Telecommunicatio...The application of legal texts in the context of digital television is a process that relies on several normative instruments,ranging from international treaties,such as those of the ITU(International Telecommunications Union),to national regulations defining the obligations of audiovisual operators and the modalities of consumer support.Many countries have introduced specific laws and regulations to organize the gradual switch-off of analog broadcasting and encourage the adoption of new digital standards.Consequently,the digitization of Guinea’s broadcasting network cannot be carried out without taking into account the legal framework:allocation of resources and broadcasting players.Analog and digital broadcasting,according to regulatory texts,shows the relationships between the different communication management structures.As for digital broadcasting,we note the appearance of a new service,multiplex.展开更多
China agriculture encounters and achieves persistently develop,push and robust from 2024 to 2025,and achieve sustainable perfect,reshape and remold from closing years.China agriculture develop is a new start point of ...China agriculture encounters and achieves persistently develop,push and robust from 2024 to 2025,and achieve sustainable perfect,reshape and remold from closing years.China agriculture develop is a new start point of China agriculture,is China agriculture develop’s new orientation,new protect,and new orientation.展开更多
With the rapid development of web technology,Social Networks(SNs)have become one of the most popular platforms for users to exchange views and to express their emotions.More and more people are used to commenting on a...With the rapid development of web technology,Social Networks(SNs)have become one of the most popular platforms for users to exchange views and to express their emotions.More and more people are used to commenting on a certain hot spot in SNs,resulting in a large amount of texts containing emotions.Textual Emotion Cause Extraction(TECE)aims to automatically extract causes for a certain emotion in texts,which is an important research issue in natural language processing.It is different from the previous tasks of emotion recognition and emotion classification.In addition,it is not limited to the shallow-level emotion classification of text,but to trace the emotion source.In this paper,we provide a survey for TECE.First,we introduce the development process and classification of TECE.Then,we discuss the existing methods and key factors for TECE.Finally,we enumerate the challenges and developing trend for TECE.展开更多
With the booming growth of global tourism, more and more tourist attractions and cultural heritage attract international tourists through different forms. As an important bridge for cultural communication, the transla...With the booming growth of global tourism, more and more tourist attractions and cultural heritage attract international tourists through different forms. As an important bridge for cultural communication, the translation of tourism texts not only needs to convey information accurately, but also needs to take into account cultural differences and cognitive characteristics. Cognitive construal theory, as an emerging theoretical framework for translation, provides an understanding and explanation of different dimensions in translation practice by focusing on human cognitive processes. This paper investigates the English translation of several tourism texts based on the cognitive construal theory and examines the cognitive mechanisms reflected in the translated texts. The study shows that the four dimensions of the cognitive construal all have an important impact on the English translation of tourism texts;people from different countries or ethnic backgrounds have different cognitive construals, so translators should flexibly adjust their cognitive construals in order to achieve a specific translation purpose. This study helps people to understand translation activities from the perspective of cognitive construal and provides a reference for the practice of translating tourism texts.展开更多
This study investigates translation strategies for Chinese cultural terms in academic texts through a case study of Chapter 7 from“Jade Myth Belief and Chinese Spirit”.Using a qualitative research approach based on ...This study investigates translation strategies for Chinese cultural terms in academic texts through a case study of Chapter 7 from“Jade Myth Belief and Chinese Spirit”.Using a qualitative research approach based on cultural context framework and cognitive model,the study analyzes translation challenges and solutions in rendering cultural terms related to jade mythology and archaeological concepts.The research identifies three primary translation strategies:transliteration with annotation,domestication with explanation,and cognitive-based translation.The findings reveal that effective translation requires a balanced approach between maintaining academic precision and preserving cultural authenticity.The study demonstrates that successful translation of cultural terms in academic contexts demands a sophisticated understanding of both source and target cultural contexts,along with careful consideration of the academic audience’s needs.This research contributes to the field by providing practical insights for translators working with Chinese cultural texts in academic settings and proposing an approach to handling complex cultural terminology.展开更多
文摘Due to the availability of a huge number of electronic text documents from a variety of sources representing unstructured and semi-structured information,the document classication task becomes an interesting area for controlling data behavior.This paper presents a document classication multimodal for categorizing textual semi-structured and unstructured documents.The multimodal implements several individual deep learning models such as Deep Neural Networks(DNN),Recurrent Convolutional Neural Networks(RCNN)and Bidirectional-LSTM(Bi-LSTM).The Stacked Ensemble based meta-model technique is used to combine the results of the individual classiers to produce better results,compared to those reached by any of the above mentioned models individually.A series of textual preprocessing steps are executed to normalize the input corpus followed by text vectorization techniques.These techniques include using Term Frequency Inverse Term Frequency(TFIDF)or Continuous Bag of Word(CBOW)to convert text data into the corresponding suitable numeric form acceptable to be manipulated by deep learning models.Moreover,this proposed model is validated using a dataset collected from several spaces with a huge number of documents in every class.In addition,the experimental results prove that the proposed model has achieved effective performance.Besides,upon investigating the PDF Documents classication,the proposed model has achieved accuracy up to 0.9045 and 0.959 for the TFIDF and CBOW features,respectively.Moreover,concerning the JSON Documents classication,the proposed model has achieved accuracy up to 0.914 and 0.956 for the TFIDF and CBOW features,respectively.Furthermore,as for the XML Documents classication,the proposed model has achieved accuracy values up to 0.92 and 0.959 for the TFIDF and CBOW features,respectively.
文摘Large language models(LLMs),such as ChatGPT developed by OpenAI,represent a significant advancement in artificial intelligence(AI),designed to understand,generate,and interpret human language by analyzing extensive text data.Their potential integration into clinical settings offers a promising avenue that could transform clinical diagnosis and decision-making processes in the future(Thirunavukarasu et al.,2023).This article aims to provide an in-depth analysis of LLMs’current and potential impact on clinical practices.Their ability to generate differential diagnosis lists underscores their potential as invaluable tools in medical practice and education(Hirosawa et al.,2023;Koga et al.,2023).
文摘Aiming at the problems of incomplete characterization of text relations,poor guidance of potential representations,and low quality of model generation in the field of controllable long text generation,this paper proposes a new GSPT-CVAE model(Graph Structured Processing,Single Vector,and Potential Attention Com-puting Transformer-Based Conditioned Variational Autoencoder model).The model obtains a more comprehensive representation of textual relations by graph-structured processing of the input text,and at the same time obtains a single vector representation by weighted merging of the vector sequences after graph-structured processing to get an effective potential representation.In the process of potential representation guiding text generation,the model adopts a combination of traditional embedding and potential attention calculation to give full play to the guiding role of potential representation for generating text,to improve the controllability and effectiveness of text generation.The experimental results show that the model has excellent representation learning ability and can learn rich and useful textual relationship representations.The model also achieves satisfactory results in the effectiveness and controllability of text generation and can generate long texts that match the given constraints.The ROUGE-1 F1 score of this model is 0.243,the ROUGE-2 F1 score is 0.041,the ROUGE-L F1 score is 0.22,and the PPL-Word score is 34.303,which gives the GSPT-CVAE model a certain advantage over the baseline model.Meanwhile,this paper compares this model with the state-of-the-art generative models T5,GPT-4,Llama2,and so on,and the experimental results show that the GSPT-CVAE model has a certain competitiveness.
基金supported by the Institute of Information&Communications Technology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT)[RS-2021-II211341,Artificial Intelligence Graduate School Program(Chung-Ang University)]by the Chung-Ang University Graduate Research Scholarship in 2024.
文摘Restoring texts corrupted by visually perturbed homoglyph characters presents significant challenges to conventional Natural Language Processing(NLP)systems,primarily due to ambiguities arising from characters that appear visually similar yet differ semantically.Traditional text restoration methods struggle with these homoglyph perturbations due to limitations such as a lack of contextual understanding and difficulty in handling cases where one character maps to multiple candidates.To address these issues,we propose an Optical Character Recognition(OCR)-assisted masked Bidirectional Encoder Representations from Transformers(BERT)model specifically designed for homoglyph-perturbed text restoration.Our method integrates OCR preprocessing with a character-level BERT architecture,where OCR preprocessing transforms visually perturbed characters into their approximate alphabetic equivalents,significantly reducing multi-correspondence ambiguities.Subsequently,the character-level BERT leverages bidirectional contextual information to accurately resolve remaining ambiguities by predicting intended characters based on surrounding semantic cues.Extensive experiments conducted on realistic phishing email datasets demonstrate that the proposed method significantly outperforms existing restoration techniques,including OCR-based,dictionarybased,and traditional BERT-based approaches,achieving a word-level restoration accuracy of up to 99.59%in fine-tuned settings.Additionally,our approach exhibits robust performance in zero-shot scenarios and maintains effectiveness under low-resource conditions.Further evaluations across multiple downstream tasks,such as part-ofspeech tagging,chunking,toxic comment classification,and homoglyph detection under conditions of severe visual perturbation(up to 40%),confirm the method’s generalizability and applicability.Our proposed hybrid approach,combining OCR preprocessing with character-level contextual modeling,represents a scalable and practical solution for mitigating visually adversarial text attacks,thereby enhancing the security and reliability of NLP systems in real-world applications.
基金the Special Project of the Shanghai Municipal Commission of Economy and Information Technology for Promoting High-Quality Industrial Development(No.2024-GZL-RGZN-02011)the Shanghai City Digital Transformation Project(No.202301002)the Project of Shanghai Shenkang Hospital Development Center(No.SHDC22023214)。
文摘Surgical site infections(SSIs)are the most common healthcare-related infections in patients with lung cancer.Constructing a lung cancer SSI risk prediction model requires the extraction of relevant risk factors from lung cancer case texts,which involves two types of text structuring tasks:attribute discrimination and attribute extraction.This article proposes a joint model,Multi-BGLC,around these two types of tasks,using bidirectional encoder representations from transformers(BERT)as the encoder and fine-tuning the decoder composed of graph convolutional neural network(GCNN)+long short-term memory(LSTM)+conditional random field(CRF)based on cancer case data.The GCNN is used for attribute discrimination,whereas the LSTM and CRF are used for attribute extraction.The experiment verified the effectiveness and accuracy of the model compared with other baseline models.
文摘We analyze the suitability of existing pre-trained transformer-based language models(PLMs)for abstractive text summarization on German technical healthcare texts.The study focuses on the multilingual capabilities of these models and their ability to perform the task of abstractive text summarization in the healthcare field.The research hypothesis was that large language models could perform high-quality abstractive text summarization on German technical healthcare texts,even if the model is not specifically trained in that language.Through experiments,the research questions explore the performance of transformer language models in dealing with complex syntax constructs,the difference in performance between models trained in English and German,and the impact of translating the source text to English before conducting the summarization.We conducted an evaluation of four PLMs(GPT-3,a translation-based approach also utilizing GPT-3,a German language Model,and a domain-specific bio-medical model approach).The evaluation considered the informativeness using 3 types of metrics based on Recall-Oriented Understudy for Gisting Evaluation(ROUGE)and the quality of results which is manually evaluated considering 5 aspects.The results show that text summarization models could be used in the German healthcare domain and that domain-independent language models achieved the best results.The study proves that text summarization models can simplify the search for pre-existing German knowledge in various domains.
文摘On January 14,Heimtextil kicked off the new trade fair year with over 3,000 exhibitors from 65 countries.With steady growth,the leading trade fair for home and contract textiles and textile design is strongly positioned. This makes it a reliable platform for international participants.At the opening,architect and designer Patricia Urquiola presented her installation 'among-us' at Heimtextil.
文摘The application of legal texts in the context of digital television is a process that relies on several normative instruments,ranging from international treaties,such as those of the ITU(International Telecommunications Union),to national regulations defining the obligations of audiovisual operators and the modalities of consumer support.Many countries have introduced specific laws and regulations to organize the gradual switch-off of analog broadcasting and encourage the adoption of new digital standards.Consequently,the digitization of Guinea’s broadcasting network cannot be carried out without taking into account the legal framework:allocation of resources and broadcasting players.Analog and digital broadcasting,according to regulatory texts,shows the relationships between the different communication management structures.As for digital broadcasting,we note the appearance of a new service,multiplex.
文摘China agriculture encounters and achieves persistently develop,push and robust from 2024 to 2025,and achieve sustainable perfect,reshape and remold from closing years.China agriculture develop is a new start point of China agriculture,is China agriculture develop’s new orientation,new protect,and new orientation.
基金partially supported by the National Natural Science Foundation of China under Grant No.62372121the Ministry of education of Humanities and Social Science project under Grant No.20YJAZH118+1 种基金the National Key Research and Development Program of China under Grant No.2020YFB1005804the MOE Project at Center for Linguistics and Applied Linguistics,Guangdong University of Foreign Studies。
文摘With the rapid development of web technology,Social Networks(SNs)have become one of the most popular platforms for users to exchange views and to express their emotions.More and more people are used to commenting on a certain hot spot in SNs,resulting in a large amount of texts containing emotions.Textual Emotion Cause Extraction(TECE)aims to automatically extract causes for a certain emotion in texts,which is an important research issue in natural language processing.It is different from the previous tasks of emotion recognition and emotion classification.In addition,it is not limited to the shallow-level emotion classification of text,but to trace the emotion source.In this paper,we provide a survey for TECE.First,we introduce the development process and classification of TECE.Then,we discuss the existing methods and key factors for TECE.Finally,we enumerate the challenges and developing trend for TECE.
文摘With the booming growth of global tourism, more and more tourist attractions and cultural heritage attract international tourists through different forms. As an important bridge for cultural communication, the translation of tourism texts not only needs to convey information accurately, but also needs to take into account cultural differences and cognitive characteristics. Cognitive construal theory, as an emerging theoretical framework for translation, provides an understanding and explanation of different dimensions in translation practice by focusing on human cognitive processes. This paper investigates the English translation of several tourism texts based on the cognitive construal theory and examines the cognitive mechanisms reflected in the translated texts. The study shows that the four dimensions of the cognitive construal all have an important impact on the English translation of tourism texts;people from different countries or ethnic backgrounds have different cognitive construals, so translators should flexibly adjust their cognitive construals in order to achieve a specific translation purpose. This study helps people to understand translation activities from the perspective of cognitive construal and provides a reference for the practice of translating tourism texts.
基金sponsored by the Humanities and Social Sciences Project of the Ministry of Education under Grant No.24YJCZH443Shanghai Philosophy and Social Science Planning Project under Grant No.2024EYY015Shanghai Municipal Philosophy and Social Sciences Planning Project under Grant No.2024EYY011.
文摘This study investigates translation strategies for Chinese cultural terms in academic texts through a case study of Chapter 7 from“Jade Myth Belief and Chinese Spirit”.Using a qualitative research approach based on cultural context framework and cognitive model,the study analyzes translation challenges and solutions in rendering cultural terms related to jade mythology and archaeological concepts.The research identifies three primary translation strategies:transliteration with annotation,domestication with explanation,and cognitive-based translation.The findings reveal that effective translation requires a balanced approach between maintaining academic precision and preserving cultural authenticity.The study demonstrates that successful translation of cultural terms in academic contexts demands a sophisticated understanding of both source and target cultural contexts,along with careful consideration of the academic audience’s needs.This research contributes to the field by providing practical insights for translators working with Chinese cultural texts in academic settings and proposing an approach to handling complex cultural terminology.