BACKGROUND With the rising use of endoscopic submucosal dissection(ESD)and endoscopic mucosal resection(EMR),patients are increasingly questioning various aspects of these endoscopic procedures.At the same time,conver...BACKGROUND With the rising use of endoscopic submucosal dissection(ESD)and endoscopic mucosal resection(EMR),patients are increasingly questioning various aspects of these endoscopic procedures.At the same time,conversational artificial intelligence(AI)tools like chat generative pretrained transformer(ChatGPT)are rapidly emerging as sources of medical information.AIM To evaluate ChatGPT’s reliability and usefulness regarding ESD and EMR for patients and healthcare professionals.METHODS In this study,30 specific questions related to ESD and EMR were identified.Then,these questions were repeatedly entered into ChatGPT,with two independent answers generated for each question.A Likert scale was used to rate the accuracy,completeness,and comprehensibility of the responses.Meanwhile,a binary category(high/Low)was used to evaluate each aspect of the two responses generated by ChatGPT and the response retrieved from Google.RESULTS By analyzing the average scores of the three raters,our findings indicated that the responses generated by ChatGPT received high ratings for accuracy(mean score of 5.14 out of 6),completeness(mean score of 2.34 out of 3),and comprehensibility(mean score of 2.96 out of 3).Kendall’s coefficients of concordance indicated good agreement among raters(all P<0.05).For the responses generated by Google,more than half were classified by experts as having low accuracy and low completeness.CONCLUSION ChatGPT provided accurate and reliable answers in response to questions about ESD and EMR.Future studies should address ChatGPT’s current limitations by incorporating more detailed and up-to-date medical information.This could establish AI chatbots as significant resource for both patients and health care professionals.展开更多
Most of the current object detection algorithms use pretrained models that are trained on ImageNet and then fine-tuned in the network,which can achieve good performance in terms of general object detectors.However,in ...Most of the current object detection algorithms use pretrained models that are trained on ImageNet and then fine-tuned in the network,which can achieve good performance in terms of general object detectors.However,in the field of remote sensing image object detection,as pretrained models are significantly different from remote sensing data,it is meaningful to explore a train-fromscratch technique for remote sensing images.This paper proposes an object detection framework trained from scratch,SRS-Net,and describes the design of a densely connected backbone network to provide integrated hidden layer supervision for the convolution module.Then,two necessary improvement principles are proposed:studying the role of normalization in the network structure,and improving data augmentation methods for remote sensing images.To evaluate the proposed framework,we performed many ablation experiments on the DIOR,DOTA,and AS datasets.The results show that whether using the improved backbone network,the normalization method or training data enhancement strategy,the performance of the object detection network trained from scratch increased.These principles compensate for the lack of pretrained models.Furthermore,we found that SRS-Net could achieve similar to or slightly better performance than baseline methods,and surpassed most advanced general detectors.展开更多
Data diversity and abundance are essential for improving the performance and generalization of models in natural language processing and 2D vision.However,the 3D vision domain suffers from a lack of 3D data,and simply...Data diversity and abundance are essential for improving the performance and generalization of models in natural language processing and 2D vision.However,the 3D vision domain suffers from a lack of 3D data,and simply combining multiple 3D datasets for pretraining a 3D backbone does not yield significant improvement,due to the domain discrepancies among different 3D datasets that impede effective feature learning.In this work,we identify the main sources of the domain discrepancies between 3D indoor scene datasets,and propose Swin3d++,an enhanced architecture based on Swin3d for efficient pretraining on multi-source 3D point clouds.Swin3d++introduces domain-specific mechanisms to SWIN3D's modules to address domain discrepancies and enhance the network capability on multi-source pretraining.Moreover,we devise a simple source-augmentation strategy to increase the pretraining data scale and facilitate supervised pretraining.We validate the effectiveness of our design,and demonstrate that Swin3d++surpasses the state-of-the-art 3D pretraining methods on typical indoor scene understanding tasks.展开更多
Most visual-language navigation(VLN)research focuses on simulate environments,but applying these methods to real-world scenarios is challenging because of misalignments between vision and language in complex environme...Most visual-language navigation(VLN)research focuses on simulate environments,but applying these methods to real-world scenarios is challenging because of misalignments between vision and language in complex environments,leading to path deviations.To address this,we propose a novel vision-and-language object navigation strategy that uses multimodal pretrained knowledge as a cross-modal bridge to link semantic concepts in both images and text.This improves navigation supervision at key-points and enhances robustness.Specifically,we 1)randomly generate key-points within a specific density range and optimize them on the basis of challenging locations;2)use pretrained multimodal knowledge to efficiently retrieve target objects;3)combine depth information with simultaneous localization and mapping(SLAM)map data to predict optimal positions and orientations for accurate navigation;and 4)implement the method on a physical robot,successfully conducting navigation tests.Our approach achieves a maximum success rate of 66.7%,outperforming existing VLN methods in real-world environments.展开更多
The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,call...The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,called Swin3D,for 3D indoor scene understanding.We designed a 3D Swin Transformer as our backbone network,which enables efficient selfattention on sparse voxels with linear memory complexity,making the backbone scalable to large models and datasets.We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance.We pretrained a large Swin3D model on a synthetic Structured3D dataset,which is an order of magnitude larger than the ScanNet dataset.Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets but also outperforms state-of-the-art methods on downstream tasks with+2.3 mIoU and+2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation,respectively,+1.8 mIoU on ScanNet segmentation(val),+1.9 mAP@0.5 on ScanNet detection,and+8.1 mAP@0.5 on S3DIS detection.A series of extensive ablation studies further validated the scalability,generality,and superior performance enabled by our approach.展开更多
The effectiveness of Al-driven drug discovery can be enhanced by pretraining on small molecules.However,the conventional masked language model pretraining techniques are not suitable for molecule pretraining due to th...The effectiveness of Al-driven drug discovery can be enhanced by pretraining on small molecules.However,the conventional masked language model pretraining techniques are not suitable for molecule pretraining due to the limited vocabulary size and the non-sequential structure of molecules.To overcome these challenges,we propose FragAdd,a strategy that involves adding a chemically implausible molecular fragment to the input molecule.This approach allows for the incorporation of rich local information and the generation of a high-quality graph representation,which is advantageous for tasks like virtual screening.Consequently,we have developed a virtual screening protocol that focuses on identifying estrogen receptor alpha binders on a nucleus receptor.Our results demonstrate a significant improvement in the binding capacity of the retrieved molecules.Additionally,we demonstrate that the FragAdd strategy can be combined with other self-supervised methods to further expedite the drug discovery process.展开更多
Decomposing complex real-world tasks into simpler subtasks and devising a subtask execution plan is critical for humans to achieve effective decision-making.However,replicating this process remains challenging for AI ...Decomposing complex real-world tasks into simpler subtasks and devising a subtask execution plan is critical for humans to achieve effective decision-making.However,replicating this process remains challenging for AI agents and naturally raises two questions:(1)How to extract discriminative knowledge representation from priors?(2)How to develop a rational plan to decompose complex problems?To address these issues,we introduce a groundbreaking framework that incorporates two main contributions.First,our multiple-encoder and individual-predictor regime goes beyond traditional architectures to extract nuanced task-specific dynamics from datasets,enriching the feature space for subtasks.Second,we innovate in planning by introducing a top-K subtask planning tree generated through an attention mechanism,which allows for dynamic adaptability and forward-looking decision-making.Our framework is empirically validated against challenging benchmarks BabyAI including multiple combinatorially rich synthetic tasks(e.g.,GoToSeq,SynthSeq,BossLevel),where it not only outperforms competitive baselines but also demonstrates superior adaptability and effectiveness incomplex task decomposition.展开更多
Objective Appropriate medical imaging is important for value-based care.We aim to evaluate the performance of generative pretrained transformer 4(GPT-4),an innovative natural language processing model,providing approp...Objective Appropriate medical imaging is important for value-based care.We aim to evaluate the performance of generative pretrained transformer 4(GPT-4),an innovative natural language processing model,providing appropriate medical imaging automatically in different clinical scenarios.Methods Institutional Review Boards(IRB)approval was not required due to the use of nonidentifiable data.Instead,we used 112 questions from the American College of Radiology(ACR)Radiology-TEACHES Program as prompts,which is an open-sourced question and answer program to guide appropriate medical imaging.We included 69 free-text case vignettes and 43 simplified cases.For the performance evaluation of GPT-4 and GPT-3.5,we considered the recommendations of ACR guidelines as the gold standard,and then three radiologists analyzed the consistency of the responses from the GPT models with those of the ACR.We set a five-score criterion for the evaluation of the consistency.A paired t-test was applied to assess the statistical significance of the findings.Results For the performance of the GPT models in free-text case vignettes,the accuracy of GPT-4 was 92.9%,whereas the accuracy of GPT-3.5 was just 78.3%.GPT-4 can provide more appropriate suggestions to reduce the overutilization of medical imaging than GPT-3.5(t=3.429,P=0.001).For the performance of the GPT models in simplified scenarios,the accuracy of GPT-4 and GPT-3.5 was 66.5%and 60.0%,respectively.The differences were not statistically significant(t=1.858,P=0.070).GPT-4 was characterized by longer reaction times(27.1 s in average)and extensive responses(137.1 words on average)than GPT-3.5.Conclusion As an advanced tool for improving value-based healthcare in clinics,GPT-4 may guide appropriate medical imaging accurately and efficiently。展开更多
Recurrent neural network transducer(RNN-T)is an important branch of current end-to-end automatic speech recognition(ASR).Various promising approaches have been designed for boosting RNN-T architecture;however,few stud...Recurrent neural network transducer(RNN-T)is an important branch of current end-to-end automatic speech recognition(ASR).Various promising approaches have been designed for boosting RNN-T architecture;however,few studies exploit the effectiveness of pretrained methods in this framework.In this paper,we introduce the pretrained acoustic extractor(PAE)and the pretrained linguistic network(PLN)to enhance the Conformer long short-term memory(Conformer-LSTM)transducer.First,we construct the input of the acoustic encoder with two different latent representations:one extracted by PAE from the raw waveform,and the other obtained from filter-bank transformation.Second,we fuse an extra semantic feature from the PLN into the joint network to reduce illogical and homophonic errors.Compared with previous works,our approaches are able to obtain pretrained representations for better model generalization.Evaluation on two large-scale datasets has demonstrated that our proposed approaches yield better performance than existing approaches.展开更多
Multimodal pretraining has made convincing achievements in various downstream tasks in recent years.However,since the majority of the existing works construct models based on English,their applications are limited by ...Multimodal pretraining has made convincing achievements in various downstream tasks in recent years.However,since the majority of the existing works construct models based on English,their applications are limited by language.In this work,we address this issue by developing models with multimodal and multilingual capabilities.We explore two types of methods to extend multimodal pretraining model from monolingual to multilingual.Specifically,we propose a pretraining-based model named multilingual multimodal pretraining(MLMM),and two generalization-based models named multilingual CLIP(M-CLIP)and multilingual acquisition(MLA).In addition,we further extend the generalization-based models to incorporate the audio modality and develop the multilingual CLIP for vision,language,and audio(CLIP4VLA).Our models achieve state-of-the-art performances on multilingual vision-text retrieval,visual question answering,and image captioning benchmarks.Based on the experimental results,we discuss the pros and cons of the two types of models and their potential practical applications.展开更多
针对中文文本中广泛存在的“一词多义”现象,以及文本不规范导致的分类模型鲁棒性问题,提出一种基于对抗训练和中文预训练模型相结合的AT-NEZHA(Adversarial Training NEZHA)分类模型。一方面通过引入BERT模型的中文改进版NEZHA模型的wo...针对中文文本中广泛存在的“一词多义”现象,以及文本不规范导致的分类模型鲁棒性问题,提出一种基于对抗训练和中文预训练模型相结合的AT-NEZHA(Adversarial Training NEZHA)分类模型。一方面通过引入BERT模型的中文改进版NEZHA模型的word embedding融合上下文信息解决中文文本中“一词多义”问题,另一方面利用对抗训练算法,对词嵌入层参数矩阵进行梯度扰动来增加训练过程中的损失值,使得模型找到更合适的参数,从而提高模型的鲁棒性。实验结果表明,AT-NEZHA能有效提高文本分类的准确度。展开更多
基金Supported by Ningbo Top Medical and Health Research Program,No.2023020612the Ningbo Leading Medical&Healthy Discipline Project,No.2022-S04+1 种基金the Medical Health Science and Technology Project of Zhejiang Provincial Health Commission,No.2022KY315Ningbo Science and Technology Public Welfare Project,No.2023S133.
文摘BACKGROUND With the rising use of endoscopic submucosal dissection(ESD)and endoscopic mucosal resection(EMR),patients are increasingly questioning various aspects of these endoscopic procedures.At the same time,conversational artificial intelligence(AI)tools like chat generative pretrained transformer(ChatGPT)are rapidly emerging as sources of medical information.AIM To evaluate ChatGPT’s reliability and usefulness regarding ESD and EMR for patients and healthcare professionals.METHODS In this study,30 specific questions related to ESD and EMR were identified.Then,these questions were repeatedly entered into ChatGPT,with two independent answers generated for each question.A Likert scale was used to rate the accuracy,completeness,and comprehensibility of the responses.Meanwhile,a binary category(high/Low)was used to evaluate each aspect of the two responses generated by ChatGPT and the response retrieved from Google.RESULTS By analyzing the average scores of the three raters,our findings indicated that the responses generated by ChatGPT received high ratings for accuracy(mean score of 5.14 out of 6),completeness(mean score of 2.34 out of 3),and comprehensibility(mean score of 2.96 out of 3).Kendall’s coefficients of concordance indicated good agreement among raters(all P<0.05).For the responses generated by Google,more than half were classified by experts as having low accuracy and low completeness.CONCLUSION ChatGPT provided accurate and reliable answers in response to questions about ESD and EMR.Future studies should address ChatGPT’s current limitations by incorporating more detailed and up-to-date medical information.This could establish AI chatbots as significant resource for both patients and health care professionals.
基金supported by the Natural Science Foundation of China(No.61906213).
文摘Most of the current object detection algorithms use pretrained models that are trained on ImageNet and then fine-tuned in the network,which can achieve good performance in terms of general object detectors.However,in the field of remote sensing image object detection,as pretrained models are significantly different from remote sensing data,it is meaningful to explore a train-fromscratch technique for remote sensing images.This paper proposes an object detection framework trained from scratch,SRS-Net,and describes the design of a densely connected backbone network to provide integrated hidden layer supervision for the convolution module.Then,two necessary improvement principles are proposed:studying the role of normalization in the network structure,and improving data augmentation methods for remote sensing images.To evaluate the proposed framework,we performed many ablation experiments on the DIOR,DOTA,and AS datasets.The results show that whether using the improved backbone network,the normalization method or training data enhancement strategy,the performance of the object detection network trained from scratch increased.These principles compensate for the lack of pretrained models.Furthermore,we found that SRS-Net could achieve similar to or slightly better performance than baseline methods,and surpassed most advanced general detectors.
文摘Data diversity and abundance are essential for improving the performance and generalization of models in natural language processing and 2D vision.However,the 3D vision domain suffers from a lack of 3D data,and simply combining multiple 3D datasets for pretraining a 3D backbone does not yield significant improvement,due to the domain discrepancies among different 3D datasets that impede effective feature learning.In this work,we identify the main sources of the domain discrepancies between 3D indoor scene datasets,and propose Swin3d++,an enhanced architecture based on Swin3d for efficient pretraining on multi-source 3D point clouds.Swin3d++introduces domain-specific mechanisms to SWIN3D's modules to address domain discrepancies and enhance the network capability on multi-source pretraining.Moreover,we devise a simple source-augmentation strategy to increase the pretraining data scale and facilitate supervised pretraining.We validate the effectiveness of our design,and demonstrate that Swin3d++surpasses the state-of-the-art 3D pretraining methods on typical indoor scene understanding tasks.
基金jointly supported by the National Natural Science Foundation of China(Nos.62236010,62322607,62276261 and 62076014)the Youth Innovation Promotion Association of Chinese Academy of Sciences,China(No.2021128)+1 种基金the Joint Fund of Natural Science of Hunan Province,China(No.2023JJ50242)the Key Projects of Education Department of Hunan Province,China(No.22A0115).
文摘Most visual-language navigation(VLN)research focuses on simulate environments,but applying these methods to real-world scenarios is challenging because of misalignments between vision and language in complex environments,leading to path deviations.To address this,we propose a novel vision-and-language object navigation strategy that uses multimodal pretrained knowledge as a cross-modal bridge to link semantic concepts in both images and text.This improves navigation supervision at key-points and enhances robustness.Specifically,we 1)randomly generate key-points within a specific density range and optimize them on the basis of challenging locations;2)use pretrained multimodal knowledge to efficiently retrieve target objects;3)combine depth information with simultaneous localization and mapping(SLAM)map data to predict optimal positions and orientations for accurate navigation;and 4)implement the method on a physical robot,successfully conducting navigation tests.Our approach achieves a maximum success rate of 66.7%,outperforming existing VLN methods in real-world environments.
文摘The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,called Swin3D,for 3D indoor scene understanding.We designed a 3D Swin Transformer as our backbone network,which enables efficient selfattention on sparse voxels with linear memory complexity,making the backbone scalable to large models and datasets.We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance.We pretrained a large Swin3D model on a synthetic Structured3D dataset,which is an order of magnitude larger than the ScanNet dataset.Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets but also outperforms state-of-the-art methods on downstream tasks with+2.3 mIoU and+2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation,respectively,+1.8 mIoU on ScanNet segmentation(val),+1.9 mAP@0.5 on ScanNet detection,and+8.1 mAP@0.5 on S3DIS detection.A series of extensive ablation studies further validated the scalability,generality,and superior performance enabled by our approach.
基金supported by the National Key R&D Program of China(Nos.2019YFA0905700 and 2021YFC2101500)the National Natural Science Foundation of China(No.62072283).
文摘The effectiveness of Al-driven drug discovery can be enhanced by pretraining on small molecules.However,the conventional masked language model pretraining techniques are not suitable for molecule pretraining due to the limited vocabulary size and the non-sequential structure of molecules.To overcome these challenges,we propose FragAdd,a strategy that involves adding a chemically implausible molecular fragment to the input molecule.This approach allows for the incorporation of rich local information and the generation of a high-quality graph representation,which is advantageous for tasks like virtual screening.Consequently,we have developed a virtual screening protocol that focuses on identifying estrogen receptor alpha binders on a nucleus receptor.Our results demonstrate a significant improvement in the binding capacity of the retrieved molecules.Additionally,we demonstrate that the FragAdd strategy can be combined with other self-supervised methods to further expedite the drug discovery process.
文摘Decomposing complex real-world tasks into simpler subtasks and devising a subtask execution plan is critical for humans to achieve effective decision-making.However,replicating this process remains challenging for AI agents and naturally raises two questions:(1)How to extract discriminative knowledge representation from priors?(2)How to develop a rational plan to decompose complex problems?To address these issues,we introduce a groundbreaking framework that incorporates two main contributions.First,our multiple-encoder and individual-predictor regime goes beyond traditional architectures to extract nuanced task-specific dynamics from datasets,enriching the feature space for subtasks.Second,we innovate in planning by introducing a top-K subtask planning tree generated through an attention mechanism,which allows for dynamic adaptability and forward-looking decision-making.Our framework is empirically validated against challenging benchmarks BabyAI including multiple combinatorially rich synthetic tasks(e.g.,GoToSeq,SynthSeq,BossLevel),where it not only outperforms competitive baselines but also demonstrates superior adaptability and effectiveness incomplex task decomposition.
基金National Natural Science Foundation of China(Grant Nos.62171297 and 61931013).
文摘Objective Appropriate medical imaging is important for value-based care.We aim to evaluate the performance of generative pretrained transformer 4(GPT-4),an innovative natural language processing model,providing appropriate medical imaging automatically in different clinical scenarios.Methods Institutional Review Boards(IRB)approval was not required due to the use of nonidentifiable data.Instead,we used 112 questions from the American College of Radiology(ACR)Radiology-TEACHES Program as prompts,which is an open-sourced question and answer program to guide appropriate medical imaging.We included 69 free-text case vignettes and 43 simplified cases.For the performance evaluation of GPT-4 and GPT-3.5,we considered the recommendations of ACR guidelines as the gold standard,and then three radiologists analyzed the consistency of the responses from the GPT models with those of the ACR.We set a five-score criterion for the evaluation of the consistency.A paired t-test was applied to assess the statistical significance of the findings.Results For the performance of the GPT models in free-text case vignettes,the accuracy of GPT-4 was 92.9%,whereas the accuracy of GPT-3.5 was just 78.3%.GPT-4 can provide more appropriate suggestions to reduce the overutilization of medical imaging than GPT-3.5(t=3.429,P=0.001).For the performance of the GPT models in simplified scenarios,the accuracy of GPT-4 and GPT-3.5 was 66.5%and 60.0%,respectively.The differences were not statistically significant(t=1.858,P=0.070).GPT-4 was characterized by longer reaction times(27.1 s in average)and extensive responses(137.1 words on average)than GPT-3.5.Conclusion As an advanced tool for improving value-based healthcare in clinics,GPT-4 may guide appropriate medical imaging accurately and efficiently。
基金supported in part by the Guangdong Basic and Applied Basic Research Foundation(No.GDST23EG32).
文摘Recurrent neural network transducer(RNN-T)is an important branch of current end-to-end automatic speech recognition(ASR).Various promising approaches have been designed for boosting RNN-T architecture;however,few studies exploit the effectiveness of pretrained methods in this framework.In this paper,we introduce the pretrained acoustic extractor(PAE)and the pretrained linguistic network(PLN)to enhance the Conformer long short-term memory(Conformer-LSTM)transducer.First,we construct the input of the acoustic encoder with two different latent representations:one extracted by PAE from the raw waveform,and the other obtained from filter-bank transformation.Second,we fuse an extra semantic feature from the PLN into the joint network to reduce illogical and homophonic errors.Compared with previous works,our approaches are able to obtain pretrained representations for better model generalization.Evaluation on two large-scale datasets has demonstrated that our proposed approaches yield better performance than existing approaches.
基金supported by the National Natural Science Foundation of China(No.62072462)the National Key R&D Program of China(No.2020AAA0108600)the Large-scale Pretraining Program 468 of Beijing Academy of Artificial Intelligence(BAAI).
文摘Multimodal pretraining has made convincing achievements in various downstream tasks in recent years.However,since the majority of the existing works construct models based on English,their applications are limited by language.In this work,we address this issue by developing models with multimodal and multilingual capabilities.We explore two types of methods to extend multimodal pretraining model from monolingual to multilingual.Specifically,we propose a pretraining-based model named multilingual multimodal pretraining(MLMM),and two generalization-based models named multilingual CLIP(M-CLIP)and multilingual acquisition(MLA).In addition,we further extend the generalization-based models to incorporate the audio modality and develop the multilingual CLIP for vision,language,and audio(CLIP4VLA).Our models achieve state-of-the-art performances on multilingual vision-text retrieval,visual question answering,and image captioning benchmarks.Based on the experimental results,we discuss the pros and cons of the two types of models and their potential practical applications.
文摘针对中文文本中广泛存在的“一词多义”现象,以及文本不规范导致的分类模型鲁棒性问题,提出一种基于对抗训练和中文预训练模型相结合的AT-NEZHA(Adversarial Training NEZHA)分类模型。一方面通过引入BERT模型的中文改进版NEZHA模型的word embedding融合上下文信息解决中文文本中“一词多义”问题,另一方面利用对抗训练算法,对词嵌入层参数矩阵进行梯度扰动来增加训练过程中的损失值,使得模型找到更合适的参数,从而提高模型的鲁棒性。实验结果表明,AT-NEZHA能有效提高文本分类的准确度。