Most of the current object detection algorithms use pretrained models that are trained on ImageNet and then fine-tuned in the network,which can achieve good performance in terms of general object detectors.However,in ...Most of the current object detection algorithms use pretrained models that are trained on ImageNet and then fine-tuned in the network,which can achieve good performance in terms of general object detectors.However,in the field of remote sensing image object detection,as pretrained models are significantly different from remote sensing data,it is meaningful to explore a train-fromscratch technique for remote sensing images.This paper proposes an object detection framework trained from scratch,SRS-Net,and describes the design of a densely connected backbone network to provide integrated hidden layer supervision for the convolution module.Then,two necessary improvement principles are proposed:studying the role of normalization in the network structure,and improving data augmentation methods for remote sensing images.To evaluate the proposed framework,we performed many ablation experiments on the DIOR,DOTA,and AS datasets.The results show that whether using the improved backbone network,the normalization method or training data enhancement strategy,the performance of the object detection network trained from scratch increased.These principles compensate for the lack of pretrained models.Furthermore,we found that SRS-Net could achieve similar to or slightly better performance than baseline methods,and surpassed most advanced general detectors.展开更多
Data diversity and abundance are essential for improving the performance and generalization of models in natural language processing and 2D vision.However,the 3D vision domain suffers from a lack of 3D data,and simply...Data diversity and abundance are essential for improving the performance and generalization of models in natural language processing and 2D vision.However,the 3D vision domain suffers from a lack of 3D data,and simply combining multiple 3D datasets for pretraining a 3D backbone does not yield significant improvement,due to the domain discrepancies among different 3D datasets that impede effective feature learning.In this work,we identify the main sources of the domain discrepancies between 3D indoor scene datasets,and propose Swin3d++,an enhanced architecture based on Swin3d for efficient pretraining on multi-source 3D point clouds.Swin3d++introduces domain-specific mechanisms to SWIN3D's modules to address domain discrepancies and enhance the network capability on multi-source pretraining.Moreover,we devise a simple source-augmentation strategy to increase the pretraining data scale and facilitate supervised pretraining.We validate the effectiveness of our design,and demonstrate that Swin3d++surpasses the state-of-the-art 3D pretraining methods on typical indoor scene understanding tasks.展开更多
Multimodal pretraining has made convincing achievements in various downstream tasks in recent years.However,since the majority of the existing works construct models based on English,their applications are limited by ...Multimodal pretraining has made convincing achievements in various downstream tasks in recent years.However,since the majority of the existing works construct models based on English,their applications are limited by language.In this work,we address this issue by developing models with multimodal and multilingual capabilities.We explore two types of methods to extend multimodal pretraining model from monolingual to multilingual.Specifically,we propose a pretraining-based model named multilingual multimodal pretraining(MLMM),and two generalization-based models named multilingual CLIP(M-CLIP)and multilingual acquisition(MLA).In addition,we further extend the generalization-based models to incorporate the audio modality and develop the multilingual CLIP for vision,language,and audio(CLIP4VLA).Our models achieve state-of-the-art performances on multilingual vision-text retrieval,visual question answering,and image captioning benchmarks.Based on the experimental results,we discuss the pros and cons of the two types of models and their potential practical applications.展开更多
The effectiveness of Al-driven drug discovery can be enhanced by pretraining on small molecules.However,the conventional masked language model pretraining techniques are not suitable for molecule pretraining due to th...The effectiveness of Al-driven drug discovery can be enhanced by pretraining on small molecules.However,the conventional masked language model pretraining techniques are not suitable for molecule pretraining due to the limited vocabulary size and the non-sequential structure of molecules.To overcome these challenges,we propose FragAdd,a strategy that involves adding a chemically implausible molecular fragment to the input molecule.This approach allows for the incorporation of rich local information and the generation of a high-quality graph representation,which is advantageous for tasks like virtual screening.Consequently,we have developed a virtual screening protocol that focuses on identifying estrogen receptor alpha binders on a nucleus receptor.Our results demonstrate a significant improvement in the binding capacity of the retrieved molecules.Additionally,we demonstrate that the FragAdd strategy can be combined with other self-supervised methods to further expedite the drug discovery process.展开更多
3D shape recognition has drawn much attention in recent years.The view-based approach performs best of all.However,the current multi-view methods are almost all fully supervised,and the pretraining models are almost a...3D shape recognition has drawn much attention in recent years.The view-based approach performs best of all.However,the current multi-view methods are almost all fully supervised,and the pretraining models are almost all based on ImageNet.Although the pretraining results of ImageNet are quite impressive,there is still a significant discrepancy between multi-view datasets and ImageNet.Multi-view datasets naturally retain rich 3D information.In addition,large-scale datasets such as ImageNet require considerable cleaning and annotation work,so it is difficult to regenerate a second dataset.In contrast,unsupervised learning methods can learn general feature representations without any extra annotation.To this end,we propose a three-stage unsupervised joint pretraining model.Specifically,we decouple the final representations into three fine-grained representations.Data augmentation is utilized to obtain pixel-level representations within each view.And we boost the spatial invariant features from the view level.Finally,we exploit global information at the shape level through a novel extract-and-swap module.Experimental results demonstrate that the proposed method gains significantly in 3D object classification and retrieval tasks,and shows generalization to cross-dataset tasks.展开更多
Decomposing complex real-world tasks into simpler subtasks and devising a subtask execution plan is critical for humans to achieve effective decision-making.However,replicating this process remains challenging for AI ...Decomposing complex real-world tasks into simpler subtasks and devising a subtask execution plan is critical for humans to achieve effective decision-making.However,replicating this process remains challenging for AI agents and naturally raises two questions:(1)How to extract discriminative knowledge representation from priors?(2)How to develop a rational plan to decompose complex problems?To address these issues,we introduce a groundbreaking framework that incorporates two main contributions.First,our multiple-encoder and individual-predictor regime goes beyond traditional architectures to extract nuanced task-specific dynamics from datasets,enriching the feature space for subtasks.Second,we innovate in planning by introducing a top-K subtask planning tree generated through an attention mechanism,which allows for dynamic adaptability and forward-looking decision-making.Our framework is empirically validated against challenging benchmarks BabyAI including multiple combinatorially rich synthetic tasks(e.g.,GoToSeq,SynthSeq,BossLevel),where it not only outperforms competitive baselines but also demonstrates superior adaptability and effectiveness incomplex task decomposition.展开更多
Recurrent neural network transducer(RNN-T)is an important branch of current end-to-end automatic speech recognition(ASR).Various promising approaches have been designed for boosting RNN-T architecture;however,few stud...Recurrent neural network transducer(RNN-T)is an important branch of current end-to-end automatic speech recognition(ASR).Various promising approaches have been designed for boosting RNN-T architecture;however,few studies exploit the effectiveness of pretrained methods in this framework.In this paper,we introduce the pretrained acoustic extractor(PAE)and the pretrained linguistic network(PLN)to enhance the Conformer long short-term memory(Conformer-LSTM)transducer.First,we construct the input of the acoustic encoder with two different latent representations:one extracted by PAE from the raw waveform,and the other obtained from filter-bank transformation.Second,we fuse an extra semantic feature from the PLN into the joint network to reduce illogical and homophonic errors.Compared with previous works,our approaches are able to obtain pretrained representations for better model generalization.Evaluation on two large-scale datasets has demonstrated that our proposed approaches yield better performance than existing approaches.展开更多
Panoramic images, offering a 360-degree view, are essential in virtual reality(VR) and augmented reality(AR), enhancing realism with high-quality textures. However, acquiring complete and high-quality panoramic textur...Panoramic images, offering a 360-degree view, are essential in virtual reality(VR) and augmented reality(AR), enhancing realism with high-quality textures. However, acquiring complete and high-quality panoramic textures is challenging. This paper introduces a method using generative adversarial networks(GANs) and the contrastive language-image pretraining(CLIP) model to restore and control texture in panoramic images. The GAN model captures complex structures and maintains consistency, while CLIP enables fine-grained texture control via semantic text-image associations. GAN inversion optimizes latent codes for precise texture details. The resulting low dynamic range(LDR) images are converted to high dynamic range(HDR) using the Blender engine for seamless texture blending. Experimental results demonstrate the effectiveness and flexibility of this method in panoramic texture restoration and generation.展开更多
BACKGROUND With the rising use of endoscopic submucosal dissection(ESD)and endoscopic mucosal resection(EMR),patients are increasingly questioning various aspects of these endoscopic procedures.At the same time,conver...BACKGROUND With the rising use of endoscopic submucosal dissection(ESD)and endoscopic mucosal resection(EMR),patients are increasingly questioning various aspects of these endoscopic procedures.At the same time,conversational artificial intelligence(AI)tools like chat generative pretrained transformer(ChatGPT)are rapidly emerging as sources of medical information.AIM To evaluate ChatGPT’s reliability and usefulness regarding ESD and EMR for patients and healthcare professionals.METHODS In this study,30 specific questions related to ESD and EMR were identified.Then,these questions were repeatedly entered into ChatGPT,with two independent answers generated for each question.A Likert scale was used to rate the accuracy,completeness,and comprehensibility of the responses.Meanwhile,a binary category(high/Low)was used to evaluate each aspect of the two responses generated by ChatGPT and the response retrieved from Google.RESULTS By analyzing the average scores of the three raters,our findings indicated that the responses generated by ChatGPT received high ratings for accuracy(mean score of 5.14 out of 6),completeness(mean score of 2.34 out of 3),and comprehensibility(mean score of 2.96 out of 3).Kendall’s coefficients of concordance indicated good agreement among raters(all P<0.05).For the responses generated by Google,more than half were classified by experts as having low accuracy and low completeness.CONCLUSION ChatGPT provided accurate and reliable answers in response to questions about ESD and EMR.Future studies should address ChatGPT’s current limitations by incorporating more detailed and up-to-date medical information.This could establish AI chatbots as significant resource for both patients and health care professionals.展开更多
Background:Multiparametric magnetic resonance imaging(mpMRI)has significantly advanced prostate cancer(PCa)detection,yet decisions on invasive biopsy with moderate prostate imaging reporting and data system(PI-RADS)sc...Background:Multiparametric magnetic resonance imaging(mpMRI)has significantly advanced prostate cancer(PCa)detection,yet decisions on invasive biopsy with moderate prostate imaging reporting and data system(PI-RADS)scores remain ambiguous.Methods:To explore the decision-making capacity of Generative Pretrained Transformer-4(GPT-4)for automated prostate biopsy recommendations,we included 2299 individuals who underwent prostate biopsy from 2018 to 2023 in 3 large medical centers,with available mpMRI before biopsy and documented clinical-histopathological records.GPT-4 generated structured reports with given prompts.The performance of GPT-4 was quantified using confusion matrices,and sensitivity,specificity,as well as area under the curve were calculated.Multiple artificial evaluation procedures were conducted.Wilcoxon’s rank sum test,Fisher’s exact test,and Kruskal-Wallis tests were used for comparisons.Results:Utilizing the largest sample size in the Chinese population,patients with moderate PI-RADS scores(scores 3 and 4)accounted for 39.7%(912/2299),defined as the subset-of-interest(SOI).The detection rates of clinically significant PCa corresponding to PI-RADS scores 2-5 were 9.4%,27.3%,49.2%,and 80.1%,respectively.Nearly 47.5%(433/912)of SOI patients were histopathologically proven to have undergone unnecessary prostate biopsies.With the assistance of GPT-4,20.8%(190/912)of the SOI population could avoid unnecessary biopsies,and it performed even better[28.8%(118/410)]in the most heterogeneous subgroup of PI-RADS score 3.More than 90.0%of GPT-4-generated reports were comprehensive and easy to understand,but less satisfied with the accuracy(82.8%).GPT-4 also demonstrated cognitive potential for handling complex problems.Additionally,the Chain of Thought method enabled us to better understand the decision-making logic behind GPT-4.Eventually,we developed a ProstAIGuide platform to facilitate accessibility for both doctors and patients.Conclusions:This multi-center study highlights the clinical utility of GPT-4 for prostate biopsy decision-making and advances our understanding of the latest artificial intelligence implementation in various medical scenarios.展开更多
Sentence classification is the process of categorizing a sentence based on the context of the sentence.Sentence categorization requires more semantic highlights than other tasks,such as dependence parsing,which requir...Sentence classification is the process of categorizing a sentence based on the context of the sentence.Sentence categorization requires more semantic highlights than other tasks,such as dependence parsing,which requires more syntactic elements.Most existing strategies focus on the general semantics of a conversation without involving the context of the sentence,recognizing the progress and comparing impacts.An ensemble pre-trained language model was taken up here to classify the conversation sentences from the conversation corpus.The conversational sentences are classified into four categories:information,question,directive,and commission.These classification label sequences are for analyzing the conversation progress and predicting the pecking order of the conversation.Ensemble of Bidirectional Encoder for Representation of Transformer(BERT),Robustly Optimized BERT pretraining Approach(RoBERTa),Generative Pre-Trained Transformer(GPT),DistilBERT and Generalized Autoregressive Pretraining for Language Understanding(XLNet)models are trained on conversation corpus with hyperparameters.Hyperparameter tuning approach is carried out for better performance on sentence classification.This Ensemble of Pre-trained Language Models with a Hyperparameter Tuning(EPLM-HT)system is trained on an annotated conversation dataset.The proposed approach outperformed compared to the base BERT,GPT,DistilBERT and XLNet transformer models.The proposed ensemble model with the fine-tuned parameters achieved an F1_score of 0.88.展开更多
In the field of natural language processing(NLP),there have been various pre-training language models in recent years,with question answering systems gaining significant attention.However,as algorithms,data,and comput...In the field of natural language processing(NLP),there have been various pre-training language models in recent years,with question answering systems gaining significant attention.However,as algorithms,data,and computing power advance,the issue of increasingly larger models and a growing number of parameters has surfaced.Consequently,model training has become more costly and less efficient.To enhance the efficiency and accuracy of the training process while reducing themodel volume,this paper proposes a first-order pruningmodel PAL-BERT based on the ALBERT model according to the characteristics of question-answering(QA)system and language model.Firstly,a first-order network pruning method based on the ALBERT model is designed,and the PAL-BERT model is formed.Then,the parameter optimization strategy of the PAL-BERT model is formulated,and the Mish function was used as an activation function instead of ReLU to improve the performance.Finally,after comparison experiments with traditional deep learning models TextCNN and BiLSTM,it is confirmed that PALBERT is a pruning model compression method that can significantly reduce training time and optimize training efficiency.Compared with traditional models,PAL-BERT significantly improves the NLP task’s performance.展开更多
As Natural Language Processing(NLP)continues to advance,driven by the emergence of sophisticated large language models such as ChatGPT,there has been a notable growth in research activity.This rapid uptake reflects in...As Natural Language Processing(NLP)continues to advance,driven by the emergence of sophisticated large language models such as ChatGPT,there has been a notable growth in research activity.This rapid uptake reflects increasing interest in the field and induces critical inquiries into ChatGPT’s applicability in the NLP domain.This review paper systematically investigates the role of ChatGPT in diverse NLP tasks,including information extraction,Name Entity Recognition(NER),event extraction,relation extraction,Part of Speech(PoS)tagging,text classification,sentiment analysis,emotion recognition and text annotation.The novelty of this work lies in its comprehensive analysis of the existing literature,addressing a critical gap in understanding ChatGPT’s adaptability,limitations,and optimal application.In this paper,we employed a systematic stepwise approach following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses(PRISMA)framework to direct our search process and seek relevant studies.Our review reveals ChatGPT’s significant potential in enhancing various NLP tasks.Its adaptability in information extraction tasks,sentiment analysis,and text classification showcases its ability to comprehend diverse contexts and extract meaningful details.Additionally,ChatGPT’s flexibility in annotation tasks reducesmanual efforts and accelerates the annotation process,making it a valuable asset in NLP development and research.Furthermore,GPT-4 and prompt engineering emerge as a complementary mechanism,empowering users to guide the model and enhance overall accuracy.Despite its promising potential,challenges persist.The performance of ChatGP Tneeds tobe testedusingmore extensivedatasets anddiversedata structures.Subsequently,its limitations in handling domain-specific language and the need for fine-tuning in specific applications highlight the importance of further investigations to address these issues.展开更多
Purpose:Automatic keyphrase extraction(AKE)is an important task for grasping the main points of the text.In this paper,we aim to combine the benefits of sequence labeling formulation and pretrained language model to p...Purpose:Automatic keyphrase extraction(AKE)is an important task for grasping the main points of the text.In this paper,we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research.Design/methodology/approach:We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT,which was released by Google in 2018.We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain,which contains 100,000 abstracts as training set,6,000 abstracts as development set and 3,094 abstracts as test set.We use unsupervised keyphrase extraction methods including term frequency(TF),TF-IDF,TextRank and supervised machine learning methods including Conditional Random Field(CRF),Bidirectional Long Short Term Memory Network(BiLSTM),and BiLSTM-CRF as baselines.Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models.Findings:Compared with character-level BiLSTM-CRF,the best baseline model with F1 score of 50.16%,our character-level sequence labeling model based on BERT obtains F1 score of 59.80%,getting 9.64%absolute improvement.Research limitations:We just consider automatic keyphrase extraction task rather than keyphrase generation task,so only keyphrases that are occurred in the given text can be extracted.In addition,our proposed dataset is not suitable for dealing with nested keyphrases.Practical implications:We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts(CAKE)publicly available for the benefits of research community,which is available at:https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction.Originality/value:By designing comparative experiments,our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models.And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.展开更多
Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and c...Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and categorize them into predefined entity types.This process can provide basic support for the automatic construction of knowledge bases.In contrast to general texts,biomedical texts frequently contain numerous nested entities and local dependencies among these entities,presenting significant challenges to prevailing NER models.To address these issues,we propose a novel Chinese nested biomedical NER model based on RoBERTa and Global Pointer(RoBGP).Our model initially utilizes the RoBERTa-wwm-ext-large pretrained language model to dynamically generate word-level initial vectors.It then incorporates a Bidirectional Long Short-Term Memory network for capturing bidirectional semantic information,effectively addressing the issue of long-distance dependencies.Furthermore,the Global Pointer model is employed to comprehensively recognize all nested entities in the text.We conduct extensive experiments on the Chinese medical dataset CMeEE and the results demonstrate the superior performance of RoBGP over several baseline models.This research confirms the effectiveness of RoBGP in Chinese biomedical NER,providing reliable technical support for biomedical information extraction and knowledge base construction.展开更多
Next point-of-interest(POI)recommendation has been applied by many internet companies to enhance the user travel experience.Recent research advocates deep-learning methods to model long-term check-in sequences and min...Next point-of-interest(POI)recommendation has been applied by many internet companies to enhance the user travel experience.Recent research advocates deep-learning methods to model long-term check-in sequences and mine mobility patterns of people to improve recommendation performance.Existing approaches model general user preferences based on historical check-ins and can be termed as preference pattern models.The preference pattern is different from the intention pattern,in that it does not emphasize the user mobility pattern of revisiting POIs,which is a common behavior and kind of intention for users.An effective module is needed to predict when and where users will repeat visits.In this paper,we propose a Spatio-Temporal Intention Learning Self-Attention Network(STILSAN)for next POI recommendation.STILSAN employs a preference-intention module to capture the user’s long-term preference and recognizes the user’s intention to revisit some specific POIs at a specific time.Meanwhile,we design a spatial encoder module as a pretrained model for learning POI spatial feature by simulating the spatial clustering phenomenon and the spatial proximity of the POIs.Experiments are conducted on two real-world check-in datasets.The experimental results demonstrate that all the proposed modules can effectively improve recommendation accuracy and STILSAN yields outstanding improvements over the state-of-the-art models.展开更多
Hand Gesture Recognition(HGR)is a promising research area with an extensive range of applications,such as surgery,video game techniques,and sign language translation,where sign language is a complicated structured for...Hand Gesture Recognition(HGR)is a promising research area with an extensive range of applications,such as surgery,video game techniques,and sign language translation,where sign language is a complicated structured form of hand gestures.The fundamental building blocks of structured expressions in sign language are the arrangement of the fingers,the orientation of the hand,and the hand’s position concerning the body.The importance of HGR has increased due to the increasing number of touchless applications and the rapid growth of the hearing-impaired population.Therefore,real-time HGR is one of the most effective interaction methods between computers and humans.Developing a user-free interface with good recognition performance should be the goal of real-time HGR systems.Nowadays,Convolutional Neural Network(CNN)shows great recognition rates for different image-level classification tasks.It is challenging to train deep CNN networks like VGG-16,VGG-19,Inception-v3,and Efficientnet-B0 from scratch because only some significant labeled image datasets are available for static hand gesture images.However,an efficient and robust hand gesture recognition system of sign language employing finetuned Inception-v3 and Efficientnet-Bo network is proposed to identify hand gestures using a comparative small HGR dataset.Experiments show that Inception-v3 achieved 90%accuracy and 0.93%precision,0.91%recall,and 0.90%f1-score,respectively,while EfficientNet-B0 achieved 99%accuracy and 0.98%,0.97%,0.98%,precision,recall,and f1-score respectively.展开更多
Thanks to the strong representation capability of pre-trained language models,supervised machine translation models have achieved outstanding performance.However,the performances of these models drop sharply when the ...Thanks to the strong representation capability of pre-trained language models,supervised machine translation models have achieved outstanding performance.However,the performances of these models drop sharply when the scale of the parallel training corpus is limited.Considering the pre-trained language model has a strong ability for monolingual representation,it is the key challenge for machine translation to construct the in-depth relationship between the source and target language by injecting the lexical and syntactic information into pre-trained language models.To alleviate the dependence on the parallel corpus,we propose a Linguistics Knowledge-Driven MultiTask(LKMT)approach to inject part-of-speech and syntactic knowledge into pre-trained models,thus enhancing the machine translation performance.On the one hand,we integrate part-of-speech and dependency labels into the embedding layer and exploit large-scale monolingual corpus to update all parameters of pre-trained language models,thus ensuring the updated language model contains potential lexical and syntactic information.On the other hand,we leverage an extra self-attention layer to explicitly inject linguistic knowledge into the pre-trained language model-enhanced machine translation model.Experiments on the benchmark dataset show that our proposed LKMT approach improves the Urdu-English translation accuracy by 1.97 points and the English-Urdu translation accuracy by 2.42 points,highlighting the effectiveness of our LKMT framework.Detailed ablation experiments confirm the positive impact of part-of-speech and dependency parsing on machine translation.展开更多
In this paper,a cross-sensor generative self-supervised learning network is proposed for fault detection of multi-sensor.By modeling the sensor signals in multiple dimensions to achieve correlation information mining ...In this paper,a cross-sensor generative self-supervised learning network is proposed for fault detection of multi-sensor.By modeling the sensor signals in multiple dimensions to achieve correlation information mining between channels to deal with the pretext task,the shared features between multi-sensor data can be captured,and the gap between channel data features will be reduced.Meanwhile,in order to model fault features in the downstream task,the salience module is developed to optimize cross-sensor data features based on a small amount of labeled data to make warning feature information prominent for improving the separator accuracy.Finally,experimental results on the public datasets FEMTO-ST dataset and the private datasets SMT shock absorber dataset(SMT-SA dataset)show that the proposed method performs favorably against other STATE-of-the-art methods.展开更多
The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,call...The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,called Swin3D,for 3D indoor scene understanding.We designed a 3D Swin Transformer as our backbone network,which enables efficient selfattention on sparse voxels with linear memory complexity,making the backbone scalable to large models and datasets.We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance.We pretrained a large Swin3D model on a synthetic Structured3D dataset,which is an order of magnitude larger than the ScanNet dataset.Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets but also outperforms state-of-the-art methods on downstream tasks with+2.3 mIoU and+2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation,respectively,+1.8 mIoU on ScanNet segmentation(val),+1.9 mAP@0.5 on ScanNet detection,and+8.1 mAP@0.5 on S3DIS detection.A series of extensive ablation studies further validated the scalability,generality,and superior performance enabled by our approach.展开更多
基金supported by the Natural Science Foundation of China(No.61906213).
文摘Most of the current object detection algorithms use pretrained models that are trained on ImageNet and then fine-tuned in the network,which can achieve good performance in terms of general object detectors.However,in the field of remote sensing image object detection,as pretrained models are significantly different from remote sensing data,it is meaningful to explore a train-fromscratch technique for remote sensing images.This paper proposes an object detection framework trained from scratch,SRS-Net,and describes the design of a densely connected backbone network to provide integrated hidden layer supervision for the convolution module.Then,two necessary improvement principles are proposed:studying the role of normalization in the network structure,and improving data augmentation methods for remote sensing images.To evaluate the proposed framework,we performed many ablation experiments on the DIOR,DOTA,and AS datasets.The results show that whether using the improved backbone network,the normalization method or training data enhancement strategy,the performance of the object detection network trained from scratch increased.These principles compensate for the lack of pretrained models.Furthermore,we found that SRS-Net could achieve similar to or slightly better performance than baseline methods,and surpassed most advanced general detectors.
文摘Data diversity and abundance are essential for improving the performance and generalization of models in natural language processing and 2D vision.However,the 3D vision domain suffers from a lack of 3D data,and simply combining multiple 3D datasets for pretraining a 3D backbone does not yield significant improvement,due to the domain discrepancies among different 3D datasets that impede effective feature learning.In this work,we identify the main sources of the domain discrepancies between 3D indoor scene datasets,and propose Swin3d++,an enhanced architecture based on Swin3d for efficient pretraining on multi-source 3D point clouds.Swin3d++introduces domain-specific mechanisms to SWIN3D's modules to address domain discrepancies and enhance the network capability on multi-source pretraining.Moreover,we devise a simple source-augmentation strategy to increase the pretraining data scale and facilitate supervised pretraining.We validate the effectiveness of our design,and demonstrate that Swin3d++surpasses the state-of-the-art 3D pretraining methods on typical indoor scene understanding tasks.
基金supported by the National Natural Science Foundation of China(No.62072462)the National Key R&D Program of China(No.2020AAA0108600)the Large-scale Pretraining Program 468 of Beijing Academy of Artificial Intelligence(BAAI).
文摘Multimodal pretraining has made convincing achievements in various downstream tasks in recent years.However,since the majority of the existing works construct models based on English,their applications are limited by language.In this work,we address this issue by developing models with multimodal and multilingual capabilities.We explore two types of methods to extend multimodal pretraining model from monolingual to multilingual.Specifically,we propose a pretraining-based model named multilingual multimodal pretraining(MLMM),and two generalization-based models named multilingual CLIP(M-CLIP)and multilingual acquisition(MLA).In addition,we further extend the generalization-based models to incorporate the audio modality and develop the multilingual CLIP for vision,language,and audio(CLIP4VLA).Our models achieve state-of-the-art performances on multilingual vision-text retrieval,visual question answering,and image captioning benchmarks.Based on the experimental results,we discuss the pros and cons of the two types of models and their potential practical applications.
基金supported by the National Key R&D Program of China(Nos.2019YFA0905700 and 2021YFC2101500)the National Natural Science Foundation of China(No.62072283).
文摘The effectiveness of Al-driven drug discovery can be enhanced by pretraining on small molecules.However,the conventional masked language model pretraining techniques are not suitable for molecule pretraining due to the limited vocabulary size and the non-sequential structure of molecules.To overcome these challenges,we propose FragAdd,a strategy that involves adding a chemically implausible molecular fragment to the input molecule.This approach allows for the incorporation of rich local information and the generation of a high-quality graph representation,which is advantageous for tasks like virtual screening.Consequently,we have developed a virtual screening protocol that focuses on identifying estrogen receptor alpha binders on a nucleus receptor.Our results demonstrate a significant improvement in the binding capacity of the retrieved molecules.Additionally,we demonstrate that the FragAdd strategy can be combined with other self-supervised methods to further expedite the drug discovery process.
基金This work was supported in part by National Natural Science Foundation of China(No.61976095)the Science and Technology Planning Project of Guangdong Province,China(No.2018B030323026).
文摘3D shape recognition has drawn much attention in recent years.The view-based approach performs best of all.However,the current multi-view methods are almost all fully supervised,and the pretraining models are almost all based on ImageNet.Although the pretraining results of ImageNet are quite impressive,there is still a significant discrepancy between multi-view datasets and ImageNet.Multi-view datasets naturally retain rich 3D information.In addition,large-scale datasets such as ImageNet require considerable cleaning and annotation work,so it is difficult to regenerate a second dataset.In contrast,unsupervised learning methods can learn general feature representations without any extra annotation.To this end,we propose a three-stage unsupervised joint pretraining model.Specifically,we decouple the final representations into three fine-grained representations.Data augmentation is utilized to obtain pixel-level representations within each view.And we boost the spatial invariant features from the view level.Finally,we exploit global information at the shape level through a novel extract-and-swap module.Experimental results demonstrate that the proposed method gains significantly in 3D object classification and retrieval tasks,and shows generalization to cross-dataset tasks.
文摘Decomposing complex real-world tasks into simpler subtasks and devising a subtask execution plan is critical for humans to achieve effective decision-making.However,replicating this process remains challenging for AI agents and naturally raises two questions:(1)How to extract discriminative knowledge representation from priors?(2)How to develop a rational plan to decompose complex problems?To address these issues,we introduce a groundbreaking framework that incorporates two main contributions.First,our multiple-encoder and individual-predictor regime goes beyond traditional architectures to extract nuanced task-specific dynamics from datasets,enriching the feature space for subtasks.Second,we innovate in planning by introducing a top-K subtask planning tree generated through an attention mechanism,which allows for dynamic adaptability and forward-looking decision-making.Our framework is empirically validated against challenging benchmarks BabyAI including multiple combinatorially rich synthetic tasks(e.g.,GoToSeq,SynthSeq,BossLevel),where it not only outperforms competitive baselines but also demonstrates superior adaptability and effectiveness incomplex task decomposition.
基金supported in part by the Guangdong Basic and Applied Basic Research Foundation(No.GDST23EG32).
文摘Recurrent neural network transducer(RNN-T)is an important branch of current end-to-end automatic speech recognition(ASR).Various promising approaches have been designed for boosting RNN-T architecture;however,few studies exploit the effectiveness of pretrained methods in this framework.In this paper,we introduce the pretrained acoustic extractor(PAE)and the pretrained linguistic network(PLN)to enhance the Conformer long short-term memory(Conformer-LSTM)transducer.First,we construct the input of the acoustic encoder with two different latent representations:one extracted by PAE from the raw waveform,and the other obtained from filter-bank transformation.Second,we fuse an extra semantic feature from the PLN into the joint network to reduce illogical and homophonic errors.Compared with previous works,our approaches are able to obtain pretrained representations for better model generalization.Evaluation on two large-scale datasets has demonstrated that our proposed approaches yield better performance than existing approaches.
文摘Panoramic images, offering a 360-degree view, are essential in virtual reality(VR) and augmented reality(AR), enhancing realism with high-quality textures. However, acquiring complete and high-quality panoramic textures is challenging. This paper introduces a method using generative adversarial networks(GANs) and the contrastive language-image pretraining(CLIP) model to restore and control texture in panoramic images. The GAN model captures complex structures and maintains consistency, while CLIP enables fine-grained texture control via semantic text-image associations. GAN inversion optimizes latent codes for precise texture details. The resulting low dynamic range(LDR) images are converted to high dynamic range(HDR) using the Blender engine for seamless texture blending. Experimental results demonstrate the effectiveness and flexibility of this method in panoramic texture restoration and generation.
基金Supported by Ningbo Top Medical and Health Research Program,No.2023020612the Ningbo Leading Medical&Healthy Discipline Project,No.2022-S04+1 种基金the Medical Health Science and Technology Project of Zhejiang Provincial Health Commission,No.2022KY315Ningbo Science and Technology Public Welfare Project,No.2023S133.
文摘BACKGROUND With the rising use of endoscopic submucosal dissection(ESD)and endoscopic mucosal resection(EMR),patients are increasingly questioning various aspects of these endoscopic procedures.At the same time,conversational artificial intelligence(AI)tools like chat generative pretrained transformer(ChatGPT)are rapidly emerging as sources of medical information.AIM To evaluate ChatGPT’s reliability and usefulness regarding ESD and EMR for patients and healthcare professionals.METHODS In this study,30 specific questions related to ESD and EMR were identified.Then,these questions were repeatedly entered into ChatGPT,with two independent answers generated for each question.A Likert scale was used to rate the accuracy,completeness,and comprehensibility of the responses.Meanwhile,a binary category(high/Low)was used to evaluate each aspect of the two responses generated by ChatGPT and the response retrieved from Google.RESULTS By analyzing the average scores of the three raters,our findings indicated that the responses generated by ChatGPT received high ratings for accuracy(mean score of 5.14 out of 6),completeness(mean score of 2.34 out of 3),and comprehensibility(mean score of 2.96 out of 3).Kendall’s coefficients of concordance indicated good agreement among raters(all P<0.05).For the responses generated by Google,more than half were classified by experts as having low accuracy and low completeness.CONCLUSION ChatGPT provided accurate and reliable answers in response to questions about ESD and EMR.Future studies should address ChatGPT’s current limitations by incorporating more detailed and up-to-date medical information.This could establish AI chatbots as significant resource for both patients and health care professionals.
基金supported by the Beijing Key Clinical Specialty Project(20240930)the National Natural Science Foundation of China(NSFC 82373436)+7 种基金the Beijing Hospitals Authority’Youth Program(BHAYP,QML20230114)the Beijing Natural Science Foundation(BNSF Z200027)the Beijing Chaoyang Hospital Multi-disciplinary Team Program(CYDXK202204),the NSFC(62331001)the BNSF(Z200027)the NSFC(82202097)the BHAYP(QML20230113)the Training Fund for Open Projects at Clinical Institutes and Departments of Capital Medical University(CCMU2022ZKYXY010)the Beijing Scholars Program(No.[2015]160).
文摘Background:Multiparametric magnetic resonance imaging(mpMRI)has significantly advanced prostate cancer(PCa)detection,yet decisions on invasive biopsy with moderate prostate imaging reporting and data system(PI-RADS)scores remain ambiguous.Methods:To explore the decision-making capacity of Generative Pretrained Transformer-4(GPT-4)for automated prostate biopsy recommendations,we included 2299 individuals who underwent prostate biopsy from 2018 to 2023 in 3 large medical centers,with available mpMRI before biopsy and documented clinical-histopathological records.GPT-4 generated structured reports with given prompts.The performance of GPT-4 was quantified using confusion matrices,and sensitivity,specificity,as well as area under the curve were calculated.Multiple artificial evaluation procedures were conducted.Wilcoxon’s rank sum test,Fisher’s exact test,and Kruskal-Wallis tests were used for comparisons.Results:Utilizing the largest sample size in the Chinese population,patients with moderate PI-RADS scores(scores 3 and 4)accounted for 39.7%(912/2299),defined as the subset-of-interest(SOI).The detection rates of clinically significant PCa corresponding to PI-RADS scores 2-5 were 9.4%,27.3%,49.2%,and 80.1%,respectively.Nearly 47.5%(433/912)of SOI patients were histopathologically proven to have undergone unnecessary prostate biopsies.With the assistance of GPT-4,20.8%(190/912)of the SOI population could avoid unnecessary biopsies,and it performed even better[28.8%(118/410)]in the most heterogeneous subgroup of PI-RADS score 3.More than 90.0%of GPT-4-generated reports were comprehensive and easy to understand,but less satisfied with the accuracy(82.8%).GPT-4 also demonstrated cognitive potential for handling complex problems.Additionally,the Chain of Thought method enabled us to better understand the decision-making logic behind GPT-4.Eventually,we developed a ProstAIGuide platform to facilitate accessibility for both doctors and patients.Conclusions:This multi-center study highlights the clinical utility of GPT-4 for prostate biopsy decision-making and advances our understanding of the latest artificial intelligence implementation in various medical scenarios.
文摘Sentence classification is the process of categorizing a sentence based on the context of the sentence.Sentence categorization requires more semantic highlights than other tasks,such as dependence parsing,which requires more syntactic elements.Most existing strategies focus on the general semantics of a conversation without involving the context of the sentence,recognizing the progress and comparing impacts.An ensemble pre-trained language model was taken up here to classify the conversation sentences from the conversation corpus.The conversational sentences are classified into four categories:information,question,directive,and commission.These classification label sequences are for analyzing the conversation progress and predicting the pecking order of the conversation.Ensemble of Bidirectional Encoder for Representation of Transformer(BERT),Robustly Optimized BERT pretraining Approach(RoBERTa),Generative Pre-Trained Transformer(GPT),DistilBERT and Generalized Autoregressive Pretraining for Language Understanding(XLNet)models are trained on conversation corpus with hyperparameters.Hyperparameter tuning approach is carried out for better performance on sentence classification.This Ensemble of Pre-trained Language Models with a Hyperparameter Tuning(EPLM-HT)system is trained on an annotated conversation dataset.The proposed approach outperformed compared to the base BERT,GPT,DistilBERT and XLNet transformer models.The proposed ensemble model with the fine-tuned parameters achieved an F1_score of 0.88.
基金Supported by Sichuan Science and Technology Program(2021YFQ0003,2023YFSY0026,2023YFH0004).
文摘In the field of natural language processing(NLP),there have been various pre-training language models in recent years,with question answering systems gaining significant attention.However,as algorithms,data,and computing power advance,the issue of increasingly larger models and a growing number of parameters has surfaced.Consequently,model training has become more costly and less efficient.To enhance the efficiency and accuracy of the training process while reducing themodel volume,this paper proposes a first-order pruningmodel PAL-BERT based on the ALBERT model according to the characteristics of question-answering(QA)system and language model.Firstly,a first-order network pruning method based on the ALBERT model is designed,and the PAL-BERT model is formed.Then,the parameter optimization strategy of the PAL-BERT model is formulated,and the Mish function was used as an activation function instead of ReLU to improve the performance.Finally,after comparison experiments with traditional deep learning models TextCNN and BiLSTM,it is confirmed that PALBERT is a pruning model compression method that can significantly reduce training time and optimize training efficiency.Compared with traditional models,PAL-BERT significantly improves the NLP task’s performance.
文摘As Natural Language Processing(NLP)continues to advance,driven by the emergence of sophisticated large language models such as ChatGPT,there has been a notable growth in research activity.This rapid uptake reflects increasing interest in the field and induces critical inquiries into ChatGPT’s applicability in the NLP domain.This review paper systematically investigates the role of ChatGPT in diverse NLP tasks,including information extraction,Name Entity Recognition(NER),event extraction,relation extraction,Part of Speech(PoS)tagging,text classification,sentiment analysis,emotion recognition and text annotation.The novelty of this work lies in its comprehensive analysis of the existing literature,addressing a critical gap in understanding ChatGPT’s adaptability,limitations,and optimal application.In this paper,we employed a systematic stepwise approach following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses(PRISMA)framework to direct our search process and seek relevant studies.Our review reveals ChatGPT’s significant potential in enhancing various NLP tasks.Its adaptability in information extraction tasks,sentiment analysis,and text classification showcases its ability to comprehend diverse contexts and extract meaningful details.Additionally,ChatGPT’s flexibility in annotation tasks reducesmanual efforts and accelerates the annotation process,making it a valuable asset in NLP development and research.Furthermore,GPT-4 and prompt engineering emerge as a complementary mechanism,empowering users to guide the model and enhance overall accuracy.Despite its promising potential,challenges persist.The performance of ChatGP Tneeds tobe testedusingmore extensivedatasets anddiversedata structures.Subsequently,its limitations in handling domain-specific language and the need for fine-tuning in specific applications highlight the importance of further investigations to address these issues.
基金This work is supported by the project“Research on Methods and Technologies of Scientific Researcher Entity Linking and Subject Indexing”(Grant No.G190091)from the National Science Library,Chinese Academy of Sciencesthe project“Design and Research on a Next Generation of Open Knowledge Services System and Key Technologies”(2019XM55).
文摘Purpose:Automatic keyphrase extraction(AKE)is an important task for grasping the main points of the text.In this paper,we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research.Design/methodology/approach:We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT,which was released by Google in 2018.We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain,which contains 100,000 abstracts as training set,6,000 abstracts as development set and 3,094 abstracts as test set.We use unsupervised keyphrase extraction methods including term frequency(TF),TF-IDF,TextRank and supervised machine learning methods including Conditional Random Field(CRF),Bidirectional Long Short Term Memory Network(BiLSTM),and BiLSTM-CRF as baselines.Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models.Findings:Compared with character-level BiLSTM-CRF,the best baseline model with F1 score of 50.16%,our character-level sequence labeling model based on BERT obtains F1 score of 59.80%,getting 9.64%absolute improvement.Research limitations:We just consider automatic keyphrase extraction task rather than keyphrase generation task,so only keyphrases that are occurred in the given text can be extracted.In addition,our proposed dataset is not suitable for dealing with nested keyphrases.Practical implications:We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts(CAKE)publicly available for the benefits of research community,which is available at:https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction.Originality/value:By designing comparative experiments,our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models.And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.
基金supported by the Outstanding Youth Team Project of Central Universities(QNTD202308)the Ant Group through CCF-Ant Research Fund(CCF-AFSG 769498 RF20220214).
文摘Named Entity Recognition(NER)stands as a fundamental task within the field of biomedical text mining,aiming to extract specific types of entities such as genes,proteins,and diseases from complex biomedical texts and categorize them into predefined entity types.This process can provide basic support for the automatic construction of knowledge bases.In contrast to general texts,biomedical texts frequently contain numerous nested entities and local dependencies among these entities,presenting significant challenges to prevailing NER models.To address these issues,we propose a novel Chinese nested biomedical NER model based on RoBERTa and Global Pointer(RoBGP).Our model initially utilizes the RoBERTa-wwm-ext-large pretrained language model to dynamically generate word-level initial vectors.It then incorporates a Bidirectional Long Short-Term Memory network for capturing bidirectional semantic information,effectively addressing the issue of long-distance dependencies.Furthermore,the Global Pointer model is employed to comprehensively recognize all nested entities in the text.We conduct extensive experiments on the Chinese medical dataset CMeEE and the results demonstrate the superior performance of RoBGP over several baseline models.This research confirms the effectiveness of RoBGP in Chinese biomedical NER,providing reliable technical support for biomedical information extraction and knowledge base construction.
基金supported by Chongqing Technology Innovation and Application Development Project[grant number cstc2021jscx-dxwtBX0023]funding from Chongqing Changan Automobile Co.,Ltd.,Dongfeng Motor Corporation,and Dongfeng Changxing Tech Co.,Ltd.
文摘Next point-of-interest(POI)recommendation has been applied by many internet companies to enhance the user travel experience.Recent research advocates deep-learning methods to model long-term check-in sequences and mine mobility patterns of people to improve recommendation performance.Existing approaches model general user preferences based on historical check-ins and can be termed as preference pattern models.The preference pattern is different from the intention pattern,in that it does not emphasize the user mobility pattern of revisiting POIs,which is a common behavior and kind of intention for users.An effective module is needed to predict when and where users will repeat visits.In this paper,we propose a Spatio-Temporal Intention Learning Self-Attention Network(STILSAN)for next POI recommendation.STILSAN employs a preference-intention module to capture the user’s long-term preference and recognizes the user’s intention to revisit some specific POIs at a specific time.Meanwhile,we design a spatial encoder module as a pretrained model for learning POI spatial feature by simulating the spatial clustering phenomenon and the spatial proximity of the POIs.Experiments are conducted on two real-world check-in datasets.The experimental results demonstrate that all the proposed modules can effectively improve recommendation accuracy and STILSAN yields outstanding improvements over the state-of-the-art models.
基金This research work was supported by the National Research Foundation of Korea(NRF)grant funded by the Korean government(MSIT)(NRF-2022R1A2C1004657).
文摘Hand Gesture Recognition(HGR)is a promising research area with an extensive range of applications,such as surgery,video game techniques,and sign language translation,where sign language is a complicated structured form of hand gestures.The fundamental building blocks of structured expressions in sign language are the arrangement of the fingers,the orientation of the hand,and the hand’s position concerning the body.The importance of HGR has increased due to the increasing number of touchless applications and the rapid growth of the hearing-impaired population.Therefore,real-time HGR is one of the most effective interaction methods between computers and humans.Developing a user-free interface with good recognition performance should be the goal of real-time HGR systems.Nowadays,Convolutional Neural Network(CNN)shows great recognition rates for different image-level classification tasks.It is challenging to train deep CNN networks like VGG-16,VGG-19,Inception-v3,and Efficientnet-B0 from scratch because only some significant labeled image datasets are available for static hand gesture images.However,an efficient and robust hand gesture recognition system of sign language employing finetuned Inception-v3 and Efficientnet-Bo network is proposed to identify hand gestures using a comparative small HGR dataset.Experiments show that Inception-v3 achieved 90%accuracy and 0.93%precision,0.91%recall,and 0.90%f1-score,respectively,while EfficientNet-B0 achieved 99%accuracy and 0.98%,0.97%,0.98%,precision,recall,and f1-score respectively.
基金supported by the National Natural Science Foundation of China under Grant(61732005,61972186)Yunnan Provincial Major Science and Technology Special Plan Projects(Nos.202103AA080015,202203AA080004).
文摘Thanks to the strong representation capability of pre-trained language models,supervised machine translation models have achieved outstanding performance.However,the performances of these models drop sharply when the scale of the parallel training corpus is limited.Considering the pre-trained language model has a strong ability for monolingual representation,it is the key challenge for machine translation to construct the in-depth relationship between the source and target language by injecting the lexical and syntactic information into pre-trained language models.To alleviate the dependence on the parallel corpus,we propose a Linguistics Knowledge-Driven MultiTask(LKMT)approach to inject part-of-speech and syntactic knowledge into pre-trained models,thus enhancing the machine translation performance.On the one hand,we integrate part-of-speech and dependency labels into the embedding layer and exploit large-scale monolingual corpus to update all parameters of pre-trained language models,thus ensuring the updated language model contains potential lexical and syntactic information.On the other hand,we leverage an extra self-attention layer to explicitly inject linguistic knowledge into the pre-trained language model-enhanced machine translation model.Experiments on the benchmark dataset show that our proposed LKMT approach improves the Urdu-English translation accuracy by 1.97 points and the English-Urdu translation accuracy by 2.42 points,highlighting the effectiveness of our LKMT framework.Detailed ablation experiments confirm the positive impact of part-of-speech and dependency parsing on machine translation.
基金supported by the National Natural Science Foundation of China under Grant No.62173317the Key Research and Development Program of Anhui under Grant No.202104a05020064。
文摘In this paper,a cross-sensor generative self-supervised learning network is proposed for fault detection of multi-sensor.By modeling the sensor signals in multiple dimensions to achieve correlation information mining between channels to deal with the pretext task,the shared features between multi-sensor data can be captured,and the gap between channel data features will be reduced.Meanwhile,in order to model fault features in the downstream task,the salience module is developed to optimize cross-sensor data features based on a small amount of labeled data to make warning feature information prominent for improving the separator accuracy.Finally,experimental results on the public datasets FEMTO-ST dataset and the private datasets SMT shock absorber dataset(SMT-SA dataset)show that the proposed method performs favorably against other STATE-of-the-art methods.
文摘The use of pretrained backbones with finetuning has shown success for 2D vision and natural language processing tasks,with advantages over taskspecific networks.In this paper,we introduce a pretrained 3D backbone,called Swin3D,for 3D indoor scene understanding.We designed a 3D Swin Transformer as our backbone network,which enables efficient selfattention on sparse voxels with linear memory complexity,making the backbone scalable to large models and datasets.We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance.We pretrained a large Swin3D model on a synthetic Structured3D dataset,which is an order of magnitude larger than the ScanNet dataset.Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets but also outperforms state-of-the-art methods on downstream tasks with+2.3 mIoU and+2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation,respectively,+1.8 mIoU on ScanNet segmentation(val),+1.9 mAP@0.5 on ScanNet detection,and+8.1 mAP@0.5 on S3DIS detection.A series of extensive ablation studies further validated the scalability,generality,and superior performance enabled by our approach.