Intelligent sorting is an important prerequisite for the full quantitative consumption and harmless disposal of kitchen waste.The existing object detection method based on an ImageNet pre-trained model is an effective...Intelligent sorting is an important prerequisite for the full quantitative consumption and harmless disposal of kitchen waste.The existing object detection method based on an ImageNet pre-trained model is an effective way of sorting.Owing to significant domain gaps between natural images and kitchen waste images,it is difficult to reflect the characteristics of diverse scales and dense distribution in kitchen waste based on an ImageNet pre-trained model,leading to poor generalisation.In this article,the authors propose the first pre-trained model for kitchen waste sorting called KitWaSor,which combines both contrastive learning(CL)and masked image modelling(MIM)through self-supervised learning(SSL).First,to address the issue of diverse scales,the authors propose a mixed masking strategy by introducing an incomplete masking branch based on the original random masking branch.It prevents the complete loss of small-scale objects while avoiding excessive leakage of large-scale object pixels.Second,to address the issue of dense distribution,the authors introduce semantic consistency constraints on the basis of the mixed masking strategy.That is,object semantic reasoning is performed through semantic consistency constraints to compensate for the lack of contextual information.To train KitWaSor,the authors construct the first million-level kitchen waste dataset across seasonal and regional distributions,named KWD-Million.Extensive experiments show that KitWaSor achieves state-of-the-art(SOTA)performance on the two most relevant downstream tasks for kitchen waste sorting(i.e.image classification and object detection),demonstrating the effectiveness of the proposed KitWaSor.展开更多
Self-supervised learning aims to learn a universal feature representation without labels.To date,most existing self-supervised learning methods are designed and optimized for image classification.These pre-trained mod...Self-supervised learning aims to learn a universal feature representation without labels.To date,most existing self-supervised learning methods are designed and optimized for image classification.These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction.To fill this gap,we aim to design an effective,dense self-supervised learning framework that directly works at the level of pixels(or local features)by taking into account the correspondence between local features.Specifically,we present dense contrastive learning(DenseCL),which implements self-supervised learning by optimizing a pairwise contrastive(dis)similarity loss at the pixel level between two views of input images.Compared to the supervised ImageNet pre-training and other self-supervised learning methods,our self-supervised DenseCL pretraining demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection,semantic segmentation and instance segmentation.Specifically,our approach significantly outperforms the strong MoCo-v2 by 2.0%AP on PASCAL VOC object detection,1.1%AP on COCO object detection,0.9%AP on COCO instance segmentation,3.0%mIoU on PASCAL VOC semantic segmentation and 1.8%mIoU on Cityscapes semantic segmentation.The improvements are up to 3.5%AP and 8.8%mIoU over MoCo-v2,and 6.1%AP and 6.1%mIoU over supervised counterpart with frozen-backbone evaluation protocol.展开更多
Edge computing,a novel paradigm for performing computations at the network edge,holds significant relevance in the healthcare domain for extracting medical knowledge from traditional Uygur medical texts.Medical knowle...Edge computing,a novel paradigm for performing computations at the network edge,holds significant relevance in the healthcare domain for extracting medical knowledge from traditional Uygur medical texts.Medical knowledge extraction methods based on edge computing deploy deep learning models on edge devices to achieve localized entity and relation extraction.This approach avoids transferring substantial sensitive data to cloud data centers,effectively safeguarding the privacy of healthcare services.However,existing relation extraction methods mainly employ a sequential pipeline approach,which classifies relations between determined entities after entity recognition.This mode faces challenges such as error propagation between tasks,insufficient consideration of dependencies between the two subtasks,and the neglect of interrelations between different relations within a sentence.To address these challenges,a joint extraction model with parameter sharing in edge computing is proposed,named CoEx-Bert.This model leverages shared parameterization between two models to jointly extract entities and relations.Specifically,CoEx-Bert employs two models,each separately sharing hidden layer parameters,and combines these two loss functions for joint backpropagation to optimize the model parameters.Additionally,it effectively resolves the issue of entity overlapping when extracting knowledge from unstructured Uygur medical texts by considering contextual relations.Finally,this model is deployed on edge devices for real-time extraction and inference of Uygur medical knowledge.Experimental results demonstrate that CoEx-Bert outperforms existing state-of-the-art methods,achieving accuracy,recall,and F1-score of 90.65%,92.45%,and 91.54%,respectively,in the Uygur traditional medical literature dataset.These improvements represent a 6.45%increase in accuracy,a 9.45%increase in recall,and a 7.95%increase in F1-score compared to the baseline.展开更多
Massive Multiple-Input-Multiple-Output(MIMO)is a promising technology to meet the demand for the connection of massive devices and high data capacity for mobile networks in the next generation communication system.How...Massive Multiple-Input-Multiple-Output(MIMO)is a promising technology to meet the demand for the connection of massive devices and high data capacity for mobile networks in the next generation communication system.However,due to the massive connectivity of mobile devices,the pilot contamination problem will severely degrade the communication quality and spectrum efficiency of the massive MIMO system.We propose a deep Monte Carlo Tree Search(MCTS)-based intelligent Pilot-power Allocation Scheme(iPAS)to address this issue.The core of iPAS is a multi-task deep reinforcement learning algorithm that can automatically learn the radio environment and make decisions on the pilot sequence and power allocation to maximize the spectrum efficiency with self-play training.To accelerate the searching convergence,we introduce a Deep Neural Network(DNN)to predict the pilot sequence and power allocation actions.The DNN is trained in a self-supervised learning manner,where the training data is generated from the searching process of the MCTS algorithm.Numerical results show that our proposed iPAS achieves a better Cumulative Distribution Function(CDF)of the ergodic spectral efficiency compared with the previous suboptimal algorithms.展开更多
Spatial relation extraction is the process of identifying geographic entities from text and determining their corresponding spatial relations.Traditional spatial relation extraction mainly uses rule-based pattern matc...Spatial relation extraction is the process of identifying geographic entities from text and determining their corresponding spatial relations.Traditional spatial relation extraction mainly uses rule-based pattern matching,supervised learning-based or unsupervised learning-based methods.However,these methods suffer from poor time-sensitive,high labor cost and high dependence on large-scale data.With the development of pre-trained language models greatly alleviating the shortcomings of traditional methods,supervised learning methods incorporating pre-trained language models have become the mainstream relation extraction methods.Pipeline extraction and joint extraction,as the two most dominant ideas of relation extraction,both have obtained good performance on different datasets,and whether to share the contextual information of entities and relations is the main differences between the two ideas.In this paper,we compare the performance of two ideas oriented to spatial relation extraction based on Chinese corpus data in the field of geography and verify which method based on pre-trained language models is more suitable for Chinese spatial relation extraction.We fine-tuned the hyperparameters of the two models to optimize the extraction accuracy before the comparison experiments.The results of the comparison experiments show that pipeline extraction performs better than joint extraction of spatial relation extraction for Chinese text data with sentence granularity,because different tasks have different focus on contextual information,and it is difficult to take account into the needs of both tasks by sharing contextual information.In addition,we further compare the performance of the two models with the rule-based template approach in extracting topological,directional and distance relations,summarize the shortcomings of this experiment and provide an outlook for future work.展开更多
Robustness is a long-standing challenge for automatic speech recognition(ASR)as the applied environment of any ASR system faces much noisier speech samples than clean training corpora.However,it is impractical to anno...Robustness is a long-standing challenge for automatic speech recognition(ASR)as the applied environment of any ASR system faces much noisier speech samples than clean training corpora.However,it is impractical to annotate every types of noisy environments.In this work,we propose a novel phonetic-semantic pre-training(PSP)framework that allows a model to effectively improve the performance of ASR against practical noisy environments via seamlessly integrating pre-training,self-supervised learning,and fine-tuning.In particular,there are three fundamental stages in PSP.First,pre-train the phone-to-word transducer(PWT)to map the generated phone sequence to the target text using only unpaired text data;second,continue training the PWT on more complex data generated from an empirical phone-perturbation heuristic,in additional to self-supervised signals by recovering the tainted phones;and third,fine-tune the resultant PWT with real world speech data.We perform experiments on two real-life datasets collected from industrial scenarios and synthetic noisy datasets,which show that the PSP effectively improves the traditional ASR pipeline with relative character error rate(CER)reductions of 28.63%and 26.38%,respectively,in two real-life datasets.It also demonstrates its robustness against synthetic highly noisy speech datasets.展开更多
Medical data refers to health-related information associated with regular patient care or as part of a clinical trial program.There are many categories of such data,such as clinical imaging data,bio-signal data,electr...Medical data refers to health-related information associated with regular patient care or as part of a clinical trial program.There are many categories of such data,such as clinical imaging data,bio-signal data,electronic health records(EHR),and multi-modality medical data.With the development of deep neural networks in the last decade,the emerging pre-training paradigm has become dominant in that it has significantly improved machine learning methods′performance in a data-limited scenario.In recent years,studies of pre-training in the medical domain have achieved significant progress.To summarize these technology advancements,this work provides a comprehensive survey of recent advances for pre-training on several major types of medical data.In this survey,we summarize a large number of related publications and the existing benchmarking in the medical domain.Especially,the survey briefly describes how some pre-training methods are applied to or developed for medical data.From a data-driven perspective,we examine the extensive use of pre-training in many medical scenarios.Moreover,based on the summary of recent pre-training studies,we identify several challenges in this field to provide insights for future studies.展开更多
The majority of vision-language pre-training(VLP)models rely on pre-trained object detectors,which incur high costs and restrict the recognition of object classes.Additionally,their encoder-based structures hinder the...The majority of vision-language pre-training(VLP)models rely on pre-trained object detectors,which incur high costs and restrict the recognition of object classes.Additionally,their encoder-based structures hinder their ability to perform text generation tasks effectively.To mitigate these challenges,we propose a Detector-free Vision-and-Language Pre-training(D-VLP)model designed to bolster intermodal interaction for unified understanding and generation tasks.Our D-VLP model employs a co-modality decoder equipped with a fused multi-attention self-attention module,enhancing feature fusion and information alignment between images and text.It is pre-trained using a novel Prefix Masked Language Modeling(prefixMLM)approach,leveraging the strengths of masked language modeling and unidirectional language modeling,which enables bidirectional processing and autoregressive token generation.Extensive experiments demonstrate that D-VLP surpasses state-of-the-art models in vision-language tasks,highlighting its superior performance and adaptability across various image-text tasks with minimal adjustments.展开更多
Recently, the emergence of pre-trained models(PTMs) has brought natural language processing(NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language rep...Recently, the emergence of pre-trained models(PTMs) has brought natural language processing(NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next,we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.展开更多
基金National Key Research and Development Program of China,Grant/Award Number:2021YFC1910402。
文摘Intelligent sorting is an important prerequisite for the full quantitative consumption and harmless disposal of kitchen waste.The existing object detection method based on an ImageNet pre-trained model is an effective way of sorting.Owing to significant domain gaps between natural images and kitchen waste images,it is difficult to reflect the characteristics of diverse scales and dense distribution in kitchen waste based on an ImageNet pre-trained model,leading to poor generalisation.In this article,the authors propose the first pre-trained model for kitchen waste sorting called KitWaSor,which combines both contrastive learning(CL)and masked image modelling(MIM)through self-supervised learning(SSL).First,to address the issue of diverse scales,the authors propose a mixed masking strategy by introducing an incomplete masking branch based on the original random masking branch.It prevents the complete loss of small-scale objects while avoiding excessive leakage of large-scale object pixels.Second,to address the issue of dense distribution,the authors introduce semantic consistency constraints on the basis of the mixed masking strategy.That is,object semantic reasoning is performed through semantic consistency constraints to compensate for the lack of contextual information.To train KitWaSor,the authors construct the first million-level kitchen waste dataset across seasonal and regional distributions,named KWD-Million.Extensive experiments show that KitWaSor achieves state-of-the-art(SOTA)performance on the two most relevant downstream tasks for kitchen waste sorting(i.e.image classification and object detection),demonstrating the effectiveness of the proposed KitWaSor.
文摘Self-supervised learning aims to learn a universal feature representation without labels.To date,most existing self-supervised learning methods are designed and optimized for image classification.These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction.To fill this gap,we aim to design an effective,dense self-supervised learning framework that directly works at the level of pixels(or local features)by taking into account the correspondence between local features.Specifically,we present dense contrastive learning(DenseCL),which implements self-supervised learning by optimizing a pairwise contrastive(dis)similarity loss at the pixel level between two views of input images.Compared to the supervised ImageNet pre-training and other self-supervised learning methods,our self-supervised DenseCL pretraining demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection,semantic segmentation and instance segmentation.Specifically,our approach significantly outperforms the strong MoCo-v2 by 2.0%AP on PASCAL VOC object detection,1.1%AP on COCO object detection,0.9%AP on COCO instance segmentation,3.0%mIoU on PASCAL VOC semantic segmentation and 1.8%mIoU on Cityscapes semantic segmentation.The improvements are up to 3.5%AP and 8.8%mIoU over MoCo-v2,and 6.1%AP and 6.1%mIoU over supervised counterpart with frozen-backbone evaluation protocol.
文摘Edge computing,a novel paradigm for performing computations at the network edge,holds significant relevance in the healthcare domain for extracting medical knowledge from traditional Uygur medical texts.Medical knowledge extraction methods based on edge computing deploy deep learning models on edge devices to achieve localized entity and relation extraction.This approach avoids transferring substantial sensitive data to cloud data centers,effectively safeguarding the privacy of healthcare services.However,existing relation extraction methods mainly employ a sequential pipeline approach,which classifies relations between determined entities after entity recognition.This mode faces challenges such as error propagation between tasks,insufficient consideration of dependencies between the two subtasks,and the neglect of interrelations between different relations within a sentence.To address these challenges,a joint extraction model with parameter sharing in edge computing is proposed,named CoEx-Bert.This model leverages shared parameterization between two models to jointly extract entities and relations.Specifically,CoEx-Bert employs two models,each separately sharing hidden layer parameters,and combines these two loss functions for joint backpropagation to optimize the model parameters.Additionally,it effectively resolves the issue of entity overlapping when extracting knowledge from unstructured Uygur medical texts by considering contextual relations.Finally,this model is deployed on edge devices for real-time extraction and inference of Uygur medical knowledge.Experimental results demonstrate that CoEx-Bert outperforms existing state-of-the-art methods,achieving accuracy,recall,and F1-score of 90.65%,92.45%,and 91.54%,respectively,in the Uygur traditional medical literature dataset.These improvements represent a 6.45%increase in accuracy,a 9.45%increase in recall,and a 7.95%increase in F1-score compared to the baseline.
文摘Massive Multiple-Input-Multiple-Output(MIMO)is a promising technology to meet the demand for the connection of massive devices and high data capacity for mobile networks in the next generation communication system.However,due to the massive connectivity of mobile devices,the pilot contamination problem will severely degrade the communication quality and spectrum efficiency of the massive MIMO system.We propose a deep Monte Carlo Tree Search(MCTS)-based intelligent Pilot-power Allocation Scheme(iPAS)to address this issue.The core of iPAS is a multi-task deep reinforcement learning algorithm that can automatically learn the radio environment and make decisions on the pilot sequence and power allocation to maximize the spectrum efficiency with self-play training.To accelerate the searching convergence,we introduce a Deep Neural Network(DNN)to predict the pilot sequence and power allocation actions.The DNN is trained in a self-supervised learning manner,where the training data is generated from the searching process of the MCTS algorithm.Numerical results show that our proposed iPAS achieves a better Cumulative Distribution Function(CDF)of the ergodic spectral efficiency compared with the previous suboptimal algorithms.
基金supported by the National Key Research and Development Program of China under[Grant number 2021YFB3900903]the National Natural Science Foundation of China under[Grant number 41971337].
文摘Spatial relation extraction is the process of identifying geographic entities from text and determining their corresponding spatial relations.Traditional spatial relation extraction mainly uses rule-based pattern matching,supervised learning-based or unsupervised learning-based methods.However,these methods suffer from poor time-sensitive,high labor cost and high dependence on large-scale data.With the development of pre-trained language models greatly alleviating the shortcomings of traditional methods,supervised learning methods incorporating pre-trained language models have become the mainstream relation extraction methods.Pipeline extraction and joint extraction,as the two most dominant ideas of relation extraction,both have obtained good performance on different datasets,and whether to share the contextual information of entities and relations is the main differences between the two ideas.In this paper,we compare the performance of two ideas oriented to spatial relation extraction based on Chinese corpus data in the field of geography and verify which method based on pre-trained language models is more suitable for Chinese spatial relation extraction.We fine-tuned the hyperparameters of the two models to optimize the extraction accuracy before the comparison experiments.The results of the comparison experiments show that pipeline extraction performs better than joint extraction of spatial relation extraction for Chinese text data with sentence granularity,because different tasks have different focus on contextual information,and it is difficult to take account into the needs of both tasks by sharing contextual information.In addition,we further compare the performance of the two models with the rule-based template approach in extracting topological,directional and distance relations,summarize the shortcomings of this experiment and provide an outlook for future work.
文摘Robustness is a long-standing challenge for automatic speech recognition(ASR)as the applied environment of any ASR system faces much noisier speech samples than clean training corpora.However,it is impractical to annotate every types of noisy environments.In this work,we propose a novel phonetic-semantic pre-training(PSP)framework that allows a model to effectively improve the performance of ASR against practical noisy environments via seamlessly integrating pre-training,self-supervised learning,and fine-tuning.In particular,there are three fundamental stages in PSP.First,pre-train the phone-to-word transducer(PWT)to map the generated phone sequence to the target text using only unpaired text data;second,continue training the PWT on more complex data generated from an empirical phone-perturbation heuristic,in additional to self-supervised signals by recovering the tainted phones;and third,fine-tune the resultant PWT with real world speech data.We perform experiments on two real-life datasets collected from industrial scenarios and synthetic noisy datasets,which show that the PSP effectively improves the traditional ASR pipeline with relative character error rate(CER)reductions of 28.63%and 26.38%,respectively,in two real-life datasets.It also demonstrates its robustness against synthetic highly noisy speech datasets.
基金supported by 2021 UQ School of Information Technology and Electrical Engineering(ITEE)Research Support Funding,Cyber Research Seed Funding(No.2021-R3)the University of Adelaide(No.1531570)New Staff Research Start-up Funds(No.NS-2102).
文摘Medical data refers to health-related information associated with regular patient care or as part of a clinical trial program.There are many categories of such data,such as clinical imaging data,bio-signal data,electronic health records(EHR),and multi-modality medical data.With the development of deep neural networks in the last decade,the emerging pre-training paradigm has become dominant in that it has significantly improved machine learning methods′performance in a data-limited scenario.In recent years,studies of pre-training in the medical domain have achieved significant progress.To summarize these technology advancements,this work provides a comprehensive survey of recent advances for pre-training on several major types of medical data.In this survey,we summarize a large number of related publications and the existing benchmarking in the medical domain.Especially,the survey briefly describes how some pre-training methods are applied to or developed for medical data.From a data-driven perspective,we examine the extensive use of pre-training in many medical scenarios.Moreover,based on the summary of recent pre-training studies,we identify several challenges in this field to provide insights for future studies.
基金supported in part by the Science and Technology Major Project of Guangxi under Grant No.AA22068057the National Natural Science Foundation of China under Grant No.62076077the School Foundation of Guilin University of Aerospace Technology under Grant No.XJ21KT32.
文摘The majority of vision-language pre-training(VLP)models rely on pre-trained object detectors,which incur high costs and restrict the recognition of object classes.Additionally,their encoder-based structures hinder their ability to perform text generation tasks effectively.To mitigate these challenges,we propose a Detector-free Vision-and-Language Pre-training(D-VLP)model designed to bolster intermodal interaction for unified understanding and generation tasks.Our D-VLP model employs a co-modality decoder equipped with a fused multi-attention self-attention module,enhancing feature fusion and information alignment between images and text.It is pre-trained using a novel Prefix Masked Language Modeling(prefixMLM)approach,leveraging the strengths of masked language modeling and unidirectional language modeling,which enables bidirectional processing and autoregressive token generation.Extensive experiments demonstrate that D-VLP surpasses state-of-the-art models in vision-language tasks,highlighting its superior performance and adaptability across various image-text tasks with minimal adjustments.
基金the National Natural Science Foundation of China(Grant Nos.61751201 and 61672162)the Shanghai Municipal Science and Technology Major Project(Grant No.2018SHZDZX01)and ZJLab。
文摘Recently, the emergence of pre-trained models(PTMs) has brought natural language processing(NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next,we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.