Bilingual lexicon induction focuses on learning word translation pairs,also known as bitexts,from monolingual corpora by establishing a mapping between the source and target embedding spaces.Despite recent advancement...Bilingual lexicon induction focuses on learning word translation pairs,also known as bitexts,from monolingual corpora by establishing a mapping between the source and target embedding spaces.Despite recent advancements,bilingual lexicon induction is limited to inducing bitexts consisting of individual words,lacking the ability to handle semantics-rich phrases.To bridge this gap and support downstream cross-lingual tasks,it is practical to develop a method for bilingual phrase induction that extracts bilingual phrase pairs from monolingual corpora without relying on cross-lingual knowledge.In this paper,the authors propose a novel phrase embedding training method based on the skip-gram structure.Specifically,a local hard negative sampling strategy that utilises negative samples of central tokens in sliding windows to enhance phrase embedding learning is introduced.The proposed method achieves competitive or superior performance compared to baseline approaches,with exceptional results recorded for distant languages.Additionally,we develop a phrase representation learning method that leverages multilingual pre-trained language models.These mPLMs-based representations can be combined with the above-mentioned static phrase embeddings to further improve the accuracy of the bilingual phrase induction task.We manually construct a dataset of bilingual phrase pairs and integrate it with MUSE to facilitate the bilingual phrase induction task.展开更多
Face Presentation Attack Detection(fPAD)plays a vital role in securing face recognition systems against various presentation attacks.While supervised learning-based methods demonstrate effectiveness,they are prone to ...Face Presentation Attack Detection(fPAD)plays a vital role in securing face recognition systems against various presentation attacks.While supervised learning-based methods demonstrate effectiveness,they are prone to overfitting to known attack types and struggle to generalize to novel attack scenarios.Recent studies have explored formulating fPAD as an anomaly detection problem or one-class classification task,enabling the training of generalized models for unknown attack detection.However,conventional anomaly detection approaches encounter difficulties in precisely delineating the boundary between bonafide samples and unknown attacks.To address this challenge,we propose a novel framework focusing on unknown attack detection using exclusively bonafide facial data during training.The core innovation lies in our pseudo-negative sample synthesis(PNSS)strategy,which facilitates learning of compact decision boundaries between bonafide faces and potential attack variations.Specifically,PNSS generates synthetic negative samples within low-likelihood regions of the bonafide feature space to represent diverse unknown attack patterns.To overcome the inherent imbalance between positive and synthetic negative samples during iterative training,we implement a dual-loss mechanism combining focal loss for classification optimization with pairwise confusion loss as a regularizer.This architecture effectively mitigates model bias towards bonafide samples while maintaining discriminative power.Comprehensive evaluations across three benchmark datasets validate the framework’s superior performance.Notably,our PNSS achieves 8%–18% average classification error rate(ACER)reduction compared with state-of-the-art one-class fPAD methods in cross-dataset evaluations on Idiap Replay-Attack and MSU-MFSD datasets.展开更多
Landslide susceptibility mapping is a crucial tool for disaster prevention and management.The performance of conventional data-driven model is greatly influenced by the quality of the samples data.The random selection...Landslide susceptibility mapping is a crucial tool for disaster prevention and management.The performance of conventional data-driven model is greatly influenced by the quality of the samples data.The random selection of negative samples results in the lack of interpretability throughout the assessment process.To address this limitation and construct a high-quality negative samples database,this study introduces a physics-informed machine learning approach,combining the random forest model with Scoops 3D,to optimize the negative samples selection strategy and assess the landslide susceptibility of the study area.The Scoops 3D is employed to determine the factor of safety value leveraging Bishop’s simplified method.Instead of conventional random selection,negative samples are extracted from the areas with a high factor of safety value.Subsequently,the results of conventional random forest model and physics-informed data-driven model are analyzed and discussed,focusing on model performance and prediction uncertainty.In comparison to conventional methods,the physics-informed model,set with a safety area threshold of 3,demonstrates a noteworthy improvement in the mean AUC value by 36.7%,coupled with a reduced prediction uncertainty.It is evident that the determination of the safety area threshold exerts an impact on both prediction uncertainty and model performance.展开更多
Recently,self-supervised learning has shown great potential in Graph Neural Networks (GNNs) through contrastive learning,which aims to learn discriminative features for each node without label information. The key to ...Recently,self-supervised learning has shown great potential in Graph Neural Networks (GNNs) through contrastive learning,which aims to learn discriminative features for each node without label information. The key to graph contrastive learning is data augmentation. The anchor node regards its augmented samples as positive samples,and the rest of the samples are regarded as negative samples,some of which may be positive samples. We call these mislabeled samples as “false negative” samples,which will seriously affect the final learning effect. Since such semantically similar samples are ubiquitous in the graph,the problem of false negative samples is very significant. To address this issue,the paper proposes a novel model,False negative sample Detection for Graph Contrastive Learning (FD4GCL),which uses attribute and structure-aware to detect false negative samples. Experimental results on seven datasets show that FD4GCL outperforms the state-of-the-art baselines and even exceeds several supervised methods.展开更多
If the population is rare and clustered,then simple random sampling gives a poor estimate of the population total.For such type of populations,adaptive cluster sampling is useful.But it loses control on the final samp...If the population is rare and clustered,then simple random sampling gives a poor estimate of the population total.For such type of populations,adaptive cluster sampling is useful.But it loses control on the final sample size.Hence,the cost of sampling increases substantially.To overcome this problem,the surveyors often use auxiliary information which is easy to obtain and inexpensive.An attempt is made through the auxiliary information to control the final sample size.In this article,we have proposed two-stage negative adaptive cluster sampling design.It is a new design,which is a combination of two-stage sampling and negative adaptive cluster sampling designs.In this design,we consider an auxiliary variablewhich is highly negatively correlatedwith the variable of interest and auxiliary information is completely known.In the first stage of this design,an initial random sample is drawn by using the auxiliary information.Further,using Thompson’s(JAmStat Assoc 85:1050-1059,1990)adaptive procedure networks in the population are discovered.These networks serve as the primary-stage units(PSUs).In the second stage,random samples of unequal sizes are drawn from the PSUs to get the secondary-stage units(SSUs).The values of the auxiliary variable and the variable of interest are recorded for these SSUs.Regression estimator is proposed to estimate the population total of the variable of interest.A new estimator,Composite Horwitz-Thompson(CHT)-type estimator,is also proposed.It is based on only the information on the variable of interest.Variances of the above two estimators along with their unbiased estimators are derived.Using this proposed methodology,sample survey was conducted at Western Ghat of Maharashtra,India.The comparison of the performance of these estimators and methodology is presented and compared with other existing methods.The cost-benefit analysis is given.展开更多
A novel face verification algorithm using competitive negative samples is proposed.In the algorithm,the tested face matches not only with the claimed client face but also with competitive negative samples,and all the ...A novel face verification algorithm using competitive negative samples is proposed.In the algorithm,the tested face matches not only with the claimed client face but also with competitive negative samples,and all the matching scores are combined to make a final decision.Based on the algorithm,three schemes,including closestnegative-sample scheme,all-negative-sample scheme,and closest-few-negative-sample scheme,are designed.They are tested and compared with the traditional similaritybased verification approach on several databases with different features and classifiers.Experiments demonstrate that the three schemes reduce the verification error rate by 25.15%,30.24%,and 30.97%,on average,respectively.展开更多
Identifying cancer driver genes has paramount significance in elucidating the intricate mechanisms underlying cancer development,progression,and therapeutic interventions.Abundant omics data and interactome networks p...Identifying cancer driver genes has paramount significance in elucidating the intricate mechanisms underlying cancer development,progression,and therapeutic interventions.Abundant omics data and interactome networks provided by numerous extensive databases enable the application of graph deep learning techniques that incorporate network structures into the deep learning framework.However,most existing models primarily focus on individual network,inevitably neglecting the incompleteness and noise of interactions.Moreover,samples with imbalanced classes in driver gene identification hamper the performance of models.To address this,we propose a novel deep learning framework MMGN,which integrates multiplex networks and pan-cancer multiomics data using graph neural networks combined with negative sample inference to discover cancer driver genes,which not only enhances gene feature learning based on the mutual information and the consensus regularizer,but also achieves balanced class of positive and negative samples for model training.The reliability of MMGN has been verified by the Area Under the Receiver Operating Characteristic curves(AUROC)and the Area Under the Precision-Recall Curves(AUPRC).We believe MMGN has the potential to provide new prospects in precision oncology and may find broader applications in predicting biomarkers for other intricate diseases.展开更多
Landslide susceptibility mapping is a crucial tool for analyzing geohazards in a region.Recent publications have popularized data-driven models,particularly machine learning-based methods,owing to their strong capabil...Landslide susceptibility mapping is a crucial tool for analyzing geohazards in a region.Recent publications have popularized data-driven models,particularly machine learning-based methods,owing to their strong capability in dealing with complex nonlinear problems.However,a significant proportion of these models have neglected qualitative aspects during analysis,resulting in a lack of interpretability throughout the process and causing inaccuracies in the negative sample extraction.In this study,Scoops 3D was employed as a physics-informed tool to qualitatively assess slope stability in the study area(the Hubei Province section of the Three Gorges Reservoir Area).The non-landslide samples were extracted based on the calculated factor of safety(FS).Subsequently,the random forest algorithm was employed for data-driven landslide susceptibility analysis,with the area under the receiver operating characteristic curve(AUC)serving as the model evaluation index.Compared to the benchmark model(i.e.,the standard method of utilizing the pure random forest algorithm),the proposed method’s AUC value improved by 20.1%,validating the effectiveness of the dual-driven method(physics-informed data-driven).展开更多
基金National Key Research and Development Program of China,Grant/Award Number:2023YFC3305003National Natural Science Foundation of China,Grant/Award Number:62376076。
文摘Bilingual lexicon induction focuses on learning word translation pairs,also known as bitexts,from monolingual corpora by establishing a mapping between the source and target embedding spaces.Despite recent advancements,bilingual lexicon induction is limited to inducing bitexts consisting of individual words,lacking the ability to handle semantics-rich phrases.To bridge this gap and support downstream cross-lingual tasks,it is practical to develop a method for bilingual phrase induction that extracts bilingual phrase pairs from monolingual corpora without relying on cross-lingual knowledge.In this paper,the authors propose a novel phrase embedding training method based on the skip-gram structure.Specifically,a local hard negative sampling strategy that utilises negative samples of central tokens in sliding windows to enhance phrase embedding learning is introduced.The proposed method achieves competitive or superior performance compared to baseline approaches,with exceptional results recorded for distant languages.Additionally,we develop a phrase representation learning method that leverages multilingual pre-trained language models.These mPLMs-based representations can be combined with the above-mentioned static phrase embeddings to further improve the accuracy of the bilingual phrase induction task.We manually construct a dataset of bilingual phrase pairs and integrate it with MUSE to facilitate the bilingual phrase induction task.
基金supported in part by the National Natural Science Foundation of China under Grants 61972267,and 61772070in part by the Natural Science Foundation of Hebei Province under Grant F2024210005.
文摘Face Presentation Attack Detection(fPAD)plays a vital role in securing face recognition systems against various presentation attacks.While supervised learning-based methods demonstrate effectiveness,they are prone to overfitting to known attack types and struggle to generalize to novel attack scenarios.Recent studies have explored formulating fPAD as an anomaly detection problem or one-class classification task,enabling the training of generalized models for unknown attack detection.However,conventional anomaly detection approaches encounter difficulties in precisely delineating the boundary between bonafide samples and unknown attacks.To address this challenge,we propose a novel framework focusing on unknown attack detection using exclusively bonafide facial data during training.The core innovation lies in our pseudo-negative sample synthesis(PNSS)strategy,which facilitates learning of compact decision boundaries between bonafide faces and potential attack variations.Specifically,PNSS generates synthetic negative samples within low-likelihood regions of the bonafide feature space to represent diverse unknown attack patterns.To overcome the inherent imbalance between positive and synthetic negative samples during iterative training,we implement a dual-loss mechanism combining focal loss for classification optimization with pairwise confusion loss as a regularizer.This architecture effectively mitigates model bias towards bonafide samples while maintaining discriminative power.Comprehensive evaluations across three benchmark datasets validate the framework’s superior performance.Notably,our PNSS achieves 8%–18% average classification error rate(ACER)reduction compared with state-of-the-art one-class fPAD methods in cross-dataset evaluations on Idiap Replay-Attack and MSU-MFSD datasets.
基金Project(G2022165004L)supported by the High-end Foreign Expert Introduction Program,ChinaProject(2021XM3008)supported by the Special Foundation of Postdoctoral Support Program,Chongqing,China+1 种基金Project(2018-ZL-01)supported by the Sichuan Transportation Science and Technology Project,ChinaProject(HZ2021001)supported by the Chongqing Municipal Education Commission,China。
文摘Landslide susceptibility mapping is a crucial tool for disaster prevention and management.The performance of conventional data-driven model is greatly influenced by the quality of the samples data.The random selection of negative samples results in the lack of interpretability throughout the assessment process.To address this limitation and construct a high-quality negative samples database,this study introduces a physics-informed machine learning approach,combining the random forest model with Scoops 3D,to optimize the negative samples selection strategy and assess the landslide susceptibility of the study area.The Scoops 3D is employed to determine the factor of safety value leveraging Bishop’s simplified method.Instead of conventional random selection,negative samples are extracted from the areas with a high factor of safety value.Subsequently,the results of conventional random forest model and physics-informed data-driven model are analyzed and discussed,focusing on model performance and prediction uncertainty.In comparison to conventional methods,the physics-informed model,set with a safety area threshold of 3,demonstrates a noteworthy improvement in the mean AUC value by 36.7%,coupled with a reduced prediction uncertainty.It is evident that the determination of the safety area threshold exerts an impact on both prediction uncertainty and model performance.
基金supported by the National Key Research and Development Program of China(No.2021YFB3300503)Regional Innovation and Development Joint Fund of National Natural Science Foundation of China(No.U22A20167)National Natural Science Foundation of China(No.61872260).
文摘Recently,self-supervised learning has shown great potential in Graph Neural Networks (GNNs) through contrastive learning,which aims to learn discriminative features for each node without label information. The key to graph contrastive learning is data augmentation. The anchor node regards its augmented samples as positive samples,and the rest of the samples are regarded as negative samples,some of which may be positive samples. We call these mislabeled samples as “false negative” samples,which will seriously affect the final learning effect. Since such semantically similar samples are ubiquitous in the graph,the problem of false negative samples is very significant. To address this issue,the paper proposes a novel model,False negative sample Detection for Graph Contrastive Learning (FD4GCL),which uses attribute and structure-aware to detect false negative samples. Experimental results on seven datasets show that FD4GCL outperforms the state-of-the-art baselines and even exceeds several supervised methods.
文摘If the population is rare and clustered,then simple random sampling gives a poor estimate of the population total.For such type of populations,adaptive cluster sampling is useful.But it loses control on the final sample size.Hence,the cost of sampling increases substantially.To overcome this problem,the surveyors often use auxiliary information which is easy to obtain and inexpensive.An attempt is made through the auxiliary information to control the final sample size.In this article,we have proposed two-stage negative adaptive cluster sampling design.It is a new design,which is a combination of two-stage sampling and negative adaptive cluster sampling designs.In this design,we consider an auxiliary variablewhich is highly negatively correlatedwith the variable of interest and auxiliary information is completely known.In the first stage of this design,an initial random sample is drawn by using the auxiliary information.Further,using Thompson’s(JAmStat Assoc 85:1050-1059,1990)adaptive procedure networks in the population are discovered.These networks serve as the primary-stage units(PSUs).In the second stage,random samples of unequal sizes are drawn from the PSUs to get the secondary-stage units(SSUs).The values of the auxiliary variable and the variable of interest are recorded for these SSUs.Regression estimator is proposed to estimate the population total of the variable of interest.A new estimator,Composite Horwitz-Thompson(CHT)-type estimator,is also proposed.It is based on only the information on the variable of interest.Variances of the above two estimators along with their unbiased estimators are derived.Using this proposed methodology,sample survey was conducted at Western Ghat of Maharashtra,India.The comparison of the performance of these estimators and methodology is presented and compared with other existing methods.The cost-benefit analysis is given.
基金supported by the National Natural Science Foundation of China (No.69972024)the National High Technology Research and Development Program of China (No.2001A4114081).
文摘A novel face verification algorithm using competitive negative samples is proposed.In the algorithm,the tested face matches not only with the claimed client face but also with competitive negative samples,and all the matching scores are combined to make a final decision.Based on the algorithm,three schemes,including closestnegative-sample scheme,all-negative-sample scheme,and closest-few-negative-sample scheme,are designed.They are tested and compared with the traditional similaritybased verification approach on several databases with different features and classifiers.Experiments demonstrate that the three schemes reduce the verification error rate by 25.15%,30.24%,and 30.97%,on average,respectively.
基金supported in part by the National Natural Science Foundation of China(No.62202383)the Guangdong Basic and Applied Basic Research Foundation(No.2024A1515012602)the National Key Research and Development Program of China(No.2022YFD1801200).
文摘Identifying cancer driver genes has paramount significance in elucidating the intricate mechanisms underlying cancer development,progression,and therapeutic interventions.Abundant omics data and interactome networks provided by numerous extensive databases enable the application of graph deep learning techniques that incorporate network structures into the deep learning framework.However,most existing models primarily focus on individual network,inevitably neglecting the incompleteness and noise of interactions.Moreover,samples with imbalanced classes in driver gene identification hamper the performance of models.To address this,we propose a novel deep learning framework MMGN,which integrates multiplex networks and pan-cancer multiomics data using graph neural networks combined with negative sample inference to discover cancer driver genes,which not only enhances gene feature learning based on the mutual information and the consensus regularizer,but also achieves balanced class of positive and negative samples for model training.The reliability of MMGN has been verified by the Area Under the Receiver Operating Characteristic curves(AUROC)and the Area Under the Precision-Recall Curves(AUPRC).We believe MMGN has the potential to provide new prospects in precision oncology and may find broader applications in predicting biomarkers for other intricate diseases.
基金funded by the National Key R&D Program of China(Project No.2019YFC1509605)High-end Foreign Expert Introduction program(No.G20200022005 and DL2021165001L)Science and Technology Research Program of Chongqing Municipal Education Commission(Grant No.HZ2021001)。
文摘Landslide susceptibility mapping is a crucial tool for analyzing geohazards in a region.Recent publications have popularized data-driven models,particularly machine learning-based methods,owing to their strong capability in dealing with complex nonlinear problems.However,a significant proportion of these models have neglected qualitative aspects during analysis,resulting in a lack of interpretability throughout the process and causing inaccuracies in the negative sample extraction.In this study,Scoops 3D was employed as a physics-informed tool to qualitatively assess slope stability in the study area(the Hubei Province section of the Three Gorges Reservoir Area).The non-landslide samples were extracted based on the calculated factor of safety(FS).Subsequently,the random forest algorithm was employed for data-driven landslide susceptibility analysis,with the area under the receiver operating characteristic curve(AUC)serving as the model evaluation index.Compared to the benchmark model(i.e.,the standard method of utilizing the pure random forest algorithm),the proposed method’s AUC value improved by 20.1%,validating the effectiveness of the dual-driven method(physics-informed data-driven).