The paper aims to discuss three interesting issues of statistical inferences for a common risk ratio (RR) in sparse meta-analysis data. Firstly, the conventional log-risk ratio estimator encounters a number of problem...The paper aims to discuss three interesting issues of statistical inferences for a common risk ratio (RR) in sparse meta-analysis data. Firstly, the conventional log-risk ratio estimator encounters a number of problems when the number of events in the experimental or control group is zero in sparse data of a 2 × 2 table. The adjusted log-risk ratio estimator with the continuity correction points based upon the minimum Bayes risk with respect to the uniform prior density over (0, 1) and the Euclidean loss function is proposed. Secondly, the interest is to find the optimal weights of the pooled estimate that minimize the mean square error (MSE) of subject to the constraint on where , , . Finally, the performance of this minimum MSE weighted estimator adjusted with various values of points is investigated to compare with other popular estimators, such as the Mantel-Haenszel (MH) estimator and the weighted least squares (WLS) estimator (also equivalently known as the inverse-variance weighted estimator) in senses of point estimation and hypothesis testing via simulation studies. The results of estimation illustrate that regardless of the true values of RR, the MH estimator achieves the best performance with the smallest MSE when the study size is rather large and the sample sizes within each study are small. The MSE of WLS estimator and the proposed-weight estimator adjusted by , or , or are close together and they are the best when the sample sizes are moderate to large (and) while the study size is rather small.展开更多
Missing data are a problem in geophysical surveys, and interpolation and reconstruction of missing data is part of the data processing and interpretation. Based on the sparseness of the geophysical data or the transfo...Missing data are a problem in geophysical surveys, and interpolation and reconstruction of missing data is part of the data processing and interpretation. Based on the sparseness of the geophysical data or the transform domain, we can improve the accuracy and stability of the reconstruction by transforming it to a sparse optimization problem. In this paper, we propose a mathematical model for the sparse reconstruction of data based on the LO-norm minimization. Furthermore, we discuss two types of the approximation algorithm for the LO- norm minimization according to the size and characteristics of the geophysical data: namely, the iteratively reweighted least-squares algorithm and the fast iterative hard thresholding algorithm. Theoretical and numerical analysis showed that applying the iteratively reweighted least-squares algorithm to the reconstruction of potential field data exploits its fast convergence rate, short calculation time, and high precision, whereas the fast iterative hard thresholding algorithm is more suitable for processing seismic data, moreover, its computational efficiency is better than that of the traditional iterative hard thresholding algorithm.展开更多
An algorithm, Clustering Algorithm Based On Sparse Feature Vector (CABOSFV),was proposed for the high dimensional clustering of binary sparse data. This algorithm compressesthe data effectively by using a tool 'Sp...An algorithm, Clustering Algorithm Based On Sparse Feature Vector (CABOSFV),was proposed for the high dimensional clustering of binary sparse data. This algorithm compressesthe data effectively by using a tool 'Sparse Feature Vector', thus reduces the data scaleenormously, and can get the clustering result with only one data scan. Both theoretical analysis andempirical tests showed that CABOSFV is of low computational complexity. The algorithm findsclusters in high dimensional large datasets efficiently and handles noise effectively.展开更多
Deep learning has been probed for the airfoil performance prediction in recent years.Compared with the expensive CFD simulations and wind tunnel experiments,deep learning models can be leveraged to somewhat mitigate s...Deep learning has been probed for the airfoil performance prediction in recent years.Compared with the expensive CFD simulations and wind tunnel experiments,deep learning models can be leveraged to somewhat mitigate such expenses with proper means.Nevertheless,effective training of the data-driven models in deep learning severely hinges on the data in diversity and quantity.In this paper,we present a novel data augmented Generative Adversarial Network(GAN),daGAN,for rapid and accurate flow filed prediction,allowing the adaption to the task with sparse data.The presented approach consists of two modules,pre-training module and fine-tuning module.The pre-training module utilizes a conditional GAN(cGAN)to preliminarily estimate the distribution of the training data.In the fine-tuning module,we propose a novel adversarial architecture with two generators one of which fulfils a promising data augmentation operation,so that the complement data is adequately incorporated to boost the generalization of the model.We use numerical simulation data to verify the generalization of daGAN on airfoils and flow conditions with sparse training data.The results show that daGAN is a promising tool for rapid and accurate evaluation of detailed flow field without the requirement for big training data.展开更多
For a data cube there are always constraints between dimensions or among attributes in a dimension, such as functional dependencies. We introduce the problem that when there are functional dependencies, how to use the...For a data cube there are always constraints between dimensions or among attributes in a dimension, such as functional dependencies. We introduce the problem that when there are functional dependencies, how to use them to speed up the computation of sparse data cubes. A new algorithm CFD (Computation by Functional Dependencies) is presented to satisfy this demand. CFD determines the order of dimensions by considering cardinalities of dimensions and functional dependencies between dimensions together, thus reduce the number of partitions for such dimensions. CFD also combines partitioning from bottom to up and aggregate computation from top to bottom to speed up the computation further. CFD can efficiently compute a data cube with hierarchies in a dimension from the smallest granularity to the coarsest one. Key words sparse data cube - functional dependency - dimension - partition - CFD CLC number TP 311 Foundation item: Supported by the E-Government Project of the Ministry of Science and Technology of China (2001BA110B01)Biography: Feng Yu-cai (1945-), male, Professor, research direction: database system.展开更多
Recent advances in deep learning have expanded new possibilities for fluid flow simulation in petroleum reservoirs.However,the predominant approach in existing research is to train neural networks using high-fidelity ...Recent advances in deep learning have expanded new possibilities for fluid flow simulation in petroleum reservoirs.However,the predominant approach in existing research is to train neural networks using high-fidelity numerical simulation data.This presents a significant challenge because the sole source of authentic wellbore production data for training is sparse.In response to this challenge,this work introduces a novel architecture called physics-informed neural network based on domain decomposition(PINN-DD),aiming to effectively utilize the sparse production data of wells for reservoir simulation with large-scale systems.To harness the capabilities of physics-informed neural networks(PINNs)in handling small-scale spatial-temporal domain while addressing the challenges of large-scale systems with sparse labeled data,the computational domain is divided into two distinct sub-domains:the well-containing and the well-free sub-domain.Moreover,the two sub-domains and the interface are rigorously constrained by the governing equations,data matching,and boundary conditions.The accuracy of the proposed method is evaluated on two problems,and its performance is compared against state-of-the-art PINNs through numerical analysis as a benchmark.The results demonstrate the superiority of PINN-DD in handling large-scale reservoir simulation with limited data and show its potential to outperform conventional PINNs in such scenarios.展开更多
Oil and gas seismic exploration have to adopt irregular seismic acquisition due to the increasingly complex exploration conditions to adapt to complex geological conditions and environments.However,the irregular seism...Oil and gas seismic exploration have to adopt irregular seismic acquisition due to the increasingly complex exploration conditions to adapt to complex geological conditions and environments.However,the irregular seismic acquisition is accompanied by the lack of acquisition data,which requires high-precision regularization.The sparse signal feature in the transform domain in compressed sensing theory is used in this paper to recover the missing signal,involving sparse transform base optimization and threshold modeling.First,this paper analyzes and compares the effects of six sparse transformation bases on the reconstruction accuracy and efficiency of irregular seismic data and establishes the quantitative relationship between sparse transformation and reconstruction accuracy and efficiency.Second,an adaptive threshold modeling method based on sparse coefficient is provided to improve the reconstruction accuracy.Test results show that the method has good adaptability to different seismic data and sparse transform bases.The f-x domain reconstruction method of effective frequency samples is studied to address the problem of low computational efficiency.The parallel computing strategy of curvelet transform combined with OpenMP is further proposed,which substantially improves the computational efficiency under the premise of ensuring the reconstruction accuracy.Finally,the actual acquisition data are used to verify the proposed method.The results indicate that the proposed method strategy can solve the regularization problem of irregular seismic data in production and improve the imaging quality of the target layer economically and efficiently.展开更多
Addressing the difficulties of scattered and sparse observational data in ocean science,a new interpolation technique based on information diffusion is proposed in this paper.Based on a fuzzy mapping idea,sparse data ...Addressing the difficulties of scattered and sparse observational data in ocean science,a new interpolation technique based on information diffusion is proposed in this paper.Based on a fuzzy mapping idea,sparse data samples are diffused and mapped into corresponding fuzzy sets in the form of probability in an interpolation ellipse model.To avoid the shortcoming of normal diffusion function on the asymmetric structure,a kind of asymmetric information diffusion function is developed and a corresponding algorithm-ellipse model for diffusion of asymmetric information is established.Through interpolation experiments and contrast analysis of the sea surface temperature data with ARGO data,the rationality and validity of the ellipse model are assessed.展开更多
Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse mult...Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse multivariate data obtained from geotechnical site investigation,it is impossible to identify outliers with certainty due to the distortion of statistics of geotechnical parameters caused by outliers and their associated statistical uncertainty resulted from data sparsity.This paper develops a probabilistic outlier detection method for sparse multivariate data obtained from geotechnical site investigation.The proposed approach quantifies the outlying probability of each data instance based on Mahalanobis distance and determines outliers as those data instances with outlying probabilities greater than 0.5.It tackles the distortion issue of statistics estimated from the dataset with outliers by a re-sampling technique and accounts,rationally,for the statistical uncertainty by Bayesian machine learning.Moreover,the proposed approach also suggests an exclusive method to determine outlying components of each outlier.The proposed approach is illustrated and verified using simulated and real-life dataset.It showed that the proposed approach properly identifies outliers among sparse multivariate data and their corresponding outlying components in a probabilistic manner.It can significantly reduce the masking effect(i.e.,missing some actual outliers due to the distortion of statistics by the outliers and statistical uncertainty).It also found that outliers among sparse multivariate data instances affect significantly the construction of multivariate distribution of geotechnical parameters for uncertainty quantification.This emphasizes the necessity of data cleaning process(e.g.,outlier detection)for uncertainty quantification based on geoscience data.展开更多
At present,the acquisition of seismic data is developing toward high-precision and high-density methods.However,complex natural environments and cultural factors in many exploration areas cause difficulties in achievi...At present,the acquisition of seismic data is developing toward high-precision and high-density methods.However,complex natural environments and cultural factors in many exploration areas cause difficulties in achieving uniform and intensive acquisition,which makes complete seismic data collection impossible.Therefore,data reconstruction is required in the processing link to ensure imaging accuracy.Deep learning,as a new field in rapid development,presents clear advantages in feature extraction and modeling.In this study,the convolutional neural network deep learning algorithm is applied to seismic data reconstruction.Based on the convolutional neural network algorithm and combined with the characteristics of seismic data acquisition,two training strategies of supervised and unsupervised learning are designed to reconstruct sparse acquisition seismic records.First,a supervised learning strategy is proposed for labeled data,wherein the complete seismic data are segmented as the input of the training set and are randomly sampled before each training,thereby increasing the number of samples and the richness of features.Second,an unsupervised learning strategy based on large samples is proposed for unlabeled data,and the rolling segmentation method is used to update(pseudo)labels and training parameters in the training process.Through the reconstruction test of simulated and actual data,the deep learning algorithm based on a convolutional neural network shows better reconstruction quality and higher accuracy than compressed sensing based on Curvelet transform.展开更多
Background:Meta-analysis is a statistical method to synthesize evidence from a number of independent studies,including those from clinical studies with binary outcomes.In practice,when there are zero events in one or ...Background:Meta-analysis is a statistical method to synthesize evidence from a number of independent studies,including those from clinical studies with binary outcomes.In practice,when there are zero events in one or both groups,it may cause statistical problems in the subsequent analysis.Methods:In this paper,by considering the relative risk as the effect size,we conduct a comparative study that consists of four continuity correction methods and another state-of-the-art method without the continuity correction,namely the generalized linear mixed models(GLMMs).To further advance the literature,we also introduce a new method of the continuity correction for estimating the relative risk.Results:From the simulation studies,the new method performs well in terms of mean squared error when there are few studies.In contrast,the generalized linear mixed model performs the best when the number of studies is large.In addition,by reanalyzing recent coronavirus disease 2019(COVID-19)data,it is evident that the double-zero-event studies impact the estimate of the mean effect size.Conclusions:We recommend the new method to handle the zero-event studies when there are few studies in a meta-analysis,or instead use the GLMM when the number of studies is large.The double-zero-event studies may be informative,and so we suggest not excluding them.展开更多
Individual participant data (IPD) meta-analysis was developed to overcome several meta-analytical pitfalls of classical meta-analysis. One advantage of classical psychometric meta-analysis over IPD meta-analysis is th...Individual participant data (IPD) meta-analysis was developed to overcome several meta-analytical pitfalls of classical meta-analysis. One advantage of classical psychometric meta-analysis over IPD meta-analysis is the corrections of the aggregated unit of studies, namely study differences, i.e., artifacts, such as measurement error. Without these corrections on a study level, meta-analysts may assume moderator variables instead of artifacts between studies. The psychometric correction of the aggregation unit of individuals in IPD meta-analysis has been neglected by IPD meta-analysts thus far. In this paper, we present the adaptation of a psychometric approach for IPD meta-analysis to account for the differences in the aggregation unit of individuals to overcome differences between individuals. We introduce the reader to this approach using the aggregation of lens model studies on individual data as an example, and lay out different application possibilities for the future (e.g., big data analysis). Our suggested psychometric IPD meta-analysis supplements the meta-analysis approaches within the field and is a suitable alternative for future analysis.展开更多
Sparse signal is a kind of sparse matrices which can carry fault information and simplify the signal at the same time.This can effectively reduce the cost of signal storage,improve the efficiency of data transmission,...Sparse signal is a kind of sparse matrices which can carry fault information and simplify the signal at the same time.This can effectively reduce the cost of signal storage,improve the efficiency of data transmission,and ultimately save the cost of equipment fault diagnosis in the aviation field.At present,the existing sparse decomposition methods generally extract sparse fault characteristics signals based on orthogonal basis atoms,which limits the adaptability of sparse decomposition.In this paper,a self-adaptive atom is extracted by the improved dual-channel tunable Q-factor wavelet transform(TQWT)method to construct a self-adaptive complete dictionary.Finally,the sparse signal is obtained by the orthogonal matching pursuit(OMP)algorithm.The atoms obtained by this method are more flexible,and are no longer constrained to an orthogonal basis to reflect the oscillation characteristics of signals.Therefore,the sparse signal can better extract the fault characteristics.The simulation and experimental results show that the selfadaptive dictionary with the atom extracted from the dual-channel TQWT has a stronger decomposition freedom and signal matching ability than orthogonal basis dictionaries,such as discrete cosine transform(DCT),discrete Hartley transform(DHT)and discrete wavelet transform(DWT).In addition,the sparse signal extracted by the self-adaptive complete dictionary can reflect the time-domain characteristics of the vibration signals,and can more accurately extract the bearing fault feature frequency.展开更多
Latent factor(LF)models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS)matrices which are commonly seen in various industrial applications.An LF model usually adopts iterativ...Latent factor(LF)models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS)matrices which are commonly seen in various industrial applications.An LF model usually adopts iterative optimizers,which may consume many iterations to achieve a local optima,resulting in considerable time cost.Hence,determining how to accelerate the training process for LF models has become a significant issue.To address this,this work proposes a randomized latent factor(RLF)model.It incorporates the principle of randomized learning techniques from neural networks into the LF analysis of HiDS matrices,thereby greatly alleviating computational burden.It also extends a standard learning process for randomized neural networks in context of LF analysis to make the resulting model represent an HiDS matrix correctly.Experimental results on three HiDS matrices from industrial applications demonstrate that compared with state-of-the-art LF models,RLF is able to achieve significantly higher computational efficiency and comparable prediction accuracy for missing data.I provides an important alternative approach to LF analysis of HiDS matrices,which is especially desired for industrial applications demanding highly efficient models.展开更多
Least squares support vector machine (LS-SVM) plays an important role in steel surface defects classification because of its high speed. However, the defect samples obtained from the real production line may be noise....Least squares support vector machine (LS-SVM) plays an important role in steel surface defects classification because of its high speed. However, the defect samples obtained from the real production line may be noise. LS-SVM suffers from the poor classification performance in the classification stage when there are noise samples. Thus, in the classification stage, it is necessary to design an effective algorithm to process the defects dataset obtained from the real production line. To this end, an adaptive weight function was employed to reduce the adverse effect of noise samples. Moreover, although LSSVM offers fast speed, it still suffers from a high computational complexity if the number of training samples is large. The time for steel surface defects classification should be as short as possible. Therefore, a sparse strategy was adopted to prune the training samples. Finally, since the steel surface defects classification belongs to unbalanced data classification, LSSVM algorithm is not applicable. Hence, the unbalanced data information was introduced to improve the classification performance. Comprehensively considering above-mentioned factors, an improved LS-SVM classification model was proposed, termed as ILS-SVM. Experimental results show that the new algorithm has the advantages of high speed and great anti-noise ability.展开更多
(Aim)COVID-19 is an ongoing infectious disease.It has caused more than 107.45 m confirmed cases and 2.35 m deaths till 11/Feb/2021.Traditional computer vision methods have achieved promising results on the automatic s...(Aim)COVID-19 is an ongoing infectious disease.It has caused more than 107.45 m confirmed cases and 2.35 m deaths till 11/Feb/2021.Traditional computer vision methods have achieved promising results on the automatic smart diagnosis.(Method)This study aims to propose a novel deep learning method that can obtain better performance.We use the pseudo-Zernike moment(PZM),derived from Zernike moment,as the extracted features.Two settings are introducing:(i)image plane over unit circle;and(ii)image plane inside the unit circle.Afterward,we use a deep-stacked sparse autoencoder(DSSAE)as the classifier.Besides,multiple-way data augmentation is chosen to overcome overfitting.The multiple-way data augmentation is based on Gaussian noise,salt-and-pepper noise,speckle noise,horizontal and vertical shear,rotation,Gamma correction,random translation and scaling.(Results)10 runs of 10-fold cross validation shows that our PZM-DSSAE method achieves a sensitivity of 92.06%±1.54%,a specificity of 92.56%±1.06%,a precision of 92.53%±1.03%,and an accuracy of 92.31%±1.08%.Its F1 score,MCC,and FMI arrive at 92.29%±1.10%,84.64%±2.15%,and 92.29%±1.10%,respectively.The AUC of our model is 0.9576.(Conclusion)We demonstrate“image plane over unit circle”can get better results than“image plane inside a unit circle.”Besides,this proposed PZM-DSSAE model is better than eight state-of-the-art approaches.展开更多
In this paper, based on sparse representation classification and robust thought, we propose a new classifier, named MRSRC (Metasample Based Robust Sparse Representation Classificatier), for DNA microarray data classif...In this paper, based on sparse representation classification and robust thought, we propose a new classifier, named MRSRC (Metasample Based Robust Sparse Representation Classificatier), for DNA microarray data classification. Firstly, we extract Metasample from trainning sample. Secondly, a weighted matrix W is added to solve an l1-regular- ized least square problem. Finally, the testing sample is classified according to the sparsity coefficient vector of it. The experimental results on the DNA microarray data classification prove that the proposed algorithm is efficient.展开更多
AIM: To propose a new meta-analysis method for bi-variate P value which account for the paired structure. METHODS: Studies that look to test two different fea-tures from the same sample gives rise to bivariate Pvalu...AIM: To propose a new meta-analysis method for bi-variate P value which account for the paired structure. METHODS: Studies that look to test two different fea-tures from the same sample gives rise to bivariate Pvalue. A relevant example of this is testing for periodici-ty as well expression from time-course gene expressionstudies. Kocak et al (2010) uses George and Mudholkar’(1983) “Difference of Two Logit-Sums” method to poolbivariate P value across independent experiments, as-suming independence within a pair. As bivariate P valueneed not to be independent within a given study, wepropose a new meta-analysis approach for pooling bi-variate P value across independent experiments, whichaccounts for potential correlation between paired P-val-ues. We compare the “Difference of Two Logit Sums”method with our novel approach in terms of their sen-sitivity and specifcity through extensive simulations by generating P value samples from most commonly used tests namely, Z test, t test, chi-square test, and F test, with varying sample sizes and correlation structure. RESULTS: The simulations results showed that our new meta-analysis approach for correlated and uncor-related bivariate P value has much more desirable sen-sitivity and specifcity features compared to the existing method, which treats each member of the paired P value as independent. We also compare these meta-analysis approaches on bivariate P value from periodici-ty and expression tests of 4936 S.Pombe genes from 10 independent time-course experiments and we showed that our new approach ranks the periodic, conserved, and cycling genes significantly higher, and detects many more periodic, “conserved” and “cycling” genes among the top 100 genes, compared to the ‘Difference of Two Logit-Sums’ method. Finally, we used our meta-analytic approach to compare the relative evidence in the association of pre-term birth with preschool wheez-ing versus pre-school asthma.CONCLUSION: The new meta-analysis method has much better sensitivity and specifc characteristics com-pared to the “Difference of Two-Logit Sums” method and it is not computationally more expensive.展开更多
This work focuses on enhancing low frequency seismic data using a convolutional neural network trained on synthetic data.Traditional seismic data often lack both high and low frequencies,which are essential for detail...This work focuses on enhancing low frequency seismic data using a convolutional neural network trained on synthetic data.Traditional seismic data often lack both high and low frequencies,which are essential for detailed geological interpretation and various geophysical applications.Low frequency data is particularly valuable for reducing wavelet sidelobes and improving full waveform inversion(FWI).Conventional methods for bandwidth extension include seismic deconvolution and sparse inversion,which have limitations in recovering low frequencies.The study explores the potential of the U-net,which has been successful in other geophysical applications such as noise attenuation and seismic resolution enhancement.The novelty in our approach is that we do not rely on computationally expensive finite difference modelling to create training data.Instead,our synthetic training data is created from individual randomly perturbed events with variations in bandwidth,making it more adaptable to different data sets compared to previous deep learning methods.The method was tested on both synthetic and real seismic data,demonstrating effective low frequency reconstruction and sidelobe reduction.With a synthetic full waveform inversion to recover a velocity model and a seismic amplitude inversion to estimate acoustic impedance we demonstrate the validity and benefit of the proposed method.Overall,the study presents a robust approach to seismic bandwidth extension using deep learning,emphasizing the importance of diverse and well-designed but computationally inexpensive synthetic training data.展开更多
Time-domain airborne electromagnetic(AEM)data are frequently subject to interference from various types of noise,which can reduce the data quality and affect data inversion and interpretation.Traditional denoising met...Time-domain airborne electromagnetic(AEM)data are frequently subject to interference from various types of noise,which can reduce the data quality and affect data inversion and interpretation.Traditional denoising methods primarily deal with data directly,without analyzing the data in detail;thus,the results are not always satisfactory.In this paper,we propose a method based on dictionary learning for EM data denoising.This method uses dictionary learning to perform feature analysis and to extract and reconstruct the true signal.In the process of dictionary learning,the random noise is fi ltered out as residuals.To verify the eff ectiveness of this dictionary learning approach for denoising,we use a fi xed overcomplete discrete cosine transform(ODCT)dictionary algorithm,the method-of-optimal-directions(MOD)dictionary learning algorithm,and the K-singular value decomposition(K-SVD)dictionary learning algorithm to denoise decay curves at single points and to denoise profi le data for diff erent time channels in time-domain AEM.The results show obvious diff erences among the three dictionaries for denoising AEM data,with the K-SVD dictionary achieving the best performance.展开更多
文摘The paper aims to discuss three interesting issues of statistical inferences for a common risk ratio (RR) in sparse meta-analysis data. Firstly, the conventional log-risk ratio estimator encounters a number of problems when the number of events in the experimental or control group is zero in sparse data of a 2 × 2 table. The adjusted log-risk ratio estimator with the continuity correction points based upon the minimum Bayes risk with respect to the uniform prior density over (0, 1) and the Euclidean loss function is proposed. Secondly, the interest is to find the optimal weights of the pooled estimate that minimize the mean square error (MSE) of subject to the constraint on where , , . Finally, the performance of this minimum MSE weighted estimator adjusted with various values of points is investigated to compare with other popular estimators, such as the Mantel-Haenszel (MH) estimator and the weighted least squares (WLS) estimator (also equivalently known as the inverse-variance weighted estimator) in senses of point estimation and hypothesis testing via simulation studies. The results of estimation illustrate that regardless of the true values of RR, the MH estimator achieves the best performance with the smallest MSE when the study size is rather large and the sample sizes within each study are small. The MSE of WLS estimator and the proposed-weight estimator adjusted by , or , or are close together and they are the best when the sample sizes are moderate to large (and) while the study size is rather small.
基金supported by the National Natural Science Foundation of China (Grant No.41074133)
文摘Missing data are a problem in geophysical surveys, and interpolation and reconstruction of missing data is part of the data processing and interpretation. Based on the sparseness of the geophysical data or the transform domain, we can improve the accuracy and stability of the reconstruction by transforming it to a sparse optimization problem. In this paper, we propose a mathematical model for the sparse reconstruction of data based on the LO-norm minimization. Furthermore, we discuss two types of the approximation algorithm for the LO- norm minimization according to the size and characteristics of the geophysical data: namely, the iteratively reweighted least-squares algorithm and the fast iterative hard thresholding algorithm. Theoretical and numerical analysis showed that applying the iteratively reweighted least-squares algorithm to the reconstruction of potential field data exploits its fast convergence rate, short calculation time, and high precision, whereas the fast iterative hard thresholding algorithm is more suitable for processing seismic data, moreover, its computational efficiency is better than that of the traditional iterative hard thresholding algorithm.
文摘An algorithm, Clustering Algorithm Based On Sparse Feature Vector (CABOSFV),was proposed for the high dimensional clustering of binary sparse data. This algorithm compressesthe data effectively by using a tool 'Sparse Feature Vector', thus reduces the data scaleenormously, and can get the clustering result with only one data scan. Both theoretical analysis andempirical tests showed that CABOSFV is of low computational complexity. The algorithm findsclusters in high dimensional large datasets efficiently and handles noise effectively.
基金supported by the funding of the Key Laboratory of Aerodynamic Noise Control(No.ANCL20190103)the State Key Laboratory of Aerodynamics,China(No.SKLA20180102)+1 种基金the Aeronautical Science Foundation of China(Nos.2018ZA52002,2019ZA052011)the Priority Academic Program Development of Jiangsu Higher Education Institutions,China(PAPD).
文摘Deep learning has been probed for the airfoil performance prediction in recent years.Compared with the expensive CFD simulations and wind tunnel experiments,deep learning models can be leveraged to somewhat mitigate such expenses with proper means.Nevertheless,effective training of the data-driven models in deep learning severely hinges on the data in diversity and quantity.In this paper,we present a novel data augmented Generative Adversarial Network(GAN),daGAN,for rapid and accurate flow filed prediction,allowing the adaption to the task with sparse data.The presented approach consists of two modules,pre-training module and fine-tuning module.The pre-training module utilizes a conditional GAN(cGAN)to preliminarily estimate the distribution of the training data.In the fine-tuning module,we propose a novel adversarial architecture with two generators one of which fulfils a promising data augmentation operation,so that the complement data is adequately incorporated to boost the generalization of the model.We use numerical simulation data to verify the generalization of daGAN on airfoils and flow conditions with sparse training data.The results show that daGAN is a promising tool for rapid and accurate evaluation of detailed flow field without the requirement for big training data.
文摘For a data cube there are always constraints between dimensions or among attributes in a dimension, such as functional dependencies. We introduce the problem that when there are functional dependencies, how to use them to speed up the computation of sparse data cubes. A new algorithm CFD (Computation by Functional Dependencies) is presented to satisfy this demand. CFD determines the order of dimensions by considering cardinalities of dimensions and functional dependencies between dimensions together, thus reduce the number of partitions for such dimensions. CFD also combines partitioning from bottom to up and aggregate computation from top to bottom to speed up the computation further. CFD can efficiently compute a data cube with hierarchies in a dimension from the smallest granularity to the coarsest one. Key words sparse data cube - functional dependency - dimension - partition - CFD CLC number TP 311 Foundation item: Supported by the E-Government Project of the Ministry of Science and Technology of China (2001BA110B01)Biography: Feng Yu-cai (1945-), male, Professor, research direction: database system.
基金funded by the National Natural Science Foundation of China(Grant No.52274048)Beijing Natural Science Foundation(Grant No.3222037)+1 种基金the CNPC 14th Five-Year Perspective Fundamental Research Project(Grant No.2021DJ2104)the Science Foundation of China University of Petroleum-Beijing(No.2462021YXZZ010).
文摘Recent advances in deep learning have expanded new possibilities for fluid flow simulation in petroleum reservoirs.However,the predominant approach in existing research is to train neural networks using high-fidelity numerical simulation data.This presents a significant challenge because the sole source of authentic wellbore production data for training is sparse.In response to this challenge,this work introduces a novel architecture called physics-informed neural network based on domain decomposition(PINN-DD),aiming to effectively utilize the sparse production data of wells for reservoir simulation with large-scale systems.To harness the capabilities of physics-informed neural networks(PINNs)in handling small-scale spatial-temporal domain while addressing the challenges of large-scale systems with sparse labeled data,the computational domain is divided into two distinct sub-domains:the well-containing and the well-free sub-domain.Moreover,the two sub-domains and the interface are rigorously constrained by the governing equations,data matching,and boundary conditions.The accuracy of the proposed method is evaluated on two problems,and its performance is compared against state-of-the-art PINNs through numerical analysis as a benchmark.The results demonstrate the superiority of PINN-DD in handling large-scale reservoir simulation with limited data and show its potential to outperform conventional PINNs in such scenarios.
基金supported by the National Science and Technology Major project(No.2016ZX05024001003)the Innovation Consortium Project of China Petroleum,and the Southwest Petroleum University(No.2020CX010201).
文摘Oil and gas seismic exploration have to adopt irregular seismic acquisition due to the increasingly complex exploration conditions to adapt to complex geological conditions and environments.However,the irregular seismic acquisition is accompanied by the lack of acquisition data,which requires high-precision regularization.The sparse signal feature in the transform domain in compressed sensing theory is used in this paper to recover the missing signal,involving sparse transform base optimization and threshold modeling.First,this paper analyzes and compares the effects of six sparse transformation bases on the reconstruction accuracy and efficiency of irregular seismic data and establishes the quantitative relationship between sparse transformation and reconstruction accuracy and efficiency.Second,an adaptive threshold modeling method based on sparse coefficient is provided to improve the reconstruction accuracy.Test results show that the method has good adaptability to different seismic data and sparse transform bases.The f-x domain reconstruction method of effective frequency samples is studied to address the problem of low computational efficiency.The parallel computing strategy of curvelet transform combined with OpenMP is further proposed,which substantially improves the computational efficiency under the premise of ensuring the reconstruction accuracy.Finally,the actual acquisition data are used to verify the proposed method.The results indicate that the proposed method strategy can solve the regularization problem of irregular seismic data in production and improve the imaging quality of the target layer economically and efficiently.
基金Project of Natural Science Foundation of China (41276088)
文摘Addressing the difficulties of scattered and sparse observational data in ocean science,a new interpolation technique based on information diffusion is proposed in this paper.Based on a fuzzy mapping idea,sparse data samples are diffused and mapped into corresponding fuzzy sets in the form of probability in an interpolation ellipse model.To avoid the shortcoming of normal diffusion function on the asymmetric structure,a kind of asymmetric information diffusion function is developed and a corresponding algorithm-ellipse model for diffusion of asymmetric information is established.Through interpolation experiments and contrast analysis of the sea surface temperature data with ARGO data,the rationality and validity of the ellipse model are assessed.
基金supported by the National Key R&D Program of China(Project No.2016YFC0800200)the NRF-NSFC 3rd Joint Research Grant(Earth Science)(Project No.41861144022)+2 种基金the National Natural Science Foundation of China(Project Nos.51679174,and 51779189)the Shenzhen Key Technology R&D Program(Project No.20170324)The financial support is grateful acknowledged。
文摘Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse multivariate data obtained from geotechnical site investigation,it is impossible to identify outliers with certainty due to the distortion of statistics of geotechnical parameters caused by outliers and their associated statistical uncertainty resulted from data sparsity.This paper develops a probabilistic outlier detection method for sparse multivariate data obtained from geotechnical site investigation.The proposed approach quantifies the outlying probability of each data instance based on Mahalanobis distance and determines outliers as those data instances with outlying probabilities greater than 0.5.It tackles the distortion issue of statistics estimated from the dataset with outliers by a re-sampling technique and accounts,rationally,for the statistical uncertainty by Bayesian machine learning.Moreover,the proposed approach also suggests an exclusive method to determine outlying components of each outlier.The proposed approach is illustrated and verified using simulated and real-life dataset.It showed that the proposed approach properly identifies outliers among sparse multivariate data and their corresponding outlying components in a probabilistic manner.It can significantly reduce the masking effect(i.e.,missing some actual outliers due to the distortion of statistics by the outliers and statistical uncertainty).It also found that outliers among sparse multivariate data instances affect significantly the construction of multivariate distribution of geotechnical parameters for uncertainty quantification.This emphasizes the necessity of data cleaning process(e.g.,outlier detection)for uncertainty quantification based on geoscience data.
基金This study was supported by the National Natural Science Foundation of China under the project‘Research on the Dynamic Location of Receiver Points and Wave Field Separation Technology Based on Deep Learning in OBN Seismic Exploration’(No.42074140).
文摘At present,the acquisition of seismic data is developing toward high-precision and high-density methods.However,complex natural environments and cultural factors in many exploration areas cause difficulties in achieving uniform and intensive acquisition,which makes complete seismic data collection impossible.Therefore,data reconstruction is required in the processing link to ensure imaging accuracy.Deep learning,as a new field in rapid development,presents clear advantages in feature extraction and modeling.In this study,the convolutional neural network deep learning algorithm is applied to seismic data reconstruction.Based on the convolutional neural network algorithm and combined with the characteristics of seismic data acquisition,two training strategies of supervised and unsupervised learning are designed to reconstruct sparse acquisition seismic records.First,a supervised learning strategy is proposed for labeled data,wherein the complete seismic data are segmented as the input of the training set and are randomly sampled before each training,thereby increasing the number of samples and the richness of features.Second,an unsupervised learning strategy based on large samples is proposed for unlabeled data,and the rolling segmentation method is used to update(pseudo)labels and training parameters in the training process.Through the reconstruction test of simulated and actual data,the deep learning algorithm based on a convolutional neural network shows better reconstruction quality and higher accuracy than compressed sensing based on Curvelet transform.
基金supported by grants awarded to Tie-Jun Tong from the General Research Fund(HKBU12303918)the National Natural Science Foundation of China(1207010822)the Initiation Grants for Faculty Niche Research Areas(RC-IG-FNRA/17-18/13,RC-FNRAIG/20-21/SCI/03)of Hong Kong Baptist University。
文摘Background:Meta-analysis is a statistical method to synthesize evidence from a number of independent studies,including those from clinical studies with binary outcomes.In practice,when there are zero events in one or both groups,it may cause statistical problems in the subsequent analysis.Methods:In this paper,by considering the relative risk as the effect size,we conduct a comparative study that consists of four continuity correction methods and another state-of-the-art method without the continuity correction,namely the generalized linear mixed models(GLMMs).To further advance the literature,we also introduce a new method of the continuity correction for estimating the relative risk.Results:From the simulation studies,the new method performs well in terms of mean squared error when there are few studies.In contrast,the generalized linear mixed model performs the best when the number of studies is large.In addition,by reanalyzing recent coronavirus disease 2019(COVID-19)data,it is evident that the double-zero-event studies impact the estimate of the mean effect size.Conclusions:We recommend the new method to handle the zero-event studies when there are few studies in a meta-analysis,or instead use the GLMM when the number of studies is large.The double-zero-event studies may be informative,and so we suggest not excluding them.
文摘Individual participant data (IPD) meta-analysis was developed to overcome several meta-analytical pitfalls of classical meta-analysis. One advantage of classical psychometric meta-analysis over IPD meta-analysis is the corrections of the aggregated unit of studies, namely study differences, i.e., artifacts, such as measurement error. Without these corrections on a study level, meta-analysts may assume moderator variables instead of artifacts between studies. The psychometric correction of the aggregation unit of individuals in IPD meta-analysis has been neglected by IPD meta-analysts thus far. In this paper, we present the adaptation of a psychometric approach for IPD meta-analysis to account for the differences in the aggregation unit of individuals to overcome differences between individuals. We introduce the reader to this approach using the aggregation of lens model studies on individual data as an example, and lay out different application possibilities for the future (e.g., big data analysis). Our suggested psychometric IPD meta-analysis supplements the meta-analysis approaches within the field and is a suitable alternative for future analysis.
基金This work was supported by the National Key R&D Program of China(Grant No.2018YFB1503103).
文摘Sparse signal is a kind of sparse matrices which can carry fault information and simplify the signal at the same time.This can effectively reduce the cost of signal storage,improve the efficiency of data transmission,and ultimately save the cost of equipment fault diagnosis in the aviation field.At present,the existing sparse decomposition methods generally extract sparse fault characteristics signals based on orthogonal basis atoms,which limits the adaptability of sparse decomposition.In this paper,a self-adaptive atom is extracted by the improved dual-channel tunable Q-factor wavelet transform(TQWT)method to construct a self-adaptive complete dictionary.Finally,the sparse signal is obtained by the orthogonal matching pursuit(OMP)algorithm.The atoms obtained by this method are more flexible,and are no longer constrained to an orthogonal basis to reflect the oscillation characteristics of signals.Therefore,the sparse signal can better extract the fault characteristics.The simulation and experimental results show that the selfadaptive dictionary with the atom extracted from the dual-channel TQWT has a stronger decomposition freedom and signal matching ability than orthogonal basis dictionaries,such as discrete cosine transform(DCT),discrete Hartley transform(DHT)and discrete wavelet transform(DWT).In addition,the sparse signal extracted by the self-adaptive complete dictionary can reflect the time-domain characteristics of the vibration signals,and can more accurately extract the bearing fault feature frequency.
基金supported in part by the National Natural Science Foundation of China (6177249391646114)+1 种基金Chongqing research program of technology innovation and application (cstc2017rgzn-zdyfX0020)in part by the Pioneer Hundred Talents Program of Chinese Academy of Sciences
文摘Latent factor(LF)models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS)matrices which are commonly seen in various industrial applications.An LF model usually adopts iterative optimizers,which may consume many iterations to achieve a local optima,resulting in considerable time cost.Hence,determining how to accelerate the training process for LF models has become a significant issue.To address this,this work proposes a randomized latent factor(RLF)model.It incorporates the principle of randomized learning techniques from neural networks into the LF analysis of HiDS matrices,thereby greatly alleviating computational burden.It also extends a standard learning process for randomized neural networks in context of LF analysis to make the resulting model represent an HiDS matrix correctly.Experimental results on three HiDS matrices from industrial applications demonstrate that compared with state-of-the-art LF models,RLF is able to achieve significantly higher computational efficiency and comparable prediction accuracy for missing data.I provides an important alternative approach to LF analysis of HiDS matrices,which is especially desired for industrial applications demanding highly efficient models.
基金the Natural Science Foundation of Liaoning Province,China(20180550067)Liaoning Province Ministry of Education Scientific Study Project(2020LNZD06 and 2017LNQN11)University of Science and Technology Liaoning Talent Project Grants(601011507-20 and 601013360-17).
文摘Least squares support vector machine (LS-SVM) plays an important role in steel surface defects classification because of its high speed. However, the defect samples obtained from the real production line may be noise. LS-SVM suffers from the poor classification performance in the classification stage when there are noise samples. Thus, in the classification stage, it is necessary to design an effective algorithm to process the defects dataset obtained from the real production line. To this end, an adaptive weight function was employed to reduce the adverse effect of noise samples. Moreover, although LSSVM offers fast speed, it still suffers from a high computational complexity if the number of training samples is large. The time for steel surface defects classification should be as short as possible. Therefore, a sparse strategy was adopted to prune the training samples. Finally, since the steel surface defects classification belongs to unbalanced data classification, LSSVM algorithm is not applicable. Hence, the unbalanced data information was introduced to improve the classification performance. Comprehensively considering above-mentioned factors, an improved LS-SVM classification model was proposed, termed as ILS-SVM. Experimental results show that the new algorithm has the advantages of high speed and great anti-noise ability.
基金This study was supported by Royal Society International Exchanges Cost Share Award,UK(RP202G0230)Medical Research Council Confidence in Concept Award,UK(MC_PC_17171)+1 种基金Hope Foundation for Cancer Research,UK(RM60G0680)Global Challenges Research Fund(GCRF),UK(P202PF11)。
文摘(Aim)COVID-19 is an ongoing infectious disease.It has caused more than 107.45 m confirmed cases and 2.35 m deaths till 11/Feb/2021.Traditional computer vision methods have achieved promising results on the automatic smart diagnosis.(Method)This study aims to propose a novel deep learning method that can obtain better performance.We use the pseudo-Zernike moment(PZM),derived from Zernike moment,as the extracted features.Two settings are introducing:(i)image plane over unit circle;and(ii)image plane inside the unit circle.Afterward,we use a deep-stacked sparse autoencoder(DSSAE)as the classifier.Besides,multiple-way data augmentation is chosen to overcome overfitting.The multiple-way data augmentation is based on Gaussian noise,salt-and-pepper noise,speckle noise,horizontal and vertical shear,rotation,Gamma correction,random translation and scaling.(Results)10 runs of 10-fold cross validation shows that our PZM-DSSAE method achieves a sensitivity of 92.06%±1.54%,a specificity of 92.56%±1.06%,a precision of 92.53%±1.03%,and an accuracy of 92.31%±1.08%.Its F1 score,MCC,and FMI arrive at 92.29%±1.10%,84.64%±2.15%,and 92.29%±1.10%,respectively.The AUC of our model is 0.9576.(Conclusion)We demonstrate“image plane over unit circle”can get better results than“image plane inside a unit circle.”Besides,this proposed PZM-DSSAE model is better than eight state-of-the-art approaches.
文摘In this paper, based on sparse representation classification and robust thought, we propose a new classifier, named MRSRC (Metasample Based Robust Sparse Representation Classificatier), for DNA microarray data classification. Firstly, we extract Metasample from trainning sample. Secondly, a weighted matrix W is added to solve an l1-regular- ized least square problem. Finally, the testing sample is classified according to the sparsity coefficient vector of it. The experimental results on the DNA microarray data classification prove that the proposed algorithm is efficient.
文摘AIM: To propose a new meta-analysis method for bi-variate P value which account for the paired structure. METHODS: Studies that look to test two different fea-tures from the same sample gives rise to bivariate Pvalue. A relevant example of this is testing for periodici-ty as well expression from time-course gene expressionstudies. Kocak et al (2010) uses George and Mudholkar’(1983) “Difference of Two Logit-Sums” method to poolbivariate P value across independent experiments, as-suming independence within a pair. As bivariate P valueneed not to be independent within a given study, wepropose a new meta-analysis approach for pooling bi-variate P value across independent experiments, whichaccounts for potential correlation between paired P-val-ues. We compare the “Difference of Two Logit Sums”method with our novel approach in terms of their sen-sitivity and specifcity through extensive simulations by generating P value samples from most commonly used tests namely, Z test, t test, chi-square test, and F test, with varying sample sizes and correlation structure. RESULTS: The simulations results showed that our new meta-analysis approach for correlated and uncor-related bivariate P value has much more desirable sen-sitivity and specifcity features compared to the existing method, which treats each member of the paired P value as independent. We also compare these meta-analysis approaches on bivariate P value from periodici-ty and expression tests of 4936 S.Pombe genes from 10 independent time-course experiments and we showed that our new approach ranks the periodic, conserved, and cycling genes significantly higher, and detects many more periodic, “conserved” and “cycling” genes among the top 100 genes, compared to the ‘Difference of Two Logit-Sums’ method. Finally, we used our meta-analytic approach to compare the relative evidence in the association of pre-term birth with preschool wheez-ing versus pre-school asthma.CONCLUSION: The new meta-analysis method has much better sensitivity and specifc characteristics com-pared to the “Difference of Two-Logit Sums” method and it is not computationally more expensive.
文摘This work focuses on enhancing low frequency seismic data using a convolutional neural network trained on synthetic data.Traditional seismic data often lack both high and low frequencies,which are essential for detailed geological interpretation and various geophysical applications.Low frequency data is particularly valuable for reducing wavelet sidelobes and improving full waveform inversion(FWI).Conventional methods for bandwidth extension include seismic deconvolution and sparse inversion,which have limitations in recovering low frequencies.The study explores the potential of the U-net,which has been successful in other geophysical applications such as noise attenuation and seismic resolution enhancement.The novelty in our approach is that we do not rely on computationally expensive finite difference modelling to create training data.Instead,our synthetic training data is created from individual randomly perturbed events with variations in bandwidth,making it more adaptable to different data sets compared to previous deep learning methods.The method was tested on both synthetic and real seismic data,demonstrating effective low frequency reconstruction and sidelobe reduction.With a synthetic full waveform inversion to recover a velocity model and a seismic amplitude inversion to estimate acoustic impedance we demonstrate the validity and benefit of the proposed method.Overall,the study presents a robust approach to seismic bandwidth extension using deep learning,emphasizing the importance of diverse and well-designed but computationally inexpensive synthetic training data.
基金financially supported the Strategic Priority Research Program of the Chinese Academy of Sciences (No. XDA14020102)the National Natural Science Foundation of China (Nos. 41774125,41530320 and 41804098)the Key National Research Project of China (Nos. 2016YFC0303100,2017YFC0601900)。
文摘Time-domain airborne electromagnetic(AEM)data are frequently subject to interference from various types of noise,which can reduce the data quality and affect data inversion and interpretation.Traditional denoising methods primarily deal with data directly,without analyzing the data in detail;thus,the results are not always satisfactory.In this paper,we propose a method based on dictionary learning for EM data denoising.This method uses dictionary learning to perform feature analysis and to extract and reconstruct the true signal.In the process of dictionary learning,the random noise is fi ltered out as residuals.To verify the eff ectiveness of this dictionary learning approach for denoising,we use a fi xed overcomplete discrete cosine transform(ODCT)dictionary algorithm,the method-of-optimal-directions(MOD)dictionary learning algorithm,and the K-singular value decomposition(K-SVD)dictionary learning algorithm to denoise decay curves at single points and to denoise profi le data for diff erent time channels in time-domain AEM.The results show obvious diff erences among the three dictionaries for denoising AEM data,with the K-SVD dictionary achieving the best performance.