Objective Humans are exposed to complex mixtures of environmental chemicals and other factors that can affect their health.Analysis of these mixture exposures presents several key challenges for environmental epidemio...Objective Humans are exposed to complex mixtures of environmental chemicals and other factors that can affect their health.Analysis of these mixture exposures presents several key challenges for environmental epidemiology and risk assessment,including high dimensionality,correlated exposure,and subtle individual effects.Methods We proposed a novel statistical approach,the generalized functional linear model(GFLM),to analyze the health effects of exposure mixtures.GFLM treats the effect of mixture exposures as a smooth function by reordering exposures based on specific mechanisms and capturing internal correlations to provide a meaningful estimation and interpretation.The robustness and efficiency was evaluated under various scenarios through extensive simulation studies.Results We applied the GFLM to two datasets from the National Health and Nutrition Examination Survey(NHANES).In the first application,we examined the effects of 37 nutrients on BMI(2011–2016 cycles).The GFLM identified a significant mixture effect,with fiber and fat emerging as the nutrients with the greatest negative and positive effects on BMI,respectively.For the second application,we investigated the association between four pre-and perfluoroalkyl substances(PFAS)and gout risk(2007–2018 cycles).Unlike traditional methods,the GFLM indicated no significant association,demonstrating its robustness to multicollinearity.Conclusion GFLM framework is a powerful tool for mixture exposure analysis,offering improved handling of correlated exposures and interpretable results.It demonstrates robust performance across various scenarios and real-world applications,advancing our understanding of complex environmental exposures and their health impacts on environmental epidemiology and toxicology.展开更多
High-dimensional and incomplete(HDI) matrices are primarily generated in all kinds of big-data-related practical applications. A latent factor analysis(LFA) model is capable of conducting efficient representation lear...High-dimensional and incomplete(HDI) matrices are primarily generated in all kinds of big-data-related practical applications. A latent factor analysis(LFA) model is capable of conducting efficient representation learning to an HDI matrix,whose hyper-parameter adaptation can be implemented through a particle swarm optimizer(PSO) to meet scalable requirements.However, conventional PSO is limited by its premature issues,which leads to the accuracy loss of a resultant LFA model. To address this thorny issue, this study merges the information of each particle's state migration into its evolution process following the principle of a generalized momentum method for improving its search ability, thereby building a state-migration particle swarm optimizer(SPSO), whose theoretical convergence is rigorously proved in this study. It is then incorporated into an LFA model for implementing efficient hyper-parameter adaptation without accuracy loss. Experiments on six HDI matrices indicate that an SPSO-incorporated LFA model outperforms state-of-the-art LFA models in terms of prediction accuracy for missing data of an HDI matrix with competitive computational efficiency.Hence, SPSO's use ensures efficient and reliable hyper-parameter adaptation in an LFA model, thus ensuring practicality and accurate representation learning for HDI matrices.展开更多
In this paper,we introduce the censored composite conditional quantile coefficient(cC-CQC)to rank the relative importance of each predictor in high-dimensional censored regression.The cCCQC takes advantage of all usef...In this paper,we introduce the censored composite conditional quantile coefficient(cC-CQC)to rank the relative importance of each predictor in high-dimensional censored regression.The cCCQC takes advantage of all useful information across quantiles and can detect nonlinear effects including interactions and heterogeneity,effectively.Furthermore,the proposed screening method based on cCCQC is robust to the existence of outliers and enjoys the sure screening property.Simulation results demonstrate that the proposed method performs competitively on survival datasets of high-dimensional predictors,particularly when the variables are highly correlated.展开更多
The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based o...The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based on complete data. This paper studies the optimal estimation of high-dimensional covariance matrices based on missing and noisy sample under the norm. First, the model with sub-Gaussian additive noise is presented. The generalized sample covariance is then modified to define a hard thresholding estimator , and the minimax upper bound is derived. After that, the minimax lower bound is derived, and it is concluded that the estimator presented in this article is rate-optimal. Finally, numerical simulation analysis is performed. The result shows that for missing samples with sub-Gaussian noise, if the true covariance matrix is sparse, the hard thresholding estimator outperforms the traditional estimate method.展开更多
The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities...The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities occupies a large proportion of the similarity,leading to the dissimilarities between any results.A similarity measurement method of high-dimensional data based on normalized net lattice subspace is proposed.The data range of each dimension is divided into several intervals,and the components in different dimensions are mapped onto the corresponding interval.Only the component in the same or adjacent interval is used to calculate the similarity.To validate this method,three data types are used,and seven common similarity measurement methods are compared.The experimental result indicates that the relative difference of the method is increasing with the dimensionality and is approximately two or three orders of magnitude higher than the conventional method.In addition,the similarity range of this method in different dimensions is [0,1],which is fit for similarity analysis after dimensionality reduction.展开更多
The Binary star DataBase(BDB, http://bdb.inasan.ru) combines data from catalogs of binary and multiple stars of all observational types. There is a number of ways for variable stars to form or to be a part of binary o...The Binary star DataBase(BDB, http://bdb.inasan.ru) combines data from catalogs of binary and multiple stars of all observational types. There is a number of ways for variable stars to form or to be a part of binary or multiple systems. We describe how such stars are represented in the database.展开更多
Problems existin similarity measurement and index tree construction which affect the performance of nearest neighbor search of high-dimensional data. The equidistance problem is solved using NPsim function to calculat...Problems existin similarity measurement and index tree construction which affect the performance of nearest neighbor search of high-dimensional data. The equidistance problem is solved using NPsim function to calculate similarity. And a sequential NPsim matrix is built to improve indexing performance. To sum up the above innovations,a nearest neighbor search algorithm of high-dimensional data based on sequential NPsim matrix is proposed in comparison with the nearest neighbor search algorithms based on KD-tree or SR-tree on Munsell spectral data set. Experimental results show that the proposed algorithm similarity is better than that of other algorithms and searching speed is more than thousands times of others. In addition,the slow construction speed of sequential NPsim matrix can be increased by using parallel computing.展开更多
Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripeni...Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripening rate, water status, nutrient levels, and disease risk. In this paper, we implement imaging spectroscopy (hyperspectral) reflectance data, for the reflective 330 - 2510 nm wavelength region (986 total spectral bands), to assess vineyard nutrient status;this constitutes a high dimensional dataset with a covariance matrix that is ill-conditioned. The identification of the variables (wavelength bands) that contribute useful information for nutrient assessment and prediction, plays a pivotal role in multivariate statistical modeling. In recent years, researchers have successfully developed many continuous, nearly unbiased, sparse and accurate variable selection methods to overcome this problem. This paper compares four regularized and one functional regression methods: Elastic Net, Multi-Step Adaptive Elastic Net, Minimax Concave Penalty, iterative Sure Independence Screening, and Functional Data Analysis for wavelength variable selection. Thereafter, the predictive performance of these regularized sparse models is enhanced using the stepwise regression. This comparative study of regression methods using a high-dimensional and highly correlated grapevine hyperspectral dataset revealed that the performance of Elastic Net for variable selection yields the best predictive ability.展开更多
We presented the first photometric light curve solutions of four W Ursae Majoris-type contact binary systems.This investigation utilized photometric data from the Transiting Exoplanet Survey Satellite and Gaia Data Re...We presented the first photometric light curve solutions of four W Ursae Majoris-type contact binary systems.This investigation utilized photometric data from the Transiting Exoplanet Survey Satellite and Gaia Data Release 3(DR3).We used the PHysics Of Eclipsing BinariEs Python code and the Markov Chain Monte Carlo method for these light curve solutions.Only TIC 249064185 among the target systems needed a cold starspot to be included in the analysis.Based on the estimated mass ratios for these total eclipse systems,three of them are categorized as low mass ratio contact binary stars.The absolute parameters of the systems were estimated using the Gaia DR3 parallax method and the orbital period and semimajor axis(P-a)empirical relationship.We ascertained that the TIC 318015356 and TIC 55522736 systems are A-subtypes,while TIC 249064185 and TIC 397984843 are W-subtypes,depending on each component’s effective temperature and mass.We estimated the initial masses of the stars,the mass lost by the binary system,and the systems’ages.We displayed star positions in the mass-radius,mass-luminosity,and total mass-orbital angular momentum diagrams.In addition,our findings indicate a good agreement with the mass-temperature empirical parameter relationship for the primary stars.展开更多
Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available f...Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available from real-world systems.To address this issue, Prof.展开更多
Latent factor(LF)models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS)matrices which are commonly seen in various industrial applications.An LF model usually adopts iterativ...Latent factor(LF)models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS)matrices which are commonly seen in various industrial applications.An LF model usually adopts iterative optimizers,which may consume many iterations to achieve a local optima,resulting in considerable time cost.Hence,determining how to accelerate the training process for LF models has become a significant issue.To address this,this work proposes a randomized latent factor(RLF)model.It incorporates the principle of randomized learning techniques from neural networks into the LF analysis of HiDS matrices,thereby greatly alleviating computational burden.It also extends a standard learning process for randomized neural networks in context of LF analysis to make the resulting model represent an HiDS matrix correctly.Experimental results on three HiDS matrices from industrial applications demonstrate that compared with state-of-the-art LF models,RLF is able to achieve significantly higher computational efficiency and comparable prediction accuracy for missing data.I provides an important alternative approach to LF analysis of HiDS matrices,which is especially desired for industrial applications demanding highly efficient models.展开更多
A novel local binary pattern-based reversible data hiding(LBP-RDH)technique has been suggested to maintain a fair symmetry between the perceptual transparency and hiding capacity.During embedding,the image is divided ...A novel local binary pattern-based reversible data hiding(LBP-RDH)technique has been suggested to maintain a fair symmetry between the perceptual transparency and hiding capacity.During embedding,the image is divided into various 3×3 blocks.Then,using the LBP-based image descriptor,the LBP codes for each block are computed.Next,the obtained LBP codes are XORed with the embedding bits and are concealed in the respective blocks using the proposed pixel readjustment process.Further,each cover image(CI)pixel produces two different stego-image pixels.Likewise,during extraction,the CI pixels are restored without the loss of a single bit of information.The outcome of the proposed technique with respect to perceptual transparency measures,such as peak signal-to-noise ratio and structural similarity index,is found to be superior to that of some of the recent and state-of-the-art techniques.In addition,the proposed technique has shown excellent resilience to various stego-attacks,such as pixel difference histogram as well as regular and singular analysis.Besides,the out-off boundary pixel problem,which endures in most of the contemporary data hiding techniques,has been successfully addressed.展开更多
Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subsp...Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subspace clustering algorithm. In the proposed algorithm, a novel objective function is firstly designed by considering the fuzzy weighting within-cluster compactness and the between-cluster separation, and loosening the constraints of dimension weight matrix. Then gradual membership and improved Cuckoo search, a global search strategy, are introduced to optimize the objective function and search subspace clusters, giving novel learning rules for clustering. At last, the performance of the proposed algorithm on the clustering analysis of various low and high dimensional datasets is experimentally compared with that of several competitive subspace clustering algorithms. Experimental studies demonstrate that the proposed algorithm can obtain better performance than most of the existing soft subspace clustering algorithms.展开更多
As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected featu...As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected features.Evolutionary computing(EC)is promising for FS owing to its powerful search capability.However,in traditional EC-based methods,feature subsets are represented via a length-fixed individual encoding.It is ineffective for high-dimensional data,because it results in a huge search space and prohibitive training time.This work proposes a length-adaptive non-dominated sorting genetic algorithm(LA-NSGA)with a length-variable individual encoding and a length-adaptive evolution mechanism for bi-objective highdimensional FS.In LA-NSGA,an initialization method based on correlation and redundancy is devised to initialize individuals of diverse lengths,and a Pareto dominance-based length change operator is introduced to guide individuals to explore in promising search space adaptively.Moreover,a dominance-based local search method is employed for further improvement.The experimental results based on 12 high-dimensional gene datasets show that the Pareto front of feature subsets produced by LA-NSGA is superior to those of existing algorithms.展开更多
In the process of encoding and decoding,erasure codes over binary fields,which just need AND operations and XOR operations and therefore have a high computational efficiency,are widely used in various fields of inform...In the process of encoding and decoding,erasure codes over binary fields,which just need AND operations and XOR operations and therefore have a high computational efficiency,are widely used in various fields of information technology.A matrix decoding method is proposed in this paper.The method is a universal data reconstruction scheme for erasure codes over binary fields.Besides a pre-judgment that whether errors can be recovered,the method can rebuild sectors of loss data on a fault-tolerant storage system constructed by erasure codes for disk errors.Data reconstruction process of the new method has simple and clear steps,so it is beneficial for implementation of computer codes.And more,it can be applied to other non-binary fields easily,so it is expected that the method has an extensive application in the future.展开更多
With the advent of modern devices,such as smartphones and wearable devices,high-dimensional data are collected on many participants for a period of time or even in perpetuity.For this type of data,dependencies between...With the advent of modern devices,such as smartphones and wearable devices,high-dimensional data are collected on many participants for a period of time or even in perpetuity.For this type of data,dependencies between and within data batches exist because data are collected from the same individual over time.Under the framework of streamed data,individual historical data are not available due to the storage and computation burden.It is urgent to develop computationally efficient methods with statistical guarantees to analyze high-dimensional streamed data and make reliable inferences in practice.In addition,the homogeneity assumption on the model parameters may not be valid in practice over time.To address the above issues,in this paper,we develop a new renewable debiased-lasso inference method for high-dimensional streamed data allowing dependences between and within data batches to exist and model parameters to gradually change.We establish the large sample properties of the proposed estimators,including consistency and asymptotic normality.The numerical results,including simulations and real data analysis,show the superior performance of the proposed method.展开更多
In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)...In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)feature extraction technique.First,dimensionality of the original imbalanced data is reduced using MDS so that distances between any two different samples are preserved as well as possible.Second,a novel OPCE algorithm is applied to classify imbalanced samples by placing optimised observation points in a low-dimensional data space.Third,optimization of the observation point mappings is carried out to obtain a reliable assessment of the unknown samples.Exhaustive experiments have been conducted to evaluate the feasibility,rationality,and effectiveness of the proposed OPCE algorithm using seven benchmark HDIC data sets.Experimental results show that(1)the OPCE algorithm can be trained faster on low-dimensional imbalanced data than on high-dimensional data;(2)the OPCE algorithm can correctly identify samples as the number of optimised observation points is increased;and(3)statistical analysis reveals that OPCE yields better HDIC performances on the selected data sets in comparison with eight other HDIC algorithms.This demonstrates that OPCE is a viable algorithm to deal with HDIC problems.展开更多
Missingness in mixed-type variables is commonly encountered in a variety of areas.The requirement of complete observations necessitates data imputation when a moderate or large proportion of data is missing.However,in...Missingness in mixed-type variables is commonly encountered in a variety of areas.The requirement of complete observations necessitates data imputation when a moderate or large proportion of data is missing.However,inappropriate imputation would downgrade the performance of machine learning algorithms,leading to bad predictions and unreliable statistical inference.For high-dimensional large-scale mixed-type missing data,we develop a computationally efficient imputation method,missing value imputation via generalized factor models(MIG),under missing at random.The proposed MIG method allows missing variables to be of different types,including continuous,binary,and count variables,and are scalable to both data size n and variable dimension p while existing imputation methods rely on restrictive assumptions such as the same type of missing variables,the low dimensionality of variables,and a limited sample size.We explicitly show that the imputation error of the proposed MIG method diminishes to zero with the rate Op(max{n^(-1/2),p^(-1/2)})as both n and p tend to infinity.Five real datasets demonstrate the superior empirical performance of the proposed MIG method over existing methods that the average normalized absolute imputation error is reduced by 5.3%–34.1%.展开更多
Secret data hiding in binary images is more difficult than other formats since binary images require only one bit repre-sentation to indicate black and white. This study proposes a new method for data hiding in binary...Secret data hiding in binary images is more difficult than other formats since binary images require only one bit repre-sentation to indicate black and white. This study proposes a new method for data hiding in binary images using opti-mized bit position to replace a secret bit. This method manipulates blocks, which are sub-divided. The parity bit for a specified block decides whether to change or not, to embed a secret bit. By finding the best position to insert a secret bit for each divided block, the image quality of the resulting stego-image can be improved, while maintaining low computational complexity. The experimental results show that the proposed method has an improvement with respect to a previous work.展开更多
基金supported in part by the Young Scientists Fund of the National Natural Science Foundation of China(Grant Nos.82304253)(and 82273709)the Foundation for Young Talents in Higher Education of Guangdong Province(Grant No.2022KQNCX021)the PhD Starting Project of Guangdong Medical University(Grant No.GDMUB2022054).
文摘Objective Humans are exposed to complex mixtures of environmental chemicals and other factors that can affect their health.Analysis of these mixture exposures presents several key challenges for environmental epidemiology and risk assessment,including high dimensionality,correlated exposure,and subtle individual effects.Methods We proposed a novel statistical approach,the generalized functional linear model(GFLM),to analyze the health effects of exposure mixtures.GFLM treats the effect of mixture exposures as a smooth function by reordering exposures based on specific mechanisms and capturing internal correlations to provide a meaningful estimation and interpretation.The robustness and efficiency was evaluated under various scenarios through extensive simulation studies.Results We applied the GFLM to two datasets from the National Health and Nutrition Examination Survey(NHANES).In the first application,we examined the effects of 37 nutrients on BMI(2011–2016 cycles).The GFLM identified a significant mixture effect,with fiber and fat emerging as the nutrients with the greatest negative and positive effects on BMI,respectively.For the second application,we investigated the association between four pre-and perfluoroalkyl substances(PFAS)and gout risk(2007–2018 cycles).Unlike traditional methods,the GFLM indicated no significant association,demonstrating its robustness to multicollinearity.Conclusion GFLM framework is a powerful tool for mixture exposure analysis,offering improved handling of correlated exposures and interpretable results.It demonstrates robust performance across various scenarios and real-world applications,advancing our understanding of complex environmental exposures and their health impacts on environmental epidemiology and toxicology.
基金supported in part by the National Natural Science Foundation of China (62372385, 62272078, 62002337)the Chongqing Natural Science Foundation (CSTB2022NSCQ-MSX1486, CSTB2023NSCQ-LZX0069)the Deanship of Scientific Research at King Abdulaziz University, Jeddah, Saudi Arabia (RG-12-135-43)。
文摘High-dimensional and incomplete(HDI) matrices are primarily generated in all kinds of big-data-related practical applications. A latent factor analysis(LFA) model is capable of conducting efficient representation learning to an HDI matrix,whose hyper-parameter adaptation can be implemented through a particle swarm optimizer(PSO) to meet scalable requirements.However, conventional PSO is limited by its premature issues,which leads to the accuracy loss of a resultant LFA model. To address this thorny issue, this study merges the information of each particle's state migration into its evolution process following the principle of a generalized momentum method for improving its search ability, thereby building a state-migration particle swarm optimizer(SPSO), whose theoretical convergence is rigorously proved in this study. It is then incorporated into an LFA model for implementing efficient hyper-parameter adaptation without accuracy loss. Experiments on six HDI matrices indicate that an SPSO-incorporated LFA model outperforms state-of-the-art LFA models in terms of prediction accuracy for missing data of an HDI matrix with competitive computational efficiency.Hence, SPSO's use ensures efficient and reliable hyper-parameter adaptation in an LFA model, thus ensuring practicality and accurate representation learning for HDI matrices.
基金Outstanding Youth Foundation of Hunan Provincial Department of Education(Grant No.22B0911)。
文摘In this paper,we introduce the censored composite conditional quantile coefficient(cC-CQC)to rank the relative importance of each predictor in high-dimensional censored regression.The cCCQC takes advantage of all useful information across quantiles and can detect nonlinear effects including interactions and heterogeneity,effectively.Furthermore,the proposed screening method based on cCCQC is robust to the existence of outliers and enjoys the sure screening property.Simulation results demonstrate that the proposed method performs competitively on survival datasets of high-dimensional predictors,particularly when the variables are highly correlated.
文摘The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based on complete data. This paper studies the optimal estimation of high-dimensional covariance matrices based on missing and noisy sample under the norm. First, the model with sub-Gaussian additive noise is presented. The generalized sample covariance is then modified to define a hard thresholding estimator , and the minimax upper bound is derived. After that, the minimax lower bound is derived, and it is concluded that the estimator presented in this article is rate-optimal. Finally, numerical simulation analysis is performed. The result shows that for missing samples with sub-Gaussian noise, if the true covariance matrix is sparse, the hard thresholding estimator outperforms the traditional estimate method.
基金Supported by the National Natural Science Foundation of China(No.61502475)the Importation and Development of High-Caliber Talents Project of the Beijing Municipal Institutions(No.CIT&TCD201504039)
文摘The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities occupies a large proportion of the similarity,leading to the dissimilarities between any results.A similarity measurement method of high-dimensional data based on normalized net lattice subspace is proposed.The data range of each dimension is divided into several intervals,and the components in different dimensions are mapped onto the corresponding interval.Only the component in the same or adjacent interval is used to calculate the similarity.To validate this method,three data types are used,and seven common similarity measurement methods are compared.The experimental result indicates that the relative difference of the method is increasing with the dimensionality and is approximately two or three orders of magnitude higher than the conventional method.In addition,the similarity range of this method in different dimensions is [0,1],which is fit for similarity analysis after dimensionality reduction.
基金supportedby the Russian Foundation of Basic Researches,projects 16–07–1162 and 18–02–00890Funding for the DPAC has been provided by nationalinstitutions, in particular the institutions participating inthe Gaia Multilateral Agreement
文摘The Binary star DataBase(BDB, http://bdb.inasan.ru) combines data from catalogs of binary and multiple stars of all observational types. There is a number of ways for variable stars to form or to be a part of binary or multiple systems. We describe how such stars are represented in the database.
基金Supported by the National Natural Science Foundation of China(No.61300078)the Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions(No.CIT&TCD201504039)+1 种基金Funding Project for Academic Human Resources Development in Beijing Union University(No.BPHR2014A03,Rk100201510)"New Start"Academic Research Projects of Beijing Union University(No.Hzk10201501)
文摘Problems existin similarity measurement and index tree construction which affect the performance of nearest neighbor search of high-dimensional data. The equidistance problem is solved using NPsim function to calculate similarity. And a sequential NPsim matrix is built to improve indexing performance. To sum up the above innovations,a nearest neighbor search algorithm of high-dimensional data based on sequential NPsim matrix is proposed in comparison with the nearest neighbor search algorithms based on KD-tree or SR-tree on Munsell spectral data set. Experimental results show that the proposed algorithm similarity is better than that of other algorithms and searching speed is more than thousands times of others. In addition,the slow construction speed of sequential NPsim matrix can be increased by using parallel computing.
文摘Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripening rate, water status, nutrient levels, and disease risk. In this paper, we implement imaging spectroscopy (hyperspectral) reflectance data, for the reflective 330 - 2510 nm wavelength region (986 total spectral bands), to assess vineyard nutrient status;this constitutes a high dimensional dataset with a covariance matrix that is ill-conditioned. The identification of the variables (wavelength bands) that contribute useful information for nutrient assessment and prediction, plays a pivotal role in multivariate statistical modeling. In recent years, researchers have successfully developed many continuous, nearly unbiased, sparse and accurate variable selection methods to overcome this problem. This paper compares four regularized and one functional regression methods: Elastic Net, Multi-Step Adaptive Elastic Net, Minimax Concave Penalty, iterative Sure Independence Screening, and Functional Data Analysis for wavelength variable selection. Thereafter, the predictive performance of these regularized sparse models is enhanced using the stepwise regression. This comparative study of regression methods using a high-dimensional and highly correlated grapevine hyperspectral dataset revealed that the performance of Elastic Net for variable selection yields the best predictive ability.
基金founded by the Gordon and Betty Moore Foundation through grants GBMF5490 and GBMF10501 to Ohio State Universitythe Alfred P.Sloan Foundation grant G-2021-14192
文摘We presented the first photometric light curve solutions of four W Ursae Majoris-type contact binary systems.This investigation utilized photometric data from the Transiting Exoplanet Survey Satellite and Gaia Data Release 3(DR3).We used the PHysics Of Eclipsing BinariEs Python code and the Markov Chain Monte Carlo method for these light curve solutions.Only TIC 249064185 among the target systems needed a cold starspot to be included in the analysis.Based on the estimated mass ratios for these total eclipse systems,three of them are categorized as low mass ratio contact binary stars.The absolute parameters of the systems were estimated using the Gaia DR3 parallax method and the orbital period and semimajor axis(P-a)empirical relationship.We ascertained that the TIC 318015356 and TIC 55522736 systems are A-subtypes,while TIC 249064185 and TIC 397984843 are W-subtypes,depending on each component’s effective temperature and mass.We estimated the initial masses of the stars,the mass lost by the binary system,and the systems’ages.We displayed star positions in the mass-radius,mass-luminosity,and total mass-orbital angular momentum diagrams.In addition,our findings indicate a good agreement with the mass-temperature empirical parameter relationship for the primary stars.
基金supported by the grants from CASthe National Key R&D Program of Chinathe National Natural Science Foundation of China
文摘Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available from real-world systems.To address this issue, Prof.
基金supported in part by the National Natural Science Foundation of China (6177249391646114)+1 种基金Chongqing research program of technology innovation and application (cstc2017rgzn-zdyfX0020)in part by the Pioneer Hundred Talents Program of Chinese Academy of Sciences
文摘Latent factor(LF)models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS)matrices which are commonly seen in various industrial applications.An LF model usually adopts iterative optimizers,which may consume many iterations to achieve a local optima,resulting in considerable time cost.Hence,determining how to accelerate the training process for LF models has become a significant issue.To address this,this work proposes a randomized latent factor(RLF)model.It incorporates the principle of randomized learning techniques from neural networks into the LF analysis of HiDS matrices,thereby greatly alleviating computational burden.It also extends a standard learning process for randomized neural networks in context of LF analysis to make the resulting model represent an HiDS matrix correctly.Experimental results on three HiDS matrices from industrial applications demonstrate that compared with state-of-the-art LF models,RLF is able to achieve significantly higher computational efficiency and comparable prediction accuracy for missing data.I provides an important alternative approach to LF analysis of HiDS matrices,which is especially desired for industrial applications demanding highly efficient models.
文摘A novel local binary pattern-based reversible data hiding(LBP-RDH)technique has been suggested to maintain a fair symmetry between the perceptual transparency and hiding capacity.During embedding,the image is divided into various 3×3 blocks.Then,using the LBP-based image descriptor,the LBP codes for each block are computed.Next,the obtained LBP codes are XORed with the embedding bits and are concealed in the respective blocks using the proposed pixel readjustment process.Further,each cover image(CI)pixel produces two different stego-image pixels.Likewise,during extraction,the CI pixels are restored without the loss of a single bit of information.The outcome of the proposed technique with respect to perceptual transparency measures,such as peak signal-to-noise ratio and structural similarity index,is found to be superior to that of some of the recent and state-of-the-art techniques.In addition,the proposed technique has shown excellent resilience to various stego-attacks,such as pixel difference histogram as well as regular and singular analysis.Besides,the out-off boundary pixel problem,which endures in most of the contemporary data hiding techniques,has been successfully addressed.
基金supported in part by the National Natural Science Foundation of China (Nos. 61303074, 61309013)the Programs for Science, National Key Basic Research and Development Program ("973") of China (No. 2012CB315900)Technology Development of Henan province (Nos.12210231003, 13210231002)
文摘Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subspace clustering algorithm. In the proposed algorithm, a novel objective function is firstly designed by considering the fuzzy weighting within-cluster compactness and the between-cluster separation, and loosening the constraints of dimension weight matrix. Then gradual membership and improved Cuckoo search, a global search strategy, are introduced to optimize the objective function and search subspace clusters, giving novel learning rules for clustering. At last, the performance of the proposed algorithm on the clustering analysis of various low and high dimensional datasets is experimentally compared with that of several competitive subspace clustering algorithms. Experimental studies demonstrate that the proposed algorithm can obtain better performance than most of the existing soft subspace clustering algorithms.
基金supported in part by the National Natural Science Foundation of China(62172065,62072060)。
文摘As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected features.Evolutionary computing(EC)is promising for FS owing to its powerful search capability.However,in traditional EC-based methods,feature subsets are represented via a length-fixed individual encoding.It is ineffective for high-dimensional data,because it results in a huge search space and prohibitive training time.This work proposes a length-adaptive non-dominated sorting genetic algorithm(LA-NSGA)with a length-variable individual encoding and a length-adaptive evolution mechanism for bi-objective highdimensional FS.In LA-NSGA,an initialization method based on correlation and redundancy is devised to initialize individuals of diverse lengths,and a Pareto dominance-based length change operator is introduced to guide individuals to explore in promising search space adaptively.Moreover,a dominance-based local search method is employed for further improvement.The experimental results based on 12 high-dimensional gene datasets show that the Pareto front of feature subsets produced by LA-NSGA is superior to those of existing algorithms.
基金supported by the National Natural Science Foundation of China under Grant No.61501064Sichuan Provincial Science and Technology Project under Grant No.2016GZ0122
文摘In the process of encoding and decoding,erasure codes over binary fields,which just need AND operations and XOR operations and therefore have a high computational efficiency,are widely used in various fields of information technology.A matrix decoding method is proposed in this paper.The method is a universal data reconstruction scheme for erasure codes over binary fields.Besides a pre-judgment that whether errors can be recovered,the method can rebuild sectors of loss data on a fault-tolerant storage system constructed by erasure codes for disk errors.Data reconstruction process of the new method has simple and clear steps,so it is beneficial for implementation of computer codes.And more,it can be applied to other non-binary fields easily,so it is expected that the method has an extensive application in the future.
基金Supported by National Key R&D Program of China(Grant No.2022YFA1003702)National Natural Science Foundation of China(Grant No.12271441)。
文摘With the advent of modern devices,such as smartphones and wearable devices,high-dimensional data are collected on many participants for a period of time or even in perpetuity.For this type of data,dependencies between and within data batches exist because data are collected from the same individual over time.Under the framework of streamed data,individual historical data are not available due to the storage and computation burden.It is urgent to develop computationally efficient methods with statistical guarantees to analyze high-dimensional streamed data and make reliable inferences in practice.In addition,the homogeneity assumption on the model parameters may not be valid in practice over time.To address the above issues,in this paper,we develop a new renewable debiased-lasso inference method for high-dimensional streamed data allowing dependences between and within data batches to exist and model parameters to gradually change.We establish the large sample properties of the proposed estimators,including consistency and asymptotic normality.The numerical results,including simulations and real data analysis,show the superior performance of the proposed method.
基金National Natural Science Foundation of China,Grant/Award Number:61972261Basic Research Foundations of Shenzhen,Grant/Award Numbers:JCYJ20210324093609026,JCYJ20200813091134001。
文摘In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)feature extraction technique.First,dimensionality of the original imbalanced data is reduced using MDS so that distances between any two different samples are preserved as well as possible.Second,a novel OPCE algorithm is applied to classify imbalanced samples by placing optimised observation points in a low-dimensional data space.Third,optimization of the observation point mappings is carried out to obtain a reliable assessment of the unknown samples.Exhaustive experiments have been conducted to evaluate the feasibility,rationality,and effectiveness of the proposed OPCE algorithm using seven benchmark HDIC data sets.Experimental results show that(1)the OPCE algorithm can be trained faster on low-dimensional imbalanced data than on high-dimensional data;(2)the OPCE algorithm can correctly identify samples as the number of optimised observation points is increased;and(3)statistical analysis reveals that OPCE yields better HDIC performances on the selected data sets in comparison with eight other HDIC algorithms.This demonstrates that OPCE is a viable algorithm to deal with HDIC problems.
基金supported by National Key R&D Program of China(Grant No.2022YFA1003702)National Natural Science Foundation of China(Grant Nos.11931014 and 12271441)。
文摘Missingness in mixed-type variables is commonly encountered in a variety of areas.The requirement of complete observations necessitates data imputation when a moderate or large proportion of data is missing.However,inappropriate imputation would downgrade the performance of machine learning algorithms,leading to bad predictions and unreliable statistical inference.For high-dimensional large-scale mixed-type missing data,we develop a computationally efficient imputation method,missing value imputation via generalized factor models(MIG),under missing at random.The proposed MIG method allows missing variables to be of different types,including continuous,binary,and count variables,and are scalable to both data size n and variable dimension p while existing imputation methods rely on restrictive assumptions such as the same type of missing variables,the low dimensionality of variables,and a limited sample size.We explicitly show that the imputation error of the proposed MIG method diminishes to zero with the rate Op(max{n^(-1/2),p^(-1/2)})as both n and p tend to infinity.Five real datasets demonstrate the superior empirical performance of the proposed MIG method over existing methods that the average normalized absolute imputation error is reduced by 5.3%–34.1%.
文摘Secret data hiding in binary images is more difficult than other formats since binary images require only one bit repre-sentation to indicate black and white. This study proposes a new method for data hiding in binary images using opti-mized bit position to replace a secret bit. This method manipulates blocks, which are sub-divided. The parity bit for a specified block decides whether to change or not, to embed a secret bit. By finding the best position to insert a secret bit for each divided block, the image quality of the resulting stego-image can be improved, while maintaining low computational complexity. The experimental results show that the proposed method has an improvement with respect to a previous work.